Back to Blog
ai-infraCUDA graphsTensorRT-LLMLlama 3 8BLatency OptimizationBatch Size Management

CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090.

April 20, 2026·1 min read

Summary

Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090. Measured and compared decode latencies with and without graph capture at various batch sizes, identifying the impact on cold start latency and the effectiveness of pre-capturing graphs.

Key Technical Findings

  • CUDA graph capture reduces kernel launch overhead by approximately 5.5ms per step for batch=1.
  • Graph capture supports batch sizes from 1 to 64, with separate graphs pre-captured during warmup.
  • Cold start time is significantly reduced (8.2s vs 0.3s) after graph capture but increases substantially due to the need for warmup.

Commands Used

`trtllm-build --use_custom_all_reduce disable --use_cuda_graph`
`python warmup_graphs.py --engine_dir ./engine-graph --batch_sizes 1,2,4,8,16,32`
`nsys profile --capture-range=cudaProfilerApi -o decode_graph python decode_bench.py`
`python compare_latency.py --no_graph vs --graph --batch 1`

Lessons Learned

  • CUDA graphs are most effective for reducing interactive latency in batch=1 scenarios rather than enhancing batch throughput.
  • CUDA graphs and paged KV cache are incompatible in TRT-LLM, necessitating a choice between them.
  • Pre-warming all expected batch sizes during container startup prevents runtime stalls due to graph capture.