Back to Blog
gpuCUDATensorRT-LLMLlama 3Latency OptimizationRTX 4090Inference

CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090 — benchmarks, trade-offs, and lessons learned.

·1 min read

CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

Key Technical Findings

  • CUDA graph capture reduces kernel launch overhead by approximately 5.5ms per step for batch=1.
  • Graph capture supports batch sizes from 1 to 64, with separate graphs pre-captured during warmup.
  • Cold start time is significantly reduced (8.2s vs 0.3s) after graph capture but increases substantially due to the need for warmup.

Commands Used

trtllm-build --use_custom_all_reduce disable --use_cuda_graph
python warmup_graphs.py --engine_dir ./engine-graph --batch_sizes 1,2,4,8,16,32
nsys profile --capture-range=cudaProfilerApi -o decode_graph python decode_bench.py
python compare_latency.py --no_graph vs --graph --batch 1

Lessons Learned

  • CUDA graphs are most effective for reducing interactive latency in batch=1 scenarios rather than enhancing batch throughput.
  • CUDA graphs and paged KV cache are incompatible in TRT-LLM, necessitating a choice between them.
  • Pre-warming all expected batch sizes during container startup prevents runtime stalls due to graph capture.