Summary
Using CUDA graph capture in TensorRT-LLM to eliminate kernel launch overhead during the decode phase of Llama 3 8B on RTX 4090. Measured and compared decode latencies with and without graph capture at various batch sizes, identifying the impact on cold start latency and the effectiveness of pre-capturing graphs.
Key Technical Findings
- CUDA graph capture reduces kernel launch overhead by approximately 5.5ms per step for batch=1.
- Graph capture supports batch sizes from 1 to 64, with separate graphs pre-captured during warmup.
- Cold start time is significantly reduced (8.2s vs 0.3s) after graph capture but increases substantially due to the need for warmup.
Commands Used
`trtllm-build --use_custom_all_reduce disable --use_cuda_graph`
`python warmup_graphs.py --engine_dir ./engine-graph --batch_sizes 1,2,4,8,16,32`
`nsys profile --capture-range=cudaProfilerApi -o decode_graph python decode_bench.py`
`python compare_latency.py --no_graph vs --graph --batch 1`
Lessons Learned
- CUDA graphs are most effective for reducing interactive latency in batch=1 scenarios rather than enhancing batch throughput.
- CUDA graphs and paged KV cache are incompatible in TRT-LLM, necessitating a choice between them.
- Pre-warming all expected batch sizes during container startup prevents runtime stalls due to graph capture.