CUDA Graph Capture for Low-Latency LLM Decode on RTX 4090

5.5ms per decode step sounds like a rounding error. But a typical LLM response generates 100 to 300 tokens, and each token is a decode step. That's 550ms to 1.65 seconds of latency you're paying in overhead alone — before the GPU has done any actual work.

This is what CUDA kernel launch overhead looks like at scale, and CUDA graph capture is the tool that eliminates it.

The Problem: Kernel Launch Overhead at Decode Time

Every GPU operation — matrix multiply, softmax, attention kernel — starts with a CPU-side call that launches the kernel onto the GPU. At the scale of a single operation, this overhead is negligible. During the decode phase of an LLM, where the model runs through hundreds of near-identical forward passes (one per generated token), the cost accumulates into something real.

The decode loop has a stable structure. For a given batch size, the same sequence of kernels fires in the same order with the same shapes on every step. The only thing that changes is the KV cache state. This structural stability is exactly what CUDA graphs exploit.

Recording and Replaying the Graph

CUDA graph capture works in two phases. During warmup, you tell CUDA to record a complete forward pass — every kernel launch, every memory operation — into a graph object. After capture, instead of re-issuing all those individual launches on each decode step, you replay the graph: a single call that executes the entire recorded sequence. The CPU overhead collapses from N kernel launches per step to 1 graph replay per step.

TensorRT-LLM exposes this through --use_cuda_graph at build time. The warmup script pre-captures graphs for each expected batch size (1 through 64), storing them for reuse. Running this during container startup means there's no capture penalty on the first real request.

# Build engine with CUDA graph support
trtllm-build --use_custom_all_reduce disable --use_cuda_graph

# Pre-capture graphs for all expected batch sizes during warmup
python warmup_graphs.py --engine_dir ./engine-graph --batch_sizes 1,2,4,8,16,32

# Profile decode step timing
nsys profile --capture-range=cudaProfilerApi -o decode_graph python decode_bench.py

# Compare latency with and without graphs
python compare_latency.py --no_graph vs --graph --batch 1

What the Numbers Show

At batch=1 — a single user making a request — CUDA graphs reduce per-step overhead by approximately 5.5ms. At 100 decode steps, that's 550ms of recovered latency. At 200 steps, over a second.

The tradeoff: container startup takes 8.2 seconds instead of 0.3 seconds while graphs are being captured and warmed. In a production serving environment where containers are long-lived, this is a one-time cost. In short-lived or frequently recycled containers, it's worth profiling whether the warmup pays for itself.

There's one hard incompatibility to know before reaching for this optimization: CUDA graphs and paged KV cache cannot be used together in TensorRT-LLM. Paged KV cache is essential for efficient memory management under variable-length sequences and high concurrency. CUDA graphs are most valuable for batch=1 interactive latency. These optimization goals tend to conflict — paged KV cache matters most when you're serving many concurrent users, and CUDA graphs matter most when you're optimizing the single-user case.

When to Use It

CUDA graph capture is the right optimization for one specific scenario: interactive, low-concurrency serving where a single user's response latency matters more than aggregate throughput. A coding assistant responding to a single developer. A voice interface that needs fast token generation.

If you're running high-concurrency batch inference where paged KV cache is handling memory pressure across hundreds of concurrent sequences, the incompatibility makes graphs the wrong tool. The throughput gains from efficient KV cache management will outweigh the kernel launch savings.

For the interactive case — especially on consumer hardware like the RTX 4090 where you're already memory-constrained and not running paged KV — CUDA graphs recover real latency that users feel.