Optimizing KV Cache Management in TensorRT-LLM for H100 GPUs

Summary

Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.

Key Technical Findings

KV cache per token per layer is 32KB at fp16 for 70B model.
At context length of 4096, each sequence holds 128MB KV cache.
H100 80GB can support up to 600 concurrent sequences with kv_cache_free_gpu_mem_fraction=0.85.
Prefix reuse on a 512-token system prompt reduces TTFT by 31% for cached requests.

Commands

python measure_kv_cache.py --model llama3-70b --layers 80 --heads 64 --head_dim 128 --dtype fp16
trtllm-build --gemm_plugin fp8 --paged_kv_cache enable --kv_cache_free_gpu_mem_fraction 0.85
python kv_reuse_test.py --prefix_len 512 --num_requests 1000 --cache_reuse enabled

Lessons Learned

KV cache is the primary memory bottleneck at long context, not model weights.
Paged KV cache prevents fragmentation but adds a 3-5% overhead compared to contiguous allocation.
Prefix caching significantly benefits chatbot workloads with fixed system prompts.
Max concurrent sequences should be calculated analytically for effective capacity planning.