Summary
Tuned KV cache allocation and paging strategy in TensorRT-LLM to maximize concurrent sequences on H100 GPUs for Llama 3 70B at long context, measuring memory pressure, calculating theoretical limits, tuning memory fraction settings, comparing fixed vs paged strategies, and implementing prefix reuse.
Key Technical Findings
- KV cache per token per layer is 32KB at fp16 for 70B model.
- At context length of 4096, each sequence holds 128MB KV cache.
- H100 80GB can support up to 600 concurrent sequences with kv_cache_free_gpu_mem_fraction=0.85.
- Prefix reuse on a 512-token system prompt reduces TTFT by 31% for cached requests.
Commands
python measure_kv_cache.py --model llama3-70b --layers 80 --heads 64 --head_dim 128 --dtype fp16trtllm-build --gemm_plugin fp8 --paged_kv_cache enable --kv_cache_free_gpu_mem_fraction 0.85python kv_reuse_test.py --prefix_len 512 --num_requests 1000 --cache_reuse enabled
Lessons Learned
- KV cache is the primary memory bottleneck at long context, not model weights.
- Paged KV cache prevents fragmentation but adds a 3-5% overhead compared to contiguous allocation.
- Prefix caching significantly benefits chatbot workloads with fixed system prompts.
- Max concurrent sequences should be calculated analytically for effective capacity planning.