Summary
The document provides a detailed analysis of the trtllm-build flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5. It compares various configurations including paged KV cache, inflight batching, and chunked context FMHA.
Key Technical Findings
- Paged KV cache and inflight batching are essential for production serving.
- The
strongly_typedflag reduces the engine size by 18%. - Memory growth is linear with sequence count without paged KV cache, leading to OOM errors at batch 32.
- Context FMHA benefits prefill-heavy workloads but not pure decode serving.
Commands
- trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-base --gemm_plugin auto --max_batch_size 64 --max_input_len 2048 --max_output_len 1024
- trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-paged --gemm_plugin fp8 --paged_kv_cache enable --max_num_tokens 8192 --use_inflight_batching
- trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-chunked --gemm_plugin fp8 --paged_kv_cache enable --use_inflight_batching --context_fmha enabled --use_paged_context_fmha
Lessons Learned
- paged_kv_cache + use_inflight_batching are non-negotiable for production serving.
- strongly_typed flag enables stricter type inference and reduces engine footprint.
- max_num_tokens controls the inflight batching budget - size to GPU memory, not batch count.
- Build once, profile twice - always benchmark after each flag change.
Next Steps
- Script automated engine builds with different flag sets and tabulate throughput
- Test --weight_only_precision int4_awq for memory-constrained deployments