TensorRT-LLM Engine Build Flags Deep Dive for Llama 3 70B on H100 SXM5

Summary

The document provides a detailed analysis of the trtllm-build flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5. It compares various configurations including paged KV cache, inflight batching, and chunked context FMHA.

Key Technical Findings

Paged KV cache and inflight batching are essential for production serving.
The strongly_typed flag reduces the engine size by 18%.
Memory growth is linear with sequence count without paged KV cache, leading to OOM errors at batch 32.
Context FMHA benefits prefill-heavy workloads but not pure decode serving.

Commands

trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-base --gemm_plugin auto --max_batch_size 64 --max_input_len 2048 --max_output_len 1024
trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-paged --gemm_plugin fp8 --paged_kv_cache enable --max_num_tokens 8192 --use_inflight_batching
trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-chunked --gemm_plugin fp8 --paged_kv_cache enable --use_inflight_batching --context_fmha enabled --use_paged_context_fmha

Lessons Learned

paged_kv_cache + use_inflight_batching are non-negotiable for production serving.
strongly_typed flag enables stricter type inference and reduces engine footprint.
max_num_tokens controls the inflight batching budget - size to GPU memory, not batch count.
Build once, profile twice - always benchmark after each flag change.

Next Steps

Script automated engine builds with different flag sets and tabulate throughput
Test --weight_only_precision int4_awq for memory-constrained deployments