Back to Blog
ai-infraTensorRT-LLMLlama 3 70BH100 SXM5paged KV cacheinflight batchingchunked context FMHA

TensorRT-LLM Engine Build Flags Deep Dive for Llama 3 70B on H100 SXM5

The document provides a detailed analysis of the `trtllm-build` flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5.

April 20, 2026·1 min read

Summary

The document provides a detailed analysis of the trtllm-build flags to optimize the inference engine configuration for Llama 3 70B on H100 SXM5. It compares various configurations including paged KV cache, inflight batching, and chunked context FMHA.

Key Technical Findings

  • Paged KV cache and inflight batching are essential for production serving.
  • The strongly_typed flag reduces the engine size by 18%.
  • Memory growth is linear with sequence count without paged KV cache, leading to OOM errors at batch 32.
  • Context FMHA benefits prefill-heavy workloads but not pure decode serving.

Commands

  • trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-base --gemm_plugin auto --max_batch_size 64 --max_input_len 2048 --max_output_len 1024
  • trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-paged --gemm_plugin fp8 --paged_kv_cache enable --max_num_tokens 8192 --use_inflight_batching
  • trtllm-build --checkpoint_dir ./llama3-70b --output_dir ./engine-chunked --gemm_plugin fp8 --paged_kv_cache enable --use_inflight_batching --context_fmha enabled --use_paged_context_fmha

Lessons Learned

  • paged_kv_cache + use_inflight_batching are non-negotiable for production serving.
  • strongly_typed flag enables stricter type inference and reduces engine footprint.
  • max_num_tokens controls the inflight batching budget - size to GPU memory, not batch count.
  • Build once, profile twice - always benchmark after each flag change.

Next Steps

  • Script automated engine builds with different flag sets and tabulate throughput
  • Test --weight_only_precision int4_awq for memory-constrained deployments