Summary
FP8 (E4M3) quantization on NVIDIA H100 GPUs can deliver 1.6–1.9× throughput improvement over BF16 with near-zero accuracy degradation when calibrated correctly. This post covers the full pipeline: calibration dataset selection, TensorRT-LLM build flags, and production validation methodology.
What I Did
I benchmarked Llama-3 70B Instruct with FP8 weights + activations on a 4×H100 SXM5 node. The goal was to push throughput per GPU while keeping MMLU accuracy within 0.5% of the BF16 baseline.
Key setup:
- Model:
meta-llama/Meta-Llama-3-70B-Instruct - Hardware: 4× H100 80GB SXM5 (NVLink 4.0)
- Framework: TensorRT-LLM 0.9.0
- Calibration: 512 samples from Pile-val, 2048 max tokens
Key Technical Findings
Throughput gains (batch size 32, seq 2048):
| Precision | Tokens/sec/GPU | vs BF16 |
|---|---|---|
| BF16 | 2,840 | 1.0× |
| FP8 W+A | 5,210 | 1.83× |
| INT8 W | 4,090 | 1.44× |
Memory footprint:
- BF16 70B: 140GB across 4× GPUs = 35GB/GPU
- FP8 70B: 76GB across 4× GPUs = 19GB/GPU
- This makes single-node 70B serving possible on 2× H100 80GB
Accuracy on MMLU (5-shot):
- BF16: 82.1%
- FP8 W+A (calibrated): 81.8% (−0.3%)
- INT8 W-only: 81.6% (−0.5%)
The critical insight: calibration dataset matters more than calibration size. Using domain-matched data (code + technical text) vs. generic web text reduced accuracy loss from −1.8% to −0.3% on coding benchmarks.
Commands Used
Step 1: Build FP8 calibrated checkpoint
python quantization/quantize.py \
--model_dir meta-llama/Meta-Llama-3-70B-Instruct \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir ./llama3-70b-fp8 \
--calib_size 512 \
--calib_dataset pile-val
Step 2: Build TensorRT-LLM engine
trtllm-build \
--checkpoint_dir ./llama3-70b-fp8 \
--output_dir ./engines/llama3-70b-fp8 \
--gemm_plugin fp8 \
--strongly_typed \
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 1024 \
--tp_size 4
Step 3: Benchmark
python benchmarks/benchmark.py \
--engine_dir ./engines/llama3-70b-fp8 \
--dataset cnn_dailymail \
--batch_size 32 \
--input_output_len 2048,512 \
--warm_up 5 \
--num_runs 20
Step 4: Validate accuracy
python evaluate/run_mmlu.py \
--engine_dir ./engines/llama3-70b-fp8 \
--num_fewshot 5 \
--output_path ./results/fp8-mmlu.json
Lessons Learned
-
Use
--strongly_typedflag — without it TRT will fall back to FP16 for some layers silently, reducing throughput gains by ~30%. -
KV cache FP8 is essential — activating
kv_cache_dtype fp8alongside weight FP8 gives the bulk of memory savings. Skipping it cuts memory reduction roughly in half. -
Attention layers are sensitive — if accuracy drops >1%, try excluding attention layers from FP8 via
--exclude_modules="attention". Usually not needed with good calibration, but worth knowing. -
Profile with Nsight before claiming success — SM efficiency should be above 75% for FP8 to deliver expected gains. Lower means memory-bound operation and FP8 compute advantage isn't being used.
# Quick SM efficiency check
nsys profile --trace=cuda,nvtx \
python run_benchmark.py --engine_dir ./engines/llama3-70b-fp8 &
nsys-ui report1.nsys-rep
Next Steps
- Test FP8 on Mixtral 8×22B to compare MoE routing overhead impact
- Evaluate
fp8_rowwisevsfp8_per_tensorfor activations on longer sequences - Integrate H100 TE (Transformer Engine) native FP8 path as alternative to TRT-LLM calibration