FP8 Quantization on H100: A Practical Guide

Summary

FP8 (E4M3) quantization on NVIDIA H100 GPUs can deliver 1.6–1.9× throughput improvement over BF16 with near-zero accuracy degradation when calibrated correctly. This post covers the full pipeline: calibration dataset selection, TensorRT-LLM build flags, and production validation methodology.

What I Did

I benchmarked Llama-3 70B Instruct with FP8 weights + activations on a 4×H100 SXM5 node. The goal was to push throughput per GPU while keeping MMLU accuracy within 0.5% of the BF16 baseline.

Key setup:

Model: meta-llama/Meta-Llama-3-70B-Instruct
Hardware: 4× H100 80GB SXM5 (NVLink 4.0)
Framework: TensorRT-LLM 0.9.0
Calibration: 512 samples from Pile-val, 2048 max tokens

Key Technical Findings

Throughput gains (batch size 32, seq 2048):

Precision	Tokens/sec/GPU	vs BF16
BF16	2,840	1.0×
FP8 W+A	5,210	1.83×
INT8 W	4,090	1.44×

Memory footprint:

BF16 70B: 140GB across 4× GPUs = 35GB/GPU
FP8 70B: 76GB across 4× GPUs = 19GB/GPU
This makes single-node 70B serving possible on 2× H100 80GB

Accuracy on MMLU (5-shot):

BF16: 82.1%
FP8 W+A (calibrated): 81.8% (−0.3%)
INT8 W-only: 81.6% (−0.5%)

The critical insight: calibration dataset matters more than calibration size. Using domain-matched data (code + technical text) vs. generic web text reduced accuracy loss from −1.8% to −0.3% on coding benchmarks.

Commands Used

Step 1: Build FP8 calibrated checkpoint

python quantization/quantize.py \
  --model_dir meta-llama/Meta-Llama-3-70B-Instruct \
  --dtype float16 \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --output_dir ./llama3-70b-fp8 \
  --calib_size 512 \
  --calib_dataset pile-val

Step 2: Build TensorRT-LLM engine

trtllm-build \
  --checkpoint_dir ./llama3-70b-fp8 \
  --output_dir ./engines/llama3-70b-fp8 \
  --gemm_plugin fp8 \
  --strongly_typed \
  --max_batch_size 64 \
  --max_input_len 2048 \
  --max_output_len 1024 \
  --tp_size 4

Step 3: Benchmark

python benchmarks/benchmark.py \
  --engine_dir ./engines/llama3-70b-fp8 \
  --dataset cnn_dailymail \
  --batch_size 32 \
  --input_output_len 2048,512 \
  --warm_up 5 \
  --num_runs 20

Step 4: Validate accuracy

python evaluate/run_mmlu.py \
  --engine_dir ./engines/llama3-70b-fp8 \
  --num_fewshot 5 \
  --output_path ./results/fp8-mmlu.json

Lessons Learned

Use --strongly_typed flag — without it TRT will fall back to FP16 for some layers silently, reducing throughput gains by ~30%.
KV cache FP8 is essential — activating kv_cache_dtype fp8 alongside weight FP8 gives the bulk of memory savings. Skipping it cuts memory reduction roughly in half.
Attention layers are sensitive — if accuracy drops >1%, try excluding attention layers from FP8 via --exclude_modules="attention". Usually not needed with good calibration, but worth knowing.
Profile with Nsight before claiming success — SM efficiency should be above 75% for FP8 to deliver expected gains. Lower means memory-bound operation and FP8 compute advantage isn't being used.

# Quick SM efficiency check
nsys profile --trace=cuda,nvtx \
  python run_benchmark.py --engine_dir ./engines/llama3-70b-fp8 &
nsys-ui report1.nsys-rep

Next Steps

Test FP8 on Mixtral 8×22B to compare MoE routing overhead impact
Evaluate fp8_rowwise vs fp8_per_tensor for activations on longer sequences
Integrate H100 TE (Transformer Engine) native FP8 path as alternative to TRT-LLM calibration