Back to Blog
gpuAWQFP8quantizationLlama 3 70Binference throughput

AWQ vs. FP8 Quantization: Balancing Accuracy and Throughput for Llama 3 70B

Comparing AWQ INT4 and FP8 quantization methods on the Llama 3 70B model across accuracy benchmarks (MMLU, HumanEval, GSM8K, MT-Bench) and inference throughput measurements on an H100 GPU.

April 20, 2026·1 min read

Summary

Comparing AWQ INT4 and FP8 quantization methods on the Llama 3 70B model across accuracy benchmarks (MMLU, HumanEval, GSM8K, MT-Bench) and inference throughput measurements on an H100 GPU. The study evaluates memory footprint and throughput for both quantization techniques.

Model size:

  • FP16=140GB, FP8=70GB, AWQ INT4=38GB

Throughput H100:

  • FP16=480 tok/s, FP8=920 tok/s, AWQ=1040 tok/s

Commands Used

python autoawq_quantize.py --model llama3-70b --bits 4 --group_size 128 --output ./llama3-70b-awq
python modelopt_quantize.py --model llama3-70b --dtype fp8 --calib_size 512 --output ./llama3-70b-fp8
lm_eval --model hf --model_args pretrained=./llama3-70b-fp8 --tasks mmlu,gsm8k --device cuda
python mt_bench_eval.py --model ./llama3-70b-awq --judge gpt-4

Lessons Learned

  • FP8 is the better choice when GPU supports it - near-lossless accuracy, 2x compression.
  • AWQ INT4 is best for memory-constrained deployment (38GB fits on single A100 40GB).
  • Never choose quantization method without evaluating on your target task distribution.
  • Mixed precision (FP8 weights, FP16 activations) is the safest starting point.

Next Steps

  • Test GPTQ vs AWQ at INT4 on same benchmarks for completeness
  • Evaluate SmoothQuant as a middle ground between FP8 and INT4