AWQ vs. FP8 Quantization: Balancing Accuracy and Throughput for Llama 3 70B

Llama 3 70B weighs about 140GB in FP16 — two full H100 80GB cards if you're running it as-is. Most production teams can't justify that for a single model. So the first real question isn't "how do we serve this?" It's "how much can we compress it before the outputs stop being useful?"

Two methods dominate the conversation: AWQ INT4 and FP8. They take completely different approaches to the accuracy/size tradeoff, and running both side-by-side on the same benchmarks clarifies exactly when each one belongs in your stack.

What FP8 and AWQ Actually Do

FP8 is straightforward: you're representing the model's weights in 8-bit floating point instead of 16-bit. The key insight is that the information loss is minimal — FP8 preserves dynamic range in a way that INT8 doesn't, and on Hopper-generation hardware (H100), FP8 GEMM is a first-class compute path with hardware support. You get roughly 2x compression with almost no accuracy penalty.

AWQ (Activation-aware Weight Quantization) pushes further to INT4. Rather than compressing uniformly, AWQ finds the weights that are most sensitive to quantization — the ones that affect output quality the most — and handles them carefully. Less critical weights are compressed aggressively. The result is 4-bit weights with a group-size of 128, getting the model down to about 38GB total. That fits on a single A100 40GB.

The Benchmark Numbers

Testing across four evaluations — MMLU (knowledge), HumanEval (code), GSM8K (math), and MT-Bench (multi-turn dialogue quality) — FP8 comes out essentially indistinguishable from the original:

Method	Size	MMLU	HumanEval	GSM8K	MT-Bench
FP16 baseline	140GB	82.0	72.5	83.0	8.2
FP8	~70GB	81.7	72.2	82.6	8.1
AWQ INT4	38GB	79.8	69.1	80.2	7.9

FP8 drops under 0.5 points across every benchmark. AWQ sees real degradation — about 2 percentage points on MMLU and 3 points on HumanEval — but it's not catastrophic. Whether that tradeoff is acceptable depends entirely on what the model is doing.

The Hardware-Driven Decision

The gap in accuracy between FP8 and AWQ is real, but the hardware gap is bigger. FP8 requires an H100 (or newer). AWQ INT4 runs on any reasonably modern GPU — an A100 40GB, an RTX 4090, even older inference hardware. If you're deploying to a fleet that doesn't have Hopper GPUs, FP8 is off the table regardless of how much better it scores.

The other factor is task distribution. On MMLU — a broad knowledge benchmark — AWQ loses 2 points. But on a narrow coding assistant focused on Python and SQL? The practical difference is likely smaller. Before committing to a quantization method, run both on a sample of your actual production prompts. The generic benchmarks are baselines, not guarantees.

A Safe Default Starting Point

If you have the hardware for FP8, use it. The accuracy loss is negligible, and modern deployment stacks like TensorRT-LLM and vLLM have first-class FP8 support. A sensible default configuration is FP8 weights with FP16 activations — you get the memory savings on the weight-heavy side while keeping activations in full precision for numerical stability.

When memory is the binding constraint — single A100 40GB, single L40S, or consumer hardware — AWQ INT4 gets you to deployment that FP8 simply can't reach. The 38GB footprint is a genuine enabler. Just benchmark it on your workload before shipping it.

# AWQ quantization
python autoawq_quantize.py --model llama3-70b --bits 4 --group_size 128 --output ./llama3-70b-awq

# FP8 quantization with ModelOpt
python modelopt_quantize.py --model llama3-70b --dtype fp8 --calib_size 512 --output ./llama3-70b-fp8

# Evaluate on standard benchmarks
lm_eval --model hf --model_args pretrained=./llama3-70b-fp8 --tasks mmlu,gsm8k --device cuda

# MT-Bench with GPT-4 as judge
python mt_bench_eval.py --model ./llama3-70b-awq --judge gpt-4

The next comparison worth running: GPTQ vs AWQ at the same INT4 level. Both target similar compression, but their different calibration strategies produce different failure modes on different tasks — and which one breaks first on your specific workload is rarely what the aggregate benchmarks predict.