Commands Used
python autoawq_quantize.py --model llama3-70b --bits 4 --group_size 128 --output ./llama3-70b-awq
python modelopt_quantize.py --model llama3-70b --dtype fp8 --calib_size 512 --output ./llama3-70b-fp8
lm_eval --model hf --model_args pretrained=./llama3-70b-fp8 --tasks mmlu,gsm8k --device cuda
python mt_bench_eval.py --model ./llama3-70b-awq --judge gpt-4
Lessons Learned
- FP8 is the better choice when GPU supports it - near-lossless accuracy, 2x compression.
- AWQ INT4 is best for memory-constrained deployment (38GB fits on single A100 40GB).
- Never choose quantization method without evaluating on your target task distribution.
- Mixed precision (FP8 weights, FP16 activations) is the safest starting point.
Next Steps
- Test GPTQ vs AWQ at INT4 on same benchmarks for completeness
- Evaluate SmoothQuant as a middle ground between FP8 and INT4