Here is the converted markdown:
LLM Inference Optimization: From 300 to 3,310 Tokens/sec with FP8
Summary
The author achieved an 11x improvement in LLM inference throughput from 300 to 3,310 tokens/sec using the DeepSeek-Coder-33B model on Blackwell-class GPUs without changing hardware. The key optimizations included switching to FP8 precision and adjusting batch size configuration.
Key Technical Findings
- Decode dominates production LLM inference with long outputs.
- FP8 quantization significantly increased throughput and reduced latency by shifting compute to FP8 tensor cores.
- Proper batching configuration is crucial for scaling under load; increasing
max_batch_sizeresolved a bottleneck in request handling.
Commands Used
Switched model precision from FP16 to FP8 using TensorRT-LLM.
Increased `max_batch_size` from 32 to 64.
Lessons Learned
- Profiling is essential for identifying bottlenecks and optimizing LLM inference.
- Prefill optimization has minimal impact on decode-heavy workloads.
- The "knee" in the concurrency throughput curve indicates optimal efficiency.
Publishability
yes
Privacy Risk
low
Tags
- inference
- gpu-optimization
- fp8
- tensorrt-llm
- blackwell
- deepseek
- benchmarking
Action Items
- Implement FP8 precision for LLM models on compatible hardware.
- Review and adjust batching configurations in production systems.