Back to Blog

DGX-GB10---Blackwell-fp8-optimization-journey

·1 min read

Here is the converted markdown:

LLM Inference Optimization: From 300 to 3,310 Tokens/sec with FP8

Summary

The author achieved an 11x improvement in LLM inference throughput from 300 to 3,310 tokens/sec using the DeepSeek-Coder-33B model on Blackwell-class GPUs without changing hardware. The key optimizations included switching to FP8 precision and adjusting batch size configuration.

Key Technical Findings

  • Decode dominates production LLM inference with long outputs.
  • FP8 quantization significantly increased throughput and reduced latency by shifting compute to FP8 tensor cores.
  • Proper batching configuration is crucial for scaling under load; increasing max_batch_size resolved a bottleneck in request handling.

Commands Used

Switched model precision from FP16 to FP8 using TensorRT-LLM.
Increased `max_batch_size` from 32 to 64.

Lessons Learned

  • Profiling is essential for identifying bottlenecks and optimizing LLM inference.
  • Prefill optimization has minimal impact on decode-heavy workloads.
  • The "knee" in the concurrency throughput curve indicates optimal efficiency.

Publishability

yes

Privacy Risk

low

Tags

  • inference
  • gpu-optimization
  • fp8
  • tensorrt-llm
  • blackwell
  • deepseek
  • benchmarking

Action Items

  • Implement FP8 precision for LLM models on compatible hardware.
  • Review and adjust batching configurations in production systems.