Back to Blog
gpuTensorRT-LLMFP8 quantizationNVIDIA DGX SparkGPU optimizationinference scaling

End-to-End TensorRT-LLM FP8 Inference Optimization on NVIDIA GB10 Blackwell

End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases.

April 20, 2026·2 min read

Summary

End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases. Starting with stabilization and achieving 300 tok/s, the project progressed to enable FlashAttention FMHA (with no gain), switch to FP8 quantization, increase max_batch_size, and perform scaling sweeps.

What I Did

Achieved 300 tok/s baseline with FP16 Enabled FlashAttention FMHA but saw no gain as decode dominated Switched to FP8 quantization on GEMM layers, increasing speed from 300 to 500 tok/s and reducing latency from 7s to 4s Increased max_batch_size to 64, unlocking 1493 tok/s at concurrency 64 Scaling sweep with batch=192 reached 3030 tok/s at concurrency 160; optimal range is 128-160 for throughput/latency balance

Key Technical Findings

FP8 GEMM is the single biggest lever on Blackwell - not attention, not batching alone Batch size must be scaled together with concurrency or GPU idles Optimal operating point is 128-160 concurrency: 3000+ tok/s with P95 under 7s Always profile decode separately from prefill - they have different bottlenecks P95 flat near P50 means the system is stable and not overloaded

Commands

trtllm-build --checkpoint_dir ./model-fp16 --output_dir ./model-fp8 --gemm_plugin fp8 --max_batch_size 192 tritonserver --model-repository=/models --http-port 8000 genai-perf -m ensemble --service-kind triton --backend tensorrtllm --num-prompts 100 --concurrency 160 --output-tokens 512 nsys profile -w true -t cuda,nvtx --capture-range=cudaProfilerApi -o blackwell_fp8 python benchmark.py

Lessons Learned

FP8 GEMM is the single biggest lever on Blackwell - not attention, not batching alone Batch size must be scaled together with concurrency or GPU idles Optimal operating point is 128-160 concurrency: 3000+ tok/s with P95 under 7s Always profile decode separately from prefill - they have different bottlenecks P95 flat near P50 means the system is stable and not overloaded

Next Steps

Automate concurrency sweep with genai-perf and plot the throughput/latency curve Test continuous batching with in-flight sequence padding to push past 3310 tok/s Add Grafana dashboard: tokens/sec, P95 latency, GPU utilization, KV cache pressure