Summary
End-to-end TensorRT-LLM FP8 inference optimization on NVIDIA GB10 Blackwell from baseline to 3K+ tokens/sec was achieved through a series of phases. Starting with stabilization and achieving 300 tok/s, the project progressed to enable FlashAttention FMHA (with no gain), switch to FP8 quantization, increase max_batch_size, and perform scaling sweeps.
What I Did
Achieved 300 tok/s baseline with FP16 Enabled FlashAttention FMHA but saw no gain as decode dominated Switched to FP8 quantization on GEMM layers, increasing speed from 300 to 500 tok/s and reducing latency from 7s to 4s Increased max_batch_size to 64, unlocking 1493 tok/s at concurrency 64 Scaling sweep with batch=192 reached 3030 tok/s at concurrency 160; optimal range is 128-160 for throughput/latency balance
Key Technical Findings
FP8 GEMM is the single biggest lever on Blackwell - not attention, not batching alone Batch size must be scaled together with concurrency or GPU idles Optimal operating point is 128-160 concurrency: 3000+ tok/s with P95 under 7s Always profile decode separately from prefill - they have different bottlenecks P95 flat near P50 means the system is stable and not overloaded
Commands
trtllm-build --checkpoint_dir ./model-fp16 --output_dir ./model-fp8 --gemm_plugin fp8 --max_batch_size 192 tritonserver --model-repository=/models --http-port 8000 genai-perf -m ensemble --service-kind triton --backend tensorrtllm --num-prompts 100 --concurrency 160 --output-tokens 512 nsys profile -w true -t cuda,nvtx --capture-range=cudaProfilerApi -o blackwell_fp8 python benchmark.py
Lessons Learned
FP8 GEMM is the single biggest lever on Blackwell - not attention, not batching alone Batch size must be scaled together with concurrency or GPU idles Optimal operating point is 128-160 concurrency: 3000+ tok/s with P95 under 7s Always profile decode separately from prefill - they have different bottlenecks P95 flat near P50 means the system is stable and not overloaded
Next Steps
Automate concurrency sweep with genai-perf and plot the throughput/latency curve Test continuous batching with in-flight sequence padding to push past 3310 tok/s Add Grafana dashboard: tokens/sec, P95 latency, GPU utilization, KV cache pressure