What I Did
Key Technical Findings
FP8 GEMM is the single biggest lever on Blackwell - not attention, not batching alone Batch size must be scaled together with concurrency or GPU idles Optimal operating point is 128-160 concurrency: 3000+ tok/s with P95 under 7s Always profile decode separately from prefill - they have different bottlenecks P95 flat near P50 means the system is stable and not overloaded
Commands
trtllm-build --checkpoint_dir ./model-fp16 --output_dir ./model-fp8 --gemm_plugin fp8 --max_batch_size 192 tritonserver --model-repository=/models --http-port 8000 genai-perf -m ensemble --service-kind triton --backend tensorrtllm --num-prompts 100 --concurrency 160 --output-tokens 512 nsys profile -w true -t cuda,nvtx --capture-range=cudaProfilerApi -o blackwell_fp8 python benchmark.py
Lessons Learned
FP8 GEMM is the single biggest lever on Blackwell - not attention, not batching alone Batch size must be scaled together with concurrency or GPU idles Optimal operating point is 128-160 concurrency: 3000+ tok/s with P95 under 7s Always profile decode separately from prefill - they have different bottlenecks P95 flat near P50 means the system is stable and not overloaded
Next Steps
Automate concurrency sweep with genai-perf and plot the throughput/latency curve Test continuous batching with in-flight sequence padding to push past 3310 tok/s Add Grafana dashboard: tokens/sec, P95 latency, GPU utilization, KV cache pressure