3,000 Tokens Per Second on Blackwell: A TensorRT-LLM FP8 Optimization Journey

300 tokens per second feels fast until you realize the NVIDIA GB10 Blackwell is capable of more than 10 times that. Getting from the baseline to 3,030 tok/s took four distinct changes, and understanding which one did what reveals something important about how Blackwell-generation hardware actually works.

The Baseline

Starting point: TRT-LLM with FP16 weights, conservative batch size, default configuration. The first genai-perf run at modest concurrency landed at ~300 tok/s. The GPU was alive, the model was loading, tokens were coming out. Just slowly.

The obvious first attempt — enabling FlashAttention FMHA — showed almost no improvement. This is a common trap. FlashAttention reduces memory bandwidth for the attention operation, but during the decode phase of inference, the bottleneck isn't attention. It's the feed-forward GEMM layers. Optimizing attention when the GPU is waiting on matrix multiplies is optimization theater.

The First Real Lever: FP8 GEMM

Switching to FP8 quantization on the GEMM layers — not attention, specifically the feed-forward weight matrices — moved the needle from 300 to 500 tok/s while bringing P95 latency down from 7 seconds to 4 seconds.

This is what makes Blackwell interesting. GB10 has dedicated hardware for FP8 tensor operations, and the throughput uplift on FP8 GEMM compared to FP16 is substantial. The model quality loss is minimal — FP8 is near-lossless on a well-calibrated quantization — but the throughput gain is the single largest lever available on this hardware.

If you're running inference on Blackwell-generation GPUs and haven't enabled FP8 GEMM, this is the first thing to do. Not batching. Not attention. The matrix multiply dtype.

Batch Size and Concurrency Must Scale Together

With FP8 enabled, the next run increased max_batch_size from the default to 64. Throughput jumped to 1,493 tok/s at concurrency 64. Then batch size went to 192 and concurrency to 160: 3,030 tok/s.

The important lesson here is that batch size and concurrency are not independent knobs. Increase batch size without increasing concurrency and the GPU waits for sequences that never arrive. Increase concurrency without increasing batch size and requests queue up waiting for slots. They have to move together or neither move does much.

The sweep looked like this:

max_batch_size	concurrency	tok/s	P95 latency
default	16	~300	~7s
64	64	1,493	~5s
128	128	2,710	~6.5s
192	160	3,030	~6.8s
192	192	3,110	~12s (overloaded)

The optimal operating point is 128–160 concurrency: 3,000+ tok/s with P95 latency staying under 7 seconds. Push to 192 concurrency and P95 spikes — the GPU is overloaded and sequences start stacking up behind each other.

# Build with FP8 GEMM plugin
trtllm-build \
  --checkpoint_dir ./model-fp16 \
  --output_dir ./model-fp8 \
  --gemm_plugin fp8 \
  --max_batch_size 192

# Start Triton serving the TRT-LLM engine
tritonserver --model-repository=/models --http-port 8000

# Concurrency sweep with genai-perf
genai-perf \
  -m ensemble \
  --service-kind triton \
  --backend tensorrtllm \
  --num-prompts 100 \
  --concurrency 160 \
  --output-tokens 512

# Profile with Nsight
nsys profile -w true -t cuda,nvtx \
  --capture-range=cudaProfilerApi \
  -o blackwell_fp8 \
  python benchmark.py

Reading the Latency Distribution

One signal that indicates a well-tuned serving system: P95 latency sitting close to P50. When the 50th percentile and 95th percentile are near each other, requests are being served with consistent timing — the system is handling load smoothly, not occasionally getting stuck behind overlong sequences.

When P95 spikes well above P50, it usually means one of two things: either the batch is overloaded and some requests are waiting a long time to start, or a few long-context sequences are holding KV cache slots while shorter requests pile up behind them.

At concurrency 160, the P95/P50 gap stayed narrow. At 192, it blew out. That gap is the signal telling you where the operating point is.

The Takeaway on Blackwell

FP8 GEMM is the performance story on Blackwell — not attention optimization, not minor scheduler tuning. The hardware was built to run FP8 efficiently, and the throughput numbers show it. Everything else (batching, concurrency, KV cache tuning) layers on top of getting the compute dtype right.

The 10x jump from 300 to 3,030 tok/s breaks down roughly as: 1.7x from FP8 quantization, 3x from batch size scaling, 1.2x from concurrency optimization. The gains compound, but FP8 is the foundation.