Evaluating Inflight Batching and Scheduler Policies in TensorRT-LLM Triton Backend

GPU SM utilization sitting at 40% on an H100 while a Triton server is handling real load is a bad sign. The GPU has the capacity. The problem is the scheduler.

Static batching — the default in most inference setups — collects a fixed batch of requests, runs them all through the model together, and returns results. It's simple. It's also wasteful, because every sequence in the batch has to finish before the next batch starts. A 10-token request ends up waiting behind a 512-token request, and the GPU idles between batches while Triton assembles the next one.

Inflight batching (continuous batching) fixes this by treating the batch as a moving window. Sequences join and leave the batch mid-flight. A short sequence completes, its slot opens, and the next waiting request fills it immediately. The GPU never waits for a full batch to assemble.

Switching to TRT-LLM Inflight Batching

TensorRT-LLM's Triton backend supports inflight batching natively — it just requires the right configuration in the model repository. Two parameters matter most.

max_num_tokens controls how many tokens can be in-flight at once. Set it too low and you're throttling throughput by artificially limiting occupancy. Set it too high and you risk OOM when a spike of long-context requests arrives simultaneously. A safe starting point is 70–80% of your GPU memory budget for KV cache.

batch_scheduler_policy controls what happens when the in-flight queue is full and a new request arrives. The two options have meaningfully different behavior under bursty load:

guaranteed_no_evict never removes an active sequence from the KV cache mid-generation. Once a sequence starts, it runs to completion or failure. Latency is predictable. This is right for APIs with SLA commitments where a P99 spike is worse than slightly lower peak throughput.
max_utilization will preempt sequences if doing so allows better overall throughput. More tokens processed per second, but individual requests can experience retries if they get preempted. This is right for batch processing jobs where average throughput matters more than per-request latency.

What the Numbers Show

Running a 60-second load test at 50 req/s arrival rate with output length 256 on Llama 3 8B:

Configuration	GPU SM utilization	Throughput (tok/s)	P99 latency
Static batching	~42%	baseline	baseline
Inflight, guaranteed_no_evict	~87%	+2.1x	stable
Inflight, max_utilization	~91%	+2.4x	variable

The jump from static to inflight isn't a tuning win — it's a fundamentally different operating model. SM utilization above 85% means the GPU is working on tokens almost continuously rather than waiting for the scheduler.

# Start Triton with the TRT-LLM model repository
tritonserver --model-repository=/models

# Test a single generation
curl -X POST http://localhost:8000/v2/models/ensemble/generate \
  -d '{"text_input":"explain CUDA streams","max_tokens":256}'

# Bursty load test — 50 req/s for 60s
python load_test.py --arrival_rate 50 --duration 60 --output_len 256 --concurrency 64

# Plot in-flight queue depth and SM utilization over time
python plot_inflight_stats.py --log triton_metrics.log

Choosing the Right Policy for Your Workload

The policy decision is simpler than it looks. If you're building a user-facing API — a coding assistant, a chat interface, anything with an end-user waiting — use guaranteed_no_evict. Predictable latency is worth the small throughput trade.

If you're running a batch pipeline — nightly document processing, bulk embeddings, offline eval — use max_utilization. There's no end-user watching a spinner, so maximizing tokens processed per hour is the right objective.

The next meaningful optimization layered on top of this is chunked prefill: splitting long prefill phases into smaller chunks so long-context requests don't block decode-phase sequences while they're still processing their input. That + inflight batching together is what full KV cache utilization looks like.