TensorRT-LLM vs vLLM: Production Comparison at Scale

Summary

After running both frameworks at production scale for 3 months, the verdict is nuanced. TRT-LLM wins on raw throughput (1.4–2.1× for batch workloads), but vLLM wins on operational simplicity, multi-model serving, and continuous batching latency at low load. Choose based on your traffic pattern.

What I Did

I deployed both frameworks serving Llama-3 8B and 70B on identical hardware (8× H100 SXM5) under simulated production traffic using real query distributions from a coding assistant application (token length distribution: mean 820, p95 2,400, p99 4,100).

Test duration: 72 hours each, capturing cold/warm/peak states.

Key Technical Findings

Throughput (tokens/sec, batch size 64)

Model	Framework	Tokens/sec	Memory (GB)
8B	TRT-LLM	48,200	18.4
8B	vLLM	34,100	21.8
70B	TRT-LLM	11,900	142.0
70B	vLLM	7,800	156.0

Latency at realistic load (p50 / p95 / p99 ms, 50 rps)

Framework	p50	p95	p99
TRT-LLM	94ms	380ms	1,200ms
vLLM	87ms	290ms	780ms

Interesting: vLLM has lower tail latency at moderate load due to better continuous batching scheduling. TRT-LLM's superior throughput shows at high utilization; at 70%+ GPU utilization, TRT-LLM's p99 becomes more favorable.

Where each wins

TRT-LLM is better when:

Predictable batch workloads (offline inference, embedding generation)
Maximizing GPU utilization matters more than tail latency
You're using FP8 quantization (TRT has a more mature FP8 pipeline)
Serving a small number of fixed models (engine build overhead is paid once)

vLLM is better when:

Serving many different models or adapters (LoRA is first-class)
Variable traffic patterns (continuous batching efficiency is better)
OpenAI-compatible API is required out of the box
Team lacks CUDA/TRT expertise

Commands Used

vLLM deployment (OpenAI-compatible)

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --port 8000

TRT-LLM triton deployment

# Build engine (one-time, ~25 min for 70B on 8×H100)
trtllm-build \
  --checkpoint_dir ./llama3-70b-bf16 \
  --output_dir ./engines/llama3-70b-tp8 \
  --gpt_attention_plugin bfloat16 \
  --gemm_plugin bfloat16 \
  --paged_kv_cache enable \
  --tp_size 8 \
  --max_batch_size 64 \
  --max_input_len 4096 \
  --max_output_len 2048

# Triton serve
tritonserver \
  --model-repository ./triton_models \
  --http-port 8001 \
  --grpc-port 8002

Load test both

# Using k6 with custom LLM load profile
k6 run \
  --vus 50 \
  --duration 30m \
  --env BASE_URL=http://localhost:8000 \
  scripts/llm_load_test.js

Lessons Learned

Engine build time is a real operational cost — TRT-LLM takes 20–45 minutes to compile a 70B engine. Factor this into deployment velocity. vLLM starts in ~90 seconds.
Paged KV cache matters more than framework choice at high load — both frameworks now have paged KV cache, and enabling it properly (correct memory fraction) has a bigger throughput impact than framework selection.
Don't benchmark with synthetic uniform-length requests — real production traffic has a long tail. Benchmark with your actual query length distribution or you'll be surprised in prod.
TRT-LLM requires GPU driver pinning — we hit a silent performance regression when upgrading from driver 535 to 545. Pinning driver version in your container image is essential.

# Pin in Dockerfile
FROM nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
# Don't use latest — pin to tested version

vLLM's --enable-chunked-prefill is non-optional in production — without it, long prefill requests can block the decode queue for seconds. Enable it.

Next Steps

Evaluate SGLang as a third option (shows promising throughput + RadixAttention for cached prefixes)
Test speculative decoding in both frameworks
Profile per-layer compute vs memory-bound ratios to understand throughput difference root cause