Summary
After running both frameworks at production scale for 3 months, the verdict is nuanced. TRT-LLM wins on raw throughput (1.4–2.1× for batch workloads), but vLLM wins on operational simplicity, multi-model serving, and continuous batching latency at low load. Choose based on your traffic pattern.
What I Did
I deployed both frameworks serving Llama-3 8B and 70B on identical hardware (8× H100 SXM5) under simulated production traffic using real query distributions from a coding assistant application (token length distribution: mean 820, p95 2,400, p99 4,100).
Test duration: 72 hours each, capturing cold/warm/peak states.
Key Technical Findings
Throughput (tokens/sec, batch size 64)
| Model | Framework | Tokens/sec | Memory (GB) |
|---|---|---|---|
| 8B | TRT-LLM | 48,200 | 18.4 |
| 8B | vLLM | 34,100 | 21.8 |
| 70B | TRT-LLM | 11,900 | 142.0 |
| 70B | vLLM | 7,800 | 156.0 |
Latency at realistic load (p50 / p95 / p99 ms, 50 rps)
| Framework | p50 | p95 | p99 |
|---|---|---|---|
| TRT-LLM | 94ms | 380ms | 1,200ms |
| vLLM | 87ms | 290ms | 780ms |
Interesting: vLLM has lower tail latency at moderate load due to better continuous batching scheduling. TRT-LLM's superior throughput shows at high utilization; at 70%+ GPU utilization, TRT-LLM's p99 becomes more favorable.
Where each wins
TRT-LLM is better when:
- Predictable batch workloads (offline inference, embedding generation)
- Maximizing GPU utilization matters more than tail latency
- You're using FP8 quantization (TRT has a more mature FP8 pipeline)
- Serving a small number of fixed models (engine build overhead is paid once)
vLLM is better when:
- Serving many different models or adapters (LoRA is first-class)
- Variable traffic patterns (continuous batching efficiency is better)
- OpenAI-compatible API is required out of the box
- Team lacks CUDA/TRT expertise
Commands Used
vLLM deployment (OpenAI-compatible)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--port 8000
TRT-LLM triton deployment
# Build engine (one-time, ~25 min for 70B on 8×H100)
trtllm-build \
--checkpoint_dir ./llama3-70b-bf16 \
--output_dir ./engines/llama3-70b-tp8 \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--paged_kv_cache enable \
--tp_size 8 \
--max_batch_size 64 \
--max_input_len 4096 \
--max_output_len 2048
# Triton serve
tritonserver \
--model-repository ./triton_models \
--http-port 8001 \
--grpc-port 8002
Load test both
# Using k6 with custom LLM load profile
k6 run \
--vus 50 \
--duration 30m \
--env BASE_URL=http://localhost:8000 \
scripts/llm_load_test.js
Lessons Learned
-
Engine build time is a real operational cost — TRT-LLM takes 20–45 minutes to compile a 70B engine. Factor this into deployment velocity. vLLM starts in ~90 seconds.
-
Paged KV cache matters more than framework choice at high load — both frameworks now have paged KV cache, and enabling it properly (correct memory fraction) has a bigger throughput impact than framework selection.
-
Don't benchmark with synthetic uniform-length requests — real production traffic has a long tail. Benchmark with your actual query length distribution or you'll be surprised in prod.
-
TRT-LLM requires GPU driver pinning — we hit a silent performance regression when upgrading from driver 535 to 545. Pinning driver version in your container image is essential.
# Pin in Dockerfile
FROM nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
# Don't use latest — pin to tested version
- vLLM's
--enable-chunked-prefillis non-optional in production — without it, long prefill requests can block the decode queue for seconds. Enable it.
Next Steps
- Evaluate SGLang as a third option (shows promising throughput + RadixAttention for cached prefixes)
- Test speculative decoding in both frameworks
- Profile per-layer compute vs memory-bound ratios to understand throughput difference root cause