Back to Blog
ai-infraTensorRT-LLMvLLMservingbenchmark

TensorRT-LLM vs vLLM: Production Comparison at Scale

A rigorous head-to-head benchmark of TRT-LLM and vLLM for production LLM serving — covering throughput, latency percentiles, memory efficiency, and operational complexity.

April 8, 2026·3 min read

Summary

After running both frameworks at production scale for 3 months, the verdict is nuanced. TRT-LLM wins on raw throughput (1.4–2.1× for batch workloads), but vLLM wins on operational simplicity, multi-model serving, and continuous batching latency at low load. Choose based on your traffic pattern.

What I Did

I deployed both frameworks serving Llama-3 8B and 70B on identical hardware (8× H100 SXM5) under simulated production traffic using real query distributions from a coding assistant application (token length distribution: mean 820, p95 2,400, p99 4,100).

Test duration: 72 hours each, capturing cold/warm/peak states.

Key Technical Findings

Throughput (tokens/sec, batch size 64)

Model Framework Tokens/sec Memory (GB)
8B TRT-LLM 48,200 18.4
8B vLLM 34,100 21.8
70B TRT-LLM 11,900 142.0
70B vLLM 7,800 156.0

Latency at realistic load (p50 / p95 / p99 ms, 50 rps)

Framework p50 p95 p99
TRT-LLM 94ms 380ms 1,200ms
vLLM 87ms 290ms 780ms

Interesting: vLLM has lower tail latency at moderate load due to better continuous batching scheduling. TRT-LLM's superior throughput shows at high utilization; at 70%+ GPU utilization, TRT-LLM's p99 becomes more favorable.

Where each wins

TRT-LLM is better when:

  • Predictable batch workloads (offline inference, embedding generation)
  • Maximizing GPU utilization matters more than tail latency
  • You're using FP8 quantization (TRT has a more mature FP8 pipeline)
  • Serving a small number of fixed models (engine build overhead is paid once)

vLLM is better when:

  • Serving many different models or adapters (LoRA is first-class)
  • Variable traffic patterns (continuous batching efficiency is better)
  • OpenAI-compatible API is required out of the box
  • Team lacks CUDA/TRT expertise

Commands Used

vLLM deployment (OpenAI-compatible)

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --port 8000

TRT-LLM triton deployment

# Build engine (one-time, ~25 min for 70B on 8×H100)
trtllm-build \
  --checkpoint_dir ./llama3-70b-bf16 \
  --output_dir ./engines/llama3-70b-tp8 \
  --gpt_attention_plugin bfloat16 \
  --gemm_plugin bfloat16 \
  --paged_kv_cache enable \
  --tp_size 8 \
  --max_batch_size 64 \
  --max_input_len 4096 \
  --max_output_len 2048

# Triton serve
tritonserver \
  --model-repository ./triton_models \
  --http-port 8001 \
  --grpc-port 8002

Load test both

# Using k6 with custom LLM load profile
k6 run \
  --vus 50 \
  --duration 30m \
  --env BASE_URL=http://localhost:8000 \
  scripts/llm_load_test.js

Lessons Learned

  1. Engine build time is a real operational cost — TRT-LLM takes 20–45 minutes to compile a 70B engine. Factor this into deployment velocity. vLLM starts in ~90 seconds.

  2. Paged KV cache matters more than framework choice at high load — both frameworks now have paged KV cache, and enabling it properly (correct memory fraction) has a bigger throughput impact than framework selection.

  3. Don't benchmark with synthetic uniform-length requests — real production traffic has a long tail. Benchmark with your actual query length distribution or you'll be surprised in prod.

  4. TRT-LLM requires GPU driver pinning — we hit a silent performance regression when upgrading from driver 535 to 545. Pinning driver version in your container image is essential.

# Pin in Dockerfile
FROM nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
# Don't use latest — pin to tested version
  1. vLLM's --enable-chunked-prefill is non-optional in production — without it, long prefill requests can block the decode queue for seconds. Enable it.

Next Steps

  • Evaluate SGLang as a third option (shows promising throughput + RadixAttention for cached prefixes)
  • Test speculative decoding in both frameworks
  • Profile per-layer compute vs memory-bound ratios to understand throughput difference root cause