Speculative Decoding Latency Optimization on TensorRT-LLM with Llama 3 Models

The premise sounds wrong at first: load a second model to make the first one go faster. You're adding compute, adding memory, adding complexity. How does this help?

The answer is in the arithmetic of autoregressive decoding. A 70B model generates tokens one at a time — each token requires a full forward pass through 80 transformer layers. The GPU is doing a huge amount of work per token but executing the work sequentially. Speculative decoding breaks that sequential dependency by shifting some of the work to a smaller, faster model.

How It Works

The small model (Llama 3 8B, in this case) runs several decode steps ahead of the large model (Llama 3 70B), generating a sequence of draft tokens. The large model then evaluates all those draft tokens in a single forward pass — taking advantage of the fact that evaluating a proposed sequence is much cheaper than generating it token-by-token.

If the 70B model accepts all 5 draft tokens, you've effectively generated 5 tokens in roughly the time it would have taken to generate 1. If it rejects some, you roll back to the last accepted token and repeat. The theoretical speedup depends entirely on the acceptance rate — how often the 70B model agrees with what the 8B model predicted.

This is a latency optimization. Total GPU compute goes up. What goes down is the number of sequential round-trips through the large model, which is the bottleneck that makes autoregressive decoding slow for interactive use.

Tuning the Number of Speculative Tokens

The key parameter is num_spec_tokens — how many draft tokens the small model generates before the large model evaluates them. More draft tokens means bigger potential speedups when acceptance is high, but each draft step that gets rejected wastes compute.

Testing across the range 3–10:

num_spec_tokens	Acceptance rate (HumanEval)	Latency vs. baseline
3	78%	−22%
5	74%	−31%
7	68%	−28%
10	59%	−18%

The sweet spot is between 4 and 6 for coding tasks. Beyond 7, acceptance rate drops fast enough that the rejected drafts eat the gains. The curve shifts with task type — factual question answering tends to have higher acceptance rates than open-ended generation, so the optimal num_spec_tokens varies.

# Build target engine with speculative decoding enabled
trtllm-build \
  --checkpoint_dir ./llama3-70b \
  --spec_decode_mode draft_target \
  --draft_model_dir ./llama3-8b \
  --num_draft_tokens 5

# Benchmark at different batch sizes
python spec_decode_bench.py \
  --target llama3-70b \
  --draft llama3-8b \
  --num_spec_tokens 5 \
  --batch_size 1 \
  --output_len 256

# Measure acceptance rate on a real task distribution
python acceptance_rate_eval.py --dataset humaneval --spec_tokens 5

The Model Alignment Requirement

This only works if the draft and target models have the same vocabulary and reasonably similar output distributions. Llama 3 8B and 70B are both trained on the same base dataset with the same tokenizer, so they share a semantic space — the 8B's drafts are plausible continuations in the same token universe that the 70B operates in.

Using mismatched model families — drafting with Mistral for a Llama target, for example — tends to collapse acceptance rates toward 50% or lower, which eliminates the speedup entirely.

When This Is and Isn't the Right Tool

Speculative decoding is specifically a P50 and P99 latency optimizer for interactive, single-request-at-a-time serving. If you have one user asking coding questions, the response arrives noticeably faster.

It does not help with throughput. At high concurrency — 64+ parallel requests — the GPU is already highly utilized processing real requests, and adding draft generation creates memory pressure without improving how many requests complete per second. For batch workloads, inflight batching is the right optimization. For single-user latency, speculative decoding is.

The two are not mutually exclusive in principle, but they're optimizing different regimes. Know which problem you have before reaching for either.