Summary
Configured inflight batching in TensorRT-LLM Triton backend for Llama 3 8B model, tuning parameters to optimize GPU utilization and balance latency and throughput under bursty loads.
What I Did
Replaced static batching with TRT-LLM inflight batching. Tuned max_num_tokens and batch_scheduler_policy. Compared guaranteed_no_evict vs max_utilization scheduler policies. Monitored queue depth and GPU SM utilization at varying arrival rates.
Key Technical Findings
- Inflights batching significantly improves throughput compared to static batching.
guaranteed_no_evictis preferable for SLA-sensitive APIs, whilemax_utilizationsuits batch jobs.- High GPU utilization (above 85%) indicates efficient scheduling.
- Set
max_num_tokensto 70-80% of the GPU memory budget.
Commands Used
tritonserver --model-repository=/models
curl -X POST http://localhost:8000/v2/models/ensemble/generate -d '{"text_input":"explain CUDA streams","max_tokens":256}'
python load_test.py --arrival_rate 50 --duration 60 --output_len 256 --concurrency 64
python plot_inflight_stats.py --log triton_metrics.log
Next Steps
- Implement prometheus metrics scraping from Triton.
- Test chunked prefill with inflight batching.