Summary
Deployed Triton Inference Server with two models (Llama 3 8B and Llama 3 70B) to serve simple and complex queries respectively. Implemented a BLS ensemble model as the router, routing based on prompt token length and complexity classifier output. Measured latency, cost savings, and accuracy degradation.
What I Did
Set up Triton serving two models: Llama 3 8B for simple queries and Llama 3 70B for complex ones. Implemented a BLS ensemble model as the router. Routed requests based on prompt token length and complexity classifier output. Measured latency, cost savings from offloading simple requests to the smaller model.
Key Technical Findings
- Set up Triton serving two models: Llama 3 8B for simple queries and Llama 3 70B for complex ones.
- Implemented a BLS ensemble model as the router.
- Routed requests based on prompt token length and complexity classifier output.
- Measured latency, cost savings from offloading simple requests to the smaller model.
Commands Used
`tritonserver --model-repository=/models --model-control-mode=explicit --load-model=llama3-8b --load-model=llama3-70b --load-model=router`
`curl -X POST http://localhost:8000/v2/models/router/generate -d '{"text_input":"what is 2+2","max_tokens":16}'`
`python router_accuracy_eval.py --dataset mixed_complexity_1000 --routing_threshold 256`
`python cost_analysis.py --8b_cost 0.0001 --70b_cost 0.0007 --routing_log router.log`
Lessons Learned
- Routing by token length alone gives 70-80% of the benefit with zero ML overhead.
- A small classifier model (BERT-base) routing by task type improves accuracy by 8%.
- Always pre-warm all models before accepting traffic - cold ensemble start is 12s.
- Monitor per-model queue depth to detect routing imbalance in production.
Next Steps
Train a lightweight complexity classifier fine-tuned on your task distribution. Add latency-based dynamic rerouting when one model queue exceeds threshold.