Model Routing
Send each request to the cheapest model that can handle it
The Problem: Your app sends every request to one big, expensive model — even "what are your hours?" gets the same flagship treatment as a multi-step reasoning task. You are paying surgeon prices to bandage paper cuts, and at scale that wasted spend dominates your bill.
The Solution: Model Routing — Right Model for Each Request
Model routing is the practice of directing each incoming request to the cheapest model that can still handle it correctly, instead of paying flagship prices for every call. A small component — the router — sits in front of your models, looks at the request, and decides where to send it. Because most production traffic is easy (lookups, FAQ-style questions, short extractions) and only a minority is genuinely hard (multi-step reasoning, tricky code, edge cases), sending everything to one big model wastes money on the easy majority. Open-source and commercial routers like RouteLLM popularised this in 2024-2026, reporting 50-85% cost reductions at near-identical quality.
Classifier routing vs cascade
There are two core strategies. A classifier router does difficulty / type classification upfront: a lightweight model (or a learned classifier) scores how hard the request is before answering it, then picks the cheapest model above that bar — one model call per request. A cascade works the other way: it tries the cheap model first and escalates to a stronger one only when confidence is low (for example, the small model hedges, refuses, or its self-rated certainty drops below a threshold). Cascades can be more accurate because the hard cases are caught after a real attempt, but they may pay for two calls on those requests. Classifier routers are cheaper per request but depend on the classifier being right. Semantic routing adds intent to the picture — routing by what the user is asking for (a coding question vs a billing question) to a specialised model — and a fallback path fails over to another model or provider when the chosen one errors or times out.
The cost / quality / latency tradeoff, and how to tune it
Every router balances three dials: cost, quality, and latency. Push more traffic to the small model and you save money and time but risk mis-routing a hard request to a model that answers it wrong; push more to the big model and quality rises but so does the bill. The danger is mis-routing: a genuinely hard request sent to the cheap model produces a confidently wrong answer. You bound that risk two ways — a conservative threshold that errs toward the strong model when unsure, and a confidence-based fallback that escalates low-confidence cheap answers. To tune the thresholds, you don't guess: you log every routed request with its predicted difficulty, the model used, the cost, and a quality signal (user feedback or an LLM-as-judge score), then sweep the threshold offline on that data and pick the point that maximises savings while keeping quality above your floor. Worked example: a chat product gets 100,000 requests/day; routing the easy 80% to a model at a tenth of the price and keeping the hard 20% on the flagship cuts total spend by roughly 70% — the easy majority never needed the expensive model in the first place.
Think of it like a hospital triage nurse — simple cases go to a GP, complex ones to a specialist, instead of sending every patient to the surgeon:
- 1. Classify the incoming request: Judge difficulty and type before answering — with a lightweight classifier, a cheap first-pass model, or simple rules. The goal is a cheap, fast signal of how hard this request is
- 2. Pick the cheapest capable model: Map the predicted difficulty to a model tier: small model for easy, mid for moderate, flagship for hard. Choose the cheapest tier that clears the quality bar your task needs
- 3. Optionally cascade on low confidence: Let the cheap model try first; if it hedges, refuses, or its confidence falls below the threshold, escalate to a stronger model. This catches hard cases the classifier missed
- 4. Track cost and quality to tune thresholds: Log every route with its difficulty, model, cost, and a quality signal. Sweep the threshold offline and pick the point that maximises savings while keeping quality above your floor
Where to Apply Model Routing
- High-volume chat products: A support or consumer chatbot serving millions of messages a day: route the ~80% of simple FAQ-style turns to a small model and escalate only ambiguous or multi-step conversations to the flagship — the single biggest lever on the bill
- RAG and agent steps: Inside a pipeline, different steps have different difficulty: query rewriting and extraction are cheap-model work, final synthesis and tool-use planning may need a strong model. Route per step instead of using one model for the whole chain
- Cost optimization at scale: When request volume is high, even a small per-request saving multiplies into thousands of dollars per month. Routing typically cuts 50-85% of spend with quality within a couple of percent — the classic RouteLLM result
- Latency-sensitive routing: Beyond cost, small models answer faster. Route latency-critical turns (autocomplete, real-time chat) to a fast small model and reserve the slower flagship for the requests that truly need its reasoning
Fun Fact: RouteLLM showed that a well-trained router can match GPT-4-level quality while sending most queries to a model ~25x cheaper — achieving over 85% cost savings on common benchmarks. The router itself is tiny: the intelligence is in knowing which questions are actually hard.
Try It Yourself!
Explore the interactive router below to see how a difficulty threshold, a cascade, and a comparison tab trade cost against quality.
Interactive: LLM Router Explorer
Click a request — the router classifies it and sends it to the cheapest capable model.
Router
classify difficulty
Frequently asked questions
What is model routing for LLMs?
Model routing directs each incoming request to the cheapest model that can answer it correctly. A lightweight classifier (or a cheap first-pass model) judges difficulty, sends easy requests to a small model and only escalates hard ones to a flagship model. Routers like RouteLLM cut costs by 50-85% while keeping quality nearly identical.
What is the difference between a classifier router and a cascade?
A classifier router decides upfront, before any answer, which model to use — one model call per request. A cascade tries the cheap model first, then escalates to a stronger model only when the cheap model's confidence is low. Cascades can be more accurate but may pay for two calls on hard requests; classifier routers are cheaper but depend on the classifier being right.
How do you tune routing thresholds without hurting quality?
Log every routed request with its predicted difficulty, the model used, the cost, and a quality signal (user feedback or an LLM-as-judge score). Then sweep the threshold offline on this data: raising it sends more traffic to the small model (cheaper, riskier), lowering it sends more to the big model (safer, costlier). Pick the threshold that maximizes cost savings while keeping quality above your floor, and bound risk with a confidence-based fallback to the strong model.
Try it yourself
Interactive demo of this technique
Serve 100,000 support chatbot requests/day at minimum cost without losing quality
All 100,000 requests hit the large model at ~2¢ each → $2,000/day. Quality is excellent (99%), but ~80% of requests are simple FAQs the flagship is overkill for. Paying surgeon prices to bandage paper cuts.
80% (easy) → small model, 20% (hard) → large. Cost: 80k×0.05¢ + 20k×2¢ = 400 = 2,000. ~78% savings at 98% quality — the easy majority never needed the flagship.
Don't pay flagship prices for every request. A router sends the easy majority to a cheap model and escalates only the hard cases — ~78% savings at the same quality.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path