Lesson 13Optimization

Model Routing

Send each request to the cheapest model that can handle it

The Problem: Your app sends every request to one big, expensive model — even "what are your hours?" gets the same flagship treatment as a multi-step reasoning task. You are paying surgeon prices to bandage paper cuts, and at scale that wasted spend dominates your bill.

The Solution: Model Routing — Right Model for Each Request

Model routing is the practice of directing each incoming request to the cheapest model that can still handle it correctly, instead of paying flagship prices for every call. A small component — the router — sits in front of your models, looks at the request, and decides where to send it. Because most production traffic is easy (lookups, FAQ-style questions, short extractions) and only a minority is genuinely hard (multi-step reasoning, tricky code, edge cases), sending everything to one big model wastes money on the easy majority. Open-source and commercial routers like RouteLLM popularised this in 2024-2026, reporting 50-85% cost reductions at near-identical quality.

Classifier routing vs cascade

There are two core strategies. A classifier router does difficulty / type classification upfront: a lightweight model (or a learned classifier) scores how hard the request is before answering it, then picks the cheapest model above that bar — one model call per request. A cascade works the other way: it tries the cheap model first and escalates to a stronger one only when confidence is low (for example, the small model hedges, refuses, or its self-rated certainty drops below a threshold). Cascades can be more accurate because the hard cases are caught after a real attempt, but they may pay for two calls on those requests. Classifier routers are cheaper per request but depend on the classifier being right. Semantic routing adds intent to the picture — routing by what the user is asking for (a coding question vs a billing question) to a specialised model — and a fallback path fails over to another model or provider when the chosen one errors or times out.

The cost / quality / latency tradeoff, and how to tune it

Every router balances three dials: cost, quality, and latency. Push more traffic to the small model and you save money and time but risk mis-routing a hard request to a model that answers it wrong; push more to the big model and quality rises but so does the bill. The danger is mis-routing: a genuinely hard request sent to the cheap model produces a confidently wrong answer. You bound that risk two ways — a conservative threshold that errs toward the strong model when unsure, and a confidence-based fallback that escalates low-confidence cheap answers. To tune the thresholds, you don't guess: you log every routed request with its predicted difficulty, the model used, the cost, and a quality signal (user feedback or an LLM-as-judge score), then sweep the threshold offline on that data and pick the point that maximises savings while keeping quality above your floor. Worked example: a chat product gets 100,000 requests/day; routing the easy 80% to a model at a tenth of the price and keeping the hard 20% on the flagship cuts total spend by roughly 70% — the easy majority never needed the expensive model in the first place.

Think of it like a hospital triage nurse — simple cases go to a GP, complex ones to a specialist, instead of sending every patient to the surgeon:

1. Classify the incoming request: Judge difficulty and type before answering — with a lightweight classifier, a cheap first-pass model, or simple rules. The goal is a cheap, fast signal of how hard this request is
2. Pick the cheapest capable model: Map the predicted difficulty to a model tier: small model for easy, mid for moderate, flagship for hard. Choose the cheapest tier that clears the quality bar your task needs
3. Optionally cascade on low confidence: Let the cheap model try first; if it hedges, refuses, or its confidence falls below the threshold, escalate to a stronger model. This catches hard cases the classifier missed
4. Track cost and quality to tune thresholds: Log every route with its difficulty, model, cost, and a quality signal. Sweep the threshold offline and pick the point that maximises savings while keeping quality above your floor

Where to Apply Model Routing

High-volume chat products: A support or consumer chatbot serving millions of messages a day: route the ~80% of simple FAQ-style turns to a small model and escalate only ambiguous or multi-step conversations to the flagship — the single biggest lever on the bill
RAG and agent steps: Inside a pipeline, different steps have different difficulty: query rewriting and extraction are cheap-model work, final synthesis and tool-use planning may need a strong model. Route per step instead of using one model for the whole chain
Cost optimization at scale: When request volume is high, even a small per-request saving multiplies into thousands of dollars per month. Routing typically cuts 50-85% of spend with quality within a couple of percent — the classic RouteLLM result
Latency-sensitive routing: Beyond cost, small models answer faster. Route latency-critical turns (autocomplete, real-time chat) to a fast small model and reserve the slower flagship for the requests that truly need its reasoning

Fun Fact: RouteLLM showed that a well-trained router can match GPT-4-level quality while sending most queries to a model ~25x cheaper — achieving over 85% cost savings on common benchmarks. The router itself is tiny: the intelligence is in knowing which questions are actually hard.

Try It Yourself!

Explore the interactive router below to see how a difficulty threshold, a cascade, and a comparison tab trade cost against quality.

Model Routing: How an LLM Router Works

Interactive: LLM Router Explorer

Click a request — the router classifies it and sends it to the cheapest capable model.

Router

classify difficulty

→

Pick a request above

Frequently asked questions

What is model routing for LLMs?

Model routing directs each incoming request to the cheapest model that can answer it correctly. A lightweight classifier (or a cheap first-pass model) judges difficulty, sends easy requests to a small model and only escalates hard ones to a flagship model. Routers like RouteLLM cut costs by 50-85% while keeping quality nearly identical.

What is the difference between a classifier router and a cascade?

A classifier router decides upfront, before any answer, which model to use — one model call per request. A cascade tries the cheap model first, then escalates to a stronger model only when the cheap model's confidence is low. Cascades can be more accurate but may pay for two calls on hard requests; classifier routers are cheaper but depend on the classifier being right.

How do you tune routing thresholds without hurting quality?

Log every routed request with its predicted difficulty, the model used, the cost, and a quality signal (user feedback or an LLM-as-judge score). Then sweep the threshold offline on this data: raising it sends more traffic to the small model (cheaper, riskier), lowering it sends more to the big model (safer, costlier). Pick the threshold that maximizes cost savings while keeping quality above your floor, and bound risk with a confidence-based fallback to the strong model.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Serve 100,000 support chatbot requests/day at minimum cost without losing quality

Without technique

Prompt

Route EVERY request to the flagship model (big and expensive), regardless of difficulty. "What are your hours?" and "Find the bug in this code" get the same treatment.

Response

All 100,000 requests hit the large model at ~2¢ each → $2,000/day. Quality is excellent (99%), but ~80% of requests are simple FAQs the flagship is overkill for. Paying surgeon prices to bandage paper cuts.

Tokens:200/300

Time:2200ms

Quality:

With production-model-routing

With technique

Prompt

Put a router in front of the models. Classify each request: simple FAQ → small model (~0.05¢), moderate → medium, hard reasoning → large (~2¢). Add a confidence fallback: if the small model is unsure, escalate to the large one.

Response

80% (easy) → small model, 20% (hard) → large. Cost: 80k×0.05¢ + 20k×2¢ = $40 +$ 400 = $440/day instead of$ 2,000. ~78% savings at 98% quality — the easy majority never needed the flagship.

👁️Most traffic is easy (FAQ, short extractions) — the flagship is overkill for it

🧠Classify difficulty BEFORE answering and route to the cheapest capable model

🔢80k×0.05¢ + 20k×2¢ = $440 vs $2,000 → ~78% savings, quality within 1%

✅A confidence fallback bounds mis-routing: a hard request that lands on the small model gets escalated

Tokens:200/300

Time:700ms

Quality:

Why this works

Don't pay flagship prices for every request. A router sends the easy majority to a cheap model and escalates only the hard cases — ~78% savings at the same quality.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

This lesson is part of a structured LLM course.

My Learning Path

The Solution: Model Routing — Right Model for Each Request

Classifier routing vs cascade

The cost / quality / latency tradeoff, and how to tune it

Think of it like a hospital triage nurse — simple cases go to a GP, complex ones to a specialist, instead of sending every patient to the surgeon:

1. Classify the incoming request: Judge difficulty and type before answering — with a lightweight classifier, a cheap first-pass model, or simple rules. The goal is a cheap, fast signal of how hard this request is
2. Pick the cheapest capable model: Map the predicted difficulty to a model tier: small model for easy, mid for moderate, flagship for hard. Choose the cheapest tier that clears the quality bar your task needs
3. Optionally cascade on low confidence: Let the cheap model try first; if it hedges, refuses, or its confidence falls below the threshold, escalate to a stronger model. This catches hard cases the classifier missed
4. Track cost and quality to tune thresholds: Log every route with its difficulty, model, cost, and a quality signal. Sweep the threshold offline and pick the point that maximises savings while keeping quality above your floor

Where to Apply Model Routing

High-volume chat products: A support or consumer chatbot serving millions of messages a day: route the ~80% of simple FAQ-style turns to a small model and escalate only ambiguous or multi-step conversations to the flagship — the single biggest lever on the bill

RAG and agent steps: Inside a pipeline, different steps have different difficulty: query rewriting and extraction are cheap-model work, final synthesis and tool-use planning may need a strong model. Route per step instead of using one model for the whole chain

Cost optimization at scale: When request volume is high, even a small per-request saving multiplies into thousands of dollars per month. Routing typically cuts 50-85% of spend with quality within a couple of percent — the classic RouteLLM result

Latency-sensitive routing: Beyond cost, small models answer faster. Route latency-critical turns (autocomplete, real-time chat) to a fast small model and reserve the slower flagship for the requests that truly need its reasoning

Frequently asked questions

What is model routing for LLMs?

What is the difference between a classifier router and a cascade?

How do you tune routing thresholds without hurting quality?