Lesson 1

Model Selection Guide

Choosing the right model

The Problem: There are dozens of LLMs available — GPT-5, Claude, o3, Gemini, Llama, DeepSeek, and more. Plus reasoning models that think before answering. How do you choose the right model for your specific use case?

The Solution: Choose the Right Tool for the Job

Model selection is the engineering decision of matching your requirements — speed, cost, accuracy, context length, and modality — to the right model. There is no single "best" LLM: a model that tops a leaderboard can be wasteful for routing support tickets and still too weak for proving a math theorem. The job is to find the cheapest, fastest model that clears the quality bar your task actually requires. Think of it like choosing a vehicle — a sports car, a truck, and a bicycle are each "best" only for a specific trip.

How to evaluate a model

Start from the task, not the hype. Public benchmarks like MMLU or HumanEval are a coarse filter for shortlisting candidates, but they rarely predict performance on your data — so build a small eval set of 50–100 real examples and score each candidate on it. Then weigh three operational axes: latency (time to first token and full response), cost (price per million input and output tokens, which differ), and the context window you need. Capabilities matter too: only some models do native tool use, vision, or long-context retrieval well. For privacy or compliance, open-weight models you self-host (Llama, Mistral) trade convenience for control; quantization can shrink them to fit your hardware at a small quality cost.

Tradeoffs, pitfalls, and a worked example

The biggest pitfall is over-provisioning: paying flagship prices for tasks a small model handles fine. The opposite trap is under-provisioning a genuinely hard reasoning task and shipping wrong answers. A powerful pattern is model routing — a cheap classifier sends easy requests to a small model and only escalates the hard ones. Worked example: a support bot gets 10,000 tickets/day. Sending all of them to a flagship model at, say, $5 per million output tokens might cost hundreds of dollars daily. Instead, route the ~80% of simple FAQ-style tickets to a small model (Haiku / GPT-4o Mini) at roughly a tenth of the price, and escalate only the ~20% of ambiguous or multi-step cases to the flagship. On a representative eval this typically cuts total spend by 60–70% while keeping answer quality within a couple of percent — because the small model was already good enough for the easy majority.

Think of it like choosing a vehicle for different tasks:

1. Latency < 500ms AND quality critical: Use GPT-4o or Claude Sonnet — best balance of speed and intelligence
2. Cost < $0.01/request AND simple task: Use GPT-4o Mini or Claude Haiku — 10-20x cheaper, great for classification, extraction, FAQ
3. Context > 100K tokens: Use Claude (200K) or Gemini (1M+) — other models require document chunking
4. Complex math / logic / hard reasoning: Use reasoning models (o3, o4-mini) — they use thinking tokens for step-by-step reasoning, but cost more due to hidden token usage
5. On-premise / data privacy required: Use Llama or Mistral — open-weight models you can host yourself
6. Always: test on YOUR data: Run 50-100 real examples through each candidate model before committing — benchmarks lie, your evals don't

Key Selection Criteria

Quality: Benchmark scores (MMLU, HumanEval) matter less than eval on YOUR data — always test with real examples from your domain
Model Routing: Use a lightweight classifier to route easy tasks (FAQ, extraction) to cheap models and hard tasks (reasoning, coding) to flagship models — saves 60-80% with minimal quality loss
Cost vs Latency: Flagship models are 10-30x more expensive and 2-5x slower — justify the upgrade with measurable quality difference on your evals
Context Window: Need 100K+ tokens? Only Claude (200K) and Gemini (1M+) support it natively — others require chunking strategies

Fun Fact: A/B testing model routing in production showed that sending 80% of support tickets to Haiku saved 85% of costs with only a 2% quality drop. The remaining 20% of complex cases went to Sonnet — total cost reduction of 70% with near-identical user satisfaction.

Try It Yourself!

Explore different models and their trade-offs for various use cases.

Model Comparison

Select your use case:

Sort by:

Model	Context	Price (in)	Quality	Best for
GPT-5 OpenAI	400K	$1.25/1M	Top	General purposeAgents
Claude Opus 4.5 Anthropic	200K	$15.00/1M	Top	ResearchComplex analysis
Claude Sonnet 4 Anthropic	200K	$3.00/1M	Top	CodingAnalysis
o3 OpenAI	200K	$2.00/1M	Top	Complex reasoningMath
GPT-4o OpenAI	128K	$2.50/1M	High	ChatVision
Gemini 2.5 Pro Google	1M	$1.25/1M	High	Long documentsReasoning
DeepSeek V3OSS DeepSeek	128K	$0.27/1M	High	Budget projectsCoding
Qwen 2.5 72BOSS Alibaba	128K	Self-hosted	High	Asian languagesSelf-hosted
Mistral Large 2 Mistral	128K	$2.00/1M	High	EU complianceCost-effective
Llama 3.3 70BOSS Meta	128K	Self-hosted	High	Privacy-sensitiveFine-tuning
o4-mini OpenAI	200K	$1.10/1M	High	Budget reasoningMath
Gemini 2.5 Flash Google	1M	$0.30/1M	Medium	High volumeLong documents
GPT-4o mini OpenAI	128K	$0.15/1M	Medium	High volumeSimple tasks
Claude 3.5 Haiku Anthropic	200K	$0.80/1M	Medium	ClassificationSimple tasks

Quick Decision

Need the best: Claude Opus 4.5 / GPT-5
Hard reasoning: o3 / o4-mini
Best for coding: Claude Sonnet 4
Save money: DeepSeek V3 / GPT-4o mini / Gemini 2.5 Flash
Long docs: Gemini 2.5 Pro (1M tokens)
Privacy: Llama 3.3 / Qwen 2.5 (self-hosted)

Frequently asked questions

How do I choose the right LLM for my task?

Start from the task, not the leaderboards. Build an eval set of 50–100 real examples from your domain and run each candidate model on it. Then weigh cost (price per million input and output tokens), latency (time to first token and full response), and the context window you need. Pick the cheapest, fastest model that clears the quality bar your task actually requires — not the most powerful one.

When is a flagship model worth the extra cost over a small one?

Flagship models (GPT-5, Claude Sonnet, o3) are typically 10–30x more expensive and 2–5x slower than small ones (Haiku, GPT-4o Mini). The upgrade is justified only when your own evals show a measurable quality gain — for example on hard reasoning, math, or coding. For classification, extraction, and FAQ tasks a small model usually performs just as well at a fraction of the cost.

What is model routing and how much can it save?

Model routing uses a cheap classifier to decide where each request goes: easy tasks (FAQ, extraction) go to a small model, and hard ones (reasoning, multi-step cases) escalate to a flagship. When ~80% of traffic is simple, this typically cuts total spend by 60–70% with only a couple of percent of quality loss, because the small model is already good enough for the easy majority.

Can I trust benchmarks like MMLU and HumanEval when picking a model?

Benchmarks are only a coarse filter for shortlisting candidates — they rarely predict performance on your data. Models can be tuned toward popular tests, and your task differs in domain, format, and language. Always make the final call with your own eval set of real examples: benchmarks lie, your evals don't.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Choose the right model for support ticket categorization task

Without technique

Prompt

Classify support ticket: "Can't log into my account, forgot password". Categories: billing, auth, bug, feature_request.

Response

This request belongs to the "auth" (authentication) category. The user is experiencing difficulty logging in due to a lost password. This is a typical authentication issue that is resolved through the password reset mechanism. I recommend sending the user a password recovery link and checking if the account is locked.

Tokens:45/120

Time:2800ms

Quality:

With Model Selection

With technique

Prompt

Classify ticket. Reply with ONE WORD from the list: billing, auth, bug, feature_request. Ticket: "Can't log into my account, forgot password"

Response

auth

👁️Simple classification doesn't need GPT-4 — a lightweight model (GPT-4o-mini, Haiku) is enough

🧠Prompt trimmed: verbosity removed, response limited to one word → fewer output tokens

🔢Cost: GPT-4 ~$0.03/request vs GPT-4o-mini ~$0.0003/request — 100x difference

✅Rule: match the model to task complexity, don't use a cannon to kill a sparrow

Tokens:35/2

Time:180ms

Quality:

Why this works

Not every task requires the most powerful model. For simple classification, a lightweight model gives the same result 15x faster and 100x cheaper.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Benchmarks Cost Optimization

This lesson is part of a structured LLM course.

My Learning Path

Lesson 1

Model Selection Guide

Choosing the right model

The Solution: Choose the Right Tool for the Job

How to evaluate a model

Tradeoffs, pitfalls, and a worked example

Think of it like choosing a vehicle for different tasks:

1. Latency < 500ms AND quality critical: Use GPT-4o or Claude Sonnet — best balance of speed and intelligence
2. Cost < $0.01/request AND simple task: Use GPT-4o Mini or Claude Haiku — 10-20x cheaper, great for classification, extraction, FAQ
3. Context > 100K tokens: Use Claude (200K) or Gemini (1M+) — other models require document chunking
4. Complex math / logic / hard reasoning: Use reasoning models (o3, o4-mini) — they use thinking tokens for step-by-step reasoning, but cost more due to hidden token usage
5. On-premise / data privacy required: Use Llama or Mistral — open-weight models you can host yourself
6. Always: test on YOUR data: Run 50-100 real examples through each candidate model before committing — benchmarks lie, your evals don't

Key Selection Criteria

Quality: Benchmark scores (MMLU, HumanEval) matter less than eval on YOUR data — always test with real examples from your domain
Model Routing: Use a lightweight classifier to route easy tasks (FAQ, extraction) to cheap models and hard tasks (reasoning, coding) to flagship models — saves 60-80% with minimal quality loss
Cost vs Latency: Flagship models are 10-30x more expensive and 2-5x slower — justify the upgrade with measurable quality difference on your evals
Context Window: Need 100K+ tokens? Only Claude (200K) and Gemini (1M+) support it natively — others require chunking strategies

Try It Yourself!

Explore different models and their trade-offs for various use cases.

Model Comparison

Select your use case:

Sort by:

Model	Context	Price (in)	Quality	Best for
GPT-5 OpenAI	400K	$1.25/1M	Top	General purposeAgents
Claude Opus 4.5 Anthropic	200K	$15.00/1M	Top	ResearchComplex analysis
Claude Sonnet 4 Anthropic	200K	$3.00/1M	Top	CodingAnalysis
o3 OpenAI	200K	$2.00/1M	Top	Complex reasoningMath
GPT-4o OpenAI	128K	$2.50/1M	High	ChatVision
Gemini 2.5 Pro Google	1M	$1.25/1M	High	Long documentsReasoning
DeepSeek V3OSS DeepSeek	128K	$0.27/1M	High	Budget projectsCoding
Qwen 2.5 72BOSS Alibaba	128K	Self-hosted	High	Asian languagesSelf-hosted
Mistral Large 2 Mistral	128K	$2.00/1M	High	EU complianceCost-effective
Llama 3.3 70BOSS Meta	128K	Self-hosted	High	Privacy-sensitiveFine-tuning
o4-mini OpenAI	200K	$1.10/1M	High	Budget reasoningMath
Gemini 2.5 Flash Google	1M	$0.30/1M	Medium	High volumeLong documents
GPT-4o mini OpenAI	128K	$0.15/1M	Medium	High volumeSimple tasks
Claude 3.5 Haiku Anthropic	200K	$0.80/1M	Medium	ClassificationSimple tasks

Quick Decision

Need the best: Claude Opus 4.5 / GPT-5
Hard reasoning: o3 / o4-mini
Best for coding: Claude Sonnet 4
Save money: DeepSeek V3 / GPT-4o mini / Gemini 2.5 Flash
Long docs: Gemini 2.5 Pro (1M tokens)
Privacy: Llama 3.3 / Qwen 2.5 (self-hosted)

Frequently asked questions

How do I choose the right LLM for my task?

When is a flagship model worth the extra cost over a small one?

What is model routing and how much can it save?

Can I trust benchmarks like MMLU and HumanEval when picking a model?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Choose the right model for support ticket categorization task

Without technique

Prompt

Classify support ticket: "Can't log into my account, forgot password". Categories: billing, auth, bug, feature_request.

Response

Tokens:45/120

Time:2800ms

Quality:

With Model Selection

With technique

Prompt

Classify ticket. Reply with ONE WORD from the list: billing, auth, bug, feature_request. Ticket: "Can't log into my account, forgot password"

Response

auth

👁️Simple classification doesn't need GPT-4 — a lightweight model (GPT-4o-mini, Haiku) is enough

🧠Prompt trimmed: verbosity removed, response limited to one word → fewer output tokens

🔢Cost: GPT-4 ~$0.03/request vs GPT-4o-mini ~$0.0003/request — 100x difference

✅Rule: match the model to task complexity, don't use a cannon to kill a sparrow

Tokens:35/2

Time:180ms

Quality:

Why this works

Not every task requires the most powerful model. For simple classification, a lightweight model gives the same result 15x faster and 100x cheaper.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Benchmarks Cost Optimization

This lesson is part of a structured LLM course.

My Learning Path