Model Selection Guide
Choosing the right model
The Problem: There are dozens of LLMs available — GPT-5, Claude, o3, Gemini, Llama, DeepSeek, and more. Plus reasoning models that think before answering. How do you choose the right model for your specific use case?
The Solution: Choose the Right Tool for the Job
Model selection is the engineering decision of matching your requirements — speed, cost, accuracy, context length, and modality — to the right model. There is no single "best" LLM: a model that tops a leaderboard can be wasteful for routing support tickets and still too weak for proving a math theorem. The job is to find the cheapest, fastest model that clears the quality bar your task actually requires. Think of it like choosing a vehicle — a sports car, a truck, and a bicycle are each "best" only for a specific trip.
How to evaluate a model
Start from the task, not the hype. Public benchmarks like MMLU or HumanEval are a coarse filter for shortlisting candidates, but they rarely predict performance on your data — so build a small eval set of 50–100 real examples and score each candidate on it. Then weigh three operational axes: latency (time to first token and full response), cost (price per million input and output tokens, which differ), and the context window you need. Capabilities matter too: only some models do native tool use, vision, or long-context retrieval well. For privacy or compliance, open-weight models you self-host (Llama, Mistral) trade convenience for control; quantization can shrink them to fit your hardware at a small quality cost.
Tradeoffs, pitfalls, and a worked example
The biggest pitfall is over-provisioning: paying flagship prices for tasks a small model handles fine. The opposite trap is under-provisioning a genuinely hard reasoning task and shipping wrong answers. A powerful pattern is model routing — a cheap classifier sends easy requests to a small model and only escalates the hard ones. Worked example: a support bot gets 10,000 tickets/day. Sending all of them to a flagship model at, say, $5 per million output tokens might cost hundreds of dollars daily. Instead, route the ~80% of simple FAQ-style tickets to a small model (Haiku / GPT-4o Mini) at roughly a tenth of the price, and escalate only the ~20% of ambiguous or multi-step cases to the flagship. On a representative eval this typically cuts total spend by 60–70% while keeping answer quality within a couple of percent — because the small model was already good enough for the easy majority.
Think of it like choosing a vehicle for different tasks:
- 1. Latency < 500ms AND quality critical: Use GPT-4o or Claude Sonnet — best balance of speed and intelligence
- 2. Cost < $0.01/request AND simple task: Use GPT-4o Mini or Claude Haiku — 10-20x cheaper, great for classification, extraction, FAQ
- 3. Context > 100K tokens: Use Claude (200K) or Gemini (1M+) — other models require document chunking
- 4. Complex math / logic / hard reasoning: Use reasoning models (o3, o4-mini) — they use thinking tokens for step-by-step reasoning, but cost more due to hidden token usage
- 5. On-premise / data privacy required: Use Llama or Mistral — open-weight models you can host yourself
- 6. Always: test on YOUR data: Run 50-100 real examples through each candidate model before committing — benchmarks lie, your evals don't
Key Selection Criteria
- Quality: Benchmark scores (MMLU, HumanEval) matter less than eval on YOUR data — always test with real examples from your domain
- Model Routing: Use a lightweight classifier to route easy tasks (FAQ, extraction) to cheap models and hard tasks (reasoning, coding) to flagship models — saves 60-80% with minimal quality loss
- Cost vs Latency: Flagship models are 10-30x more expensive and 2-5x slower — justify the upgrade with measurable quality difference on your evals
- Context Window: Need 100K+ tokens? Only Claude (200K) and Gemini (1M+) support it natively — others require chunking strategies
Fun Fact: A/B testing model routing in production showed that sending 80% of support tickets to Haiku saved 85% of costs with only a 2% quality drop. The remaining 20% of complex cases went to Sonnet — total cost reduction of 70% with near-identical user satisfaction.
Try It Yourself!
Explore different models and their trade-offs for various use cases.
Select your use case:
| Model | Context | Price (in) | Vision | Tools | Quality | Best for |
|---|---|---|---|---|---|---|
GPT-5 OpenAI | 400K | $1.25/1M | Top | General purposeAgents | ||
Claude Opus 4.5 Anthropic | 200K | $15.00/1M | Top | ResearchComplex analysis | ||
Claude Sonnet 4 Anthropic | 200K | $3.00/1M | Top | CodingAnalysis | ||
o3 OpenAI | 200K | $2.00/1M | Top | Complex reasoningMath | ||
GPT-4o OpenAI | 128K | $2.50/1M | High | ChatVision | ||
Gemini 2.5 Pro Google | 1M | $1.25/1M | High | Long documentsReasoning | ||
DeepSeek V3OSS DeepSeek | 128K | $0.27/1M | High | Budget projectsCoding | ||
Qwen 2.5 72BOSS Alibaba | 128K | Self-hosted | High | Asian languagesSelf-hosted | ||
Mistral Large 2 Mistral | 128K | $2.00/1M | High | EU complianceCost-effective | ||
Llama 3.3 70BOSS Meta | 128K | Self-hosted | High | Privacy-sensitiveFine-tuning | ||
o4-mini OpenAI | 200K | $1.10/1M | High | Budget reasoningMath | ||
Gemini 2.5 Flash Google | 1M | $0.30/1M | Medium | High volumeLong documents | ||
GPT-4o mini OpenAI | 128K | $0.15/1M | Medium | High volumeSimple tasks | ||
Claude 3.5 Haiku Anthropic | 200K | $0.80/1M | Medium | ClassificationSimple tasks |
Quick Decision
- Need the best: Claude Opus 4.5 / GPT-5
- Hard reasoning: o3 / o4-mini
- Best for coding: Claude Sonnet 4
- Save money: DeepSeek V3 / GPT-4o mini / Gemini 2.5 Flash
- Long docs: Gemini 2.5 Pro (1M tokens)
- Privacy: Llama 3.3 / Qwen 2.5 (self-hosted)
Frequently asked questions
How do I choose the right LLM for my task?
Start from the task, not the leaderboards. Build an eval set of 50–100 real examples from your domain and run each candidate model on it. Then weigh cost (price per million input and output tokens), latency (time to first token and full response), and the context window you need. Pick the cheapest, fastest model that clears the quality bar your task actually requires — not the most powerful one.
When is a flagship model worth the extra cost over a small one?
Flagship models (GPT-5, Claude Sonnet, o3) are typically 10–30x more expensive and 2–5x slower than small ones (Haiku, GPT-4o Mini). The upgrade is justified only when your own evals show a measurable quality gain — for example on hard reasoning, math, or coding. For classification, extraction, and FAQ tasks a small model usually performs just as well at a fraction of the cost.
What is model routing and how much can it save?
Model routing uses a cheap classifier to decide where each request goes: easy tasks (FAQ, extraction) go to a small model, and hard ones (reasoning, multi-step cases) escalate to a flagship. When ~80% of traffic is simple, this typically cuts total spend by 60–70% with only a couple of percent of quality loss, because the small model is already good enough for the easy majority.
Can I trust benchmarks like MMLU and HumanEval when picking a model?
Benchmarks are only a coarse filter for shortlisting candidates — they rarely predict performance on your data. Models can be tuned toward popular tests, and your task differs in domain, format, and language. Always make the final call with your own eval set of real examples: benchmarks lie, your evals don't.
Try it yourself
Interactive demo of this technique
Choose the right model for support ticket categorization task
This request belongs to the "auth" (authentication) category. The user is experiencing difficulty logging in due to a lost password. This is a typical authentication issue that is resolved through the password reset mechanism. I recommend sending the user a password recovery link and checking if the account is locked.
auth
Not every task requires the most powerful model. For simple classification, a lightweight model gives the same result 15x faster and 100x cheaper.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path