Cost Optimization
Reduce API costs
The Problem: LLM API costs can spiral out of control quickly. A popular app can cost thousands per day. How do you keep AI affordable?
The Solution: Be Energy-Efficient
Cost optimization for LLM applications is the practice of getting the same quality output while paying for fewer tokens and cheaper compute. Almost every hosted LLM bills you per token — separately for input (your prompt) and output (the model's reply), usually priced per million tokens. So your bill is driven by three levers: how many tokens you send, how many you generate, and which model processes them. Optimizing means pulling each lever without hurting accuracy. It's like managing electricity at home — you don't sit in the dark, you just turn off lights in empty rooms and switch to efficient appliances.
How the main techniques work
Three families of techniques do most of the work. Prompt compression trims the input you send every call: filler words, redundant instructions, and bloated examples. Prompt caching stores the unchanging prefix of your prompt (system instructions, tool definitions, long documents) on the provider's side, so repeat calls skip re-processing it — Anthropic, for example, charges a one-time write premium of ~1.25x but only ~0.1x to read from cache afterward. Model selection (routing) sends each request to the cheapest model that can still do the job: a flagship model for hard reasoning, a small "mini" or "haiku"-tier model for classification, extraction, and FAQ. A semantic cache goes further and skips the model entirely when a near-identical question was already answered.
When to use it, and the tradeoffs
Reach for cost optimization once an app has real traffic — early on, engineering time costs more than the API bill, so premature tuning is wasted effort. The golden rule is measure first: log token counts and cost per request, because you can't optimize what you don't track. Every lever has a tradeoff. Aggressive compression can drop context the model actually needed; over-eager routing sends a hard task to a weak model and quietly degrades quality; a semantic cache can return a stale answer to a question that only looks similar. That's why you keep evals running. Worked example: a support bot sends a 2,000-token system prompt plus 500 tokens of context, 10,000 times a day, at $10 per million tokens — about $75/day, or $2,250/month. Caching the static system prompt, routing simple intents to a mini model, and compressing the prompt by 40% together cut that to roughly $18/day — a 76% reduction, ~$1,710 saved every month from three changes that touch no model weights at all.
Think of it like saving electricity:
- 1. Audit current costs: Measure first! Log every request with token count and cost — you can't optimize what you don't measure
- 2. Compress system prompts: Remove filler words, reduce examples from 5 to 2-3, use bullet points instead of paragraphs — target 40-60% reduction
- 3. Add semantic cache: 60%+ of FAQ requests are near-duplicates — semantic cache finds similar questions and returns stored responses without calling the LLM
- 4. Route by complexity: 80% of tasks don't need the flagship model — use a classifier to route simple tasks to mini/haiku (10-20x cheaper)
- 5. Monitor and iterate: Set up cost dashboards, track cost-per-conversation, and review weekly — optimization is continuous, not one-time
Example: System prompt 2,000 tokens + user context 500 tokens x 10,000 requests/day x $10/1M tokens = $75/day ($2,250/month). With caching + routing: $18/day — 76% savings.
Key Strategies
- Prompt Compression: Remove filler words, shorten examples, use structured formats — a 2,000-token system prompt can often be compressed to 800 tokens with zero quality loss
- Prompt Caching: Anthropic prompt caching: first request costs 1.25x, but cached requests cost only 0.1x — a 90% discount for repeated system prompts across conversations
- Model Routing: 80% of tasks (FAQ, extraction, classification) don't need the flagship model — route them to mini/haiku and save 10-20x per request
- Semantic Caching: 60%+ of FAQ requests are near-duplicates — semantic cache matches similar (not identical) questions and returns stored responses instantly
Fun Fact: Real-world example: System prompt 2,000 tokens + user context 500 tokens at 10,000 requests/day at $10/1M tokens = $75/day ($2,250/month). After applying caching + routing + compression: $18/day — that's a 76% reduction, saving $1,710/month from just three optimizations.
Try It Yourself!
Use the cost calculator below to estimate your LLM costs and see how different optimization strategies reduce expenses.
Cost Breakdown
Per request
$0.0033
Daily
$3.25
Monthly
$97.50
Yearly
$1170
Potential Savings
Switching to Gemini 2.0 Flash could save ~$94.58/mo
Model Cost Comparison
Optimization Tips
- •Use smaller models (GPT-4o mini, Claude Haiku) for simple tasks
- •Cache repeated prompts to avoid redundant API calls
- •Batch requests when possible to reduce overhead
- •Optimize prompts: shorter = cheaper
- •Use streaming to fail fast on bad responses
- •Implement token limits to cap costs per request
Frequently asked questions
How do I reduce LLM API costs?
Measure first: log token counts and cost per request. Then pull three levers — prompt compression (strip filler from the input), prompt caching (store the unchanging system prefix on the provider side), and model routing (send simple tasks to a cheap mini/haiku model and reserve the flagship for hard ones). On real traffic this typically saves 50–80% with no loss in quality.
What is prompt caching and how much does it save?
Prompt caching stores the unchanging part of your request (system instructions, tool definitions, long documents) on the provider's side so repeat calls skip re-processing it. With Anthropic, writing to cache costs a one-time premium of about 1.25x, while reading from cache afterward is only about 0.1x — up to a 90% discount on input tokens for repeated system prompts.
When should I start optimizing LLM costs?
Once the app has real traffic and the API bill is noticeable. Early on, engineering time costs more than the API bill, so premature tuning is wasted effort. Set up cost-per-request measurement and dashboards first, find the most expensive scenarios, and optimize those specifically rather than everything at once.
What are the tradeoffs and risks of cost optimization?
Every lever has a cost. Aggressive compression can drop context the model actually needed. Over-eager routing sends a hard task to a weak model and quietly degrades quality. A semantic cache can return a stale answer to a question that only looks similar. Keep evals running and watch your quality metrics after each change.
Try it yourself
Interactive demo of this technique
Reduce token usage in code generation without quality loss
Of course! I'd be happy to help you! Here's a JavaScript function that does exactly what you asked:
// Function to remove duplicates from an array of numbers
function removeDuplicates(numbers) {
// Create a new array to store unique values
const uniqueNumbers = [];
// Iterate through each element of the original array
for (let i = 0; i < numbers.length; i++) {
// Check if the current element is already in the unique array
if (!uniqueNumbers.includes(numbers[i])) {
uniqueNumbers.push(numbers[i]);
}
}
return uniqueNumbers;
}
I hope this helps! If you have any additional questions, don't hesitate to ask!
const unique = (nums) => [...new Set(nums)];
A concise prompt + "code only" instruction saves up to 90% of tokens. In production with thousands of calls, that's tens of dollars per day.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path