Lesson 5Calculator

Cost Optimization

Reduce API costs

The Problem: LLM API costs can spiral out of control quickly. A popular app can cost thousands per day. How do you keep AI affordable?

The Solution: Be Energy-Efficient

Cost optimization for LLM applications is the practice of getting the same quality output while paying for fewer tokens and cheaper compute. Almost every hosted LLM bills you per token — separately for input (your prompt) and output (the model's reply), usually priced per million tokens. So your bill is driven by three levers: how many tokens you send, how many you generate, and which model processes them. Optimizing means pulling each lever without hurting accuracy. It's like managing electricity at home — you don't sit in the dark, you just turn off lights in empty rooms and switch to efficient appliances.

How the main techniques work

Three families of techniques do most of the work. Prompt compression trims the input you send every call: filler words, redundant instructions, and bloated examples. Prompt caching stores the unchanging prefix of your prompt (system instructions, tool definitions, long documents) on the provider's side, so repeat calls skip re-processing it — Anthropic, for example, charges a one-time write premium of ~1.25x but only ~0.1x to read from cache afterward. Model selection (routing) sends each request to the cheapest model that can still do the job: a flagship model for hard reasoning, a small "mini" or "haiku"-tier model for classification, extraction, and FAQ. A semantic cache goes further and skips the model entirely when a near-identical question was already answered.

When to use it, and the tradeoffs

Reach for cost optimization once an app has real traffic — early on, engineering time costs more than the API bill, so premature tuning is wasted effort. The golden rule is measure first: log token counts and cost per request, because you can't optimize what you don't track. Every lever has a tradeoff. Aggressive compression can drop context the model actually needed; over-eager routing sends a hard task to a weak model and quietly degrades quality; a semantic cache can return a stale answer to a question that only looks similar. That's why you keep evals running. Worked example: a support bot sends a 2,000-token system prompt plus 500 tokens of context, 10,000 times a day, at $10 per million tokens — about $75/day, or $2,250/month. Caching the static system prompt, routing simple intents to a mini model, and compressing the prompt by 40% together cut that to roughly $18/day — a 76% reduction, ~$1,710 saved every month from three changes that touch no model weights at all.

Think of it like saving electricity:

1. Audit current costs: Measure first! Log every request with token count and cost — you can't optimize what you don't measure
2. Compress system prompts: Remove filler words, reduce examples from 5 to 2-3, use bullet points instead of paragraphs — target 40-60% reduction
3. Add semantic cache: 60%+ of FAQ requests are near-duplicates — semantic cache finds similar questions and returns stored responses without calling the LLM
4. Route by complexity: 80% of tasks don't need the flagship model — use a classifier to route simple tasks to mini/haiku (10-20x cheaper)
5. Monitor and iterate: Set up cost dashboards, track cost-per-conversation, and review weekly — optimization is continuous, not one-time

Example: System prompt 2,000 tokens + user context 500 tokens x 10,000 requests/day x $10/1M tokens = $75/day ($2,250/month). With caching + routing: $18/day — 76% savings.

Key Strategies

Prompt Compression: Remove filler words, shorten examples, use structured formats — a 2,000-token system prompt can often be compressed to 800 tokens with zero quality loss
Prompt Caching: Anthropic prompt caching: first request costs 1.25x, but cached requests cost only 0.1x — a 90% discount for repeated system prompts across conversations
Model Routing: 80% of tasks (FAQ, extraction, classification) don't need the flagship model — route them to mini/haiku and save 10-20x per request
Semantic Caching: 60%+ of FAQ requests are near-duplicates — semantic cache matches similar (not identical) questions and returns stored responses instantly

Fun Fact: Real-world example: System prompt 2,000 tokens + user context 500 tokens at 10,000 requests/day at $10/1M tokens = $75/day ($2,250/month). After applying caching + routing + compression: $18/day — that's a 76% reduction, saving $1,710/month from just three optimizations.

Try It Yourself!

Use the cost calculator below to estimate your LLM costs and see how different optimization strategies reduce expenses.

Cost Calculator

Model

Requests per day

Avg input tokens

Avg output tokens

Cost Breakdown

Per request

$0.0033

Daily

$3.25

Monthly

$97.50

Yearly

$1170

Potential Savings

Switching to Gemini 2.0 Flash could save ~$94.58/mo

Model Cost Comparison

GPT-5

$165/mo

GPT-4o

$97.50/mo

GPT-4o mini

$5.85/mo

Claude Opus 4.5

$675/mo

Claude Sonnet 4

$135/mo

Claude 3.5 Haiku

$36.00/mo

Optimization Tips

•Use smaller models (GPT-4o mini, Claude Haiku) for simple tasks
•Cache repeated prompts to avoid redundant API calls
•Batch requests when possible to reduce overhead
•Optimize prompts: shorter = cheaper
•Use streaming to fail fast on bad responses
•Implement token limits to cap costs per request

Frequently asked questions

How do I reduce LLM API costs?

Measure first: log token counts and cost per request. Then pull three levers — prompt compression (strip filler from the input), prompt caching (store the unchanging system prefix on the provider side), and model routing (send simple tasks to a cheap mini/haiku model and reserve the flagship for hard ones). On real traffic this typically saves 50–80% with no loss in quality.

What is prompt caching and how much does it save?

Prompt caching stores the unchanging part of your request (system instructions, tool definitions, long documents) on the provider's side so repeat calls skip re-processing it. With Anthropic, writing to cache costs a one-time premium of about 1.25x, while reading from cache afterward is only about 0.1x — up to a 90% discount on input tokens for repeated system prompts.

When should I start optimizing LLM costs?

Once the app has real traffic and the API bill is noticeable. Early on, engineering time costs more than the API bill, so premature tuning is wasted effort. Set up cost-per-request measurement and dashboards first, find the most expensive scenarios, and optimize those specifically rather than everything at once.

What are the tradeoffs and risks of cost optimization?

Every lever has a cost. Aggressive compression can drop context the model actually needed. Over-eager routing sends a hard task to a weak model and quietly degrades quality. A semantic cache can return a stale answer to a question that only looks similar. Keep evals running and watch your quality metrics after each change.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Reduce token usage in code generation without quality loss

Without technique

Prompt

Hello! I need your help. Please be so kind as to write a function in the JavaScript programming language that would accept an array of numbers as an input parameter and would return a new array containing only the unique values from the original array, that is, without duplicates. It would be great if you could also add comments to the code for better understanding. Thank you very much in advance!

Response

Of course! I'd be happy to help you! Here's a JavaScript function that does exactly what you asked:

// Function to remove duplicates from an array of numbers
function removeDuplicates(numbers) {
  // Create a new array to store unique values
  const uniqueNumbers = [];
  // Iterate through each element of the original array
  for (let i = 0; i < numbers.length; i++) {
    // Check if the current element is already in the unique array
    if (!uniqueNumbers.includes(numbers[i])) {
      uniqueNumbers.push(numbers[i]);
    }
  }
  return uniqueNumbers;
}

I hope this helps! If you have any additional questions, don't hesitate to ask!

Tokens:95/180

Time:1600ms

Quality:

With Cost Optimization

With technique

Prompt

JS: function — unique elements from number array. Code only.

Response

const unique = (nums) => [...new Set(nums)];

🔢Input prompt: 95 tokens → 18 tokens (81% reduction)

🔢Output response: 180 tokens → 15 tokens (92% reduction)

🧠"Code only" removes greetings, farewells, and comments — the main token consumers

✅At 10K requests/day savings: ~2.5M tokens/day = ~$25/day on GPT-4

Tokens:18/15

Time:220ms

Quality:

Why this works

A concise prompt + "code only" instruction saves up to 90% of tokens. In production with thousands of calls, that's tens of dollars per day.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Model Selection Quantization

This lesson is part of a structured LLM course.

My Learning Path

Lesson 5Calculator

Cost Optimization

Reduce API costs

The Problem: LLM API costs can spiral out of control quickly. A popular app can cost thousands per day. How do you keep AI affordable?

The Solution: Be Energy-Efficient

How the main techniques work

When to use it, and the tradeoffs

Think of it like saving electricity:

1. Audit current costs: Measure first! Log every request with token count and cost — you can't optimize what you don't measure
2. Compress system prompts: Remove filler words, reduce examples from 5 to 2-3, use bullet points instead of paragraphs — target 40-60% reduction
3. Add semantic cache: 60%+ of FAQ requests are near-duplicates — semantic cache finds similar questions and returns stored responses without calling the LLM
4. Route by complexity: 80% of tasks don't need the flagship model — use a classifier to route simple tasks to mini/haiku (10-20x cheaper)
5. Monitor and iterate: Set up cost dashboards, track cost-per-conversation, and review weekly — optimization is continuous, not one-time

Example: System prompt 2,000 tokens + user context 500 tokens x 10,000 requests/day x $10/1M tokens = $75/day ($2,250/month). With caching + routing: $18/day — 76% savings.

Key Strategies

Prompt Compression: Remove filler words, shorten examples, use structured formats — a 2,000-token system prompt can often be compressed to 800 tokens with zero quality loss
Prompt Caching: Anthropic prompt caching: first request costs 1.25x, but cached requests cost only 0.1x — a 90% discount for repeated system prompts across conversations
Model Routing: 80% of tasks (FAQ, extraction, classification) don't need the flagship model — route them to mini/haiku and save 10-20x per request
Semantic Caching: 60%+ of FAQ requests are near-duplicates — semantic cache matches similar (not identical) questions and returns stored responses instantly

Try It Yourself!

Use the cost calculator below to estimate your LLM costs and see how different optimization strategies reduce expenses.

Cost Calculator

Model

Requests per day

Avg input tokens

Avg output tokens

Cost Breakdown

Per request

$0.0033

Daily

$3.25

Monthly

$97.50

Yearly

$1170

Potential Savings

Switching to Gemini 2.0 Flash could save ~$94.58/mo

Model Cost Comparison

GPT-5

$165/mo

GPT-4o

$97.50/mo

GPT-4o mini

$5.85/mo

Claude Opus 4.5

$675/mo

Claude Sonnet 4

$135/mo

Claude 3.5 Haiku

$36.00/mo

Optimization Tips

•Use smaller models (GPT-4o mini, Claude Haiku) for simple tasks
•Cache repeated prompts to avoid redundant API calls
•Batch requests when possible to reduce overhead
•Optimize prompts: shorter = cheaper
•Use streaming to fail fast on bad responses
•Implement token limits to cap costs per request

Frequently asked questions

How do I reduce LLM API costs?

What is prompt caching and how much does it save?

When should I start optimizing LLM costs?

What are the tradeoffs and risks of cost optimization?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Reduce token usage in code generation without quality loss

Without technique

Prompt

Response

Of course! I'd be happy to help you! Here's a JavaScript function that does exactly what you asked:

// Function to remove duplicates from an array of numbers
function removeDuplicates(numbers) {
  // Create a new array to store unique values
  const uniqueNumbers = [];
  // Iterate through each element of the original array
  for (let i = 0; i < numbers.length; i++) {
    // Check if the current element is already in the unique array
    if (!uniqueNumbers.includes(numbers[i])) {
      uniqueNumbers.push(numbers[i]);
    }
  }
  return uniqueNumbers;
}

I hope this helps! If you have any additional questions, don't hesitate to ask!

Tokens:95/180

Time:1600ms

Quality:

With Cost Optimization

With technique

Prompt

JS: function — unique elements from number array. Code only.

Response

const unique = (nums) => [...new Set(nums)];

🔢Input prompt: 95 tokens → 18 tokens (81% reduction)

🔢Output response: 180 tokens → 15 tokens (92% reduction)

🧠"Code only" removes greetings, farewells, and comments — the main token consumers

✅At 10K requests/day savings: ~2.5M tokens/day = ~$25/day on GPT-4

Tokens:18/15

Time:220ms

Quality:

Why this works

A concise prompt + "code only" instruction saves up to 90% of tokens. In production with thousands of calls, that's tens of dollars per day.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Model Selection Quantization

This lesson is part of a structured LLM course.

My Learning Path