Lesson 11Optimization

Prompt Caching

Cache static prompt prefixes to slash costs and latency

The Problem: You have a production app sending 10,000 API requests per day. Each includes the same 4,000-token system prompt. That's 40 million tokens daily just for the static part — hundreds of dollars monthly for content the model has already "read" thousands of times.

The Solution: Prompt Caching — Remember, Don't Repeat

Prompt caching is an API-level feature that stores the processed prefix of your prompt across requests so the model doesn't have to re-process the same tokens every time. Only the beginning is cached — never the middle or the end. Anthropic offers a 90% discount on cached input tokens, with a one-time 25% surcharge on the request that writes the cache. OpenAI caches automatically for prompts over 1,024 tokens at a 50% discount. The cache has a TTL (time-to-live, время жизни) of roughly 5 minutes by default, and that timer resets on every cache hit, so an actively used prefix effectively stays warm.

How it actually works under the hood

When a transformer processes your prompt, it computes intermediate attention states for every token — this is the KV-cache (key-value cache). Normally those states are thrown away after each request. Prompt caching keeps the states for the unchanged prefix in memory, so a repeat request skips the expensive "prefill" over those tokens and resumes from where the new content begins. That is why it cuts both cost and latency — the model is doing strictly less compute. The critical constraint: the cached portion must be byte-for-byte identical and sit at the very start. Change a single character near the front — a date, a user name, a reordered example — and the prefix no longer matches, so the cache misses and you pay full price.

When to use it (and a worked example)

Reach for caching whenever a large, stable block leads every request: a long system prompt, a set of few-shot examples, or reference documents in a RAG pipeline. Suppose a support bot prepends a 3,000-token system prompt plus 50 example tickets, then appends only the new user message. The first call is a cache write (you pay ~1.25x for the prefix). Every call in the next 5 minutes reuses that prefix at 10% of the price, while the short per-user suffix is billed normally. Over thousands of daily requests, the savings dominate. The main pitfall is structural: put everything static first, everything dynamic last. If the user query or a timestamp leaks into the front of the prompt, the cache never hits — and you've paid the write surcharge for nothing.

Think of it like a librarian who remembers frequently requested books and puts them on a nearby shelf — the first request goes to the archive (cache miss), but repeat requests are served instantly (cache hit):

1. Identify cacheable prefix: Find the stable part: system instructions, few-shot examples, reference documents. Must be identical across requests and placed at the beginning
2. Set cache breakpoints: Anthropic: add cache_control markers. OpenAI: automatic for prompts >1,024 tokens — just structure your prompt correctly
3. First request (cache write): The first request processes the full prompt and writes to cache. Anthropic charges 25% extra for cache write — this is the investment
4. Subsequent requests (cache hits): Every following request with the same prefix: 90% cost reduction and up to 85% lower time-to-first-token. Each hit resets the 5-minute TTL

Where to Apply Prompt Caching

RAG with stable system prompt: Cache the system instructions + retrieval guidelines. Only the user query and retrieved chunks change per request. Ideal for high-volume Q&A systems
Few-shot classification: Cache 50-100 classification examples in the prefix. Each new input is appended at the end. Perfect for support ticket routing or content moderation
Batch processing: Process thousands of documents with the same analysis prompt. Cache the instructions once, change only the document per request. Massive savings at scale
Common Pitfall: Putting dynamic content before static content in your prompt. If the user query comes before the system prompt, every request has a different prefix and the cache never hits. Always structure: [stable prefix] + [dynamic suffix]

Fun Fact: A system prompt of 4,000 tokens at $3/1M input tokens, 10,000 requests/day. Without caching: $120/day ($3,600/month). With Anthropic caching (95% hit rate): ~$12/day ($360/month). That's a 90% cost reduction — saving $3,240/month from a single API parameter change.

Try It Yourself!

Explore the interactive visualization below to see how caching affects cost, latency, and token usage in real scenarios.

Prompt Caching: How It Works

Interactive: Prompt Caching Explorer

Request

System + User

Cache Check

Frequently asked questions

What is prompt caching in LLM APIs?

Prompt caching is an API-level feature that stores the processed prefix of your prompt (system prompt, few-shot examples, large context) so repeated requests reuse it instead of reprocessing. Anthropic offers 90% discount on cached tokens; OpenAI offers 50% automatic caching.

How is prompt caching different from KV-cache?

KV-cache is an internal model mechanism that caches attention computations during a single generation. Prompt caching is an API-level feature that persists across separate API requests, caching the prefix of your prompt for minutes (typically 5-minute TTL).

When should I use prompt caching?

Use prompt caching when you send the same long prefix (system prompt, few-shot examples, or large context documents) across many API requests. It pays off after just 2 cache hits for Anthropic (despite the 25% write surcharge) and immediately for OpenAI (no write surcharge).

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

Process 10 documents with the same analysis prompt and estimate costs

Without technique

Prompt

For each of 10 documents, send the full request: [System prompt: 3000 tokens] [20 few-shot examples: 5000 tokens] [Document: 500 tokens] Each request = 8500 input tokens. 10 requests = 85,000 tokens. Cost: 85,000 * $3/1M = $0.255

Response

10 full requests at 8500 tokens each. Total cost: $0.255. Each request is processed from scratch, including the same 8000 tokens of system prompt and examples.

Tokens:85000/2000

Time:12000ms

Quality:

With production-prompt-caching

With technique

Prompt

Structure with caching: [System prompt: 3000 tokens | cache_control: ephemeral] [20 few-shot examples: 5000 tokens | cache_control: ephemeral] [Document: 500 tokens] Request 1 (cache write): 8000 * 1.25 + 500 = $0.030 + $0.0015 Requests 2-10 (cache hit): 8000 * 0.1 + 500 = 9 * ($0.0024 + $0.0015) Total: $0.0315 + 9 * $0.0039 = $0.0666

Response

1 cache write + 9 cache hits. Cached prefix cost (8000 tokens): write $0.030, 9 reads at$ 0.0024. Non-cached tokens (500/request): 10 * $0.0015 =$ 0.015. Total: $0.0666 instead of$ 0.255 — 74% savings.

👁️System prompt (3K) and examples (5K) are identical for all 10 documents — perfect cache candidate

🧠Place stable content (8K) first, document (500 tokens) — last

🔢Cache write costs 1.25x = $0.030. Each cache hit = 0.1x = $0.0024. Already positive ROI after 2nd request.

✅74% savings + ~70% TTFT reduction. One change to API request structure.

Tokens:85000/2000

Time:3500ms

Quality:

Why this works

When batch-processing documents with the same prompt, prompt caching saves 70-90% on input token costs — just add cache_control to the stable part and ensure it comes first.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:Cost Optimization Api Patterns Rag

This lesson is part of a structured LLM course.

My Learning Path

The Solution: Prompt Caching — Remember, Don't Repeat

How it actually works under the hood

When to use it (and a worked example)

Think of it like a librarian who remembers frequently requested books and puts them on a nearby shelf — the first request goes to the archive (cache miss), but repeat requests are served instantly (cache hit):

1. Identify cacheable prefix: Find the stable part: system instructions, few-shot examples, reference documents. Must be identical across requests and placed at the beginning
2. Set cache breakpoints: Anthropic: add cache_control markers. OpenAI: automatic for prompts >1,024 tokens — just structure your prompt correctly
3. First request (cache write): The first request processes the full prompt and writes to cache. Anthropic charges 25% extra for cache write — this is the investment
4. Subsequent requests (cache hits): Every following request with the same prefix: 90% cost reduction and up to 85% lower time-to-first-token. Each hit resets the 5-minute TTL

Where to Apply Prompt Caching

RAG with stable system prompt: Cache the system instructions + retrieval guidelines. Only the user query and retrieved chunks change per request. Ideal for high-volume Q&A systems

Few-shot classification: Cache 50-100 classification examples in the prefix. Each new input is appended at the end. Perfect for support ticket routing or content moderation

Batch processing: Process thousands of documents with the same analysis prompt. Cache the instructions once, change only the document per request. Massive savings at scale

Common Pitfall: Putting dynamic content before static content in your prompt. If the user query comes before the system prompt, every request has a different prefix and the cache never hits. Always structure: [stable prefix] + [dynamic suffix]

Frequently asked questions

What is prompt caching in LLM APIs?

How is prompt caching different from KV-cache?

When should I use prompt caching?