Prompt Caching
Cache static prompt prefixes to slash costs and latency
The Problem: You have a production app sending 10,000 API requests per day. Each includes the same 4,000-token system prompt. That's 40 million tokens daily just for the static part — hundreds of dollars monthly for content the model has already "read" thousands of times.
The Solution: Prompt Caching — Remember, Don't Repeat
Prompt caching is an API-level feature that stores the processed prefix of your prompt across requests. Only the beginning is cached — not the middle or end. Anthropic offers 90% discount on cached tokens with a 25% surcharge on the cache write. OpenAI caches automatically for prompts over 1,024 tokens at a 50% discount. The cache has a TTL (time-to-live) of ~5 minutes, extended on each hit. Minimum cacheable size is 1,024 tokens.
Think of it like a librarian who remembers frequently requested books and puts them on a nearby shelf — the first request goes to the archive (cache miss), but repeat requests are served instantly (cache hit):
- 1. Identify cacheable prefix: Find the stable part: system instructions, few-shot examples, reference documents. Must be identical across requests and placed at the beginning
- 2. Set cache breakpoints: Anthropic: add cache_control markers. OpenAI: automatic for prompts >1,024 tokens — just structure your prompt correctly
- 3. First request (cache write): The first request processes the full prompt and writes to cache. Anthropic charges 25% extra for cache write — this is the investment
- 4. Subsequent requests (cache hits): Every following request with the same prefix: 90% cost reduction and up to 85% lower time-to-first-token. Each hit resets the 5-minute TTL
Where to Apply Prompt Caching
- RAG with stable system prompt: Cache the system instructions + retrieval guidelines. Only the user query and retrieved chunks change per request. Ideal for high-volume Q&A systems
- Few-shot classification: Cache 50-100 classification examples in the prefix. Each new input is appended at the end. Perfect for support ticket routing or content moderation
- Batch processing: Process thousands of documents with the same analysis prompt. Cache the instructions once, change only the document per request. Massive savings at scale
- Common Pitfall: Putting dynamic content before static content in your prompt. If the user query comes before the system prompt, every request has a different prefix and the cache never hits. Always structure: [stable prefix] + [dynamic suffix]
Fun Fact: A system prompt of 4,000 tokens at $3/1M input tokens, 10,000 requests/day. Without caching: $120/day ($3,600/month). With Anthropic caching (95% hit rate): ~$12/day ($360/month). That's a 90% cost reduction — saving $3,240/month from a single API parameter change.
Try It Yourself!
Explore the interactive visualization below to see how caching affects cost, latency, and token usage in real scenarios.
Interactive: Prompt Caching Explorer
Request
System + User
Cache Check
Try it yourself
Interactive demo of this technique
Process 10 documents with the same analysis prompt and estimate costs
10 full requests at 8500 tokens each. Total cost: $0.255. Each request is processed from scratch, including the same 8000 tokens of system prompt and examples.
1 cache write + 9 cache hits. Cached prefix cost (8000 tokens): write 0.0024. Non-cached tokens (500/request): 10 * 0.015. Total: 0.255 — 74% savings.
When batch-processing documents with the same prompt, prompt caching saves 70-90% on input token costs — just add cache_control to the stable part and ensure it comes first.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path