Prompt Caching
Cache static prompt prefixes to slash costs and latency
The Problem: You have a production app sending 10,000 API requests per day. Each includes the same 4,000-token system prompt. That's 40 million tokens daily just for the static part — hundreds of dollars monthly for content the model has already "read" thousands of times.
The Solution: Prompt Caching — Remember, Don't Repeat
Prompt caching is an API-level feature that stores the processed prefix of your prompt across requests so the model doesn't have to re-process the same tokens every time. Only the beginning is cached — never the middle or the end. Anthropic offers a 90% discount on cached input tokens, with a one-time 25% surcharge on the request that writes the cache. OpenAI caches automatically for prompts over 1,024 tokens at a 50% discount. The cache has a TTL (time-to-live, время жизни) of roughly 5 minutes by default, and that timer resets on every cache hit, so an actively used prefix effectively stays warm.
How it actually works under the hood
When a transformer processes your prompt, it computes intermediate attention states for every token — this is the KV-cache (key-value cache). Normally those states are thrown away after each request. Prompt caching keeps the states for the unchanged prefix in memory, so a repeat request skips the expensive "prefill" over those tokens and resumes from where the new content begins. That is why it cuts both cost and latency — the model is doing strictly less compute. The critical constraint: the cached portion must be byte-for-byte identical and sit at the very start. Change a single character near the front — a date, a user name, a reordered example — and the prefix no longer matches, so the cache misses and you pay full price.
When to use it (and a worked example)
Reach for caching whenever a large, stable block leads every request: a long system prompt, a set of few-shot examples, or reference documents in a RAG pipeline. Suppose a support bot prepends a 3,000-token system prompt plus 50 example tickets, then appends only the new user message. The first call is a cache write (you pay ~1.25x for the prefix). Every call in the next 5 minutes reuses that prefix at 10% of the price, while the short per-user suffix is billed normally. Over thousands of daily requests, the savings dominate. The main pitfall is structural: put everything static first, everything dynamic last. If the user query or a timestamp leaks into the front of the prompt, the cache never hits — and you've paid the write surcharge for nothing.
Think of it like a librarian who remembers frequently requested books and puts them on a nearby shelf — the first request goes to the archive (cache miss), but repeat requests are served instantly (cache hit):
- 1. Identify cacheable prefix: Find the stable part: system instructions, few-shot examples, reference documents. Must be identical across requests and placed at the beginning
- 2. Set cache breakpoints: Anthropic: add cache_control markers. OpenAI: automatic for prompts >1,024 tokens — just structure your prompt correctly
- 3. First request (cache write): The first request processes the full prompt and writes to cache. Anthropic charges 25% extra for cache write — this is the investment
- 4. Subsequent requests (cache hits): Every following request with the same prefix: 90% cost reduction and up to 85% lower time-to-first-token. Each hit resets the 5-minute TTL
Where to Apply Prompt Caching
- RAG with stable system prompt: Cache the system instructions + retrieval guidelines. Only the user query and retrieved chunks change per request. Ideal for high-volume Q&A systems
- Few-shot classification: Cache 50-100 classification examples in the prefix. Each new input is appended at the end. Perfect for support ticket routing or content moderation
- Batch processing: Process thousands of documents with the same analysis prompt. Cache the instructions once, change only the document per request. Massive savings at scale
- Common Pitfall: Putting dynamic content before static content in your prompt. If the user query comes before the system prompt, every request has a different prefix and the cache never hits. Always structure: [stable prefix] + [dynamic suffix]
Fun Fact: A system prompt of 4,000 tokens at $3/1M input tokens, 10,000 requests/day. Without caching: $120/day ($3,600/month). With Anthropic caching (95% hit rate): ~$12/day ($360/month). That's a 90% cost reduction — saving $3,240/month from a single API parameter change.
Try It Yourself!
Explore the interactive visualization below to see how caching affects cost, latency, and token usage in real scenarios.
Interactive: Prompt Caching Explorer
Request
System + User
Cache Check
Frequently asked questions
What is prompt caching in LLM APIs?
Prompt caching is an API-level feature that stores the processed prefix of your prompt (system prompt, few-shot examples, large context) so repeated requests reuse it instead of reprocessing. Anthropic offers 90% discount on cached tokens; OpenAI offers 50% automatic caching.
How is prompt caching different from KV-cache?
KV-cache is an internal model mechanism that caches attention computations during a single generation. Prompt caching is an API-level feature that persists across separate API requests, caching the prefix of your prompt for minutes (typically 5-minute TTL).
When should I use prompt caching?
Use prompt caching when you send the same long prefix (system prompt, few-shot examples, or large context documents) across many API requests. It pays off after just 2 cache hits for Anthropic (despite the 25% write surcharge) and immediately for OpenAI (no write surcharge).
Try it yourself
Interactive demo of this technique
Process 10 documents with the same analysis prompt and estimate costs
10 full requests at 8500 tokens each. Total cost: $0.255. Each request is processed from scratch, including the same 8000 tokens of system prompt and examples.
1 cache write + 9 cache hits. Cached prefix cost (8000 tokens): write 0.0024. Non-cached tokens (500/request): 10 * 0.015. Total: 0.255 — 74% savings.
When batch-processing documents with the same prompt, prompt caching saves 70-90% on input token costs — just add cache_control to the stable part and ensure it comes first.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path