Prompt Caching: How to Cut 80% Off Your LLM Bill
The cheapest LLM optimization isn't a smaller model or a smarter RAG — it's a properly structured prefix cache. We break down how caching actually works, where to place cache points, why segment order matters more than content, and how to measure hit-rate so you don't fool yourself.
IntermediateAI DevOps25 minAnthropic API, OpenAI API
1
The pain you didn't think about before launch
While you're building the agent, cost seems trivial: ten cents per run. Then the agent hits production — and the API bill reads like server rent. The reason usually isn't call volume, it's the prefix: every call ships the same system prompt, instructions, examples, sometimes an entire knowledge base.
Prefix caching solves this differently than most people think. It's not 'remember the answer to a repeated question'. It's 'remember the already-processed prefix so we don't recompute it'. Claude sees the familiar start, skips the most expensive op — prefill — and jumps to the dynamic part. Discount on cached tokens: up to 90%.
💸 No cache
- Prefill 5000 tokens per call
- Latency: high, always the same
- Cost: ×1 (base per-token price)
⚡ With prefix cache
- Prefill once, then hit
- Latency on hit: −50% time-to-first-token
- Cost: ×0.1 on cached prefix
Don't enable caching 'later'. Turn it on from the first prototype — you'll immediately start writing prompts with stable content on top. Restructuring a prompt for cache later is a separate day-long project.
2
The physics of cache: order matters more than content
To use the cache right, hold one idea in your head: the prefix is cached, not a fragment. Change the first character — cache gone. Change something in the middle — everything from that point onward stops being cached.
On each request, the system finds the longest common prefix with what's already in cache. Found 4000 tokens of match — they bill at cache-hit rate (typically 10% of base). The rest is recomputed from scratch.
Why does it work this way? What's cached isn't the text — it's the model's KV cache, the internal state after running the prefix through attention. That state depends on every preceding token, so changing a single character up front invalidates the whole thing.
Hence the rule that saves more money than any other: the same exact text, reordered, can yield a 90% discount or zero. Added a timestamp to the first line — hit-rate floored. Changed 'You are helpful' to 'You are a helpful' — same outcome.
промпт = [система][инструменты][контекст][история][вход]
cache_lookup(промпт):
ищет longest_common_prefix с prev_cached
если hit = [система][инструменты] → скидка на 3000 токенов
если hit = ничего → платишь полную цену за всё
ключевое: hit ищется СТРОГО с начала, префиксноOne tiny edit to the system prompt invalidates the cache for all your users at once. In production, don't edit prompts on the fly — ship them via release and expect a temporary hit-rate dip.
3
Stable on top, dynamic at the bottom
The most common mistake: shoving dynamic context before stable content. 'User: hi, here's the order data, now answer according to these rules…' — rules ended up AFTER order data. Every request changes the data — which invalidates cache for the rules too.
Correct order: descending stability. On top, what changes monthly (system prompt, persona, rules). Below, what changes daily (tools, examples). Below, session context. At the very bottom, what changes every request: the latest user message.
Sounds simple, but almost every first-draft prompt is arranged backwards. Flip it — and hit-rate climbs on its own.
❌ Bad: dynamic on top
- user question
- RAG context
- system rules
- examples
- hit-rate: ~0%
✅ Good: stable on top
- system rules
- tools + examples
- RAG context
- dialogue history
- user question → hit-rate >80%
RAG context is the trickiest layer. More stable than user input, less stable than the system prompt. Put it in the middle with its own cache point in front of it — and you'll get discounts even on rarely repeated queries.
4
cache_control: three points, no more
Providers give you explicit control: you place the 'cache up to here' markers yourself. Anthropic uses a `cache_control` marker, OpenAI caches automatically after 1024 prefix tokens. Anthropic allows up to four points, each creating a separate cached prefix.
Rule of three: first point after the system prompt, second after tool definitions, third before the RAG context. More points don't give more discount — they just complicate debugging.
TTL is a separate choice. Anthropic's default is 5 minutes: cheap to write, quick to expire. Extended 1-hour TTL costs more per write but pays off when requests arrive less frequently than every five minutes. Measure patterns first, enable long TTL second.
System prompt
← cache_control #1
Tools + base instructions
← cache_control #2
RAG context / knowledge
← cache_control #3
Dialogue history + current message
not cached — dynamic part
messages = [
{ role: "system", content: [...], cache_control: "ephemeral" }, // #1
{ role: "user", content: tools_def, cache_control: "ephemeral" }, // #2
{ role: "user", content: rag_ctx, cache_control: "ephemeral" }, // #3
{ role: "user", content: user_message } // dynamic
]1-hour TTL pays off when your p50 interval between requests exceeds 5 minutes. If users spam in quick succession, the default 5-minute TTL is enough — and you save on write cost.
5
Hit-rate — the metric everyone forgets to watch
You turned caching on. The bill went down. Done? Not until you checked hit-rate. 'Cache enabled' and 'cache hit' are different things. Every response returns three numbers: tokens from cache, fresh tokens, tokens written. Watch the ratio.
Green zone — hit-rate above 70% on stable endpoints. Yellow (30–70%) — check segment order and hunt for dynamic chunks mid-prefix. Red (<30%) — caching isn't working at all: either TTL is too short, or something is invalidating the prefix. Without monitoring, you won't know if the optimization works until next month's bill arrives.
Where does cache actually save money?
Stable system prompt + changing messages
Shared RAG context across users
Long-running chat sessions with growing history
Unique long document on every request
Rare hourly requests with 5-min TTL
If hit-rate suddenly drops, first thing to check: did someone add a timestamp or random ID to the system prompt? One innocent `generated_at: 2026-04-10T14:32:01` line kills the entire cache.
Result
A prompt that caches from the first call: stable content on top, dynamic at the bottom, three cache_control points in the right places, hit-rate above 70%, and an API bill cut 5–10× without a single line of model-level optimization.