Prompt Caching: How to Cut 80% Off Your LLM Bill

The cheapest LLM optimization isn't a smaller model or a smarter RAG — it's a properly structured prefix cache. We break down how caching actually works, where to place cache points, why segment order matters more than content, and how to measure hit-rate so you don't fool yourself.

IntermediateAI DevOps25 minAnthropic API, OpenAI API

The pain you didn't think about before launch

While you're building the agent, cost seems trivial: ten cents per run. Then the agent hits production — and the API bill reads like server rent. The reason usually isn't call volume, it's the prefix: every call ships the same system prompt, instructions, examples, sometimes an entire knowledge base. Prefix caching solves this differently than most people think. It's not 'remember the answer to a repeated question'. It's 'remember the already-processed prefix so we don't recompute it'. Claude sees the familiar start, skips the most expensive op — prefill — and jumps to the dynamic part. Discount on cached tokens: up to 90%.

💸 No cache

Prefill 5000 tokens per call
Latency: high, always the same
Cost: ×1 (base per-token price)

⚡ With prefix cache

Prefill once, then hit
Latency on hit: −50% time-to-first-token
Cost: ×0.1 on cached prefix

Don't enable caching 'later'. Turn it on from the first prototype — you'll immediately start writing prompts with stable content on top. Restructuring a prompt for cache later is a separate day-long project.

The physics of cache: order matters more than content

To use the cache right, hold one idea in your head: the prefix is cached, not a fragment. Change the first character — cache gone. Change something in the middle — everything from that point onward stops being cached. On each request, the system finds the longest common prefix with what's already in cache. Found 4000 tokens of match — they bill at cache-hit rate (typically 10% of base). The rest is recomputed from scratch. Why does it work this way? What's cached isn't the text — it's the model's KV cache, the internal state after running the prefix through attention. That state depends on every preceding token, so changing a single character up front invalidates the whole thing. Hence the rule that saves more money than any other: the same exact text, reordered, can yield a 90% discount or zero. Added a timestamp to the first line — hit-rate floored. Changed 'You are helpful' to 'You are a helpful' — same outcome.

промпт = [система][инструменты][контекст][история][вход]

cache_lookup(промпт):
  ищет longest_common_prefix с prev_cached
  если hit = [система][инструменты] → скидка на 3000 токенов
  если hit = ничего → платишь полную цену за всё

ключевое: hit ищется СТРОГО с начала, префиксно

One tiny edit to the system prompt invalidates the cache for all your users at once. In production, don't edit prompts on the fly — ship them via release and expect a temporary hit-rate dip.

Stable on top, dynamic at the bottom

The most common mistake: shoving dynamic context before stable content. 'User: hi, here's the order data, now answer according to these rules…' — rules ended up AFTER order data. Every request changes the data — which invalidates cache for the rules too. Correct order: descending stability. On top, what changes monthly (system prompt, persona, rules). Below, what changes daily (tools, examples). Below, session context. At the very bottom, what changes every request: the latest user message. Sounds simple, but almost every first-draft prompt is arranged backwards. Flip it — and hit-rate climbs on its own.

❌ Bad: dynamic on top

user question
RAG context
system rules
examples
hit-rate: ~0%

✅ Good: stable on top

system rules
tools + examples
RAG context
dialogue history
user question → hit-rate >80%

RAG context is the trickiest layer. More stable than user input, less stable than the system prompt. Put it in the middle with its own cache point in front of it — and you'll get discounts even on rarely repeated queries.

cache_control: three points, no more

Providers give you explicit control: you place the 'cache up to here' markers yourself. Anthropic uses a `cache_control` marker, OpenAI caches automatically after 1024 prefix tokens. Anthropic allows up to four points, each creating a separate cached prefix. Rule of three: first point after the system prompt, second after tool definitions, third before the RAG context. More points don't give more discount — they just complicate debugging. TTL is a separate choice. Anthropic's default is 5 minutes: cheap to write, quick to expire. Extended 1-hour TTL costs more per write but pays off when requests arrive less frequently than every five minutes. Measure patterns first, enable long TTL second.

System prompt

← cache_control #1

Tools + base instructions

← cache_control #2

RAG context / knowledge

← cache_control #3

Dialogue history + current message

not cached — dynamic part

messages = [
  { role: "system", content: [...], cache_control: "ephemeral" },  // #1
  { role: "user", content: tools_def, cache_control: "ephemeral" }, // #2
  { role: "user", content: rag_ctx, cache_control: "ephemeral" },   // #3
  { role: "user", content: user_message }                           // dynamic
]

1-hour TTL pays off when your p50 interval between requests exceeds 5 minutes. If users spam in quick succession, the default 5-minute TTL is enough — and you save on write cost.

Hit-rate — the metric everyone forgets to watch

You turned caching on. The bill went down. Done? Not until you checked hit-rate. 'Cache enabled' and 'cache hit' are different things. Every response returns three numbers: tokens from cache, fresh tokens, tokens written. Watch the ratio. Green zone — hit-rate above 70% on stable endpoints. Yellow (30–70%) — check segment order and hunt for dynamic chunks mid-prefix. Red (<30%) — caching isn't working at all: either TTL is too short, or something is invalidating the prefix. Without monitoring, you won't know if the optimization works until next month's bill arrives.

Where does cache actually save money?

Stable system prompt + changing messages

Shared RAG context across users

Long-running chat sessions with growing history

Unique long document on every request

Rare hourly requests with 5-min TTL

If hit-rate suddenly drops, first thing to check: did someone add a timestamp or random ID to the system prompt? One innocent `generated_at: 2026-04-10T14:32:01` line kills the entire cache.

Result

A prompt that caches from the first call: stable content on top, dynamic at the bottom, three cache_control points in the right places, hit-rate above 70%, and an API bill cut 5–10× without a single line of model-level optimization.

All Recipes

Prompt Caching: How to Cut 80% Off Your LLM Bill

IntermediateAI DevOps25 minAnthropic API, OpenAI API

The pain you didn't think about before launch

💸 No cache

Prefill 5000 tokens per call
Latency: high, always the same
Cost: ×1 (base per-token price)

⚡ With prefix cache

Prefill once, then hit
Latency on hit: −50% time-to-first-token
Cost: ×0.1 on cached prefix

The physics of cache: order matters more than content

промпт = [система][инструменты][контекст][история][вход]

cache_lookup(промпт):
  ищет longest_common_prefix с prev_cached
  если hit = [система][инструменты] → скидка на 3000 токенов
  если hit = ничего → платишь полную цену за всё

ключевое: hit ищется СТРОГО с начала, префиксно

One tiny edit to the system prompt invalidates the cache for all your users at once. In production, don't edit prompts on the fly — ship them via release and expect a temporary hit-rate dip.

Stable on top, dynamic at the bottom

❌ Bad: dynamic on top

user question
RAG context
system rules
examples
hit-rate: ~0%

✅ Good: stable on top

system rules
tools + examples
RAG context
dialogue history
user question → hit-rate >80%

cache_control: three points, no more

System prompt

← cache_control #1

Tools + base instructions

← cache_control #2

RAG context / knowledge

← cache_control #3

Dialogue history + current message

not cached — dynamic part

messages = [
  { role: "system", content: [...], cache_control: "ephemeral" },  // #1
  { role: "user", content: tools_def, cache_control: "ephemeral" }, // #2
  { role: "user", content: rag_ctx, cache_control: "ephemeral" },   // #3
  { role: "user", content: user_message }                           // dynamic
]

1-hour TTL pays off when your p50 interval between requests exceeds 5 minutes. If users spam in quick succession, the default 5-minute TTL is enough — and you save on write cost.

Hit-rate — the metric everyone forgets to watch

Where does cache actually save money?

Stable system prompt + changing messages

Shared RAG context across users

Long-running chat sessions with growing history

Unique long document on every request

Rare hourly requests with 5-min TTL

If hit-rate suddenly drops, first thing to check: did someone add a timestamp or random ID to the system prompt? One innocent `generated_at: 2026-04-10T14:32:01` line kills the entire cache.

Prompt Caching: How to Cut 80% Off Your LLM Bill

The pain you didn't think about before launch

💸 No cache

⚡ With prefix cache

The physics of cache: order matters more than content

Stable on top, dynamic at the bottom

❌ Bad: dynamic on top

✅ Good: stable on top

cache_control: three points, no more

Hit-rate — the metric everyone forgets to watch

Where does cache actually save money?

Result

Related Theory

Prompt Caching: How to Cut 80% Off Your LLM Bill

The pain you didn't think about before launch

💸 No cache

⚡ With prefix cache

The physics of cache: order matters more than content

Stable on top, dynamic at the bottom

❌ Bad: dynamic on top

✅ Good: stable on top

cache_control: three points, no more

Hit-rate — the metric everyone forgets to watch

Where does cache actually save money?

Result

Related Theory