§4.1 · API

Prompt caching

How to cache the prefix so repeat calls cost a fraction.

Prompt caching is the single biggest lever you have on Anthropic spend if your application sends the same context multiple times. Most API users leave 50–80% of their bill on the table by not using it correctly.

How it works

When you send a prompt with a cache_control breakpoint, Anthropic caches the prefix up to that breakpoint for 5 minutes (refreshed on each hit). The first call pays a small write premium (~1.25× input). Every subsequent call that sends the same prefix pays only the read rate (~0.10× input) for those tokens.

The cache is keyed on the exact prefix. One character of drift and the prefix is "poisoned" — no cache hit.

The three rules

Stable content first, volatile content last. Put the user's question, the current date, and any per-request variables at the END of your messages. Everything that doesn't change goes at the start.
One breakpoint, near the end of the stable section. You don't need a breakpoint on every block. One well-placed one captures the bulk of the savings.
Watch the minimum cacheable size. Sonnet caches at 1,024+ tokens; Haiku at 2,048+. Below those thresholds you pay the cache-write fee with no savings.

A typical mistake

[date: 2026-05-18]
[user query: "..."]
[long stable instructions block]
[cache_control breakpoint]

Nothing here will cache. The volatile date + query come before the stable block, so the stable block is never in the cacheable prefix. Reorder to:

[long stable instructions block]
[cache_control breakpoint]
[date: 2026-05-18]
[user query: "..."]

Now the instructions cache. The date and query are after the breakpoint — they don't matter for the cache key.

What "good" looks like

Production apps that get this right see cache hit rates above 70% on traffic that re-sends a stable system prompt. Below 30% means you have a prefix-ordering bug.