§4.3 · API

Cache hit rate as a north-star metric

Why this number, and how to drive it up.

Cache hit rate is the single most useful number on your Anthropic dashboard. It tells you what fraction of your input tokens are being read from cache (cheap) versus charged at the full input rate (5–10× more expensive per token of cacheable content).

What to watch

Hit rate	What it means
>70%	Your prefix is stable and well-ordered. Keep doing what you're doing.
40–70%	Either your prefix is partially poisoned by volatile content, or you have multiple variants of a stable block that should be consolidated.
<40%	Either you're not using `cache_control` at all, or your prefix is so volatile the cache expires before the next hit.

How to drive it up

Audit prefix order. Any variable that changes per call must come AFTER the cache_control breakpoint. Common offenders: timestamps, user IDs, page-numbered references, conversation turn counters.
Consolidate instruction variants. If your codebase has slightly-different system prompts for similar tasks, each one is its own cache key. Merge them with a single switch sentence near the end.
Pin the model + version. Switching between claude-sonnet-4-6 and claude-haiku-4-5-20251001 resets the cache. If you have an A/B test running, expect a hit-rate dip until one side wins.
Mind the 5-minute window. Sporadic traffic (one call every 10 min) will see most cache writes expire before the next read. Either keep prefixes warm with a tiny background poke or accept that low-volume endpoints won't benefit.

A note on observability

Anthropic returns the cache breakdown in the usage block on every response — cache_read_input_tokens and cache_creation_input_tokens. Don't trust your dashboard alone; instrument the response in your own logs so you can debug a drop in real time.

The Cache Hit Rate Monitor in Mission Control polls your Anthropic key for the live number and pings you when it slips.