§4.3 · API
Cache hit rate as a north-star metric
Why this number, and how to drive it up.
Cache hit rate is the single most useful number on your Anthropic dashboard. It tells you what fraction of your input tokens are being read from cache (cheap) versus charged at the full input rate (5–10× more expensive per token of cacheable content).
What to watch
| Hit rate | What it means |
|---|---|
| >70% | Your prefix is stable and well-ordered. Keep doing what you're doing. |
| 40–70% | Either your prefix is partially poisoned by volatile content, or you have multiple variants of a stable block that should be consolidated. |
| <40% | Either you're not using cache_control at all, or your prefix is so volatile the cache expires before the next hit. |
How to drive it up
- Audit prefix order. Any variable that changes per call must come AFTER the
cache_controlbreakpoint. Common offenders: timestamps, user IDs, page-numbered references, conversation turn counters. - Consolidate instruction variants. If your codebase has slightly-different system prompts for similar tasks, each one is its own cache key. Merge them with a single switch sentence near the end.
- Pin the model + version. Switching between
claude-sonnet-4-6andclaude-haiku-4-5-20251001resets the cache. If you have an A/B test running, expect a hit-rate dip until one side wins. - Mind the 5-minute window. Sporadic traffic (one call every 10 min) will see most cache writes expire before the next read. Either keep prefixes warm with a tiny background poke or accept that low-volume endpoints won't benefit.
A note on observability
Anthropic returns the cache breakdown in the usage block on every response — cache_read_input_tokens and cache_creation_input_tokens. Don't trust your dashboard alone; instrument the response in your own logs so you can debug a drop in real time.
The Cache Hit Rate Monitor in Mission Control polls your Anthropic key for the live number and pings you when it slips.