§4.6 · API
Output length, max_tokens, and schema design
Output costs 5× input. Treat every output token like rent.
Output tokens cost roughly 5× more than input tokens. A single sloppy max_tokens setting can balloon your monthly bill 10× without changing any other behavior.
Three traps to avoid
max_tokensunset (or huge default). A long-tail completion will fill it. Cap explicitly per endpoint.- Verbose JSON schemas. Every field name, every nested object, every closing bracket is a token Claude pays to emit. Compact field names and flat structures cut output spend without losing information.
- Asking for explanations alongside the answer. "Return X and explain your reasoning" doubles output. Ask for the explanation only on debug paths.
Sizing max_tokens honestly
Measure your actual output distribution for two weeks, then set max_tokens to the 95th percentile + ~20% headroom. Anything past the 95th is usually a degenerate prompt that should be re-shaped, not accommodated.
The Output Length Linter scans your endpoint configs and surfaces ceilings that look over-provisioned.
Schema compaction — the same content, less spend
// Wasteful (Claude emits all of this every response):
{
"user_first_name": "...",
"user_last_name": "...",
"user_email_address": "...",
"user_account_creation_timestamp": "..."
}
// Same data, smaller output:
{
"fn": "...",
"ln": "...",
"em": "...",
"cts": "..."
}
Switch field names locally; expand them in your parser. The model doesn't care; your bill does.
Streaming doesn't change the cost
Streaming makes the UX feel faster but you still pay for every token emitted. If your stream cuts off mid-response and you retry, you pay twice. Decide whether speed is worth the retry exposure.
The big lever: rendered vs raw
If you're using Claude to summarize a long document, give it the summary you want as the output schema (5 bullet points, 100 chars each, no preamble). Don't give it freedom to add introduction, transitions, and a closing flourish.