§4.6 · API

Output length, max_tokens, and schema design

Output costs 5× input. Treat every output token like rent.

Output tokens cost roughly 5× more than input tokens. A single sloppy max_tokens setting can balloon your monthly bill 10× without changing any other behavior.

Three traps to avoid

max_tokens unset (or huge default). A long-tail completion will fill it. Cap explicitly per endpoint.
Verbose JSON schemas. Every field name, every nested object, every closing bracket is a token Claude pays to emit. Compact field names and flat structures cut output spend without losing information.
Asking for explanations alongside the answer. "Return X and explain your reasoning" doubles output. Ask for the explanation only on debug paths.

Sizing `max_tokens` honestly

Measure your actual output distribution for two weeks, then set max_tokens to the 95th percentile + ~20% headroom. Anything past the 95th is usually a degenerate prompt that should be re-shaped, not accommodated.

The Output Length Linter scans your endpoint configs and surfaces ceilings that look over-provisioned.

Schema compaction — the same content, less spend

// Wasteful (Claude emits all of this every response):
{
  "user_first_name": "...",
  "user_last_name":  "...",
  "user_email_address": "...",
  "user_account_creation_timestamp": "..."
}

// Same data, smaller output:
{
  "fn":  "...",
  "ln":  "...",
  "em":  "...",
  "cts": "..."
}

Switch field names locally; expand them in your parser. The model doesn't care; your bill does.

Streaming doesn't change the cost

Streaming makes the UX feel faster but you still pay for every token emitted. If your stream cuts off mid-response and you retry, you pay twice. Decide whether speed is worth the retry exposure.

The big lever: rendered vs raw

If you're using Claude to summarize a long document, give it the summary you want as the output schema (5 bullet points, 100 chars each, no preamble). Don't give it freedom to add introduction, transitions, and a closing flourish.