Prompt Caching
Overview
Prompt caching lets you avoid reprocessing the same prompt content on every request. When you send a long system prompt, tool definitions, or conversation history repeatedly, the LLM provider can cache that content and reuse it on subsequent requests — reducing both latency and cost.
The LLM Gateway supports prompt caching across all major providers, with each provider using its own caching mechanism:
Cached input tokens are billed at a discounted rate compared to regular input tokens. The exact discount depends on the model and provider.
Prompt caching only activates when the cacheable portion of your prompt meets a minimum token threshold. Claude requires at least 1,024 tokens (2,048 for certain models), and OpenAI requires 1,024 tokens. Shorter prompts won’t benefit from caching.
Claude models
Claude models require you to explicitly mark which content blocks to cache using the cache_control field. Add cache_control with type set to "ephemeral" on any message you want cached.
Python
JavaScript
You can also set cache_control on tool result messages to cache tool interaction history in multi-turn agentic conversations.
Cache control with TTL
The LLM Gateway extends Anthropic’s native cache_control with an optional ttl field for specifying cache duration. This is a Gateway-specific parameter — Anthropic’s native API does not support it.
The ttl field is a Gateway extension, not part of Anthropic’s native API. If omitted, Anthropic’s default cache duration applies.
Where to place cache_control
The cache_control field can be placed on:
- System messages — Cache long system prompts that don’t change between requests
- User and assistant messages — Cache conversation history in multi-turn flows
- Tool result messages — Cache tool call outputs in agentic workflows
For best results with Claude, place cache_control on the content that stays the same across requests — typically the system prompt and any static context. Content after the last cache breakpoint is not cached.
OpenAI models
OpenAI models cache prompts automatically. No configuration is needed — the gateway passes your requests through and caching happens on OpenAI’s infrastructure.
Python
JavaScript
You can optionally configure cache behavior with two additional request-level fields:
Gemini models
Gemini models also cache automatically — no configuration needed.
Python
JavaScript
Reading cache metrics from the response
All providers return cache usage data in the usage.prompt_tokens_details field of the response:
When cached_tokens is greater than zero, those tokens were served from cache and billed at the discounted cached input rate rather than the standard input rate.
Best practices
- Cache your system prompt — System prompts are the best candidates for caching since they stay the same across requests. Place
cache_controlon the system message for Claude, or rely on automatic caching for OpenAI and Gemini. - Cache tool definitions — If you use the same tools across multiple requests, the tool definitions in your prompt are automatically eligible for caching.
- Order messages for maximum cache hits — Put static content (system prompt, tool definitions) at the beginning of the message array. Content before the cache breakpoint is more likely to match across requests.
- Monitor cache metrics — Check
prompt_tokens_details.cached_tokensin responses to verify caching is working and estimate your cost savings.
API reference
Request
Top-level request parameters
These fields are set at the top level of the request body:
Message-level cache_control
The cache_control field can also be set on individual messages. Message-level cache_control lets you mark specific cache breakpoints — the provider caches all content up to and including the marked message. This is the recommended approach for Claude models.
Cache control object
The cache_control object has the same structure whether used at the request level or message level: