Prompt Caching
Public Beta
Prompt caching is available in Public Beta.
Overview
Prompt caching lets you avoid reprocessing the same prompt content on every request. When you send a long system prompt, tool definitions, or conversation history repeatedly, the LLM provider can cache that content and reuse it on subsequent requests — reducing both latency and cost.
The LLM Gateway supports prompt caching across all major providers, with each provider using its own caching mechanism:
Cached input tokens are billed at a discounted rate compared to regular input tokens. The exact discount depends on the model and provider.
Prompt caching only activates when the cacheable portion of your prompt meets a minimum token threshold. Claude’s minimum varies by model — see Minimum cacheable prompt length for the per-model limits. OpenAI requires 1,024 tokens. Shorter prompts won’t benefit from caching.
Claude models
Claude models require you to explicitly mark which content blocks to cache using the cache_control field. Add cache_control with type set to "ephemeral" on any message you want cached.
Python
JavaScript
You can also set cache_control on tool result messages to cache tool interaction history in multi-turn agentic conversations.
Minimum cacheable prompt length
Claude only caches prompts that meet a minimum token threshold, and the threshold depends on the model. If the cacheable portion of your prompt falls below this threshold, the request is processed without caching and no error is returned.
Cache control with TTL
The LLM Gateway extends Anthropic’s native cache_control with an optional ttl field for specifying cache duration. This is a Gateway-specific parameter — Anthropic’s native API does not support it.
The ttl field is a Gateway extension, not part of Anthropic’s native API. If omitted, Anthropic’s default cache duration applies.
Where to place cache_control
The cache_control field can be placed on:
- System messages — Cache long system prompts that don’t change between requests
- User and assistant messages — Cache conversation history in multi-turn flows
- Tool result messages — Cache tool call outputs in agentic workflows
For best results with Claude, place cache_control on the content that stays the same across requests — typically the system prompt and any static context. Content after the last cache breakpoint is not cached.
OpenAI models
OpenAI models cache prompts automatically. No configuration is needed — the gateway passes your requests through and caching happens on OpenAI’s infrastructure.
Python
JavaScript
You can optionally configure cache behavior with two additional request-level fields:
Gemini models
Gemini models also cache automatically — no configuration needed.
Python
JavaScript
Kimi models
Kimi models also cache automatically — no configuration needed.
Python
JavaScript
You can optionally configure cache behavior with the same request-level fields supported for OpenAI models:
Reading cache metrics from the response
All providers return cache usage data in the usage.prompt_tokens_details field of the response:
When cached_tokens is greater than zero, those tokens were served from cache and billed at the discounted cached input rate rather than the standard input rate.
Best practices
- Cache your system prompt — System prompts are the best candidates for caching since they stay the same across requests. Place
cache_controlon the system message for Claude, or rely on automatic caching for OpenAI and Gemini. - Cache tool definitions — If you use the same tools across multiple requests, the tool definitions in your prompt are automatically eligible for caching.
- Order messages for maximum cache hits — Put static content (system prompt, tool definitions) at the beginning of the message array. Content before the cache breakpoint is more likely to match across requests.
- Monitor cache metrics — Check
prompt_tokens_details.cached_tokensin responses to verify caching is working and estimate your cost savings.
API reference
Request
Top-level request parameters
These fields are set at the top level of the request body:
Message-level cache_control
The cache_control field can also be set on individual messages. Message-level cache_control lets you mark specific cache breakpoints — the provider caches all content up to and including the marked message. This is the recommended approach for Claude models.
Cache control object
The cache_control object has the same structure whether used at the request level or message level: