Prompt Caching | AssemblyAI

Public Beta

Prompt caching is available in Public Beta.

Overview

Prompt caching lets you avoid reprocessing the same prompt content on every request. When you send a long system prompt, tool definitions, or conversation history repeatedly, the LLM provider can cache that content and reuse it on subsequent requests — reducing both latency and cost.

The LLM Gateway supports prompt caching across all major providers, with each provider using its own caching mechanism:

Provider	Caching behavior	Configuration required?
Claude	Explicit opt-in	Yes — add `cache_control` to messages
OpenAI	Automatic	No — caching happens implicitly
Gemini	Automatic	No — caching happens implicitly
Kimi	Automatic	No — caching happens implicitly

Cached input tokens are billed at a discounted rate compared to regular input tokens. The exact discount depends on the model and provider.

Prompt caching only activates when the cacheable portion of your prompt meets a minimum token threshold. Claude’s minimum varies by model — see Minimum cacheable prompt length for the per-model limits. OpenAI requires 1,024 tokens. Shorter prompts won’t benefit from caching.

Claude models

Claude models require you to explicitly mark which content blocks to cache using the cache_control field. Add cache_control with type set to "ephemeral" on any message you want cached.

Python

JavaScript

1 import os
2 import requests
3 
4 headers = {
5   "authorization": os.environ["ASSEMBLYAI_API_KEY"]
6 }
7 
8 system_prompt = (
9     "You are a customer support agent for Acme Corp. "
10     "You have access to our full product catalog, pricing, "
11     "and policy documentation. Always be helpful and concise."
12 )
13 
14 response = requests.post(
15     "https://llm-gateway.assemblyai.com/v1/chat/completions",
16     headers=headers,
17     json={
18         "model": "claude-sonnet-4-6",
19         "messages": [
20             {
21                 "role": "system",
22                 "content": system_prompt,
23                 "cache_control": {"type": "ephemeral"}
24             },
25             {
26                 "role": "user",
27                 "content": "What is your return policy?"
28             }
29         ],
30         "max_tokens": 1000
31     }
32 )
33 
34 result = response.json()
35 print(result["choices"][0]["message"]["content"])
36 
37 # Check cache usage in the response
38 usage = result["usage"]
39 cache_details = usage.get("prompt_tokens_details", {})
40 print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")

You can also set cache_control on tool result messages to cache tool interaction history in multi-turn agentic conversations.

Minimum cacheable prompt length

Claude only caches prompts that meet a minimum token threshold, and the threshold depends on the model. If the cacheable portion of your prompt falls below this threshold, the request is processed without caching and no error is returned.

Model	Minimum cacheable prompt length
Claude Opus 4.7 (`claude-opus-4-7`)	4,096 tokens
Claude Opus 4.6 (`claude-opus-4-6`)	4,096 tokens
Claude Opus 4.5 (`claude-opus-4-5-20251101`)	4,096 tokens
Claude Haiku 4.5 (`claude-haiku-4-5-20251001`)	4,096 tokens
Claude Sonnet 4.6 (`claude-sonnet-4-6`)	2,048 tokens
Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`)	1,024 tokens
Claude Sonnet 4 (`claude-sonnet-4-20250514`)	1,024 tokens
Claude Opus 4 (`claude-opus-4-20250514`)	1,024 tokens

Cache control with TTL

The LLM Gateway extends Anthropic’s native cache_control with an optional ttl field for specifying cache duration. This is a Gateway-specific parameter — Anthropic’s native API does not support it.

1 {
2   "cache_control": {
3     "type": "ephemeral",
4     "ttl": "5m"
5   }
6 }

The ttl field is a Gateway extension, not part of Anthropic’s native API. If omitted, Anthropic’s default cache duration applies.

Where to place cache_control

The cache_control field can be placed on:

System messages — Cache long system prompts that don’t change between requests
User and assistant messages — Cache conversation history in multi-turn flows
Tool result messages — Cache tool call outputs in agentic workflows

For best results with Claude, place cache_control on the content that stays the same across requests — typically the system prompt and any static context. Content after the last cache breakpoint is not cached.

OpenAI models

OpenAI models cache prompts automatically. No configuration is needed — the gateway passes your requests through and caching happens on OpenAI’s infrastructure.

Python

JavaScript

1 import os
2 import requests
3 
4 headers = {
5   "authorization": os.environ["ASSEMBLYAI_API_KEY"]
6 }
7 
8 # OpenAI models cache automatically — no cache_control needed
9 response = requests.post(
10     "https://llm-gateway.assemblyai.com/v1/chat/completions",
11     headers=headers,
12     json={
13         "model": "gpt-4.1",
14         "messages": [
15             {
16                 "role": "system",
17                 "content": "You are a customer support agent..."
18             },
19             {
20                 "role": "user",
21                 "content": "What is your return policy?"
22             }
23         ],
24         "max_tokens": 1000
25     }
26 )
27 
28 result = response.json()
29 
30 # Cached tokens still appear in usage
31 cache_details = result["usage"].get("prompt_tokens_details", {})
32 print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")

You can optionally configure cache behavior with two additional request-level fields:

Field	Type	Description
`prompt_cache_retention`	string	Controls how long cached content is retained on OpenAI’s infrastructure. These values are passed through to OpenAI’s API — refer to OpenAI’s documentation for current allowed values.
`prompt_cache_key`	string	A custom key to group related requests for caching. Requests with the same key are more likely to share cached content.

Gemini models

Gemini models also cache automatically — no configuration needed.

Python

JavaScript

1 import os
2 import requests
3 
4 headers = {
5   "authorization": os.environ["ASSEMBLYAI_API_KEY"]
6 }
7 
8 # Gemini models cache automatically
9 response = requests.post(
10     "https://llm-gateway.assemblyai.com/v1/chat/completions",
11     headers=headers,
12     json={
13         "model": "gemini-2.5-flash",
14         "messages": [
15             {
16                 "role": "system",
17                 "content": "You are a customer support agent..."
18             },
19             {
20                 "role": "user",
21                 "content": "What is your return policy?"
22             }
23         ],
24         "max_tokens": 1000
25     }
26 )
27 
28 result = response.json()
29 
30 # Cached tokens appear in usage
31 cache_details = result["usage"].get("prompt_tokens_details", {})
32 print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")

Kimi models

Kimi models also cache automatically — no configuration needed.

Python

JavaScript

1 import os
2 import requests
3 
4 headers = {
5   "authorization": os.environ["ASSEMBLYAI_API_KEY"]
6 }
7 
8 # Kimi models cache automatically
9 response = requests.post(
10     "https://llm-gateway.assemblyai.com/v1/chat/completions",
11     headers=headers,
12     json={
13         "model": "kimi-k2.5",
14         "messages": [
15             {
16                 "role": "system",
17                 "content": "You are a customer support agent..."
18             },
19             {
20                 "role": "user",
21                 "content": "What is your return policy?"
22             }
23         ],
24         "max_tokens": 1000
25     }
26 )
27 
28 result = response.json()
29 
30 # Cached tokens still appear in usage
31 cache_details = result["usage"].get("prompt_tokens_details", {})
32 print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")

You can optionally configure cache behavior with the same request-level fields supported for OpenAI models:

Field	Type	Description
`prompt_cache_retention`	string	Controls how long cached content is retained. Refer to OpenAI’s documentation for allowed values.
`prompt_cache_key`	string	A custom key to group related requests for caching. Requests with the same key are more likely to share cached content.

Reading cache metrics from the response

All providers return cache usage data in the usage.prompt_tokens_details field of the response:

1 {
2   "usage": {
3     "input_tokens": 500,
4     "output_tokens": 150,
5     "total_tokens": 650,
6     "prompt_tokens_details": {
7       "cached_tokens": 450,
8       "cache_creation": {
9         "ephemeral_5m_input_tokens": 0,
10         "ephemeral_1h_input_tokens": 0
11       }
12     }
13   }
14 }

Field	Description
`cached_tokens`	Number of input tokens read from cache (cost savings).
`cache_creation.ephemeral_5m_input_tokens`	Tokens written to a 5-minute ephemeral cache (Claude only).
`cache_creation.ephemeral_1h_input_tokens`	Tokens written to a 1-hour ephemeral cache (Claude only).

When cached_tokens is greater than zero, those tokens were served from cache and billed at the discounted cached input rate rather than the standard input rate.

Best practices

Cache your system prompt — System prompts are the best candidates for caching since they stay the same across requests. Place cache_control on the system message for Claude, or rely on automatic caching for OpenAI and Gemini.
Cache tool definitions — If you use the same tools across multiple requests, the tool definitions in your prompt are automatically eligible for caching.
Order messages for maximum cache hits — Put static content (system prompt, tool definitions) at the beginning of the message array. Content before the cache breakpoint is more likely to match across requests.
Monitor cache metrics — Check prompt_tokens_details.cached_tokens in responses to verify caching is working and estimate your cost savings.

API reference

Request

Top-level request parameters

These fields are set at the top level of the request body:

Key	Type	Required?	Description
`cache_control`	object	No	Default cache control applied to the entire request (Claude models). When set at the request level, it acts as a default for all messages. Contains `type` and optional `ttl`.
`prompt_cache_retention`	string	No	Controls cache retention duration (OpenAI models). Passed through to OpenAI’s API.
`prompt_cache_key`	string	No	Custom cache key for grouping requests (OpenAI models).

Message-level cache_control

The cache_control field can also be set on individual messages. Message-level cache_control lets you mark specific cache breakpoints — the provider caches all content up to and including the marked message. This is the recommended approach for Claude models.

Key	Type	Required?	Description
`cache_control`	object	No	Cache control for this specific message. Marks a cache breakpoint at this position in the conversation.

Cache control object

The cache_control object has the same structure whether used at the request level or message level:

Key	Type	Required?	Description
`type`	string	Yes	The cache type. Use `"ephemeral"` for standard caching.
`ttl`	string	No	Time-to-live for the cached content (Gateway extension — not part of Anthropic’s native API).

Response

Usage fields for cache metrics

Key	Type	Description
`usage.prompt_tokens_details.cached_tokens`	number	Input tokens served from cache.
`usage.prompt_tokens_details.cache_creation.ephemeral_5m_input_tokens`	number	Tokens written to 5-minute cache.
`usage.prompt_tokens_details.cache_creation.ephemeral_1h_input_tokens`	number	Tokens written to 1-hour cache.