Note
Amazon Bedrock prompt caching is currently only available to a select number of customers.
To learn more about participating in the preview, see Amazon Bedrock prompt caching
Prompt caching is an optional feature that you can use while getting model inference in Amazon Bedrock to achieve reductions in response latency. You can add portions of your conversation to a cache so that the model can reuse the context in the cache instead of fully processing the input and computing responses each time.
Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. For example, if you have a chatbot where users can upload documents and ask questions about them, it can be time consuming for the model to process the document every time the user provides input. With prompt caching, you can cache the document in the context of the conversation for faster responses.
When using prompt caching, you're charged at a reduced rate for inference and a different
rate for how many tokens are read from and written to the cache. For more information, see
the Amazon Bedrock pricing page
How it works
If you opt to use prompt caching, Amazon Bedrock creates a cache composed of cache checkpoints. These are checkpoints at which the entire prefix of the prompt leading up to that point is cached. In subsequent requests, the model can retrieve this cached information instead of processing it again, resulting in faster response times and reduced cost.
Cache checkpoints have a minimum and maximum number of tokens, dependent on the specific model you're using. You can only create a cache checkpoint if your total prompt prefix meets the minimum number of tokens. For example, the Anthropic Claude 3.5 Sonnet v2 model requires 1,024 tokens for cache checkpoints. You can create your first checkpoint after your prompt and the model's responses reach 1,024 tokens. You can create a second checkpoint after the total reaches 2,048 tokens. If you try to add a cache checkpoint without meeting the minimum number of tokens, your inference request still succeeds but the checkpoint isn't added to the cache.
The cache has a five minute Time To Live (TTL), which resets with each successful cache hit. During this period, the context in the cache is preserved. If no cache hits occur within the TTL window, your cache expires.
If your cache expires, you can reuse the previously cached context up to that point as the first cache checkpoint of a new cache.
You can use prompt caching anytime you get model inference in Amazon Bedrock for supported models. Prompt caching is supported by the following Amazon Bedrock features:
- Converse and ConverseStream APIs
-
You can carry on a conversation with a model where you specify cache checkpoints in your prompts.
- InvokeModel and InvokeModelWithResponseStream APIs
-
You can submit single prompt requests in which you enable prompt caching and specify your cache checkpoints.
- Amazon Bedrock Agents
-
When you create or update an agent, you can choose to enable or disable prompt caching. Amazon Bedrock automatically handles the prompt caching and checkpoint behavior for you.
The APIs provide you with the most flexibility and granular control over the prompt cache. You can set each individual cache checkpoint within your prompts. You can add to the cache by creating more cache checkpoints, up to the maximum number of cache checkpoints allowed for the specific model. For more information, see Supported models, regions, and limits.
To use prompt caching with other features such as Amazon Bedrock Agents, you simply have to enable the prompt caching field when you create or update your agent. When you enable prompt caching, the caching behavior and cache checkpoints are handled automatically for you by Amazon Bedrock.
Supported models, regions, and limits
The following table lists the supported AWS Regions, token minimums, maximum number of cache checkpoints, and fields that allow cache checkpoints for each supported model.
Model name | Model ID | Regions that support prompt caching | Minimum number of tokens per cache checkpoint | Maximum number of cache checkpoints | Fields in which you can add cache checkpoints |
---|---|---|---|---|---|
Amazon Nova Micro v1 |
amazon.nova-micro-v1:0 |
US East (N. Virginia) US West (Oregon) |
1 |
1 |
|
Amazon Nova Lite v1 |
amazon.nova-lite-v1:0 |
US East (N. Virginia) US West (Oregon) |
1 |
1 |
|
Amazon Nova Pro v1 |
amazon.nova-pro-v1:0 |
US East (N. Virginia) US West (Oregon) |
1 |
1 |
|
Claude 3.5 Haiku |
anthropic.claude-3-5-haiku-20241022-v1:0 |
Regular inference: US West (Oregon)
|
2,048 |
4 |
|
Claude 3.5 Sonnet v2 |
anthropic.claude-3-5-sonnet-20241022-v2:0 |
Regular inference: US West (Oregon)
|
1,024 |
4 |
|
Getting started
The following sections show you a brief overview of how to use the prompt caching feature for each method of interacting with models through Amazon Bedrock.
The Converse API provides advanced and flexible options for implementing prompt caching in multi-turn conversations. For more information about the prompt requirements for each model, see the preceding section Supported models, regions, and limits.
Example request
The following examples show a cache checkpoint set in the
messages
, system
, or tools
fields of a request to the Converse API. You can place checkpoints in any of these
locations for a given request. For example, if sending a request to the
Claude 3.5 Sonnet v2 model, you could place two cache checkpoints in
messages
, one cache checkpoint in system
,
and one in tools
. For more detailed information and examples of
structuring and sending Converse API requests, see
Carry out a conversation with the Converse API operations.
The model response from the Converse API includes two new fields that are specific to prompt
caching. The CacheReadInputTokens
and
CacheWriteInputTokens
values tell you how many tokens were
read from the cache and how many tokens were written to the cache because of
your previous request. These are values that you're charged for by Amazon Bedrock, at a
rate that's lower than the cost of full model inference.
Prompt caching is enabled by default when you call the InvokeModel API. You can set cache checkpoints at any point in your request body, similar to the previous example for the Converse API.
The following example shows how to structure the body of your InvokeModel request for the Anthropic Claude 3.5 Sonnet v2 model. Note that the exact format and fields of the body for InvokeModel requests may vary depending on the model you choose. To see the format and content of the request and response bodies for different models, see Inference request parameters and response fields for foundation models.
body={
"anthropic_version": "bedrock-2023-05-31",
"system":"Reply concisely",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the best way to learn programming."
},
{
"type": "text",
"text": "Add additional context here for the prompt that meets the minimum token requirement for your chosen model.",
"cache_control": {
"type": "ephemeral"
}
}
]
}
],
"max_tokens": 2048,
"temperature": 0.5,
"top_p": 0.8,
"stop_sequences": [
"stop"
],
"top_k": 250
}
For more information about sending an InvokeModel request, see Submit a single prompt with InvokeModel.
In a chat playground in the Amazon Bedrock console, you can turn on the prompt caching option, and Amazon Bedrock automatically creates cache checkpoints for you.
Follow the instructions in Generate responses in the console using playgrounds to get started with prompting in an Amazon Bedrock playground. For supported models, prompt caching is automatically turned on in the playground. However, if it’s not, then do the following to turn on prompt caching:
-
In the left side panel, open the Configurations menu.
-
Turn on the Prompt caching toggle.
-
Run your prompts.
After your combined input and model responses reach the minimum required number of tokens for a checkpoint (which varies by model), Amazon Bedrock automatically creates the first cache checkpoint for you. As you continue chatting, each subsequent reach of the minimum number of tokens creates a new checkpoint, up to the maximum number of checkpoints allowed for the model. You can view your cache checkpoints at any time by choosing View cache checkpoints next to the Prompt caching toggle, as shown in the following screenshot.

You can view how many tokens are being read from and written to the cache due
to each interaction with the model by viewing the Caching metrics
pop-up (
) in the playground responses.

If you turn off the prompt caching toggle while in the middle of a conversation, you can continue chatting with the model.