Prompt caching for faster model inference

Focus mode

Prompt caching for faster model inference - Amazon Bedrock

How it works Supported models, regions, and limits Getting started

Note

Amazon Bedrock prompt caching is currently only available to a select number of customers. To learn more about participating in the preview, see Amazon Bedrock prompt caching.

Prompt caching is an optional feature that you can use while getting model inference in Amazon Bedrock to achieve reductions in response latency. You can add portions of your conversation to a cache so that the model can reuse the context in the cache instead of fully processing the input and computing responses each time.

Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. For example, if you have a chatbot where users can upload documents and ask questions about them, it can be time consuming for the model to process the document every time the user provides input. With prompt caching, you can cache the document in the context of the conversation for faster responses.

When using prompt caching, you're charged at a reduced rate for inference and a different rate for how many tokens are read from and written to the cache. For more information, see the Amazon Bedrock pricing page.

How it works

If you opt to use prompt caching, Amazon Bedrock creates a cache composed of cache checkpoints. These are checkpoints at which the entire prefix of the prompt leading up to that point is cached. In subsequent requests, the model can retrieve this cached information instead of processing it again, resulting in faster response times and reduced cost.

Cache checkpoints have a minimum and maximum number of tokens, dependent on the specific model you're using. You can only create a cache checkpoint if your total prompt prefix meets the minimum number of tokens. For example, the Anthropic Claude 3.5 Sonnet v2 model requires 1,024 tokens for cache checkpoints. You can create your first checkpoint after your prompt and the model's responses reach 1,024 tokens. You can create a second checkpoint after the total reaches 2,048 tokens. If you try to add a cache checkpoint without meeting the minimum number of tokens, your inference request still succeeds but the checkpoint isn't added to the cache.

The cache has a five minute Time To Live (TTL), which resets with each successful cache hit. During this period, the context in the cache is preserved. If no cache hits occur within the TTL window, your cache expires.

If your cache expires, you can reuse the previously cached context up to that point as the first cache checkpoint of a new cache.

You can use prompt caching anytime you get model inference in Amazon Bedrock for supported models. Prompt caching is supported by the following Amazon Bedrock features:

Converse and ConverseStream APIs: You can carry on a conversation with a model where you specify cache checkpoints in your prompts.
InvokeModel and InvokeModelWithResponseStream APIs: You can submit single prompt requests in which you enable prompt caching and specify your cache checkpoints.
Amazon Bedrock Agents: When you create or update an agent, you can choose to enable or disable prompt caching. Amazon Bedrock automatically handles the prompt caching and checkpoint behavior for you.

The APIs provide you with the most flexibility and granular control over the prompt cache. You can set each individual cache checkpoint within your prompts. You can add to the cache by creating more cache checkpoints, up to the maximum number of cache checkpoints allowed for the specific model. For more information, see Supported models, regions, and limits.

To use prompt caching with other features such as Amazon Bedrock Agents, you simply have to enable the prompt caching field when you create or update your agent. When you enable prompt caching, the caching behavior and cache checkpoints are handled automatically for you by Amazon Bedrock.

Supported models, regions, and limits

The following table lists the supported AWS Regions, token minimums, maximum number of cache checkpoints, and fields that allow cache checkpoints for each supported model.

Model name	Model ID	Regions that support prompt caching	Minimum number of tokens per cache checkpoint	Maximum number of cache checkpoints	Fields in which you can add cache checkpoints
Amazon Nova Micro v1	amazon.nova-micro-v1:0	US East (N. Virginia) US West (Oregon)	1	1	`system`
Amazon Nova Lite v1	amazon.nova-lite-v1:0	US East (N. Virginia) US West (Oregon)	1	1	`system`
Amazon Nova Pro v1	amazon.nova-pro-v1:0	US East (N. Virginia) US West (Oregon)	1	1	`system`
Claude 3.5 Haiku	anthropic.claude-3-5-haiku-20241022-v1:0	Regular inference: US West (Oregon) Cross-region inference: US East (N. Virginia) US West (Oregon)	2,048	4	`system`, `messages`, and `tools`
Claude 3.5 Sonnet v2	anthropic.claude-3-5-sonnet-20241022-v2:0	Regular inference: US West (Oregon) Cross-region inference: US East (N. Virginia) US West (Oregon)	1,024	4	`system`, `messages`, and `tools`

Getting started

The following sections show you a brief overview of how to use the prompt caching feature for each method of interacting with models through Amazon Bedrock.

The Converse API provides advanced and flexible options for implementing prompt caching in multi-turn conversations. For more information about the prompt requirements for each model, see the preceding section Supported models, regions, and limits.

Example request

The following examples show a cache checkpoint set in the messages, system, or tools fields of a request to the Converse API. You can place checkpoints in any of these locations for a given request. For example, if sending a request to the Claude 3.5 Sonnet v2 model, you could place two cache checkpoints in messages, one cache checkpoint in system, and one in tools. For more detailed information and examples of structuring and sending Converse API requests, see Carry out a conversation with the Converse API operations.

messages checkpoints

In this example, the first image field provides an image to the model, and the second text field asks the model to analyze the image. As long as the number of tokens preceding the cachePoint in the content object meets the minimum token count for the model, a cache checkpoint is created.


...
"messages": [
   {
        "role": "user",
        "content": [
            {
                "image": {
                    "bytes": "asfb14tscve..."
                }
            },
            {
                "text": "What's is in this image?"
            },
            {
                "cachePoint": {
                    "type": "default"
                }
            }
      ]
  }
]
...

system checkpoints

In this example, you provide your system prompt in the text field. Afterward, you can add a cachePoint field to cache the system prompt.


...
  "system": [ 
    {
        "text": "You are an app that creates play lists for a radio station that plays rock and pop music. Only return song names and the artist. "
    },
    {
        "cachePoint": {
            "type": "default"
        }
    }
  ],
...

tools checkpoints

In this example, you provide your tool definition in the toolSpec field. (Alternatively, you can call a tool that you’ve previously defined. For more information, see Call a tool with the Converse API.) Afterward, you can add a cachePoint field to cache the tool.


...
toolConfig={
    "tools": [
        {
            "toolSpec": {
                "name": "top_song",
                "description": "Get the most popular song played on a radio station.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "sign": {
                                "type": "string",
                                "description": "The call sign for the radio station for which you want the most popular song. Example calls signs are WZPZ and WKRP."
                            }
                        },
                        "required": [
                            "sign"
                        ]
                    }
                }
            }
        },
        {
                "cachePoint": {
                    "type": "default"
                }
        }
    ]
}
...

The model response from the Converse API includes two new fields that are specific to prompt caching. The CacheReadInputTokens and CacheWriteInputTokens values tell you how many tokens were read from the cache and how many tokens were written to the cache because of your previous request. These are values that you're charged for by Amazon Bedrock, at a rate that's lower than the cost of full model inference.

Converse API

Example request

messages checkpoints


...
"messages": [
   {
        "role": "user",
        "content": [
            {
                "image": {
                    "bytes": "asfb14tscve..."
                }
            },
            {
                "text": "What's is in this image?"
            },
            {
                "cachePoint": {
                    "type": "default"
                }
            }
      ]
  }
]
...

system checkpoints

In this example, you provide your system prompt in the text field. Afterward, you can add a cachePoint field to cache the system prompt.


...
  "system": [ 
    {
        "text": "You are an app that creates play lists for a radio station that plays rock and pop music. Only return song names and the artist. "
    },
    {
        "cachePoint": {
            "type": "default"
        }
    }
  ],
...

tools checkpoints


...
toolConfig={
    "tools": [
        {
            "toolSpec": {
                "name": "top_song",
                "description": "Get the most popular song played on a radio station.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "sign": {
                                "type": "string",
                                "description": "The call sign for the radio station for which you want the most popular song. Example calls signs are WZPZ and WKRP."
                            }
                        },
                        "required": [
                            "sign"
                        ]
                    }
                }
            }
        },
        {
                "cachePoint": {
                    "type": "default"
                }
        }
    ]
}
...

anchor anchor anchor


...
"messages": [
   {
        "role": "user",
        "content": [
            {
                "image": {
                    "bytes": "asfb14tscve..."
                }
            },
            {
                "text": "What's is in this image?"
            },
            {
                "cachePoint": {
                    "type": "default"
                }
            }
      ]
  }
]
...

Prompt caching is enabled by default when you call the InvokeModel API. You can set cache checkpoints at any point in your request body, similar to the previous example for the Converse API.

The following example shows how to structure the body of your InvokeModel request for the Anthropic Claude 3.5 Sonnet v2 model. Note that the exact format and fields of the body for InvokeModel requests may vary depending on the model you choose. To see the format and content of the request and response bodies for different models, see Inference request parameters and response fields for foundation models.


body={
        "anthropic_version": "bedrock-2023-05-31",
        "system":"Reply concisely",
        "messages": [
            {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe the best way to learn programming."
                },
                {
                    "type": "text",
                    "text": "Add additional context here for the prompt that meets the minimum token requirement for your chosen model.",
                    "cache_control": {
                        "type": "ephemeral"
                    }
                }
            ]
            }
        ],
        "max_tokens": 2048,
        "temperature": 0.5,
        "top_p": 0.8,
        "stop_sequences": [
            "stop"
        ],
        "top_k": 250
}

For more information about sending an InvokeModel request, see Submit a single prompt with InvokeModel.

InvokeModel API

Prompt caching is enabled by default when you call the InvokeModel API. You can set cache checkpoints at any point in your request body, similar to the previous example for the Converse API.


body={
        "anthropic_version": "bedrock-2023-05-31",
        "system":"Reply concisely",
        "messages": [
            {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe the best way to learn programming."
                },
                {
                    "type": "text",
                    "text": "Add additional context here for the prompt that meets the minimum token requirement for your chosen model.",
                    "cache_control": {
                        "type": "ephemeral"
                    }
                }
            ]
            }
        ],
        "max_tokens": 2048,
        "temperature": 0.5,
        "top_p": 0.8,
        "stop_sequences": [
            "stop"
        ],
        "top_k": 250
}

For more information about sending an InvokeModel request, see Submit a single prompt with InvokeModel.

In a chat playground in the Amazon Bedrock console, you can turn on the prompt caching option, and Amazon Bedrock automatically creates cache checkpoints for you.

Follow the instructions in Generate responses in the console using playgrounds to get started with prompting in an Amazon Bedrock playground. For supported models, prompt caching is automatically turned on in the playground. However, if it’s not, then do the following to turn on prompt caching:

In the left side panel, open the Configurations menu.
Turn on the Prompt caching toggle.
Run your prompts.

After your combined input and model responses reach the minimum required number of tokens for a checkpoint (which varies by model), Amazon Bedrock automatically creates the first cache checkpoint for you. As you continue chatting, each subsequent reach of the minimum number of tokens creates a new checkpoint, up to the maximum number of checkpoints allowed for the model. You can view your cache checkpoints at any time by choosing View cache checkpoints next to the Prompt caching toggle, as shown in the following screenshot.

UI toggle for prompt caching in an Amazon Bedrock text playground.

You can view how many tokens are being read from and written to the cache due to each interaction with the model by viewing the Caching metrics pop-up ( ) in the playground responses.

Caching metrics box that shows the number of tokens read from and written to the cache.

If you turn off the prompt caching toggle while in the middle of a conversation, you can continue chatting with the model.

Playground

In a chat playground in the Amazon Bedrock console, you can turn on the prompt caching option, and Amazon Bedrock automatically creates cache checkpoints for you.

In the left side panel, open the Configurations menu.
Turn on the Prompt caching toggle.
Run your prompts.

You can view how many tokens are being read from and written to the cache due to each interaction with the model by viewing the Caching metrics pop-up ( ) in the playground responses.

If you turn off the prompt caching toggle while in the middle of a conversation, you can continue chatting with the model.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Use a computer use tool to complete a model response

Batch inference: Process multiple prompts

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Prompt caching for faster model inference

Note

How it works

Supported models, regions, and limits

Getting started

Converse API

InvokeModel API

Playground

On this page

Related resources

Did this page help you?

Related resources

Next topic:

Previous topic:

Need help?

UI toggle for prompt caching in an Amazon Bedrock text playground.

Caching metrics box that shows the number of tokens read from and written to the cache.