How inference works in Amazon Bedrock

Focus mode

How inference works in Amazon Bedrock - Amazon Bedrock

Invoking models in different AWS Regions

When you submit an input to a model, the model predicts a probable sequence of tokens that follows, and returns that sequence as the output. Amazon Bedrock provides you the capability of running inference with the foundation model of your choice. When you run inference, you provide the following inputs:

Prompt – An input provided to the model in order for it to generate a response. For information about writing prompts, see Prompt engineering concepts. For information about protecting against prompt injection attacks, see Prompt injection security.
Model – A foundation model or inference profile to run inference with. The model or inference profile that you choose also specifies a level of throughput, which defines the number and rate of input and output tokens that you can process. For more information about the foundation models that are available in Amazon Bedrock, see Amazon Bedrock foundation model information. For more information about inference profiles, see Set up a model invocation resource using inference profiles. For more information about increasing throughput, see Increase throughput with cross-Region inference and Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock.
Inference parameters – A set of values that can be adjusted to limit or influence the model response. For information about inference parameters, see Influence response generation with inference parameters and Inference request parameters and response fields for foundation models.

Invoking models in different AWS Regions

When you invoke a model, you choose the AWS Region in which to invoke it. The quotas for the frequency and size of the requests that you can make depend on the Region. You can find these quotas by searching for the following quotas at Amazon Bedrock service quotas:

On-demand model inference requests per minute for ${Model}
On-demand InvokeModel tokens per minute for ${Model}

You can also invoke an inference profile instead of the foundation model itself. An inference profile defines a model and one or more Regions to which the inference profile can route model invocation requests. By invoking an inference profile that includes multiple Regions, you can increase your throughput. For more information, see Increase throughput with cross-Region inference.

Requests made to a Region may be served out of local zones that share the same parent Region. For example, requests made to US East (N. Virginia) (us-east-1) may be served out of any local zone associated with it, such as Atlanta, US (us-east-1-atl-2a).

The same principle applies when using cross-Region inference. For example, requests made to the US Anthropic Claude 3 Haiku inference profile may be served out of any local zone whose parent Region is in US, such as Seattle, US (us-west-2-sea-1a). When new local zones are added to AWS, they will be also be added to the corresponding cross-Region inference endpoint.

To see a list of local endpoints and the parent Regions they're associated with, see AWS Local Zones Locations.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Inference: Generate responses

Influence response generation with inference parameters

Next topic:

Influence response generation with inference parameters

Previous topic:

Inference: Generate responses

Need help?

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

How inference works in Amazon Bedrock

Invoking models in different AWS Regions

Next topic:

Previous topic:

Need help?

On this page

Related resources

Did this page help you?

Related resources