When you submit an input to a model, the model predicts a probable sequence of tokens that follows, and returns that sequence as the output. Amazon Bedrock provides you the capability of running inference with the foundation model of your choice. When you run inference, you provide the following inputs:
-
Prompt – An input provided to the model in order for it to generate a response. For information about writing prompts, see Prompt engineering concepts. For information about protecting against prompt injection attacks, see Prompt injection security.
-
Model – A foundation model or inference profile to run inference with. The model or inference profile that you choose also specifies a level of throughput, which defines the number and rate of input and output tokens that you can process. For more information about the foundation models that are available in Amazon Bedrock, see Amazon Bedrock foundation model information. For more information about inference profiles, see Set up a model invocation resource using inference profiles. For more information about increasing throughput, see Increase throughput with cross-Region inference and Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock.
-
Inference parameters – A set of values that can be adjusted to limit or influence the model response. For information about inference parameters, see Influence response generation with inference parameters and Inference request parameters and response fields for foundation models.
Invoking models in different AWS Regions
When you invoke a model, you choose the AWS Region in which to invoke it. The quotas for the frequency and size of the requests that you can make depend on the Region. You can find these quotas by searching for the following quotas at Amazon Bedrock service quotas:
-
On-demand model inference requests per minute for
${Model}
-
On-demand InvokeModel tokens per minute for
${Model}
You can also invoke an inference profile instead of the foundation model itself. An inference profile defines a model and one or more Regions to which the inference profile can route model invocation requests. By invoking an inference profile that includes multiple Regions, you can increase your throughput. For more information, see Increase throughput with cross-Region inference.
Requests made to a Region may be served out of local zones that share the same parent Region. For example, requests made to US East (N. Virginia) (us-east-1) may be served out of any local zone associated with it, such as Atlanta, US (us-east-1-atl-2a).
The same principle applies when using cross-Region inference. For example, requests made to the US Anthropic Claude 3 Haiku inference profile may be served out of any local zone whose parent Region is in US, such as Seattle, US (us-west-2-sea-1a). When new local zones are added to AWS, they will be also be added to the corresponding cross-Region inference endpoint.
To see a list of local endpoints and the parent Regions they're associated with, see AWS Local Zones Locations