Optimize model inference for latency - Amazon Bedrock

Optimize model inference for latency

Note

The Latency Optimized Inference feature is in preview release for Amazon Bedrock and is subject to change.

Latency-optimized inference for foundation models in Amazon Bedrock delivers faster response times and improved responsiveness for AI applications. The optimized versions of Anthropic's Claude 3.5 Haiku model and Meta's Llama 3.1 405B and 70B models offer significantly reduced latency without compromising accuracy.

Accessing the latency optimization capability requires no additional setup or model fine-tuning, allowing for immediate enhancement of existing applications with faster response times. You can set the “Latency” parameter to “optimized” while calling the Bedrock runtime API. If you select "standard" as your invocation option, your requests will be served by standard inference. By default all requests are routed to through "standard".

“performanceConfig” : { “latency” : “standard | optimized” }

Once you reach the usage quota for latency optimization for a model, we will attempt to serve the request with Standard latency. In such cases, the request will be charged at Standard latency rates. The latency configuration for a served request is visible in API response and AWS CloudTrail logs. You can also view metrics for latency optimized requests in Amazon CloudWatch logs under "model-id+latency-optimized".

Latency optimized inference is available for Meta’s Llama 3.1 70B and 405B, as well as Anthropic’s Claude 3.5 Haiku in the US East (Ohio) Region via cross-region inference. For more information about pricing, visit the pricing page.

Note

Latency optimized inference for Llama 3.1 405B currently supports requests with total input and output token count up to 11K. For larger token count requests, we will fall back to the standard mode.

Foundation Model Model Id Supported Region

Anthropic Claude 3.5 Haiku

us.anthropic.claude-3-5-haiku-20241022-v1:0

US East (Ohio)

Meta Llama 3.1 70B Instruct us.meta.llama3-1-70b-instruct-v1:0

US East (Ohio)

Llama 3.1 405B Instruct

us.meta.llama3-1-405b-instruct-v1:0 US East (Ohio)