Autoscale multi-container endpoints

Focus mode

Autoscale multi-container endpoints - Amazon SageMaker AI

If you want to configure automatic scaling for a multi-container endpoint using the InvocationsPerInstance metric, we recommend that the model in each container exhibits similar CPU utilization and latency on each inference request. This is recommended because if traffic to the multi-container endpoint shifts from a low CPU utilization model to a high CPU utilization model, but the overall call volume remains the same, the endpoint does not scale out and there may not be enough instances to handle all the requests to the high CPU utilization model. For information about automatically scaling endpoints, see Automatic scaling of Amazon SageMaker AI models.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Metrics for multi-container endpoints with direct invocation

Troubleshoot multi-container endpoints

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Autoscale multi-container endpoints

Related resources

Did this page help you?

Related resources

Next topic:

Previous topic:

Need help?