If you want to configure automatic scaling for a multi-container endpoint using the
InvocationsPerInstance
metric, we recommend that the model in each
container exhibits similar CPU utilization and latency on each inference request.
This is recommended because if traffic to the multi-container endpoint shifts from a
low CPU utilization model to a high CPU utilization model, but the overall call
volume remains the same, the endpoint does not scale out and there may not be enough
instances to handle all the requests to the high CPU utilization model. For
information about automatically scaling endpoints, see Automatic scaling of Amazon SageMaker AI models.
Did this page help you? - Yes
Thanks for letting us know we're doing a good job!
If you've got a moment, please tell us what we did right so we can do more of it.
Did this page help you? - No
Thanks for letting us know this page needs work. We're sorry we let you down.
If you've got a moment, please tell us how we can make the documentation better.
Next topic:
Troubleshoot multi-container endpointsPrevious topic:
Metrics for multi-container endpoints with direct invocationNeed help?
PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.