Set Auto Scaling Policies for Multi-Model Endpoint Deployments
SageMaker AI multi-model endpoints fully support automatic scaling, which manages replicas of models to ensure models scale based on traffic patterns. We recommend that you configure your multi-model endpoint and the size of your instances based on Instance recommendations for multi-model endpoint deployments and also set up instance based auto scaling for your endpoint. The invocation rates used to trigger an auto-scale event are based on the aggregate set of predictions across the full set of models served by the endpoint. For additional details on setting up endpoint auto scaling, see Automatically Scale Amazon SageMaker AI Models.
You can set up auto scaling policies with predefined and custom metrics on both CPU and GPU backed multi-model endpoints.
Note
SageMaker AI multi-model endpoint metrics are available at one-minute granularity.
Define a scaling policy
To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric.
Scaling policy configuration is represented by a JSON block. You save your scaling
policy configuration as a JSON block in a text file. You use that text file when invoking
the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see
TargetTrackingScalingPolicyConfiguration
in the Application Auto Scaling API Reference.
The following options are available for defining a target-tracking scaling policy configuration.
Use a predefined metric
To quickly define a target-tracking scaling policy for a variant, use the
SageMakerVariantInvocationsPerInstance
predefined metric.
SageMakerVariantInvocationsPerInstance
is the average number of times per
minute that each instance for a variant is invoked. We strongly recommend using this
metric.
To use a predefined metric in a scaling policy, create a target tracking configuration
for your policy. In the target tracking configuration, include a
PredefinedMetricSpecification
for the predefined metric and a
TargetValue
for the target value of that metric.
The following example is a typical policy configuration for target-tracking scaling
for a variant. In this configuration, we use the
SageMakerVariantInvocationsPerInstance
predefined metric to adjust the
number of variant instances so that each instance has an
InvocationsPerInstance
metric of 70
.
{"TargetValue": 70.0, "PredefinedMetricSpecification": { "PredefinedMetricType": "InvocationsPerInstance" } }
Note
We recommend that you use InvocationsPerInstance
while using
multi-model endpoints. The TargetValue
for this metric depends on your
application’s latency requirements. We also recommend that you load test your endpoints
to set up suitable scaling parameter values. To learn more about load testing and
setting up autoscaling for your endpoints, see the blog Configuring autoscaling inference endpoints in Amazon SageMaker AI
Use a custom metric
If you need to define a target-tracking scaling policy that meets your custom requirements, define a custom metric. You can define a custom metric based on any production variant metric that changes in proportion to scaling.
Not all SageMaker AI metrics work for target tracking. The metric must be a valid utilization metric, and it must describe how busy an instance is. The value of the metric must increase or decrease in inverse proportion to the number of variant instances. That is, the value of the metric should decrease when the number of instances increases.
Important
Before deploying automatic scaling in production, you must test automatic scaling with your custom metric.
Example custom metric for a CPU backed multi-model endpoint
The following example is a target-tracking configuration for a scaling policy. In
this configuration, for a model named my-model
, a custom metric of
CPUUtilization
adjusts the instance count on the endpoint based on an
average CPU utilization of 50% across all instances.
{"TargetValue": 50, "CustomizedMetricSpecification": {"MetricName": "CPUUtilization", "Namespace": "/aws/sagemaker/Endpoints", "Dimensions": [ {"Name": "EndpointName", "Value": "my-endpoint" }, {"Name": "ModelName","Value": "my-model"} ], "Statistic": "Average", "Unit": "Percent" } }
Example custom metric for a GPU backed multi-model endpoint
The following example is a target-tracking configuration for a scaling policy. In
this configuration, for a model named my-model
, a custom metric of
GPUUtilization
adjusts the instance count on the endpoint based on an
average GPU utilization of 50% across all instances.
{"TargetValue": 50, "CustomizedMetricSpecification": {"MetricName": "GPUUtilization", "Namespace": "/aws/sagemaker/Endpoints", "Dimensions": [ {"Name": "EndpointName", "Value": "my-endpoint" }, {"Name": "ModelName","Value": "my-model"} ], "Statistic": "Average", "Unit": "Percent" } }
Add a cooldown period
To add a cooldown period for scaling out your endpoint, specify a value, in seconds, for
ScaleOutCooldown
. Similarly, to add a cooldown period for scaling in your
model, add a value, in seconds, for ScaleInCooldown
. For more information about
ScaleInCooldown
and ScaleOutCooldown
, see TargetTrackingScalingPolicyConfiguration
in the Application Auto Scaling API Reference.
The following is an example target-tracking configuration for a scaling policy. In this
configuration, the SageMakerVariantInvocationsPerInstance
predefined metric is
used to adjust scaling based on an average of 70
across all instances of that
variant. The configuration provides a scale-in cooldown period of 10 minutes and a scale-out
cooldown period of 5 minutes.
{"TargetValue": 70.0, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance" }, "ScaleInCooldown": 600, "ScaleOutCooldown": 300 }