Scale an endpoint to zero instances
When you set up auto scaling for an endpoint, you can allow the scale-in process to reduce the number of in-service instances to zero. By doing so, you save costs during periods when your endpoint isn't serving inference requests and therefore doesn't require any active instances.
However, after scaling in to zero instances, your endpoint can't respond to any incoming inference requests until it provisions at least one instance. To automate the provisioning process, you create a step scaling policy with Application Auto Scaling. Then, you assign the policy to an Amazon CloudWatch alarm.
After you set up the step scaling policy and the alarm, your endpoint will automatically provision an instance soon after it receives an inference request that it can't respond to. Be aware that the provisioning process takes several minutes. During that time, any attempts to invoke the endpoint will produce an error.
The following procedures explain how to set up auto scaling for an endpoint so that it scales in to, and out from, zero instances. The procedures use commands with the AWS CLI.
Before you begin
Before your endpoint can scale in to, and out from, zero instances, it must meet the following requirements:
-
It is in service.
-
It hosts one or more inference components. An endpoint can scale to and from zero instances only if it hosts inference components.
For information about hosting inference components on SageMaker AI endpoints, see Deploy models for real-time inference.
-
In the endpoint configuration, for the production variant
ManagedInstanceScaling
object, you've set theMinInstanceCount
parameter to0
.For reference information about this parameter, see ProductionVariantManagedInstanceScaling.
To enable an endpoint to scale in to zero instances (AWS CLI)
For each inference component that the endpoint hosts, do the following:
-
Register the inference component as a scalable target. When you register it, set the minimum capacity to
0
, as shown by the following command:aws application-autoscaling register-scalable-target \ --service-namespace sagemaker \ --resource-id inference-component/
inference-component-name
\ --scalable-dimension sagemaker:inference-component:DesiredCopyCount \ --min-capacity 0 \ --max-capacityn
In this example, replace
inference-component-name
with the name of your inference component. Replacen
with the maximum number of inference component copies to provision when scaling out.For more information about this command and each of its parameters, see register-scalable-target in the AWS CLI Command Reference.
-
Apply a target tracking policy to the inference component, as shown by the following command:
aws application-autoscaling put-scaling-policy \ --policy-name my-scaling-policy \ --policy-type TargetTrackingScaling \ --resource-id inference-component/
inference-component-name
\ --service-namespace sagemaker \ --scalable-dimension sagemaker:inference-component:DesiredCopyCount \ --target-tracking-scaling-policy-configuration file://config.jsonIn this example, replace
inference-component-name
with the name of your inference component.In the example, the
config.json
file contains a target tracking policy configuration, such as the following:{ "PredefinedMetricSpecification": { "PredefinedMetricType": "SageMakerInferenceComponentInvocationsPerCopy" }, "TargetValue": 1, "ScaleInCooldown": 300, "ScaleOutCooldown": 300 }
For more example tracking policy configurations, see Define a scaling policy.
For more information about this command and each of its parameters, see put-scaling-policy in the AWS CLI Command Reference.
To enable an endpoint to scale out from zero instances (AWS CLI)
For each inference component that the endpoint hosts, do the following:
-
Apply a step scaling policy to the inference component, as shown by the following command:
aws application-autoscaling put-scaling-policy \ --policy-name
my-scaling-policy
\ --policy-type StepScaling \ --resource-id inference-component/inference-component-name
\ --service-namespace sagemaker \ --scalable-dimension sagemaker:inference-component:DesiredCopyCount \ --target-tracking-scaling-policy-configuration file://config.jsonIn this example, replace
my-scaling-policy
with a unique name for your policy. Replaceinference-component-name
with the name of your inference component.In the example, the
config.json
file contains a step scaling policy configuration, such as the following:{ "AdjustmentType": "ChangeInCapacity", "MetricAggregationType": "Maximum", "Cooldown": 60, "StepAdjustments": [ { "MetricIntervalLowerBound": 0, "ScalingAdjustment": 1 } ] }
When this step scaling policy is triggered, SageMaker AI provisions the necessary instances to support the inference component copies.
After you create the step scaling policy, take note of its Amazon Resource Name (ARN). You need the ARN for the CloudWatch alarm in the next step.
For more information about step scaling polices, see Step scaling policies in the Application Auto Scaling User Guide.
-
Create a CloudWatch alarm and assign the step scaling policy to it, as shown by the following example:
aws cloudwatch put-metric-alarm \ --alarm-actions
step-scaling-policy-arn
\ --alarm-description "Alarm when SM IC endpoint invoked that has 0 instances." \ --alarm-nameic-step-scaling-alarm
\ --comparison-operator GreaterThanThreshold \ --datapoints-to-alarm 1 \ --dimensions "Name=InferenceComponentName,Value=inference-component-name
" \ --evaluation-periods 1 \ --metric-name NoCapacityInvocationFailures \ --namespace AWS/SageMaker \ --period 60 \ --statistic Sum \ --threshold 1In this example, replace
step-scaling-policy-arn
with the ARN of your step scaling policy. Replaceic-step-scaling-alarm
with a name of your choice. Replaceinference-component-name
with the name of your inference component.This example sets the
--metric-name
parameter toNoCapacityInvocationFailures
. SageMaker AI emits this metric when an endpoint receives an inference request, but the endpoint has no active instances to serve the request. When that event occurs, the alarm initiates the step scaling policy in the previous step.For more information about this command and each of its parameters, see put-metric-alarm in the AWS CLI Command Reference.