Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

MLCOST-29: Monitor endpoint usage and right-size the instance fleet - Machine Learning Lens

MLCOST-29: Monitor endpoint usage and right-size the instance fleet

Ensure efficient compute resources are used to run models in production. Monitor your endpoint usage and right-size the instance fleet. Use automatic scaling (autoscaling) for your hosted models. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. 

Implementation plan

  • Monitor Amazon SageMaker AI endpoints with Amazon CloudWatch - You can monitor Amazon SageMaker AI using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. Use metrics such as CPUUtilization, GPUUtilization, MemoryUtilization, GPUUtilization to view your endpoint's resource utilization and use the information to right-size the endpoint instance.

  • Use autoscaling with Amazon SageMaker AI - Amazon SageMaker AI supports autoscaling that monitors your workloads and dynamically adjusts the capacity to maintain steady and predictable performance at the lowest possible cost. When the workload increases, autoscaling brings more instances online. When the workload decreases, autoscaling removes unnecessary instances, helping you reduce your compute cost. SageMaker AI automatically attempts to distribute your instances across Availability Zones. So, we strongly recommend that you deploy multiple instances for each production endpoint for high availability. If you’re using a VPC, configure at least two subnets in different Availability Zones so Amazon SageMaker AI can distribute your instances across those Availability Zones.

  • Determine the resource placement carefully – Amazon FSx for Lustre can be an input data source for Amazon SageMaker AI. When FSx for Lustre is used as an input data source, Amazon SageMaker AI ML training jobs are accelerated by eliminating the initial Amazon S3 download step. However, as a best practice, it is recommended that customers deploy FSx for Lustre and SageMaker AI in the same Availability Zone. Deploying them across Availability Zones or VPC can result in a significant cost.

Documents

Blogs

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.