Next steps for inference with Amazon SageMaker AI - Amazon SageMaker AI

Next steps for inference with Amazon SageMaker AI

After you have an endpoint and understand the general inference workflow, you can use the following features in SageMaker AI to improve your inference workflow.

Monitoring

To track your model over time through metrics such as model accuracy and drift, you can use Model Monitor. With Model Monitor, you can set alerts that notify you when there are deviations in your model’s quality. To learn more, see the Model Monitor documentation.

To learn more about tools that can be used to monitor model deployments and events that change your endpoint, see Monitor Amazon SageMaker AI. For example, you can monitor your endpoint’s health through metrics such as invocation errors and model latency using Amazon CloudWatch metrics. The SageMaker AI endpoint invocation metrics can provide you with valuable information about your endpoint’s performance.

CI/CD for model deployment

To put together machine learning solutions in SageMaker AI, you can use SageMaker AI MLOps. You can use this feature to automate the steps in your machine learning workflow and practice CI/CD. You can use MLOps Project Templates to help with the setup and implementation of SageMaker AI MLOps projects. SageMaker AI also supports using your own third-party Git repo for creating a CI/CD system.

For your ML pipelines, use Model Registry to manage your model versions and the deployment and automation of your models.

Deployment guardrails

If you want to update your model while it’s in production without impacting production, you can use deployment guardrails. Deployment guardrails are a set of model deployment options in SageMaker AI Inference to update your machine learning models in production. Using the fully managed deployment options, you can control the switch from the current model in production to a new one. Traffic shifting modes give you granular control over the traffic shifting process, and built-in safeguards like auto-rollbacks help you catch issues early on.

To learn more about deployment guardrails, see the deployment guardrails documentation.

Inferentia

If you need to run large-scale machine learning and deep learning applications, you can use an Inf1 instance with a real-time endpoint. This instance type is suitable for use cases such as image or speech recognition, natural language processing (NLP), personalization, forecasting, or fraud detection.

Inf1 instances are built to support machine learning inference applications and feature the AWS Inferentia chips. Inf1 instances provide higher throughput and lower cost per inference than GPU-based instances.

To deploy a model on Inf1 instances, compile your model with SageMaker Neo and choose an Inf1 instance for your deployment option. To learn more, see Optimize model performance using SageMaker Neo.

Optimize model performance

SageMaker AI provides features to manage resources and optimize inference performance when deploying machine learning models. You can use SageMaker AI’s built-in algorithms and pre-built models, as well as prebuilt Docker images, which are developed for machine learning.

To train models and optimize them for deployment, see prebuilt Docker imagesOptimize model performance using SageMaker Neo. With SageMaker Neo, you can train TensorFlow, Apache MXNet, PyTorch, ONNX, and XGBoost models. Then, you can optimize them and deploy on ARM, Intel, and Nvidia processors.

Autoscaling

If you have varying amounts of traffic to your endpoints, you might want to try autoscaling. For example, during peak hours, you might require more instances to process requests. However, during periods of low traffic, you might want to reduce your use of computing resources. To dynamically adjust the number of instances provisioned in response to changes in your workload, see Automatic scaling of Amazon SageMaker AI models.

If you have unpredictable traffic patterns or don’t want to set up scaling policies, you can also use Serverless Inference for an endpoint. Then, SageMaker AI manages autoscaling for you. During periods of low traffic, SageMaker AI scales down your endpoint, and if traffic increases, then SageMaker AI scales your endpoint up. For more information, see the Deploy models with Amazon SageMaker Serverless Inference documentation.