Real-time inference

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker AI hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling (see Automatic scaling of Amazon SageMaker AI models).

Topics

Deploy models for real-time inference
Invoke models for real-time inference
Endpoints
Hosting options
Automatic scaling of Amazon SageMaker AI models
Instance storage volumes
Validation of models in production
Online explainability with SageMaker Clarify
Fine-tune models with adapter inference components

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Troubleshoot Inference Recommender errors

Deploy models