Inference options in Amazon SageMaker AI
SageMaker AI provides multiple inference options so that you can pick the option that best suits your workload:
-
Real-Time Inference: Real-time inference is ideal for online inferences that have low latency or high throughput requirements. Use real-time inference for a persistent and fully managed endpoint (REST API) that can handle sustained traffic, backed by the instance type of your choice. Real-time inference can support payload sizes up to 6 MB and processing times of 60 seconds.
-
Serverless Inference: Serverless inference is ideal when you have intermittent or unpredictable traffic patterns. SageMaker AI manages all of the underlying infrastructure, so there’s no need to manage instances or scaling policies. You pay only for what you use and not for idle time. It can support payload sizes up to 4 MB and processing times up to 60 seconds.
-
Batch Transform: Batch transform is suitable for offline processing when large amounts of data are available upfront and you don’t need a persistent endpoint. You can also use batch transform for pre-processing datasets. It can support large datasets that are GBs in size and processing times of days.
-
Asynchronous Inference: Asynchronous inference is ideal when you want to queue requests and have large payloads with long processing times. Asynchronous Inference can support payloads up to 1 GB and long processing times up to one hour. You can also scale down your endpoint to 0 when there are no requests to process.
The following diagram shows the preceding information in a flowchart and can help you choose the option that best fits your use case.