Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

MLSUS-14: Deploy multiple models behind a single endpoint - Machine Learning Lens

MLSUS-14: Deploy multiple models behind a single endpoint

Host multiple models behind a single endpoint to improve endpoint utilization. Sharing endpoint resources is more sustainable and less expensive than deploying a single model behind one endpoint.

Implementation plan

Amazon SageMaker AI provides three methods to deploy multiple models to a single endpoint:

  • Host multiple models in one container behind one endpoint (MLCOST-24) - SageMaker AI multi-model endpoints (MME) are served using a single container. This feature is ideal when you have a large number of similar models that you can serve through a shared serving container and don’t need to access all the models at the same time. This can help cut inference costs and reduce carbon emissions by up to 90% .

  • Host multiple models which use different containers behind one endpoint (MLCOST-24) – SageMaker AI multi-container endpoint (MCE) support deploying up to 15 containers that use different models or framework on a single endpoint, and invoking them independently or in sequence for low-latency inference and cost savings. The models can be completely heterogenous, with their own independent serving stack.

  • Use SageMaker AI inference pipelines - An inference pipeline is an Amazon SageMaker AI model that is composed of a linear sequence of containers deployed behind a single endpoint. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. The output from the one container is passed as input to the next. When defining the containers for a pipeline model, you also specify the order in which the containers are run.

Documents

Blogs

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.