# Multi-model endpoints
<a name="multi-model-endpoints"></a>

Multi-model endpoints provide a scalable and cost-effective solution to deploying large numbers of models. They use the same fleet of resources and a shared serving container to host all of your models. This reduces hosting costs by improving endpoint utilization compared with using single-model endpoints. It also reduces deployment overhead because Amazon SageMaker AI manages loading models in memory and scaling them based on the traffic patterns to your endpoint.

The following diagram shows how multi-model endpoints work compared to single-model endpoints.

![\[Diagram that shows how multi-model versus how single-model endpoints host models.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/multi-model-endpoints-diagram.png)


Multi-model endpoints are ideal for hosting a large number of models that use the same ML framework on a shared serving container. If you have a mix of frequently and infrequently accessed models, a multi-model endpoint can efficiently serve this traffic with fewer resources and higher cost savings. Your application should be tolerant of occasional cold start-related latency penalties that occur when invoking infrequently used models.

Multi-model endpoints support hosting both CPU and GPU backed models. By using GPU backed models, you can lower your model deployment costs through increased usage of the endpoint and its underlying accelerated compute instances.

Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency. When this is the case, multi-model endpoints can effectively use instances across all models. If you have models that have significantly higher transactions per second (TPS) or latency requirements, we recommend hosting them on dedicated endpoints.

You can use multi-model endpoints with the following features:
+ [AWS PrivateLink](https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html) and VPCs
+ [Auto scaling](multi-model-endpoints-autoscaling.md)
+ [Serial inference pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) (but only one multi-model enabled container can be included in an inference pipeline)
+ A/B testing

You can use the AWS SDK for Python (Boto) or the SageMaker AI console to create a multi-model endpoint. For CPU backed multi-model endpoints, you can create your endpoint with custom-built containers by integrating the [Multi Model Server](https://github.com/awslabs/multi-model-server) library.

**Topics**
+ [How multi-model endpoints work](#how-multi-model-endpoints-work)
+ [Sample notebooks for multi-model endpoints](#multi-model-endpoint-sample-notebooks)
+ [Supported algorithms, frameworks, and instances for multi-model endpoints](multi-model-support.md)
+ [Instance recommendations for multi-model endpoint deployments](multi-model-endpoint-instance.md)
+ [Create a Multi-Model Endpoint](create-multi-model-endpoint.md)
+ [Invoke a Multi-Model Endpoint](invoke-multi-model-endpoint.md)
+ [Add or Remove Models](add-models-to-endpoint.md)
+ [Build Your Own Container for SageMaker AI Multi-Model Endpoints](build-multi-model-build-container.md)
+ [Multi-Model Endpoint Security](multi-model-endpoint-security.md)
+ [CloudWatch Metrics for Multi-Model Endpoint Deployments](multi-model-endpoint-cloudwatch-metrics.md)
+ [Set SageMaker AI multi-model endpoint model caching behavior](multi-model-caching.md)
+ [Set Auto Scaling Policies for Multi-Model Endpoint Deployments](multi-model-endpoints-autoscaling.md)

## How multi-model endpoints work
<a name="how-multi-model-endpoints-work"></a>

 SageMaker AI manages the lifecycle of models hosted on multi-model endpoints in the container's memory. Instead of downloading all of the models from an Amazon S3 bucket to the container when you create the endpoint, SageMaker AI dynamically loads and caches them when you invoke them. When SageMaker AI receives an invocation request for a particular model, it does the following: 

1. Routes the request to an instance behind the endpoint.

1. Downloads the model from the S3 bucket to that instance's storage volume.

1. Loads the model to the container's memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) on that accelerated compute instance. If the model is already loaded in the container's memory, invocation is faster because SageMaker AI doesn't need to download and load it.

SageMaker AI continues to route requests for a model to the instance where the model is already loaded. However, if the model receives many invocation requests, and there are additional instances for the multi-model endpoint, SageMaker AI routes some requests to another instance to accommodate the traffic. If the model isn't already loaded on the second instance, the model is downloaded to that instance's storage volume and loaded into the container's memory.

When an instance's memory utilization is high and SageMaker AI needs to load another model into memory, it unloads unused models from that instance's container to ensure that there is enough memory to load the model. Models that are unloaded remain on the instance's storage volume and can be loaded into the container's memory later without being downloaded again from the S3 bucket. If the instance's storage volume reaches its capacity, SageMaker AI deletes any unused models from the storage volume.

To delete a model, stop sending requests and delete it from the S3 bucket. SageMaker AI provides multi-model endpoint capability in a serving container. Adding models to, and deleting them from, a multi-model endpoint doesn't require updating the endpoint itself. To add a model, you upload it to the S3 bucket and invoke it. You don’t need code changes to use it.

**Note**  
When you update a multi-model endpoint, initial invocation requests on the endpoint might experience higher latencies as Smart Routing in multi-model endpoints adapt to your traffic pattern. However, once it learns your traffic pattern, you can experience low latencies for most frequently used models. Less frequently used models may incur some cold start latencies since the models are dynamically loaded to an instance.

## Sample notebooks for multi-model endpoints
<a name="multi-model-endpoint-sample-notebooks"></a>

To learn more about how to use multi-model endpoints, you can try the following sample notebooks:
+ Examples for multi-model endpoints using CPU backed instances:
  + [Multi-Model Endpoint XGBoost Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html) – This notebook shows how to deploy multiple XGBoost models to an endpoint.
  + [Multi-Model Endpoints BYOC Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_bring_your_own/multi_model_endpoint_bring_your_own.html) – This notebook shows how to set up and deploy a customer container that supports multi-model endpoints in SageMaker AI.
+ Example for multi-model endpoints using GPU backed instances:
  + [Run multiple deep learning models on GPUs with Amazon SageMaker AI Multi-model endpoints (MME)](https://github.com/aws/amazon-sagemaker-examples/blob/main/multi-model-endpoints/mme-on-gpu/cv/resnet50_mme_with_gpu.ipynb) – This notebook shows how to use an NVIDIA Triton Inference container to deploy ResNet-50 models to a multi-model endpoint.

For instructions on how to create and access Jupyter notebook instances that you can use to run the previous examples in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you've created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The multi-model endpoint notebooks are located in the **ADVANCED FUNCTIONALITY** section. To open a notebook, choose its **Use** tab and choose **Create copy**.

For more information about use cases for multi-model endpoints, see the following blogs and resources:
+ Video: [Hosting thousands of models on SageMaker AI](https://www.youtube.com/watch?v=XqCNTWmHsLc&t=751s)
+ Video: [SageMaker AI ML for SaaS](https://www.youtube.com/watch?v=BytpYlJ3vsQ)
+ Blog: [How to scale machine learning inference for multi-tenant SaaS use cases](https://aws.amazon.com/blogs/machine-learning/how-to-scale-machine-learning-inference-for-multi-tenant-saas-use-cases/)
+ Case study: [Veeva Systems](https://aws.amazon.com/partners/success/advanced-clinical-veeva/)

# Supported algorithms, frameworks, and instances for multi-model endpoints
<a name="multi-model-support"></a>

For information about the algorithms, frameworks, and instance types that you can use with multi-model endpoints, see the following sections.

## Supported algorithms, frameworks, and instances for multi-model endpoints using CPU backed instances
<a name="multi-model-support-cpu"></a>

The inference containers for the following algorithms and frameworks support multi-model endpoints:
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)
+ [Linear Learner Algorithm](linear-learner.md)
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)
+ [Resources for using TensorFlow with Amazon SageMaker AI](tf.md)
+ [Resources for using Scikit-learn with Amazon SageMaker AI](sklearn.md)
+ [Resources for using Apache MXNet with Amazon SageMaker AI](mxnet.md)
+ [Resources for using PyTorch with Amazon SageMaker AI](pytorch.md)

To use any other framework or algorithm, use the SageMaker AI inference toolkit to build a container that supports multi-model endpoints. For information, see [Build Your Own Container for SageMaker AI Multi-Model Endpoints](build-multi-model-build-container.md).

Multi-model endpoints support all of the CPU instance types.

## Supported algorithms, frameworks, and instances for multi-model endpoints using GPU backed instances
<a name="multi-model-support-gpu"></a>

Hosting multiple GPU backed models on multi-model endpoints is supported through the [SageMaker AI Triton Inference server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This supports all major inference frameworks such as NVIDIA® TensorRT™, PyTorch, MXNet, Python, ONNX, XGBoost, scikit-learn, RandomForest, OpenVINO, custom C\$1\$1, and more.

To use any other framework or algorithm, you can use Triton backend for Python or C\$1\$1 to write your model logic and serve any custom model. After you have the server ready, you can start deploying 100s of Deep Learning models behind one endpoint.

Multi-model endpoints support the following GPU instance types:


| Instance family | Instance type | vCPUs | GiB of memory per vCPU | GPUs | GPU memory | 
| --- | --- | --- | --- | --- | --- | 
| p2 | ml.p2.xlarge | 4 | 15.25 | 1 | 12 | 
| p3 | ml.p3.2xlarge | 8 | 7.62 | 1 | 16 | 
| g5 | ml.g5.xlarge | 4 | 4 | 1 | 24 | 
| g5 | ml.g5.2xlarge | 8 | 4 | 1 | 24 | 
| g5 | ml.g5.4xlarge | 16 | 4 | 1 | 24 | 
| g5 | ml.g5.8xlarge | 32 | 4 | 1 | 24 | 
| g5 | ml.g5.16xlarge | 64 | 4 | 1 | 24 | 
| g4dn | ml.g4dn.xlarge | 4 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.2xlarge | 8 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.4xlarge | 16 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.8xlarge | 32 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.16xlarge | 64 | 4 | 1 | 16 | 

# Instance recommendations for multi-model endpoint deployments
<a name="multi-model-endpoint-instance"></a>

There are several items to consider when selecting a SageMaker AI ML instance type for a multi-model endpoint:
+ Provision sufficient [ Amazon Elastic Block Store (Amazon EBS)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html) capacity for all of the models that need to be served.
+ Balance performance (minimize cold starts) and cost (don’t over-provision instance capacity). For information about the size of the storage volume that SageMaker AI attaches for each instance type for an endpoint and for a multi-model endpoint, see [Instance storage volumes](host-instance-storage.md).
+ For a container configured to run in `MultiModel` mode, the storage volume provisioned for its instances are larger than the default `SingleModel` mode. This allows more models to be cached on the instance storage volume than in `SingleModel` mode.

When choosing a SageMaker AI ML instance type, consider the following:
+ Multi-model endpoints are currently supported for all CPU instances types and on single-GPU instance types.
+ For the traffic distribution (access patterns) to the models that you want to host behind the multi-model endpoint, along with the model size (how many models could be loaded in memory on the instance), keep the following information in mind:
  + Think of the amount of memory on an instance as the cache space for models to be loaded, and think of the number of vCPUs as the concurrency limit to perform inference on the loaded models (assuming that invoking a model is bound to CPU).
  + For CPU backed instances, the number of vCPUs impacts your maximum concurrent invocations per instance (assuming that invoking a model is bound to CPU). A higher amount of vCPUs enables you to invoke more unique models concurrently.
  + For GPU backed instances, a higher amount of instance and GPU memory enables you to have more models loaded and ready to serve inference requests.
  + For both CPU and GPU backed instances, have some "slack" memory available so that unused models can be unloaded, and especially for multi-model endpoints with multiple instances. If an instance or an Availability Zone fails, the models on those instances will be rerouted to other instances behind the endpoint.
+ Determine your tolerance to loading/downloading times:
  + d instance type families (for example, m5d, c5d, or r5d) and g5s come with an NVMe (non-volatile memory express) SSD, which offers high I/O performance and might reduce the time it takes to download models to the storage volume and for the container to load the model from the storage volume.
  + Because d and g5 instance types come with an NVMe SSD storage, SageMaker AI does not attach an Amazon EBS storage volume to these ML compute instances that hosts the multi-model endpoint. Auto scaling works best when the models are similarly sized and homogenous, that is when they have similar inference latency and resource requirements.

You can also use the following guidance to help you optimize model loading on your multi-model endpoints:

**Choosing an instance type that can't hold all of the targeted models in memory**

In some cases, you might opt to reduce costs by choosing an instance type that can't hold all of the targeted models in memory at once. SageMaker AI dynamically unloads models when it runs out of memory to make room for a newly targeted model. For infrequently requested models, you sacrifice dynamic load latency. In cases with more stringent latency needs, you might opt for larger instance types or more instances. Investing time up front for performance testing and analysis helps you to have successful production deployments.

**Evaluating your model cache hits**

Amazon CloudWatch metrics can help you evaluate your models. For more information about metrics you can use with multi-model endpoints, see [CloudWatch Metrics for Multi-Model Endpoint Deployments](multi-model-endpoint-cloudwatch-metrics.md).

 You can use the `Average` statistic of the `ModelCacheHit` metric to monitor the ratio of requests where the model is already loaded. You can use the `SampleCount` statistic for the `ModelUnloadingTime` metric to monitor the number of unload requests sent to the container during a time period. If models are unloaded too frequently (an indicator of *thrashing*, where models are being unloaded and loaded again because there is insufficient cache space for the working set of models), consider using a larger instance type with more memory or increasing the number of instances behind the multi-model endpoint. For multi-model endpoints with multiple instances, be aware that a model might be loaded on more than 1 instance.

# Create a Multi-Model Endpoint
<a name="create-multi-model-endpoint"></a>

You can use the SageMaker AI console or the AWS SDK for Python (Boto) to create a multi-model endpoint. To create either a CPU or GPU backed endpoint through the console, see the console procedure in the following sections. If you want to create a multi-model endpoint with the AWS SDK for Python (Boto), use either the CPU or GPU procedure in the following sections. The CPU and GPU workflows are similar but have several differences, such as the container requirements.

**Topics**
+ [Create a multi-model endpoint (console)](#create-multi-model-endpoint-console)
+ [Create a multi-model endpoint using CPUs with the AWS SDK for Python (Boto3)](#create-multi-model-endpoint-sdk-cpu)
+ [Create a multi-model endpoint using GPUs with the AWS SDK for Python (Boto3)](#create-multi-model-endpoint-sdk-gpu)

## Create a multi-model endpoint (console)
<a name="create-multi-model-endpoint-console"></a>

You can create both CPU and GPU backed multi-model endpoints through the console. Use the following procedure to create a multi-model endpoint through the SageMaker AI console.

**To create a multi-model endpoint (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Model**, and then from the **Inference** group, choose **Create model**. 

1. For **Model name**, enter a name.

1. For **IAM role**, choose or create an IAM role that has the `AmazonSageMakerFullAccess` IAM policy attached. 

1.  In the **Container definition** section, for **Provide model artifacts and inference image options**, choose **Use multiple models**.  
![\[The section of the Create model page where you can choose Use multiple models.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mme-create-model-ux-2.PNG)

1. For the **Inference container image**, enter the Amazon ECR path for your desired container image.

   For GPU models, you must use a container backed by the NVIDIA Triton Inference Server. For a list of container images that work with GPU backed endpoints, see the [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). For more information about the NVIDIA Triton Inference Server, see [Use Triton Inference Server with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html).

1. Choose **Create model**.

1. Deploy your multi-model endpoint as you would a single model endpoint. For instructions, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

## Create a multi-model endpoint using CPUs with the AWS SDK for Python (Boto3)
<a name="create-multi-model-endpoint-sdk-cpu"></a>

Use the following section to create a multi-model endpoint backed by CPU instances. You create a multi-model endpoint using the Amazon SageMaker AI [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config), and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) APIs just as you would create a single model endpoint, but with two changes. When defining the model container, you need to pass a new `Mode` parameter value, `MultiModel`. You also need to pass the `ModelDataUrl` field that specifies the prefix in Amazon S3 where the model artifacts are located, instead of the path to a single model artifact, as you would when deploying a single model.

For a sample notebook that uses SageMaker AI to deploy multiple XGBoost models to an endpoint, see [Multi-Model Endpoint XGBoost Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html). 

The following procedure outlines the key steps used in that sample to create a CPU backed multi-model endpoint.

**To deploy the model (AWS SDK for Python (Boto 3))**

1. Get a container with an image that supports deploying multi-model endpoints. For a list of built-in algorithms and framework containers that support multi-model endpoints, see [Supported algorithms, frameworks, and instances for multi-model endpoints](multi-model-support.md). For this example, we use the [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md) built-in algorithm. We call the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html) utility function `image_uris.retrieve()` to get the address for the K-Nearest Neighbors built-in algorithm image.

   ```
   import sagemaker
   region = sagemaker_session.boto_region_name
   image = sagemaker.image_uris.retrieve("knn",region=region)
   container = { 
                 'Image':        image,
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   ```

1. Get an AWS SDK for Python (Boto3) SageMaker AI client and create the model that uses this container.

   ```
   import boto3
   sagemaker_client = boto3.client('sagemaker')
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container])
   ```

1. (Optional) If you are using a serial inference pipeline, get the additional container(s) to include in the pipeline, and include it in the `Containers` argument of `CreateModel`:

   ```
   preprocessor_container = { 
                  'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<PREPROCESSOR_IMAGE>:<TAG>'
               }
   
   multi_model_container = { 
                 'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<IMAGE>:<TAG>',
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [preprocessor_container, multi_model_container]
               )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. (Optional) If your use case does not benefit from model caching, set the value of the `ModelCacheSetting` field of the `MultiModelConfig` parameter to `Disabled`, and include it in the `Container` argument of the call to `create_model`. The value of the `ModelCacheSetting` field is `Enabled` by default.

   ```
   container = { 
                   'Image': image, 
                   'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                   'Mode': 'MultiModel' 
                   'MultiModelConfig': {
                           // Default value is 'Enabled'
                           'ModelCacheSetting': 'Disabled'
                   }
              }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container]
               )
   ```

1. Configure the multi-model endpoint for the model. We recommend configuring your endpoints with at least two instances. This allows SageMaker AI to provide a highly available set of predictions across multiple Availability Zones for the models.

   ```
   response = sagemaker_client.create_endpoint_config(
                   EndpointConfigName = '<ENDPOINT_CONFIG_NAME>',
                   ProductionVariants=[
                        {
                           'InstanceType':        'ml.m4.xlarge',
                           'InitialInstanceCount': 2,
                           'InitialVariantWeight': 1,
                           'ModelName':            '<MODEL_NAME>',
                           'VariantName':          'AllTraffic'
                         }
                   ]
              )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. Create the multi-model endpoint using the `EndpointName` and `EndpointConfigName` parameters.

   ```
   response = sagemaker_client.create_endpoint(
                 EndpointName       = '<ENDPOINT_NAME>',
                 EndpointConfigName = '<ENDPOINT_CONFIG_NAME>')
   ```

## Create a multi-model endpoint using GPUs with the AWS SDK for Python (Boto3)
<a name="create-multi-model-endpoint-sdk-gpu"></a>

Use the following section to create a GPU backed multi-model endpoint. You create a multi-model endpoint using the Amazon SageMaker AI [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config), and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) APIs similarly to creating single model endpoints, but there are several changes. When defining the model container, you need to pass a new `Mode` parameter value, `MultiModel`. You also need to pass the `ModelDataUrl` field that specifies the prefix in Amazon S3 where the model artifacts are located, instead of the path to a single model artifact, as you would when deploying a single model. For GPU backed multi-model endpoints, you also must use a container with the NVIDIA Triton Inference Server that is optimized for running on GPU instances. For a list of container images that work with GPU backed endpoints, see the [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only).

For an example notebook that demonstrates how to create a multi-model endpoint backed by GPUs, see [Run mulitple deep learning models on GPUs with Amazon SageMaker AI Multi-model endpoints (MME)](https://github.com/aws/amazon-sagemaker-examples/blob/main/multi-model-endpoints/mme-on-gpu/cv/resnet50_mme_with_gpu.ipynb).

The following procedure outlines the key steps to create a GPU backed multi-model endpoint.

**To deploy the model (AWS SDK for Python (Boto 3))**

1. Define the container image. To create a multi-model endpoint with GPU support for ResNet models, define the container to use the [NVIDIA Triton Server image](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This container supports multi-model endpoints and is optimized for running on GPU instances. We call the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html) utility function `image_uris.retrieve()` to get the address for the image. For example:

   ```
   import sagemaker
   region = sagemaker_session.boto_region_name
   
   // Find the sagemaker-tritonserver image at 
   // https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/resnet50/triton_resnet50.ipynb
   // Find available tags at https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only
   
   image = "<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/sagemaker-tritonserver:<TAG>".format(
       account_id=account_id_map[region], region=region
   )
   
   container = { 
                 'Image':        image,
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel',
                 "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "resnet"},
               }
   ```

1. Get an AWS SDK for Python (Boto3) SageMaker AI client and create the model that uses this container.

   ```
   import boto3
   sagemaker_client = boto3.client('sagemaker')
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container])
   ```

1. (Optional) If you are using a serial inference pipeline, get the additional container(s) to include in the pipeline, and include it in the `Containers` argument of `CreateModel`:

   ```
   preprocessor_container = { 
                  'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<PREPROCESSOR_IMAGE>:<TAG>'
               }
   
   multi_model_container = { 
                 'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<IMAGE>:<TAG>',
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [preprocessor_container, multi_model_container]
               )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. (Optional) If your use case does not benefit from model caching, set the value of the `ModelCacheSetting` field of the `MultiModelConfig` parameter to `Disabled`, and include it in the `Container` argument of the call to `create_model`. The value of the `ModelCacheSetting` field is `Enabled` by default.

   ```
   container = { 
                   'Image': image, 
                   'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                   'Mode': 'MultiModel' 
                   'MultiModelConfig': {
                           // Default value is 'Enabled'
                           'ModelCacheSetting': 'Disabled'
                   }
              }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container]
               )
   ```

1. Configure the multi-model endpoint with GPU backed instances for the model. We recommend configuring your endpoints with more than one instance to allow for high availability and higher cache hits.

   ```
   response = sagemaker_client.create_endpoint_config(
                   EndpointConfigName = '<ENDPOINT_CONFIG_NAME>',
                   ProductionVariants=[
                        {
                           'InstanceType':        'ml.g4dn.4xlarge',
                           'InitialInstanceCount': 2,
                           'InitialVariantWeight': 1,
                           'ModelName':            '<MODEL_NAME>',
                           'VariantName':          'AllTraffic'
                         }
                   ]
              )
   ```

1. Create the multi-model endpoint using the `EndpointName` and `EndpointConfigName` parameters.

   ```
   response = sagemaker_client.create_endpoint(
                 EndpointName       = '<ENDPOINT_NAME>',
                 EndpointConfigName = '<ENDPOINT_CONFIG_NAME>')
   ```

# Invoke a Multi-Model Endpoint
<a name="invoke-multi-model-endpoint"></a>

To invoke a multi-model endpoint, use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint) from the SageMaker AI Runtime just as you would invoke a single model endpoint, with one change. Pass a new `TargetModel` parameter that specifies which of the models at the endpoint to target. The SageMaker AI Runtime `InvokeEndpoint` request supports `X-Amzn-SageMaker-Target-Model` as a new header that takes the relative path of the model specified for invocation. The SageMaker AI system constructs the absolute path of the model by combining the prefix that is provided as part of the `CreateModel` API call with the relative path of the model.

The following procedures are the same for both CPU and GPU-backed multi-model endpoints.

------
#### [ AWS SDK for Python (Boto 3) ]

The following example prediction request uses the [AWS SDK for Python (Boto 3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) in the sample notebook.

```
response = runtime_sagemaker_client.invoke_endpoint(
                        EndpointName = "<ENDPOINT_NAME>",
                        ContentType  = "text/csv",
                        TargetModel  = "<MODEL_FILENAME>.tar.gz",
                        Body         = body)
```

------
#### [ AWS CLI ]

 The following example shows how to make a CSV request with two rows using the AWS Command Line Interface (AWS CLI):

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name "<ENDPOINT_NAME>" \
  --body "1.0,2.0,5.0"$'\n'"2.0,3.0,4.0" \
  --content-type "text/csv" \
  --target-model "<MODEL_NAME>.tar.gz"
  output_file.txt
```

An `output_file.txt` with information about your inference requests is made if the inference was successful. For more examples on how to make predictions with the AWS CLI, see [Making predictions with the AWS CLI](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html#making-predictions-with-the-aws-cli) in the SageMaker Python SDK documentation.

------

The multi-model endpoint dynamically loads target models as needed. You can observe this when running the [MME Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html) as it iterates through random invocations against multiple target models hosted behind a single endpoint. The first request against a given model takes longer because the model has to be downloaded from Amazon Simple Storage Service (Amazon S3) and loaded into memory. This is called a *cold start*, and it is expected on multi-model endpoints to optimize for better price performance for customers. Subsequent calls finish faster because there's no additional overhead after the model has loaded.

**Note**  
For GPU backed instances, the HTTP response code with 507 from the GPU container indicates a lack of memory or other resources. This causes unused models to be unloaded from the container in order to load more frequently used models.

## Retry Requests on ModelNotReadyException Errors
<a name="invoke-multi-model-config-retry"></a>

The first time you call `invoke_endpoint` for a model, the model is downloaded from Amazon Simple Storage Service and loaded into the inference container. This makes the first call take longer to return. Subsequent calls to the same model finish faster, because the model is already loaded.

SageMaker AI returns a response for a call to `invoke_endpoint` within 60 seconds. Some models are too large to download within 60 seconds. If the model does not finish loading before the 60 second timeout limit, the request to `invoke_endpoint` returns with the error code `ModelNotReadyException`, and the model continues to download and load into the inference container for up to 360 seconds. If you get a `ModelNotReadyException` error code for an `invoke_endpoint` request, retry the request. By default, the AWS SDKs for Python (Boto 3) (using [Legacy retry mode](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#legacy-retry-mode)) and Java retry `invoke_endpoint` requests that result in `ModelNotReadyException` errors. You can configure the retry strategy to continue retrying the request for up to 360 seconds. If you expect your model to take longer than 60 seconds to download and load into the container, set the SDK socket timeout to 70 seconds. For more information about configuring the retry strategy for the AWS SDK for Python (Boto3), see [Configuring a retry mode](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#configuring-a-retry-mode). The following code shows an example that configures the retry strategy to retry calls to `invoke_endpoint` for up to 180 seconds.

```
import boto3
from botocore.config import Config

# This example retry strategy sets the retry attempts to 2. 
# With this setting, the request can attempt to download and/or load the model 
# for upto 180 seconds: 1 orginal request (60 seconds) + 2 retries (120 seconds)
config = Config(
    read_timeout=70,
    retries={
        'max_attempts': 2  # This value can be adjusted to 5 to go up to the 360s max timeout
    }
)
runtime_sagemaker_client = boto3.client('sagemaker-runtime', config=config)
```

# Add or Remove Models
<a name="add-models-to-endpoint"></a>

You can deploy additional models to a multi-model endpoint and invoke them through that endpoint immediately. When adding a new model, you don't need to update or bring down the endpoint, so you avoid the cost of creating and running a separate endpoint for each new model. The process for adding and removing models is the same for CPU and GPU-backed multi-model endpoints.

 SageMaker AI unloads unused models from the container when the instance is reaching memory capacity and more models need to be downloaded into the container. SageMaker AI also deletes unused model artifacts from the instance storage volume when the volume is reaching capacity and new models need to be downloaded. The first invocation to a newly added model takes longer because the endpoint takes time to download the model from S3 to the container's memory in instance hosting the endpoint

With the endpoint already running, copy a new set of model artifacts to the Amazon S3 location there you store your models.

```
# Add an AdditionalModel to the endpoint and exercise it
aws s3 cp AdditionalModel.tar.gz s3://amzn-s3-demo-bucket/path/to/artifacts/
```

**Important**  
To update a model, proceed as you would when adding a new model. Use a new and unique name. Don't overwrite model artifacts in Amazon S3 because the old version of the model might still be loaded in the containers or on the storage volume of the instances on the endpoint. Invocations to the new model could then invoke the old version of the model. 

Client applications can request predictions from the additional target model as soon as it is stored in S3.

```
response = runtime_sagemaker_client.invoke_endpoint(
                        EndpointName='<ENDPOINT_NAME>',
                        ContentType='text/csv',
                        TargetModel='AdditionalModel.tar.gz',
                        Body=body)
```

To delete a model from a multi-model endpoint, stop invoking the model from the clients and remove it from the S3 location where model artifacts are stored.

# Build Your Own Container for SageMaker AI Multi-Model Endpoints
<a name="build-multi-model-build-container"></a>

Refer to the following sections for bringing your own container and dependencies to multi-model endpoints.

**Topics**
+ [Bring your own dependencies for multi-model endpoints on CPU backed instances](#build-multi-model-container-cpu)
+ [Bring your own dependencies for multi-model endpoints on GPU backed instances](#build-multi-model-container-gpu)
+ [Use the SageMaker AI Inference Toolkit](#multi-model-inference-toolkit)
+ [Custom Containers Contract for Multi-Model Endpoints](mms-container-apis.md)

## Bring your own dependencies for multi-model endpoints on CPU backed instances
<a name="build-multi-model-container-cpu"></a>

If none of the pre-built container images serve your needs, you can build your own container for use with CPU backed multi-model endpoints.

Custom Amazon Elastic Container Registry (Amazon ECR) images deployed in Amazon SageMaker AI are expected to adhere to the basic contract described in [Custom Inference Code with Hosting Services](your-algorithms-inference-code.md) that govern how SageMaker AI interacts with a Docker container that runs your own inference code. For a container to be capable of loading and serving multiple models concurrently, there are additional APIs and behaviors that must be followed. This additional contract includes new APIs to load, list, get, and unload models, and a different API to invoke models. There are also different behaviors for error scenarios that the APIs need to abide by. To indicate that the container complies with the additional requirements, you can add the following command to your Docker file:

```
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
```

SageMaker AI also injects an environment variable into the container

```
SAGEMAKER_MULTI_MODEL=true
```

If you are creating a multi-model endpoint for a serial inference pipline, your Docker file must have the required labels for both multi-models and serial inference pipelines. For more information about serial information pipelines, see [Run Real-time Predictions with an Inference Pipeline](inference-pipeline-real-time.md).

To help you implement these requirements for a custom container, two libraries are available:
+ [Multi Model Server](https://github.com/awslabs/multi-model-server) is an open source framework for serving machine learning models that can be installed in containers to provide the front end that fulfills the requirements for the new multi-model endpoint container APIs. It provides the HTTP front end and model management capabilities required by multi-model endpoints to host multiple models within a single container, load models into and unload models out of the container dynamically, and performs inference on a specified loaded model. It also provides a pluggable backend that supports a pluggable custom backend handler where you can implement your own algorithm.
+ [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) is a library that bootstraps Multi Model Server with a configuration and settings that make it compatible with SageMaker AI multi-model endpoints. It also allows you to tweak important performance parameters, such as the number of workers per model, depending on the needs of your scenario. 

## Bring your own dependencies for multi-model endpoints on GPU backed instances
<a name="build-multi-model-container-gpu"></a>

The bring your own container (BYOC) capability on multi-model endpoints with GPU backed instances is not currently supported by the Multi Model Server and SageMaker AI Inference Toolkit libraries.

For creating multi-model endpoints with GPU backed instances, you can use the SageMaker AI supported [NVIDIA Triton Inference Server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). with the [NVIDIA Triton Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). To bring your own dependencies, you can build your own container with the SageMaker AI supported [NVIDIA Triton Inference Server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html) as the base image to your Docker file:

```
FROM 301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.07-py3
```

**Important**  
Containers with the Triton Inference Server are the only supported containers you can use for GPU backed multi-model endpoints.

## Use the SageMaker AI Inference Toolkit
<a name="multi-model-inference-toolkit"></a>

**Note**  
The SageMaker AI Inference Toolkit is only supported for CPU backed multi-model endpoints. The SageMaker AI Inference Toolkit is not currently not supported for GPU backed multi-model endpoints.

Pre-built containers that support multi-model endpoints are listed in [Supported algorithms, frameworks, and instances for multi-model endpoints](multi-model-support.md). If you want to use any other framework or algorithm, you need to build a container. The easiest way to do this is to use the [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) to extend an existing pre-built container. The SageMaker AI inference toolkit is an implementation for the multi-model server (MMS) that creates endpoints that can be deployed in SageMaker AI. For a sample notebook that shows how to set up and deploy a custom container that supports multi-model endpoints in SageMaker AI, see the [Multi-Model Endpoint BYOC Sample Notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_bring_your_own).

**Note**  
The SageMaker AI inference toolkit supports only Python model handlers. If you want to implement your handler in any other language, you must build your own container that implements the additional multi-model endpoint APIs. For information, see [Custom Containers Contract for Multi-Model Endpoints](mms-container-apis.md).

**To extend a container by using the SageMaker AI inference toolkit**

1. Create a model handler. MMS expects a model handler, which is a Python file that implements functions to pre-process, get preditions from the model, and process the output in a model handler. For an example of a model handler, see [model\$1handler.py](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/container/model_handler.py) from the sample notebook.

1. Import the inference toolkit and use its `model_server.start_model_server` function to start MMS. The following example is from the `dockerd-entrypoint.py` file from the sample notebook. Notice that the call to `model_server.start_model_server` passes the model handler described in the previous step:

   ```
   import subprocess
   import sys
   import shlex
   import os
   from retrying import retry
   from subprocess import CalledProcessError
   from sagemaker_inference import model_server
   
   def _retry_if_error(exception):
       return isinstance(exception, CalledProcessError or OSError)
   
   @retry(stop_max_delay=1000 * 50,
          retry_on_exception=_retry_if_error)
   def _start_mms():
       # by default the number of workers per model is 1, but we can configure it through the
       # environment variable below if desired.
       # os.environ['SAGEMAKER_MODEL_SERVER_WORKERS'] = '2'
       model_server.start_model_server(handler_service='/home/model-server/model_handler.py:handle')
   
   def main():
       if sys.argv[1] == 'serve':
           _start_mms()
       else:
           subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
   
       # prevent docker exit
       subprocess.call(['tail', '-f', '/dev/null'])
       
   main()
   ```

1. In your `Dockerfile`, copy the model handler from the first step and specify the Python file from the previous step as the entrypoint in your `Dockerfile`. The following lines are from the [Dockerfile](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/container/Dockerfile) used in the sample notebook:

   ```
   # Copy the default custom service file to handle incoming data and inference requests
   COPY model_handler.py /home/model-server/model_handler.py
   
   # Define an entrypoint script for the docker image
   ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
   ```

1. Build and register your container. The following shell script from the sample notebook builds the container and uploads it to an Amazon Elastic Container Registry repository in your AWS account:

   ```
   %%sh
   
   # The name of our algorithm
   algorithm_name=demo-sagemaker-multimodel
   
   cd container
   
   account=$(aws sts get-caller-identity --query Account --output text)
   
   # Get the region defined in the current configuration (default to us-west-2 if none defined)
   region=$(aws configure get region)
   region=${region:-us-west-2}
   
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   
   # If the repository doesn't exist in ECR, create it.
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   
   if [ $? -ne 0 ]
   then
       aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
   fi
   
   # Get the login command from ECR and execute it directly
   $(aws ecr get-login --region ${region} --no-include-email)
   
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build -q -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```

You can now use this container to deploy multi-model endpoints in SageMaker AI.

**Topics**
+ [Bring your own dependencies for multi-model endpoints on CPU backed instances](#build-multi-model-container-cpu)
+ [Bring your own dependencies for multi-model endpoints on GPU backed instances](#build-multi-model-container-gpu)
+ [Use the SageMaker AI Inference Toolkit](#multi-model-inference-toolkit)
+ [Custom Containers Contract for Multi-Model Endpoints](mms-container-apis.md)

# Custom Containers Contract for Multi-Model Endpoints
<a name="mms-container-apis"></a>

To handle multiple models, your container must support a set of APIs that enable Amazon SageMaker AI to communicate with the container for loading, listing, getting, and unloading models as required. The `model_name` is used in the new set of APIs as the key input parameter. The customer container is expected to keep track of the loaded models using `model_name` as the mapping key. Also, the `model_name` is an opaque identifier and is not necessarily the value of the `TargetModel` parameter passed into the `InvokeEndpoint` API. The original `TargetModel` value in the `InvokeEndpoint` request is passed to container in the APIs as a `X-Amzn-SageMaker-Target-Model` header that can be used for logging purposes.

**Note**  
Multi-model endpoints for GPU backed instances are currently supported only with SageMaker AI's [NVIDIA Triton Inference Server container](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This container already implements the contract defined below. Customers can directly use this container with their multi-model GPU endpoints, without any additional work.

You can configure the following APIs on your containers for CPU backed multi-model endpoints.

**Topics**
+ [Load Model API](#multi-model-api-load-model)
+ [List Model API](#multi-model-api-list-model)
+ [Get Model API](#multi-model-api-get-model)
+ [Unload Model API](#multi-model-api-unload-model)
+ [Invoke Model API](#multi-model-api-invoke-model)

## Load Model API
<a name="multi-model-api-load-model"></a>

Instructs the container to load a particular model present in the `url` field of the body into the memory of the customer container and to keep track of it with the assigned `model_name`. After a model is loaded, the container should be ready to serve inference requests using this `model_name`.

```
POST /models HTTP/1.1
Content-Type: application/json
Accept: application/json

{
     "model_name" : "{model_name}",
     "url" : "/opt/ml/models/{model_name}/model",
}
```

**Note**  
If `model_name` is already loaded, this API should return 409. Any time a model cannot be loaded due to lack of memory or to any other resource, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.

## List Model API
<a name="multi-model-api-list-model"></a>

Returns the list of models loaded into the memory of the customer container.

```
GET /models HTTP/1.1
Accept: application/json

Response = 
{
    "models": [
        {
             "modelName" : "{model_name}",
             "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        {
            "modelName" : "{model_name}",
            "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        ....
    ]
}
```

This API also supports pagination.

```
GET /models HTTP/1.1
Accept: application/json

Response = 
{
    "models": [
        {
             "modelName" : "{model_name}",
             "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        {
            "modelName" : "{model_name}",
            "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        ....
    ]
}
```

SageMaker AI can initially call the List Models API without providing a value for `next_page_token`. If a `nextPageToken` field is returned as part of the response, it will be provided as the value for `next_page_token` in a subsequent List Models call. If a `nextPageToken` is not returned, it means that there are no more models to return.

## Get Model API
<a name="multi-model-api-get-model"></a>

This is a simple read API on the `model_name` entity.

```
GET /models/{model_name} HTTP/1.1
Accept: application/json

{
     "modelName" : "{model_name}",
     "modelUrl" : "/opt/ml/models/{model_name}/model",
}
```

**Note**  
If `model_name` is not loaded, this API should return 404.

## Unload Model API
<a name="multi-model-api-unload-model"></a>

Instructs the SageMaker AI platform to instruct the customer container to unload a model from memory. This initiates the eviction of a candidate model as determined by the platform when starting the process of loading a new model. The resources provisioned to `model_name` should be reclaimed by the container when this API returns a response.

```
DELETE /models/{model_name}
```

**Note**  
If `model_name` is not loaded, this API should return 404.

## Invoke Model API
<a name="multi-model-api-invoke-model"></a>

Makes a prediction request from the particular `model_name` supplied. The SageMaker AI Runtime `InvokeEndpoint` request supports `X-Amzn-SageMaker-Target-Model` as a new header that takes the relative path of the model specified for invocation. The SageMaker AI system constructs the absolute path of the model by combining the prefix that is provided as part of the `CreateModel` API call with the relative path of the model.

```
POST /models/{model_name}/invoke HTTP/1.1
Content-Type: ContentType
Accept: Accept
X-Amzn-SageMaker-Custom-Attributes: CustomAttributes
X-Amzn-SageMaker-Target-Model: [relativePath]/{artifactName}.tar.gz
```

**Note**  
If `model_name` is not loaded, this API should return 404.

Additionally, on GPU instances, if `InvokeEndpoint` fails due to a lack of memory or other resources, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.

# Multi-Model Endpoint Security
<a name="multi-model-endpoint-security"></a>

Models and data in a multi-model endpoint are co-located on instance storage volume and in container memory. All instances for Amazon SageMaker AI endpoints run on a single tenant container that you own. Only your models can run on your multi-model endpoint. It's your responsibility to manage the mapping of requests to models and to provide access for users to the correct target models. SageMaker AI uses [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) to provide IAM identity-based policies that you use to specify allowed or denied actions and resources and the conditions under which actions are allowed or denied.

By default, an IAM principal with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html) permissions on a multi-model endpoint can invoke any model at the address of the S3 prefix defined in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) operation, provided that the IAM Execution Role defined in operation has permissions to download the model. If you need to restrict [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html) access to a limited set of models in S3, you can do one of the following:
+ Restrict `InvokeEndpont` calls to specific models hosted at the endpoint by using the `sagemaker:TargetModel` IAM condition key. For example, the following policy allows `InvokeEndpont` requests only when the value of the `TargetModel` field matches one of the specified regular expressions:

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Action": [
                  "sagemaker:InvokeEndpoint"
              ],
              "Effect": "Allow",
              "Resource":
              "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
              "Condition": {
                  "StringLike": {
                      "sagemaker:TargetModel": ["company_a/*", "common/*"]
                  }
              }
          }
      ]
  }
  ```

------

  For information about SageMaker AI condition keys, see [Condition Keys for Amazon SageMaker AI](https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonsagemaker.html#amazonsagemaker-policy-keys) in the *AWS Identity and Access Management User Guide*.
+ Create multi-model endpoints with more restrictive S3 prefixes. 

For more information about how SageMaker AI uses roles to manage access to endpoints and perform operations on your behalf, see [How to use SageMaker AI execution roles](sagemaker-roles.md). Your customers might also have certain data isolation requirements dictated by their own compliance requirements that can be satisfied using IAM identities.

# CloudWatch Metrics for Multi-Model Endpoint Deployments
<a name="multi-model-endpoint-cloudwatch-metrics"></a>

Amazon SageMaker AI provides metrics for endpoints so you can monitor the cache hit rate, the number of models loaded and the model wait times for loading, downloading, and uploading at a multi-model endpoint. Some of the metrics are different for CPU and GPU backed multi-model endpoints, so the following sections describe the Amazon CloudWatch metrics that you can use for each type of multi-model endpoint.

For more information about the metrics, see **Multi-Model Endpoint Model Loading Metrics** and **Multi-Model Endpoint Model Instance Metrics** in [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). Per-model metrics aren't supported. 

## CloudWatch metrics for CPU backed multi-model endpoints
<a name="multi-model-endpoint-cloudwatch-metrics-cpu"></a>

You can monitor the following metrics on CPU backed multi-model endpoints.

The `AWS/SageMaker` namespace includes the following model loading metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Loading Metrics**


| Metric | Description | 
| --- | --- | 
| ModelLoadingWaitTime  |  The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelUnloadingTime  |  The interval of time that it took to unload the model through the container's `UnloadModel` API call.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelDownloadingTime |  The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelLoadingTime  |  The interval of time that it took to load the model through the container's `LoadModel` API call. Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelCacheHit  |  The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded. The Average statistic shows the ratio of requests for which the model was already loaded. Units: None Valid statistics: Average, Sum, Sample Count  | 

**Dimensions for Multi-Model Endpoint Model Loading Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

The `/aws/sagemaker/Endpoints` namespaces include the following instance metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Instance Metrics**


| Metric | Description | 
| --- | --- | 
| LoadedModelCount  |  The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance. The Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. Units: None Valid statistics: Average, Sum, Min, Max, Sample Count  | 
| CPUUtilization  |  The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| MemoryUtilization |  The percentage of memory that is used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. Units: Percent  | 

## CloudWatch metrics for GPU multi-model endpoint deployments
<a name="multi-model-endpoint-cloudwatch-metrics-gpu"></a>

You can monitor the following metrics on GPU backed multi-model endpoints.

The `AWS/SageMaker` namespace includes the following model loading metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Loading Metrics**


| Metric | Description | 
| --- | --- | 
| ModelLoadingWaitTime  |  The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelUnloadingTime  |  The interval of time that it took to unload the model through the container's `UnloadModel` API call.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelDownloadingTime |  The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelLoadingTime  |  The interval of time that it took to load the model through the container's `LoadModel` API call. Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelCacheHit  |  The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded. The Average statistic shows the ratio of requests for which the model was already loaded. Units: None Valid statistics: Average, Sum, Sample Count  | 

**Dimensions for Multi-Model Endpoint Model Loading Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

The `/aws/sagemaker/Endpoints` namespaces include the following instance metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Instance Metrics**


| Metric | Description | 
| --- | --- | 
| LoadedModelCount  |  The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance. The Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. Units: None Valid statistics: Average, Sum, Min, Max, Sample Count  | 
| CPUUtilization  |  The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0‐100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| MemoryUtilization |  The percentage of memory that is used by the containers on an instance. This value range is 0%‐100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUUtilization |  The percentage of GPU units that are used by the containers on an instance. The value can range betweenrange is 0‐100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUMemoryUtilization |  The percentage of GPU memory used by the containers on an instance. The value range is 0‐100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUMemoryUtilization` range is 0%‐400%. For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. Units: Percent  | 

# Set SageMaker AI multi-model endpoint model caching behavior
<a name="multi-model-caching"></a>

By default, multi-model endpoints cache frequently used models in memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) and on disk to provide low latency inference. The cached models are unloaded and/or deleted from disk only when a container runs out of memory or disk space to accommodate a newly targeted model.

You can change the caching behavior of a multi-model endpoint and explicitly enable or disable model caching by setting the parameter `ModelCacheSetting` when you call [create\$1model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model).

We recommend setting the value of the `ModelCacheSetting` parameter to `Disabled` for use cases that do not benefit from model caching. For example, when a large number of models need to be served from the endpoint but each model is invoked only once (or very infrequently). For such use cases, setting the value of the `ModelCacheSetting` parameter to `Disabled` allows higher transactions per second (TPS) for `invoke_endpoint` requests compared to the default caching mode. Higher TPS in these use cases is because SageMaker AI does the following after the `invoke_endpoint` request:
+ Asynchronously unloads the model from memory and deletes it from disk immediately after it is invoked.
+ Provides higher concurrency for downloading and loading models in the inference container. For both CPU and GPU backed endpoints, the concurrency is a factor of the number of the vCPUs of the container instance.

For guidelines on choosing a SageMaker AI ML instance type for a multi-model endpoint, see [Instance recommendations for multi-model endpoint deployments](multi-model-endpoint-instance.md).

# Set Auto Scaling Policies for Multi-Model Endpoint Deployments
<a name="multi-model-endpoints-autoscaling"></a>

SageMaker AI multi-model endpoints fully support automatic scaling, which manages replicas of models to ensure models scale based on traffic patterns. We recommend that you configure your multi-model endpoint and the size of your instances based on [Instance recommendations for multi-model endpoint deployments](multi-model-endpoint-instance.md) and also set up instance based auto scaling for your endpoint. The invocation rates used to trigger an auto-scale event are based on the aggregate set of predictions across the full set of models served by the endpoint. For additional details on setting up endpoint auto scaling, see [Automatically Scale Amazon SageMaker AI Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html).

You can set up auto scaling policies with predefined and custom metrics on both CPU and GPU backed multi-model endpoints.

**Note**  
SageMaker AI multi-model endpoint metrics are available at one-minute granularity.

## Define a scaling policy
<a name="multi-model-endpoints-autoscaling-define"></a>

To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric.

Scaling policy configuration is represented by a JSON block. You save your scaling policy configuration as a JSON block in a text file. You use that text file when invoking the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see `[TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html)` in the *Application Auto Scaling API Reference*.

The following options are available for defining a target-tracking scaling policy configuration.

### Use a predefined metric
<a name="multi-model-endpoints-autoscaling-predefined"></a>

To quickly define a target-tracking scaling policy for a variant, use the `SageMakerVariantInvocationsPerInstance` predefined metric. `SageMakerVariantInvocationsPerInstance` is the average number of times per minute that each instance for a variant is invoked. We strongly recommend using this metric.

To use a predefined metric in a scaling policy, create a target tracking configuration for your policy. In the target tracking configuration, include a `PredefinedMetricSpecification` for the predefined metric and a `TargetValue` for the target value of that metric.

The following example is a typical policy configuration for target-tracking scaling for a variant. In this configuration, we use the `SageMakerVariantInvocationsPerInstance` predefined metric to adjust the number of variant instances so that each instance has an `InvocationsPerInstance` metric of `70`.

```
{"TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "InvocationsPerInstance"
    }
}
```

**Note**  
We recommend that you use `InvocationsPerInstance` while using multi-model endpoints. The `TargetValue` for this metric depends on your application’s latency requirements. We also recommend that you load test your endpoints to set up suitable scaling parameter values. To learn more about load testing and setting up autoscaling for your endpoints, see the blog [Configuring autoscaling inference endpoints in Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/).

### Use a custom metric
<a name="multi-model-endpoints-autoscaling-custom"></a>

If you need to define a target-tracking scaling policy that meets your custom requirements, define a custom metric. You can define a custom metric based on any production variant metric that changes in proportion to scaling.

Not all SageMaker AI metrics work for target tracking. The metric must be a valid utilization metric, and it must describe how busy an instance is. The value of the metric must increase or decrease in inverse proportion to the number of variant instances. That is, the value of the metric should decrease when the number of instances increases.

**Important**  
Before deploying automatic scaling in production, you must test automatic scaling with your custom metric.

#### Example custom metric for a CPU backed multi-model endpoint
<a name="multi-model-endpoints-autoscaling-custom-cpu"></a>

The following example is a target-tracking configuration for a scaling policy. In this configuration, for a model named `my-model`, a custom metric of `CPUUtilization` adjusts the instance count on the endpoint based on an average CPU utilization of 50% across all instances.

```
{"TargetValue": 50,
    "CustomizedMetricSpecification":
    {"MetricName": "CPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "ModelName","Value": "my-model"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```

#### Example custom metric for a GPU backed multi-model endpoint
<a name="multi-model-endpoints-autoscaling-custom-gpu"></a>

The following example is a target-tracking configuration for a scaling policy. In this configuration, for a model named `my-model`, a custom metric of `GPUUtilization` adjusts the instance count on the endpoint based on an average GPU utilization of 50% across all instances.

```
{"TargetValue": 50,
    "CustomizedMetricSpecification":
    {"MetricName": "GPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "ModelName","Value": "my-model"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```

## Add a cooldown period
<a name="multi-model-endpoints-autoscaling-cooldown"></a>

To add a cooldown period for scaling out your endpoint, specify a value, in seconds, for `ScaleOutCooldown`. Similarly, to add a cooldown period for scaling in your model, add a value, in seconds, for `ScaleInCooldown`. For more information about `ScaleInCooldown` and `ScaleOutCooldown`, see `[TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html)` in the *Application Auto Scaling API Reference*.

The following is an example target-tracking configuration for a scaling policy. In this configuration, the `SageMakerVariantInvocationsPerInstance` predefined metric is used to adjust scaling based on an average of `70` across all instances of that variant. The configuration provides a scale-in cooldown period of 10 minutes and a scale-out cooldown period of 5 minutes.

```
{"TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    },
    "ScaleInCooldown": 600,
    "ScaleOutCooldown": 300
}
```