Custom Containers Contract for Multi-Model Endpoints - Amazon SageMaker AI

Custom Containers Contract for Multi-Model Endpoints

To handle multiple models, your container must support a set of APIs that enable Amazon SageMaker AI to communicate with the container for loading, listing, getting, and unloading models as required. The model_name is used in the new set of APIs as the key input parameter. The customer container is expected to keep track of the loaded models using model_name as the mapping key. Also, the model_name is an opaque identifier and is not necessarily the value of the TargetModel parameter passed into the InvokeEndpoint API. The original TargetModel value in the InvokeEndpoint request is passed to container in the APIs as a X-Amzn-SageMaker-Target-Model header that can be used for logging purposes.

Note

Multi-model endpoints for GPU backed instances are currently supported only with SageMaker AI's NVIDIA Triton Inference Server container. This container already implements the contract defined below. Customers can directly use this container with their multi-model GPU endpoints, without any additional work.

You can configure the following APIs on your containers for CPU backed multi-model endpoints.

Load Model API

Instructs the container to load a particular model present in the url field of the body into the memory of the customer container and to keep track of it with the assigned model_name. After a model is loaded, the container should be ready to serve inference requests using this model_name.

POST /models HTTP/1.1 Content-Type: application/json Accept: application/json { "model_name" : "{model_name}", "url" : "/opt/ml/models/{model_name}/model", }
Note

If model_name is already loaded, this API should return 409. Any time a model cannot be loaded due to lack of memory or to any other resource, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.

List Model API

Returns the list of models loaded into the memory of the customer container.

GET /models HTTP/1.1 Accept: application/json Response = { "models": [ { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, .... ] }

This API also supports pagination.

GET /models HTTP/1.1 Accept: application/json Response = { "models": [ { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }, .... ] }

SageMaker AI can initially call the List Models API without providing a value for next_page_token. If a nextPageToken field is returned as part of the response, it will be provided as the value for next_page_token in a subsequent List Models call. If a nextPageToken is not returned, it means that there are no more models to return.

Get Model API

This is a simple read API on the model_name entity.

GET /models/{model_name} HTTP/1.1 Accept: application/json { "modelName" : "{model_name}", "modelUrl" : "/opt/ml/models/{model_name}/model", }
Note

If model_name is not loaded, this API should return 404.

Unload Model API

Instructs the SageMaker AI platform to instruct the customer container to unload a model from memory. This initiates the eviction of a candidate model as determined by the platform when starting the process of loading a new model. The resources provisioned to model_name should be reclaimed by the container when this API returns a response.

DELETE /models/{model_name}
Note

If model_name is not loaded, this API should return 404.

Invoke Model API

Makes a prediction request from the particular model_name supplied. The SageMaker AI Runtime InvokeEndpoint request supports X-Amzn-SageMaker-Target-Model as a new header that takes the relative path of the model specified for invocation. The SageMaker AI system constructs the absolute path of the model by combining the prefix that is provided as part of the CreateModel API call with the relative path of the model.

POST /models/{model_name}/invoke HTTP/1.1 Content-Type: ContentType Accept: Accept X-Amzn-SageMaker-Custom-Attributes: CustomAttributes X-Amzn-SageMaker-Target-Model: [relativePath]/{artifactName}.tar.gz
Note

If model_name is not loaded, this API should return 404.

Additionally, on GPU instances, if InvokeEndpoint fails due to a lack of memory or other resources, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.