Set SageMaker AI multi-model endpoint model caching behavior

By default, multi-model endpoints cache frequently used models in memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) and on disk to provide low latency inference. The cached models are unloaded and/or deleted from disk only when a container runs out of memory or disk space to accommodate a newly targeted model.

You can change the caching behavior of a multi-model endpoint and explicitly enable or disable model caching by setting the parameter ModelCacheSetting when you call create_model.

We recommend setting the value of the ModelCacheSetting parameter to Disabled for use cases that do not benefit from model caching. For example, when a large number of models need to be served from the endpoint but each model is invoked only once (or very infrequently). For such use cases, setting the value of the ModelCacheSetting parameter to Disabled allows higher transactions per second (TPS) for invoke_endpoint requests compared to the default caching mode. Higher TPS in these use cases is because SageMaker AI does the following after the invoke_endpoint request:

Asynchronously unloads the model from memory and deletes it from disk immediately after it is invoked.
Provides higher concurrency for downloading and loading models in the inference container. For both CPU and GPU backed endpoints, the concurrency is a factor of the number of the vCPUs of the container instance.

For guidelines on choosing a SageMaker AI ML instance type for a multi-model endpoint, see Instance recommendations for multi-model endpoint deployments.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

CloudWatch Metrics for Multi-Model Endpoint Deployments

Set Auto Scaling Policies for Multi-Model Endpoint Deployments