Invoke a Multi-Model Endpoint
To invoke a multi-model endpoint, use the invoke_endpoint
TargetModel
parameter that specifies
which of the models at the endpoint to target. The SageMaker AI Runtime InvokeEndpoint
request supports X-Amzn-SageMaker-Target-Model
as a new header that takes the
relative path of the model specified for invocation. The SageMaker AI system constructs the absolute
path of the model by combining the prefix that is provided as part of the
CreateModel
API call with the relative path of the model.
The following procedures are the same for both CPU and GPU-backed multi-model endpoints.
The multi-model endpoint dynamically loads target models as needed. You can observe this
when running the MME Sample Notebook
Note
For GPU backed instances, the HTTP response code with 507 from the GPU container indicates a lack of memory or other resources. This causes unused models to be unloaded from the container in order to load more frequently used models.
Retry Requests on ModelNotReadyException Errors
The first time you call invoke_endpoint
for a model, the model is
downloaded from Amazon Simple Storage Service and loaded into the inference container. This makes the first call
take longer to return. Subsequent calls to the same model finish faster, because the model
is already loaded.
SageMaker AI returns a response for a call to invoke_endpoint
within 60 seconds.
Some models are too large to download within 60 seconds. If the model does not finish
loading before the 60 second timeout limit, the request to invoke_endpoint
returns with the error code ModelNotReadyException
, and the model continues to
download and load into the inference container for up to 360 seconds. If you get a
ModelNotReadyException
error code for an invoke_endpoint
request, retry the request. By default, the AWS SDKs for Python (Boto 3) (using Legacy retry modeinvoke_endpoint
requests that
result in ModelNotReadyException
errors. You can configure the retry strategy
to continue retrying the request for up to 360 seconds. If you expect your model to take
longer than 60 seconds to download and load into the container, set the SDK socket timeout
to 70 seconds. For more information about configuring the retry strategy for the
AWS SDK for Python (Boto3), see Configuring a retry modeinvoke_endpoint
for up to 180 seconds.
import boto3 from botocore.config import Config # This example retry strategy sets the retry attempts to 2. # With this setting, the request can attempt to download and/or load the model # for upto 180 seconds: 1 orginal request (60 seconds) + 2 retries (120 seconds) config = Config( read_timeout=70, retries={ 'max_attempts': 2 # This value can be adjusted to 5 to go up to the 360s max timeout } ) runtime_sagemaker_client = boto3.client('sagemaker-runtime', config=config)