Get an inference recommendation for an existing endpoint
Inference recommendation jobs run a set of load tests on recommended instance types and an existing endpoint. Inference recommendation jobs use performance metrics that are based on load tests using the sample data you provided during model version registration.
You can benchmark and get inference recommendations for an existing SageMaker Inference endpoint to help you improve the performance of your endpoint. The procedure of getting recommendations for an existing SageMaker Inference endpoint is similar to the procedure for getting inference recommendations without an endpoint. There are several feature exclusions to take note of when benchmarking an existing endpoint:
-
You can only use one existing endpoint per Inference Recommender job.
-
You can only have one variant on your endpoint.
-
You can’t use an endpoint that enables autoscaling.
-
This functionality is only supported for Real-Time Inference.
-
This functionality doesn’t support Real-Time Multi-Model Endpoints.
Warning
We strongly recommend that you don't run an Inference Recommender job on a production endpoint that handles live traffic. The synthetic load during benchmarking can affect your production endpoint and cause throttling or provide inaccurate benchmark results. We recommend that you use a non-production or developer endpoint for comparison purposes.
The following sections demonstrate how to use Amazon SageMaker Inference Recommender to create an inference recommendation for an existing endpoint based on your model type using the AWS SDK for Python (Boto3) and the AWS CLI.
Note
Before you create an Inference Recommender recommendation job, make sure you have satisfied the Prerequisites for using Amazon SageMaker Inference Recommender.
Prerequisites
If you don’t already have a SageMaker Inference endpoint, you can either get an inference recommendation without an endpoint, or you can create a Real-Time Inference endpoint by following the instructions in Create your endpoint and deploy your model.
Create an inference recommendation job for an existing endpoint
Create an inference recommendation programmatically using AWS SDK for Python (Boto3), or the AWS CLI. Specify a job name for your inference recommendation, the name of an existing SageMaker Inference endpoint, an AWS IAM role ARN, an input configuration, and your model package ARN from when you registered your model with the model registry.
Get your inference recommendation job results
You can collect the results of your inference recommendation job programmatically with the same procedure for standard inference recommendation jobs. For more information, see Get your inference recommendation job results.
When you get inference recommendation job results for an existing endpoint, you should receive a JSON response similar to the following:
{ "JobName":
"job-name"
, "JobType": "Default", "JobArn": "arn:aws:sagemaker:region
:account-id
:inference-recommendations-job/resource-id
", "RoleArn":"iam-role-arn"
, "Status": "COMPLETED", "CreationTime": 1664922919.2, "LastModifiedTime": 1664924208.291, "InputConfig": { "ModelPackageVersionArn": "arn:aws:sagemaker:region
:account-id
:model-package/resource-id
", "Endpoints": [ { "EndpointName":"endpoint-name"
} ] }, "InferenceRecommendations": [ { "Metrics": { "CostPerHour": 0.7360000014305115, "CostPerInference": 7.456940238625975e-06, "MaxInvocations": 1645, "ModelLatency": 171 }, "EndpointConfiguration": { "EndpointName":"sm-endpoint-name"
, "VariantName":"variant-name"
, "InstanceType": "ml.g4dn.xlarge", "InitialInstanceCount": 1 }, "ModelConfiguration": { "EnvironmentParameters": [ { "Key": "TS_DEFAULT_WORKERS_PER_MODEL", "ValueType": "string", "Value": "4" } ] } } ], "EndpointPerformances": [ { "Metrics": { "MaxInvocations": 184, "ModelLatency": 1312 }, "EndpointConfiguration": { "EndpointName":"endpoint-name"
} } ] }
The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, and creation and latest modification times.
The InferenceRecommendations
dictionary contains a list of Inference Recommender
inference recommendations.
The EndpointConfiguration
nested dictionary contains the instance
type (InstanceType
) recommendation along with the endpoint and
variant name (a deployed AWS machine learning model) that was used during the
recommendation job.
The Metrics
nested dictionary contains information about the
estimated cost per hour (CostPerHour
) for your real-time endpoint
in US dollars, the estimated cost per inference (CostPerInference
)
in US dollars for your real-time endpoint, the expected maximum number of
InvokeEndpoint
requests per minute sent to the endpoint
(MaxInvocations
), and the model latency
(ModelLatency
), which is the interval of time (in milliseconds)
that your model took to respond to SageMaker. The model latency includes the local
communication times taken to send the request and to fetch the response from the
container of a model and the time taken to complete the inference in the
container.
The EndpointPerformances
nested dictionary contains the name of
your existing endpoint on which the recommendation job was run
(EndpointName
) and the performance metrics for your endpoint
(MaxInvocations
and ModelLatency
).