Collect the results of your inference recommendation job programmatically with AWS SDK for Python (Boto3), the AWS CLI, Studio Classic, or the SageMaker AI console.
Once an inference recommendation is complete, you can use
DescribeInferenceRecommendationsJob
to get the job
details and recommendations. Provide the job name that you used when
you created the inference recommendation job.
job_name=
'<INSERT>'
response = sagemaker_client.describe_inference_recommendations_job( JobName=job_name)
Print the response object. The previous code sample stored the
response in a variable named response
.
print(response['Status'])
This returns a JSON response similar to the following example. Note that this example shows the recommended instance types for real-time inference (for an example showing serverless inference recommendations, see the example after this one).
{
'JobName': 'job-name'
,
'JobDescription': 'job-description'
,
'JobType': 'Default',
'JobArn': 'arn:aws:sagemaker:region
:account-id
:inference-recommendations-job/resource-id
',
'Status': 'COMPLETED',
'CreationTime': datetime.datetime(2021, 10, 26, 20, 4, 57, 627000, tzinfo=tzlocal()),
'LastModifiedTime': datetime.datetime(2021, 10, 26, 20, 25, 1, 997000, tzinfo=tzlocal()),
'InputConfig': {
'ModelPackageVersionArn': 'arn:aws:sagemaker:region
:account-id
:model-package/resource-id
',
'JobDurationInSeconds': 0
},
'InferenceRecommendations': [{
'Metrics': {
'CostPerHour': 0.20399999618530273,
'CostPerInference': 5.246913588052848e-06,
'MaximumInvocations': 648,
'ModelLatency': 263596
},
'EndpointConfiguration': {
'EndpointName': 'endpoint-name'
,
'VariantName': 'variant-name'
,
'InstanceType': 'ml.c5.xlarge',
'InitialInstanceCount': 1
},
'ModelConfiguration': {
'Compiled': False,
'EnvironmentParameters': []
}
},
{
'Metrics': {
'CostPerHour': 0.11500000208616257,
'CostPerInference': 2.92620870823157e-06,
'MaximumInvocations': 655,
'ModelLatency': 826019
},
'EndpointConfiguration': {
'EndpointName': 'endpoint-name'
,
'VariantName': 'variant-name'
,
'InstanceType': 'ml.c5d.large',
'InitialInstanceCount': 1
},
'ModelConfiguration': {
'Compiled': False,
'EnvironmentParameters': []
}
},
{
'Metrics': {
'CostPerHour': 0.11500000208616257,
'CostPerInference': 3.3625731248321244e-06,
'MaximumInvocations': 570,
'ModelLatency': 1085446
},
'EndpointConfiguration': {
'EndpointName': 'endpoint-name'
,
'VariantName': 'variant-name'
,
'InstanceType': 'ml.m5.large',
'InitialInstanceCount': 1
},
'ModelConfiguration': {
'Compiled': False,
'EnvironmentParameters': []
}
}],
'ResponseMetadata': {
'RequestId': 'request-id'
,
'HTTPStatusCode': 200,
'HTTPHeaders': {
'x-amzn-requestid': 'x-amzn-requestid'
,
'content-type': 'content-type'
,
'content-length': '1685',
'date': 'Tue, 26 Oct 2021 20:31:10 GMT'
},
'RetryAttempts': 0
}
}
The first few lines provide information about the inference recommendation job itself. This includes the job name, role ARN, and creation and deletion times.
The InferenceRecommendations
dictionary contains a
list of Inference Recommender inference recommendations.
The EndpointConfiguration
nested dictionary contains
the instance type (InstanceType
) recommendation along
with the endpoint and variant name (a deployed AWS machine
learning model) that was used during the recommendation job. You can
use the endpoint and variant name for monitoring in Amazon CloudWatch Events. See
Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch for more
information.
The Metrics
nested dictionary contains information
about the estimated cost per hour (CostPerHour
) for
your real-time endpoint in US dollars, the estimated cost per
inference (CostPerInference
) in US dollars for your
real-time endpoint, the expected maximum number of
InvokeEndpoint
requests per minute sent to the
endpoint (MaxInvocations
), and the model latency
(ModelLatency
), which is the interval of time (in
microseconds) that your model took to respond to SageMaker AI. The model
latency includes the local communication times taken to send the
request and to fetch the response from the container of a model and
the time taken to complete the inference in the container.
The following example shows the
InferenceRecommendations
part of the response for
an inference recommendations job configured to return serverless
inference recommendations:
"InferenceRecommendations": [
{
"EndpointConfiguration": {
"EndpointName": "value
",
"InitialInstanceCount": value
,
"InstanceType": "value
",
"VariantName": "value
",
"ServerlessConfig": {
"MaxConcurrency": value
,
"MemorySizeInMb": value
}
},
"InvocationEndTime": value
,
"InvocationStartTime": value
,
"Metrics": {
"CostPerHour": value
,
"CostPerInference": value
,
"CpuUtilization": value
,
"MaxInvocations": value
,
"MemoryUtilization": value
,
"ModelLatency": value
,
"ModelSetupTime": value
},
"ModelConfiguration": {
"Compiled": "False",
"EnvironmentParameters": [],
"InferenceSpecificationName": "value
"
},
"RecommendationId": "value
"
}
]
You can interpret the recommendations for serverless inference
similarly to the results for real-time inference, with the exception
of the ServerlessConfig
, which tells you the metrics
returned for a serverless endpoint with the given
MemorySizeInMB
and when MaxConcurrency =
1
. To increase the throughput possible on the endpoint,
increase the value of MaxConcurrency
linearly. For
example, if the inference recommendation shows
MaxInvocations
as 1000
, then
increasing MaxConcurrency
to 2
would
support 2000 MaxInvocations
. Note that this is true
only up to a certain point, which can vary based on your model and
code. Serverless recommendations also measure the metric
ModelSetupTime
, which measures (in microseconds)
the time it takes to launch computer resources on a serverless
endpoint. For more information about setting up serverless
endpoints, see the Serverless Inference
documentation.
For more information about interpreting the results of your recommendation job, see Recommendation results.