This section contains information about how to understand and prevent common errors, the error messages they generate, and guidance on how to resolve these errors.
How to
troubleshoot
You can attempt to resolve your error by going through the following steps:
-
Check if you've covered all the prerequisites to use Inference Recommender. See the Inference Recommender Prerequisites.
-
Check that you are able to deploy your model from Model Registry to an endpoint and that it can process your payloads without errors. See Deploy a Model from the Registry.
-
When you kick off an Inference Recommender job, you should see endpoints being created in the console and you can review the CloudWatch logs.
Common
errors
Review the following table for common Inference Recommender errors and their solutions.
Error | Solution |
---|---|
Specify |
Make sure you provide the ML domain or
|
Provided role ARN cannot be assumed and an
|
Make sure the execution role provided has the necessary permissions specified in the prerequisites. |
Specify |
Make sure you provide the ML Framework or
|
Users at the end of prev phase is 0 while initial users of current phase is 1. |
Users here refers to virtual users or threads used to send requests. Each phase starts with A users and ends with B users such that B > A. Between sequential phases, x_1 and x_2, we require that abs(x_2.A - x_1.B) <= 3 and >= 0. |
Total Traffic duration (across) should not be more than Job duration. |
The total duration of all your Phases cannot exceed the Job duration. |
Burstable instance type ml.t2.medium is not allowed. |
Inference Recommender doesn't support load testing on t2 instance family because burstable instances do not provide consistent performance. |
ResourceLimitExceeded when calling CreateEndpoint operation |
You have exceeded a SageMaker AI resource limit. For example, Inference Recommender might be unable to provision endpoints for benchmarking if the account has reached the endpoint quota. For more information about SageMaker AI limits and quotas, see Amazon SageMaker AI endpoints and quotas. |
ModelError when calling InvokeEndpoint operation |
A model error can happen for the following reasons:
|
PayloadError when calling InvokeEndpoint operation |
A payload error can happen for following reasons:
|
Check
CloudWatch
When you kick off an Inference Recommender job, you should see endpoints being created in the console. Select one of the endpoints and view the CloudWatch logs to monitor for any 4xx/5xx errors. If you have a successful Inference Recommender job, you will be able to see the endpoint names as part of the results. Even if your Inference Recommender job is unsuccessful, you can still check the CloudWatch logs for the deleted endpoints by following the steps below:
-
Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/
. -
Select the Region in which you created the Inference Recommender job from the Region dropdown list in the top right.
-
In the navigation pane of CloudWatch, choose Logs, and then select Log groups.
-
Search for the log group called
/aws/sagemaker/Endpoints/sm-epc-*
. Select the log group based on your most recent Inference Recommender job.
You can also troubleshoot your job by checking the Inference Recommender CloudWatch logs. The
Inference Recommender logs, which are published in the
/aws/sagemaker/InferenceRecommendationsJobs
CloudWatch log group,
give a high level view on the progress of the job in the
log
stream. You can find detailed information on each of the endpoint configurations
being tested in the
<jobName>
/execution
log stream.<jobName>
/Endpoint/<endpointName>
Overview of the Inference Recommender log streams
-
contains overall job information such as endpoint configurations scheduled for benchmarking, compilation job skip reason, and validation failure reason.<jobName>
/execution -
contains information such as resource creation progress, test configuration, load test stop reason, and resource cleanup status.<jobName>
/Endpoint/<endpointName>
-
contains information on compilation jobs created by Inference Recommender, such as the compilation job configuration and compilation job status.<jobName>
/CompilationJob/<compilationJobName>
Create an alarm for Inference Recommender error messages
Inference Recommender outputs log statements for errors that might be helpful while troubleshooting. With a CloudWatch log group and a metric filter, you can look for terms and patterns in this log data as the data is sent to CloudWatch. Then, you can create a CloudWatch alarm based on the log group-metric filter. For more information, see Create a CloudWatch alarm based on a log group-metric filter.
Check
benchmarks
When you kick off an Inference Recommender job, Inference Recommender creates several benchmarks to evaluate the performance of your model on different instance types. You can use the ListInferenceRecommendationsJobSteps API to view the details for all the benchmarks. If you have a failed benchmark, you can see the failure reasons as part of the results.
To use the ListInferenceRecommendationsJobSteps API, provide the following values:
-
For
JobName
, provide the name of the Inference Recommender job. -
For
StepType
, useBENCHMARK
to return details about the job's benchmarks. -
For
Status
, useFAILED
to return details about only the failed benchmarks. For a list of the other status types, see theStatus
field in the ListInferenceRecommendationsJobSteps API.
# Create a low-level SageMaker service client.
import boto3
aws_region = '<region>'
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)
# Provide the job name for the SageMaker Inference Recommender job
job_name = '<job-name>'
# Filter for benchmarks
step_type = 'BENCHMARK'
# Filter for benchmarks that have a FAILED status
status = 'FAILED'
response = sagemaker_client.list_inference_recommendations_job_steps(
JobName = job_name,
StepType = step_type,
Status = status
)
You can print the response object to view the results. The preceding code
example stored the response in a variable called response
:
print(response)