Troubleshoot Inference Pipelines

To troubleshoot inference pipeline issues, use CloudWatch logs and error messages. If you are using custom Docker images in a pipeline that includes Amazon SageMaker AI built-in algorithms, you might also encounter permissions problems. To grant the required permissions, create an Amazon Elastic Container Registry (Amazon ECR) policy.

Topics

Troubleshoot Amazon ECR Permissions for Inference Pipelines
Use CloudWatch Logs to Troubleshoot SageMaker AI Inference Pipelines
Use Error Messages to Troubleshoot Inference Pipelines

Troubleshoot Amazon ECR Permissions for Inference Pipelines

When you use custom Docker images in a pipeline that includes SageMaker AI built-in algorithms, you need an Amazon ECR policy. The policy allows your Amazon ECR repository to grant permission for SageMaker AI to pull the image. The policy must add the following permissions:


{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "allowSageMakerToPull",
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": [
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage",
                "ecr:BatchCheckLayerAvailability"
            ]
        }
    ]
}

Use CloudWatch Logs to Troubleshoot SageMaker AI Inference Pipelines

SageMaker AI publishes the container logs for endpoints that deploy an inference pipeline to Amazon CloudWatch at the following path for each container.


/aws/sagemaker/Endpoints/{EndpointName}/{Variant}/{InstanceId}/{ContainerHostname}

For example, logs for this endpoint are published to the following log groups and streams:


EndpointName: MyInferencePipelinesEndpoint
Variant: MyInferencePipelinesVariant
InstanceId: i-0179208609ff7e488
ContainerHostname: MyContainerName1 and MyContainerName2


logGroup: /aws/sagemaker/Endpoints/MyInferencePipelinesEndpoint
logStream: MyInferencePipelinesVariant/i-0179208609ff7e488/MyContainerName1
logStream: MyInferencePipelinesVariant/i-0179208609ff7e488/MyContainerName2

A log stream is a sequence of log events that share the same source. Each separate source of logs into CloudWatch makes up a separate log stream. A log group is a group of log streams that share the same retention, monitoring, and access control settings.

To see the log groups and streams

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
In the navigation page, choose Logs.
In Log Groups. filter on MyInferencePipelinesEndpoint:
To see the log streams, on the CloudWatch Log Groups page, choose MyInferencePipelinesEndpoint, and then Search Log Group.

For a list of the logs that SageMaker AI publishes, see Inference Pipeline Logs and Metrics.

Use Error Messages to Troubleshoot Inference Pipelines

The inference pipeline error messages indicate which containers failed.

If an error occurs while SageMaker AI is invoking an endpoint, the service returns a ModelError (error code 424), which indicates which container failed. If the request payload (the response from the previous container) exceeds the limit of 5 MB, SageMaker AI provides a detailed error message, such as:

Received response from MyContainerName1 with status code 200. However, the request payload from MyContainerName1 to MyContainerName2 is 6000000 bytes, which has exceeded the maximum limit of 5 MB.

If a container fails the ping health check while SageMaker AI is creating an endpoint, it returns a ClientError and indicates all of the containers that failed the ping check in the last health check.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Logs and Metrics

Delete Endpoints and Resources