Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Retry Policy for Pipeline Steps

Focus mode
Retry Policy for Pipeline Steps - Amazon SageMaker AI

Retry policies help you automatically retry your Pipelines steps after an error occurs. Any pipeline step can encounter exceptions, and exceptions happen for various reasons. In some cases, a retry can resolve these issues. With a retry policy for pipeline steps, you can choose whether to retry a particular pipeline step or not.

The retry policy only supports the following pipeline steps:

Note

Jobs running inside both the tuning and AutoML steps conduct retries internally and will not retry the SageMaker.JOB_INTERNAL_ERROR exception type, even if a retry policy is configured. You can program your own Retry Strategy using the SageMaker API.

Supported exception types for the retry policy

The retry policy for pipeline steps supports the following exception types:

  • Step.SERVICE_FAULT: These exceptions occur when an internal server error or transient error happens when calling downstream services. Pipelines retries on this type of error automatically. With a retry policy, you can override the default retry operation for this exception type.

  • Step.THROTTLING: Throttling exceptions can occur while calling the downstream services. Pipelines retries on this type of error automatically. With a retry policy, you can override the default retry operation for this exception type.

  • SageMaker.JOB_INTERNAL_ERROR: These exceptions occur when the SageMaker AI job returns InternalServerError. In this case, starting a new job may fix a transient issue.

  • SageMaker.CAPACITY_ERROR: The SageMaker AI job may encounter Amazon EC2 InsufficientCapacityErrors, which leads to the SageMaker AI job’s failure. You can retry by starting a new SageMaker AI job to avoid the issue.

  • SageMaker.RESOURCE_LIMIT: You can exceeed the resource limit quota when running a SageMaker AI job. You can wait and retry running the SageMaker AI job after a short period and see if resources are released.

The JSON schema for the retry policy

The retry policy for Pipelines has the following JSON schema:

"RetryPolicy": { "ExceptionType": [String] "IntervalSeconds": Integer "BackoffRate": Double "MaxAttempts": Integer "ExpireAfterMin": Integer }
  • ExceptionType: This field requires the following exception types in a string array format.

    • Step.SERVICE_FAULT

    • Step.THROTTLING

    • SageMaker.JOB_INTERNAL_ERROR

    • SageMaker.CAPACITY_ERROR

    • SageMaker.RESOURCE_LIMIT

  • IntervalSeconds (optional): The number of seconds before the first retry attempt (1 by default). IntervalSeconds has a maximum value of 43200 seconds (12 hours).

  • BackoffRate (optional): The multiplier by which the retry interval increases during each attempt (2.0 by default).

  • MaxAttempts (optional): A positive integer that represents the maximum number of retry attempts (5 by default). If the error recurs more times than MaxAttempts specifies, retries cease and normal error handling resumes. A value of 0 specifies that errors are never retried. MaxAttempts has a maximum value of 20.

  • ExpireAfterMin (optional): A positive integer that represents the maximum timespan of retry. If the error recurs after ExpireAfterMin minutes counting from the step gets executed, retries cease and normal error handling resumes. A value of 0 specifies that errors are never retried. ExpireAfterMin has a maximum value of 14,400 minutes (10 days).

    Note

    Only one of MaxAttempts or ExpireAfterMin can be given, but not both; if both are not specified, MaxAttempts becomes the default. If both properties are identified within one policy, then the retry policy generates a validation error.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.