Automated job retries - AWS Batch

Automated job retries

You can apply a retry strategy to your jobs and job definitions that allows failed jobs to be automatically retried. Possible failure scenarios include the following:

  • Any non-zero exit code from a container job

  • Amazon EC2 instance failure or termination

  • Internal AWS service error or outage

When a job is submitted to a job queue and placed into the RUNNING state that's considered an attempt. By default, each job is given one attempt to move to either the SUCCEEDED or FAILED job state. However, both the job definition and the job submission workflows can be used to specify a retry strategy with between 1 and 10 attempts. If evaluateOnExit is specified, it can contain up to 5 retry strategies. If evaluateOnExit is specified, but none of the retry strategies match, then the job is retried. For jobs that don't match to exit, add a final entry that exits for any reason. For example, this evaluateOnExit object has two entries that with actions of RETRY and a final entry with an action of EXIT.

"evaluateOnExit": [ { "action": "RETRY", "onReason": "AGENT" }, { "action": "RETRY", "onStatusReason": "Task failed to start" }, { "action": "EXIT", "onReason": "*" } ]

At runtime, the AWS_BATCH_JOB_ATTEMPT environment variable is set to the container's corresponding job attempt number. The first attempt is numbered 1, and subsequent attempts are in ascending order (for example, 2, 3, 4).

For example, suppose that a job attempt fails for any reason and the number of attempts specified in the retry configuration is greater than the AWS_BATCH_JOB_ATTEMPT number. Then, the job is placed back in the RUNNABLE state. For more information, see Job states.

Note

Jobs that are cancelled or terminated aren't retried. Also, jobs that fail because of an invalid job definition aren't retried.

For more information, see Retry strategy, Create a single-node job definition , Tutorial: submit a job and Stopped tasks error codes.