Automated job retries
You can apply a retry strategy to your jobs and job definitions that allows failed jobs to be automatically retried. Possible failure scenarios include the following:
-
Any non-zero exit code from a container job
-
Amazon EC2 instance failure or termination
-
Internal AWS service error or outage
When a job is submitted to a job queue and placed into the RUNNING
state that's considered an
attempt. By default, each job is given one attempt to move to either the SUCCEEDED
or FAILED
job state. However, both the job definition and the job submission workflows can be used to specify a retry strategy
with between 1 and 10 attempts. If evaluateOnExit is specified, it
can contain up to 5 retry strategies. If evaluateOnExit is specified, but none of the retry
strategies match, then the job is retried. For jobs that don't match to exit, add a final entry that exits for any
reason. For example, this evaluateOnExit
object has two entries that with actions of RETRY
and a final entry with an action of EXIT
.
"evaluateOnExit": [ { "action": "RETRY", "onReason": "AGENT" }, { "action": "RETRY", "onStatusReason": "Task failed to start" }, { "action": "EXIT", "onReason": "*" } ]
At runtime, the AWS_BATCH_JOB_ATTEMPT
environment variable is set to the container's corresponding
job attempt number. The first attempt is numbered 1
, and subsequent attempts are in ascending order (for
example, 2, 3, 4).
For example, suppose that a job attempt fails for any reason and the number of attempts specified in the retry
configuration is greater than the AWS_BATCH_JOB_ATTEMPT
number. Then, the job is placed back in the
RUNNABLE
state. For more information, see Job states.
Note
Jobs that are cancelled or terminated aren't retried. Also, jobs that fail because of an invalid job definition aren't retried.
For more information, see Retry strategy, Create a single-node job definition , Tutorial: submit a job and Stopped tasks error codes.