How CloudWatch alarms detect Amazon ECS deployment failures
You can configure Amazon ECS to set the deployment to failed when it detects that a
specified CloudWatch alarm has gone into the ALARM
state.
You can optionally set the configuration to roll back a failed deployment to the last completed deployment.
The following create-service
AWS CLI example shows how to create a Linux
service when the deployment alarms are used with the rollback option.
aws ecs create-service \ --service-name
MyService
\ --deployment-controller type=ECS
\ --desired-count3
\ --deployment-configuration "alarms={alarmNames=[alarm1Name
,alarm2Name
],enable=true
,rollback=true
}" \ --task-definitionsample-fargate:1
\ --launch-typeFARGATE
\ --platform-familyLINUX
\ --platform-version1.4.0
\ --network-configuration "awsvpcConfiguration={subnets=[subnet-12344321
],securityGroups=[sg-12344321
],assignPublicIp=ENABLED
}"
Consider the following when you use the Amazon CloudWatch alarms method on a service.
-
The bake time is a period of time after a new service version has scaled out and the old service version has scaled in, during which Amazon ECS continues to monitor the alarm associated with the deployment. Amazon ECS computes this time period based on the alarm configuration associated with the deployment. You can't set this value.
-
The
deploymentConfiguration
request parameter now contains thealarms
data type. You can specify the alarm names, whether to use the method, and whether to initiate a rollback when the alarms indicate a deployment failure. For more information, see CreateService in the Amazon Elastic Container Service API Reference. -
The
DescribeServices
response provides insight into the state of a deployment, therolloutState
androlloutStateReason
. When a new deployment starts, the rollout state begins in anIN_PROGRESS
state. When the service reaches a steady state and the bake time is complete, the rollout state transitions toCOMPLETED
. If the service fails to reach a steady state and the alarm has gone into theALARM
state, the deployment will transition to aFAILED
state. A deployment in aFAILED
state won't launch any new tasks. -
In addition to the service deployment state change events Amazon ECS sends for deployments that have started and have completed, Amazon ECS also sends an event when a deployment that uses alarms fails. These events provide details about why a deployment failed or if a deployment was started because of a rollback. For more information, see Amazon ECS service deployment state change events.
-
If a new deployment is started because a previous deployment failed and rollback was turned on, the
reason
field of the service deployment state change event will indicate the deployment was started because of a rollback. -
If you use the deployment circuit breaker and the Amazon CloudWatch alarms to detect failures, either one can initiate a deployment failure as soon as the criteria for either method is met. A rollback occurs when you use the rollback option for the method that initiated the deployment failure.
-
The Amazon CloudWatch alarms is only supported for Amazon ECS services that use the rolling update (
ECS
) deployment controller. -
You can configure this option by using the Amazon ECS console, or the AWS CLI. For more information, see Create a service using defined parameters and create-service in the AWS Command Line Interface Reference.
-
You might notice that the deployment status remains
IN_PROGRESS
for a prolonged amount of time. The reason for this is that Amazon ECS does not change the status until it has deleted the active deployment, and this does not happen until after the bake time. Depending on your alarm configuration, the deployment might appear to take several minutes longer than it does when you don't use alarms (even though the new primary task set is scaled up and the old deployment is scaled down). If you use CloudFormation timeouts, consider increasing the timeouts. For more information, see Creating wait conditions in a template in the AWS CloudFormation User Guide. -
Amazon ECS calls
DescribeAlarms
to poll the alarms. The calls toDescribeAlarms
count toward the CloudWatch service quotas associated with your account. If you have other AWS services that callDescribeAlarms
, there might be an impact on Amazon ECS to poll the alarms. For example, if another service makes enoughDescribeAlarms
calls to reach the quota, that service is throttled and Amazon ECS is also throttled and unable to poll alarms. If an alarm is generated during the throttling period, Amazon ECS might miss the alarm and the roll back might not occur. There is no other impact on the deployment. For more information on CloudWatch service quotas, see CloudWatch service quotas in the CloudWatch User Guide. -
If an alarm is in the
ALARM
state at the beginning of a deployment, Amazon ECS will not monitor alarms for the duration of that deployment (Amazon ECS ignores the alarm configuration). This behavior addresses the case where you want to start a new deployment to fix an initial deployment failure.
Recommended alarms
We recommend that you use the following alarm metrics:
-
If you use an Application Load Balancer, use the
HTTPCode_ELB_5XX_Count
andHTTPCode_ELB_4XX_Count
Application Load Balancer metrics. These metrics check for HTTP spikes. For more information about the Application Load Balancer metrics, see CloudWatch metrics for your Application Load Balancer in the User Guide for Application Load Balancers. -
If you have an existing application, use the
CPUUtilization
andMemoryUtilization
metrics. These metrics check for the percentage of CPU and memory that the cluster or service uses. For more information, see Considerations. -
If you use Amazon Simple Queue Service queues in your tasks, use
ApproximateNumberOfMessagesNotVisible
Amazon SQS metric. This metric checks for number of messages in the queue that are delayed and not available for reading immediately. For more information about Amazon SQS metrics, see Available CloudWatch metrics for Amazon SQS in the Amazon Simple Queue Service Developer Guide.