Monitoring deployments for automatic rollback
During a deployment, you can mitigate situations where malformed or incorrect
configuration data causes errors in your application by using a combination of AWS AppConfig deployment
strategies and automatic rollbacks based on Amazon CloudWatch alarms. Once configured, if one
or more CloudWatch alarms go into the ALARM
or INSUFFICIENT_DATA
state during a deployment, AWS AppConfig
automatically rolls back your configuration data to the previous version, thereby preventing
application outages or errors. You can also roll back a configuration by calling the StopDeployment API operation while a deployment is still in progress.
Important
For deployments that successfully complete, AWS AppConfig also supports reverting configuration
data to a previous version by using the AllowRevert
parameter with the StopDeployment API operation. For some customers, reverting to a previous
configuration after a successful deployment guarantees the data will be the same as it was
before the deployment. Reverting also ignores alarm monitors, which may prevent a roll
forward from progressing during an application emergency. For more information, see Reverting a configuration.
To configure automatic rollbacks, you specify the Amazon Resource Name (ARN) of one or more CloudWatch metrics in the CloudWatch alarms field when you create (or edit) an AWS AppConfig environment. For more information, see Creating environments for your application in AWS AppConfig.
Note
If you use a third-party monitoring solution (for example, Datadog), you can create an
AWS AppConfig extension that checks for alarms at the AT_DEPLOYMENT_TICK
action point
and, as a safety guardrail, rolls back the deployment if it triggered an alarm. For more information about AWS AppConfig
extensions, see Extending AWS AppConfig workflows using
extensions. For more information about custom
extensions, see Walkthrough: Creating
custom AWS AppConfig extensions. To view a code sample of an AWS AppConfig extension that
uses the AT_DEPLOYMENT_TICK
action point to integrate with Datadog, see
aws-samples / aws-appconfig-tick-extn-for-datadog
Recommended metrics to monitor for automatic rollback
The metrics you choose to monitor will depend on the hardware and software used by your applications. AWS AppConfig customers often monitor the following metrics. For a complete list of recommended metrics grouped by AWS service, see Recommended alarms in the Amazon CloudWatch User Guide.
After you determine the metrics you want to monitor, use CloudWatch to configure alarms. For more information, see Using Amazon CloudWatch alarms.
Service | Metric | Details |
---|---|---|
4XXError |
This alarm detects a high rate of client-side errors. This can indicate an issue in the authorization or client request parameters. It could also mean that a resource was removed or a client is requesting one that doesn't exist. Consider enabling Amazon CloudWatch Logs and checking for any errors that may be causing the 4XX errors. Moreover, consider enabling detailed CloudWatch metrics to view this metric per resource and method and narrow down the source of the errors. Errors could also be caused by exceeding the configured throttling limit. |
|
5XXError |
This alarm helps to detect a high rate of server-side errors. This can indicate that there is something wrong on the API backend, the network, or the integration between the API gateway and the backend API. |
|
Latency |
This alarm detects high latency in a stage. Find the
|
|
GroupInServiceCapacity |
This alarm helps to detect when the capacity in the group is below the desired capacity required for your workload. To troubleshoot, check your scaling activities for launch failures and confirm that your desired capacity configuration is correct. |
|
CPUUtilization |
This alarm helps to monitor the CPU utilization of an EC2 instance. Depending on the application, consistently high utilization levels might be normal. But if performance is degraded, and the application is not constrained by disk I/O, memory, or network resources, then a maxed-out CPU might indicate a resource bottleneck or application performance problems. |
|
CPUReservation |
This alarm helps you detect a high CPU reservation of the ECS cluster. High CPU reservation might indicate that the cluster is running out of registered CPUs for the task. |
|
HTTPCode_Target_5XX_Count |
This alarm helps you detect a high server-side error count for the ECS service. This can indicate that there are errors that cause the server to be unable to serve requests. |
|
node_cpu_utilization |
This alarm helps to detect high CPU utilization in worker nodes of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need for replacing your worker nodes with instances that have greater CPU or a need to scale the system horizontally. |
|
node_memory_utilization |
This alarm helps in detecting high memory utilization in worker nodes of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need to scale the number of pod replicas, or optimize your application. |
|
pod_cpu_utilization_over_pod_limit |
This alarm helps in detecting high CPU utilization in pods of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need to increase the CPU limit for the affected pod. |
|
pod_memory_utilization_over_pod_limit |
This alarm helps in detecting high CPU utilization in pods of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need to increase the CPU limit for the affected pod. |
|
Errors |
This alarm detects high error counts. Errors includes the exceptions thrown by the code as well as exceptions thrown by the Lambda runtime. |
|
Throttles |
This alarm detects a high number of throttled invocation requests. Throttling occurs when there is no concurrency is available for scale up. |
|
memory_utilization |
This alarm is used to detect if the memory utilization of a lambda function is approaching the configured limit. |
|
4xxErrors |
This alarm helps us report the total number of 4xx error status codes that are made in response to client requests. 403 error codes might indicate an incorrect IAM policy, and 404 error codes might indicate mis-behaving client application, for example. |
|
5xxErrors |
This alarm helps you detect a high number of server-side errors. These errors indicate that a client made a request that the server couldn’t complete. This can help you correlate the issue your application is facing because of S3. |