Monitoring deployments for automatic rollback - AWS AppConfig

Monitoring deployments for automatic rollback

During a deployment, you can mitigate situations where malformed or incorrect configuration data causes errors in your application by using a combination of AWS AppConfig deployment strategies and automatic rollbacks based on Amazon CloudWatch alarms. Once configured, if one or more CloudWatch alarms go into the ALARM or INSUFFICIENT_DATAstate during a deployment, AWS AppConfig automatically rolls back your configuration data to the previous version, thereby preventing application outages or errors. You can also roll back a configuration by calling the StopDeployment API operation while a deployment is still in progress.

Important

For deployments that successfully complete, AWS AppConfig also supports reverting configuration data to a previous version by using the AllowRevert parameter with the StopDeployment API operation. For some customers, reverting to a previous configuration after a successful deployment guarantees the data will be the same as it was before the deployment. Reverting also ignores alarm monitors, which may prevent a roll forward from progressing during an application emergency. For more information, see Reverting a configuration.

To configure automatic rollbacks, you specify the Amazon Resource Name (ARN) of one or more CloudWatch metrics in the CloudWatch alarms field when you create (or edit) an AWS AppConfig environment. For more information, see Creating environments for your application in AWS AppConfig.

Note

If you use a third-party monitoring solution (for example, Datadog), you can create an AWS AppConfig extension that checks for alarms at the AT_DEPLOYMENT_TICK action point and, as a safety guardrail, rolls back the deployment if it triggered an alarm. For more information about AWS AppConfig extensions, see Extending AWS AppConfig workflows using extensions. For more information about custom extensions, see Walkthrough: Creating custom AWS AppConfig extensions. To view a code sample of an AWS AppConfig extension that uses the AT_DEPLOYMENT_TICK action point to integrate with Datadog, see aws-samples / aws-appconfig-tick-extn-for-datadog on GitHub.

Recommended metrics to monitor for automatic rollback

The metrics you choose to monitor will depend on the hardware and software used by your applications. AWS AppConfig customers often monitor the following metrics. For a complete list of recommended metrics grouped by AWS service, see Recommended alarms in the Amazon CloudWatch User Guide.

After you determine the metrics you want to monitor, use CloudWatch to configure alarms. For more information, see Using Amazon CloudWatch alarms.

Service Metric Details

Amazon API Gateway

4XXError

This alarm detects a high rate of client-side errors. This can indicate an issue in the authorization or client request parameters. It could also mean that a resource was removed or a client is requesting one that doesn't exist. Consider enabling Amazon CloudWatch Logs and checking for any errors that may be causing the 4XX errors. Moreover, consider enabling detailed CloudWatch metrics to view this metric per resource and method and narrow down the source of the errors. Errors could also be caused by exceeding the configured throttling limit.

Amazon API Gateway

5XXError

This alarm helps to detect a high rate of server-side errors. This can indicate that there is something wrong on the API backend, the network, or the integration between the API gateway and the backend API.

Amazon API Gateway

Latency

This alarm detects high latency in a stage. Find the IntegrationLatency metric value to check the API backend latency. If the two metrics are mostly aligned, the API backend is the source of higher latency and you should investigate there for issues. Consider also enabling CloudWatch Logs and checking for errors that might be causing the high latency.

Amazon EC2 Auto Scaling

GroupInServiceCapacity

This alarm helps to detect when the capacity in the group is below the desired capacity required for your workload. To troubleshoot, check your scaling activities for launch failures and confirm that your desired capacity configuration is correct.

Amazon EC2

CPUUtilization

This alarm helps to monitor the CPU utilization of an EC2 instance. Depending on the application, consistently high utilization levels might be normal. But if performance is degraded, and the application is not constrained by disk I/O, memory, or network resources, then a maxed-out CPU might indicate a resource bottleneck or application performance problems.

Amazon ECS

CPUReservation

This alarm helps you detect a high CPU reservation of the ECS cluster. High CPU reservation might indicate that the cluster is running out of registered CPUs for the task.

Amazon ECS

HTTPCode_Target_5XX_Count

This alarm helps you detect a high server-side error count for the ECS service. This can indicate that there are errors that cause the server to be unable to serve requests.

Amazon EKS with Container insights

node_cpu_utilization

This alarm helps to detect high CPU utilization in worker nodes of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need for replacing your worker nodes with instances that have greater CPU or a need to scale the system horizontally.

Amazon EKS with Container insights

node_memory_utilization

This alarm helps in detecting high memory utilization in worker nodes of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need to scale the number of pod replicas, or optimize your application.

Amazon EKS with Container insights

pod_cpu_utilization_over_pod_limit

This alarm helps in detecting high CPU utilization in pods of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need to increase the CPU limit for the affected pod.

Amazon EKS with Container insights

pod_memory_utilization_over_pod_limit

This alarm helps in detecting high CPU utilization in pods of the Amazon EKS cluster. If the utilization is consistently high, it might indicate a need to increase the CPU limit for the affected pod.

AWS Lambda

Errors

This alarm detects high error counts. Errors includes the exceptions thrown by the code as well as exceptions thrown by the Lambda runtime.

AWS Lambda

Throttles

This alarm detects a high number of throttled invocation requests. Throttling occurs when there is no concurrency is available for scale up.

Lambda Insights

memory_utilization

This alarm is used to detect if the memory utilization of a lambda function is approaching the configured limit.

Amazon S3

4xxErrors

This alarm helps us report the total number of 4xx error status codes that are made in response to client requests. 403 error codes might indicate an incorrect IAM policy, and 404 error codes might indicate mis-behaving client application, for example.

Amazon S3

5xxErrors

This alarm helps you detect a high number of server-side errors. These errors indicate that a client made a request that the server couldn’t complete. This can help you correlate the issue your application is facing because of S3.