Design for failure

“Everything fails all the time.” – Werner Vogels

This adage is not any less true in the container world than it is for the cloud. Achieving high availability is a top priority for workloads, but remains an arduous undertaking for development teams. Modern applications running in containers should not be tasked with managing the underlying layers, from physical infrastructure like electricity sources or environmental controls all the way to the stability of the underlying operating system. If a set of containers fails while tasked with delivering a service, these containers should be re-instantiated automatically and with no delay. Similarly, as microservices interact with each other over the network more than they do locally and synchronously, connections need to be monitored and managed. Latency and timeouts should be assumed and gracefully handled. More generally, microservices need to apply the same error retries and exponential backoff with jitter as advised with applications running in a networked environment.

Designing for failure also means testing the design and watching services cope with deteriorating conditions. Not all technology departments need to apply this principle to the extent that Netflix does, but we encourage you to test these mechanisms often.

Designing for failure yields a self-healing infrastructure that acts with the maturity that is expected of recent workloads. Preventing emergency calls guarantees a base level of satisfaction for the service-owning team. This also removes a level of stress that can otherwise grow into accelerated attrition. Designing for failure will deliver greater uptime for your products. It can shield a company from outages that could erode customer trust.

Here are the key factors from the twelve-factor app pattern methodology that play a role in designing for failure:

Disposability (maximize robustness with fast startup and graceful shutdown) – Produce lean container images and strive for processes that can start and stop in a matter of seconds.
Logs (treat logs as event streams) – If part of a system fails, troubleshooting is necessary. Ensure that material for forensics exists.
Dev/prod parity – Keep development, staging, and production as similar as possible.

AWS recommends that container hosts be part of a self-healing group. Ideally, container management systems are aware of different data centers and the microservices that span across them, mitigating possible events at the physical level.

Containers offer an abstraction from operating system management. You can treat container instances as immutable servers. Containers will behave identically on a developer’s laptop or on a fleet of virtual machines in the cloud.

One very useful container pattern for hardening an application’s resiliency is the circuit breaker. With circuit breakers such as Resilience4j, Hystrix, an application container is proxied by a container in charge of monitoring connection attempts from the application container. If connections are successful, the circuit breaker container remains in closed status, letting communication happen. When connections start failing, the circuit breaker logic triggers. If a pre-defined threshold for failure/success ratio is breached, the container enters an open status that prevents more connections. This mechanism offers a predictable and clean breaking point, a departure from partially failing situations that can render recovery difficult. The application container can move on and switch to a backup service or enter a degraded state.

One other useful container pattern for an application’s resilience is the using Service Mesh, which forms a network of microservices communicating with each other. Tools such as AWS App Mesh, Istio, and LinkerD Cilium Service Mesh have been available recently to manage and monitor such service meshes. Services meshes have sidecars which refers to a separate process that is installed along with the service in a container set. Important feature of the sidecar is that all communication to and from the service is routed through the sidecar process. This redirection of communication is completely transparent to the service. This service meshes offer several resilience patterns which can be activated by rules in the sidecar and these are Timeout, Retry, and Circuit Breaker.

Modern container management services allow developers to retrieve near real-time, event-driven updates on the state of containers. To get started collecting needed information about the status of containerized microservices, it is essential to use tools like Fluent Bit or, for AWS native implementation, Amazon CloudWatch. For more information, visit the best practice guides targeting observability on Amazon EKS and Amazon ECS.

Sending these logs to the appropriate destination becomes as easy as specifying it in a key/value manner. You can then deﬁne appropriate metrics and alarms in your observability solution. Besides the strategy to collect data using the container underlying host, you can also collect telemetry and troubleshooting material by using a linked logging container. This pattern is generically referred to as sidecar. More speciﬁcally, in the case of a container working to standardize and normalize the output, the pattern is known as an adapter.

Monitoring is helping to track the health of containerized applications. Solutions like Prometheus, as Amazon Managed Service for Prometheus or AWS-native offerings like Amazon CloudWatch provide metric capture, analytics, and visualization. These solutions cover basic metrics like memory use, CPU usage, CPU limit, and memory limit, but can also collect defined custom metrics; for example, a specific workload container.

Transaction tracing can help you debug and analyze distributed applications. Combined with log collection and metric capture, tools like AWS X-Ray or Jaeger can provide end-to-end tracing of the application and help to get a full picture of the state and healthiness of the workload and connected services.

To have a good overview about log, trace, and metrics is essential to evaluate the application state, then investigate and mitigate incidents. Visualization tools like Amazon Managed Grafana or Amazon CloudWatch Container Insights can provide dashboards and filtering functionality to serve these needs.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Infrastructure automation

Evolutionary design