Design for failure
“Everything fails all the time.” – Werner Vogels
This adage is not any less true in the container world than it is
for the cloud. Achieving high availability is a top priority for
workloads, but remains an arduous undertaking for development teams.
Modern applications running in containers should not be tasked with
managing the underlying layers, from physical infrastructure like
electricity sources or environmental controls all the way to the
stability of the underlying operating system. If a set of containers
fails while tasked with delivering a service, these containers
should be re-instantiated automatically and with no delay.
Similarly, as microservices interact with each other over the
network more than they do locally and synchronously, connections
need to be monitored and managed. Latency and timeouts should be
assumed and gracefully handled. More generally, microservices need
to apply the same error retries and exponential backoff with jitter
Designing for failure also means testing the design and watching
services cope with deteriorating conditions. Not all technology
departments need to apply this principle to the extent that Netflix
Designing for failure yields a self-healing infrastructure that acts with the maturity that is expected of recent workloads. Preventing emergency calls guarantees a base level of satisfaction for the service-owning team. This also removes a level of stress that can otherwise grow into accelerated attrition. Designing for failure will deliver greater uptime for your products. It can shield a company from outages that could erode customer trust.
Here are the key factors from the twelve-factor app pattern methodology that play a role in designing for failure:
-
Disposability (maximize robustness with fast startup and graceful shutdown) – Produce lean container images and strive for processes that can start and stop in a matter of seconds.
-
Logs (treat logs as event streams) – If part of a system fails, troubleshooting is necessary. Ensure that material for forensics exists.
-
Dev/prod parity – Keep development, staging, and production as similar as possible.
AWS recommends that container hosts be part of a self-healing group. Ideally, container management systems are aware of different data centers and the microservices that span across them, mitigating possible events at the physical level.
Containers offer an abstraction from operating system management. You can treat container instances as immutable servers. Containers will behave identically on a developer’s laptop or on a fleet of virtual machines in the cloud.
One very useful container pattern for hardening an application’s resiliency is the circuit breaker. With circuit breakers such as Resilience4j, Hystrix, an application container is proxied by a container in charge of monitoring connection attempts from the application container. If connections are successful, the circuit breaker container remains in closed status, letting communication happen. When connections start failing, the circuit breaker logic triggers. If a pre-defined threshold for failure/success ratio is breached, the container enters an open status that prevents more connections. This mechanism offers a predictable and clean breaking point, a departure from partially failing situations that can render recovery difficult. The application container can move on and switch to a backup service or enter a degraded state.
One other useful container pattern for an application’s resilience is the using Service Mesh, which forms a network of microservices communicating with each other. Tools such as AWS App Mesh, Istio, and LinkerD Cilium Service Mesh have been available recently to manage and monitor such service meshes. Services meshes have sidecars which refers to a separate process that is installed along with the service in a container set. Important feature of the sidecar is that all communication to and from the service is routed through the sidecar process. This redirection of communication is completely transparent to the service. This service meshes offer several resilience patterns which can be activated by rules in the sidecar and these are Timeout, Retry, and Circuit Breaker.
Modern container management services allow developers to retrieve
near real-time, event-driven updates on the state of containers. To get started collecting needed information about
the status of containerized microservices, it is essential to use tools like Fluent Bit
Sending these logs to the appropriate destination becomes as easy as specifying it in a key/value manner. You can then define appropriate metrics and alarms in your observability solution. Besides the strategy to collect data using the container underlying host, you can also collect telemetry and troubleshooting material by using a linked logging container. This pattern is generically referred to as sidecar. More specifically, in the case of a container working to standardize and normalize the output, the pattern is known as an adapter.
Monitoring is helping to track the health of containerized applications. Solutions like Prometheus, as
Amazon Managed Service for Prometheus
Transaction tracing can help you debug and analyze distributed applications. Combined with log collection and
metric capture, tools like AWS X-Ray
To have a good overview about log, trace, and metrics is essential to evaluate the application state, then
investigate and mitigate incidents. Visualization tools like Amazon Managed Grafana