Operations perspective: health and availability
The operations perspective focuses on ensuring that cloud services are delivered at a level that is agreed upon with your business stakeholders. Automating and optimizing operations will allow you to effectively scale while improving the reliability of your workloads. This perspective comprises nine capabilities shown in the following figure. Common stakeholders include infrastructure and operations leaders, site reliability engineers, and information technology service managers.
AWS CAF Operations perspective capabilities
-
Observability – Gain actionable insights from your infrastructure and application data. When you are operating at cloud speed and scale
, you need to be able to spot problems as they arise, ideally before they disrupt the customer experience. Develop the telemetry (logs, metrics, and traces) necessary to understand the internal state and health of your workloads. Monitor application endpoints, assess the impact to the end users, and generate alerts when measurements exceed thresholds. Use synthetic monitoring to create canaries (configurable scripts that run on a schedule) to monitor your endpoints and APIs. Implement traces
to track requests as they travel through the entire application and identify bottlenecks or performance issues. Gain insights into resources, servers, databases, and networks using metrics and logs. Set up real-time analysis of time series data to understand causes of performance impacts. Centralize data in a single dashboard , giving you a unified view of critical information about your workloads and their performance. -
Event management (AIOps) – Detect events, assess their potential impact, and determine the appropriate control action. Being able to filter the noise, focus on priority events, predict impending resource exhaustion, automatically generate alerts and incidents, and identify likely causes and remediation actions will help you improve incident detection and response times. Establish an event store pattern and leverage machine learning
(AIOps ) to automate event correlation, anomaly detection, and causality determination. Integrate with cloud services and third-party tools, including with your incident management system and process. Automate responses to events to reduce errors caused by manual processes and ensure prompt and consistent responses. -
Incident and problem management – Quickly restore service operations and minimize adverse business impact. With cloud adoption, processes for response to service issues and application health issues can be highly automated, resulting in greater service uptime. As you move to a more distributed operating model, streamlining interactions between relevant teams, tools, and processes will help you accelerate the resolution of critical and/or complex incidents. Define escalation paths in your runbooks, including what triggers escalation, and procedures for escalation.
Practice incident response gamedays
and incorporate lessons learned in your runbooks. Identify incident patterns to determine problems and corrective measures. Leverage chatbots and collaboration tools to connect your operations teams, tools, and workflows. Leverage blameless post-incident analyses to identify contributing factors of incidents and develop corresponding action plans. -
Change and release management – Introduce and modify workloads while minimizing the risk to production environments. Traditional release management is a complex process that is slow to deploy and difficult to roll back. Cloud adoption provides the opportunity to leverage CI/CD techniques to rapidly manage releases and rollbacks. Establish change processes that allow for automated approval workflows that align with the agility of the cloud. Use deployment management systems to track and implement changes. Use frequent, small, and reversible changes to reduce the scope of a change. Test changes and validate the results at all lifecycle stages
to minimize the risk and impact of failed deployments. Automate rollback to previous known good state when outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes. -
Performance and capacity management – Monitor workload performance and ensure that capacity meets current and future demands. Although the capacity of the cloud is virtually unlimited, service quotas, capacity reservations, and resource constraints restrict the actual capacity of your workloads. Such capacity constraints need to be understood
and effectively managed. Identify key stakeholders and agree on the objectives, scope, goals, and metrics. Collect and process performance data and regularly review and report performance against targets. Periodically evaluate new technologies to improve performance and recommend changes to the goals and metrics as appropriate. Monitor the utilization of your workloads, create baselines for future comparison, and identify thresholds to expand capacity as required. Analyze demand over time to ensure capacity matches seasonal trends and fluctuating operating conditions. -
Configuration management – Maintain an accurate and complete record of all your cloud workloads, their relationships, and configuration changes over time. Unless effectively managed, the dynamic and virtual nature of cloud resource provisioning can lead to a configuration drift. Define and enforce a tagging schema
that overlays your business attributes to your cloud usage, and leverage tags to organize your resources along technical, business, and security dimensions. Specify mandatory tags and enforce compliance through policy. Leverage infrastructure as code (IaC) and configuration management tools for resource provisioning and lifecycle management. Establish configuration baselines and maintain them through version control . -
Patch management – Systematically distribute and apply software updates. Software updates address emerging security vulnerabilities, fix bugs, and introduce new features. A systematic approach to patch management will ensure that you benefit from the latest updates while minimizing risks to production environments. Apply important updates during your specified maintenance window and critical security updates as soon as possible. Notify users in advance with the details of the upcoming updates and allow them to defer patches when other mitigating controls are available. Update your machine images and test patches before rolling out to production. To ensure continued availability during patching, consider separate maintenance windows for each Availability Zone (AZ) and environment. Regularly review patching compliance and alert non-compliant teams to apply required updates.
-
Availability and continuity management – Ensure availability of business-critical information, applications, and services. Building cloud-enabled backup
solutions requires careful consideration of existing technology investments, recovery objectives, and available resources. Timely restoration after disasters and security events will help you maintain system availability and business continuity. Back up your data and documentation according to a defined schedule. Develop a disaster recovery plan as a subset of your business continuity plan. Identify the threat, risk, impact, and cost of different disaster scenarios for each workload and specify Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) accordingly. Implement your chosen disaster recover strategy leveraging multi-AZ or multi-Region architecture. Consider leveraging chaos engineering
to improve resiliency and performance with controlled experiments. Review and test you plans regularly and adjust your approach based on lessons learned. -
Application management – investigate and remediate application issues in a single pane of glass. Aggregating application data into a single management console
will simplify operational oversight and accelerate remediation of application issues by reducing the need to switch context between different management tools. Integrate with other operational and management systems, such as application portfolio management and CMDB, automate the discovery of your application components and resources, and consolidate application data into a single management console. Include software components and infrastructure resources, and delineate different environments, such as development, staging, and production. To remediate operational issues more quickly and consistently, consider automating your runbooks.