OPS08-BP01 Analyze workload metrics - Operational Excellence Pillar

OPS08-BP01 Analyze workload metrics

After implementing application telemetry, regularly analyze the collected metrics. While latency, requests, errors, and capacity (or quotas) provide insights into system performance, it's vital to prioritize the review of business outcome metrics. This ensures you're making data-driven decisions aligned with your business objectives.

Desired outcome: Accurate insights into workload performance that drive data-informed decisions, ensuring alignment with business objectives.

Common anti-patterns:

  • Analyzing metrics in isolation without considering their impact on business outcomes.

  • Over-reliance on technical metrics while sidelining business metrics.

  • Infrequent review of metrics, missing out on real-time decision-making opportunities.

Benefits of establishing this best practice:

  • Enhanced understanding of the correlation between technical performance and business outcomes.

  • Improved decision-making process informed by real-time data.

  • Proactive identification and mitigation of issues before they affect business outcomes.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Leverage tools like Amazon CloudWatch to perform metric analysis. AWS services such as CloudWatch anomaly detection and Amazon DevOps Guru can be used to detect anomalies, especially when static thresholds are unknown or when patterns of behavior are more suited for anomaly detection.

Implementation steps

  1. Analyze and review: Regularly review and interpret your workload metrics.

    1. Prioritize business outcome metrics over purely technical metrics.

    2. Understand the significance of spikes, drops, or patterns in your data.

  2. Utilize Amazon CloudWatch: Use Amazon CloudWatch for a centralized view and deep-dive analysis.

    1. Configure CloudWatch dashboards to visualize your metrics and compare them over time.

    2. Use percentiles in CloudWatch to get a clear view of metric distribution, which can help in defining SLAs and understanding outliers.

    3. Set up CloudWatch anomaly detection to identify unusual patterns without relying on static thresholds.

    4. Implement CloudWatch cross-account observability to monitor and troubleshoot applications that span multiple accounts within a Region.

    5. Use CloudWatch Metric Insights to query and analyze metric data across accounts and Regions, identifying trends and anomalies.

    6. Apply CloudWatch Metric Math to transform, aggregate, or perform calculations on your metrics for deeper insights.

  3. Employ Amazon DevOps Guru: Incorporate Amazon DevOps Guru for its machine learning-enhanced anomaly detection to identify early signs of operational issues for your serverless applications and remediate them before they impact your customers.

  4. Optimize based on insights: Make informed decisions based on your metric analysis to adjust and improve your workloads.

Level of effort for the Implementation Plan: Medium

Resources

Related best practices:

Related documents:

Related videos:

Related examples: