Operations perspective: Health and availability of the AI landscape
Operating ML applications is new for many customers. In the new CAF-AI capability AI lifecycle management and MLOps, we
have already introduced a few perspectives and guidance on tackling this. Beyond what has
already been covered, what remains are considerations around incident management and
performance. To dive deeper beyond this CAF-AI perspective, we recommend reviewing the MLOps
Maturity Framework
Foundational Capability | Explanation |
---|---|
Incident and Problem Management | Identify and manage unforeseen AI behavior. |
Performance and Capacity | Monitor and handle AI workload performance. |
Observability |
This capability is not enriched for AI, refer to the AWS CAF |
Event Management (AIOps) |
This capability is not enriched for AI, refer to theAWS CAF |
Change and Release Management; |
This capability is not enriched for AI, refer to the AWS CAF |
Configuration Management |
This capability is not enriched for AI, refer to the AWS CAF |
Patch Management |
This capability is not enriched for AI, refer to the AWS CAF |
Availability and Continuity Management |
This capability is not enriched for AI, refer to the AWS CAF |
Incident and problem management
Identify and manage unforeseen AI behavior.
AI systems are often used in situations where the expertise of a single person is not
enough to grasp or solve a problem. This nature of AI systems makes it hard to understand
the general behavior of the system and the edge cases, making it difficult to foresee the
potentially degrading performance over time. Therefore, practitioners look at AI systems
through proxies and simplified statistics. When adopting AI observing and monitoring
Make sure to establish practices that acknowledge that AI systems get validated but never verified, and that they need constant and ongoing control and observation. One example is training-serving skew, where the performance of the in-lab developed AI system significantly differs from what’s being seen in production. When needed, allow your customers and users to flag results as unfavorable or wrong. Open up pathways for them to engage directly to report incidents. From the beginning, prepare or a change in data and hence performance through drift, training-serving skew, black swan events, and unobserved data-points. Where the system allows for it, provide ways to gracefully fail and to report and react to such incidents and learn from them. Anticipate that customers and users for which the system does not work well will often not be represented in the data. Finally, expect such incidents to occur and be suspicious if none are reported. Expect this challenge to grow with the size and complexity of your AI system. For example, foundation models are significantly harder to correct and monitor than simple decision trees.
Performance and capacity
Monitor and handle AI workload performance.
AI follows different development cycles than traditional software and comes with different performance and workload profiles: In the early stages of development, data is explored and cost and performance require the capability to adapt to numerous and very different workloads, often dominated by experiments and training workloads that require strong machines, specialized hardware and memory-effective architectures. Use the cloud to enable this multitude of workloads as it delivers the capability to react dynamically to these workload profiles, each of which occur sparsely and only at certain points in the development lifecycle.
Over time, training and streamlined pre-processing takes over and dominates the workload
profile, becoming more consistent and predictable. Your speed of innovation will be impacted
by your ability to adapt to this new profile and move quickly and continuously between the
two while keeping clear lines between development and production. Make sure that model
artifacts and the data that has been fueling these streamlined workloads are available for
potential fallbacks. Once a model moves into a deployed and operationalized stage, make sure
that the inference gets optimized for non-functional requirements (such as, latency or
throughput) cost and monitoring of performance and capacity are in place. In the AI lifecycle management capability, we introduced the MLOps
maturity model, refer to it for deeper operations insights. Over time, multiple types of
workload-profiles will mix and mingle