Management and Governance - Machine Learning Best Practices for Public Sector Organizations

Management and Governance

Public sector organizations face increased scrutiny to ensure that funds are properly utilized to serve mission needs. As such, ML workloads need to provide increased visibility for monitoring and auditing. Changes need to be tracked in several places, including data sources, data models, data transfer processes and transformation processes, and deployment endpoints and inference endpoints. A clear separation needs to be put in place between development and production workloads, while enforcing separation of duties with appropriate approval mechanisms. In addition, any underlying infrastructure, software, and licenses need to be maintained and managed. This section highlights several AWS services and associated best practices to address these management and governance challenges.

Enable governance and control

AWS Cloud provides several services that enable governance and control. These include:

  • AWS Control Tower. Setup and governance can be complex and time consuming for organizations with multiple AWS accounts and teams. AWS Control Tower creates a landing zone that consists of a predefined structure of accounts using AWS Organizations, the ability to create accounts using Service Catalog, enforcement of compliance rules called guardrails using Service Control Policies, and detection of policy violations using AWS Config. (See the Cross-account deployments in an AWS Control Tower environment blog for details on how to set up Control Tower)

  • AWS License Manager. Public sector organizations may have existing software with their own licenses being used for various tasks in ML such as ETL. AWS License Manager can be used to track this software obtained from the AWS Marketplace and keep a consolidated view of all licenses. AWS License Manager enables sharing of licenses with other accounts in the organization.

  • Resource Tagging. Organizing AI/ML resources can be done using tags. Each tag is a simple label consisting of a customer-defined key and an optional value that can make it easier to manage, search for, and filter resources by purpose, owner, environment, or other criteria. Automated tools such as AWS Resource Groups and the Resource Groups Tagging API enable programmatic control of tags, making it easier to automatically manage, search, and filter tags and resources. To make the most effective use of tags, organizations should create business-relevant tag groupings to organize their resources along technical, business, and security dimensions.

Provision ML resources that meet policies

AWS Cloud provides several services that enable consistent and repeatable provisioning of ML resources per organization policies.

  • AWS CloudFormation. A successful AI/ML solution may involve resources from multiple services. Deploying and managing these resources one by one can be time-consuming and inconvenient. AWS CloudFormation provides a mechanism to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycles, by treating infrastructure as code.

  • AWS Cloud Development Kit (AWS CDK) (CDK). Many team members prefer to work in their own language to define the infrastructure, as opposed to using JSON and YAML. The AWS CDK, an open-source software development framework, allows teams to define cloud infrastructure in code directly in supported programming languages (i.e., TypeScript, JavaScript, Python, Java, and C#). CDK defines reusable cloud components known as Constructs, and composes them together into Stacks and Apps. The constructs are synthesized into CloudFormation at the time of deployment.

  • Service Catalog. Deploying and setting up ML workspaces for a group or different groups of people is always a big challenge for public sector organizations. Service Catalog provides a solution for this problem. It enables the central management of commonly deployed IT services, and achieves consistent governance and meets compliance requirements. End users can quickly deploy only the approved IT services they need, following the constraints set by the organization. For example, Service Catalog can be used with Amazon SageMaker AI notebooks to provide end users a template to quickly deploy and set up their ML Workspace. The following diagram shows how Service Catalog ensures two separate workflows for cloud system administrators and data scientists or developers who work with Amazon SageMaker AI.

Setting up ML workspace using Service Catalog

Figure 5: Setting up ML workspace using Service Catalog

By leveraging Service Catalog, cloud administrators are able to define the right level of controls and enforce data encryption along with centrally-mandated tags for any AWS service used by various groups. At the same time, data scientists can achieve self-service and a better security posture by simply launching an Amazon SageMaker AI notebook instance through Service Catalog.

Operate environment with governance

AWS Cloud provides several services that enable the reliable operation of the ML environment.

  • Amazon CloudWatch is a monitoring and observability service used to monitor resources and applications run on AWS in real time. Amazon SageMaker AI has built-in Amazon CloudWatch monitoring and logging to manage production compute infrastructure and perform health checks, apply security patches, and conduct other routine maintenance. For a complete list of metrics that can be monitored, refer to the Monitor Amazon SageMaker AI with Amazon CloudWatch section of the SageMaker user guide.

  • Amazon EventBridge is a serverless event bus service that can monitor status change events in Amazon SageMaker AI. EventBridge enables automatic responses to events such as a training job status change or endpoint status change. Events from SageMaker are delivered to EventBridge in near real time. Simple rules can be written to indicate which events are of interest, and what automated actions to take when an event matches a rule.

  • SageMaker Model Monitor can be used to continuously monitor the quality of ML models in production. Model Monitor can notify team members when there are deviations in the model quality. Early and proactive detection of these deviations enables corrective actions, such as retraining models, auditing upstream systems, or fixing quality issues without having to monitor models manually or build additional tooling. The model monitor provides various types of monitoring, including data quality drift, model quality drift, bias drift, and feature attribution drift. For a sample notebook with the full end-to-end workflow for Model Monitor, see the Introduction to Amazon SageMaker AI Model Monitor or see Monitoring in-production ML models at large scale using Amazon SageMaker AI Model Monitor, which outlines how to monitor ML models in production at scale.

  • AWS CloudTrail. Amazon SageMaker AI is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in SageMaker. CloudTrail captures all API calls for SageMaker. The calls captured include actions from the SageMaker console and code calls to the SageMaker API operations. Continuous delivery of CloudTrail events can be delivered to an Amazon S3 bucket, including events for SageMaker. Every event or log entry contains information about who generated the request.