MLSEC-04: Secure data and modeling environment
Secure any system or environment that hosts data or enables model development. Store training data in secured storage and repositories. Run data preparation in a secure cloud. Tightly control access to the destination compute instances as data moves from the data repositories to the instances. Encrypt data at rest in the storage infrastructure and in transit to the compute infrastructure.
Implementation plan
-
Build a secure analysis environment - During the data preparation and feature engineering phases, there are multiple options for secure data exploration on AWS. Data can be explored in an Amazon
SageMaker managed notebook environment, or in an Amazon EMR notebook. You can also use managed services, such as Amazon Athena and AWS Glue , or a combination of the two, to explore the data without moving the data out of your data lake. Use an Amazon SageMaker Jupyter notebook instance to explore, visualize, and feature engineer a small subset of data. Scale up the feature engineering using a managed ETL service, such as Amazon EMR or AWS Glue. -
Create dedicated AWS IAM
and AWS KMS resources – This approach limits the scope of impact of credentials and keys. Create a private S3 bucket and enable version control for the data and intellectual property (IP). In AWS, a centralized data lake is implemented using AWS Lake Formation on Amazon S3. Securing and monitoring a data lake on Amazon S3 is achieved using a combination of services and capabilities to encrypt data in transit and at rest. Monitor access using granular AWS IAM policies, S3 bucket policies, S3 Access Logs, Amazon CloudWatch , and AWS CloudTrail . -
Use Secrets Manager and Paramater Store to protect credentials - Secrets Manager enables you to replace hard-coded secrets in your code, such as credentials, with an API call to decrypt and retrieve the secret programmatically. Parameter Store was designed for wider use cases than secrets or passwords, but allows you to store application configuration variables such as AMI IDs or license keys. With AWS Secrets Manager
and Parameter Store, you can store your credentials, and then grant permissions to your SageMaker IAM role to access Secrets Manager from your notebook. -
Automate managing configuration - Use lifecycle configurations scripts to manage Jupyter notebook instances. The scripts run when the notebook instance is first created, or every time it starts. They enable you to install custom packages, preload datasets, and set up source code repositories. Lifecycle configurations can be changed and reused across multiple notebook instances. You can make a change once and apply the updated configuration by restarting the managed notebook instances. This gives IT, operations, and security teams the flexibility and control they need, while supporting the needs of your developers and data scientists. Use AWS CloudFormation
infrastructure as code, as well as Service Catalog to simplify configuration for end users. -
Create private, isolated, network environments - Use Amazon Virtual Private Cloud
(Amazon VPC) to enable connectivity to only the services and users you need. Deploy the Amazon SageMaker notebook instance in an Amazon VPC to enable network level controls to limit communication to the hosted notebook. Additionally, network calls into and out of the notebook instance can be captured in VPC Flow Logs to enable additional visibility and control at the network level. By deploying the notebook in your VPC, you will also be able to query data sources and systems accessible from within your VPC, such as relational databases in Amazon RDS or Amazon Redshift data warehouses. Using IAM, you can further restrict access to the web-based UI of the notebook instance so that it can only be accessed from within your VPC. Use AWS PrivateLink to privately connect your SageMaker notebook instance VPC with supported AWS services. This ensures secure communication between your notebook instance and Amazon S3 within the AWS network. Use AWS KMS to encrypt data on the EBS volumes attached to SageMaker notebook instances. -
Restrict access - The Jupyter notebook server provides web-based access to the underlying operating system on an EC2 instance. This gives you the ability to install additional software packages or Jupyter kernels to customize your environment. The access is granted by default to a user with root access or super user on the operating system, giving them total control of the underlying EC2 instance. This access should be restricted to remove the user's ability to assume root permissions but still give them control over their local user's environment.
-
Secure ML algorithms - Amazon SageMaker uses container technology to train and host algorithms and models. When creating your own containers, publish them to a private container registry hosted on Amazon Elastic Container
Repository (Amazon ECR) . Encrypt containers that are hosted on Amazon ECR at rest using AWS KMS. -
Enforce code best practices - Use secure git repositories for storing code.
-
Implement a package mirror for consuming approved packages - Evaluate the license terms to determine which ML packages are appropriate for your business across the phases of the ML lifecycle. Examples of ML Python packages include: Pandas, PyTorch, Keras, NumPy, and Scikit-learn. Once you’ve determined the set and criteria, build a validation mechanism and automate it where possible. A sample automated mechanism can include a script that runs the download, installation, and package version and dependency checks. Only download packages from approved and private repos.
Validate what is in the packages downloaded. This will enable importing safely and confirming the validity of packages. Amazon SageMaker notebook instances come with multiple environments already installed. These environments contain Jupyter kernels and Python packages. You can also install your own environments that contain your choice of packages and kernels. SageMaker enables modifying package channel paths to a private repository . Where appropriate, use an internal repository as a proxy for public repositories to minimize the network and time overhead.