AWS Identity and Access Management for SageMaker HyperPod - Amazon SageMaker AI

AWS Identity and Access Management for SageMaker HyperPod

AWS Identity and Access Management (IAM) is an AWS service that helps an administrator securely control access to AWS resources. IAM administrators control who can be authenticated (signed in) and authorized (have permissions) to use Amazon EKS resources. IAM is an AWS service that you can use with no additional charge.

Important

Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see Provide permissions for tagging SageMaker AI resources.

AWS managed policies for Amazon SageMaker AI that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

Let's assume that there are two main layers of SageMaker HyperPod users: cluster admin users and data scientist users.

  • Cluster admin users – Are responsible for creating and managing SageMaker HyperPod clusters. This includes configuring the HyperPod clusters and managing user access to them.

    • Create and configure SageMaker HyperPod clusters with Slurm or Amazon EKS.

    • Create and configure IAM roles for data scientist users and HyperPod cluster resources.

    • For SageMaker HyperPod orchestration with Amazon EKS, create and configure EKS access entries, role-based access control (RBAC), and Pod Identity to fulfill data science use cases.

  • Data scientist users – Focus on ML model training. They use the open-source orchestrator or the SageMaker HyperPod CLI to submit and manage training jobs.

    • Assume and use the IAM Role provided by cluster admin users.

    • Interact with the open-source orchestrator CLIs supported by SageMaker HyperPod (Slurm or Kubernetes) or the SageMaker HyperPod CLI to check clusters capacity, connect to cluster, and submit workloads.

Set up IAM roles for cluster admins by attaching the right permissions or policies to operate SageMaker HyperPod clusters. Cluster admins also should create IAM roles to provide to SageMaker HyperPod resources to assume to run and communicate with necessary AWS resources, such as Amazon S3, Amazon CloudWatch, and AWS Systems Manager (SSM). Finally, the AWS account admin or the cluster admins should grant scientists permissions to access the SageMaker HyperPod clusters and run ML workloads.

Depending on which orchestrator you choose, permissions needed for the cluster admin and scientists may vary. You can also control the scope of permissions for various actions in the roles using the condition keys per service. Use the following Service Authorization References for adding detailed scope for the services related to SageMaker HyperPod.

IAM users for cluster admin

Cluster administrators (admins) operate and configure SageMaker HyperPod clusters, performing the tasks in SageMaker HyperPod operation. The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account.

Slurm
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker:CreateCluster", "sagemaker:ListClusters" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "sagemaker:DeleteCluster", "sagemaker:DescribeCluster", "sagemaker:DescribeClusterNode", "sagemaker:ListClusterNodes", "sagemaker:UpdateCluster", "sagemaker:UpdateClusterSoftware", "sagemaker:BatchDeleteClusterNodes" ], "Resource": "arn:aws:sagemaker:region:account-id:cluster/*" } ] }
Amazon EKS
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": <execution-role-arn> }, { "Effect": "Allow", "Action": [ "sagemaker:CreateCluster", "sagemaker:DeleteCluster", "sagemaker:DescribeCluster", "sagemaker:DescribeCluterNode", "sagemaker:ListClusterNodes", "sagemaker:ListClusters", "sagemaker:UpdateCluster", "sagemaker:UpdateClusterSoftware", "sagemaker:BatchDeleteClusterNodes", "eks:DescribeCluster", "eks:CreateAccessEntry", "eks:DescribeAccessEntry", "eks:DeleteAccessEntry", "eks:AssociateAccessPolicy", "iam:CreateServiceLinkedRole" ], "Resource": "*" } ] }

To grant permissions to access the SageMaker AI console, use the sample policy provided at Permissions required to use the Amazon SageMaker AI console.

To grant permissions to access the Amazon EC2 Systems Manager console, use the sample policy provided at Using the AWS Systems Manager console in the AWS Systems Manager User Guide.

You might also consider attaching the AmazonSageMakerFullAccess policy to the role; however, note that the AmazonSageMakerFullAccess policy grants permissions to the entire SageMaker API calls, features, and resources.

For guidance on IAM users in general, see IAM users in the AWS Identity and Access Management User Guide.

IAM users for scientists

Scientists log into and run ML workloads on SageMaker HyperPod cluster nodes provisioned by cluster admins. For scientists in your AWS account, you should grant the permission "ssm:StartSession" to run the SSM start-session command. The following is a policy example for IAM users.

Slurm

Add the following policy to grant SSM session permissions to connect to an SSM target for all resources. This allows you to access HyperPod clusters.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ssm:StartSession", "ssm:TerminateSession" ], "Resource": "*" } ] }
Amazon EKS

Grant the following IAM role permissions for data scientists to run hyperpod list-clusters and hyperpod connect-cluster commands among the HyperPod CLI commands. To learn more about the HyperPod CLI, see Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS. It also includes SSM session permissions to connect to an SSM target for all resources. This allows you to access HyperPod clusters.

{ "Version": "2012-10-17", "Statement": [ { "Sid": "DescribeHyerpodClusterPermissions", "Effect": "Allow", "Action": [ "sagemaker:DescribeCluster" ], "Resource": "<hyperpod-cluster-arn>" }, { "Sid": "UseEksClusterPermissions", "Effect": "Allow", "Action": [ "eks:DescribeCluster", ], "Resource": "<eks-cluster-arn>" }, { "Sid": "ListClustersPermission", "Effect": "Allow", "Action": [ "sagemaker:ListClusters" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "ssm:StartSession", "ssm:TerminateSession" ], "Resource": "*" } ] }

To grant data scientists IAM users or roles access to Kubernetes APIs in the cluster, see also Grant IAM users and roles access to Kubernetes APIs in the Amazon EKS User Guide.

IAM role for SageMaker HyperPod

For SageMaker HyperPod clusters to run and communicate with necessary AWS resources, you need create an IAM role for HyperPod cluster to assume.

Start with attaching the managed role AWS managed policy: AmazonSageMakerHyperPodServiceRolePolicy. Given this AWS managed policy, SageMaker HyperPod cluster instance groups assume the role to communicate with Amazon CloudWatch, Amazon S3, and AWS Systems Manager Agent (SSM Agent). This managed policy is the minimum requirement for SageMaker HyperPod resources to run properly, so you must provide an IAM role with this policy to all instance groups.

Tip

Depending on your preference on designing the level of permissions for multiple instance groups, you can also set up multiple IAM roles and attach them to different instance groups. When you set up your cluster user access to specific SageMaker HyperPod cluster nodes, the nodes assume the role with the selective permissions you manually attached.

When you set up the access for scientists to specific cluster nodes through AWS Systems Manager (see also Setting up AWS Systems Manager and Run As for cluster user access control), the cluster nodes assume the role with the selective permissions you manually attach.

After you are done with creating IAM roles, make notes of their names and ARNs. You use the roles when creating a SageMaker HyperPod cluster, granting the correct permissions required for each instance group to communicate with necessary AWS resources.

Slurm

For HyperPod orchestrated with Slurm, you must attach the following managed policy to the SageMaker HyperPod IAM role.

(Optional) Additional permissions for using SageMaker HyperPod with Amazon Virtual Private Cloud

If you want to use your own Amazon Virtual Private Cloud (VPC) instead of the default SageMaker AI VPC, you should add the following additional permissions to the IAM role for SageMaker HyperPod.

{ "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface", "ec2:CreateNetworkInterfacePermission", "ec2:DeleteNetworkInterface", "ec2:DeleteNetworkInterfacePermission", "ec2:DescribeNetworkInterfaces", "ec2:DescribeVpcs", "ec2:DescribeDhcpOptions", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups", "ec2:DetachNetworkInterface" ], "Resource": "*" } { "Effect": "Allow", "Action": "ec2:CreateTags", "Resource": [ "arn:aws:ec2:*:*:network-interface/*" ] }

The following list breaks down which permissions are needed to enable SageMaker HyperPod cluster functionalities when you configure the cluster with your own Amazon VPC.

  • The following ec2 permissions are required to enable configuring a SageMaker HyperPod cluster with your VPC.

    { "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface", "ec2:CreateNetworkInterfacePermission", "ec2:DeleteNetworkInterface", "ec2:DeleteNetworkInterfacePermission", "ec2:DescribeNetworkInterfaces", "ec2:DescribeVpcs", "ec2:DescribeDhcpOptions", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups" ], "Resource": "*" }
  • The following ec2 permission is required to enable the SageMaker HyperPod auto-resume functionality.

    { "Effect": "Allow", "Action": [ "ec2:DetachNetworkInterface" ], "Resource": "*" }
  • The following ec2 permission allows SageMaker HyperPod to create tags on the network interfaces within your account.

    { "Effect": "Allow", "Action": "ec2:CreateTags", "Resource": [ "arn:aws:ec2:*:*:network-interface/*" ] }
Amazon EKS

For HyperPod orchestrated with Amazon EKS, you must attach the following managed policies to the SageMaker HyperPod IAM role.

In addition to the managed policies, attach the following permission policy to the role.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:AssignPrivateIpAddresses", "ec2:CreateNetworkInterface", "ec2:CreateNetworkInterfacePermission", "ec2:DeleteNetworkInterface", "ec2:DeleteNetworkInterfacePermission", "ec2:DescribeNetworkInterfaces", "ec2:DescribeVpcs", "ec2:DescribeDhcpOptions", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups", "ec2:DetachNetworkInterface", "ec2:ModifyNetworkInterfaceAttribute", "ec2:UnassignPrivateIpAddresses", "ecr:BatchGetImage", "ecr:GetAuthorizationToken", "ecr:GetDownloadUrlForLayer", "eks-auth:AssumeRoleForPodIdentity" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "ec2:CreateTags" ], "Resource": [ "arn:aws:ec2:*:*:network-interface/*" ] } ] }
Note

The "eks-auth:AssumeRoleForPodIdentity" permission is optional. It's required if you plan to use EKS Pod identity.

SageMaker HyperPod service-linked role

For Amazon EKS support in SageMaker HyperPod, HyperPod creates a service-linked role with AWS managed policy: AmazonSageMakerHyperPodServiceRolePolicy to monitor and support resiliency on your EKS cluster such as replacing nodes and restarting jobs.

IAM policies for Amazon EKS