Prerequisites for using SageMaker HyperPod - Amazon SageMaker AI

Prerequisites for using SageMaker HyperPod

The following sections walk you through prerequisites before getting started with SageMaker HyperPod.

SageMaker HyperPod quotas

You can create SageMaker HyperPod clusters given the quotas for cluster usage in your AWS account.

Important

To learn more about SageMaker HyperPod pricing, see SageMaker HyperPod pricing and Amazon SageMaker AI Pricing.

View Amazon SageMaker HyperPod quotas using the AWS Management Console

Look up the default and applied values of a quota, also referred to as a limit, for cluster usage, which is used for SageMaker HyperPod.

  1. Open the Service Quotas console.

  2. In the left navigation pane, choose AWS services.

  3. From the AWS services list, search for and select Amazon SageMaker AI.

  4. In the Service quotas list, you can see the service quota name, applied value (if it's available), AWS default quota, and whether the quota value is adjustable.

  5. In the search bar, type cluster usage. This shows quotas for cluster usage, applied quotas, and the default quotas.

Request a Amazon SageMaker HyperPod quota increase using the AWS Management Console

Increase your quotas at the account or resource level.

  1. To increase the quota of instances for cluster usage, select the quota that you want to increase.

  2. If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the Adjustability column.

  3. For Increase quota value, enter the new value. The new value must be greater than the current value.

  4. Choose Request.

  5. To view any pending or recently resolved requests in the console, navigate to the Request history tab from the service's details page, or choose Dashboard from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is Pending. After the status changes to Quota requested, you see the case number with AWS Support. Choose the case number to open the ticket for your request.

To learn more about requesting a quota increase in general, see Requesting a Quota Increase in the AWS Service Quotas User Guide.

Setting up SageMaker HyperPod with your Amazon VPC

To set up a SageMaker HyperPod cluster with your Amazon VPC, check the following items.

Note

It is required for orchestrating with Amazon EKS. For orchestrating with Slurm, setting up your own VPC is optional.

  • Before creating a SageMaker HyperPod cluster with a custom VPC, ensure that your AWS account has sufficient capacity for creating the required number of Elastic Network Interfaces (ENIs) within that VPC. This limit is controlled by Amazon EC2 and varies by AWS Region. SageMaker HyperPod cannot request limit increases on your behalf.

    To check your current ENI limit:
    1. Open the Service Quotas console.

    2. In the Manage quotas section, use the AWS Services drop-down list to search for VPC.

    3. Choose to view the quotas of Amazon Virtual Private Cloud (Amazon VPC).

    4. Look for the service quota Network interfaces per Region or the Quota code L-DF5E4CA3.

    If your current limit is insufficient for your SageMaker HyperPod cluster needs, request a quota increase. Ensuring adequate ENI capacity beforehand helps prevent cluster creation failures.

  • If you want to use your own VPC to connect SageMaker HyperPod with AWS resources in your VPC, you need to provide the VPC name, ID, AWS Region, subnet ID, and security group ID when you create SageMaker HyperPod. If you want to create a new VPC, see Create a default VPC or Create a VPC in the Amazon Virtual Private Cloud User Guide.

  • It is important that you create all your resources in the same AWS Region as your SageMaker HyperPod cluster and configure security group rules to allow connections between the resources in your VPC. For example, assume that you create a VPC in us-west-2. You should create subnets in this VPC across one or more Availability Zones as needed (such as us-west-2a or us-west-2b), and create a security group that allows all incoming (inbound) traffic from inside the security group and all outbound traffic.

    Note

    When setting up a SageMaker HyperPod cluster, you can choose to deploy across multiple Availability Zones. For more information, see Setting up SageMaker HyperPod clusters across multiple AZs.

  • You also need to ensure that your VPC has connection to Amazon Simple Storage Service (Amazon S3). If you configure a VPC, SageMaker HyperPod instance groups don't have access to the internet, and therefore can't connect to Amazon S3 for accessing or storing files such as lifecycle scripts, training data, and model artifacts. To establish connection with Amazon S3 while using VPC, you should create a VPC endpoint. By creating a VPC endpoint, you can allow the SageMaker HyperPod instance groups to access the Amazon S3 buckets within the same VPC. We recommend that you also create a custom policy that only allows requests from your private VPC to access your Amazon S3 buckets. For more information, see Endpoints for Amazon S3 in the AWS PrivateLink Guide.

  • If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. Note that allowing outbound traffic to 0.0.0.0/0 isn't sufficient and can cause EFA health checks to fail. Make sure that you add an explicit outbound traffic rule to the security group so that the instances in the security group can communicate. To learn more, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

Setting up SageMaker HyperPod clusters across multiple AZs

You can set up your SageMaker HyperPod clusters across multiple Availability Zones (AZs) to obtain a higher instance capacity.

Note

Elastic Fabric Adapter (EFA) traffic cannot cross AZs or VPCs. This does not apply to normal IP traffic from the ENA device of an EFA interface. For more information, see EFA limitations.

When you create a HyperPod cluster, all of the HyperPod instances are created within the same AZ using VpcConfig on a cluster level. To learn more about VPCs and how to create new ones for your cluster, see the preceding section, Setting up SageMaker HyperPod with your Amazon VPC.

You can set up your HyperPod cluster across multiple AZs when you create or update your cluster using the SageMaker AI console. Alternatively, you can use the following APIs.

During new InstanceGroup creation using CreateCluster and UpdateCluster APIs, you can use OverrideVpcConfig property at the InstanceGroup level to override subnet IDs and security groups for the InstanceGroup. The following list provides information on OverrideVpcConfig. The OverrideVpcConfig field:

  • Is immutable. Once an instance group is created, it will always be associated with the same subnet in the account.

  • Is optional.

    • If not specified, the cluster level VpcConfig will be used as default.

    • When specified, both sub-fields, Subnets and SecurityGroupIds, are required.

  • Has two sub-fields:

    • Subnets sub-field supports a single subnet ID for an instance group.

    • SecurityGroupIds sub-field supports 1 - 5 entries.

Note

Network latency may be degraded for workloads executed across multiple AZs.

Setting up AWS Systems Manager and Run As for cluster user access control

SageMaker HyperPod DLAMI comes with AWS Systems Manager (SSM) out of the box to help you manage access to your SageMaker HyperPod cluster instance groups. This section describes how to create operating system (OS) users in your SageMaker HyperPod clusters and associate them with IAM users and roles. This is useful to authenticate SSM sessions using the credentials of the OS user account.

Note

Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.

Enabling Run As in your AWS account

As an AWS account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in SSM. With this feature, you can start each SSM session using the OS user associated to the IAM role or user.

To enable Run As in your AWS account, follow the steps in Turn on Run As support for Linux and macOS managed nodes. If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in Option 2 of step 5 under To turn on Run As support for Linux and macOS managed nodes.

(Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre

To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre file system, select one of the AWS Regions supported by SageMaker HyperPod. After choosing the AWS Region you prefer, you also should determine which Availability Zone (AZ) to use.

If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre file system is set up within the same AWS Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre file system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with your VPC.