Prerequisites for using SageMaker HyperPod - Amazon SageMaker AI

Prerequisites for using SageMaker HyperPod

The following sections walk you through prerequisites before getting started with SageMaker HyperPod.

SageMaker HyperPod quotas

You can create SageMaker HyperPod clusters given the quotas for cluster usage in your AWS account.

Important

To learn more about SageMaker HyperPod pricing, see SageMaker HyperPod pricing and Amazon SageMaker AI Pricing.

View Amazon SageMaker HyperPod quotas using the AWS Management Console

Look up the default and applied values of a quota, also referred to as a limit, for cluster usage, which is used for SageMaker HyperPod.

  1. Open the Service Quotas console.

  2. In the left navigation pane, choose AWS services.

  3. From the AWS services list, search for and select Amazon SageMaker AI.

  4. In the Service quotas list, you can see the service quota name, applied value (if it's available), AWS default quota, and whether the quota value is adjustable.

  5. In the search bar, type cluster usage. This shows quotas for cluster usage, applied quotas, and the default quotas.

Request a Amazon SageMaker HyperPod quota increase using the AWS Management Console

Increase your quotas at the account or resource level.

  1. To increase the quota of instances for cluster usage, select the quota that you want to increase.

  2. If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the Adjustability column.

  3. For Increase quota value, enter the new value. The new value must be greater than the current value.

  4. Choose Request.

  5. To view any pending or recently resolved requests in the console, navigate to the Request history tab from the service's details page, or choose Dashboard from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is Pending. After the status changes to Quota requested, you see the case number with AWS Support. Choose the case number to open the ticket for your request.

To learn more about requesting a quota increase in general, see Requesting a Quota Increase in the AWS Service Quotas User Guide.

Setting up SageMaker HyperPod with a custom Amazon VPC

To set up a SageMaker HyperPod cluster with a custom Amazon VPC, review the following prerequisites.

Note

VPC configuration is mandatory for Amazon EKS orchestration. For Slurm orchestration, VPC setup is optional.

  • Validate Elastic Network Interface (ENI) capacity in your AWS account before creating a SageMaker HyperPod cluster with a custom VPC. The ENI limit is controlled by Amazon EC2 and varies by AWS Region. SageMaker HyperPod cannot automatically request quota increases.

    To verify your current ENI quota:
    1. Open the Service Quotas console.

    2. In the Manage quotas section, use the AWS Services drop-down list to search for VPC.

    3. Choose to view the quotas of Amazon Virtual Private Cloud (Amazon VPC).

    4. Look for the service quota Network interfaces per Region or the Quota code L-DF5E4CA3.

    If your current ENI limit is insufficient for your SageMaker HyperPod cluster needs, request a quota increase. Ensuring adequate ENI capacity beforehand helps prevent cluster deployment failures.

  • When using a custom VPC to connect a SageMaker HyperPod cluster with AWS resources, provide the VPC name, ID, AWS Region, subnet IDs, and security group IDs during cluster creation.

    Note

    When your Amazon VPC and subnets support IPv6 in the VPCConfig of the cluster or at the Instance group level using the OverrideVPCConfig attribute of ClusterInstanceGroupSpecification, network communications differ based on the cluster orchestration platform:

    • Slurm-orchestrated clusters automatically configure nodes with dual IPv6 and IPv4 addresses, allowing immediate IPv6 network communications. No additional configuration is required beyond the VPCConfig IPv6 settings.

    • In EKS-orchestrated clusters, nodes receive dual-stack addressing, but pods can only use IPv6 when the Amazon EKS cluster is explicitly IPv6-enabled. You must create a new IPv6 Amazon EKS cluster - existing IPv4 Amazon EKS clusters cannot be converted to IPv6. For information about deploying an IPv6 Amazon EKS cluster, see Amazon EKS IPv6 Cluster Deployment.

    Additional resources for IPv6 configuration:

  • Make sure that all resources are deployed in the same AWS Region as the SageMaker HyperPod cluster. Configure security group rules to allow inter-resource communication within the VPC. For example, when creating a VPC in us-west-2, provision subnets across one or more Availability Zones (such as us-west-2a or us-west-2b), and create a security group allowing intra-group traffic.

    Note

    SageMaker HyperPod supports multi-Availability Zone deployment. For more information, see Setting up SageMaker HyperPod clusters across multiple AZs.

  • Establish Amazon Simple Storage Service (Amazon S3) connectivity for VPC-deployed SageMaker HyperPod instance groups by creating a VPC endpoint. Without internet access, instance groups cannot store or retrieve lifecycle scripts, training data, or model artifacts. We recommend that you create a custom IAM policy restricting Amazon S3 bucket access to the private VPC. For more information, see Endpoints for Amazon S3 in the AWS PrivateLink Guide.

  • For HyperPod clusters using Elastic Fabric Adapter (EFA)-enabled instances, configure the security group to allow all inbound and outbound traffic to and from the security group itself. Specifically, avoid using 0.0.0.0/0 for outbound rules, as this may cause EFA health check failures. For more information about EFA security group preparation guidelines, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

Setting up SageMaker HyperPod clusters across multiple AZs

You can configure your SageMaker HyperPod clusters across multiple Availability Zones (AZs) to improve reliability and availability.

Note

Elastic Fabric Adapter (EFA) traffic cannot cross AZs or VPCs. This does not apply to normal IP traffic from the ENA device of an EFA interface. For more information, see EFA limitations.

  • Default behavior

    HyperPod deploys all cluster instances in a single Availability Zone. The VPC configuration determines the deployment AZ:

    • For Slurm-orchestrated clusters, VPC configuration is optional. When no VPC configuration is provided, HyperPod defaults to one subnet from the platform VPC.

    • For EKS-orchestrated clusters, VPC configuration is required.

    • For both Slurm and EKS orchestrators, when VpcConfig is provided, HyperPod selects a subnet from the provided VpcConfig's subnet list. All instance groups inherit the subnet's AZ.

    Note

    Once you create a cluster, you cannot modify its VpcConfig settings.

    To learn more about configuring VPCs for HyperPod clusters, see the preceding section, Setting up SageMaker HyperPod with a custom Amazon VPC.

  • Multi-AZ configuration

    You can set up your HyperPod cluster across multiple AZs when creating a cluster or when adding a new instance group to an existing cluster. To configure multi-AZ deployments, you can override the default VPC settings of the cluster by specifying different subnets and security groups, potentially across different Availability Zones, for individual instance groups within your cluster.

    SageMaker HyperPod API users can use the OverrideVpcConfig property within the ClusterInstanceGroupSpecification when working with the CreateCluster or UpdateCluster APIs.

    The OverrideVpcConfig field:

    • Cannot be modified after the instance group is created.

    • Is optional. If not specified, the cluster level VpcConfig is used as default.

    • For Slurm-orchestrated clusters, can only be specified when cluster level VpcConfig is provided. If no VpcConfig is specified at cluster level, OverrideVpcConfig cannot be used for any instance group.

    • Contains two required fields:

      • Subnets - accepts between 1 and 16 subnet IDs

      • SecurityGroupIds - accepts between 1 and 5 security group IDs

    For more information about creating or updating a SageMaker HyperPod cluster using the SageMaker HyperPod console UI or the AWS CLI:

Note

When running workloads across multiple AZs, be aware that network communication between AZs introduces additional latency. Consider this impact when designing latency-sensitive applications.

Setting up AWS Systems Manager and Run As for cluster user access control

SageMaker HyperPod DLAMI comes with AWS Systems Manager (SSM) out of the box to help you manage access to your SageMaker HyperPod cluster instance groups. This section describes how to create operating system (OS) users in your SageMaker HyperPod clusters and associate them with IAM users and roles. This is useful to authenticate SSM sessions using the credentials of the OS user account.

Note

Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.

Enabling Run As in your AWS account

As an AWS account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in SSM. With this feature, you can start each SSM session using the OS user associated to the IAM role or user.

To enable Run As in your AWS account, follow the steps in Turn on Run As support for Linux and macOS managed nodes. If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in Option 2 of step 5 under To turn on Run As support for Linux and macOS managed nodes.

(Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre

To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre file system, select one of the AWS Regions supported by SageMaker HyperPod. After choosing the AWS Region you prefer, you also should determine which Availability Zone (AZ) to use.

If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre file system is set up within the same AWS Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre file system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with your VPC.