Amazon SageMaker HyperPod release notes - Amazon SageMaker AI

Amazon SageMaker HyperPod release notes

The following release notes track the latest updates for Amazon SageMaker HyperPod. These release notes describe new features, fixes, and improvements made since the previous version.

SageMaker HyperPod release notes: October 31, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Added scaling down SageMaker HyperPod clusters at the instance group level and instance level for both Amazon EKS and Slurm orchestrated clusters. For more information about scaling down Amazon EKS clusters, see Scale down a SageMaker HyperPod cluster. For more information about scaling down Slurm clusters, see Scale down a cluster in Using the AWS CLI.

  • SageMaker HyperPod now supports the G6, G6e, and P5e instance types for both Amazon EKS and Slurm orchestrated clusters.

SageMaker HyperPod release notes: September 10, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New features

SageMaker HyperPod DLAMI for Amazon EKS support

The following is a summarized list of packages pre-installed or pre-configured in the SageMaker HyperPod DLAMIs for Amazon EKS support. Each DLAMIs is built on Amazon Linux 2 (AL2) and supports a specific Kubernetes version.

The AMIs include the following:

Deep Learning EKS AMI 1.28
  • Amazon EKS Components

    • Kubernetes Version: 1.28.11

    • Containerd Version: 1.7.20

    • Runc Version: 1.1.11

    • AWS IAM Authenticator: 0.6.21

  • Amazon SSM Agent: 3.3.380

  • Linux Kernel: 5.10.223

  • OSS NVIDIA driver: 535.183.01

  • NVIDIA CUDA: 12.2

  • EFA Installer: 1.32.0

  • GDRCopy: 2.4

  • NVIDIA container toolkit: 1.16.1

  • AWS OFI NCCL: 1.9.1

  • aws-neuronx-tools: 2.18.3.0-1

  • aws-neuronx-runtime-lib: 2.21.41.0

  • aws-neuronx-oci-hook: 2.4.4.0-1

  • aws-neuronx-dkms: 2.17.17.0

  • aws-neuronx-collectives: 2.21.46.0

Deep Learning EKS AMI 1.29
  • Amazon EKS Components

    • Kubernetes Version: 1.29.6

    • Containerd Version: 1.7.20

    • Runc Version: 1.1.11

    • AWS IAM Authenticator: 0.6.21

  • Amazon SSM Agent: 3.3.380

  • Linux Kernel: 5.10.223

  • OSS Nvidia driver: 535.183.01

  • NVIDIA CUDA: 12.2

  • EFA Installer: 1.32.0

  • GDRCopy: 2.4

  • Nvidia container toolkit: 1.16.1

  • AWS OFI NCCL: 1.9.1

  • aws-neuronx-tools: 2.18.3.0-1

  • aws-neuronx-runtime-lib: 2.21.41.0

  • aws-neuronx-oci-hook: 2.4.4.0-1

  • aws-neuronx-dkms: 2.17.17.0

  • aws-neuronx-collectives: 2.21.46.0

Deep Learning EKS AMI 1.30
  • Amazon EKS Components

    • Kubernetes Version: 1.30.2

    • Containerd Version: 1.7.20

    • Runc Version: 1.1.11

    • AWS IAM Authenticator: 0.6.21

  • Amazon SSM Agent: 3.3.380

  • Linux Kernel: 5.10.223

  • OSS Nvidia driver: 535.183.01

  • NVIDIA CUDA: 12.2

  • EFA Installer: 1.32.0

  • GDRCopy: 2.4

  • Nvidia container toolkit: 1.16.1

  • AWS OFI NCCL: 1.9.1

  • aws-neuronx-tools: 2.18.3.0-1

  • aws-neuronx-runtime-lib: 2.21.41.0

  • aws-neuronx-oci-hook: 2.4.4.0-1

  • aws-neuronx-dkms: 2.17.17.0

  • aws-neuronx-collectives: 2.21.46.0

SageMaker HyperPod DLAMI for Slurm support

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI for Slurm.

Note

To find instructions on updating existing HyperPod clusters with the latest HyperPod DLAMI, see Update the SageMaker HyperPod platform software of a cluster.

  • Installed the NVIDIA driver v550.90.07

  • Installed the EFA driver v2.10

  • Installed the latest version of AWS Neuron SDK

    • aws-neuronx-collectives: v2.21.46.0

    • aws-neuronx-dkms: v2.17.17.0

    • aws-neuronx-oci-hook: v2.4.4.0

    • aws-neuronx-runtime-lib: v2.21.41.0

    • aws-neuronx-tools: v2.18.3.0

SageMaker HyperPod release notes: August 20, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Enhanced the SageMaker HyperPod auto-resume functionality, extending the resiliency capability for Slurm nodes attached with Generic RESources (GRES).

    When Generic Resources (GRES) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

Other changes

  • Pre-packaged slurmrestd in the SageMaker HyperPod AMI.

  • Changed the default values for ResumeTimeout and UnkillableStepTimeout from 60 seconds to 300 seconds in slurm.conf to improve system responsiveness and job handling.

  • Made minor improvements on health checks for NVIDIA Data Center GPU Manager (DCGM) and The NVIDIA System Management Interface (nvidia-smi).

Bug fixes

  • The HyperPod auto-resume plug-in can use idle nodes to resume a job.

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

SageMaker HyperPod release notes: June 20, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Added a new capability of attaching additional storage to SageMaker HyperPod cluster instances. With this capability, you can configure supplementary storage at the instance group configuration level during the cluster creation or update processes, either through the SageMaker HyperPod console or the CreateCluster and UpdateCluster APIs. The additional EBS volume is attached to each instance within a SageMaker HyperPod cluster and mounted to /opt/sagemaker. To learn more about implementing it in your SageMaker HyperPod cluster, see the updated documentation on the following pages.

    Note that you need to update the HyperPod cluster software to use this capability. After patching the HyperPod cluster software, you can utilize this capability for existing SageMaker HyperPod clusters created before June 20, 2024 by adding new instance groups. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

SageMaker HyperPod release notes: April 24, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

Bug fixes

  • Fixed a bug with the ThreadsPerCore parameter in the ClusterInstanceGroupSpecification API. With the fix, the CreateCluster and UpdateCluster APIs properly take and apply the user input through ThreadsPerCore. This fix is effective on HyperPod clusters created after April 24, 2024. If you had issues with this bug and want to get this fix applied to your cluster, you need to create a new cluster. Make sure that you back up and restore your work while moving to a new cluster following the instructions at Use the backup script provided by SageMaker HyperPod.

SageMaker HyperPod release notes: March 27, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

HyperPod software patch

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.

  • In this release of the HyperPod DLAMI, Slurm is built with REST service (slurmestd) with JSON, YAML, and JWT support.

  • Upgraded Slurm to v23.11.3

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

Improvements

  • Increased auto-resume service timeout to 60 minutes.

  • Improved instance replacement process to not restart the Slurm controller.

  • Improved error messages from running lifecycle scripts, such as download errors and instance health check errors on instance start-up.

Bug fixes

  • Fixed a bug with chrony service that caused an issue with time synchronization.

  • Fixed a bug with parsing slurm.conf.

  • Fixed an issue with NVIDIA go-dcgm library.

SageMaker HyperPod release notes: March 14, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

HyperPod DLAMI for Slurm software patch

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.

  • Upgraded Slurm to v23.11.1

  • Added OpenPMIx v4.2.6 for enabling Slurm with PMIx.

  • Built upon the AWS Deep Learning Base GPU AMI (Ubuntu 20.04) released on 2023-10-26

  • A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI

    • Slurm: v23.11.1

    • OpenPMIx : v4.2.6

    • Munge: v0.5.15

    • aws-neuronx-dkms: v2.*

    • aws-neuronx-collectives: v2.*

    • aws-neuronx-runtime-lib: v2.*

    • aws-neuronx-tools: v2.*

    • SageMaker HyperPod software packages to support features such as cluster health check and auto-resume

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

Improvements

SageMaker HyperPod release notes: February 15, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Added a new UpdateClusterSoftware API for SageMaker HyperPod security patching. When security patches become available, we recommend you to update existing SageMaker HyperPod clusters in your account by running aws sagemaker update-cluster-software --cluster-name your-cluster-name. To follow up with future security patches, keep tracking this Amazon SageMaker HyperPod release notes page. To learn how the UpdateClusterSoftware API works, see Update the SageMaker HyperPod platform software of a cluster.

SageMaker HyperPod release notes: November 29, 2023

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Launched Amazon SageMaker HyperPod at AWS re:Invent 2023.

HyperPod software patch

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.

  • Built upon the AWS Deep Learning Base GPU AMI (Ubuntu 20.04) released on 2023-10-18

  • A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI

    • Slurm: v23.02.3

    • Munge: v0.5.15

    • aws-neuronx-dkms: v2.*

    • aws-neuronx-collectives: v2.*

    • aws-neuronx-runtime-lib: v2.*

    • aws-neuronx-tools: v2.*

    • SageMaker HyperPod software packages to support features such as cluster health check and auto-resume