Amazon SageMaker HyperPod release notes
The following release notes track the latest updates for Amazon SageMaker HyperPod. These release notes describe new features, fixes, and improvements made since the previous version.
SageMaker HyperPod release notes: October 31, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added scaling down SageMaker HyperPod clusters at the instance group level and instance level for both Amazon EKS and Slurm orchestrated clusters. For more information about scaling down Amazon EKS clusters, see Scale down a SageMaker HyperPod cluster. For more information about scaling down Slurm clusters, see Scale down a cluster in Using the AWS CLI.
-
SageMaker HyperPod now supports the G6, G6e, and P5e instance types for both Amazon EKS and Slurm orchestrated clusters.
SageMaker HyperPod release notes: September 10, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added Amazon EKS support in SageMaker HyperPod. To learn more, see Orchestrating SageMaker HyperPod clusters with Amazon EKS.
-
Added support for managing SageMaker HyperPod clusters through AWS CloudFormation and Terraform. For more information about managing HyperPod clusters through AWS CloudFormation, see CloudFormation documentation for
AWS::SageMaker::Cluster
. To learn about managing HyperPod clusters through Terraform, see Terraform documentation forawscc_sagemaker_cluster
.
SageMaker HyperPod DLAMI for Amazon EKS support
The following is a summarized list of packages pre-installed or pre-configured in the SageMaker HyperPod DLAMIs for Amazon EKS support. Each DLAMIs is built on Amazon Linux 2 (AL2) and supports a specific Kubernetes version.
The AMIs include the following:
SageMaker HyperPod DLAMI for Slurm support
The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI for Slurm.
Note
To find instructions on updating existing HyperPod clusters with the latest HyperPod DLAMI, see Update the SageMaker HyperPod platform software of a cluster.
-
Installed the NVIDIA driver v550.90.07
-
Installed the EFA driver v2.10
-
Installed the latest version of AWS Neuron SDK
-
aws-neuronx-collectives: v2.21.46.0
-
aws-neuronx-dkms: v2.17.17.0
-
aws-neuronx-oci-hook: v2.4.4.0
-
aws-neuronx-runtime-lib: v2.21.41.0
-
aws-neuronx-tools: v2.18.3.0
-
SageMaker HyperPod release notes: August 20, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Enhanced the SageMaker HyperPod auto-resume functionality, extending the resiliency capability for Slurm nodes attached with Generic RESources (GRES).
When Generic Resources (GRES)
are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.
Other changes
-
Pre-packaged
slurmrestd
in the SageMaker HyperPod AMI. -
Changed the default values for
ResumeTimeout
andUnkillableStepTimeout
from 60 seconds to 300 seconds inslurm.conf
to improve system responsiveness and job handling. -
Made minor improvements on health checks for NVIDIA Data Center GPU Manager (DCGM) and The NVIDIA System Management Interface (nvidia-smi).
Bug fixes
-
The HyperPod auto-resume plug-in can use idle nodes to resume a job.
Upgrade steps
-
Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.
Important
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.
aws sagemaker update-cluster-software --cluster-name
your-cluster-name
Note
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.
SageMaker HyperPod release notes: June 20, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added a new capability of attaching additional storage to SageMaker HyperPod cluster instances. With this capability, you can configure supplementary storage at the instance group configuration level during the cluster creation or update processes, either through the SageMaker HyperPod console or the
CreateCluster
andUpdateCluster
APIs. The additional EBS volume is attached to each instance within a SageMaker HyperPod cluster and mounted to/opt/sagemaker
. To learn more about implementing it in your SageMaker HyperPod cluster, see the updated documentation on the following pages.Note that you need to update the HyperPod cluster software to use this capability. After patching the HyperPod cluster software, you can utilize this capability for existing SageMaker HyperPod clusters created before June 20, 2024 by adding new instance groups. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.
Upgrade steps
-
Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.
Important
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.
aws sagemaker update-cluster-software --cluster-name
your-cluster-name
Note
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.
SageMaker HyperPod release notes: April 24, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
Bug fixes
-
Fixed a bug with the
ThreadsPerCore
parameter in theClusterInstanceGroupSpecification
API. With the fix, theCreateCluster
andUpdateCluster
APIs properly take and apply the user input throughThreadsPerCore
. This fix is effective on HyperPod clusters created after April 24, 2024. If you had issues with this bug and want to get this fix applied to your cluster, you need to create a new cluster. Make sure that you back up and restore your work while moving to a new cluster following the instructions at Use the backup script provided by SageMaker HyperPod.
SageMaker HyperPod release notes: March 27, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
HyperPod software patch
The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.
-
In this release of the HyperPod DLAMI, Slurm is built with REST service (
slurmestd
) with JSON, YAML, and JWT support. -
Upgraded Slurm
to v23.11.3
Upgrade steps
-
Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.
Important
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.
aws sagemaker update-cluster-software --cluster-name
your-cluster-name
Note
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.
Improvements
-
Increased auto-resume service timeout to 60 minutes.
-
Improved instance replacement process to not restart the Slurm controller.
-
Improved error messages from running lifecycle scripts, such as download errors and instance health check errors on instance start-up.
Bug fixes
-
Fixed a bug with chrony service that caused an issue with time synchronization.
-
Fixed a bug with parsing
slurm.conf
. -
Fixed an issue with NVIDIA
go-dcgm
library.
SageMaker HyperPod release notes: March 14, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
HyperPod DLAMI for Slurm software patch
The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.
-
Upgraded Slurm
to v23.11.1 -
Added OpenPMIx
v4.2.6 for enabling Slurm with PMIx . -
Built upon the AWS Deep Learning Base GPU AMI (Ubuntu 20.04)
released on 2023-10-26 -
A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI
Upgrade steps
-
Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.
Important
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.
aws sagemaker update-cluster-software --cluster-name
your-cluster-name
Note
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.
Improvements
-
HyperPod now properly supports passing partition names provided through
provisioning_params.json
and creates partitions appropriately based on provided inputs. For more information aboutprovisioning_params.json
, see SageMaker HyperPod forms and Customize SageMaker HyperPod clusters using lifecycle scripts.
SageMaker HyperPod release notes: February 15, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added a new
UpdateClusterSoftware
API for SageMaker HyperPod security patching. When security patches become available, we recommend you to update existing SageMaker HyperPod clusters in your account by runningaws sagemaker update-cluster-software --cluster-name
. To follow up with future security patches, keep tracking this Amazon SageMaker HyperPod release notes page. To learn how theyour-cluster-name
UpdateClusterSoftware
API works, see Update the SageMaker HyperPod platform software of a cluster.
SageMaker HyperPod release notes: November 29, 2023
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Launched Amazon SageMaker HyperPod at AWS re:Invent 2023.
HyperPod software patch
The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.
-
Built upon the AWS Deep Learning Base GPU AMI (Ubuntu 20.04)
released on 2023-10-18 -
A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI
-
Slurm
: v23.02.3 -
Munge: v0.5.15
-
aws-neuronx-dkms
: v2.* -
aws-neuronx-collectives
: v2.* -
aws-neuronx-runtime-lib
: v2.* -
aws-neuronx-tools
: v2.* -
SageMaker HyperPod software packages to support features such as cluster health check and auto-resume
-