Amazon SageMaker HyperPod release notes
This topic covers release notes that track update, fixes, and new features for Amazon SageMaker HyperPod. If you are looking for general feature releases, updates, and improvements for Amazon SageMaker HyperPod, you might find this page helpful.
The HyperPod AMI releases are documented separately to include information of the key components including general AMI releases, versions, and dependencies. If you are looking for these information related to HyperPod AMI releases, see Amazon SageMaker HyperPod AMI releases.
SageMaker HyperPod release notes: February 20, 2025
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm and Orchestrating SageMaker HyperPod clusters with Amazon EKS.
New features and improvements
-
Added support for deleting instance groups from your SageMaker HyperPod cluster. For more information, see Delete instance groups from EKS-orchestrated clusters and Scale down a cluster for Slurm-orchestrated clusters.
SageMaker HyperPod release notes: February 18, 2025
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm and Orchestrating SageMaker HyperPod clusters with Amazon EKS.
New features
-
This release of SageMaker HyperPod incorporates a security update from the Nvidia container toolkit (from version 1.17.3 to version 1.17.4). For more information, see v1.17.4 release note
. Note
For all container workloads in the Nvidia container toolkit version 1.17.4, the mounting of CUDA compatibility libraries is now disabled. To ensure compatibility with multiple CUDA versions on container workflows, update your
LD_LIBRARY_PATH
to include your CUDA compatibility libraries. You can find the specific steps in If you use a CUDA compatibility layer.
For information about related AMI releases, see SageMaker HyperPod AMI releases for Slurm: February 18, 2025 and SageMaker HyperPod AMI releases for Amazon EKS: February 18, 2025.
SageMaker HyperPod release notes: February 06, 2025
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm and Orchestrating SageMaker HyperPod clusters with Amazon EKS.
New features and improvements
-
Enhanced SageMaker HyperPod multi-AZ support: You can specify different subnets and security groups, cutting across different Availability Zones, for individual instance groups within your cluster. For more information about SageMaker HyperPod multi-AZ support, see Setting up SageMaker HyperPod clusters across multiple AZs.
SageMaker HyperPod release notes: January 22, 2025
AMI releases
SageMaker HyperPod release notes: January 09, 2025
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New features and improvements
-
Added IPv6 support: Clusters can use IPv6 addressing when configured with IPv6-enabled VPC and subnets. For more information, see Setting up SageMaker HyperPod with a custom Amazon VPC.
SageMaker HyperPod release notes: December 21, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
SageMaker HyperPod now supports the following instance types for both Slurm and Amazon EKS clusters.
-
New instance types: C6gn, C6i, M6i, R6i.
-
New Trainium instance types: Trn1 and Trn1n.
-
Improvements
-
Enhanced error logging visibility when Slurm interrupts jobs, and prevented unnecessary job step termination during Slurm-initiated job cancellations.
-
Updated base DLAMI for p5en for both Slurm and Amazon EKS clusters.
AMI releases
SageMaker HyperPod release notes: December 13, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New feature
-
SageMaker HyperPod releases a set of Amazon CloudWatch metrics to monitor the health and performance of SageMaker HyperPod Slurm clusters. These metrics are related to CPU, GPU, memory utilization, and cluster instance information such as node counts and failed nodes. This monitoring feature is enabled by default, and the metrics can be accessed under the
/aws/sagemaker/Clusters
CloudWatch namespace. You can also set up CloudWatch alarms based on these metrics to proactively detect and address potential issues within their Slurm-based HyperPod clusters. For more information, see Amazon SageMaker HyperPod Slurm metrics.
AMI releases
SageMaker HyperPod release notes: November 24, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added support for configuring SageMaker HyperPod clusters across multiple Availability Zones. For more information about SageMaker HyperPod multi-AZ support, see Setting up SageMaker HyperPod clusters across multiple AZs.
AMI releases
SageMaker HyperPod release notes: November 15, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm. For more information, see and SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024.
New features and improvements
-
Added support for trn1 and trn1n instance types for both Amazon EKS and Slurm orchestrated clusters.
-
Improved log management for Slurm clusters:
-
Implemented log rotation: weekly or daily based on size.
-
Set log retention to 3 weeks.
-
Compressed logs to reduce storage impact.
-
Continued uploading logs to CloudWatch for long-term retention.
Note
Some logs are still stored in syslogs.
-
-
Adjusted Fluent Bit settings to prevent tracking issues with files containing long lines.
Bug fixes
-
Prevented unintended truncation with Slurm controller node updates in configuration file
slurm.config
.
AMI releases
SageMaker HyperPod release notes: November 11, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New feature
-
SageMaker HyperPod AMI now supports G6e instance types.
AMI releases
SageMaker HyperPod release notes: October 31, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added scaling down SageMaker HyperPod clusters at the instance group level and instance level for both Amazon EKS and Slurm orchestrated clusters. For more information about scaling down Amazon EKS clusters, see Scale down a SageMaker HyperPod cluster. For more information about scaling down Slurm clusters, see Scale down a cluster in Using the AWS CLI.
-
SageMaker HyperPod now supports the P5e instance type for both Amazon EKS and Slurm orchestrated clusters.
SageMaker HyperPod release notes: October 21, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New feature
-
SageMaker HyperPod now supports the P5e[n], G6, Gr6, and Trn2[n] instance types for both Slurm and Amazon EKS clusters.
AMI releases
SageMaker HyperPod release notes: September 10, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added Amazon EKS support in SageMaker HyperPod. To learn more, see Orchestrating SageMaker HyperPod clusters with Amazon EKS.
-
Added support for managing SageMaker HyperPod clusters through AWS CloudFormation and Terraform. For more information about managing HyperPod clusters through AWS CloudFormation, see CloudFormation documentation for
AWS::SageMaker::Cluster
. To learn about managing HyperPod clusters through Terraform, see Terraform documentation forawscc_sagemaker_cluster
.
AMI releases
SageMaker HyperPod release notes: August 20, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Enhanced the SageMaker HyperPod auto-resume functionality, extending the resiliency capability for Slurm nodes attached with Generic RESources (GRES).
When Generic Resources (GRES)
are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.
Other changes
-
Pre-packaged
slurmrestd
in the SageMaker HyperPod AMI. -
Changed the default values for
ResumeTimeout
andUnkillableStepTimeout
from 60 seconds to 300 seconds inslurm.conf
to improve system responsiveness and job handling. -
Made minor improvements on health checks for NVIDIA Data Center GPU Manager (DCGM) and The NVIDIA System Management Interface (nvidia-smi).
Bug fixes
-
The HyperPod auto-resume plug-in can use idle nodes to resume a job.
SageMaker HyperPod release notes: June 20, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added a new capability of attaching additional storage to SageMaker HyperPod cluster instances. With this capability, you can configure supplementary storage at the instance group configuration level during the cluster creation or update processes, either through the SageMaker HyperPod console or the
CreateCluster
andUpdateCluster
APIs. The additional EBS volume is attached to each instance within a SageMaker HyperPod cluster and mounted to/opt/sagemaker
. To learn more about implementing it in your SageMaker HyperPod cluster, see the updated documentation on the following pages.Note that you need to update the HyperPod cluster software to use this capability. After patching the HyperPod cluster software, you can utilize this capability for existing SageMaker HyperPod clusters created before June 20, 2024 by adding new instance groups. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.
Upgrade steps
-
Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.
Important
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.
aws sagemaker update-cluster-software --cluster-name
your-cluster-name
Note
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.
SageMaker HyperPod release notes: April 24, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
Bug fixes
-
Fixed a bug with the
ThreadsPerCore
parameter in theClusterInstanceGroupSpecification
API. With the fix, theCreateCluster
andUpdateCluster
APIs properly take and apply the user input throughThreadsPerCore
. This fix is effective on HyperPod clusters created after April 24, 2024. If you had issues with this bug and want to get this fix applied to your cluster, you need to create a new cluster. Make sure that you back up and restore your work while moving to a new cluster following the instructions at Use the backup script provided by SageMaker HyperPod.
SageMaker HyperPod release notes: March 27, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
HyperPod software patch
The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.
-
In this release of the HyperPod DLAMI, Slurm is built with REST service (
slurmestd
) with JSON, YAML, and JWT support. -
Upgraded Slurm
to v23.11.3.
Improvements
-
Increased auto-resume service timeout to 60 minutes.
-
Improved instance replacement process to not restart the Slurm controller.
-
Improved error messages from running lifecycle scripts, such as download errors and instance health check errors on instance start-up.
Bug fixes
-
Fixed a bug with chrony service that caused an issue with time synchronization.
-
Fixed a bug with parsing
slurm.conf
. -
Fixed an issue with NVIDIA
go-dcgm
library.
SageMaker HyperPod release notes: March 14, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
Improvements
-
HyperPod now properly supports passing partition names provided through
provisioning_params.json
and creates partitions appropriately based on provided inputs. For more information aboutprovisioning_params.json
, see SageMaker HyperPod forms and Customize SageMaker HyperPod clusters using lifecycle scripts.
AMI releases
SageMaker HyperPod release notes: February 15, 2024
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Added a new
UpdateClusterSoftware
API for SageMaker HyperPod security patching. When security patches become available, we recommend you to update existing SageMaker HyperPod clusters in your account by runningaws sagemaker update-cluster-software --cluster-name
. To follow up with future security patches, keep tracking this Amazon SageMaker HyperPod release notes page. To learn how theyour-cluster-name
UpdateClusterSoftware
API works, see Update the SageMaker HyperPod platform software of a cluster.
SageMaker HyperPod release notes: November 29, 2023
SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.
New features
-
Launched Amazon SageMaker HyperPod at AWS re:Invent 2023.
AMI releases