Amazon SageMaker HyperPod release notes - Amazon SageMaker AI

Amazon SageMaker HyperPod release notes

This topic covers release notes that track update, fixes, and new features for Amazon SageMaker HyperPod. If you are looking for general feature releases, updates, and improvements for Amazon SageMaker HyperPod, you might find this page helpful.

The HyperPod AMI releases are documented separately to include information of the key components including general AMI releases, versions, and dependencies. If you are looking for these information related to HyperPod AMI releases, see Amazon SageMaker HyperPod AMI releases.

SageMaker HyperPod release notes: February 20, 2025

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm and Orchestrating SageMaker HyperPod clusters with Amazon EKS.

New features and improvements

  • Added support for deleting instance groups from your SageMaker HyperPod cluster. For more information, see Delete instance groups from EKS-orchestrated clusters and Scale down a cluster for Slurm-orchestrated clusters.

SageMaker HyperPod release notes: February 18, 2025

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm and Orchestrating SageMaker HyperPod clusters with Amazon EKS.

New features

  • This release of SageMaker HyperPod incorporates a security update from the Nvidia container toolkit (from version 1.17.3 to version 1.17.4). For more information, see v1.17.4 release note.

    Note

    For all container workloads in the Nvidia container toolkit version 1.17.4, the mounting of CUDA compatibility libraries is now disabled. To ensure compatibility with multiple CUDA versions on container workflows, update your LD_LIBRARY_PATH to include your CUDA compatibility libraries. You can find the specific steps in If you use a CUDA compatibility layer.

For information about related AMI releases, see SageMaker HyperPod AMI releases for Slurm: February 18, 2025 and SageMaker HyperPod AMI releases for Amazon EKS: February 18, 2025.

SageMaker HyperPod release notes: February 06, 2025

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm and Orchestrating SageMaker HyperPod clusters with Amazon EKS.

New features and improvements

  • Enhanced SageMaker HyperPod multi-AZ support: You can specify different subnets and security groups, cutting across different Availability Zones, for individual instance groups within your cluster. For more information about SageMaker HyperPod multi-AZ support, see Setting up SageMaker HyperPod clusters across multiple AZs.

SageMaker HyperPod release notes: January 22, 2025

AMI releases

SageMaker HyperPod release notes: January 09, 2025

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New features and improvements

SageMaker HyperPod release notes: December 21, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • SageMaker HyperPod now supports the following instance types for both Slurm and Amazon EKS clusters.

    • New instance types: C6gn, C6i, M6i, R6i.

    • New Trainium instance types: Trn1 and Trn1n.

Improvements

  • Enhanced error logging visibility when Slurm interrupts jobs, and prevented unnecessary job step termination during Slurm-initiated job cancellations.

  • Updated base DLAMI for p5en for both Slurm and Amazon EKS clusters.

AMI releases

SageMaker HyperPod release notes: December 13, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New feature

  • SageMaker HyperPod releases a set of Amazon CloudWatch metrics to monitor the health and performance of SageMaker HyperPod Slurm clusters. These metrics are related to CPU, GPU, memory utilization, and cluster instance information such as node counts and failed nodes. This monitoring feature is enabled by default, and the metrics can be accessed under the /aws/sagemaker/Clusters CloudWatch namespace. You can also set up CloudWatch alarms based on these metrics to proactively detect and address potential issues within their Slurm-based HyperPod clusters. For more information, see Amazon SageMaker HyperPod Slurm metrics.

AMI releases

SageMaker HyperPod release notes: November 24, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New features

AMI releases

SageMaker HyperPod release notes: November 15, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm. For more information, see and SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024.

New features and improvements

  • Added support for trn1 and trn1n instance types for both Amazon EKS and Slurm orchestrated clusters.

  • Improved log management for Slurm clusters:

    • Implemented log rotation: weekly or daily based on size.

    • Set log retention to 3 weeks.

    • Compressed logs to reduce storage impact.

    • Continued uploading logs to CloudWatch for long-term retention.

      Note

      Some logs are still stored in syslogs.

  • Adjusted Fluent Bit settings to prevent tracking issues with files containing long lines.

Bug fixes

  • Prevented unintended truncation with Slurm controller node updates in configuration file slurm.config.

AMI releases

SageMaker HyperPod release notes: November 11, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New feature

  • SageMaker HyperPod AMI now supports G6e instance types.

AMI releases

SageMaker HyperPod release notes: October 31, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Added scaling down SageMaker HyperPod clusters at the instance group level and instance level for both Amazon EKS and Slurm orchestrated clusters. For more information about scaling down Amazon EKS clusters, see Scale down a SageMaker HyperPod cluster. For more information about scaling down Slurm clusters, see Scale down a cluster in Using the AWS CLI.

  • SageMaker HyperPod now supports the P5e instance type for both Amazon EKS and Slurm orchestrated clusters.

SageMaker HyperPod release notes: October 21, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New feature

  • SageMaker HyperPod now supports the P5e[n], G6, Gr6, and Trn2[n] instance types for both Slurm and Amazon EKS clusters.

AMI releases

SageMaker HyperPod release notes: September 10, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Amazon EKS and Orchestrating SageMaker HyperPod clusters with Slurm.

New features

AMI releases

SageMaker HyperPod release notes: August 20, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Enhanced the SageMaker HyperPod auto-resume functionality, extending the resiliency capability for Slurm nodes attached with Generic RESources (GRES).

    When Generic Resources (GRES) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

Other changes

  • Pre-packaged slurmrestd in the SageMaker HyperPod AMI.

  • Changed the default values for ResumeTimeout and UnkillableStepTimeout from 60 seconds to 300 seconds in slurm.conf to improve system responsiveness and job handling.

  • Made minor improvements on health checks for NVIDIA Data Center GPU Manager (DCGM) and The NVIDIA System Management Interface (nvidia-smi).

Bug fixes

  • The HyperPod auto-resume plug-in can use idle nodes to resume a job.

SageMaker HyperPod release notes: June 20, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Added a new capability of attaching additional storage to SageMaker HyperPod cluster instances. With this capability, you can configure supplementary storage at the instance group configuration level during the cluster creation or update processes, either through the SageMaker HyperPod console or the CreateCluster and UpdateCluster APIs. The additional EBS volume is attached to each instance within a SageMaker HyperPod cluster and mounted to /opt/sagemaker. To learn more about implementing it in your SageMaker HyperPod cluster, see the updated documentation on the following pages.

    Note that you need to update the HyperPod cluster software to use this capability. After patching the HyperPod cluster software, you can utilize this capability for existing SageMaker HyperPod clusters created before June 20, 2024 by adding new instance groups. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

Upgrade steps

  • Run the following command to call the UpdateClusterSoftware API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see Update the SageMaker HyperPod platform software of a cluster.

    Important

    Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see Use the backup script provided by SageMaker HyperPod.

    aws sagemaker update-cluster-software --cluster-name your-cluster-name
    Note

    Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

SageMaker HyperPod release notes: April 24, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

Bug fixes

  • Fixed a bug with the ThreadsPerCore parameter in the ClusterInstanceGroupSpecification API. With the fix, the CreateCluster and UpdateCluster APIs properly take and apply the user input through ThreadsPerCore. This fix is effective on HyperPod clusters created after April 24, 2024. If you had issues with this bug and want to get this fix applied to your cluster, you need to create a new cluster. Make sure that you back up and restore your work while moving to a new cluster following the instructions at Use the backup script provided by SageMaker HyperPod.

SageMaker HyperPod release notes: March 27, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

HyperPod software patch

The HyperPod service team distributes software patches through SageMaker HyperPod DLAMI. See the following details about the latest HyperPod DLAMI.

  • In this release of the HyperPod DLAMI, Slurm is built with REST service (slurmestd) with JSON, YAML, and JWT support.

  • Upgraded Slurm to v23.11.3.

Improvements

  • Increased auto-resume service timeout to 60 minutes.

  • Improved instance replacement process to not restart the Slurm controller.

  • Improved error messages from running lifecycle scripts, such as download errors and instance health check errors on instance start-up.

Bug fixes

  • Fixed a bug with chrony service that caused an issue with time synchronization.

  • Fixed a bug with parsing slurm.conf.

  • Fixed an issue with NVIDIA go-dcgm library.

SageMaker HyperPod release notes: March 14, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

Improvements

AMI releases

SageMaker HyperPod release notes: February 15, 2024

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Added a new UpdateClusterSoftware API for SageMaker HyperPod security patching. When security patches become available, we recommend you to update existing SageMaker HyperPod clusters in your account by running aws sagemaker update-cluster-software --cluster-name your-cluster-name. To follow up with future security patches, keep tracking this Amazon SageMaker HyperPod release notes page. To learn how the UpdateClusterSoftware API works, see Update the SageMaker HyperPod platform software of a cluster.

SageMaker HyperPod release notes: November 29, 2023

SageMaker HyperPod releases the following for Orchestrating SageMaker HyperPod clusters with Slurm.

New features

  • Launched Amazon SageMaker HyperPod at AWS re:Invent 2023.

AMI releases