Run a training job on HyperPod Slurm

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium slurm cluster. Before you submit the training job, update the cluster configuration. Use one of the following methods to update the cluster configuration:

Modify slurm.yaml
Override it through the command line

After you've updated the cluster configuration, install the environment.

Configure the cluster

To submit a training job to a Slurm cluster, specify the Slurm-specific configuration. Modify slurm.yaml to configure the Slurm cluster. The following is an example of a Slurm cluster configuration. You can modify this file for your own training needs:


job_name_prefix: 'sagemaker-'
slurm_create_submission_file_only: False 
stderr_to_stdout: True
srun_args:
  # - "--no-container-mount-home"
slurm_docker_cfg:
  docker_args:
    # - "--runtime=nvidia" 
  post_launch_commands: 
container_mounts: 
  - "/fsx:/fsx"

job_name_prefix: Specify a job name prefix to easily identify your submissions to the Slurm cluster.
slurm_create_submission_file_only: Set this configuration to True for a dry run to help you debug.
stderr_to_stdout: Specify whether you're redirecting your standard error (stderr) to standard output (stdout).
srun_args: Customize additional srun configurations, such as excluding specific compute nodes. For more information, see the srun documentation.
slurm_docker_cfg: The SageMaker HyperPod recipe launcher launches a Docker container to run your training job. You can specify additional Docker arguments within this parameter.
container_mounts: Specify the volumes you're mounting into the container for the recipe launcher, for your training jobs to access the files in those volumes.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Cluster-Specific Configurations

Run a training job on HyperPod k8s