What particular configurations HyperPod manages in Slurm configuration files - Amazon SageMaker AI

What particular configurations HyperPod manages in Slurm configuration files

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the slurm.conf and gres.conf files at /opt/slurm/etc/ to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows which specific parameters the HyperPod agent handles and overwrites.

Important

We strongly recommend that you do not change these parameters managed by HyperPod.

  • In slurm.conf, HyperPod sets up the following basic parameters: ClusterName, SlurmctldHost, PartitionName, and NodeName.

    Also, to enable the Auto-resume functionality, HyperPod requires the TaskPlugin and SchedulerParameters parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

    TaskPlugin=task/none SchedulerParameters=permit_job_expansion
  • In gres.conf, HyperPod manages NodeName for GPU nodes.