SageMaker HyperPod FAQ - Amazon SageMaker

SageMaker HyperPod FAQ

Use the following frequently asked questions to troubleshoot problems with using SageMaker HyperPod.

Q. Why can I not find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?

By default, agent logs and instance start-up logs are sent to the HyperPod platform account’s CloudWatch. In case of user lifecycle scripts, lifecycle configuration logs are sent to your account’s CloudWatch.

If you use the sample lifecycle scripts provided by the HyperPod service team, you can expect to find the lifecycle configuration logs written to /var/log/provision/provisioning.log, and you wouldn’t encounter this problem.

However, if you use custom paths for collecting logs from lifecycle provisioning and can’t find the log groups appearing in your account's CloudWatch, it might be due to mismatches in the log file paths specified in your lifecycle scripts and what the CloudWatch agent running on the HyperPod cluster instances looks for. In this case, it means that you need to properly set up your lifecycle scripts to send logs to the CloudWatch agent, and also set up the CloudWatch agent configuration accordingly. To resolve the problem, choose one of the following options.

  • Option 1: Update your lifecycle scripts to write logs to /var/log/provision/provisioning.log.

  • Option 2: Update the CloudWatch agent to look for your custom paths for logging lifecycle provisioning.

    1. Each HyperPod cluster instance contains a CloudWatch agent configuration file in JSON format at /opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json. In the configuration file, find the field name logs.logs_collected.files.collect_list.file_path. With the default setup by HyperPod, the key-value pair should be "file_path": "/var/log/provision/provisioning.log" as documented at Logging SageMaker HyperPod at instance level. The following code snippet shows how the JSON file looks with the HyperPod default configuration.

      "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/provision/provisioning.log", "log_group_name": "/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]", "log_stream_name": "LifecycleConfig/[InstanceGroupName]/{instance_id}", "retention_in_days": -1 } ] } }, "force_flush_interval": 3 }
    2. Replace the value for the "file_path" field name with the custom path you use in your lifecycle scripts. For example, if you have set up your lifecycle scripts to write to /var/log/custom-provision/custom-provisioning.log, update the value to match with it as follows.

      "file_path": "/var/log/custom-provision/custom-provisioning.log"
    3. Restart the CloudWatch agent with the configuration file to finish applying the custom path. For example, the following CloudWatch command shows how to restart the CloudWatch agent with the CloudWatch agent configuration file from step 1. For more information, see also Troubleshooting the CloudWatch agent.

      sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \ -a fetch-config -m ec2 -s -c \ file:/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json

Q. What particular configurations does HyperPod manage in Slurm configuration files such as slurm.conf and gres.conf?

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the slurm.conf and gres.conf files at /opt/slurm/etc/ to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows what specific parameters the HyperPod agent handles and overwrites.

Important

We strongly recommend that you DON’T change these parameters managed by HyperPod.

  • In slurm.conf, HyperPod sets up the following basic parameters: ClusterName, SlurmctldHost, PartitionName, and NodeName.

    Also, to enable the Auto-resume functionality, HyperPod requires the TaskPlugin and SchedulerParameters parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

    TaskPlugin=task/none SchedulerParameters=permit_job_expansion
  • In gres.conf, HyperPod manages NodeName for GPU nodes.

Q. How do I run Docker on Slurm nodes on HyperPod?

To help you run Docker on your Slurm nodes running on HyperPod, the HyperPod service team provides setup scripts that you can include as part of the lifecycle configuration for cluster creation. To learn more, see Start with base lifecycle scripts provided by HyperPod and Run Docker containers on a Slurm compute node on HyperPod.

Q. How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?

Because the default root volume of your head node usually is limited by 100GB EBS volume, you need to set up Docker and Enroot to use local NVMe instance store. To learn how to set up NVMe store and use it for launching Docker containers, see Run Docker containers on a Slurm compute node on HyperPod.

Q. How to set up EFA security groups?

If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

Q. How do I monitor my HyperPod cluster nodes? Is there any CloudWatch metrics exported from HyperPod?

To gain observability into the resource utilization of your HyperPod cluster, we recommend that you integrate the HyperPod cluster with Amazon Managed Grafana and Amazon Managed Service for Prometheus. With various open-source Grafana dashboards and exporter packages, you can export and visualize metrics related to the HyperPod cluster resources. To learn more about setting up SageMaker HyperPod with Amazon Managed Grafana and Amazon Managed Service for Prometheus, see SageMaker HyperPod cluster resources monitoring. Note that SageMaker HyperPod currently doesn't support the exportation of system metrics to Amazon CloudWatch.

Q. Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.

If the default instance storage is insufficient for your workload, you can configure additional storage per instance. Starting from the release on June 20, 2024, you can add an additional Amazon Elastic Block Store (EBS) volume to each instance in your SageMaker HyperPod cluster. Note that this capability cannot be applied to existing instance groups of SageMaker HyperPod clusters created before June 20, 2024. You can utilize this capability by patching existing SageMaker HyperPod clusters created before June 20, 2024 and adding new instance groups to them. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.