Example HyperPod task governance AWS CLI commands
You can use HyperPod with EKS through Kubectl or through
HyperPod custom CLI. You can use these commands through Studio or
AWS CLI. The following provides SageMaker HyperPod task governance examples, on how to
view cluster details using the HyperPod AWS CLI commands. For more
information, including how to install, see the HyperPod CLI
Github repository
Topics
Get cluster accelerator device quota information
The following example command gets the information on the cluster accelerator device quota.
hyperpod get-clusters -n hyperpod-ns-test-team
The namespace in this example, hyperpod-ns-test-team
, is
created in Kubernetes based on the team name provided,
test-team
, when the compute allocation is created. For more
information, see Edit policies.
Example response:
[ { "Cluster": "hyperpod-eks-test-
cluster-id
", "InstanceType": "ml.g5.xlarge", "TotalNodes": 2, "AcceleratorDevicesAvailable": 1, "NodeHealthStatus=Schedulable": 2, "DeepHealthCheckStatus=Passed": "N/A", "Namespaces": { "hyperpod-ns-test-team": { "TotalAcceleratorDevices": 1, "AvailableAcceleratorDevices": 1 } } } ]
Submit a job to SageMaker AI-managed queue and namespace
The following example command submits a job to your HyperPod cluster. If you have access to only one team, the HyperPod AWS CLI will automatically assign the queue for you in this case. Otherwise if multiple queues are discovered, we will display all viable options for you to select.
hyperpod start-job --job-name hyperpod-cli-test --job-kind kubeflow/PyTorchJob --image docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd --entry-script /opt/pytorch-mnist/mnist.py --pull-policy IfNotPresent --instance-type ml.g5.xlarge --node-count 1 --tasks-per-node 1 --results-dir ./result --priority training-priority
The priority classes are defined in the Cluster
policy, which defines how tasks are prioritized and idle
compute is allocated. When a data scientist submits a job, they use one of
the priority class names with the format
.
In this example, priority-class-name
-prioritytraining-priority
refers to the priority class
named “training”. For more information on policy concepts, see Policies.
If a priority class is not specified, the job is treated as a low priority job, with a task ranking value of 0.
If a priority class is specified, but does not correspond to one of the priority classes defined in the Cluster policy, the submission fails and an error message provides the defined set of priority classes.
You can also submit the job by using YAML configuration files. Below is an example that is equivalent to submitting a job via above command:
defaults: - override hydra/job_logging: stdout hydra: run: dir: . output_subdir: null training_cfg: entry_script: /opt/pytorch-mnist/mnist.py script_args: [] run: name: hyperpod-cli-test nodes: 1 ntasks_per_node: 1 cluster: cluster_type: k8s instance_type: ml.g5.xlarge custom_labels: kueue.x-k8s.io/priority-class: training-priority cluster_config: label_selector: required: sagemaker.amazonaws.com/node-health-status: - Schedulable preferred: sagemaker.amazonaws.com/deep-health-check-status: - Passed weights: - 100 pullPolicy: IfNotPresent base_results_dir: ./result container: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd env_vars: NCCL_DEBUG: INFO
You can submit the job using the above YAML configuration with the following command.
hyperpod start-job --config-file ./basicJob.yaml
List jobs
The following command lists the jobs and their details.
hyperpod list-jobs
Example response:
{ "jobs": [ { "Name": "hyperpod-cli-test", "Namespace": "hyperpod-ns-test-team", "CreationTime": "2024-11-18T21:21:15Z", "Priority": "training", "State": "Succeeded" } ] }
Get job detailed information
The following command provides a job’s details. If no namespace is specified, HyperPod AWS CLI will fetch a namespace managed by SageMaker AI that you have access to.
hyperpod get-job --job-name hyperpod-cli-test
Example response:
{ "Name": "hyperpod-cli-test", "Namespace": "hyperpod-ns-test-team", "Label": { "app": "hyperpod-cli-test", "app.kubernetes.io/managed-by": "Helm", "kueue.x-k8s.io/priority-class": "training" }, "CreationTimestamp": "2024-11-18T21:21:15Z", "Status": { "completionTime": "2024-11-18T21:25:24Z", "conditions": [ { "lastTransitionTime": "2024-11-18T21:21:15Z", "lastUpdateTime": "2024-11-18T21:21:15Z", "message": "PyTorchJob hyperpod-cli-test is created.", "reason": "PyTorchJobCreated", "status": "True", "type": "Created" }, { "lastTransitionTime": "2024-11-18T21:21:17Z", "lastUpdateTime": "2024-11-18T21:21:17Z", "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test is running.", "reason": "PyTorchJobRunning", "status": "False", "type": "Running" }, { "lastTransitionTime": "2024-11-18T21:25:24Z", "lastUpdateTime": "2024-11-18T21:25:24Z", "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test successfully completed.", "reason": "PyTorchJobSucceeded", "status": "True", "type": "Succeeded" } ], "replicaStatuses": { "Worker": { "selector": "training.kubeflow.org/job-name=hyperpod-cli-test,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker", "succeeded": 1 } }, "startTime": "2024-11-18T21:21:15Z" }, "ConsoleURL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/hyperpod-eks-test-
cluster-id
“ }
Suspend and unsuspend jobs
If you want to remove some submitted job from the scheduler,
HyperPod AWS CLI provides suspend
command to temporarily
remove the job from orchestration. The suspended job will no longer be
scheduled unless the job is manually unsuspended by the
unsuspend
command
To temporarily suspend a job:
hyperpod patch-job suspend --job-name hyperpod-cli-test
To add a job back to the queue:
hyperpod patch-job unsuspend --job-name hyperpod-cli-test
Debugging jobs
The HyperPod AWS CLI also provides other commands for you to debug
job submission issues. For example list-pods
and
get-logs
in the HyperPod AWS CLI Github
repository.