Run jobs using kubectl - Amazon SageMaker

Run jobs using kubectl

Note that you should install Kubeflow Training Operator in the clusters using a Helm chart. For more information, see Install packages on the Amazon EKS cluster using Helm. Verify if the Kubeflow Training Operator control plane is properly set up by running the following command.

kubectl get pods -n kubeflow

This should return an output similar to the following.

NAME READY STATUS RESTARTS AGE training-operator-658c68d697-46zmn 1/1 Running 0 90s

To submit a training job

To run a training jobs, prepare the job configuration file and run the kubectl apply command as follows.

kubectl apply -f /path/to/training_job.yaml

To describe a training job

To retrieve the details of the job submitted to the EKS cluster, use the following command. It returns job information such as the job submission time, completion time, job status, configuration details.

kubectl get -o yaml training-job -n kubeflow

To stop a training job and delete EKS resources

To stop a training job, use kubectl delete. The following is an example of stopping the training job created from the configuration file pytorch_job_simple.yaml.

kubectl delete -f /path/to/training_job.yaml

This should return the following output.

pytorchjob.kubeflow.org "training-job" deleted

To enable job auto-resume

SageMaker HyperPod supports job auto-resume functionality for Kubernetes jobs, integrating with the Kubeflow Training Operator control plane.

Ensure that there are sufficient nodes in the cluster that have passed the SageMaker HyperPod health check. The nodes should have the taint sagemaker.amazonaws.com/node-health-status set to Schedulable. It is recommended to include a node selector in the job YAML file to select nodes with the appropriate configuration as follows.

sagemaker.amazonaws.com/node-health-status: Schedulable

The following code snippet is an example of how to modify a Kubeflow PyTorch job YAML configuration to enable the job auto-resume functionality. You need to add two annotations and set restartPolicy to OnFailure as follows.

apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name: pytorch-simple namespace: kubeflow annotations: { // config for job auto resume sagemaker.amazonaws.com/enable-job-auto-resume: "true" sagemaker.amazonaws.com/job-max-retry-count: "2" } spec: pytorchReplicaSpecs: ...... Worker: replicas: 10 restartPolicy: OnFailure template: spec: nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable

To check the job auto-resume status

Run the following command to check the status of job auto-resume.

kubectl describe pytorchjob -n kubeflow <job-name>

Depending on the failure patterns, you might see two patterns of Kubeflow training job restart as follows.

Pattern 1:

Start Time: 2024-07-11T05:53:10Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-0 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-1 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-master-0 Warning PyTorchJobRestarting 7m59s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed. Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-0 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-1 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-master-0 Warning PyTorchJobRestarting 7m58s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.

Pattern 2:

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-worker-0 Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-master-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-master-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-master-0