使用 執行任務 kubectl - Amazon SageMaker

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

使用 執行任務 kubectl

請注意,您應該使用 Helm Chart 在叢集中安裝 Kubeflow 訓練運算子。如需詳細資訊,請參閱使用 Helm 在 Amazon EKS叢集上安裝套件。執行下列命令,確認 Kubeflow Training Operator 控制平面是否已正確設定。

kubectl get pods -n kubeflow

這應該會傳回類似下列的輸出。

NAME READY STATUS RESTARTS AGE training-operator-658c68d697-46zmn 1/1 Running 0 90s

若要提交訓練任務

若要執行訓練任務,請準備任務組態檔案,然後執行 kubectl apply命令,如下所示。

kubectl apply -f /path/to/training_job.yaml

描述訓練任務

若要擷取提交至EKS叢集的任務詳細資訊,請使用下列命令。它傳回任務提交時間、完成時間、任務狀態、組態詳細資訊等任務資訊。

kubectl get -o yaml training-job -n kubeflow

若要停止訓練任務並刪除EKS資源

若要停止訓練任務,請使用 kubectl 刪除。以下是停止從組態檔案 建立之訓練任務的範例pytorch_job_simple.yaml

kubectl delete -f /path/to/training_job.yaml

這應該會傳回下列輸出。

pytorchjob.kubeflow.org "training-job" deleted

啟用任務自動恢復

SageMaker HyperPod 支援 Kubernetes 任務的任務自動恢復功能,與 Kubeflow Training Operator 控制平面整合。

確保叢集中有足夠的節點已通過 SageMaker HyperPod 運作狀態檢查。節點的污點應sagemaker.amazonaws.com/node-health-status設為 Schedulable。建議在任務YAML檔案中包含節點選取器,以選取具有適當組態的節點,如下所示。

sagemaker.amazonaws.com/node-health-status: Schedulable

下列程式碼片段是如何修改 Kubeflow PyTorch 任務YAML組態以啟用任務自動恢復功能的範例。您需要新增兩個註釋並restartPolicy設定為 OnFailure ,如下所示。

apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name: pytorch-simple namespace: kubeflow annotations: { // config for job auto resume sagemaker.amazonaws.com/enable-job-auto-resume: "true" sagemaker.amazonaws.com/job-max-retry-count: "2" } spec: pytorchReplicaSpecs: ...... Worker: replicas: 10 restartPolicy: OnFailure template: spec: nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable

若要檢查任務自動恢復狀態

執行下列命令來檢查任務自動恢復的狀態。

kubectl describe pytorchjob -n kubeflow <job-name>

根據失敗模式,您可能會看到兩種 Kubeflow 訓練任務重新啟動的模式,如下所示。

模式 1

Start Time: 2024-07-11T05:53:10Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-0 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-1 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-master-0 Warning PyTorchJobRestarting 7m59s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed. Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-0 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-1 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-master-0 Warning PyTorchJobRestarting 7m58s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.

模式 2

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-worker-0 Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-master-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-master-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-master-0