本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
使用 執行任務 kubectl
請注意,您應該使用 Helm Chart 在叢集中安裝 Kubeflow 訓練運算子。如需詳細資訊,請參閱使用 Helm 在 Amazon EKS叢集上安裝套件。執行下列命令,確認 Kubeflow Training Operator 控制平面是否已正確設定。
kubectl get pods -n kubeflow
這應該會傳回類似下列的輸出。
NAME READY STATUS RESTARTS AGE training-operator-658c68d697-46zmn 1/1 Running 0 90s
若要提交訓練任務
若要執行訓練任務,請準備任務組態檔案,然後執行 kubectl apply
kubectl apply -f
/path/to/training_job.yaml
描述訓練任務
若要擷取提交至EKS叢集的任務詳細資訊,請使用下列命令。它傳回任務提交時間、完成時間、任務狀態、組態詳細資訊等任務資訊。
kubectl get -o yaml
training-job
-nkubeflow
若要停止訓練任務並刪除EKS資源
若要停止訓練任務,請使用 kubectl 刪除。以下是停止從組態檔案 建立之訓練任務的範例pytorch_job_simple.yaml
。
kubectl delete -f
/path/to/training_job.yaml
這應該會傳回下列輸出。
pytorchjob.kubeflow.org "training-job" deleted
啟用任務自動恢復
SageMaker HyperPod 支援 Kubernetes 任務的任務自動恢復功能,與 Kubeflow Training Operator 控制平面整合。
確保叢集中有足夠的節點已通過 SageMaker HyperPod 運作狀態檢查。節點的污點應sagemaker.amazonaws.com/node-health-status
設為 Schedulable
。建議在任務YAML檔案中包含節點選取器,以選取具有適當組態的節點,如下所示。
sagemaker.amazonaws.com/node-health-status: Schedulable
下列程式碼片段是如何修改 Kubeflow PyTorch 任務YAML組態以啟用任務自動恢復功能的範例。您需要新增兩個註釋並restartPolicy
設定為 OnFailure
,如下所示。
apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name: pytorch-simple namespace: kubeflow
annotations: { // config for job auto resume sagemaker.amazonaws.com/enable-job-auto-resume: "true" sagemaker.amazonaws.com/job-max-retry-count: "2" }
spec: pytorchReplicaSpecs: ...... Worker: replicas: 10restartPolicy: OnFailure
template: spec: nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable
若要檢查任務自動恢復狀態
執行下列命令來檢查任務自動恢復的狀態。
kubectl describe pytorchjob -n kubeflow
<job-name>
根據失敗模式,您可能會看到兩種 Kubeflow 訓練任務重新啟動的模式,如下所示。
模式 1:
Start Time: 2024-07-11T05:53:10Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-0 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-1 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-master-0 Warning PyTorchJobRestarting 7m59s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed. Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-0 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-1 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-master-0 Warning PyTorchJobRestarting 7m58s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
模式 2:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-worker-0 Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-master-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-master-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-master-0