本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
注意
训练作业自动恢复需要 Kubeflow 训练操作员发布版本1.7.0
1.8.0
、或。1.8.1
请注意,您应使用 Helm 图表在集群中安装 Kubeflow 训练操作员。有关更多信息,请参阅 使用 Helm 在 Amazon EKS 集群上安装软件包。运行以下命令验证 Kubeflow Training Operator 控制面板是否设置正确。
kubectl get pods -n kubeflow
返回的输出结果应与下面类似。
NAME READY STATUS RESTARTS AGE
training-operator-658c68d697-46zmn 1/1 Running 0 90s
提交训练作业
要运行训练作业,请准备作业配置文件并运行 kubectl apply
kubectl apply -f
/path/to/training_job.yaml
描述训练作业
要检索提交给 EKS 集群的作业详情,请使用以下命令。它返回作业信息,如作业提交时间、完成时间、作业状态和配置详情。
kubectl get -o yaml
training-job
-nkubeflow
停止训练作业并删除 EKS 资源
要停止训练作业,请使用 kubectl delete。下面是停止根据配置文件 pytorch_job_simple.yaml
创建的训练作业的示例。
kubectl delete -f
/path/to/training_job.yaml
这应该返回以下输出内容。
pytorchjob.kubeflow.org "training-job" deleted
启用作业自动恢复
SageMaker HyperPod 支持 Kubernetes 作业的作业自动恢复功能,与 Kubeflow 训练操作员控制平面集成。
确保集群中有足够的节点已通过 SageMaker HyperPod 运行状况检查。节点的污点 sagemaker.amazonaws.com/node-health-status
应设置为 Schedulable
。建议在作业 YAML 文件中包含一个节点选择器,以选择具有相应配置的节点,如下所示。
sagemaker.amazonaws.com/node-health-status: Schedulable
以下代码片段是如何修改 Kubeflow PyTorch 作业 YAML 配置以启用作业自动恢复功能的示例。您需要添加两个注释,并将 restartPolicy
设置为 OnFailure
,如下所示。
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple
namespace: kubeflow
annotations: { // config for job auto resume
sagemaker.amazonaws.com/enable-job-auto-resume: "true"
sagemaker.amazonaws.com/job-max-retry-count: "2"
}
spec:
pytorchReplicaSpecs:
......
Worker:
replicas: 10
restartPolicy: OnFailure
template:
spec:
nodeSelector:
sagemaker.amazonaws.com/node-health-status: Schedulable
检查作业自动恢复状态
运行以下命令检查作业自动恢复的状态。
kubectl describe pytorchjob -n kubeflow
<job-name>
根据故障规律,您可能会看到以下两种 Kubeflow 训练作业重启规律。
规律 1:
Start Time: 2024-07-11T05:53:10Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-0
Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-1
Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-master-0
Warning PyTorchJobRestarting 7m59s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-0
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-1
Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-master-0
Warning PyTorchJobRestarting 7m58s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
规律 2:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-worker-0
Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-worker-0
Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-master-0
Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-master-0
Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-worker-0
Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-master-0