使用运行作业 kubectl - Amazon SageMaker

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用运行作业 kubectl

请注意,你应该使用 Helm 图表在集群中安装 Kubeflow 训练运算符。有关更多信息,请参阅 使用 Helm 在亚马逊EKS集群上安装软件包。运行以下命令,验证 Kubeflow 训练操作员控制平面是否设置正确。

kubectl get pods -n kubeflow

这将返回类似于以下内容的输出。

NAME READY STATUS RESTARTS AGE training-operator-658c68d697-46zmn 1/1 Running 0 90s

提交培训作业

要运行训练作业,请准备作业配置文件并按如下方式运行kubectl apply命令。

kubectl apply -f /path/to/training_job.yaml

描述训练作业

要检索提交到EKS集群的任务的详细信息,请使用以下命令。它返回作业信息,例如作业提交时间、完成时间、作业状态、配置详细信息。

kubectl get -o yaml training-job -n kubeflow

停止训练作业并删除EKS资源

要停止训练作业,请使用 kubectl delete。以下是停止通过配置文件创建的训练作业的示例pytorch_job_simple.yaml

kubectl delete -f /path/to/training_job.yaml

这应该会返回以下输出。

pytorchjob.kubeflow.org "training-job" deleted

启用作业自动恢复

SageMaker HyperPod 支持 Kubernetes 作业的作业自动恢复功能,与 Kubeflow 训练操作员控制平面集成。

确保集群中有足够的节点已通过 SageMaker HyperPod 运行状况检查。节点的污点应sagemaker.amazonaws.com/node-health-status设置为。Schedulable建议在作业YAML文件中加入节点选择器,以选择具有适当配置的节点,如下所示。

sagemaker.amazonaws.com/node-health-status: Schedulable

以下代码片段是如何修改 Kubeflow PyTorch 作业YAML配置以启用作业自动恢复功能的示例。您需要添加两个注释并将其设置为restartPolicyOnFailure如下所示。

apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name: pytorch-simple namespace: kubeflow annotations: { // config for job auto resume sagemaker.amazonaws.com/enable-job-auto-resume: "true" sagemaker.amazonaws.com/job-max-retry-count: "2" } spec: pytorchReplicaSpecs: ...... Worker: replicas: 10 restartPolicy: OnFailure template: spec: nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable

查看作业自动恢复状态

运行以下命令以检查作业自动恢复的状态。

kubectl describe pytorchjob -n kubeflow <job-name>

根据失败模式,你可能会看到 Kubeflow 训练作业重启的两种模式,如下所示。

模式 1

Start Time: 2024-07-11T05:53:10Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-0 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-1 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-master-0 Warning PyTorchJobRestarting 7m59s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed. Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-0 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-1 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-master-0 Warning PyTorchJobRestarting 7m58s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.

模式 2

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-worker-0 Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-master-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-master-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-master-0