本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
Amazon 協調之 SageMaker HyperPod 叢集上訓練任務的模型可觀測性 EKS
SageMaker HyperPod 與 Amazon 協調的 叢集EKS可與 MLflow Amazon SageMaker Studio 上的應用程式整合。叢集管理員會設定MLflow伺服器,並將其與 SageMaker HyperPod 叢集連線。資料科學家可以深入了解模型
使用 設定MLflow伺服器 AWS CLI
MLflow 追蹤伺服器應由叢集管理員建立。
-
按照使用 SageMaker MLflow建立追蹤伺服器的指示,建立追蹤伺服器。 AWS CLI
-
確定 的IAM執行角色中存在
eks-auth:AssumeRoleForPodIdentity
許可 SageMaker HyperPod。 -
如果您的EKS叢集尚未安裝
eks-pod-identity-agent
附加元件,請在EKS叢集上安裝附加元件。aws eks create-addon \ --cluster-name
<eks_cluster_name>
\ --addon-name eks-pod-identity-agent \ --addon-versionvx.y.z-eksbuild.1
-
為 Pod 建立新角色
trust-relationship.json
的檔案以呼叫 MLflow APIs。cat >trust-relationship.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowEksAuthToAssumeRoleForPodIdentity", "Effect": "Allow", "Principal": { "Service": "pods.eks.amazonaws.com" }, "Action": [ "sts:AssumeRole", "sts:TagSession" ] } ] } EOF
執行下列程式碼以建立角色並連接信任關係。
aws iam create-role --role-name
hyperpod-mlflow-role
\ --assume-role-policy-document file://trust-relationship.json \ --description "allow pods to emit mlflow metrics and put data in s3" -
建立下列政策,授予 Pod 呼叫所有
sagemaker-mlflow
操作並在 S3 中放置模型成品的存取權。S3 許可已存在於追蹤伺服器中,但如果模型成品對 s3 的直接呼叫過大,則會從MLflow程式碼中上傳成品。cat >hyperpod-mlflow-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker-mlflow:AccessUI", "sagemaker-mlflow:CreateExperiment", "sagemaker-mlflow:SearchExperiments", "sagemaker-mlflow:GetExperiment", "sagemaker-mlflow:GetExperimentByName", "sagemaker-mlflow:DeleteExperiment", "sagemaker-mlflow:RestoreExperiment", "sagemaker-mlflow:UpdateExperiment", "sagemaker-mlflow:CreateRun", "sagemaker-mlflow:DeleteRun", "sagemaker-mlflow:RestoreRun", "sagemaker-mlflow:GetRun", "sagemaker-mlflow:LogMetric", "sagemaker-mlflow:LogBatch", "sagemaker-mlflow:LogModel", "sagemaker-mlflow:LogInputs", "sagemaker-mlflow:SetExperimentTag", "sagemaker-mlflow:SetTag", "sagemaker-mlflow:DeleteTag", "sagemaker-mlflow:LogParam", "sagemaker-mlflow:GetMetricHistory", "sagemaker-mlflow:SearchRuns", "sagemaker-mlflow:ListArtifacts", "sagemaker-mlflow:UpdateRun", "sagemaker-mlflow:CreateRegisteredModel", "sagemaker-mlflow:GetRegisteredModel", "sagemaker-mlflow:RenameRegisteredModel", "sagemaker-mlflow:UpdateRegisteredModel", "sagemaker-mlflow:DeleteRegisteredModel", "sagemaker-mlflow:GetLatestModelVersions", "sagemaker-mlflow:CreateModelVersion", "sagemaker-mlflow:GetModelVersion", "sagemaker-mlflow:UpdateModelVersion", "sagemaker-mlflow:DeleteModelVersion", "sagemaker-mlflow:SearchModelVersions", "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts", "sagemaker-mlflow:TransitionModelVersionStage", "sagemaker-mlflow:SearchRegisteredModels", "sagemaker-mlflow:SetRegisteredModelTag", "sagemaker-mlflow:DeleteRegisteredModelTag", "sagemaker-mlflow:DeleteModelVersionTag", "sagemaker-mlflow:DeleteRegisteredModelAlias", "sagemaker-mlflow:SetRegisteredModelAlias", "sagemaker-mlflow:GetModelVersionByAlias" ], "Resource": "
arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>
" }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>
" } ] } EOF注意
ARNs 應該是MLflow伺服器中的 ,以及 S3 儲存貯體在您所建立MLflow的伺服器期間,依照指示設定MLflow基礎設施 所設定。
-
hyperpod-mlflow-role
使用上一個步驟中儲存的政策文件,將mlflow-metrics-emit-policy
政策連接至 。aws iam put-role-policy \ --role-name
hyperpod-mlflow-role
\ --policy-namemlflow-metrics-emit-policy
\ --policy-documentfile://hyperpod-mlflow-policy.json
-
為 Pod 建立 Kubernetes 服務帳戶以存取MLflow伺服器。
cat >
mlflow-service-account.yaml
<<EOF apiVersion: v1 kind: ServiceAccount metadata: name:mlflow-service-account
namespace:kubeflow
EOF執行下列命令以套用至EKS叢集。
kubectl apply -f
mlflow-service-account.yaml
-
建立 Pod 身分關聯。
aws eks create-pod-identity-association \ --cluster-name
EKS_CLUSTER_NAME
\ --role-arnarn:aws:iam::111122223333:role/hyperpod-mlflow-role
\ --namespacekubeflow
\ --service-accountmlflow-service-account
若要收集從訓練任務到MLflow伺服器的指標
資料科學家需要設定訓練指令碼和Docker 映像,以將指標傳送到MLflow伺服器。
-
在訓練指令碼的開頭新增下列行。
import mlflow # Set the Tracking Server URI using the ARN of the Tracking Server you created mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN']) # Enable autologging in MLflow mlflow.autolog()
-
使用訓練指令碼建置 Docker 映像,然後推送到 Amazon ECR。取得ECR容器ARN的 。如需建置和推送 Docker 映像的詳細資訊,請參閱 ECR 使用者指南 中的推送 Docker 映像。
提示
請務必在 Docker 檔案中新增 mlflow 和 sagemaker-mlflow 套件的安裝。若要進一步了解套件的安裝、需求和套件的相容版本,請參閱安裝 MLflow和 SageMaker MLflow 外掛程式 。
-
在訓練任務 Pod 中新增服務帳戶,以讓他們存取
hyperpod-mlflow-role
。這可讓 Pod 呼叫 MLflow APIs。執行下列 SageMaker HyperPod CLI任務提交範本。使用檔案名稱 建立此項目mlflow-test.yaml
。defaults: - override hydra/job_logging: stdout hydra: run: dir: . output_subdir: null training_cfg: entry_script:
./train.py
script_args: [] run: name:test-job-with-mlflow
# Current run name nodes:2
# Number of nodes to use for current training # ntasks_per_node:1
# Number of devices to use per node cluster: cluster_type: k8s # currently k8s only instance_type:ml.c5.2xlarge
cluster_config: # name of service account associated with the namespace service_account_name:mlflow-service-account
# persistent volume, usually used to mount FSx persistent_volume_claims: null namespace:kubeflow
# required node affinity to select nodes with SageMaker HyperPod # labels and passed health check if burn-in enabled label_selector: required: sagemaker.amazonaws.com/node-health-status: - Schedulable preferred: sagemaker.amazonaws.com/deep-health-check-status: - Passed weights: - 100 pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never restartPolicy: OnFailure # restart policy base_results_dir: ./result # Location to store the results, checkpoints and logs. container:111122223333.dkr.ecr.us-west-2.amazonaws.com/tag
# container to use env_vars: NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information MLFLOW_TRACKING_ARN:arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
-
使用 YAML 檔案啟動任務,如下所示。
hyperpod start-job --config-file
/path/to/mlflow-test.yaml
-
產生URLMLflow追蹤伺服器的預先簽署。您可以在瀏覽器上開啟連結,並開始追蹤您的訓練任務。
aws sagemaker create-presigned-mlflow-tracking-server-url \ --tracking-server-name "
tracking-server-name
" \ --session-expiration-duration-in-seconds1800
\ --expires-in-seconds300
\ --regionregion