Model observability for training jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
SageMaker HyperPod clusters orchestrated with Amazon EKS can integrate with the MLflow application on Amazon SageMaker Studio. Cluster admins set up the MLflow server and connect it with the SageMaker HyperPod clusters. Data scientists can gain insights into the model
To set up an MLflow server using AWS CLI
An MLflow tracking server should be created by cluster admin.
-
Create a SageMaker AI MLflow tracking server, following the instructions at Create a tracking server using the AWS CLI.
-
Make sure that the
eks-auth:AssumeRoleForPodIdentity
permission exists in the IAM execution role for SageMaker HyperPod. -
If the
eks-pod-identity-agent
add-on is not already installed on your EKS cluster, install the add-on on the EKS cluster.aws eks create-addon \ --cluster-name
<eks_cluster_name>
\ --addon-name eks-pod-identity-agent \ --addon-versionvx.y.z-eksbuild.1
-
Create a
trust-relationship.json
file for a new role for Pod to call MLflow APIs.cat >trust-relationship.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowEksAuthToAssumeRoleForPodIdentity", "Effect": "Allow", "Principal": { "Service": "pods.eks.amazonaws.com" }, "Action": [ "sts:AssumeRole", "sts:TagSession" ] } ] } EOF
Run the following code to create the role and attach the trust relationship.
aws iam create-role --role-name
hyperpod-mlflow-role
\ --assume-role-policy-document file://trust-relationship.json \ --description "allow pods to emit mlflow metrics and put data in s3" -
Create the following policy that grants Pod access to call all
sagemaker-mlflow
operations and to put model artifacts in S3. S3 permission already exists within the tracking server but if the model artifacts is too big direct call to s3 is made from the MLflow code to upload the artifacts.cat >hyperpod-mlflow-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker-mlflow:AccessUI", "sagemaker-mlflow:CreateExperiment", "sagemaker-mlflow:SearchExperiments", "sagemaker-mlflow:GetExperiment", "sagemaker-mlflow:GetExperimentByName", "sagemaker-mlflow:DeleteExperiment", "sagemaker-mlflow:RestoreExperiment", "sagemaker-mlflow:UpdateExperiment", "sagemaker-mlflow:CreateRun", "sagemaker-mlflow:DeleteRun", "sagemaker-mlflow:RestoreRun", "sagemaker-mlflow:GetRun", "sagemaker-mlflow:LogMetric", "sagemaker-mlflow:LogBatch", "sagemaker-mlflow:LogModel", "sagemaker-mlflow:LogInputs", "sagemaker-mlflow:SetExperimentTag", "sagemaker-mlflow:SetTag", "sagemaker-mlflow:DeleteTag", "sagemaker-mlflow:LogParam", "sagemaker-mlflow:GetMetricHistory", "sagemaker-mlflow:SearchRuns", "sagemaker-mlflow:ListArtifacts", "sagemaker-mlflow:UpdateRun", "sagemaker-mlflow:CreateRegisteredModel", "sagemaker-mlflow:GetRegisteredModel", "sagemaker-mlflow:RenameRegisteredModel", "sagemaker-mlflow:UpdateRegisteredModel", "sagemaker-mlflow:DeleteRegisteredModel", "sagemaker-mlflow:GetLatestModelVersions", "sagemaker-mlflow:CreateModelVersion", "sagemaker-mlflow:GetModelVersion", "sagemaker-mlflow:UpdateModelVersion", "sagemaker-mlflow:DeleteModelVersion", "sagemaker-mlflow:SearchModelVersions", "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts", "sagemaker-mlflow:TransitionModelVersionStage", "sagemaker-mlflow:SearchRegisteredModels", "sagemaker-mlflow:SetRegisteredModelTag", "sagemaker-mlflow:DeleteRegisteredModelTag", "sagemaker-mlflow:DeleteModelVersionTag", "sagemaker-mlflow:DeleteRegisteredModelAlias", "sagemaker-mlflow:SetRegisteredModelAlias", "sagemaker-mlflow:GetModelVersionByAlias" ], "Resource": "
arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>
" }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>
" } ] } EOFNote
The ARNs should be the one from the MLflow server and the S3 bucket set up with the MLflow server during the server you created following the instructions Set up MLflow infrastructure.
-
Attach the
mlflow-metrics-emit-policy
policy to thehyperpod-mlflow-role
using the policy document saved in the previous step.aws iam put-role-policy \ --role-name
hyperpod-mlflow-role
\ --policy-namemlflow-metrics-emit-policy
\ --policy-documentfile://hyperpod-mlflow-policy.json
-
Create a Kubernetes service account for Pod to access the MLflow server.
cat >
mlflow-service-account.yaml
<<EOF apiVersion: v1 kind: ServiceAccount metadata: name:mlflow-service-account
namespace:kubeflow
EOFRun the following command to apply to the EKS cluster.
kubectl apply -f
mlflow-service-account.yaml
-
Create a Pod identity association.
aws eks create-pod-identity-association \ --cluster-name
EKS_CLUSTER_NAME
\ --role-arnarn:aws:iam::111122223333:role/hyperpod-mlflow-role
\ --namespacekubeflow
\ --service-accountmlflow-service-account
To collect metrics from training jobs to the MLflow server
Data scientists need to set up the training script and docker image to emit metrics to the MLflow server.
-
Add the following lines at the beginning of your training script.
import mlflow # Set the Tracking Server URI using the ARN of the Tracking Server you created mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN']) # Enable autologging in MLflow mlflow.autolog()
-
Build a Docker image with the training script and push to Amazon ECR. Get the ARN of the ECR container. For more information about building and pushing a Docker image, see Pushing a Docker image in the ECR User Guide.
Tip
Make sure that you add installation of mlflow and sagemaker-mlflow packages in the Docker file. To learn more about the installation of the packages, requirements, and compatilible versions of the packages, see Install MLflow and the SageMaker AI MLflow plugin.
-
Add a service account in the training job Pods to give them access to
hyperpod-mlflow-role
. This allows Pods to call MLflow APIs. Run the following SageMaker HyperPod CLI job submission template. Create this with file namemlflow-test.yaml
.defaults: - override hydra/job_logging: stdout hydra: run: dir: . output_subdir: null training_cfg: entry_script:
./train.py
script_args: [] run: name:test-job-with-mlflow
# Current run name nodes:2
# Number of nodes to use for current training # ntasks_per_node:1
# Number of devices to use per node cluster: cluster_type: k8s # currently k8s only instance_type:ml.c5.2xlarge
cluster_config: # name of service account associated with the namespace service_account_name:mlflow-service-account
# persistent volume, usually used to mount FSx persistent_volume_claims: null namespace:kubeflow
# required node affinity to select nodes with SageMaker HyperPod # labels and passed health check if burn-in enabled label_selector: required: sagemaker.amazonaws.com/node-health-status: - Schedulable preferred: sagemaker.amazonaws.com/deep-health-check-status: - Passed weights: - 100 pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never restartPolicy: OnFailure # restart policy base_results_dir: ./result # Location to store the results, checkpoints and logs. container:111122223333.dkr.ecr.us-west-2.amazonaws.com/tag
# container to use env_vars: NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information MLFLOW_TRACKING_ARN:arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
-
Start the job using the YAML file as follows.
hyperpod start-job --config-file
/path/to/mlflow-test.yaml
-
Generate a pre-signed URL for the MLflow tracking server. You can open the link on your browser and start tracking your training job.
aws sagemaker create-presigned-mlflow-tracking-server-url \ --tracking-server-name "
tracking-server-name
" \ --session-expiration-duration-in-seconds1800
\ --expires-in-seconds300
\ --regionregion