Launching distributed training jobs with SMDDP using the SageMaker Python SDK - Amazon SageMaker AI

Launching distributed training jobs with SMDDP using the SageMaker Python SDK

To run a distributed training job with your adapted script from Adapting your training script to use the SMDDP collective operations, use the SageMaker Python SDK's framework or generic estimators by specifying the prepared training script as an entry point script and the distributed training configuration.

This page walks you through how to use the SageMaker AI Python SDK in two ways.

  • If you want to achieve a quick adoption of your distributed training job in SageMaker AI, configure a SageMaker AI PyTorch or TensorFlow framework estimator class. The framework estimator picks up your training script and automatically matches the right image URI of the pre-built PyTorch or TensorFlow Deep Learning Containers (DLC), given the value specified to the framework_version parameter.

  • If you want to extend one of the pre-built containers or build a custom container to create your own ML environment with SageMaker AI, use the SageMaker AI generic Estimator class and specify the image URI of the custom Docker container hosted in your Amazon Elastic Container Registry (Amazon ECR).

Your training datasets should be stored in Amazon S3 or Amazon FSx for Lustre in the AWS Region in which you are launching your training job. If you use Jupyter notebooks, you should have a SageMaker notebook instance or a SageMaker Studio Classic app running in the same AWS Region. For more information about storing your training data, see the SageMaker Python SDK data inputs documentation.

Tip

We recommend that you use Amazon FSx for Lustre instead of Amazon S3 to improve training performance. Amazon FSx has higher throughput and lower latency than Amazon S3.

Tip

To properly run distributed training on the EFA-enabled instance types, you should enables traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

Choose one of the following topics for instructions on how to run a distributed training job of your training script. After you launch a training job, you can monitor system utilization and model performance using Amazon SageMaker Debugger or Amazon CloudWatch.

While you follow instructions in the following topics to learn more about technical details, we also recommend that you try the Amazon SageMaker AI data parallelism library examples to get started.