Use the PyTorch framework estimators in the SageMaker Python SDK
You can launch distributed training by adding the distribution
argument to
the SageMaker framework estimators, PyTorch
TensorFlow
- PyTorch
-
The following launcher options are available for launching PyTorch distributed training.
-
pytorchddp
– This option runsmpirun
and sets up environment variables needed for running PyTorch distributed training on SageMaker. To use this option, pass the following dictionary to thedistribution
parameter.{ "pytorchddp": { "enabled": True } }
-
torch_distributed
– This option runstorchrun
and sets up environment variables needed for running PyTorch distributed training on SageMaker. To use this option, pass the following dictionary to thedistribution
parameter.{ "torch_distributed": { "enabled": True } }
-
smdistributed
– This option also runsmpirun
but withsmddprun
that sets up environment variables needed for running PyTorch distributed training on SageMaker.{ "smdistributed": { "dataparallel": { "enabled": True } } }
If you chose to replace NCCL
AllGather
to SMDDPAllGather
, you can use all three options. Choose one option that fits with your use case.If you chose to replace NCCL
AllReduce
with SMDDPAllReduce
, you should choose one of thempirun
-based options:smdistributed
orpytorchddp
. You can also add additional MPI options as follows.{ "pytorchddp": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } }
{ "smdistributed": { "dataparallel": { "enabled": True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION" } } }
The following code sample shows the basic structure of a PyTorch estimator with distributed training options.
from sagemaker.pytorch import PyTorch pt_estimator = PyTorch( base_job_name="
training_job_name_prefix
", source_dir="subdirectory-to-your-code
", entry_point="adapted-training-script.py
", role="SageMakerRole
", py_version="py310
", framework_version="2.0.1
", # For running a multi-node distributed training job, specify a value greater than 1 # Example: 2,3,4,..8 instance_count=2
, # Instance types supported by the SageMaker data parallel library: # ml.p4d.24xlarge, ml.p4de.24xlarge instance_type="ml.p4d.24xlarge
", # Activate distributed training with SMDDP distribution={ "pytorchddp": { "enabled": True } } # mpirun, activates SMDDP AllReduce OR AllGather # distribution={ "torch_distributed": { "enabled": True } } # torchrun, activates SMDDP AllGather # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } # mpirun, activates SMDDP AllReduce OR AllGather ) pt_estimator.fit("s3://bucket/path/to/training/data
")Note
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the SageMaker PyTorch DLCs. Create the following
requirements.txt
file and save in the source directory where you save the training script.# requirements.txt pytorch-lightning lightning-bolts
For example, the tree-structured directory should look like the following.
├──
pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code ├──adapted-training-script.py
└──requirements.txt
For more information about specifying the source directory to place the
requirements.txt
file along with your training script and a job submission, see Using third-party librariesin the Amazon SageMaker Python SDK documentation. Considerations for activating SMDDP collective operations and using the right distributed training launcher options
-
SMDDP
AllReduce
and SMDDPAllGather
are not mutually compatible at present. -
SMDDP
AllReduce
is activated by default when usingsmdistributed
orpytorchddp
, which arempirun
-based launchers, and NCCLAllGather
is used. -
SMDDP
AllGather
is activated by default when usingtorch_distributed
launcher, andAllReduce
falls back to NCCL. -
SMDDP
AllGather
can also be activated when using thempirun
-based launchers with an additional environment variable set as follows.export SMDATAPARALLEL_OPTIMIZE_SDP=true
-
- TensorFlow
-
Important
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see TensorFlow (deprecated).
from sagemaker.tensorflow import TensorFlow tf_estimator = TensorFlow( base_job_name = "
training_job_name_prefix
", entry_point="
", role="adapted-training-script.py
SageMakerRole
", framework_version="2.11.0
", py_version="py38
", # For running a multi-node distributed training job, specify a value greater than 1 # Example: 2,3,4,..8 instance_count=2
, # Instance types supported by the SageMaker data parallel library: #ml.p4d.24xlarge
,ml.p3dn.24xlarge
, andml.p3.16xlarge
instance_type="ml.p3.16xlarge
", # Training using the SageMaker data parallel distributed training strategy distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } ) tf_estimator.fit("s3://bucket/path/to/training/data
")