Use the SMDDP library in your PyTorch Lightning training script
If you want to bring your PyTorch
Lightningsmdistributed.dataparallel
library’s PyTorch modules, set up the environment variables for PyTorch Lightning to accept
the SageMaker AI environment variables that are preset by the SageMaker training toolkit, and activate
the SMDDP library by setting the process group backend to "smddp"
. To learn
more, walk through the following instructions that break down the steps with code
examples.
Note
The PyTorch Lightning support is available in the SageMaker AI data parallel library v1.5.0 and later.
-
Import the
pytorch_lightning
library and thesmdistributed.dataparallel.torch
modules.import lightning as pl import smdistributed.dataparallel.torch.torch_smddp
-
Instantiate the LightningEnvironment
. from lightning.fabric.plugins.environments.lightning import LightningEnvironment env = LightningEnvironment() env.world_size = lambda: int(os.environ["WORLD_SIZE"]) env.global_rank = lambda: int(os.environ["RANK"])
-
For PyTorch DDP – Create an object of the DDPStrategy
class with "smddp"
forprocess_group_backend
and"gpu"
foraccelerator
, and pass that to the Trainerclass. import lightning as pl from lightning.pytorch.strategies import DDPStrategy ddp = DDPStrategy( cluster_environment=env, process_group_backend="smddp", accelerator="gpu" ) trainer = pl.Trainer( max_epochs=200, strategy=ddp, devices=num_gpus, num_nodes=num_nodes )
For PyTorch FSDP – Create an object of the FSDPStrategy
class (with wrapping policy of choice) with "smddp"
forprocess_group_backend
and"gpu"
foraccelerator
, and pass that to the Trainerclass. import lightning as pl from lightning.pytorch.strategies import FSDPStrategy from functools import partial from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy policy = partial( size_based_auto_wrap_policy, min_num_params=10000 ) fsdp = FSDPStrategy( auto_wrap_policy=policy, process_group_backend="smddp", cluster_environment=env ) trainer = pl.Trainer( max_epochs=200, strategy=fsdp, devices=num_gpus, num_nodes=num_nodes )
After you have completed adapting your training script, proceed to Launching distributed training jobs with SMDDP using the SageMaker Python SDK.
Note
When you construct a SageMaker AI PyTorch estimator and submit a training job request in Launching distributed training jobs with SMDDP using the
SageMaker Python SDK, you need to provide requirements.txt
to
install pytorch-lightning
and lightning-bolts
in the SageMaker AI
PyTorch training container.
# requirements.txt pytorch-lightning lightning-bolts
For more information about specifying the source directory to place the
requirements.txt
file along with your training script and a job submission,
see Using third-party libraries