Use the SMDDP library in your PyTorch Lightning training script

If you want to bring your PyTorch Lightning training script and run a distributed data parallel training job in SageMaker AI, you can run the training job with minimal changes in your training script. The necessary changes include the following: import the smdistributed.dataparallel library’s PyTorch modules, set up the environment variables for PyTorch Lightning to accept the SageMaker AI environment variables that are preset by the SageMaker training toolkit, and activate the SMDDP library by setting the process group backend to "smddp". To learn more, walk through the following instructions that break down the steps with code examples.

Note

The PyTorch Lightning support is available in the SageMaker AI data parallel library v1.5.0 and later.

Import the pytorch_lightning library and the smdistributed.dataparallel.torch modules.
```
import lightning as pl
import smdistributed.dataparallel.torch.torch_smddp
```

Instantiate the LightningEnvironment.


from lightning.fabric.plugins.environments.lightning import LightningEnvironment

env = LightningEnvironment()
env.world_size = lambda: int(os.environ["WORLD_SIZE"])
env.global_rank = lambda: int(os.environ["RANK"])

For PyTorch DDP – Create an object of the DDPStrategy class with "smddp" for process_group_backend and "gpu" for accelerator, and pass that to the Trainer class.


import lightning as pl
from lightning.pytorch.strategies import DDPStrategy

ddp = DDPStrategy(
    cluster_environment=env, 
    process_group_backend="smddp", 
    accelerator="gpu"
)

trainer = pl.Trainer(
    max_epochs=200, 
    strategy=ddp, 
    devices=num_gpus, 
    num_nodes=num_nodes
)

For PyTorch FSDP – Create an object of the FSDPStrategy class (with wrapping policy of choice) with "smddp" for process_group_backend and "gpu" for accelerator, and pass that to the Trainer class.


import lightning as pl
from lightning.pytorch.strategies import FSDPStrategy

from functools import partial
from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy

policy = partial(
    size_based_auto_wrap_policy, 
    min_num_params=10000
)

fsdp = FSDPStrategy(
    auto_wrap_policy=policy,
    process_group_backend="smddp", 
    cluster_environment=env
)

trainer = pl.Trainer(
    max_epochs=200, 
    strategy=fsdp, 
    devices=num_gpus, 
    num_nodes=num_nodes
)

After you have completed adapting your training script, proceed to Launching distributed training jobs with SMDDP using the SageMaker Python SDK.

Note

When you construct a SageMaker AI PyTorch estimator and submit a training job request in Launching distributed training jobs with SMDDP using the SageMaker Python SDK, you need to provide requirements.txt to install pytorch-lightning and lightning-bolts in the SageMaker AI PyTorch training container.


# requirements.txt
pytorch-lightning
lightning-bolts

For more information about specifying the source directory to place the requirements.txt file along with your training script and a job submission, see Using third-party libraries in the Amazon SageMaker AI Python SDK documentation.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

PyTorch

TensorFlow (deprecated)