在 Python 中使用 PyTorch 框架估算器 SageMaker SDK

您可以通过将distribution参数添加到 SageMaker 框架估计器中来启动分布式训练，PyTorch或者。TensorFlow有关更多详细信息，请从以下选项中选择 SageMaker 分布式数据并行度 (SMDDP) 库支持的框架之一。

PyTorch

以下启动器选项可用于启动 PyTorch 分布式训练。

pytorchddp— 此选项运行mpirun并设置运行 PyTorch 分布式训练所需的环境变量 SageMaker。要使用此选项，请将以下字典传递给distribution参数。
```
{ "pytorchddp": { "enabled": True } }
```
torch_distributed— 此选项运行torchrun并设置运行 PyTorch 分布式训练所需的环境变量 SageMaker。要使用此选项，请将以下字典传递给distribution参数。
```
{ "torch_distributed": { "enabled": True } }
```
smdistributed— 此选项也可以运行，mpirun但smddprun可以设置运行 PyTorch 分布式训练所需的环境变量 SageMaker。
```
{ "smdistributed": { "dataparallel": { "enabled": True } } }
```

如果您选择替换NCCLAllGather为 SMDDPAllGather，则可以使用所有三个选项。选择一个适合您的用例的选项。

如果您选择NCCLAllReduce替换为 SMDDPAllReduce，则应选择以下选项mpirun之一：smdistributed或pytorchddp。您还可以添加其他MPI选项，如下所示。


{ 
    "pytorchddp": {
        "enabled": True, 
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}


{ 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}

以下代码示例显示了带有分布式训练选项的 PyTorch 估计器的基本结构。


from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    base_job_name="training_job_name_prefix",
    source_dir="subdirectory-to-your-code",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    py_version="py310",
    framework_version="2.0.1",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker data parallel library: 
    # ml.p4d.24xlarge, ml.p4de.24xlarge
    instance_type="ml.p4d.24xlarge",

    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)

pt_estimator.fit("s3://bucket/path/to/training/data")

注意

PyTorch Lightning 及其实用程序库（例如 Lightning Bolts）未预装在。 SageMaker PyTorch DLCs创建以下 requirements.txt 文件，并将该文件保存到用于保存训练脚本的源目录中。


# requirements.txt
pytorch-lightning
lightning-bolts

例如，树结构目录应如下所示。


├── pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code
    ├──  adapted-training-script.py
    └──  requirements.txt

有关指定存放requirements.txt文件以及训练脚本和作业提交的源目录的更多信息，请参阅 Amaz SageMaker on Python SDK 文档中的使用第三方库。

激活SMDDP集体操作和使用正确的分布式训练启动器选项的注意事项

SMDDPAllReduce而且SMDDPAllGather目前还不能相互兼容。
SMDDPAllReduce默认情况下，在使用smdistributed或时处于激活状态pytorchddp，它们是mpirun基于启动器的启动器，并且NCCLAllGather已使用。
SMDDPAllGather使用torch_distributed启动器时默认处于激活状态，并AllReduce回退到NCCL。
SMDDPAllGather当使用mpirun基于启动器的启动器时，也可以激活，并设置了其他环境变量，如下所示。
```
export SMDATAPARALLEL_OPTIMIZE_SDP=true
```

TensorFlow

重要

在 v2.11.0 TensorFlow 之后，该SMDDP库已停止支持DLCs， TensorFlow 并且不再可用。要查找已安装SMDDP库 TensorFlow DLCs的先前内容，请参阅TensorFlow（已淘汰）。


from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
    base_job_name = "training_job_name_prefix",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    framework_version="2.11.0",
    py_version="py38",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker data parallel library: 
    # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
    instance_type="ml.p3.16xlarge",

    # Training using the SageMaker data parallel distributed training strategy
    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

使用启动分布式训练作业 SMDDP

使用 SageMaker 通用估算器扩展预建容器 DLC