SageMaker Python で PyTorch フレームワーク推定器を使用する SDK

分散トレーニングを起動するには、引distribution数を SageMaker フレームワーク推定器、PyTorchまたはに追加しますTensorFlow。詳細については、次の選択肢から SageMaker 分散データ並列処理 (SMDDP) ライブラリでサポートされているフレームワークのいずれかを選択します。

PyTorch

PyTorch 分散トレーニングを起動するには、次のランチャーオプションを使用できます。

pytorchddp – このオプションはmpirun、で PyTorch 分散トレーニングを実行するために必要な環境変数を実行してセットアップします SageMaker。このオプションを使用するには、次のディクショナリを distributionパラメータに渡します。
```
{ "pytorchddp": { "enabled": True } }
```
torch_distributed – このオプションはtorchrun、で PyTorch 分散トレーニングを実行するために必要な環境変数を実行してセットアップします SageMaker。このオプションを使用するには、次のディクショナリを distributionパラメータに渡します。
```
{ "torch_distributed": { "enabled": True } }
```
smdistributed – このオプションはでも実行されますmpirunがsmddprun、で PyTorch 分散トレーニングを実行するために必要な環境変数を設定するで実行されます SageMaker。
```
{ "smdistributed": { "dataparallel": { "enabled": True } } }
```

NCCL AllGather をに置き換えることを選択した場合はAllGather、3 SMDDP つのオプションすべてを使用できます。ユースケースに合ったオプションを 1 つ選択します。

NCCL AllReduce をに置き換えることを選択した場合はSMDDPAllReduce、 mpirunベースのオプションのいずれか、smdistributedまたはを選択する必要がありますpytorchddp。次のようにMPIオプションを追加することもできます。


{ 
    "pytorchddp": {
        "enabled": True, 
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}


{ 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}

次のコードサンプルは、分散トレーニングオプションを備えた PyTorch 推定器の基本構造を示しています。


from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    base_job_name="training_job_name_prefix",
    source_dir="subdirectory-to-your-code",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    py_version="py310",
    framework_version="2.0.1",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker data parallel library: 
    # ml.p4d.24xlarge, ml.p4de.24xlarge
    instance_type="ml.p4d.24xlarge",

    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)

pt_estimator.fit("s3://bucket/path/to/training/data")

注記

PyTorch Lightning や Lightning Bolts などのユーティリティライブラリは、にプリインストールされていません SageMaker PyTorch DLCs。次の requirements.txt ファイルを作成し、トレーニングスクリプトを保存するソースディレクトリに保存します。


# requirements.txt
pytorch-lightning
lightning-bolts

例えば、ツリー構造のディレクトリは次のようになります。


├── pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code
    ├──  adapted-training-script.py
    └──  requirements.txt

トレーニングスクリプトとジョブ送信とともにrequirements.txtファイルを配置するソースディレクトリを指定する方法の詳細については、Amazon SageMaker Python SDKドキュメントの「サードパーティーライブラリの使用」を参照してください。

SMDDP 集合オペレーションをアクティブ化し、適切な分散トレーニングランチャーオプションを使用するための考慮事項

SMDDP AllReduce と SMDDPAllGatherは現在互換性がありません。
SMDDP AllReduce はpytorchddp、 mpirunベースのランチャーである smdistributedまたはを使用するときにデフォルトでアクティブ化され、 NCCLAllGatherが使用されます。
SMDDP AllGather はtorch_distributedランチャーの使用時にデフォルトでアクティブ化され、にAllReduceフォールバックされますNCCL。
SMDDP AllGather は、次のように追加の環境変数が設定されている mpirunベースのランチャーを使用する場合にもアクティブ化できます。
```
export SMDATAPARALLEL_OPTIMIZE_SDP=true
```

TensorFlow

重要

SMDDP ライブラリはのサポートを終了 TensorFlow し、 v2.11.0 より TensorFlow 後の DLCs では使用できなくなります。SMDDP ライブラリがインストールされた前の TensorFlow DLCs については、「」を参照してくださいTensorFlow (廃止) 。


from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
    base_job_name = "training_job_name_prefix",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    framework_version="2.11.0",
    py_version="py38",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker data parallel library: 
    # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
    instance_type="ml.p3.16xlarge",

    # Training using the SageMaker data parallel distributed training strategy
    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

で分散トレーニングジョブを起動する SMDDP

SageMaker 汎用推定器を使用して構築済みのDLCコンテナを拡張する