SageMaker Python SDK を使用して SageMaker Training Compiler をアクティブ化する方法について説明します。 SageMaker Python SDK の使用と SageMaker フレームワーク深層学習コンテナの拡張 API オペレーションを使用して SageMaker Training Compiler を SageMaker CreateTrainingJobアクティブ化する方法について説明します。

TensorFlow Training Compiler で SageMaker トレーニングジョブを実行する

SageMaker Training Compiler では、Amazon SageMaker Studio Classic、Amazon SageMaker ノートブックインスタンス、 AWS SDK for Python (Boto3)およびのいずれかの SageMaker インターフェイスを使用してトレーニングジョブを実行できます AWS Command Line Interface。

トピック

SageMaker Python SDK の使用
SageMaker Python SDK の使用と SageMaker フレームワーク深層学習コンテナの拡張
CreateTrainingJob API オペレーションを使用して SageMaker SageMaker Training Compiler を有効にする

SageMaker Python SDK の使用

SageMaker Training Compiler を有効にするには、 TensorFlow または Hugging Face 推定器に SageMaker compiler_configパラメータを追加します。TrainingCompilerConfig クラスをインポートし、そのインスタンスを compiler_config パラメータに渡します。次のコード例は、 SageMaker Training Compiler がオンになっている SageMaker 推定器クラスの構造を示しています。

ヒント

TensorFlow および Transformers ライブラリが提供する構築済みモデルの使用を開始するには、「」のリファレンステーブルに記載されているバッチサイズを試してくださいテスト済みモデル。

注記

SageMaker の Training Compiler TensorFlow は、 SageMaker TensorFlowと Hugging Face フレームワークの推定器を通じて利用できます。

ユースケースに適合する情報については、次のいずれかのオプションを参照してください。

TensorFlow


from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64    

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_estimator=TensorFlow(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='2.9.1',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_estimator.fit()

トレーニングスクリプトを準備するには、次のページを参照してください。

シングル GPU のトレーニングの場合 TensorFlow Keras () を使用して構築されたモデルの tf.keras.*。
シングル GPU のトレーニングの場合 TensorFlow モジュールを使用して構築されたモデルの (tf.* TensorFlow Keras モジュールを除く）。

Hugging Face Estimator with TensorFlow


from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    transformers_version='4.21.1',
    tensorflow_version='2.6.3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()

トレーニングスクリプトを準備するには、次のページを参照してください。

シングル GPU のトレーニングの場合 Hugging Face Transformer を使用した TensorFlow Keras モデルの
シングル GPU のトレーニングの場合 Hugging Face Transformer を使用した TensorFlow モデルの

次のリストは、コンパイラで SageMaker トレーニングジョブを実行するために必要な最小限のパラメータセットです。

注記

SageMaker Hugging Face 推定器を使用する場合は、transformers_version、tensorflow_version、hyperparameters、および compiler_configパラメータを指定して SageMaker Training Compiler を有効にする必要があります。image_uri を使用して、サポートされるフレームワークにリストされている Training Compiler の統合深層学習コンテナを手動で指定することはできません。

entry_point (str) — 必須。トレーニングスクリプトのファイル名を指定します。
instance_count (int) — 必須。インスタンス数を指定します。
instance_type (str) — 必須。インスタンスのタイプを指定します。
transformers_version (str) – SageMaker Hugging Face 推定器を使用する場合にのみ必要です。 SageMaker Training Compiler でサポートされている Hugging Face Transformer ライブラリのバージョンを指定します。使用可能なバージョンを見つけるには、「サポートされるフレームワーク」を参照してください。
framework_version または tensorflow_version (str) — 必須。 SageMaker Training Compiler でサポートされている TensorFlow バージョンを指定します。使用可能なバージョンを見つけるには、「サポートされるフレームワーク」を参照してください。

注記
SageMaker TensorFlow 推定器を使用する場合は、を指定する必要がありますframework_version。
SageMaker Hugging Face 推定器を使用する場合は、 transformers_versionとの両方を指定する必要がありますtensorflow_version。
hyperparameters (dict) — オプション。トレーニングジョブのハイパーパラメータ (n_gpus、batch_size、learning_rate など) を指定します。 SageMaker Training Compiler を有効にする場合は、より大きなバッチサイズを試し、それに応じて学習レートを調整します。コンパイラを使用し、バッチサイズを調整してトレーニング速度を向上させたケーススタディについては、「テスト済みモデル」および「SageMaker Training Compiler のサンプルノートブックとブログ」を参照してください。
compiler_config (TrainingCompilerConfig オブジェクト) – 必須。 SageMaker Training Compiler をオンにするには、このパラメータを含めます。TrainingCompilerConfig クラスのパラメータは次のとおりです。
- enabled (bool) — オプション。 SageMaker Training Compiler をオンまたはオフFalseにするには、 Trueまたはを指定します。デフォルト値は、Trueです。
- debug (bool) — オプション。コンパイラで高速化されたトレーニングジョブからより詳細なトレーニングログを受け取るには、これを True に変更します。ただし、追加のログ記録によってオーバーヘッドが増し、コンパイルされたトレーニングジョブが遅くなる可能性があります。デフォルト値は、Falseです。

警告

SageMaker デバッガーをオンにすると、 SageMaker Training Compiler のパフォーマンスに影響する可能性があります。 SageMaker Training Compiler の実行時にデバッガーをオフにして、パフォーマンスに影響が出ないようにすることをお勧めします。詳細については、「考慮事項」を参照してください。デバッガー機能をオフにするには、次の 2 つの引数を推定器に追加します。


disable_profiler=True,
debugger_hook_config=False

コンパイラを使用したトレーニングジョブが正常に起動すると、ジョブの初期化フェーズで次のログを受け取ります。

TrainingCompilerConfig(debug=False) の場合


Found configuration for Training Compiler
Configuring SM Training Compiler...

TrainingCompilerConfig(debug=True) の場合


Found configuration for Training Compiler
Configuring SM Training Compiler...
Training Compiler set to debug mode

SageMaker Python SDK の使用と SageMaker フレームワーク深層学習コンテナの拡張

AWS Deep Learning Containers (DLC) は、オープンソース TensorFlow フレームワークに加えられた変更を含むの TensorFlow適合バージョンで TensorFlow 使用できます。SageMaker Framework Deep Learning Containers は、基盤となる AWS インフラストラクチャと Amazon 用に最適化されています SageMaker。DLCs を使用するという利点により、 SageMaker Training Compiler 統合はネイティブよりもパフォーマンスが向上します TensorFlow。さらに、DLC イメージを拡張してカスタムトレーニングコンテナを作成できます。

注記

この Docker カスタマイズ機能は、現在でのみ使用できます TensorFlow。

ユースケースに合わせて SageMaker TensorFlow DLCs を拡張およびカスタマイズするには、以下の手順に従います。

Dockerfile を作成する

SageMaker TensorFlow DLC を拡張するには、次の Dockerfile テンプレートを使用します。 SageMaker TensorFlow DLC イメージを Docker コンテナのベースイメージとして使用する必要があります。 SageMaker TensorFlow DLC イメージ URIs「サポートされているフレームワーク」を参照してください。


# SageMaker TensorFlow Deep Learning Container image
FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/tensorflow-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# This environment variable is used by the SageMaker container 
# to determine user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# Add more code lines to customize for your use-case
...

詳細については、「Step 2: Create and upload the Dockerfile and Python training scripts」を参照してください。

SageMaker フレームワーク DLCs次の落とし穴を考慮してください。

SageMaker コンテナ内の TensorFlow パッケージのバージョンを明示的にアンインストールまたは変更しないでください。そうすると、 AWS 最適化された TensorFlow パッケージがオープンソース TensorFlow パッケージによって上書きされ、パフォーマンスが低下する可能性があります。
依存関係として特定の TensorFlow バージョンまたは種類を持つパッケージには注意してください。これらのパッケージは、最適化されたを AWS 暗黙的にアンインストール TensorFlow し、オープンソース TensorFlow パッケージをインストールする場合があります。

例えば、テンソルフロー/モデルとテンソルフロー/テキストライブラリが常にオープンソースの再インストール TensorFlowを試みるという既知の問題があります。ユースケースに合わせて特定のバージョンを選択するためにこれらのライブラリをインストールする必要がある場合は、v2.9 以降の SageMaker TensorFlow DLC Dockerfiles を確認することをお勧めします。Dockerfiles へのパスは通常、tensorflow/training/docker/<tensorflow-version>/py3/<cuda-version>/Dockerfile.gpu の形式です。Dockerfiles には、 AWS マネージド TensorFlow バイナリ ( TF_URL環境変数に指定) やその他の依存関係を順番に再インストールするためのコード行があります。再インストールのセクションは次の例のようになります。


# tf-models does not respect existing installations of TensorFlow 
# and always installs open source TensorFlow

RUN pip3 install --no-cache-dir -U \
    tf-models-official==x.y.z

RUN pip3 uninstall -y tensorflow tensorflow-gpu \
  ; pip3 install --no-cache-dir -U \
    ${TF_URL} \
    tensorflow-io==x.y.z \
    tensorflow-datasets==x.y.z

構築して ECR にプッシュする

Docker コンテナをビルドして Amazon ECR にプッシュするには、次のリンクの手順に従ってください。

SageMaker Python SDK 推定器を使用してを実行する

フレームワーク SageMaker TensorFlow 推定器を通常どおりに使用します。image_uri を指定して、Amazon ECR でホストした新しいコンテナを使用する必要があります。


import sagemaker, boto3
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'tf-custom-container-test'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'

byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(
    account_id, region, uri_suffix, ecr_repository + tag
)

byoc_image_uri
# This should return something like
# 111122223333.dkr.ecr.us-east-2.amazonaws.com/tf-custom-container-test:latest

estimator = TensorFlow(
    image_uri=image_uri,
    role=get_execution_role(),
    base_job_name='tf-custom-container-test-job',
    instance_count=1,
    instance_type='ml.p3.8xlarge'
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

# Start training
estimator.fit()

`CreateTrainingJob` API オペレーションを使用して SageMaker SageMaker Training Compiler を有効にする

SageMaker Training Compiler の設定オプションは、 CreateTrainingJob API オペレーションのリクエスト構文の AlgorithmSpecificationおよび HyperParametersフィールドで指定する必要があります。


"AlgorithmSpecification": {
    "TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
    "sagemaker_training_compiler_enabled": "true",
    "sagemaker_training_compiler_debug_mode": "false"
}

SageMaker Training Compiler が実装されている深層学習コンテナイメージ URIs「」を参照してくださいサポートされるフレームワーク。

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

PyTorch Training Compiler でトレーニングジョブを実行する

サンプルノートブックとブログ