モデルパラメータをデバッグするためのデバッガールールを設定するにはシステムおよびフレームワークのメトリクスをプロファイリングするためのデバッガー組み込みルールを設定するには UpdateTrainingJob API オペレーションを使って Debugger プロファイリング設定を更新する CreateTrainingJob API に Debugger のカスタムルール設定を追加する

JSON (AWS CLI)

Amazon SageMaker Debugger の組み込みルールは、SageMaker AI CreateTrainingJob API オペレーションを通じて DebugHookConfig、DebugRuleConfiguration、ProfilerConfig、ProfilerRuleConfiguration オブジェクトを使用してトレーニングジョブ用に設定できます。RuleEvaluatorImage パラメータで正しいイメージ URI を指定する必要があります。次の例では、CreateTrainingJob をリクエストする JSON 文字列を設定する方法を具体的に説明します。

次のコードは、必要な設定とデバッガー設定でトレーニングジョブを実行するための完全な JSON テンプレートを示しています。テンプレートを JSON ファイルとして作業ディレクトリに保存し、 CLI AWS を使用してトレーニングジョブを実行します。例えば、次のコードを debugger-training-job-cli.json として保存します。

注記

正しい Docker コンテナイメージを使っていることを確認してください。深 AWS 層学習コンテナイメージを見つけるには、「利用可能な深層学習コンテナイメージ」を参照してください。デバッガールールを使用するために利用可能な Docker イメージの完全なリストを見つけるには、「Debugger ルールの Docker イメージ」を参照してください。


{
   "TrainingJobName": "debugger-aws-cli-test",
   "RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-YYYYMMDDT123456",
   "AlgorithmSpecification": {
      // Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
      "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04",
      "TrainingInputMode": "File",
      "EnableSageMakerMetricsTimeSeries": false
   },
   "HyperParameters": {
      "sagemaker_program": "entry_point/tf-hvd-train.py",
      "sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-profiling-test/source.tar.gz"
   },
   "OutputDataConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output"
   },
   "DebugHookConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-output",
      "CollectionConfigurations": [
         {
            "CollectionName": "losses",
            "CollectionParameters" : {
                "train.save_interval": "50"
            }
         }
      ]
   },
   "DebugRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "LossNotDecreasing",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "LossNotDecreasing"}
      }
   ],
   "ProfilerConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/profiler-output",
      "ProfilingIntervalInMilliseconds": 500,
      "ProfilingParameters": {
          "DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\", }",
          "DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }",
          "PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cprofile\", \"cProfileTimer\": \"total_time\"}",
          "LocalPath": "/opt/ml/output/profiler/" 
      }
   },
   "ProfilerRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "ProfilerReport",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "ProfilerReport"}
      }
   ],
   "ResourceConfig": { 
      "InstanceType": "ml.p3.8xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 30
   },
   
   "StoppingCondition": { 
      "MaxRuntimeInSeconds": 86400
   }
}

JSON ファイルを保存したら、ターミナルで次のコマンドを実行します。(Jupyter ノートブックを使用している場合は、行の先頭で ! を使用してください)。


aws sagemaker create-training-job --cli-input-json file://debugger-training-job-cli.json

モデルパラメータをデバッグするためのデバッガールールを設定するには

次のコードサンプルは、この SageMaker API を使って組み込み VanishingGradient ルールを設定する方法を示しています。

デバッガーが出力テンソルを収集できるようにするには

デバッガーフック設定を次のように指定します。


"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "gradients",
            "CollectionParameters" : {
                "save_interval": "500"
            }
        }
    ]
}

これにより、トレーニングジョブは 500 ステップの save_interval ごとに、テンソルコレクション、gradients を保存します。利用可能な CollectionName 値を見つけるには、SMDebug クライアントライブラリドキュメントの「デバッガー組み込みコレクション」を参照してください。利用可能な CollectionParameters パラメータのキーと値を見つけるには、SageMaker Python SDK ドキュメントの sagemaker.debugger.CollectionConfig クラスを参照してください。

出力テンソルをデバッグするためのデバッガールールを有効にするには

次の DebugRuleConfigurations API の例は、保存された gradients コレクションで組み込みの VanishingGradient ルールを実行する方法を示しています。


"DebugRuleConfigurations": [
    {
        "RuleConfigurationName": "VanishingGradient",
        "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "VanishingGradient",
            "threshold": "20.0"
        }
    }
]

このサンプルのような設定では、デバッガーは gradients テンソルのコレクションで VanishingGradient ルールを使ってトレーニングジョブのルール評価ジョブを開始します。デバッガールールを使用するために利用可能な Docker イメージの完全なリストを見つけるには、「Debugger ルールの Docker イメージ」を参照してください。RuleParameters のキーバリューのペアを見つけるには、「デバッガーの組み込みルールのリスト」を参照してください。

システムおよびフレームワークのメトリクスをプロファイリングするためのデバッガー組み込みルールを設定するには

次のサンプルコードは、Profiler Config API オペレーションを指定して、システムおよびフレームワークのメトリクスの収集を有効にする方法を示しています。

システムおよびフレームワークのメトリクスを収集するためにデバッガープロファイリングを有効にするには

Target Step


"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500, 
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/" 
    }
}

Target Time Duration


"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500,
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/"  
    }
}

メトリクスをプロファイリングするためのデバッガールールを有効にするには

次のサンプルコードは、ProfilerReport ルールの設定方法を示しています。


"ProfilerRuleConfigurations": [ 
    {
        "RuleConfigurationName": "ProfilerReport",
        "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "ProfilerReport",
            "CPUBottleneck_cpu_threshold": "90",
            "IOBottleneck_threshold": "90"
        }
    }
]

デバッガールールを使用するために利用可能な Docker イメージの完全なリストを見つけるには、「Debugger ルールの Docker イメージ」を参照してください。RuleParameters のキーバリューのペアを見つけるには、「デバッガーの組み込みルールのリスト」を参照してください。

`UpdateTrainingJob` API オペレーションを使って Debugger プロファイリング設定を更新する

UpdateTrainingJob API オペレーションを使用すると、トレーニングジョブの実行中にデバッガープロファイリング設定を更新できます。新しい ProfilerConfig および ProfilerRuleConfiguration オブジェクトを設定し、TrainingJobName パラメータにトレーニングジョブ名を指定します。


{
    "ProfilerConfig": { 
        "DisableProfiler": boolean,
        "ProfilingIntervalInMilliseconds": number,
        "ProfilingParameters": { 
            "string" : "string" 
        }
    },
    "ProfilerRuleConfigurations": [ 
        { 
            "RuleConfigurationName": "string",
            "RuleEvaluatorImage": "string",
            "RuleParameters": { 
                "string" : "string" 
            }
        }
    ],
    "TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
}

`CreateTrainingJob` API に Debugger のカスタムルール設定を追加する

カスタムルールは、CreateTrainingJob API オペレーションで DebugHookConfig および DebugRuleConfiguration オブジェクトを使って、トレーニングジョブ用に設定できます。次のコードサンプルは、この SageMaker API オペレーションを使用して、smdebug ライブラリを使って記述されたカスタム ImproperActivation ルールを設定する方法を示しています。この例では、カスタムルールを custom_rules.py ファイルに記述し、Amazon S3 バケットにアップロード済みであることを前提としています。例では、カスタムルールを実行するために使用できる構築済み Docker イメージを提供しています。これらについては、「カスタムルール評価用の Amazon SageMaker Debugger イメージの URI」を参照してください。RuleEvaluatorImage パラメータで、ビルド済み Docker イメージの URL レジストリアドレスを指定します。


"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "relu_activations",
            "CollectionParameters": {
                "include_regex": "relu",
                "save_interval": "500",
                "end_step": "5000"
            }
        }
    ]
},
"DebugRulesConfigurations": [
    {
        "RuleConfigurationName": "improper_activation_job",
        "RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest",
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 400,
        "RuleParameters": {
           "source_s3_uri": "s3://bucket/custom_rules.py",
           "rule_to_invoke": "ImproperActivation",
           "collection_names": "relu_activations"
        }
    }
]

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

SageMaker API を使って Debugger を設定する

SDK for Python (Boto3)