使用 SageMaker Python 使用 Debugger 啟動訓練任務 SDK

若要使用 SageMaker Debugger 設定 SageMaker 估算器，請使用 Amazon SageMaker Python SDK 並指定 Debugger 特定的參數。若要充分利用除錯功能，您需要設定三個參數：debugger_hook_config、tensorboard_output_config 和 rules。

重要

建構並執行估計器擬合方法以啟動訓練任務之前，請確定您已依照調整訓練指令碼以註冊掛鉤中的指示調整訓練指令碼。

使用 Debugger 特定參數建構 SageMaker 估算器

本節中的程式碼範例說明如何使用偵錯工具特定的參數建構 SageMaker 估算器。

注意

下列程式碼範例是建構 SageMaker 架構估算器的範本，無法直接執行。您必須繼續後續幾節，設定特定 Debugger 參數。

PyTorch


# An example of constructing a SageMaker PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

TensorFlow


# An example of constructing a SageMaker TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

MXNet


# An example of constructing a SageMaker MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

XGBoost


# An example of constructing a SageMaker XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

Generic estimator


# An example of constructing a SageMaker generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)

設定下列參數以啟用 SageMaker 偵錯工具：

debugger_hook_config （的物件DebuggerHookConfig） – 在期間啟用調整訓練指令碼中的掛鉤調整訓練指令碼以註冊掛鉤、設定 SageMaker 訓練啟動器（估算器）以從您的訓練任務收集輸出張量，以及將張量儲存到您的安全 S3 儲存貯體或本機機器中。若要了解如何設定 debugger_hook_config 參數，請參閱設定 SageMaker 偵錯工具以儲存張量。
rules （Rule物件清單） – 設定此參數以啟用您要即時執行的 SageMaker Debugger 內建規則。內建規則這種邏輯可自動偵錯模型的訓練進度，並透過分析安全 S3 儲存貯體中儲存的輸出張量找出訓練問題。若要了解如何設定 rules，請參閱如何設定 Debugger 內建規則。若要尋找偵錯輸出張量之內建規則的完整清單，請參閱偵錯工具規則。如果您想要建立自己的邏輯偵測任何訓練問題，請參閱使用 Debugger 用戶端程式庫建立自訂規則。

注意
內建規則只能透過 SageMaker 訓練執行個體使用。您無法在本機模式使用這些規則。
tensorboard_output_config （的物件TensorBoardOutputConfig） – 設定 SageMaker Debugger 以 TensorBoard相容的格式收集輸出張量，並儲存至TensorBoardOutputConfig物件中指定的 S3 輸出路徑。如需進一步了解，請參閱視覺化中的 Amazon SageMaker Debugger 輸出張量 TensorBoard。

注意
tensorboard_output_config 必須使用 debugger_hook_config 參數進行設定，過程中您必須新增 sagemaker-debugger 勾點，調整訓練指令碼。

注意

SageMaker Debugger 會將輸出張量安全地儲存在 S3 儲存貯體的子資料夾中。例如，您帳戶中預設 S3 儲存貯URI體的格式為 s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/。 SageMaker Debugger 建立了兩個子資料夾：debug-output、和 rule-output。如果新增 tensorboard_output_config 參數，您也會找到 tensorboard-output 資料夾。

請參閱下列主題，尋找關於如何設定特定 Debugger 參數的更多詳細範例。

主題

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

TensorFlow

設定 SageMaker 偵錯工具以儲存張量