List of Debugger built-in profiler rules - Amazon SageMaker AI

List of Debugger built-in profiler rules

Use the Debugger built-in profiler rules provided by Amazon SageMaker Debugger and analyze metrics collected while training your models. The Debugger built-in rules monitor various common conditions that are critical for the success of running a performant training job. You can call the built-in profiler rules using Amazon SageMaker Python SDK or the low-level SageMaker API operations. There's no additional cost for using the built-in rules. For more information about billing, see the Amazon SageMaker AI Pricing page.

Note

The maximum numbers of built-in profiler rules that you can attach to a training job is 20. SageMaker Debugger fully manages the built-in rules and analyzes your training job synchronously.

Important

To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.

import sys import IPython !{sys.executable} -m pip install -U sagemaker smdebug IPython.Application.instance().kernel.do_shutdown(True)

Profiler rules

The following rules are the Debugger built-in rules that are callable using the ProfilerRule.sagemaker classmethod.

Debugger built-in rule for generating the profiling report

Scope of Validity Built-in Rules
Profiling Report for any SageMaker training job

Debugger built-in rules for profiling hardware system resource utilization (system metrics)

Scope of Validity Built-in Rules
Generic system monitoring rules for any SageMaker training job

Debugger built-in rules for profiling framework metrics

Scope of Validity Built-in Rules
Profiling rules for deep learning frameworks (TensorFlow and PyTorch)
Warning

In favor of Amazon SageMaker Profiler, SageMaker AI Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.

  • SageMaker Python SDK <= v2.130.0

  • PyTorch >= v1.6.0, < v2.0

  • TensorFlow >= v2.3.1, < v2.11

See also March 16, 2023.

To use the built-in rules with default parameter values – use the following configuration format:

from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules = [ ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_1()), ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_2()), ... ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_n()) ]

To use the built-in rules with customizing the parameter values – use the following configuration format:

from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules = [ ProfilerRule.sagemaker( base_config=rule_configs.BuiltInRuleName(), rule_parameters={ "key": "value" } ) ]

To find available keys for the rule_parameters parameter, see the parameter description tables.

Sample rule configuration codes are provided for each built-in rule below the parameter description tables.

ProfilerReport

The ProfilerReport rule invokes all of the built-in rules for monitoring and profiling. It creates a profiling report and updates when the individual rules are triggered. You can download a comprehensive profiling report while a training job is running or after the training job is complete. You can adjust the rule parameter values to customize sensitivity of the built-in monitoring and profiling rules. The following example code shows the basic format to adjust the built-in rule parameters through the ProfilerReport rule.

rules=[ ProfilerRule.sagemaker( rule_configs.ProfilerReport( <BuiltInRuleName>_<parameter_name> = value ) ) ]

If you trigger this ProfilerReport rule without any customized parameter as shown in the following example code, then the ProfilerReport rule triggers all of the built-in rules for monitoring and profiling with their default parameter values.

rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]

The following example code shows how to specify and adjust the CPUBottleneck rule's cpu_threshold parameter and the IOBottleneck rule's threshold parameter.

rules=[ ProfilerRule.sagemaker( rule_configs.ProfilerReport( CPUBottleneck_cpu_threshold = 90, IOBottleneck_threshold = 90 ) ) ]

To explore what's in the profiler report, see SageMaker Debugger Profiling Report. Also, because this rule activates all of the profiling rules, you can also check the rule analysis status using the SageMaker Debugger UI in SageMaker Studio Experiments.

Parameter Descriptions for the OverallSystemUsage Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

<BuiltInRuleName>_<parameter_name>

Customizable parameter to adjust thresholds of other built-in monitoring and profiling rules.

Optional

Default value: None

BatchSize

The BatchSize rule helps detect if GPU is underutilized due to a small batch size. To detect this issue, this rule monitors the average CPU utilization, GPU utilization, and GPU memory utilization. If utilization on CPU, GPU, and GPU memory is low on average, it may indicate that the training job can either run on a smaller instance type or can run with a bigger batch size. This analysis does not work for frameworks that heavily overallocate memory. However, increasing the batch size can lead to processing or data loading bottlenecks because more data preprocessing time is required in each iteration.

Parameter Descriptions for the BatchSize Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

cpu_threshold_p95

Defines the threshold for 95th quantile of CPU utilization in percentage.

Optional

Valid values: Integer

Default value: 70 (in percentage)

gpu_threshold_p95

Defines the threshold for 95th quantile of GPU utilization in percentage.

Optional

Valid values: Integer

Default value: 70 (in percentage)

gpu_memory_threshold_p95

Defines the threshold for 95th quantile of GPU memory utilization in percentage.

Optional

Valid values: Integer

Default values: 70 (in percentage)

patience

Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.

Optional

Valid values: Integer

Default values: 100

window

Window size for computing quantiles.

Optional

Valid values: Integer

Default values: 500

scan_interval_us

Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

CPUBottleneck

The CPUBottleneck rule helps detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if number of CPU bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the CPUBottleneck Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

threshold

Defines the threshold for proportion of bottlenecked time to the total training time. If the proportion exceeds the percentage specified to the threshold parameter, the rule switches the rule status to True.

Optional

Valid values: Integer

Default value: 50 (in percentage)

gpu_threshold

A threshold that defines low GPU utilization.

Optional

Valid values: Integer

Default value: 10 (in percentage)

cpu_threshold

A threshold that defines high CPU utilization.

Optional

Valid values: Integer

Default values: 90 (in percentage)

patience

Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.

Optional

Valid values: Integer

Default values: 100

scan_interval_us

Time interval with which timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

GPUMemoryIncrease

The GPUMemoryIncrease rule helps detect a large increase in memory usage on GPUs.

Parameter Descriptions for the GPUMemoryIncrease Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

increase

Defines the threshold for absolute memory increase.

Optional

Valid values: Integer

Default value: 10 (in percentage)

patience

Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.

Optional

Valid values: Integer

Default values: 100

window

Window size for computing quantiles.

Optional

Valid values: Integer

Default values: 500

scan_interval_us

Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

IOBottleneck

This rule helps to detect if GPU is underutilized due to data IO bottlenecks. Rule returns True if number of IO bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the IOBottleneck Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

threshold Defines the threshold when Rule to return True.

Optional

Valid values: Integer

Default value: 50 (in percentage)

gpu_threshold

A threshold that defines when GPU is considered underutilized.

Optional

Valid values: Integer

Default value: 70 (in percentage)

io_threshold A threshold that defines high IO wait time.

Optional

Valid values: Integer

Default values: 50 (in percentage)

patience Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.

Optional

Valid values: Integer

Default values: 1000

scan_interval_us

Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

LoadBalancing

The LoadBalancing rule helps detect issues in workload balancing among multiple GPUs.

Parameter Descriptions for the LoadBalancing Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

threshold

Defines the workload percentage.

Optional

Valid values: Integer

Default value: 0.5 (unitless proportion)

patience

Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.

Optional

Valid values: Integer

Default values: 10

scan_interval_us

Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

LowGPUUtilization

The LowGPUUtilization rule helps detect if GPU utilization is low or suffers from fluctuations. This is checked for each GPU on each worker. Rule returns True if 95th quantile is below threshold_p95 which indicates underutilization. Rule returns true if 95th quantile is above threshold_p95 and 5th quantile is below threshold_p5 which indicates fluctuations.

Parameter Descriptions for the LowGPUUtilization Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

threshold_p95

A threshold for 95th quantile below which GPU is considered to be underutilized.

Optional

Valid values: Integer

Default value: 70 (in percentage)

threshold_p5 A threshold for 5th quantile. Default is 10 percent.

Optional

Valid values: Integer

Default values: 10 (in percentage)

patience

Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.

Optional

Valid values: Integer

Default values: 1000

window

Window size for computing quantiles.

Optional

Valid values: Integer

Default values: 500

scan_interval_us

Time interval that timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

OverallSystemUsage

The OverallSystemUsage rule measures overall system usage per worker node. The rule currently only aggregates values per node and computes their percentiles.

Parameter Descriptions for the OverallSystemUsage Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

scan_interval_us

Time interval to scan timeline files.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

MaxInitializationTime

The MaxInitializationTime rule helps detect if the training initialization is taking too much time. The rule waits until the first step is available.

Parameter Descriptions for the MaxInitializationTime Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

threshold

Defines the threshold in minutes to wait for the first step to become available.

Optional

Valid values: Integer

Default value: 20 (in minutes)

scan_interval_us

Time interval with which timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

OverallFrameworkMetrics

The OverallFrameworkMetrics rule summarizes the time spent on framework metrics, such as forward and backward pass, and data loading.

Parameter Descriptions for the OverallFrameworkMetrics Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

scan_interval_us

Time interval to scan timeline files.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)

StepOutlier

The StepOutlier rule helps detect outliers in step durations. This rule returns True if there are outliers with step durations larger than stddev sigmas of the entire step durations in a time range.

Parameter Descriptions for the StepOutlier Rule

Parameter Name Description
base_trial

The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.

Required

Valid values: String

stddev

Defines a factor by which to multiply the standard deviation. For example, the rule is invoked by default when a step duration is larger or smaller than 5 times the standard deviation.

Optional

Valid values: Integer

Default value: 5 (in minutes)

mode Mode under which steps have been saved and on which Rule should run on. Per default rule will run on steps from EVAL and TRAIN phase

Optional

Valid values: Integer

Default value: 5 (in minutes)

n_outliers How many outliers to ignore before rule returns True

Optional

Valid values: Integer

Default value: 10

scan_interval_us

Time interval with which timeline files are scanned.

Optional

Valid values: Integer

Default values: 60000000 (in microseconds)