Use built-in profiler rules managed by Amazon SageMaker Debugger - Amazon SageMaker AI

Use built-in profiler rules managed by Amazon SageMaker Debugger

The Amazon SageMaker Debugger built-in profiler rules analyze system metrics and framework operations collected during the training of a model. Debugger offers the ProfilerRule API operation that helps configure the rules to monitor training compute resources and operations and to detect anomalies. For example, the profiling rules can help you detect whether there are computational problems such as CPU bottlenecks, excessive I/O wait time, imbalanced workload across GPU workers, and compute resource underutilization. To see a full list of available built-in profiling rules, see List of Debugger built-in profiler rules. The following topics show how to use the Debugger built-in rules with default parameter settings and custom parameter values.

Note

The built-in rules are provided through Amazon SageMaker processing containers and fully managed by SageMaker Debugger at no additional cost. For more information about billing, see the Amazon SageMaker AI Pricing page.

Use SageMaker Debugger built-in profiler rules with their default parameter settings

To add SageMaker Debugger built-in rules in your estimator, you need to configure a rules list object. The following example code shows the basic structure of listing the SageMaker Debugger built-in rules.

from sagemaker.debugger import Rule, ProfilerRule, rule_configs rules=[ ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_1()), ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_2()), ... ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_n()), ... # You can also append more debugging rules in the Rule.sagemaker(rule_configs.*()) format. ] estimator=Estimator( ... rules=rules )

For a complete list of available built-in rules, see List of Debugger built-in profiler rules.

To use the profiling rules and inspect the computational performance and progress of your training job, add the ProfilerReport rule of SageMaker Debugger. This rule activates all built-in rules under the Debugger ProfilerRule ProfilerRule family. Furthermore, this rule generates an aggregated profiling report. For more information, see Profiling Report Generated Using SageMaker Debugger. You can use the following code to add the profiling report rule to your training estimator.

from sagemaker.debugger import Rule, rule_configs rules=[ ProfilerRule.sagemaker(rule_configs.ProfilerReport()) ]

When you start the training job with the ProfilerReport rule, Debugger collects resource utilization data every 500 milliseconds. Debugger analyzes the resource utilization to identify if your model is having bottleneck problems. If the rules detect training anomalies, the rule evaluation status changes to IssueFound. You can set up automated actions, such as notifying training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see Action on Amazon SageMaker Debugger rules.

Use Debugger built-in profiler rules with custom parameter values

If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure the base_config and rule_parameters parameters for the ProfilerRule.sagemaker and Rule.sagemaker class methods. In case of the Rule.sagemaker class methods, you can also customize tensor collections through the collections_to_save parameter. For instruction on how to use the CollectionConfig class, see Configure tensor collections using the CollectionConfig API.

Use the following configuration template for built-in rules to customize parameter values. By changing the rule parameters as you want, you can adjust the sensitivity of the rules to be initiated.

  • The base_config argument is where you call the built-in rule methods.

  • The rule_parameters argument is to adjust the default key values of the built-in rules listed in List of Debugger built-in profiler rules.

For more information about the Debugger rule class, methods, and parameters, see SageMaker AI Debugger Rule class in the Amazon SageMaker Python SDK.

from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig rules=[ ProfilerRule.sagemaker( base_config=rule_configs.BuiltInProfilerRuleName(), rule_parameters={ "key": "value" } ) ]

The parameter descriptions and value customization examples are provided for each rule at List of Debugger built-in profiler rules.

For a low-level JSON configuration of the Debugger built-in rules using the CreateTrainingJob API, see Configure Debugger using SageMaker API.