# Analyze data using the Debugger Python client library While your training job is running or after it has completed, you can access the training data collected by Debugger using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and the [SMDebug client library](https://github.com/awslabs/sagemaker-debugger/). The Debugger Python client library provides analysis and visualization tools that enable you to drill down into your training job data. **To install the library and use its analysis tools (in a JupyterLab notebook or an iPython kernel)** ``` ! pip install -U smdebug ``` The following topics walk you through how to use the Debugger Python tools to visualize and analyze the training data collected by Debugger. **Analyze system and framework metrics** + [Access the profile data](debugger-analyze-data-profiling.md) + [Plot the system metrics and framework metrics data](debugger-access-data-profiling-default-plot.md) + [Access the profiling data using the pandas data parsing tool](debugger-access-data-profiling-pandas-frame.md) + [Access the Python profiling stats data](debugger-access-data-python-profiling.md) + [Merge timelines of multiple profile trace files](debugger-merge-timeline.md) + [Profiling data loaders](debugger-data-loading-time.md) # Access the profile data The SMDebug `TrainingJob` class reads data from the S3 bucket where the system and framework metrics are saved. **To set up a `TrainingJob` object and retrieve profiling event files of a training job** ``` from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob tj = TrainingJob(training_job_name, region) ``` **Tip** You need to specify the `training_job_name` and `region` parameters to log to a training job. There are two ways to specify the training job information: Use the SageMaker Python SDK while the estimator is still attached to the training job. ``` import sagemaker training_job_name=estimator.latest_training_job.job_name region=sagemaker.Session().boto_region_name ``` Pass strings directly. ``` training_job_name="your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS" region="us-west-2" ``` **Note** By default, SageMaker Debugger collects system metrics to monitor hardware resource utilization and system bottlenecks. Running the following functions, you might receive error messages regarding unavailability of framework metrics. To retrieve framework profiling data and gain insights into framework operations, you must enable framework profiling. If you use SageMaker Python SDK to manipulate your training job request, pass the `framework_profile_params` to the `profiler_config` argument of your estimator. To learn more, see [Configure SageMaker Debugger Framework Profiling](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-framework-profiling.html). If you use Studio Classic, turn on profiling using the **Profiling** toggle button in the Debugger insights dashboard. To learn more, see [SageMaker Debugger Insights Dashboard Controller](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio-insights-controllers.html). **To retrieve a description of the training job description and the S3 bucket URI where the metric data are saved** ``` tj.describe_training_job() tj.get_config_and_profiler_s3_output_path() ``` **To check if the system and framework metrics are available from the S3 URI** ``` tj.wait_for_sys_profiling_data_to_be_available() tj.wait_for_framework_profiling_data_to_be_available() ``` **To create system and framework reader objects after the metric data become available** ``` system_metrics_reader = tj.get_systems_metrics_reader() framework_metrics_reader = tj.get_framework_metrics_reader() ``` **To refresh and retrieve the latest training event files** The reader objects have an extended method, `refresh_event_file_list()`, to retrieve the latest training event files. ``` system_metrics_reader.refresh_event_file_list() framework_metrics_reader.refresh_event_file_list() ``` # Plot the system metrics and framework metrics data You can use the system and algorithm metrics objects for the following visualization classes to plot timeline graphs and histograms. **Note** To visualize the data with narrowed-down metrics in the following visualization object plot methods, specify `select_dimensions` and `select_events` parameters. For example, if you specify `select_dimensions=["GPU"]`, the plot methods filter the metrics that include the "GPU" keyword. If you specify `select_events=["total"]`, the plot methods filter the metrics that include the "total" event tags at the end of the metric names. If you enable these parameters and give the keyword strings, the visualization classes return the charts with filtered metrics. + The `MetricsHistogram` class ``` from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram metrics_histogram = MetricsHistogram(system_metrics_reader) metrics_histogram.plot( starttime=0, endtime=system_metrics_reader.get_timestamp_of_latest_available_file(), select_dimensions=["CPU", "GPU", "I/O"], # optional select_events=["total"] # optional ) ``` + The `StepTimelineChart` class ``` from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import StepTimelineChart view_step_timeline_chart = StepTimelineChart(framework_metrics_reader) ``` + The `StepHistogram` class ``` from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram step_histogram = StepHistogram(framework_metrics_reader) step_histogram.plot( starttime=step_histogram.last_timestamp - 5 * 1000 * 1000, endtime=step_histogram.last_timestamp, show_workers=True ) ``` + The `TimelineCharts` class ``` from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts view_timeline_charts = TimelineCharts( system_metrics_reader, framework_metrics_reader, select_dimensions=["CPU", "GPU", "I/O"], # optional select_events=["total"] # optional ) view_timeline_charts.plot_detailed_profiler_data([700,710]) ``` + The `Heatmap` class ``` from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap view_heatmap = Heatmap( system_metrics_reader, framework_metrics_reader, select_dimensions=["CPU", "GPU", "I/O"], # optional select_events=["total"], # optional plot_height=450 ) ``` # Access the profiling data using the pandas data parsing tool The following `PandasFrame` class provides tools to convert the collected profiling data to Pandas data frame. ``` from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame ``` The `PandasFrame` class takes the `tj` object's S3 bucket output path, and its methods `get_all_system_metrics()` `get_all_framework_metrics()` return system metrics and framework metrics in the Pandas data format. ``` pf = PandasFrame(tj.profiler_s3_output_path) system_metrics_df = pf.get_all_system_metrics() framework_metrics_df = pf.get_all_framework_metrics( selected_framework_metrics=[ 'Step:ModeKeys.TRAIN', 'Step:ModeKeys.GLOBAL' ] ) ``` # Access the Python profiling stats data The Python profiling provides framework metrics related to Python functions and operators in your training scripts and the SageMaker AI deep learning frameworks. **Training Modes and Phases for Python Profiling** To profile specific intervals during training to partition statistics for each of these intervals, Debugger provides tools to set modes and phases. For training modes, use the following `PythonProfileModes` class: ``` from smdebug.profiler.python_profile_utils import PythonProfileModes ``` This class provides the following options: + `PythonProfileModes.TRAIN` – Use if you want to profile the target steps in the training phase. This mode option available only for TensorFlow. + `PythonProfileModes.EVAL` – Use if you want to profile the target steps in the evaluation phase. This mode option available only for TensorFlow. + `PythonProfileModes.PREDICT` – Use if you want to profile the target steps in the prediction phase. This mode option available only for TensorFlow. + `PythonProfileModes.GLOBAL` – Use if you want to profile the target steps in the global phase, which includes the previous three phases. This mode option available only for PyTorch. + `PythonProfileModes.PRE_STEP_ZERO` – Use if you want to profile the target steps in the initialization stage before the first training step of the first epoch starts. This phase includes the initial job submission, uploading the training scripts to EC2 instances, preparing the EC2 instances, and downloading input data. This mode option available for both TensorFlow and PyTorch. + `PythonProfileModes.POST_HOOK_CLOSE` – Use if you want to profile the target steps in the finalization stage after the training job has done and the Debugger hook is closed. This phase includes profiling data while the training jobs are finalized and completed. This mode option available for both TensorFlow and PyTorch. For training phases, use the following `StepPhase` class: ``` from smdebug.profiler.analysis.utils.python_profile_analysis_utils import StepPhase ``` This class provides the following options: + `StepPhase.START` – Use to specify the start point of the initialization phase. + `StepPhase.STEP_START` – Use to specify the start step of the training phase. + `StepPhase.FORWARD_PASS_END` – Use to specify the steps where the forward pass ends. This option is available only for PyTorch. + `StepPhase.STEP_END` – Use to specify the end steps in the training phase. This option is available only for TensorFlow. + `StepPhase.END` – Use to specify the ending point of the finalization (post-hook-close) phase. If the callback hook is not closed, the finalization phase profiling does not occur. **Python Profiling Analysis Tools** Debugger supports the Python profiling with two profiling tools: + cProfile – The standard python profiler. cProfile collects framework metrics on CPU time for every function called when profiling was enabled. + Pyinstrument – This is a low overhead Python profiler sampling profiling events every milliseconds. To learn more about the Python profiling options and what's collected, see [Default system monitoring and customized framework profiling with different profiling options](debugger-configure-framework-profiling-options.md). The following methods of the `PythonProfileAnalysis`, `cProfileAnalysis`, `PyinstrumentAnalysis` classes are provided to fetch and analyze the Python profiling data. Each function loads the latest data from the default S3 URI. ``` from smdebug.profiler.analysis.python_profile_analysis import PythonProfileAnalysis, cProfileAnalysis, PyinstrumentAnalysis ``` To set Python profiling objects for analysis, use the cProfileAnalysis or PyinstrumentAnalysis classes as shown in the following example code. It shows how to set a `cProfileAnalysis` object, and if you want to use `PyinstrumentAnalysis`, replace the class name. ``` python_analysis = cProfileAnalysis( local_profile_dir=tf_python_stats_dir, s3_path=tj.profiler_s3_output_path ) ``` The following methods are available for the `cProfileAnalysis` and `PyinstrumentAnalysis` classes to fetch the Python profiling stats data: + `python_analysis.fetch_python_profile_stats_by_time(start_time_since_epoch_in_secs, end_time_since_epoch_in_secs)` – Takes in a start time and end time, and returns the function stats of step stats whose start or end times overlap with the provided interval. + `python_analysis.fetch_python_profile_stats_by_step(start_step, end_step, mode, start_phase, end_phase)` – Takes in a start step and end step and returns the function stats of all step stats whose profiled `step` satisfies `start_step <= step < end_step`. + `start_step` and `end_step` (str) – Specify the start step and end step to fetch the Python profiling stats data. + `mode` (str) – Specify the mode of training job using the `PythonProfileModes` enumerator class. The default is `PythonProfileModes.TRAIN`. Available options are provided in the [Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-modes) section. + `start_phase` (str) – Specify the start phase in the target step(s) using the `StepPhase` enumerator class. This parameter enables profiling between different phases of training. The default is `StepPhase.STEP_START`. Available options are provided in the [ Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-phases) section. + `end_phase` (str) – Specify the end phase in the target step(s) using the `StepPhase` enumerator class. This parameter sets up the end phase of training. Available options are as same as the ones for the `start_phase` parameter. The default is `StepPhase.STEP_END`. Available options are provided in the [ Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-phases) section. + `python_analysis.fetch_profile_stats_between_modes(start_mode, end_mode)` – Fetches stats from the Python profiling between the start and end modes. + `python_analysis.fetch_pre_step_zero_profile_stats()` – Fetches the stats from the Python profiling until step 0. + `python_analysis.fetch_post_hook_close_profile_stats()` – Fetches stats from the Python profiling after the hook is closed. + `python_analysis.list_profile_stats()` – Returns a DataFrame of the Python profiling stats. Each row holds the metadata for each instance of profiling and the corresponding stats file (one per step). + `python_analysis.list_available_node_ids()` – Returns a list the available node IDs for the Python profiling stats. The `cProfileAnalysis` class specific methods: + `fetch_profile_stats_by_training_phase()` – Fetches and aggregates the Python profiling stats for every possible combination of start and end modes. For example, if a training and validation phases are done while detailed profiling is enabled, the combinations are `(PRE_STEP_ZERO, TRAIN)`, `(TRAIN, TRAIN)`, `(TRAIN, EVAL)`, `(EVAL, EVAL)`, and `(EVAL, POST_HOOK_CLOSE)`. All stats files within each of these combinations are aggregated. + `fetch_profile_stats_by_job_phase()` – Fetches and aggregates the Python profiling stats by job phase. The job phases are `initialization` (profiling until step 0), `training_loop` (training and validation), and `finalization` (profiling after the hook is closed). # Merge timelines of multiple profile trace files The SMDebug client library provide profiling analysis and visualization tools for merging timelines of system metrics, framework metrics, and Python profiling data collected by Debugger. **Tip** Before proceeding, you need to set a TrainingJob object that will be utilized throughout the examples in this page. For more information about setting up a TrainingJob object, see [Access the profile data](debugger-analyze-data-profiling.md). The `MergedTimeline` class provides tools to integrate and correlate different profiling information in a single timeline. After Debugger captures profiling data and annotations from different phases of a training job, JSON files of trace events are saved in a default `tracefolder` directory. + For annotations in the Python layers, the trace files are saved in `*pythontimeline.json`. + For annotations in the TensorFlow C\$1\$1 layers, the trace files are saved in `*model_timeline.json`. + Tensorflow profiler saves events in a `*trace.json.gz` file. **Tip** If you want to list all of the JSON trace files, use the following AWS CLI command: ``` ! aws s3 ls {tj.profiler_s3_output_path} --recursive | grep '\.json$' ``` As shown in the following animated screenshot, putting and aligning the trace events captured from the different profiling sources in a single plot can provide an overview of the entire events occurring in different phases of the training job. ![\[An example of merged timeline\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-merged-timeline.gif) **Tip** To interact with the merged timeline on the traicing app using a keyboard, use the `W` key for zooming in, the `A` key for shifting to the left, the `S` key for zooming out, and the `D` key for shifiting to the right. The multiple event trace JSON files can be merged into one trace event JSON file using the following `MergedTimeline` API operation and class method from the `smdebug.profiler.analysis.utils.merge_timelines` module. ``` from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline combined_timeline = MergedTimeline(path, file_suffix_filter, output_directory) combined_timeline.merge_timeline(start, end, unit) ``` The `MergedTimeline` API operation passes the following parameters: + `path` (str) – Specify a root folder (`/profiler-output`) that contains system and framework profiling trace files. You can locate the `profiler-output` using the SageMaker AI estimator classmethod or the TrainingJob object. For example, `estimator.latest_job_profiler_artifacts_path()` or `tj.profiler_s3_output_path`. + `file_suffix_filter` (list) – Specify a list of file suffix filters to merge timelines. Available suffiex filters are `["model_timeline.json", "pythontimeline.json", "trace.json.gz"].` If this parameter is not manually specified, all of the trace files are merged by default. + `output_directory` (str) – Specify a path to save the merged timeline JSON file. The default is to the directory specified for the `path` parameter. The `merge_timeline()` classmethod passes the following parameters to execute the merging process: + `start` (int) – Specify start time (in microseconds and in Unix time format) or start step to merge timelines. + `end` (int) – Specify end time (in microseconds and in Unix time format) or end step to merge timelines. + `unit` (str) – Choose between `"time"` and `"step"`. The default is `"time"`. Using the following example codes, execute the `merge_timeline()` method and download the merged JSON file. + Merge timeline with the `"time"` unit option. The following example code merges all available trace files between the Unix start time (the absolute zero Unix time) and the current Unix time, which means that you can merge the timelines for the entire training duration. ``` import time from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline from smdebug.profiler.profiler_constants import CONVERT_TO_MICROSECS combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./") combined_timeline.merge_timeline(0, int(time.time() * CONVERT_TO_MICROSECS)) ``` + Merge timeline with the `"step"` unit option. The following example code merges all available timelines between step 3 and step 9. ``` from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./") combined_timeline.merge_timeline(3, 9, unit="step") ``` Open the Chrome tracing app at `chrome://tracing` on a Chrome browser, and open the JSON file. You can explore the output to plot the merged timeline. # Profiling data loaders In PyTorch, data loader iterators, such as `SingleProcessingDataLoaderIter` and `MultiProcessingDataLoaderIter`, are initiated at the beginning of every iteration over a dataset. During the initialization phase, PyTorch turns on worker processes depending on the configured number of workers, establishes data queue to fetch data and `pin_memory` threads. To use the PyTorch data loader profiling analysis tool, import the following `PT_dataloader_analysis` class: ``` from smdebug.profiler.analysis.utils.pytorch_dataloader_analysis import PT_dataloader_analysis ``` Pass the profiling data retrieved as a Pandas frame data object in the [Access the profiling data using the pandas data parsing tool](debugger-access-data-profiling-pandas-frame.md) section: ``` pt_analysis = PT_dataloader_analysis(pf) ``` The following functions are available for the `pt_analysis` object: The SMDebug `S3SystemMetricsReader` class reads the system metrics from the S3 bucket specified to the `s3_trial_path` parameter. + `pt_analysis.analyze_dataloaderIter_initialization()` The analysis outputs the median and maximum duration for these initializations. If there are outliers, (i.e duration is greater than 2 \$1 median), the function prints the start and end times for those durations. These can be used to inspect system metrics during those time intervals. The following list shows what analysis is available from this class method: + Which type of data loader iterators were initialized. + The number of workers per iterator. + Inspect whether the iterator was initialized with or without pin\$1memory. + Number of times the iterators were initialized during training. + `pt_analysis.analyze_dataloaderWorkers()` The following list shows what analysis is available from this class method: + The number of worker processes that were spun off during the entire training. + Median and maximum duration for the worker processes. + Start and end time for the worker processes that are outliers. + `pt_analysis.analyze_dataloader_getnext()` The following list shows what analysis is available from this class method: + Number of GetNext calls made during the training. + Median and maximum duration in microseconds for GetNext calls. + Start time, End time, duration and worker id for the outlier GetNext call duration. + `pt_analysis.analyze_batchtime(start_timestamp, end_timestamp, select_events=[".*"], select_dimensions=[".*"])` Debugger collects the start and end times of all the GetNext calls. You can find the amount of time spent by the training script on one batch of data. Within the specified time window, you can identify the calls that are not directly contributing to the training. These calls can be from the following operations: computing the accuracy, adding the losses for debugging or logging purposes, and printing the debugging information. Operations like these can be compute intensive or time consuming. We can identify such operations by correlating the Python profiler, system metrics, and framework metrics. The following list shows what analysis is available from this class method: + Profile time spent on each data batch, `BatchTime_in_seconds`, by finding the difference between start times of current and subsequent GetNext calls. + Find the outliers in `BatchTime_in_seconds` and start and end time for those outliers. + Obtain the system and framework metrics during those `BatchTime_in_seconds` timestamps. This indicates where the time was spent. + `pt_analysis.plot_the_window()` Plots a timeline charts between a start timestamp and the end timestamp.