Appendix
Use the following information to get information about monitoring and analyzing training results.
Monitor training results
Monitoring and analyzing training results is essential for developers to assess
convergence and troubleshoot issues. SageMaker HyperPod recipes offer Tensorboard
integration to analyze training behavior. To address the challenges of profiling
large distributed training jobs, these recipes also incorporate VizTracer. VizTracer
is a low-overhead tool for tracing and visualizing Python code execution. For more
information about VizTracer, see VizTracer
The following sections guide you through the process of implementing these features in your SageMaker HyperPod recipes.
Tensorboard
Tensorboard is a powerful tool for visualizing and analyzing the training process. To enable Tensorboard, modify your recipe by setting the following parameter:
exp_manager: exp_dir: null name: experiment create_tensorboard_logger: True
After you enable the Tensorboard logger, the training logs are generated and stored within the experiment directory. The experiment directed is defined in exp_manager.exp_dir. To access and analyze these logs locally, use the following procedure:
To access and analyze logs
-
Download the Tensorboard experiment folder from your training environment to your local machine.
-
Open a terminal or command prompt on your local machine.
-
Navigate to the directory containing the downloaded experiment folder.
-
Launch Tensorboard with the following the command.
tensorboard --port=
<port>
--bind_all --logdir experiment. -
Open your web browser and visit http://localhost:8008.
You can now see the status and visualizations of your training jobs within the
Tensorboard interface. Seeing the status and visualizations helps you monitor
and analyze the training process. Monitoring and analyzing the training process
helps you gain insights into the behavior and performance of your models. For
more information about how you monitor and analyze the training with
Tensorboard, see the NVIDIA NeMo Framework User Guide
VizTracer
To enable VizTracer, you can modify your recipe by setting the model.viztracer.enabled parameter to true. For example, you can update your llama recipe to enable VizTracer by adding the following configuration:
model: viztracer: enabled: true
After the training has completed, your VizTracer profile is in the experiment folder exp_dir/result.json. To analyze your profile, you can download it and open it using the vizviewer tool:
vizviewer --port <port> result.json
This command launches the vizviewer on port 9001. You can view your VizTracer by specifying http://localhost:<port> in your browser. After you open VizTracer, you begin analyzing the training. For more information about using VizTracer, see VizTracer documentation.
SageMaker AI Jumpstart versus SageMaker HyperPod
While SageMaker AI JumpStart provides fine-tuning capabilities, the SageMaker HyperPod recipes provide the following:
-
Additional fine-grained control over the training loop
-
Recipe customization for your own models and data
-
Support for model parallelism
Use the SageMaker HyperPod recipes when you need access to the model's hyperparameters, multi-node training, and customization options for the training loop.
For more information about fine-tuning your models in SageMaker AI Jumpstart, see Fine-tune publicly available foundation models with the JumpStartEstimator class