SageMaker HyperPod recipe repository
Use the SageMaker HyperPod recipes
-
main.py
: This file serves as the primary entry point for initiating the process of submitting a training job to either a cluster or a SageMaker training job. -
launcher_scripts
: This directory contains a collection of commonly used scripts designed to facilitate the training process for various Large Language Models (LLMs). -
recipes_collection
: This folder houses a compilation of pre-defined LLM recipes provided by the developers. Users can leverage these recipes in conjunction with their custom data to train LLM models tailored to their specific requirements.
You use the SageMaker HyperPod recipes to launch training or fine-tuning jobs. Regardless of the cluster you're using, the process of submitting the job is the same. For example, you can use the same script to submit a job to a Slurm or Kubernetes cluster. The launcher dispatches a training job based on three configuration files:
-
General Configuration (
config.yaml
): Includes common settings such as the default parameters or environment variables used in the training job. -
Cluster Configuration (cluster): For training jobs using clusters only. If you're submitting a training job to a Kubernetes cluster, you might need to specify information such as volume, label, or restart policy. For Slurm clusters, you might need to specify the Slurm job name. All the parameters are related to the specific cluster that you're using.
-
Recipe (recipes): Recipes contain the settings for your training job, such as the model types, sharding degree, or dataset paths. For example, you can specify Llama as your training model and train it using model or data parallelism techniques like Fully Sharded Distributed Parallel (FSDP) across eight machines. You can also specify different checkpoint frequencies or paths for your training job.
After you've specified a recipe, you run the launcher script to specify an
end-to-end training job on a cluster based on the configurations through the
main.py
entry point. For each recipe that you use, there are
accompanying shell scripts located in the launch_scripts folder. These examples
guide you through submitting and initiating training jobs. The following figure
illustrates how a SageMaker HyperPod recipe launcher submits a training job to a
cluster based on the preceding. Currently, the SageMaker HyperPod recipe launcher is
built on top of the Nvidia NeMo Framework Launcher. For more information, see
NeMo Launcher Guide