Trainium SageMaker training jobs pre-training tutorial
This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with AWS Trainium instances.
-
Set up your environment
-
Launch a training job
Before you begin, make sure you have the following prerequisites.
Prerequisites
Before you start setting up your environment, make sure you have:
-
Amazon FSx file system or S3 bucket where you can load the data and output the training artifacts.
-
Request a Service Quota for the
ml.trn1.32xlarge
instance on Amazon SageMaker AI. To request a service quota increase, do the following:To request a service quota increase for ml.trn1.32xlarge instance
-
Navigate to the AWS Service Quotas console.
-
Choose AWS services.
-
Select JupyterLab.
-
Specify one instance for
ml.trn1.32xlarge
.
-
-
Create an AWS Identity and Access Management (IAM) role with the
AmazonSageMakerFullAccess
andAmazonEC2FullAccess
managed policies. These policies provide Amazon SageMaker AI with permissions to run the examples. -
Data in one of the following formats:
-
JSON
-
JSONGZ (Compressed JSON)
-
ARROW
-
-
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see User access tokens
.
Set up your environment for Trainium SageMaker training jobs
Before you run a SageMaker training job, use the aws configure
command
to configure your AWS credentials and preferred region . As an alternative,
you can also provide your credentials through environment variables such as the
AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and
AWS_SESSION_TOKEN
. For more information, see SageMaker AI Python
SDK
We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see SageMaker JupyterLab.
-
(Optional) If you are using Jupyter notebook in Amazon SageMaker Studio, you can skip running the following command. Make sure to use a version >= python 3.9
# set up a virtual environment python3 -m venv ${PWD}/venv source venv/bin/activate # install dependencies after git clone. git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git cd sagemaker-hyperpod-recipes pip3 install -r requirements.txt
-
Install SageMaker AI Python SDK
pip3 install --upgrade sagemaker
-
-
If you are running a llama 3.2 multi-modal training job, the
transformers
version must be4.45.2
or greater.-
Append
transformers==4.45.2
torequirements.txt
in source_dir only when you're using the SageMaker AI Python SDK. -
If you are using HyperPod recipes to launch using
sm_jobs
as the cluster type, you don't have to specify the transformers version.
-
-
Container
: The Neuron container is set automatically by SageMaker AI Python SDK.
-
Launch the training job with a Jupyter Notebook
You can use the following Python code to run a SageMaker training job using your
recipe. It leverages the PyTorch estimator from the SageMaker AI Python SDK
-
compiler_cache_url
: Cache to be used to save the compiled artifacts, such as an Amazon S3 artifact.
import os import sagemaker,boto3 from sagemaker.debugger import TensorBoardOutputConfig from sagemaker.pytorch import PyTorch sagemaker_session = sagemaker.Session() role = sagemaker.get_execution_role() recipe_overrides = { "run": { "results_dir": "/opt/ml/model", }, "exp_manager": { "explicit_log_dir": "/opt/ml/output/tensorboard", }, "data": { "train_dir": "/opt/ml/input/data/train", }, "model": { "model_config": "/opt/ml/input/data/train/config.json", }, "compiler_cache_url": "
<compiler_cache_url>
" } tensorboard_output_config = TensorBoardOutputConfig( s3_output_path=os.path.join(output, 'tensorboard'), container_local_output_path=overrides["exp_manager"]["explicit_log_dir"] ) estimator = PyTorch( output_path=output_path, base_job_name=f"llama-trn", role=role, instance_type="ml.trn1.32xlarge", sagemaker_session=sagemaker_session, training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain", recipe_overrides=recipe_overrides, ) estimator.fit(inputs={"train": "your-inputs"}, wait=True)
The preceding code creates a PyTorch estimator object with the training recipe
and then fits the model using the fit()
method. Use the
training_recipe
parameter to specify the recipe you want to use
for training.
Launch the training job with the recipes launcher
-
Update
./recipes_collection/cluster/sm_jobs.yaml
-
compiler_cache_url: The URL used to save the artifacts. It can be an Amazon S3 URL.
sm_jobs_config: output_path:
<s3_output_path>
wait: True tensorboard_config: output_path:<s3_output_path>
container_logs_path: /opt/ml/output/tensorboard # Path to logs on the container wait: True # Whether to wait for training job to finish inputs: # Inputs to call fit with. Set either s3 or file_system, not both. s3: # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation. train:<s3_train_data_path>
val: null additional_estimator_kwargs: # All other additional args to pass to estimator. Must be int, float or string. max_run: 180000 image_uri:<your_image_uri>
enable_remote_debug: True py_version: py39 recipe_overrides: model: exp_manager: exp_dir:<exp_dir>
data: train_dir: /opt/ml/input/data/train val_dir: /opt/ml/input/data/val -
-
Update
./recipes_collection/config.yaml
defaults: - _self_ - cluster: sm_jobs - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above. instance_type: ml.trn1.32xlarge base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.
-
Launch the job with
main.py
python3 main.py --config-path recipes_collection --config-name config
For more information about configuring SageMaker training jobs, see SageMaker training jobs pre-training tutorial (GPU).