Get started with distributed training in Amazon SageMaker AI
The following page gives information about the steps needed to get started with distributed training in Amazon SageMaker AI. If you’re already familiar with distributed training, choose one of the following options that matches your preferred strategy or framework to get started. If you want to learn about distributed training in general, see Distributed training concepts.
The SageMaker AI distributed training libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker AI, and improve training speed and throughput. The libraries offer both data parallel and model parallel training strategies. They combine software and hardware technologies to improve inter-GPU and inter-node communications, and extend SageMaker AI’s training capabilities with built-in options that require minimal code changes to your training scripts.
Before you get started
SageMaker Training supports distributed training on a single instance as well as multiple
instances, so you can run any size of training at scale. We recommend you to use the framework
estimator classes such as PyTorchCreateTrainingJob
API in the
backend, finds the Region where your current session is running, and pulls one of the
pre-built AWS deep learning container prepackaged with a number of libraries including deep
learning frameworks, distributed training frameworks, and the EFA driver. If you want to mount an FSx file
system to the training instances, you need to pass your VPC subnet and security group ID to
the estimator. Before running your distributed training job in SageMaker AI, read the following
general guidance on the basic infrastructure setup.
Availability zones and network backplane
When using multiple instances (also called nodes),
it’s important to understand the network that connects the instances, how they read the
training data, and how they share information between themselves. For example, when you run
a distributed data-parallel training job, a number of factors, such as communication between
the nodes of a compute cluster for running the AllReduce
operation and data
transfer between the nodes and data storage in Amazon Simple Storage Service or Amazon FSx for Lustre, play a crucial
role to achieve an optimal use of compute resources and a faster training speed. To reduce
communication overhead, make sure that you configure instances, VPC subnet, and data storage
in the same AWS Region and Availability Zone.
GPU instances with faster network and high-throughput storage
You can technically use any instances for distributed training. For cases where you need
to run multi-node distributed training jobs for training large models, such as large
language models (LLMs) and diffusion models, which require faster inter-node commutation, we
recommend EFA-enabled GPU instances supported by SageMaker AI
Use the SageMaker AI distributed data parallelism (SMDDP) library
The SMDDP library improves communication between nodes with implementations of
AllReduce
and AllGather
collective communication operations that
are optimized for AWS network infrastructure and Amazon SageMaker AI ML instance topology. You can use
the SMDDP
library as the backend of PyTorch-based distributed training packages: PyTorch distributed data parallel
(DDP)PyTorch
estimator for launching a distributed training job on two
ml.p4d.24xlarge
instances.
from sagemaker.pytorch import PyTorch estimator = PyTorch( ..., instance_count=
2
, instance_type="ml.p4d.24xlarge
", # Activate distributed training with SMDDP distribution={ "pytorchddp": { "enabled": True } } # mpirun, activates SMDDP AllReduce OR AllGather # distribution={ "torch_distributed": { "enabled": True } } # torchrun, activates SMDDP AllGather # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } # mpirun, activates SMDDP AllReduce OR AllGather )
To learn how to prepare your training script and launch a distributed data-parallel training job on SageMaker AI, see Run distributed training with the SageMaker AI distributed data parallelism library.
Use the SageMaker AI model parallelism library (SMP)
SageMaker AI provides the SMP library and supports various distributed training techniques, such as sharded data parallelism, pipelining, tensor parallelism, optimizer state sharding, and more. To learn more about what the SMP library offers, see Core Features of the SageMaker Model Parallelism Library.
To use SageMaker AI's model parallelism library, configure the distribution
parameter
of the SageMaker AI framework estimators. Supported framework estimators are PyTorchml.p4d.24xlarge
instances.
from sagemaker.
framework
importFramework
distribution={ "smdistributed": { "modelparallel": { "enabled":True, "parameters": { ... # enter parameter key-value pairs here } }, }, "mpi": { "enabled" : True, ... # enter parameter key-value pairs here } } estimator =Framework
( ..., instance_count=2
, instance_type="ml.p4d.24xlarge
", distribution=distribution )
To learn how to adapt your training script, configure distribution parameters in the
estimator
class, and launch a distributed training job, see SageMaker AI's model parallelism library (see also Distributed Training APIs
Use open source distributed training frameworks
SageMaker AI also supports the following options to operate mpirun
and
torchrun
in the backend.
-
To use PyTorch DistributedDataParallel (DDP)
in SageMaker AI with the mpirun
backend, adddistribution={"pytorchddp": {"enabled": True}}
to your PyTorch estimator. For more information, see also PyTorch Distributed Trainingand SageMaker AI PyTorch Estimator 's distribution
argument in the SageMaker Python SDK documentation.Note
This option is available for PyTorch 1.12.0 and later.
from sagemaker.pytorch import PyTorch estimator = PyTorch( ..., instance_count=
2
, instance_type="ml.p4d.24xlarge
", distribution={"pytorchddp": {"enabled": True}} # runs mpirun in the backend ) -
SageMaker AI supports the PyTorch
torchrun
launcherfor distributed training on GPU-based Amazon EC2 instances, such as P3 and P4, as well as Trn1 powered by the AWS Trainium device. To use PyTorch DistributedDataParallel (DDP)
in SageMaker AI with the torchrun
backend, adddistribution={"torch_distributed": {"enabled": True}}
to the PyTorch estimator.Note
This option is available for PyTorch 1.13.0 and later.
The following code snippet shows an example of constructing a SageMaker AI PyTorch estimator to run distributed training on two
ml.p4d.24xlarge
instances with thetorch_distributed
distribution option.from sagemaker.pytorch import PyTorch estimator = PyTorch( ..., instance_count=
2
, instance_type="ml.p4d.24xlarge
", distribution={"torch_distributed": {"enabled": True}} # runs torchrun in the backend )For more information, see Distributed PyTorch Training
and SageMaker AI PyTorch Estimator 's distribution
argument in the SageMaker Python SDK documentation.Notes for distributed training on Trn1
A Trn1 instance consists of up to 16 Trainium devices, and each Trainium device consists of two NeuronCores
. For specs of the AWS Trainium devices, see Trainium Architecture in the AWS Neuron Documentation. To train on the Trainium-powered instances, you only need to specify the Trn1 instance code,
ml.trn1.*
, in string to theinstance_type
argument of the SageMaker AI PyTorch estimator class. To find available Trn1 instance types, see AWS Trn1 Architecturein the AWS Neuron documentation. Note
SageMaker Training on Amazon EC2 Trn1 instances is currently available only for the PyTorch framework in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0. To find a complete list of supported versions of PyTorch Neuron, see Neuron Containers
in the AWS Deep Learning Containers GitHub repository. When you launch a training job on Trn1 instances using the SageMaker Python SDK, SageMaker AI automatically picks up and runs the right container from Neuron Containers
provided by AWS Deep Learning Containers. The Neuron Containers are prepackaged with training environment settings and dependencies for easier adaptation of your training job to the SageMaker Training platform and Amazon EC2 Trn1 instances. Note
To run your PyTorch training job on Trn1 instances with SageMaker AI, you should modify your training script to initialize process groups with the
xla
backend and use PyTorch/XLA. To support the XLA adoption process, the AWS Neuron SDK provides PyTorch Neuron that uses XLA to make conversion of PyTorch operations to Trainium instructions. To learn how to modify your training script, see Developer Guide for Training with PyTorch Neuron ( torch-neuronx
)in the AWS Neuron Documentation. For more information, see Distributed Training with PyTorch Neuron on Trn1 instances
and SageMaker AI PyTorch Estimator 's distribution
argument in the SageMaker Python SDK documentation. -
To use MPI in SageMaker AI, add
distribution={"mpi": {"enabled": True}}
to your estimator. The MPI distribution option is available for the following frameworks: MXNet, PyTorch, and TensorFlow. -
To use a parameter server in SageMaker AI, add
distribution={"parameter_server": {"enabled": True}}
to your estimator. The parameter server option is available for the following frameworks: MXNet, PyTorch, and TensorFlow.Tip
For more information about using the MPI and parameter server options per framework, use the following links to the SageMaker Python SDK documentation.
-
MXNet Distributed Training
and SageMaker AI MXNet Estimator 's distribution
argument -
PyTorch Distributed Training
and SageMaker AI PyTorch Estimator 's distribution
argument -
TensorFlow Distributed Training
and SageMaker AI TensorFlow Estimator 's distribution
argument.
-