PyTorch
Bring your own PyTorch model to SageMaker, and run the training job with SageMaker Training Compiler.
PyTorch Models with Hugging Face Transformers
PyTorch models with Hugging Face TransformersPyTorch
or
HuggingFace
estimator with the SageMaker Training Compiler configuration when you'll
proceed to the next topic at Enable SageMaker Training Compiler.
Tip
When you create a tokenizer for an NLP model using Transformers in your training
script, make sure that you use a static input tensor shape by specifying
padding='max_length'
. Do not use padding='longest'
because padding to the longest sequence in the batch can change the tensor shape for
each training batch. The dynamic input shape can trigger recompilation of the model
and might increase total training time. For more information about padding options
of the Transformers tokenizers, see Padding and
truncation
Topics
Large
Language Models Using the Hugging Face Transformers Trainer
Class
If you use the transformers library’s Trainer class, you don’t need to make any additional changes to your training script. SageMaker Training Compiler automatically compiles your Trainer model if you enable it through the estimator class. The following code shows the basic form of a PyTorch training script with Hugging Face Trainer API.
from transformers import Trainer, TrainingArguments training_args=TrainingArguments(**kwargs) trainer=Trainer(args=training_args, **kwargs)
Topics
For single GPU training
You don't need to change your code when you use the transformers.Trainer
For distributed training
PyTorch v1.11.0 and later
To run distributed training with SageMaker Training Compiler, you must add the following
_mp_fn()
function in your training script and wrap the
main()
function. It redirects the _mp_fn(index)
function calls from the SageMaker distributed runtime for PyTorch
(pytorchxla
) to the main()
function of your
training script.
def _mp_fn(index): main()
This function accepts the index
argument to indicate the rank of
the current GPU in the cluster for distributed training. To find more example
scripts, see the Hugging Face Transformers language modeling example scripts
For Transformers v4.17 and before with PyTorch v1.10.2 and before
SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job,
and you don't need to make any modification in your training script. Instead,
SageMaker Training Compiler requires you to pass a SageMaker distributed training launcher script to the
entry_point
argument and pass your training script to the
hyperparameters
argument in the SageMaker Hugging Face
estimator.
Best Practices to Use SageMaker Training Compiler with Trainer
-
Make sure that you use SyncFree optimizers by setting the
optim
argument toadamw_torch_xla
while setting up transformers.TrainingArgument. See also Optimizer in the Hugging Face Transformers documentation. -
Ensure that the throughput of the data processing pipeline is higher than the training throughput. You can tweak the
dataloader_num_workers
andpreprocessing_num_workers
arguments of the transformers.TrainingArgumentclass to achieve this. Typically, these need to be greater than or equal to the number of GPUs but less than the number of CPUs.
After you have completed adapting your training script, proceed to Run PyTorch Training Jobs with SageMaker Training Compiler.
Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)
If you have a training script that uses PyTorch directly, you need to make additional changes to your PyTorch training script to implement PyTorch/XLA. Follow the instructions to modify your script to properly set up the PyTorch/XLA primatives.
Topics
For single GPU training
-
Import the optimization libraries.
import torch_xla import torch_xla.core.xla_model as xm
-
Change the target device to be XLA instead of
torch.device("cuda")
device=xm.xla_device()
-
If you're using PyTorch's Automatic Mixed Precision
(AMP), do the following: -
Replace
torch.cuda.amp
with the following:import torch_xla.amp
-
Replace
torch.optim.SGD
andtorch.optim.Adam
with the following:import torch_xla.amp.syncfree.Adam as adam import torch_xla.amp.syncfree.SGD as SGD
-
Replace
torch.cuda.amp.GradScaler
with the following:import torch_xla.amp.GradScaler as grad_scaler
-
-
If you're not using AMP, replace
optimizer.step()
with the following:xm.optimizer_step(optimizer)
-
If you're using a distributed dataloader, wrap your dataloader in the PyTorch/XLA's
ParallelLoader
class:import torch_xla.distributed.parallel_loader as pl parallel_loader=pl.ParallelLoader(dataloader, [device]).per_device_loader(device)
-
Add
mark_step
at the end of the training loop when you're not usingparallel_loader
:xm.mark_step()
-
To checkpoint your training, use the PyTorch/XLA's model checkpoint method:
xm.save(model.state_dict(), path_to_save)
After you have completed adapting your training script, proceed to Run PyTorch Training Jobs with SageMaker Training Compiler.
For distributed training
In addition to the changes listed in the previous For single GPU training section, add the following changes to properly distribute workload across GPUs.
-
If you're using AMP, add
all_reduce
afterscaler.scale(loss).backward()
:gradients=xm._fetch_gradients(optimizer) xm.all_reduce('sum', gradients, scale=1.0/xm.xrt_world_size())
-
If you need to set variables for
local_ranks
andworld_size
, use similar code to the following:local_rank=xm.get_local_ordinal() world_size=xm.xrt_world_size()
-
For any
world_size
(num_gpus_per_node*num_nodes
) greater than1
, you must define a train sampler which should look similar to the following:import torch_xla.core.xla_model as xm if xm.xrt_world_size() > 1: train_sampler=torch.utils.data.distributed.DistributedSampler( train_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal(), shuffle=True ) train_loader=torch.utils.data.DataLoader( train_dataset, batch_size=args.batch_size, sampler=train_sampler, drop_last=args.drop_last, shuffle=False if train_sampler else True, num_workers=args.num_workers )
-
Make the following changes to make sure you use the
parallel_loader
provided by thetorch_xla distributed
module.import torch_xla.distributed.parallel_loader as pl train_device_loader=pl.MpDeviceLoader(train_loader, device)
The
train_device_loader
functions like a regular PyTorch loader as follows:for step, (data, target) in enumerate(train_device_loader): optimizer.zero_grad() output=model(data) loss=torch.nn.NLLLoss(output, target) loss.backward()
With all of these changes, you should be able to launch distributed training with any PyTorch model without the Transformer Trainer API. Note that these instructions can be used for both single-node multi-GPU and multi-node multi-GPU.
-
For PyTorch v1.11.0 and later
To run distributed training with SageMaker Training Compiler, you must add the following
_mp_fn()
function in your training script and wrap themain()
function. It redirects the_mp_fn(index)
function calls from the SageMaker distributed runtime for PyTorch (pytorchxla
) to themain()
function of your training script.def _mp_fn(index): main()
This function accepts the
index
argument to indicate the rank of the current GPU in the cluster for distributed training. To find more example scripts, see the Hugging Face Transformers language modeling example scripts. For Transformers v4.17 and before with PyTorch v1.10.2 and before
SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job and requires you to pass a SageMaker distributed training launcher script to the
entry_point
argument and pass your training script to thehyperparameters
argument in the SageMaker Hugging Face estimator.
After you have completed adapting your training script, proceed to Run PyTorch Training Jobs with SageMaker Training Compiler.
Best Practices to Use SageMaker Training Compiler with PyTorch/XLA
If you want to leverage the SageMaker Training Compiler on your native PyTorch training script, you
may want to first get familiar with PyTorch on XLA
devices
Note
This section for best practices assumes that you use the following PyTorch/XLA modules:
import torch_xla.core.xla_model as xm import torch_xla.distributed.parallel_loader as pl
Understand the lazy mode in PyTorch/XLA
One significant difference between PyTorch/XLA and native PyTorch is that the
PyTorch/XLA system runs in lazy mode while the native PyTorch runs in eager
mode. Tensors in lazy mode are placeholders for building the computational graph
until they are materialized after the compilation and evaluation are complete.
The PyTorch/XLA system builds the computational graph on the fly when you call
PyTorch APIs to build the computation using tensors and operators. The
computational graph gets compiled and executed when xm.mark_step()
is called explicitly or implicitly by
pl.MpDeviceLoader/pl.ParallelLoader
, or when you explicitly
request the value of a tensor such as by calling loss.item()
or
print(loss)
.
Minimize the number of compilation-and-executions using
pl.MpDeviceLoader/pl.ParallelLoader
and
xm.step_closure
For best performance, you should keep in mind the possible ways to initiate
compilation-and-executions as described
in Understand the lazy mode in PyTorch/XLA
and should try to minimize the number of compilation-and-executions. Ideally,
only one compilation-and-execution is necessary per training iteration and is
initiated automatically by pl.MpDeviceLoader/pl.ParallelLoader
. The
MpDeviceLoader
is optimized for XLA and should always be used
if possible for best performance. During training, you might want to examine
some intermediate results such as loss values. In such case, the printing of
lazy tensors should be wrapped using xm.add_step_closure()
to avoid
unnecessary compilation-and-executions.
Use
AMP and syncfree
optimizers
Training in Automatic Mixed Precision (AMP) mode significantly accelerates
your training speed by leveraging the Tensor cores of NVIDIA GPUs. SageMaker Training Compiler
provides syncfree
optimizers that are optimized for XLA to improve
AMP performance. Currently, the following three syncfree
optimizers
are available and should be used if possible for best performance.
torch_xla.amp.syncfree.SGD torch_xla.amp.syncfree.Adam torch_xla.amp.syncfree.AdamW
These syncfree
optimizers should be paired with
torch_xla.amp.GradScaler
for gradient scaling/unscaling.
Tip
Starting PyTorch 1.13.1, SageMaker Training Compiler improves performance by letting
PyTorch/XLA to automatically override the optimizers (such as SGD, Adam,
AdamW) in torch.optim
or
transformers.optimization
with the syncfree versions of
them in torch_xla.amp.syncfree
(such as
torch_xla.amp.syncfree.SGD
,
torch_xla.amp.syncfree.Adam
,
torch_xla.amp.syncfree.AdamW
). You don't need to change
those code lines where you define optimizers in your training
script.