SageMaker Distributed Model Parallelism Best Practices
Use the following guidelines when you run a distributed training job with the SageMaker model parallel library.
Setting Up the Right Configuration for a Given Model
When scaling up a model, we recommend you to go over the following list in order. Each list item discusses the advantage of using the library's techniques along with the tradeoffs that might arise.
Tip
If a model can fit well using a subset of the library's features, adding more model parallelism or memory saving features does not usually improve performance.
Using large GPU instance types
-
In the realm of model parallelism, it is best to use powerful instances with large GPU memories to handle overhead from model parallelism operations such as partitioning models across multiple GPUs. We recommend using
ml.p4d
orml.p3dn
instances for training large DL models. These instances are also equipped with Elastic Fabric Adapter (EFA), which provides higher network bandwidth and enables large-scale training with model parallelism.
Sharding optimizer state
-
The impact of sharding optimizer state depends on the number of data parallel ranks. Typically, a higher degree of data parallelism (proportional to the size of compute node) can improve the efficiency of memory usage.
When you want to downsize a cluster, make sure you check the optimizer state sharding configuration. For example, a large DL model with optimizer state sharding that fits on a compute cluster with 16 GPUs (for example, two P4d or P4de instances) might not always fit on a node with 8 GPUs (for example, a single P4d or P4de instance). This is because the combined memory of 8 GPUs is lower than the combined memory of 16 GPUs, and the required memory per GPU for sharding over 8 GPUs is also higher than the memory per GPU for sharding over the 16-GPU scenario. As a result, the increased memory requirement might not fit into the smaller cluster.
For more information, see Optimizer State Sharding.
Activation checkpointing
-
Memory efficiency can be improved by using activation checkpointing for a group of modules. The more you group the modules, the more efficient the memory usage. When checkpointing sequential modules for layers, the
strategy
argument of thesmp.set_activation_checkpointing
function groups the layers together for checkpointing. For example, grouping two or more layers together for checkpointing is more memory efficient than checkpointing one layer at a time, and this trades extra computation time for reduced memory usage.For more information, see Activation Checkpointing.
Tensor parallelism
-
The degree of tensor parallelism should be a power of two (2, 4, 8, ..., 2n), where the maximum degree must be equal to the number of GPUs per node. For example, if you use a node with 8 GPUs, possible numbers for the degree of tensor parallelism are 2, 4, and 8. We don’t recommend arbitrary numbers (such as 3, 5, 6, and 7) for the degree of tensor parallelism. When you use multiple nodes, misconfiguring the degree of tensor parallelism might result in running tensor parallelism across the nodes; this adds significant overhead from communication of activations across the nodes and can become computationally expensive.
For more information, see Tensor Parallelism.
Pipeline parallelism across nodes
-
You can run pipeline parallelism both within a single node and across multiple nodes. When you use pipeline parallelism in combination with tensor parallelism, we recommend running pipeline parallelism across multiple nodes and keeping tensor parallelism within individual nodes.
-
Pipeline parallelism comes with the following three knobs:
microbatches
,active_microbatches
, andprescaled_batch
.-
When you use tensor parallelism with pipeline parallelism, we recommend activating
prescaled_batch
so that the batch size per model parallel group can be increased for efficient pipelining. Withprescaled_batch
activated, the batch size set in the training script becomestp_size
times the batch size set for each rank withoutprescaled_batch
. -
Increasing the number of
microbatches
helps achieve efficient pipelining and better performance. Note that the effective microbatch size is the batch size divided by number of microbatches. If you increase the number of microbatches while keeping batch size constant, each microbatch processes fewer samples. -
The number of
active_microbatches
is the maximum number of microbatches that are simultaneously in process during pipelining. For each active microbatch in process, its activations and gradients take up GPU memory. Therefore, increasingactive_microbatches
takes up more GPU memory.
-
-
If both GPU and GPU memory are underutilized, increase
active_microbatches
for better parallelization during pipelining. -
For more information about how to use tensor parallelism with pipeline parallelism, see Tensor parallelism combined with pipeline parallelism.
-
To find descriptions of the aforementioned parameters, see Parameters for
smdistributed
in the SageMaker Python SDK documentation.
Offloading activations to CPU
-
Make sure that this is used in combination with activation checkpointing and pipeline parallelism. To ensure that the offloading and preloading happen in the background, specify a value greater than 1 to the microbatches parameter.
-
When offloading activations, you might be able to increase
active_microbatches
and sometimes match with the total number of microbatches. This depends on which modules are checkpointed and how the model is partitioned.For more information, see Activation Offloading.
Reference configurations
The SageMaker model parallelism training team provides the following reference points based on experiments with the GPT-2 model, the sequence length of 512, and the vocabulary size of 50,000.
The number of model parameters | Instance type | Pipeline parallelism | Tensor parallelism | Optimizer state sharding | Activation checkpointing | Prescaled batch | Batch size |
---|---|---|---|---|---|---|---|
10 billion | 16 ml.p4d.24xlarge |
1 | 4 | True | Each transformer layer | True | batch_size=40 |
30 billion | 16 ml.p4d.24xlarge |
1 | 8 | True | Each transformer layer | True | batch_size=32 |
60 billion | 32 ml.p4d.24xlarge |
2 | 8 | True | Each transformer layer | True | batch_size=56 , microbatches=4 ,
active_microbatches=2 |
You can extrapolate from the preceding configurations to estimate GPU memory usage for your model configuration. For example, if you increase the sequence length for a 10-billion-parameter model or increase the size of the model to 20 billion, you might want to lower batch size first. If the model still doesn’t fit, try increasing the degree of tensor parallelism.
Modifying Your Training Script
-
Before you use the SageMaker model parallel library’s features in your training script, review The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls.
-
To launch a training job faster, use the SageMaker local mode
. This helps you quickly run a training job locally on a SageMaker notebook instance. Depending on the scale of the ML instance on which your SageMaker notebook instance is running, you might need to adjust the size of your model by changing the model configurations, such as the hidden width, number of transformer layers, and attention heads. Validate if the reduced model runs well on the notebook instance before using a large cluster for training the full model.
Monitoring and Logging a Training Job Using the SageMaker Console and Amazon CloudWatch
To monitor system-level metrics such as CPU memory utilization, GPU memory
utilization, and GPU utilization, use visualization provided through the SageMaker console
-
In the left navigation pane, choose Training.
-
Choose Training jobs.
-
In the main pane, choose the training job name for which you want to see more details.
-
Browse the main pane and find the Monitor section to see the automated visualization.
-
To see training job logs, choose View logs in the Monitor section. You can access the distributed training job logs of the training job in CloudWatch. If you launched multi-node distributed training, you should see multiple log streams with tags in the format of algo-n-1234567890. The algo-1 log stream tracks training logs from the main (0th) node.
For more information, see Amazon CloudWatch Metrics for Monitoring and Analyzing Training Jobs.
Permissions
To run a SageMaker training job with model parallelism or the SageMaker distributed training example notebooks
-
To use FSx for Lustre
, add AmazonFSxFullAccess
. -
To use Amazon S3 as a data channel, add
AmazonS3FullAccess
. -
To use Docker, build your own container, and push it to Amazon ECR, add
AmazonEC2ContainerRegistryFullAccess
. -
To have a full access to use the entire suite of SageMaker features, add
AmazonSageMakerFullAccess
.