Core features of the SageMaker model parallelism library v2

The Amazon SageMaker AI model parallelism library v2 (SMP v2) offers distribution strategies and memory-saving techniques, such as sharded data parallelism, tensor parallelism, and checkpointing. The model parallelism strategies and techniques offered by SMP v2 help distribute large models across multiple devices while optimizing training speed and memory consumption. SMP v2 also provides a Python package torch.sagemaker to help adapt your training script with few lines of code change.

This guide follows the basic two-step flow introduced in Use the SageMaker model parallelism library v2. To dive deep into the core features of SMP v2 and how to use them, see the following topics.

Note

These core features are available in SMP v2.0.0 and later and the SageMaker Python SDK v2.200.0 and later, and works for PyTorch v2.0.1 and later. To check the versions of the packages, see Supported frameworks and AWS Regions.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Use the SMP v2

Hybrid sharded data parallelism