Supported frameworks AWS Regions Supported instance types

Supported frameworks and AWS Regions

Before using the SageMaker model parallelism library v2 (SMP v2), check the supported frameworks and instance types and determine if there are enough quotas in your AWS account and AWS Region.

Note

To check the latest updates and release notes of the library, see Release notes for the SageMaker model parallelism library.

Supported frameworks

SMP v2 supports the following deep learning frameworks and available through SMP Docker containers and an SMP Conda channel. When you use the framework estimator classes in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use SMP v2, we recommend that you always keep the SageMaker Python SDK up to date in your development environment.

PyTorch versions that the SageMaker model parallelism library supports

PyTorch version	SageMaker model parallelism library version	SMP Docker image URI
v2.4.1	`smdistributed-modelparallel==v2.6.1`	`658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121`
v2.4.1	`smdistributed-modelparallel==v2.6.0`
v2.3.1	`smdistributed-modelparallel==v2.5.0`	`658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.3.1-gpu-py311-cu121`
v2.3.1	`smdistributed-modelparallel==v2.4.0`
v2.2.0	`smdistributed-modelparallel==v2.3.0`	`658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121`
v2.2.0	`smdistributed-modelparallel==v2.2.0`
v2.1.2	`smdistributed-modelparallel==v2.1.0`	`658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121`
v2.0.1	`smdistributed-modelparallel==v2.0.0`	`658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.0.1-gpu-py310-cu121`

SMP Conda channel

The following S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.


https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/

For more information about Conda channels in general, see Channels in the Conda documentation.

Note

To find previous versions of the SMP library v1.x and pre-packaged DLCs, see Supported Frameworks in the SMP v1 documentation.

Use SMP v2 with open source libraries

The SMP v2 library works with other PyTorch-based open source libraries such as PyTorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate, because SMP v2 is compatible with the PyTorch FSDP APIs. If you have further questions on using the SMP library with other third party libraries, contact the SMP service team at sm-model-parallel-feedback@amazon.com.

AWS Regions

SMP v2 is available in the following AWS Regions. If you'd like to use the SMP Docker image URIs or the SMP Conda channel, check the following list and choose the AWS Region matching with yours, and update the image URI or the channel URL accordingly.

ap-northeast-1
ap-northeast-2
ap-northeast-3
ap-south-1
ap-southeast-1
ap-southeast-2
ca-central-1
eu-central-1
eu-north-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
us-east-1
us-east-2
us-west-1
us-west-2

Supported instance types

SMP v2 requires one of the following ML instance types.

Instance type
`ml.p4d.24xlarge`
`ml.p4de.24xlarge`
`ml.p5.48xlarge`
`ml.p5e.48xlarge`

Tip

Starting from SMP v2.2.0 supporting PyTorch v2.2.0 and later, Mixed precision training with FP8 on P5 instances using Transformer Engine is available.

For specs of the SageMaker machine learning instance types in general, see the Accelerated Computing section in the Amazon EC2 Instance Types page. For information about instance pricing, see Amazon SageMaker Pricing.

If you encountered an error message similar to the following, follow the instructions at Requesting a quota increase in the AWS Service Quotas User Guide.


ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
    the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
    for training job usage' is 0 Instances, with current utilization of 0 Instances
    and a request delta of 1 Instances.
    Please contact AWS support to request an increase for this limit.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Model parallelism concepts

Use the SMP v2