Supported Frameworks and AWS Regions
Before using the SageMaker model parallelism library, check the supported frameworks and instance types, and determine if there are enough quotas in your AWS account and AWS Region.
Note
To check the latest updates and release notes of the library, see the SageMaker Model Parallel Release Notes
Supported Frameworks
The SageMaker model parallelism library supports the following deep learning frameworks and is available in AWS Deep Learning Containers (DLC) or downloadable as a binary file.
PyTorch versions supported by SageMaker and the SageMaker model parallelism library
PyTorch version | SageMaker model parallelism library version | smdistributed-modelparallel integrated DLC image
URI |
URL of the binary file** |
---|---|---|---|
v2.0.0 | smdistributed-modelparallel==v1.15.0 |
|
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl |
v1.13.1 | smdistributed-modelparallel==v1.15.0 |
|
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl |
v1.12.1 | smdistributed-modelparallel==v1.13.0 |
|
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed_modelparallel-1.13.0-cp38-cp38-linux_x86_64.whl |
v1.12.0 | smdistributed-modelparallel==v1.11.0 |
|
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl |
v1.11.0 | smdistributed-modelparallel==v1.10.0 |
|
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl |
v1.10.2 |
smdistributed-modelparallel==v1.7.0 |
|
- |
v1.10.0 |
smdistributed-modelparallel==v1.5.0 |
|
- |
v1.9.1 |
smdistributed-modelparallel==v1.4.0 |
|
- |
v1.8.1* |
smdistributed-modelparallel==v1.6.0 |
|
- |
Note
The SageMaker model parallelism library v1.6.0 and later provides extended features for PyTorch. For more information, see Core Features of the SageMaker Model Parallelism Library.
** The URLs of the binary files are for installing the SageMaker model parallelism library in custom containers. For more information, see Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library.
TensorFlow versions supported by SageMaker and the SageMaker model parallelism library
TensorFlow version | SageMaker model parallelism library version | smdistributed-modelparallel integrated DLC image
URI |
---|---|---|
v2.6.0 | smdistributed-modelparallel==v1.4.0 |
763104351884.dkr.ecr. |
v2.5.1 | smdistributed-modelparallel==v1.4.0
|
763104351884.dkr.ecr.
|
Hugging Face Transformers versions supported by SageMaker and the SageMaker distributed data parallel library
The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for
PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library
versions and paired PyTorch and TensorFlow versions, see the latest Hugging Face Containers
AWS Regions
The SageMaker data parallel library is available in all of the AWS Regions where the AWS Deep Learning Containers for SageMaker
Supported Instance Types
The SageMaker model parallelism library requires one of the following ML instance types.
Instance type |
---|
ml.g4dn.12xlarge |
ml.p3.16xlarge |
ml.p3dn.24xlarge
|
ml.p4d.24xlarge |
ml.p4de.24xlarge |
For specs of the instance types, see the Accelerated
Computing section in the Amazon EC2 Instance Types page
If you encountered an error message similar to the following, follow the instructions at Request a service quota increase for SageMaker resources.
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.