Amazon SageMaker Training Compiler Release Notes
Important
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the AWS Deep Learning Containers Framework Support Policy.
See the following release notes to track the latest updates for Amazon SageMaker Training Compiler.
SageMaker Training Compiler Release Notes: February 13, 2023
Currency Updates
Added support for PyTorch v1.13.1
Bug Fixes
-
Fixed a race condition issue on GPU which was causing NAN loss in some models like vision transformer (ViT) models.
Other Changes
-
SageMaker Training Compiler improves performance by letting PyTorch/XLA to automatically override the optimizers (such as SGD, Adam, AdamW) in
torch.optim
ortransformers.optimization
with the syncfree versions of them intorch_xla.amp.syncfree
(such astorch_xla.amp.syncfree.SGD
,torch_xla.amp.syncfree.Adam
,torch_xla.amp.syncfree.AdamW
). You don't need to change those code lines where you define optimizers in your training script.
Migration to AWS Deep Learning Containers
This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
-
PyTorch v1.13.1
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-trcomp-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.
SageMaker Training Compiler Release Notes: January 9, 2023
Breaking Changes
-
tf.keras.optimizers.Optimizer
points to a new optimizer in TensorFlow 2.11.0 and later. The old optimizers are moved totf.keras.optimizers.legacy
. You might encounter job failure due to the breaking change when you do the following.-
Load checkpoints from an old optimizer. We recommend you to switch to use the legacy optimizers.
-
Use TensorFlow v1. We recommend you to migrate to TensorFlow v2, or switch to the legacy optimizers if you need to continue using TensorFlow v1.
For more detailed list of breaking changes from the optimizer changes, see the official TensorFlow v2.11.0 release notes
in the TensorFlow GitHub repository. -
Migration to AWS Deep Learning Containers
This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
-
TensorFlow v2.11.0
763104351884.dkr.ecr.
<region>
.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-ubuntu20.04-sagemakerTo find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.
SageMaker Training Compiler Release Notes: December 8, 2022
Bug Fixes
-
Fixed the seed for PyTorch training jobs starting PyTorch v1.12 to ensure that there is no discrepancy in model initialization across different processes. See also PyTorch Reproducibility
. -
Fixed the issue causing PyTorch distributed training jobs on G4dn and G5 instances to not default to communication through PCIe
.
Known Issues
-
Improper use of PyTorch/XLA APIs in Hugging Face’s vision transformers might lead to convergence issues.
Other Changes
-
When using the Hugging Face Transformers
Trainer
class, make sure that you use SyncFree optimizers by setting theoptim
argument toadamw_torch_xla
. For more information, see Large Language Models Using the Hugging Face Transformers Trainer Class. See also Optimizerin the Hugging Face Transformers documentation.
Migration to AWS Deep Learning Containers
This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
-
PyTorch v1.12.0
763104351884.dkr.ecr.
<region>
.amazonaws.com/pytorch-trcomp-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemakerTo find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.
SageMaker Training Compiler Release Notes: October 4, 2022
Currency Updates
-
Added support for TensorFlow v2.10.0.
Other Changes
-
Added Hugging Face NLP models using the Transformers library to TensorFlow framework tests. To find the tested Transformer models, see Tested Models.
Migration to AWS Deep Learning Containers
This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
-
TensorFlow v2.10.0
763104351884.dkr.ecr.
<region>
.amazonaws.com/tensorflow-training:2.10.0-gpu-py39-cu112-ubuntu20.04-sagemakerTo find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.
SageMaker Training Compiler Release Notes: September 1, 2022
Currency Updates
-
Added support for Hugging Face Transformers v4.21.1 with PyTorch v1.11.0.
Improvements
-
Implemented a new distributed training launcher mechanism to activate SageMaker Training Compiler for Hugging Face Transformer models with PyTorch. To learn more, see Run PyTorch Training Jobs with SageMaker Training Compiler for Distributed Training.
-
Integrated with EFA to improve the collective communication in distributed training.
-
Added support for G5 instances for PyTorch training jobs. For more information, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.
Migration to AWS Deep Learning Containers
This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
-
HuggingFace v4.21.1 with PyTorch v1.11.0
763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-trcomp-training:1.11.0-transformers4.21.1-gpu-py38-cu113-ubuntu20.04
To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.
SageMaker Training Compiler Release Notes: June 14, 2022
New Features
-
Added support for TensorFlow v2.9.1. SageMaker Training Compiler fully supports compiling TensorFlow modules (
tf.*
) and TensorFlow Keras modules (tf.keras.*
). -
Added support for custom containers created by extending AWS Deep Learning Containers for TensorFlow. For more information, see Enable SageMaker Training Compiler Using the SageMaker Python SDK and Extending SageMaker Framework Deep Learning Containers.
-
Added support for G5 instances for TensorFlow training jobs.
Migration to AWS Deep Learning Containers
This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
-
TensorFlow 2.9.1
763104351884.dkr.ecr.
<region>
.amazonaws.com/tensorflow-training:2.9.1-gpu-py39-cu112-ubuntu20.04-sagemakerTo find a complete list of the pre-built containers with Amazon SageMaker Training Compiler, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.
SageMaker Training Compiler Release Notes: April 26, 2022
Improvements
-
Added support for all of the AWS Regions where AWS Deep Learning Containers
are in service except the China regions.
SageMaker Training Compiler Release Notes: April 12, 2022
Currency Updates
-
Added support for Hugging Face Transformers v4.17.0 with TensorFlow v2.6.3 and PyTorch v1.10.2.
SageMaker Training Compiler Release Notes: February 21, 2022
Improvements
-
Completed benchmark test and confirmed training speed-ups on the
ml.g4dn
instance types. To find a complete list of testedml
instances, see Supported Instance Types.
SageMaker Training Compiler Release Notes: December 01, 2021
New Features
Launched Amazon SageMaker Training Compiler at AWS re:Invent 2021.
Migration to AWS Deep Learning Containers
Amazon SageMaker Training Compiler passed benchmark testing and is migrated to AWS Deep Learning Containers. To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see Supported Frameworks, AWS Regions, Instance Types, and Tested Models.