MLCOST-15: Use distributed training
Enable distributed training for a faster training time, when an algorithm allows it. Use multiple instances in a training cluster. Use managed services to help ensure all training instances are automatically shut down when training is completed.
Implementation plan
-
Use Amazon SageMaker AI Distributed training libraries - The distributed training libraries
in Amazon SageMaker AI automatically split large deep learning models and training datasets across AWS GPU instances in a fraction of the time it takes to do manually. SageMaker AI achieves these efficiencies through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently to improve training speed.
Documents
Blogs
-
New – Data Parallelism Library in Amazon SageMaker AI Simplifies Training on Large Datasets
-
How Latent Space used the Amazon SageMaker AI model parallelism library to push the frontiers of
large-scale transformers -
The science behind Amazon SageMaker AI’s distributed-training engines
-
Amazon SageMaker AI XGBoost now offers fully distributed GPU training