Running training jobs on a heterogeneous cluster - Amazon SageMaker AI

Running training jobs on a heterogeneous cluster

Using the heterogeneous cluster feature of SageMaker Training, you can run a training job with multiple types of ML instances for a better resource scaling and utilization for different ML training tasks and purposes. For example, if your training job on a cluster with GPU instances suffers low GPU utilization and CPU bottleneck problems due to CPU-intensive tasks, using a heterogeneous cluster can help offload CPU-intensive tasks by adding more cost-efficient CPU instance groups, resolve such bottleneck problems, and achieve a better GPU utilization.

Note

This feature is available in the SageMaker Python SDK v2.98.0 and later.

Note

This feature is available through the SageMaker AI PyTorch and TensorFlow framework estimator classes. Supported frameworks are PyTorch v1.10 or later and TensorFlow v2.6 or later.

See also the blog Improve price performance of your model training using Amazon SageMaker AI heterogeneous clusters.