Train a Model with Amazon SageMaker - Amazon SageMaker

Train a Model with Amazon SageMaker

Amazon SageMaker Training is a fully managed machine learning (ML) service offered by SageMaker that helps you efficiently train a wide range of ML models at scale. The core of SageMaker jobs is the containerization of ML workloads and the capability of managing AWS compute resources. The SageMaker Training platform takes care of the heavy lifting associated with setting up and managing infrastructure for ML training workloads. With SageMaker Training, you can focus on developing, training, and fine-tuning your model. This page introduces three recommended ways to get started with training a model on SageMaker, followed by additional options you can consider.

Tip

For information about training foundation models for Generative AI, see Use SageMaker JumpStart foundation models in Amazon SageMaker Studio.

Choosing a feature within Amazon SageMaker Training

There are three main use cases for training ML models within SageMaker. This section describes those use cases, as well as the SageMaker features we recommend for each use case.

Whether you are training complex deep learning models or implementing smaller machine learning algorithms, SageMaker Training provides streamlined and cost-effective solutions that meet the requirements of your use cases.

Use cases

The following are the main uses cases for training ML models within SageMaker.

  • Use case 1: Develop a machine learning model in a low-code or no-code environment.

  • Use case 2: Use code to develop machine learning models with more flexibility and control.

  • Use case 3: Develop machine learning models at scale with maximum flexibility and control.

The following table describes three common scenarios of training ML models and corresponding options to get started with SageMaker Training.

Descriptor Use case 1 Use case 2 Use case 3
SageMaker feature Build a model using Amazon SageMaker Canvas. Train a model using one of the SageMaker built-in ML algorithms such as XGBoost or Task-Specific Models by SageMaker JumpStart with the SageMaker Python SDK. Train a model at scale with maximum flexibility leveraging script mode or custom containers in SageMaker.
Description Bring your data. SageMaker helps manage building ML models and setting up the training infrastructure and resources.

Bring your data and choose one of the built-in ML algorithms provided by SageMaker. Set up the model hyperparameters, output metrics, and basic infrastructure settings using the SageMaker Python SDK. The SageMaker Training platform helps provision the training infrastructure and resources.

Develop your own ML code and bring it as a script or a set of scripts to SageMaker. To learn more, see Distributed computing with SageMaker best practices. Additionally, you can bring your own Docker container. The SageMaker Training platform helps provision the training infrastructure and resources at scale based on your custom settings.

Optimized for

Low/no-code and UI-driven model development with quick experimentation with a training dataset. When you build a custom model an algorithm automatically selected based on your data. For advanced customization options like algorithm selection, see advanced model building configurations.

Training ML models with high-level customization for hyperparameters, infrastructure settings, and the ability to directly use ML frameworks and entrypoint scripts for more flexibility. Use built-in algorithms, pre-trained models, and JumpStart models through the Amazon SageMaker Python SDK to develop ML models. For more information, see Low-code deployment with the JumpStart class.

ML training workloads at scale, requiring multiple instances and maximum flexibility. See distributed computing with SageMaker best practices. SageMaker uses Docker images to host the training and serving of all models. You can use any SageMaker or external algorithms and use Docker containers to build models.

Considerations

Minimal flexibility to customize the model provided by Amazon SageMaker Canvas.

The SageMaker Python SDK provides a simplified interface and fewer configuration options compared to the low-level SageMaker Training API.

Requires knowledge of AWS infrastructure and distributed training options. See also Create your own training container using the SageMaker Training toolkit.

Recommended environment Use Amazon SageMaker Canvas. To learn how to set it up, see Getting started with using SageMaker Canvas. Use SageMaker JupyterLab within Amazon SageMaker Studio. To learn how to set it up, see Launch Amazon SageMaker Studio. Use SageMaker JupyterLab within Amazon SageMaker Studio. To learn how to set it up, see Launch Amazon SageMaker Studio.

Additional options

SageMaker offers the following additional options for training ML models.

SageMaker features offering training capabilities

  • SageMaker JumpStart: SageMaker JumpStart provides access to the SageMaker public model hub that contains the latest publicly available and proprietary foundation models (FMs). You can fine-tune, evaluate, and deploy these models within Amazon SageMaker Studio. SageMaker JumpStart streamlines the process of leveraging foundation models for your generative AI use-cases and allows you to create private model hubs to use foundation models while enforcing governance guardrails and ensuring that your organization can only access approved models. To get started with SageMaker JumpStart, see SageMaker JumpStart Foundation Models.

  • SageMaker HyperPod: SageMaker HyperPod is a persistent cluster service for use cases that need resilient clusters for massive machine learning (ML) workloads and developing state-of-the-art foundation models (FMs). It accelerates development of such models by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium or NVIDIA A100 and H100 Graphical Processing Units (GPUs). You can use workload manager software such as Slurm on HyperPod.

More features of SageMaker Training

  • Hyperparameter Tuning: This SageMaker feature helps define a set of hyperparameters for a model and launch many training jobs on a dataset. Depending on the hyperparameter values, the model training performance might vary. This feature provides the best performing set of hyperparameters within the given range of hyperparameters you set to search through.

  • Distributed training: Pre-train or fine-tune FMs built with PyTorch, NVIDIA CUDA, and other PyTorch-based frameworks. To efficiently utilize GPU instances, use the SageMaker distributed training libraries that offer collective communication operations and various model parallelism techniques such as expert parallelism and shared data parallelism that are optimized for AWS infrastructure.

  • Observability features: Use the profiling and debugging functionalities of SageMaker Training to gain insights into model training workloads, model performance, and resource utilization. To learn more, see Debug and improve model performance and Profile and optimize computational performance.

  • Cost-saving and efficient instance options: To optimize compute cost and efficiency for training instance provisioning, use Heterogeneous Cluster, Managed Spot instances, or Managed Warm Pools.