Built-in algorithms and pretrained models in Amazon SageMaker - Amazon SageMaker AI

Built-in algorithms and pretrained models in Amazon SageMaker

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly. For someone who is new to SageMaker, choosing the right algorithm for your particular use case can be a challenging task. The following table provides a quick cheat sheet that shows how you can start with an example problem or use case and find an appropriate built-in algorithm offered by SageMaker that is valid for that problem type. Additional guidance organized by learning paradigms (supervised and unsupervised) and important data domains (text and images) is provided in the sections following the table.

Table: Mapping use cases to built-in algorithms

Example problems and use cases Learning paradigm or domain Problem types Data input format Built-in algorithms

Here a few examples out of the 15 problem types that can be addressed by the pre-trained models and pre-built solution templates provided by SageMaker JumpStart:

Question answering: chatbot that outputs an answer for a given question.

Text analysis: analyze texts from models specific to an industry domain such as finance.

Pre-trained models and pre-built solution templates

Image Classification

Tabular Classification

Tabular Regression

Text Classification

Object Detection

Text Embedding

Question Answering

Sentence Pair Classification

Image Embedding

Named Entity Recognition

Instance Segmentation

Text Generation

Text Summarization

Semantic Segmentation

Machine Translation

Image, Text, Tabular

Popular models, including Mobilenet, YOLO, Faster R-CNN, BERT, lightGBM, and CatBoost

For a list of pre-trained models available, see JumpStart Models.

For a list of pre-built solution templates available, see JumpStart Solutions.

Predict if an item belongs to a category: an email spam filter

Supervised learning

Binary/multi-class classification

Tabular

AutoGluon-Tabular, CatBoost, Factorization Machines Algorithm, K-Nearest Neighbors (k-NN) Algorithm, LightGBM, Linear Learner Algorithm, TabTransformer, XGBoost algorithm with Amazon SageMaker AI

Predict a numeric/continuous value: estimate the value of a house

Regression

Tabular

AutoGluon-Tabular, CatBoost, Factorization Machines Algorithm, K-Nearest Neighbors (k-NN) Algorithm, LightGBM, Linear Learner Algorithm, TabTransformer, XGBoost algorithm with Amazon SageMaker AI

Based on historical data for a behavior, predict future behavior: predict sales on a new product based on previous sales data.

Time-series forecasting

Tabular

Use the SageMaker AI DeepAR forecasting algorithm

Improve the data embeddings of the high-dimensional objects: identify duplicate support tickets or find the correct routing based on similarity of text in the tickets

Embeddings: convert high-dimensional objects into low-dimensional space. Tabular Object2Vec Algorithm

Drop those columns from a dataset that have a weak relation with the label/target variable: the color of a car when predicting its mileage.

Unsupervised learning

Feature engineering: dimensionality reduction

Tabular

Principal Component Analysis (PCA) Algorithm

Detect abnormal behavior in application: spot when an IoT sensor is sending abnormal readings

Anomaly detection

Tabular

Random Cut Forest (RCF) Algorithm

Protect your application from suspicious users: detect if an IP address accessing a service might be from a bad actor

IP anomaly detection

Tabular

IP Insights

Group similar objects/data together: find high-, medium-, and low-spending customers from their transaction histories

Clustering or grouping

Tabular

K-Means Algorithm

Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document.

Topic modeling

Text

Latent Dirichlet Allocation (LDA) Algorithm, Neural Topic Model (NTM) Algorithm

Assign pre-defined categories to documents in a corpus: categorize books in a library into academic disciplines

Textual analysis

Text classification

Text

BlazingText algorithm, Text Classification - TensorFlow

Convert text from one language to other: Spanish to English

Machine translation

algorithm
Text

Sequence-to-Sequence Algorithm

Summarize a long text corpus: an abstract for a research paper

Text summarization

Text

Sequence-to-Sequence Algorithm

Convert audio files to text: transcribe call center conversations for further analysis

Speech-to-text

Text

Sequence-to-Sequence Algorithm

Label/tag an image based on the content of the image: alerts about adult content in an image

Image processing

Image and multi-label classification

Image

Image Classification - MXNet

Classify something in an image using transfer learning.

Image classification Image

Image Classification - TensorFlow

Detect people and objects in an image: police review a large photo gallery for a missing person

Object detection and classification

Image

Object Detection - MXNet, Object Detection - TensorFlow

Tag every pixel of an image individually with a category: self-driving cars prepare to identify objects in their way

Computer vision

Image

Semantic Segmentation Algorithm

For important information about the following items common to all of the built-in algorithms provided by SageMaker AI, see Parameters for Built-in Algorithms.

  • Docker registry paths

  • data formats

  • recommended Amazon EC2 instance types

  • CloudWatch logs

The following sections provide additional guidance for the Amazon SageMaker AI built-in algorithms grouped by the supervised and unsupervised learning paradigms to which they belong. For descriptions of these learning paradigms and their associated problem types, see Types of Algorithms. Sections are also provided for the SageMaker AI built-in algorithms available to address two important machine learning domains: textual analysis and image processing.

Pre-trained models and solution templates

SageMaker JumpStart provides a wide range of pre-trained models, pre-built solution templates, and examples for popular problem types. These use the SageMaker AI SDK as well as Studio Classic. For more information about these models, solutions, and the example notebooks provided by SageMaker JumpStart, see SageMaker JumpStart pretrained models.

Supervised learning

Amazon SageMaker AI provides several built-in general purpose algorithms that can be used for either classification or regression problems.

  • AutoGluon-Tabular—an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers.

  • CatBoost—an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.

  • Factorization Machines Algorithm—an extension of a linear model that is designed to economically capture interactions between features within high-dimensional sparse datasets.

  • K-Nearest Neighbors (k-NN) Algorithm—a non-parametric method that uses the k nearest labeled points to assign a value. For classification, it is a label to a new data point. For regression, it is a predicted target value from the average of the k nearest points.

  • LightGBM—an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability. These two novel techniques are Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).

  • Linear Learner Algorithm—learns a linear function for regression or a linear threshold function for classification.

  • TabTransformer—a novel deep tabular data modeling architecture built on self-attention-based Transformers.

  • XGBoost algorithm with Amazon SageMaker AI—an implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models.

Amazon SageMaker AI also provides several built-in supervised learning algorithms used for more specialized tasks during feature engineering and forecasting from time series data.

  • Object2Vec Algorithm—a new highly customizable multi-purpose algorithm used for feature engineering. It can learn low-dimensional dense embeddings of high-dimensional objects to produce features that improve training efficiencies for downstream models. While this is a supervised algorithm, there are many scenarios in which the relationship labels can be obtained purely from natural clusterings in data. Even though it requires labeled data for training, this can occur without any explicit human annotation.

  • Use the SageMaker AI DeepAR forecasting algorithm—a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).

Unsupervised learning

Amazon SageMaker AI provides several built-in algorithms that can be used for a variety of unsupervised learning tasks. These tasks includes things like clustering, dimension reduction, pattern recognition, and anomaly detection.

  • Principal Component Analysis (PCA) Algorithm—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For mathematicians, principal components are eigenvectors of the data's covariance matrix.

  • K-Means Algorithm—finds discrete groupings within data. This occurs where members of a group are as similar as possible to one another and as different as possible from members of other groups.

  • IP Insights—learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.

  • Random Cut Forest (RCF) Algorithm—detects anomalous data points within a data set that diverge from otherwise well-structured or patterned data.

Textual analysis

SageMaker AI provides algorithms that are tailored to the analysis of textual documents. This includes text used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.

  • BlazingText algorithm—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.

  • Sequence-to-Sequence Algorithm—a supervised algorithm commonly used for neural machine translation.

  • Latent Dirichlet Allocation (LDA) Algorithm—an algorithm suitable for determining topics in a set of documents. It is an unsupervised algorithm, which means that it doesn't use example data with answers during training.

  • Neural Topic Model (NTM) Algorithm—another unsupervised technique for determining topics in a set of documents, using a neural network approach.

  • Text Classification - TensorFlow—a supervised algorithm that supports transfer learning with available pretrained models for text classification.

Image processing

SageMaker AI also provides image processing algorithms that are used for image classification, object detection, and computer vision.

  • Image Classification - MXNet—uses example data with answers (referred to as a supervised algorithm). Use this algorithm to classify images.

  • Image Classification - TensorFlow—uses pretrained TensorFlow Hub models to fine-tune for specific tasks (referred to as a supervised algorithm). Use this algorithm to classify images.

  • Semantic Segmentation Algorithm—provides a fine-grained, pixel-level approach to developing computer vision applications.

  • Object Detection - MXNet—detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.

  • Object Detection - TensorFlow—detects bounding boxes and object labels in an image. It is a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow models.