Built-in algorithms and pretrained models in Amazon SageMaker
Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly. For someone who is new to SageMaker, choosing the right algorithm for your particular use case can be a challenging task. The following table provides a quick cheat sheet that shows how you can start with an example problem or use case and find an appropriate built-in algorithm offered by SageMaker that is valid for that problem type. Additional guidance organized by learning paradigms (supervised and unsupervised) and important data domains (text and images) is provided in the sections following the table.
Table: Mapping use cases to built-in algorithms
Example problems and use cases | Learning paradigm or domain | Problem types | Data input format | Built-in algorithms |
---|---|---|---|---|
Here a few examples out of the 15 problem types that can be addressed by the pre-trained models and pre-built solution templates provided by SageMaker JumpStart: Question answering: chatbot that outputs an answer for a given question. Text analysis: analyze texts from models specific to an industry domain such as finance. |
Pre-trained models and pre-built solution templates |
Image Classification Tabular Classification Tabular Regression Text Classification Object Detection Text Embedding Question Answering Sentence Pair Classification Image Embedding Named Entity Recognition Instance Segmentation Text Generation Text Summarization Semantic Segmentation Machine Translation |
Image, Text, Tabular | Popular models, including Mobilenet, YOLO, Faster R-CNN, BERT, lightGBM, and CatBoost For a list of pre-trained models available, see JumpStart Models. For a list of pre-built solution templates available, see JumpStart Solutions. |
Predict if an item belongs to a category: an email spam filter |
Binary/multi-class classification |
Tabular |
AutoGluon-Tabular, CatBoost, Factorization Machines Algorithm, K-Nearest Neighbors (k-NN) Algorithm, LightGBM, Linear Learner Algorithm, TabTransformer, XGBoost algorithm with Amazon SageMaker |
|
Predict a numeric/continuous value: estimate the value of a house |
Regression |
Tabular |
AutoGluon-Tabular, CatBoost, Factorization Machines Algorithm, K-Nearest Neighbors (k-NN) Algorithm, LightGBM, Linear Learner Algorithm, TabTransformer, XGBoost algorithm with Amazon SageMaker |
|
Based on historical data for a behavior, predict future behavior: predict sales on a new product based on previous sales data. |
Time-series forecasting |
Tabular | ||
Improve the data embeddings of the high-dimensional objects: identify duplicate support tickets or find the correct routing based on similarity of text in the tickets |
Embeddings: convert high-dimensional objects into low-dimensional space. | Tabular | Object2Vec Algorithm | |
Drop those columns from a dataset that have a weak relation with the label/target variable: the color of a car when predicting its mileage. |
Unsupervised learning |
Feature engineering: dimensionality reduction |
Tabular | |
Detect abnormal behavior in application: spot when an IoT sensor is sending abnormal readings |
Anomaly detection |
Tabular | ||
Protect your application from suspicious users: detect if an IP address accessing a service might be from a bad actor |
IP anomaly detection |
Tabular | ||
Group similar objects/data together: find high-, medium-, and low-spending customers from their transaction histories |
Clustering or grouping |
Tabular | ||
Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document. |
Topic modeling |
Text |
Latent Dirichlet Allocation (LDA) Algorithm, Neural Topic Model (NTM) Algorithm |
|
Assign pre-defined categories to documents in a corpus: categorize books in a library into academic disciplines |
Text classification |
Text | ||
Convert text from one language to other: Spanish to English |
Machine translation algorithm |
Text | ||
Summarize a long text corpus: an abstract for a research paper |
Text summarization |
Text | ||
Convert audio files to text: transcribe call center conversations for further analysis |
Speech-to-text |
Text | ||
Label/tag an image based on the content of the image: alerts about adult content in an image |
Image processing |
Image and multi-label classification |
Image | |
Classify something in an image using transfer learning. |
Image classification | Image | ||
Detect people and objects in an image: police review a large photo gallery for a missing person |
Object detection and classification |
Image | ||
Tag every pixel of an image individually with a category: self-driving cars prepare to identify objects in their way |
Computer vision |
Image |
For important information about the following items common to all of the built-in algorithms provided by SageMaker, see Parameters for Built-in Algorithms.
-
Docker registry paths
-
data formats
-
recommended Amazon EC2 instance types
-
CloudWatch logs
The following sections provide additional guidance for the Amazon SageMaker built-in algorithms grouped by the supervised and unsupervised learning paradigms to which they belong. For descriptions of these learning paradigms and their associated problem types, see Types of Algorithms. Sections are also provided for the SageMaker built-in algorithms available to address two important machine learning domains: textual analysis and image processing.
Pre-trained models and solution templates
SageMaker JumpStart provides a wide range of pre-trained models, pre-built solution templates, and examples for popular problem types. These use the SageMaker SDK as well as Studio Classic. For more information about these models, solutions, and the example notebooks provided by SageMaker JumpStart, see SageMaker JumpStart pretrained models.
Supervised learning
Amazon SageMaker provides several built-in general purpose algorithms that can be used for either classification or regression problems.
-
AutoGluon-Tabular—an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers.
-
CatBoost—an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
-
Factorization Machines Algorithm—an extension of a linear model that is designed to economically capture interactions between features within high-dimensional sparse datasets.
-
K-Nearest Neighbors (k-NN) Algorithm—a non-parametric method that uses the k nearest labeled points to assign a value. For classification, it is a label to a new data point. For regression, it is a predicted target value from the average of the k nearest points.
-
LightGBM—an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability. These two novel techniques are Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
-
Linear Learner Algorithm—learns a linear function for regression or a linear threshold function for classification.
-
TabTransformer—a novel deep tabular data modeling architecture built on self-attention-based Transformers.
-
XGBoost algorithm with Amazon SageMaker—an implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models.
Amazon SageMaker also provides several built-in supervised learning algorithms used for more specialized tasks during feature engineering and forecasting from time series data.
-
Object2Vec Algorithm—a new highly customizable multi-purpose algorithm used for feature engineering. It can learn low-dimensional dense embeddings of high-dimensional objects to produce features that improve training efficiencies for downstream models. While this is a supervised algorithm, there are many scenarios in which the relationship labels can be obtained purely from natural clusterings in data. Even though it requires labeled data for training, this can occur without any explicit human annotation.
-
Use the SageMaker DeepAR forecasting algorithm—a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).
Unsupervised learning
Amazon SageMaker provides several built-in algorithms that can be used for a variety of unsupervised learning tasks. These tasks includes things like clustering, dimension reduction, pattern recognition, and anomaly detection.
-
Principal Component Analysis (PCA) Algorithm—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For mathematicians, principal components are eigenvectors of the data's covariance matrix.
-
K-Means Algorithm—finds discrete groupings within data. This occurs where members of a group are as similar as possible to one another and as different as possible from members of other groups.
-
IP Insights—learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.
-
Random Cut Forest (RCF) Algorithm—detects anomalous data points within a data set that diverge from otherwise well-structured or patterned data.
Textual analysis
SageMaker provides algorithms that are tailored to the analysis of textual documents. This includes text used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
-
BlazingText algorithm—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.
-
Sequence-to-Sequence Algorithm—a supervised algorithm commonly used for neural machine translation.
-
Latent Dirichlet Allocation (LDA) Algorithm—an algorithm suitable for determining topics in a set of documents. It is an unsupervised algorithm, which means that it doesn't use example data with answers during training.
-
Neural Topic Model (NTM) Algorithm—another unsupervised technique for determining topics in a set of documents, using a neural network approach.
-
Text Classification - TensorFlow—a supervised algorithm that supports transfer learning with available pretrained models for text classification.
Image processing
SageMaker also provides image processing algorithms that are used for image classification, object detection, and computer vision.
-
Image Classification - MXNet—uses example data with answers (referred to as a supervised algorithm). Use this algorithm to classify images.
-
Image Classification - TensorFlow—uses pretrained TensorFlow Hub models to fine-tune for specific tasks (referred to as a supervised algorithm). Use this algorithm to classify images.
-
Semantic Segmentation Algorithm—provides a fine-grained, pixel-level approach to developing computer vision applications.
-
Object Detection - MXNet—detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.
-
Object Detection - TensorFlow—detects bounding boxes and object labels in an image. It is a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow models.