Autopilot supports different training modes and algorithms to address machine learning problems, report on quality and objective metrics, and to use cross-validation automatically, when needed.
Training modes
SageMaker Autopilot can automatically select the training method based on the dataset size, or you can select it manually. The choices are as follows:
-
Ensembling – Autopilot uses the AutoGluon
library to train several base models. To find the best combination for your dataset, ensemble mode runs 10 trials with different model and meta parameter settings. Then Autopilot combines these models using a stacking ensemble method to create an optimal predictive model. For a list of algorithms that Autopilot supports in ensembling mode for tabular data, see the following Algorithms support section. -
Hyperparameter optimization (HPO) – Autopilot finds the best version of a model by tuning hyperparameters using Bayesian optimization or multi-fidelity optimization while running training jobs on your dataset. HPO mode selects the algorithms that are most relevant to your dataset and selects the best range of hyperparameters to tune your models. To tune your models, HPO mode runs up to 100 trials (default) to find the optimal hyperparameters settings within the selected range. If your dataset size is less than 100 MB, Autopilot uses Bayesian optimization. Autopilot chooses multi-fidelity optimization if your dataset is larger than 100 MB.
In multi-fidelity optimization, metrics are continuously emitted from the training containers. A trial that is performing poorly against a selected objective metric is stopped early. A trial that is performing well is allocated more resources.
For a list of algorithms that Autopilot supports in HPO mode, see the following Algorithm support section.
-
Auto – Autopilot automatically chooses either ensembling mode or HPO mode based on your dataset size. If your dataset is larger than 100 MB, Autopilot chooses HPO. Otherwise, it chooses ensembling mode. Autopilot can fail to read the size of your dataset in the following cases.
-
If you enable Virtual Private Cloud (VPC) mode, for an AutoML job but the S3 bucket containing the dataset only allows access from the VPC.
-
The input S3DataType of your dataset is a
ManifestFile
. -
The input S3Uri contains more than 1000 items.
If Autopilot is unable to read your dataset size, it defaults to choosing HPO mode.
-
Note
For optimal runtime and performance, use ensemble training mode for datasets that are smaller than 100 MB.
Algorithms support
In HPO mode, Autopilot supports the following types of machine learning algorithms:
-
Linear learner – A supervised learning algorithm that can solve either classification or regression problems.
-
XGBoost – A supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.
-
Deep learning algorithm – A multilayer perceptron (MLP) and feedforward artificial neural network. This algorithm can handle data that is not linearly separable.
Note
You don't need to specify an algorithm to use for your machine learning problem. Autopilot automatically selects the appropriate algorithm to train.
In ensembling mode, Autopilot supports the following types of machine learning algorithms:
-
LightGBM – An optimized framework that uses tree-based algorithms with gradient boosting. This algorithm uses trees that grow in breadth, rather than depth, and is highly optimized for speed.
-
CatBoost – A framework that uses tree-based algorithms with gradient boosting. Optimized for handling categorical variables.
-
XGBoost – A framework that uses tree-based algorithms with gradient boosting that grows in depth, rather than breadth.
-
Random Forest
– A tree-based algorithm that uses several decision trees on random sub-samples of the data with replacement. The trees are split into optimal nodes at each level. The decisions of each tree are averaged together to prevent overfitting and improve predictions. -
Extra Trees
– A tree-based algorithm that uses several decision trees on the entire dataset. The trees are split randomly at each level. The decisions of each tree are averaged to prevent overfitting and to improve predictions. Extra trees add a degree of randomization in comparison to the random forest algorithm. -
Linear Models
– A framework that uses a linear equation to model the relationship between two variables in observed data. -
Neural network torch – A neural network model that's implemented using Pytorch
. -
Neural network fast.ai – A neural network model that's implemented using fast.ai
.