

# Built-in SageMaker AI Algorithms for Tabular Data
<a name="algorithms-tabular"></a>

Amazon SageMaker AI provides built-in algorithms that are tailored to the analysis of tabular data. Tabular data refers to any datasets that are organized in tables consisting of rows (observations) and columns (features). The built-in SageMaker AI algorithms for tabular data can be used for either classification or regression problems.
+ [AutoGluon-Tabular](autogluon-tabular.md)—an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers. 
+ [CatBoost](catboost.md)—an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
+ [Factorization Machines Algorithm](fact-machines.md)—an extension of a linear model that is designed to economically capture interactions between features within high-dimensional sparse datasets.
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)—a non-parametric method that uses the k nearest labeled points to assign a label to a new data point for classification or a predicted target value from the average of the k nearest points for regression.
+ [LightGBM](lightgbm.md)—an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
+ [Linear Learner Algorithm](linear-learner.md)—learns a linear function for regression or a linear threshold function for classification.
+ [TabTransformer](tabtransformer.md)—a novel deep tabular data modeling architecture built on self-attention-based Transformers. 
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)—an implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| AutoGluon-Tabular | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| CatBoost | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| Factorization Machines | train and (optionally) test | File or Pipe | recordIO-protobuf | CPU (GPU for dense data) | Yes | 
| K-Nearest-Neighbors (k-NN) | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU (single GPU device on one or more instances) | Yes | 
| LightGBM | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| Linear Learner | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| TabTransformer | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| XGBoost (0.90-1, 0.90-2, 1.0-1, 1.2-1, 1.2-21) | train and (optionally) validation | File or Pipe | CSV, LibSVM, or Parquet | CPU (or GPU for 1.2-1) | Yes | 

# AutoGluon-Tabular
<a name="autogluon-tabular"></a>

[AutoGluon-Tabular](https://auto.gluon.ai/stable/index.html) is a popular open-source AutoML framework that trains highly accurate machine learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. This page includes information about Amazon EC2 instance recommendations and sample notebooks for AutoGluon-Tabular.

# How to use SageMaker AI AutoGluon-Tabular
<a name="autogluon-tabular-modes"></a>

You can use AutoGluon-Tabular as an Amazon SageMaker AI built-in algorithm. The following section describes how to use AutoGluon-Tabular with the SageMaker Python SDK. For information on how to use AutoGluon-Tabular from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use AutoGluon-Tabular as a built-in algorithm**

  Use the AutoGluon-Tabular built-in algorithm to build an AutoGluon-Tabular training container as shown in the following code example. You can automatically spot the AutoGluon-Tabular built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the AutoGluon-Tabular image URI, you can use the AutoGluon-Tabular container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The AutoGluon-Tabular built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own AutoGluon-Tabular training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "autogluon-classification-ensemble", "*", "training"
  training_instance_type = "ml.p3.2xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_binary/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "auto_stack"
  ] = "True"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the AutoGluon-Tabular as a built-in algorithm, see the following notebook examples. Any S3 bucket used in these examples must be in the same AWS Region as the notebook instance used to run them.
  + [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)
  + [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)

# Input and Output interface for the AutoGluon-Tabular algorithm
<a name="InputOutput-AutoGluon-Tabular"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of AutoGluon-Tabular supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the AutoGluon-Tabular model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains a CSV file. You can optionally include another subdirectory in the same location called `validation/` that also has a CSV file. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI AutoGluon-Tabular uses the `autogluon.tabular.TabularPredictor` module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI AutoGluon-Tabular with the AutoGluon framework**
+ Use the following Python code:

  ```
  import tarfile
  from autogluon.tabular import TabularPredictor
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = TabularPredictor.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the AutoGluon-Tabular algorithm
<a name="Instance-AutoGluon-Tabular"></a>

SageMaker AI AutoGluon-Tabular supports single-instance CPU and single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker AI AutoGluon-Tabular currently does not support multi-GPU training.

## AutoGluon-Tabular sample notebooks
<a name="autogluon-tabular-sample-notebooks"></a>

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI AutoGluon-Tabular algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular classification model.  | 
|  [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular regression model.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How AutoGluon-Tabular works
<a name="autogluon-tabular-HowItWorks"></a>

AutoGluon-Tabular performs advanced data processing, deep learning, and multi-layer model ensemble methods. It automatically recognizes the data type in each column for robust data preprocessing, including special handling of text fields. 

AutoGluon fits various models ranging from off-the-shelf boosted trees to customized neural networks. These models are ensembled in a novel way: models are stacked in multiple layers and trained in a layer-wise manner that guarantees raw data can be translated into high-quality predictions within a given time constraint. This process mitigates overfitting by splitting the data in various ways with careful tracking of out-of-fold examples.

The AutoGluon-Tabular algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, and distributions. You can use AutoGluon-Tabular for regression, classification (binary and multiclass), and ranking problems.

Refer to the following diagram illustrating how the multi-layer stacking strategy works.

![\[AutoGluon's multi-layer stacking strategy shown with two stacking layers.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/autogluon_tabular_illustration.png)


For more information, see *[AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data](https://arxiv.org/pdf/2003.06505.pdf)*.

# AutoGluon-Tabular hyperparameters
<a name="autogluon-tabular-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI AutoGluon-Tabular algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI AutoGluon-Tabular algorithm is an implementation of the open-source [AutoGluon-Tabular](https://github.com/awslabs/autogluon) package.

**Note**  
The default hyperparameters are based on example datasets in the [AutoGluon-Tabular sample notebooks](autogluon-tabular.md#autogluon-tabular-sample-notebooks).

By default, the SageMaker AI AutoGluon-Tabular algorithm automatically chooses an evaluation metric based on the type of classification problem. The algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error. For binary classification problems, the evaluation metric is area under the receiver operating characteristic curve (AUC). For multiclass classification problems, the evaluation metric is accuracy. You can use the `eval_metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on AutoGluon-Tabular hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| eval\$1metric |  The evaluation metric for validation data. If `eval_metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) Valid values: string, refer to the [AutoGluon documentation](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html) for valid values. Default value: `"auto"`.  | 
| presets |  List of preset configurations for various arguments in `fit()`.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, any of the following: (`"best_quality"`, `"high_quality"`, `good_quality"`, `"medium_quality"`, `"optimize_for_deployment"`,` or "interpretable"`). Default value: `"medium_quality"`.  | 
| auto\$1stack |  Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy. Set `auto_stack` to `"True"` if you are willing to tolerate longer training times in order to maximize predictive accuracy. This automatically sets the `num_bag_folds` and `num_stack_levels` arguments based on dataset properties.  Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| num\$1bag\$1folds |  Number of folds used for bagging of models. When `num_bag_folds` is equal to `k`, training time is roughly increased by a factor of `k`. Set `num_bag_folds` to 0 to deactivate bagging. This is disabled by default, but we recommend using values between 5 and 10 to maximize predictive performance. Increasing `num_bag_folds` results in models with lower bias, but that are more prone to overfitting. One is an invalid value for this parameter, and will raise a `ValueError`. Values greater than 10 may produce diminishing returns and can even harm overall results due to overfitting. To further improve predictions, avoid increasing `num_bag_folds` and instead increase `num_bag_sets`. Valid values: string, any integer between (and including) `"0"` and `"10"`. Default value: `"0"`.  | 
| num\$1bag\$1sets |  Number of repeats of kfold bagging to perform (values must be greater than or equal to 1). The total number of models trained during bagging is equal to `num_bag_folds` \$1 `num_bag_sets`. This parameter defaults to one if `time_limit` is not specified. This parameters is disabled if `num_bag_folds` is not specified. Values greater than one result in superior predictive performance, especially on smaller problems and with stacking enabled.  Valid values: integer, range: [`1`, `20`]. Default value: `1`.  | 
| num\$1stack\$1levels |  Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of `num_stack_levels` \$1 1. Set this parameter to 0 to deactivate stack ensembling. This parameter is deactivated by default, but we recommend using values between 1 and 3 to maximize predictive performance. To prevent overfitting and a `ValueError`, `num_bag_folds` must be greater than or equal to 2. Valid values: float, range: [`0`, `3`]. Default value: `0`.  | 
| refit\$1full |  Whether or not to retrain all models on all of the data (training and validation) after the normal training procedure. For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| set\$1best\$1to\$1refit\$1full |  Whether or not to change the default model that the predictor uses for prediction. If `set_best_to_refit_full` is set to `"True"`, the default model changes to the model that exhibited the highest validation score as a result of refitting (activated by `refit_full`). Only valid if `refit_full` is set. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| save\$1space |  Whether or note to reduce the memory and disk size of predictor by deleting auxiliary model files that aren’t needed for prediction on new data. This has no impact on inference accuracy. We recommend setting `save_space` to `"True"` if the only goal is to use the trained model for prediction. Certain advanced functionality may no longer be available if `save_space` is set to `"True"`. Refer to the `[predictor.save\$1space()](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.save_space.html)` documentation for more details. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| verbosity |  The verbosity of print messages. `verbosity` levels range from `0` to `4`, with higher levels corresponding to more detailed print statements. A `verbosity` of `0` suppresses warnings.  Valid values: integer, any of the following: (`0`, `1`, `2`, `3`, or `4`). Default value: `2`.  | 

# Tuning an AutoGluon-Tabular model
<a name="autogluon-tabular-tuning"></a>

Although AutoGluon-Tabular can be used with model tuning, its design can deliver good performance using stacking and ensemble methods, meaning hyperparameter optimization is not necessary. Rather than focusing on model tuning, AutoGluon-Tabular succeeds by stacking models in multiple layers and training in a layer-wise manner. 

For more information about AutoGluon-Tabular hyperparameters, see [AutoGluon-Tabular hyperparameters](autogluon-tabular-hyperparameters.md).

# CatBoost
<a name="catboost"></a>

[CatBoost](https://catboost.ai/) is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.

CatBoost introduces two critical algorithmic advances to GBDT:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm

1. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. This page includes information about Amazon EC2 instance recommendations and sample notebooks for CatBoost.

# How to use SageMaker AI CatBoost
<a name="catboost-modes"></a>

You can use CatBoost as an Amazon SageMaker AI built-in algorithm. The following section describes how to use CatBoost with the SageMaker Python SDK. For information on how to use CatBoost from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use CatBoost as a built-in algorithm**

  Use the CatBoost built-in algorithm to build a CatBoost training container as shown in the following code example. You can automatically spot the CatBoost built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the CatBoost image URI, you can use the CatBoost container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The CatBoost built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own CatBoost training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "catboost-classification-model", "*", "training"
  training_instance_type = "ml.m5.xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_multiclass/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "iterations"
  ] = "500"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up CatBoost as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)
  + [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)

# Input and Output interface for the CatBoost algorithm
<a name="InputOutput-CatBoost"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of CatBoost supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the CatBoost model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `training` or `validation` channels, the CatBoost algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI CatBoost uses the `catboost.CatBoostClassifier` and `catboost.CatBoostRegressor` modules to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI CatBoost with `catboost`**
+ Use the following Python code:

  ```
  import tarfile
  from catboost import CatBoostClassifier
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  file_path = os.path.join(model_file_path, "model")
  model = CatBoostClassifier()
  model.load_model(file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the CatBoost algorithm
<a name="Instance-CatBoost"></a>

SageMaker AI CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5). Further, we recommend that you have enough total memory in selected instances to hold the training data. 

## CatBoost sample notebooks
<a name="catboost-sample-notebooks"></a>

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI CatBoost algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI CatBoost algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI CatBoost algorithm to train and host a tabular regression model.   | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How CatBoost Works
<a name="catboost-HowItWorks"></a>

CatBoost implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two critical algorithmic advances:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm

1. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.

The CatBoost algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use CatBoost for regression, classification (binary and multiclass), and ranking problems.

For more information on gradient boosting, see [How the SageMaker AI XGBoost algorithm works](xgboost-HowItWorks.md). For in-depth details about the additional GOSS and EFB techniques used in the CatBoost method, see *[CatBoost: unbiased boosting with categorical features](https://arxiv.org/pdf/1706.09516.pdf)*.

# CatBoost hyperparameters
<a name="catboost-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI CatBoost algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI CatBoost algorithm is an implementation of the open-source [CatBoost](https://github.com/catboost/catboost) package.

**Note**  
The default hyperparameters are based on example datasets in the [CatBoost sample notebooks](catboost.md#catboost-sample-notebooks).

By default, the SageMaker AI CatBoost algorithm automatically chooses an evaluation metric and loss function based on the type of classification problem. The CatBoost algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric and loss functions are both root mean squared error. For binary classification problems, the evaluation metric is Area Under the Curve (AUC) and the loss function is log loss. For multiclass classification problems, the evaluation metric and loss functions are multiclass cross entropy. You can use the `eval_metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on LightGBM hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| iterations |  The maximum number of trees that can be built. Valid values: integer, range: Positive integer. Default value: `500`.  | 
| early\$1stopping\$1rounds |  The training will stop if one metric of one validation data point does not improve in the last `early_stopping_rounds` round. If `early_stopping_rounds` is less than or equal to zero, this hyperparameter is ignored. Valid values: integer. Default value: `5`.  | 
| eval\$1metric |  The evaluation metric for validation data. If `eval_metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/catboost-hyperparameters.html) Valid values: string, refer to the [CatBoost documentation](https://catboost.ai/en/docs/references/eval-metric__supported-metrics) for valid values. Default value: `"auto"`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.009`.  | 
| depth |  Depth of the tree. Valid values: integer, range: (`1`, `16`). Default value: `6`.  | 
| l2\$1leaf\$1reg |  Coefficient for the L2 regularization term of the cost function. Valid values: integer, range: Positive integer. Default value: `3`.  | 
| random\$1strength |  The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. Valid values: float, range: Positive floating point number. Default value: `1.0`.  | 
| max\$1leaves |  The maximum number of leaves in the resulting tree. Can only be used with the `"Lossguide"` growing policy. Valid values: integer, range: [`2`, `64`]. Default value: `31`.  | 
| rsm |  Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. Valid values: float, range: (`0.0`, `1.0`]. Default value: `1.0`.  | 
| sampling\$1frequency |  Frequency to sample weights and objects when building trees. Valid values: string, either: (`"PerTreeLevel"` or `"PerTree"`). Default value: `"PerTreeLevel"`.  | 
| min\$1data\$1in\$1leaf |  The minimum number of training samples in a leaf. CatBoost does not search for new splits in leaves with a sample count less than the specified value. Can only be used with the `"Lossguide"` and `"Depthwise"` growing policies. Valid values: integer, range: (`1` or `∞`). Default value: `1`.  | 
| bagging\$1temperature |  Defines the settings of the Bayesian bootstrap. Use the Bayesian bootstrap to assign random weights to objects. If `bagging_temperature` is set to `1.0`, then the weights are sampled from an exponential distribution. If `bagging_temperature` is set to `0.0`, then all weights are 1.0. Valid values: float, range: Non-negative float. Default value: `1.0`.  | 
| boosting\$1type |  The boosting scheme. "Auto" means that the `boosting_type` is selected based on processing unit type, the number of objects in the training dataset, and the selected learning mode. Valid values: string, any of the following: (`"Auto"`, `"Ordered"`, `"Plain"`). Default value: `"Auto"`.  | 
| scale\$1pos\$1weight |  The weight for positive class in binary classification. The value is used as a multiplier for the weights of objects from positive class. Valid values: float, range: Positive float. Default value: `1.0`.  | 
| max\$1bin |  The number of splits for numerical features. `"Auto"` means that `max_bin` is selected based on the processing unit type and other parameters. For details, see the CatBoost documentation. Valid values: string, either: (`"Auto"` or string of integer from `"1"` to `"65535"` inclusively). Default value: `"Auto"`.  | 
| grow\$1policy |  The tree growing policy. Defines how to perform greedy tree construction. Valid values: string, any of the following: (`"SymmetricTree"`, `"Depthwise"`, or `"Lossguide"`). Default value: `"SymmetricTree"`.  | 
| random\$1seed |  The random seed used for training. Valid values: integer, range: Non-negative integer. Default value: `1.0`. | 
| thread\$1count |  The number of threads to use during the training. If `thread_count` is `-1`, then the number of threads is equal to the number of processor cores. `thread_count` cannot be `0`. Valid values: integer, either: (`-1` or positive integer). Default value: `-1`.  | 
| verbose |  The verbosity of print messages, with higher levels corresponding to more detailed print statements. Valid values: integer, range: Positive integer. Default value: `1`.  | 

# Tune a CatBoost model
<a name="catboost-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters:

**Note**  
The learning loss function is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [CatBoost hyperparameters](catboost-hyperparameters.md).
+ A learning loss function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for CatBoost is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the CatBoost algorithm
<a name="catboost-metrics"></a>

The SageMaker AI CatBoost algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| RMSE | root mean square error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MAE | mean absolute error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MedianAbsoluteError | median absolute error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| R2 | r2 score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Logloss | binary cross entropy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Precision | precision | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Recall | recall | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| F1 | f1 score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| AUC | auc score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MultiClass | multiclass cross entropy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Accuracy | accuracy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| BalancedAccuracy | balanced accuracy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 

## Tunable CatBoost hyperparameters
<a name="catboost-tunable-hyperparameters"></a>

Tune the CatBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the CatBoost evaluation metrics are: `learning_rate`, `depth`, `l2_leaf_reg`, and `random_strength`. For a list of all the CatBoost hyperparameters, see [CatBoost hyperparameters](catboost-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| depth | IntegerParameterRanges | MinValue: 4, MaxValue: 10 | 
| l2\$1leaf\$1reg | IntegerParameterRanges | MinValue: 2, MaxValue: 10 | 
| random\$1strength | ContinuousParameterRanges | MinValue: 0, MaxValue: 10 | 

# Factorization Machines Algorithm
<a name="fact-machines"></a>

The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

**Note**  
The Amazon SageMaker AI implementation of the Factorization Machines algorithm considers only pair-wise (2nd order) interactions between features.

**Topics**
+ [Input/Output Interface for the Factorization Machines Algorithm](#fm-inputoutput)
+ [EC2 Instance Recommendation for the Factorization Machines Algorithm](#fm-instances)
+ [Factorization Machines Sample Notebooks](#fm-sample-notebooks)
+ [How Factorization Machines Work](fact-machines-howitworks.md)
+ [Factorization Machines Hyperparameters](fact-machines-hyperparameters.md)
+ [Tune a Factorization Machines Model](fm-tuning.md)
+ [Factorization Machines Response Formats](fm-in-formats.md)

## Input/Output Interface for the Factorization Machines Algorithm
<a name="fm-inputoutput"></a>

The Factorization Machines algorithm can be run in either in binary classification mode or regression mode. In each mode, a dataset can be provided to the **test** channel along with the train channel dataset. The scoring depends on the mode used. In regression mode, the testing dataset is scored using Root Mean Square Error (RMSE). In binary classification mode, the test dataset is scored using Binary Cross Entropy (Log Loss), Accuracy (at threshold=0.5) and F1 Score (at threshold =0.5).

For **training**, the Factorization Machines algorithm currently supports only the `recordIO-protobuf` format with `Float32` tensors. Because their use case is predominantly on sparse data, `CSV` is not a good candidate. Both File and Pipe mode training are supported for recordIO-wrapped protobuf.

For **inference**, the Factorization Machines algorithm supports the `application/json` and `x-recordio-protobuf` formats. 
+ For the **binary classification** problem, the algorithm predicts a score and a label. The label is a number and can be either `0` or `1`. The score is a number that indicates how strongly the algorithm believes that the label should be `1`. The algorithm computes score first and then derives the label from the score value. If the score is greater than or equal to 0.5, the label is `1`.
+ For the **regression** problem, just a score is returned and it is the predicted value. For example, if Factorization Machines is used to predict a movie rating, score is the predicted rating value.

Please see [Factorization Machines Sample Notebooks](#fm-sample-notebooks) for more details on training and inference file formats.

## EC2 Instance Recommendation for the Factorization Machines Algorithm
<a name="fm-instances"></a>

The Amazon SageMaker AI Factorization Machines algorithm is highly scalable and can train across distributed instances. We recommend training and inference with CPU instances for both sparse and dense datasets. In some circumstances, training with one or more GPUs on dense data might provide some benefit. Training with GPUs is available only on dense data. Use CPU instances for sparse data. The Factorization Machines algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

## Factorization Machines Sample Notebooks
<a name="fm-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI Factorization Machines algorithm to analyze the images of handwritten digits from zero to nine in the MNIST dataset, see [An Introduction to Factorization Machines with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/factorization_machines_mnist/factorization_machines_mnist.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. Example notebooks that use Factorization Machines algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Factorization Machines Work
<a name="fact-machines-howitworks"></a>

The prediction task for a Factorization Machines model is to estimate a function ŷ from a feature set xi to a target domain. This domain is real-valued for regression and binary for classification. The Factorization Machines model is supervised and so has a training dataset (xi,yj) available. The advantages this model presents lie in the way it uses a factorized parametrization to capture the pairwise feature interactions. It can be represented mathematically as follows: 

![\[An image containing the equation for the Factorization Machines model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM1.jpg)


 The three terms in this equation correspond respectively to the three components of the model: 
+ The w0 term represents the global bias.
+ The wi linear terms model the strength of the ith variable.
+ The <vi,vj> factorization terms model the pairwise interaction between the ith and jth variable.

The global bias and linear terms are the same as in a linear model. The pairwise feature interactions are modeled in the third term as the inner product of the corresponding factors learned for each feature. Learned factors can also be considered as embedding vectors for each feature. For example, in a classification task, if a pair of features tends to co-occur more often in positive labeled samples, then the inner product of their factors would be large. In other words, their embedding vectors would be close to each other in cosine similarity. For more information about the Factorization Machines model, see [Factorization Machines](https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf).

For regression tasks, the model is trained by minimizing the squared error between the model prediction ŷn and the target value yn. This is known as the square loss:

![\[An image containing the equation for square loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM2.jpg)


For a classification task, the model is trained by minimizing the cross entropy loss, also known as the log loss: 

![\[An image containing the equation for log loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM3.jpg)


where: 

![\[An image containing the logistic function of the predicted values.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM4.jpg)


For more information about loss functions for classification, see [Loss functions for classification](https://en.wikipedia.org/wiki/Loss_functions_for_classification).

# Factorization Machines Hyperparameters
<a name="fact-machines-hyperparameters"></a>

The following table contains the hyperparameters for the Factorization Machines algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order.


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim | The dimension of the input feature space. This could be very high with sparse input. **Required** Valid values: Positive integer. Suggested value range: [10000,10000000]  | 
| num\$1factors | The dimensionality of factorization. **Required** Valid values: Positive integer. Suggested value range: [2,1000], 64 typically generates good outcomes and is a good starting point.  | 
| predictor\$1type | The type of predictor. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Required** Valid values: String: `binary_classifier` or `regressor`  | 
| bias\$1init\$1method | The initialization method for the bias term: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant` Default value: `normal`  | 
| bias\$1init\$1scale | Range for initialization of the bias term. Takes effect if `bias_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| bias\$1init\$1sigma | The standard deviation for initialization of the bias term. Takes effect if `bias_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| bias\$1init\$1value | The initial value of the bias term. Takes effect if `bias_init_method` is set to `constant`.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| bias\$1lr | The learning rate for the bias term.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.1  | 
| bias\$1wd | The weight decay for the bias term.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| clip\$1gradient | Gradient clipping optimizer parameter. Clips the gradient by projecting onto the interval [-`clip_gradient`, \$1`clip_gradient`].  **Optional** Valid values: Float Default value: None  | 
| epochs | The number of training epochs to run.  **Optional** Valid values: Positive integer Default value: 1  | 
| eps | Epsilon parameter to avoid division by 0. **Optional** Valid values: Float. Suggested value: small. Default value: None  | 
| factors\$1init\$1method | The initialization method for factorization terms: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant`. Default value: `normal`  | 
| factors\$1init\$1scale  | The range for initialization of factorization terms. Takes effect if `factors_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| factors\$1init\$1sigma | The standard deviation for initialization of factorization terms. Takes effect if `factors_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| factors\$1init\$1value | The initial value of factorization terms. Takes effect if `factors_init_method` is set to `constant`.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| factors\$1lr | The learning rate for factorization terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.0001  | 
| factors\$1wd | The weight decay for factorization terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.00001  | 
| linear\$1lr | The learning rate for linear terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| linear\$1init\$1method | The initialization method for linear terms: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant`. Default value: `normal`  | 
| linear\$1init\$1scale | Range for initialization of linear terms. Takes effect if `linear_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| linear\$1init\$1sigma | The standard deviation for initialization of linear terms. Takes effect if `linear_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| linear\$1init\$1value | The initial value of linear terms. Takes effect if `linear_init_method` is set to *constant*.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| linear\$1wd | The weight decay for linear terms. **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| mini\$1batch\$1size | The size of mini-batch used for training.  **Optional** Valid values: Positive integer Default value: 1000  | 
| rescale\$1grad |  Gradient rescaling optimizer parameter. If set, multiplies the gradient with `rescale_grad` before updating. Often choose to be 1.0/`batch_size`.  **Optional** Valid values: Float Default value: None  | 

# Tune a Factorization Machines Model
<a name="fm-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Factorization Machines Algorithm
<a name="fm-metrics"></a>

The Factorization Machines algorithm has both binary classification and regression predictor types. The predictor type determines which metric you can use for automatic model tuning. The algorithm reports a `test:rmse` regressor metric, which is computed during training. When tuning the model for regression tasks, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:rmse | Root Mean Square Error | Minimize | 

The Factorization Machines algorithm reports three binary classification metrics, which are computed during training. When tuning the model for binary classification tasks, choose one of these as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:binary\$1classification\$1accuracy | Accuracy | Maximize | 
| test:binary\$1classification\$1cross\$1entropy | Cross Entropy | Minimize | 
| test:binary\$1f\$1beta | Beta | Maximize | 

## Tunable Factorization Machines Hyperparameters
<a name="fm-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the Factorization Machines algorithm. The initialization parameters that contain the terms bias, linear, and factorization depend on their initialization method. There are three initialization methods: `uniform`, `normal`, and `constant`. These initialization methods are not themselves tunable. The parameters that are tunable are dependent on this choice of the initialization method. For example, if the initialization method is `uniform`, then only the `scale` parameters are tunable. Specifically, if `bias_init_method==uniform`, then `bias_init_scale`, `linear_init_scale`, and `factors_init_scale` are tunable. Similarly, if the initialization method is `normal`, then only `sigma` parameters are tunable. If the initialization method is `constant`, then only `value` parameters are tunable. These dependencies are listed in the following table. 


| Parameter Name | Parameter Type | Recommended Ranges | Dependency | 
| --- | --- | --- | --- | 
| bias\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| bias\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| bias\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| bias\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| bias\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| epoch | IntegerParameterRange | MinValue: 1, MaxValue: 1000 | None | 
| factors\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| factors\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| factors\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| factors\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| factors\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512] | None | 
| linear\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| linear\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| linear\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| linear\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| linear\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| mini\$1batch\$1size | IntegerParameterRange | MinValue: 100, MaxValue: 10000 | None | 

# Factorization Machines Response Formats
<a name="fm-in-formats"></a>

Amazon SageMaker AI provides several response formats for getting inference from the Factorization Machines model, such as JSON, JSONLINES, and RECORDIO, with specific structures for binary classification and regression tasks.

## JSON Response Format
<a name="fm-json"></a>

Binary classification

```
let response =   {
    "predictions":    [
        {
            "score": 0.4,
            "predicted_label": 0
        } 
    ]
}
```

Regression

```
let response =   {
    "predictions":    [
        {
            "score": 0.4
        } 
    ]
}
```

## JSONLINES Response Format
<a name="fm-jsonlines"></a>

Binary classification

```
{"score": 0.4, "predicted_label": 0}
```

Regression

```
{"score": 0.4}
```

## RECORDIO Response Format
<a name="fm-recordio"></a>

Binary classification

```
[
    Record = {
        features = {},
        label = {
            'score’: {
                keys: [],
                values: [0.4]  # float32
            },
            'predicted_label': {
                keys: [],
                values: [0.0]  # float32
            }
        }
    }
]
```

Regression

```
[
    Record = {
        features = {},
        label = {
            'score’: {
                keys: [],
                values: [0.4]  # float32
            }   
        }
    }
]
```

# K-Nearest Neighbors (k-NN) Algorithm
<a name="k-nearest-neighbors"></a>

Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression. For classification problems, the algorithm queries the *k* points that are closest to the sample point and returns the most frequently used label of their class as the predicted label. For regression problems, the algorithm queries the *k* closest points to the sample point and returns the average of their feature values as the predicted value. 

Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building. Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction, the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model in memory and inference latency. We provide two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is to construct the index. The index enables efficient lookups of distances between points whose values or class labels have not yet been determined and the k nearest points to use for inference.

**Topics**
+ [Input/Output Interface for the k-NN Algorithm](#kNN-input_output)
+ [k-NN Sample Notebooks](#kNN-sample-notebooks)
+ [How the k-NN Algorithm Works](kNN_how-it-works.md)
+ [EC2 Instance Recommendation for the k-NN Algorithm](#kNN-instances)
+ [k-NN Hyperparameters](kNN_hyperparameters.md)
+ [Tune a k-NN Model](kNN-tuning.md)
+ [Data Formats for k-NN Training Input](kNN-in-formats.md)
+ [k-NN Request and Response Formats](kNN-inference-formats.md)

## Input/Output Interface for the k-NN Algorithm
<a name="kNN-input_output"></a>

SageMaker AI k-NN supports train and test data channels.
+ Use a *train channel* for data that you want to sample and construct into the k-NN index.
+ Use a *test channel* to emit scores in log files. Scores are listed as one line per mini-batch: accuracy for `classifier`, mean-squared error (mse) for `regressor` for score.

For training inputs, k-NN supports `text/csv` and `application/x-recordio-protobuf` data formats. For input type `text/csv`, the first `label_size` columns are interpreted as the label vector for that row. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference inputs, k-NN supports the `application/json`, `application/x-recordio-protobuf`, and `text/csv` data formats. The `text/csv` format accepts a `label_size` and encoding parameter. It assumes a `label_size` of 0 and a UTF-8 encoding.

For inference outputs, k-NN supports the `application/json` and `application/x-recordio-protobuf` data formats. These two data formats also support a verbose output mode. In verbose output mode, the API provides the search results with the distances vector sorted from smallest to largest, and corresponding elements in the labels vector.

For batch transform, k-NN supports the `application/jsonlines` data format for both input and output. An example input is as follows:

```
content-type: application/jsonlines

{"features": [1.5, 16.0, 14.0, 23.0]}
{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}
```

An example output is as follows:

```
accept: application/jsonlines

{"predicted_label": 0.0}
{"predicted_label": 2.0}
```

For more information on input and output file formats, see [Data Formats for k-NN Training Input](kNN-in-formats.md) for training, [k-NN Request and Response Formats](kNN-inference-formats.md) for inference, and the [k-NN Sample Notebooks](#kNN-sample-notebooks).

## k-NN Sample Notebooks
<a name="kNN-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI k-nearest neighbor algorithm to predict wilderness cover types from geological and forest service data, see the [K-Nearest Neighbor Covertype ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/k_nearest_neighbors_covtype/k_nearest_neighbors_covtype.html). 

Use a Jupyter notebook instance to run the example in SageMaker AI. To learn how to create and open a Jupyter notebook instance in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI example notebooks. Find K-Nearest Neighbor notebooks in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How the k-NN Algorithm Works
<a name="kNN_how-it-works"></a>

The Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm follows a multi-step training process which includes sampling the input data, performing dimension reduction, and building an index. The indexed data is then used during inference to efficiently find the k-nearest neighbors for a given data point and make predictions based on the neighboring labels or values.

## Step 1: Sample
<a name="step1-k-NN-sampling"></a>

To specify the total number of data points to be sampled from the training dataset, use the `sample_size`parameter. For example, if the initial dataset has 1,000 data points and the `sample_size` is set to 100, where the total number of instances is 2, each worker would sample 50 points. A total set of 100 data points would be collected. Sampling runs in linear time with respect to the number of data points. 

## Step 2: Perform Dimension Reduction
<a name="step2-kNN-dim-reduction"></a>

The current implementation of the k-NN algorithm has two methods of dimension reduction. You specify the method in the `dimension_reduction_type` hyperparameter. The `sign` method specifies a random projection, which uses a linear projection using a matrix of random signs, and the `fjlt` method specifies a fast Johnson-Lindenstrauss transform, a method based on the Fourier transform. Both methods preserve the L2 and inner product distances. The `fjlt` method should be used when the target dimension is large and has better performance with CPU inference. The methods differ in their computational complexity. The `sign` method requires O(ndk) time to reduce the dimension of a batch of n points of dimension d into a target dimension k. The `fjlt` method requires O(nd log(d)) time, but the constants involved are larger. Using dimension reduction introduces noise into the data and this noise can reduce prediction accuracy.

## Step 3: Build an Index
<a name="step3-kNN-build-index"></a>

During inference, the algorithm queries the index for the k-nearest-neighbors of a sample point. Based on the references to the points, the algorithm makes the classification or regression prediction. It makes the prediction based on the class labels or values provided. k-NN provides three different types of indexes: a flat index, an inverted index, and an inverted index with product quantization. You specify the type with the `index_type` parameter.

## Serialize the Model
<a name="kNN-model-serialization"></a>

When the k-NN algorithm finishes training, it serializes three files to prepare for inference. 
+ model\$1algo-1: Contains the serialized index for computing the nearest neighbors.
+ model\$1algo-1.labels: Contains serialized labels (np.float32 binary format) for computing the predicted label based on the query result from the index.
+ model\$1algo-1.json: Contains the JSON-formatted model metadata which stores the `k` and `predictor_type` hyper-parameters from training for inference along with other relevant state.

With the current implementation of k-NN, you can modify the metadata file to change the way predictions are computed. For example, you can change `k` to 10 or change `predictor_type` to *regressor*.

```
{
  "k": 5,
  "predictor_type": "classifier",
  "dimension_reduction": {"type": "sign", "seed": 3, "target_dim": 10, "input_dim": 20},
  "normalize": False,
  "version": "1.0"
}
```

## EC2 Instance Recommendation for the k-NN Algorithm
<a name="kNN-instances"></a>

We recommend training on a CPU instance (such as ml.m5.2xlarge) or on a GPU instance. The k-NN algorithm supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

Inference requests from CPUs generally have a lower average latency than requests from GPUs because there is a tax on CPU-to-GPU communication when you use GPU hardware. However, GPUs generally have higher throughput for larger batches.

# k-NN Hyperparameters
<a name="kNN_hyperparameters"></a>

The following table lists the hyperparameters that you can set for the Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm.


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  The number of features in the input data. **Required** Valid values: positive integer.  | 
| k |  The number of nearest neighbors. **Required** Valid values: positive integer  | 
| predictor\$1type |  The type of inference to use on the data labels. **Required** Valid values: *classifier* for classification or *regressor* for regression.  | 
| sample\$1size |  The number of data points to be sampled from the training data set.  **Required** Valid values: positive integer  | 
| dimension\$1reduction\$1target |  The target dimension to reduce to. **Required** when you specify the `dimension_reduction_type` parameter. Valid values: positive integer greater than 0 and less than `feature_dim`.  | 
| dimension\$1reduction\$1type |  The type of dimension reduction method.  **Optional** Valid values: *sign* for random projection or *fjlt* for the fast Johnson-Lindenstrauss transform. Default value: No dimension reduction  | 
| faiss\$1index\$1ivf\$1nlists |  The number of centroids to construct in the index when `index_type` is *faiss.IVFFlat* or *faiss.IVFPQ*. **Optional** Valid values: positive integer Default value: *auto*, which resolves to `sqrt(sample_size)`.  | 
| faiss\$1index\$1pq\$1m |  The number of vector sub-components to construct in the index when `index_type` is set to *faiss.IVFPQ*.  The FaceBook AI Similarity Search (FAISS) library requires that the value of `faiss_index_pq_m` is a divisor of the data dimension. If `faiss_index_pq_m` is not a divisor of the data dimension, we increase the data dimension to smallest integer divisible by `faiss_index_pq_m`. If no dimension reduction is applied, the algorithm adds a padding of zeros. If dimension reduction is applied, the algorithm increase the value of the `dimension_reduction_target` hyper-parameter. **Optional** Valid values: One of the following positive integers: 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 40, 48, 56, 64, 96  | 
| index\$1metric |  The metric to measure the distance between points when finding nearest neighbors. When training with `index_type` set to `faiss.IVFPQ`, the `INNER_PRODUCT` distance and `COSINE` similarity are not supported. **Optional**  Valid values: *L2* for Euclidean-distance, *INNER\$1PRODUCT* for inner-product distance, *COSINE* for cosine similarity. Default value: *L2*  | 
| index\$1type |  The type of index. **Optional** Valid values: *faiss.Flat*, *faiss.IVFFlat*, *faiss.IVFPQ*. Default values: *faiss.Flat*  | 
| mini\$1batch\$1size |  The number of observations per mini-batch for the data iterator.  **Optional** Valid values: positive integer Default value: 5000  | 

# Tune a k-NN Model
<a name="kNN-tuning"></a>

The Amazon SageMaker AI k-nearest neighbors algorithm is a supervised algorithm. The algorithm consumes a test data set and emits a metric about the accuracy for a classification task or about the mean squared error for a regression task. These accuracy metrics compare the model predictions for their respective task to the ground truth provided by the empirical test data. To find the best model that reports the highest accuracy or lowest error on the test dataset, run a hyperparameter tuning job for k-NN. 

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric appropriate for the prediction task of the algorithm. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric. The hyperparameters are used only to help estimate model parameters and are not used by the trained model to make predictions.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the k-NN Algorithm
<a name="km-metrics"></a>

The k-nearest neighbors algorithm computes one of two metrics in the following table during training depending on the type of task specified by the `predictor_type` hyper-parameter. 
+ *classifier* specifies a classification task and computes `test:accuracy` 
+ *regressor* specifies a regression task and computes `test:mse`.

Choose the `predictor_type` value appropriate for the type of task undertaken to calculate the relevant objective metric when tuning a model.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:accuracy |  When `predictor_type` is set to *classifier*, k-NN compares the predicted label, based on the average of the k-nearest neighbors' labels, to the ground truth label provided in the test channel data. The accuracy reported ranges from 0.0 (0%) to 1.0 (100%).  |  Maximize  | 
| test:mse |  When `predictor_type` is set to *regressor*, k-NN compares the predicted label, based on the average of the k-nearest neighbors' labels, to the ground truth label provided in the test channel data. The mean squared error is computed by comparing the two labels.  |  Minimize  | 



## Tunable k-NN Hyperparameters
<a name="km-tunable-hyperparameters"></a>

Tune the Amazon SageMaker AI k-nearest neighbor model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| k |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 1024  | 
| sample\$1size |  IntegerParameterRanges  |  MinValue: 256, MaxValue: 20000000  | 

# Data Formats for k-NN Training Input
<a name="kNN-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input training formats described in [Common Data Formats - Training](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html). This topic contains a list of the available input formats for the SageMaker AI k-nearest-neighbor algorithm.

## CSV Data Format
<a name="kNN-training-data-csv"></a>

content-type: text/csv; label\$1size=1

```
4,1.2,1.3,9.6,20.3
```

The first `label_size` columns are interpreted as the label vector for that row.

## RECORDIO Data Format
<a name="kNN-training-data-recordio"></a>

content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {
            'values': {
                values: [1.2, 1.3, 9.6, 20.3]  # float32
            }
        },
        label = {
            'values': {
                values: [4]  # float32
            }
        }
    }
] 

                
}
```

# k-NN Request and Response Formats
<a name="kNN-inference-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI k-nearest-neighbor algorithm.

## INPUT: CSV Request Format
<a name="kNN-input-csv"></a>

content-type: text/csv

```
1.2,1.3,9.6,20.3
```

This accepts a `label_size` or encoding parameter. It assumes a `label_size` of 0 and a utf-8 encoding.

## INPUT: JSON Request Format
<a name="kNN-input-json"></a>

content-type: application/json

```
{
  "instances": [
    {"data": {"features": {"values": [-3, -1, -4, 2]}}},
    {"features": [3.0, 0.1, 0.04, 0.002]}]
}
```

## INPUT: JSONLINES Request Format
<a name="kNN-input-jsonlines"></a>

content-type: application/jsonlines

```
{"features": [1.5, 16.0, 14.0, 23.0]}
{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}
```

## INPUT: RECORDIO Request Format
<a name="kNN-input-recordio"></a>

content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {
            'values': {
                values: [-3, -1, -4, 2]  # float32
            }
        },
        label = {}
    },
    Record = {
        features = {
            'values': {
                values: [3.0, 0.1, 0.04, 0.002]  # float32
            }
        },
        label = {}
    },
]
```

## OUTPUT: JSON Response Format
<a name="kNN-output-json"></a>

accept: application/json

```
{
  "predictions": [
    {"predicted_label": 0.0},
    {"predicted_label": 2.0}
  ]
}
```

## OUTPUT: JSONLINES Response Format
<a name="kNN-output-jsonlines"></a>

accept: application/jsonlines

```
{"predicted_label": 0.0}
{"predicted_label": 2.0}
```

## OUTPUT: VERBOSE JSON Response Format
<a name="KNN-output-verbose-json"></a>

In verbose mode, the API provides the search results with the distances vector sorted from smallest to largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/json; verbose=true

```
{
  "predictions": [
    {
        "predicted_label": 0.0,
        "distances": [3.11792408, 3.89746071, 6.32548437],
        "labels": [0.0, 1.0, 0.0]
    },
    {
        "predicted_label": 2.0,
        "distances": [1.08470316, 3.04917915, 5.25393973],
        "labels": [2.0, 2.0, 0.0]
    }
  ]
}
```

## OUTPUT: RECORDIO-PROTOBUF Response Format
<a name="kNN-output-recordio-protobuf"></a>

content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [2.0]  # float32
            }
        }
    }
]
```

## OUTPUT: VERBOSE RECORDIO-PROTOBUF Response Format
<a name="kNN-output-verbose-recordio"></a>

In verbose mode, the API provides the search results with the distances vector sorted from smallest to largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/x-recordio-protobuf; verbose=true

```
[
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            },
            'distances': {
                values: [3.11792408, 3.89746071, 6.32548437]  # float32
            },
            'labels': {
                values: [0.0, 1.0, 0.0]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            },
            'distances': {
                values: [1.08470316, 3.04917915, 5.25393973]  # float32
            },
            'labels': {
                values: [2.0, 2.0, 0.0]  # float32
            }
        }
    }
]
```

## SAMPLE OUTPUT for the k-NN Algorithm
<a name="kNN-sample-output"></a>

For regressor tasks:

```
[06/08/2018 20:15:33 INFO 140026520049408] #test_score (algo-1) : ('mse', 0.013333333333333334)
```

For classifier tasks:

```
[06/08/2018 20:15:46 INFO 140285487171328] #test_score (algo-1) : ('accuracy', 0.98666666666666669)
```

# LightGBM
<a name="lightgbm"></a>

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT. This page includes information about Amazon EC2 instance recommendations and sample notebooks for LightGBM.

# How to use SageMaker AI LightGBM
<a name="lightgbm-modes"></a>

You can use LightGBM as an Amazon SageMaker AI built-in algorithm. The following section describes how to use LightGBM with the SageMaker Python SDK. For information on how to use LightGBM from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use LightGBM as a built-in algorithm**

  Use the LightGBM built-in algorithm to build a LightGBM training container as shown in the following code example. You can automatically spot the LightGBM built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the LightGBM image URI, you can use the LightGBM container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The LightGBM built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own LightGBM training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
  training_instance_type = "ml.m5.xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_multiclass/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train" 
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation" 
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "num_boost_round"
  ] = "500"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1, # for distributed training, specify an instance_count greater than 1
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "train": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the LightGBM as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)
  + [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)

# Input and Output interface for the LightGBM algorithm
<a name="InputOutput-LightGBM"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of LightGBM supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the LightGBM model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `train` and `validation` channels to provide your input data. Alternatively, you can use only the `train` channel.

**Note**  
Both `train` and `training` are valid channel names for LightGBM training.

**Use both the `train` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `train` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `train` or `validation` channels, the LightGBM algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `train` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `train` channel**:

You can alternatively provide your input data by way of a single S3 path for the `train` channel. This S3 path should point to a directory with a subdirectory named `train/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI LightGBM uses the Python Joblib module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI LightGBM with the JobLib module**
+ Use the following Python code:

  ```
  import joblib 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = joblib.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the LightGBM algorithm
<a name="Instance-LightGBM"></a>

SageMaker AI LightGBM currently supports single-instance and multi-instance CPU training. For multi-instance CPU training (distributed training), specify an `instance_count` greater than 1 when you define your Estimator. For more information on distributed training with LightGBM, see [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html).

LightGBM is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5). Further, we recommend that you have enough total memory in selected instances to hold the training data. 

## LightGBM sample notebooks
<a name="lightgbm-sample-notebooks"></a>

The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI LightGBM algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular regression model.   | 
|  [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html)  |  This notebook demonstrates distributed training with the Amazon SageMaker AI LightGBM algorithm using the Dask framework.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How LightGBM works
<a name="lightgbm-HowItWorks"></a>

LightGBM implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). These techniques are designed to significantly improve the efficiency and scalability of GBDT.

The LightGBM algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use LightGBM for regression, classification (binary and multiclass), and ranking problems.

For more information on gradient boosting, see [How the SageMaker AI XGBoost algorithm works](xgboost-HowItWorks.md). For in-depth details about the additional GOSS and EFB techniques used in the LightGBM method, see *[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf)*.

# LightGBM hyperparameters
<a name="lightgbm-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI LightGBM algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI LightGBM algorithm is an implementation of the open-source [LightGBM](https://github.com/microsoft/LightGBM) package. 

**Note**  
The default hyperparameters are based on example datasets in the [LightGBM sample notebooks](lightgbm.md#lightgbm-sample-notebooks).

By default, the SageMaker AI LightGBM algorithm automatically chooses an evaluation metric and objective function based on the type of classification problem. The LightGBM algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error and the objective function is L2 loss. For binary classification problems, the evaluation metric and objective function are both binary cross entropy. For multiclass classification problems, the evaluation metric is multiclass cross entropy and the objective function is softmax. You can use the `metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on LightGBM hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| num\$1boost\$1round |  The maximum number of boosting iterations. **Note:** Internally, LightGBM constructs `num_class * num_boost_round` trees for multi-class classification problems. Valid values: integer, range: Positive integer. Default value: `100`.  | 
| early\$1stopping\$1rounds |  The training will stop if one metric of one validation data point does not improve in the last `early_stopping_rounds` round. If `early_stopping_rounds` is less than or equal to zero, this hyperparameter is ignored. Valid values: integer. Default value: `10`.  | 
| metric |  The evaluation metric for validation data. If `metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/lightgbm-hyperparameters.html) Valid values: string, any of the following: (`"auto"`, `"rmse"`, `"l1"`, `"l2"`, `"huber"`, `"fair"`, `"binary_logloss"`, `"binary_error"`, `"auc"`, `"average_precision"`, `"multi_logloss"`, `"multi_error"`, `"auc_mu"`, or `"cross_entropy"`). Default value: `"auto"`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.1`.  | 
| num\$1leaves |  The maximum number of leaves in one tree. Valid values: integer, range: (`1`, `131072`). Default value: `64`.  | 
| feature\$1fraction |  A subset of features to be selected on each iteration (tree). Must be less than 1.0. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.9`.  | 
| bagging\$1fraction |  A subset of features similar to `feature_fraction`, but `bagging_fraction` randomly selects part of the data without resampling. Valid values: float, range: (`0.0`, `1.0`]. Default value: `0.9`.  | 
| bagging\$1freq |  The frequency to perform bagging. At every `bagging_freq` iteration, LightGBM randomly selects a percentage of the data to use for the next `bagging_freq` iteration. This percentage is determined by the `bagging_fraction` hyperparameter. If `bagging_freq` is zero, then bagging is deactivated. Valid values: integer, range: Non-negative integer. Default value: `1`.  | 
| max\$1depth |  The maximum depth for a tree model. This is used to deal with overfitting when the amount of data is small. If `max_depth` is less than or equal to zero, this means there is no limit for maximum depth. Valid values: integer. Default value: `6`.  | 
| min\$1data\$1in\$1leaf |  The minimum amount of data in one leaf. Can be used to deal with overfitting. Valid values: integer, range: Non-negative integer. Default value: `3`.  | 
| max\$1delta\$1step |  Used to limit the max output of tree leaves. If `max_delta_step` is less than or equal to 0, then there is no constraint. The final max output of leaves is `learning_rate * max_delta_step`. Valid values: float. Default value: `0.0`.  | 
| lambda\$1l1 |  L1 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| lambda\$1l2 |  L2 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| boosting |  Boosting type Valid values: string, any of the following: (`"gbdt"`, `"rf"`, `"dart"`, or `"goss"`). Default value: `"gbdt"`.  | 
| min\$1gain\$1to\$1split |  The minimum gain to perform a split. Can be used to speed up training. Valid values: integer, float: Non-negative float. Default value: `0.0`.  | 
| scale\$1pos\$1weight |  The weight of the labels with positive class. Used only for binary classification tasks. `scale_pos_weight` cannot be used if `is_unbalance` is set to `"True"`.  Valid values: float, range: Positive float. Default value: `1.0`.  | 
| tree\$1learner |  Tree learner type. Valid values: string, any of the following: (`"serial"`, `"feature"`, `"data"`, or `"voting"`). Default value: `"serial"`.  | 
| feature\$1fraction\$1bynode |  Selects a subset of random features on each tree node. For example, if `feature_fraction_bynode` is `0.8`, then 80% of features are selected. Can be used to deal with overfitting. Valid values: integer, range: (`0.0`, `1.0`]. Default value: `1.0`.  | 
| is\$1unbalance |  Set to `"True"` if training data is unbalanced. Used only for binary classification tasks. `is_unbalance` cannot be used with `scale_pos_weight`. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| max\$1bin |  The maximum number of bins used to bucket feature values. A small number of bins may reduce training accuracy, but may increase general performance. Can be used to deal with overfitting. Valid values: integer, range: (1, ∞). Default value: `255`.  | 
| tweedie\$1variance\$1power |  Controls the variance of the Tweedie distribution. Set this closer to `2.0` to shift toward a gamma distribution. Set this closer to `1.0` to shift toward a Poisson distribution. Used only for regression tasks. Valid values: float, range: [`1.0`, `2.0`). Default value: `1.5`.  | 
| num\$1threads |  Number of parallel threads used to run LightGBM. Value 0 means default number of threads in OpenMP. Valid values: integer, range: Non-negative integer. Default value: `0`.  | 
| verbosity |  The verbosity of print messages. If the `verbosity` is less than `0`, then print messages only show fatal errors. If `verbosity` is set to `0`, then print messages include errors and warnings. If `verbosity` is `1`, then print messages show more information. A `verbosity` greater than `1` shows the most information in print messages and can be used for debugging. Valid values: integer. Default value: `1`.  | 

# Tune a LightGBM model
<a name="lightgbm-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters: 

**Note**  
The learning objective function is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).
+ A learning objective function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your specified hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for LightGBM is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the LightGBM algorithm
<a name="lightgbm-metrics"></a>

The SageMaker AI LightGBM algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| rmse | root mean square error | minimize | "rmse: ([0-9\$1\$1.]\$1)" | 
| l1 | mean absolute error | minimize | "l1: ([0-9\$1\$1.]\$1)" | 
| l2 | mean squared error | minimize | "l2: ([0-9\$1\$1.]\$1)" | 
| huber | huber loss | minimize | "huber: ([0-9\$1\$1.]\$1)" | 
| fair | fair loss | minimize | "fair: ([0-9\$1\$1.]\$1)" | 
| binary\$1logloss | binary cross entropy | maximize | "binary\$1logloss: ([0-9\$1\$1.]\$1)" | 
| binary\$1error | binary error | minimize | "binary\$1error: ([0-9\$1\$1.]\$1)" | 
| auc | AUC | maximize | "auc: ([0-9\$1\$1.]\$1)" | 
| average\$1precision | average precision score | maximize | "average\$1precision: ([0-9\$1\$1.]\$1)" | 
| multi\$1logloss | multiclass cross entropy | maximize | "multi\$1logloss: ([0-9\$1\$1.]\$1)" | 
| multi\$1error | multiclass error score | minimize | "multi\$1error: ([0-9\$1\$1.]\$1)" | 
| auc\$1mu | AUC-mu | maximize | "auc\$1mu: ([0-9\$1\$1.]\$1)" | 
| cross\$1entropy | cross entropy | minimize | "cross\$1entropy: ([0-9\$1\$1.]\$1)" | 

## Tunable LightGBM hyperparameters
<a name="lightgbm-tunable-hyperparameters"></a>

Tune the LightGBM model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the LightGBM evaluation metrics are: `learning_rate`, `num_leaves`, `feature_fraction`, `bagging_fraction`, `bagging_freq`, `max_depth` and `min_data_in_leaf`. For a list of all the LightGBM hyperparameters, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| num\$1leaves | IntegerParameterRanges | MinValue: 10, MaxValue: 100 | 
| feature\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1freq | IntegerParameterRanges | MinValue: 0, MaxValue: 10 | 
| max\$1depth | IntegerParameterRanges | MinValue: 15, MaxValue: 100 | 
| min\$1data\$1in\$1leaf | IntegerParameterRanges | MinValue: 10, MaxValue: 200 | 

# Linear Learner Algorithm
<a name="linear-learner"></a>

*Linear models* are supervised learning algorithms used for solving either classification or regression problems. For input, you give the model labeled examples (*x*, *y*). *x* is a high-dimensional vector and *y* is a numeric label. For binary classification problems, the label must be either 0 or 1. For multiclass classification problems, the labels must be from 0 to `num_classes` - 1. For regression problems, *y* is a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold function, and maps a vector *x* to an approximation of the label *y*. 

The Amazon SageMaker AI linear learner algorithm provides a solution for both classification and regression problems. With the SageMaker AI algorithm, you can simultaneously explore different training objectives and choose the best solution from a validation set. You can also explore a large number of models and choose the best. The best model optimizes either of the following:
+ Continuous objectives, such as mean square error, cross entropy loss, absolute error.
+ Discrete objectives suited for classification, such as F1 measure, precision, recall, or accuracy. 

Compared with methods that provide a solution for only continuous objectives, the SageMaker AI linear learner algorithm provides a significant increase in speed over naive hyperparameter optimization techniques. It is also more convenient. 

The linear learner algorithm requires a data matrix, with rows representing the observations, and columns representing the dimensions of the features. It also requires an additional column that contains the labels that match the data points. At a minimum, Amazon SageMaker AI linear learner requires you to specify input and output data locations, and objective type (classification or regression) as arguments. The feature dimension is also required. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html). You can specify additional parameters in the `HyperParameters` string map of the request body. These parameters control the optimization procedure, or specifics of the objective function that you train on. For example, the number of epochs, regularization, and loss type. 

If you're using [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), the linear learner algorithm supports using [checkpoints to take a snapshot of the state of the model](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html).

**Topics**
+ [Input/Output interface for the linear learner algorithm](#ll-input_output)
+ [EC2 instance recommendation for the linear learner algorithm](#ll-instances)
+ [Linear learner sample notebooks](#ll-sample-notebooks)
+ [How linear learner works](ll_how-it-works.md)
+ [Linear learner hyperparameters](ll_hyperparameters.md)
+ [Tune a linear learner model](linear-learner-tuning.md)
+ [Linear learner response formats](LL-in-formats.md)

## Input/Output interface for the linear learner algorithm
<a name="ll-input_output"></a>

The Amazon SageMaker AI linear learner algorithm supports three data channels: train, validation (optional), and test (optional). If you provide validation data, the `S3DataDistributionType` should be `FullyReplicated`. The algorithm logs validation loss at every epoch, and uses a sample of the validation data to calibrate and select the best model. If you don't provide validation data, the algorithm uses a sample of the training data to calibrate and select the model. If you provide test data, the algorithm logs include the test score for the final model.

**For training**, the linear learner algorithm supports both `recordIO-wrapped protobuf` and `CSV` formats. For the `application/x-recordio-protobuf` input type, only Float32 tensors are supported. For the `text/csv` input type, the first column is assumed to be the label, which is the target variable for prediction. You can use either File mode or Pipe mode to train linear learner models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

**For inference**, the linear learner algorithm supports the `application/json`, `application/x-recordio-protobuf`, and `text/csv` formats. When you make predictions on new data, the format of the response depends on the type of model. **For regression** (`predictor_type='regressor'`), the `score` is the prediction produced by the model. **For classification** (`predictor_type='binary_classifier'` or `predictor_type='multiclass_classifier'`), the model returns a `score` and also a `predicted_label`. The `predicted_label` is the class predicted by the model and the `score` measures the strength of that prediction. 
+ **For binary classification**, `predicted_label` is `0` or `1`, and `score` is a single floating point number that indicates how strongly the algorithm believes that the label should be 1.
+ **For multiclass classification**, the `predicted_class` will be an integer from `0` to `num_classes-1`, and `score` will be a list of one floating point number per class. 

To interpret the `score` in classification problems, you have to consider the loss function used. If the `loss` hyperparameter value is `logistic` for binary classification or `softmax_loss` for multiclass classification, then the `score` can be interpreted as the probability of the corresponding class. These are the loss values used by the linear learner when the `loss` value is `auto` default value. But if the loss is set to `hinge_loss`, then the score cannot be interpreted as a probability. This is because hinge loss corresponds to a Support Vector Classifier, which does not produce probability estimates.

For more information on input and output file formats, see [Linear learner response formats](LL-in-formats.md). For more information on inference formats, and the [Linear learner sample notebooks](#ll-sample-notebooks).

## EC2 instance recommendation for the linear learner algorithm
<a name="ll-instances"></a>

The linear learner algorithm supports both CPU and GPU instances for training and inference. For GPU, the linear learner algorithm supports P2, P3, G4dn, and G5 GPU families.

During testing, we have not found substantial evidence that multi-GPU instances are faster than single-GPU instances. Results can vary, depending on your specific use case.

## Linear learner sample notebooks
<a name="ll-sample-notebooks"></a>

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI linear learner algorithm.


| **Notebook Title** | **Description** | 
| --- | --- | 
|  [An Introduction with the MNIST dataset](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/linear_learner_mnist/linear_learner_mnist.html)  |   Using the MNIST dataset, we train a binary classifier to predict a single digit.  | 
|  [How to Build a Multiclass Classifier?](https://sagemaker-examples.readthedocs.io/en/latest/scientific_details_of_algorithms/linear_learner_multiclass_classification/linear_learner_multiclass_classification.html)  |   Using UCI's Covertype dataset, we demonstrate how to train a multiclass classifier.   | 
|  [How to Build a Machine Learning (ML) Pipeline for Inference? ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html)  |   Using a Scikit-learn container, we demonstrate how to build an end-to-end ML pipeline.   | 

 For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. The topic modeling example notebooks using the linear learning algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab and choose **Create copy**. 

# How linear learner works
<a name="ll_how-it-works"></a>

There are three steps involved in the implementation of the linear learner algorithm: preprocess, train, and validate. 

## Step 1: Preprocess
<a name="step1-preprocessing"></a>

Normalization, or feature scaling, is an important preprocessing step for certain loss functions that ensures the model being trained on a dataset does not become dominated by the weight of a single feature. The Amazon SageMaker AI Linear Learner algorithm has a normalization option to assist with this preprocessing step. If normalization is turned on, the algorithm first goes over a small sample of the data to learn the mean value and standard deviation for each feature and for the label. Each of the features in the full dataset is then shifted to have mean of zero and scaled to have a unit standard deviation.

**Note**  
For best results, ensure your data is shuffled before training. Training with unshuffled data may cause training to fail. 

You can configure whether the linear learner algorithm normalizes the feature data and the labels using the `normalize_data` and `normalize_label` hyperparameters, respectively. Normalization is enabled by default for both features and labels for regression. Only the features can be normalized for binary classification and this is the default behavior. 

## Step 2: Train
<a name="step2-training"></a>

With the linear learner algorithm, you train with a distributed implementation of stochastic gradient descent (SGD). You can control the optimization process by choosing the optimization algorithm. For example, you can choose to use Adam, AdaGrad, stochastic gradient descent, or other optimization algorithms. You also specify their hyperparameters, such as momentum, learning rate, and the learning rate schedule. If you aren't sure which algorithm or hyperparameter value to use, choose a default that works for the majority of datasets. 

During training, you simultaneously optimize multiple models, each with slightly different objectives. For example, you vary L1 or L2 regularization and try out different optimizer settings. 

## Step 3: Validate and set the threshold
<a name="step3-validation"></a>

When training multiple models in parallel, the models are evaluated against a validation set to select the most optimal model once training is complete. For regression, the most optimal model is the one that achieves the best loss on the validation set. For classification, a sample of the validation set is used to calibrate the classification threshold. The most optimal model selected is the one that achieves the best binary classification selection criteria on the validation set. Examples of such criteria include the F1 measure, accuracy, and cross-entropy loss. 

**Note**  
If the algorithm is not provided a validation set, then evaluating and selecting the most optimal model is not possible. To take advantage of parallel training and model selection ensure you provide a validation set to the algorithm. 

# Linear learner hyperparameters
<a name="ll_hyperparameters"></a>

The following table contains the hyperparameters for the linear learner algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. When a hyperparameter is set to `auto`, Amazon SageMaker AI will automatically calculate and set the value of that hyperparameter. 


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes |  The number of classes for the response variable. The algorithm assumes that classes are labeled `0`, ..., `num_classes - 1`. **Required** when `predictor_type` is `multiclass_classifier`. Otherwise, the algorithm ignores it. Valid values: Integers from 3 to 1,000,000  | 
| predictor\$1type |  Specifies the type of target variable as a binary classification, multiclass classification, or regression. **Required** Valid values: `binary_classifier`, `multiclass_classifier`, or `regressor`  | 
| accuracy\$1top\$1k |  When computing the top-k accuracy metric for multiclass classification, the value of *k*. If the model assigns one of the top-k scores to the true label, an example is scored as correct. **Optional** Valid values: Positive integers Default value: 3   | 
| balance\$1multiclass\$1weights |  Specifies whether to use class weights, which give each class equal importance in the loss function. Used only when the `predictor_type` is `multiclass_classifier`. **Optional** Valid values: `true`, `false` Default value: `false`  | 
| beta\$11 |  The exponential decay rate for first-moment estimates. Applies only when the `optimizer` value is `adam`. **Optional** Valid values: `auto` or floating-point value between 0 and 1.0 Default value: `auto`  | 
| beta\$12 |  The exponential decay rate for second-moment estimates. Applies only when the `optimizer` value is `adam`. **Optional** Valid values: `auto` or floating-point integer between 0 and 1.0  Default value: `auto`  | 
| bias\$1lr\$1mult |  Allows a different learning rate for the bias term. The actual learning rate for the bias is `learning_rate` \$1 `bias_lr_mult`. **Optional** Valid values: `auto` or positive floating-point integer Default value: `auto`  | 
| bias\$1wd\$1mult |  Allows different regularization for the bias term. The actual L2 regularization weight for the bias is `wd` \$1 `bias_wd_mult`. By default, there is no regularization on the bias term. **Optional** Valid values: `auto` or non-negative floating-point integer Default value: `auto`  | 
| binary\$1classifier\$1model\$1selection\$1criteria |  When `predictor_type` is set to `binary_classifier`, the model evaluation criteria for the validation dataset (or for the training dataset if you don't provide a validation dataset). Criteria include: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) **Optional** Valid values: `accuracy`, `f_beta`, `precision_at_target_recall`, `recall_at_target_precision`, or `loss_function` Default value: `accuracy`  | 
| early\$1stopping\$1patience | If no improvement is made in the relevant metric, the number of epochs to wait before ending training. If you have provided a value for binary\$1classifier\$1model\$1selection\$1criteria. the metric is that value. Otherwise, the metric is the same as the value specified for the loss hyperparameter. The metric is evaluated on the validation data. If you haven't provided validation data, the metric is always the same as the value specified for the `loss` hyperparameter and is evaluated on the training data. To disable early stopping, set `early_stopping_patience` to a value greater than the value specified for `epochs`.**Optional**Valid values: Positive integerDefault value: 3 | 
| early\$1stopping\$1tolerance |  The relative tolerance to measure an improvement in loss. If the ratio of the improvement in loss divided by the previous best loss is smaller than this value, early stopping considers the improvement to be zero. **Optional** Valid values: Positive floating-point integer Default value: 0.001  | 
| epochs |  The maximum number of passes over the training data. **Optional** Valid values: Positive integer Default value: 15  | 
| f\$1beta |  The value of beta to use when calculating F score metrics for binary or multiclass classification. Also used if the value specified for `binary_classifier_model_selection_criteria` is `f_beta`. **Optional** Valid values: Positive floating-point integers Default value: 1.0   | 
| feature\$1dim |  The number of features in the input data.  **Optional** Valid values: `auto` or positive integer Default values: `auto`  | 
| huber\$1delta |  The parameter for Huber loss. During training and metric evaluation, compute L2 loss for errors smaller than delta and L1 loss for errors larger than delta. **Optional** Valid values: Positive floating-point integer Default value: 1.0   | 
| init\$1bias |  Initial weight for the bias term. **Optional** Valid values: Floating-point integer Default value: 0  | 
| init\$1method |  Sets the initial distribution function used for model weights. Functions include: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) **Optional** Valid values: `uniform` or `normal` Default value: `uniform`  | 
| init\$1scale |  Scales an initial uniform distribution for model weights. Applies only when the `init_method` hyperparameter is set to `uniform`. **Optional** Valid values: Positive floating-point integer Default value: 0.07  | 
| init\$1sigma |  The initial standard deviation for the normal distribution. Applies only when the `init_method` hyperparameter is set to `normal`. **Optional** Valid values: Positive floating-point integer Default value: 0.01  | 
| l1 |  The L1 regularization parameter. If you don't want to use L1 regularization, set the value to 0. **Optional** Valid values: `auto` or non-negative float Default value: `auto`  | 
| learning\$1rate |  The step size used by the optimizer for parameter updates. **Optional** Valid values: `auto` or positive floating-point integer Default value: `auto`, whose value depends on the optimizer chosen.  | 
| loss |  Specifies the loss function.  The available loss functions and their default values depend on the value of `predictor_type`: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) Valid values: `auto`, `logistic`, `squared_loss`, `absolute_loss`, `hinge_loss`, `eps_insensitive_squared_loss`, `eps_insensitive_absolute_loss`, `quantile_loss`, or `huber_loss`  **Optional** Default value: `auto`  | 
| loss\$1insensitivity |  The parameter for the epsilon-insensitive loss type. During training and metric evaluation, any error smaller than this value is considered to be zero. **Optional** Valid values: Positive floating-point integer Default value: 0.01   | 
| lr\$1scheduler\$1factor |  For every `lr_scheduler_step` hyperparameter, the learning rate decreases by this quantity. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive floating-point integer between 0 and 1 Default value: `auto`  | 
| lr\$1scheduler\$1minimum\$1lr |  The learning rate never decreases to a value lower than the value set for `lr_scheduler_minimum_lr`. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive floating-point integer Default values: `auto`  | 
| lr\$1scheduler\$1step |  The number of steps between decreases of the learning rate. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive integer Default value: `auto`  | 
| margin |  The margin for the `hinge_loss` function. **Optional** Valid values: Positive floating-point integer Default value: 1.0  | 
| mini\$1batch\$1size |  The number of observations per mini-batch for the data iterator. **Optional** Valid values: Positive integer Default value: 1000  | 
| momentum |  The momentum of the `sgd` optimizer. **Optional** Valid values: `auto` or a floating-point integer between 0 and 1.0 Default value: `auto`  | 
| normalize\$1data |  Normalizes the feature data before training. Data normalization shifts the data for each feature to have a mean of zero and scales it to have unit standard deviation. **Optional** Valid values: `auto`, `true`, or `false` Default value: `true`  | 
| normalize\$1label |  Normalizes the label. Label normalization shifts the label to have a mean of zero and scales it to have unit standard deviation. The `auto` default value normalizes the label for regression problems but does not for classification problems. If you set the `normalize_label` hyperparameter to `true` for classification problems, the algorithm ignores it. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| num\$1calibration\$1samples |  The number of observations from the validation dataset to use for model calibration (when finding the best threshold). **Optional** Valid values: `auto` or positive integer Default value: `auto`  | 
| num\$1models |  The number of models to train in parallel. For the default, `auto`, the algorithm decides the number of parallel models to train. One model is trained according to the given training parameter (regularization, optimizer, loss), and the rest by close parameters. **Optional** Valid values: `auto` or positive integer Default values: `auto`  | 
| num\$1point\$1for\$1scaler |  The number of data points to use for calculating normalization or unbiasing of terms. **Optional** Valid values: Positive integer Default value: 10,000  | 
| optimizer |  The optimization algorithm to use. **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) Default value: `auto`. The default setting for `auto` is `adam`.  | 
| positive\$1example\$1weight\$1mult |  The weight assigned to positive examples when training a binary classifier. The weight of negative examples is fixed at 1. If you want the algorithm to choose a weight so that errors in classifying negative *vs.* positive examples have equal impact on training loss, specify `balanced`. If you want the algorithm to choose the weight that optimizes performance, specify `auto`. **Optional** Valid values: `balanced`, `auto`, or a positive floating-point integer Default value: 1.0  | 
| quantile |  The quantile for quantile loss. For quantile q, the model attempts to produce predictions so that the value of `true_label` is greater than the prediction with probability q. **Optional** Valid values: Floating-point integer between 0 and 1 Default value: 0.5  | 
| target\$1precision |  The target precision. If `binary_classifier_model_selection_criteria` is `recall_at_target_precision`, then precision is held at this value while recall is maximized. **Optional** Valid values: Floating-point integer between 0 and 1.0 Default value: 0.8  | 
| target\$1recall |  The target recall. If `binary_classifier_model_selection_criteria` is `precision_at_target_recall`, then recall is held at this value while precision is maximized. **Optional** Valid values: Floating-point integer between 0 and 1.0 Default value: 0.8  | 
| unbias\$1data |  Unbiases the features before training so that the mean is 0. By default data is unbiased as the `use_bias` hyperparameter is set to `true`. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| unbias\$1label |  Unbiases labels before training so that the mean is 0. Applies to regression only if the `use_bias` hyperparameter is set to `true`. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| use\$1bias |  Specifies whether the model should include a bias term, which is the intercept term in the linear equation. **Optional** Valid values: `true` or `false` Default value: `true`  | 
| use\$1lr\$1scheduler |  Whether to use a scheduler for the learning rate. If you want to use a scheduler, specify `true`.  **Optional** Valid values: `true` or `false` Default value: `true`  | 
| wd |  The weight decay parameter, also known as the L2 regularization parameter. If you don't want to use L2 regularization, set the value to 0. **Optional** Valid values:`auto` or non-negative floating-point integer Default value: `auto`  | 

# Tune a linear learner model
<a name="linear-learner-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric. 

The linear learner algorithm also has an internal mechanism for tuning hyperparameters separate from the automatic model tuning feature described here. By default, the linear learner algorithm tunes hyperparameters by training multiple models in parallel. When you use automatic model tuning, the linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel models, `num_models`, to 1. The algorithm ignores any value that you set for `num_models`.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the linear learner algorithm
<a name="linear-learner-metrics"></a>

The linear learner algorithm reports the metrics in the following table, which are computed during training. Choose one of them as the objective metric. To avoid overfitting, we recommend tuning the model against a validation metric instead of a training metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:absolute\$1loss |  The absolute loss of the final model on the test dataset. This objective metric is only valid for regression.  |  Minimize  | 
| test:binary\$1classification\$1accuracy |  The accuracy of the final model on the test dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:binary\$1f\$1beta |  The F-beta score of the final model on the test dataset. By default, it is the F1 score, which is the harmonic mean of precision and recall. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:dcg |  The discounted cumulative gain of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1f\$1beta |  The F-beta score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1precision |  The precision score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1recall |  The recall score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:mse |  The mean square error of the final model on the test dataset. This objective metric is only valid for regression.  |  Minimize  | 
| test:multiclass\$1accuracy |  The accuracy of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:multiclass\$1top\$1k\$1accuracy |  The accuracy among the top k labels predicted on the test dataset. If you choose this metric as the objective, we recommend setting the value of k using the `accuracy_top_k` hyperparameter. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:objective\$1loss |  The mean value of the objective loss function on the test dataset after the model is trained. By default, the loss is logistic loss for binary classification and squared loss for regression. To set the loss to other types, use the `loss` hyperparameter.  |  Minimize  | 
| test:precision |  The precision of the final model on the test dataset. If you choose this metric as the objective, we recommend setting a target recall by setting the `binary_classifier_model_selection` hyperparameter to `precision_at_target_recall` and setting the value for the `target_recall` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:recall |  The recall of the final model on the test dataset. If you choose this metric as the objective, we recommend setting a target precision by setting the `binary_classifier_model_selection` hyperparameter to `recall_at_target_precision` and setting the value for the `target_precision` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:roc\$1auc\$1score |  The area under receiving operating characteristic curve (ROC curve) of the final model on the test dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:absolute\$1loss |  The absolute loss of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:binary\$1classification\$1accuracy |  The accuracy of the final model on the validation dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:binary\$1f\$1beta |  The F-beta score of the final model on the validation dataset. By default, the F-beta score is the F1 score, which is the harmonic mean of the `validation:precision` and `validation:recall` metrics. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:dcg |  The discounted cumulative gain of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1f\$1beta |  The F-beta score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1precision |  The precision score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1recall |  The recall score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:mse |  The mean square error of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:multiclass\$1accuracy |  The accuracy of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:multiclass\$1top\$1k\$1accuracy |  The accuracy among the top k labels predicted on the validation dataset. If you choose this metric as the objective, we recommend setting the value of k using the `accuracy_top_k` hyperparameter. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:objective\$1loss |  The mean value of the objective loss function on the validation dataset every epoch. By default, the loss is logistic loss for binary classification and squared loss for regression. To set loss to other types, use the `loss` hyperparameter.  |  Minimize  | 
| validation:precision |  The precision of the final model on the validation dataset. If you choose this metric as the objective, we recommend setting a target recall by setting the `binary_classifier_model_selection` hyperparameter to `precision_at_target_recall` and setting the value for the `target_recall` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:recall |  The recall of the final model on the validation dataset. If you choose this metric as the objective, we recommend setting a target precision by setting the `binary_classifier_model_selection` hyperparameter to `recall_at_target_precision` and setting the value for the `target_precision` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:rmse |  The root mean square error of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:roc\$1auc\$1score |  The area under receiving operating characteristic curve (ROC curve) of the final model on the validation dataset. This objective metric is only valid for binary classification.  |  Maximize  | 

## Tuning linear learner hyperparameters
<a name="linear-learner-tunable-hyperparameters"></a>

You can tune a linear learner model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| wd |  `ContinuousParameterRanges`  |  `MinValue: ``1e-7`, `MaxValue`: `1`  | 
| l1 |  `ContinuousParameterRanges`  |  `MinValue`: `1e-7`, `MaxValue`: `1`  | 
| learning\$1rate |  `ContinuousParameterRanges`  |  `MinValue`: `1e-5`, `MaxValue`: `1`  | 
| mini\$1batch\$1size |  `IntegerParameterRanges`  |  `MinValue`: `100`, `MaxValue`: `5000`  | 
| use\$1bias |  `CategoricalParameterRanges`  |  `[True, False]`  | 
| positive\$1example\$1weight\$1mult |  `ContinuousParameterRanges`  |  `MinValue`: 1e-5, `MaxValue`: `1e5`  | 

# Linear learner response formats
<a name="LL-in-formats"></a>

## JSON response formats
<a name="LL-json"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). The following are the available output formats for the SageMaker AI linear learner algorithm.

**Binary Classification**

```
let response =   {
    "predictions":    [
        {
            "score": 0.4,
            "predicted_label": 0
        } 
    ]
}
```

**Multiclass Classification**

```
let response =   {
    "predictions":    [
        {
            "score": [0.1, 0.2, 0.4, 0.3],
            "predicted_label": 2
        } 
    ]
}
```

**Regression**

```
let response =   {
    "predictions":    [
        {
            "score": 0.4
        } 
    ]
}
```

## JSONLINES response formats
<a name="LL-jsonlines"></a>

**Binary Classification**

```
{"score": 0.4, "predicted_label": 0}
```

**Multiclass Classification**

```
{"score": [0.1, 0.2, 0.4, 0.3], "predicted_label": 2}
```

**Regression**

```
{"score": 0.4}
```

## RECORDIO response formats
<a name="LL-recordio"></a>

**Binary Classification**

```
[
    Record = {
        features = {},
        label = {
            'score': {
                keys: [],
                values: [0.4]  # float32
            },
            'predicted_label': {
                keys: [],
                values: [0.0]  # float32
            }
        }
    }
]
```

**Multiclass Classification**

```
[
    Record = {
    "features": [],
    "label":    {
            "score":  {
                    "values":   [0.1, 0.2, 0.3, 0.4]   
            },
            "predicted_label":  {
                    "values":   [3]
            }
       },
    "uid":  "abc123",
    "metadata": "{created_at: '2017-06-03'}"
   }
]
```

**Regression**

```
[
    Record = {
        features = {},
        label = {
            'score': {
                keys: [],
                values: [0.4]  # float32
            }   
        }
    }
]
```

# TabTransformer
<a name="tabtransformer"></a>

[TabTransformer](https://arxiv.org/abs/2012.06678) is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. This page includes information about Amazon EC2 instance recommendations and sample notebooks for TabTransformer.

# How to use SageMaker AI TabTransformer
<a name="tabtransformer-modes"></a>

You can use TabTransformer as an Amazon SageMaker AI built-in algorithm. The following section describes how to use TabTransformer with the SageMaker Python SDK. For information on how to use TabTransformer from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use TabTransformer as a built-in algorithm**

  Use the TabTransformer built-in algorithm to build a TabTransformer training container as shown in the following code example. You can automatically spot the TabTransformer built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the TabTransformer image URI, you can use the TabTransformer container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The TabTransformer built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own TabTransformer training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "pytorch-tabtransformerclassification-model", "*", "training"
  training_instance_type = "ml.p3.2xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_binary/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "n_epochs"
  ] = "50"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the TabTransformer as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb)
  + [Tabular regression with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Regression_TabTransformer.ipynb)

# Input and Output interface for the TabTransformer algorithm
<a name="InputOutput-TabTransformer"></a>

TabTransformer operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of TabTransformer supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the TabTransformer model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `training` or `validation` channels, the TabTransformer algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

## Amazon EC2 instance recommendation for the TabTransformer algorithm
<a name="Instance-TabTransformer"></a>

SageMaker AI TabTransformer supports single-instance CPU and single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker AI TabTransformer currently does not support multi-GPU training.

## TabTransformer sample notebooks
<a name="tabtransformer-sample-notebooks"></a>

The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI TabTransformer algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI TabTransformer algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Regression_TabTransformer.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI TabTransformer algorithm to train and host a tabular regression model.   | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How TabTransformer works
<a name="tabtransformer-HowItWorks"></a>

TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability.

TabTransformer performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use TabTransformer for regression, classification (binary and multiclass), and ranking problems.

The following diagram illustrates the TabTransformer architecture.

![\[The architecture of TabTransformer.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/tabtransformer_illustration.png)


For more information, see *[TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/abs/2012.06678)*.

# TabTransformer hyperparameters
<a name="tabtransformer-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI TabTransformer algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI TabTransformer algorithm is an implementation of the open-source [TabTransformer](https://github.com/jrzaurin/pytorch-widedeep) package.

**Note**  
The default hyperparameters are based on example datasets in the [TabTransformer sample notebooks](tabtransformer.md#tabtransformer-sample-notebooks).

The SageMaker AI TabTransformer algorithm automatically chooses an evaluation metric and objective function based on the type of classification problem. The TabTransformer algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is r square and the objective function is mean square error. For binary classification problems, the evaluation metric and objective function are both binary cross entropy. For multiclass classification problems, the evaluation metric and objective function are both multiclass cross entropy.

**Note**  
The TabTransformer evaluation metric and objective functions are not currently available as hyperparameters. Instead, the SageMaker AI TabTransformer built-in algorithm automatically detects the type of classification task (regression, binary, or multiclass) based on the number of unique integers in the label column and assigns an evaluation metric and objective function.


| Parameter Name | Description | 
| --- | --- | 
| n\$1epochs |  Number of epochs to train the deep neural network. Valid values: integer, range: Positive integer. Default value: `5`.  | 
| patience |  The training will stop if one metric of one validation data point does not improve in the last `patience` round. Valid values: integer, range: (`2`, `60`). Default value: `10`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: Positive floating point number. Default value: `0.001`.  | 
| batch\$1size |  The number of examples propagated through the network.  Valid values: integer, range: (`1`, `2048`). Default value: `256`.  | 
| input\$1dim |  The dimension of embeddings to encode the categorical and/or continuous columns. Valid values: string, any of the following: `"16"`, `"32"`, `"64"`, `"128"`, `"256"`, or `"512"`. Default value: `"32"`.  | 
| n\$1blocks |  The number of Transformer encoder blocks. Valid values: integer, range: (`1`, `12`). Default value: `4`.  | 
| attn\$1dropout |  Dropout rate applied to the Multi-Head Attention layers. Valid values: float, range: (`0`, `1`). Default value: `0.2`.  | 
| mlp\$1dropout |  Dropout rate applied to the FeedForward network within the encoder layers and the final MLP layers on top of Transformer encoders. Valid values: float, range: (`0`, `1`). Default value: `0.1`.  | 
| frac\$1shared\$1embed |  The fraction of embeddings shared by all the different categories for one particular column. Valid values: float, range: (`0`, `1`). Default value: `0.25`.  | 

# Tune a TabTransformer model
<a name="tabtransformer-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters: 

**Note**  
The learning objective function and evaluation metric are both automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [TabTransformer hyperparameters](tabtransformer-hyperparameters.md).
+ A learning objective function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for TabTransformer is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the TabTransformer algorithm
<a name="tabtransformer-metrics"></a>

The SageMaker AI TabTransformer algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| r2 | r square | maximize | "metrics=\$1'r2': (\$1\$1S\$1)\$1" | 
| f1\$1score | binary cross entropy | maximize | "metrics=\$1'f1': (\$1\$1S\$1)\$1" | 
| accuracy\$1score | multiclass cross entropy | maximize | "metrics=\$1'accuracy': (\$1\$1S\$1)\$1" | 

## Tunable TabTransformer hyperparameters
<a name="tabtransformer-tunable-hyperparameters"></a>

Tune the TabTransformer model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the TabTransformer evaluation metrics are: `learning_rate`, `input_dim`, `n_blocks`, `attn_dropout`, `mlp_dropout`, and `frac_shared_embed`. For a list of all the TabTransformer hyperparameters, see [TabTransformer hyperparameters](tabtransformer-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| input\$1dim | CategoricalParameterRanges | [16, 32, 64, 128, 256, 512] | 
| n\$1blocks | IntegerParameterRanges | MinValue: 1, MaxValue: 12 | 
| attn\$1dropout | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.8 | 
| mlp\$1dropout | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.8 | 
| frac\$1shared\$1embed | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.5 | 

# XGBoost algorithm with Amazon SageMaker AI
<a name="xgboost"></a>

The [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
+ Its robust handling of a variety of data types, relationships, distributions.
+ The variety of hyperparameters that you can fine-tune.

You can use XGBoost for regression, classification (binary and multiclass), and ranking problems. 

You can use the new release of the XGBoost algorithm as either:
+ A Amazon SageMaker AI built-in algorithm.
+ A framework to run training scripts in your local environments.

This implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an bigger set of metrics than the original versions. It provides an XGBoost `estimator` that runs a training script in a managed XGBoost environment. The current release of SageMaker AI XGBoost is based on the original XGBoost versions 1.0, 1.2, 1.3, 1.5, 1.7 and 3.0.

For more information about the Amazon SageMaker AI XGBoost algorithm, see the following blog posts:
+ [Introducing the open-source Amazon SageMaker AI XGBoost algorithm container](https://aws.amazon.com/blogs/machine-learning/introducing-the-open-source-amazon-sagemaker-xgboost-algorithm-container/)
+ [Amazon SageMaker AI XGBoost now offers fully distributed GPU training](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-xgboost-now-offers-fully-distributed-gpu-training/)

## Supported versions
<a name="xgboost-supported-versions"></a>

For more details, see our [support policy](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-support-policy.html#pre-built-containers-support-policy-ml-framework).
+ Framework (open source) mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5
+ Algorithm mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5

**Warning**  
Due to required compute capacity, version 3.0-5 of SageMaker AI XGBoost is not compatible with GPU instances from the P3 instance family for training or inference.

**Warning**  
Due to package compatible, version 3.0-5 of SageMaker AI XGBoost does not support SageMaker debugger.

**Warning**  
Due to required compute capacity, version 1.7-1 of SageMaker AI XGBoost is not compatible with GPU instances from the P2 instance family for training or inference.

**Warning**  
Network Isolation Mode: Do not upgrade pip beyond version 25.2. Newer versions may attempt to fetch setuptools from PyPI during module installation.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

**Warning**  
The XGBoost 0.90 versions are deprecated. Supports for security updates or bug fixes for XGBoost 0.90 is discontinued. We highly recommend that you upgrade the XGBoost version to one of the newer versions.

**Note**  
XGBoost v1.1 is not supported on SageMaker AI. XGBoost 1.1 has a broken capability to run prediction when the test input has fewer features than the training data in LIBSVM inputs. This capability has been restored in XGBoost v1.2. Consider using SageMaker AI XGBoost 1.2-2 or later.

**Note**  
You can use XGBoost v1.0-1, but it's not officially supported.

## EC2 instance recommendation for the XGBoost algorithm
<a name="Instance-XGBoost"></a>

SageMaker AI XGBoost supports CPU and GPU training and inference. Instance recommendations depend on training and inference needs, as well as the version of the XGBoost algorithm. Choose one of the following options for more information:
+ [CPU training](#Instance-XGBoost-training-cpu)
+ [GPU training](#Instance-XGBoost-training-gpu)
+ [Distributed CPU training](#Instance-XGBoost-distributed-training-cpu)
+ [Distributed GPU training](#Instance-XGBoost-distributed-training-gpu)
+ [Inference](#Instance-XGBoost-inference)

### Training
<a name="Instance-XGBoost-training"></a>

The SageMaker AI XGBoost algorithm supports CPU and GPU training.

#### CPU training
<a name="Instance-XGBoost-training-cpu"></a>

SageMaker AI XGBoost 1.0-1 or earlier only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. It supports the use of disk space to handle data that does not fit into main memory. This is a result of the out-of-core feature available with the libsvm input mode. Even so, writing cache files onto disk slows the algorithm processing time. 

#### GPU training
<a name="Instance-XGBoost-training-gpu"></a>

SageMaker AI XGBoost version 1.2-2 or later supports GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. 

SageMaker AI XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.

SageMaker AI XGBoost version 1.7-1 or later supports P3, G4dn, and G5 GPU instance families. Note that due to compute capacity requirements, version 1.7-1 or later does not support the P2 instance family.

SageMaker AI XGBoost version 3.0-5 or later supports G4dn and G5 GPU instance families. Note that due to compute capacity requirements, version 3.0-5 or later does not support the P3 instance family.

To take advantage of GPU training:
+ Specify the instance type as one of the GPU instances (for example, G4dn) 
+ Set the `tree_method` hyperparameter to `gpu_hist` in your existing XGBoost script

### Distributed training
<a name="Instance-XGBoost-distributed-training"></a>

SageMaker AI XGBoost supports CPU and GPU instances for distributed training.

#### Distributed CPU training
<a name="Instance-XGBoost-distributed-training-cpu"></a>

To run CPU training on multiple instances, set the `instance_count` parameter for the estimator to a value greater than one. The input data must be divided between the total number of instances. 

##### Divide input data across instances
<a name="Instance-XGBoost-distributed-training-divide-data"></a>

Divide the input data using the following steps:

1. Break the input data down into smaller files. The number of files should be at least equal to the number of instances used for distributed training. Using multiple smaller files as opposed to one large file also decreases the data download time for the training job.

1. When creating your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html), set the distribution parameter to `ShardedByS3Key`. With this, each instance gets approximately *1/n* of the number of files in S3 if there are *n* instances specified in the training job.

#### Distributed GPU training
<a name="Instance-XGBoost-distributed-training-gpu"></a>

You can use distributed training with either single-GPU or multi-GPU instances.

**Distributed training with single-GPU instances **

SageMaker AI XGBoost versions 1.2-2 through 1.3-1 only support single-GPU instance training. This means that even if you select a multi-GPU instance, only one GPU is used per instance.

You must divide your input data between the total number of instances if: 
+ You use XGBoost versions 1.2-2 through 1.3-1.
+ You do not need to use multi-GPU instances.

 For more information, see [Divide input data across instances](#Instance-XGBoost-distributed-training-divide-data).

**Note**  
Versions 1.2-2 through 1.3-1 of SageMaker AI XGBoost only use one GPU per instance even if you choose a multi-GPU instance.

**Distributed training with multi-GPU instances**

Starting with version 1.5-1, SageMaker AI XGBoost offers distributed GPU training with [Dask](https://www.dask.org/). With Dask you can utilize all GPUs when using one or more multi-GPU instances. Dask also works when using single-GPU instances. 

Train with Dask using the following steps:

1. Either omit the `distribution` parameter in your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html) or set it to `FullyReplicated`.

1. When defining your hyperparameters, set `use_dask_gpu_training` to `"true"`.

**Important**  
Distributed training with Dask only supports CSV and Parquet input formats. If you use other data formats such as LIBSVM or PROTOBUF, the training job fails.   
For Parquet data, ensure that the column names are saved as strings. Columns that have names of other data types will fail to load.

**Important**  
Distributed training with Dask does not support pipe mode. If pipe mode is specified, the training job fails.

There are a few considerations to be aware of when training SageMaker AI XGBoost with Dask. Be sure to split your data into smaller files. Dask reads each Parquet file as a partition. There is a Dask worker for every GPU. As a result, the number of files should be greater than the total number of GPUs (instance count \$1 number of GPUs per instance). Having a very large number of files can also degrade performance. For more information, see [Dask Best Practices](https://docs.dask.org/en/stable/best-practices.html).

#### Variations in output
<a name="Instance-XGBoost-distributed-training-output"></a>

The specified `tree_method` hyperparameter determines the algorithm that is used for XGBoost training. The tree methods `approx`, `hist` and `gpu_hist` are all approximate methods and use sketching for quantile calculation. For more information, see [Tree Methods](https://xgboost.readthedocs.io/en/stable/treemethod.html) in the XGBoost documentation. Sketching is an approximate algorithm. Therefore, you can expect variations in the model depending on factors such as the number of workers chosen for distributed training. The significance of the variation is data-dependent.

### Inference
<a name="Instance-XGBoost-inference"></a>

SageMaker AI XGBoost supports CPU and GPU instances for inference. For information about the instance types for inference, see [Amazon SageMaker AI ML Instance Types](https://aws.amazon.com/sagemaker/pricing/).

# How to use SageMaker AI XGBoost
<a name="xgboost-how-to-use"></a>

With SageMaker AI, you can use XGBoost as a built-in algorithm or framework. When XGBoost as a framework, you have more flexibility and access to more advanced scenarios because you can customize your own training scripts. The following sections describe how to use XGBoost with the SageMaker Python SDK, and the input/output interface for the XGBoost algorithm. For information on how to use XGBoost from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

**Topics**
+ [Use XGBoost as a framework](#xgboost-how-to-framework)
+ [Use XGBoost as a built-in algorithm](#xgboost-how-to-built-in)
+ [Input/Output interface for the XGBoost algorithm](#InputOutput-XGBoost)

## Use XGBoost as a framework
<a name="xgboost-how-to-framework"></a>

Use XGBoost as a framework to run your customized training scripts that can incorporate additional data processing into your training jobs. In the following code example, SageMaker Python SDK provides the XGBoost API as a framework. This functions similarly to how SageMaker AI provides other framework APIs, such as TensorFlow, MXNet, and PyTorch.

```
import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "verbosity":"1",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-framework'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-framework')

# construct a SageMaker AI XGBoost estimator
# specify the entry_point to your xgboost training script
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.7-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For an end-to-end example of using SageMaker AI XGBoost as a framework, see [Regression with Amazon SageMaker AI XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone_dist_script_mode.html).

## Use XGBoost as a built-in algorithm
<a name="xgboost-how-to-built-in"></a>

Use the XGBoost built-in algorithm to build an XGBoost training container as shown in the following code example. You can automatically spot the XGBoost built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API. If using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 1, use the `get_image_uri` API. To make sure that the `image_uris.retrieve` API finds the correct URI, see [Common parameters for built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html). Then look up `xgboost` from the full list of built-in algorithm image URIs and available regions.

After specifying the XGBoost image URI, use the XGBoost container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. This XGBoost built-in algorithm mode does not incorporate your own XGBoost training script and runs directly on the input datasets.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](xgboost.md#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

```
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-built-in-algo'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-built-in-algo')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# construct a SageMaker AI estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For more information about how to set up the XGBoost as a built-in algorithm, see the following notebook examples.
+ [Managed Spot Training for XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html)
+ [Regression with Amazon SageMaker AI XGBoost (Parquet input)](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html)

## Input/Output interface for the XGBoost algorithm
<a name="InputOutput-XGBoost"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports the following data formats for training and inference:
+  *text/libsvm* (default) 
+  *text/csv*
+  *application/x-parquet*
+  *application/x-recordio-protobuf*

**Note**  
There are a few considerations to be aware of regarding training and inference input:  
For increased performance, we recommend using XGBoost with *File mode*, in which your data from Amazon S3 is stored on the training instance volumes.
For training with columnar input, the algorithm assumes that the target variable (label) is the first column. For inference, the algorithm assumes that the input has no label column.
For CSV data, the input should not have a header record.
For LIBSVM training, the algorithm assumes that subsequent columns after the label column contain the zero-based index value pairs for features. So each row has the format: : <label> <index0>:<value0> <index1>:<value1>.
For information on instance types and distributed training, see [EC2 instance recommendation for the XGBoost algorithm](xgboost.md#Instance-XGBoost).

For CSV training input mode, the total memory available to the algorithm must be able to hold the training dataset. The total memory available is calculated as `Instance Count * the memory available in the InstanceType`. For libsvm training input mode, it's not required, but we recommend it.

For v1.3-1 and later, SageMaker AI XGBoost saves the model in the XGBoost internal binary format, using `Booster.save_model`. Previous versions use the Python pickle module to serialize/deserialize the model.

**Note**  
Be mindful of versions when using an SageMaker AI XGBoost model in open source XGBoost. Versions 1.3-1 and later use the XGBoost internal binary format while previous versions use the Python pickle module.

**To use a model trained with SageMaker AI XGBoost v1.3-1 or later in open source XGBoost**
+ Use the following Python code:

  ```
  import xgboost as xgb
  
  xgb_model = xgb.Booster()
  xgb_model.load_model(model_file_path)
  xgb_model.predict(dtest)
  ```

**To use a model trained with previous versions of SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

# XGBoost sample notebooks
<a name="xgboost-sample-notebooks"></a>

The following list contains a variety of sample Jupyter notebooks that address different use cases of Amazon SageMaker AI XGBoost algorithm.
+ [How to Create a Custom XGBoost container](https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/sagemaker_studio_image_build/xgboost_bring_your_own/Batch_Transform_BYO_XGB.html) – This notebook shows you how to build a custom XGBoost Container with Amazon SageMaker AI Batch Transform.
+ [Regression with XGBoost using Parquet](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html) – This notebook shows you how to use the Abalone dataset in Parquet to train a XGBoost model.
+ [How to Train and Host a Multiclass Classification Model](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_mnist/xgboost_mnist.html) – This notebook shows how to use the MNIST dataset to train and host a multiclass classification model.
+ [How to train a Model for Customer Churn Prediction](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html) – This notebook shows you how to train a model to Predict Mobile Customer Departure in an effort to identify unhappy customers.
+ [An Introduction to Amazon SageMaker AI Managed Spot infrastructure for XGBoost Training](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html) – This notebook shows you how to use Spot Instances for training with a XGBoost Container.
+ [How to use Amazon SageMaker Debugger to debug XGBoost Training Jobs](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html) – This notebook shows you how to use Amazon SageMaker Debugger to monitor training jobs to detect inconsistencies using built-in debugging rules.

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. The topic modeling example notebooks using the linear learning algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How the SageMaker AI XGBoost algorithm works
<a name="xgboost-HowItWorks"></a>

[XGBoost](https://github.com/dmlc/xgboost) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

When using [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) for regression, the weak learners are regression trees, and each regression tree maps an input data point to one of its leaves that contains a continuous score. XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions). The training proceeds iteratively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction. It's called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

 Below is a brief illustration on how gradient tree boosting works.

![\[A diagram illustrating gradient tree boosting.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/xgboost_illustration.png)


**For more detail on XGBoost, see:**
+ [XGBoost: A Scalable Tree Boosting System](https://arxiv.org/pdf/1603.02754.pdf)
+ [Gradient Tree Boosting ](https://www.sas.upenn.edu/~fdiebold/NoHesitations/BookAdvanced.pdf#page=380)
+ [Introduction to Boosted Trees](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

# XGBoost hyperparameters
<a name="xgboost_hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source DMLC XGBoost package. For details about full set of hyperparameter that can be configured for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_1.2.0/).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class |  The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: Integer.  | 
| num\$1round |  The number of rounds to run the training. **Required** Valid values: Integer.  | 
| alpha |  L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 0  | 
| base\$1score |  The initial prediction score of all instances, global bias. **Optional** Valid values: Float. Default value: 0.5  | 
| booster |  Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `"gbtree"`, `"gblinear"`, or `"dart"`. Default value: `"gbtree"`  | 
| colsample\$1bylevel |  Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bynode |  Subsample ratio of columns from each node. **Optional** Valid values: Float. Range: (0,1]. Default value: 1  | 
| colsample\$1bytree |  Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| csv\$1weights |  When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| deterministic\$1histogram |  When this flag is enabled, XGBoost builds histogram on GPU deterministically. Used only if `tree_method` is set to `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"`. Default value: `"true"`  | 
| early\$1stopping\$1rounds |  The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: Integer. Default value: -  | 
| eta |  Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric |  Evaluation metrics for validation data. A default metric is assigned according to the objective: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) For a list of valid inputs, see [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String. Default value: Default according to objective.  | 
| gamma |  Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy |  Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `"depthwise"` or `"lossguide"`. Default value: `"depthwise"`  | 
| interaction\$1constraints |  Specify groups of variables that are allowed to interact. **Optional** Valid values: Nested list of integers. Each integer represents a feature, and each nested list contains features that are allowed to interact e.g., [[1,2], [3,4,5]]. Default value: None  | 
| lambda |  L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 1  | 
| lambda\$1bias |  L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin |  Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: Integer. Default value: 256  | 
| max\$1delta\$1step |  Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth |  Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves |  Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: Integer. Default value: 0  | 
| min\$1child\$1weight |  Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| monotone\$1constraints |  Specifies monotonicity constraints on any feature. **Optional** Valid values: Tuple of Integers. Valid integers: -1 (decreasing constraint), 0 (no constraint), 1 (increasing constraint).  E.g., (0, 1): No constraint on first predictor, and an increasing constraint on the second. (-1, 1): Decreasing constraint on first predictor, and an increasing constraint on the second. Default value: (0, 0)  | 
| normalize\$1type |  Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread |  Number of parallel threads used to run *xgboost*. **Optional** Valid values: Integer. Default value: Maximum number of threads.  | 
| objective |  Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `multi:softmax`, `reg:squarederror`. For a full list of valid inputs, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String Default value: `"reg:squarederror"`  | 
| one\$1drop |  When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type |  The type of boosting process to run. **Optional** Valid values: String. Either `"default"` or `"update"`. Default value: `"default"`  | 
| rate\$1drop |  The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf |  This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type |  Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight |  Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed |  Random number seed. **Optional** Valid values: integer Default value: 0  | 
| single\$1precision\$1histogram |  When this flag is enabled, XGBoost uses single precision to build histograms instead of double precision. Used only if `tree_method` is set to `hist` or `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| sketch\$1eps |  Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop |  Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample |  Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method |  The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, `hist`, or `gpu_hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power |  Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater |  A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 
| use\$1dask\$1gpu\$1training |  Set `use_dask_gpu_training` to `"true"` if you want to run distributed GPU training with Dask. Dask GPU training is only supported for versions 1.5-1 and later. Do not set this value to `"true"` for versions preceding 1.5-1. For more information, see [Distributed GPU training](xgboost.md#Instance-XGBoost-distributed-training-gpu). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| verbosity | Verbosity of printing messages. Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug). **Optional** Default value: 1  | 

# Tune an XGBoost Model
<a name="xgboost-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

**Note**  
Automatic model tuning for XGBoost 0.90 is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation Metrics Computed by the XGBoost Algorithm
<a name="xgboost-metrics"></a>

The XGBoost algorithm computes the following metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Classification rate, calculated as \$1(right)/\$1(all cases).  |  Maximize  | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:f1 |  Indicator of classification accuracy, calculated as the harmonic mean of precision and recall.  |  Maximize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:mse |  Mean squared error.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

## Tunable XGBoost Hyperparameters
<a name="xgboost-tunable-hyperparameters"></a>

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bynode |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 

# Deprecated Versions of XGBoost and their Upgrades
<a name="xgboost-previous-versions"></a>

This topic contains documentation for previous versions of Amazon SageMaker AI XGBoost that are still available but deprecated. It also provides instructions on how to upgrade deprecated versions of XGBoost, when possible, to more current versions.

**Topics**
+ [Upgrade XGBoost Version 0.90 to Version 1.5](xgboost-version-0.90.md)
+ [XGBoost Version 0.72](xgboost-72.md)

# Upgrade XGBoost Version 0.90 to Version 1.5
<a name="xgboost-version-0.90"></a>

If you are using the SageMaker Python SDK, to upgrade existing XGBoost 0.90 jobs to version 1.5, you must have version 2.x of the SDK installed and change the XGBoost `version` and `framework_version` parameters to 1.5-1. If you are using Boto3, you need to update the Docker image, and a few hyperparameters and learning objectives.

**Topics**
+ [Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x](#upgrade-xgboost-version-0.90-sagemaker-python-sdk)
+ [Change the image tag to 1.5-1](#upgrade-xgboost-version-0.90-change-image-tag)
+ [Change Docker Image for Boto3](#upgrade-xgboost-version-0.90-boto3)
+ [Update Hyperparameters and Learning Objectives](#upgrade-xgboost-version-0.90-hyperparameters)

## Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x
<a name="upgrade-xgboost-version-0.90-sagemaker-python-sdk"></a>

If you are still using Version 1.x of the SageMaker Python SDK, you must to upgrade version 2.x of the SageMaker Python SDK. For information on the latest version of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). To install the latest version, run:

```
python -m pip install --upgrade sagemaker
```

## Change the image tag to 1.5-1
<a name="upgrade-xgboost-version-0.90-change-image-tag"></a>

If you are using the SageMaker Python SDK and using the XGBoost build-in algorithm, change the version parameter in `image_uris.retrive`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)
```

If you are using the SageMaker Python SDK and using XGBoost as a framework to run your customized training scripts, change the `framework_version` parameter in the XGBoost API.

```
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.5-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)
```

`sagemaker.session.s3_input` in SageMaker Python SDK version 1.x has been renamed to `sagemaker.inputs.TrainingInput`. You must use `sagemaker.inputs.TrainingInput` as in the following example.

```
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)
```

 For the full list of SageMaker Python SDK version 2.x changes, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). 

## Change Docker Image for Boto3
<a name="upgrade-xgboost-version-0.90-boto3"></a>

If you are using Boto3 to train or deploy your model, change the docker image tag (1, 0.72, 0.90-1 or 0.90-2) to 1.5-1.

```
{
    "AlgorithmSpecification":: {
        "TrainingImage": "746614075791.dkr.ecr.us-west-1.amazonaws.com/sagemaker-xgboost:1.5-1"
    }
    ...
}
```

If you using the SageMaker Python SDK to retrieve registry path, change the `version` parameter in `image_uris.retrieve`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")
```

## Update Hyperparameters and Learning Objectives
<a name="upgrade-xgboost-version-0.90-hyperparameters"></a>

The silent parameter has been deprecated and is no longer available in XGBoost 1.5 and later versions. Use `verbosity` instead. If you were using the `reg:linear` learning objective, it has been deprecated as well in favor of` reg:squarederror`. Use `reg:squarederror` instead.

```
hyperparameters = {
    "verbosity": "2",
    "objective": "reg:squarederror",
    "num_round": "50",
    ...
}

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          ...)
```

# XGBoost Version 0.72
<a name="xgboost-72"></a>

**Important**  
The XGBoost 0.72 is deprecated by Amazon SageMaker AI. You can still use this old version of XGBoost (as a built-in algorithm) by pulling its image URI as shown in the following code sample. For XGBoost, the image URI ending with `:1` is for the old version.  

```
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri

xgb_image_uri = get_image_uri(boto3.Session().region_name, "xgboost", repo_version="1")
```

```
import boto3
from sagemaker import image_uris

xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")
```
If you want to use newer versions, you have to explicitly specify the image URI tags (see [Supported versions](xgboost.md#xgboost-supported-versions)).

This previous release of the Amazon SageMaker AI XGBoost algorithm is based on the 0.72 release. [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. XGBoost has done remarkably well in machine learning competitions because it robustly handles a variety of data types, relationships, and distributions, and because of the large number of hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid choice for problems in regression, classification (binary and multiclass), and ranking.

Customers should consider using the new release of [XGBoost algorithm with Amazon SageMaker AI](xgboost.md). They can use it as a SageMaker AI built-in algorithm or as a framework to run scripts in their local environments as they would typically, for example, do with a Tensorflow deep learning framework. The new implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics. The earlier implementation of XGBoost remains available to customers if they need to postpone migrating to the new version. But this previous implementation will remain tied to the 0.72 release of XGBoost.

## Input/Output Interface for the XGBoost Release 0.72
<a name="xgboost-72-InputOutput"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports CSV and libsvm formats for training and inference:
+ For Training ContentType, valid inputs are *text/libsvm* (default) or *text/csv*.
+ For Inference ContentType, valid inputs are *text/libsvm* or (the default) *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column.   
For libsvm training, the algorithm assumes that the label is in the first column. Subsequent columns contain the zero-based index value pairs for features. So each row has the format: <label> <index0>:<value0> <index1>:<value1> ... Inference requests for libsvm may or may not have labels in the libsvm format.

This differs from other SageMaker AI algorithms, which use the protobuf training input format to maintain greater consistency with standard XGBoost data formats.

For CSV training input mode, the total memory available to the algorithm (Instance Count \$1 the memory available in the `InstanceType`) must be able to hold the training dataset. For libsvm training input mode, it's not required, but we recommend it.

SageMaker AI XGBoost uses the Python pickle module to serialize/deserialize the model, which can be used for saving/loading the model.

**To use a model trained with SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  import xgboost
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

## EC2 Instance Recommendation for the XGBoost Release 0.72
<a name="xgboost-72-Instance"></a>

SageMaker AI XGBoost currently only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M4) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. Although it supports the use of disk space to handle data that does not fit into main memory (the out-of-core feature available with the libsvm input mode), writing cache files onto disk slows the algorithm processing time.

## XGBoost Release 0.72 Sample Notebooks
<a name="xgboost-72-sample-notebooks"></a>

For a sample notebook that shows how to use the latest version of SageMaker AI XGBoost as a built-in algorithm to train and host a regression model, see [Regression with Amazon SageMaker AI XGBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.html). To use the 0.72 version of XGBoost, you need to change the version in the sample code to 0.72. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the XGBoost algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

## XGBoost Release 0.72 Hyperparameters
<a name="xgboost-72-hyperparameters"></a>

The following table contains the hyperparameters for the XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source XGBoost package. Currently SageMaker AI supports version 0.72. For more detail about hyperparameter configuration for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_0.72/parameter.html).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class | The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: integer  | 
| num\$1round | The number of rounds to run the training. **Required** Valid values: integer  | 
| alpha | L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 0  | 
| base\$1score | The initial prediction score of all instances, global bias. **Optional** Valid values: float Default value: 0.5  | 
| booster | Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `gbtree`, `gblinear`, or `dart`. Default value: `gbtree`  | 
| colsample\$1bylevel | Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bytree | Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1 | 
| csv\$1weights | When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| early\$1stopping\$1rounds | The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: integer Default value: -  | 
| eta | Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric | Evaluation metrics for validation data. A default metric is assigned according to the objective:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-72.html) For a list of valid inputs, see [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: Default according to objective.  | 
| gamma | Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy | Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `depthwise` or `lossguide`. Default value: `depthwise`  | 
| lambda | L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 1  | 
| lambda\$1bias | L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin | Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: integer Default value: 256  | 
| max\$1delta\$1step | Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth | Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves | Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: integer Default value: 0  | 
| min\$1child\$1weight | Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| normalize\$1type | Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread | Number of parallel threads used to run *xgboost*. **Optional** Valid values: integer Default value: Maximum number of threads.  | 
| objective | Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `reg:softmax`, `multi:squarederror`. For a full list of valid inputs, refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: `reg:squarederror`  | 
| one\$1drop | When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type | The type of boosting process to run. **Optional** Valid values: String. Either `default` or `update`. Default value: `default`  | 
| rate\$1drop | The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf | This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type | Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight | Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed | Random number seed. **Optional** Valid values: integer Default value: 0  | 
| silent | 0 means print running messages, 1 means silent mode. Valid values: 0 or 1 **Optional** Default value: 0  | 
| sketch\$1eps | Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop | Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample | Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method | The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, or `hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power | Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater | A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 

## Tune an XGBoost Release 0.72 Model
<a name="xgboost-72-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

### Metrics Computed by the XGBoost Release 0.72 Algorithm
<a name="xgboost-72-metrics"></a>

The XGBoost algorithm based on version 0.72 computes the following nine metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

### Tunable XGBoost Release 0.72 Hyperparameters
<a name="xgboost-72-tunable-hyperparameters"></a>

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 