

# Unsupervised Built-in SageMaker AI Algorithms
<a name="algorithms-unsupervised"></a>

Amazon SageMaker AI provides several built-in algorithms that can be used for a variety of unsupervised learning tasks such as clustering, dimension reduction, pattern recognition, and anomaly detection.
+ [IP Insights](ip-insights.md)—learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.
+ [K-Means Algorithm](k-means.md)—finds discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
+ [Principal Component Analysis (PCA) Algorithm](pca.md)—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For mathematicians, principal components are eigenvectors of the data's covariance matrix.
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)—detects anomalous data points within a data set that diverge from otherwise well-structured or patterned data.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| IP Insights | train and (optionally) validation | File | CSV | CPU or GPU | Yes | 
| K-Means | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPUCommon (single GPU device on one or more instances) | No | 
| PCA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | GPU or CPU | Yes | 
| Random Cut Forest | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU | Yes | 

# IP Insights
<a name="ip-insights"></a>

Amazon SageMaker AI IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms.

SageMaker AI IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker AI IP Insights model returns a score that infers how anomalous the pattern of the event is. For example, when a user attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the IP Insights score into another machine learning model. For example, you can combine the IP Insight score with other features to rank the findings of another security system, such as those from [Amazon GuardDuty](https://docs.aws.amazon.com/guardduty/latest/ug/what-is-guardduty.html).

The SageMaker AI IP Insights algorithm can also learn vector representations of IP addresses, known as *embeddings*. You can use vector-encoded embeddings as features in downstream machine learning tasks that use the information observed in the IP addresses. For example, you can use them in tasks such as measuring similarities between IP addresses in clustering and visualization tasks.

**Topics**
+ [Input/Output Interface for the IP Insights Algorithm](#ip-insights-inputoutput)
+ [EC2 Instance Recommendation for the IP Insights Algorithm](#ip-insights-instances)
+ [IP Insights Sample Notebooks](#ip-insights-sample-notebooks)
+ [How IP Insights Works](ip-insights-howitworks.md)
+ [IP Insights Hyperparameters](ip-insights-hyperparameters.md)
+ [Tune an IP Insights Model](ip-insights-tuning.md)
+ [IP Insights Data Formats](ip-insights-data-formats.md)

## Input/Output Interface for the IP Insights Algorithm
<a name="ip-insights-inputoutput"></a>

**Training and Validation**

The SageMaker AI IP Insights algorithm supports training and validation data channels. It uses the optional validation channel to compute an area-under-curve (AUC) score on a predefined negative sampling strategy. The AUC metric validates how well the model discriminates between positive and negative samples. Training and validation data content types need to be in `text/csv` format. The first column of the CSV data is an opaque string that provides a unique identifier for the entity. The second column is an IPv4 address in decimal-dot notation. IP Insights currently supports only File mode. For more information and some examples, see [IP Insights Training Data Formats](ip-insights-training-data-formats.md).

**Inference**

For inference, IP Insights supports `text/csv`, `application/json`, and `application/jsonlines` data content types. For more information about the common data formats for inference provided by SageMaker AI, see [Common data formats for inference](cdf-inference.md). IP Insights inference returns output formatted as either `application/json` or `application/jsonlines`. Each record in the output data contains the corresponding `dot_product` (or compatibility score) for each input data point. For more information and some examples, see [IP Insights Inference Data Formats](ip-insights-inference-data-formats.md).

## EC2 Instance Recommendation for the IP Insights Algorithm
<a name="ip-insights-instances"></a>

The SageMaker AI IP Insights algorithm can run on both GPU and CPU instances. For training jobs, we recommend using GPU instances. However, for certain workloads with large training datasets, distributed CPU instances might reduce training costs. For inference, we recommend using CPU instances. IP Insights supports P2, P3, G4dn, and G5 GPU families.

### GPU Instances for the IP Insights Algorithm
<a name="ip-insights-instances-gpu"></a>

IP Insights supports all available GPUs. If you need to speed up training, we recommend starting with a single GPU instance, such as ml.p3.2xlarge, and then moving to a multi-GPU environment, such as ml.p3.8xlarge and ml.p3.16xlarge. Multi-GPUs automatically divide the mini batches of training data across themselves. If you switch from a single GPU to multiple GPUs, the `mini_batch_size` is divided equally into the number of GPUs used. You may want to increase the value of the `mini_batch_size` to compensate for this.

### CPU Instances for the IP Insights Algorithm
<a name="ip-insights-instances-cpu"></a>

The type of CPU instance that we recommend depends largely on the instance's available memory and the model size. The model size is determined by two hyperparameters: `vector_dim` and `num_entity_vectors`. The maximum supported model size is 8 GB. The following table lists typical EC2 instance types that you would deploy based on these input parameters for various model sizes. In Table 1, the value for `vector_dim` in the first column range from 32 to 2048 and the values for `num_entity_vectors` in the first row range from 10,000 to 50,000,000.


| `vector_dim` \$1 `num_entity_vectors`. | 10,000 | 50,000 | 100,000 | 500,000 | 1,000,000 | 5,000,000 | 10,000,000 | 50,000,000 | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 32 |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.2xlarge | ml.m5.4xlarge | 
|  `64`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge | ml.m5.2xlarge |  | 
|  `128`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge | ml.m5.4xlarge |  | 
|  `256`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.4xlarge |  |  | 
|  `512`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge |  |  |  | 
|  `1024`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.4xlarge |  |  |  | 
|  `2048`  |  ml.m5.large  | ml.m5.large | ml.m5.xlarge | ml.m5.xlarge |  |  |  |  | 

The values for the `mini_batch_size`, `num_ip_encoder_layers`, `random_negative_sampling_rate`, and `shuffled_negative_sampling_rate` hyperparameters also affect the amount of memory required. If these values are large, you might need to use a larger instance type than normal.

## IP Insights Sample Notebooks
<a name="ip-insights-sample-notebooks"></a>

For a sample notebook that shows how to train the SageMaker AI IP Insights algorithm and perform inferences with it, see [An Introduction to the SageMaker AIIP Insights Algorithm ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ipinsights_login/ipinsights-tutorial.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI examples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How IP Insights Works
<a name="ip-insights-howitworks"></a>

Amazon SageMaker AI IP Insights is an unsupervised algorithm that consumes observed data in the form of (entity, IPv4 address) pairs that associates entities with IP addresses. IP Insights determines how likely it is that an entity would use a particular IP address by learning latent vector representations for both entities and IP addresses. The distance between these two representations can then serve as the proxy for how likely this association is.

The IP Insights algorithm uses a neural network to learn the latent vector representations for entities and IP addresses. Entities are first hashed to a large but fixed hash space and then encoded by a simple embedding layer. Character strings such as user names or account IDs can be fed directly into IP Insights as they appear in log files. You don't need to preprocess the data for entity identifiers. You can provide entities as an arbitrary string value during both training and inference. The hash size should be configured with a value that is high enough to ensure that the number of *collisions*, which occur when distinct entities are mapped to the same latent vector, remain insignificant. For more information about how to select appropriate hash sizes, see [Feature Hashing for Large Scale Multitask Learning](https://alex.smola.org/papers/2009/Weinbergeretal09.pdf). For representing IP addresses, on the other hand, IP Insights uses a specially designed encoder network to uniquely represent each possible IPv4 address by exploiting the prefix structure of IP addresses.

During training, IP Insights automatically generates negative samples by randomly pairing entities and IP addresses. These negative samples represent data that is less likely to occur in reality. The model is trained to discriminate between positive samples that are observed in the training data and these generated negative samples. More specifically, the model is trained to minimize the *cross entropy*, also known as the *log loss*, defined as follows: 

![\[An image containing the equation for log loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-cross-entropy.png)


yn is the label that indicates whether the sample is from the real distribution governing observed data (yn=1) or from the distribution generating negative samples (yn=0). pn is the probability that the sample is from the real distribution, as predicted by the model.

Generating negative samples is an important process that is used to achieve an accurate model of the observed data. If negative samples are extremely unlikely, for example, if all of the IP addresses in negative samples are 10.0.0.0, then the model trivially learns to distinguish negative samples and fails to accurately characterize the actual observed dataset. To keep negative samples more realistic, IP Insights generates negative samples both by randomly generating IP addresses and randomly picking IP addresses from training data. You can configure the type of negative sampling and the rates at which negative samples are generated with the `random_negative_sampling_rate` and `shuffled_negative_sampling_rate` hyperparameters.

Given an nth (entity, IP address pair), the IP Insights model outputs a *score*, Sn , that indicates how compatible the entity is with the IP address. This score corresponds to the log odds ratio for a given (entity, IP address) of the pair coming from a real distribution as compared to coming from a negative distribution. It is defined as follows:

![\[An image containing the equation for the score, a log odds ratio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-log-odds.png)


The score is essentially a measure of the similarity between the vector representations of the nth entity and IP address. It can be interpreted as how much more likely it would be to observe this event in reality than in a randomly generated dataset. During training, the algorithm uses this score to calculate an estimate of the probability of a sample coming from the real distribution, pn, to use in the cross entropy minimization, where:

![\[An image showing the equation for probability that the sample is from a real distribution.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-sample-probability.png)


# IP Insights Hyperparameters
<a name="ip-insights-hyperparameters"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Amazon SageMaker AI IP Insights algorithm.


| Parameter Name | Description | 
| --- | --- | 
| num\$1entity\$1vectors | The number of entity vector representations (entity embedding vectors) to train. Each entity in the training set is randomly assigned to one of these vectors using a hash function. Because of hash collisions, it might be possible to have multiple entities assigned to the same vector. This would cause the same vector to represent multiple entities. This generally has a negligible effect on model performance, as long as the collision rate is not too severe. To keep the collision rate low, set this value as high as possible. However, the model size, and, therefore, the memory requirement, for both training and inference, scales linearly with this hyperparameter. We recommend that you set this value to twice the number of unique entity identifiers. **Required** Valid values: 1 ≤ positive integer ≤ 250,000,000  | 
| vector\$1dim | The size of embedding vectors to represent entities and IP addresses. The larger the value, the more information that can be encoded using these representations. In practice, model size scales linearly with this parameter and limits how large the dimension can be. In addition, using vector representations that are too large can cause the model to overfit, especially for small training datasets. Overfitting occurs when a model doesn't learn any pattern in the data but effectively memorizes the training data and, therefore, cannot generalize well and performs poorly during inference. The recommended value is 128. **Required** Valid values: 4 ≤ positive integer ≤ 4096  | 
| batch\$1metrics\$1publish\$1interval | The interval (every X batches) at which the Apache MXNet Speedometer function prints the training speed of the network (samples/second).  **Optional** Valid values: positive integer ≥ 1 Default value: 1,000 | 
| epochs | The number of passes over the training data. The optimal value depends on your data size and learning rate. Typical values range from 5 to 100. **Optional** Valid values: positive integer ≥ 1 Default value: 10 | 
| learning\$1rate | The learning rate for the optimizer. IP Insights use a gradient-descent-based Adam optimizer. The learning rate effectively controls the step size to update model parameters at each iteration. Too large a learning rate can cause the model to diverge because the training is likely to overshoot a minima. On the other hand, too small a learning rate slows down convergence. Typical values range from 1e-4 to 1e-1. **Optional** Valid values: 1e-6 ≤ float ≤ 10.0 Default value: 0.001 | 
| mini\$1batch\$1size | The number of examples in each mini batch. The training procedure processes data in mini batches. The optimal value depends on the number of unique account identifiers in the dataset. In general, the larger the `mini_batch_size`, the faster the training and the greater the number of possible shuffled-negative-sample combinations. However, with a large `mini_batch_size`, the training is more likely to converge to a poor local minimum and perform relatively worse for inference.  **Optional** Valid values: 1 ≤ positive integer ≤ 500000 Default value: 10,000 | 
| num\$1ip\$1encoder\$1layers | The number of fully connected layers used to encode the IP address embedding. The larger the number of layers, the greater the model's capacity to capture patterns among IP addresses. However, using a large number of layers increases the chance of overfitting. **Optional** Valid values: 0 ≤ positive integer ≤ 100 Default value: 1 | 
| random\$1negative\$1sampling\$1rate | The number of random negative samples, R, to generate per input example. The training procedure relies on negative samples to prevent the vector representations of the model collapsing to a single point. Random negative sampling generates R random IP addresses for each input account in the mini batch. The sum of the `random_negative_sampling_rate` (R) and `shuffled_negative_sampling_rate` (S) must be in the interval: 1 ≤ R \$1 S ≤ 500. **Optional** Valid values: 0 ≤ positive integer ≤ 500 Default value: 1 | 
| shuffled\$1negative\$1sampling\$1rate | The number of shuffled negative samples, S, to generate per input example. In some cases, it helps to use more realistic negative samples that are randomly picked from the training data itself. This kind of negative sampling is achieved by shuffling the data within a mini batch. Shuffled negative sampling generates S negative IP addresses by shuffling the IP address and account pairings within a mini batch. The sum of the `random_negative_sampling_rate` (R) and `shuffled_negative_sampling_rate` (S) must be in the interval: 1 ≤ R \$1 S ≤ 500. **Optional** Valid values: 0 ≤ positive integer ≤ 500 Default value: 1 | 
| weight\$1decay | The weight decay coefficient. This parameter adds an L2 regularization factor that is required to prevent the model from overfitting the training data. **Optional** Valid values: 0.0 ≤ float ≤ 10.0 Default value: 0.00001 | 

# Tune an IP Insights Model
<a name="ip-insights-tuning"></a>

*Automatic model tuning*, also called hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the IP Insights Algorithm
<a name="ip-insights-metrics"></a>

The Amazon SageMaker AI IP Insights algorithm is an unsupervised learning algorithm that learns associations between IP addresses and entities. The algorithm trains a discriminator model , which learns to separate observed data points (*positive samples*) from randomly generated data points (*negative samples*). Automatic model tuning on IP Insights helps you find the model that can most accurately distinguish between unlabeled validation data and automatically generated negative samples. The model accuracy on the validation dataset is measured by the area under the receiver operating characteristic curve. This `validation:discriminator_auc` metric can take values between 0.0 and 1.0, where 1.0 indicates perfect accuracy.

The IP Insights algorithm computes a `validation:discriminator_auc` metric during validation, the value of which is used as the objective function to optimize for hyperparameter tuning.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:discriminator\$1auc |  Area under the receiver operating characteristic curve on the validation dataset. The validation dataset is not labeled. Area Under the Curve (AUC) is a metric that describes the model's ability to discriminate validation data points from randomly generated data points.  |  Maximize  | 

## Tunable IP Insights Hyperparameters
<a name="ip-insights-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the SageMaker AI IP Insights algorithm. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs |  IntegerParameterRange  |  MinValue: 1, MaxValue: 100  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 0.1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 100, MaxValue: 50000  | 
| num\$1entity\$1vectors |  IntegerParameterRanges  |  MinValue: 10000, MaxValue: 1000000  | 
| num\$1ip\$1encoder\$1layers |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 10  | 
| random\$1negative\$1sampling\$1rate |  IntegerParameterRanges  |  MinValue: 0, MaxValue: 10  | 
| shuffled\$1negative\$1sampling\$1rate |  IntegerParameterRanges  |  MinValue: 0, MaxValue: 10  | 
| vector\$1dim |  IntegerParameterRanges  |  MinValue: 8, MaxValue: 256  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 1.0  | 

# IP Insights Data Formats
<a name="ip-insights-data-formats"></a>

This section provides examples of the available input and output data formats used by the IP Insights algorithm during training and inference.

**Topics**
+ [IP Insights Training Data Formats](ip-insights-training-data-formats.md)
+ [IP Insights Inference Data Formats](ip-insights-inference-data-formats.md)

# IP Insights Training Data Formats
<a name="ip-insights-training-data-formats"></a>

The following are the available data input formats for the IP Insights algorithm. Amazon SageMaker AI built-in algorithms adhere to the common input training format described in [Common Data Formats for Training](cdf-training.md). However, the SageMaker AI IP Insights algorithm currently supports only the CSV data input format.

## IP Insights Training Data Input Formats
<a name="ip-insights-training-input-format-requests"></a>

### INPUT: CSV
<a name="ip-insights-input-csv"></a>

The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot notation. 

content-type: text/csv

```
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
```

# IP Insights Inference Data Formats
<a name="ip-insights-inference-data-formats"></a>

The following are the available input and output formats for the IP Insights algorithm. Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common data formats for inference](cdf-inference.md). However, the SageMaker AI IP Insights algorithm does not currently support RecordIO format.

## IP Insights Input Request Formats
<a name="ip-insights-input-format-requests"></a>

### INPUT: CSV Format
<a name="ip-insights-input-csv-format"></a>

The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot notation. 

content-type: text/csv

```
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
```

### INPUT: JSON Format
<a name="ip-insights-input-json"></a>

JSON data can be provided in different formats. IP Insights follows the common SageMaker AI formats. For more information about inference formats, see [Common data formats for inference](cdf-inference.md).

content-type: application/json

```
{
  "instances": [
    {"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
    {"features": ["entity_id_2", "10.10.1.2"]}
  ]
}
```

### INPUT: JSONLINES Format
<a name="ip-insights-input-jsonlines"></a>

The JSON Lines content type is useful for running batch transform jobs. For more information on SageMaker AI inference formats, see [Common data formats for inference](cdf-inference.md). For more information on running batch transform jobs, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md).

content-type: application/jsonlines

```
{"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
{"features": ["entity_id_2", "10.10.1.2"]}]
```

## IP Insights Output Response Formats
<a name="ip-insights-ouput-format-response"></a>

### OUTPUT: JSON Response Format
<a name="ip-insights-output-json"></a>

The default output of the SageMaker AI IP Insights algorithm is the `dot_product` between the input entity and IP address. The dot\$1product signifies how compatible the model considers the entity and IP address. The `dot_product` is unbounded. To make predictions about whether an event is anomalous, you need to set a threshold based on your defined distribution. For information about how to use the `dot_product` for anomaly detection, see the [An Introduction to the SageMaker AIIP Insights Algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ipinsights_login/ipinsights-tutorial.html).

accept: application/json

```
{
  "predictions": [
    {"dot_product": 0.0},
    {"dot_product": 2.0}
  ]
}
```

Advanced users can access the model's learned entity and IP embeddings by providing the additional content-type parameter `verbose=True` to the Accept heading. You can use the `entity_embedding` and `ip_embedding` for debugging, visualizing, and understanding the model. Additionally, you can use these embeddings in other machine learning techniques, such as classification or clustering.

accept: application/json;verbose=True

```
{
  "predictions": [
    {
        "dot_product": 0.0,
        "entity_embedding": [1.0, 0.0, 0.0],
        "ip_embedding": [0.0, 1.0, 0.0]
    },
    {
        "dot_product": 2.0,
        "entity_embedding": [1.0, 0.0, 1.0],
        "ip_embedding": [1.0, 0.0, 1.0]
    }
  ]
}
```

### OUTPUT: JSONLINES Response Format
<a name="ip-insights-jsonlines"></a>

accept: application/jsonlines 

```
{"dot_product": 0.0}
{"dot_product": 2.0}
```

accept: application/jsonlines; verbose=True 

```
{"dot_product": 0.0, "entity_embedding": [1.0, 0.0, 0.0], "ip_embedding": [0.0, 1.0, 0.0]}
{"dot_product": 2.0, "entity_embedding": [1.0, 0.0, 1.0], "ip_embedding": [1.0, 0.0, 1.0]}
```

# K-Means Algorithm
<a name="k-means"></a>

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity. 

Amazon SageMaker AI uses a modified version of the web-scale k-means clustering algorithm. Compared with the original version of the algorithm, the version used by Amazon SageMaker AI is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, the version used by Amazon SageMaker AI streams mini-batches (small, random subsets) of the training data. For more information about mini-batch k-means, see [Web-scale k-means Clustering](https://dl.acm.org/doi/10.1145/1772690.1772862).

The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The *n* attributes in each row represent a point in *n*-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations. The algorithm groups observations with similar attribute values (the points corresponding to these observations are closer together). For more information about how k-means works in Amazon SageMaker AI, see [How K-Means Clustering Works](algo-kmeans-tech-notes.md).

**Topics**
+ [Input/Output Interface for the K-Means Algorithm](#km-inputoutput)
+ [EC2 Instance Recommendation for the K-Means Algorithm](#km-instances)
+ [K-Means Sample Notebooks](#kmeans-sample-notebooks)
+ [How K-Means Clustering Works](algo-kmeans-tech-notes.md)
+ [K-Means Hyperparameters](k-means-api-config.md)
+ [Tune a K-Means Model](k-means-tuning.md)
+ [K-Means Response Formats](km-in-formats.md)

## Input/Output Interface for the K-Means Algorithm
<a name="km-inputoutput"></a>

For training, the k-means algorithm expects data to be provided in the *train* channel (recommended `S3DataDistributionType=ShardedByS3Key`), with an optional *test* channel (recommended `S3DataDistributionType=FullyReplicated`) to score the data on. Both `recordIO-wrapped-protobuf` and `CSV` formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference, `text/csv`, `application/json`, and `application/x-recordio-protobuf` are supported. k-means returns a `closest_cluster` label and the `distance_to_cluster` for each observation.

For more information on input and output file formats, see [K-Means Response Formats](km-in-formats.md) for inference and the [K-Means Sample Notebooks](#kmeans-sample-notebooks). The k-means algorithm does not support multiple instance learning, in which the training set consists of labeled “bags”, each of which is a collection of unlabeled instances.

## EC2 Instance Recommendation for the K-Means Algorithm
<a name="km-instances"></a>

We recommend training k-means on CPU instances. You can train on GPU instances, but should limit GPU training to single-GPU instances (such as ml.g4dn.xlarge) because only one GPU is used per instance. The k-means algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

## K-Means Sample Notebooks
<a name="kmeans-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI K-means algorithm to segment the population of counties in the United States by attributes identified using principle component analysis, see [Analyze US census data for population segmentation using Amazon SageMaker AI](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans/sagemaker-countycensusclustering.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, click on its **Use** tab and select **Create copy**.

# How K-Means Clustering Works
<a name="algo-kmeans-tech-notes"></a>

K-means is an algorithm that trains a model that groups similar objects together. The k-means algorithm accomplishes this by mapping each observation in the input dataset to a point in the *n*-dimensional space (where *n* is the number of attributes of the observation). For example, your dataset might contain observations of temperature and humidity in a particular location, which are mapped to points (*t, h*) in 2-dimensional space. 



**Note**  
Clustering algorithms are unsupervised. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used. For more information, see [Unsupervised learning](algorithms-choose.md#algorithms-choose-unsupervised-learning).

In k-means clustering, each cluster has a center. During model training, the k-means algorithm uses the distance of the point that corresponds to each observation in the dataset to the cluster centers as the basis for clustering. You choose the number of clusters (*k*) to create. 

For example, suppose that you want to create a model to recognize handwritten digits and you choose the MNIST dataset for training. The dataset provides thousands of images of handwritten digits (0 through 9). In this example, you might choose to create 10 clusters, one for each digit (0, 1, …, 9). As part of model training, the k-means algorithm groups the input images into 10 clusters.

Each image in the MNIST dataset is a 28x28-pixel image, with a total of 784 pixels. Each image corresponds to a point in a 784-dimensional space, similar to a point in a 2-dimensional space (x,y). To find a cluster to which a point belongs, the k-means algorithm finds the distance of that point from all of the cluster centers. It then chooses the cluster with the closest center as the cluster to which the image belongs. 

**Note**  
Amazon SageMaker AI uses a customized version of the algorithm where, instead of specifying that the algorithm create *k* clusters, you might choose to improve model accuracy by specifying extra cluster centers *(K = k\$1x)*. However, the algorithm ultimately reduces these to *k* clusters.

In SageMaker AI, you specify the number of clusters when creating a training job. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html). In the request body, you add the `HyperParameters` string map to specify the `k` and `extra_center_factor` strings.

The following is a summary of how k-means works for model training in SageMaker AI:

1. It determines the initial *K* cluster centers. 
**Note**  
In the following topics, *K* clusters refer to *k \$1 x*, where you specify *k* and *x* when creating a model training job. 

1. It iterates over input training data and recalculates cluster centers.

1. It reduces resulting clusters to *k* (if the data scientist specified the creation of *k\$1x* clusters in the request). 

The following sections also explain some of the parameters that a data scientist might specify to configure a model training job as part of the `HyperParameters` string map. 

**Topics**
+ [Step 1: Determine the Initial Cluster Centers](#kmeans-step1)
+ [Step 2: Iterate over the Training Dataset and Calculate Cluster Centers](#kmeans-step2)
+ [Step 3: Reduce the Clusters from *K* to *k*](#kmeans-step3)

## Step 1: Determine the Initial Cluster Centers
<a name="kmeans-step1"></a>

When using k-means in SageMaker AI, the initial cluster centers are chosen from the observations in a small, randomly sampled batch. Choose one of the following strategies to determine how these initial cluster centers are selected:
+ The random approach—Randomly choose *K* observations in your input dataset as cluster centers. For example, you might choose a cluster center that points to the 784-dimensional space that corresponds to any 10 images in the MNIST training dataset.
+ The k-means\$1\$1 approach, which works as follows: 

  1. Start with one cluster and determine its center. You randomly select an observation from your training dataset and use the point corresponding to the observation as the cluster center. For example, in the MNIST dataset, randomly choose a handwritten digit image. Then choose the point in the 784-dimensional space that corresponds to the image as your cluster center. This is cluster center 1.

  1. Determine the center for cluster 2. From the remaining observations in the training dataset, pick an observation at random. Choose one that is different than the one you previously selected. This observation corresponds to a point that is far away from cluster center 1. Using the MNIST dataset as an example, you do the following:
     + For each of the remaining images, find the distance of the corresponding point from cluster center 1. Square the distance and assign a probability that is proportional to the square of the distance. That way, an image that is different from the one that you previously selected has a higher probability of getting selected as cluster center 2. 
     + Choose one of the images randomly, based on probabilities assigned in the previous step. The point that corresponds to the image is cluster center 2.

  1. Repeat Step 2 to find cluster center 3. This time, find the distances of the remaining images from cluster center 2.

  1. Repeat the process until you have the *K* cluster centers.

To train a model in SageMaker AI, you create a training job. In the request, you provide configuration information by specifying the following `HyperParameters` string maps:
+ To specify the number of clusters to create, add the `k` string.
+ For greater accuracy, add the optional `extra_center_factor` string. 
+ To specify the strategy that you want to use to determine the initial cluster centers, add the `init_method` string and set its value to `random` or `k-means++`.

For more information about the SageMaker AI k-means estimator, see [K-means](https://sagemaker.readthedocs.io/en/stable/algorithms/unsupervised/kmeans.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation.

You now have an initial set of cluster centers. 

## Step 2: Iterate over the Training Dataset and Calculate Cluster Centers
<a name="kmeans-step2"></a>

The cluster centers that you created in the preceding step are mostly random, with some consideration for the training dataset. In this step, you use the training dataset to move these centers toward the true cluster centers. The algorithm iterates over the training dataset, and recalculates the *K* cluster centers.

1. Read a mini-batch of observations (a small, randomly chosen subset of all records) from the training dataset and do the following. 
**Note**  
When creating a model training job, you specify the batch size in the `mini_batch_size` string in the `HyperParameters` string map. 

   1. Assign all of the observations in the mini-batch to one of the clusters with the closest cluster center.

   1. Calculate the number of observations assigned to each cluster. Then, calculate the proportion of new points assigned per cluster.

      For example, consider the following clusters:

      Cluster c1 = 100 previously assigned points. You added 25 points from the mini-batch in this step.

      Cluster c2 = 150 previously assigned points. You added 40 points from the mini-batch in this step.

      Cluster c3 = 450 previously assigned points. You added 5 points from the mini-batch in this step.

      Calculate the proportion of new points assigned to each of clusters as follows:

      ```
      p1 = proportion of points assigned to c1 = 25/(100+25)
      p2 = proportion of points assigned to c2 = 40/(150+40)
      p3 = proportion of points assigned to c3 = 5/(450+5)
      ```

   1. Compute the center of the new points added to each cluster:

      ```
      d1 = center of the new points added to cluster 1
      d2 = center of the new points added to cluster 2
      d3 = center of the new points added to cluster 3
      ```

   1. Compute the weighted average to find the updated cluster centers as follows:

      ```
      Center of cluster 1 = ((1 - p1) * center of cluster 1) + (p1 * d1)
      Center of cluster 2 = ((1 - p2) * center of cluster 2) + (p2 * d2)
      Center of cluster 3 = ((1 - p3) * center of cluster 3) + (p3 * d3)
      ```

1. Read the next mini-batch, and repeat Step 1 to recalculate the cluster centers. 

1. For more information about mini-batch *k*-means, see [Web-scale k-means Clustering](https://citeseerx.ist.psu.edu/document?repid=rep1type=pdf&doi=b452a856a3e3d4d37b1de837996aa6813bedfdcf)).

## Step 3: Reduce the Clusters from *K* to *k*
<a name="kmeans-step3"></a>

If the algorithm created *K* clusters—*(K = k\$1x)* where *x* is greater than 1—then it reduces the *K* clusters to *k* clusters. (For more information, see `extra_center_factor` in the preceding discussion.) It does this by applying Lloyd's method with `kmeans++` initialization to the *K* cluster centers. For more information about Lloyd's method, see [k-means clustering](https://pdfs.semanticscholar.org/0074/4cb7cc9ccbbcdadbd5ff2f2fee6358427271.pdf). 

# K-Means Hyperparameters
<a name="k-means-api-config"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the k-means training algorithm provided by Amazon SageMaker AI. For more information about how k-means clustering works, see [How K-Means Clustering Works](algo-kmeans-tech-notes.md).


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim | The number of features in the input data. **Required** Valid values: Positive integer  | 
| k |  The number of required clusters. **Required** Valid values: Positive integer  | 
| epochs | The number of passes done over the training data. **Optional** Valid values: Positive integer Default value: 1  | 
| eval\$1metrics | A JSON list of metric types used to report a score for the model. Allowed values are `msd` for Means Square Deviation and `ssd` for Sum of Square Distance. If test data is provided, the score is reported for each of the metrics requested. **Optional** Valid values: Either `[\"msd\"]` or `[\"ssd\"]` or `[\"msd\",\"ssd\"]` . Default value: `[\"msd\"]`  | 
| extra\$1center\$1factor | The algorithm creates K centers = `num_clusters` \$1 `extra_center_factor` as it runs and reduces the number of centers from K to `k` when finalizing the model. **Optional** Valid values: Either a positive integer or `auto`. Default value: `auto`  | 
| half\$1life\$1time\$1size | Used to determine the weight given to an observation when computing a cluster mean. This weight decays exponentially as more points are observed. When a point is first observed, it is assigned a weight of 1 when computing the cluster mean. The decay constant for the exponential decay function is chosen so that after observing `half_life_time_size` points, its weight is 1/2. If set to 0, there is no decay. **Optional** Valid values: Non-negative integer Default value: 0  | 
| init\$1method | Method by which the algorithm chooses the initial cluster centers. The standard k-means approach chooses them at random. An alternative k-means\$1\$1 method chooses the first cluster center at random. Then it spreads out the position of the remaining initial clusters by weighting the selection of centers with a probability distribution that is proportional to the square of the distance of the remaining data points from existing centers. **Optional** Valid values: Either `random` or `kmeans++`. Default value: `random`  | 
| local\$1lloyd\$1init\$1method | The initialization method for Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Either `random` or `kmeans++`. Default value: `kmeans++`  | 
| local\$1lloyd\$1max\$1iter | The maximum number of iterations for Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Positive integer Default value: 300  | 
| local\$1lloyd\$1num\$1trials | The number of times the Lloyd's expectation-maximization (EM) procedure with the least loss is run when building the final model containing `k` centers. **Optional** Valid values: Either a positive integer or `auto`. Default value: `auto`  | 
| local\$1lloyd\$1tol | The tolerance for change in loss for early stopping of Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Float. Range in [0, 1]. Default value: 0.0001  | 
| mini\$1batch\$1size | The number of observations per mini-batch for the data iterator. **Optional** Valid values: Positive integer Default value: 5000  | 

# Tune a K-Means Model
<a name="k-means-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker AI k-means algorithm is an unsupervised algorithm that groups data into clusters whose members are as similar as possible. Because it is unsupervised, it doesn't use a validation dataset that hyperparameters can optimize against. But it does take a test dataset and emits metrics that depend on the squared distance between the data points and the final cluster centroids at the end of each training run. To find the model that reports the tightest clusters on the test dataset, you can use a hyperparameter tuning job. The clusters optimize the similarity of their members.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the K-Means Algorithm
<a name="km-metrics"></a>

The k-means algorithm computes the following metrics during training. When tuning a model, choose one of these metrics as the objective metric. 


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:msd | Mean squared distances between each record in the test set and the closest center of the model. | Minimize | 
| test:ssd | Sum of the squared distances between each record in the test set and the closest center of the model. | Minimize | 



## Tunable K-Means Hyperparameters
<a name="km-tunable-hyperparameters"></a>

Tune the Amazon SageMaker AI k-means model with the following hyperparameters. The hyperparameters that have the greatest impact on k-means objective metrics are: `mini_batch_size`, `extra_center_factor`, and `init_method`. Tuning the hyperparameter `epochs` generally results in minor improvements.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs | IntegerParameterRanges | MinValue: 1, MaxValue:10 | 
| extra\$1center\$1factor | IntegerParameterRanges | MinValue: 4, MaxValue:10 | 
| init\$1method | CategoricalParameterRanges | ['kmeans\$1\$1', 'random'] | 
| mini\$1batch\$1size | IntegerParameterRanges | MinValue: 3000, MaxValue:15000 | 

# K-Means Response Formats
<a name="km-in-formats"></a>

All SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI k-means algorithm.

## JSON Response Format
<a name="km-json"></a>

```
{
    "predictions": [
        {
            "closest_cluster": 1.0,
            "distance_to_cluster": 3.0,
        },
        {
            "closest_cluster": 2.0,
            "distance_to_cluster": 5.0,
        },

        ....
    ]
}
```

## JSONLINES Response Format
<a name="km-jsonlines"></a>

```
{"closest_cluster": 1.0, "distance_to_cluster": 3.0}
{"closest_cluster": 2.0, "distance_to_cluster": 5.0}
```

## RECORDIO Response Format
<a name="km-recordio"></a>

```
[
    Record = {
        features = {},
        label = {
            'closest_cluster': {
                keys: [],
                values: [1.0, 2.0]  # float32
            },
            'distance_to_cluster': {
                keys: [],
                values: [3.0, 5.0]  # float32
            },
        }
    }
]
```

## CSV Response Format
<a name="km-csv"></a>

The first value in each line corresponds to `closest_cluster`.

The second value in each line corresponds to `distance_to_cluster`.

```
1.0,3.0
2.0,5.0
```

# Principal Component Analysis (PCA) Algorithm
<a name="pca"></a>

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called *components*, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

In Amazon SageMaker AI, PCA operates in two modes, depending on the scenario: 
+ **regular**: For datasets with sparse data and a moderate number of observations and features.
+ **randomized**: For datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

PCA uses tabular data. 

The rows represent observations you want to embed in a lower dimensional space. The columns represent features that you want to find a reduced approximation for. The algorithm calculates the covariance matrix (or an approximation thereof in a distributed manner), and then performs the singular value decomposition on this summary to produce the principal components. 

**Topics**
+ [Input/Output Interface for the PCA Algorithm](#pca-inputoutput)
+ [EC2 Instance Recommendation for the PCA Algorithm](#pca-instances)
+ [PCA Sample Notebooks](#PCA-sample-notebooks)
+ [How PCA Works](how-pca-works.md)
+ [PCA Hyperparameters](PCA-reference.md)
+ [PCA Response Formats](PCA-in-formats.md)

## Input/Output Interface for the PCA Algorithm
<a name="pca-inputoutput"></a>

For training, PCA expects data provided in the train channel, and optionally supports a dataset passed to the test dataset, which is scored by the final algorithm. Both `recordIO-wrapped-protobuf` and `CSV` formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference, PCA supports `text/csv`, `application/json`, and `application/x-recordio-protobuf`. Results are returned in either `application/json` or `application/x-recordio-protobuf` format with a vector of "projections."

For more information on input and output file formats, see [PCA Response Formats](PCA-in-formats.md) for inference and the [PCA Sample Notebooks](#PCA-sample-notebooks).

## EC2 Instance Recommendation for the PCA Algorithm
<a name="pca-instances"></a>

PCA supports CPU and GPU instances for training and inference. Which instance type is most performant depends heavily on the specifics of the input data. For GPU instances, PCA supports P2, P3, G4dn, and G5.

## PCA Sample Notebooks
<a name="PCA-sample-notebooks"></a>

For a sample notebook that shows how to use the SageMaker AI Principal Component Analysis algorithm to analyze the images of handwritten digits from zero to nine in the MNIST dataset, see [An Introduction to PCA with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/pca_mnist/pca_mnist.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How PCA Works
<a name="how-pca-works"></a>

Principal Component Analysis (PCA) is a learning algorithm that reduces the dimensionality (number of features) within a dataset while still retaining as much information as possible. 

PCA reduces dimensionality by finding a new set of features called *components*, which are composites of the original features, but are uncorrelated with one another. The first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

It is an unsupervised dimensionality reduction algorithm. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used.

Given the input of a matrix with rows ![\[x_1,…,x_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-39b.png) each of dimension `1 * d`, the data is partitioned into mini-batches of rows and distributed among the training nodes (workers). Each worker then computes a summary of its data. The summaries of the different workers are then unified into a single solution at the end of the computation. 

**Modes**

The Amazon SageMaker AI PCA algorithm uses either of two modes to calculate these summaries, depending on the situation:
+ **regular**: for datasets with sparse data and a moderate number of observations and features.
+ **randomized**: for datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

As the algorithm's last step, it performs the singular value decomposition on the unified solution, from which the principal components are then derived.

## Mode 1: Regular
<a name="mode-1"></a>

The workers jointly compute both ![\[Equation in text-form: \sum x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-1b.png) and ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

**Note**  
Because ![\[Equation in text-form: x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-3b.png) are `1 * d` row vectors, ![\[Equation in text-form: x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-4b.png) is a matrix (not a scalar). Using row vectors within the code allows us to obtain efficient caching.

The covariance matrix is computed as ![\[Equation in text-form: \sum x_i^T x_i - (1/n) (\sum x_i)^T \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-32b.png) , and its top `num_components` singular vectors form the model.

**Note**  
If `subtract_mean` is `False`, we avoid computing and subtracting ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

Use this algorithm when the dimension `d` of the vectors is small enough so that ![\[Equation in text-form: d^2\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-7b.png) can fit in memory.

## Mode 2: Randomized
<a name="mode-2"></a>

When the number of features in the input dataset is large, we use a method to approximate the covariance metric. For every mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, we randomly initialize a `(num_components + extra_components) * b` matrix that we multiply by each mini-batch, to create a `(num_components + extra_components) * d` matrix. The sum of these matrices is computed by the workers, and the servers perform SVD on the final `(num_components + extra_components) * d` matrix. The top right `num_components` singular vectors of it are the approximation of the top singular vectors of the input matrix.

Let ![\[Equation in text-form: \ell\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38b.png) ` = num_components + extra_components`. Given a mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, the worker draws a random matrix ![\[Equation in text-form: H_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-24b.png) of dimension ![\[Equation in text-form: \ell * b\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38.png) . Depending on whether the environment uses a GPU or CPU and the dimension size, the matrix is either a random sign matrix where each entry is `+-1` or a *FJLT* (fast Johnson Lindenstrauss transform; for information, see [FJLT Transforms](https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pdf) and the follow-up papers). The worker then computes ![\[Equation in text-form: H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-26b.png) and maintains ![\[Equation in text-form: B = \sum H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-27b.png) . The worker also maintains ![\[Equation in text-form: h^T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-28b.png) , the sum of columns of ![\[Equation in text-form: H_1,..,H_T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-29b.png) (`T` being the total number of mini-batches), and `s`, the sum of all input rows. After processing the entire shard of data, the worker sends the server `B`, `h`, `s`, and `n` (the number of input rows).

Denote the different inputs to the server as ![\[Equation in text-form: B^1, h^1, s^1, n^1,…\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-30b.png) The server computes `B`, `h`, `s`, `n` the sums of the respective inputs. It then computes ![\[Equation in text-form: C = B – (1/n) h^T s\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-31b.png) , and finds its singular value decomposition. The top-right singular vectors and singular values of `C` are used as the approximate solution to the problem.

# PCA Hyperparameters
<a name="PCA-reference"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific HyperParameters as string-to-string maps. The following table lists the hyperparameters for the PCA training algorithm provided by Amazon SageMaker AI. For more information about how PCA works, see [How PCA Works](how-pca-works.md). 


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  Input dimension. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  Number of rows in a mini-batch. **Required** Valid values: positive integer  | 
| num\$1components |  The number of principal components to compute. **Required** Valid values: positive integer  | 
| algorithm\$1mode |  Mode for computing the principal components.  **Optional** Valid values: *regular* or *randomized* Default value: *regular*  | 
| extra\$1components |  As the value increases, the solution becomes more accurate but the runtime and memory consumption increase linearly. The default, -1, means the maximum of 10 and `num_components`. Valid for *randomized* mode only. **Optional** Valid values: Non-negative integer or -1 Default value: -1  | 
| subtract\$1mean |  Indicates whether the data should be unbiased both during training and at inference.  **Optional** Valid values: One of *true* or *false* Default value: *true*  | 

# PCA Response Formats
<a name="PCA-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI PCA algorithm.

## JSON Response Format
<a name="PCA-json"></a>

Accept—application/json

```
{
    "projections": [
        {
            "projection": [1.0, 2.0, 3.0, 4.0, 5.0]
        },
        {
            "projection": [6.0, 7.0, 8.0, 9.0, 0.0]
        },
        ....
    ]
}
```

## JSONLINES Response Format
<a name="PCA-jsonlines"></a>

Accept—application/jsonlines

```
{ "projection": [1.0, 2.0, 3.0, 4.0, 5.0] }
{ "projection": [6.0, 7.0, 8.0, 9.0, 0.0] }
```

## RECORDIO Response Format
<a name="PCA-recordio"></a>

Accept—application/x-recordio-protobuf

```
[
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    },
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    }  
]
```

# Random Cut Forest (RCF) Algorithm
<a name="randomcutforest"></a>

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

While there are many applications of anomaly detection algorithms to one-dimensional time series data such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-dimensional input. Amazon SageMaker AI RCF scales well with respect to number of features, data set size, and number of instances.

**Topics**
+ [Input/Output Interface for the RCF Algorithm](#rcf-input_output)
+ [Instance Recommendations for the RCF Algorithm](#rcf-instance-recommend)
+ [RCF Sample Notebooks](#rcf-sample-notebooks)
+ [How RCF Works](rcf_how-it-works.md)
+ [RCF Hyperparameters](rcf_hyperparameters.md)
+ [Tune an RCF Model](random-cut-forest-tuning.md)
+ [RCF Response Formats](rcf-in-formats.md)

## Input/Output Interface for the RCF Algorithm
<a name="rcf-input_output"></a>

Amazon SageMaker AI Random Cut Forest supports the `train` and `test` data channels. The optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either `application/x-recordio-protobuf` or `text/csv` formats. For the test data, when using text/csv format, the content must be specified as text/csv;label\$1size=1 where the first column of each row represents the anomaly label: "1" for an anomalous data point and "0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`

The train channel only supports `S3DataDistributionType=ShardedByS3Key` and the test channel only supports `S3DataDistributionType=FullyReplicated`. The following example specifies the S3 distribution type for the train channel using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html).

**Note**  
The `sagemaker.inputs.s3_input` method was renamed to `sagemaker.inputs.TrainingInput` in [SageMaker Python SDK v2](https://sagemaker.readthedocs.io/en/stable/v2.html#s3-input).

```
  import sagemaker
    
  # specify Random Cut Forest training job information and hyperparameters
  rcf = sagemaker.estimator.Estimator(...)
    
  # explicitly specify "ShardedByS3Key" distribution type
  train_data = sagemaker.inputs.TrainingInput(
       s3_data=s3_training_data_location,
       content_type='text/csv;label_size=0',
       distribution='ShardedByS3Key')
    
  # run the training job on input data stored in S3
  rcf.fit({'train': train_data})
```

To avoid common errors around execution roles, ensure that you have the execution roles required, `AmazonSageMakerFullAccess` and `AmazonEC2ContainerRegistryFullAccess`. To avoid common errors around your image not existing or its permissions being incorrect, ensure that your ECR image is not larger then the allocated disk space on the training instance. To avoid this, run your training job on an instance that has sufficient disk space. In addition, if your ECR image is from a different AWS account's Elastic Container Service (ECS) repository, and you do not set repository permissions to grant access, this will result in an error. See the [ECR repository permissions ](https://docs.aws.amazon.com/AmazonECR/latest/userguide/set-repository-policy.html) for more information on setting a repository policy statement.

See the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for more information on customizing the S3 data source attributes. Finally, in order to take advantage of multi-instance training the training data must be partitioned into at least as many files as instances.

For inference, RCF supports `application/x-recordio-protobuf`, `text/csv` and `application/json` input data content types. See the [Parameters for Built-in Algorithms](common-info-all-im-models.md) documentation for more information. RCF inference returns `application/x-recordio-protobuf` or `application/json` formatted output. Each record in these output data contains the corresponding anomaly scores for each input data point. See [Common Data Formats--Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) for more information.

For more information on input and output file formats, see [RCF Response Formats](rcf-in-formats.md) for inference and the [RCF Sample Notebooks](#rcf-sample-notebooks).

## Instance Recommendations for the RCF Algorithm
<a name="rcf-instance-recommend"></a>

For training, we recommend the `ml.m4`, `ml.c4`, and `ml.c5` instance families. For inference we recommend using a `ml.c5.xl` instance type in particular, for maximum performance as well as minimized cost per hour of usage. Although the algorithm could technically run on GPU instance types it does not take advantage of GPU hardware.

## RCF Sample Notebooks
<a name="rcf-sample-notebooks"></a>

For an example of how to train an RCF model and perform inferences with it, see the [An Introduction to SageMaker AI Random Cut Forests](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.html) notebook. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, click on its **Use** tab and select **Create copy**.

For a blog post on using the RCF algorithm, see [Use the built-in Amazon SageMaker AI Random Cut Forest algorithm for anomaly detection](https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/).

# How RCF Works
<a name="rcf_how-it-works"></a>

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a dataset. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a dataset can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

The main idea behind the RCF algorithm is to create a forest of trees where each tree is obtained using a partition of a sample of the training data. For example, a random sample of the input data is first determined. The random sample is then partitioned according to the number of trees in the forest. Each tree is given such a partition and organizes that subset of points into a k-d tree. The anomaly score assigned to a data point by the tree is defined as the expected change in complexity of the tree as a result adding that point to the tree; which, in approximation, is inversely proportional to the resulting depth of the point in the tree. The random cut forest assigns an anomaly score by computing the average score from each constituent tree and scaling the result with respect to the sample size. The RCF algorithm is based on the one described in reference [1].

## Sample Data Randomly
<a name="rcf-rndm-sample-data"></a>

The first step in the RCF algorithm is to obtain a random sample of the training data. In particular, suppose we want a sample of size ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) from ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) total data points. If the training data is small enough, the entire dataset can be used, and we could randomly draw ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) elements from this set. However, frequently the training data is too large to fit all at once, and this approach isn't feasible. Instead, we use a technique called reservoir sampling.

[Reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) is an algorithm for efficiently drawing random samples from a dataset ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) where the elements in the dataset can only be observed one at a time or in batches. In fact, reservoir sampling works even when ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) is not known *a priori*. If only one sample is requested, such as when ![\[Equation in text-form: K=1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf15.jpg), the algorithm is like this:

**Algorithm: Reservoir Sampling**
+  Input: dataset or data stream ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) 
+  Initialize the random sample ![\[Equation in text-form: X=S_1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf4.jpg) 
+  For each observed sample ![\[Equation in text-form: S_n,n=2,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf5.jpg):
  +  Pick a uniform random number ![\[Equation in text-form: \xi \in [0,1]\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf6.jpg) 
  +  If ![\[Equation in text-form: \xi \less 1/n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf7.jpg) 
    +  Set ![\[Equation in text-form: X=S_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf8.jpg) 
+  Return ![\[Equation in text-form: X\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf9.jpg) 

This algorithm selects a random sample such that ![\[Equation in text-form: P(X=S_n)=1/N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf10.jpg) for all ![\[Equation in text-form: n=1,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf11.jpg). When ![\[Equation in text-form: K>1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf12.jpg) the algorithm is more complicated. Additionally, a distinction must be made between random sampling that is with and without replacement. RCF performs an augmented reservoir sampling without replacement on the training data based on the algorithms described in [2].

## Train a RCF Model and Produce Inferences
<a name="rcf-training-inference"></a>

The next step in RCF is to construct a random cut forest using the random sample of data. First, the sample is partitioned into a number of equal-sized partitions equal to the number of trees in the forest. Then, each partition is sent to an individual tree. The tree recursively organizes its partition into a binary tree by partitioning the data domain into bounding boxes.

This procedure is best illustrated with an example. Suppose a tree is given the following two-dimensional dataset. The corresponding tree is initialized to the root node:

![\[A two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF1.jpg)


Figure: A two-dimensional dataset where the majority of data lies in a cluster (blue) except for one anomalous data point (orange). The tree is initialized with a root node.

The RCF algorithm organizes these data in a tree by first computing a bounding box of the data, selecting a random dimension (giving more weight to dimensions with higher "variance"), and then randomly determining the position of a hyperplane "cut" through that dimension. The two resulting subspaces define their own sub tree. In this example, the cut happens to separate a lone point from the remainder of the sample. The first level of the resulting binary tree consists of two nodes, one which will consist of the subtree of points to the left of the initial cut and the other representing the single point on the right.

![\[A random cut partitioning the two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF2.jpg)


Figure: A random cut partitioning the two-dimensional dataset. An anomalous data point is more likely to lie isolated in a bounding box at a smaller tree depth than other points. 

Bounding boxes are then computed for the left and right halves of the data and the process is repeated until every leaf of the tree represents a single data point from the sample. Note that if the lone point is sufficiently far away then it is more likely that a random cut would result in point isolation. This observation provides the intuition that tree depth is, loosely speaking, inversely proportional to the anomaly score.

When performing inference using a trained RCF model the final anomaly score is reported as the average across scores reported by each tree. Note that it is often the case that the new data point does not already reside in the tree. To determine the score associated with the new point the data point is inserted into the given tree and the tree is efficiently (and temporarily) reassembled in a manner equivalent to the training process described above. That is, the resulting tree is as if the input data point were a member of the sample used to construct the tree in the first place. The reported score is inversely proportional to the depth of the input point within the tree.

## Choose Hyperparameters
<a name="rcf-choose-hyperparam"></a>

The primary hyperparameters used to tune the RCF model are `num_trees` and `num_samples_per_tree`. Increasing `num_trees` has the effect of reducing the noise observed in anomaly scores since the final score is the average of the scores reported by each tree. While the optimal value is application-dependent we recommend using 100 trees to begin with as a balance between score noise and model complexity. Note that inference time is proportional to the number of trees. Although training time is also affected it is dominated by the reservoir sampling algorithm describe above.

The parameter `num_samples_per_tree` is related to the expected density of anomalies in the dataset. In particular, `num_samples_per_tree` should be chosen such that `1/num_samples_per_tree` approximates the ratio of anomalous data to normal data. For example, if 256 samples are used in each tree then we expect our data to contain anomalies 1/256 or approximately 0.4% of the time. Again, an optimal value for this hyperparameter is dependent on the application.

## References
<a name="references"></a>

1.  Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. "Robust random cut forest based anomaly detection on streams." In *International Conference on Machine Learning*, pp. 2712-2721. 2016.

1.  Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, and Al Geist. "Reservoir-based random sampling with replacement from data stream." In *Proceedings of the 2004 SIAM International Conference on Data Mining*, pp. 492-496. Society for Industrial and Applied Mathematics, 2004.

# RCF Hyperparameters
<a name="rcf_hyperparameters"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Amazon SageMaker AI RCF algorithm. For more information, including recommendations on how to choose hyperparameters, see [How RCF Works](rcf_how-it-works.md).




| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  The number of features in the data set. (If you use the [Random Cut Forest](https://sagemaker.readthedocs.io/en/stable/algorithms/unsupervised/randomcutforest.html) estimator, this value is calculated for you and need not be specified.) **Required** Valid values: Positive integer (min: 1, max: 10000)  | 
| eval\$1metrics |  A list of metrics used to score a labeled test data set. The following metrics can be selected for output: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html) **Optional** Valid values: a list with possible values taken from `accuracy` or `precision_recall_fscore`.  Default value: Both `accuracy`, `precision_recall_fscore` are calculated.  | 
| num\$1samples\$1per\$1tree |  Number of random samples given to each tree from the training data set. **Optional** Valid values: Positive integer (min: 1, max: 2048) Default value: 256  | 
| num\$1trees |  Number of trees in the forest. **Optional** Valid values: Positive integer (min: 50, max: 1000) Default value: 100  | 

# Tune an RCF Model
<a name="random-cut-forest-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning or hyperparameter optimization, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker AI RCF algorithm is an unsupervised anomaly-detection algorithm that requires a labeled test dataset for hyperparameter optimization. RCF calculates anomaly scores for test data points and then labels the data points as anomalous if their scores are beyond three standard deviations from the mean score. This is known as the three-sigma limit heuristic. The F1-score is based on the difference between calculated labels and actual labels. The hyperparameter tuning job finds the model that maximizes that score. The success of hyperparameter optimization depends on the applicability of the three-sigma limit heuristic to the test dataset.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the RCF Algorithm
<a name="random-cut-forest-metrics"></a>

The RCF algorithm computes the following metric during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:f1 | F1-score on the test dataset, based on the difference between calculated labels and actual labels. | Maximize | 

## Tunable RCF Hyperparameters
<a name="random-cut-forest-tunable-hyperparameters"></a>

You can tune a RCF model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1samples\$1per\$1tree | IntegerParameterRanges | MinValue: 1, MaxValue:2048 | 
| num\$1trees | IntegerParameterRanges | MinValue: 50, MaxValue:1000 | 

# RCF Response Formats
<a name="rcf-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). Note that SageMaker AI Random Cut Forest supports both dense and sparse JSON and RecordIO formats. This topic contains a list of the available output formats for the SageMaker AI RCF algorithm.

## JSON Response Format
<a name="RCF-json"></a>

ACCEPT: application/json.

```
    {                                                                                                                                                                                                                                                                                    
        "scores":    [                                                                                                                                                                                                                                                                   
            {"score": 0.02},                                                                                                                                                                                                                                                             
            {"score": 0.25}                                                                                                                                                                                                                                                              
        ]                                                                                                                                                                                                                                                                                
    }
```

### JSONLINES Response Format
<a name="RCF-jsonlines"></a>

ACCEPT: application/jsonlines.

```
{"score": 0.02},
{"score": 0.25}
```

## RECORDIO Response Format
<a name="rcf-recordio"></a>

ACCEPT: application/x-recordio-protobuf.

```
    [                                                                                                                                                                                                                                                                                    
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.25]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         },                                                                                                                                                                                                                                                                                   
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.23]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         }                                                                                                                                                                                                                                                                                    
    ]
```