

# Random Cut Forest (RCF) Algorithm
<a name="randomcutforest"></a>

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

While there are many applications of anomaly detection algorithms to one-dimensional time series data such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-dimensional input. Amazon SageMaker AI RCF scales well with respect to number of features, data set size, and number of instances.

**Topics**
+ [Input/Output Interface for the RCF Algorithm](#rcf-input_output)
+ [Instance Recommendations for the RCF Algorithm](#rcf-instance-recommend)
+ [RCF Sample Notebooks](#rcf-sample-notebooks)
+ [How RCF Works](rcf_how-it-works.md)
+ [RCF Hyperparameters](rcf_hyperparameters.md)
+ [Tune an RCF Model](random-cut-forest-tuning.md)
+ [RCF Response Formats](rcf-in-formats.md)

## Input/Output Interface for the RCF Algorithm
<a name="rcf-input_output"></a>

Amazon SageMaker AI Random Cut Forest supports the `train` and `test` data channels. The optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either `application/x-recordio-protobuf` or `text/csv` formats. For the test data, when using text/csv format, the content must be specified as text/csv;label\$1size=1 where the first column of each row represents the anomaly label: "1" for an anomalous data point and "0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`

The train channel only supports `S3DataDistributionType=ShardedByS3Key` and the test channel only supports `S3DataDistributionType=FullyReplicated`. The following example specifies the S3 distribution type for the train channel using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html).

**Note**  
The `sagemaker.inputs.s3_input` method was renamed to `sagemaker.inputs.TrainingInput` in [SageMaker Python SDK v2](https://sagemaker.readthedocs.io/en/stable/v2.html#s3-input).

```
  import sagemaker
    
  # specify Random Cut Forest training job information and hyperparameters
  rcf = sagemaker.estimator.Estimator(...)
    
  # explicitly specify "ShardedByS3Key" distribution type
  train_data = sagemaker.inputs.TrainingInput(
       s3_data=s3_training_data_location,
       content_type='text/csv;label_size=0',
       distribution='ShardedByS3Key')
    
  # run the training job on input data stored in S3
  rcf.fit({'train': train_data})
```

To avoid common errors around execution roles, ensure that you have the execution roles required, `AmazonSageMakerFullAccess` and `AmazonEC2ContainerRegistryFullAccess`. To avoid common errors around your image not existing or its permissions being incorrect, ensure that your ECR image is not larger then the allocated disk space on the training instance. To avoid this, run your training job on an instance that has sufficient disk space. In addition, if your ECR image is from a different AWS account's Elastic Container Service (ECS) repository, and you do not set repository permissions to grant access, this will result in an error. See the [ECR repository permissions ](https://docs.aws.amazon.com/AmazonECR/latest/userguide/set-repository-policy.html) for more information on setting a repository policy statement.

See the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for more information on customizing the S3 data source attributes. Finally, in order to take advantage of multi-instance training the training data must be partitioned into at least as many files as instances.

For inference, RCF supports `application/x-recordio-protobuf`, `text/csv` and `application/json` input data content types. See the [Parameters for Built-in Algorithms](common-info-all-im-models.md) documentation for more information. RCF inference returns `application/x-recordio-protobuf` or `application/json` formatted output. Each record in these output data contains the corresponding anomaly scores for each input data point. See [Common Data Formats--Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) for more information.

For more information on input and output file formats, see [RCF Response Formats](rcf-in-formats.md) for inference and the [RCF Sample Notebooks](#rcf-sample-notebooks).

## Instance Recommendations for the RCF Algorithm
<a name="rcf-instance-recommend"></a>

For training, we recommend the `ml.m4`, `ml.c4`, and `ml.c5` instance families. For inference we recommend using a `ml.c5.xl` instance type in particular, for maximum performance as well as minimized cost per hour of usage. Although the algorithm could technically run on GPU instance types it does not take advantage of GPU hardware.

## RCF Sample Notebooks
<a name="rcf-sample-notebooks"></a>

For an example of how to train an RCF model and perform inferences with it, see the [An Introduction to SageMaker AI Random Cut Forests](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.html) notebook. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, click on its **Use** tab and select **Create copy**.

For a blog post on using the RCF algorithm, see [Use the built-in Amazon SageMaker AI Random Cut Forest algorithm for anomaly detection](https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/).

# How RCF Works
<a name="rcf_how-it-works"></a>

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a dataset. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a dataset can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

The main idea behind the RCF algorithm is to create a forest of trees where each tree is obtained using a partition of a sample of the training data. For example, a random sample of the input data is first determined. The random sample is then partitioned according to the number of trees in the forest. Each tree is given such a partition and organizes that subset of points into a k-d tree. The anomaly score assigned to a data point by the tree is defined as the expected change in complexity of the tree as a result adding that point to the tree; which, in approximation, is inversely proportional to the resulting depth of the point in the tree. The random cut forest assigns an anomaly score by computing the average score from each constituent tree and scaling the result with respect to the sample size. The RCF algorithm is based on the one described in reference [1].

## Sample Data Randomly
<a name="rcf-rndm-sample-data"></a>

The first step in the RCF algorithm is to obtain a random sample of the training data. In particular, suppose we want a sample of size ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) from ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) total data points. If the training data is small enough, the entire dataset can be used, and we could randomly draw ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) elements from this set. However, frequently the training data is too large to fit all at once, and this approach isn't feasible. Instead, we use a technique called reservoir sampling.

[Reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) is an algorithm for efficiently drawing random samples from a dataset ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) where the elements in the dataset can only be observed one at a time or in batches. In fact, reservoir sampling works even when ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) is not known *a priori*. If only one sample is requested, such as when ![\[Equation in text-form: K=1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf15.jpg), the algorithm is like this:

**Algorithm: Reservoir Sampling**
+  Input: dataset or data stream ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) 
+  Initialize the random sample ![\[Equation in text-form: X=S_1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf4.jpg) 
+  For each observed sample ![\[Equation in text-form: S_n,n=2,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf5.jpg):
  +  Pick a uniform random number ![\[Equation in text-form: \xi \in [0,1]\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf6.jpg) 
  +  If ![\[Equation in text-form: \xi \less 1/n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf7.jpg) 
    +  Set ![\[Equation in text-form: X=S_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf8.jpg) 
+  Return ![\[Equation in text-form: X\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf9.jpg) 

This algorithm selects a random sample such that ![\[Equation in text-form: P(X=S_n)=1/N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf10.jpg) for all ![\[Equation in text-form: n=1,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf11.jpg). When ![\[Equation in text-form: K>1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf12.jpg) the algorithm is more complicated. Additionally, a distinction must be made between random sampling that is with and without replacement. RCF performs an augmented reservoir sampling without replacement on the training data based on the algorithms described in [2].

## Train a RCF Model and Produce Inferences
<a name="rcf-training-inference"></a>

The next step in RCF is to construct a random cut forest using the random sample of data. First, the sample is partitioned into a number of equal-sized partitions equal to the number of trees in the forest. Then, each partition is sent to an individual tree. The tree recursively organizes its partition into a binary tree by partitioning the data domain into bounding boxes.

This procedure is best illustrated with an example. Suppose a tree is given the following two-dimensional dataset. The corresponding tree is initialized to the root node:

![\[A two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF1.jpg)


Figure: A two-dimensional dataset where the majority of data lies in a cluster (blue) except for one anomalous data point (orange). The tree is initialized with a root node.

The RCF algorithm organizes these data in a tree by first computing a bounding box of the data, selecting a random dimension (giving more weight to dimensions with higher "variance"), and then randomly determining the position of a hyperplane "cut" through that dimension. The two resulting subspaces define their own sub tree. In this example, the cut happens to separate a lone point from the remainder of the sample. The first level of the resulting binary tree consists of two nodes, one which will consist of the subtree of points to the left of the initial cut and the other representing the single point on the right.

![\[A random cut partitioning the two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF2.jpg)


Figure: A random cut partitioning the two-dimensional dataset. An anomalous data point is more likely to lie isolated in a bounding box at a smaller tree depth than other points. 

Bounding boxes are then computed for the left and right halves of the data and the process is repeated until every leaf of the tree represents a single data point from the sample. Note that if the lone point is sufficiently far away then it is more likely that a random cut would result in point isolation. This observation provides the intuition that tree depth is, loosely speaking, inversely proportional to the anomaly score.

When performing inference using a trained RCF model the final anomaly score is reported as the average across scores reported by each tree. Note that it is often the case that the new data point does not already reside in the tree. To determine the score associated with the new point the data point is inserted into the given tree and the tree is efficiently (and temporarily) reassembled in a manner equivalent to the training process described above. That is, the resulting tree is as if the input data point were a member of the sample used to construct the tree in the first place. The reported score is inversely proportional to the depth of the input point within the tree.

## Choose Hyperparameters
<a name="rcf-choose-hyperparam"></a>

The primary hyperparameters used to tune the RCF model are `num_trees` and `num_samples_per_tree`. Increasing `num_trees` has the effect of reducing the noise observed in anomaly scores since the final score is the average of the scores reported by each tree. While the optimal value is application-dependent we recommend using 100 trees to begin with as a balance between score noise and model complexity. Note that inference time is proportional to the number of trees. Although training time is also affected it is dominated by the reservoir sampling algorithm describe above.

The parameter `num_samples_per_tree` is related to the expected density of anomalies in the dataset. In particular, `num_samples_per_tree` should be chosen such that `1/num_samples_per_tree` approximates the ratio of anomalous data to normal data. For example, if 256 samples are used in each tree then we expect our data to contain anomalies 1/256 or approximately 0.4% of the time. Again, an optimal value for this hyperparameter is dependent on the application.

## References
<a name="references"></a>

1.  Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. "Robust random cut forest based anomaly detection on streams." In *International Conference on Machine Learning*, pp. 2712-2721. 2016.

1.  Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, and Al Geist. "Reservoir-based random sampling with replacement from data stream." In *Proceedings of the 2004 SIAM International Conference on Data Mining*, pp. 492-496. Society for Industrial and Applied Mathematics, 2004.

# RCF Hyperparameters
<a name="rcf_hyperparameters"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Amazon SageMaker AI RCF algorithm. For more information, including recommendations on how to choose hyperparameters, see [How RCF Works](rcf_how-it-works.md).




| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  The number of features in the data set. (If you use the [Random Cut Forest](https://sagemaker.readthedocs.io/en/stable/algorithms/unsupervised/randomcutforest.html) estimator, this value is calculated for you and need not be specified.) **Required** Valid values: Positive integer (min: 1, max: 10000)  | 
| eval\$1metrics |  A list of metrics used to score a labeled test data set. The following metrics can be selected for output: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html) **Optional** Valid values: a list with possible values taken from `accuracy` or `precision_recall_fscore`.  Default value: Both `accuracy`, `precision_recall_fscore` are calculated.  | 
| num\$1samples\$1per\$1tree |  Number of random samples given to each tree from the training data set. **Optional** Valid values: Positive integer (min: 1, max: 2048) Default value: 256  | 
| num\$1trees |  Number of trees in the forest. **Optional** Valid values: Positive integer (min: 50, max: 1000) Default value: 100  | 

# Tune an RCF Model
<a name="random-cut-forest-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning or hyperparameter optimization, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker AI RCF algorithm is an unsupervised anomaly-detection algorithm that requires a labeled test dataset for hyperparameter optimization. RCF calculates anomaly scores for test data points and then labels the data points as anomalous if their scores are beyond three standard deviations from the mean score. This is known as the three-sigma limit heuristic. The F1-score is based on the difference between calculated labels and actual labels. The hyperparameter tuning job finds the model that maximizes that score. The success of hyperparameter optimization depends on the applicability of the three-sigma limit heuristic to the test dataset.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the RCF Algorithm
<a name="random-cut-forest-metrics"></a>

The RCF algorithm computes the following metric during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:f1 | F1-score on the test dataset, based on the difference between calculated labels and actual labels. | Maximize | 

## Tunable RCF Hyperparameters
<a name="random-cut-forest-tunable-hyperparameters"></a>

You can tune a RCF model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1samples\$1per\$1tree | IntegerParameterRanges | MinValue: 1, MaxValue:2048 | 
| num\$1trees | IntegerParameterRanges | MinValue: 50, MaxValue:1000 | 

# RCF Response Formats
<a name="rcf-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). Note that SageMaker AI Random Cut Forest supports both dense and sparse JSON and RecordIO formats. This topic contains a list of the available output formats for the SageMaker AI RCF algorithm.

## JSON Response Format
<a name="RCF-json"></a>

ACCEPT: application/json.

```
    {                                                                                                                                                                                                                                                                                    
        "scores":    [                                                                                                                                                                                                                                                                   
            {"score": 0.02},                                                                                                                                                                                                                                                             
            {"score": 0.25}                                                                                                                                                                                                                                                              
        ]                                                                                                                                                                                                                                                                                
    }
```

### JSONLINES Response Format
<a name="RCF-jsonlines"></a>

ACCEPT: application/jsonlines.

```
{"score": 0.02},
{"score": 0.25}
```

## RECORDIO Response Format
<a name="rcf-recordio"></a>

ACCEPT: application/x-recordio-protobuf.

```
    [                                                                                                                                                                                                                                                                                    
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.25]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         },                                                                                                                                                                                                                                                                                   
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.23]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         }                                                                                                                                                                                                                                                                                    
    ]
```