

# Principal Component Analysis (PCA) Algorithm
<a name="pca"></a>

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called *components*, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

In Amazon SageMaker AI, PCA operates in two modes, depending on the scenario: 
+ **regular**: For datasets with sparse data and a moderate number of observations and features.
+ **randomized**: For datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

PCA uses tabular data. 

The rows represent observations you want to embed in a lower dimensional space. The columns represent features that you want to find a reduced approximation for. The algorithm calculates the covariance matrix (or an approximation thereof in a distributed manner), and then performs the singular value decomposition on this summary to produce the principal components. 

**Topics**
+ [Input/Output Interface for the PCA Algorithm](#pca-inputoutput)
+ [EC2 Instance Recommendation for the PCA Algorithm](#pca-instances)
+ [PCA Sample Notebooks](#PCA-sample-notebooks)
+ [How PCA Works](how-pca-works.md)
+ [PCA Hyperparameters](PCA-reference.md)
+ [PCA Response Formats](PCA-in-formats.md)

## Input/Output Interface for the PCA Algorithm
<a name="pca-inputoutput"></a>

For training, PCA expects data provided in the train channel, and optionally supports a dataset passed to the test dataset, which is scored by the final algorithm. Both `recordIO-wrapped-protobuf` and `CSV` formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference, PCA supports `text/csv`, `application/json`, and `application/x-recordio-protobuf`. Results are returned in either `application/json` or `application/x-recordio-protobuf` format with a vector of "projections."

For more information on input and output file formats, see [PCA Response Formats](PCA-in-formats.md) for inference and the [PCA Sample Notebooks](#PCA-sample-notebooks).

## EC2 Instance Recommendation for the PCA Algorithm
<a name="pca-instances"></a>

PCA supports CPU and GPU instances for training and inference. Which instance type is most performant depends heavily on the specifics of the input data. For GPU instances, PCA supports P2, P3, G4dn, and G5.

## PCA Sample Notebooks
<a name="PCA-sample-notebooks"></a>

For a sample notebook that shows how to use the SageMaker AI Principal Component Analysis algorithm to analyze the images of handwritten digits from zero to nine in the MNIST dataset, see [An Introduction to PCA with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/pca_mnist/pca_mnist.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How PCA Works
<a name="how-pca-works"></a>

Principal Component Analysis (PCA) is a learning algorithm that reduces the dimensionality (number of features) within a dataset while still retaining as much information as possible. 

PCA reduces dimensionality by finding a new set of features called *components*, which are composites of the original features, but are uncorrelated with one another. The first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

It is an unsupervised dimensionality reduction algorithm. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used.

Given the input of a matrix with rows ![\[x_1,…,x_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-39b.png) each of dimension `1 * d`, the data is partitioned into mini-batches of rows and distributed among the training nodes (workers). Each worker then computes a summary of its data. The summaries of the different workers are then unified into a single solution at the end of the computation. 

**Modes**

The Amazon SageMaker AI PCA algorithm uses either of two modes to calculate these summaries, depending on the situation:
+ **regular**: for datasets with sparse data and a moderate number of observations and features.
+ **randomized**: for datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

As the algorithm's last step, it performs the singular value decomposition on the unified solution, from which the principal components are then derived.

## Mode 1: Regular
<a name="mode-1"></a>

The workers jointly compute both ![\[Equation in text-form: \sum x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-1b.png) and ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

**Note**  
Because ![\[Equation in text-form: x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-3b.png) are `1 * d` row vectors, ![\[Equation in text-form: x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-4b.png) is a matrix (not a scalar). Using row vectors within the code allows us to obtain efficient caching.

The covariance matrix is computed as ![\[Equation in text-form: \sum x_i^T x_i - (1/n) (\sum x_i)^T \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-32b.png) , and its top `num_components` singular vectors form the model.

**Note**  
If `subtract_mean` is `False`, we avoid computing and subtracting ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

Use this algorithm when the dimension `d` of the vectors is small enough so that ![\[Equation in text-form: d^2\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-7b.png) can fit in memory.

## Mode 2: Randomized
<a name="mode-2"></a>

When the number of features in the input dataset is large, we use a method to approximate the covariance metric. For every mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, we randomly initialize a `(num_components + extra_components) * b` matrix that we multiply by each mini-batch, to create a `(num_components + extra_components) * d` matrix. The sum of these matrices is computed by the workers, and the servers perform SVD on the final `(num_components + extra_components) * d` matrix. The top right `num_components` singular vectors of it are the approximation of the top singular vectors of the input matrix.

Let ![\[Equation in text-form: \ell\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38b.png) ` = num_components + extra_components`. Given a mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, the worker draws a random matrix ![\[Equation in text-form: H_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-24b.png) of dimension ![\[Equation in text-form: \ell * b\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38.png) . Depending on whether the environment uses a GPU or CPU and the dimension size, the matrix is either a random sign matrix where each entry is `+-1` or a *FJLT* (fast Johnson Lindenstrauss transform; for information, see [FJLT Transforms](https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pdf) and the follow-up papers). The worker then computes ![\[Equation in text-form: H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-26b.png) and maintains ![\[Equation in text-form: B = \sum H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-27b.png) . The worker also maintains ![\[Equation in text-form: h^T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-28b.png) , the sum of columns of ![\[Equation in text-form: H_1,..,H_T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-29b.png) (`T` being the total number of mini-batches), and `s`, the sum of all input rows. After processing the entire shard of data, the worker sends the server `B`, `h`, `s`, and `n` (the number of input rows).

Denote the different inputs to the server as ![\[Equation in text-form: B^1, h^1, s^1, n^1,…\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-30b.png) The server computes `B`, `h`, `s`, `n` the sums of the respective inputs. It then computes ![\[Equation in text-form: C = B – (1/n) h^T s\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-31b.png) , and finds its singular value decomposition. The top-right singular vectors and singular values of `C` are used as the approximate solution to the problem.

# PCA Hyperparameters
<a name="PCA-reference"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific HyperParameters as string-to-string maps. The following table lists the hyperparameters for the PCA training algorithm provided by Amazon SageMaker AI. For more information about how PCA works, see [How PCA Works](how-pca-works.md). 


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  Input dimension. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  Number of rows in a mini-batch. **Required** Valid values: positive integer  | 
| num\$1components |  The number of principal components to compute. **Required** Valid values: positive integer  | 
| algorithm\$1mode |  Mode for computing the principal components.  **Optional** Valid values: *regular* or *randomized* Default value: *regular*  | 
| extra\$1components |  As the value increases, the solution becomes more accurate but the runtime and memory consumption increase linearly. The default, -1, means the maximum of 10 and `num_components`. Valid for *randomized* mode only. **Optional** Valid values: Non-negative integer or -1 Default value: -1  | 
| subtract\$1mean |  Indicates whether the data should be unbiased both during training and at inference.  **Optional** Valid values: One of *true* or *false* Default value: *true*  | 

# PCA Response Formats
<a name="PCA-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI PCA algorithm.

## JSON Response Format
<a name="PCA-json"></a>

Accept—application/json

```
{
    "projections": [
        {
            "projection": [1.0, 2.0, 3.0, 4.0, 5.0]
        },
        {
            "projection": [6.0, 7.0, 8.0, 9.0, 0.0]
        },
        ....
    ]
}
```

## JSONLINES Response Format
<a name="PCA-jsonlines"></a>

Accept—application/jsonlines

```
{ "projection": [1.0, 2.0, 3.0, 4.0, 5.0] }
{ "projection": [6.0, 7.0, 8.0, 9.0, 0.0] }
```

## RECORDIO Response Format
<a name="PCA-recordio"></a>

Accept—application/x-recordio-protobuf

```
[
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    },
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    }  
]
```