# Statistical Variance and Deviation Functions
<a name="sql-reference-statistical-variance-deviation-functions"></a>

Each of these functions takes a set of numbers, ignores nulls, and can be used as either an aggregate function or an analytical function. For more information, see [Aggregate Functions](sql-reference-aggregate-functions.md) and [Analytic Functions](sql-reference-analytic-functions.md).

The relationships among these functions are described in the following table.


| Function purpose | Function name | Formula | Comments | 
| --- | --- | --- | --- | 
|  Hotspots  |  [HOTSPOTS](sqlrf-hotspots.md) (expr)  |  Detects hotspots of frequently occurring data in the data stream.  |    | 
|  Random Cut Forest  |  [RANDOM\$1CUT\$1FOREST](sqlrf-random-cut-forest.md) (expr)  |  Detects anomalies in the data stream.  |    | 
|  Random Cut Forest with Explanation  |  [RANDOM\$1CUT\$1FOREST\$1WITH\$1EXPLANATION](sqlrf-random-cut-forest-with-explanation.md) (expr)  |  Detects anomalies in the data stream, and returns an attribution score based on how anomalous the data in each column is.  |    | 
|  Population variance  |  [VAR\$1POP](sql-reference-VARPOP.md)(expr)  |  ( SUM(expr\$1expr) - SUM(expr)\$1SUM(expr) / COUNT(expr)) / COUNT(expr)  |  Applied to an empty set, it returns null.  | 
|  Population standard deviation  |  [STDDEV\$1POP](sql-reference-STDDEVPOP.md)  |  Square root of the population variance (VAR\$1POP).  |  When VAR\$1POP returns null, STDDEV\$1POP returns null.  | 
|  Sample variance  |  [VAR\$1SAMP](sql-reference-VARSAMP.md)  |  (SUM(expr\$1expr) - SUM(expr)\$1SUM(expr) / COUNT(expr)) / (COUNT(expr)−1)  |  Applied to an empty set, it returns null.  Applied to an input set of one element, VAR\$1SAMP returns null.  | 
|  Sample standard deviation  |  [STDDEV\$1SAMP](sql-reference-STDDEVSAMP.md) (expr)  |  Square root of the sample variance (VAR\$1SAMP).  |  Applied to only 1 row of input data, STDDEV\$1SAMP returns null.  | 

# HOTSPOTS
<a name="sqlrf-hotspots"></a>

Detects *hotspots*, or regions of activity that is significantly higher than the norm, in your data stream. A hotspot is defined as a small region of space that is relatively dense with data points. 

Using the `HOTSPOTS` function, you can use a simple SQL function to identify relatively dense regions in your data without having to explicitly build and train complicated machine learning models. You can then identify subsections of your data that need attention so that you can take immediate action.

For example, hotspots in your data might indicate a collection of overheated servers in a data center, a high concentration of vehicles indicating a traffic bottleneck, ride share rides in a certain area indicating a high-traffic event, or increased sales of products in a category that share similar features.

**Note**  
The ability of the `HOTSPOTS` function to detect frequent data points is application-dependent. To cast your business problem so that it can be solved with this function requires domain expertise. For example, you might want to determine which combination of columns in your input stream to pass to the function, and how to normalize the data if necessary. 

The algorithm accepts the `DOUBLE`, `INTEGER`, `FLOAT`, `TINYINT`, `SMALLINT`, `REAL`, and `BIGINT` data types. DECIMAL is not a supported type. Use DOUBLE instead.

**Note**  
The `HOTSPOT` function does not return the records that make up a hotspot. You can use the `ROWTIME` column to determine which records belong to a given hotspot.


## Syntax
<a name="hotspots-syntax"></a>

```
HOTSPOTS (inputStream,    
  windowSize,
  scanRadius,
  minimumNumberOfPointsInAHotspot)
```

## Parameters
<a name="hotspots-parameters"></a>

The following sections describe `HOTSPOT` function parameters. 

### inputStream
<a name="hotspots-input-stream"></a>

Pointer to your input stream. You set a pointer using the `CURSOR` function. For example, the following statements set a pointer to `InputStream`:

 
```
--Select all columns from input stream 
CURSOR(SELECT STREAM * FROM InputStream) 
--Select specific columns from input stream
CURSOR(SELECT STREAM PRICE, CHANGE FROM InputStream) 
-– Normalize the column X value.
CURSOR(SELECT STREAM IntegerColumnX / 100, IntegerColumnY FROM InputStream) 
–- Combine columns before passing to the function.
CURSOR(SELECT STREAM IntegerColumnX - IntegerColumnY FROM InputStream)
```

**Note**  
Only numeric columns from the input stream will be analyzed for hotspots. The `HOTSPOTS` function ignores other columns included in the cursor.

### windowSize
<a name="hotspots-windowsize"></a>

Specifies the number of records that are considered for each timestep by the sliding window over the stream.

You can set this value between 100 and 1000 inclusive. 

By increasing the window size, you can get a better estimate of hotspot position and density (relevance), but this also increases the running time.

### scanRadius
<a name="hotspots-scan-radius"></a>

Specifies the typical distance between a hotspot point and its nearest neighbors. 

This parameter is analogous to the **ε** value in the [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) algorithm.

Set this parameter to a value that is smaller than the typical distance between points that are not in a hotspot, but large enough so that points in a hotspot have neighbors within this distance. 

You can set this value to any double value greater than zero. The lower the value for `scanRadius`, the more similar any two records belonging to the same hotspot are. However, low values for `scanRadius` also increase running time. Lower values of `scanRadius` result in hotspots that are smaller but more numerous.

### minimumNumberOfPointsInAHotspot
<a name="hotspots-minimumpoints"></a>

Specifies the number of records that are required for the records to form a hotspot. 

**Note**  
This parameter should be set in consideration with [windowSize](#hotspots-windowsize). It is best to think of `minimumNumberOfPointsInAHotspot` as some fraction of `windowSize`. The exact fraction is discoverable through experimentation. 

You can set this value between 2 and your configured value for window size inclusive. Choose a value that best models the problem you are solving in light of your chosen value for window size. 

## Output
<a name="hotspots-output"></a>

The output of the HOTSPOTS function is a table object that has the same schema as the input, with the following additional column:

### HOTSPOT\$1RESULTS
<a name="hotspots-hotspots"></a>

A JSON string describing all the hotspots found around the record. The function returns all potential hotspots; you can filter out hotspots below a certain `density` threshold in your application. The field has the following nodes, with values for each of the input columns:
+ `density`: The number of records in the hotspot divided by the hotspot size. You can use this value to determine the relative relevance of the hotspot.
+ `maxValues`: The maximum values for the records in the hotspot for each data column.
+ `minValues`: The minimum values for the records in the hotspot for each data column.

Data type: VARCHAR.

**Note**  
The trends that machine learning functions use to determine analysis scores are infrequently reset when the Kinesis Data Analytics service performs service maintenance. You might unexpectedly see analysis scores of 0 after service maintenance occurs. We recommend you set up filters or other mechanisms to treat these values appropriately as they occur.

## Example
<a name="hotspots-examples"></a>

The following example executes the `HOTSPOTS` function on the demo stream, which contains random data without meaningful hotspots. For an example that executes the `HOTSPOTS` function on a custom data stream with meaningful data hotspots, see [Example: Detect Hotspots](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-hotspots-detection.html).

### Example Dataset
<a name="hotspots-examples-dataset"></a>

The example following is based on the sample stock dataset that is part of the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Data Analytics Developer Guide*. To run the example, you need an Kinesis Data Analytics application that has the sample stock ticker input stream. To learn how to create a Kinesis Data Analytics application and configure the sample stock ticker input stream, see [Getting Started](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Data Analytics Developer Guide*. 

The sample stock dataset has the following schema:

```
(ticker_symbol  VARCHAR(4),
sector          VARCHAR(16),
change          REAL,
price           REAL)
```

### Example 1: Return Hotspots on the Sample Data Stream
<a name="hotspots_example_1"></a>

In this example, a destination stream is created for the output of the HOTSPOTS function. A pump is then created that runs the HOTSPOTS function on the specified values in the sample data stream.

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM"(
    CHANGE REAL, 
    PRICE REAL,
    HOTSPOTS_RESULT VARCHAR(10000));
    
CREATE OR REPLACE PUMP "STREAM_PUMP" AS 
    INSERT INTO "DESTINATION_SQL_STREAM"
        SELECT 
            "CHANGE", 
            "PRICE",
            "HOTSPOTS_RESULT" 
        FROM TABLE (
            HOTSPOTS(
                CURSOR(SELECT STREAM "CHANGE", "PRICE" FROM "SOURCE_SQL_STREAM_001"), 
                100, 
                0.013, 
                20)
            );
```

#### Results
<a name="hotspots_example_1_results"></a>

This example outputs a stream similar to the following.

![\[Results of SQL code showing rowtime, hotspot, and hotspot_results.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/hotspots-example-1.png)


# RANDOM\$1CUT\$1FOREST
<a name="sqlrf-random-cut-forest"></a>

Detects anomalies in your data stream.  A record is an anomaly if it is distant from other records.  To detect anomalies in individual record columns, see [RANDOM\$1CUT\$1FOREST\$1WITH\$1EXPLANATION](sqlrf-random-cut-forest-with-explanation.md).

 
**Note**  
The `RANDOM_CUT_FOREST` function's ability to detect anomalies is application-dependent. To cast your business problem so that it can be solved with this function requires domain expertise. For example, determining which combination of columns in your input stream to pass to the function and potentially normalize the data. For more information, see [inputStream](#random-cut-forest-input-view).

A stream record can have non-numeric columns, but the function uses only numeric columns to assign an anomaly score. A record can have one or more numeric columns. The algorithm uses all of the numeric data in computing an anomaly score.  If a record has ***n*** numeric columns, the underlying algorithm assumes each record is a point in ***n***-dimensional space. A point in ***n***-dimensional space that is distant from other points receives a higher anomaly score. 

The algorithm starts developing the machine learning model using current records in the stream when you start the application. The algorithm does not use older records in the stream for machine learning, nor does it use statistics from previous executions of the application.

The algorithm accepts the `DOUBLE`, `INTEGER`, `FLOAT`, `TINYINT`, `SMALLINT`, `REAL`, and `BIGINT` data types.

 
**Note**  
DECIMAL is not a supported type. Use DOUBLE instead.

The following is an example of anomaly detection. The diagram shows three clusters and a few anomalies randomly interjected. The red squares show the records that received the highest anomaly score according to the `RANDOM_CUT_FOREST` function. The blue diamonds represent the remaining records.  Note how the highest scoring records tend to be outside the clusters. 

 
![\[Scatter plot showing three clusters of blue data points and several red outlier points.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/rcf-10.png)


For a sample application with step-by-step instructions, see [Detect Anomalies](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-anomaly-detection.html).

## Syntax
<a name="random-cut-forest-syntax"></a>

 
```
RANDOM_CUT_FOREST (inputStream,    
                   numberOfTrees,
                   subSampleSize,
                   timeDecay,
                   shingleSize)
```


## Parameters
<a name="random-cut-forest-parameters"></a>

The following sections describe the parameters. 

### inputStream
<a name="random-cut-forest-input-view"></a>

Pointer to your input stream. You set a pointer using the `CURSOR` function. For example, the following statements sets a pointer to `InputStream`.

 
```
CURSOR(SELECT STREAM * FROM InputStream) 
CURSOR(SELECT STREAM IntegerColumnX, IntegerColumnY FROM InputStream) 
-– Perhaps normalize the column X value.
CURSOR(SELECT STREAM IntegerColumnX / 100, IntegerColumnY FROM InputStream) 
–- Combine columns before passing to the function.
CURSOR(SELECT STREAM IntegerColumnX - IntegerColumnY FROM InputStream)
```

The `CURSOR` function is the only required parameter for the `RANDOM_CUT_FOREST` function. The function assumes the following default values for the other parameters:

numberOfTrees = 100

subSampleSize = 256

timeDecay = 100,000

shingleSize = 1

When using this function, your input stream can have up to 30 numeric columns. 

### numberOfTrees
<a name="random-cut-forest-no-of-trees"></a>

Using this parameter, you specify the number of random cut trees in the forest.  

**Note**  
By default, the algorithm constructs a number of trees, each constructed using a given number of sample records (see `subSampleSize` later in this list) from the input stream. The algorithm uses each tree to assign an anomaly score. The average of all these scores is the final anomaly score.

The default value for `numberOfTrees` is 100. You can set this value between 1 and 1,000 (inclusive). By increasing the number of trees in the forest, you can get a better estimate of the anomaly score, but this also increases the running time.

### subSampleSize
<a name="random-cut-forest-subsample-size"></a>

Using this parameter, you can specify the size of the random sample that you want the algorithm to use when constructing each tree.  Each tree in the forest is constructed with a (different) random sample of records.  The algorithm uses each tree to assign an anomaly score. When the sample reaches `subSampleSize` records, records are removed randomly, with older records having a higher probability of removal than newer records. 

The default value for `subSampleSize` is 256.  You can set this value between 10 and 1,000 (inclusive).

Note that the `subSampleSize` must be less than the `timeDecay` parameter (which is set to 100,000 by default).  Increasing the sample size provides each tree a larger view of the data, but also increases the running time.

**Note**  
The algorithm returns zero for the first `subSampleSize` records while the machine learning model is trained.

### timeDecay
<a name="random-cut-forest-time-decay"></a>

The `timeDecay` parameter allows you to specify how much of the recent past to consider when computing an anomaly score. This is because data streams naturally evolve over time. For example, an eCommerce website’s revenue may continuously increase, or global temperatures may rise over time.  In such situations, we want an anomaly to be flagged relative to recent data, as opposed to data from the distant past. 

The default value is 100,000 records (or 100,000 shingles if shingling is used, as described in the following section).  You can set this value between 1 and the maximum integer (that is, 2147483647). The algorithm exponentially decays the importance of older data.

If you choose the `timeDecay` default of 100,000, the anomaly detection algorithm does the following:
+ Uses only the most recent 100,000 records in the calculations (and ignores older records). 
+ Within the most recent 100,000 records, assigns exponentially more weight to recent records and less to older records in anomaly detection calculations. 

If you don't want to use the default value, you can calculate the number of records to use in the algorithm. To do this, multiply the number of expected records per day by the number of days you want the algorithm to consider. For example, if you expect 1,000 records per day, and you want to analyze 7 days of records, set this parameter to 7,000 (1,000 \$1 7).

The `timeDecay` parameter determines the maximum quantity of recent records kept in the working set of the anomaly detection algorithm.  Smaller `timeDecay` values are desirable if the data is changing rapidly.  The best `timeDecay` value is application-dependent. 

### shingleSize
<a name="random-cut-forest-shingle-size"></a>

The explanation given here is for a one-dimensional stream (that is, a stream with one numeric column), but it can also be used for multi-dimensional streams.

A shingle is a consecutive sequence of the most recent records.  For example, a `shingleSize` of 10 at time *t* corresponds to a vector of the last 10 records received up to and including time *t*.  The algorithm treats this sequence as a vector over the last `shingleSize` number of records.  

If data is arriving uniformly in time, a shingle of size 10 at time *t* corresponds to the data received at time *t-9, t-8,…,t*.  At time *t\$11*, the shingle slides over one unit and consists of data from time *t-8,t-7, …, t, t\$11*.  These shingled records gathered over time correspond to a collection of 10-dimensional vectors over which the anomaly detection algorithm runs.  


The intuition is that a shingle captures the shape of the recent past.   Your data may have a typical shape.  For example, if your data is collected hourly, a shingle of size 24 may capture the daily rhythm of your data.

The default `shingleSize` is one record (because shingle size is data dependent). You can set this value between 1 and 30 (inclusive).


Note the following about setting the `shingleSize`:
+ If you set the `shingleSize` too small, the algorithm will be more susceptible to minor fluctuations in the data, leading to high-anomaly scores for records that are not anomalous.
+ If you set the `shingleSize` too large, it may take more time to detect anomalous records because there are more records in the shingle that are not anomalous.  It may also take more time to determine that the anomaly has ended.
+ Identifying the right shingle size is application-dependent.  Experiment with different shingle sizes to ascertain the effect.

  
The following example illustrates how you can catch anomalies when you monitor the records with the highest anomaly score. In this particular example, the two highest anomaly scores also signal the beginning and end of an artificially injected anomaly.

Consider this stylized one-dimensional stream represented as a sine wave, intended to capture a circadian rhythm.  This curve illustrates the typical number of orders that an eCommerce site receives per hour, the number of users logged into a server, the number of ad clicks received per hour, etc.  A severe dip of 20 consecutive records is artificially injected in the middle of the plot.  

 
![\[Sine wave graph showing regular oscillations with an anomaly spike in the third cycle.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/rcf-20.png)


We ran the `RANDOM_CUT_FOREST` function with a shingle size of four records. The result is shown below. The red line shows the anomaly score.  Note that the beginning and the end of the anomaly received high scores.

 
![\[Graph showing sine wave with anomaly and anomaly score over time, with a spike in both lines.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/rcf-30.png)


When you use this function, we recommend that you investigate the highest scoring points as potential anomalies.  

**Note**  
The trends that machine learning functions use to determine analysis scores are infrequently reset when the Kinesis Data Analytics service performs service maintenance. You might unexpectedly see analysis scores of 0 after service maintenance occurs. We recommend you set up filters or other mechanisms to treat these values appropriately as they occur.


For more information, see the [Robust Random Cut Forest Based Anomaly Detection On Streams](http://proceedings.mlr.press/v48/guha16.pdf) white paper at the Journal of Machine Learning Research website.

# RANDOM\$1CUT\$1FOREST\$1WITH\$1EXPLANATION
<a name="sqlrf-random-cut-forest-with-explanation"></a>

Computes an anomaly score and explains it for each record in your data stream. The anomaly score for a record indicates how different it is from the trends that have recently been observed for your stream. The function also returns an attribution score for each column in a record, based on how anomalous the data in that column is. For each record, the sum of the attribution scores of all columns is equal to the anomaly score. 

You also have the option of getting information about the direction in which a given column is anomalous (whether it's high or low relative to the recently observed data trends for that column in the stream).

For example, in e-commerce applications, you might want to know when there's a change in the recently observed pattern of transactions. You also might want to know how much of the change is due to a change in the number of purchases made per hour, and how much is due to a change in the number of carts abandoned per hour—information that is represented by the attribution scores. You also might want to look at directionality to know whether you're being notified of the change due to an increase or a decrease in each of those values.

 
**Note**  
The `RANDOM_CUT_FOREST_WITH_EXPLANATION` function's ability to detect anomalies is application-dependent. Casting your business problem so that it can be solved with this function requires domain expertise. For example, you may need to determine which combination of columns in your input stream to pass to the function, and you may potentially benefit from normalizing the data. For more information, see [inputStream](#random-cut-forest-with-explanation-input-view).

A stream record can have non-numeric columns, but the function uses only numeric columns to assign an anomaly score. A record can have one or more numeric columns. The algorithm uses all the numeric data in computing an anomaly score. 

The algorithm starts developing the machine learning model using current records in the stream when you start the application. The algorithm does not use older records in the stream for machine learning, nor does it use statistics from previous executions of the application.

The algorithm accepts the `DOUBLE`, `INTEGER`, `FLOAT`, `TINYINT`, `SMALLINT`, `REAL`, and `BIGINT` data types.

 
**Note**  
`DECIMAL` is not a supported type. Use `DOUBLE` instead.

The following is a simple visual example of anomaly detection with different attribution scores in two-dimensional space. The diagram shows a cluster of blue data points and four outliers shown as red points. The red points have similar anomaly scores, but these four points are anomalous for different reasons. For points A1 and A2, most of the anomaly is attributable to their outlying y-values. In the case of A3 and A4, you can attribute most of the anomaly to their outlying x-values. Directionality is LOW for the y-value of A1, HIGH for the y-value of A2, HIGH for the x-value of A3, and LOW for the x-value of A4.

 
![\[Anomaly detection chart showing 4 points that appear outside a cluster.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/different_attributions.png)


## Syntax
<a name="random-cut-forest-with-explanation-syntax"></a>

 
```
RANDOM_CUT_FOREST_WITH_EXPLANATION (inputStream,    
                   numberOfTrees,
                   subSampleSize,
                   timeDecay,
                   shingleSize,
                   withDirectionality
)
```


## Parameters
<a name="random-cut-forest-with-explanation-parameters"></a>

The following sections describe the parameters of the `RANDOM_CUT_FOREST_WITH_EXPLANATION` function. 

### inputStream
<a name="random-cut-forest-with-explanation-input-view"></a>

Pointer to your input stream. You set a pointer using the `CURSOR` function. For example, the following statements set a pointer to `InputStream`.

 
```
CURSOR(SELECT STREAM * FROM InputStream) 
CURSOR(SELECT STREAM IntegerColumnX, IntegerColumnY FROM InputStream) 
-– Perhaps normalize the column X value.
CURSOR(SELECT STREAM IntegerColumnX / 100, IntegerColumnY FROM InputStream) 
–- Combine columns before passing to the function.
CURSOR(SELECT STREAM IntegerColumnX - IntegerColumnY FROM InputStream)
```

The `CURSOR` function is the only required parameter for the `RANDOM_CUT_FOREST_WITH_EXPLANATION` function. The function assumes the following default values for the other parameters:

`numberOfTrees` = 100

`subSampleSize` = 256

`timeDecay` = 100,000

`shingleSize` = 1

`withDirectionality` = FALSE

When you use this function, your input stream can have up to 30 numeric columns. 

### numberOfTrees
<a name="random-cut-forest-with-explanation-no-of-trees"></a>

Using this parameter, you specify the number of random cut trees in the forest.  

**Note**  
By default, the algorithm constructs a number of trees, each constructed using a given number of sample records (see `subSampleSize` later in this list) from the input stream. The algorithm uses each tree to assign an anomaly score. The average of all these scores is the final anomaly score.

The default value for `numberOfTrees` is 100. You can set this value between 1 and 1,000 (inclusive). By increasing the number of trees in the forest, you can get a better estimate of the anomaly and attribution scores, but this also increases the running time.

### subSampleSize
<a name="random-cut-forest-with-explanation-subsample-size"></a>

Using this parameter, you can specify the size of the random sample that you want the algorithm to use when constructing each tree. Each tree in the forest is constructed with a (different) random sample of records. The algorithm uses each tree to assign an anomaly score. When the sample reaches `subSampleSize` records, records are removed randomly, with older records having a higher probability of removal than newer records. 

The default value for `subSampleSize` is 256. You can set this value between 10 and 1,000 (inclusive).

The `subSampleSize` must be less than the `timeDecay` parameter (which is set to 100,000 by default). Increasing the sample size provides each tree a larger view of the data, but it also increases the running time.

**Note**  
The algorithm returns zero for the first `subSampleSize` records while the machine learning model is trained.

### timeDecay
<a name="random-cut-forest-with-explanation--time-decay"></a>

You can use the `timeDecay` parameter to specify how much of the recent past to consider when computing an anomaly score. Data streams naturally evolve over time. For example, an e-commerce website’s revenue might continuously increase, or global temperatures might rise over time. In such situations, you want an anomaly to be flagged relative to recent data, as opposed to data from the distant past. 

The default value is 100,000 records (or 100,000 shingles if shingling is used, as described in the following section). You can set this value between 1 and the maximum integer (that is, 2147483647). The algorithm exponentially decays the importance of older data.

If you choose the `timeDecay` default of 100,000, the anomaly detection algorithm does the following:
+ Uses only the most recent 100,000 records in the calculations (and ignores older records). 
+ Within the most recent 100,000 records, assigns exponentially more weight to recent records and less to older records in anomaly detection calculations. 

The `timeDecay` parameter determines the maximum quantity of recent records kept in the working set of the anomaly detection algorithm. Smaller `timeDecay` values are desirable if the data is changing rapidly. The best `timeDecay` value is application-dependent. 

### shingleSize
<a name="random-cut-forest-with-explanation-shingle-size"></a>

The explanation given here applies to a one-dimensional stream (that is, a stream with one numeric column), but shingling can also be used for multi-dimensional streams.

A shingle is a consecutive sequence of the most recent records. For example, a `shingleSize` of 10 at time *t* corresponds to a vector of the last 10 records received up to and including time *t*. The algorithm treats this sequence as a vector over the last `shingleSize` number of records.  

If data is arriving uniformly in time, a shingle of size 10 at time *t* corresponds to the data received at time *t-9, t-8,…,t*. At time *t\$11*, the shingle slides over one unit and consists of data from time *t-8,t-7, …, t, t\$11*. These shingled records gathered over time correspond to a collection of 10-dimensional vectors over which the anomaly detection algorithm runs.  


The intuition is that a shingle captures the shape of the recent past. Your data might have a typical shape. For example, if your data is collected hourly, a shingle of size 24 might capture the daily rhythm of your data.

The default `shingleSize` is one record (because shingle size is data-dependent). You can set this value between 1 and 30 (inclusive).


Note the following about setting the `shingleSize`:
+ If you set the `shingleSize` too small, the algorithm is more susceptible to minor fluctuations in the data, leading to high anomaly scores for records that are not anomalous.
+ If you set the `shingleSize` too large, it might take more time to detect anomalous records because there are more records in the shingle that are not anomalous. It also might take more time to determine that the anomaly has ended.
+ Identifying the right shingle size is application-dependent. Experiment with different shingle sizes to determine the effects.

  
### withDirectionality
<a name="random-cut-forest-with-explanation-with-directionality"></a>

A Boolean parameter that defaults to `false`. When set to `true`, it tells you the direction in which each individual dimension makes a contribution to the anomaly score. It also provides the strength of the recommendation for that directionality.

## Results
<a name="random-cut-forest-with-explanation-results"></a>

The function returns an anomaly score of 0 or more and an explanation in JSON format.

The anomaly score starts out at 0 for all the records in the stream while the algorithm goes through the learning phase. You then start to see positive values for the anomaly score. Not all positive anomaly scores are significant; only the highest ones are. To get a better understanding of the results, look at the explanation.

The explanation provides the following values for each column in the record:
+ **Attribution score:** A nonnegative number that indicates how much this column has contributed to the anomaly score of the record. In other words, it indicates how different the value of this column is from what’s expected based on the recently observed trend. The sum of the attribution scores of all columns for the record is equal to the anomaly score.
+ **Strength:** A nonnegative number representing the strength of the directional recommendation. A high value for strength indicates a high confidence in the directionality that is returned by the function. During the learning phase, the strength is 0.
+ **Directionality:** This is either HIGH if the value of the column is above the recently observed trend or LOW if it’s below the trend. During the learning phase, this defaults to LOW.

**Note**  
The trends that machine learning functions use to determine analysis scores are infrequently reset when the Kinesis Data Analytics service performs service maintenance. You might unexpectedly see analysis scores of 0 after service maintenance occurs. We recommend you set up filters or other mechanisms to treat these values appropriately as they occur.

## Examples
<a name="random-cut-forest-with-explanation-examples"></a>

### Stock Ticker Data Example
<a name="random-cut-forest-with-explanation-examples-stockdata"></a>

This example is based on the sample stock dataset that is part of the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. To run the example, you need a Kinesis Data Analytics application that has the sample stock ticker input stream. To learn how to create a Kinesis Data Analytics application and configure the sample stock ticker input stream, see [Getting Started](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. 

The sample stock dataset has the following schema:

```
(ticker_symbol  VARCHAR(4),
sector          VARCHAR(16),
change          REAL,
price           REAL)
```

In this example, the application calculates an anomaly score for the record and an attribution score for the `PRICE` and `CHANGE` columns, which are the only numeric columns in the input stream. 

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (anomaly REAL, ANOMALY_EXPLANATION VARCHAR(20480));
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
SELECT "ANOMALY_SCORE", "ANOMALY_EXPLANATION" FROM TABLE (RANDOM_CUT_FOREST_WITH_EXPLANATION(CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001"), 100, 256, 100000, 1, true)) WHERE ANOMALY_SCORE > 0
```

The preceding example outputs a stream similar to the following.

![\[Screenshot showing an output stream containing anomaly scores and explanation information.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/anomaly_results.png)


### Network and CPU Utilization Example
<a name="random-cut-forest-with-explanation-examples-networkandcpu"></a>

This theoretical example shows two sets of data that follow an oscillating pattern. In the following graph, they're represented by the red and blue curves at the top. The red curve shows network utilization over time, and the blue curve shows idle CPU over time for the same computer system. The two signals, which are out of phase with each other, are regular most of the time. But they both also show occasional anomalies, which appear as irregularities in the graph. The following is an explanation of what the curves in the graph represent from the top curve to the bottom curve.
+ The top curve, which is red, represents network utilization over time. It follows a cyclical pattern and is regular most of the time, except for two anomalous periods, each representing a drop in utilization. The first anomalous period occurs between time values 500 and 1,000. The second anomalous period occurs between time values 1,500 and 2,000.
+ The second curve from the top (blue in color) is idle CPU over time. It follows a cyclical pattern and is regular most of the time, with the exception of two anomalous periods. The first anomalous period occurs between time values 1,000 and 1,500 and shows a drop in idle CPU time. The second anomalous period occurs between time values 1,500 and 2,000 and shows an increase in idle CPU time.
+ The third curve from the top shows the anomaly score. At the beginning, there's a learning phase during which the anomaly score is 0. After the learning phase, there's steady noise in the curve, but the anomalies stand out. 

  The first anomaly, which is marked in red on the black anomaly score curve, is more attributable to the network utilization data. The second anomaly, marked in blue, is more attributable to the CPU data. The red and blue markings are provided in this graph as a visual aide. They aren't produced by the `RANDOM_CUT_FOREST_WITH_EXPLANATION` function. Here's how we obtained these red and blue markings:
  + After running the function, we selected the top 20 anomaly score values. 
  + From this set of top 20 anomaly score values, we selected those values for which the network utilization had an attribution greater than or equal to 1.5 times the attribution for CPU. We colored the points in this new set with red markers in the graph.
  + We colored with blue markers the points for which the CPU attribution score was greater than or equal to 1.5 times the attribution for network utilization.
+ The second curve from the bottom is a graphical representation of directionality for the network utilization signal. We obtained this curve by running the function, multiplying the strength by -1 for LOW directionality and by \$11 for HIGH directionality, and plotting the results against time. 

  When there's a drop in the cyclical pattern of network utilization, there’s a corresponding negative spike in directionality. When network utilization shows an increase back to the regular pattern, directionality shows a positive spike corresponding to that increase. Later on, there's another negative spike, followed closely by another positive spike. Together they represent the second anomaly seen in the network utilization curve.
+ The bottom curve is a graphical representation of directionality for the CPU signal. We obtained it by multiplying the strength by -1 for LOW directionality and by \$11 for HIGH directionality, and then plotting the results against time. 

  With the first anomaly in the idle CPU curve, this directionality curve shows a negative spike followed immediately by a smaller, positive spike. The second anomaly in the idle CPU curve produces a positive spike followed by a negative spike in directionality.

![\[Stylized graph showing network utilization over time; idle CPU over time; anomaly score with explanation over time; directionality for network utilization; and directionality for idle CPU.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-rcfwa-example-3.png)


### Blood Pressure Example
<a name="random-cut-forest-with-explanation-examples-bloodpressure"></a>

For a more detailed example, with code that detects and explains anomalies in blood pressure readings, see [Example: Detecting Data Anomalies and Getting an Explanation](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-anomaly-detection-with-explanation.html).

# STDDEV\$1POP
<a name="sql-reference-STDDEVPOP"></a>

Returns the square root of the [VAR\$1POP](sql-reference-VARPOP.md) population variance for <number expression>, evaluated for each row remaining in the group.

When you use `STDDEV_POP`, be aware of the following:
+ When the input set has no non-null data, `STDDEV_POP` returns `NULL`.
+ If you don't use the `OVER` clause, `STDDEV_POP` is calculated as an aggregate function. In this case, the aggregate query must contain a [GROUP BY clause](sql-reference-group-by-clause.md) on a monotonic expression based on `ROWTIME` that groups the stream into finite rows. Otherwise, the group is the infinite stream, and the query will never complete and no rows will be emitted. For more information, see [Aggregate Functions](sql-reference-aggregate-functions.md). 
+ A windowed query that uses a `GROUP BY` clause processes rows in a tumbling window. For more information, see [Tumbling Windows (Aggregations Using GROUP BY)](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/tumbling-window-concepts.html).
+ If you use the `OVER` clause, `STDDEV_POP` is calculated as an analytic function. For more information, see [Analytic Functions](sql-reference-analytic-functions.md).
+ A windowed query that uses an `OVER` clause processes rows in a sliding window. For more information, see [Sliding Windows](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sliding-window-concepts.html) 

## Syntax
<a name="stddevpop-syntax"></a>

```
 STDDEV_POP ( [DISTINCT | ALL] number-expression )
```

## Parameters
<a name="stddevpop-parameters"></a>

### ALL
<a name="stddevpop-parameters-all"></a>

Includes duplicate values in the input set. `ALL` is the default.

### DISTINCT
<a name="stddevpop-parameters-distinct"></a>

Excludes duplicate values in the input set.

## Examples
<a name="stddevpop-examples"></a>

### Example Dataset
<a name="w2aac22c29c16c17b2"></a>

The examples following are based on the sample stock dataset that is part of the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. To run each example, you need an Amazon Kinesis Analytics application that has the sample stock ticker input stream. To learn how to create an Analytics application and configure the sample stock ticker input stream, see [Getting Started](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. 

The sample stock dataset has the schema following.

```
(ticker_symbol  VARCHAR(4),
sector          VARCHAR(16),
change          REAL,
price           REAL)
```

### Example 1: Determine the standard deviation of the population in a column in a tumbling window query
<a name="w2aac22c29c16c17b4"></a>

The following example demonstrates how to use the `STDDEV_POP` function to determine the standard deviation of the values in a tumbling window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

#### Using STEP (Recommended)
<a name="stddevpop-examples-step"></a>

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), stddev_pop_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, STDDEV_POP(price) AS stddev_pop_price
    FROM "SOURCE_SQL_STREAM_001"
    GROUP BY ticker_symbol, STEP(("SOURCE_SQL_STREAM_001".ROWTIME) BY INTERVAL '60' SECOND);
```

#### Using FLOOR
<a name="stddevpop-examples-floor"></a>

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), stddev_pop_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, STDDEV_POP(price) AS stddev_pop_price
    FROM "SOURCE_SQL_STREAM_001"
    GROUP BY ticker_symbol, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);
```

#### Results
<a name="stddevpop-results"></a>

The preceding examples output a stream similar to the following:

![\[Table showing ROWTIME, TICKER_SYMBOL, and STDDEV_POP_PRICE columns with sample data entries.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-stddev-pop-1.png)


### Example 2: Determine the standard deviation of the population of the values in a column in a sliding window query
<a name="w2aac22c29c16c17b6"></a>

The following example demonstrates how to use the `STDDEV_POP` function to determine the standard deviation of the values in a sliding window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), stddev_pop_price REAL);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, STDDEV_POP(price) OVER TEN_SECOND_SLIDING_WINDOW AS stddev_pop_price
FROM "SOURCE_SQL_STREAM_001"
 
WINDOW TEN_SECOND_SLIDING_WINDOW AS (
  PARTITION BY ticker_symbol
  RANGE INTERVAL '10' SECOND PRECEDING);
```

The preceding example outputs a stream similar to the following:

![\[Table showing ROWTIME, TICKER_SYMBOL, and STDEV_POP_PRICE columns with sample data entries.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-stddev-pop-2.png)


## See Also
<a name="stddepop-seealso"></a>
+ Sample standard deviation: [STDDEV\$1SAMP](sql-reference-STDDEVSAMP.md)
+ Sample variance: [VAR\$1SAMP](sql-reference-VARSAMP.md)
+ Population variance: [VAR\$1POP](sql-reference-VARPOP.md)

# STDDEV\$1SAMP
<a name="sql-reference-STDDEVSAMP"></a>

Returns the statistical standard deviation of all values in <number-expression>, evaluated for each row remaining in the group and defined as the square root of the [VAR\$1SAMP](sql-reference-VARSAMP.md).

When you use `STDDEV_SAMP`, be aware of the following:
+ When the input set has no non-null data, `STDDEV_SAMP` returns `NULL`.
+ If you don't use the `OVER` clause, `STDDEV_SAMP` is calculated as an aggregate function. In this case, the aggregate query must contain a [GROUP BY clause](sql-reference-group-by-clause.md) on a monotonic expression based on `ROWTIME` that groups the stream into finite rows. Otherwise, the group is the infinite stream, and the query will never complete and no rows will be emitted. For more information, see [Aggregate Functions](sql-reference-aggregate-functions.md). 
+ A windowed query that uses a GROUP BY clause processes rows in a tumbling window. For more information, see [Tumbling Windows (Aggregations Using GROUP BY)](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/tumbling-window-concepts.html).
+ If you use the `OVER` clause, `STDDEV_SAMP` is calculated as an analytic function. For more information, see [Analytic Functions](sql-reference-analytic-functions.md).
+ A windowed query that uses an OVER clause processes rows in a sliding window. For more information, see [Sliding Windows](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sliding-window-concepts.html) 
+ `STD_DEV` is an alias of `STDDEV_SAMP`.

## Syntax
<a name="stddevsamp-syntax"></a>

```
 STDDEV_SAMP ( [DISTINCT | ALL] number-expression )
```

## Parameters
<a name="stddevsamp-parameters"></a>

### ALL
<a name="stddevsamp-parameters-all"></a>

Includes duplicate values in the input set. `ALL` is the default.

### DISTINCT
<a name="stddevsamp-parameters-distinct"></a>

Excludes duplicate values in the input set.

## Examples
<a name="stddevsamp-examples"></a>

### Example Dataset
<a name="w2aac22c29c18c23b2"></a>

The examples following are based on the sample stock dataset that is part of the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. To run each example, you need an Amazon Kinesis Analytics application that has the sample stock ticker input stream. To learn how to create an Analytics application and configure the sample stock ticker input stream, see [Getting Started](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. 

The sample stock dataset has the schema following.

```
(ticker_symbol  VARCHAR(4),
sector          VARCHAR(16),
change          REAL,
price           REAL)
```

### Example 1: Determine the statistical standard deviation of the values in a column in a tumbling window query
<a name="w2aac22c29c18c23b4"></a>

The following example demonstrates how to use the `STDDEV_SAMP` function to determine the standard deviation of the values in a tumbling window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), stddev_samp_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, STDDEV_SAMP(price) AS stddev_samp_price
    FROM "SOURCE_SQL_STREAM_001"
    GROUP BY ticker_symbol, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);
```

### Results
<a name="stddev-example-results"></a>

The preceding examples output a stream similar to the following:

![\[Table showing ROWTIME, TICKER_SYMBOL, and STDDEV_SAMP_PRICE columns with sample data.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-stddev-samp-1.png)


### Example 2: Determine the statistical standard deviation of the values in a columm in a sliding window query
<a name="w2aac22c29c18c23c10"></a>

The following example demonstrates how to use the `STDDEV_SAMP` function to determine the standard deviation of the values in a sliding window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), stddev_samp_price REAL);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, STDDEV_SAMP(price) OVER TEN_SECOND_SLIDING_WINDOW AS stddev_samp_price
FROM "SOURCE_SQL_STREAM_001"
 
WINDOW TEN_SECOND_SLIDING_WINDOW AS (
  PARTITION BY ticker_symbol
  RANGE INTERVAL '10' SECOND PRECEDING);
```

The preceding example outputs a stream similar to the following:

![\[Table showing ROWTIME, TICKER_SYMBOL, and STDDEV_SAMP_PRICE columns with sample data.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-stddev-samp-2.png)


## See Also
<a name="stddevsamp-seealso"></a>
+ Population standard deviation: [STDDEV\$1POP](sql-reference-STDDEVPOP.md)
+ Sample variance: [VAR\$1SAMP](sql-reference-VARSAMP.md)
+ Population variance: [VAR\$1POP](sql-reference-VARPOP.md)

# VAR\$1POP
<a name="sql-reference-VARPOP"></a>

Returns the population variance of a non-null set of numbers (nulls being ignored)

VAR\$1POP uses the following calculation:
+ (SUM(expr\$1expr) - SUM(expr)\$1SUM(expr) / COUNT(expr)) / COUNT(expr)

In other words, for a given set of non-null values, using S1 as the sum of the values and S2 as the sum of the squares of the values, VAR\$1POP returns the result (S2-S1\$1S1/N)/N.

When you use `VAR_POP`, be aware of the following:
+ When the input set has no non-null data, or when applied to an empty set, `VAR_POP` returns `NULL`.
+ If you don't use the `OVER` clause, `VAR_POP` is calculated as an aggregate function. In this case, the aggregate query must contain a [GROUP BY clause](sql-reference-group-by-clause.md) on a monotonic expression based on `ROWTIME` that groups the stream into finite rows. Otherwise, the group is the infinite stream, and the query will never complete and no rows will be emitted. For more information, see [Aggregate Functions](sql-reference-aggregate-functions.md). 
+ A windowed query that uses a `GROUP BY` clause processes rows in a tumbling window. For more information, see [Tumbling Windows (Aggregations Using GROUP BY)](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/tumbling-window-concepts.html).
+ If you use the `OVER` clause, `VAR_POP` is calculated as an analytic function. For more information, see [Analytic Functions](sql-reference-analytic-functions.md).
+ A windowed query that uses an `OVER` clause processes rows in a sliding window. For more information, see [Sliding Windows](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sliding-window-concepts.html) 

## Syntax
<a name="varpop-syntax"></a>

```
 VAR_POP ( [DISTINCT | ALL] number-expression )
```

## Parameters
<a name="varpop-parameters"></a>

### ALL
<a name="varpop-parameters-all"></a>

Includes duplicate values in the input set. `ALL` is the default.

### DISTINCT
<a name="varpop-parameters-distinct"></a>

Excludes duplicate values in the input set.

## Examples
<a name="varpop-examples"></a>

### Example Dataset
<a name="w2aac22c29c20c29b2"></a>

The examples following are based on the sample stock dataset that is part of the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. To run each example, you need an Amazon Kinesis Analytics application that has the sample stock ticker input stream. To learn how to create an Analytics application and configure the sample stock ticker input stream, see [Getting Started](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. 

The sample stock dataset has the schema following.

```
(ticker_symbol  VARCHAR(4),
sector          VARCHAR(16),
change          REAL,
price           REAL)
```

### Example 1: Determine the population variance in a column in a tumbling window query
<a name="varpop_example_1"></a>

The following example demonstrates how to use the `VARPOP` function to determine the population variance of the values in a tumbling window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

#### Using STEP (Recommended)
<a name="varpop_example_step"></a>

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), var_pop_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, VAR_POP(price) AS var_pop_price
    FROM "SOURCE_SQL_STREAM_001"
    GROUP BY ticker_symbol, STEP(("SOURCE_SQL_STREAM_001".ROWTIME) BY INTERVAL '60' SECOND);
```

#### Using FLOOR
<a name="varpop_example_floor"></a>

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), var_pop_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, VAR_POP(price) AS var_pop_price
    FROM "SOURCE_SQL_STREAM_001"
    GROUP BY ticker_symbol, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);
```

#### Results
<a name="varpop_example_results"></a>

The preceding examples output a stream similar to the following:

![\[Table showing ROWTIME, TICKER_SYMBOL, and VAR_POP_PRICE columns with sample data entries.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-varpop-1.png)


### Example 2: Determine the population variance of the values in a columm in a sliding window query
<a name="w2aac22c29c20c29b6"></a>

The following example demonstrates how to use the `VARPOP` function to determine the population variance of the values in a sliding window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), var_pop_price REAL);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, VAR_POP(price) OVER TEN_SECOND_SLIDING_WINDOW AS var_pop_price
FROM "SOURCE_SQL_STREAM_001"
 
WINDOW TEN_SECOND_SLIDING_WINDOW AS (
  PARTITION BY ticker_symbol
  RANGE INTERVAL '10' SECOND PRECEDING);
```

The preceding example outputs a stream similar to the following:

![\[Table showing ROWTIME, TICKER_SYMBOL, and VAR_POP_PRICE columns with sample data entries.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-varpop-2.png)


## See Also
<a name="varpop-seealso"></a>
+ Population standard deviation: [STDDEV\$1POP](sql-reference-STDDEVPOP.md)
+ Sample standard deviation: [STDDEV\$1SAMP](sql-reference-STDDEVSAMP.md)
+ Sample variance: [VAR\$1SAMP](sql-reference-VARSAMP.md)

# VAR\$1SAMP
<a name="sql-reference-VARSAMP"></a>

Returns the sample variance of a non-null set of numbers (nulls being ignored).

VAR\$1SAMP uses the following calculation:
+ (SUM(expr\$1expr) - SUM(expr)\$1SUM(expr) / COUNT(expr)) / (COUNT(expr)-1)

In other words, for a given set of non-null values, using S1 as the sum of the values and S2 as the sum of the squares of the values, `VAR_SAMP` returns the result (S2-S1\$1S1/N)/(N-1). 

When you use `VAR_SAMP`, be aware of the following:
+ When the input set has no non-null data, `VAR_SAMP` returns `NULL`. Given an input set of null or one element, `VAR_SAMP` returns `null`.
+ If you don't use the `OVER` clause, `VAR_SAMP` is calculated as an aggregate function. In this case, the aggregate query must contain a [GROUP BY clause](sql-reference-group-by-clause.md) on a monotonic expression based on `ROWTIME` that groups the stream into finite rows. Otherwise, the group is the infinite stream, and the query will never complete and no rows will be emitted. For more information, see [Aggregate Functions](sql-reference-aggregate-functions.md). 
+ A windowed query that uses a `GROUP BY` clause processes rows in a tumbling window. For more information, see [Tumbling Windows (Aggregations Using GROUP BY)](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/tumbling-window-concepts.html).
+ If you use the `OVER` clause, `VAR_SAMP` is calculated as an analytic function. For more information, see [Analytic Functions](sql-reference-analytic-functions.md).
+ A windowed query that uses an `OVER` clause processes rows in a sliding window. For more information, see [Sliding Windows](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sliding-window-concepts.html) 

## Syntax
<a name="varsamp-syntax"></a>

```
 VAR_SAMP ( [DISTINCT | ALL] number-expression )
```

## Parameters
<a name="varsamp-parameters"></a>

### ALL
<a name="varsamp-parameters-all"></a>

Includes duplicate values in the input set. `ALL` is the default.

### DISTINCT
<a name="varsamp-parameters-distinct"></a>

Excludes duplicate values in the input set.

## Examples
<a name="varsamp-examples"></a>

### Example Dataset
<a name="w2aac22c29c22c29b2"></a>

The examples following are based on the sample stock dataset that is part of the [Getting Started Exercise](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. To run each example, you need an Amazon Kinesis Analytics application that has the sample stock ticker input stream. To learn how to create an Analytics application and configure the sample stock ticker input stream, see [Getting Started](https://docs.aws.amazon.com/kinesisanalytics/latest/dev/get-started-exercise.html) in the *Amazon Kinesis Analytics Developer Guide*. 

The sample stock dataset has the schema following.

```
(ticker_symbol  VARCHAR(4),
sector          VARCHAR(16),
change          REAL,
price           REAL)
```

### Example 1: Determine the sample variance in a column in a tumbling window query
<a name="varsamp-example-1"></a>

The following example demonstrates how to use the `VAR_SAMP` function to determine the sample variance of the values in a tumbling window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

#### Using STEP (Recommended)
<a name="varsamp-example-step"></a>

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), var_samp_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, VAR_SAMP(price) AS var_samp_price
    FROM "SOURCE_SQL_STREAM_001"
    GROUP BY ticker_symbol, STEP(("SOURCE_SQL_STREAM_001".ROWTIME) BY INTERVAL '60' SECOND);
```

#### Using FLOOR
<a name="varsamp-example-floor"></a>

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), var_samp_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, VAR_SAMP(price) AS var_samp_price
    FROM "SOURCE_SQL_STREAM_001"
    GROUP BY ticker_symbol, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);
```

#### Results
<a name="varsamp-example-results"></a>

The preceding examples output a stream similar to the following:

![\[Table showing ROWTIME, TICKER_SYMBOL, and VAR_SAMP_PRICE columns with sample data entries.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-varsamp-1.png)


### Example 2: Determine the sample variance of the values in a column in a sliding window query
<a name="w2aac22c29c22c29b6"></a>

The following example demonstrates how to use the `VAR_SAMP` function to determine the sample variance of the values in a sliding window of the PRICE column of the example dataset. `DISTINCT` is not specified, so duplicate values are included in the calculation.

```
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (ticker_symbol VARCHAR(4), var_samp_price REAL);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, VAR_SAMP(price) OVER TEN_SECOND_SLIDING_WINDOW AS var_samp_price
FROM "SOURCE_SQL_STREAM_001"
 
WINDOW TEN_SECOND_SLIDING_WINDOW AS (
  PARTITION BY ticker_symbol
  RANGE INTERVAL '10' SECOND PRECEDING);
```

The preceding example outputs a stream similar to the following:

![\[Table showing data for ROWTIME, TICKER_SYMBOL, and VAR_SAMP_PRICE columns with sample entries.\]](http://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/images/sql-reference-varsamp-2.png)


## See Also
<a name="varsamp-seealso"></a>
+ Population standard deviation: [STDDEV\$1POP](sql-reference-STDDEVPOP.md)
+ Sample standard deviation: [STDDEV\$1SAMP](sql-reference-STDDEVSAMP.md)
+ Population variance:[VAR\$1POP](sql-reference-VARPOP.md)