

# Hyperloglog functions
<a name="hyperloglog-functions"></a>

The HyperLogLog (HLL) functions in SQL provide a way to efficiently estimate the number of unique elements (cardinality) in a large dataset, even when the actual set of unique elements isn't stored. 

The main benefits of using HLL functions are:
+ **Memory efficiency**: HLL sketches require much less memory than storing the full set of unique elements, making them suitable for large datasets.
+ **Distributed computing**: HLL sketches can be combined across multiple data sources or processing nodes, allowing for efficient distributed unique count estimation.
+ **Approximate results**: HLL provides an approximate unique count estimation, with a tunable trade-off between accuracy and memory usage (via the precision parameter).

These functions are particularly useful in scenarios where you need to estimate the number of unique items, such as in analytics, data warehousing, and real-time stream processing applications.

AWS Clean Rooms supports the following HLL functions.

**Topics**
+ [HLL\$1SKETCH\$1AGG function](HLL_SKETCH_AGG.md)
+ [HLL\$1SKETCH\$1ESTIMATE function](HLL_SKETCH_ESTIMATE.md)
+ [HLL\$1UNION function](HLL_UNION.md)
+ [HLL\$1UNION\$1AGG function](HLL_UNION_AGG.md)

# HLL\$1SKETCH\$1AGG function
<a name="HLL_SKETCH_AGG"></a>

The HLL\$1SKETCH\$1AGG aggregate function creates an HLL sketch from the values in the specified column. It returns an HLLSKETCH data type that encapsulates the input expression values. 

The HLL\$1SKETCH\$1AGG aggregate function works with any data type and ignores NULL values. 

When there are no rows in a table or all rows are NULL, the resulting sketch has no index-value pairs such as `{"version":1,"logm":15,"sparse":{"indices":[],"values":[]}}`.

## Syntax
<a name="HLL_SKETCH_AGG-synopsis"></a>

```
HLL_SKETCH_AGG (aggregate_expression[, lgConfigK ] )
```

## Argument
<a name="HLL_SKETCH_AGG-argument"></a>

 *aggregate\$1expression*   
Any expression of type INT, BIGINT, STRING, or BINARY against which unique counting will occur. Any `NULL` values are ignored.

*lgConfigK*  
An optional INT constant between 4 and 21 inclusive with default 12. The log-base-2 of K, where K is the number of buckets or slots for the sketch.

## Return type
<a name="HLL_SKETCH_AGG-return-type"></a>

The HLL\$1SKETCH\$1AGG function returns a non-NULL BINARY buffer containing the HyperLogLog sketch computed because of consuming and aggregating all input values in the aggregation group.

## Examples
<a name="HLL_SKETCH_AGG-examples"></a>

The following examples use the HyperLogLog (HLL) algorithm to estimate the distinct count of values in the `col` column. The `hll_sketch_agg(col, 12)` function aggregates the values in the col column, creating an HLL sketch using a precision of 12. The `hll_sketch_estimate()` function is then used to estimate the distinct count of values based on the generated HLL sketch. The final result of the query is 3, which represents the estimated distinct count of values in the `col` column. In this case, the distinct values are 1, 2, and 3. 

```
SELECT hll_sketch_estimate(hll_sketch_agg(col, 12))
    FROM VALUES (1), (1), (2), (2), (3) tab(col);
  3
```

The following example also uses the HLL algorithm to estimate the distinct count of values in the `col` column, but it doesn't specify a precision value for the HLL sketch. In this case, it uses the default precision of 14. The `hll_sketch_agg(col)` function takes the values in the `col` column and creates an HyperLogLog (HLL) sketch, which is a compact data structure that can be used to estimate the distinct count of elements. The `hll_sketch_estimate(hll_sketch_agg(col))` function takes the HLL sketch created in the previous step and calculates an estimate of the distinct count of values in the `col` column. The final result of the query is 3, which represents the estimated distinct count of values in the `col` column. In this case, the distinct values are 1, 2, and 3. 

```
SELECT hll_sketch_estimate(hll_sketch_agg(col))
FROM VALUES (1), (1), (2), (2), (3) tab(col);
3
```

# HLL\$1SKETCH\$1ESTIMATE function
<a name="HLL_SKETCH_ESTIMATE"></a>

The HLL\$1SKETCH\$1ESTIMATE function takes an HLL sketch and estimates the number of unique elements represented by the sketch. It uses the HyperLogLog (HLL) algorithm to count a probabilistic approximation of the number of unique values in a given column, consuming a binary representation known as a sketch buffer previously generated by the HLL\$1SKETCH\$1AGG function and returning the result as a big integer. 

The HLL sketching algorithm provides an efficient way to estimate the number of unique elements, even for large datasets, without having to store the full set of unique values.

The `hll_union` and `hll_union_agg` functions can also combine sketches together by consuming and merging these buffers as inputs.

## Syntax
<a name="HLL_SKETCH_ESTIMATE-synopsis"></a>

```
HLL_SKETCH_ESTIMATE (hllsketch_expression)
```

## Argument
<a name="HLL_SKETCH_ESTIMATE-argument"></a>

 *hllsketch\$1expression*   
A `BINARY` expression holding a sketch generated by HLL\$1SKETCH\$1AGG

## Return type
<a name="HLL_SKETCH_ESTIMATE-return-type"></a>

The HLL\$1SKETCH\$1ESTIMATE function returns a BIGINT value that is the approximate distinct count represented by the input sketch.

## Examples
<a name="HLL_SKETCH_ESTIMATE-examples"></a>

The following examples use the HyperLogLog (HLL) sketching algorithm to estimate the cardinality (unique count) of values in the `col` column. The `hll_sketch_agg(col, 12)` function takes the `col` column and creates an HLL sketch using a precision of 12 bits. The HLL sketch is an approximate data structure that can efficiently estimate the number of unique elements in a set. The `hll_sketch_estimate()` function takes the HLL sketch created by `hll_sketch_agg` and estimates the cardinality (unique count) of the values represented by the sketch. The `FROM VALUES (1), (1), (2), (2), (3) tab(col);` generates a test dataset with 5 rows, where the `col` column contains the values 1, 1, 2, 2, and 3. The result of this query is the estimated unique count of the values in the `col` column, which is 3.

```
SELECT hll_sketch_estimate(hll_sketch_agg(col, 12))
    FROM VALUES (1), (1), (2), (2), (3) tab(col);
  3
```

The difference between the following example and the previous one is that the precision parameter (12 bits) isn't specified in the `hll_sketch_agg` function call. In this case, the default precision of 14 bits is used, which may provide a more accurate estimate for the unique count compared to the previous example that used 12 bits of precision.

```
SELECT hll_sketch_estimate(hll_sketch_agg(col))
FROM VALUES (1), (1), (2), (2), (3) tab(col);
3
```

# HLL\$1UNION function
<a name="HLL_UNION"></a>

The HLL\$1UNION function combines two HLL sketches into a single, unified sketch. It uses the HyperLogLog (HLL) algorithm to combine two sketches into a single sketch. Queries can use the resulting buffers to compute approximate unique counts as long integers with the `hll_sketch_estimate` function.

## Syntax
<a name="HLL_UNION-syntax"></a>

```
HLL_UNION (( expr1, expr2 [, allowDifferentLgConfigK ] ))
```

## Argument
<a name="HLL_UNION-argument"></a>

 *exprN*   
A `BINARY` expression holding a sketch generated by HLL\$1SKETCH\$1AGG.

*allowDifferentLgConfigK*  
A optional BOOLEAN expression controlling whether to allow merging two sketches with different lgConfigK values. The default value is `false`.

## Return type
<a name="HLL_UNION-return-type"></a>

The HLL\$1UNION function returns a BINARY buffer containing the HyperLogLog sketch computed as a result of combining the input expressions. When the `allowDifferentLgConfigK` parameter is `true`, the result sketch uses the smaller of the two provided `lgConfigK` values.

## Examples
<a name="HLL_UNION-examples"></a>

The following examples use the HyperLogLog (HLL) sketching algorithm to estimate the unique count of values across two columns, `col1` and `col2`, in a dataset.

 The `hll_sketch_agg(col1)` function creates an HLL sketch for the unique values in the `col1` column. 

The `hll_sketch_agg(col2)` function creates an HLL sketch for the unique values in the col2 column. 

The `hll_union(...)` function combines the two HLL sketches created in steps 1 and 2 into a single, unified HLL sketch.

The `hll_sketch_estimate(...)` function takes the combined HLL sketch and estimates the unique count of values across both `col1` and `col2`.

The `FROM VALUES` clause generates a test dataset with 5 rows, where `col1` contains the values 1, 1, 2, 2, and 3, and `col2` contains the values 4, 4, 5, 5, and 6. 

The result of this query is the estimated unique count of values across both `col1` and `col2`, which is 6. The HLL sketching algorithm provides an efficient way to estimate the number of unique elements, even for large datasets, without having to store the full set of unique values. In this example, the `hll_union` function is used to combine the HLL sketches from the two columns, which allows the unique count to be estimated across the entire dataset, rather than just for each column individually.

```
SELECT hll_sketch_estimate(
  hll_union(
    hll_sketch_agg(col1),
    hll_sketch_agg(col2)))
  FROM VALUES
    (1, 4),
    (1, 4),
    (2, 5),
    (2, 5),
    (3, 6) AS tab(col1, col2);
  6
```

The difference between the following example and the previous one is that the precision parameter (12 bits) isn't specified in the `hll_sketch_agg` function call. In this case, the default precision of 14 bits is used, which may provide a more accurate estimate for the unique count compared to the previous example that used 12 bits of precision.

```
SELECT hll_sketch_estimate(
  hll_union(
    hll_sketch_agg(col1, 14),
    hll_sketch_agg(col2, 14)))
  FROM VALUES
    (1, 4),
    (1, 4),
    (2, 5),
    (2, 5),
    (3, 6) AS tab(col1, col2);
```

# HLL\$1UNION\$1AGG function
<a name="HLL_UNION_AGG"></a>

The HLL\$1UNION\$1AGG function combines multiple HLL sketches into a single, unified sketch. It uses the HyperLogLog (HLL) algorithm to combine a group of sketches into a single one. Queries can use the resulting buffers to compute approximate unique counts with the `hll_sketch_estimate` function.

## Syntax
<a name="HLL_UNION_AGG-syntax"></a>

```
HLL_UNION_AGG ( expr [, allowDifferentLgConfigK ] )
```

## Argument
<a name="HLL_UNION_AGG-argument"></a>

 *expr*   
A `BINARY` expression holding a sketch generated by HLL\$1SKETCH\$1AGG.

*allowDifferentLgConfigK*  
A optional BOOLEAN expression controlling whether to allow merging two sketches with different lgConfigK values. The default value is `false`.

## Return type
<a name="HLL_UNION_AGG-return-type"></a>

The HLL\$1UNION\$1AGG function returns a BINARY buffer containing the HyperLogLog sketch computed as a result of combining the input expressions of the same group. When the `allowDifferentLgConfigK` parameter is `true`, the result sketch uses the smaller of the two provided `lgConfigK` values.

## Examples
<a name="HLL_UNION_AGG-examples"></a>

The following examples use the HyperLogLog (HLL) sketching algorithm to estimate the unique count of values across multiple HLL sketches.

The first example estimates the unique count of values in a dataset.

```
SELECT hll_sketch_estimate(hll_union_agg(sketch, true))
    FROM (SELECT hll_sketch_agg(col) as sketch
            FROM VALUES (1) AS tab(col)
          UNION ALL
          SELECT hll_sketch_agg(col, 20) as sketch
            FROM VALUES (1) AS tab(col));
  1
```

The inner query creates two HLL sketches:
+ The first SELECT statement creates a sketch from a single value of 1. 
+ The second SELECT statement creates a sketch from another single value of 1, but with a precision of 20. 

The outer query uses the HLL\$1UNION\$1AGG function to combine the two sketches into a single sketch. Then it applies the HLL\$1SKETCH\$1ESTIMATE function to this combined sketch to estimate the unique count of values.

The result of this query is the estimated unique count of the values in the `col` column, which is `1`. This means that the two input values of 1 are considered to be unique, even though they have the same value.

The second example includes a different precision parameter for the HLL\$1UNION\$1AGG function. In this case, both HLL sketches are created with a precision of 14 bits, which allows them to be successfully combined using `hll_union_agg` with the `true` parameter.

```
SELECT hll_sketch_estimate(hll_union_agg(sketch, true))
    FROM (SELECT hll_sketch_agg(col, 14) as sketch
            FROM VALUES (1) AS tab(col)
          UNION ALL
          SELECT hll_sketch_agg(col, 14) as sketch
            FROM VALUES (1) AS tab(col));
  1
```

The final result of the query is the estimated unique count, which in this case is also `1`. This means that the two input values of 1 are considered to be unique, even though they have the same value.