

 Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# HyperLogLog sketches
<a name="hyperloglog-overview"></a>

This topic describes how to use HyperLogLog sketches in Amazon Redshift. HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a data set. HyperLogLog sketches are arrays of uniqueness data about a data set.

*HyperLogLog* is an algorithm used for estimating the cardinality of a multiset. *Cardinality* refers to the number of distinct values in a multiset. For example, in the set of \$14,3,6,2,2,6,4,3,6,2,2,3\$1, the cardinality is 4 with distinct values of 4, 3, 6, and 2. 

The precision of the HyperLogLog algorithm (also known as m value) can affect the accuracy of the estimated cardinality. During the cardinality estimation, Amazon Redshift uses a default precision value of 15. This value can be up to 26 for smaller datasets. Thus, the average relative error ranges between 0.01–0.6%.

When calculating the cardinality of a multiset, the HyperLogLog algorithm generates a construct called an HLL sketch. An *HLL sketch *encapsulates information about the distinct values in a multiset. The Amazon Redshift data type HLLSKETCH represents such sketch values. This data type can be used to store sketches in an Amazon Redshift table. Additionally, Amazon Redshift supports operations that can be applied to HLLSKETCH values as aggregate and scalar functions. You can use these functions to extract the cardinality of an HLLSKETCH and combine multiple HLLSKETCH values. 

The HLLSKETCH data type offers significant query performance benefits when extracting the cardinality from large datasets. You can preaggregate these datasets using HLLSKETCH values and store them in tables. Amazon Redshift can extract the cardinality directly from the stored HLLSKETCH values without accessing the underlying datasets.

When processing HLL sketches, Amazon Redshift performs optimizations that minimize the memory footprint of the sketch and maximize the precision of the extracted cardinality. Amazon Redshift uses two representations for HLL sketches, sparse and dense. An HLLSKETCH starts in sparse format. As new values are inserted into it, its size increases. After its size reaches the size of the dense representation, Amazon Redshift automatically converts the sketch from sparse to dense.

Amazon Redshift imports, exports, and prints an HLLSKETCH as JSON when the sketch is in a sparse format. Amazon Redshift imports, exports, and prints an HLLSKETCH as a Base64 string when the sketch is in a dense format. For more information about UNLOAD, see [Unloading the HLLSKETCH data type](r_UNLOAD.md#unload-usage-hll). To import text or comma-separated value (CSV) data into Amazon Redshift, use the COPY command. For more information, see [Loading the HLLSKETCH data type](copy-usage_notes-hll.md).

For information about functions used with HyperLogLog, see [HyperLogLog functions](hyperloglog-functions.md).

**Topics**
+ [Considerations](hyperloglog-functions-usage-notes.md)
+ [Limitations](hyperloglog-functions-limitations.md)
+ [Examples](r_HLL-examples.md)

# Considerations
<a name="hyperloglog-functions-usage-notes"></a>

This topic describes usage details for HyperLogLog in Amazon Redshift.

The following are considerations for using HyperLogLog in Amazon Redshift:
+ The following non-HyperLogLog functions can accept an input of type HLLSKETCH or columns of type HLLSKETCH:
  + The aggregate function COUNT
  + The conditional expressions COALESCE and NVL
  + CASE expressions
+ The supported encoding is RAW.
+ You can perform an UNLOAD operation on table with HLLSKETCH columns into text or CSV. You can use the UNLOAD HLLSKETCH columns to write HLLSKETCH data. Amazon Redshift shows the data in a JSON format for a sparse representation or a Base64 format for a dense representation. For more information about UNLOAD, see [Unloading the HLLSKETCH data type](r_UNLOAD.md#unload-usage-hll).

  The following shows the format used for a sparse HyperLogLog sketch represented in a JSON format.

  ```
  {"version":1,"logm":15,"sparse":{"indices":[15099259,33107846,37891580,50065963],"values":[2,3,2,1]}}
  ```
+ You can import text or CSV data into Amazon Redshift using the COPY command. For more information, see [Loading the HLLSKETCH data type](copy-usage_notes-hll.md).
+ The default encoding for HLLSKETCH is RAW. For more information, see [Compression encodings](c_Compression_encodings.md).

# Limitations
<a name="hyperloglog-functions-limitations"></a>

This topic describes limitations for HyperLogLog in Amazon Redshift.

The following are limitations for using HyperLogLog in Amazon Redshift:
+ Amazon Redshift tables don't support an HLLSKETCH column as a sort key or a distribution key for an Amazon Redshift table.
+ Amazon Redshift doesn't support HLLSKETCH columns in ORDER BY, GROUP BY, or DISTINCT clauses.
+ You can only UNLOAD HLLSKETCH columns to text or CSV format. Amazon Redshift then writes the HLLSKETCH data in either a JSON format or a Base64 format. For more information about UNLOAD, see [UNLOAD](r_UNLOAD.md).
+ Amazon Redshift only supports HyperLogLog sketches with a precision (logm value) of 15.
+ JDBC and ODBC drivers don't support the HLLSKETCH data type. Therefore, the result set uses VARCHAR to represent the HLLSKETCH values.
+ Amazon Redshift Spectrum doesn't natively support the HLLSKETCH data. Therefore, you can't create or alter an external table with an HLLSKETCH column.
+ Data types for Python user-defined functions (UDFs) don't support the HLLSKETCH data type. For more information about Python UDFs, see [Scalar Python UDFs](udf-creating-a-scalar-udf.md).

**Note**  
Starting November 1, 2025, Amazon Redshift will no longer support the creation of new Python UDFs. Existing Python UDFs will continue to function until June 30, 2026. Starting July 1, 2026, Amazon Redshift will no longer support Python UDFs. We recommend that you migrate your existing Python UDFs to Lambda UDFs before November 1, 2025. For information on creating and using Lambda UDFs, see [Scalar Lambda UDFs](udf-creating-a-lambda-sql-udf.md). For information on converting existing Python UDFs to Lambda UDFs, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/).

# Examples
<a name="r_HLL-examples"></a>

This section contains examples for using HyperLogLog with Amazon Redshift.

**Topics**
+ [Example: Return cardinality in a subquery](#hll-examples-subquery)
+ [Example: Return an HLLSKETCH type from combined sketches in a subquery](#hll-examples-combined-subquery)
+ [Example: Return a HyperLogLog sketch from combining multiple sketches](#hll-examples-multiple-sketches)
+ [Example: Generate HyperLogLog sketches over S3 data using external tables](#hll-examples-cache-sketches)

## Example: Return cardinality in a subquery
<a name="hll-examples-subquery"></a>

The following example returns the cardinality for each sketch in a subquery for a table named *Sales*.

```
CREATE TABLE Sales (customer VARCHAR, country VARCHAR, amount BIGINT);
INSERT INTO Sales VALUES ('David Joe', 'Greece', 14.5),  ('David Joe', 'Greece', 19.95), ('John Doe', 'USA', 29.95), ('John Doe', 'USA', 19.95), ('George Spanos', 'Greece', 9.95), ('George Spanos', 'Greece', 2.95);
```

The following query generates an HLL sketch for the customers of each country and extracts the cardinality. This shows unique customers from each country.

```
SELECT hll_cardinality(sketch), country
FROM (SELECT hll_create_sketch(customer) AS sketch, country
        FROM Sales
        GROUP BY country) AS hll_subquery;
        
hll_cardinality | country
----------------+---------
            1   | USA
            2   | Greece
 ...
```

## Example: Return an HLLSKETCH type from combined sketches in a subquery
<a name="hll-examples-combined-subquery"></a>

The following example returns a single HLLSKETCH type that represents the combination of individual sketches from a subquery. The sketches are combined by using the HLL\$1COMBINE aggregate function. 

```
SELECT hll_combine(sketch)
FROM (SELECT hll_create_sketch(customer) AS sketch
        FROM Sales
        GROUP BY country) AS hll_subquery
       
                                        hll_combine
--------------------------------------------------------------------------------------------
 {"version":1,"logm":15,"sparse":{"indices":[29808639,35021072,47612452],"values":[1,1,1]}}
(1 row)
```

## Example: Return a HyperLogLog sketch from combining multiple sketches
<a name="hll-examples-multiple-sketches"></a>

For the following example, suppose that the table `page-users` stores preaggregated sketches for each page that users visited on a given website. Each row in this table contains a HyperLogLog sketch that represents all user IDs that show the visited pages.

```
page_users
-- +----------------+-------------+--------------+
-- | _PARTITIONTIME | page         | sketch |
-- +----------------+-------------+--------------+
-- | 2019-07-28     | homepage     | CHAQkAQYA... |
-- | 2019-07-28     | Product A    | CHAQxPnYB... |
-- +----------------+-------------+--------------+
```

The following example unions the preaggregated multiple sketches and generates a single sketch. This sketch encapsulates the collective cardinality that each sketch encapsulates.

```
SELECT hll_combine(sketch) as sketch
FROM page_users
```

The output looks similar to the following.

```
-- +-----------------------------------------+
-- | sketch |
-- +-----------------------------------------+
-- | CHAQ3sGoCxgCIAuCB4iAIBgTIBgqgIAgAwY.... |
-- +-----------------------------------------+
```

When a new sketch is created, you can use the HLL\$1CARDINALITY function to get the collective distinct values, as shown following.

```
SELECT hll_cardinality(sketch)
FROM ( 
  SELECT
  hll_combine(sketch) as sketch
  FROM page_users
) AS hll_subquery
```

The output looks similar to the following.

```
-- +-------+
-- | count |
-- +-------+
-- | 54356 |
-- +-------+
```

## Example: Generate HyperLogLog sketches over S3 data using external tables
<a name="hll-examples-cache-sketches"></a>

The following examples cache HyperLogLog sketches to avoid directly accessing Amazon S3 for cardinality estimation. 

You can preaggregate and cache HyperLogLog sketches in external tables defined to hold Amazon S3 data. By doing this, you can extract cardinality estimates without accessing the underlying base data. 

For example, suppose that you have unloaded a set of tab-delimited text files into Amazon S3. You run the following query to define an external table named `sales` in the Amazon Redshift external schema named `spectrum`. The Amazon S3 bucket for this example is in the US East (N. Virginia) AWS Region.

```
create external table spectrum.sales(
salesid integer,
listid integer,
sellerid smallint,
buyerid smallint,
eventid integer,
dateid integer,
qtysold integer,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
row format delimited
fields terminated by '\t' stored as textfile
location 's3://redshift-downloads/tickit/spectrum/sales/';
```

Suppose that you want to compute the distinct buyers who purchased an item on arbitrary dates. To do so, the following example generates sketches for the buyer IDs for each day of the year and stores the result in the Amazon Redshift table `hll_sales`.

```
CREATE TABLE hll_sales AS
SELECT saletime, hll_create_sketch(buyerid) AS sketch
FROM spectrum.sales
GROUP BY saletime;
            
SELECT TOP 5 * FROM hll_sales;
```

The output looks similar to the following.

```
-- hll_sales

-- | saletime        | sketch                                                              |
-- +-----------------+---------------------------------------------------------------------+
-- | 7/22/2008 8:30  | {"version":1,"logm":15,"sparse":{"indices":[9281416],"values":[1]}}
-- | 2/19/2008 0:38  | {"version":1,"logm":15,"sparse":{"indices":[48735497],"values":[3]}}
-- | 11/5/2008 4:49  | {"version":1,"logm":15,"sparse":{"indices":[27858661],"values":[1]}}
-- | 10/27/2008 4:08 | {"version":1,"logm":15,"sparse":{"indices":[65295430],"values":[2]}}
-- | 2/16/2008 9:37  | {"version":1,"logm":15,"sparse":{"indices":[56869618],"values":[2]}}
-- +---------------- +---------------------------------------------------------------------+
```

The following query shows the estimated number of distinct buyers that purchased an item during the Friday after Thanksgiving in 2008.

```
SELECT hll_cardinality(hll_combine(sketch)) as distinct_buyers
FROM hll_sales
WHERE trunc(saletime) = '2008-11-28';
```

The output looks similar to the following.

```
distinct_buyers
---------------
386
```

Suppose that you want the number of distinct users who bought an item on a certain range of dates. An example might be from the Friday after Thanksgiving to the following Monday. To get this, the following query uses the `hll_combine` aggregate function. This function enables you to avoid double-counting buyers who purchased an item on more than one day of the selected range. 

```
SELECT hll_cardinality(hll_combine(sketch)) as distinct_buyers
FROM hll_sales
WHERE saletime BETWEEN '2008-11-28' AND '2008-12-01';
```

The output looks similar to the following.

```
distinct_buyers
---------------
1166
```

To keep the `hll_sales` table up-to-date, run the following query at the end of each day. Doing this generates an HyperLogLog sketch based on the IDs of buyers that purchased an item today and adds it to the `hll_sales` table.

```
INSERT INTO hll_sales 
SELECT saletime, hll_create_sketch(buyerid) 
FROM spectrum.sales 
WHERE TRUNC(saletime) = to_char(GETDATE(), 'YYYY-MM-DD')
GROUP BY saletime;
```