

 Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Column compression to reduce the size of stored data
<a name="t_Compressing_data_on_disk"></a>

*Compression* is a column-level operation that reduces the size of data when it is stored. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.

ENCODE AUTO is the default for tables. When a table is set to ENCODE AUTO, Amazon Redshift automatically manages compression encoding for all columns in the table. For more information, see [CREATE TABLE](r_CREATE_TABLE_NEW.md) and [ALTER TABLE](r_ALTER_TABLE.md).

However, if you specify compression encoding for any column in the table, the table is no longer set to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all columns in the table. 

You can apply a compression type, or *encoding*, to the columns in a table manually when you create the table. Or you can use the COPY command to analyze and apply compression automatically. For more information, see [Let COPY choose compression encodings](c_best-practices-use-auto-compression.md). For details about applying automatic compression, see [Loading tables with automatic compression](c_Loading_tables_auto_compress.md).

**Note**  
We strongly recommend using the COPY command to apply automatic compression.

You might choose to apply compression encodings manually if the new table shares the same data characteristics as another table. Or you might do so if you discover in testing that the compression encodings applied during automatic compression are not the best fit for your data. If you choose to apply compression encodings manually, you can run the [ANALYZE COMPRESSION](r_ANALYZE_COMPRESSION.md) command against an already populated table and use the results to choose compression encodings.

To apply compression manually, you specify compression encodings for individual columns as part of the CREATE TABLE statement. The syntax is as follows.

```
CREATE TABLE table_name (column_name 
data_type ENCODE encoding-type)[, ...]
```

Here, *encoding-type* is taken from the keyword table in the following section.

For example, the following statement creates a two-column table, PRODUCT. When data is loaded into the table, the PRODUCT\$1ID column is not compressed, but the PRODUCT\$1NAME column is compressed, using the byte dictionary encoding (BYTEDICT).

```
create table product(
product_id int encode raw,
product_name char(20) encode bytedict);
```

You can specify the encoding for a column when it is added to a table using the ALTER TABLE command.

```
ALTER TABLE table-name ADD [ COLUMN ] column_name column_type ENCODE encoding-type
```

**Topics**
+ [Compression encodings](c_Compression_encodings.md)
+ [Testing compression encodings](t_Verifying_data_compression.md)

# Compression encodings
<a name="c_Compression_encodings"></a>

<a name="compression-encoding-list"></a>A *compression encoding* specifies the type of compression that is applied to a column of data values as rows are added to a table.

ENCODE AUTO is the default for tables. When a table is set to ENCODE AUTO, Amazon Redshift automatically manages compression encoding for all columns in the table. For more information, see [CREATE TABLE](r_CREATE_TABLE_NEW.md) and [ALTER TABLE](r_ALTER_TABLE.md).

However, if you specify compression encoding for any column in the table, the table is no longer set to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all columns in the table.

When you use CREATE TABLE, ENCODE AUTO is disabled when you specify compression encoding for any column in the table. If ENCODE AUTO is disabled, Amazon Redshift automatically assigns compression encoding to columns for which you don't specify an ENCODE type as follows:
+ Columns that are defined as sort keys are assigned RAW compression.
+ Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression.
+ Columns that are defined as SMALLINT, INTEGER, BIGINT, DECIMAL, DATE, TIMESTAMP, or TIMESTAMPTZ data types are assigned AZ64 compression.
+ Columns that are defined as CHAR or VARCHAR data types are assigned LZO compression.

You can change a table's encoding after creating it by using ALTER TABLE. If you disable ENCODE AUTO using ALTER TABLE, Amazon Redshift no longer automatically manages compression encodings for your columns. All columns will keep the compression encoding types that they had when you disabled ENCODE AUTO until you change them or you enable ENCODE AUTO again.

Amazon Redshift supports the following compression encodings:

------
#### [ Raw ]

Raw encoding is the default encoding for columns that are designated as sort keys and columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types. With raw encoding, data is stored in raw, uncompressed form.

------
#### [ AZ64 ]

AZ64 is a proprietary compression encoding algorithm designed by Amazon to achieve a high compression ratio and improved query processing. At its core, the AZ64 algorithm compresses smaller groups of data values and uses single instruction, multiple data (SIMD) instructions for parallel processing. Use AZ64 to achieve significant storage savings and high performance for numeric, date, and time data types. 

You can use AZ64 as the compression encoding when defining columns using CREATE TABLE and ALTER TABLE statements with the following data types:
+ SMALLINT
+ INTEGER
+ BIGINT
+ DECIMAL
+ DATE
+ TIMESTAMP
+ TIMESTAMPTZ

------
#### [ Byte-dictionary ]

In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block.

This encoding is very effective on low cardinality string columns. This encoding is optimal when the data domain of a column is fewer than 256 unique values.

For columns with the string data type (CHAR and VARCHAR) encoded with BYTEDICT, Amazon Redshift performs vectorized scans and predicate evaluations that operate over compressed data directly. These scans use hardware-specific single instruction and multiple data (SIMD) instructions for parallel processing. This significantly speeds up the scanning of string columns. Byte-dictionary encoding is especially space-efficient if a CHAR/VARCHAR column holds long character strings.

Suppose that a table has a COUNTRY column with a CHAR(30) data type. As data is loaded, Amazon Redshift creates the dictionary and populates the COUNTRY column with the index value. The dictionary contains the indexed unique values, and the table itself contains only the one-byte subscripts of the corresponding values.

**Note**  
Trailing blanks are stored for fixed-length character columns. Therefore, in a CHAR(30) column, every compressed value saves 29 bytes of storage when you use the byte-dictionary encoding.

The following table represents the dictionary for the COUNTRY column.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

The following table represents the values in the COUNTRY column.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

The total compressed size in this example is calculated as follows: 6 different entries are stored in the dictionary (6 \$1 30 = 180), and the table contains 10 1-byte compressed values, for a total of 190 bytes.

------
#### [ Delta ]

Delta encodings are very useful for date time columns.

Delta encoding compresses data by recording the difference between values that follow each other in the column. This difference is recorded in a separate dictionary for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) For example, suppose that the column contains 10 integers in sequence from 1 to 10. The first are stored as a 4-byte integer (plus a 1-byte flag). The next nine are each stored as a byte with the value 1, indicating that it is one greater than the previous value.

Delta encoding comes in two variations: 
+ DELTA records the differences as 1-byte values (8-bit integers)
+ DELTA32K records differences as 2-byte values (16-bit integers)

If most of the values in the column could be compressed by using a single byte, the 1-byte variation is very effective. However, if the deltas are larger, this encoding, in the worst case, is somewhat less effective than storing the uncompressed data. Similar logic applies to the 16-bit version.

If the difference between two values exceeds the 1-byte range (DELTA) or 2-byte range (DELTA32K), the full original value is stored, with a leading 1-byte flag. The 1-byte range is from -127 to 127, and the 2-byte range is from -32K to 32K.

The following table shows how a delta encoding works for a numeric column.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

------
#### [ LZO ]

LZO encoding provides a very high compression ratio with good performance. LZO encoding works especially well for CHAR and VARCHAR columns that store very long character strings. They are especially good for free-form text, such as product descriptions, user comments, or JSON strings. 

------
#### [ Mostly ]

Mostly encodings are useful when the data type for a column is larger than most of the stored values require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2 column, to 8-bit storage.

In general, the mostly encodings work with the following data types:
+ SMALLINT/INT2 (16-bit)
+ INTEGER/INT (32-bit)
+ BIGINT/INT8 (64-bit)
+ DECIMAL/NUMERIC (64-bit)

Choose the appropriate variation of the mostly encoding to suit the size of the data type for the column. For example, apply MOSTLY8 to a column that is defined as a 16-bit integer column. Applying MOSTLY16 to a column with a 16-bit data type or MOSTLY32 to a column with a 32-bit data type is disallowed.

Mostly encodings might be less effective than no compression when a relatively high number of the values in the column can't be compressed. Before applying one of these encodings to a column, perform a check. *Most* of the values that you are going to load now (and are likely to load in the future) should fit into the ranges shown in the following table.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

**Note**  
For decimal values, ignore the decimal point to determine whether the value fits into the range. For example, 1,234.56 is treated as 123,456 and can be compressed in a MOSTLY32 column.

For example, the VENUEID column in the VENUE table is defined as a raw integer column, which means that its values consume 4 bytes of storage. However, the current range of values in the column is **0** to **309**. Therefore, recreating and reloading this table with MOSTLY16 encoding for VENUEID would reduce the storage of every value in that column to 2 bytes.

If the VENUEID values referenced in another table were mostly in the range of 0 to 127, it might make sense to encode that foreign-key column as MOSTLY8. Before making the choice, run several queries against the referencing table data to find out whether the values mostly fall into the 8-bit, 16-bit, or 32-bit range.

The following table shows compressed sizes for specific numeric values when the MOSTLY8, MOSTLY16, and MOSTLY32 encodings are used:

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

------
#### [ Run length ]

Run length encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run). A separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) This encoding is best suited to a table in which data values are often repeated consecutively, for example, when the table is sorted by those values.

For example, suppose that a column in a large dimension table has a predictably small domain, such as a COLOR column with fewer than 10 possible values. These values are likely to fall in long sequences throughout the table, even if the data is not sorted.

We don't recommend applying run length encoding on any column that is designated as a sort key. Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns are compressed much more highly than other columns in the same query, range-restricted scans might perform poorly.

The following table uses the COLOR column example to show how the run length encoding works.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

------
#### [ Text255 and Text32k ]

Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often. A separate dictionary of unique words is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains the first 245 unique words in the column. Those words are replaced on disk by a one-byte index value representing one of the 245 values, and any words that are not represented in the dictionary are stored uncompressed. The process repeats for each 1-MB disk block. If the indexed words occur frequently in the column, the column yields a high compression ratio.

For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a specific number of words. Instead, the dictionary indexes each unique word it finds until the combined entries reach a length of 32K, minus some overhead. The index values are stored in two bytes.

For example, consider the VENUENAME column in the VENUE table. Words such as **Arena**, **Center**, and **Theatre** recur in this column and are likely to be among the first 245 words encountered in each block if text255 compression is applied. If so, this column benefits from compression. This is because every time those words appear, they occupy only 1 byte of storage (instead of 5, 6, or 7 bytes, respectively).

------
#### [ ZSTD ]

Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse datasets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long and short strings, such as product descriptions, user comments, logs, and JSON strings. Where some algorithms, such as Delta encoding or Mostly encoding, can potentially use more storage space than no compression, ZSTD is unlikely to increase disk usage. 

ZSTD supports SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, TIMESTAMP, and TIMESTAMPTZ data types.

------

The following table identifies the supported compression encodings and the data types that support the encoding.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

# Testing compression encodings
<a name="t_Verifying_data_compression"></a>

If you decide to manually specify column encodings, you might want to test different encodings with your data.

**Note**  
We recommend that you use the COPY command to load data whenever possible, and allow the COPY command to choose the optimal encodings based on your data. Or you can use the [ANALYZE COMPRESSION](r_ANALYZE_COMPRESSION.md) command to view the suggested encodings for existing data. For details about applying automatic compression, see [Loading tables with automatic compression](c_Loading_tables_auto_compress.md).

To perform a meaningful test of data compression, you must have a large number of rows. For this example, we create a table and insert rows by using a statement that selects from two tables; VENUE and LISTING. We leave out the WHERE clause that would normally join the two tables. The result is that *each* row in the VENUE table is joined to *all* of the rows in the LISTING table, for a total of over 32 million rows. This is known as a Cartesian join and normally is not recommended. However, for this purpose, it's a convenient method of creating many rows. If you have an existing table with data that you want to test, you can skip this step.

After we have a table with sample data, we create a table with seven columns. Each has a different compression encoding: raw, bytedict, lzo, run length, text255, text32k, and zstd. We populate each column with exactly the same data by running an INSERT command that selects the data from the first table.

To test compression encodings, do the following:

1.  (Optional) First, use a Cartesian join to create a table with a large number of rows. Skip this step if you want to test an existing table. 

   ```
   create table cartesian_venue(
   venueid smallint not null distkey sortkey,
   venuename varchar(100),
   venuecity varchar(30),
   venuestate char(2),
   venueseats integer);
   
   insert into cartesian_venue
   select venueid, venuename, venuecity, venuestate, venueseats
   from venue, listing;
   ```

1.  Next, create a table with the encodings that you want to compare.  

   ```
   create table encodingvenue (
   venueraw varchar(100) encode raw,
   venuebytedict varchar(100) encode bytedict,
   venuelzo varchar(100) encode lzo,
   venuerunlength varchar(100) encode runlength,
   venuetext255 varchar(100) encode text255,
   venuetext32k varchar(100) encode text32k,
   venuezstd varchar(100) encode zstd);
   ```

1.  Insert the same data into all of the columns using an INSERT statement with a SELECT clause. 

   ```
   insert into encodingvenue
   select venuename as venueraw, venuename as venuebytedict, venuename as venuelzo, venuename as venuerunlength, venuename as  venuetext32k, venuename as  venuetext255, venuename as venuezstd
   from cartesian_venue;
   ```

1.  Verify the number of rows in the new table. 

   ```
   select count(*) from encodingvenue
   
     count
   ----------
    38884394
   (1 row)
   ```

1.  Query the [STV\$1BLOCKLIST](r_STV_BLOCKLIST.md) system table to compare the number of 1 MB disk blocks used by each column.  

   The MAX aggregate function returns the highest block number for each column. The STV\$1BLOCKLIST table includes details for three system-generated columns. This example uses `col < 6` in the WHERE clause to exclude the system-generated columns. 

   ```
   select col, max(blocknum)
   from stv_blocklist b, stv_tbl_perm p
   where (b.tbl=p.id) and name ='encodingvenue'
   and col < 7
   group by name, col
   order by col;
   ```

   The query returns the following results. The columns are numbered beginning with zero. Depending on how your cluster is configured, your result might have different numbers, but the relative sizes should be similar. You can see that BYTEDICT encoding on the second column produced the best results for this dataset. This approach has a compression ratio of better than 20:1. LZO and ZSTD encoding also produced excellent results. Different datasets produce different results, of course. When a column contains longer text strings, LZO often produces the best compression results.

   ```
    col | max
   -----+-----
      0 | 203
      1 |  10
      2 |  22
      3 | 204
      4 |  56
      5 |  72
      6 |  20
   (7 rows)
   ```

If you have data in an existing table, you can use the [ANALYZE COMPRESSION](r_ANALYZE_COMPRESSION.md) command to view the suggested encodings for the table. For example, the following example shows the recommended encoding for a copy of the VENUE table, CARTESIAN\$1VENUE, that contains 38 million rows. Notice that ANALYZE COMPRESSION recommends LZO encoding for the VENUENAME column. ANALYZE COMPRESSION chooses optimal compression based on multiple factors, which include percent of reduction. In this specific case, BYTEDICT provides better compression, but LZO also produces greater than 90 percent compression. 

```
analyze compression cartesian_venue;

Table          | Column     | Encoding | Est_reduction_pct
---------------+------------+----------+------------------
reallybigvenue | venueid    | lzo      | 97.54            
reallybigvenue | venuename  | lzo      | 91.71            
reallybigvenue | venuecity  | lzo      | 96.01            
reallybigvenue | venuestate | lzo      | 97.68            
reallybigvenue | venueseats | lzo      | 98.21
```

## Example
<a name="Examples__compression_encodings_in_CREATE_TABLE_statements"></a>

The following example creates a CUSTOMER table that has columns with various data types. This CREATE TABLE statement shows one of many possible combinations of compression encodings for these columns. 

```
create table customer(
custkey int encode delta,
custname varchar(30) encode raw,
gender varchar(7) encode text255,
address varchar(200) encode text255,
city varchar(30) encode text255,
state char(2) encode raw,
zipcode char(5) encode bytedict,
start_date date encode delta32k);
```

The following table shows the column encodings that were chosen for the CUSTOMER table and gives an explanation for the choices:


| Column | Data type | Encoding | Explanation | 
| --- | --- | --- | --- | 
| CUSTKEY | int | delta | CUSTKEY consists of unique, consecutive integer values. Because the differences are one byte, DELTA is a good choice. | 
| CUSTNAME | varchar(30) | raw | CUSTNAME has a large domain with few repeated values. Any compression encoding would probably be ineffective. | 
| GENDER | varchar(7) | text255 | GENDER is very small domain with many repeated values. Text255 works well with VARCHAR columns in which the same words recur. | 
| ADDRESS | varchar(200) | text255 | ADDRESS is a large domain, but contains many repeated words, such as Street, Avenue, North, South, and so on. Text 255 and text 32k are useful for compressing VARCHAR columns in which the same words recur. The column length is short, so text255 is a good choice. | 
| CITY | varchar(30) | text255 | CITY is a large domain, with some repeated values. Certain city names are used much more commonly than others. Text255 is a good choice for the same reasons as ADDRESS. | 
| STATE | char(2) | raw | In the United States, STATE is a precise domain of 50 two-character values. Bytedict encoding would yield some compression, but because the column size is only two characters, compression might not be worth the overhead of uncompressing the data. | 
| ZIPCODE | char(5) | bytedict | ZIPCODE is a known domain of fewer than 50,000 unique values. Certain zip codes occur much more commonly than others. Bytedict encoding is very effective when a column contains a limited number of unique values.  | 
| START\$1DATE | date | delta32k | Delta encodings are very useful for date time columns, especially if the rows are loaded in date order. | 