Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Data distribution for query optimization
<a name="t_Distributing_data"></a>

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table's distribution style. When you run a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in choosing a table distribution style is to minimize the impact of the redistribution step by locating the data where it must be before the query is run.

**Note**  
This section will introduce you to the principles of data distribution in an Amazon Redshift database. We recommend that you create your tables with `DISTSTYLE AUTO`. If you do so, then Amazon Redshift uses automatic table optimization to choose the data distribution style. For more information, see [Automatic table optimization](t_Creating_tables.md). The rest of this section provides details about distribution styles. 

**Topics**
+ [Data distribution concepts](#t_data_distribution_concepts)
+ [Distribution styles](c_choosing_dist_sort.md)
+ [Viewing distribution styles](viewing-distribution-styles.md)
+ [Evaluating query patterns](t_evaluating_query_patterns.md)
+ [Designating distribution styles](t_designating_distribution_styles.md)
+ [Evaluating the query plan](c_data_redistribution.md)
+ [Query plan example](t_explain_plan_example.md)
+ [Distribution examples](c_Distribution_examples.md)

## Data distribution concepts
<a name="t_data_distribution_concepts"></a>

Some data distribution concepts for Amazon Redshift follow.

 **Nodes and slices** 

 An Amazon Redshift cluster is a set of nodes. Each node in the cluster has its own operating system, dedicated memory, and dedicated disk storage. One node is the *leader node*, which manages the distribution of data and query processing tasks to the compute nodes. The *compute nodes* provide resources to do those tasks. 

 The disk storage for a compute node is divided into a number of *slices*. The number of slices per node depends on the node size of the cluster. The nodes all participate in running parallel queries, working on data that is distributed as evenly as possible across the slices. For more information about the number of slices that each node size has, see [About clusters and nodes](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes) in the *Amazon Redshift Management Guide*.

 **Data redistribution** 

 When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node slices according to the table's distribution style. As part of a query plan, the optimizer determines where blocks of data must be located to best run the query. The data is then physically moved, or redistributed, while the query runs. Redistribution might involve either sending specific rows to nodes for joining or broadcasting an entire table to all of the nodes. 

 Data redistribution can account for a substantial portion of the cost of a query plan, and the network traffic it generates can affect other database operations and slow overall system performance. To the extent that you anticipate where best to locate data initially, you can minimize the impact of data redistribution. 

 **Data distribution goals** 

 When you load data into a table, Amazon Redshift distributes the table's rows to the compute nodes and slices according to the distribution style that you chose when you created the table. Data distribution has two primary goals: 
+ To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data distribution skew, forces some nodes to do more work than others, which impairs query performance.
+ To minimize data movement as a query runs. If the rows that participate in joins or aggregates are already collocated on the nodes with their joining rows in other tables, the optimizer doesn't need to redistribute as much data when queries run.

The distribution strategy that you choose for your database has important consequences for query performance, storage requirements, data loading, and maintenance. By choosing the best distribution style for each table, you can balance your data distribution and significantly improve overall system performance.

# Distribution styles
<a name="c_choosing_dist_sort"></a>

When you create a table, you can designate one of the following distribution styles: AUTO, EVEN, KEY, or ALL. 

If you don't specify a distribution style, Amazon Redshift uses AUTO distribution.

 **AUTO distribution** 

With AUTO distribution, Amazon Redshift assigns an optimal distribution style based on the size of the table data. For example, if AUTO distribution style is specified, Amazon Redshift initially assigns the ALL distribution style to a small table. When the table grows larger, Amazon Redshift might change the distribution style to KEY, choosing the primary key (or a column of the composite primary key) as the distribution key. If the table grows larger and none of the columns are suitable to be the distribution key, Amazon Redshift changes the distribution style to EVEN. The change in distribution style occurs in the background with minimal impact to user queries. 

To view actions that Amazon Redshift automatically performed to alter a table distribution key, see [SVL\$1AUTO\$1WORKER\$1ACTION](r_SVL_AUTO_WORKER_ACTION.md). To view current recommendations regarding altering a table distribution key, see [SVV\$1ALTER\$1TABLE\$1RECOMMENDATIONS](r_SVV_ALTER_TABLE_RECOMMENDATIONS.md). 

To view the distribution style applied to a table, query the PG\$1CLASS\$1INFO system catalog view. For more information, see [Viewing distribution styles](viewing-distribution-styles.md). If you don't specify a distribution style with the CREATE TABLE statement, Amazon Redshift applies AUTO distribution. 

 **EVEN distribution** 

 The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column. EVEN distribution is appropriate when a table doesn't participate in joins. It's also appropriate when there isn't a clear choice between KEY distribution and ALL distribution.

 **KEY distribution** 

 The rows are distributed according to the values in one column. The leader node places matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns. This way, matching values from the common columns are physically stored together. 

 **ALL distribution** 

 A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in. 

 ALL distribution multiplies the storage required by the number of nodes in the cluster, and so it takes much longer to load, update, or insert data into multiple tables. ALL distribution is appropriate only for relatively slow moving tables; that is, tables that are not updated frequently or extensively. Because the cost of redistributing small tables during a query is low, there isn't a significant benefit to define small dimension tables as DISTSTYLE ALL. 

**Note**  
 After you have specified a distribution style for a column, Amazon Redshift handles data distribution at the cluster level. Amazon Redshift does not require or support the concept of partitioning data within database objects. You don't need to create table spaces or define partitioning schemes for tables. 

In certain scenarios, you can change the distribution style of a table after it is created. For more information, see [ALTER TABLE](r_ALTER_TABLE.md). For scenarios when you can't change the distribution style of a table after it's created, you can recreate the table and populate the new table with a deep copy. For more information, see [Performing a deep copy](performing-a-deep-copy.md)

# Viewing distribution styles
<a name="viewing-distribution-styles"></a>

To view the distribution style of a table, query the PG\$1CLASS\$1INFO view or the SVV\$1TABLE\$1INFO view.

The RELEFFECTIVEDISTSTYLE column in PG\$1CLASS\$1INFO indicates the current distribution style for the table. If the table uses automatic distribution, RELEFFECTIVEDISTSTYLE is 10, 11, or 12, which indicates whether the effective distribution style is AUTO (ALL), AUTO (EVEN), or AUTO (KEY). If the table uses automatic distribution, the distribution style might initially show AUTO (ALL), then change to AUTO (EVEN) or AUTO (KEY) when the table grows. 

The following table gives the distribution style for each value in RELEFFECTIVEDISTSTYLE column: 


| RELEFFECTIVEDISTSTYLE | Current distribution style | 
| --- | --- | 
| 0 | EVEN | 
| 1 | KEY | 
| 8 | ALL | 
| 10 | AUTO (ALL) | 
| 11 | AUTO (EVEN) | 
| 12 | AUTO (KEY) | 

The DISTSTYLE column in SVV\$1TABLE\$1INFO indicates the current distribution style for the table. If the table uses automatic distribution, DISTSTYLE is AUTO (ALL), AUTO (EVEN), or AUTO (KEY).

The following example creates four tables using the three distribution styles and automatic distribution, then queries SVV\$1TABLE\$1INFO to view the distribution styles. 

```
create table public.dist_key (col1 int)
diststyle key distkey (col1);

insert into public.dist_key values (1);

create table public.dist_even (col1 int)
diststyle even;

insert into public.dist_even values (1);

create table public.dist_all (col1 int)
diststyle all;

insert into public.dist_all values (1);

create table public.dist_auto (col1 int);

insert into public.dist_auto values (1);

select "schema", "table", diststyle from SVV_TABLE_INFO
where "table" like 'dist%';

        schema   |    table        | diststyle
     ------------+-----------------+------------
      public     | dist_key        | KEY(col1)
      public     | dist_even       | EVEN
      public     | dist_all        | ALL
      public     | dist_auto       | AUTO(ALL)
```

# Evaluating query patterns
<a name="t_evaluating_query_patterns"></a>

 Choosing distribution styles is only one aspect of database design. Consider distribution styles within the context of the entire system, balancing distribution with other important factors such as cluster size, compression encoding methods, sort keys, and table constraints. 

 Test your system with data that is as close to real data as possible. 

To make good choices for distribution styles, you must understand the query patterns for your Amazon Redshift application. Identify the most costly queries in your system and base your initial database design on the demands of those queries. Factors that determine the total cost of a query include how long the query takes to run and how much computing resources it consumes. Other factors that determine query cost are how often it is run, and how disruptive it is to other queries and database operations. 

 Identify the tables that are used by the most costly queries, and evaluate their role in query runtime. Consider how the tables are joined and aggregated. 

 Use the guidelines in this section to choose a distribution style for each table. When you have done so, create the tables and load them with data that is as close as possible to real data. Then test the tables for the types of queries that you expect to use. You can evaluate the query explain plans to identify tuning opportunities. Compare load times, storage space, and query runtimes to balance your system's overall requirements. 

# Designating distribution styles
<a name="t_designating_distribution_styles"></a>

 The considerations and recommendations for designating distribution styles in this section use a star schema as an example. Your database design might be based on a star schema, some variant of a star schema, or an entirely different schema. Amazon Redshift is designed to work effectively with whatever schema design you choose. The principles in this section can be applied to any design schema. 

1.  **Specify the primary key and foreign keys for all your tables.** 

   Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer uses them when it generates query plans. If you set primary keys and foreign keys, your application must maintain the validity of the keys. 

1.  **Distribute the fact table and its largest dimension table on their common columns.** 

   Choose the largest dimension based on the size of dataset that participates in the most common join, not only the size of the table. If a table is commonly filtered, using a WHERE clause, only a portion of its rows participate in the join. Such a table has less impact on redistribution than a smaller table that contributes more data. Designate both the dimension table's primary key and the fact table's corresponding foreign key as DISTKEY. If multiple tables use the same distribution key, they are also collocated with the fact table. Your fact table can have only one distribution key. Any tables that join on another key isn't collocated with the fact table. 

1.  **Designate distribution keys for the other dimension tables.** 

   Distribute the tables on their primary keys or their foreign keys, depending on how they most commonly join with other tables. 

1.  **Evaluate whether to change some of the dimension tables to use ALL distribution.** 

   If a dimension table cannot be collocated with the fact table or other important joining tables, you can improve query performance significantly by distributing the entire table to all of the nodes. Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations, so you should weigh all factors before choosing ALL distribution. The following section explains how to identify candidates for ALL distribution by evaluating the EXPLAIN plan. 

1.  **Use AUTO distribution for the remaining tables.** 

   If a table is largely denormalized and does not participate in joins, or if you don't have a clear choice for another distribution style, use AUTO distribution. 

To let Amazon Redshift choose the appropriate distribution style, don't explicitly specify a distribution style.

# Evaluating the query plan
<a name="c_data_redistribution"></a>

You can use query plans to identify candidates for optimizing the distribution style. 

After making your initial design decisions, create your tables, load them with data, and test them. Use a test dataset that is as close as possible to the real data. Measure load times to use as a baseline for comparisons. 

Evaluate queries that are representative of the most costly queries you expect to run, specifically queries that use joins and aggregations. Compare runtimes for various design options. When you compare runtimes, don't count the first time the query is run, because the first runtime includes the compilation time. 

**DS\$1DIST\$1NONE**  
No redistribution is required, because corresponding slices are collocated on the compute nodes. You typically have only one DS\$1DIST\$1NONE step, the join between the fact table and one dimension table. 

**DS\$1DIST\$1ALL\$1NONE**  
No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node. 

**DS\$1DIST\$1INNER**  
The inner table is redistributed. 

**DS\$1DIST\$1OUTER**  
The outer table is redistributed. 

**DS\$1BCAST\$1INNER**  
A copy of the entire inner table is broadcast to all the compute nodes. 

**DS\$1DIST\$1ALL\$1INNER**  
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.

**DS\$1DIST\$1BOTH**  
Both tables are redistributed. 

**DS\$1DIST\$1ERR**  
When the table doesn't have a distribution style selected.

DS\$1DIST\$1NONE and DS\$1DIST\$1ALL\$1NONE are good. They indicate that no distribution was required for that step because all of the joins are collocated. 

DS\$1DIST\$1INNER means that the step probably has a relatively high cost because the inner table is being redistributed to the nodes. DS\$1DIST\$1INNER indicates that the outer table is already properly distributed on the join key. Set the inner table's distribution key to the join key to convert this to DS\$1DIST\$1NONE. In some cases, distributing the inner table on the join key isn't possible because the outer table isn't distributed on the join key. If this is the case, evaluate whether to use ALL distribution for the inner table. If the table isn't updated frequently or extensively, and it's large enough to carry a high redistribution cost, change the distribution style to ALL and test again. ALL distribution causes increased load times, so when you retest, include the load time in your evaluation factors. 

DS\$1DIST\$1ALL\$1INNER is not good. It means that the entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL, so that a copy of the entire outer table is located on each node. This results in inefficient serial runtime of the join on a single node, instead taking advantage of parallel runtime using all of the nodes. DISTSTYLE ALL is meant to be used only for the inner join table. Instead, specify a distribution key or use even distribution for the outer table.

DS\$1BCAST\$1INNER and DS\$1DIST\$1BOTH are not good. Usually these redistributions occur because the tables are not joined on their distribution keys. If the fact table does not already have a distribution key, specify the joining column as the distribution key for both tables. If the fact table already has a distribution key on another column, evaluate whether changing the distribution key to collocate this join improve overall performance. If changing the distribution key of the outer table isn't an optimal choice, you can achieve collocation by specifying DISTSTYLE ALL for the inner table. 

 The following example shows a portion of a query plan with DS\$1BCAST\$1INNER and DS\$1DIST\$1NONE labels.

```
->  XN Hash Join DS_BCAST_INNER  (cost=112.50..3272334142.59 rows=170771 width=84)
        Hash Cond: ("outer".venueid = "inner".venueid)
        ->  XN Hash Join DS_BCAST_INNER  (cost=109.98..3167290276.71 rows=172456 width=47)
              Hash Cond: ("outer".eventid = "inner".eventid)
              ->  XN Merge Join DS_DIST_NONE  (cost=0.00..6286.47 rows=172456 width=30)
                    Merge Cond: ("outer".listid = "inner".listid)
                    ->  XN Seq Scan on listing  (cost=0.00..1924.97 rows=192497 width=14)
                    ->  XN Seq Scan on sales  (cost=0.00..1724.56 rows=172456 width=24)
```

After changing the dimension tables to use DISTSTYLE ALL, the query plan for the same query shows DS\$1DIST\$1ALL\$1NONE in place of DS\$1BCAST\$1INNER. Also, there is a dramatic change in the relative cost for the join steps. The total cost is `14142.59` compared to `3272334142.59` in the previous query.

```
->  XN Hash Join DS_DIST_ALL_NONE  (cost=112.50..14142.59 rows=170771 width=84)
        Hash Cond: ("outer".venueid = "inner".venueid)
        ->  XN Hash Join DS_DIST_ALL_NONE  (cost=109.98..10276.71 rows=172456 width=47)
              Hash Cond: ("outer".eventid = "inner".eventid)
              ->  XN Merge Join DS_DIST_NONE  (cost=0.00..6286.47 rows=172456 width=30)
                    Merge Cond: ("outer".listid = "inner".listid)
                    ->  XN Seq Scan on listing  (cost=0.00..1924.97 rows=192497 width=14)
                    ->  XN Seq Scan on sales  (cost=0.00..1724.56 rows=172456 width=24)
```

# Query plan example
<a name="t_explain_plan_example"></a>

This example shows how to evaluate a query plan to find opportunities to optimize the distribution.

Run the following query with an EXPLAIN command to produce a query plan.

```
explain
select lastname, catname, venuename, venuecity, venuestate, eventname, 
month, sum(pricepaid) as buyercost, max(totalprice) as maxtotalprice
from category join event on category.catid = event.catid
join venue on venue.venueid = event.venueid
join sales on sales.eventid = event.eventid
join listing on sales.listid = listing.listid
join date on sales.dateid = date.dateid
join users on users.userid = sales.buyerid
group by lastname, catname, venuename, venuecity, venuestate, eventname, month
having sum(pricepaid)>9999
order by catname, buyercost desc;
```

In the TICKIT database, SALES is a fact table and LISTING is its largest dimension. In order to collocate the tables, SALES is distributed on the LISTID, which is the foreign key for LISTING, and LISTING is distributed on its primary key, LISTID. The following example shows the CREATE TABLE commands for SALES and LISTING.

```
create table sales(
	salesid integer not null,
	listid integer not null distkey,
	sellerid integer not null,
	buyerid integer not null,
	eventid integer not null encode mostly16,
	dateid smallint not null,
	qtysold smallint not null encode mostly8,
	pricepaid decimal(8,2) encode delta32k,
	commission decimal(8,2) encode delta32k,
	saletime timestamp,
	primary key(salesid),
	foreign key(listid) references listing(listid),
	foreign key(sellerid) references users(userid),
	foreign key(buyerid) references users(userid),
	foreign key(dateid) references date(dateid))
        sortkey(listid,sellerid);

create table listing(
	listid integer not null distkey sortkey,
	sellerid integer not null,
	eventid integer not null encode mostly16,
	dateid smallint not null,
	numtickets smallint not null encode mostly8,
	priceperticket decimal(8,2) encode bytedict,
	totalprice decimal(8,2) encode mostly32,
	listtime timestamp,
	primary key(listid),
	foreign key(sellerid) references users(userid),
	foreign key(eventid) references event(eventid),
	foreign key(dateid) references date(dateid));
```

In the following query plan, the Merge Join step for the join on SALES and LISTING shows DS\$1DIST\$1NONE, which indicates that no redistribution is required for the step. However, moving up the query plan, the other inner joins show DS\$1BCAST\$1INNER, which indicates that the inner table is broadcast as part of the query execution. Because only one pair of tables can be collocated using key distribution, five tables must be rebroadcast.

```
QUERY PLAN
XN Merge  (cost=1015345167117.54..1015345167544.46 rows=1000 width=103)
  Merge Key: category.catname, sum(sales.pricepaid)
  ->  XN Network  (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)
        Send to leader
        ->  XN Sort  (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)
              Sort Key: category.catname, sum(sales.pricepaid)
              ->  XN HashAggregate  (cost=15345150568.37..15345152276.08 rows=170771 width=103)
                    Filter: (sum(pricepaid) > 9999.00)
	                    ->  XN Hash Join DS_BCAST_INNER  (cost=742.08..15345146299.10 rows=170771 width=103)
	                          Hash Cond: ("outer".catid = "inner".catid)
	                          ->  XN Hash Join DS_BCAST_INNER  (cost=741.94..15342942456.61 rows=170771 width=97)
	                                Hash Cond: ("outer".dateid = "inner".dateid)
	                                ->  XN Hash Join DS_BCAST_INNER  (cost=737.38..15269938609.81 rows=170766 width=90)
	                                      Hash Cond: ("outer".buyerid = "inner".userid)
	                                      ->  XN Hash Join DS_BCAST_INNER  (cost=112.50..3272334142.59 rows=170771 width=84)
	                                            Hash Cond: ("outer".venueid = "inner".venueid)
	                                            ->  XN Hash Join DS_BCAST_INNER  (cost=109.98..3167290276.71 rows=172456 width=47)
	                                                  Hash Cond: ("outer".eventid = "inner".eventid)
	                                                  ->  XN Merge Join DS_DIST_NONE  (cost=0.00..6286.47 rows=172456 width=30)
	                                                        Merge Cond: ("outer".listid = "inner".listid)
	                                                        ->  XN Seq Scan on listing  (cost=0.00..1924.97 rows=192497 width=14)
	                                                        ->  XN Seq Scan on sales  (cost=0.00..1724.56 rows=172456 width=24)
	                                                  ->  XN Hash  (cost=87.98..87.98 rows=8798 width=25)
	                                                        ->  XN Seq Scan on event  (cost=0.00..87.98 rows=8798 width=25)
	                                            ->  XN Hash  (cost=2.02..2.02 rows=202 width=41)
	                                                  ->  XN Seq Scan on venue  (cost=0.00..2.02 rows=202 width=41)
	                                      ->  XN Hash  (cost=499.90..499.90 rows=49990 width=14)
	                                            ->  XN Seq Scan on users  (cost=0.00..499.90 rows=49990 width=14)
	                                ->  XN Hash  (cost=3.65..3.65 rows=365 width=11)
	                                      ->  XN Seq Scan on date  (cost=0.00..3.65 rows=365 width=11)
	                          ->  XN Hash  (cost=0.11..0.11 rows=11 width=10)
	                                ->  XN Seq Scan on category  (cost=0.00..0.11 rows=11 width=10)
```

One solution is to alter the tables to have DISTSTYLE ALL.

```
ALTER TABLE users ALTER DISTSTYLE ALL;
ALTER TABLE venue ALTER DISTSTYLE ALL;
ALTER TABLE category ALTER DISTSTYLE ALL;
ALTER TABLE date ALTER DISTSTYLE ALL;
ALTER TABLE event ALTER DISTSTYLE ALL;
```

Run the same query with EXPLAIN again, and examine the new query plan. The joins now show DS\$1DIST\$1ALL\$1NONE, indicating that no redistribution is required because the data was distributed to every node using DISTSTYLE ALL.

```
QUERY PLAN
XN Merge  (cost=1000000047117.54..1000000047544.46 rows=1000 width=103)
  Merge Key: category.catname, sum(sales.pricepaid)
  ->  XN Network  (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)
        Send to leader
        ->  XN Sort  (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)
              Sort Key: category.catname, sum(sales.pricepaid)
              ->  XN HashAggregate  (cost=30568.37..32276.08 rows=170771 width=103)
                    Filter: (sum(pricepaid) > 9999.00)
                    ->  XN Hash Join DS_DIST_ALL_NONE  (cost=742.08..26299.10 rows=170771 width=103)
                          Hash Cond: ("outer".buyerid = "inner".userid)
                          ->  XN Hash Join DS_DIST_ALL_NONE  (cost=117.20..21831.99 rows=170766 width=97)
                                Hash Cond: ("outer".dateid = "inner".dateid)
                                ->  XN Hash Join DS_DIST_ALL_NONE  (cost=112.64..17985.08 rows=170771 width=90)
                                      Hash Cond: ("outer".catid = "inner".catid)
                                      ->  XN Hash Join DS_DIST_ALL_NONE  (cost=112.50..14142.59 rows=170771 width=84)
                                            Hash Cond: ("outer".venueid = "inner".venueid)
                                            ->  XN Hash Join DS_DIST_ALL_NONE  (cost=109.98..10276.71 rows=172456 width=47)
                                                  Hash Cond: ("outer".eventid = "inner".eventid)
                                                  ->  XN Merge Join DS_DIST_NONE  (cost=0.00..6286.47 rows=172456 width=30)
                                                        Merge Cond: ("outer".listid = "inner".listid)
                                                        ->  XN Seq Scan on listing  (cost=0.00..1924.97 rows=192497 width=14)
                                                        ->  XN Seq Scan on sales  (cost=0.00..1724.56 rows=172456 width=24)
                                                  ->  XN Hash  (cost=87.98..87.98 rows=8798 width=25)
                                                        ->  XN Seq Scan on event  (cost=0.00..87.98 rows=8798 width=25)
                                            ->  XN Hash  (cost=2.02..2.02 rows=202 width=41)
                                                  ->  XN Seq Scan on venue  (cost=0.00..2.02 rows=202 width=41)
                                      ->  XN Hash  (cost=0.11..0.11 rows=11 width=10)
                                            ->  XN Seq Scan on category  (cost=0.00..0.11 rows=11 width=10)
                                ->  XN Hash  (cost=3.65..3.65 rows=365 width=11)
                                      ->  XN Seq Scan on date  (cost=0.00..3.65 rows=365 width=11)
                          ->  XN Hash  (cost=499.90..499.90 rows=49990 width=14)
                                ->  XN Seq Scan on users  (cost=0.00..499.90 rows=49990 width=14)
```

# Distribution examples
<a name="c_Distribution_examples"></a>

The following examples show how data is distributed according to the options that you define in the CREATE TABLE statement.

## DISTKEY examples
<a name="c_Distribution_examples-distkey-examples"></a>

Look at the schema of the USERS table in the TICKIT database. USERID is defined as the SORTKEY column and the DISTKEY column: 

```
select "column", type, encoding, distkey, sortkey 
from pg_table_def where tablename = 'users';
    
    column     |          type          | encoding | distkey | sortkey
---------------+------------------------+----------+---------+---------
 userid        | integer                | none     | t       |       1
 username      | character(8)           | none     | f       |       0
 firstname     | character varying(30)  | text32k  | f       |       0

...
```

USERID is a good choice for the distribution column on this table. If you query the SVV\$1DISKUSAGE system view, you can see that the table is very evenly distributed. Column numbers are zero-based, so USERID is column 0.

```
select slice, col, num_values as rows, minvalue, maxvalue
from svv_diskusage
where name='users' and col=0 and rows>0
order by slice, col;

slice| col | rows  | minvalue | maxvalue
-----+-----+-------+----------+----------
0    | 0   | 12496 | 4        | 49987
1    | 0   | 12498 | 1        | 49988
2    | 0   | 12497 | 2        | 49989
3    | 0   | 12499 | 3        | 49990
(4 rows)
```

The table contains 49,990 rows. The rows (num\$1values) column shows that each slice contains about the same number of rows. The minvalue and maxvalue columns show the range of values on each slice. Each slice includes nearly the entire range of values, so there's a good chance that every slice participates in running a query that filters for a range of user IDs.

This example demonstrates distribution on a small test system. The total number of slices is typically much higher.

If you commonly join or group using the STATE column, you might choose to distribute on the STATE column. The following example shows a case where you create a new table with the same data as the USERS table but set the DISTKEY to the STATE column. In this case, the distribution isn't as even. Slice 0 (13,587 rows) holds approximately 30 percent more rows than slice 3 (10,150 rows). In a much larger table, this amount of distribution skew can have an adverse impact on query processing.

```
create table userskey distkey(state) as select * from users;

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage
where name = 'userskey' and col=0 and rows>0
order by slice, col;

slice | col | rows  | minvalue | maxvalue
------+-----+-------+----------+----------
    0 |   0 | 13587 |        5 |    49989
    1 |   0 | 11245 |        2 |    49990
    2 |   0 | 15008 |        1 |    49976
    3 |   0 | 10150 |        4 |    49986
(4 rows)
```

## DISTSTYLE EVEN example
<a name="c_Distribution_examples-diststyle-even-example"></a>

If you create a new table with the same data as the USERS table but set the DISTSTYLE to EVEN, rows are always evenly distributed across slices. 

```
create table userseven diststyle even as 
select * from users;

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage
where name = 'userseven' and col=0 and rows>0
order by slice, col;

slice | col | rows  | minvalue | maxvalue
------+-----+-------+----------+----------
    0 |   0 | 12497 |        4 |    49990
    1 |   0 | 12498 |        8 |    49984
    2 |   0 | 12498 |        2 |    49988
    3 |   0 | 12497 |        1 |    49989  
(4 rows)
```

However, because distribution is not based on a specific column, query processing can be degraded, especially if the table is joined to other tables. The lack of distribution on a joining column often influences the type of join operation that can be performed efficiently. Joins, aggregations, and grouping operations are optimized when both tables are distributed and sorted on their respective joining columns.

## DISTSTYLE ALL example
<a name="c_Distribution_examples-diststyle-all-example"></a>

If you create a new table with the same data as the USERS table but set the DISTSTYLE to ALL, all the rows are distributed to the first slice of each node. 

```
select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage
where name = 'usersall' and col=0 and rows > 0
order by slice, col;

slice | col | rows  | minvalue | maxvalue
------+-----+-------+----------+----------
    0 |   0 | 49990 |        4 |    49990
    2 |   0 | 49990 |        2 |    49990

(4 rows)
```