

# How to use partition keys effectively in Amazon Keyspaces
<a name="bp-partition-key-design"></a>

The primary key that uniquely identifies each row in an Amazon Keyspaces table can consist of one or multiple partition key columns, which determine which partitions the data is stored in, and one or more optional clustering column, which define how data is clustered and sorted within a partition. 

Because the partition key establishes the number of partitions your data is stored in and how the data is distributed across these partitions, how you chose your partition key can have a significant impact upon the performance of your queries. In general, you should design your application for uniform activity across all partitions on disk. 

Distributing read and write activity of your application evenly across all partitions helps to minimize throughput costs and this applies to on-demand as well as provisioned read/write capacity modes. For example, if you are using provisioned capacity mode, you can determine the access patterns that your application needs, and estimate the total read capacity units (RCU) and write capacity units (WCU) that each table requires. Amazon Keyspaces supports your access patterns using the throughput that you provisioned as long as the traffic against a given partition does not exceed 3,000 RCUs and 1,000 WCUs. 

When a partition experiences sustained high read or write throughput, depending on traffic patterns Amazon Keyspaces may automatically split the partition into two new partitions. Each new partition contains a subset of the original partition's rows, distributing the throughput evenly across both partitions.

Amazon Keyspaces offers additional flexibility in your per-partition throughput provisioning by providing burst capacity, for more information see [Use burst capacity effectively in Amazon Keyspaces](throughput-bursting.md).

**Topics**
+ [Use write sharding to evenly distribute workloads across partitions](bp-partition-key-sharding.md)

# Use write sharding to evenly distribute workloads across partitions
<a name="bp-partition-key-sharding"></a>

One way to better distribute writes across a partition in Amazon Keyspaces is to expand the space. You can do this in several different ways. You can add an additional partition key column to which you write random numbers to distribute the rows among partitions. Or you can use a number that is calculated based on something that you're querying on.

## Sharding using compound partition keys and random values
<a name="bp-partition-key-sharding-random"></a>

One strategy for distributing loads more evenly across a partition is to add an additional partition key column to which you write random numbers. Then you randomize the writes across the larger space.

For example, consider the following table which has a single partition key representing a date.

```
CREATE TABLE IF NOT EXISTS tracker.blogs (
   publish_date date,
   title text,
   description int,
   PRIMARY KEY (publish_date));
```

To more evenly distribute this table across partitions, you could include an additional partition key column `shard` that stores random numbers. For example:

```
CREATE TABLE IF NOT EXISTS tracker.blogs (
   publish_date date, 
   shard int, 
   title text, 
   description int, 
   PRIMARY KEY ((publish_date, shard)));
```

When inserting data you might choose a random number between `1` and `200` for the `shard` column. This yields compound partition key values like `(2020-07-09, 1)`, `(2020-07-09, 2)`, and so on, through `(2020-07-09, 200)`. Because you are randomizing the partition key, the writes to the table on each day are spread evenly across multiple partitions. This results in better parallelism and higher overall throughput.

However, to read all the rows for a given day, you would have to query the rows for all the shards and then merge the results. For example, you would first issue a `SELECT` statement for the partition key value `(2020-07-09, 1)`. Then issue another `SELECT` statement for `(2020-07-09, 2)`, and so on, through `(2020-07-09, 200)`. Finally, your application would have to merge the results from all those `SELECT` statements.

## Sharding using compound partition keys and calculated values
<a name="bp-partition-key-sharding-calculated"></a>

A randomizing strategy can greatly improve write throughput. But it's difficult to read a specific row because you don't know which value was written to the `shard` column when the row was written. To make it easier to read individual rows, you can use a different strategy. Instead of using a random number to distribute the rows among partitions, use a number that you can calculate based upon something that you want to query on.

Consider the previous example, in which a table uses today's date in the partition key. Now suppose that each row has an accessible `title` column, and that you most often need to find rows by title in addition to date. Before your application writes the row to the table, it could calculate a hash value based on the title and use it to populate the `shard` column. The calculation might generate a number between 1 and 200 that is fairly evenly distributed, similar to what the random strategy produces.

A simple calculation would likely suffice, such as the product of the UTF-8 code point values for the characters in the title, modulo 200, \$1 1. The compound partition key value would then be the combination of the date and calculation result.

With this strategy, the writes are spread evenly across the partition key values, and thus across the physical partitions. You can easily perform a `SELECT` statement for a particular row and date because you can calculate the partition key value for a specific `title` value.

To read all the rows for a given day, you still must `SELECT` each of the `(2020-07-09, N)` keys (where `N` is 1–200), and your application then has to merge all the results. The benefit is that you avoid having a single "hot" partition key value taking all of the workload.