# Tutorial: Loading data into Amazon Keyspaces using cqlsh
<a name="bulk-upload"></a>

This tutorial guides you through the process of migrating data from Apache Cassandra to Amazon Keyspaces using the `cqlsh COPY FROM` command. The `cqlsh COPY FROM` command is useful to quickly and easily upload small datasets to Amazon Keyspaces for academic or test purposes. For more information about how to migrate production workloads, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md). In this tutorial, you'll complete the following steps: 

Prerequisites – Set up an AWS account with credentials, create a JKS trust store file for the certificate, and configure `cqlsh` to connect to Amazon Keyspaces. 

1. **Create source CSV and target table** – Prepare a CSV file as the source data and create the target keyspace and table in Amazon Keyspaces.

1. **Prepare the data** – Randomize the data in the CSV file and analyze it to determine the average and maximum row sizes.

1. **Set throughput capacity** – Calculate the required write capacity units (WCUs) based on the data size and desired load time, and configure the table's provisioned capacity.

1. **Configure cqlsh parameters** – Determine optimal values for `cqlsh COPY FROM` parameters like `INGESTRATE`, `NUMPROCESSES`, `MAXBATCHSIZE`, and `CHUNKSIZE` to distribute the workload evenly. 

1. **Run the `cqlsh COPY FROM` command ** – Run the `cqlsh COPY FROM` command to upload the data from the CSV file to the Amazon Keyspaces table, and monitor the progress.

Troubleshooting – Resolve common issues like invalid requests, parser errors, capacity errors, and cqlsh errors during the data upload process. 

**Topics**
+ [Prerequisites: Steps to complete before you can upload data using `cqlsh COPY FROM`](bulk-upload-prequs.md)
+ [Step 1: Create the source CSV file and a target table for the data upload](bulk-upload-source.md)
+ [Step 2: Prepare the source data for a successful data upload](bulk-upload-prepare-data.md)
+ [Step 3: Set throughput capacity for the table](bulk-upload-capacity.md)
+ [Step 4: Configure `cqlsh COPY FROM` settings](bulk-upload-config.md)
+ [Step 5: Run the `cqlsh COPY FROM` command to upload data from the CSV file to the target table](bulk-upload-run.md)
+ [Troubleshooting](bulk-upload-troubleshooting.md)

# Prerequisites: Steps to complete before you can upload data using `cqlsh COPY FROM`
<a name="bulk-upload-prequs"></a>

You must complete the following tasks before you can start this tutorial.

1. If you have not already done so, sign up for an AWS account by following the steps at [Setting up AWS Identity and Access Management](accessing.md#SettingUp.IAM).

1. Create service-specific credentials by following the steps at [Create service-specific credentials for programmatic access to Amazon Keyspaces](programmatic.credentials.ssc.md).

1. Set up the Cassandra Query Language shell (cqlsh) connection and confirm that you can connect to Amazon Keyspaces by following the steps at [Using `cqlsh` to connect to Amazon Keyspaces](programmatic.cqlsh.md). 

# Step 1: Create the source CSV file and a target table for the data upload
<a name="bulk-upload-source"></a>

For this tutorial, we use a comma-separated values (CSV) file with the name `keyspaces_sample_table.csv` as the source file for the data migration. The provided sample file contains a few rows of data for a table with the name `book_awards`.

1. Create the source file. You can choose one of the following options:
   + Download the sample CSV file (`keyspaces_sample_table.csv`) contained in the following archive file [samplemigration.zip](samples/samplemigration.zip). Unzip the archive and take note of the path to `keyspaces_sample_table.csv`.
   + To populate a CSV file with your own data stored in an Apache Cassandra database, you can populate the source CSV file by using the `cqlsh` `COPY TO` statement as shown in the following example.

     ```
     cqlsh localhost 9042 -u "username" -p "password" --execute "COPY mykeyspace.mytable TO 'keyspaces_sample_table.csv' WITH HEADER=true";
     ```

     Make sure the CSV file you create meets the following requirements:
     + The first row contains the column names.
     + The column names in the source CSV file match the column names in the target table.
     + The data is delimited with a comma.
     + All data values are valid Amazon Keyspaces data types. See [Data types](cql.elements.md#cql.data-types).

1. Create the target keyspace and table in Amazon Keyspaces.

   1. Connect to Amazon Keyspaces using `cqlsh`, replacing the service endpoint, user name, and password in the following example with your own values.

      ```
      cqlsh cassandra.us-east-1.amazonaws.com 9142 -u "111122223333" -p "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" --ssl
      ```

   1. Create a new keyspace with the name `catalog` as shown in the following example. 

      ```
      CREATE KEYSPACE catalog WITH REPLICATION = {'class': 'SingleRegionStrategy'};
      ```

   1. When the new keyspace is available, use the following code to create the target table `book_awards`.

      ```
      CREATE TABLE "catalog.book_awards" (
         year int,
         award text,
         rank int, 
         category text,
         book_title text,
         author text, 
         publisher text,
         PRIMARY KEY ((year, award), category, rank)
         );
      ```

   If Apache Cassandra is your original data source, a simple way to create the Amazon Keyspaces target table with matching headers is to generate the `CREATE TABLE` statement from the source table, as shown in the following statement.

   ```
   cqlsh localhost 9042  -u "username" -p "password" --execute "DESCRIBE TABLE mykeyspace.mytable;"
   ```

   Then create the target table in Amazon Keyspaces with the column names and data types matching the description from the Cassandra source table.

# Step 2: Prepare the source data for a successful data upload
<a name="bulk-upload-prepare-data"></a>

Preparing the source data for an efficient transfer is a two-step process. First, you randomize the data. In the second step, you analyze the data to determine the appropriate `cqlsh` parameter values and required table settings to ensure that the data upload is successful.

**Randomize the data**  
The `cqlsh COPY FROM` command reads and writes data in the same order that it appears in the CSV file. If you use the `cqlsh COPY TO` command to create the source file, the data is written in key-sorted order in the CSV. Internally, Amazon Keyspaces partitions data using partition keys. Although Amazon Keyspaces has built-in logic to help load balance requests for the same partition key, loading the data is faster and more efficient if you randomize the order. This is because you can take advantage of the built-in load balancing that occurs when Amazon Keyspaces is writing to different partitions.

To spread the writes across the partitions evenly, you must randomize the data in the source file. You can write an application to do this or use an open-source tool, such as [Shuf](https://en.wikipedia.org/wiki/Shuf). Shuf is freely available on Linux distributions, on macOS (by installing coreutils in [homebrew](https://brew.sh)), and on Windows (by using Windows Subsystem for Linux (WSL)). One extra step is required to prevent the header row with the column names to get shuffled in this step.

To randomize the source file while preserving the header, enter the following code.

```
tail -n +2 keyspaces_sample_table.csv | shuf -o keyspace.table.csv && (head -1 keyspaces_sample_table.csv && cat keyspace.table.csv ) > keyspace.table.csv1 && mv keyspace.table.csv1 keyspace.table.csv
```

Shuf rewrites the data to a new CSV file called `keyspace.table.csv`. You can now delete the `keyspaces_sample_table.csv` file—you no longer need it.

**Analyze the data**  
Determine the average and maximum row size by analyzing the data.

You do this for the following reasons:
+ The average row size helps to estimate the total amount of data to be transferred.
+ You need the average row size to provision the write capacity needed for the data upload.
+ You can make sure that each row is less than 1 MB in size, which is the maximum row size in Amazon Keyspaces.

**Note**  
This quota refers to row size, not partition size. Unlike Apache Cassandra partitions, Amazon Keyspaces partitions can be virtually unbound in size. Partition keys and clustering columns require additional storage for metadata, which you must add to the raw size of rows. For more information, see [Estimate row size in Amazon Keyspaces](calculating-row-size.md).

The following code uses [AWK](https://en.wikipedia.org/wiki/AWK) to analyze a CSV file and print the average and maximum row size.

```
awk -F, 'BEGIN {samp=10000;max=-1;}{if(NR>1){len=length($0);t+=len;avg=t/NR;max=(len>max ? len : max)}}NR==samp{exit}END{printf("{lines: %d, average: %d bytes, max: %d bytes}\n",NR,avg,max);}' keyspace.table.csv
```

Running this code results in the following output.

```
using 10,000 samples:
{lines: 10000, avg: 123 bytes, max: 225 bytes}
```

You use the average row size in the next step of this tutorial to provision the write capacity for the table.

# Step 3: Set throughput capacity for the table
<a name="bulk-upload-capacity"></a>

This tutorial shows you how to tune cqlsh to load data within a set time range. Because you know how many reads and writes you perform in advance, use provisioned capacity mode. After you finish the data transfer, you should set the capacity mode of the table to match your application’s traffic patterns. To learn more about capacity management, see [Managing serverless resources in Amazon Keyspaces (for Apache Cassandra)](serverless_resource_management.md).

With provisioned capacity mode, you specify how much read and write capacity you want to provision to your table in advance. Write capacity is billed hourly and metered in write capacity units (WCUs). Each WCU is enough write capacity to support writing 1 KB of data per second. When you load the data, the write rate must be under the max WCUs (parameter: `write_capacity_units`) that are set on the target table. 

By default, you can provision up to 40,000 WCUs to a table and 80,000 WCUs across all the tables in your account. If you need additional capacity, you can request a quota increase in the [ Service Quotas](https://console.aws.amazon.com/servicequotas/home#!/services/cassandra/quotas) console. For more information about quotas, see [Quotas for Amazon Keyspaces (for Apache Cassandra)](quotas.md).

**Calculate the average number of WCUs required for an insert**  
Inserting 1 KB of data per second requires 1 WCU. If your CSV file has 360,000 rows and you want to load all the data in 1 hour, you must write 100 rows per second (360,000 rows / 60 minutes / 60 seconds = 100 rows per second). If each row has up to 1 KB of data, to insert 100 rows per second, you must provision 100 WCUs to your table. If each row has 1.5 KB of data, you need two WCUs to insert one row per second. Therefore, to insert 100 rows per second, you must provision 200 WCUs.

To determine how many WCUs you need to insert one row per second, divide the average row size in bytes by 1024 and round up to the nearest whole number.

For example, if the average row size is 3000 bytes, you need three WCUs to insert one row per second.

```
ROUNDUP(3000 / 1024) = ROUNDUP(2.93) = 3 WCUs
```

**Calculate data load time and capacity**  
Now that you know the average size and number of rows in your CSV file, you can calculate how many WCUs you need to load the data in a given amount of time, and the approximate time it takes to load all the data in your CSV file using different WCU settings.

For example, if each row in your file is 1 KB and you have 1,000,000 rows in your CSV file, to load the data in 1 hour, you need to provision at least 278 WCUs to your table for that hour.

```
1,000,000 rows * 1 KBs = 1,000,000 KBs
1,000,000 KBs / 3600 seconds =277.8 KBs / second = 278 WCUs
```

**Configure provisioned capacity settings**  
You can set a table’s write capacity settings when you create the table or by using the `ALTER TABLE` CQL command. The following is the syntax for altering a table’s provisioned capacity settings with the `ALTER TABLE` CQL statement.

```
ALTER TABLE mykeyspace.mytable WITH custom_properties={'capacity_mode':{'throughput_mode': 'PROVISIONED', 'read_capacity_units': 100, 'write_capacity_units': 278}} ; 
```

For the complete language reference, see [ALTER TABLE](cql.ddl.table.md#cql.ddl.table.alter).

# Step 4: Configure `cqlsh COPY FROM` settings
<a name="bulk-upload-config"></a>

This section outlines how to determine the parameter values for `cqlsh COPY FROM`. The `cqlsh COPY FROM` command reads the CSV file that you prepared earlier and inserts the data into Amazon Keyspaces using CQL. The command divides up the rows and distributes the `INSERT` operations among a set of workers. Each worker establishes a connection with Amazon Keyspaces and sends `INSERT` requests along this channel. 

The `cqlsh COPY` command doesn’t have internal logic to distribute work evenly among its workers. However, you can configure it manually to make sure that the work is distributed evenly. Start by reviewing these key cqlsh parameters:
+ **DELIMITER** – If you used a delimiter other than a comma, you can set this parameter, which defaults to comma.
+ **INGESTRATE** – The target number of rows that `cqlsh COPY FROM` attempts to process per second. If unset, it defaults to 100,000.
+ **NUMPROCESSES** – The number of child worker processes that cqlsh creates for `COPY FROM` tasks. The maximum for this setting is 16, the default is `num_cores - 1`, where `num_cores` is the number of processing cores on the host running cqlsh.
+ **MAXBATCHSIZE** – The batch size determines the maximal number of rows inserted into the destination table in a single batch. If unset, cqlsh uses batches of 20 inserted rows.
+ **CHUNKSIZE** – The size of the work unit that passes to the child worker. By default, it is set to 5,000. 
+ **MAXATTEMPTS** – The maximum number of times to retry a failed worker chunk. After the maximum attempt is reached, the failed records are written to a new CSV file that you can run again later after investigating the failure.

Set `INGESTRATE` based on the number of WCUs that you provisioned to the target destination table. The `INGESTRATE` of the `cqlsh COPY FROM` command isn’t a limit—it’s a target average. This means it can (and often does) burst above the number you set. To allow for bursts and make sure that enough capacity is in place to handle the data load requests, set `INGESTRATE` to 90% of the table’s write capacity.

```
INGESTRATE = WCUs * .90
```

Next, set the `NUMPROCESSES` parameter to equal to one less than the number of cores on your system. To find out what the number of cores of your system is, you can run the following code.

```
python -c "import multiprocessing; print(multiprocessing.cpu_count())"
```

For this tutorial, we use the following value.

```
NUMPROCESSES = 4
```

Each process creates a worker, and each worker establishes a connection to Amazon Keyspaces. Amazon Keyspaces can support up to 3,000 CQL requests per second on every connection. This means that you have to make sure that each worker is processing fewer than 3,000 requests per second. 

As with `INGESTRATE`, the workers often burst above the number you set and aren’t limited by clock seconds. Therefore, to account for bursts, set your cqlsh parameters to target each worker to process 2,500 requests per second. To calculate the amount of work distributed to a worker, use the following guideline.
+ Divide `INGESTRATE` by `NUMPROCESSES`.
+ If `INGESTRATE` / `NUMPROCESSES` > 2,500, lower the `INGESTRATE` to make this formula true.

```
INGESTRATE / NUMPROCESSES <= 2,500
```

Before you configure the settings to optimize the upload of our sample data, let's review the `cqlsh` default settings and see how using them impacts the data upload process. Because `cqlsh COPY FROM` uses the `CHUNKSIZE` to create chunks of work (`INSERT` statements) to distribute to workers, the work is not automatically distributed evenly. Some workers might sit idle, depending on the `INGESTRATE` setting.

To distribute work evenly among the workers and keep each worker at the optimal 2,500 requests per second rate, you must set `CHUNKSIZE`, `MAXBATCHSIZE`, and `INGESTRATE` by changing the input parameters. To optimize network traffic utilization during the data load, choose a value for `MAXBATCHSIZE` that is close to the maximum value of 30. By changing `CHUNKSIZE` to 100 and `MAXBATCHSIZE` to 25, the 10,000 rows are spread evenly among the four workers (10,000 / 2500 = 4).

The following code example illustrates this.

```
INGESTRATE = 10,000
NUMPROCESSES = 4
CHUNKSIZE = 100
MAXBATCHSIZE. = 25
Work Distribution:
Connection 1 / Worker 1 : 2,500 Requests per second
Connection 2 / Worker 2 : 2,500 Requests per second
Connection 3 / Worker 3 : 2,500 Requests per second
Connection 4 / Worker 4 : 2,500 Requests per second
```

To summarize, use the following formulas when setting `cqlsh COPY FROM` parameters:
+ **INGESTRATE** = write\$1capacity\$1units \$1 .90
+ **NUMPROCESSES** = num\$1cores -1 (default)
+ **INGESTRATE / NUMPROCESSES** = 2,500 (This must be a true statement.)
+ **MAXBATCHSIZE** = 30 (Defaults to 20. Amazon Keyspaces accepts batches up to 30.)
+ **CHUNKSIZE** = (INGESTRATE / NUMPROCESSES) / MAXBATCHSIZE

Now that you have calculated `NUMPROCESSES`, `INGESTRATE`, and `CHUNKSIZE`, you’re ready to load your data.

# Step 5: Run the `cqlsh COPY FROM` command to upload data from the CSV file to the target table
<a name="bulk-upload-run"></a>

To run the `cqlsh COPY FROM` command, complete the following steps.

1. Connect to Amazon Keyspaces using cqlsh.

1. Choose your keyspace with the following code.

   ```
   USE catalog;
   ```

1. Set write consistency to `LOCAL_QUORUM`. To ensure data durability, Amazon Keyspaces doesn’t allow other write consistency settings. See the following code.

   ```
   CONSISTENCY LOCAL_QUORUM;
   ```

1. Prepare your `cqlsh COPY FROM` syntax using the following code example. 

   ```
   COPY book_awards FROM './keyspace.table.csv' WITH HEADER=true 
   AND INGESTRATE=calculated ingestrate 
   AND NUMPROCESSES=calculated numprocess
   AND MAXBATCHSIZE=20 
   AND CHUNKSIZE=calculated chunksize;
   ```

1. Run the statement prepared in the previous step. cqlsh echoes back all the settings that you've configured.

   1. Make sure that the settings match your input. See the following example.

      ```
      Reading options from the command line: {'chunksize': '120', 'header': 'true', 'ingestrate': '36000', 'numprocesses': '15', 'maxbatchsize': '20'}
      Using 15 child processes
      ```

   1. Review the number of rows transferred and the current average rate, as shown in the following example.

      ```
      Processed: 57834 rows; Rate: 6561 rows/s; Avg. rate: 31751 rows/s
      ```

   1. When cqlsh has finished uploading the data, review the summary of the data load statistics (the number of files read, runtime, and skipped rows) as shown in the following example.

      ```
      15556824 rows imported from 1 files in 8 minutes and 8.321 seconds (0 skipped).
      ```

In this final step of the tutorial, you have uploaded the data to Amazon Keyspaces.

**Important**  
Now that you have transferred your data, adjust the capacity mode settings of your target table to match your application’s regular traffic patterns. You incur charges at the hourly rate for your provisioned capacity until you change it.

# Troubleshooting
<a name="bulk-upload-troubleshooting"></a>

After the data upload has completed, check to see if rows were skipped. To do so, navigate to the source directory of the source CSV file and search for a file with the following name.

```
import_yourcsvfilename.err.timestamp.csv
```

cqlsh writes any skipped rows of data into a file with that name. If the file exists in your source directory and has data in it, these rows didn't upload to Amazon Keyspaces. To retry these rows, first check for any errors that were encountered during the upload and adjust the data accordingly. To retry these rows, you can rerun the process.


**Common errors**  
The most common reasons why rows aren’t loaded are capacity errors and parsing errors.

**Invalid request errors when uploading data to Amazon Keyspaces**

In the following example, the source table contains a counter column, which results in logged batch calls from the cqlsh `COPY` command. Logged batch calls are not supported by Amazon Keyspaces.

```
Failed to import 10 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message=“Only UNLOGGED Batches are supported at this time.“,  will retry later, attempt 22 of 25
```

To resolve this error, use DSBulk to migrate the data. For more information, see [Tutorial: Loading data into Amazon Keyspaces using DSBulk](dsbulk-upload.md).

**Parser errors when uploading data to Amazon Keyspaces**

The following example shows a skipped row due to a `ParseError`.

```
Failed to import 1 rows: ParseError - Invalid ... – 
```

To resolve this error, you need to make sure that the data to be imported matches the table schema in Amazon Keyspaces. Review the import file for parsing errors. You can try using a single row of data using an `INSERT` statement to isolate the error.

**Capacity errors when uploading data to Amazon Keyspaces**

```
Failed to import 1 rows: WriteTimeout - Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses]
 message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 2, 'write_type': 'SIMPLE', 'consistency': 
 'LOCAL_QUORUM'}, will retry later, attempt 1 of 100
```

Amazon Keyspaces uses the `ReadTimeout` and `WriteTimeout` exceptions to indicate when a write request fails due to insufficient throughput capacity. To help diagnose insufficient capacity exceptions, Amazon Keyspaces publishes `WriteThrottleEvents` and `ReadThrottledEvents` metrics in Amazon CloudWatch. For more information, see [Monitoring Amazon Keyspaces with Amazon CloudWatch](monitoring-cloudwatch.md).

**cqlsh errors when uploading data to Amazon Keyspaces**

To help troubleshoot cqlsh errors, rerun the failing command with the `--debug` flag.

When using an incompatible version of cqlsh, you see the following error.

```
AttributeError: 'NoneType' object has no attribute 'is_up'
Failed to import 3 rows: AttributeError - 'NoneType' object has no attribute 'is_up',  given up after 1 attempts
```

Confirm that the correct version of cqlsh is installed by running the following command.

```
cqlsh --version
```

You should see something like the following for output.

```
cqlsh 5.0.1
```

If you're using Windows, replace all instances of `cqlsh` with `cqlsh.bat`. For example, to check the version of cqlsh in Windows, run the following command.

```
cqlsh.bat --version
```

The connection to Amazon Keyspaces fails after the cqlsh client receives three consecutive errors of any type from the server. The cqlsh client fails with the following message. 

```
Failed to import 1 rows: NoHostAvailable - , will retry later, attempt 3 of 100
```

To resolve this error, you need to make sure that the data to be imported matches the table schema in Amazon Keyspaces. Review the import file for parsing errors. You can try using a single row of data by using an INSERT statement to isolate the error.

The client automatically attempts to reestablish the connection.