# Migrating to Amazon Keyspaces (for Apache Cassandra)
<a name="migrating"></a>

Migrating to Amazon Keyspaces (for Apache Cassandra) presents a range of compelling benefits for businesses and organizations. Here are some key advantages that make Amazon Keyspaces an attractive choice for migration.
+ **Scalability** – Amazon Keyspaces is designed to handle massive workloads and scale seamlessly to accommodate growing data volumes and traffic. With traditional Cassandra, scaling is not performed on demand and requires planning for future peaks. With Amazon Keyspaces, you can easily scale your tables up or down based on demand, ensuring that your applications can handle sudden spikes in traffic without compromising performance.
+ **Performance** – Amazon Keyspaces offers low-latency data access, enabling applications to retrieve and process data with exceptional speed. Its distributed architecture ensures that read and write operations are distributed across multiple nodes, delivering consistent, single-digit millisecond response times even at high request rates.
+ **Fully managed** – Amazon Keyspaces is a fully managed service provided by AWS. This means that AWS handles the operational aspects of database management, including provisioning, configuration, patching, backups, and scaling. This allows you to focus more on developing your applications and less on database administration tasks.
+ **Serverless architecture** – Amazon Keyspaces is serverless. You pay only for capacity consumed with no upfront capacity provisioning required. You don't have servers to manage or instances to choose. This pay-per-request model offers cost efficiency and minimal operational overhead, as you only pay for the resources you consume without the need to provision and monitor capacity.
+ **NoSQL flexibility with schema** – Amazon Keyspaces follows a NoSQL data model, providing flexibility in schema design. With Amazon Keyspaces, you can store structured, semi-structured, and unstructured data, making it well-suited for handling diverse and evolving data types. Additionally, Amazon Keyspaces performs schema validation on write allowing for a centralized evolution of the data model. This flexibility enables faster development cycles and easier adaptation to changing business requirements. 
+ **High availability and durability** – Amazon Keyspaces replicates data across multiple [Availability Zones](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/) within an AWS Region, ensuring high availability and data durability. It automatically handles replication, failover, and recovery, minimizing the risk of data loss or service disruptions. Amazon Keyspaces provides an availability SLA of up to 99.999%. For even more resiliency and low-latency local reads, Amazon Keyspaces offers [multi-Region replication](multiRegion-replication.md).
+ **Security and compliance** – Amazon Keyspaces integrates with AWS Identity and Access Management for fine-grained access control. It provides encryption at rest and in-transit, helping to improve the security of your data. Amazon Keyspaces has been assessed by third-party auditors for security and compliance with specific programs, including HIPAA, PCI DSS, and SOC, enabling you to meet regulatory requirements. For more information, see [Compliance validation for Amazon Keyspaces (for Apache Cassandra)](Keyspaces-compliance.md).
+ **Integration with AWS Ecosystem** – As part of the AWS ecosystem, Amazon Keyspaces seamlessly integrates with other AWS services, for example AWS CloudFormation, Amazon CloudWatch, and AWS CloudTrail. This integration enables you to build serverless architectures, leverage infrastructure as code, and create real-time data-driven applications. For more information, see [Monitoring Amazon Keyspaces (for Apache Cassandra)](monitoring-overview.md).

**Topics**
+ [Create a migration plan for migrating from Apache Cassandra to Amazon Keyspaces](migrating-cassandra.md)
+ [How to select the right tool for bulk uploading or migrating data to Amazon Keyspaces](migrating-tools.md)

# Create a migration plan for migrating from Apache Cassandra to Amazon Keyspaces
<a name="migrating-cassandra"></a>

For a successful migration from Apache Cassandra to Amazon Keyspaces, we recommend a review of the applicable migration concepts and best practices as well as a comparison of the available options. 

 This topic outlines how the migration process works by introducing several key concepts and the tools and techniques available to you. You can evaluate the different migration strategies to select the one that best meets your requirements.

**Topics**
+ [Functional compatibility](#migrating-cassandra-compatibility)
+ [Estimate Amazon Keyspaces pricing](#migrating-cassandra-sizing)
+ [Choose a migration strategy](#migrating-cassandra-strategy)
+ [Online migration to Amazon Keyspaces: strategies and best practices](migrating-online.md)
+ [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md)
+ [Using a hybrid migration solution: Apache Cassandra to Amazon Keyspaces](migrating-hybrid.md)

## Functional compatibility
<a name="migrating-cassandra-compatibility"></a>

Consider the functional differences between Apache Cassandra and Amazon Keyspaces carefully before the migration. Amazon Keyspaces supports all commonly used Cassandra data-plane operations, such as creating keyspaces and tables, reading data, and writing data.

However there are some Cassandra APIs that Amazon Keyspaces doesn't support. For more information about supported APIs, see [Supported Cassandra APIs, operations, functions, and data types](cassandra-apis.md). For an overview of all functional differences between Amazon Keyspaces and Apache Cassandra, see [Functional differences: Amazon Keyspaces vs. Apache Cassandra](functional-differences.md). 

To compare the Cassandra APIs and schema that you're using with supported functionality in Amazon Keyspaces, you can run a compatibility script available in the Amazon Keyspaces toolkit on [GitHub](https://github.com/aws-samples/amazon-keyspaces-toolkit/blob/master/bin/toolkit-compat-tool.py). 

**How to use the compatibility script**

1. Download the compatibility Python script from [GitHub](https://github.com/aws-samples/amazon-keyspaces-toolkit/blob/master/bin/toolkit-compat-tool.py) and move it to a location that has access to your existing Apache Cassandra cluster.

1. The compatibility script uses similar parameters as `CQLSH`. For `--host` and `--port` enter the IP address and the port you use to connect and run queries to one of the Cassandra nodes in your cluster. 

   If your Cassandra cluster uses authentication, you also need to provide `-username` and `-password`. To run the compatibility script, you can use the following command.

   ```
   python toolkit-compat-tool.py --host hostname or IP -u "username" -p "password" --port native transport port
   ```

## Estimate Amazon Keyspaces pricing
<a name="migrating-cassandra-sizing"></a>

This section provides an overview of the information you need to gather from your Apache Cassandra tables to calculate the estimated cost for Amazon Keyspaces. Each one of your tables requires different data types, needs to support different CQL queries, and maintains distinctive read/write traffic. 

Thinking of your requirements based on tables aligns with Amazon Keyspaces table-level resource isolation and [read/write throughput capacity modes](ReadWriteCapacityMode.md). With Amazon Keyspaces, you can define read/write capacity and [automatic scaling policies](autoscaling.md) for tables independently. 

Understanding table requirements helps you prioritize tables for migration based on functionality, cost, and migration effort.

Collect the following Cassandra table metrics before a migration. This information helps to estimate the cost of your workload on Amazon Keyspaces. 
+ **Table name** – The name of the fully qualified keyspace and table name.
+ **Description** – A description of the table, for example how it’s used, or what type of data is stored in it.
+ **Average reads per second** – The average number of coordinate-level reads against the table over a large time interval.
+ **Average writes per second** – The average number of coordinate-level writes against the table over a large time interval.
+ **Average row size in bytes** – The average row size in bytes. 
+ **Storage size in GBs** – The raw storage size for a table.
+ **Read consistency breakdown** – The percentage of reads that use eventual consistency (`LOCAL_ONE` or `ONE`) vs. strong consistency (`LOCAL_QUORUM`).

This table shows an example of the information about your tables that you need to pull together when planning a migration.


****  

| Table name | Description | Average reads per second | Average writes per second | Average row size in bytes | Storage size in GBs | Read consistency breakdown | 
| --- | --- | --- | --- | --- | --- | --- | 
|  mykeyspace.mytable  |  Used to store shopping cart history  |  10,000  |  5,000  | 2,200 | 2,000 | 100% `LOCAL_ONE` | 
| mykeyspace.mytable2 | Used to store latest profile information | 20,000 | 1,000 | 850 | 1,000 | 25% `LOCAL_QUORUM` 75% `LOCAL_ONE` | 

### How to collect table metrics
<a name="migrating-table-metrics"></a>

This section provides step by step instructions on how to collect the necessary table metrics from your existing Cassandra cluster. These metrics include row size, table size, and read/write requests per second (RPS). They allow you to assess throughput capacity requirements for an Amazon Keyspaces table and estimate pricing.

**How to collect table metrics on the Cassandra source table**

1. Determine row size

   Row size is important for determining the read capacity and write capacity utilization in Amazon Keyspaces. The following diagram shows the typical data distribution over a Cassandra token range.   
![\[A diagram showing the typical data distribution over a Cassandra token range using the murmur3 partitioner.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-data-distribution.png)

   You can use a row size sampler script available on [GitHub](https://github.com/aws-samples/amazon-keyspaces-toolkit/blob/master/bin/row-size-sampler.sh) to collect row size metrics for each table in your Cassandra cluster. 

   The script exports table data from Apache Cassandra by using `cqlsh` and `awk` to calculate the min, max, average, and standard deviation of row size over a configurable sample set of table data. The row size sampler passes the arguments to `cqlsh`, so the same parameters can be used to connect and read from your Cassandra cluster. 

   The following statement is an example of this.

   ```
   ./row-size-sampler.sh 10.22.33.44 9142 \\
      -u "username" -p "password" --ssl
   ```

   For more information on how row size is calculated in Amazon Keyspaces, see [Estimate row size in Amazon Keyspaces](calculating-row-size.md).

1. Determine table size

   With Amazon Keyspaces, you don't need to provision storage in advance. Amazon Keyspaces monitors the billable size of your tables continuously to determine your storage charges. Storage is billed per GB-month. Amazon Keyspaces table size is based on the raw size (uncompressed) of a single replica. 

   To monitor the table size in Amazon Keyspaces, you can use the metric `BillableTableSizeInBytes`, which is displayed for each table in the AWS Management Console.

   To estimate the billable size of your Amazon Keyspaces table, you can use either one of these two methods:
   + Use the average row size and multiply by the number or rows.

     You can estimate the size of the Amazon Keyspaces table by multiplying the average row size by the number of rows from your Cassandra source table. Use the row size sample script from the previous section to capture the average row size. To capture the row count, you can use tools like `dsbulk count` to determine the total number of rows in your source table. 
   + Use the `nodetool` to gather table metadata.

     `Nodetool` is an administrative tool provided in the Apache Cassandra distribution that provides insight into the state of the Cassandra process and returns table metadata. You can use `nodetool` to sample metadata about table size and with that extrapolate the table size in Amazon Keyspaces. 

     The command to use is `nodetool tablestats`. Tablestats returns the table's size and compression ratio. The table's size is stored as the `tablelivespace` for the table and you can divide it by the `compression ratio`. Then multiple this size value by the number of nodes. Finally divide by the replication factor (typically three). 

     This is the complete formula for the calculation that you can use to assess table size.

     ```
     ((tablelivespace / compression ratio) * (total number of nodes))/ (replication factor)
     ```

     Let's assume that your Cassandra cluster has 12 nodes. Running the `nodetool tablestats` command returns a `tablelivespace` of 200 GB and a `compression ratio` of 0.5. The keyspace has a replication factor of three. 

     This is how the calculation for this example looks like.

     ```
     (200 GB / 0.5) * (12 nodes)/ (replication factor of 3)
                             = 4,800 GB / 3
                             = 1,600 GB is the table size estimate for Amazon Keyspaces
     ```

1. Capture the number of reads and writes

   To determine the capacity and scaling requirements for your Amazon Keyspaces tables, capture the read and write request rate of your Cassandra tables before the migration. 

   Amazon Keyspaces is serverless and you only pay for what you use. In general, the price of read/write throughput in Amazon Keyspaces is based on the number and size of the requests. 

   There are two capacity modes in Amazon Keyspaces:
   + [On-demand](ReadWriteCapacityMode.OnDemand.md) – This is a flexible billing option capable of serving thousands of requests per second without the need for capacity planning. It offers pay-per-request pricing for read and write requests so that you pay only for what you use.
   + [Provisioned](ReadWriteCapacityMode.Provisioned.md) – If you choose provisioned throughput capacity mode, you specify the number of reads and writes per second that are required for your application. This helps you manage your Amazon Keyspaces usage to stay at or below a defined request rate to maintain predictability. 

     Provisioned mode offers [auto scaling](autoscaling.md) to automatically adjust your provisioned rate to scale up or scale down to improve operational efficiency. For more information about serverless resource management, see [Managing serverless resources in Amazon Keyspaces (for Apache Cassandra)](serverless_resource_management.md).

   Because you provision read and write throughput capacity in Amazon Keyspaces separately, you need to measure the request rate for reads and writes in your existing tables independently. 

    To gather the most accurate utilization metrics from your existing Cassandra cluster, capture the average requests per second (RPS) for coordinator-level read and write operations over an extended period of time for a table that is aggregated over all nodes in a single data center. 

   Capturing the average RPS over a period of at least several weeks captures peaks and valleys in your traffic patterns, as shown in the following diagram.  
![\[A diagram showing the average rate of requests per second per day over a period of two weeks.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-rps.png)

   You have two options to determine the read and write request rate of your Cassandra table.
   + Use existing Cassandra monitoring

     You can use the metrics shown in the following table to observe read and write requests. Note that the metric names can change based on the monitoring tool that you're using.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/migrating-cassandra.html)
   + Use the `nodetool`

     Use `nodetool tablestats` and `nodetool info` to capture average read and write operations from the table. `tablestats` returns the total read and write count from the time the node has been initiated. `nodetool info` provides the up-time for a node in seconds.

     To receive the per second average of read and writes, divide the read and write count by the node up-time in seconds. Then, for reads you divide by the consistency level ad for writes you divide by the replication factor. These calculations are expressed in the following formulas. 

     Formula for average reads per second:

     ```
     ((number of reads * number of nodes in cluster) / read consistency quorum (2)) / uptime
     ```

     Formula for average writes per second:

     ```
     ((number of writes * number of nodes in cluster) / replication factor of 3) / uptime
     ```

     Let's assume we have a 12 node cluster that has been up for 4 weeks. `nodetool info` returns 2,419,200 seconds of up-time and `nodetool tablestats` returns 1 billion writes and 2 billion reads. This example would result in the following calculation.

     ```
     ((2 billion reads * 12 in cluster) / read consistency quorum (2)) / 2,419,200 seconds
     =  12 billion reads / 2,419,200 seconds
     =  4,960 read request per second
                             ((1 billion writes * 12 in cluster) / replication factor of 3) / 2,419,200 seconds
     =  4 billion writes / 2,419,200 seconds
     =  1,653 write request per second
     ```

1. Determine the capacity utilization of the table

   To estimate the average capacity utilization, start with the average request rates and the average row size of your Cassandra source table.

   Amazon Keyspaces uses *read capacity units* (RCUs) and *write capacity units* (WCUs) to measure provisioned throughput capacity for reads and writes for tables. For this estimate we use these units to calculate the read and write capacity needs of the new Amazon Keyspaces table after migration. 

    Later in this topic we'll discuss how the choice between provisioned and on-demand capacity mode affects billing. But for the estimate of capacity utilization in this example, we assume that the table is in provisioned mode.
   + **Reads** – One RCU represents one `LOCAL_QUORUM` read request, or two `LOCAL_ONE` read requests, for a row up to 4 KB in size. If you need to read a row that is larger than 4 KB, the read operation uses additional RCUs. The total number of RCUs required depends on the row size, and whether you want to use `LOCAL_QUORUM` or `LOCAL_ONE` read consistency. 

     For example, reading an 8 KB row requires 2 RCUs using `LOCAL_QUORUM` read consistency, and 1 RCU if you choose `LOCAL_ONE` read consistency. 
   + **Writes** – One WCU represents one write for a row up to 1 KB in size. All writes are using `LOCAL_QUORUM` consistency, and there is no additional charge for using lightweight transactions (LWTs). 

     The total number of WCUs required depends on the row size. If you need to write a row that is larger than 1 KB, the write operation uses additional WCUs. For example, if your row size is 2 KB, you require 2 WCUs to perform one write request. 

   The following formula can be used to estimate the required RCUs and WCUs. 
   + **Read capacity in RCUs** can be determined by multiplying reads per second by number of rows read per read multiplied by average row size divided by 4KB and rounded up to the nearest whole number.
   + **Write capacity in WCUs** can be determined by multiplying the number of requests by the average row size divided by 1KB and rounded up to the nearest whole number. 

   This is expressed in the following formulas.

   ```
   Read requests per second * ROUNDUP((Average Row Size)/4096 per unit) =  RCUs per second
                   
   Write requests per second * ROUNDUP(Average Row Size/1024 per unit) = WCUs per second
   ```

   For example, if you're performing 4,960 read requests with a row size of 2.5KB on your Cassandra table, you need 4,960 RCUs in Amazon Keyspaces. If you're currently performing 1,653 write requests per second with a row size of 2.5KB on your Cassandra table, you need 4,959 WCUs per second in Amazon Keyspaces. 

   This example is expressed in the following formulas.

   ```
   4,960 read requests per second * ROUNDUP( 2.5KB /4KB bytes per unit)
   = 4,960 read requests per second * 1 RCU
   = 4,960 RCUs
                   
   1,653 write requests per second * ROUNDUP(2.5KB/1KB per unit) 
   = 1,653 requests per second * 3 WCUs
   = 4,959 WCUs
   ```

   Using `eventual consistency` allows you to save up to half of the throughput capacity on each read request. Each eventually consistent read can consume up to 8KB. You can calculate eventual consistent reads by multiplying the previous calculation by 0.5 as shown in the following formula. 

   ```
   4,960 read requests per second * ROUNDUP( 2.5KB /4KB per unit) * .5 
   = 2,480 read request per second * 1 RCU
   = 2,480 RCUs
   ```

1. Calculate the monthly pricing estimate for Amazon Keyspaces

   To estimate the monthly billing for the table based on read/write capacity throughput, you can calculate the pricing for on-demand and for provisioned mode using different formulas and compare the options for your table. 

   **Provisioned mode** – Read and write capacity consumption is billed on an hourly rate based on the capacity units per second. First, divide that rate by 0.7 to represent the default autoscaling target utilization of 70%. Then multiple by 30 calendar days, 24 hours per day, and regional rate pricing. 

   This calculation is summarized in the following formulas.

   ```
   (read capacity per second / .7) * 24 hours * 30 days * regional rate
                   (write capacity per second / .7) * 24 hours * 30 days * regional rate
   ```

   **On-demand mode** – Read and write capacity are billed on a per request rate. First, multiply the request rate by 30 calendar days, and 24 hours per day. Then divide by one million request units. Finally, multiply by the regional rate. 

   This calculation is summarized in the following formulas. 

   ```
   ((read capacity per second * 30 * 24 * 60 * 60) / 1 Million read request units) * regional rate
                   ((write capacity per second * 30 * 24 * 60 * 60) / 1 Million write request units) * regional rate
   ```

## Choose a migration strategy
<a name="migrating-cassandra-strategy"></a>

You can choose between the following migration strategies when migrating from Apache Cassandra to Amazon Keyspaces:
+ **Online** – This is a live migration using dual writes to start writing new data to Amazon Keyspaces and the Cassandra cluster simultaneously. This migration type is recommended for applications that require zero downtime during migration and read after write consistency.

  For more information about how to plan and implement an online migration strategy, see [Online migration to Amazon Keyspaces: strategies and best practices](migrating-online.md).
+ **Offline** – This migration technique involves copying a data set from Cassandra to Amazon Keyspaces during a downtime window. Offline migration can simplify the migration process, because it doesn't require changes to your application or conflict resolution between historical data and new writes.

  For more information about how to plan an offline migration, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md).
+ **Hybrid** – This migration technique allows for changes to be replicated to Amazon Keyspaces in near real time, but without read after write consistency. 

  For more information about how to plan a hybrid migration, see [Using a hybrid migration solution: Apache Cassandra to Amazon Keyspaces](migrating-hybrid.md).

After reviewing the migration techniques and best practices discussed in this topic, you can place the available options in a decision tree to design a migration strategy based on your requirements and available resources.

# Online migration to Amazon Keyspaces: strategies and best practices
<a name="migrating-online"></a>

If you need to maintain application availability during a migration from Apache Cassandra to Amazon Keyspaces, you can prepare a custom online migration strategy by implementing the key components discussed in this topic. By following these best practices for online migrations, you can ensure that application availability and read-after-write consistency are maintained during the entire migration process, minimizing the impact on your users.

When designing an online migration strategy from Apache Cassandra to Amazon Keyspaces, you need to consider the following key steps.

1. **Writing new data**
   + **ZDM Dual Write Proxy for Amazon Keyspaces Migration** – Use the ZDM Dual Write Proxy available on [Github](https://github.com/aws-samples/amazon-keyspaces-examples/blob/main/migration/online/zdm-proxy/README.md) to perform zero-downtime migration from Apache Cassandra to Amazon Keyspaces. The ZDM Proxy performs dual writes without the need to refactor existing applications and performs dual reads for query validation.
   + Application dual-writes: You can implement dual writes in your application using existing Cassandra client libraries and drivers. Designate one database as the leader and the other as the follower. Write failures to the follower database are recorded in a [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) for analysis.
   + Messaging tier dual-writes: Alternatively, you can configure your existing messaging platform to send writes to both Cassandra and Amazon Keyspaces using an additional consumer. This creates eventually consistent views across both databases.

1. **Migrating historical data**
   + Copy historical data: You can migrate historical data from Cassandra to Amazon Keyspaces using AWS Glue or custom extract, transform, and load (ETL) scripts. Handle conflict resolution between dual writes and bulk loads using techniques like lightweight transactions or timestamps.
   + Use Time-To-Live (TTL): For shorter data retention periods, you can use TTL in both Cassandra and Amazon Keyspaces to avoid uploading unnecessary historical data. As old data expires in Cassandra and new data is written via dual-writes, Amazon Keyspaces eventually catches up.

1. **Validating data**
   + Dual reads: Implement dual reads from both Cassandra (primary) and Amazon Keyspaces (secondary) databases, comparing results asynchronously. Differences are logged or sent to a DLQ.
   + Sample reads: Use Λ functions to periodically sample and compare data across both systems, logging any discrepancies to a DLQ.

1. **Migrating the application**
   + Blue-green strategy: Switch your application to treat Amazon Keyspaces as the primary and Cassandra as the secondary data store in a single step. Monitor performance and roll back if issues arise.
   + Canary deployment: Gradually roll out the migration to a subset of users first, incrementally increasing traffic to Amazon Keyspaces as the primary until fully migrated.

1. **Decommissioning Cassandra**

   Once your application is fully migrated to Amazon Keyspaces and data consistency is validated, you can plan to decommission your Cassandra cluster based on data retention policies.

By planning an online migration strategy with these components, you can transition smoothly to the fully managed Amazon Keyspaces service with minimal downtime or disruption. The following sections go into each component in more detail.

**Topics**
+ [Writing new data during an online migration](migration-online-dw.md)
+ [Uploading historical data during an online migration](migration-online-historical.md)
+ [Validating data consistency during an online migration](migration-online-validation.md)
+ [Migrating the application during an online migration](migration-online-app-migration.md)
+ [Decommissioning Cassandra after an online migration](migration-online-decommission.md)

# Writing new data during an online migration
<a name="migration-online-dw"></a>

The first step in an online migration plan is to ensure that any new data written by the application is stored in both databases, your existing Cassandra cluster and Amazon Keyspaces. The goal is to provide a consistent view across the two data stores. You can do this by applying all new writes to both databases. To implement dual writes, consider one of the following three options.
+ **ZDM Dual Write Proxy for Amazon Keyspaces Migration** – Using the ZDM Proxy for Amazon Keyspaces available on [Github](https://github.com/aws-samples/amazon-keyspaces-examples/blob/main/migration/online/zdm-proxy/README.md), you can migrate your Apache Cassandra workloads to Amazon Keyspaces without application downtime. This enhanced solution implements AWS best practices and extends the official ZDM Proxy capabilities.
  + Perform online migrations between Apache Cassandra and Amazon Keyspaces.
  + Write data to both source and target tables simultaneously without refactoring applications.
  + Validate queries through dual-read operations.

  The solution offers the following enhancements to work with AWS and Amazon Keyspaces.
  + **Container deployment** – Use a pre-configured Docker image from Amazon Elastic Container Registry (Amazon ECR) for VPC-accessible deployments.
  + **Infrastructure as code** – Deploy using AWS CloudFormation templates for automated setup on AWS Fargate.
  + **Amazon Keyspaces compatibility** – Access system tables with custom adaptations for Amazon Keyspaces.

  The solution runs on Amazon ECS with Fargate, providing serverless scalability based on your workload demands. A network load balancer distributes incoming application traffic across multiple Amazon ECS tasks for high availability.  
![\[Implementing the ZDM dual write proxy for migrating data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-zdm.png)
+ **Application dual writes** – You can implement dual writes with minimal changes to your application code by leveraging the existing Cassandra client libraries and drivers. You can either implement dual writes in your existing application, or create a new layer in the architecture to handle dual writes. For more information and a customer case study that shows how dual writes were implemented in an existing application, see [Cassandra migration case study](https://aws.amazon.com/solutions/case-studies/intuit-apache-migration-case-study/).

  When implementing dual writes, you can designate one database as the leader and the other database as the follower. This allows you to keep writing to your original source, or leader database without letting write failures to the follower, or destination database disrupt the critical path of your application.

  Instead of retrying failed writes to the follower, you can use Amazon Simple Queue Service to record failed writes in a [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html). The DLQ lets you analyze the failed writes to the follower and determine why processing did not succeed in the destination database.

  For a more sophisticated dual write implementation, you can follow AWS best practices for designing a sequence of local transactions using the [saga pattern](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/saga.html). A saga pattern ensures that if a transaction fails, the saga runs compensating transactions to revert the database changes made by the previous transactions. 

  When using dual-writes for an online migration, you can configure the dual-writes following the saga pattern so that each write is a local transaction to ensure atomic operations across heterogeneous databases. For more information about designing distributed application using recommended design patterns for the AWS Cloud, see [Cloud design patterns, architectures, and implementations](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/introduction).  
![\[Implementing dual writes at the application layer when migrating from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-dual-writes.png)
+ **Messaging tier dual writes** – Instead of implementing dual writes at the application layer, you can use your existing messaging tier to perform dual writes to Cassandra and Amazon Keyspaces. 

  To do this you can configure an additional consumer to your messaging platform to send writes to both data stores. This approach provides a simple low code strategy using the messaging tier to create two views across both databases that are eventually consistent. 

# Uploading historical data during an online migration
<a name="migration-online-historical"></a>

After implementing dual writes to ensure that new data is written to both data stores in real time, the next step in the migration plan is to evaluate how much historical data you need to copy or bulk upload from Cassandra to Amazon Keyspaces. This ensures that both, new data and historical data are going to be available in the new Amazon Keyspaces database before you’re migrating the application. 

Depending on your data retention requirements, for example how much historical data you need to preserve based on your organizations policies, you can consider one the following two options.
+ **Bulk upload of historical data** – The migration of historical data from your existing Cassandra deployment to Amazon Keyspaces can be achieved through various techniques, for example using AWS Glue or custom scripts to extract, transform, and load (ETL) the data. For more information about using AWS Glue to upload historical data, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md). 

  When planning the bulk upload of historical data, you need to consider how to resolve conflicts that can occur when new writes are trying to update the same data that is in the process of being uploaded. The bulk upload is expected to be eventually consistent, which means the data is going to reach all nodes eventually. 

  If an update of the same data occurs at the same time due to a new write, you want to ensure that it's not going to be overwritten by the historical data upload. To ensure that you preserve the latest updates to your data even during the bulk import, you must add conflict resolution either into the bulk upload scripts or into the application logic for dual writes. 

  For example, you can use [Lightweight transactions](functional-differences.md#functional-differences.light-transactions) (LWT) to compare and set operations. To do this, you can add an additional field to your data-model that represents time of modification or state. 

  Additionally, Amazon Keyspaces supports the Cassandra `WRITETIME` timestamp function. You can use Amazon Keyspaces client-side timestamps to preserve source database timestamps and implement last-writer-wins conflict resolution. For more information, see [Client-side timestamps in Amazon Keyspaces](client-side-timestamps.md).
+ **Using Time-to-Live (TTL)** – For data retention periods shorter than 30, 60, or 90 days, you can use TTL in Cassandra and Amazon Keyspaces during migration to avoid uploading unnecessary historical data to Amazon Keyspaces. TTL allows you to set a time period after which the data is automatically removed from the database. 

  During the migration phase, instead of copying historical data to Amazon Keyspaces, you can configure the TTL settings to let the historical data expire automatically in the old system (Cassandra) while only applying the new writes to Amazon Keyspaces using the dual-write method. Over time and with old data continually expiring in the Cassandra cluster and new data written using the dual-write method, Amazon Keyspaces automatically catches up to contain the same data as Cassandra.

   This approach can significantly reduce the amount of data to be migrated, resulting in a more efficient and streamlined migration process. You can consider this approach when dealing with large datasets with varying data retention requirements. For more information about TTL, see [Expire data with Time to Live (TTL) for Amazon Keyspaces (for Apache Cassandra)](TTL.md).

  Consider the following example of a migration from Cassandra to Amazon Keyspaces using TTL data expiration. In this example we set TTL for both databases to 60 days and show how the migration process progresses over a period of 90 days. Both databases receive the same newly written data during this period using the dual writes method. We're going to look at three different phases of the migration, each phase is 30 days long. 

  How the migration process works for each phase is shown in the following images.   
![\[Using TTL to expire historical data when migrating from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-TTL.png)

  1. After the first 30 days, the Cassandra cluster and Amazon Keyspaces have been receiving new writes. The Cassandra cluster also contains historical data that has not yet reached 60 days of retention, which makes up 50% of the data in the cluster. 

     Data that is older than 60 days is being automatically deleted in the Cassandra cluster using TTL. At this point Amazon Keyspaces contains 50% of the data stored in the Cassandra cluster, which is made up of the new writes minus the historical data.

  1. After 60 days, both the Cassandra cluster and Amazon Keyspaces contain the same data written in the last 60 days.

  1. Within 90 days, both Cassandra and Amazon Keyspaces contain the same data and are expiring data at the same rate. 

  This example illustrates how to avoid the step of uploading historical data by using TTL with an expiration date set to 60 days.

# Validating data consistency during an online migration
<a name="migration-online-validation"></a>

 The next step in the online migration process is data validation. Dual writes are adding new data to your Amazon Keyspaces database and you have completed the migration of historical data either using bulk upload or data expiration with TTL. 

Now you can use the validation phase to confirm that both data stores contain in fact the same data and return the same read results. You can choose from one of the following two options to validate that both your databases contain identical data. 
+ **Dual reads** – To validate that both, the source and the destination database contain the same set of newly written and historical data, you can implement dual reads. To do so you read from both your primary Cassandra and your secondary Amazon Keyspaces database similarly to the dual writes method and compare the results asynchronously. 

  The results from the primary database are returned to the client, and the results from the secondary database are used to validate against the primary resultset. Differences found can be logged or sent to a [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) for later reconciliation. 

  In the following diagram, the application is performing a synchronous read from Cassandra, which is the primary data store) and an asynchronous read from Amazon Keyspaces, which is the secondary data store.  
![\[Using dual reads to validate data consistency during an online migration from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-dual-reads.png)
+ **Sample reads** – An alternative solution that doesn’t require application code changes is to use AWS Lambda functions to periodically and randomly sample data from both the source Cassandra cluster and the destination Amazon Keyspaces database. 

  These Lambda functions can be configured to run at regular intervals. The Lambda function retrieves a random subset of data from both the source and destination systems, and then performs a comparison of the sampled data. Any discrepancies or mismatches between the two datasets can be recorded and sent to a dedicated [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) for later reconciliation.

  This process is illustrated in the following diagram.  
![\[Using sample reads to validate data consistency during and online migration from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-sample-reads.png)

# Migrating the application during an online migration
<a name="migration-online-app-migration"></a>

In the fourth phase of an online migration, you are migrating your application and transitioning to Amazon Keyspaces as the primary data store. This means that you switch your application to read and write directly from and to Amazon Keyspaces. To ensure minimal disruption to your users, this should be a well-planned and coordinated process. 

Two different recommended solution for application migration are available, the blue green cut over strategy and the canary cut over strategy. The following sections outline these strategies in more detail. 
+ **Blue green strategy** – Using this approach, you switch your application to treat Amazon Keyspaces as the primary data store and Cassandra as the secondary data store in a single step. You can do this using an AWS AppConfig feature flag to control the election of primary and secondary data stores across the application instance. For more information about feature flags, see [Creating a feature flag configuration profile in AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/appconfig-creating-configuration-and-profile-feature-flags.html).

  After making Amazon Keyspaces the primary data store, you monitor the application's behavior and performance, ensuring that Amazon Keyspaces meets your requirements and that the migration is successful.

  For example, if you implemented dual-reads for your application, during the application migration phase you transition the primary reads going from Cassandra to Amazon Keyspaces and the secondary reads from Amazon Keyspaces to Cassandra. After the transition, you continue to monitor and compare results as described in the [data validation](migration-online-validation.md) section to ensure consistency across both databases before decommissioning Cassandra. 

  If you detect any issues, you can quickly roll back to the previous state by reverting to Cassandra as the primary data store. You only proceed to the decommissioning phase of the migration if Amazon Keyspaces is meeting all your needs as the primary data store.  
![\[Using the blue green strategy for migrating an application from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-switch.png)
+ **Canary strategy** – In this approach, you gradually roll out the migration to a subset of your users or traffic. Initially, a small percentage of your application's traffic, for example 5% of all traffic is routed to the version using Amazon Keyspaces as the primary data store, while the rest of the traffic continues to use Cassandra as the primary data store. 

  This allows you to thoroughly test the migrated version with real-world traffic and monitor its performance, stability, and investigate potential issues. If you don't detect any issues, you can incrementally increase the percentage of traffic routed to Amazon Keyspaces until it becomes the primary data store for all users and traffic. 

  This staged roll out minimizes the risk of widespread service disruptions and allows for a more controlled migration process. If any critical issues arise during the canary deployment, you can quickly roll back to the previous version using Cassandra as the primary data store for the affected traffic segment. You only proceed to the decommissioning phase of the migration after you have validated that Amazon Keyspaces processes 100% of your users and traffic as expected.

  The following diagram illustrates the individual steps of the canary strategy.  
![\[Using the canary strategy for migrating an application from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-canary.png)

# Decommissioning Cassandra after an online migration
<a name="migration-online-decommission"></a>

After the application migration is complete with your application is fully running on Amazon Keyspaces and you have validated data consistency over a period of time, you can plan to decommission your Cassandra cluster. During this phase, you can evaluate if the data remaining in your Cassandra cluster needs to be archived or can be deleted. This depends on your organization’s policies for data handling and retention.

By following this strategy and considering the recommended best practices described in this topic when planning your online migration from Cassandra to Amazon Keyspaces, you can ensure a seamless transition to Amazon Keyspaces while maintaining read-after-write consistency and availability of your application.

Migrating from Apache Cassandra to Amazon Keyspaces can provide numerous benefits, including reduced operational overhead, automatic scaling, improved security, and a framework that helps you to reach your compliance goals. By planning an online migration strategy with dual writes, historical data upload, data validation, and a gradual roll out, you can ensure a smooth transition with minimal disruption to your application and its users. 

Implementing the online migration strategy discussed in this topic allows you to validate the migration results, identify and address any issues, and ultimately decommission your existing Cassandra deployment in favor of the fully managed Amazon Keyspaces service. 

# Offline migration process: Apache Cassandra to Amazon Keyspaces
<a name="migrating-offline"></a>

Offline migrations are suitable when you can afford downtime to perform the migration. It's common among enterprises to have maintenance windows for patching, large releases, or downtime for hardware upgrades or major upgrades. Offline migration can use this window to copy data and switch over the application traffic from Apache Cassandra to Amazon Keyspaces.

Offline migration reduces modifications to the application because it doesn't require communication to both Cassandra and Amazon Keyspaces simultaneously. Additionally, with the data flow paused, the exact state can be copied without maintaining mutations.

In this example, we use Amazon Simple Storage Service (Amazon S3) as a staging area for data during the offline migration to minimize downtime. You can automatically import the data you stored in Parquet format in Amazon S3 into an Amazon Keyspaces table using the Spark Cassandra connector and AWS Glue. The following section is going to show the high-level overview of the process. You can find code examples for this process on [Github](https://github.com/aws-samples/amazon-keyspaces-examples/tree/main/scala/datastax-v4/aws-glue).

The offline migration process from Apache Cassandra to Amazon Keyspaces using Amazon S3 and AWS Glue requires the following AWS Glue jobs.

1. An ETL job that extracts and transforms CQL data and stores it in an Amazon S3 bucket.

1. A second job that imports the data from the bucket to Amazon Keyspaces.

1. A third job to import incremental data.

**How to perform an offline migration to Amazon Keyspaces from Cassandra running on Amazon EC2 in a Amazon Virtual Private Cloud**

1. First you use AWS Glue to export table data from Cassandra in Parquet format and save it to an Amazon S3 bucket. You need to run an AWS Glue job using a AWS Glue connector to a VPC where the Amazon EC2 instance running Cassandra resides. Then, using the Amazon S3 private endpoint, you can save data to the Amazon S3 bucket. 

   The following diagram illustrates these steps.  
![\[Migrating Apache Cassandra data from Amazon EC2 running in a VPC to a Amazon S3 bucket using AWS Glue.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-export.png)

1. Shuffle the data in the Amazon S3 bucket to improve data randomization. Evenly imported data allows for more distributed traffic in the target table. 

   This step is required when exporting data from Cassandra with large partitions (partitions with more than 1000 rows) to avoid hot key patterns when inserting the data into Amazon Keyspaces. Hot key issues cause `WriteThrottleEvents` in Amazon Keyspaces and result in increased load time.   
![\[A AWS Glue job shuffles data from a Amazon S3 bucket and returns it into another Amazon S3 bucket.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-shuffle.png)

1. Use another AWS Glue job to import data from the Amazon S3 bucket into Amazon Keyspaces. The shuffled data in the Amazon S3 bucket is stored in Parquet format.  
![\[The AWS Glue import job takes shuffled data from the Amazon S3 bucket and moves it into an Amazon Keyspaces table.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-import.png)

For more information about the offline migration process, see the workshop [ Amazon Keyspaces with AWS Glue](https://catalog.workshops.aws/unlocking-amazonkeyspaces/en-US/keyspaces-with-glue)

# Using a hybrid migration solution: Apache Cassandra to Amazon Keyspaces
<a name="migrating-hybrid"></a>

The following migration solution can be considered a hybrid between online and offline migration. With this hybrid approach, data is written to the destination database in near real time without providing read after write consistency. This means that newly written data won’t be immediately available and delays are to be expected. If you need read after write consistency, see [Online migration to Amazon Keyspaces: strategies and best practices](migrating-online.md). 

For a near real time migration from Apache Cassandra to Amazon Keyspaces, you can choose between two available methods.
+ **CQLReplicator** – (Recommended) CQLReplicator is an open source utility available on [Github](https://github.com/aws-samples/cql-replicator) that helps you to migrate data from Apache Cassandra to Amazon Keyspaces in near real time.

  To determine the writes and updates to propagate to the destination database, CQLReplicator scans the Apache Cassandra token range and uses an AWS Glue job to remove duplicate events and apply writes and updates directly to Amazon Keyspaces.
+ **Change data capture (CDC)** – If you are familiar with Cassandra CDC, the Apache Cassandra built-in CDC feature that allows capturing changes by copying the commit log to a separate CDC directory is another option for implementing a hybrid migration.

  You can do this by replicating the data changes to Amazon Keyspaces, making CDC an alternative option for data migration scenarios. 

If you don't need read after write consistency, you can use either the CQLReplicator or a CDC pipeline to migrate data from Apache Cassandra to Amazon Keyspaces based on your preferences and familiarity with the tools and AWS services used in each solution. Using these methods to migrate data in near real time can be considered a hybrid approach to migration that offers an alternative to online migration.

This strategy is considered a hybrid approach, because in addition to the options outlined in this topic, you have to implement some steps of the online migration progress, for example historical data copy and the application migration strategies discussed in the [online migration](migrating-online.md) topic. 

The following sections go over the hybrid migration options in more detail.

**Topics**
+ [Migrate data using CQLReplicator](migration-hybrid-cql-rep.md)
+ [Migrate data using change data capture (CDC)](migration-hybrid-cdc.md)

# Migrate data using CQLReplicator
<a name="migration-hybrid-cql-rep"></a>

With [CQLReplicator](https://github.com/aws-samples/cql-replicator), you can read data from Apache Cassandra in near real time through intelligently scanning the Cassandra token ring using CQL queries. CQLReplicator doesn’t use Cassandra CDC and instead implements a caching strategy to reduce the performance penalties of full scans. 

To reduce the number of writes to the destination, CQLReplicator automatically removes duplicate replication events. With CQLReplicator, you can tune the replication of changes from the source database to the destination database, allowing for a near real time migration of data from Apache Cassandra to Amazon Keyspaces. 

The following diagram shows the typical architecture of a CQLReplicator job using AWS Glue. 

1. To allow access to Apache Cassandra running in a private VPC, configure an AWS Glue connection with the connection type **Network**.

1. To remove duplicates and enable key caching with the CQLReplicator job, configure Amazon Simple Storage Service (Amazon S3).

1. The CQLReplicator job streams verified source database changes directly to Amazon Keyspaces.

![\[Using CQLReplicator to migrate data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/hybrid-migration-CQLRep.png)


For more information about the migration process using CQLReplicator, see the following post on the AWS Database blog [ Migrate Cassandra workloads to Amazon Keyspaces using CQLReplicator](https://aws.amazon.com/blogs/database/migrate-cassandra-workloads-to-amazon-keyspaces-using-cqlreplicator/) and the AWS prescriptive guidance [ Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-apache-cassandra-workloads-to-amazon-keyspaces-using-aws-glue.html).

# Migrate data using change data capture (CDC)
<a name="migration-hybrid-cdc"></a>

If you're already familiar with configuring a change data capture (CDC) pipeline with [Debezium](https://debezium.io/), you can use this option to migrate data to Amazon Keyspaces as an alternative to using CQLReplicator. Debezium is an open-source, distributed platform for CDC, designed to monitor a database and capture row-level changes reliably. 

The [Debezium connector for Apache Cassandra](https://debezium.io/documentation/reference/stable/connectors/cassandra.html) uploads changes to Amazon Managed Streaming for Apache Kafka (Amazon MSK) so that they can be consumed and processed by downstream consumers which in turn write the data to Amazon Keyspaces. For more information, see [Guidance for continuous data migration from Apache Cassandra to Amazon Keyspaces](https://aws.amazon.com/solutions/guidance/continuous-data-migration-from-apache-cassandra-to-amazon-keyspaces/).

To address any potential data consistency issues, you can implement a process with Amazon MSK where a consumer compares the keys or partitions in Cassandra with those in Amazon Keyspaces.

To implement this solution successfully, we recommend to consider the following. 
+ How to parse the CDC commit log, for example how to remove duplicate events.
+ How to maintain the CDC directory, for example how to delete old logs.
+ How to handle partial failures in Apache Cassandra, for example if a write only succeeds in one out of three replicas.
+ How to handle resource allocation, for example increasing the size of the instance to account for additional CPU, memory, DISK, and IO requirements for the CDC process that occurs on a node.

This pattern treats changes from Cassandra as a "hint" that a key may have changed from its previous state. To determine if there are changes to propagate to the destination database, you must first read from the source Cassandra cluster using a `LOCAL_QUORUM` operation to receive the latest records and then write them to Amazon Keyspaces. 

In the case of range deletes or range updates, you may need to perform a comparison against the entire partition to determine which write or update events need to be written to your destination database. 

In cases where writes are not idempotent, you also need to compare your writes with what is already in the destination database before writing to Amazon Keyspaces.

The following diagram shows the typical architecture of a CDC pipeline using Debezium and Amazon MSK. 

![\[Using a change data capture pipeline to migrate data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/hybrid-migration-CDC.png)


# How to select the right tool for bulk uploading or migrating data to Amazon Keyspaces
<a name="migrating-tools"></a>

In this section you can review the different tools that you can use to bulk upload or migrate data to Amazon Keyspaces, and learn how to select the correct tool based on your needs. In addition, this section provides an overview and use cases for the available step-by-step tutorials that demonstrate how to import data into Amazon Keyspaces. 

To review the available strategies to migrate workloads from Apache Cassandra to Amazon Keyspaces, see [Create a migration plan for migrating from Apache Cassandra to Amazon Keyspaces](migrating-cassandra.md).
+ **Migration tools**
  + With the [pricing calculator for Amazon Keyspaces (for Apache Cassandra)](https://aws-samples.github.io/sample-pricing-calculator-for-keyspaces/#cassandra) available on Github, you can estimate your monthly costs for Amazon Keyspaces based on your existing Apache Cassandra workload. Enter metrics from your Cassandra nodetool status output and intended serverless configuration for Amazon Keyspaces to compare direct costs between the two solutions. Note that this calculator focuses only on the operational costs of Amazon Keyspaces compared to your existing Cassandra deployment. It doesn't include total cost of ownership (TCO) factors like infrastructure maintenance, operational overhead, or support costs for Cassandra.
  + **ZDM Dual Write Proxy for Amazon Keyspaces Migration** – ZDM Dual Write Proxy available on [Github](https://github.com/aws-samples/amazon-keyspaces-examples/blob/main/migration/online/zdm-proxy/README.md) supports zero-downtime migration from Apache Cassandra to Amazon Keyspaces.
  + **CQLReplicator** – CQLReplicator is an open source utility available on [Github](https://github.com/aws-samples/cql-replicator) that helps you to migrate data from Apache Cassandra to Amazon Keyspaces in near real time. 

    For more information, see [Migrate data using CQLReplicator](migration-hybrid-cql-rep.md).
  + To learn more about how to use Amazon Managed Streaming for Apache Kafka to implement an [online migration](migrating-online.md) process with dual-writes, see [Guidance for continuous data migration from Apache Cassandra to Amazon Keyspaces](https://aws.amazon.com/solutions/guidance/continuous-data-migration-from-apache-cassandra-to-amazon-keyspaces/).
  + For large migrations, consider using an extract, transform, and load (ETL) tool. You can use AWS Glue to quickly and effectively perform data transformation migrations. For more information, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md).
  + To learn how to use the Apache Cassandra Spark connector to write data to Amazon Keyspaces, see [Tutorial: Integrate with Apache Spark to import or export data](spark-integrating.md).
  + Get started quickly with loading data into Amazon Keyspaces by using the cqlsh `COPY FROM` command. cqlsh is included with Apache Cassandra and is best suited for loading small datasets or test data. For step-by-step instructions, see [Tutorial: Loading data into Amazon Keyspaces using cqlsh](bulk-upload.md).
  + You can also use the DataStax Bulk Loader for Apache Cassandra to load data into Amazon Keyspaces using the `dsbulk` command. DSBulk provides more robust import capabilities than cqlsh and is available from the [GitHub repository](https://github.com/datastax/dsbulk). For step-by-step instructions, see [Tutorial: Loading data into Amazon Keyspaces using DSBulk](dsbulk-upload.md).

General considerations for data uploads to Amazon Keyspaces
+ **Break the data upload down into smaller components.**

  Consider the following units of migration and their potential footprint in terms of raw data size. Uploading smaller amounts of data in one or more phases may help simplify your migration.
  + **By cluster** – Migrate all of your Cassandra data at once. This approach may be fine for smaller clusters.
  + **By keyspace or table** – Break up your migration into groups of keyspaces or tables. This approach can help you migrate data in phases based on your requirements for each workload.
  + **By data** – Consider migrating data for a specific group of users or products, to bring the size of data down even more.
+ **Prioritize what data to upload first based on simplicity.**

  Consider if you have data that could be migrated first more easily—for example, data that does not change during specific times, data from nightly batch jobs, data not used during offline hours, or data from internal apps.

**Topics**
+ [Tutorial: Loading data into Amazon Keyspaces using cqlsh](bulk-upload.md)
+ [Tutorial: Loading data into Amazon Keyspaces using DSBulk](dsbulk-upload.md)

# Tutorial: Loading data into Amazon Keyspaces using cqlsh
<a name="bulk-upload"></a>

This tutorial guides you through the process of migrating data from Apache Cassandra to Amazon Keyspaces using the `cqlsh COPY FROM` command. The `cqlsh COPY FROM` command is useful to quickly and easily upload small datasets to Amazon Keyspaces for academic or test purposes. For more information about how to migrate production workloads, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md). In this tutorial, you'll complete the following steps: 

Prerequisites – Set up an AWS account with credentials, create a JKS trust store file for the certificate, and configure `cqlsh` to connect to Amazon Keyspaces. 

1. **Create source CSV and target table** – Prepare a CSV file as the source data and create the target keyspace and table in Amazon Keyspaces.

1. **Prepare the data** – Randomize the data in the CSV file and analyze it to determine the average and maximum row sizes.

1. **Set throughput capacity** – Calculate the required write capacity units (WCUs) based on the data size and desired load time, and configure the table's provisioned capacity.

1. **Configure cqlsh parameters** – Determine optimal values for `cqlsh COPY FROM` parameters like `INGESTRATE`, `NUMPROCESSES`, `MAXBATCHSIZE`, and `CHUNKSIZE` to distribute the workload evenly. 

1. **Run the `cqlsh COPY FROM` command ** – Run the `cqlsh COPY FROM` command to upload the data from the CSV file to the Amazon Keyspaces table, and monitor the progress.

Troubleshooting – Resolve common issues like invalid requests, parser errors, capacity errors, and cqlsh errors during the data upload process. 

**Topics**
+ [Prerequisites: Steps to complete before you can upload data using `cqlsh COPY FROM`](bulk-upload-prequs.md)
+ [Step 1: Create the source CSV file and a target table for the data upload](bulk-upload-source.md)
+ [Step 2: Prepare the source data for a successful data upload](bulk-upload-prepare-data.md)
+ [Step 3: Set throughput capacity for the table](bulk-upload-capacity.md)
+ [Step 4: Configure `cqlsh COPY FROM` settings](bulk-upload-config.md)
+ [Step 5: Run the `cqlsh COPY FROM` command to upload data from the CSV file to the target table](bulk-upload-run.md)
+ [Troubleshooting](bulk-upload-troubleshooting.md)

# Prerequisites: Steps to complete before you can upload data using `cqlsh COPY FROM`
<a name="bulk-upload-prequs"></a>

You must complete the following tasks before you can start this tutorial.

1. If you have not already done so, sign up for an AWS account by following the steps at [Setting up AWS Identity and Access Management](accessing.md#SettingUp.IAM).

1. Create service-specific credentials by following the steps at [Create service-specific credentials for programmatic access to Amazon Keyspaces](programmatic.credentials.ssc.md).

1. Set up the Cassandra Query Language shell (cqlsh) connection and confirm that you can connect to Amazon Keyspaces by following the steps at [Using `cqlsh` to connect to Amazon Keyspaces](programmatic.cqlsh.md). 

# Step 1: Create the source CSV file and a target table for the data upload
<a name="bulk-upload-source"></a>

For this tutorial, we use a comma-separated values (CSV) file with the name `keyspaces_sample_table.csv` as the source file for the data migration. The provided sample file contains a few rows of data for a table with the name `book_awards`.

1. Create the source file. You can choose one of the following options:
   + Download the sample CSV file (`keyspaces_sample_table.csv`) contained in the following archive file [samplemigration.zip](samples/samplemigration.zip). Unzip the archive and take note of the path to `keyspaces_sample_table.csv`.
   + To populate a CSV file with your own data stored in an Apache Cassandra database, you can populate the source CSV file by using the `cqlsh` `COPY TO` statement as shown in the following example.

     ```
     cqlsh localhost 9042 -u "username" -p "password" --execute "COPY mykeyspace.mytable TO 'keyspaces_sample_table.csv' WITH HEADER=true";
     ```

     Make sure the CSV file you create meets the following requirements:
     + The first row contains the column names.
     + The column names in the source CSV file match the column names in the target table.
     + The data is delimited with a comma.
     + All data values are valid Amazon Keyspaces data types. See [Data types](cql.elements.md#cql.data-types).

1. Create the target keyspace and table in Amazon Keyspaces.

   1. Connect to Amazon Keyspaces using `cqlsh`, replacing the service endpoint, user name, and password in the following example with your own values.

      ```
      cqlsh cassandra.us-east-1.amazonaws.com 9142 -u "111122223333" -p "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" --ssl
      ```

   1. Create a new keyspace with the name `catalog` as shown in the following example. 

      ```
      CREATE KEYSPACE catalog WITH REPLICATION = {'class': 'SingleRegionStrategy'};
      ```

   1. When the new keyspace is available, use the following code to create the target table `book_awards`.

      ```
      CREATE TABLE "catalog.book_awards" (
         year int,
         award text,
         rank int, 
         category text,
         book_title text,
         author text, 
         publisher text,
         PRIMARY KEY ((year, award), category, rank)
         );
      ```

   If Apache Cassandra is your original data source, a simple way to create the Amazon Keyspaces target table with matching headers is to generate the `CREATE TABLE` statement from the source table, as shown in the following statement.

   ```
   cqlsh localhost 9042  -u "username" -p "password" --execute "DESCRIBE TABLE mykeyspace.mytable;"
   ```

   Then create the target table in Amazon Keyspaces with the column names and data types matching the description from the Cassandra source table.

# Step 2: Prepare the source data for a successful data upload
<a name="bulk-upload-prepare-data"></a>

Preparing the source data for an efficient transfer is a two-step process. First, you randomize the data. In the second step, you analyze the data to determine the appropriate `cqlsh` parameter values and required table settings to ensure that the data upload is successful.

**Randomize the data**  
The `cqlsh COPY FROM` command reads and writes data in the same order that it appears in the CSV file. If you use the `cqlsh COPY TO` command to create the source file, the data is written in key-sorted order in the CSV. Internally, Amazon Keyspaces partitions data using partition keys. Although Amazon Keyspaces has built-in logic to help load balance requests for the same partition key, loading the data is faster and more efficient if you randomize the order. This is because you can take advantage of the built-in load balancing that occurs when Amazon Keyspaces is writing to different partitions.

To spread the writes across the partitions evenly, you must randomize the data in the source file. You can write an application to do this or use an open-source tool, such as [Shuf](https://en.wikipedia.org/wiki/Shuf). Shuf is freely available on Linux distributions, on macOS (by installing coreutils in [homebrew](https://brew.sh)), and on Windows (by using Windows Subsystem for Linux (WSL)). One extra step is required to prevent the header row with the column names to get shuffled in this step.

To randomize the source file while preserving the header, enter the following code.

```
tail -n +2 keyspaces_sample_table.csv | shuf -o keyspace.table.csv && (head -1 keyspaces_sample_table.csv && cat keyspace.table.csv ) > keyspace.table.csv1 && mv keyspace.table.csv1 keyspace.table.csv
```

Shuf rewrites the data to a new CSV file called `keyspace.table.csv`. You can now delete the `keyspaces_sample_table.csv` file—you no longer need it.

**Analyze the data**  
Determine the average and maximum row size by analyzing the data.

You do this for the following reasons:
+ The average row size helps to estimate the total amount of data to be transferred.
+ You need the average row size to provision the write capacity needed for the data upload.
+ You can make sure that each row is less than 1 MB in size, which is the maximum row size in Amazon Keyspaces.

**Note**  
This quota refers to row size, not partition size. Unlike Apache Cassandra partitions, Amazon Keyspaces partitions can be virtually unbound in size. Partition keys and clustering columns require additional storage for metadata, which you must add to the raw size of rows. For more information, see [Estimate row size in Amazon Keyspaces](calculating-row-size.md).

The following code uses [AWK](https://en.wikipedia.org/wiki/AWK) to analyze a CSV file and print the average and maximum row size.

```
awk -F, 'BEGIN {samp=10000;max=-1;}{if(NR>1){len=length($0);t+=len;avg=t/NR;max=(len>max ? len : max)}}NR==samp{exit}END{printf("{lines: %d, average: %d bytes, max: %d bytes}\n",NR,avg,max);}' keyspace.table.csv
```

Running this code results in the following output.

```
using 10,000 samples:
{lines: 10000, avg: 123 bytes, max: 225 bytes}
```

You use the average row size in the next step of this tutorial to provision the write capacity for the table.

# Step 3: Set throughput capacity for the table
<a name="bulk-upload-capacity"></a>

This tutorial shows you how to tune cqlsh to load data within a set time range. Because you know how many reads and writes you perform in advance, use provisioned capacity mode. After you finish the data transfer, you should set the capacity mode of the table to match your application’s traffic patterns. To learn more about capacity management, see [Managing serverless resources in Amazon Keyspaces (for Apache Cassandra)](serverless_resource_management.md).

With provisioned capacity mode, you specify how much read and write capacity you want to provision to your table in advance. Write capacity is billed hourly and metered in write capacity units (WCUs). Each WCU is enough write capacity to support writing 1 KB of data per second. When you load the data, the write rate must be under the max WCUs (parameter: `write_capacity_units`) that are set on the target table. 

By default, you can provision up to 40,000 WCUs to a table and 80,000 WCUs across all the tables in your account. If you need additional capacity, you can request a quota increase in the [ Service Quotas](https://console.aws.amazon.com/servicequotas/home#!/services/cassandra/quotas) console. For more information about quotas, see [Quotas for Amazon Keyspaces (for Apache Cassandra)](quotas.md).

**Calculate the average number of WCUs required for an insert**  
Inserting 1 KB of data per second requires 1 WCU. If your CSV file has 360,000 rows and you want to load all the data in 1 hour, you must write 100 rows per second (360,000 rows / 60 minutes / 60 seconds = 100 rows per second). If each row has up to 1 KB of data, to insert 100 rows per second, you must provision 100 WCUs to your table. If each row has 1.5 KB of data, you need two WCUs to insert one row per second. Therefore, to insert 100 rows per second, you must provision 200 WCUs.

To determine how many WCUs you need to insert one row per second, divide the average row size in bytes by 1024 and round up to the nearest whole number.

For example, if the average row size is 3000 bytes, you need three WCUs to insert one row per second.

```
ROUNDUP(3000 / 1024) = ROUNDUP(2.93) = 3 WCUs
```

**Calculate data load time and capacity**  
Now that you know the average size and number of rows in your CSV file, you can calculate how many WCUs you need to load the data in a given amount of time, and the approximate time it takes to load all the data in your CSV file using different WCU settings.

For example, if each row in your file is 1 KB and you have 1,000,000 rows in your CSV file, to load the data in 1 hour, you need to provision at least 278 WCUs to your table for that hour.

```
1,000,000 rows * 1 KBs = 1,000,000 KBs
1,000,000 KBs / 3600 seconds =277.8 KBs / second = 278 WCUs
```

**Configure provisioned capacity settings**  
You can set a table’s write capacity settings when you create the table or by using the `ALTER TABLE` CQL command. The following is the syntax for altering a table’s provisioned capacity settings with the `ALTER TABLE` CQL statement.

```
ALTER TABLE mykeyspace.mytable WITH custom_properties={'capacity_mode':{'throughput_mode': 'PROVISIONED', 'read_capacity_units': 100, 'write_capacity_units': 278}} ; 
```

For the complete language reference, see [ALTER TABLE](cql.ddl.table.md#cql.ddl.table.alter).

# Step 4: Configure `cqlsh COPY FROM` settings
<a name="bulk-upload-config"></a>

This section outlines how to determine the parameter values for `cqlsh COPY FROM`. The `cqlsh COPY FROM` command reads the CSV file that you prepared earlier and inserts the data into Amazon Keyspaces using CQL. The command divides up the rows and distributes the `INSERT` operations among a set of workers. Each worker establishes a connection with Amazon Keyspaces and sends `INSERT` requests along this channel. 

The `cqlsh COPY` command doesn’t have internal logic to distribute work evenly among its workers. However, you can configure it manually to make sure that the work is distributed evenly. Start by reviewing these key cqlsh parameters:
+ **DELIMITER** – If you used a delimiter other than a comma, you can set this parameter, which defaults to comma.
+ **INGESTRATE** – The target number of rows that `cqlsh COPY FROM` attempts to process per second. If unset, it defaults to 100,000.
+ **NUMPROCESSES** – The number of child worker processes that cqlsh creates for `COPY FROM` tasks. The maximum for this setting is 16, the default is `num_cores - 1`, where `num_cores` is the number of processing cores on the host running cqlsh.
+ **MAXBATCHSIZE** – The batch size determines the maximal number of rows inserted into the destination table in a single batch. If unset, cqlsh uses batches of 20 inserted rows.
+ **CHUNKSIZE** – The size of the work unit that passes to the child worker. By default, it is set to 5,000. 
+ **MAXATTEMPTS** – The maximum number of times to retry a failed worker chunk. After the maximum attempt is reached, the failed records are written to a new CSV file that you can run again later after investigating the failure.

Set `INGESTRATE` based on the number of WCUs that you provisioned to the target destination table. The `INGESTRATE` of the `cqlsh COPY FROM` command isn’t a limit—it’s a target average. This means it can (and often does) burst above the number you set. To allow for bursts and make sure that enough capacity is in place to handle the data load requests, set `INGESTRATE` to 90% of the table’s write capacity.

```
INGESTRATE = WCUs * .90
```

Next, set the `NUMPROCESSES` parameter to equal to one less than the number of cores on your system. To find out what the number of cores of your system is, you can run the following code.

```
python -c "import multiprocessing; print(multiprocessing.cpu_count())"
```

For this tutorial, we use the following value.

```
NUMPROCESSES = 4
```

Each process creates a worker, and each worker establishes a connection to Amazon Keyspaces. Amazon Keyspaces can support up to 3,000 CQL requests per second on every connection. This means that you have to make sure that each worker is processing fewer than 3,000 requests per second. 

As with `INGESTRATE`, the workers often burst above the number you set and aren’t limited by clock seconds. Therefore, to account for bursts, set your cqlsh parameters to target each worker to process 2,500 requests per second. To calculate the amount of work distributed to a worker, use the following guideline.
+ Divide `INGESTRATE` by `NUMPROCESSES`.
+ If `INGESTRATE` / `NUMPROCESSES` > 2,500, lower the `INGESTRATE` to make this formula true.

```
INGESTRATE / NUMPROCESSES <= 2,500
```

Before you configure the settings to optimize the upload of our sample data, let's review the `cqlsh` default settings and see how using them impacts the data upload process. Because `cqlsh COPY FROM` uses the `CHUNKSIZE` to create chunks of work (`INSERT` statements) to distribute to workers, the work is not automatically distributed evenly. Some workers might sit idle, depending on the `INGESTRATE` setting.

To distribute work evenly among the workers and keep each worker at the optimal 2,500 requests per second rate, you must set `CHUNKSIZE`, `MAXBATCHSIZE`, and `INGESTRATE` by changing the input parameters. To optimize network traffic utilization during the data load, choose a value for `MAXBATCHSIZE` that is close to the maximum value of 30. By changing `CHUNKSIZE` to 100 and `MAXBATCHSIZE` to 25, the 10,000 rows are spread evenly among the four workers (10,000 / 2500 = 4).

The following code example illustrates this.

```
INGESTRATE = 10,000
NUMPROCESSES = 4
CHUNKSIZE = 100
MAXBATCHSIZE. = 25
Work Distribution:
Connection 1 / Worker 1 : 2,500 Requests per second
Connection 2 / Worker 2 : 2,500 Requests per second
Connection 3 / Worker 3 : 2,500 Requests per second
Connection 4 / Worker 4 : 2,500 Requests per second
```

To summarize, use the following formulas when setting `cqlsh COPY FROM` parameters:
+ **INGESTRATE** = write\$1capacity\$1units \$1 .90
+ **NUMPROCESSES** = num\$1cores -1 (default)
+ **INGESTRATE / NUMPROCESSES** = 2,500 (This must be a true statement.)
+ **MAXBATCHSIZE** = 30 (Defaults to 20. Amazon Keyspaces accepts batches up to 30.)
+ **CHUNKSIZE** = (INGESTRATE / NUMPROCESSES) / MAXBATCHSIZE

Now that you have calculated `NUMPROCESSES`, `INGESTRATE`, and `CHUNKSIZE`, you’re ready to load your data.

# Step 5: Run the `cqlsh COPY FROM` command to upload data from the CSV file to the target table
<a name="bulk-upload-run"></a>

To run the `cqlsh COPY FROM` command, complete the following steps.

1. Connect to Amazon Keyspaces using cqlsh.

1. Choose your keyspace with the following code.

   ```
   USE catalog;
   ```

1. Set write consistency to `LOCAL_QUORUM`. To ensure data durability, Amazon Keyspaces doesn’t allow other write consistency settings. See the following code.

   ```
   CONSISTENCY LOCAL_QUORUM;
   ```

1. Prepare your `cqlsh COPY FROM` syntax using the following code example. 

   ```
   COPY book_awards FROM './keyspace.table.csv' WITH HEADER=true 
   AND INGESTRATE=calculated ingestrate 
   AND NUMPROCESSES=calculated numprocess
   AND MAXBATCHSIZE=20 
   AND CHUNKSIZE=calculated chunksize;
   ```

1. Run the statement prepared in the previous step. cqlsh echoes back all the settings that you've configured.

   1. Make sure that the settings match your input. See the following example.

      ```
      Reading options from the command line: {'chunksize': '120', 'header': 'true', 'ingestrate': '36000', 'numprocesses': '15', 'maxbatchsize': '20'}
      Using 15 child processes
      ```

   1. Review the number of rows transferred and the current average rate, as shown in the following example.

      ```
      Processed: 57834 rows; Rate: 6561 rows/s; Avg. rate: 31751 rows/s
      ```

   1. When cqlsh has finished uploading the data, review the summary of the data load statistics (the number of files read, runtime, and skipped rows) as shown in the following example.

      ```
      15556824 rows imported from 1 files in 8 minutes and 8.321 seconds (0 skipped).
      ```

In this final step of the tutorial, you have uploaded the data to Amazon Keyspaces.

**Important**  
Now that you have transferred your data, adjust the capacity mode settings of your target table to match your application’s regular traffic patterns. You incur charges at the hourly rate for your provisioned capacity until you change it.

# Troubleshooting
<a name="bulk-upload-troubleshooting"></a>

After the data upload has completed, check to see if rows were skipped. To do so, navigate to the source directory of the source CSV file and search for a file with the following name.

```
import_yourcsvfilename.err.timestamp.csv
```

cqlsh writes any skipped rows of data into a file with that name. If the file exists in your source directory and has data in it, these rows didn't upload to Amazon Keyspaces. To retry these rows, first check for any errors that were encountered during the upload and adjust the data accordingly. To retry these rows, you can rerun the process.


**Common errors**  
The most common reasons why rows aren’t loaded are capacity errors and parsing errors.

**Invalid request errors when uploading data to Amazon Keyspaces**

In the following example, the source table contains a counter column, which results in logged batch calls from the cqlsh `COPY` command. Logged batch calls are not supported by Amazon Keyspaces.

```
Failed to import 10 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message=“Only UNLOGGED Batches are supported at this time.“,  will retry later, attempt 22 of 25
```

To resolve this error, use DSBulk to migrate the data. For more information, see [Tutorial: Loading data into Amazon Keyspaces using DSBulk](dsbulk-upload.md).

**Parser errors when uploading data to Amazon Keyspaces**

The following example shows a skipped row due to a `ParseError`.

```
Failed to import 1 rows: ParseError - Invalid ... – 
```

To resolve this error, you need to make sure that the data to be imported matches the table schema in Amazon Keyspaces. Review the import file for parsing errors. You can try using a single row of data using an `INSERT` statement to isolate the error.

**Capacity errors when uploading data to Amazon Keyspaces**

```
Failed to import 1 rows: WriteTimeout - Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses]
 message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 2, 'write_type': 'SIMPLE', 'consistency': 
 'LOCAL_QUORUM'}, will retry later, attempt 1 of 100
```

Amazon Keyspaces uses the `ReadTimeout` and `WriteTimeout` exceptions to indicate when a write request fails due to insufficient throughput capacity. To help diagnose insufficient capacity exceptions, Amazon Keyspaces publishes `WriteThrottleEvents` and `ReadThrottledEvents` metrics in Amazon CloudWatch. For more information, see [Monitoring Amazon Keyspaces with Amazon CloudWatch](monitoring-cloudwatch.md).

**cqlsh errors when uploading data to Amazon Keyspaces**

To help troubleshoot cqlsh errors, rerun the failing command with the `--debug` flag.

When using an incompatible version of cqlsh, you see the following error.

```
AttributeError: 'NoneType' object has no attribute 'is_up'
Failed to import 3 rows: AttributeError - 'NoneType' object has no attribute 'is_up',  given up after 1 attempts
```

Confirm that the correct version of cqlsh is installed by running the following command.

```
cqlsh --version
```

You should see something like the following for output.

```
cqlsh 5.0.1
```

If you're using Windows, replace all instances of `cqlsh` with `cqlsh.bat`. For example, to check the version of cqlsh in Windows, run the following command.

```
cqlsh.bat --version
```

The connection to Amazon Keyspaces fails after the cqlsh client receives three consecutive errors of any type from the server. The cqlsh client fails with the following message. 

```
Failed to import 1 rows: NoHostAvailable - , will retry later, attempt 3 of 100
```

To resolve this error, you need to make sure that the data to be imported matches the table schema in Amazon Keyspaces. Review the import file for parsing errors. You can try using a single row of data by using an INSERT statement to isolate the error.

The client automatically attempts to reestablish the connection.

# Tutorial: Loading data into Amazon Keyspaces using DSBulk
<a name="dsbulk-upload"></a>

This step-by-step tutorial guides you through migrating data from Apache Cassandra to Amazon Keyspaces using the DataStax Bulk Loader (DSBulk) available on [GitHub](https://github.com/datastax/dsbulk.git). Using DSBulk is useful to upload datasets to Amazon Keyspaces for academic or test purposes. For more information about how to migrate production workloads, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md). In this tutorial, you complete the following steps.

Prerequisites – Set up an AWS account with credentials, create a JKS trust store file for the certificate, configure `cqlsh`, download and install DSBulk, and configure an `application.conf` file. 

1. **Create source CSV and target table** – Prepare a CSV file as the source data and create the target keyspace and table in Amazon Keyspaces.

1. **Prepare the data** – Randomize the data in the CSV file and analyze it to determine the average and maximum row sizes.

1. **Set throughput capacity** – Calculate the required write capacity units (WCUs) based on the data size and desired load time, and configure the table's provisioned capacity.

1. **Configure DSBulk settings** – Create a DSBulk configuration file with settings like authentication, SSL/TLS, consistency level, and connection pool size.

1. **Run the DSBulk load command** – Run the DSBulk load command to upload the data from the CSV file to the Amazon Keyspaces table, and monitor the progress.

**Topics**
+ [Prerequisites: Steps you have to complete before you can upload data with DSBulk](dsbulk-upload-prequs.md)
+ [Step 1: Create the source CSV file and a target table for the data upload using DSBulk](dsbulk-upload-source.md)
+ [Step 2: Prepare the data to upload using DSBulk](dsbulk-upload-prepare-data.md)
+ [Step 3: Set the throughput capacity for the target table](dsbulk-upload-capacity.md)
+ [Step 4: Configure `DSBulk` settings to upload data from the CSV file to the target table](dsbulk-upload-config.md)
+ [Step 5: Run the DSBulk `load` command to upload data from the CSV file to the target table](dsbulk-upload-run.md)

# Prerequisites: Steps you have to complete before you can upload data with DSBulk
<a name="dsbulk-upload-prequs"></a>

You must complete the following tasks before you can start this tutorial.

1. If you have not already done so, sign up for an AWS account by following the steps at [Setting up AWS Identity and Access Management](accessing.md#SettingUp.IAM).

1. Create credentials by following the steps at [Create and configure AWS credentials for Amazon Keyspaces](access.credentials.md).

1. Create a JKS trust store file.

   1.  Download the following digital certificates and save the files locally or in your home directory.

      1. AmazonRootCA1

      1. AmazonRootCA2

      1. AmazonRootCA3

      1. AmazonRootCA4

      1. Starfield Class 2 Root (optional – for backward compatibility)

      To download the certificates, you can use the following commands.

      ```
      curl -O https://www.amazontrust.com/repository/AmazonRootCA1.pem
      curl -O https://www.amazontrust.com/repository/AmazonRootCA2.pem
      curl -O https://www.amazontrust.com/repository/AmazonRootCA3.pem
      curl -O https://www.amazontrust.com/repository/AmazonRootCA4.pem
      curl -O https://certs.secureserver.net/repository/sf-class2-root.crt
      ```
**Note**  
Amazon Keyspaces previously used TLS certificates anchored to the Starfield Class 2 CA. AWS is migrating all AWS Regions to certificates issued under Amazon Trust Services (Amazon Root CAs 1–4). During this transition, configure clients to trust both Amazon Root CAs 1–4 and the Starfield root to ensure compatibility across all Regions.

   1. Convert the digital certificates into trustStore files and add them to the keystore.

      ```
      openssl x509 -outform der -in AmazonRootCA1.pem -out temp_file.der
      keytool -import -alias amazon-root-ca-1 -keystore cassandra_truststore.jks -file temp_file.der
      
      openssl x509 -outform der -in AmazonRootCA2.pem -out temp_file.der
      keytool -import -alias amazon-root-ca-2 -keystore cassandra_truststore.jks -file temp_file.der
      
      openssl x509 -outform der -in AmazonRootCA3.pem -out temp_file.der
      keytool -import -alias amazon-root-ca-3 -keystore cassandra_truststore.jks -file temp_file.der
      
      openssl x509 -outform der -in AmazonRootCA4.pem -out temp_file.der
      keytool -import -alias amazon-root-ca-4 -keystore cassandra_truststore.jks -file temp_file.der
                   
      openssl x509 -outform der -in sf-class2-root.crt -out temp_file.der
      keytool -import -alias cassandra -keystore cassandra_truststore.jks -file temp_file.der
      ```

      In the last step, you need to create a password for the keystore and trust each certificate. The interactive command looks like this.

      ```
      Enter keystore password:  
      Re-enter new password: 
      Owner: CN=Amazon Root CA 1, O=Amazon, C=US
      Issuer: CN=Amazon Root CA 1, O=Amazon, C=US
      Serial number: 66c9fcf99bf8c0a39e2f0788a43e696365bca
      Valid from: Tue May 26 00:00:00 UTC 2015 until: Sun Jan 17 00:00:00 UTC 2038
      Certificate fingerprints:
           SHA1: 8D:A7:F9:65:EC:5E:FC:37:91:0F:1C:6E:59:FD:C1:CC:6A:6E:DE:16
           SHA256: 8E:CD:E6:88:4F:3D:87:B1:12:5B:A3:1A:C3:FC:B1:3D:70:16:DE:7F:57:CC:90:4F:E1:CB:97:C6:AE:98:19:6E
      Signature algorithm name: SHA256withRSA
      Subject Public Key Algorithm: 2048-bit RSA key
      Version: 3
      
      Extensions: 
      
      #1: ObjectId: 2.5.29.19 Criticality=true
      BasicConstraints:[
        CA:true
        PathLen:2147483647
      ]
      
      #2: ObjectId: 2.5.29.15 Criticality=true
      KeyUsage [
        DigitalSignature
        Key_CertSign
        Crl_Sign
      ]
      
      #3: ObjectId: 2.5.29.14 Criticality=false
      SubjectKeyIdentifier [
      KeyIdentifier [
      0000: 84 18 CC 85 34 EC BC 0C   94 94 2E 08 59 9C C7 B2  ....4.......Y...
      0010: 10 4E 0A 08                                        .N..
      ]
      ]
      
      Trust this certificate? [no]:  yes
      Certificate was added to keystore
      Enter keystore password:  
      Owner: CN=Amazon Root CA 2, O=Amazon, C=US
      Issuer: CN=Amazon Root CA 2, O=Amazon, C=US
      Serial number: 66c9fd29635869f0a0fe58678f85b26bb8a37
      Valid from: Tue May 26 00:00:00 UTC 2015 until: Sat May 26 00:00:00 UTC 2040
      Certificate fingerprints:
           SHA1: 5A:8C:EF:45:D7:A6:98:59:76:7A:8C:8B:44:96:B5:78:CF:47:4B:1A
           SHA256: 1B:A5:B2:AA:8C:65:40:1A:82:96:01:18:F8:0B:EC:4F:62:30:4D:83:CE:C4:71:3A:19:C3:9C:01:1E:A4:6D:B4
      Signature algorithm name: SHA384withRSA
      Subject Public Key Algorithm: 4096-bit RSA key
      Version: 3
      
      Extensions: 
      
      #1: ObjectId: 2.5.29.19 Criticality=true
      BasicConstraints:[
        CA:true
        PathLen:2147483647
      ]
      
      #2: ObjectId: 2.5.29.15 Criticality=true
      KeyUsage [
        DigitalSignature
        Key_CertSign
        Crl_Sign
      ]
      
      #3: ObjectId: 2.5.29.14 Criticality=false
      SubjectKeyIdentifier [
      KeyIdentifier [
      0000: B0 0C F0 4C 30 F4 05 58   02 48 FD 33 E5 52 AF 4B  ...L0..X.H.3.R.K
      0010: 84 E3 66 52                                        ..fR
      ]
      ]
      
      Trust this certificate? [no]:  yes
      Certificate was added to keystore
      Enter keystore password:  
      Owner: CN=Amazon Root CA 3, O=Amazon, C=US
      Issuer: CN=Amazon Root CA 3, O=Amazon, C=US
      Serial number: 66c9fd5749736663f3b0b9ad9e89e7603f24a
      Valid from: Tue May 26 00:00:00 UTC 2015 until: Sat May 26 00:00:00 UTC 2040
      Certificate fingerprints:
           SHA1: 0D:44:DD:8C:3C:8C:1A:1A:58:75:64:81:E9:0F:2E:2A:FF:B3:D2:6E
           SHA256: 18:CE:6C:FE:7B:F1:4E:60:B2:E3:47:B8:DF:E8:68:CB:31:D0:2E:BB:3A:DA:27:15:69:F5:03:43:B4:6D:B3:A4
      Signature algorithm name: SHA256withECDSA
      Subject Public Key Algorithm: 256-bit EC (secp256r1) key
      Version: 3
      
      Extensions: 
      
      #1: ObjectId: 2.5.29.19 Criticality=true
      BasicConstraints:[
        CA:true
        PathLen:2147483647
      ]
      
      #2: ObjectId: 2.5.29.15 Criticality=true
      KeyUsage [
        DigitalSignature
        Key_CertSign
        Crl_Sign
      ]
      
      #3: ObjectId: 2.5.29.14 Criticality=false
      SubjectKeyIdentifier [
      KeyIdentifier [
      0000: AB B6 DB D7 06 9E 37 AC   30 86 07 91 70 C7 9C C4  ......7.0...p...
      0010: 19 B1 78 C0                                        ..x.
      ]
      ]
      
      Trust this certificate? [no]:  yes
      Certificate was added to keystore
      Enter keystore password:  
      Owner: CN=Amazon Root CA 4, O=Amazon, C=US
      Issuer: CN=Amazon Root CA 4, O=Amazon, C=US
      Serial number: 66c9fd7c1bb104c2943e5717b7b2cc81ac10e
      Valid from: Tue May 26 00:00:00 UTC 2015 until: Sat May 26 00:00:00 UTC 2040
      Certificate fingerprints:
           SHA1: F6:10:84:07:D6:F8:BB:67:98:0C:C2:E2:44:C2:EB:AE:1C:EF:63:BE
           SHA256: E3:5D:28:41:9E:D0:20:25:CF:A6:90:38:CD:62:39:62:45:8D:A5:C6:95:FB:DE:A3:C2:2B:0B:FB:25:89:70:92
      Signature algorithm name: SHA384withECDSA
      Subject Public Key Algorithm: 384-bit EC (secp384r1) key
      Version: 3
      
      Extensions: 
      
      #1: ObjectId: 2.5.29.19 Criticality=true
      BasicConstraints:[
        CA:true
        PathLen:2147483647
      ]
      
      #2: ObjectId: 2.5.29.15 Criticality=true
      KeyUsage [
        DigitalSignature
        Key_CertSign
        Crl_Sign
      ]
      
      #3: ObjectId: 2.5.29.14 Criticality=false
      SubjectKeyIdentifier [
      KeyIdentifier [
      0000: D3 EC C7 3A 65 6E CC E1   DA 76 9A 56 FB 9C F3 86  ...:en...v.V....
      0010: 6D 57 E5 81                                        mW..
      ]
      ]
      
      Trust this certificate? [no]:  yes
      Certificate was added to keystore
      Enter keystore password:  
      Owner: OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US
      Issuer: OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US
      Serial number: 0
      Valid from: Tue Jun 29 17:39:16 UTC 2004 until: Thu Jun 29 17:39:16 UTC 2034
      Certificate fingerprints:
           SHA1: AD:7E:1C:28:B0:64:EF:8F:60:03:40:20:14:C3:D0:E3:37:0E:B5:8A
           SHA256: 14:65:FA:20:53:97:B8:76:FA:A6:F0:A9:95:8E:55:90:E4:0F:CC:7F:AA:4F:B7:C2:C8:67:75:21:FB:5F:B6:58
      Signature algorithm name: SHA1withRSA (weak)
      Subject Public Key Algorithm: 2048-bit RSA key
      Version: 3
      
      Extensions: 
      
      #1: ObjectId: 2.5.29.35 Criticality=false
      AuthorityKeyIdentifier [
      KeyIdentifier [
      0000: BF 5F B7 D1 CE DD 1F 86   F4 5B 55 AC DC D7 10 C2  ._.......[U.....
      0010: 0E A9 88 E7                                        ....
      ]
      [OU=Starfield Class 2 Certification Authority, O="Starfield Technologies, Inc.", C=US]
      SerialNumber: [    00]
      ]
      
      #2: ObjectId: 2.5.29.19 Criticality=false
      BasicConstraints:[
        CA:true
        PathLen:2147483647
      ]
      
      #3: ObjectId: 2.5.29.14 Criticality=false
      SubjectKeyIdentifier [
      KeyIdentifier [
      0000: BF 5F B7 D1 CE DD 1F 86   F4 5B 55 AC DC D7 10 C2  ._.......[U.....
      0010: 0E A9 88 E7                                        ....
      ]
      ]
      
      
      Warning:
      The input uses the SHA1withRSA signature algorithm which is considered a security risk. This algorithm will be disabled in a future update.
      
      Trust this certificate? [no]:  yes
      Certificate was added to keystore
      ```

1. Set up the Cassandra Query Language shell (cqlsh) connection and confirm that you can connect to Amazon Keyspaces by following the steps at [Using `cqlsh` to connect to Amazon Keyspaces](programmatic.cqlsh.md). 

1. Download and install DSBulk. 
**Note**  
The version shown in this tutorial might not be the latest version available. Before you download DSBulk, check the [DataStax Bulk Loader download page](https://downloads.datastax.com/#bulk-loader) for the latest version, and update the version number in the following commands accordingly.

   1. To download DSBulk, you can use the following code.

      ```
      curl -OL https://downloads.datastax.com/dsbulk/dsbulk-1.8.0.tar.gz
      ```

   1. Then unpack the tar file and add DSBulk to your `PATH` as shown in the following example.

      ```
      tar -zxvf dsbulk-1.8.0.tar.gz
      # add the DSBulk directory to the path
      export PATH=$PATH:./dsbulk-1.8.0/bin
      ```

   1. Create an `application.conf` file to store settings to be used by DSBulk. You can save the following example as `./dsbulk_keyspaces.conf`. Replace `localhost` with the contact point of your local Cassandra cluster if you are not on the local node, for example the DNS name or IP address. Take note of the file name and path, as you're going to need to specify this later in the `dsbulk load` command. 

      ```
      datastax-java-driver {
        basic.contact-points = [ "localhost"]
        advanced.auth-provider {
              class = software.aws.mcs.auth.SigV4AuthProvider
              aws-region = us-east-1
        }
      }
      ```

   1. To enable SigV4 support, download the shaded `jar` file from [GitHub](https://github.com/aws/aws-sigv4-auth-cassandra-java-driver-plugin/releases/) and place it in the DSBulk `lib` folder as shown in the following example.

      ```
      curl -O -L https://github.com/aws/aws-sigv4-auth-cassandra-java-driver-plugin/releases/download/4.0.6-shaded-v2/aws-sigv4-auth-cassandra-java-driver-plugin-4.0.6-shaded.jar
      ```

# Step 1: Create the source CSV file and a target table for the data upload using DSBulk
<a name="dsbulk-upload-source"></a>

For this tutorial, we use a comma-separated values (CSV) file with the name `keyspaces_sample_table.csv` as the source file for the data migration. The provided sample file contains a few rows of data for a table with the name `book_awards`.

1. Create the source file. You can choose one of the following options:
   + Download the sample CSV file (`keyspaces_sample_table.csv`) contained in the following archive file [samplemigration.zip](samples/samplemigration.zip). Unzip the archive and take note of the path to `keyspaces_sample_table.csv`.
   + To populate a CSV file with your own data stored in an Apache Cassandra database, you can populate the source CSV file by using `dsbulk unload` as shown in the following example.

     ```
     dsbulk unload -k mykeyspace -t mytable -f ./my_application.conf > keyspaces_sample_table.csv
     ```

     Make sure the CSV file you create meets the following requirements:
     + The first row contains the column names.
     + The column names in the source CSV file match the column names in the target table.
     + The data is delimited with a comma.
     + All data values are valid Amazon Keyspaces data types. See [Data types](cql.elements.md#cql.data-types).

1. Create the target keyspace and table in Amazon Keyspaces.

   1. Connect to Amazon Keyspaces using `cqlsh`, replacing the service endpoint, user name, and password in the following example with your own values.

      ```
      cqlsh cassandra.us-east-1.amazonaws.com 9142 -u "111122223333" -p "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" --ssl
      ```

   1. Create a new keyspace with the name `catalog` as shown in the following example. 

      ```
      CREATE KEYSPACE catalog WITH REPLICATION = {'class': 'SingleRegionStrategy'};
      ```

   1. After the new keyspace has a status of available, use the following code to create the target table `book_awards`. To learn more about asynchronous resource creation and how to check if a resource is available, see [Check keyspace creation status in Amazon Keyspaces](keyspaces-create.md).

      ```
      CREATE TABLE catalog.book_awards (
         year int,
         award text,
         rank int, 
         category text,
         book_title text,
         author text, 
         publisher text,
         PRIMARY KEY ((year, award), category, rank)
         );
      ```

   If Apache Cassandra is your original data source, a simple way to create the Amazon Keyspaces target table with matching headers is to generate the `CREATE TABLE` statement from the source table as shown in the following statement.

   ```
   cqlsh localhost 9042  -u "username" -p "password" --execute "DESCRIBE TABLE mykeyspace.mytable;"
   ```

   Then create the target table in Amazon Keyspaces with the column names and data types matching the description from the Cassandra source table.

# Step 2: Prepare the data to upload using DSBulk
<a name="dsbulk-upload-prepare-data"></a>

Preparing the source data for an efficient transfer is a two-step process. First, you randomize the data. In the second step, you analyze the data to determine the appropriate `dsbulk` parameter values and required table settings.

**Randomize the data**  
The `dsbulk` command reads and writes data in the same order that it appears in the CSV file. If you use the `dsbulk` command to create the source file, the data is written in key-sorted order in the CSV. Internally, Amazon Keyspaces partitions data using partition keys. Although Amazon Keyspaces has built-in logic to help load balance requests for the same partition key, loading the data is faster and more efficient if you randomize the order. This is because you can take advantage of the built-in load balancing that occurs when Amazon Keyspaces is writing to different partitions.

To spread the writes across the partitions evenly, you must randomize the data in the source file. You can write an application to do this or use an open-source tool, such as [Shuf](https://en.wikipedia.org/wiki/Shuf). Shuf is freely available on Linux distributions, on macOS (by installing coreutils in [homebrew](https://brew.sh)), and on Windows (by using Windows Subsystem for Linux (WSL)). One extra step is required to prevent the header row with the column names to get shuffled in this step.

To randomize the source file while preserving the header, enter the following code.

```
tail -n +2 keyspaces_sample_table.csv | shuf -o keyspace.table.csv && (head -1 keyspaces_sample_table.csv && cat keyspace.table.csv ) > keyspace.table.csv1 && mv keyspace.table.csv1 keyspace.table.csv
```

Shuf rewrites the data to a new CSV file called `keyspace.table.csv`. You can now delete the `keyspaces_sample_table.csv` file—you no longer need it.

**Analyze the data**  
Determine the average and maximum row size by analyzing the data.

You do this for the following reasons:
+ The average row size helps to estimate the total amount of data to be transferred.
+ You need the average row size to provision the write capacity needed for the data upload.
+ You can make sure that each row is less than 1 MB in size, which is the maximum row size in Amazon Keyspaces.

**Note**  
This quota refers to row size, not partition size. Unlike Apache Cassandra partitions, Amazon Keyspaces partitions can be virtually unbound in size. Partition keys and clustering columns require additional storage for metadata, which you must add to the raw size of rows. For more information, see [Estimate row size in Amazon Keyspaces](calculating-row-size.md).

The following code uses [AWK](https://en.wikipedia.org/wiki/AWK) to analyze a CSV file and print the average and maximum row size.

```
awk -F, 'BEGIN {samp=10000;max=-1;}{if(NR>1){len=length($0);t+=len;avg=t/NR;max=(len>max ? len : max)}}NR==samp{exit}END{printf("{lines: %d, average: %d bytes, max: %d bytes}\n",NR,avg,max);}' keyspace.table.csv
```

Running this code results in the following output.

```
using 10,000 samples:
{lines: 10000, avg: 123 bytes, max: 225 bytes}
```

Make sure that your maximum row size doesn't exceed 1 MB. If it does, you have to break up the row or compress the data to bring the row size below 1 MB. In the next step of this tutorial, you use the average row size to provision the write capacity for the table. 

# Step 3: Set the throughput capacity for the target table
<a name="dsbulk-upload-capacity"></a>

This tutorial shows you how to tune DSBulk to load data within a set time range. Because you know how many reads and writes you perform in advance, use provisioned capacity mode. After you finish the data transfer, you should set the capacity mode of the table to match your application’s traffic patterns. To learn more about capacity management, see [Managing serverless resources in Amazon Keyspaces (for Apache Cassandra)](serverless_resource_management.md).

With provisioned capacity mode, you specify how much read and write capacity you want to provision to your table in advance. Write capacity is billed hourly and metered in write capacity units (WCUs). Each WCU is enough write capacity to support writing 1 KB of data per second. When you load the data, the write rate must be under the max WCUs (parameter: `write_capacity_units`) that are set on the target table. 

By default, you can provision up to 40,000 WCUs to a table and 80,000 WCUs across all the tables in your account. If you need additional capacity, you can request a quota increase in the [ Service Quotas](https://console.aws.amazon.com/servicequotas/home#!/services/cassandra/quotas) console. For more information about quotas, see [Quotas for Amazon Keyspaces (for Apache Cassandra)](quotas.md).

**Calculate the average number of WCUs required for an insert**  
Inserting 1 KB of data per second requires 1 WCU. If your CSV file has 360,000 rows and you want to load all the data in 1 hour, you must write 100 rows per second (360,000 rows / 60 minutes / 60 seconds = 100 rows per second). If each row has up to 1 KB of data, to insert 100 rows per second, you must provision 100 WCUs to your table. If each row has 1.5 KB of data, you need two WCUs to insert one row per second. Therefore, to insert 100 rows per second, you must provision 200 WCUs.

To determine how many WCUs you need to insert one row per second, divide the average row size in bytes by 1024 and round up to the nearest whole number.

For example, if the average row size is 3000 bytes, you need three WCUs to insert one row per second.

```
ROUNDUP(3000 / 1024) = ROUNDUP(2.93) = 3 WCUs
```

**Calculate data load time and capacity**  
Now that you know the average size and number of rows in your CSV file, you can calculate how many WCUs you need to load the data in a given amount of time, and the approximate time it takes to load all the data in your CSV file using different WCU settings.

For example, if each row in your file is 1 KB and you have 1,000,000 rows in your CSV file, to load the data in 1 hour, you need to provision at least 278 WCUs to your table for that hour.

```
1,000,000 rows * 1 KBs = 1,000,000 KBs
1,000,000 KBs / 3600 seconds =277.8 KBs / second = 278 WCUs
```

**Configure provisioned capacity settings**  
You can set a table’s write capacity settings when you create the table or by using the `ALTER TABLE` command. The following is the syntax for altering a table’s provisioned capacity settings with the `ALTER TABLE` command.

```
ALTER TABLE catalog.book_awards WITH custom_properties={'capacity_mode':{'throughput_mode': 'PROVISIONED', 'read_capacity_units': 100, 'write_capacity_units': 278}} ;  
```

For the complete language reference, see [CREATE TABLE](cql.ddl.table.md#cql.ddl.table.create) and [ALTER TABLE](cql.ddl.table.md#cql.ddl.table.alter).

# Step 4: Configure `DSBulk` settings to upload data from the CSV file to the target table
<a name="dsbulk-upload-config"></a>

This section outlines the steps required to configure DSBulk for data upload to Amazon Keyspaces. You configure DSBulk by using a configuration file. You specify the configuration file directly from the command line.

1. Create a DSBulk configuration file for the migration to Amazon Keyspaces, in this example we use the file name `dsbulk_keyspaces.conf`. Specify the following settings in the DSBulk configuration file.

   1. *`PlainTextAuthProvider`* – Create the authentication provider with the `PlainTextAuthProvider` class. `ServiceUserName` and `ServicePassword` should match the user name and password you obtained when you generated the service-specific credentials by following the steps at [Create credentials for programmatic access to Amazon Keyspaces](programmatic.credentials.md).

   1. *`local-datacenter`* – Set the value for `local-datacenter` to the AWS Region that you're connecting to. For example, if the application is connecting to `cassandra.us-east-1.amazonaws.com`, then set the local data center to `us-east-1`. For all available AWS Regions, see [Service endpoints for Amazon Keyspaces](programmatic.endpoints.md). To avoid replicas, set `slow-replica-avoidance` to `false`.

   1. *`SSLEngineFactory`* – To configure SSL/TLS, initialize the `SSLEngineFactory` by adding a section in the configuration file with a single line that specifies the class with `class = DefaultSslEngineFactory`. Provide the path to `cassandra_truststore.jks` and the password that you created previously.

   1. *`consistency`* – Set the consistency level to `LOCAL QUORUM`. Other write consistency levels are not supported, for more information see [Supported Apache Cassandra read and write consistency levels and associated costs](consistency.md).

   1. The number of connections per pool is configurable in the Java driver. For this example, set `advanced.connection.pool.local.size` to 3.

   The following is the complete sample configuration file.

   ```
   datastax-java-driver {
   basic.contact-points = [ "cassandra.us-east-1.amazonaws.com:9142"]
   advanced.auth-provider {
       class = PlainTextAuthProvider
       username = "ServiceUserName"
       password = "ServicePassword"
   }
   
   basic.load-balancing-policy {
       local-datacenter = "us-east-1"
       slow-replica-avoidance = false           
   }
   
   basic.request {
       consistency = LOCAL_QUORUM
       default-idempotence = true
   }
   advanced.ssl-engine-factory {
       class = DefaultSslEngineFactory
       truststore-path = "./cassandra_truststore.jks"
       truststore-password = "my_password"
       hostname-validation = false
     }
   advanced.connection.pool.local.size = 3
   }
   ```

1. Review the parameters for the DSBulk `load` command.

   1. *`executor.maxPerSecond`* – The maximum number of rows that the load command attempts to process concurrently per second. If unset, this setting is disabled with -1.

      Set `executor.maxPerSecond` based on the number of WCUs that you provisioned to the target destination table. The `executor.maxPerSecond` of the `load` command isn’t a limit – it’s a target average. This means it can (and often does) burst above the number you set. To allow for bursts and make sure that enough capacity is in place to handle the data load requests, set `executor.maxPerSecond` to 90% of the table’s write capacity.

      ```
      executor.maxPerSecond = WCUs * .90
      ```

      In this tutorial, we set `executor.maxPerSecond` to 5.
**Note**  
If you are using DSBulk 1.6.0 or higher, you can use `dsbulk.engine.maxConcurrentQueries` instead.

   1. Configure these additional parameters for the DSBulk `load` command.
      + *`batch-mode`* – This parameter tells the system to group operations by partition key. We recommend to disable batch mode, because it can result in hot key scenarios and cause `WriteThrottleEvents`.
      + *`driver.advanced.retry-policy-max-retries`* – This determines how many times to retry a failed query. If unset, the default is 10. You can adjust this value as needed.
      + *`driver.basic.request.timeout`* – The time in minutes the system waits for a query to return. If unset, the default is "5 minutes". You can adjust this value as needed.

# Step 5: Run the DSBulk `load` command to upload data from the CSV file to the target table
<a name="dsbulk-upload-run"></a>

In the final step of this tutorial, you upload the data into Amazon Keyspaces.

To run the DSBulk `load` command, complete the following steps.

1. Run the following code to upload the data from your csv file to your Amazon Keyspaces table. Make sure to update the path to the application configuration file you created earlier.

   ```
   dsbulk load -f ./dsbulk_keyspaces.conf  --connector.csv.url keyspace.table.csv -header true --batch.mode DISABLED --executor.maxPerSecond 5 --driver.basic.request.timeout "5 minutes" --driver.advanced.retry-policy.max-retries 10 -k catalog -t book_awards
   ```

1. The output includes the location of a log file that details successful and unsuccessful operations. The file is stored in the following directory.

   ```
   Operation directory: /home/user_name/logs/UNLOAD_20210308-202317-801911
   ```

1. The log file entries will include metrics, as in the following example. Check to make sure that the number of rows is consistent with the number of rows in your csv file.

   ```
   total | failed | rows/s | p50ms | p99ms | p999ms
      200 |      0 |    200 | 21.63 | 21.89 |  21.89
   ```

**Important**  
Now that you have transferred your data, adjust the capacity mode settings of your target table to match your application’s regular traffic patterns. You incur charges at the hourly rate for your provisioned capacity until you change it. For more information, see [Configure read/write capacity modes in Amazon Keyspaces](ReadWriteCapacityMode.md).