# Create a migration plan for migrating from Apache Cassandra to Amazon Keyspaces
<a name="migrating-cassandra"></a>

For a successful migration from Apache Cassandra to Amazon Keyspaces, we recommend a review of the applicable migration concepts and best practices as well as a comparison of the available options. 

 This topic outlines how the migration process works by introducing several key concepts and the tools and techniques available to you. You can evaluate the different migration strategies to select the one that best meets your requirements.

**Topics**
+ [Functional compatibility](#migrating-cassandra-compatibility)
+ [Estimate Amazon Keyspaces pricing](#migrating-cassandra-sizing)
+ [Choose a migration strategy](#migrating-cassandra-strategy)
+ [Online migration to Amazon Keyspaces: strategies and best practices](migrating-online.md)
+ [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md)
+ [Using a hybrid migration solution: Apache Cassandra to Amazon Keyspaces](migrating-hybrid.md)

## Functional compatibility
<a name="migrating-cassandra-compatibility"></a>

Consider the functional differences between Apache Cassandra and Amazon Keyspaces carefully before the migration. Amazon Keyspaces supports all commonly used Cassandra data-plane operations, such as creating keyspaces and tables, reading data, and writing data.

However there are some Cassandra APIs that Amazon Keyspaces doesn't support. For more information about supported APIs, see [Supported Cassandra APIs, operations, functions, and data types](cassandra-apis.md). For an overview of all functional differences between Amazon Keyspaces and Apache Cassandra, see [Functional differences: Amazon Keyspaces vs. Apache Cassandra](functional-differences.md). 

To compare the Cassandra APIs and schema that you're using with supported functionality in Amazon Keyspaces, you can run a compatibility script available in the Amazon Keyspaces toolkit on [GitHub](https://github.com/aws-samples/amazon-keyspaces-toolkit/blob/master/bin/toolkit-compat-tool.py). 

**How to use the compatibility script**

1. Download the compatibility Python script from [GitHub](https://github.com/aws-samples/amazon-keyspaces-toolkit/blob/master/bin/toolkit-compat-tool.py) and move it to a location that has access to your existing Apache Cassandra cluster.

1. The compatibility script uses similar parameters as `CQLSH`. For `--host` and `--port` enter the IP address and the port you use to connect and run queries to one of the Cassandra nodes in your cluster. 

   If your Cassandra cluster uses authentication, you also need to provide `-username` and `-password`. To run the compatibility script, you can use the following command.

   ```
   python toolkit-compat-tool.py --host hostname or IP -u "username" -p "password" --port native transport port
   ```

## Estimate Amazon Keyspaces pricing
<a name="migrating-cassandra-sizing"></a>

This section provides an overview of the information you need to gather from your Apache Cassandra tables to calculate the estimated cost for Amazon Keyspaces. Each one of your tables requires different data types, needs to support different CQL queries, and maintains distinctive read/write traffic. 

Thinking of your requirements based on tables aligns with Amazon Keyspaces table-level resource isolation and [read/write throughput capacity modes](ReadWriteCapacityMode.md). With Amazon Keyspaces, you can define read/write capacity and [automatic scaling policies](autoscaling.md) for tables independently. 

Understanding table requirements helps you prioritize tables for migration based on functionality, cost, and migration effort.

Collect the following Cassandra table metrics before a migration. This information helps to estimate the cost of your workload on Amazon Keyspaces. 
+ **Table name** – The name of the fully qualified keyspace and table name.
+ **Description** – A description of the table, for example how it’s used, or what type of data is stored in it.
+ **Average reads per second** – The average number of coordinate-level reads against the table over a large time interval.
+ **Average writes per second** – The average number of coordinate-level writes against the table over a large time interval.
+ **Average row size in bytes** – The average row size in bytes. 
+ **Storage size in GBs** – The raw storage size for a table.
+ **Read consistency breakdown** – The percentage of reads that use eventual consistency (`LOCAL_ONE` or `ONE`) vs. strong consistency (`LOCAL_QUORUM`).

This table shows an example of the information about your tables that you need to pull together when planning a migration.


****  

| Table name | Description | Average reads per second | Average writes per second | Average row size in bytes | Storage size in GBs | Read consistency breakdown | 
| --- | --- | --- | --- | --- | --- | --- | 
|  mykeyspace.mytable  |  Used to store shopping cart history  |  10,000  |  5,000  | 2,200 | 2,000 | 100% `LOCAL_ONE` | 
| mykeyspace.mytable2 | Used to store latest profile information | 20,000 | 1,000 | 850 | 1,000 | 25% `LOCAL_QUORUM` 75% `LOCAL_ONE` | 

### How to collect table metrics
<a name="migrating-table-metrics"></a>

This section provides step by step instructions on how to collect the necessary table metrics from your existing Cassandra cluster. These metrics include row size, table size, and read/write requests per second (RPS). They allow you to assess throughput capacity requirements for an Amazon Keyspaces table and estimate pricing.

**How to collect table metrics on the Cassandra source table**

1. Determine row size

   Row size is important for determining the read capacity and write capacity utilization in Amazon Keyspaces. The following diagram shows the typical data distribution over a Cassandra token range.   
![\[A diagram showing the typical data distribution over a Cassandra token range using the murmur3 partitioner.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-data-distribution.png)

   You can use a row size sampler script available on [GitHub](https://github.com/aws-samples/amazon-keyspaces-toolkit/blob/master/bin/row-size-sampler.sh) to collect row size metrics for each table in your Cassandra cluster. 

   The script exports table data from Apache Cassandra by using `cqlsh` and `awk` to calculate the min, max, average, and standard deviation of row size over a configurable sample set of table data. The row size sampler passes the arguments to `cqlsh`, so the same parameters can be used to connect and read from your Cassandra cluster. 

   The following statement is an example of this.

   ```
   ./row-size-sampler.sh 10.22.33.44 9142 \\
      -u "username" -p "password" --ssl
   ```

   For more information on how row size is calculated in Amazon Keyspaces, see [Estimate row size in Amazon Keyspaces](calculating-row-size.md).

1. Determine table size

   With Amazon Keyspaces, you don't need to provision storage in advance. Amazon Keyspaces monitors the billable size of your tables continuously to determine your storage charges. Storage is billed per GB-month. Amazon Keyspaces table size is based on the raw size (uncompressed) of a single replica. 

   To monitor the table size in Amazon Keyspaces, you can use the metric `BillableTableSizeInBytes`, which is displayed for each table in the AWS Management Console.

   To estimate the billable size of your Amazon Keyspaces table, you can use either one of these two methods:
   + Use the average row size and multiply by the number or rows.

     You can estimate the size of the Amazon Keyspaces table by multiplying the average row size by the number of rows from your Cassandra source table. Use the row size sample script from the previous section to capture the average row size. To capture the row count, you can use tools like `dsbulk count` to determine the total number of rows in your source table. 
   + Use the `nodetool` to gather table metadata.

     `Nodetool` is an administrative tool provided in the Apache Cassandra distribution that provides insight into the state of the Cassandra process and returns table metadata. You can use `nodetool` to sample metadata about table size and with that extrapolate the table size in Amazon Keyspaces. 

     The command to use is `nodetool tablestats`. Tablestats returns the table's size and compression ratio. The table's size is stored as the `tablelivespace` for the table and you can divide it by the `compression ratio`. Then multiple this size value by the number of nodes. Finally divide by the replication factor (typically three). 

     This is the complete formula for the calculation that you can use to assess table size.

     ```
     ((tablelivespace / compression ratio) * (total number of nodes))/ (replication factor)
     ```

     Let's assume that your Cassandra cluster has 12 nodes. Running the `nodetool tablestats` command returns a `tablelivespace` of 200 GB and a `compression ratio` of 0.5. The keyspace has a replication factor of three. 

     This is how the calculation for this example looks like.

     ```
     (200 GB / 0.5) * (12 nodes)/ (replication factor of 3)
                             = 4,800 GB / 3
                             = 1,600 GB is the table size estimate for Amazon Keyspaces
     ```

1. Capture the number of reads and writes

   To determine the capacity and scaling requirements for your Amazon Keyspaces tables, capture the read and write request rate of your Cassandra tables before the migration. 

   Amazon Keyspaces is serverless and you only pay for what you use. In general, the price of read/write throughput in Amazon Keyspaces is based on the number and size of the requests. 

   There are two capacity modes in Amazon Keyspaces:
   + [On-demand](ReadWriteCapacityMode.OnDemand.md) – This is a flexible billing option capable of serving thousands of requests per second without the need for capacity planning. It offers pay-per-request pricing for read and write requests so that you pay only for what you use.
   + [Provisioned](ReadWriteCapacityMode.Provisioned.md) – If you choose provisioned throughput capacity mode, you specify the number of reads and writes per second that are required for your application. This helps you manage your Amazon Keyspaces usage to stay at or below a defined request rate to maintain predictability. 

     Provisioned mode offers [auto scaling](autoscaling.md) to automatically adjust your provisioned rate to scale up or scale down to improve operational efficiency. For more information about serverless resource management, see [Managing serverless resources in Amazon Keyspaces (for Apache Cassandra)](serverless_resource_management.md).

   Because you provision read and write throughput capacity in Amazon Keyspaces separately, you need to measure the request rate for reads and writes in your existing tables independently. 

    To gather the most accurate utilization metrics from your existing Cassandra cluster, capture the average requests per second (RPS) for coordinator-level read and write operations over an extended period of time for a table that is aggregated over all nodes in a single data center. 

   Capturing the average RPS over a period of at least several weeks captures peaks and valleys in your traffic patterns, as shown in the following diagram.  
![\[A diagram showing the average rate of requests per second per day over a period of two weeks.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-rps.png)

   You have two options to determine the read and write request rate of your Cassandra table.
   + Use existing Cassandra monitoring

     You can use the metrics shown in the following table to observe read and write requests. Note that the metric names can change based on the monitoring tool that you're using.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/migrating-cassandra.html)
   + Use the `nodetool`

     Use `nodetool tablestats` and `nodetool info` to capture average read and write operations from the table. `tablestats` returns the total read and write count from the time the node has been initiated. `nodetool info` provides the up-time for a node in seconds.

     To receive the per second average of read and writes, divide the read and write count by the node up-time in seconds. Then, for reads you divide by the consistency level ad for writes you divide by the replication factor. These calculations are expressed in the following formulas. 

     Formula for average reads per second:

     ```
     ((number of reads * number of nodes in cluster) / read consistency quorum (2)) / uptime
     ```

     Formula for average writes per second:

     ```
     ((number of writes * number of nodes in cluster) / replication factor of 3) / uptime
     ```

     Let's assume we have a 12 node cluster that has been up for 4 weeks. `nodetool info` returns 2,419,200 seconds of up-time and `nodetool tablestats` returns 1 billion writes and 2 billion reads. This example would result in the following calculation.

     ```
     ((2 billion reads * 12 in cluster) / read consistency quorum (2)) / 2,419,200 seconds
     =  12 billion reads / 2,419,200 seconds
     =  4,960 read request per second
                             ((1 billion writes * 12 in cluster) / replication factor of 3) / 2,419,200 seconds
     =  4 billion writes / 2,419,200 seconds
     =  1,653 write request per second
     ```

1. Determine the capacity utilization of the table

   To estimate the average capacity utilization, start with the average request rates and the average row size of your Cassandra source table.

   Amazon Keyspaces uses *read capacity units* (RCUs) and *write capacity units* (WCUs) to measure provisioned throughput capacity for reads and writes for tables. For this estimate we use these units to calculate the read and write capacity needs of the new Amazon Keyspaces table after migration. 

    Later in this topic we'll discuss how the choice between provisioned and on-demand capacity mode affects billing. But for the estimate of capacity utilization in this example, we assume that the table is in provisioned mode.
   + **Reads** – One RCU represents one `LOCAL_QUORUM` read request, or two `LOCAL_ONE` read requests, for a row up to 4 KB in size. If you need to read a row that is larger than 4 KB, the read operation uses additional RCUs. The total number of RCUs required depends on the row size, and whether you want to use `LOCAL_QUORUM` or `LOCAL_ONE` read consistency. 

     For example, reading an 8 KB row requires 2 RCUs using `LOCAL_QUORUM` read consistency, and 1 RCU if you choose `LOCAL_ONE` read consistency. 
   + **Writes** – One WCU represents one write for a row up to 1 KB in size. All writes are using `LOCAL_QUORUM` consistency, and there is no additional charge for using lightweight transactions (LWTs). 

     The total number of WCUs required depends on the row size. If you need to write a row that is larger than 1 KB, the write operation uses additional WCUs. For example, if your row size is 2 KB, you require 2 WCUs to perform one write request. 

   The following formula can be used to estimate the required RCUs and WCUs. 
   + **Read capacity in RCUs** can be determined by multiplying reads per second by number of rows read per read multiplied by average row size divided by 4KB and rounded up to the nearest whole number.
   + **Write capacity in WCUs** can be determined by multiplying the number of requests by the average row size divided by 1KB and rounded up to the nearest whole number. 

   This is expressed in the following formulas.

   ```
   Read requests per second * ROUNDUP((Average Row Size)/4096 per unit) =  RCUs per second
                   
   Write requests per second * ROUNDUP(Average Row Size/1024 per unit) = WCUs per second
   ```

   For example, if you're performing 4,960 read requests with a row size of 2.5KB on your Cassandra table, you need 4,960 RCUs in Amazon Keyspaces. If you're currently performing 1,653 write requests per second with a row size of 2.5KB on your Cassandra table, you need 4,959 WCUs per second in Amazon Keyspaces. 

   This example is expressed in the following formulas.

   ```
   4,960 read requests per second * ROUNDUP( 2.5KB /4KB bytes per unit)
   = 4,960 read requests per second * 1 RCU
   = 4,960 RCUs
                   
   1,653 write requests per second * ROUNDUP(2.5KB/1KB per unit) 
   = 1,653 requests per second * 3 WCUs
   = 4,959 WCUs
   ```

   Using `eventual consistency` allows you to save up to half of the throughput capacity on each read request. Each eventually consistent read can consume up to 8KB. You can calculate eventual consistent reads by multiplying the previous calculation by 0.5 as shown in the following formula. 

   ```
   4,960 read requests per second * ROUNDUP( 2.5KB /4KB per unit) * .5 
   = 2,480 read request per second * 1 RCU
   = 2,480 RCUs
   ```

1. Calculate the monthly pricing estimate for Amazon Keyspaces

   To estimate the monthly billing for the table based on read/write capacity throughput, you can calculate the pricing for on-demand and for provisioned mode using different formulas and compare the options for your table. 

   **Provisioned mode** – Read and write capacity consumption is billed on an hourly rate based on the capacity units per second. First, divide that rate by 0.7 to represent the default autoscaling target utilization of 70%. Then multiple by 30 calendar days, 24 hours per day, and regional rate pricing. 

   This calculation is summarized in the following formulas.

   ```
   (read capacity per second / .7) * 24 hours * 30 days * regional rate
                   (write capacity per second / .7) * 24 hours * 30 days * regional rate
   ```

   **On-demand mode** – Read and write capacity are billed on a per request rate. First, multiply the request rate by 30 calendar days, and 24 hours per day. Then divide by one million request units. Finally, multiply by the regional rate. 

   This calculation is summarized in the following formulas. 

   ```
   ((read capacity per second * 30 * 24 * 60 * 60) / 1 Million read request units) * regional rate
                   ((write capacity per second * 30 * 24 * 60 * 60) / 1 Million write request units) * regional rate
   ```

## Choose a migration strategy
<a name="migrating-cassandra-strategy"></a>

You can choose between the following migration strategies when migrating from Apache Cassandra to Amazon Keyspaces:
+ **Online** – This is a live migration using dual writes to start writing new data to Amazon Keyspaces and the Cassandra cluster simultaneously. This migration type is recommended for applications that require zero downtime during migration and read after write consistency.

  For more information about how to plan and implement an online migration strategy, see [Online migration to Amazon Keyspaces: strategies and best practices](migrating-online.md).
+ **Offline** – This migration technique involves copying a data set from Cassandra to Amazon Keyspaces during a downtime window. Offline migration can simplify the migration process, because it doesn't require changes to your application or conflict resolution between historical data and new writes.

  For more information about how to plan an offline migration, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md).
+ **Hybrid** – This migration technique allows for changes to be replicated to Amazon Keyspaces in near real time, but without read after write consistency. 

  For more information about how to plan a hybrid migration, see [Using a hybrid migration solution: Apache Cassandra to Amazon Keyspaces](migrating-hybrid.md).

After reviewing the migration techniques and best practices discussed in this topic, you can place the available options in a decision tree to design a migration strategy based on your requirements and available resources.

# Online migration to Amazon Keyspaces: strategies and best practices
<a name="migrating-online"></a>

If you need to maintain application availability during a migration from Apache Cassandra to Amazon Keyspaces, you can prepare a custom online migration strategy by implementing the key components discussed in this topic. By following these best practices for online migrations, you can ensure that application availability and read-after-write consistency are maintained during the entire migration process, minimizing the impact on your users.

When designing an online migration strategy from Apache Cassandra to Amazon Keyspaces, you need to consider the following key steps.

1. **Writing new data**
   + **ZDM Dual Write Proxy for Amazon Keyspaces Migration** – Use the ZDM Dual Write Proxy available on [Github](https://github.com/aws-samples/amazon-keyspaces-examples/blob/main/migration/online/zdm-proxy/README.md) to perform zero-downtime migration from Apache Cassandra to Amazon Keyspaces. The ZDM Proxy performs dual writes without the need to refactor existing applications and performs dual reads for query validation.
   + Application dual-writes: You can implement dual writes in your application using existing Cassandra client libraries and drivers. Designate one database as the leader and the other as the follower. Write failures to the follower database are recorded in a [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) for analysis.
   + Messaging tier dual-writes: Alternatively, you can configure your existing messaging platform to send writes to both Cassandra and Amazon Keyspaces using an additional consumer. This creates eventually consistent views across both databases.

1. **Migrating historical data**
   + Copy historical data: You can migrate historical data from Cassandra to Amazon Keyspaces using AWS Glue or custom extract, transform, and load (ETL) scripts. Handle conflict resolution between dual writes and bulk loads using techniques like lightweight transactions or timestamps.
   + Use Time-To-Live (TTL): For shorter data retention periods, you can use TTL in both Cassandra and Amazon Keyspaces to avoid uploading unnecessary historical data. As old data expires in Cassandra and new data is written via dual-writes, Amazon Keyspaces eventually catches up.

1. **Validating data**
   + Dual reads: Implement dual reads from both Cassandra (primary) and Amazon Keyspaces (secondary) databases, comparing results asynchronously. Differences are logged or sent to a DLQ.
   + Sample reads: Use Λ functions to periodically sample and compare data across both systems, logging any discrepancies to a DLQ.

1. **Migrating the application**
   + Blue-green strategy: Switch your application to treat Amazon Keyspaces as the primary and Cassandra as the secondary data store in a single step. Monitor performance and roll back if issues arise.
   + Canary deployment: Gradually roll out the migration to a subset of users first, incrementally increasing traffic to Amazon Keyspaces as the primary until fully migrated.

1. **Decommissioning Cassandra**

   Once your application is fully migrated to Amazon Keyspaces and data consistency is validated, you can plan to decommission your Cassandra cluster based on data retention policies.

By planning an online migration strategy with these components, you can transition smoothly to the fully managed Amazon Keyspaces service with minimal downtime or disruption. The following sections go into each component in more detail.

**Topics**
+ [Writing new data during an online migration](migration-online-dw.md)
+ [Uploading historical data during an online migration](migration-online-historical.md)
+ [Validating data consistency during an online migration](migration-online-validation.md)
+ [Migrating the application during an online migration](migration-online-app-migration.md)
+ [Decommissioning Cassandra after an online migration](migration-online-decommission.md)

# Writing new data during an online migration
<a name="migration-online-dw"></a>

The first step in an online migration plan is to ensure that any new data written by the application is stored in both databases, your existing Cassandra cluster and Amazon Keyspaces. The goal is to provide a consistent view across the two data stores. You can do this by applying all new writes to both databases. To implement dual writes, consider one of the following three options.
+ **ZDM Dual Write Proxy for Amazon Keyspaces Migration** – Using the ZDM Proxy for Amazon Keyspaces available on [Github](https://github.com/aws-samples/amazon-keyspaces-examples/blob/main/migration/online/zdm-proxy/README.md), you can migrate your Apache Cassandra workloads to Amazon Keyspaces without application downtime. This enhanced solution implements AWS best practices and extends the official ZDM Proxy capabilities.
  + Perform online migrations between Apache Cassandra and Amazon Keyspaces.
  + Write data to both source and target tables simultaneously without refactoring applications.
  + Validate queries through dual-read operations.

  The solution offers the following enhancements to work with AWS and Amazon Keyspaces.
  + **Container deployment** – Use a pre-configured Docker image from Amazon Elastic Container Registry (Amazon ECR) for VPC-accessible deployments.
  + **Infrastructure as code** – Deploy using AWS CloudFormation templates for automated setup on AWS Fargate.
  + **Amazon Keyspaces compatibility** – Access system tables with custom adaptations for Amazon Keyspaces.

  The solution runs on Amazon ECS with Fargate, providing serverless scalability based on your workload demands. A network load balancer distributes incoming application traffic across multiple Amazon ECS tasks for high availability.  
![\[Implementing the ZDM dual write proxy for migrating data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-zdm.png)
+ **Application dual writes** – You can implement dual writes with minimal changes to your application code by leveraging the existing Cassandra client libraries and drivers. You can either implement dual writes in your existing application, or create a new layer in the architecture to handle dual writes. For more information and a customer case study that shows how dual writes were implemented in an existing application, see [Cassandra migration case study](https://aws.amazon.com/solutions/case-studies/intuit-apache-migration-case-study/).

  When implementing dual writes, you can designate one database as the leader and the other database as the follower. This allows you to keep writing to your original source, or leader database without letting write failures to the follower, or destination database disrupt the critical path of your application.

  Instead of retrying failed writes to the follower, you can use Amazon Simple Queue Service to record failed writes in a [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html). The DLQ lets you analyze the failed writes to the follower and determine why processing did not succeed in the destination database.

  For a more sophisticated dual write implementation, you can follow AWS best practices for designing a sequence of local transactions using the [saga pattern](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/saga.html). A saga pattern ensures that if a transaction fails, the saga runs compensating transactions to revert the database changes made by the previous transactions. 

  When using dual-writes for an online migration, you can configure the dual-writes following the saga pattern so that each write is a local transaction to ensure atomic operations across heterogeneous databases. For more information about designing distributed application using recommended design patterns for the AWS Cloud, see [Cloud design patterns, architectures, and implementations](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/introduction).  
![\[Implementing dual writes at the application layer when migrating from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-dual-writes.png)
+ **Messaging tier dual writes** – Instead of implementing dual writes at the application layer, you can use your existing messaging tier to perform dual writes to Cassandra and Amazon Keyspaces. 

  To do this you can configure an additional consumer to your messaging platform to send writes to both data stores. This approach provides a simple low code strategy using the messaging tier to create two views across both databases that are eventually consistent. 

# Uploading historical data during an online migration
<a name="migration-online-historical"></a>

After implementing dual writes to ensure that new data is written to both data stores in real time, the next step in the migration plan is to evaluate how much historical data you need to copy or bulk upload from Cassandra to Amazon Keyspaces. This ensures that both, new data and historical data are going to be available in the new Amazon Keyspaces database before you’re migrating the application. 

Depending on your data retention requirements, for example how much historical data you need to preserve based on your organizations policies, you can consider one the following two options.
+ **Bulk upload of historical data** – The migration of historical data from your existing Cassandra deployment to Amazon Keyspaces can be achieved through various techniques, for example using AWS Glue or custom scripts to extract, transform, and load (ETL) the data. For more information about using AWS Glue to upload historical data, see [Offline migration process: Apache Cassandra to Amazon Keyspaces](migrating-offline.md). 

  When planning the bulk upload of historical data, you need to consider how to resolve conflicts that can occur when new writes are trying to update the same data that is in the process of being uploaded. The bulk upload is expected to be eventually consistent, which means the data is going to reach all nodes eventually. 

  If an update of the same data occurs at the same time due to a new write, you want to ensure that it's not going to be overwritten by the historical data upload. To ensure that you preserve the latest updates to your data even during the bulk import, you must add conflict resolution either into the bulk upload scripts or into the application logic for dual writes. 

  For example, you can use [Lightweight transactions](functional-differences.md#functional-differences.light-transactions) (LWT) to compare and set operations. To do this, you can add an additional field to your data-model that represents time of modification or state. 

  Additionally, Amazon Keyspaces supports the Cassandra `WRITETIME` timestamp function. You can use Amazon Keyspaces client-side timestamps to preserve source database timestamps and implement last-writer-wins conflict resolution. For more information, see [Client-side timestamps in Amazon Keyspaces](client-side-timestamps.md).
+ **Using Time-to-Live (TTL)** – For data retention periods shorter than 30, 60, or 90 days, you can use TTL in Cassandra and Amazon Keyspaces during migration to avoid uploading unnecessary historical data to Amazon Keyspaces. TTL allows you to set a time period after which the data is automatically removed from the database. 

  During the migration phase, instead of copying historical data to Amazon Keyspaces, you can configure the TTL settings to let the historical data expire automatically in the old system (Cassandra) while only applying the new writes to Amazon Keyspaces using the dual-write method. Over time and with old data continually expiring in the Cassandra cluster and new data written using the dual-write method, Amazon Keyspaces automatically catches up to contain the same data as Cassandra.

   This approach can significantly reduce the amount of data to be migrated, resulting in a more efficient and streamlined migration process. You can consider this approach when dealing with large datasets with varying data retention requirements. For more information about TTL, see [Expire data with Time to Live (TTL) for Amazon Keyspaces (for Apache Cassandra)](TTL.md).

  Consider the following example of a migration from Cassandra to Amazon Keyspaces using TTL data expiration. In this example we set TTL for both databases to 60 days and show how the migration process progresses over a period of 90 days. Both databases receive the same newly written data during this period using the dual writes method. We're going to look at three different phases of the migration, each phase is 30 days long. 

  How the migration process works for each phase is shown in the following images.   
![\[Using TTL to expire historical data when migrating from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-TTL.png)

  1. After the first 30 days, the Cassandra cluster and Amazon Keyspaces have been receiving new writes. The Cassandra cluster also contains historical data that has not yet reached 60 days of retention, which makes up 50% of the data in the cluster. 

     Data that is older than 60 days is being automatically deleted in the Cassandra cluster using TTL. At this point Amazon Keyspaces contains 50% of the data stored in the Cassandra cluster, which is made up of the new writes minus the historical data.

  1. After 60 days, both the Cassandra cluster and Amazon Keyspaces contain the same data written in the last 60 days.

  1. Within 90 days, both Cassandra and Amazon Keyspaces contain the same data and are expiring data at the same rate. 

  This example illustrates how to avoid the step of uploading historical data by using TTL with an expiration date set to 60 days.

# Validating data consistency during an online migration
<a name="migration-online-validation"></a>

 The next step in the online migration process is data validation. Dual writes are adding new data to your Amazon Keyspaces database and you have completed the migration of historical data either using bulk upload or data expiration with TTL. 

Now you can use the validation phase to confirm that both data stores contain in fact the same data and return the same read results. You can choose from one of the following two options to validate that both your databases contain identical data. 
+ **Dual reads** – To validate that both, the source and the destination database contain the same set of newly written and historical data, you can implement dual reads. To do so you read from both your primary Cassandra and your secondary Amazon Keyspaces database similarly to the dual writes method and compare the results asynchronously. 

  The results from the primary database are returned to the client, and the results from the secondary database are used to validate against the primary resultset. Differences found can be logged or sent to a [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) for later reconciliation. 

  In the following diagram, the application is performing a synchronous read from Cassandra, which is the primary data store) and an asynchronous read from Amazon Keyspaces, which is the secondary data store.  
![\[Using dual reads to validate data consistency during an online migration from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-dual-reads.png)
+ **Sample reads** – An alternative solution that doesn’t require application code changes is to use AWS Lambda functions to periodically and randomly sample data from both the source Cassandra cluster and the destination Amazon Keyspaces database. 

  These Lambda functions can be configured to run at regular intervals. The Lambda function retrieves a random subset of data from both the source and destination systems, and then performs a comparison of the sampled data. Any discrepancies or mismatches between the two datasets can be recorded and sent to a dedicated [dead letter queue (DLQ)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) for later reconciliation.

  This process is illustrated in the following diagram.  
![\[Using sample reads to validate data consistency during and online migration from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-sample-reads.png)

# Migrating the application during an online migration
<a name="migration-online-app-migration"></a>

In the fourth phase of an online migration, you are migrating your application and transitioning to Amazon Keyspaces as the primary data store. This means that you switch your application to read and write directly from and to Amazon Keyspaces. To ensure minimal disruption to your users, this should be a well-planned and coordinated process. 

Two different recommended solution for application migration are available, the blue green cut over strategy and the canary cut over strategy. The following sections outline these strategies in more detail. 
+ **Blue green strategy** – Using this approach, you switch your application to treat Amazon Keyspaces as the primary data store and Cassandra as the secondary data store in a single step. You can do this using an AWS AppConfig feature flag to control the election of primary and secondary data stores across the application instance. For more information about feature flags, see [Creating a feature flag configuration profile in AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/appconfig-creating-configuration-and-profile-feature-flags.html).

  After making Amazon Keyspaces the primary data store, you monitor the application's behavior and performance, ensuring that Amazon Keyspaces meets your requirements and that the migration is successful.

  For example, if you implemented dual-reads for your application, during the application migration phase you transition the primary reads going from Cassandra to Amazon Keyspaces and the secondary reads from Amazon Keyspaces to Cassandra. After the transition, you continue to monitor and compare results as described in the [data validation](migration-online-validation.md) section to ensure consistency across both databases before decommissioning Cassandra. 

  If you detect any issues, you can quickly roll back to the previous state by reverting to Cassandra as the primary data store. You only proceed to the decommissioning phase of the migration if Amazon Keyspaces is meeting all your needs as the primary data store.  
![\[Using the blue green strategy for migrating an application from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-switch.png)
+ **Canary strategy** – In this approach, you gradually roll out the migration to a subset of your users or traffic. Initially, a small percentage of your application's traffic, for example 5% of all traffic is routed to the version using Amazon Keyspaces as the primary data store, while the rest of the traffic continues to use Cassandra as the primary data store. 

  This allows you to thoroughly test the migrated version with real-world traffic and monitor its performance, stability, and investigate potential issues. If you don't detect any issues, you can incrementally increase the percentage of traffic routed to Amazon Keyspaces until it becomes the primary data store for all users and traffic. 

  This staged roll out minimizes the risk of widespread service disruptions and allows for a more controlled migration process. If any critical issues arise during the canary deployment, you can quickly roll back to the previous version using Cassandra as the primary data store for the affected traffic segment. You only proceed to the decommissioning phase of the migration after you have validated that Amazon Keyspaces processes 100% of your users and traffic as expected.

  The following diagram illustrates the individual steps of the canary strategy.  
![\[Using the canary strategy for migrating an application from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/online-migration-canary.png)

# Decommissioning Cassandra after an online migration
<a name="migration-online-decommission"></a>

After the application migration is complete with your application is fully running on Amazon Keyspaces and you have validated data consistency over a period of time, you can plan to decommission your Cassandra cluster. During this phase, you can evaluate if the data remaining in your Cassandra cluster needs to be archived or can be deleted. This depends on your organization’s policies for data handling and retention.

By following this strategy and considering the recommended best practices described in this topic when planning your online migration from Cassandra to Amazon Keyspaces, you can ensure a seamless transition to Amazon Keyspaces while maintaining read-after-write consistency and availability of your application.

Migrating from Apache Cassandra to Amazon Keyspaces can provide numerous benefits, including reduced operational overhead, automatic scaling, improved security, and a framework that helps you to reach your compliance goals. By planning an online migration strategy with dual writes, historical data upload, data validation, and a gradual roll out, you can ensure a smooth transition with minimal disruption to your application and its users. 

Implementing the online migration strategy discussed in this topic allows you to validate the migration results, identify and address any issues, and ultimately decommission your existing Cassandra deployment in favor of the fully managed Amazon Keyspaces service. 

# Offline migration process: Apache Cassandra to Amazon Keyspaces
<a name="migrating-offline"></a>

Offline migrations are suitable when you can afford downtime to perform the migration. It's common among enterprises to have maintenance windows for patching, large releases, or downtime for hardware upgrades or major upgrades. Offline migration can use this window to copy data and switch over the application traffic from Apache Cassandra to Amazon Keyspaces.

Offline migration reduces modifications to the application because it doesn't require communication to both Cassandra and Amazon Keyspaces simultaneously. Additionally, with the data flow paused, the exact state can be copied without maintaining mutations.

In this example, we use Amazon Simple Storage Service (Amazon S3) as a staging area for data during the offline migration to minimize downtime. You can automatically import the data you stored in Parquet format in Amazon S3 into an Amazon Keyspaces table using the Spark Cassandra connector and AWS Glue. The following section is going to show the high-level overview of the process. You can find code examples for this process on [Github](https://github.com/aws-samples/amazon-keyspaces-examples/tree/main/scala/datastax-v4/aws-glue).

The offline migration process from Apache Cassandra to Amazon Keyspaces using Amazon S3 and AWS Glue requires the following AWS Glue jobs.

1. An ETL job that extracts and transforms CQL data and stores it in an Amazon S3 bucket.

1. A second job that imports the data from the bucket to Amazon Keyspaces.

1. A third job to import incremental data.

**How to perform an offline migration to Amazon Keyspaces from Cassandra running on Amazon EC2 in a Amazon Virtual Private Cloud**

1. First you use AWS Glue to export table data from Cassandra in Parquet format and save it to an Amazon S3 bucket. You need to run an AWS Glue job using a AWS Glue connector to a VPC where the Amazon EC2 instance running Cassandra resides. Then, using the Amazon S3 private endpoint, you can save data to the Amazon S3 bucket. 

   The following diagram illustrates these steps.  
![\[Migrating Apache Cassandra data from Amazon EC2 running in a VPC to a Amazon S3 bucket using AWS Glue.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-export.png)

1. Shuffle the data in the Amazon S3 bucket to improve data randomization. Evenly imported data allows for more distributed traffic in the target table. 

   This step is required when exporting data from Cassandra with large partitions (partitions with more than 1000 rows) to avoid hot key patterns when inserting the data into Amazon Keyspaces. Hot key issues cause `WriteThrottleEvents` in Amazon Keyspaces and result in increased load time.   
![\[A AWS Glue job shuffles data from a Amazon S3 bucket and returns it into another Amazon S3 bucket.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-shuffle.png)

1. Use another AWS Glue job to import data from the Amazon S3 bucket into Amazon Keyspaces. The shuffled data in the Amazon S3 bucket is stored in Parquet format.  
![\[The AWS Glue import job takes shuffled data from the Amazon S3 bucket and moves it into an Amazon Keyspaces table.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/migration-import.png)

For more information about the offline migration process, see the workshop [ Amazon Keyspaces with AWS Glue](https://catalog.workshops.aws/unlocking-amazonkeyspaces/en-US/keyspaces-with-glue)

# Using a hybrid migration solution: Apache Cassandra to Amazon Keyspaces
<a name="migrating-hybrid"></a>

The following migration solution can be considered a hybrid between online and offline migration. With this hybrid approach, data is written to the destination database in near real time without providing read after write consistency. This means that newly written data won’t be immediately available and delays are to be expected. If you need read after write consistency, see [Online migration to Amazon Keyspaces: strategies and best practices](migrating-online.md). 

For a near real time migration from Apache Cassandra to Amazon Keyspaces, you can choose between two available methods.
+ **CQLReplicator** – (Recommended) CQLReplicator is an open source utility available on [Github](https://github.com/aws-samples/cql-replicator) that helps you to migrate data from Apache Cassandra to Amazon Keyspaces in near real time.

  To determine the writes and updates to propagate to the destination database, CQLReplicator scans the Apache Cassandra token range and uses an AWS Glue job to remove duplicate events and apply writes and updates directly to Amazon Keyspaces.
+ **Change data capture (CDC)** – If you are familiar with Cassandra CDC, the Apache Cassandra built-in CDC feature that allows capturing changes by copying the commit log to a separate CDC directory is another option for implementing a hybrid migration.

  You can do this by replicating the data changes to Amazon Keyspaces, making CDC an alternative option for data migration scenarios. 

If you don't need read after write consistency, you can use either the CQLReplicator or a CDC pipeline to migrate data from Apache Cassandra to Amazon Keyspaces based on your preferences and familiarity with the tools and AWS services used in each solution. Using these methods to migrate data in near real time can be considered a hybrid approach to migration that offers an alternative to online migration.

This strategy is considered a hybrid approach, because in addition to the options outlined in this topic, you have to implement some steps of the online migration progress, for example historical data copy and the application migration strategies discussed in the [online migration](migrating-online.md) topic. 

The following sections go over the hybrid migration options in more detail.

**Topics**
+ [Migrate data using CQLReplicator](migration-hybrid-cql-rep.md)
+ [Migrate data using change data capture (CDC)](migration-hybrid-cdc.md)

# Migrate data using CQLReplicator
<a name="migration-hybrid-cql-rep"></a>

With [CQLReplicator](https://github.com/aws-samples/cql-replicator), you can read data from Apache Cassandra in near real time through intelligently scanning the Cassandra token ring using CQL queries. CQLReplicator doesn’t use Cassandra CDC and instead implements a caching strategy to reduce the performance penalties of full scans. 

To reduce the number of writes to the destination, CQLReplicator automatically removes duplicate replication events. With CQLReplicator, you can tune the replication of changes from the source database to the destination database, allowing for a near real time migration of data from Apache Cassandra to Amazon Keyspaces. 

The following diagram shows the typical architecture of a CQLReplicator job using AWS Glue. 

1. To allow access to Apache Cassandra running in a private VPC, configure an AWS Glue connection with the connection type **Network**.

1. To remove duplicates and enable key caching with the CQLReplicator job, configure Amazon Simple Storage Service (Amazon S3).

1. The CQLReplicator job streams verified source database changes directly to Amazon Keyspaces.

![\[Using CQLReplicator to migrate data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/hybrid-migration-CQLRep.png)


For more information about the migration process using CQLReplicator, see the following post on the AWS Database blog [ Migrate Cassandra workloads to Amazon Keyspaces using CQLReplicator](https://aws.amazon.com/blogs/database/migrate-cassandra-workloads-to-amazon-keyspaces-using-cqlreplicator/) and the AWS prescriptive guidance [ Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-apache-cassandra-workloads-to-amazon-keyspaces-using-aws-glue.html).

# Migrate data using change data capture (CDC)
<a name="migration-hybrid-cdc"></a>

If you're already familiar with configuring a change data capture (CDC) pipeline with [Debezium](https://debezium.io/), you can use this option to migrate data to Amazon Keyspaces as an alternative to using CQLReplicator. Debezium is an open-source, distributed platform for CDC, designed to monitor a database and capture row-level changes reliably. 

The [Debezium connector for Apache Cassandra](https://debezium.io/documentation/reference/stable/connectors/cassandra.html) uploads changes to Amazon Managed Streaming for Apache Kafka (Amazon MSK) so that they can be consumed and processed by downstream consumers which in turn write the data to Amazon Keyspaces. For more information, see [Guidance for continuous data migration from Apache Cassandra to Amazon Keyspaces](https://aws.amazon.com/solutions/guidance/continuous-data-migration-from-apache-cassandra-to-amazon-keyspaces/).

To address any potential data consistency issues, you can implement a process with Amazon MSK where a consumer compares the keys or partitions in Cassandra with those in Amazon Keyspaces.

To implement this solution successfully, we recommend to consider the following. 
+ How to parse the CDC commit log, for example how to remove duplicate events.
+ How to maintain the CDC directory, for example how to delete old logs.
+ How to handle partial failures in Apache Cassandra, for example if a write only succeeds in one out of three replicas.
+ How to handle resource allocation, for example increasing the size of the instance to account for additional CPU, memory, DISK, and IO requirements for the CDC process that occurs on a node.

This pattern treats changes from Cassandra as a "hint" that a key may have changed from its previous state. To determine if there are changes to propagate to the destination database, you must first read from the source Cassandra cluster using a `LOCAL_QUORUM` operation to receive the latest records and then write them to Amazon Keyspaces. 

In the case of range deletes or range updates, you may need to perform a comparison against the entire partition to determine which write or update events need to be written to your destination database. 

In cases where writes are not idempotent, you also need to compare your writes with what is already in the destination database before writing to Amazon Keyspaces.

The following diagram shows the typical architecture of a CDC pipeline using Debezium and Amazon MSK. 

![\[Using a change data capture pipeline to migrate data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/hybrid-migration-CDC.png)