

# Using a hybrid migration solution: Apache Cassandra to Amazon Keyspaces
<a name="migrating-hybrid"></a>

The following migration solution can be considered a hybrid between online and offline migration. With this hybrid approach, data is written to the destination database in near real time without providing read after write consistency. This means that newly written data won’t be immediately available and delays are to be expected. If you need read after write consistency, see [Online migration to Amazon Keyspaces: strategies and best practices](migrating-online.md). 

For a near real time migration from Apache Cassandra to Amazon Keyspaces, you can choose between two available methods.
+ **CQLReplicator** – (Recommended) CQLReplicator is an open source utility available on [Github](https://github.com/aws-samples/cql-replicator) that helps you to migrate data from Apache Cassandra to Amazon Keyspaces in near real time.

  To determine the writes and updates to propagate to the destination database, CQLReplicator scans the Apache Cassandra token range and uses an AWS Glue job to remove duplicate events and apply writes and updates directly to Amazon Keyspaces.
+ **Change data capture (CDC)** – If you are familiar with Cassandra CDC, the Apache Cassandra built-in CDC feature that allows capturing changes by copying the commit log to a separate CDC directory is another option for implementing a hybrid migration.

  You can do this by replicating the data changes to Amazon Keyspaces, making CDC an alternative option for data migration scenarios. 

If you don't need read after write consistency, you can use either the CQLReplicator or a CDC pipeline to migrate data from Apache Cassandra to Amazon Keyspaces based on your preferences and familiarity with the tools and AWS services used in each solution. Using these methods to migrate data in near real time can be considered a hybrid approach to migration that offers an alternative to online migration.

This strategy is considered a hybrid approach, because in addition to the options outlined in this topic, you have to implement some steps of the online migration progress, for example historical data copy and the application migration strategies discussed in the [online migration](migrating-online.md) topic. 

The following sections go over the hybrid migration options in more detail.

**Topics**
+ [Migrate data using CQLReplicator](migration-hybrid-cql-rep.md)
+ [Migrate data using change data capture (CDC)](migration-hybrid-cdc.md)

# Migrate data using CQLReplicator
<a name="migration-hybrid-cql-rep"></a>

With [CQLReplicator](https://github.com/aws-samples/cql-replicator), you can read data from Apache Cassandra in near real time through intelligently scanning the Cassandra token ring using CQL queries. CQLReplicator doesn’t use Cassandra CDC and instead implements a caching strategy to reduce the performance penalties of full scans. 

To reduce the number of writes to the destination, CQLReplicator automatically removes duplicate replication events. With CQLReplicator, you can tune the replication of changes from the source database to the destination database, allowing for a near real time migration of data from Apache Cassandra to Amazon Keyspaces. 

The following diagram shows the typical architecture of a CQLReplicator job using AWS Glue. 

1. To allow access to Apache Cassandra running in a private VPC, configure an AWS Glue connection with the connection type **Network**.

1. To remove duplicates and enable key caching with the CQLReplicator job, configure Amazon Simple Storage Service (Amazon S3).

1. The CQLReplicator job streams verified source database changes directly to Amazon Keyspaces.

![\[Using CQLReplicator to migrate data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/hybrid-migration-CQLRep.png)


For more information about the migration process using CQLReplicator, see the following post on the AWS Database blog [ Migrate Cassandra workloads to Amazon Keyspaces using CQLReplicator](https://aws.amazon.com/blogs/database/migrate-cassandra-workloads-to-amazon-keyspaces-using-cqlreplicator/) and the AWS prescriptive guidance [ Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-apache-cassandra-workloads-to-amazon-keyspaces-using-aws-glue.html).

# Migrate data using change data capture (CDC)
<a name="migration-hybrid-cdc"></a>

If you're already familiar with configuring a change data capture (CDC) pipeline with [Debezium](https://debezium.io/), you can use this option to migrate data to Amazon Keyspaces as an alternative to using CQLReplicator. Debezium is an open-source, distributed platform for CDC, designed to monitor a database and capture row-level changes reliably. 

The [Debezium connector for Apache Cassandra](https://debezium.io/documentation/reference/stable/connectors/cassandra.html) uploads changes to Amazon Managed Streaming for Apache Kafka (Amazon MSK) so that they can be consumed and processed by downstream consumers which in turn write the data to Amazon Keyspaces. For more information, see [Guidance for continuous data migration from Apache Cassandra to Amazon Keyspaces](https://aws.amazon.com/solutions/guidance/continuous-data-migration-from-apache-cassandra-to-amazon-keyspaces/).

To address any potential data consistency issues, you can implement a process with Amazon MSK where a consumer compares the keys or partitions in Cassandra with those in Amazon Keyspaces.

To implement this solution successfully, we recommend to consider the following. 
+ How to parse the CDC commit log, for example how to remove duplicate events.
+ How to maintain the CDC directory, for example how to delete old logs.
+ How to handle partial failures in Apache Cassandra, for example if a write only succeeds in one out of three replicas.
+ How to handle resource allocation, for example increasing the size of the instance to account for additional CPU, memory, DISK, and IO requirements for the CDC process that occurs on a node.

This pattern treats changes from Cassandra as a "hint" that a key may have changed from its previous state. To determine if there are changes to propagate to the destination database, you must first read from the source Cassandra cluster using a `LOCAL_QUORUM` operation to receive the latest records and then write them to Amazon Keyspaces. 

In the case of range deletes or range updates, you may need to perform a comparison against the entire partition to determine which write or update events need to be written to your destination database. 

In cases where writes are not idempotent, you also need to compare your writes with what is already in the destination database before writing to Amazon Keyspaces.

The following diagram shows the typical architecture of a CDC pipeline using Debezium and Amazon MSK. 

![\[Using a change data capture pipeline to migrate data from Apache Cassandra to Amazon Keyspaces.\]](http://docs.aws.amazon.com/keyspaces/latest/devguide/images/migration/hybrid-migration-CDC.png)
