Replicate database changes to Apache Iceberg Tables with Amazon Data Firehose
Note
Firehose supports database as a source in all AWS Regions except China Regions, AWS GovCloud (US) Regions, and Asia Pacific (Malaysia). This feature is in preview and is subject to change. Do not use it for your production workloads.
Organizations use relational databases to store and retrieve transactional data that are optimized to interact very quickly with one or a few rows of data at a time. They are not optimized for querying large sets of aggregated data. Organizations move transactional data from relational databases to analytical data stores such as data lakes, data warehouses, and other tools for analytics and machine learning use cases. To keep analytical data stores in sync with relational databases, a design pattern called change data capture (CDC) is used that enables capturing all changes to databases in real time. When data is changed through INSERT, UPDATE, or DELETE in a source database, those CDC changes must be continuously streamed without impacting the performance of databases.
Firehose provides a simple and easy-to-use end-to-end solution to replicate changes from MySQL and PostgreSQL databases into Apache Iceberg Tables. With this feature, Firehose enables you to select specific databases, tables, and columns that you want Firehose to capture in CDC events. If you don’t have Iceberg Tables already, you can opt-in for Firehose to create Iceberg Tables. Firehose creates databases and tables using the same schema as in your relational database tables. Once the stream is created, Firehose takes an initial copy of the data in the tables and writes to Apache Iceberg Tables. Once the initial copy is complete, Firehose starts continuous capture of the real time CDC changes in your databases and replicates them to Apache Iceberg Tables. If you opt-in for schema evolution, Firehose evolves your Iceberg Table schema based on your schema changes in your relational databases.