Working with Aurora zero-ETL integrations with Amazon Redshift - Amazon Aurora

Working with Aurora zero-ETL integrations with Amazon Redshift

An Aurora zero-ETL integration with Amazon Redshift enables near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from Aurora. It's a fully managed solution for making transactional data available in Amazon Redshift after it is written to an Aurora DB cluster. Extract, transform, and load (ETL) is the process of combining data from multiple sources into a large, central data warehouse.

A zero-ETL integration makes the data in your Aurora DB cluster available in Amazon Redshift in near real-time. Once that data is in Amazon Redshift, you can power your analytics, ML, and AI workloads using the built-in capabilities of Amazon Redshift, such as machine learning, materialized views, data sharing, federated access to multiple data stores and data lakes, and integrations with Amazon SageMaker, Amazon QuickSight, and other AWS services.

To create a zero-ETL integration, you specify an Aurora DB cluster as the source, and an Amazon Redshift data warehouse as the target. The integration replicates data from the source database into the target data warehouse.

The following diagram illustrates this functionality:

A zero-ETL integration

The integration monitors the health of the data pipeline and recovers from issues when possible. You can create integrations from multiple Aurora DB clusters into a single Amazon Redshift namespace, enabling you to derive insights across multiple applications.

For information about pricing for zero-ETL integrations, see Amazon Aurora pricing and Amazon Redshift pricing.

Benefits

Aurora zero-ETL integrations with Amazon Redshift have the following benefits:

  • Help you derive holistic insights from multiple data sources.

  • Eliminate the need to build and maintain complex data pipelines that perform extract, transform, and load (ETL) operations. Zero-ETL integrations remove the challenges that come with building and managing pipelines by provisioning and managing them for you.

  • Reduce operational burden and cost, and let you focus on improving your applications.

  • Let you leverage Amazon Redshift's analytics and ML capabilities to derive insights from transactional and other data, to respond effectively to critical, time-sensitive events.

Key concepts

As you get started with zero-ETL integrations, consider the following concepts:

Integration

A fully managed data pipeline that automatically replicates transactional data and schemas from an Aurora DB cluster to an Amazon Redshift data warehouse.

Source DB cluster

The Aurora DB cluster where data is replicated from. You can specify a DB cluster that uses provisioned DB instances or Aurora Serverless v2 DB instances as the source.

Target data warehouse

The Amazon Redshift data warehouse where the data is replicated to. There are two types of data warehouse: a provisioned cluster data warehouse and a serverless data warehouse. A provisioned cluster data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. A serverless data warehouse is comprised of a workgroup that stores compute resources, and a namespace that houses the database objects and users. Both data warehouses run an Amazon Redshift engine and contain one or more databases.

Multiple source DB clusters can write to the same target.

For more information, see Data warehouse system architecture in the Amazon Redshift Developer Guide.

Limitations

The following limitations apply to Aurora zero-ETL integrations with Amazon Redshift.

General limitations

  • The source DB cluster must be in the same Region as the target Amazon Redshift data warehouse.

  • You can't rename a DB cluster or any of its instances if it has existing integrations.

  • You can't create multiple integrations between the same source and target databases.

  • You can't delete a DB cluster that has existing integrations. You must delete all associated integrations first.

  • If you stop the source DB cluster, the last few transactions might not be replicated to the target data warehouse until you resume the cluster.

  • If your cluster is the source of a blue/green deployment, the blue and green environments can't have existing zero-ETL integrations during switchover. You must delete the integration first and switch over, then recreate it.

  • A DB cluster must contain at least one DB instance in order to be the source of an integration.

  • If your source cluster is the primary DB cluster in an Aurora global database and it fails over to one of its secondary clusters, the integration becomes inactive. You must delete and recreate the integration.

  • You can't create an integration for a source database that has another integration being actively created.

  • When you initially create an integration, or when a table is being resynchronized, data seeding from the source to the target can take 20-25 minutes or more depending on the size of the source database. This delay can lead to increased replica lag.

  • Some data types aren't supported. For more information, see Data type differences between Aurora and Amazon Redshift databases.

  • Object identifiers (including database name, table name, column names, and others) can contain only alphanumeric characters, numbers, $, and _ (underscore).

  • System tables, temporary tables, and views aren't replicated to Amazon Redshift.

Aurora MySQL limitations

  • Your source DB cluster must be running a supported version of Aurora MySQL. For a list of supported versions, see Supported Regions and Aurora DB engines for zero-ETL integrations with Amazon Redshift.

  • Zero-ETL integrations rely on MySQL binary logging (binlog) to capture ongoing data changes. Don't use binlog-based data filtering, as it can cause data inconsistencies between the source and target databases.

  • Zero-ETL integrations are supported only for databases configured to use the InnoDB storage engine.

  • Foreign key references with predefined table updates aren't supported. Specifically, ON DELETE and ON UPDATE rules aren't supported with CASCADE, SET NULL, and SET DEFAULT actions. Attempting to create or update a table with such references to another table will put the table into a failed state.

  • ALTER TABLE partition operations cause your table to resynchronize in order to reload data from Aurora to Amazon Redshift. The table will be unavailable for querying while it's resynchronizing. For more information, see One or more of my Amazon Redshift tables requires a resync.

  • Two-phase transactions performed on the source DB cluster cause the integration to enter a state of Syncing.

Aurora PostgreSQL limitations

  • Your source DB cluster must be running a supported version of Aurora PostgreSQL. For a list of supported versions, see Supported Regions and Aurora DB engines for zero-ETL integrations with Amazon Redshift.

  • If you select an Aurora PostgreSQL source DB cluster, you must specify at least one data filter pattern. At minimum, the pattern must include a single database (database-name.*.*) for replication to Amazon Redshift. For more information, see Data filtering for Aurora zero-ETL integrations with Amazon Redshift.

  • All databases created within the source Aurora PostgreSQL DB cluster must use UTF-8 encoding.

  • If you perform declarative partitioning transactions on the source DB cluster, all affected tables enter a failed state and are no longer accessible in Amazon Redshift.

  • Two-phase transactions aren't supported.

  • If you delete all DB instances from a DB cluster that is the source of an integration and then re-add a DB instance, replication breaks between the source and the target clusters.

  • The source DB cluster can't use use Aurora Limitless Database.

Amazon Redshift limitations

For a list of Amazon Redshift limitations related to zero-ETL integrations, see Considerations when using zero-ETL integrations with Amazon Redshift in the Amazon Redshift Management Guide.

Quotas

Your account has the following quotas related to Aurora zero-ETL integrations with Amazon Redshift. Each quota is per-Region unless otherwise specified.

Name Default Description
Integrations 100 The total number of integrations within an AWS account.
Integrations per target data warehouse 50 The number of integrations sending data to a single target Amazon Redshift data warehouse.
Integrations per source cluster

5 for Aurora MySQL

1 for Aurora PostgreSQL
The number of integrations sending data from a single source DB cluster.

In addition, Amazon Redshift places certain limits on the number of tables allowed in each DB instance or cluster node. For more information, see Quotas and limits in Amazon Redshift in the Amazon Redshift Management Guide.

Supported Regions

Aurora zero-ETL integrations with Amazon Redshift are available in a subset of AWS Regions. For a list of supported Regions, see Supported Regions and Aurora DB engines for zero-ETL integrations with Amazon Redshift.