Data filtering for Aurora zero-ETL integrations with Amazon Redshift - Amazon Aurora

Data filtering for Aurora zero-ETL integrations with Amazon Redshift

You can use data filtering for Aurora zero-ETL integrations to define the scope of replication from the source Aurora DB cluster to the target Amazon Redshift data warehouse. Rather than replicating all data to the target, you can define one or more filters that selectively include or exclude certain tables from being replicated. Only filtering at the database and table level is available for zero-ETL integrations. You can't filter by columns or rows.

Data filtering can be useful when you want to:

  • Join certain tables from two or more different source clusters and you don't need complete data from either cluster.

  • Save costs by performing analytics using only a subset of tables rather than an entire fleet of databases.

  • Filter out sensitive information—such as phone numbers, addresses, or credit card details—from certain tables.

You can add data filters to a zero-ETL integration using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or the Amazon RDS API.

If the integration has a provisioned Amazon Redshift cluster as its target, the cluster must be on patch 180 or higher.

Format of a data filter

You can define multiple filters for a single integration. Each filter either includes or excludes any existing and future database tables that match one of the patterns in the filter expression. Aurora zero-ETL integrations use Maxwell filter syntax for data filtering.

Each filter has the following elements:

Element Description
Filter type

An Include filter type includes all tables that match one of the patterns in the filter expression. An Exclude filter type excludes all tables that match one of the patterns.

Filter expression

A comma-separated list of patterns. Expressions must use Maxwell filter syntax.

Pattern

A filter pattern in the format database.table for Aurora MySQL, or database.schema.table for Aurora PostgreSQL. You can specify literal names, or define regular expressions.

Note

For Aurora MySQL, regular expressions are supported in both the database and table name. For Aurora PostgreSQL, regular expressions are supported only in the schema and table name, not in the database name.

You can't include column-level filters or denylists.

A single integration can have a maximum of 99 total patterns. In the console, you can enter patterns within a single filter expression, or spread them out among multiple expressions. A single pattern can't exceed 256 characters in length.

Important

If you select an Aurora PostgreSQL source DB cluster, you must specify at least one data filter pattern. At minimum, the pattern must include a single database (database-name.*.*) for replication to Amazon Redshift.

The following image shows the structure of Aurora MySQL data filters in the console:

Data filters for a zero-ETL integration
Important

Do not include personally identifying, confidential, or sensitive information in your filter patterns.

Data filters in the AWS CLI

When using the AWS CLI to add a data filter, the syntax differs slightly compared to the console. Each individual pattern must be associated with its own filter type (Include or Exclude). You can't group multiple patterns with a single filter type.

For example, in the console you can group the following comma-separated patterns within a single Include statement:

Aurora MySQL

mydb.mytable, mydb./table_\d+/

Aurora PostgreSQL

mydb.myschema.mytable, mydb.myschema./table_\d+/

However, when using the AWS CLI, the same data filter must be in the following format:

Aurora MySQL

'include: mydb.mytable, include: mydb./table_\d+/'

Aurora PostgreSQL

'include: mydb.myschema.mytable, include: mydb.myschema./table_\d+/'

Filter logic

If you don't specify any data filters in your integration, Aurora assumes a default filter of include:*.* and replicates all tables to the target data warehouse. However, if you specify at least one filter, the logic starts with an assumed exclude:*.*, meaning that all tables are automatically excluded from replication. This allows you to directly define which tables and databases to include.

For example, if you define the following filter:

'include: db.table1, include: db.table2'

Aurora evaluates the filter as follows:

'exclude:*.*, include: db.table1, include: db.table2'

Therefore, only table1 and table2 from the database named db are replicated to the target data warehouse.

Filter precedence

Aurora evaluates data filters in the order in which they're specified. In the AWS Management Console, this means that Aurora evaluates filter expressions from left to right and from top to bottom. If you specify a certain pattern for the first filter, then a second filter or even an individual pattern specified immediately after it can override it.

For example, your first filter might be Include books.stephenking, which includes a single table named stephenking from within the books database. However, if you add a second filter of Exclude books.*, it overrides the Include filter defined before it. Thus, no tables from the books index are replicated to Amazon Redshift.

If you specify at least one filter, the logic starts with an assumed exclude:*.*, meaning that all tables are automatically excluded from replication. Therefore, as a general best practice, define your filters from most broad to least broad. For example, use one or more Include statements to define all of the data that you want to replicate. Then, begin adding Exclude filters to selectively exclude certain tables from being replicated.

The same principle applies to filters that you define using the AWS CLI. Aurora evaluates these filter patterns in the order that they're specified, so a pattern might override one specified before it.

Aurora MySQL examples

The following examples demonstrate how data filtering works for Aurora MySQL zero-ETL integrations:

  • Include all databases and all tables:

    'include: *.*'
  • Include all tables within the books database:

    'include: books.*'
  • Exclude any tables named mystery:

    'include: *.*, exclude: *.mystery'
  • Include two specific tables within the books database:

    'include: books.stephen_king, include: books.carolyn_keene'
  • Include all tables in the books database, except for those containing the substring mystery:

    'include: books.*, exclude: books./.*mystery.*/'
  • Include all tables in the books database, except those starting with mystery:

    'include: books.*, exclude: books./mystery.*/'
  • Include all tables in the books database, except those ending with mystery:

    'include: books.*, exclude: books./.*mystery/'
  • Include all tables in the books database that start with table_, except for the one named table_stephen_king. For example, table_movies or table_books would be replicated, but not table_stephen_king.

    'include: books./table_.*/, exclude: books.table_stephen_king'

Aurora PostgreSQL examples

The following examples demonstrate how data filtering works for Aurora PostgreSQL zero-ETL integrations:

  • Include all tables within the books database:

    'include: books.*.*'
  • Exclude any tables named mystery in the books database:

    'include: books.*.*, exclude: books.*.mystery'
  • Include one table within the books database in the mystery schema, and one table within employee database in the finance schema:

    'include: books.mystery.stephen_king, include: employee.finance.benefits'
  • Include all tables in the books database and science_fiction schema, except for those containing the substring king:

    'include: books.science_fiction.*, exclude: books.*./.*king.*/
  • Include all tables in the books database, except those with a schema name starting with sci:

    'include: books.*.*, exclude: books./sci.*/.*'
  • Include all tables in the books database, except those in the mystery schema ending with king:

    'include: books.*.*, exclude: books.mystery./.*king/'
  • Include all tables in the books database that start with table_, except for the one named table_stephen_king. For example, table_movies in the fiction schema and table_books in the mystery schema are replicated, but not table_stephen_king in either schema:

    'include: books.*./table_.*/, exclude: books.*.table_stephen_king'

Adding data filters to an integration

You can configure data filtering using the AWS Management Console, the AWS CLI, or the Amazon RDS API.

Important

If you add a filter after creating an integration, then Aurora reevaluates the filter as if it always existed. It removes any data that is currently in the target Amazon Redshift data warehouse that doesn't match the new filtering criteria. This action causes all affected tables to resynchronize.

To add data filters to a zero-ETL integration
  1. Sign in to the AWS Management Console and open the Amazon RDS console at https://console.aws.amazon.com/rds/.

  2. In the navigation pane, choose Zero-ETL integrations. Select the integration that you want to add data filters to, and then choose Modify.

  3. Under Source, add one or more Include and Exclude statements.

    The following image shows an example of data filters for an Aurora MySQL integration:

    Data filters for a zero-ETL integration in the RDS console
  4. When all the changes are as you want them, choose Continue and Save changes.

To add data filters to a zero-ETL integration using the AWS CLI, call the modify-integration command. In addition to the integration identifier, specify the --data-filter parameter with a comma-separated list of Include and Exclude Maxwell filters.

The following example adds filter patterns to my-integration.

For Linux, macOS, or Unix:

aws rds modify-integration \ --integration-identifier my-integration \ --data-filter 'include: foodb.*, exclude: foodb.tbl, exclude: foodb./table_\d+/'

For Windows:

aws rds modify-integration ^ --integration-identifier my-integration ^ --data-filter 'include: foodb.*, exclude: foodb.tbl, exclude: foodb./table_\d+/'

To modify a zero-ETL integration using the RDS API, call the ModifyIntegration operation. Specify the integration identifier and provide a comma-separated list of filter patterns.

Removing data filters from an integration

When you remove a data filter from an integration, Aurora reevaluates the remaining filters as if the removed filter never existed. Aurora then replicates any data that previously didn't match the filtering criteria (but now does) into the target Amazon Redshift data warehouse.

Removing one or more data filters causes all affected tables to resynchronize.