Set up the Firehose stream - Amazon Data Firehose

Set up the Firehose stream

To create a Firehose stream with Apache Iceberg Tables as your destination you need to configure the following.

Note

The setup of a Firehose stream for delivering to tables in S3 table buckets is the same as Apache Iceberg Tables in Amazon S3.

Configure source and destination

To deliver data to Apache Iceberg Tables, choose the source for your stream.

To configure your source for your stream, see Configure source settings.

Next, choose Apache Iceberg Tables as the destination and provide a Firehose stream name.

Configure data transformation

To perform custom transformations on your data, such as adding or modifying records in your incoming stream, you can add a Lambda function to your Firehose stream. For more information on data transformation using Lambda in a Firehose stream, see Transform source data in Amazon Data Firehose.

For Apache Iceberg Tables, you need to specify how you want to route incoming records to different destination tables and the operations that you want to perform. One of the ways to provide the required routing information to Firehose is using a Lambda function.

For more information, see Route records to different Iceberg tables.

Connect data catalog

Apache Iceberg requires a data catalog to write to Apache Iceberg Tables. Firehose integrates with AWS Glue Data Catalog for Apache Iceberg Tables.

You can use AWS Glue Data Catalog in the same account as your Firehose stream or in a cross-account and in the same Region as your Firehose stream (default), or in a different Region.

Configure JQ expressions

For Apache Iceberg Tables, you must specify how you want to route incoming records to different destination tables and the operations such as insert, update, and delete that you want to perform. You can do this by configuring JQ expressions for Firehose to parse and get the required information. For more information, see Provide routing information to Firehose with JSONQuery expression.

Configure unique keys

Updates and Deletes with more than one table – Unique keys are one or more fields in your source record that uniquely identify a row in Apache Iceberg Tables. If you have insert only scenario with more than one table, then you do not have to configure unique keys. If you want to do updates and deletes on certain tables, then you must configure unique keys for those required tables. Note that update will automatically insert the row if the row in tables is missing. If you have only single table, then you can configure unique keys.

You can either configure unique keys per table as a part of Firehose stream creation or you can set identifier-field-ids natively in Iceberg during create table or alter table operation. Configuring unique keys per table during stream creation is optional. If you don’t configure unique keys per table during stream creation, Firehose checks for identifier-field-ids for required tables and will use them as unique keys. If both are not configured, then delivery of data with update and delete operations fails.

To configure this section, provide the database name, table name, and unique keys for the tables where you want to update or delete data. You can have only entry for each table in the configuration. Optionally, you can also choose to provide an error bucket prefix if data from the table fails to deliver as shown in the following example.

[ { "DestinationDatabaseName": "MySampleDatabase", "DestinationTableName": "MySampleTable", "UniqueKeys": [ "COLUMN_PLACEHOLDER" ], "S3ErrorOutputPrefix": "OPTIONAL_PREFIX_PLACEHOLDER" } ]

Specify retry duration

You can use this configuration to specify the duration in seconds for which Firehose should attempt to retry, if it encounters failures in writing to Apache Iceberg Tables in Amazon S3. You can set any value from 0 to 7200 seconds for performing retries. By default, Firehose retries for 300 seconds.

Handle failed delivery or processing

You must configure Firehose to deliver records to an S3 backup bucket in case it encounters failures in processing or delivering a stream after expiry of retry duration. For this, configure the S3 backup bucket and S3 backup bucket error output prefix from Backup settings in console.

Configure buffer hints

Firehose buffers incoming streaming data in memory to a certain size (Buffering size) and for a certain period of time (Buffering interval) before delivering it to Apache Iceberg Tables. You can choose a buffer size of 1–128 MiBs and a buffer interval of 0–900 seconds. Higher buffer hints results in less number of S3 writes, less cost of compaction due to larger data files, and faster query execution but with a higher latency. Lower buffer hint values deliver the data with lower latency.

Configure advanced settings

You can configure server-side encryption, error logging, permissions and tags for your Apache Iceberg Tables. For more information, see Configure advanced settings. You need to add the IAM role that you created as part of the Prerequisites to use Apache Iceberg Tables as a destination. Firehose will assume the role to access AWS Glue tables and write to Amazon S3 buckets.

Firehose stream creation can take several minutes to complete. After you successfully create the Firehose stream, you can start ingesting data into it and can view the data in Apache Iceberg tables.