Considerations when using Apache Iceberg tables

Using Apache Iceberg tables with Amazon Redshift

Note

To achieve the best performance when using Apache Iceberg tables with Amazon Redshift, you must generate column statistics for the tables using AWS Glue. For more information, see Generating column statistics for Iceberg tables in the AWS Glue Developer Guide.

This topic describes how to use tables in Apache Iceberg format with Amazon Redshift. Apache Iceberg is a high-performance open-source table format for data lakes. For more information, see Apache Iceberg in the Apache Iceberg documentation.

You can query Apache Iceberg tables cataloged in the AWS Glue Data Catalog with Amazon Redshift. RG instance types and Redshift Serverless use their own compute to process data lake queries, while RA3 instance types use Redshift Spectrum. For more information, see Querying your Data Lake.

Amazon Redshift provides transactional consistency for querying Apache Iceberg tables. You can manipulate the data in your tables using ACID (atomicity, consistency, isolation, durability) compliant services such as Amazon Athena and Amazon EMR while running queries using Amazon Redshift. Amazon Redshift can use the table statistics stored in Apache Iceberg metadata to optimize query plans and reduce file scans during query processing. With Amazon Redshift SQL, you can join Redshift tables with data lake tables.

To get started using Iceberg tables with Amazon Redshift:

Create an Apache Iceberg table on an AWS Glue Data Catalog database using a compatible service such as Amazon Athena or Amazon EMR. To create an Iceberg table using Athena, see Using Apache Iceberg tables in the Amazon Athena User Guide.
Create an Amazon Redshift cluster or Redshift Serverless workgroup with an associated IAM role that allows access to your data lake. For information on how to create clusters or workgroups, see Get started with Amazon Redshift provisioned data warehouses and Get started with Redshift Serverless data warehouses in the Amazon Redshift Getting Started Guide.
Connect to your cluster or workgroup using query editor v2 or a third-party SQL client. For information about how to connect using query editor v2, see Connecting to an Amazon Redshift data warehouse using SQL client tools in the Amazon Redshift Management Guide.
Create an external schema in your Amazon Redshift database for a specific Data Catalog database that includes your Iceberg tables. For information about creating an external schema, see External schemas in Amazon Redshift Spectrum.
Run SQL queries to access the Iceberg tables in the external schema you created.

Considerations when using Apache Iceberg tables with Amazon Redshift

Consider the following when using Amazon Redshift with Iceberg tables:

Iceberg version support – Amazon Redshift supports running queries against the following versions of Iceberg tables:
- Version 1 defines how large analytic tables are managed using immutable data files.
- Version 2 adds the ability to support row-level updates and deletes while keeping the existing data files unchanged, and handling table data changes using delete files.
For the difference between version 1 and version 2 tables, see Format version changes in the Apache Iceberg documentation.
Adding partitions – You don't need to manually add partitions for your Apache Iceberg tables. New partitions in Apache Iceberg tables are automatically detected by Amazon Redshift and no manual operation is needed to update partitions in the table definition. Any changes in partition specification are also automatically applied to your queries without any user intervention.
Ingesting Iceberg data into Amazon Redshift – You can use INSERT INTO or CREATE TABLE AS commands to import data from your Iceberg table into a local Amazon Redshift table. You currently cannot use the COPY command to ingest the contents of an Apache Iceberg table into a local Amazon Redshift table.
Materialized views – You can create materialized views on Apache Iceberg tables like any other external table in Amazon Redshift. The same considerations for other data lake table formats apply to Apache Iceberg tables. Automatic query rewriting and automatic materialized views on data lake tables are currently not supported.
AWS Lake Formation fine-grained access control – Amazon Redshift supports AWS Lake Formation fine-grained access control on Apache Iceberg tables.
User-defined data handling parameters – Amazon Redshift supports user-defined data handling parameters on Apache Iceberg tables. You use user-defined data handling parameters on existing files to tailor the data being queried in external tables to avoid scan errors. These parameters provide capabilities to handle mismatches between the table schema and the actual data on files. You can use user-defined data handling parameters on Apache Iceberg tables as well.
Time travel queries – Time travel queries are currently not supported with Apache Iceberg tables.
Pricing – When you access Iceberg tables from an RG cluster or a Redshift Serverless workgroup, data lake queries run on the cluster's or workgroup's own compute resources, so there is no separate charge for data lake queries. When you access Iceberg tables from a DC2 or RA3 cluster, you are charged Redshift Spectrum pricing. For information about pricing, see Amazon Redshift pricing.
Metadata caching – Metadata caching assumes metadata files are immutable based on the Iceberg specification. Metadata file immutability is a requirement for data integrity in Amazon Redshift.
Federated identity – Federated identity is not supported when writing to Apache Iceberg tables. This includes using the SESSION keyword for the IAM_ROLE parameter when creating external schemas. For more information about IAM_ROLE parameters, see CREATE EXTERNAL SCHEMA.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

External tables

Supported data types