Working with Apache Iceberg tables by using Amazon Athena SQL
Amazon Athena provides built-in support for Apache Iceberg, and doesn't require additional steps or configuration. This section provides a detailed overview of supported features and high-level guidance for using Athena to interact with Iceberg tables.
Version and feature compatibility
Note
The following sections assume that you're using Athena engine version 3.
Iceberg table specification support
The Apache Iceberg table specification specifies how Iceberg tables should behave. Athena supports table format version 2, so any Iceberg table that you create with the console, CLI, or SDK inherently uses that version.
If you use an Iceberg table that was created with another engine, such as Apache Spark
on Amazon EMR or AWS Glue, make sure to set the table format version by using table properties
Iceberg feature support
You can use Athena to read from and write to Iceberg tables. When you change data by
using the UPDATE
, MERGE INTO
, and DELETE FROM
statements, Athena supports merge-on-read mode only. This property cannot be changed. In
order to update or delete data with copy-on-write, you have to use other engines such as
Apache Spark on Amazon EMR or AWS Glue. The following table summarizes Iceberg feature support in
Athena.
DDL support | DML support | AWS Lake Formation for security (optional) | ||||
---|---|---|---|---|---|---|
Table format | Create table | Schema evolution | Reading data | Writing data | Row/column access control | |
Amazon Athena | Version 2 | ✓ | ✓ | ✓ | X Copy-on-write | ✓ |
✓ Merge-on-read | ✓ |
Note
Athena doesn't support Incremental queries.
Working with Iceberg tables
For a quick start to using Iceberg in Athena, see the section Getting started with Iceberg tables in Athena SQL earlier in this guide.
The following table lists limitations and recommendations.
Scenario |
Limitation |
Recommendation |
---|---|---|
Table DDL generation |
Iceberg tables created with other engines can have properties that are not exposed in Athena. For these tables, it's not possible to generate the DDL. |
Use the equivalent statement in the engine that created the table (for example,
the |
Random Amazon S3 prefixes in objects that are written to an Iceberg table |
By default, Iceberg tables that are created with Athena have the
|
To disable this behavior and gain full control over Iceberg table properties, create an Iceberg table with another engine such as Spark on Amazon EMR or AWS Glue. |
Incremental queries |
Not currently supported in Athena. |
To use incremental queries to enable incremental data ingestion pipelines, use Spark on Amazon EMR or AWS Glue. |
Migrating existing tables to Iceberg
To migrate your current Athena or AWS Glue tables (also known as Hive tables) to Iceberg format, you can use either in-place or full data migration:
-
In-place migration is the process of generating Iceberg's metadata files on top of existing data files.
-
Full data migration creates the Iceberg metadata layer and also rewrites existing data files from the original table to the new Iceberg table.
The following sections provide an overview of the APIs available to migrate tables and
guidance for choosing a migration strategy. For more information about these two strategies,
see the Table
Migration
In-place migration
In-place migration eliminates the need to rewrite all the data files. Instead, Iceberg metadata files are generated and linked to your existing data files. Iceberg offers three options for implementing in-place migration:
-
Using the
snapshot
procedure, as explained in the sections Snapshot Tableand Spark procedure: snapshot in the Iceberg documentation. -
Using the
add_files
procedure, as explained in the sections Add Filesand Spark procedure: add_files in the Iceberg documentation. -
Using the
migrate
procedure, as explained in the sections Migrate Tableand Spark procedure: Migrate in the Iceberg documentation.
Currently, the migrate procedure doesn't work directly with the AWS Glue Data Catalog—it
works only with the Hive metastore. If you have a requirement to use the
migrate
procedure instead of snapshot
or add_files
,
you can use a temporary Amazon EMR cluster with the Hive metastore (HMS). This approach requires
Iceberg version 1.2 or later.
Let's say you want to create the following Hive table:
![Migrating a Hive table to Amazon Athena](/images/prescriptive-guidance/latest/apache-iceberg-on-aws/images/hive-table.png)
You can create this Hive table by running this code in the Athena console:
CREATE EXTERNAL TABLE 'hive_table'( 'id' bigint, 'data' string) USING parquet LOCATION 's3://datalake-xxxx/aws_workshop/iceberg_db/hive_table' INSERT INTO iceberg_db.hive_table VALUES (1, 'a')
If your Hive table is partitioned, include the partition statement and add the partitions according to Hive requirements.
ALTER TABLE default.placeholder_table_for_migration ADD PARTITION (date = '2023-10-10')
Steps:
-
Create an Amazon EMR cluster without enabling the AWS Glue Data Catalog integration—that is, don't select the check boxes for Hive or Spark table metadata. That's because you will use the native Hive metastore (HMS) that's available in the cluster for this workaround.
-
Configure the Spark session to use the Iceberg Hive catalog implementation.
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog", "spark.sql.catalog.spark_catalog.type": "hive",
-
Validate that your Amazon EMR cluster isn't connected to the AWS Glue Data Catalog by running
show databases
orshow tables
. -
Register the Hive table in the Hive metastore of your Amazon EMR cluster, and then use the Iceberg
migrate
procedure.This procedure creates the Iceberg metadata files in the same location as the Hive table.
-
Register the migrated Iceberg table in the AWS Glue Data Catalog.
-
Switch back to an Amazon EMR cluster that has the AWS Glue Data Catalog integration enabled.
-
Use the following Iceberg configuration in the Spark session.
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.glue_catalog.warehouse": "s3://datalake-xxxx/aws_workshop", "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
You can now query this table from Amazon EMR, AWS Glue, or Athena.
![Show tables command for Iceberg table](/images/prescriptive-guidance/latest/apache-iceberg-on-aws/images/show-tables.png)
Full data migration
Full data migration recreates the data files as well as the metadata. This approach takes longer and requires additional computing resources compared with in-place migration. However, this option helps improve table quality: You can validate the data, make schema and partition changes, resort the data, and so on. To implement full data migration, use one of the following options:
-
Use the
CREATE TABLE ... AS SELECT
(CTAS) statement in Spark on Amazon EMR, AWS Glue, or Athena. You can set the partition specification and table properties for the new Iceberg table by using the PARTITIONED BY
andTBLPROPERTIES
clauses. You can fine-tune the schema and partitioning for the new table according to your needs instead of simply inheriting them from the source table. -
Read from the source table and write the data as a new Iceberg table by using Spark on Amazon EMR or AWS Glue (see Creating a table
in the Iceberg documentation).
Choosing a migration strategy
To choose the best migration strategy, consider the questions in the following table.
Question |
Recommendation |
---|---|
What is the data file format (for example, CSV or Apache Parquet)? |
|
Do you want to update or consolidate the table schema? |
|
Would the table benefit from changing the partition strategy? |
|
Would the table benefit from adding or changing the sort order strategy? |
|
Does the table have many small files? |
|