Apache Iceberg and Lake Formation with Amazon EMR - Amazon EMR

Apache Iceberg and Lake Formation with Amazon EMR

Amazon EMR releases 6.15.0 and higher include support for fine-grained access control based on AWS Lake Formation with Apache Iceberg when you read and write data with Spark SQL. Amazon EMR supports table, row, column, and cell-level access control with Apache Iceberg. With this feature, you can run snapshot queries on copy-on-write tables to query the latest snapshot of the table at a given commit or compaction instant.

If you want to use Iceberg format, set the following configurations. Replace DB_LOCATION with the Amazon S3 path where your Iceberg tables are located, and replace the Region and account ID placeholders with your own values.

spark-sql \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.iceberg_catalog.warehouse=s3://DB_LOCATION --conf spark.sql.catalog.iceberg_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.iceberg_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.iceberg_catalog.glue.account-id=ACCOUNT_ID --conf spark.sql.catalog.iceberg_catalog.glue.id=ACCOUNT_ID --conf spark.sql.catalog.iceberg_catalog.client.assume-role.region=AWS_REGION --conf spark.sql.secureCatalog=iceberg_catalog

If you want Lake Formation to use record server to manage your Spark catalog, set spark.sql.catalog.<managed_catalog_name>.lf.managed to true.

You should also be careful NOT to pass the following assume role settings:

--conf spark.sql.catalog.my_catalog.client.assume-role.region --conf spark.sql.catalog.my_catalog.client.assume-role.arn --conf spark.sql.catalog.my_catalog.client.assume-role.tags.LakeFormationAuthorizedCaller

The following support matrix lists some core features of Apache Iceberg with Lake Formation:

Copy on Write Merge on Read

Snapshot queries - Spark SQL

Read-optimized queries - Spark SQL

Incremental queries

Time travel queries

Metadata tables

DML INSERT commands

DDL commands

Spark datasource queries

Spark datasource writes