Apache Iceberg
You can populate Iceberg, Hudi, and Delta Lake tables in the AWS Glue Data Catalog using the following methods:
-
AWS Glue crawler; – AWS Glue crawlers can automatically discover and populate Iceberg, Hudi and Delta Lake table metadata in the Data Catalog. For more information, see Using crawlers to populate the Data Catalog .
-
AWS Glue ETL Jobs – You can create ETL jobs to write data to Iceberg, Hudi, and Delta Lake tables and populate their metadata in the Data Catalog. For more information, see Using data lake frameworks with AWS Glue ETL jobs.
-
AWS Glue console, AWS Lake Formation console, AWS CLI or API – You can use the AWS Glue console, Lake Formation console, or API to create and manage Iceberg table definitions in the Data Catalog.
Topics
Creating Apache Iceberg tables
You can create Apache Iceberg tables that use the Apache Parquet data format in the
AWS Glue Data Catalog with data residing in Amazon S3. A table in the Data Catalog is the metadata definition that
represents the data in a data store. By default, AWS Glue creates Iceberg v2 tables. For the
difference between v1 and v2 tables, see Format version
changes
Apache Iceberg
You can use AWS Glue or Lake Formation console or the CreateTable
operation in the
AWS Glue API to create an Iceberg table in the Data Catalog. For more information, see CreateTable action (Python: create_table).
When you create an Iceberg table in the Data Catalog, you must specify the table format and metadata file path in Amazon S3 to be able to perform reads and writes.
You can use Lake Formation to secure your Iceberg table using fine-grained access control permissions when you register the Amazon S3 data location with AWS Lake Formation. For source data in Amazon S3 and metadata that is not registered with Lake Formation, access is determined by IAM permissions policies for Amazon S3 and AWS Glue actions. For more information, see Managing permissions.
Note
Data Catalog doesn’t support creating partitions and adding Iceberg table properties.
Prerequisites
To create Iceberg tables in the Data Catalog, and set up Lake Formation data access permissions, you need to complete the following requirements:
-
Permissions required to create Iceberg tables without the data registered with Lake Formation.
In addition to the permissions required to create a table in the Data Catalog, the table creator requires the following permissions:
s3:PutObject
on resource arn:aws:s3:::{bucketName}-
s3:GetObject
on resource arn:aws:s3:::{bucketName} -
s3:DeleteObject
on resource arn:aws:s3:::{bucketName}
-
Permissions required to create Iceberg tables with data registered with Lake Formation:
To use Lake Formation to manage and secure the data in your data lake, register your Amazon S3 location that has the data for tables with Lake Formation. This is so that Lake Formation can vend credentials to AWS analytical services such as Athena, Redshift Spectrum, and Amazon EMR to access data. For more information on registering an Amazon S3 location, see Adding an Amazon S3 location to your data lake.
A principal who reads and writes the underlying data that is registered with Lake Formation requires the following permissions:
-
lakeformation:GetDataAccess
-
DATA_LOCATION_ACCESS
A principal who has data location permissions on a location also has location permissions on all child locations.
For more information on data location permissions, see Underlying data access controlulink.
-
To enable compaction, the service needs to assume an IAM role that has permissions to update tables in the Data Catalog. For details, see Table optimization prerequisites
Creating an Iceberg table
You can create Iceberg v1 and v2 tables using AWS Glue or Lake Formation console or AWS Command Line Interface as documented on this page. You can also create Iceberg tables using the AWS Glue crawler. For more information, see Data Catalog and Crawlers in the AWS Glue Developer Guide.
To create an Iceberg table
Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/
. Under Data Catalog, choose Tables, and use the Create table button to specify the following attributes:
-
Table name – Enter a name for the table. If you’re using Athena to access tables, use these naming tips in the Amazon Athena User Guide.
-
Database – Choose an existing database or create a new one.
-
Description – The description of the table. You can write a description to help you understand the contents of the table.
-
Table format – For Table format, choose Apache Iceberg.
Enable compaction – Choose Enable compaction to compact small Amazon S3 objects in the table into larger objects.
-
IAM role – To run compaction, the service assumes an IAM role on your behalf. You can choose an IAM role using the drop-down. Ensure that the role has the permissions required to enable compaction.
To learn more about the required permissions, see Table optimization prerequisites .
-
Location – Specify the path to the folder in Amazon S3 that stores the metadata table. Iceberg needs a metadata file and location in the Data Catalog to be able to perform reads and writes.
-
Schema – Choose Add columns to add columns and data types of the columns. You have the option to create an empty table and update the schema later. Data Catalog supports Hive data types. For more information, see Hive data types
. Iceberg allows you to evolve schema and partition after you create the table. You can use Athena queries to update the table schema and Spark queries
for updating partitions.
-