Connecting to S3 table buckets with Spark on an Amazon EMR Iceberg cluster

Accessing Amazon S3 tables with Amazon EMR

Note

The integration with AWS analytics services for table buckets is in preview release and is subject to change.

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases.

You can use Apache Iceberg clusters in Amazon EMR to work with S3 tables by connecting to table buckets in a Spark session. To connect to table buckets in Amazon EMR, you can use the AWS analytics services integration through AWS Glue Data Catalog, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog.

Note

S3 Tables is supported on Amazon EMR version 7.5 or higher.

Connecting to S3 table buckets with Spark on an Amazon EMR Iceberg cluster

In this procedure, you setup an Amazon EMR cluster configured for Apache Iceberg and then launch a Spark session that connects to your table buckets. You can set this up using the AWS analytics services integration through AWS Glue, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog. For information about the client catalog, see Accessing Amazon S3 tables from open source query engines.

Choose your method of using tables with Amazon EMR from the following options.

AWS analytics services integration

The following perequisites are required to query tables with Spark on Amazon EMR using the AWS analytics services integration.

Prerequisites

Integrate your table buckets with AWS analytics services.
Create a resource link to your namespace.
Create the default service role for Amazon EMR (EMR_DefaultRole_V2). For details, see Service role for Amazon EMR (EMR role)
Create the Amazon EC2 instance profile for Amazon EMR (EMR_EC2_DefaultRole). For details, see Service role for cluster EC2 instances (EC2 instance profile)
Attach the AmazonS3TablesFullAccess policy to the IAM principal you use to create the Amazon EMR cluster.

To setup an Amazon EMR cluster to query tables with Spark

Create an Iceberg cluster in Amazon EMR.
Connect to the Spark primary node using SSH.

Enter the following command to initialize a Spark session for Iceberg that connects to your tables. Replace the input placeholder with your table bucket name.


spark-shell \
--conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tablesbucket.type=glue \
--conf spark.sql.catalog.s3tablesbucket.warehouse=s3://amzn-s3-demo-table-bucket \
--conf spark.sql.defaultCatalog=s3tablesbucket \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Query your tables with Spark SQL. For example queries, see Querying S3 tables with Spark

Amazon S3 Tables Catalog for Apache Iceberg

The following prerequisites are required to query tables with Spark on Amazon EMR using the Amazon S3 Tables Catalog for Apache Iceberg.

Prerequisites

Attach the AmazonS3TablesFullAccess policy to the IAM role you use for Amazon EMR.

To setup an Amazon EMR cluster to query tables with Spark

Create an Iceberg cluster in Amazon EMR
Connect to the Spark primary node using SSH.

To initialize a Spark session for Iceberg that connects to your table bucket, enter the following command. Replace the input placeholder with your table bucket ARN.


spark-shell \
--packages software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.3 \
--conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \
--conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-bucket1 \
--conf spark.sql.defaultCatalog=s3tablesbucket \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Query your tables with Spark SQL. For example queries, see Querying S3 tables with Spark.

Note

If you are using the DROP TABLE PURGE command with EMR:

EMR version 7.5

Set the Spark config spark.sql.catalog.<your-catalog-name>.cache-enabled to false. If this config is set to true, run the command in a new session or application so the table cache is not activated.
EMR versions < 7.5

DROP TABLE is not supported. You can use the S3 Tables DeleteTable REST API to delete a table.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Amazon Redshift

Amazon QuickSight