Accessing Amazon S3 tables with Amazon EMR
Note
The integration with AWS analytics services for table buckets is in preview release and is subject to change.
Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases.
You can use Apache Iceberg clusters in Amazon EMR to work with S3 tables by connecting to table buckets in a Spark session. To connect to table buckets in Amazon EMR, you can use the AWS analytics services integration through AWS Glue Data Catalog, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog.
Note
S3 Tables is supported on Amazon EMR version 7.5 or higher.
Connecting to S3 table buckets with Spark on an Amazon EMR Iceberg cluster
In this procedure, you setup an Amazon EMR cluster configured for Apache Iceberg and then launch a Spark session that connects to your table buckets. You can set this up using the AWS analytics services integration through AWS Glue, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog. For information about the client catalog, see Accessing Amazon S3 tables from open source query engines.
Choose your method of using tables with Amazon EMR from the following options.
Note
If you are using the DROP TABLE PURGE
command with EMR:
EMR version 7.5
Set the Spark config
spark.sql.catalog.
to false. If this config is set to<your-catalog-name>
.cache-enabledtrue
, run the command in a new session or application so the table cache is not activated.-
EMR versions < 7.5
DROP TABLE is not supported. You can use the S3 Tables DeleteTable REST API to delete a table.