Accessing Amazon S3 tables with Amazon EMR - Amazon Simple Storage Service

Accessing Amazon S3 tables with Amazon EMR

Note

The integration with AWS analytics services for table buckets is in preview release and is subject to change.

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases.

You can use Apache Iceberg clusters in Amazon EMR to work with S3 tables by connecting to table buckets in a Spark session. To connect to table buckets in Amazon EMR, you can use the AWS analytics services integration through AWS Glue Data Catalog, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog.

Note

S3 Tables is supported on Amazon EMR version 7.5 or higher.

Connecting to S3 table buckets with Spark on an Amazon EMR Iceberg cluster

In this procedure, you setup an Amazon EMR cluster configured for Apache Iceberg and then launch a Spark session that connects to your table buckets. You can set this up using the AWS analytics services integration through AWS Glue, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog. For information about the client catalog, see Accessing Amazon S3 tables from open source query engines.

Choose your method of using tables with Amazon EMR from the following options.

AWS analytics services integration

The following perequisites are required to query tables with Spark on Amazon EMR using the AWS analytics services integration.

Prerequisites
To setup an Amazon EMR cluster to query tables with Spark
  1. Create an Iceberg cluster in Amazon EMR.

  2. Connect to the Spark primary node using SSH.

  3. Enter the following command to initialize a Spark session for Iceberg that connects to your tables. Replace the input placeholder with your table bucket name.

    spark-shell \ --conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.s3tablesbucket.type=glue \ --conf spark.sql.catalog.s3tablesbucket.warehouse=s3://amzn-s3-demo-bucket1 \ --conf spark.sql.defaultCatalog=s3tablesbucket \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  4. Query your tables with Spark SQL. For example queries, see Querying S3 tables with Spark

Amazon S3 Tables Catalog for Apache Iceberg

The following prerequisites are required to query tables with Spark on Amazon EMR using the Amazon S3 Tables Catalog for Apache Iceberg.

Prerequisites
  • Attach the AmazonS3TablesFullAccess policy to the IAM role you use for Amazon EMR.

To setup an Amazon EMR cluster to query tables with Spark
  1. Create an Iceberg cluster in Amazon EMR

  2. Connect to the Spark primary node using SSH.

  3. To initialize a Spark session for Iceberg that connects to your table bucket, enter the following command. Replace the input placeholder with your table bucket ARN.

    spark-shell \ --packages software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.3 \ --conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \ --conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-bucket1 \ --conf spark.sql.defaultCatalog=s3tablesbucket \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  4. Query your tables with Spark SQL. For example queries, see Querying S3 tables with Spark.

Note

If you are using the DROP TABLE PURGE command with EMR:

  • EMR version 7.5

    Set the Spark config spark.sql.catalog.<your-catalog-name>.cache-enabled to false. If this config is set to true, run the command in a new session or application so the table cache is not activated.

  • EMR versions < 7.5

    DROP TABLE is not supported. You can use the S3 Tables DeleteTable REST API to delete a table.