Accessing Amazon S3 tables with Amazon EMR
Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that
simplifies running big data frameworks, such as Apache Hadoop and
Apache Spark, on AWS to process and analyze vast amounts of data. Using
these frameworks and related open-source projects, you can process data for analytics
purposes and business intelligence workloads. Amazon EMR also lets you transform and move
large amounts of data into and out of other AWS data stores and databases.
You can use Apache Iceberg clusters in Amazon EMR to work with S3 tables by
connecting to table buckets in a Spark session. To connect to
table buckets in Amazon EMR, you can use the AWS analytics services integration through
AWS Glue Data Catalog, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog.
Connecting to S3 table buckets with Spark on
an Amazon EMR Iceberg cluster
In this procedure, you setup an Amazon EMR cluster configured for Apache
Iceberg and then launch a Spark session that connects to your
table buckets. You can set this up using the AWS analytics services integration
through AWS Glue, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client
catalog. For information about the client catalog, see Accessing Amazon S3 tables from open source
query engines.
Choose your method of using tables with Amazon EMR from the following options.
- Amazon S3 Tables Catalog for Apache Iceberg
-
The following prerequisites are required to query tables with Spark on
Amazon EMR using the Amazon S3 Tables Catalog for Apache Iceberg.
To setup an Amazon EMR cluster to query tables with
Spark
Create a cluster with the following configuration. To use this example, replace the
user input placeholder
values
with your own.
aws emr create-cluster --release-label emr-7.5.0 \
--applications Name=Spark \
--configurations file://configurations.json \
--region us-east-1
\
--name My_Spark_Iceberg_Cluster \
--log-uri s3://amzn-s3-demo-bucket
/ \
--instance-type m5.xlarge \
--instance-count 2 \
--service-role EMR_DefaultRole
\
--ec2-attributes \
InstanceProfile=EMR_EC2_DefaultRole
,SubnetId=subnet-1234567890abcdef0
,KeyName=my-key-pair
configurations.json:
[{
"Classification":"iceberg-defaults",
"Properties":{"iceberg.enabled":"true"}
}]
-
Connect to the Spark primary node using
SSH.
-
To initialize a Spark
session for Iceberg that connects to your table
bucket, enter the following command. Replace the input placeholder
with your table bucket ARN.
spark-shell \
--packages software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.3 \
--conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \
--conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1
:111122223333
:bucket/amzn-s3-demo-bucket1
\
--conf spark.sql.defaultCatalog=s3tablesbucket \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
-
Query your tables with Spark SQL. For example
queries, see Querying S3 tables with Spark.
- AWS analytics services integration
-
The following perequisites are required to query tables with
Spark on Amazon EMR using the AWS analytics services
integration.
To setup an Amazon EMR cluster to query tables with
Spark
Create a cluster with the following configuration. To use this example, replace the
user input placeholder
values
with your own.
aws emr create-cluster --release-label emr-7.5.0 \
--applications Name=Spark \
--configurations file://configurations.json \
--region us-east-1
\
--name My_Spark_Iceberg_Cluster \
--log-uri s3://amzn-s3-demo-bucket
/ \
--instance-type m5.xlarge \
--instance-count 2 \
--service-role EMR_DefaultRole
\
--ec2-attributes \
InstanceProfile=EMR_EC2_DefaultRole
,SubnetId=subnet-1234567890abcdef0
,KeyName=my-key-pair
configurations.json:
[{
"Classification":"iceberg-defaults",
"Properties":{"iceberg.enabled":"true"}
}]
-
Connect to the Spark primary node using
SSH.
-
Enter the following command to initialize a Spark
session for Iceberg that connects to your tables.
Replace the input placeholder
with your
table bucket name.
spark-shell \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.defaultCatalog=s3tables \
--conf spark.sql.catalog.s3tables=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tables.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.s3tables.client.region=us-east-1
\
--conf spark.sql.catalog.s3tables.glue.id=111122223333
-
Query your tables with Spark SQL. For example queries, see Querying S3 tables with Spark
If you are using the DROP TABLE PURGE
command with EMR:
EMR version 7.5
Set the Spark config spark.sql.catalog.<your-catalog-name>
.cache-enabled
to false. If this config is set to true
, run the command in a new session or application so the table cache is not activated.
-
EMR versions higher than 7.5
DROP TABLE
is not supported. You can use the S3 Tables DeleteTable REST API to delete a table.