Running ETL jobs on Amazon S3 tables with AWS Glue

Focus mode

Running ETL jobs on Amazon S3 tables with AWS Glue - Amazon Simple Storage Service

Prerequisites for S3 Tables AWS Glue ETL jobs Create an AWS Glue ETL PySpark script to query S3 tables Create an AWS Glue ETL job that queries S3 tables

AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use AWS Glue jobs to run extract, transform, and load (ETL) pipelines to load data into your data lakes. For more information about AWS Glue, see What is AWS Glue? in the AWS Glue Developer Guide.

An AWS Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. Jobs can run scripts designed for Apache Spark and Ray runtime environments or general-purpose Python scripts (Python shell jobs). You can monitor job runs to understand runtime metrics such as completion status, duration, and start time.

You can use AWS Glue jobs to process data in your S3 tables by connecting to your tables directly using the Amazon S3 Tables Catalog for Apache Iceberg client catalog JAR.

Note

S3 Tables is supported on AWS Glue version 5.0 or higher.

Prerequisites for S3 Tables AWS Glue ETL jobs

Before you can query tables from a AWS Glue job you must configure an IAM role that AWS Glue can use to run the job, and upload the Amazon S3 Tables Catalog for Apache Iceberg JAR to an S3 bucket that AWS Glue can access when it runs the job.

Integrate your table buckets with AWS analytics services.
Create an IAM role for AWS Glue.
- Attach the AmazonS3TablesFullAccess managed policy to the role.
- Attach the AmazonS3FullAccess managed policy to the role.
Download the latest version of the Amazon S3 Tables Catalog for Apache Iceberg client catalog JAR from Maven and then upload it to an Amazon S3 bucket.
1. Check for the latest version on Maven Central. You can download the JAR from Maven central using your browser, or using the following command. Make sure to replace the version number with the latest version.
```
wget https://repo1.maven.org/maven2/software/amazon/s3tables/s3-tables-catalog-for-iceberg-runtime/0.1.4/s3-tables-catalog-for-iceberg-runtime-0.1.4.jar
```
2. Upload the downloaded JAR to an S3 bucket that your AWS Glue IAM role can access. You can use the following AWS CLI command to upload the JAR. Make sure to replace the version number with the latest version, and the bucket name and path with your own.
```
aws s3 cp s3-tables-catalog-for-iceberg-runtime-0.1.4.jar s3://amzn-s3-demo-bucket1/jars/
```

Create an AWS Glue ETL PySpark script to query S3 tables

To access your table data when you run an AWS Glue ETL job, you use a PySpark script to configure a Spark session for Apache Iceberg that connects to your S3 table bucket when the job runs. You can modify an existing script to connect to your table buckets or create a new script. For more information on creating AWS Glue scripts, see Tutorial: Writing an AWS Glue for Spark script in the AWS Glue Developer Guide.

Use the following code snippet in your PySpark script for configuring Spark's connection to your table bucket. Replace the placeholder values with the information for your own table bucket.


# Configure Spark session for Iceberg
spark_conf = SparkSession.builder.appName("GlueJob")
    .config("spark.sql.catalog.s3tablesbucket", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.s3tablesbucket.catalog-impl", "software.amazon.s3tables.iceberg.S3TablesCatalog")
    .config("spark.sql.catalog.s3tablesbucket.warehouse", "arn:aws:s3tables:REGION:111122223333:bucket/amzn-s3-demo-table-bucket") 
    .config("spark.sql.defaultCatalog", "s3tablesbucket") 
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.s3tablesbucket.cache-enabled", "false")

Sample script

The following is an sample PySpark script you can use to test querying S3 tables with an AWS Glue job. The script connects to your table bucket and then runs queries to: create a new namespace, create a sample table, insert data into the table, and the table data. To use the script, replace the placeholder values with the information for you own table bucket.


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SparkSession

# Configure Spark session for Iceberg
spark_conf = SparkSession.builder.appName("GlueJob")
    .config("spark.sql.catalog.s3tablesbucket", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.s3tablesbucket.catalog-impl", "software.amazon.s3tables.iceberg.S3TablesCatalog")
    .config("spark.sql.catalog.s3tablesbucket.warehouse", "arn:aws:s3tables:REGION:111122223333:bucket/amzn-s3-demo-table-bucket") 
    .config("spark.sql.defaultCatalog", "s3tablesbucket") 
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.s3tablesbucket.cache-enabled", "false")

# Initialize Glue context with custom Spark configuration
sc = SparkContext.getOrCreate(conf=spark_conf.getOrCreate().sparkContext.getConf())
glueContext = GlueContext(sc)
spark = glueContext.spark_session

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

namespace = "new_namespace"
table = "new_table"   

def run_sql(query):
    try:
        result = spark.sql(query)
        result.show()
        return result
    except Exception as e:
        print(f"Error executing query '{query}': {str(e)}")
        return None

def main():
    try:
        # Create a new namespace if it doesn't exist
        print("CREATE NAMESPACE")
        run_sql(f"CREATE NAMESPACE IF NOT EXISTS {namespace}")
		
        # Show all namespaces
        print("SHOW NAMESPACES")
        run_sql("SHOW NAMESPACES")

        # Describe a specific namespace
        print("DESCRIBE NAMESPACE")
        run_sql(f"DESCRIBE NAMESPACE {namespace}")
		
        # Create table in the namespace
        print("CREATE TABLE")
        create_table_query = f"""
        CREATE TABLE IF NOT EXISTS {namespace}.{table} (
            id INT,
            name STRING,
            value INT
        )
        """
        run_sql(create_table_query)

        # Insert data into table
        print("INSERT INTO")
        insert_query = f"""
        INSERT INTO {namespace}.{table}
        VALUES 
            (1, 'ABC', 100),
            (2, 'XYZ', 200)
        """
        run_sql(insert_query)

        # Show tables in the namespace
        print("SHOW TABLES")
        run_sql(f"SHOW TABLES IN {namespace}")

        # Select all from a specific table 
        print("SELECT FROM TABLE")    
        run_sql(f"SELECT * FROM {namespace}.{table} LIMIT 20")

    except Exception as e:
        print(f"An error occurred in main execution: {str(e)}")
        raise  # Re-raise the exception for Glue to handle

    finally:
        job.commit()

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print(f"Job failed with error: {str(e)}")
        sys.exit(1)

Create an AWS Glue ETL job that queries S3 tables

When you set up a AWS Glue job for S3 tables you include the Amazon S3 Tables Catalog for Apache Iceberg JAR as an extra dependency to allow the job can query your tables directly. The following procedures show how you can do this using AWS CLI, or using the console with AWS Glue Studio. AWS Glue Studio to create a job through a visual interface, an interactive code notebook, or with a script editor. For more information, see Authoring jobs in AWS Glue in the AWS Glue User Guide.

The following procedure shows how to use the AWS Glue Studio script editor to create an ETL job that queries your S3 tables.

Prerequisites

Prerequisites for S3 Tables AWS Glue ETL jobs
Create an AWS Glue ETL PySpark script to query S3 tables

Open the AWS Glue console at https://console.aws.amazon.com/glue/.
From the Navigation pane, choose ETL jobs.
Choose Script editor, then choose Upload script and upload the PySpark script you created to query S3 tables.
Select the Job details tab and enter the following for Basic properties.
- For Name, enter a name for the job.
- For IAM Role, select the role you created for AWS Glue.
Expand Advanced properties and for Dependent JARs path, enter the S3 URI of the client catalog jar your uploaded to an S3 bucket as a prerequisite. For example, s3://amzn-s3-demo-bucket1/jars/s3-tables-catalog-for-iceberg-runtime-0.1.4.jar
Choose Save to create the job.
Choose Run start the job, and review the job status under the Runs tab.

Using AWS Glue Studio

The following procedure shows how to use the AWS Glue Studio script editor to create an ETL job that queries your S3 tables.

Prerequisites

Prerequisites for S3 Tables AWS Glue ETL jobs
Create an AWS Glue ETL PySpark script to query S3 tables

Open the AWS Glue console at https://console.aws.amazon.com/glue/.
From the Navigation pane, choose ETL jobs.
Choose Script editor, then choose Upload script and upload the PySpark script you created to query S3 tables.
Select the Job details tab and enter the following for Basic properties.
- For Name, enter a name for the job.
- For IAM Role, select the role you created for AWS Glue.
Expand Advanced properties and for Dependent JARs path, enter the S3 URI of the client catalog jar your uploaded to an S3 bucket as a prerequisite. For example, s3://amzn-s3-demo-bucket1/jars/s3-tables-catalog-for-iceberg-runtime-0.1.4.jar
Choose Save to create the job.
Choose Run start the job, and review the job status under the Runs tab.

The following procedure shows how to use the AWS CLI to create an ETL job that queries your S3 tables. To use the commands replace the placeholder values with your own.

Prerequisites

Prerequisites for S3 Tables AWS Glue ETL jobs
Create an AWS Glue ETL PySpark script to query S3 tables and upload it to an S3 bucket.

Create a Glue job.


aws glue create-job \
--name etl-tables-job \
--role arn:aws:iam::111122223333:role/AWSGlueServiceRole \
--command '{
    "Name": "glueetl",
    "ScriptLocation": "s3://amzn-s3-demo-bucket1/scripts/glue-etl-query.py",
    "PythonVersion": "3"
}' \
--default-arguments '{
    "--job-language": "python",
    "--class": "GlueApp",
    "--extra-jars": "s3://amzn-s3-demo-bucket1/jars/s3-tables-catalog-for-iceberg-runtime-0.1.4.jar"
}' \
--glue-version "5.0"

Start your job.


aws glue start-job-run \
--job-name etl-tables-job

To review you job status, copy the run ID from the previous command and enter it into the following command.


aws glue get-job-run --job-name etl-tables-job \
--run-id jr_ec9a8a302e71f8483060f87b6c309601ea9ee9c1ffc2db56706dfcceb3d0e1ad

Using the AWS CLI

The following procedure shows how to use the AWS CLI to create an ETL job that queries your S3 tables. To use the commands replace the placeholder values with your own.

Prerequisites

Prerequisites for S3 Tables AWS Glue ETL jobs
Create an AWS Glue ETL PySpark script to query S3 tables and upload it to an S3 bucket.

Create a Glue job.


aws glue create-job \
--name etl-tables-job \
--role arn:aws:iam::111122223333:role/AWSGlueServiceRole \
--command '{
    "Name": "glueetl",
    "ScriptLocation": "s3://amzn-s3-demo-bucket1/scripts/glue-etl-query.py",
    "PythonVersion": "3"
}' \
--default-arguments '{
    "--job-language": "python",
    "--class": "GlueApp",
    "--extra-jars": "s3://amzn-s3-demo-bucket1/jars/s3-tables-catalog-for-iceberg-runtime-0.1.4.jar"
}' \
--glue-version "5.0"

Start your job.


aws glue start-job-run \
--job-name etl-tables-job

To review you job status, copy the run ID from the previous command and enter it into the following command.


aws glue get-job-run --job-name etl-tables-job \
--run-id jr_ec9a8a302e71f8483060f87b6c309601ea9ee9c1ffc2db56706dfcceb3d0e1ad

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Amazon Data Firehose

AWS Regions, endpoints, and quotas

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Running ETL jobs on Amazon S3 tables with AWS Glue

Note

Prerequisites for S3 Tables AWS Glue ETL jobs

Create an AWS Glue ETL PySpark script to query S3 tables

Sample script

Create an AWS Glue ETL job that queries S3 tables

Prerequisites

Using AWS Glue Studio

Prerequisites

Prerequisites

Using the AWS CLI

Prerequisites

On this page

Related resources

Did this page help you?

Related resources

Next topic:

Previous topic:

Need help?