Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Migrating AWS Glue for Spark jobs to AWS Glue version 5.0

Focus mode
Migrating AWS Glue for Spark jobs to AWS Glue version 5.0 - AWS Glue

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, 3.0, and 4.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 5.0. It also describes the features in AWS Glue 5.0 and the advantages of using it.

To use this feature with your AWS Glue ETL jobs, choose 5.0 for the Glue version when creating your jobs.

New features

This section describes new features and advantages of AWS Glue version 5.0.

  • Apache Spark update from 3.3.0 in AWS Glue 4.0 to 3.5.2 in AWS Glue 5.0. See Major enhancements from Spark 3.3.0 to Spark 3.5.2.

  • Spark-native fine-grained access control (FGAC) using Lake Formation. For more information, see Using AWS Glue with AWS Lake Formation for fine-grained access control.

    Note the following considerations or limitations for Spark-native FGAC:

    • Currently data writes are not supported

    • Writing into Iceberg through GlueContext using Lake Formation requires use of IAM access control instead

    For a complete list of limitations and considerations when using Spark-native FGAC, see See .

  • Support for Amazon S3 Access Grants as a scalable access control solution to your Amazon S3 data from AWS Glue. For more information, see Using Amazon S3 Access Grants with AWS Glue.

  • Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1

  • Amazon SageMaker Unified Studio support.

  • Amazon SageMaker Lakehouse and data abstraction integration. For more information, see Querying metastore data catalogs from AWS Glue ETL.

  • Support to install additional Python libraries using requirements.txt. For more information, see Installing additional Python libraries in AWS Glue 5.0 using requirements.txt.

  • AWS Glue 5.0 supports data lineage in Amazon DataZone. You can configure AWS Glue to automatically collect lineage information during Spark job runs and send the lineage events to be visualized in Amazon DataZone. For more information, see Data lineage in Amazon DataZone.

    To configure this on the AWS Glue console, turn on Generate lineage events, and enter your Amazon DataZone domain ID on the Job details tab.

    The screenshot shows turning on Amazon DataZone date lineage for AWS Glue.

    Alternatively, you can provide the following job parameter (provide your DataZone domain ID):

    • Key: --conf

    • Value:

      extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener —conf spark.openlineage.transport.type=amazon_datazone_api -conf spark.openlineage.transport.domainId=<your-domain-ID>
  • Connector and JDBC driver updates. For more information, see Appendix B: JDBC driver upgrades and Appendix C: Connector upgrades.

  • Java update from 8 to 17.

  • Increased storage for AWS Glue G.1X and G.2X workers with disk space increasing to 94GB and 138GB respectively. For more information, see Jobs

  • Support for AWS SDK for Java, version 2 - AWS Glue 5.0 jobs can use the for Java versions 1.12.569 or 2.28.8 if the job supports v2. The AWS SDK for Java 2.x is a major rewrite of the version 1.x code base. It’s built on top of Java 8+ and adds several frequently requested features. These include support for non-blocking I/O, and the ability to plug in a different HTTP implementation at runtime. For more information, including a Migration Guide from SDK for Java v1 to v2, see the AWS SDK for Java, version 2 guide.

Breaking changes

Note the following breaking changes:

  • GlueContext-based table-level access control with AWS Lake Formation permissions supported in Glue 4.0 or before is not supported in Glue 5.0. Use the new Spark native fine-grained access control (FGAC) in Glue 5.0. Note the following details:

    • If you need fine grained access control (FGAC) for row/column/cell access control, you will need to migrate from GlueContext/Glue DynamicFrame in Glue 4.0 and prior to Spark DataFrame in Glue 5.0.

    • If you need database/table level access control, you can grant database/table permissions to your roles. This bypasses the need to migrate from GlueContext to Spark dataframes.

    • If you don't need FGAC, then no migration to Spark dataframe is necessary and GlueContext features like job bookmarks, push down predicates will continue to work.

  • Vectorized SIMD CSV reader is not supported.

  • Continuous logging to output log group is not supported. Use error log group instead.

  • The AWS Glue job run insights job-insights-rule-driver has been deprecated. The job-insights-rca-driver log stream is now located in the error log group.

  • Athena-based custom/marketplace connectors are not supported.

  • Adobe Marketo Engage, Facebook Ads, Google Ads, Google Analytics 4, Google Sheets, Hubspot, Instagram Ads, Intercom, Jira Cloud, Oracle NetSuite, Salesforce, Salesforce Marketing Cloud, Salesforce Marketing Cloud Account Engagement, SAP OData, ServiceNow, Slack, Snapchat Ads, Stripe, Zendesk and Zoho CRM connectors are not supported.

  • Custom log4j properties are not supported in Glue 5.0.

Major enhancements from Spark 3.3.0 to Spark 3.5.2

Note the following enhancements:

  • Python client for Spark Connect (SPARK-39375).

  • Implement support for DEFAULT values for columns in tables (SPARK-38334).

  • Support "Lateral Column Alias References" (SPARK-27561).

  • Harden SQLSTATE usage for error classes (SPARK-41994).

  • Enable Bloom filter Joins by default (SPARK-38841).

  • Better Spark UI scalability and driver stability for large applications (SPARK-41053).

  • Async Progress Tracking in Structured Streaming (SPARK-39591).

  • Python arbitrary stateful processing in structured streaming (SPARK-40434).

  • Pandas API coverage improvements (SPARK-42882) and NumPy input support in PySpark (SPARK-39405).

  • Provide a memory profiler for PySpark user-defined functions (SPARK-40281).

  • Implement PyTorch distributor (SPARK-41589).

  • Publish SBOM artifacts (SPARK-41893).

  • Support IPv6-only environment (SPARK-39457).

  • Customized K8s scheduler (Apache YuniKorn and Volcano) GA (SPARK-42802).

  • Scala and Go client support in Spark Connect (SPARK-42554) and (SPARK-43351).

  • PyTorch-based distributed ML Support for Spark Connect (SPARK-42471).

  • Structured streaming support for Spark Connect in Python and Scala (SPARK-42938).

  • Pandas API support for the Python Spark Connect Client (SPARK-42497).

  • Introduce Arrow Python UDFs (SPARK-40307).

  • Support Python user-defined table functions (SPARK-43798).

  • Migrate PySpark errors onto error classes (SPARK-42986).

  • PySpark test framework (SPARK-44042).

  • Add support for Datasketches HllSketch (SPARK-16484).

  • Built-in SQL function improvement (SPARK-41231).

  • IDENTIFIER clause (SPARK-43205).

  • Add SQL functions into Scala, Python and R API (SPARK-43907).

  • Add named argument support for SQL functions (SPARK-43922).

  • Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated (SPARK-41469).

  • Distributed ML <> spark connect (SPARK-42471).

  • DeepSpeed distributor (SPARK-44264).

  • Implement changelog checkpointing for RocksDB state store (SPARK-43421).

  • Introduce watermark propagation among operators (SPARK-42376).

  • Introduce dropDuplicatesWithinWatermark (SPARK-42931).

  • RocksDB state store provider memory management enhancements (SPARK-43311).

Actions to migrate to AWS Glue 5.0

For existing jobs, change the Glue version from the previous version to Glue 5.0 in the job configuration.

  • In AWS Glue Studio, choose Glue 5.0 - Supports Spark 3.5.2, Scala 2, Python 3 in Glue version.

  • In the API, choose 5.0 in the GlueVersion parameter in the UpdateJob API operation.

For new jobs, choose Glue 5.0 when you create a job.

  • In the console, choose Spark 3.5.2, Python 3 (Glue Version 5.0) or Spark 3.5.2, Scala 2 (Glue Version 5.0) in Glue version.

  • In AWS Glue Studio, choose Glue 5.0 - Supports Spark 3.5.2, Scala 2, Python 3 in Glue version.

  • In the API, choose 5.0 in the GlueVersion parameter in the CreateJob API operation.

To view Spark event logs of AWS Glue 5.0 coming from AWS Glue 2.0 or earlier, launch an upgraded Spark history server for AWS Glue 5.0 using AWS CloudFormation or Docker.

Migration checklist

Review this checklist for migration:

  • Java 17 updates

  • [Scala] Upgrade AWS SDK calls from v1 to v2

  • Python 3.10 to 3.11 migration

  • [Python] Update boto references from 1.26 to 1.34

AWS Glue 5.0 features

This section describes AWS Glue features in more detail.

Querying metastore data catalogs from AWS Glue ETL

You can register your AWS Glue job to access the AWS Glue Data Catalog, which makes tables and other metastore resources available to disparate consumers. The Data Catalog supports a multi-catalog hierarchy, which unifies all your data across Amazon S3 data lakes. It also provides both a Hive metastore API and an open-source Apache Iceberg API for accessing the data. These features are available to AWS Glue and other data-oriented services like Amazon EMR, Amazon Athena and Amazon Redshift.

When you create resources in the Data Catalog, you can access them from any SQL engine that supports the Apache Iceberg REST API. AWS Lake Formation manages permissions. Following configuration, you can leverage AWS Glue's capabilities to query disparate data by querying these metastore resources with familiar applications. These include Apache Spark and Trino.

How metadata resources are organized

Data is organized in a logical hierarchy of catalogs, databases and tables, using the AWS Glue Data Catalog:

  • Catalog – A logical container that holds objects from a data store, such as schemas or tables.

  • Database – Organizes data objects such as tables and views in a catalog.

  • Tables and views – Data objects in a database that provide an abstraction layer with an understandable schema. They make it easy to access underlying data, which could be in various formats and in various locations.

Migrating from AWS Glue 4.0 to AWS Glue 5.0

All existing job parameters and major features that exist in AWS Glue 4.0 will exist in AWS Glue 5.0, except machine learning transforms.

The following new parameters were added:

  • --enable-lakeformation-fine-grained-access: Enables the fine-grained access control (FGAC) feature in AWS Lake Formation tables.

Refer to the Spark migration documentation:

Migrating from AWS Glue 3.0 to AWS Glue 5.0

Note

For migration steps related to AWS Glue 4.0, see Migrating from AWS Glue 3.0 to AWS Glue 4.0.

All existing job parameters and major features that exist in AWS Glue 3.0 will exist in AWS Glue 5.0, except machine learning transforms.

Migrating from AWS Glue 2.0 to AWS Glue 5.0

Note

For migration steps related to AWS Glue 4.0 and a list of migration differences between AWS Glue version 3.0 and 4.0, see Migrating from AWS Glue 3.0 to AWS Glue 4.0.

Also note the following migration differences between AWS Glue versions 3.0 and 2.0:

  • All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 5.0, except machine learning transforms.

  • Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced. For example, Spark 3.1.1 and later does not enable Scala-untyped UDFs but Spark 2.4 does allow them.

  • Python 2.7 is not supported.

  • Any extra jars supplied in existing AWS Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies. You can avoid classpath conflicts with the --user-jars-first job parameter.

  • Changes to the behavior of loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1.

  • Different Spark task parallelism for driver/executor configuration. You can adjust task parallelism by passing the --executor-cores job argument.

Connector and JDBC driver migration for AWS Glue 5.0

For the versions of JDBC and data lake connectors that were upgraded, see:

The following changes apply to the connector or driver versions identified in the appendices for Glue 5.0.

Amazon Redshift

Note the following changes:

  • Adds support for three-part table names to allow connector to query Redshift data sharing tables.

  • Corrects mapping of Spark ShortType to use Redshift SMALLINT instead of INTEGER to better match expected data size.

  • Added support for Custom Cluster Names (CNAME) for Amazon Redshift Serverless.

Apache Hudi

Note the following changes:

  • Support record level index.

  • Support auto generation of record keys. Now you don’t have to specify the record key field.

Apache Iceberg

Note the following changes:

  • Support fine-grained access control with AWS Lake Formation.

  • Support branching and tagging which are named references to snapshots with their own independent lifecycles.

  • Added a changelog view procedure which generates a view that contains the changes made to a table over a specified period or between specific snapshots.

Delta Lake

Note the following changes:

  • Support Delta Universal Format (UniForm) which enables seamless access through Apache Iceberg and Apache Hudi.

  • Support Deletion Vectors that implements a Merge-on-Read paradigm.

AzureCosmos

Note the following changes:

  • Added hierarchical partition key support.

  • Added option to use custom Schema with StringType (raw json) for a nested property.

  • Added config option spark.cosmos.auth.aad.clientCertPemBase64 to allow using SPN (ServicePrincipal name) authentication with certificate instead of client secret.

For more information, see the Azure Cosmos DB Spark connector change log.

Microsoft SQL Server

Note the following changes:

  • TLS encryption is enabled by default.

  • When encrypt = false but the server requires encryption, the certificate is validated based on the trustServerCertificate connection setting.

  • aadSecurePrincipalId and aadSecurePrincipalSecret deprecated.

  • getAADSecretPrincipalId API removed.

  • Added support for the SQL_Variant datatype when retrieving DateTimeOffset.

  • Added CNAME resolution when realm is specified.

MongoDB

Note the following changes:

  • Support for micro-batch mode with Spark Structured Streaming.

  • Support for BSON data types.

  • Added support for reading multiple collections when using micro-batch or continuous streaming modes.

    • If the name of a collection used in your collection configuration option contains a comma, the Spark Connector treats it as two different collections. To avoid this, you must escape the comma by preceding it with a backslash (\).

    • If the name of a collection used in your collection configuration option is "*", the Spark Connector interprets it as a specification to scan all collections. To avoid this, you must escape the asterisk by preceding it with a backslash (\).

    • If the name of a collection used in your collection configuration option contains a backslash (\), the Spark Connector treats the backslash as an escape character, which might change how it interprets the value. To avoid this, you must escape the backslash by preceding it with another backslash.

For more information, see the MongoDB connector for Spark release notes.

Snowflake

Note the following changes:

  • Introduced a new trim_space parameter that you can use to trim values of StringType columns automatically when saving to a Snowflake table. Default: false.

  • Disabled the abort_detached_query parameter at the session level by default.

  • Removed the requirement of the SFUSER parameter when using OAUTH.

  • Removed the Advanced Query Pushdown feature. Alternatives to the feature are available. For example, instead of loading data from Snowflake tables, users can directly load data from Snowflake SQL queries.

For more information, see the Snowflake Connector for Spark release notes.

Appendix A: Notable dependency upgrades

The following are dependency upgrades:

Dependency Version in AWS Glue 5.0 Version in AWS Glue 4.0 Version in AWS Glue 3.0 Version in AWS Glue 2.0 Version in AWS Glue 1.0
Java 17 8 8 8 8
Spark 3.5.2-amzn-1 3.3.0-amzn-1 3.1.1-amzn-0 2.4.3 2.4.3
Hadoop 3.4.0-amzn-1 3.3.3-amzn-0 3.2.1-amzn-3 2.8.5-amzn-5 2.8.5-amzn-1
Scala 2.12.18 2.12 2.12 2.11 2.11
Jackson 2.15.2 2.12 2.12 2.11 2.11
Hive 2.3.9-amzn-4 2.3.9-amzn-2 2.3.7-amzn-4 1.2 1.2
EMRFS 2.66.0 2.54.0 2.46.0 2.38.0 2.30.0
Json4s 3.7.0-M11 3.7.0-M11 3.6.6 3.5.x 3.5.x
Arrow 12.0.1 7.0.0 2.0.0 0.10.0 0.10.0
AWS Glue Data Catalog client 4.2.0 3.7.0 3.0.0 1.10.0 N/A
AWS SDK for Java 2.28.8 1.12 1.12
Python 3.11 3.10 3.7 2.7 & 3.6 2.7 & 3.6
Boto 1.34.131 1.26 1.18 1.12 N/A
EMR DynamoDB connector 5.6.0 4.16.0

Appendix B: JDBC driver upgrades

The following are JDBC driver upgrades:

Driver JDBC driver version in AWS Glue 5.0 JDBC driver version in AWS Glue 4.0 JDBC driver version in AWS Glue 3.0 JDBC driver version in past AWS Glue versions
MySQL 8.0.33 8.0.23 8.0.23 5.1
Microsoft SQL Server 10.2.0 9.4.0 7.0.0 6.1.0
Oracle Databases 23.3.0.23.09 21.7 21.1 11.2
PostgreSQL 42.7.3 42.3.6 42.2.18 42.1.0
Amazon Redshift

redshift-jdbc42-2.1.0.29

redshift-jdbc42-2.1.0.16

redshift-jdbc41-1.2.12.1017

redshift-jdbc41-1.2.12.1017

SAP Hana 2.20.17 2.17.12
Teradata 20.00.00.33 20.00.00.06

Appendix C: Connector upgrades

The following are connector upgrades:

Driver Connector version in AWS Glue 5.0 Connector version in AWS Glue 4.0 Connector version in AWS Glue 3.0
EMR DynamoDB connector 5.6.0 4.16.0
Amazon Redshift 6.3.0 6.1.3
OpenSearch 1.2.0 1.0.1
MongoDB 10.4.0 10.0.4 3.0.0
Snowflake 3.0.0 2.12.0
Google BigQuery 0.32.2 0.32.2
AzureCosmos 4.33.0 4.22.0
AzureSQL 1.3.0 1.3.0
Vertica 3.3.5 3.3.5

Appendix D: Open table format upgrades

The following are open table format upgrades:

OTF Connector version in AWS Glue 5.0 Connector version in AWS Glue 4.0 Connector version in AWS Glue 3.0
Hudi 0.15.0 0.12.1 0.10.1
Delta Lake 3.2.1 2.1.0 1.0.0
Iceberg 1.6.1 1.0.0 0.13.1
PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.