This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, 3.0, and 4.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 5.0. It also describes the features in AWS Glue 5.0 and the advantages of using it.
To use this feature with your AWS Glue ETL jobs, choose
5.0
for the Glue version
when creating your
jobs.
Topics
New features
This section describes new features and advantages of AWS Glue version 5.0.
-
Apache Spark update from 3.3.0 in AWS Glue 4.0 to 3.5.2 in AWS Glue 5.0. See Major enhancements from Spark 3.3.0 to Spark 3.5.2.
-
Spark-native fine-grained access control (FGAC) using Lake Formation. For more information, see Using AWS Glue with AWS Lake Formation for fine-grained access control.
Note the following considerations or limitations for Spark-native FGAC:
Currently data writes are not supported
Writing into Iceberg through
GlueContext
using Lake Formation requires use of IAM access control instead
For a complete list of limitations and considerations when using Spark-native FGAC, see See .
-
Support for Amazon S3 Access Grants as a scalable access control solution to your Amazon S3 data from AWS Glue. For more information, see Using Amazon S3 Access Grants with AWS Glue.
-
Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
-
Amazon SageMaker Unified Studio support.
-
Amazon SageMaker Lakehouse and data abstraction integration. For more information, see Querying metastore data catalogs from AWS Glue ETL.
-
Support to install additional Python libraries using
requirements.txt
. For more information, see Installing additional Python libraries in AWS Glue 5.0 using requirements.txt. -
AWS Glue 5.0 supports data lineage in Amazon DataZone. You can configure AWS Glue to automatically collect lineage information during Spark job runs and send the lineage events to be visualized in Amazon DataZone. For more information, see Data lineage in Amazon DataZone.
To configure this on the AWS Glue console, turn on Generate lineage events, and enter your Amazon DataZone domain ID on the Job details tab.
Alternatively, you can provide the following job parameter (provide your DataZone domain ID):
Key:
--conf
Value:
extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener —conf spark.openlineage.transport.type=amazon_datazone_api -conf spark.openlineage.transport.domainId=<your-domain-ID>
-
Connector and JDBC driver updates. For more information, see Appendix B: JDBC driver upgrades and Appendix C: Connector upgrades.
-
Java update from 8 to 17.
-
Increased storage for AWS Glue
G.1X
andG.2X
workers with disk space increasing to 94GB and 138GB respectively. For more information, see Jobs Support for AWS SDK for Java, version 2 - AWS Glue 5.0 jobs can use the for Java versions 1.12.569
or 2.28.8 if the job supports v2. The AWS SDK for Java 2.x is a major rewrite of the version 1.x code base. It’s built on top of Java 8+ and adds several frequently requested features. These include support for non-blocking I/O, and the ability to plug in a different HTTP implementation at runtime. For more information, including a Migration Guide from SDK for Java v1 to v2, see the AWS SDK for Java, version 2 guide.
Breaking changes
Note the following breaking changes:
GlueContext
-based table-level access control with AWS Lake Formation permissions supported in Glue 4.0 or before is not supported in Glue 5.0. Use the new Spark native fine-grained access control (FGAC) in Glue 5.0. Note the following details:If you need fine grained access control (FGAC) for row/column/cell access control, you will need to migrate from
GlueContext
/GlueDynamicFrame
in Glue 4.0 and prior to SparkDataFrame
in Glue 5.0.If you need database/table level access control, you can grant database/table permissions to your roles. This bypasses the need to migrate from
GlueContext
to Spark dataframes.If you don't need FGAC, then no migration to Spark dataframe is necessary and
GlueContext
features like job bookmarks, push down predicates will continue to work.
Vectorized SIMD CSV reader is not supported.
Continuous logging to output log group is not supported. Use
error
log group instead.The AWS Glue job run insights
job-insights-rule-driver
has been deprecated. Thejob-insights-rca-driver
log stream is now located in the error log group.Athena-based custom/marketplace connectors are not supported.
Adobe Marketo Engage, Facebook Ads, Google Ads, Google Analytics 4, Google Sheets, Hubspot, Instagram Ads, Intercom, Jira Cloud, Oracle NetSuite, Salesforce, Salesforce Marketing Cloud, Salesforce Marketing Cloud Account Engagement, SAP OData, ServiceNow, Slack, Snapchat Ads, Stripe, Zendesk and Zoho CRM connectors are not supported.
Custom log4j properties are not supported in Glue 5.0.
Major enhancements from Spark 3.3.0 to Spark 3.5.2
Note the following enhancements:
-
Python client for Spark Connect (SPARK-39375
). -
Implement support for DEFAULT values for columns in tables (SPARK-38334
). -
Support "Lateral Column Alias References" (SPARK-27561
). -
Harden SQLSTATE usage for error classes (SPARK-41994
). -
Enable Bloom filter Joins by default (SPARK-38841
). -
Better Spark UI scalability and driver stability for large applications (SPARK-41053
). -
Async Progress Tracking in Structured Streaming (SPARK-39591
). -
Python arbitrary stateful processing in structured streaming (SPARK-40434
). -
Pandas API coverage improvements (SPARK-42882
) and NumPy input support in PySpark (SPARK-39405 ). -
Provide a memory profiler for PySpark user-defined functions (SPARK-40281
). -
Implement PyTorch distributor (SPARK-41589
). -
Publish SBOM artifacts (SPARK-41893
). -
Support IPv6-only environment (SPARK-39457
). -
Customized K8s scheduler (Apache YuniKorn and Volcano) GA (SPARK-42802
). -
Scala and Go client support in Spark Connect (SPARK-42554
) and (SPARK-43351 ). -
PyTorch-based distributed ML Support for Spark Connect (SPARK-42471
). -
Structured streaming support for Spark Connect in Python and Scala (SPARK-42938
). -
Pandas API support for the Python Spark Connect Client (SPARK-42497
). -
Introduce Arrow Python UDFs (SPARK-40307
). -
Support Python user-defined table functions (SPARK-43798
). -
Migrate PySpark errors onto error classes (SPARK-42986
). -
PySpark test framework (SPARK-44042
). -
Add support for Datasketches HllSketch (SPARK-16484
). -
Built-in SQL function improvement (SPARK-41231
). -
IDENTIFIER clause (SPARK-43205
). -
Add SQL functions into Scala, Python and R API (SPARK-43907
). -
Add named argument support for SQL functions (SPARK-43922
). -
Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated (SPARK-41469
). -
Distributed ML <> spark connect (SPARK-42471
). -
DeepSpeed distributor (SPARK-44264
). -
Implement changelog checkpointing for RocksDB state store (SPARK-43421
). -
Introduce watermark propagation among operators (SPARK-42376
). -
Introduce dropDuplicatesWithinWatermark (SPARK-42931
). -
RocksDB state store provider memory management enhancements (SPARK-43311
).
Actions to migrate to
AWS Glue 5.0
For existing jobs, change the Glue version
from the previous version to
Glue 5.0
in the job configuration.
-
In AWS Glue Studio, choose
Glue 5.0 - Supports Spark 3.5.2, Scala 2, Python 3
inGlue version
. -
In the API, choose
5.0
in theGlueVersion
parameter in theUpdateJob
API operation.
For new jobs, choose Glue 5.0
when you create a job.
-
In the console, choose
Spark 3.5.2, Python 3 (Glue Version 5.0) or Spark 3.5.2, Scala 2 (Glue Version 5.0)
inGlue version
. -
In AWS Glue Studio, choose
Glue 5.0 - Supports Spark 3.5.2, Scala 2, Python 3
inGlue version
. -
In the API, choose
5.0
in theGlueVersion
parameter in theCreateJob
API operation.
To view Spark event logs of AWS Glue 5.0 coming from AWS Glue 2.0 or earlier, launch an upgraded Spark history server for AWS Glue 5.0 using AWS CloudFormation or Docker.
Migration checklist
Review this checklist for migration:
-
Java 17 updates
-
[Scala] Upgrade AWS SDK calls from v1 to v2
-
Python 3.10 to 3.11 migration
-
[Python] Update boto references from 1.26 to 1.34
AWS Glue 5.0 features
This section describes AWS Glue features in more detail.
Querying metastore data catalogs from AWS Glue ETL
You can register your AWS Glue job to access the AWS Glue Data Catalog, which makes tables and other metastore resources available to disparate consumers. The Data Catalog supports a multi-catalog hierarchy, which unifies all your data across Amazon S3 data lakes. It also provides both a Hive metastore API and an open-source Apache Iceberg API for accessing the data. These features are available to AWS Glue and other data-oriented services like Amazon EMR, Amazon Athena and Amazon Redshift.
When you create resources in the Data Catalog, you can access them from any SQL engine that supports the Apache Iceberg REST API. AWS Lake Formation manages permissions. Following configuration, you can leverage AWS Glue's capabilities to query disparate data by querying these metastore resources with familiar applications. These include Apache Spark and Trino.
How metadata resources are organized
Data is organized in a logical hierarchy of catalogs, databases and tables, using the AWS Glue Data Catalog:
Catalog – A logical container that holds objects from a data store, such as schemas or tables.
Database – Organizes data objects such as tables and views in a catalog.
Tables and views – Data objects in a database that provide an abstraction layer with an understandable schema. They make it easy to access underlying data, which could be in various formats and in various locations.
Migrating from AWS Glue 4.0
to AWS Glue 5.0
All existing job parameters and major features that exist in AWS Glue 4.0 will exist in AWS Glue 5.0, except machine learning transforms.
The following new parameters were added:
-
--enable-lakeformation-fine-grained-access
: Enables the fine-grained access control (FGAC) feature in AWS Lake Formation tables.
Refer to the Spark migration documentation:
Migrating from AWS Glue 3.0
to AWS Glue 5.0
Note
For migration steps related to AWS Glue 4.0, see Migrating from AWS Glue 3.0 to AWS Glue 4.0.
All existing job parameters and major features that exist in AWS Glue 3.0 will exist in AWS Glue 5.0, except machine learning transforms.
Migrating from AWS Glue 2.0
to AWS Glue 5.0
Note
For migration steps related to AWS Glue 4.0 and a list of migration differences between AWS Glue version 3.0 and 4.0, see Migrating from AWS Glue 3.0 to AWS Glue 4.0.
Also note the following migration differences between AWS Glue versions 3.0 and 2.0:
All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 5.0, except machine learning transforms.
Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced. For example, Spark 3.1.1 and later does not enable Scala-untyped UDFs but Spark 2.4 does allow them.
Python 2.7 is not supported.
Any extra jars supplied in existing AWS Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies. You can avoid classpath conflicts with the
--user-jars-first
job parameter.Changes to the behavior of loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1.
Different Spark task parallelism for driver/executor configuration. You can adjust task parallelism by passing the
--executor-cores
job argument.
Connector and JDBC driver migration for AWS Glue 5.0
For the versions of JDBC and data lake connectors that were upgraded, see:
The following changes apply to the connector or driver versions identified in the appendices for Glue 5.0.
Amazon Redshift
Note the following changes:
Adds support for three-part table names to allow connector to query Redshift data sharing tables.
Corrects mapping of Spark
ShortType
to use RedshiftSMALLINT
instead ofINTEGER
to better match expected data size.Added support for Custom Cluster Names (CNAME) for Amazon Redshift Serverless.
Apache Hudi
Note the following changes:
Support record level index.
Support auto generation of record keys. Now you don’t have to specify the record key field.
Apache Iceberg
Note the following changes:
Support fine-grained access control with AWS Lake Formation.
Support branching and tagging which are named references to snapshots with their own independent lifecycles.
Added a changelog view procedure which generates a view that contains the changes made to a table over a specified period or between specific snapshots.
Delta Lake
Note the following changes:
Support Delta Universal Format (UniForm) which enables seamless access through Apache Iceberg and Apache Hudi.
Support Deletion Vectors that implements a Merge-on-Read paradigm.
AzureCosmos
Note the following changes:
Added hierarchical partition key support.
Added option to use custom Schema with StringType (raw json) for a nested property.
Added config option
spark.cosmos.auth.aad.clientCertPemBase64
to allow using SPN (ServicePrincipal name) authentication with certificate instead of client secret.
For more information, see the Azure Cosmos DB Spark connector change log
Microsoft SQL Server
Note the following changes:
TLS encryption is enabled by default.
When encrypt = false but the server requires encryption, the certificate is validated based on the
trustServerCertificate
connection setting.aadSecurePrincipalId
andaadSecurePrincipalSecret
deprecated.getAADSecretPrincipalId
API removed.Added support for the SQL_Variant datatype when retrieving
DateTimeOffset
.Added CNAME resolution when realm is specified.
MongoDB
Note the following changes:
Support for micro-batch mode with Spark Structured Streaming.
Support for BSON data types.
Added support for reading multiple collections when using micro-batch or continuous streaming modes.
If the name of a collection used in your
collection
configuration option contains a comma, the Spark Connector treats it as two different collections. To avoid this, you must escape the comma by preceding it with a backslash (\).If the name of a collection used in your
collection
configuration option is "*", the Spark Connector interprets it as a specification to scan all collections. To avoid this, you must escape the asterisk by preceding it with a backslash (\).If the name of a collection used in your
collection
configuration option contains a backslash (\), the Spark Connector treats the backslash as an escape character, which might change how it interprets the value. To avoid this, you must escape the backslash by preceding it with another backslash.
For more information, see the MongoDB connector for Spark release notes
Snowflake
Note the following changes:
Introduced a new
trim_space
parameter that you can use to trim values ofStringType
columns automatically when saving to a Snowflake table. Default:false
.Disabled the
abort_detached_query
parameter at the session level by default.Removed the requirement of the
SFUSER
parameter when using OAUTH.Removed the Advanced Query Pushdown feature. Alternatives to the feature are available. For example, instead of loading data from Snowflake tables, users can directly load data from Snowflake SQL queries.
For more information, see the Snowflake Connector for Spark release notes
Appendix A: Notable dependency upgrades
The following are dependency upgrades:
Dependency | Version in AWS Glue 5.0 | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 |
---|---|---|---|---|---|
Java | 17 | 8 | 8 | 8 | 8 |
Spark | 3.5.2-amzn-1 | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 |
Hadoop | 3.4.0-amzn-1 | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 |
Scala | 2.12.18 | 2.12 | 2.12 | 2.11 | 2.11 |
Jackson | 2.15.2 | 2.12 | 2.12 | 2.11 | 2.11 |
Hive | 2.3.9-amzn-4 | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 |
EMRFS | 2.66.0 | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 |
Json4s | 3.7.0-M11 | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x |
Arrow | 12.0.1 | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 |
AWS Glue Data Catalog client | 4.2.0 | 3.7.0 | 3.0.0 | 1.10.0 | N/A |
AWS SDK for Java | 2.28.8 | 1.12 | 1.12 | ||
Python | 3.11 | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 |
Boto | 1.34.131 | 1.26 | 1.18 | 1.12 | N/A |
EMR DynamoDB connector | 5.6.0 | 4.16.0 |
Appendix B: JDBC driver upgrades
The following are JDBC driver upgrades:
Driver | JDBC driver version in AWS Glue 5.0 | JDBC driver version in AWS Glue 4.0 | JDBC driver version in AWS Glue 3.0 | JDBC driver version in past AWS Glue versions |
---|---|---|---|---|
MySQL | 8.0.33 | 8.0.23 | 8.0.23 | 5.1 |
Microsoft SQL Server | 10.2.0 | 9.4.0 | 7.0.0 | 6.1.0 |
Oracle Databases | 23.3.0.23.09 | 21.7 | 21.1 | 11.2 |
PostgreSQL | 42.7.3 | 42.3.6 | 42.2.18 | 42.1.0 |
Amazon Redshift |
redshift-jdbc42-2.1.0.29 |
redshift-jdbc42-2.1.0.16 |
redshift-jdbc41-1.2.12.1017 |
redshift-jdbc41-1.2.12.1017 |
SAP Hana | 2.20.17 | 2.17.12 | ||
Teradata | 20.00.00.33 | 20.00.00.06 |
Appendix C: Connector upgrades
The following are connector upgrades:
Driver | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 |
---|---|---|---|
EMR DynamoDB connector | 5.6.0 | 4.16.0 | |
Amazon Redshift | 6.3.0 | 6.1.3 | |
OpenSearch | 1.2.0 | 1.0.1 | |
MongoDB | 10.4.0 | 10.0.4 | 3.0.0 |
Snowflake | 3.0.0 | 2.12.0 | |
Google BigQuery | 0.32.2 | 0.32.2 | |
AzureCosmos | 4.33.0 | 4.22.0 | |
AzureSQL | 1.3.0 | 1.3.0 | |
Vertica | 3.3.5 | 3.3.5 |
Appendix D: Open table format upgrades
The following are open table format upgrades:
OTF | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 |
---|---|---|---|
Hudi | 0.15.0 | 0.12.1 | 0.10.1 |
Delta Lake | 3.2.1 | 2.1.0 | 1.0.0 |
Iceberg | 1.6.1 | 1.0.0 | 0.13.1 |