Migrating AWS Glue for Spark jobs to AWS Glue version 4.0
This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, and 3.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 4.0. It also describes the features in AWS Glue 4.0 and the advantages of using it.
To use this feature with your AWS Glue ETL jobs, choose
4.0
for the Glue version
when creating your
jobs.
Topics
- New features supported
- Actions to migrate to AWS Glue 4.0
- Migration checklist
- Migrating from AWS Glue 3.0 to AWS Glue 4.0
- Migrating from AWS Glue 2.0 to AWS Glue 4.0
- Migrating from AWS Glue 1.0 to AWS Glue 4.0
- Migrating from AWS Glue 0.9 to AWS Glue 4.0
- Connector and JDBC driver migration for AWS Glue 4.0
- Appendix A: Notable dependency upgrades
- Appendix B: JDBC driver upgrades
- Appendix C: Connector upgrades
New features supported
This section describes new features and advantages of AWS Glue version 4.0.
-
It is based on Apache Spark 3.3.0, but includes optimizations in AWS Glue, and Amazon EMR, such as adaptive query runs, vectorized readers, and optimized shuffles and partition coalescing.
-
Upgraded JDBC drivers for all AWS Glue native sources including MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB, and upgraded Spark libraries and dependencies brought in by Spark 3.3.0.
-
Updated with a new Amazon Redshift connector and JDBC driver.
-
Optimized Amazon S3 access with upgraded EMR File System (EMRFS) and enabled Amazon S3-optimized output committers, by default.
-
Optimized Data Catalog access with partition indexes, pushdown predicates, partition listing, and an upgraded Hive metastore client.
-
Integration with Lake Formation for governed catalog tables with cell-level filtering and data lake transactions.
-
Reduced startup latency to improve overall job completion times and interactivity.
-
Spark jobs are billed in 1-second increments with a 10x lower minimum billing duration—from a 10-minute minimum to a 1-minute minimum.
-
Native support for open-data lake frameworks with Apache Hudi, Delta Lake, and Apache Iceberg.
-
Native support for the Amazon S3-based Cloud Shuffle Storage Plugin (an Apache Spark plugin) to use Amazon S3 for shuffling and elastic storage capacity.
Major enhancements from Spark 3.1.1 to Spark 3.3.0
Note the following enhancements:
-
Row-level runtime filtering (SPARK-32268
). -
ANSI enhancements (SPARK-38860
). -
Error message improvements (SPARK-38781
). -
Support complex types for Parquet vectorized reader (SPARK-34863
). -
Hidden file metadata support for Spark SQL (SPARK-37273
). -
Provide a profiler for Python/Pandas UDFs (SPARK-37443
). -
Introduce Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches (SPARK-36533
). -
More comprehensive Datasource V2 pushdown capabilities (SPARK-38788
). -
Migrating from log4j 1 to log4j 2 (SPARK-37814
).
Other notable changes
Note the following changes:
-
Breaking changes
-
Drop references to Python 3.6 support in docs and Python/docs (SPARK-36977
). -
Remove named tuple hack by replacing built-in pickle to cloudpickle (SPARK-32079
). -
Bump minimum pandas version to 1.0.5 (SPARK-37465
).
-
Actions to migrate to AWS Glue 4.0
For existing jobs, change the Glue version
from the previous version to
Glue 4.0
in the job configuration.
-
In AWS Glue Studio, choose
Glue 4.0 - Supports Spark 3.3, Scala 2, Python 3
inGlue version
. -
In the API, choose
4.0
in theGlueVersion
parameter in theUpdateJob
API operation.
For new jobs, choose Glue 4.0
when you create a job.
-
In the console, choose
Spark 3.3, Python 3 (Glue Version 4.0) or Spark 3.3, Scala 2 (Glue Version 3.0)
inGlue version
. -
In AWS Glue Studio, choose
Glue 4.0 - Supports Spark 3.3, Scala 2, Python 3
inGlue version
. -
In the API, choose
4.0
in theGlueVersion
parameter in theCreateJob
API operation.
To view Spark event logs of AWS Glue 4.0 coming from AWS Glue 2.0 or earlier, launch an upgraded Spark history server for AWS Glue 4.0 using AWS CloudFormation or Docker.
Migration checklist
Review this checklist for migration:
Note
For checklist items related to AWS Glue 3.0, see Migration check list.
-
Do your job's external Python libraries depend on Python 2.7/3.6?
-
Update the dependent libraries from Python 2.7/3.6 to Python 3.10 as Spark 3.3.0 completely removed Python 2.7 and 3.6 support.
-
Migrating from AWS Glue 3.0 to AWS Glue 4.0
Note the following changes when migrating:
-
All existing job parameters and major features that exist in AWS Glue 3.0 will exist in AWS Glue 4.0.
-
AWS Glue 3.0 uses Amazon EMR-optimized Spark 3.1.1, and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
Several Spark changes alone might require revision of your scripts to ensure that removed features are not being referenced.
-
AWS Glue 4.0 also features an update to EMRFS and Hadoop. For the specific version, see Appendix A: Notable dependency upgrades.
-
The AWS SDK provided in ETL jobs is now upgraded from 1.11 to 1.12.
-
All Python jobs will be using Python version 3.10. Previously, Python 3.7 was used in AWS Glue 3.0.
As a result, some pymodules brought out-of-the-box by AWS Glue are upgraded.
-
Log4j has been upgraded to Log4j2.
-
For information on the Log4j2 migration path, see the Log4j documentation
. -
You must rename any custom log4j.properties file as a log4j2.properties file instead, with the appropriate log4j2 properties.
-
-
For migrating certain connectors, see Connector and JDBC driver migration for AWS Glue 4.0.
-
The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See the instructions for AWS Glue job migration.
You can safely upgrade an AWS Glue 2.0/3.0 job to an AWS Glue 4.0 job because AWS Glue 2.0/3.0 already contains the AWS Encryption SDK bridge version.
Refer to the Spark migration documentation:
Migrating from AWS Glue 2.0 to AWS Glue 4.0
Note the following changes when migrating:
Note
For migration steps related to AWS Glue 3.0, see Migrating from AWS Glue 3.0 to AWS Glue 4.0.
-
All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 4.0.
-
The EMRFS S3-optimized committer for writing Parquet data into Amazon S3 is enabled by default since AWS Glue 3.0. However, you can still disable it by setting
--enable-s3-parquet-optimized-committer
tofalse
. -
AWS Glue 2.0 uses open-source Spark 2.4 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
-
Several Spark changes alone may require revision of your scripts to ensure that removed features are not being referenced.
-
For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.4 does allow them.
-
-
The AWS SDK provided in ETL jobs is now upgraded from 1.11 to 1.12.
-
AWS Glue 4.0 also features an update to EMRFS, updated JDBC drivers, and inclusions of additional optimizations onto Spark itself provided by AWS Glue.
-
Scala is updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
-
Python 3.10 is the default version used for Python scripts, as AWS Glue 2.0 was only using Python 3.7 and 2.7.
-
Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
-
A new mechanism of installing additional Python modules is available since AWS Glue 2.0.
-
-
Several dependency updates, highlighted in Appendix A: Notable dependency upgrades.
-
Any extra JAR files supplied in existing AWS Glue 2.0 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 4.0 from 2.0. You can avoid classpath conflicts in AWS Glue 4.0 with the
--user-jars-first
AWS Glue job parameter. AWS Glue 4.0 uses Spark 3.3. Starting with Spark 3.1, there was a change in the behavior of loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1
. We recommend to set the following parameters when reading/writing parquet data that contains timestamp columns. Setting those parameters can resolve the calendar incompatibility issue that occurs during the Spark 2 to Spark 3 upgrade, for both the AWS Glue Dynamic Frame and Spark Data Frame. Use the CORRECTED option to read the datetime value as it is; and the LEGACY option to rebase the datetime values with regard to the calendar difference during reading.
- Key: --conf - Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
-
For migrating certain connectors, see Connector and JDBC driver migration for AWS Glue 4.0.
-
The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration:
-
You can safely upgrade an AWS Glue 2.0 job to an AWS Glue 4.0 job because AWS Glue 2.0 already contains the AWS Encryption SDK bridge version.
-
Refer to the Spark migration documentation:
Migrating from AWS Glue 1.0 to AWS Glue 4.0
Note the following changes when migrating:
-
AWS Glue 1.0 uses open-source Spark 2.4 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
-
Several Spark changes alone may require revision of your scripts to ensure that removed features are not being referenced.
-
For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.4 does allow them.
-
-
All jobs in AWS Glue 4.0 will be run with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
-
Logging behavior has changed significantly in AWS Glue 4.0, Spark 3.3.0 has a minimum requirement of Log4j2.
-
Several dependency updates, highlighted in the appendix.
-
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
-
Python 3.10 is also the default version used for Python scripts, as AWS Glue 0.9 was only using Python 2.
Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
-
A new mechanism of installing additional Python modules through pip is available since AWS Glue 2.0. For more information, see Installing additional Python modules with pip in AWS Glue 2.0+.
-
AWS Glue 4.0 does not run on Apache YARN, so YARN settings do not apply.
-
AWS Glue 4.0 does not have a Hadoop Distributed File System (HDFS).
-
Any extra JAR files supplied in existing AWS Glue 1.0 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 4.0 from 1.0. We enable AWS Glue 4.0 with the
--user-jars-first
AWS Glue job parameter by default, to avoid this problem. -
AWS Glue 4.0 supports auto scaling. Therefore, the ExecutorAllocationManager metric will be available when auto scaling is enabled.
-
In AWS Glue version 4.0 jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
. -
AWS Glue 4.0 does not yet support machine learning transforms.
-
For migrating certain connectors, see Connector and JDBC driver migration for AWS Glue 4.0.
-
The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration.
-
You cannot migrate an AWS Glue 0.9/1.0 job to an AWS Glue 4.0 job directly. This is because when upgrading directly to version 2.x or later and enabling all new features immediately, the AWS Encryption SDK won't be able to decrypt the ciphertext encrypted under earlier versions of the AWS Encryption SDK.
-
To safely upgrade, we first recommend that you migrate to an AWS Glue 2.0/3.0 job that contains the AWS Encryption SDK bridge version. Run the job once to utilize the AWS Encryption SDK bridge version.
-
Upon completion, you can safely migrate the AWS Glue 2.0/3.0 job to AWS Glue 4.0.
-
Refer to the Spark migration documentation:
Migrating from AWS Glue 0.9 to AWS Glue 4.0
Note the following changes when migrating:
-
AWS Glue 0.9 uses open-source Spark 2.2.1 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
-
Several Spark changes alone might require revision of your scripts to ensure that removed features are not being referenced.
-
For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.2 does allow them.
-
-
All jobs in AWS Glue 4.0 will be run with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration because startup latency will go from 10 minutes maximum to 1 minute maximum.
-
Logging behavior has changed significantly since AWS Glue 4.0, Spark 3.3.0 has a minimum requirement of Log4j2 as mentioned here (https://spark.apache.org/docs/latest/core-migration-guide.html#upgrading-from-core-32-to-33).
-
Several dependency updates, highlighted in the appendix.
-
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
-
Python 3.10 is also the default version used for Python scripts, as AWS Glue 0.9 was only using Python 2.
-
Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
-
A new mechanism of installing additional Python modules through pip is available.
-
-
AWS Glue 4.0 does not run on Apache YARN, so YARN settings do not apply.
-
AWS Glue 4.0 does not have a Hadoop Distributed File System (HDFS).
-
Any extra JAR files supplied in existing AWS Glue 0.9 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 3.0 from 0.9. You can avoid classpath conflicts in AWS Glue 3.0 with the
--user-jars-first
AWS Glue job parameter. -
AWS Glue 4.0 supports auto scaling. Therefore, the ExecutorAllocationManager metric will be available when auto scaling is enabled.
-
In AWS Glue version 4.0 jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
. -
AWS Glue 4.0 does not yet support machine learning transforms.
-
For migrating certain connectors, see Connector and JDBC driver migration for AWS Glue 4.0.
-
The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration.
-
You cannot migrate an AWS Glue 0.9/1.0 job to an AWS Glue 4.0 job directly. This is because when upgrading directly to version 2.x or later and enabling all new features immediately, the AWS Encryption SDK won't be able to decrypt the ciphertext encrypted under earlier versions of the AWS Encryption SDK.
-
To safely upgrade, we first recommend that you migrate to an AWS Glue 2.0/3.0 job that contains the AWS Encryption SDK bridge version. Run the job once to utilize the AWS Encryption SDK bridge version.
-
Upon completion, you can safely migrate the AWS Glue 2.0/3.0 job to AWS Glue 4.0.
-
Refer to the Spark migration documentation:
Connector and JDBC driver migration for AWS Glue 4.0
For the versions of JDBC and data lake connectors that were upgraded, see:
Hudi
-
Spark SQL support improvements:
-
Through the
Call Procedure
command, there is added support for upgrade, downgrade, bootstrap, clean, and repair.Create/Drop/Show/Refresh Index
syntax is possible in Spark SQL. -
A performance gap has been closed between usage through a Spark DataSource as opposed to Spark SQL. Datasource writes in the past used to be faster than SQL.
-
All built-in key generators implement more performant Spark-specific API operations.
-
Replaced UDF transformation in the bulk
insert
operation with RDD transformations to cut down on costs of using SerDe. -
Spark SQL with Hudi requires a
primaryKey
to be specified bytblproperites
or options in the SQL statement. For update and delete operations, thepreCombineField
is required as well.
-
-
Any Hudi table created before version 0.10.0 without a
primaryKey
needs to be recreated with aprimaryKey
field since version 0.10.0.
PostgreSQL
-
Several vulnerabilities (CVEs) were addressed.
-
Java 8 is natively supported.
-
If the job is using Arrays of Arrays, with the exception of byte arrays, this scenario can be treated as multidimensional arrays.
MongoDB
-
The current MongoDB connector supports Spark version 3.1 or later and MongoDB version 4.0 or later.
-
Due to the connector upgrade, a few property names changed. For example, the URI property name changed to
connection.uri
. For more information on the current options, see the MongoDB Spark Connector blog. -
Using MongoDB 4.0 hosted by Amazon DocumentDB has some functional differences. For more information, see these topics:
-
The "partitioner" option is restricted to
ShardedPartitioner
,PaginateIntoPartitionsPartitioner
, andSinglePartitionPartitioner
. It cannot use defaultSamplePartitioner
andPaginateBySizePartitioner
for Amazon DocumentDB because the stage operator does not support the MongoDB API. For more information, see Supported MongoDB APIs, Operations, and Data Types.
Delta Lake
-
Delta Lake now supports time travel in SQL
to query older data easily. With this update, time travel is now available both in Spark SQL and through the DataFrame API. Support has been added for the current version of TIMESTAMP in SQL. -
Spark 3.3 introduces Trigger.AvailableNow
for running streaming queries as an equivalent to Trigger.Once
for batch queries. This support is also available when using Delta tables as a streaming source. -
Support for SHOW COLUMNS to return the list of columns in a table.
-
Support for DESCRIBE DETAIL
in the Scala and Python DeltaTable API. It retrieves detailed information about a Delta table using either the DeltaTable API or Spark SQL. -
Support for returning operation metrics from SQL Delete
, Merge , and Update commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed. -
Optimize performance improvements:
-
Set the configuration option
spark.databricks.delta.optimize.repartition.enabled=true
to userepartition(1)
instead ofcoalesce(1)
in the Optimize command for better performance when compacting many small files. -
Improved performance
by using a queue-based approach to parallelize compaction jobs.
-
-
Other notable changes:
-
Support for using variables
in the VACUUM and OPTIMIZE SQL commands. -
Improvements for CONVERT TO DELTA with catalog tables including:
-
Autofill the partition schema
from the catalog when it's not provided. -
Use partition information
from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.
-
-
Support for Change Data Feed (CDF) batch reads
on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. For more information, see the Delta Lake documentation . -
Improve Update command performance
by enabling schema pruning in the first pass.
-
Apache Iceberg
-
Added several performance improvements
for scan planning and Spark queries. -
Added a common REST catalog client that uses change-based commits to resolve commit conflicts on the service side.
-
AS OF
syntax for SQL time travel queries is supported. -
Added merge-on-read support for MERGE and UPDATE queries.
-
Added support to rewrite partitions using Z-order.
-
Added a spec and implementation for Puffin, a format for large stats and index blobs, like Theta sketches
or bloom filters. -
Added new interfaces for consuming data incrementally (both append and changelog scans).
-
Added support for bulk operations and ranged reads to FileIO interfaces.
-
Added more metadata tables to show delete files in the metadata tree.
-
The drop table behavior changed. In Iceberg 0.13.1, running
DROP TABLE
removes the table from the catalog and deletes the table contents as well. In Iceberg 1.0.0,DROP TABLE
only removes the table from the catalog. To delete the table contents useDROP TABLE PURGE
. -
Parquet vectorized reads are enabled by default in Iceberg 1.0.0. If you want to disable vectorized reads, set
read.parquet.vectorization.enabled
tofalse
.
Oracle
Changes are minor.
MySQL
Changes are minor.
Amazon Redshift
AWS Glue 4.0 features a new Amazon Redshift connector with a new JDBC driver. For information about the enhancements and how to migrate from previous AWS Glue versions, see Redshift connections.
Appendix A: Notable dependency upgrades
The following are dependency upgrades:
Dependency | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 |
---|---|---|---|---|
Spark | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 |
Hadoop | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 |
Scala | 2.12 | 2.12 | 2.11 | 2.11 |
Jackson | 2.13.3 | 2.10.x | 2.7.x | 2.7.x |
Hive | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 |
EMRFS | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 |
Json4s | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x |
Arrow | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 |
AWS Glue Data Catalog client | 3.7.0 | 3.0.0 | 1.10.0 | N/A |
Python | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 |
Boto | 1.26 | 1.18 | 1.12 | N/A |
Appendix B: JDBC driver upgrades
The following are JDBC driver upgrades:
Driver | JDBC driver version in past AWS Glue versions | JDBC driver version in AWS Glue 3.0 | JDBC driver version in AWS Glue 4.0 |
---|---|---|---|
MySQL | 5.1 | 8.0.23 | 8.0.23 |
Microsoft SQL Server | 6.1.0 | 7.0.0 | 9.4.0 |
Oracle Databases | 11.2 | 21.1 | 21.7 |
PostgreSQL | 42.1.0 | 42.2.18 | 42.3.6 |
MongoDB | 2.0.0 | 4.0.0 | 4.7.2 |
Amazon Redshift |
redshift-jdbc41-1.2.12.1017 |
redshift-jdbc41-1.2.12.1017 |
redshift-jdbc42-2.1.0.16 |
Appendix C: Connector upgrades
The following are connector upgrades:
Driver | Connector version in AWS Glue 3.0 | Connector version in AWS Glue 4.0 |
---|---|---|
MongoDB | 3.0.0 | 10.0.4 |
Hudi | 0.10.1 | 0.12.1 |
Delta Lake | 1.0.0 | 2.1.0 |
Iceberg | 0.13.1 | 1.0.0 |
DynamoDB | 1.11 | 1.12 |