Migrating AWS Glue for Spark jobs to AWS Glue version 3.0
This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0 and 3.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 3.0.
To use this feature with your AWS Glue ETL jobs, choose 3.0
for the
Glue version
when creating your jobs.
Topics
New features supported
This section describes new features and advantages of AWS Glue version 3.0.
It is based on Apache Spark 3.1.1, which has optimizations from open-source Spark and developed by the AWS Glue and EMR services such as adaptive query execution, vectorized readers, and optimized shuffles and partition coalescing.
Upgraded JDBC drivers for all Glue native sources including MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB, and upgraded Spark libraries and dependencies brought in by Spark 3.1.1.
Optimized Amazon S3 access with upgraded EMRFS and enabled Amazon S3 optimized output committers by default.
Optimized Data Catalog access with partition indexes, push down predicates, partition listing, and upgraded Hive metastore client.
Integration with Lake Formation for governed catalog tables with cell-level filtering and data lake transactions.
Improved Spark UI experience with Spark 3.1.1 with new Spark executor memory metrics and Spark structured streaming metrics.
Reduced startup latency improving overall job completion times and interactivity, similar to AWS Glue 2.0.
Spark jobs are billed in 1-second increments with a 10x lower minimum billing duration—from a 10-minute minimum to a 1-minute minimum, similar to AWS Glue 2.0.
Actions to migrate to AWS Glue 3.0
For existing jobs, change the Glue version
from the previous version to Glue 3.0
in the job configuration.
In the console, choose
Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0)
inGlue version
.In AWS Glue Studio, choose
Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
inGlue version
.In the API, choose
3.0
in theGlueVersion
parameter in theUpdateJob
API.
For new jobs, choose Glue 3.0
when you create a job.
In the console, choose
Spark 3.1, Python 3 (Glue Version 3.0) or Spark 3.1, Scala 2 (Glue Version 3.0)
inGlue version
.In AWS Glue Studio, choose
Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
inGlue version
.In the API, choose
3.0
in theGlueVersion
parameter in theCreateJob
API.
To view Spark event logs of AWS Glue 3.0, launch an upgraded Spark history server for Glue 3.0 using CloudFormation or Docker.
Migration check list
Review this checklist for migration.
Does your job depend on HDFS? If yes, try replacing HDFS with S3.
Search the file system path starting with
hdfs://
or/
as DFS path in the job script code.Check if your default file system is not configured with HDFS. If it is configured explicitly, you need to remove the
fs.defaultFS
configuration.Check if your job contains any
dfs.*
parameters. If it contains any, you need to verify it is okay to disable the parameters.
Does your job depend on YARN? If yes, verify the impacts by checking if your job contains the following parameters. If it contains any, you need to verify it is okay to disable the parameters.
spark.yarn.*
For example:
spark.yarn.executor.memoryOverhead spark.yarn.driver.memoryOverhead spark.yarn.scheduler.reporterThread.maxFailures
yarn.*
For example:
yarn.scheduler.maximum-allocation-mb yarn.nodemanager.resource.memory-mb
Does your job depend on Spark 2.2.1 or Spark 2.4.3? If yes, verify the impacts by checking if your job uses features changed in Spark 3.1.1.
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23
For example the
percentile_approx
function, or theSparkSession
withSparkSession.builder.getOrCreate()
when there is an existingSparkContext
.https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24
For example the
array_contains
function, or theCURRENT_DATE
,CURRENT_TIMESTAMP
function withspark.sql.caseSensitive=true
.
Do your job's extra jars conflict in Glue 3.0?
From AWS Glue 0.9/1.0: Extra jars supplied in existing AWS Glue 0.9/1.0 jobs may bring in classpath conflicts due to upgraded or new dependencies available in Glue 3.0. You can avoid classpath conflicts in AWS Glue 3.0 with the
--user-jars-first
AWS Glue job parameter or by shading your dependencies.From AWS Glue 2.0: You can still avoid classpath conflicts in AWS Glue 3.0 with the
--user-jars-first
AWS Glue job parameter or by shading your dependencies.
Do your jobs depend on Scala 2.11?
AWS Glue 3.0 uses Scala 2.12 so you need to rebuild your libraries with Scala 2.12 if your libraries depend on Scala 2.11.
Do your job's external Python libraries depend on Python 2.7/3.6?
Use the
--additional-python-modules
parameters instead of setting the egg/wheel/zip file in the Python library path.Update the dependent libraries from Python 2.7/3.6 to Python 3.7 as Spark 3.1.1 removed Python 2.7 support.
Migrating from AWS Glue 0.9 to AWS Glue 3.0
Note the following changes when migrating:
AWS Glue 0.9 uses open-source Spark 2.2.1 and AWS Glue 3.0 uses EMR-optimized Spark 3.1.1.
Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.
For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.2 does allow them.
All jobs in AWS Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
Logging behavior has changed since AWS Glue 2.0.
Several dependency updates, highlighted in Appendix A: notable dependency upgrades.
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.
Python 3.7 is also the default version used for Python scripts, as AWS Glue 0.9 was only utilizing Python 2.
Python 2.7 is not supported with Spark 3.1.1.
A new mechanism of installing additional Python modules is available.
AWS Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.
AWS Glue 3.0 does not have a Hadoop Distributed File System (HDFS).
Any extra jars supplied in existing AWS Glue 0.9 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 0.9. You can avoid classpath conflicts in AWS Glue 3.0 with the
--user-jars-first
AWS Glue job parameter.AWS Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.
In AWS Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
.AWS Glue 3.0 does not yet support machine learning transforms.
AWS Glue 3.0 does not yet support development endpoints.
Refer to the Spark migration documentation:
Migrating from AWS Glue 1.0 to AWS Glue 3.0
Note the following changes when migrating:
AWS Glue 1.0 uses open-source Spark 2.4 and AWS Glue 3.0 uses EMR-optimized Spark 3.1.1.
Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.
For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.
All jobs in AWS Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
Logging behavior has changed since AWS Glue 2.0.
Several dependency updates, highlighted in
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.
Python 3.7 is also the default version used for Python scripts, as AWS Glue 0.9 was only utilizing Python 2.
Python 2.7 is not supported with Spark 3.1.1.
A new mechanism of installing additional Python modules is available.
AWS Glue 3.0 does not run on Apache YARN, so YARN settings do not apply.
AWS Glue 3.0 does not have a Hadoop Distributed File System (HDFS).
Any extra jars supplied in existing AWS Glue 1.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 1.0. You can avoid classpath conflicts in AWS Glue 3.0 with the
--user-jars-first
AWS Glue job parameter.AWS Glue 3.0 does not yet support dynamic allocation, hence the ExecutorAllocationManager metrics are not available.
In AWS Glue version 3.0 jobs, you specify the number of workers and worker type, but do not specify a
maxCapacity
.AWS Glue 3.0 does not yet support machine learning transforms.
AWS Glue 3.0 does not yet support development endpoints.
Refer to the Spark migration documentation:
Migrating from AWS Glue 2.0 to AWS Glue 3.0
Note the following changes when migrating:
All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 3.0.
The EMRFS S3-optimized committer for writing Parquet data into Amazon S3 is enabled by default in AWS Glue 3.0. However, you can still disable it by setting
--enable-s3-parquet-optimized-committer
tofalse
.
AWS Glue 2.0 uses open-source Spark 2.4 and AWS Glue 3.0 uses EMR-optimized Spark 3.1.1.
Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced.
For example, Spark 3.1.1 does not enable Scala-untyped UDFs but Spark 2.4 does allow them.
AWS Glue 3.0 also features an update to EMRFS, updated JDBC drivers, and inclusions of additional optimizations onto Spark itself provided by AWS Glue.
All jobs in AWS Glue 3.0 will be executed with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
Python 2.7 is not supported with Spark 3.1.1.
Several dependency updates, highlighted in Appendix A: notable dependency upgrades.
Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backwards compatible with Scala 2.11.
Any extra jars supplied in existing AWS Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies in 3.0 from 2.0. You can avoid classpath conflicts in AWS Glue 3.0 with the
--user-jars-first
AWS Glue job parameter.AWS Glue 3.0 has different Spark task parallelism for driver/executor configuration compared to AWS Glue 2.0 and improves the performance and better utilizes the available resources. Both
spark.driver.cores
andspark.executor.cores
are configured to number of cores on AWS Glue 3.0 (4 on the standard andG.1X
worker, and 8 on theG.2X
worker). These configurations do not change the worker type or hardware for the AWS Glue job. You can use these configurations to calculate the number of partitions or splits to match the Spark task parallelism in your Spark application.In general, jobs will see either similar or improved performance compared to AWS Glue 2.0. If jobs run slower, you can increase the task parallelism by passing the following job argument:
key:
--executor-cores
value: <desired number of tasks that can run in parallel
>The value should not exceed 2x the number of vCPUs on the worker type, which is 8 on
G.1X
, 16 onG.2X
, 32 onG.4X
and 64 onG.8X
. You should exercise caution while updating this configuration as it could impact job performance because the increased parallelism causes memory and disk pressure, as well as it could throttle the source and target systems.
AWS Glue 3.0 uses Spark 3.1, which changes the behavior to loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1
. We recommend to set the following parameters when reading/writing parquet data that contains timestamp columns. Setting those parameters can resolve the calendar incompatibility issue that occurs during the Spark 2 to Spark 3 upgrade, for both the AWS Glue Dynamic Frame and Spark Data Frame. Use the CORRECTED option to read the datetime value as it is; and the LEGACY option to rebase the datetime values with regard to the calendar difference during reading.
- Key: --conf - Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
Refer to the Spark migration documentation:
Appendix A: notable dependency upgrades
The following are dependency upgrades:
Dependency | Version in AWS Glue 0.9 | Version in AWS Glue 1.0 | Version in AWS Glue 2.0 | Version in AWS Glue 3.0 |
---|---|---|---|---|
Spark | 2.2.1 | 2.4.3 | 2.4.3 | 3.1.1-amzn-0 |
Hadoop | 2.7.3-amzn-6 | 2.8.5-amzn-1 | 2.8.5-amzn-5 | 3.2.1-amzn-3 |
Scala | 2.11 | 2.11 | 2.11 | 2.12 |
Jackson | 2.7.x | 2.7.x | 2.7.x | 2.10.x |
Hive | 1.2 | 1.2 | 1.2 | 2.3.7-amzn-4 |
EMRFS | 2.20.0 | 2.30.0 | 2.38.0 | 2.46.0 |
Json4s | 3.2.x | 3.5.x | 3.5.x | 3.6.6 |
Arrow | N/A | 0.10.0 | 0.10.0 | 2.0.0 |
AWS Glue Catalog client | N/A | N/A | 1.10.0 | 3.0.0 |
Appendix B: JDBC driver upgrades
The following are JDBC driver upgrades:
Driver | JDBC driver version in past AWS Glue versions | JDBC driver version in AWS Glue 3.0 |
---|---|---|
MySQL | 5.1 | 8.0.23 |
Microsoft SQL Server | 6.1.0 | 7.0.0 |
Oracle Databases | 11.2 | 21.1 |
PostgreSQL | 42.1.0 | 42.2.18 |
MongoDB | 2.0.0 | 4.0.0 |