# AWS Glue versions
<a name="release-notes"></a>

You can configure the AWS Glue version parameter when you add or update a job. The AWS Glue version determines the versions of Apache Spark and Python that AWS Glue supports. The Python version indicates the version that's supported for jobs of type Spark. The following table lists the available AWS Glue versions, the corresponding Spark and Python versions, and other changes in functionality.

You can use the [Generative AI upgrades for Apache Spark](upgrade-analysis.md) to upgrade your Glue ETL jobs from older Glue versions (≥ 2.0) to the latest Glue version.

## AWS Glue versions
<a name="release-notes-versions"></a>

<a name="table-glue-versions"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/release-notes.html)

**Note**  
The following Glue versions support these versions of PythonShell:  
PythonShell v3.6 is supported in Glue version 1.0.
PythonShell v3.9 is supported in Glue version 3.0.
Additionally, dev endpoints are supported only in Glue version 1.0, and 0.9.

# AWS Glue version support policy
<a name="glue-version-support-policy"></a>

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. An *AWS Glue job* contains the business logic that performs the data integration work in AWS Glue. There are three types of jobs in AWS Glue: *Spark (batch and streaming)*, *Ray* and *Python shell*. When you define your job, you specify the AWS Glue version, which configures versions in the underlying Spark, Ray or Python runtime environment. For example: an AWS Glue version 5.0 Spark job supports Spark 3.5.4 and Python 3.11.

## Support policy
<a name="glue-version-support-policy-milestones"></a>

AWS Glue versions are built around a combination of operating system, programming language, and software libraries that are subject to maintenance and security updates. AWS Glue's version support policy is to end support for a version when any major component of the version reaches the end of community long-term support (LTS) and security updates are no longer available. AWS Glue's version support policy includes the following statuses: 

**End of Support (EOS) -** When an AWS Glue version reaches EOS:
+ AWS Glue will no longer apply security patches or other updates to EOS versions.
+ AWS Glue jobs on EOS versions are not eligible for technical support.
+ AWS Glue may not honor SLAs when jobs are run on EOS versions.

**End of Life (EOL) -** When an AWS Glue version reaches EOL:
+ You can no longer create new AWS Glue jobs or interactive sessions on EOL versions.
+ You can no longer start job runs on these AWS Glue versions.
+ AWS Glue will stop existing job runs and interactive sessions on EOL versions.
+ EOL versions will be removed from AWS Glue SDKs and APIs.

The following AWS Glue versions have reached end of support and will no longer be available after the end of life date. Changes to a version’s support status start at midnight (Pacific time zone) on the specified date.


| **Type** | **Glue version** | **End of support** | **End of life** | 
| --- | --- | --- | --- | 
| **Type** | **Python version** | **End of support** | **End of life** | 
| --- | --- | --- | --- | 
| **Type** | **Notebook version** | **End of support** | **End of life** | 
| --- | --- | --- | --- | 
| Spark | Glue version 0.9 (Spark 2.2, Scala 2, Python 2) | 6/1/2022 | 4/1/2026 | 
| Spark | Glue version 1.0 (Spark 2.4, Python 2) | 6/1/2022 | 4/1/2026 | 
| Spark | Glue version 1.0 (Spark 2.4, Scala 2, Python 3) | 9/30/2022 | 4/1/2026 | 
| Spark | Glue version 2.0 (Spark 2.4, Python 3) | 1/31/2024 | 4/1/2026 | 
| Python shell | Python 2 (AWS Glue Version 1.0) | 6/1/2022 | 4/1/2026 | 
| Python shell | PythonShell 3.6 (Glue version 1.0) | 3/31/2026 | NA | 
| Development endpoint | Zeppelin notebook | 9/30/2022 | NA | 

**Note**  
 Creating new AWS Glue Python Shell 3.6 jobs will not be allowed once end of support is reached on March 31, 2026, but you can continue to update and run existing jobs. However, jobs running on discontinued versions are not eligible for technical support. AWS Glue will not apply security patches or other updates to discontinued versions. AWS Glue will also not honor SLAs when jobs are run on discontinued versions. 

AWS strongly recommends that you migrate your jobs to supported versions.

For information on migrating your Spark jobs to the latest AWS Glue version, see [Migrating AWS Glue jobs to AWS Glue version 5.1](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-51.html). 

For migrating your Python shell jobs to the latest AWS Glue version:
+ In the console, choose `Python 3 (Glue Version 4.0)`.
+ In the [CreateJob](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateJob.html)/[UpdateJob](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateJob.html) API, set the `GlueVersion` parameter to `2.0`, and the `PythonVersion` to `3` under the `Command` parameter. The `GlueVersion` configuration does not affect the behavior of Python shell jobs, so there is no advantage to incrementing `GlueVersion`.
+ You need to make your job script compatible with Python 3.

# Migrating AWS Glue for Spark jobs to AWS Glue version 5.1
<a name="migrating-version-51"></a>

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, 3.0, 4.0 and 5.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 5.1. It also describes the features in AWS Glue 5.1 and the advantages of using it. 

To use this feature with your AWS Glue ETL jobs, choose **5.1** for the `Glue version` when creating your jobs.

**Topics**
+ [New features](#migrating-version-51-features)
+ [Actions to migrate to AWS Glue 5.1](#migrating-version-51-actions)
+ [Migration checklist](#migrating-version-51-checklist)
+ [Migrating from AWS Glue 5.0 to AWS Glue 5.1](#migrating-version-51-from-50)
+ [Migrating from Older AWS Glue Versions to AWS Glue 5.1](#migrating-older-versions-to-51)
+ [Connector and JDBC driver migration for AWS Glue 5.1](#migrating-version-51-connector-driver-migration)

## New features
<a name="migrating-version-51-features"></a>

This section describes new features and advantages of AWS Glue version 5.1.
+ Apache Spark update from 3.5.4 in AWS Glue 5.0 to 3.5.6 in AWS Glue 5.1.
+ Open Table Formats (OTF) updated to Hudi 1.0.2, Iceberg 1.10.0, and Delta Lake 3.3.2
+ **Iceberg Materialized Views** - Create and manage Iceberg Materialized Views (MV). For more information, see [blog post](https://aws.amazon.com/blogs/big-data/introducing-apache-iceberg-materialized-views-in-aws-glue-data-catalog/) 
+ **Iceberg format version 3.0** - Extends data types and existing metadata structures to add new capabilities. For more information, see the [Iceberg Table Spec](https://iceberg.apache.org/spec/). 
+ **Hudi Full Table Access** - Full Table Access (FTA) control for Apache Hudi in Apache Spark based on your policies defined in AWS Lake Formation. This feature enables read and write operations from your AWS Glue ETL jobs on AWS Lake Formation registered tables when the job role has full table access.
+ **Spark native fine-grained access control (FGAC) support using AWS Lake Formation** - DDL/DML operations (like CREATE, ALTER, DELETE, DROP) with fine grained access control for Apache Hive, Apache Iceberg and Delta Lake tables registered in AWS Lake Formation.
+ **Audit context for Spark jobs** - Audit context for AWS Glue ETL jobs will be available for AWS Glue and AWS Lake Formation API calls in the AWS CloudTrail logs.

**Known Issues and Limitations**  
Note the following known issues and limitations:
+ Limited support for view SQL clause for creation of materialized views, query rewrite and incremental refresh. More details can be found in the [Iceberg Materialized Views feature documentation page](https://docs.aws.amazon.com/lake-formation/latest/dg/materialized-views.html#materialized-views-considerations-limitations) 
+ **Hudi FTA writes** require using HoodieCredentialedHadoopStorage for credential vending during job execution. Set the following configuration when running Hudi jobs:

  `hoodie.storage.class=org.apache.spark.sql.hudi.storage.HoodieCredentialedHadoopStorage` 
+ Hudi FTA write support works only with the default Hudi configurations. Custom or non-default Hudi settings may not be fully supported and could result in unexpected behavior. Clustering for Hudi Merge-On-Read (MOR) tables is also not supported under FTA write mode.

**Breaking changes**  
Note the following breaking changes:
+  S3A filesystem has replaced EMRFS as the default S3 connector. For information on how to migrate, see [Migrating from AWS Glue 5.0 to AWS Glue 5.1](#migrating-version-51-from-50). 

## Actions to migrate to AWS Glue 5.1
<a name="migrating-version-51-actions"></a>

For existing jobs, change the `Glue version` from the previous version to `Glue 5.1` in the job configuration.
+ In AWS Glue Studio, choose `Glue 5.1 - Supports Spark 3.5.6, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.1** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob) API operation.

For new jobs, choose `Glue 5.1` when you create a job.
+ In the console, choose `Spark 3.5.6, Python 3 (Glue Version 5.1) or Spark 3.5.6, Scala 2 (Glue Version 5.1)` in `Glue version`.
+ In AWS Glue Studio, choose `Glue 5.1 - Supports Spark 3.5.6, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.1** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob) API operation.

To view Spark event logs of AWS Glue 5.1 coming from AWS Glue 2.0 or earlier, [launch an upgraded Spark history server for AWS Glue 5.1 using CloudFormation or Docker](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html).

## Migration checklist
<a name="migrating-version-51-checklist"></a>

Review this checklist for migration:
+ [Python] Update boto references from 1.34 to 1.40.

## Migrating from AWS Glue 5.0 to AWS Glue 5.1
<a name="migrating-version-51-from-50"></a>

All existing job parameters and major features that exist in AWS Glue 5.0 will exist in AWS Glue 5.1. Note the following changes when migrating:
+ In AWS Glue 5.1, S3A filesystem has replaced EMRFS as the default S3 connector. If both `spark.hadoop.fs.s3a.endpoint` and `spark.hadoop.fs.s3a.endpoint.region` are not set, the default region used by S3A is `us-east-2`. This can cause issues, such as S3 upload timeout errors, especially for VPC jobs. To mitigate the issues caused by this change, set the `spark.hadoop.fs.s3a.endpoint.region` Spark configuration when using the S3A file system in AWS Glue 5.1.
+ To continue using EMRFS instead of S3A, set the following spark configurations:

  ```
      --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem
      --conf spark.hadoop.fs.s3n.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem
      --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate
  ```

Refer to the Spark migration documentation:
+ [Migration Guide: Spark Core](https://spark.apache.org/docs/3.5.6/core-migration-guide.html)
+ [Migration Guide: SQL, Datasets and DataFrame](https://spark.apache.org/docs/3.5.6/sql-migration-guide.html)
+ [Migration Guide: Structured Streaming](https://spark.apache.org/docs/3.5.6/ss-migration-guide.html)
+ [Upgrading PySpark](https://spark.apache.org/docs/3.5.6/api/python/migration_guide/pyspark_upgrade.html)

## Migrating from Older AWS Glue Versions to AWS Glue 5.1
<a name="migrating-older-versions-to-51"></a>
+ For migration steps related to AWS Glue 4.0 to AWS Glue 5.0, see [Migrating from AWS Glue 4.0 to AWS Glue 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html#migrating-version-50-from-40).
+ For migration steps related to AWS Glue 3.0 to AWS Glue 5.0, see [Migrating from AWS Glue 3.0 to AWS Glue 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html#migrating-version-50-from-30).
+ For migration steps related to AWS Glue 2.0 to AWS Glue 5.0 and a list of migration differences between AWS Glue version 2.0 and 4.0, see [Migrating from AWS Glue 2.0 to AWS Glue 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html#migrating-version-50-from-20). 

## Connector and JDBC driver migration for AWS Glue 5.1
<a name="migrating-version-51-connector-driver-migration"></a>

For the versions of JDBC and data lake connectors that were upgraded, see:
+ [Appendix B: JDBC driver upgrades](#migrating-version-51-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-51-appendix-connector)
+ [Appendix D: Open table format upgrades](#migrating-version-51-appendix-open-table-formats)

The following changes apply to the OTF version upgrades identified in [Appendix D: Open table format upgrades](#migrating-version-51-appendix-open-table-formats) for AWS Glue 5.1.

**Apache Hudi**  
Note the following changes:
+ Support FTA read and write access on Lake Formation registered tables.

**Apache Iceberg**  
Note the following changes:
+ Support Iceberg format version 3. The following features are supported:
  + Multi-argument transforms for partitioning and sorting.
  + Row Lineage tracking.
  + Deletion vectors. Learn more in [ blog post](https://aws.amazon.com/blogs/big-data/unlock-the-power-of-apache-iceberg-v3-deletion-vectors-on-amazon-emr/) 
  + Table encryption keys.
  + Default value support for columns.
+ Support Spark-native FGAC writes on AWS Lake Formation registered tables.
+ Athena SQL compatibility - Cannot read Iceberg V3 tables created by EMR Spark due to error: `GENERIC_INTERNAL_ERROR: Cannot read unsupported version 3`

**Delta Lake**  
Note the following changes:
+ Support FTA read and write access on Lake Formation registered tables.

### Appendix A: Notable dependency upgrades
<a name="migrating-version-51-appendix-dependencies"></a>

The following are dependency upgrades:


| Dependency | Version in AWS Glue 5.1 | Version in AWS Glue 5.0 | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 | 
| --- | --- | --- | --- | --- | --- | --- | 
| Java | 17 | 17 | 8 | 8 | 8 | 8 | 
| Spark | 3.5.6 | 3.5.4 | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 | 
| Hadoop | 3.4.1 | 3.4.1 | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 | 
| Scala | 2.12.18 | 2.12.18 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Jackson | 2.15.2 | 2.15.2 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Hive | 2.3.9-amzn-4 | 2.3.9-amzn-4 | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 | 
| EMRFS | 2.73.0 | 2.69.0 | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 | 
| Json4s | 3.7.0-M11 | 3.7.0-M11 | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x | 
| Arrow | 12.0.1 | 12.0.1 | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 | 
| AWS Glue Data Catalog client | 4.9.0 | 4.5.0 | 3.7.0 | 3.0.0 | 1.10.0 | N/A | 
| AWS SDK for Java | 2.35.5 | 2.29.52 | 1.12 | 1.12 |  |  | 
| Python | 3.11 | 3.11 | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 | 
| Boto | 1.40.61 | 1.34.131 | 1.26 | 1.18 | 1.12 | N/A | 
| EMR DynamoDB connector | 5.7.0 | 5.6.0 | 4.16.0 |  |  |  | 

### Appendix B: JDBC driver upgrades
<a name="migrating-version-51-appendix-jdbc-driver"></a>

The following are JDBC driver upgrades:


| Driver | JDBC driver version in AWS Glue 5.1 | JDBC driver version in AWS Glue 5.0 | JDBC driver version in AWS Glue 4.0 | JDBC driver version in AWS Glue 3.0 | JDBC driver version in past AWS Glue versions | 
| --- | --- | --- | --- | --- | --- | 
| MySQL | 8.0.33 | 8.0.33 | 8.0.23 | 8.0.23 | 5.1 | 
| Microsoft SQL Server | 10.2.0 | 10.2.0 | 9.4.0 | 7.0.0 | 6.1.0 | 
| Oracle Databases | 23.3.0.23.09 | 23.3.0.23.09 | 21.7 | 21.1 | 11.2 | 
| PostgreSQL | 42.7.3 | 42.7.3 | 42.3.6 | 42.2.18 | 42.1.0 | 
| Amazon Redshift |  redshift-jdbc42-2.1.0.29  |  redshift-jdbc42-2.1.0.29  |  redshift-jdbc42-2.1.0.16  |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc41-1.2.12.1017   | 
| SAP Hana | 2.20.17 | 2.20.17 | 2.17.12 |  |  | 
| Teradata | 20.00.00.33 | 20.00.00.33 | 20.00.00.06 |  |  | 

### Appendix C: Connector upgrades
<a name="migrating-version-51-appendix-connector"></a>

The following are connector upgrades:


| Driver | Connector version in AWS Glue 5.1 | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | --- | 
| EMR DynamoDB connector | 5.7.0 | 5.6.0 | 4.16.0 |  | 
| Amazon Redshift | 6.4.2 | 6.4.0 | 6.1.3 |  | 
| OpenSearch | 1.2.0 | 1.2.0 | 1.0.1 |  | 
| MongoDB | 10.3.0 | 10.3.0 | 10.0.4 | 3.0.0 | 
| Snowflake | 3.1.1 | 3.0.0 | 2.12.0 |  | 
| Google BigQuery | 0.32.2 | 0.32.2 | 0.32.2 |  | 
| AzureCosmos | 4.33.0 | 4.33.0 | 4.22.0 |  | 
| AzureSQL | 1.3.0 | 1.3.0 | 1.3.0 |  | 
| Vertica | 3.3.5 | 3.3.5 | 3.3.5 |  | 

### Appendix D: Open table format upgrades
<a name="migrating-version-51-appendix-open-table-formats"></a>

The following are open table format upgrades:


| OTF | Connector version in AWS Glue 5.1 | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | --- | 
| Hudi | 1.0.2 | 0.15.0 | 0.12.1 | 0.10.1 | 
| Delta Lake | 3.3.2 | 3.3.0 | 2.1.0 | 1.0.0 | 
| Iceberg | 1.10.0 | 1.7.1 | 1.0.0 | 0.13.1 | 

# Migrating AWS Glue for Spark jobs to AWS Glue version 5.0
<a name="migrating-version-50"></a>

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, 3.0, and 4.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 5.0. It also describes the features in AWS Glue 5.0 and the advantages of using it. 

To use this feature with your AWS Glue ETL jobs, choose **5.0** for the `Glue version` when creating your jobs.

**Topics**
+ [New features](#migrating-version-50-features)
+ [Actions to migrate to AWS Glue 5.0](#migrating-version-50-actions)
+ [Migration checklist](#migrating-version-50-checklist)
+ [AWS Glue 5.0 features](#migrating-version-50-features)
+ [Migrating from AWS Glue 4.0 to AWS Glue 5.0](#migrating-version-50-from-40)
+ [Migrating from AWS Glue 3.0 to AWS Glue 5.0](#migrating-version-50-from-30)
+ [Migrating from AWS Glue 2.0 to AWS Glue 5.0](#migrating-version-50-from-20)
+ [Logging behavior changes in AWS Glue 5.0](#enable-continous-logging-changes-glue-50)
+ [Connector and JDBC driver migration for AWS Glue 5.0](#migrating-version-50-connector-driver-migration)

## New features
<a name="migrating-version-50-features"></a>

This section describes new features and advantages of AWS Glue version 5.0.
+ Apache Spark update from 3.3.0 in AWS Glue 4.0 to 3.5.4 in AWS Glue 5.0. See [Major enhancements from Spark 3.3.0 to Spark 3.5.4](#migrating-version-50-features-spark). 
+ Spark-native fine-grained access control (FGAC) using Lake Formation. This includes FGAC for Iceberg, Delta and Hudi tables. For more information, see [ Using AWS Glue with AWS Lake Formation for fine-grained access control](https://docs.aws.amazon.com/glue/latest/dg/security-lf-enable.html). 

  Note the following considerations or limitations for Spark-native FGAC:
  + Currently data writes are not supported
  + Writing into Iceberg through `GlueContext` using Lake Formation requires use of IAM access control instead

  For a complete list of limitations and considerations when using Spark-native FGAC, see [Considerations and limitations](security-lf-enable-considerations.md).
+ Support for Amazon S3 Access Grants as a scalable access control solution to your Amazon S3 data from AWS Glue. For more information, see [Using Amazon S3 Access Grants with AWS Glue](security-s3-access-grants.md).
+ Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.7.1, and Delta Lake 3.3.0
+ Amazon SageMaker Unified Studio support.
+ Amazon SageMaker Lakehouse and data abstraction integration. For more information, see [Querying metastore data catalogs from AWS Glue ETL](#migrating-version-50-features-metastore).
+ Support to install additional Python libraries using `requirements.txt`. For more information, see [Installing additional Python libraries in AWS Glue 5.0 or above using requirements.txt](aws-glue-programming-python-libraries.md#addl-python-modules-requirements-txt).
+ AWS Glue 5.0 supports data lineage in Amazon DataZone. You can configure AWS Glue to automatically collect lineage information during Spark job runs and send the lineage events to be visualized in Amazon DataZone. For more information, see [Data lineage in Amazon DataZone](https://docs.aws.amazon.com/datazone/latest/userguide/datazone-data-lineage.html).

  To configure this on the AWS Glue console, turn on **Generate lineage events**, and enter your Amazon DataZone domain ID on the **Job details** tab.  
![\[The screenshot shows turning on Amazon DataZone date lineage for AWS Glue.\]](http://docs.aws.amazon.com/glue/latest/dg/images/glue-50-data-lineage.png)

  Alternatively, you can provide the following job parameter (provide your DataZone domain ID):
  + Key: `--conf`
  + Value:

    ```
    extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
    —conf spark.openlineage.transport.type=amazon_datazone_api
    -conf spark.openlineage.transport.domainId=<your-domain-ID>
    ```
+ Connector and JDBC driver updates. For more information, see [Appendix B: JDBC driver upgrades](#migrating-version-50-appendix-jdbc-driver) and [Appendix C: Connector upgrades](#migrating-version-50-appendix-connector).
+ Java update from 8 to 17.
+ Increased storage for AWS Glue `G.1X` and `G.2X` workers with disk space increasing to 94GB and 138GB respectively. Additionally, new worker types `G.12X`, `G.16X`, and memory-optimized `R.1X`, `R.2X`, `R.4X`, `R.8X` are available in AWS Glue 4.0 and later versions. For more information, see [Jobs](aws-glue-api-jobs-job.md) 
+ **Support for AWS SDK for Java, version 2** - AWS Glue 5.0 jobs can use the for Java versions [1.12.569](https://github.com/aws/aws-sdk-java/tree/1.12.569) or [2.28.8](https://github.com/aws/aws-sdk-java-v2/tree/2.28.8) if the job supports v2. The AWS SDK for Java 2.x is a major rewrite of the version 1.x code base. It’s built on top of Java 8\$1 and adds several frequently requested features. These include support for non-blocking I/O, and the ability to plug in a different HTTP implementation at runtime. For more information, including a Migration Guide from SDK for Java v1 to v2, see the [AWS SDK for Java, version 2](https://docs.aws.amazon.com/sdk-for-java) guide.

**Breaking changes**  
Note the following breaking changes:
+  In AWS Glue 5.0, when using the S3A file system and if both `fs.s3a.endpoint` and `fs.s3a.endpoint.region` are not set, the default region used by S3A is `us-east-2`. This can cause issues, such as S3 upload timeout errors, especially for VPC jobs. To mitigate the issues caused by this change, set the `fs.s3a.endpoint.region` Spark configuration when using the S3A file system in AWS Glue 5.0. 
+ Lake Formation Fine-grained Access Control (FGAC)
  + AWS Glue 5.0 only supports the new Spark-native FGAC using Spark DataFrames. It does not support FGAC using AWS Glue DynamicFrames.
    + Use of FGAC in 5.0 requires migrating from AWS Glue DynamicFrames to Spark DataFrames
    + If you don't need FGAC, then it is not necessary to migrate to Spark DataFrame and GlueContext features, like job bookmarks and push down predicates, will continue to work.
  + Jobs with Spark-native FGAC require a minimum of 4 workers: one user driver, one system driver, one system executor, and one standby user executor.
  + For more information, see [ Using AWS Glue with AWS Lake Formation for fine-grained access control](https://docs.aws.amazon.com/glue/latest/dg/security-lf-enable.html). 
+ Lake Formation Full Table Access (FTA)
  + AWS Glue 5.0 supports FTA with Spark-native DataFrames (new) and GlueContext DynamicFrames (legacy, with limitations)
  + Spark-native FTA
    + If 4.0 script uses GlueContext, migrate to using native spark. 
    + This feature is limited to hive and iceberg tables
    + For more info on configuring a 5.0 job to use Spark native FTA, see 
  + GlueContext DynamicFrame FTA
    + No code change necessary
    + This feature is limited to non-OTF tables - it will not work with Iceberg, Delta Lake, and Hudi.
+ [Vectorized SIMD CSV reader](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-csv-home.html#aws-glue-programming-etl-format-simd-csv-reader) is not supported.
+ [Continuous logging](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html) to output log group is not supported. Use `error` log group instead.
+ The AWS Glue job run insights `job-insights-rule-driver` has been deprecated. The `job-insights-rca-driver` log stream is now located in the error log group.
+ Athena-based custom/marketplace connectors are not supported.
+ Adobe Marketo Engage, Facebook Ads, Google Ads, Google Analytics 4, Google Sheets, Hubspot, Instagram Ads, Intercom, Jira Cloud, Oracle NetSuite, Salesforce, Salesforce Marketing Cloud, Salesforce Marketing Cloud Account Engagement, SAP OData, ServiceNow, Slack, Snapchat Ads, Stripe, Zendesk and Zoho CRM connectors are not supported.
+ Custom log4j properties are not supported in AWS Glue 5.0.

### Major enhancements from Spark 3.3.0 to Spark 3.5.4
<a name="migrating-version-50-features-spark"></a>

Note the following enhancements:
+ Python client for Spark Connect ([SPARK-39375](https://issues.apache.org/jira/browse/SPARK-39375)).
+ Implement support for DEFAULT values for columns in tables ([SPARK-38334](https://issues.apache.org/jira/browse/SPARK-38334)).
+ Support "Lateral Column Alias References" ([SPARK-27561](https://issues.apache.org/jira/browse/SPARK-27561)).
+ Harden SQLSTATE usage for error classes ([SPARK-41994](https://issues.apache.org/jira/browse/SPARK-41994)).
+ Enable Bloom filter Joins by default ([SPARK-38841](https://issues.apache.org/jira/browse/SPARK-38841)).
+ Better Spark UI scalability and driver stability for large applications ([SPARK-41053](https://issues.apache.org/jira/browse/SPARK-41053)).
+ Async Progress Tracking in Structured Streaming ([SPARK-39591](https://issues.apache.org/jira/browse/SPARK-39591)).
+ Python arbitrary stateful processing in structured streaming ([SPARK-40434](https://issues.apache.org/jira/browse/SPARK-40434)).
+ Pandas API coverage improvements ([SPARK-42882](https://issues.apache.org/jira/browse/SPARK-42882)) and NumPy input support in PySpark ([SPARK-39405](https://issues.apache.org/jira/browse/SPARK-39405)).
+ Provide a memory profiler for PySpark user-defined functions ([SPARK-40281](https://issues.apache.org/jira/browse/SPARK-40281)).
+ Implement PyTorch distributor ([SPARK-41589](https://issues.apache.org/jira/browse/SPARK-41589)).
+ Publish SBOM artifacts ([SPARK-41893](https://issues.apache.org/jira/browse/SPARK-41893)).
+ Support IPv6-only environment ([SPARK-39457](https://issues.apache.org/jira/browse/SPARK-39457)).
+ Customized K8s scheduler (Apache YuniKorn and Volcano) GA ([SPARK-42802](https://issues.apache.org/jira/browse/SPARK-42802)).
+ Scala and Go client support in Spark Connect ([SPARK-42554](https://issues.apache.org/jira/browse/SPARK-42554)) and ([SPARK-43351](https://issues.apache.org/jira/browse/SPARK-43351)).
+ PyTorch-based distributed ML Support for Spark Connect ([SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471)).
+ Structured streaming support for Spark Connect in Python and Scala ([SPARK-42938](https://issues.apache.org/jira/browse/SPARK-42938)).
+ Pandas API support for the Python Spark Connect Client ([SPARK-42497](https://issues.apache.org/jira/browse/SPARK-42497)).
+ Introduce Arrow Python UDFs ([SPARK-40307](https://issues.apache.org/jira/browse/SPARK-40307)).
+ Support Python user-defined table functions ([SPARK-43798](https://issues.apache.org/jira/browse/SPARK-43798)).
+ Migrate PySpark errors onto error classes ([SPARK-42986](https://issues.apache.org/jira/browse/SPARK-42986)).
+ PySpark test framework ([SPARK-44042](https://issues.apache.org/jira/browse/SPARK-44042)).
+ Add support for Datasketches HllSketch ([SPARK-16484](https://issues.apache.org/jira/browse/SPARK-16484)).
+ Built-in SQL function improvement ([SPARK-41231](https://issues.apache.org/jira/browse/SPARK-41231)).
+ IDENTIFIER clause ([SPARK-43205](https://issues.apache.org/jira/browse/SPARK-43205)).
+ Add SQL functions into Scala, Python and R API ([SPARK-43907](https://issues.apache.org/jira/browse/SPARK-43907)).
+ Add named argument support for SQL functions ([SPARK-43922](https://issues.apache.org/jira/browse/SPARK-43922)).
+ Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated ([SPARK-41469](https://issues.apache.org/jira/browse/SPARK-41469)).
+ Distributed ML <> spark connect ([SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471)).
+ DeepSpeed distributor ([SPARK-44264](https://issues.apache.org/jira/browse/SPARK-44264)).
+ Implement changelog checkpointing for RocksDB state store ([SPARK-43421](https://issues.apache.org/jira/browse/SPARK-43421)).
+ Introduce watermark propagation among operators ([SPARK-42376](https://issues.apache.org/jira/browse/SPARK-42376)).
+ Introduce dropDuplicatesWithinWatermark ([SPARK-42931](https://issues.apache.org/jira/browse/SPARK-42931)).
+ RocksDB state store provider memory management enhancements ([SPARK-43311](https://issues.apache.org/jira/browse/SPARK-43311)).

## Actions to migrate to AWS Glue 5.0
<a name="migrating-version-50-actions"></a>

For existing jobs, change the `Glue version` from the previous version to `Glue 5.0` in the job configuration.
+ In AWS Glue Studio, choose `Glue 5.0 - Supports Spark 3.5.4, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob) API operation.

For new jobs, choose `Glue 5.0` when you create a job.
+ In the console, choose `Spark 3.5.4, Python 3 (Glue Version 5.0) or Spark 3.5.4, Scala 2 (Glue Version 5.0)` in `Glue version`.
+ In AWS Glue Studio, choose `Glue 5.0 - Supports Spark 3.5.4, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob) API operation.

To view Spark event logs of AWS Glue 5.0 coming from AWS Glue 2.0 or earlier, [launch an upgraded Spark history server for AWS Glue 5.0 using CloudFormation or Docker](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html).

## Migration checklist
<a name="migrating-version-50-checklist"></a>

Review this checklist for migration:
+ Java 17 updates
+ [Scala] Upgrade AWS SDK calls from v1 to v2
+ Python 3.10 to 3.11 migration
+ [Python] Update boto references from 1.26 to 1.34

## AWS Glue 5.0 features
<a name="migrating-version-50-features"></a>

This section describes AWS Glue features in more detail.

### Querying metastore data catalogs from AWS Glue ETL
<a name="migrating-version-50-features-metastore"></a>

You can register your AWS Glue job to access the AWS Glue Data Catalog, which makes tables and other metastore resources available to disparate consumers. The Data Catalog supports a multi-catalog hierarchy, which unifies all your data across Amazon S3 data lakes. It also provides both a Hive metastore API and an open-source Apache Iceberg API for accessing the data. These features are available to AWS Glue and other data-oriented services like Amazon EMR, Amazon Athena and Amazon Redshift.

When you create resources in the Data Catalog, you can access them from any SQL engine that supports the Apache Iceberg REST API. AWS Lake Formation manages permissions. Following configuration, you can leverage AWS Glue's capabilities to query disparate data by querying these metastore resources with familiar applications. These include Apache Spark and Trino.

#### How metadata resources are organized
<a name="migrating-version-50-features-metastore-organized"></a>

Data is organized in a logical hierarchy of catalogs, databases and tables, using the AWS Glue Data Catalog:
+ Catalog – A logical container that holds objects from a data store, such as schemas or tables.
+ Database – Organizes data objects such as tables and views in a catalog.
+ Tables and views – Data objects in a database that provide an abstraction layer with an understandable schema. They make it easy to access underlying data, which could be in various formats and in various locations.

## Migrating from AWS Glue 4.0 to AWS Glue 5.0
<a name="migrating-version-50-from-40"></a>

All existing job parameters and major features that exist in AWS Glue 4.0 will exist in AWS Glue 5.0, except machine learning transforms.

The following new parameters were added:
+ `--enable-lakeformation-fine-grained-access`: Enables the fine-grained access control (FGAC) feature in AWS Lake Formation tables.

Refer to the Spark migration documentation:
+ [Migration Guide: Spark Core](https://spark.apache.org/docs/3.5.6/core-migration-guide.html)
+ [Migration Guide: SQL, Datasets and DataFrame](https://spark.apache.org/docs/3.5.6/sql-migration-guide.html)
+ [Migration Guide: Structured Streaming](https://spark.apache.org/docs/3.5.6/ss-migration-guide.html)
+ [Upgrading PySpark](https://spark.apache.org/docs/3.5.6/api/python/migration_guide/pyspark_upgrade.html)

## Migrating from AWS Glue 3.0 to AWS Glue 5.0
<a name="migrating-version-50-from-30"></a>

**Note**  
For migration steps related to AWS Glue 4.0, see [Migrating from AWS Glue 3.0 to AWS Glue 4.0](migrating-version-40.md#migrating-version-40-from-30).

All existing job parameters and major features that exist in AWS Glue 3.0 will exist in AWS Glue 5.0, except machine learning transforms.

## Migrating from AWS Glue 2.0 to AWS Glue 5.0
<a name="migrating-version-50-from-20"></a>

**Note**  
For migration steps related to AWS Glue 4.0 and a list of migration differences between AWS Glue version 3.0 and 4.0, see [Migrating from AWS Glue 3.0 to AWS Glue 4.0](migrating-version-40.md#migrating-version-40-from-30).

Also note the following migration differences between AWS Glue versions 3.0 and 2.0:
+ All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 5.0, except machine learning transforms.
+ Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced. For example, Spark 3.1.1 and later does not enable Scala-untyped UDFs but Spark 2.4 does allow them.
+ Python 2.7 is not supported.
+ Any extra jars supplied in existing AWS Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies. You can avoid classpath conflicts with the `--user-jars-first` job parameter.
+ Changes to the behavior of loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1.
+ Different Spark task parallelism for driver/executor configuration. You can adjust task parallelism by passing the `--executor-cores` job argument.

## Logging behavior changes in AWS Glue 5.0
<a name="enable-continous-logging-changes-glue-50"></a>

 The following are changes in logging behavior in AWS Glue 5.0. For more information, see [Logging for AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html). 
+  All logs (system logs, Spark daemon logs, user logs, and Glue Logger logs) are now written to the `/aws-glue/jobs/error` log group by default. 
+  The `/aws-glue/jobs/logs-v2` log group used for continuous logging in previous versions is no longer used. 
+  You can no longer rename or customize the log group or log stream names using the removed continuous logging arguments. Instead, see the new job arguments in AWS Glue 5.0. 

### Two new job arguments are introduced in AWS Glue 5.0
<a name="enable-continous-logging-new-arguments-glue-50"></a>
+  `––custom-logGroup-prefix`: Allows you to specify a custom prefix for the `/aws-glue/jobs/error` and `/aws-glue/jobs/output` log groups. 
+  `––custom-logStream-prefix`: Allows you to specify a custom prefix for the log stream names within the log groups. 

   Validation rules and limitations for custom prefixes include: 
  +  The entire log stream name must be between 1 and 512 characters. 
  +  The custom prefix for log stream names is limited to 400 characters. 
  +  Allowed characters in prefixes include alphanumeric characters, underscores (`\$1`), hyphens (`-`), and forward slashes (`/`). 

### Deprecated continuous logging arguments in AWS Glue 5.0
<a name="enabling-continuous-logging-deprecated-arguments"></a>

 The following job arguments for continuous logging have been deprecated in AWS Glue 5.0 
+  `––enable-continuous-cloudwatch-log` 
+  `––continuous-log-logGroup` 
+  `––continuous-log-logStreamPrefix` 
+  `––continuous-log-conversionPattern` 
+  `––enable-continuous-log-filter` 

## Connector and JDBC driver migration for AWS Glue 5.0
<a name="migrating-version-50-connector-driver-migration"></a>

For the versions of JDBC and data lake connectors that were upgraded, see:
+ [Appendix B: JDBC driver upgrades](#migrating-version-50-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-50-appendix-connector)
+ [Appendix D: Open table format upgrades](#migrating-version-50-appendix-open-table-formats)

The following changes apply to the connector or driver versions identified in the appendices for Glue 5.0.

**Amazon Redshift**  
Note the following changes:
+ Adds support for three-part table names to allow connector to query Redshift data sharing tables.
+ Corrects mapping of Spark `ShortType` to use Redshift `SMALLINT` instead of `INTEGER` to better match expected data size.
+ Added support for Custom Cluster Names (CNAME) for Amazon Redshift Serverless.

**Apache Hudi**  
Note the following changes:
+ Support record level index.
+ Support auto generation of record keys. Now you don’t have to specify the record key field.

**Apache Iceberg**  
Note the following changes:
+ Support fine-grained access control with AWS Lake Formation.
+ Support branching and tagging which are named references to snapshots with their own independent lifecycles.
+ Added a changelog view procedure which generates a view that contains the changes made to a table over a specified period or between specific snapshots.

**Delta Lake**  
Note the following changes:
+ Support Delta Universal Format (UniForm) which enables seamless access through Apache Iceberg and Apache Hudi.
+ Support Deletion Vectors that implements a Merge-on-Read paradigm.

**AzureCosmos**  
Note the following changes:
+ Added hierarchical partition key support.
+ Added option to use custom Schema with StringType (raw json) for a nested property.
+ Added config option `spark.cosmos.auth.aad.clientCertPemBase64` to allow using SPN (ServicePrincipal name) authentication with certificate instead of client secret.

For more information, see the [Azure Cosmos DB Spark connector change log](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-2_2-12/CHANGELOG.md).

**Microsoft SQL Server**  
Note the following changes:
+ TLS encryption is enabled by default.
+ When encrypt = false but the server requires encryption, the certificate is validated based on the `trustServerCertificate` connection setting.
+ `aadSecurePrincipalId` and `aadSecurePrincipalSecret` deprecated.
+ `getAADSecretPrincipalId` API removed.
+ Added CNAME resolution when realm is specified.

**MongoDB**  
Note the following changes:
+ Support for micro-batch mode with Spark Structured Streaming.
+ Support for BSON data types.
+ Added support for reading multiple collections when using micro-batch or continuous streaming modes.
  + If the name of a collection used in your `collection` configuration option contains a comma, the Spark Connector treats it as two different collections. To avoid this, you must escape the comma by preceding it with a backslash (\$1).
  + If the name of a collection used in your `collection` configuration option is "\$1", the Spark Connector interprets it as a specification to scan all collections. To avoid this, you must escape the asterisk by preceding it with a backslash (\$1).
  + If the name of a collection used in your `collection` configuration option contains a backslash (\$1), the Spark Connector treats the backslash as an escape character, which might change how it interprets the value. To avoid this, you must escape the backslash by preceding it with another backslash.

For more information, see the [MongoDB connector for Spark release notes](https://www.mongodb.com/docs/spark-connector/current/release-notes/).

**Snowflake**  
Note the following changes:
+ Introduced a new `trim_space` parameter that you can use to trim values of `StringType` columns automatically when saving to a Snowflake table. Default: `false`.
+ Disabled the `abort_detached_query` parameter at the session level by default.
+ Removed the requirement of the `SFUSER` parameter when using OAUTH.
+ Removed the Advanced Query Pushdown feature. Alternatives to the feature are available. For example, instead of loading data from Snowflake tables, users can directly load data from Snowflake SQL queries.

For more information, see the [Snowflake Connector for Spark release notes](https://docs.snowflake.com/en/release-notes/clients-drivers/spark-connector-2024).

### Appendix A: Notable dependency upgrades
<a name="migrating-version-50-appendix-dependencies"></a>

The following are dependency upgrades:


| Dependency | Version in AWS Glue 5.0 | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 | 
| --- | --- | --- | --- | --- | --- | 
| Java | 17 | 8 | 8 | 8 | 8 | 
| Spark | 3.5.4 | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 | 
| Hadoop | 3.4.1 | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 | 
| Scala | 2.12.18 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Jackson | 2.15.2 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Hive | 2.3.9-amzn-4 | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 | 
| EMRFS | 2.69.0 | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 | 
| Json4s | 3.7.0-M11 | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x | 
| Arrow | 12.0.1 | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 | 
| AWS Glue Data Catalog client | 4.5.0 | 3.7.0 | 3.0.0 | 1.10.0 | N/A | 
| AWS SDK for Java | 2.29.52 | 1.12 | 1.12 |  |  | 
| Python | 3.11 | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 | 
| Boto | 1.34.131 | 1.26 | 1.18 | 1.12 | N/A | 
| EMR DynamoDB connector | 5.6.0 | 4.16.0 |  |  |  | 

### Appendix B: JDBC driver upgrades
<a name="migrating-version-50-appendix-jdbc-driver"></a>

The following are JDBC driver upgrades:


| Driver | JDBC driver version in AWS Glue 5.0 | JDBC driver version in AWS Glue 4.0 | JDBC driver version in AWS Glue 3.0 | JDBC driver version in past AWS Glue versions | 
| --- | --- | --- | --- | --- | 
| MySQL | 8.0.33 | 8.0.23 | 8.0.23 | 5.1 | 
| Microsoft SQL Server | 10.2.0 | 9.4.0 | 7.0.0 | 6.1.0 | 
| Oracle Databases | 23.3.0.23.09 | 21.7 | 21.1 | 11.2 | 
| PostgreSQL | 42.7.3 | 42.3.6 | 42.2.18 | 42.1.0 | 
| Amazon Redshift |  redshift-jdbc42-2.1.0.29  |  redshift-jdbc42-2.1.0.16  |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc41-1.2.12.1017   | 
| SAP Hana | 2.20.17 | 2.17.12 |  |  | 
| Teradata | 20.00.00.33 | 20.00.00.06 |  |  | 

### Appendix C: Connector upgrades
<a name="migrating-version-50-appendix-connector"></a>

The following are connector upgrades:


| Driver | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | 
| EMR DynamoDB connector | 5.6.0 | 4.16.0 |  | 
| Amazon Redshift | 6.4.0 | 6.1.3 |  | 
| OpenSearch | 1.2.0 | 1.0.1 |  | 
| MongoDB | 10.3.0 | 10.0.4 | 3.0.0 | 
| Snowflake | 3.0.0 | 2.12.0 |  | 
| Google BigQuery | 0.32.2 | 0.32.2 |  | 
| AzureCosmos | 4.33.0 | 4.22.0 |  | 
| AzureSQL | 1.3.0 | 1.3.0 |  | 
| Vertica | 3.3.5 | 3.3.5 |  | 

### Appendix D: Open table format upgrades
<a name="migrating-version-50-appendix-open-table-formats"></a>

The following are open table format upgrades:


| OTF | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | 
| Hudi | 0.15.0 | 0.12.1 | 0.10.1 | 
| Delta Lake | 3.3.0 | 2.1.0 | 1.0.0 | 
| Iceberg | 1.7.1 | 1.0.0 | 0.13.1 | 

# Migrating AWS Glue for Spark jobs to AWS Glue version 4.0
<a name="migrating-version-40"></a>

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, and 3.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 4.0. It also describes the features in AWS Glue 4.0 and the advantages of using it. 

To use this feature with your AWS Glue ETL jobs, choose **4.0** for the `Glue version` when creating your jobs.

**Topics**
+ [New features supported](#migrating-version-40-features)
+ [Actions to migrate to AWS Glue 4.0](#migrating-version-40-actions)
+ [Migration checklist](#migrating-version-40-checklist)
+ [Migrating from AWS Glue 3.0 to AWS Glue 4.0](#migrating-version-40-from-30)
+ [Migrating from AWS Glue 2.0 to AWS Glue 4.0](#migrating-version-40-from-20)
+ [Migrating from AWS Glue 1.0 to AWS Glue 4.0](#migrating-version-40-from-10)
+ [Migrating from AWS Glue 0.9 to AWS Glue 4.0](#migrating-version-40-from-09)
+ [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration)
+ [Appendix A: Notable dependency upgrades](#migrating-version-40-appendix-dependencies)
+ [Appendix B: JDBC driver upgrades](#migrating-version-40-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-40-appendix-connector)

## New features supported
<a name="migrating-version-40-features"></a>

This section describes new features and advantages of AWS Glue version 4.0.
+ It is based on Apache Spark 3.3.0, but includes optimizations in AWS Glue, and Amazon EMR, such as adaptive query runs, vectorized readers, and optimized shuffles and partition coalescing. 
+ Upgraded JDBC drivers for all AWS Glue native sources including MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB, and upgraded Spark libraries and dependencies brought in by Spark 3.3.0.
+ Updated with a new Amazon Redshift connector and JDBC driver.
+ Optimized Amazon S3 access with upgraded EMR File System (EMRFS) and enabled Amazon S3-optimized output committers, by default.
+ Optimized Data Catalog access with partition indexes, pushdown predicates, partition listing, and an upgraded Hive metastore client.
+ Integration with Lake Formation for governed catalog tables with cell-level filtering and data lake transactions.
+ Reduced startup latency to improve overall job completion times and interactivity.
+ Spark jobs are billed in 1-second increments with a 10x lower minimum billing duration—from a 10-minute minimum to a 1-minute minimum.
+ Native support for open-data lake frameworks with Apache Hudi, Delta Lake, and Apache Iceberg.
+ Native support for the Amazon S3-based Cloud Shuffle Storage Plugin (an Apache Spark plugin) to use Amazon S3 for shuffling and elastic storage capacity.

**Major enhancements from Spark 3.1.1 to Spark 3.3.0**  
Note the following enhancements:
+ Row-level runtime filtering ([SPARK-32268](https://issues.apache.org/jira/browse/SPARK-32268)).
+ ANSI enhancements ([SPARK-38860](https://issues.apache.org/jira/browse/SPARK-38860)).
+ Error message improvements ([SPARK-38781](https://issues.apache.org/jira/browse/SPARK-38781)).
+ Support complex types for Parquet vectorized reader ([SPARK-34863](https://issues.apache.org/jira/browse/SPARK-34863)).
+ Hidden file metadata support for Spark SQL ([SPARK-37273](https://issues.apache.org/jira/browse/SPARK-37273)).
+ Provide a profiler for Python/Pandas UDFs ([SPARK-37443](https://issues.apache.org/jira/browse/SPARK-37443)).
+ Introduce Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches ([SPARK-36533](https://issues.apache.org/jira/browse/SPARK-36533)).
+ More comprehensive Datasource V2 pushdown capabilities ([SPARK-38788](https://issues.apache.org/jira/browse/SPARK-38788)).
+ Migrating from log4j 1 to log4j 2 ([SPARK-37814](https://issues.apache.org/jira/browse/SPARK-37814)).

**Other notable changes**  
Note the following changes:
+ Breaking changes
  + Drop references to Python 3.6 support in docs and Python/docs ([SPARK-36977](https://issues.apache.org/jira/browse/SPARK-36977)).
  + Remove named tuple hack by replacing built-in pickle to cloudpickle ([SPARK-32079](https://issues.apache.org/jira/browse/SPARK-32079)).
  + Bump minimum pandas version to 1.0.5 ([SPARK-37465](https://issues.apache.org/jira/browse/SPARK-37465)).

## Actions to migrate to AWS Glue 4.0
<a name="migrating-version-40-actions"></a>

For existing jobs, change the `Glue version` from the previous version to `Glue 4.0` in the job configuration.
+ In AWS Glue Studio, choose `Glue 4.0 - Supports Spark 3.3, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **4.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob) API operation.

For new jobs, choose `Glue 4.0` when you create a job.
+ In the console, choose `Spark 3.3, Python 3 (Glue Version 4.0) or Spark 3.3, Scala 2 (Glue Version 3.0)` in `Glue version`.
+ In AWS Glue Studio, choose `Glue 4.0 - Supports Spark 3.3, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **4.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob) API operation.

To view Spark event logs of AWS Glue 4.0 coming from AWS Glue 2.0 or earlier, [launch an upgraded Spark history server for AWS Glue 4.0 using CloudFormation or Docker](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html).

## Migration checklist
<a name="migrating-version-40-checklist"></a>
+ Do your job's external Python libraries depend on Python 2.7/3.6?
  + Update the dependent libraries from Python 2.7/3.6 to Python 3.10 as Spark 3.3.0 completely removed Python 2.7 and 3.6 support.

## Migrating from AWS Glue 3.0 to AWS Glue 4.0
<a name="migrating-version-40-from-30"></a>

Note the following changes when migrating:
+ All existing job parameters and major features that exist in AWS Glue 3.0 will exist in AWS Glue 4.0.
+ AWS Glue 3.0 uses Amazon EMR-optimized Spark 3.1.1, and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.

  Several Spark changes alone might require revision of your scripts to ensure that removed features are not being referenced.
+ AWS Glue 4.0 also features an update to EMRFS and Hadoop. For the specific version, see [Appendix A: Notable dependency upgrades](#migrating-version-40-appendix-dependencies).
+ The AWS SDK provided in ETL jobs is now upgraded from 1.11 to 1.12.
+ All Python jobs will be using Python version 3.10. Previously, Python 3.7 was used in AWS Glue 3.0.

  As a result, some pymodules brought out-of-the-box by AWS Glue are upgraded.
+ Log4j has been upgraded to Log4j2.
  + For information on the Log4j2 migration path, see the [Log4j documentation](https://logging.apache.org/log4j/2.x/manual/migration.html#Log4j2API).
  + You must rename any custom log4j.properties file as a log4j2.properties file instead, with the appropriate log4j2 properties.
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See the instructions for AWS Glue job migration.

  You can safely upgrade an AWS Glue 2.0/3.0 job to an AWS Glue 4.0 job because AWS Glue 2.0/3.0 already contains the AWS Encryption SDK bridge version.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)

## Migrating from AWS Glue 2.0 to AWS Glue 4.0
<a name="migrating-version-40-from-20"></a>

Note the following changes when migrating:

**Note**  
For migration steps related to AWS Glue 3.0, see [Migrating from AWS Glue 3.0 to AWS Glue 4.0](#migrating-version-40-from-30).
+ All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 4.0.
+ The EMRFS S3-optimized committer for writing Parquet data into Amazon S3 is enabled by default since AWS Glue 3.0. However, you can still disable it by setting `--enable-s3-parquet-optimized-committer` to `false`.
+ AWS Glue 2.0 uses open-source Spark 2.4 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
  + Several Spark changes alone may require revision of your scripts to ensure that removed features are not being referenced.
  + For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.4 does allow them.
+ The AWS SDK provided in ETL jobs is now upgraded from 1.11 to 1.12.
+ AWS Glue 4.0 also features an update to EMRFS, updated JDBC drivers, and inclusions of additional optimizations onto Spark itself provided by AWS Glue.
+ Scala is updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
+ Python 3.10 is the default version used for Python scripts, as AWS Glue 2.0 was only using Python 3.7 and 2.7.
  + Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
  + A new mechanism of installing additional Python modules is available since AWS Glue 2.0.
+ Several dependency updates, highlighted in [Appendix A: Notable dependency upgrades](#migrating-version-40-appendix-dependencies).
+ Any extra JAR files supplied in existing AWS Glue 2.0 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 4.0 from 2.0. You can avoid classpath conflicts in AWS Glue 4.0 with the `--user-jars-first` AWS Glue job parameter.
+ AWS Glue 4.0 uses Spark 3.3. Starting with Spark 3.1, there was a change in the behavior of loading/saving of timestamps from/to parquet files. For more details, see [Upgrading from Spark SQL 3.0 to 3.1](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31).

  We recommend to set the following parameters when reading/writing parquet data that contains timestamp columns. Setting those parameters can resolve the calendar incompatibility issue that occurs during the Spark 2 to Spark 3 upgrade, for both the AWS Glue Dynamic Frame and Spark Data Frame. Use the CORRECTED option to read the datetime value as it is; and the LEGACY option to rebase the datetime values with regard to the calendar difference during reading.

  ```
  - Key: --conf
  - Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
  ```
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration:
  + You can safely upgrade an AWS Glue 2.0 job to an AWS Glue 4.0 job because AWS Glue 2.0 already contains the AWS Encryption SDK bridge version.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 2.4 to 3.0](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30)
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)
+ [Changes in Datetime behavior to be expected since Spark 3.0.](https://issues.apache.org/jira/browse/SPARK-31408)

## Migrating from AWS Glue 1.0 to AWS Glue 4.0
<a name="migrating-version-40-from-10"></a>

Note the following changes when migrating:
+ AWS Glue 1.0 uses open-source Spark 2.4 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
  + Several Spark changes alone may require revision of your scripts to ensure that removed features are not being referenced.
  + For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.4 does allow them.
+ All jobs in AWS Glue 4.0 will be run with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
+ Logging behavior has changed significantly in AWS Glue 4.0, Spark 3.3.0 has a minimum requirement of Log4j2.
+ Several dependency updates, highlighted in the appendix.
+ Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
+ Python 3.10 is also the default version used for Python scripts, as AWS Glue 0.9 was only using Python 2.

  Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
+ A new mechanism of installing additional Python modules through pip is available since AWS Glue 2.0. For more information, see [Installing additional Python modules with pip in AWS Glue 2.0\$1](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#addl-python-modules-support).
+ AWS Glue 4.0 does not run on Apache YARN, so YARN settings do not apply.
+ AWS Glue 4.0 does not have a Hadoop Distributed File System (HDFS).
+ Any extra JAR files supplied in existing AWS Glue 1.0 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 4.0 from 1.0. We enable AWS Glue 4.0 with the `--user-jars-first` AWS Glue job parameter by default, to avoid this problem.
+ AWS Glue 4.0 supports auto scaling. Therefore, the ExecutorAllocationManager metric will be available when auto scaling is enabled.
+ In AWS Glue version 4.0 jobs, you specify the number of workers and worker type, but do not specify a `maxCapacity`.
+ AWS Glue 4.0 does not yet support machine learning transforms.
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration.
  + You cannot migrate an AWS Glue 0.9/1.0 job to an AWS Glue 4.0 job directly. This is because when upgrading directly to version 2.x or later and enabling all new features immediately, the AWS Encryption SDK won't be able to decrypt the ciphertext encrypted under earlier versions of the AWS Encryption SDK. 
  + To safely upgrade, we first recommend that you migrate to an AWS Glue 2.0/3.0 job that contains the AWS Encryption SDK bridge version. Run the job once to utilize the AWS Encryption SDK bridge version.
  + Upon completion, you can safely migrate the AWS Glue 2.0/3.0 job to AWS Glue 4.0.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 2.4 to 3.0](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30)
+ [Upgrading from Spark SQL 3.0 to 3.1](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31)
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)
+ [Changes in Datetime behavior to be expected since Spark 3.0.](https://issues.apache.org/jira/browse/SPARK-31408)

## Migrating from AWS Glue 0.9 to AWS Glue 4.0
<a name="migrating-version-40-from-09"></a>

Note the following changes when migrating:
+ AWS Glue 0.9 uses open-source Spark 2.2.1 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
  + Several Spark changes alone might require revision of your scripts to ensure that removed features are not being referenced.
  + For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.2 does allow them.
+ All jobs in AWS Glue 4.0 will be run with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration because startup latency will go from 10 minutes maximum to 1 minute maximum.
+ Logging behavior has changed significantly since AWS Glue 4.0, Spark 3.3.0 has a minimum requirement of Log4j2 as mentioned here (https://spark.apache.org/docs/latest/core-migration-guide.html\$1upgrading-from-core-32-to-33).
+ Several dependency updates, highlighted in the appendix.
+ Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
+ Python 3.10 is also the default version used for Python scripts, as AWS Glue 0.9 was only using Python 2.
  + Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
  + A new mechanism of installing additional Python modules through pip is available.
+ AWS Glue 4.0 does not run on Apache YARN, so YARN settings do not apply.
+ AWS Glue 4.0 does not have a Hadoop Distributed File System (HDFS).
+ Any extra JAR files supplied in existing AWS Glue 0.9 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 3.0 from 0.9. You can avoid classpath conflicts in AWS Glue 3.0 with the `--user-jars-first` AWS Glue job parameter.
+ AWS Glue 4.0 supports auto scaling. Therefore, the ExecutorAllocationManager metric will be available when auto scaling is enabled.
+ In AWS Glue version 4.0 jobs, you specify the number of workers and worker type, but do not specify a `maxCapacity`.
+ AWS Glue 4.0 does not yet support machine learning transforms.
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration.
  + You cannot migrate an AWS Glue 0.9/1.0 job to an AWS Glue 4.0 job directly. This is because when upgrading directly to version 2.x or later and enabling all new features immediately, the AWS Encryption SDK won't be able to decrypt the ciphertext encrypted under earlier versions of the AWS Encryption SDK. 
  + To safely upgrade, we first recommend that you migrate to an AWS Glue 2.0/3.0 job that contains the AWS Encryption SDK bridge version. Run the job once to utilize the AWS Encryption SDK bridge version.
  + Upon completion, you can safely migrate the AWS Glue 2.0/3.0 job to AWS Glue 4.0.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 2.2 to 2.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23)
+ [Upgrading from Spark SQL 2.3 to 2.4](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24)
+ [Upgrading from Spark SQL 2.4 to 3.0](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30)
+ [Upgrading from Spark SQL 3.0 to 3.1](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31)
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)
+ [Changes in Datetime behavior to be expected since Spark 3.0](https://issues.apache.org/jira/browse/SPARK-31408).

## Connector and JDBC driver migration for AWS Glue 4.0
<a name="migrating-version-40-connector-driver-migration"></a>

For the versions of JDBC and data lake connectors that were upgraded, see:
+ [Appendix B: JDBC driver upgrades](#migrating-version-40-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-40-appendix-connector)

### Hudi
<a name="migrating-version-40-connector-driver-migration-hudi"></a>
+ Spark SQL support improvements:
  + Through the `Call Procedure` command, there is added support for upgrade, downgrade, bootstrap, clean, and repair. `Create/Drop/Show/Refresh Index` syntax is possible in Spark SQL.
  + A performance gap has been closed between usage through a Spark DataSource as opposed to Spark SQL. Datasource writes in the past used to be faster than SQL.
  + All built-in key generators implement more performant Spark-specific API operations.
  + Replaced UDF transformation in the bulk `insert` operation with RDD transformations to cut down on costs of using SerDe.
  + Spark SQL with Hudi requires a `primaryKey` to be specified by `tblproperites` or options in the SQL statement. For update and delete operations, the `preCombineField` is required as well.
+ Any Hudi table created before version 0.10.0 without a `primaryKey` needs to be recreated with a `primaryKey` field since version 0.10.0.

### PostgreSQL
<a name="migrating-version-40-connector-driver-migration-postgresql"></a>
+ Several vulnerabilities (CVEs) were addressed.
+ Java 8 is natively supported.
+ If the job is using Arrays of Arrays, with the exception of byte arrays, this scenario can be treated as multidimensional arrays.

### MongoDB
<a name="migrating-version-40-connector-driver-migration-mongodb"></a>
+ The current MongoDB connector supports Spark version 3.1 or later and MongoDB version 4.0 or later.
+ Due to the connector upgrade, a few property names changed. For example, the URI property name changed to `connection.uri`. For more information on the current options, see the [MongoDB Spark Connector blog](https://www.mongodb.com/docs/spark-connector/current/configuration/).
+ Using MongoDB 4.0 hosted by Amazon DocumentDB has some functional differences. For more information, see these topics:
  + [Functional Differences: Amazon DocumentDB and MongoDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/functional-differences.html)
  +  [Supported MongoDB APIs, Operations, and Data Types](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html).
+ The "partitioner" option is restricted to `ShardedPartitioner`, `PaginateIntoPartitionsPartitioner`, and `SinglePartitionPartitioner`. It cannot use default `SamplePartitioner` and `PaginateBySizePartitioner` for Amazon DocumentDB because the stage operator does not support the MongoDB API. For more information, see [Supported MongoDB APIs, Operations, and Data Types](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html).

### Delta Lake
<a name="migrating-version-40-connector-driver-migration-delta"></a>
+ Delta Lake now supports [time travel in SQL](https://docs.delta.io/2.1.0/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel) to query older data easily. With this update, time travel is now available both in Spark SQL and through the DataFrame API. Support has been added for the current version of TIMESTAMP in SQL.
+ Spark 3.3 introduces [Trigger.AvailableNow](https://issues.apache.org/jira/browse/SPARK-36533) for running streaming queries as an equivalent to `Trigger.Once` for batch queries. This support is also available when using Delta tables as a streaming source.
+ Support for SHOW COLUMNS to return the list of columns in a table.
+ Support for [DESCRIBE DETAIL](https://docs.delta.io/2.1.0/delta-utility.html#retrieve-delta-table-details) in the Scala and Python DeltaTable API. It retrieves detailed information about a Delta table using either the DeltaTable API or Spark SQL.
+ Support for returning operation metrics from SQL [Delete](https://github.com/delta-io/delta/pull/1328), [Merge](https://github.com/delta-io/delta/pull/1327), and [Update](https://github.com/delta-io/delta/pull/1331) commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed.
+ Optimize performance improvements:
  + Set the configuration option `spark.databricks.delta.optimize.repartition.enabled=true` to use `repartition(1)` instead of `coalesce(1)` in the Optimize command for better performance when compacting many small files.
  + [Improved performance](https://github.com/delta-io/delta/pull/1315) by using a queue-based approach to parallelize compaction jobs.
+ Other notable changes:
  + [Support for using variables](https://github.com/delta-io/delta/issues/1267) in the VACUUM and OPTIMIZE SQL commands.
  + Improvements for CONVERT TO DELTA with catalog tables including:
    + [Autofill the partition schema](https://github.com/delta-io/delta/commit/18d4d12ed06f973006501f6c39c8785db51e2b1f) from the catalog when it's not provided.
    + [Use partition information](https://github.com/delta-io/delta/commit/ebff29904f3ababb889897343f8f8f7a010a1f71) from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.
  + [Support for Change Data Feed (CDF) batch reads](https://github.com/delta-io/delta/issues/1349) on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. For more information, see the [Delta Lake documentation](https://docs.delta.io/2.1.0/delta-change-data-feed.html#known-limitations).
  + [Improve Update command performance](https://github.com/delta-io/delta/pull/1202) by enabling schema pruning in the first pass.

### Apache Iceberg
<a name="migrating-version-40-connector-driver-migration-iceberg"></a>
+ Added several [performance improvements](https://iceberg.apache.org/releases/#performance-improvements) for scan planning and Spark queries.
+ Added a common REST catalog client that uses change-based commits to resolve commit conflicts on the service side.
+ `AS OF` syntax for SQL time travel queries is supported.
+ Added merge-on-read support for MERGE and UPDATE queries.
+ Added support to rewrite partitions using Z-order.
+ Added a spec and implementation for Puffin, a format for large stats and index blobs, like [Theta sketches](https://datasketches.apache.org/docs/Theta/InverseEstimate.html) or bloom filters.
+ Added new interfaces for consuming data incrementally (both append and changelog scans).
+ Added support for bulk operations and ranged reads to FileIO interfaces.
+ Added more metadata tables to show delete files in the metadata tree.
+ The drop table behavior changed. In Iceberg 0.13.1, running `DROP TABLE` removes the table from the catalog and deletes the table contents as well. In Iceberg 1.0.0, `DROP TABLE` only removes the table from the catalog. To delete the table contents use `DROP TABLE PURGE`.
+ Parquet vectorized reads are enabled by default in Iceberg 1.0.0. If you want to disable vectorized reads, set `read.parquet.vectorization.enabled` to `false`.

### Oracle
<a name="migrating-version-40-connector-driver-migration-oracle"></a>

Changes are minor.

### MySQL
<a name="migrating-version-40-connector-driver-migration-mysql"></a>

Changes are minor.

### Amazon Redshift
<a name="migrating-version-40-connector-driver-migration-redshift"></a>

AWS Glue 4.0 features a new Amazon Redshift connector with a new JDBC driver. For information about the enhancements and how to migrate from previous AWS Glue versions, see [Redshift connections](aws-glue-programming-etl-connect-redshift-home.md).

## Appendix A: Notable dependency upgrades
<a name="migrating-version-40-appendix-dependencies"></a>

The following are dependency upgrades:


| Dependency | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 | 
| --- | --- | --- | --- | --- | 
| Spark | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 | 
| Hadoop | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 | 
| Scala | 2.12 | 2.12 | 2.11 | 2.11 | 
| Jackson | 2.13.3 | 2.10.x | 2.7.x | 2.7.x | 
| Hive | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 | 
| EMRFS | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 | 
| Json4s | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x | 
| Arrow | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 | 
| AWS Glue Data Catalog client | 3.7.0 | 3.0.0 | 1.10.0 | N/A | 
| Python | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 | 
| Boto | 1.26 | 1.18 | 1.12 | N/A | 

## Appendix B: JDBC driver upgrades
<a name="migrating-version-40-appendix-jdbc-driver"></a>

The following are JDBC driver upgrades:


| Driver | JDBC driver version in past AWS Glue versions | JDBC driver version in AWS Glue 3.0 | JDBC driver version in AWS Glue 4.0 | 
| --- | --- | --- | --- | 
| MySQL | 5.1 | 8.0.23 | 8.0.23 | 
| Microsoft SQL Server | 6.1.0 | 7.0.0 | 9.4.0 | 
| Oracle Databases | 11.2 | 21.1 | 21.7 | 
| PostgreSQL | 42.1.0 | 42.2.18 | 42.3.6 | 
| MongoDB | 2.0.0 | 4.0.0 | 4.7.2 | 
| Amazon Redshift |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc42-2.1.0.16  | 

## Appendix C: Connector upgrades
<a name="migrating-version-40-appendix-connector"></a>

The following are connector upgrades:


| Driver | Connector version in AWS Glue 3.0 | Connector version in AWS Glue 4.0 | 
| --- | --- | --- | 
| MongoDB | 3.0.0 | 10.0.4 | 
| Hudi | 0.10.1 | 0.12.1 | 
| Delta Lake | 1.0.0 | 2.1.0 | 
| Iceberg | 0.13.1 | 1.0.0 | 
| DynamoDB | 1.11 | 1.12 | 

# Generative AI upgrades for Apache Spark in AWS Glue
<a name="upgrade-analysis"></a>

 Spark Upgrades in AWS Glue enables data engineers and developers to upgrade and migrate their existing AWS Glue Spark jobs to the latest Spark releases using generative AI. Data engineers can use it to scan their AWS Glue Spark jobs, generate upgrade plans, execute plans, and validate outputs. It reduces the time and cost of Spark upgrades by automating the undifferentiated work of identifying and updating Spark scripts, configurations, dependencies, methods, and features. 

![\[The GIF shows an end to end implementation of a sample upgrade analysis workflow.\]](http://docs.aws.amazon.com/glue/latest/dg/images/demo_lumos.gif)


## How it works
<a name="upgrade-analysis-how-it-works"></a>

 When you use upgrade analysis, AWS Glue identifies differences between versions and configurations in your job's code to generate an upgrade plan. The upgrade plan details all code changes, and required migration steps. Next, AWS Glue builds and runs the upgraded application in an environment to validate changes and generates a list of code changes for you to migrate your job. You can view the updated script along with the summary that details the proposed changes. After running your own tests, accept the changes and the AWS Glue job will be updated automatically to the latest version with the new script. 

 The upgrade analysis process can take some time to complete, depending on the complexity of the job and the workload. The results of the upgrade analysis will be stored in the specified Amazon S3 path, which can be reviewed to understand the upgrade and any potential compatibility issues. After reviewing the upgrade analysis results, you can decide whether to proceed with the actual upgrade or make any necessary changes to the job before upgrading. 

## Prerequisites
<a name="upgrade-analysis-prerequisites"></a>

 The following prerequisites are required to use generative AI to upgrade jobs in AWS Glue: 
+  AWS Glue 2 PySpark jobs – only AWS Glue 2 jobs can be upgraded to AWS Glue 5. 
+  IAM permissions are required to start the analysis, review the results and upgrade your job. For more information, see the examples in the [Permissions](#auto-upgrade-permissions) section below. 
+  If using AWS KMS to encrypt analysis artifacts, then additional AWS AWS KMS permissions are needed. For more information, see the examples in the [AWS KMS policy](#auto-upgrade-kms-policy) section below. 

### Permissions
<a name="auto-upgrade-permissions"></a>

#### To start a new upgrade analysis, you need the following permissions:
<a name="collapsible-section-1"></a>

1.  Update the IAM policy of the caller with the following permission: 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "glue:StartJobUpgradeAnalysis",
                   "glue:StartJobRun",
                   "glue:GetJobRun",
                   "glue:GetJob",
                   "glue:BatchStopJobRun"
               ],
               "Resource": [
                   "arn:aws:glue:us-east-1:111122223333:job/jobName"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject"
               ],
               "Resource": [
              		 "arn:aws:s3:::amzn-s3-demo-bucket/script-location/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject",
                   "s3:GetObject"
               ],
               "Resource": [
               		"arn:aws:s3:::amzn-s3-demo-bucket/results/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "kms:Decrypt",
                   "kms:GenerateDataKey"
               ],
               "Resource": "arn:aws:kms:us-east-1:111122223333:key/key-id"
           }
       ]
   }
   ```

------

1.  Update the Execution role of the job you are upgrading to include the following in-line policy: 

   ```
       {
         "Effect": "Allow",
         "Action": ["s3:GetObject"],    
         "Resource": [
           "ARN of the Amazon S3 path provided on API",
           "ARN of the Amazon S3 path provided on API/*"
         ]
       }
   ```

    For example, if you are using the Amazon S3 path `s3://amzn-s3-demo-bucket/upgraded-result`, then the policy will be: 

   ```
   {
         "Effect": "Allow",
         "Action": ["s3:GetObject"],
         "Resource": [
           "arn:aws:s3:::amzn-s3-demo-bucket/upgraded-result/",
           "arn:aws:s3:::amzn-s3-demo-bucket/upgraded-result/*"
         ]
       }
   ```

#### To retrieve the details of an analysis, you need the following permissions:
<a name="collapsible-section-2"></a>

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetJobUpgradeAnalysis"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:111122223333:job/jobName"
      ]
    }
  ]
}
```

------

#### To stop an in-progress analysis, you need the following permissions:
<a name="collapsible-section-3"></a>

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StopJobUpgradeAnalysis",
        "glue:BatchStopJobRun"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:111122223333:job/jobName"
      ]
    }
  ]
}
```

------

#### To list all the analyses submitted on a specific job, you need the following permissions:
<a name="collapsible-section-4"></a>

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:ListJobUpgradeAnalyses"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:111122223333:job/jobName"
      ]
    }
  ]
}
```

------

#### To accept the changes from an analysis and upgrade a job, you need the following permissions:
<a name="collapsible-section-5"></a>

### AWS KMS policy
<a name="auto-upgrade-kms-policy"></a>

 To pass your own custom AWS KMS key when starting an analysis, please refer to the following section to configure appropriate permissions on the AWS KMS keys. 

#### Configuring the result artifact encryption using a AWS KMS key:
<a name="w2aac37b7c20c13c13b5b5"></a>

 This policy ensures that you have both the encryption and decryption permissions on the AWS KMS key. 

```
{
    "Effect": "Allow",
    "Principal":{
        "AWS": "<IAM Customer caller ARN>"
    },
    "Action": [
      "kms:Decrypt",
      "kms:GenerateDataKey",
    ],
    "Resource": "<key-arn-passed-on-start-api>"
}
```

## Running an upgrade analysis and applying the upgrade script
<a name="auto-upgrade-procedure"></a>

 You can run an upgrade analysis, which will generate an upgrade plan on a job you select from the **Jobs** view. 

1.  From **Jobs**, select a AWS Glue 2.0 job, then choose **Run upgrade analysis** from the **Actions** menu.   
![\[The screenshot shows the Upgrade analysis with AI from the action menu.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-run-action-menu.png)

1.  In the modal, select a path to store your generated upgrade plan in the **Result path**. This must be an Amazon S3 bucket you can access and write to.   
![\[The screenshot shows the completed upgrade analysis. The button for Apply upgraded script is visible.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-configuration-options.png)

1.  Configure additional options, if needed: 
   +  **Run configuration** – optional: The run configuration is an optional setting that allows you to customize various aspects of the validation runs performed during the upgrade analysis. This configuration is utilized to execute the upgraded script and allows you to select the compute environment properties (worker type, number of workers, etc). Note you should use your non-production developer accounts to run the validations on sample datasets before reviewing, accepting the changes and applying them to production environments. The run configuration includes the following customizable parameters: 
     + Worker type: You can specify the type of worker to be used for the validation runs, allowing you to choose appropriate compute resources based on your requirements.
     + Number of workers: You can define the number of workers to be provisioned for the validation runs, enabling you to scale resources according to your workload needs.
     + Job timeout (in minutes): This parameter allows you to set a time limit for the validation runs, ensuring that the jobs terminate after a specified duration to prevent excessive resource consumption.
     + Security configuration: You can configure security settings, such as encryption and access control, to ensure the protection of your data and resources during the validation runs.
     + Additional job parameters: If needed, you can add new job parameters to further customize the execution environment for the validation runs.

      By leveraging the run configuration, you can tailor the validation runs to suit your specific requirements. For example, you can configure the validation runs to use a smaller dataset, which allows the analysis to complete more quickly and optimizes costs. This approach ensures that the upgrade analysis is performed efficiently while minimizing resource utilization and associated costs during the validation phase. 
   +  **Encryption configuration** – optional: 
     + **Enable upgrade artifacts encryption**: Enable at-rest encryption when writing data to the result path. If you don't want to encrypt your upgrade artifacts, leave this option unchecked.

1.  Choose **Run** to start the upgrade analysis. While the analysis is running, you can view the results on the **Upgrade analysis** tab. The analysis details window will show you information about the analysis as well as links to the upgrade artifacts. 
   +  **Result path** – this is where the results summary and upgrade script are stored. 
   +  **Upgraded script in Amazon S3** – the location of the upgrade script in Amazon S3. You can view the script prior to applying the upgrade. 
   +  **Upgrade summary in Amazon S3** – the location of the upgrade summary in Amazon S3. You can view the upgrade summary prior to applying the upgrade. 

1.  When the upgrade analysis is completed successfully, you can apply the upgrade script to automatically upgrade your job by choosing **Apply upgraded script**. 

    Once applied, the AWS Glue version will be updated to 4.0. You can view the new script in the **Script** tab.   
![\[The screenshot shows the completed upgrade analysis. The button for Apply upgraded script is visible.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-analysis-details-preview.png)

## Understanding your upgrade summary
<a name="auto-upgrade-analysis-summary"></a>

 This example demonstrates the process of upgrading a AWS Glue job from version 2.0 to version 4.0. The sample job reads product data from an Amazon S3 bucket, applies several transformations to the data using Spark SQL, and then saves the transformed results back to an Amazon S3 bucket. 

### Original code (AWS Glue 2.0) - before upgrade
<a name="w2aac37b7c20c21b5b1"></a>

```
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from awsglue.job import Job
import json
from pyspark.sql.types import StructType

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

gdc_database = "s3://aws-glue-scripts-us-east-1-gamma/demo-database/"
schema_location = (
    "s3://aws-glue-scripts-us-east-1-gamma/DataFiles/"
)

products_schema_string = spark.read.text(
    f"{schema_location}schemas/products_schema"
).first()[0]

product_schema = StructType.fromJson(json.loads(products_schema_string))

products_source_df = (
    spark.read.option("header", "true")
    .schema(product_schema)
    .option(
        "path",
        f"{gdc_database}products/",
    )
    .csv(f"{gdc_database}products/")
)

products_source_df.show()
products_temp_view_name = "spark_upgrade_demo_product_view"
products_source_df.createOrReplaceTempView(products_temp_view_name)

query = f"select {products_temp_view_name}.*, format_string('%0$s-%0$s', category, subcategory) as unique_category from {products_temp_view_name}"
products_with_combination_df = spark.sql(query)
products_with_combination_df.show()

products_with_combination_df.createOrReplaceTempView(products_temp_view_name)
product_df_attribution = spark.sql(
    f"""
SELECT *,
unbase64(split(product_name, ' ')[0]) as product_name_decoded,
unbase64(split(unique_category, '-')[1]) as subcategory_decoded
FROM {products_temp_view_name}
"""
)
product_df_attribution.show()


product_df_attribution.write.mode("overwrite").option("header", "true").option(
    "path", f"{gdc_database}spark_upgrade_demo_product_agg/"
).saveAsTable("spark_upgrade_demo_product_agg", external=True)

spark_upgrade_demo_product_agg_table_df = spark.sql(
    f"SHOW TABLE EXTENDED in default like 'spark_upgrade_demo_product_agg'"
)
spark_upgrade_demo_product_agg_table_df.show()
job.commit()
```

### New code (Glue 4.0) - after upgrade
<a name="upgrade-analysis-example-new-code-glue-4"></a>

```
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from awsglue.job import Job
import json
from pyspark.sql.types import StructType

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# change 1
spark.conf.set("spark.sql.adaptive.enabled", "false")
# change 2
spark.conf.set("spark.sql.legacy.pathOptionBehavior.enabled", "true")
job = Job(glueContext)

gdc_database = "s3://aws-glue-scripts-us-east-1-gamma/demo-database/"
schema_location = (
    "s3://aws-glue-scripts-us-east-1-gamma/DataFiles/"
)

products_schema_string = spark.read.text(
    f"{schema_location}schemas/products_schema"
).first()[0]

product_schema = StructType.fromJson(json.loads(products_schema_string))

products_source_df = (
    spark.read.option("header", "true")
    .schema(product_schema)
    .option(
        "path",
        f"{gdc_database}products/",
    )
    .csv(f"{gdc_database}products/")
)

products_source_df.show()
products_temp_view_name = "spark_upgrade_demo_product_view"
products_source_df.createOrReplaceTempView(products_temp_view_name)

# change 3
query = f"select {products_temp_view_name}.*, format_string('%1$s-%1$s', category, subcategory) as unique_category from {products_temp_view_name}"
products_with_combination_df = spark.sql(query)
products_with_combination_df.show()

products_with_combination_df.createOrReplaceTempView(products_temp_view_name)
# change 4
product_df_attribution = spark.sql(
    f"""
SELECT *,
try_to_binary(split(product_name, ' ')[0], 'base64') as product_name_decoded,
try_to_binary(split(unique_category, '-')[1], 'base64') as subcategory_decoded
FROM {products_temp_view_name}
"""
)
product_df_attribution.show()


product_df_attribution.write.mode("overwrite").option("header", "true").option(
    "path", f"{gdc_database}spark_upgrade_demo_product_agg/"
).saveAsTable("spark_upgrade_demo_product_agg", external=True)

spark_upgrade_demo_product_agg_table_df = spark.sql(
    f"SHOW TABLE EXTENDED in default like 'spark_upgrade_demo_product_agg'"
)
spark_upgrade_demo_product_agg_table_df.show()
job.commit()
```

### Explanation of analysis summary
<a name="upgrade-analysis-explanation-summary"></a>

![\[The screenshot shows the Upgrade analysis summary.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-explanation-summary.png)


 Based on the summary, there are four changes proposed by AWS Glue in order to successfully upgrade the script from AWS Glue 2.0 to AWS Glue 4.0: 

1.  **Spark SQL configuration (spark.sql.adaptive.enabled)**: This change is to restore the application behavior as a new feature for Spark SQL adaptive query execution is introduced starting Spark 3.2. You can inspect this configuration change and can further enable or disable it as per their preference. 

1.  **DataFrame API change**: The path option cannot coexist with other DataFrameReader operations like `load()`. To retain the previous behavior, AWS Glue updated the script to add a new SQL configuration **(spark.sql.legacy.pathOptionBehavior.enabled)**. 

1.  **Spark SQL API change**: The behavior of `strfmt` in `format_string(strfmt, obj, ...)` has been updated to disallow `0$` as the first argument. To ensure compatibility, AWS Glue has modified the script to use `1$` as the first argument instead. 

1.  **Spark SQL API change**: The `unbase64` function does not allow malformed string inputs. To retain the previous behavior, AWS Glue updated the script to use the `try_to_binary` function. 

## Stopping an upgrade analysis in progress
<a name="auto-upgrade-stopping-analysis"></a>

 You can cancel an upgrade analysis in progress or just stop the analysis. 

1.  Choose the **Upgrade Analysis** tab. 

1.  Select the job that is running, then choose **Stop**. This will stop the analysis. You can then run another upgrade analysis on the same job.   
![\[The screenshot shows the Upgrade analysis tab with a job selected. The job is still running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-tab.png)

## Considerations
<a name="upgrade-analysis-considerations"></a>

 As you begin using Spark Upgrades, there are several important aspects to consider for optimal usage of the service. 
+  **Service Scope and Limitations**: The current release focuses on PySpark code upgrades from AWS Glue versions 2.0 to version 5.0. At this time, the service handles PySpark code that doesn't rely on additional library dependencies. You can run automated upgrades for up to 10 jobs concurrently in an AWS account, allowing you to efficiently upgrade multiple jobs while maintaining system stability. 
  +  Only PySpark jobs are supported. 
  +  Upgrade analysis will time out after 24 hours. 
  +  Only one active upgrade analysis can be run at a time for one job. On the account-level, up to 10 active upgrade analysis can be run at the same time. 
+  **Optimizing Costs During Upgrade Process**: Since Spark Upgrades use generative AI to validate the upgrade plan through multiple iterations, with each iteration running as a AWS Glue job in your account, it’s essential to optimize the validation job run configurations for cost efficiency. To achieve this, we recommend specifying a Run Configuration when starting an Upgrade Analysis as follows: 
  +  Use non-production developer accounts and select sample mock datasets that represent your production data but are smaller in size for validation with Spark Upgrades. 
  +  Using right-sized compute resources, such as G.1X workers, and selecting an appropriate number of workers for processing your sample data. 
  +  Enabling AWS Glue job auto-scaling when applicable to automatically adjust resources based on workload. 

   For example, if your production job processes terabytes of data with 20 G.2X workers, you might configure the upgrade job to process a few gigabytes of representative data with 2 G.2X workers and auto-scaling enabled for validation. 
+  **Best Practices**: We strongly recommend starting your upgrade journey with non-production jobs. This approach allows you to familiarize yourself with the upgrade workflow, and understand how the service handles different types of Spark code patterns. 
+  **Alarms and notifications**: When utilizing the Generative AI upgrades feature on a job, ensure that alarms/notifications for failed job runs are turned off. During the upgrade process, there may be up to 10 failed job runs in your account before the upgraded artifacts are provided. 
+  **Anomaly detection rules**: Turn off any anomaly detection rules on the Job that is being upgraded as well, as the data written to output folders during intermediate job runs might not be in the expected format while the upgrade validation is in progress. 
+  **Use upgrade analysis with idempotent jobs**: Use upgrade analysis with idempotent jobs to ensure each subsequent validation job run attempt is similar to the previous one, and doesn't run into issues. Idempotent jobs are jobs that can be run multiple times with the same input data, and they will produce the same output each time. When using the Generative AI upgrades for Apache Spark in AWS Glue, the service will run multiple iterations of your job as part of the validation process. During each iteration, it will make changes to your Spark code and configurations to validate the upgrade plan. If your Spark job is not idempotent, running it multiple times with the same input data could lead to issues. 

## Supported regions
<a name="upgrade-analysis-supported-regions"></a>

Generative AI upgrades for Apache Spark is available in the following regions:
+ **Asia Pacific**: Tokyo (ap-northeast-1), Seoul (ap-northeast-2), Mumbai (ap-south-1), Singapore (ap-southeast-1), and Sydney (ap-southeast-2)
+ **North America**: Canada (ca-central-1)
+ **Europe**: Frankfurt (eu-central-1), Stockholm (eu-north-1), Ireland (eu-west-1), London (eu-west-2), and Paris (eu-west-3)
+ **South America**: São Paulo (sa-east-1)
+ **United States**: North Virginia (us-east-1), Ohio (us-east-2), and Oregon (us-west-2)

## Cross-region inference in Spark Upgrades
<a name="w2aac37b7c20c37"></a>

 Spark Upgrades is powered by Amazon Bedrock and leverages cross-region inference (CRIS). With CRIS, Spark Upgrades will automatically select the optimal region within your geography (as described in more detail [here](https://docs.aws.amazon.com/bedrock/latest/userguide/cross-region-inference.html)) to process your inference request, maximizing available compute resources and model availability, and providing the best customer experience. There's no additional cost for using cross-region inference. 

 Cross-region inference requests are kept within the AWS Regions that are part of the geography where the data originally resides. For example, a request made within the US is kept within the AWS Regions in the US. Although the data remains stored only in the primary region, when using cross-region inference, your input prompts and output results may move outside of your primary region. All data will be transmitted encrypted across Amazon's secure network.