Change | Description | Date |
---|---|---|
Support for 14 more new SaaS native connectors for AWS Glue | Fourteen more SaaS native connectors were added to AWS Glue. For more information, see Adding an AWS Glue connection. | January 30, 2025 |
Support for 16 more new SaaS native connectors for AWS Glue | Sixteen more SaaS native connectors were added to AWS Glue. For more information, see Adding an AWS Glue connection. | December 17, 2024 |
Automatic column statistics generation | AWS Glue Data Quality now supports Amazon SageMaker AI LakeHouse tables and AWS Lake Formation managed Iceberg, Delta and HUDI tables in Data Catalog and ETL. For more information, see AWS Glue Data Quality. | December 6, 2024 |
Support for zero-ETL integrations | Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. For more information, see Zero-ETL integrations. | December 3, 2024 |
Support for reusable connections | A new AWS Glue connections schema provides a unified way to manage data connections across AWS services and applications, such as AWS Glue, Amazon Athena, and Amazon SageMaker Unified Studio. For more information, see Connecting to data. | December 3, 2024 |
Support for AWS Glue version 5.0. | Added information about support for AWS Glue version 5.0. Features include an Apache Spark update to 3.52, Java update to 17, open table format updates, Spark native fine-grained access control, Sagemaker Lakehouse and data warehouse abstraction integration, support in Sagemaker Unified Studio, and more. For more information, see AWS Glue Release Notes and Migrating AWS Glue jobs to AWS Glue version 5.0. | December 3, 2024 |
Connecting to AWS Glue Data Catalog using AWS Glue Iceberg REST endpoint | AWS Glue's Iceberg REST endpoint supports API operations specified in the Apache Iceberg REST specification. Using an Iceberg REST client, you can connect your application running on an analytics engine to the REST catalog hosted in the Data Catalog. For more information, see Accessing the Data Catalog. | December 3, 2024 |
Automatic column statistics generation | Automatically generate column statistics for new tables in the AWS Glue Data Catalog. For more information, see Automatic column statistics generation. | December 3, 2024 |
Support for Generative AI upgrades for Apache Spark in AWS Glue | Spark Upgrades in AWS Glue enables data engineers and developers to upgrade and migrate upgrade and migrate their existing AWS Glue Spark jobs to the latest Spark releases using generative AI. For more information, see Upgrade analysis with AI. | November 22, 2024 |
Support for Generative AI troubleshooting for Apache Spark in AWS Glue | Generative AI Troubleshooting for Apache Spark jobs in AWS Glue helps data engineers and scientists diagnose and fix issues in their Spark applications with ease. For more information, see Troubleshooting Spark jobs with AI. | November 22, 2024 |
Support for Iceberg optimizers to access Amazon S3 buckets in a VPC | AWS Glue Data Catalog supports Iceberg table optimizers to access Amazon S3 buckets from a specific Virtual Private Cloud (VPC) using AWS Glue network connection. For more information, see Optimizing Iceberg tables. | November 20, 2024 |
Support for nine more new SaaS native connectors for AWS Glue | Nine more SaaS native connectors were added to AWS Glue. For more information, see Adding an AWS Glue connection. | November 19, 2024 |
Support for ten new SaaS native connectors for AWS Glue | Ten SaaS native connectors were added to AWS Glue. For more information, see Adding an AWS Glue connection. | November 15, 2024 |
Support for job run queuing for AWS Glue ETL jobs | You can turn on job run queuing to run jobs later, when they cannot run immediately due to service quotas. For more information, see Configuring job properties for Spark jobs in AWS Glue. | September 3, 2024 |
Updated policy changes | Documented changes to the AwsGlueSessionUserRestrictedNotebookPolicy and AwsGlueSessionUserRestrictedNotebookServiceRole policies, required for supporting on tag-on-create for sessions with owner tag key. For more information, see AWS Glue updates to AWS managed policies. | August 30, 2024 |
Anomaly Detection and Dynamic Rules are now generally available | AWS Glue Data Quality utilizes a machine learning algorithm to learn from past trends and then predict future values to detect anomalies. Dynamic Rules allows you to provide dynamic thresholds. For more information, see Optimizing query performance for Iceberg tables . | August 7, 2024 |
Updated policy changes | Documented changes to the AwsGlueSessionUserRestrictedPolicy and AwsGlueSessionUserRestrictedServiceRole policies, required for supporting on tag-on-create for sessions with owner tag key. For more information, see AWS Glue updates to AWS managed policies. | August 5, 2024 |
Generating column statistics for Iceberg tables is now generally available | AWS Glue supports calculating and updating number of distinct values (NDVs) for each column in Iceberg tables. For more information, see Anomaly detection in AWS Glue Data Quality and Dynamic rules. | July 9, 2024 |
Support for AWS Glue usage profiles | Admins can create AWS Glue usage profiles for various classes of users within the account, such as developers, testers, and product teams. This flexibility allows administrators to apply different usage and cost controls for each class of users. For more information, see Setting up AWS Glue usage profiles. | June 18, 2024 |
Support for a Salesforce connector for AWS Glue for Spark | Added information about a new AWS Glue connector for Salesforce. This feature allows you to use AWS Glue for Spark to read from and write to Salesforce in AWS Glue 4.0 and later versions. For more information, see Connecting to Salesforce. | May 22, 2024 |
Amazon Q data integration in AWS Glue (GA) | Amazon Q data integration in AWS Glue is a new generative AI capability of AWS Glue that enables data engineers and ETL developers to build data integration jobs using natural language. Engineers and developers can ask Q to author jobs, troubleshoot issues and answer questions about AWS Glue and data integration. For more information, see Amazon Q data integration in AWS Glue. This feature includes an update to the | April 30, 2024 |
Amazon Q data integration in AWS Glue (preview) | Amazon Q data integration in AWS Glue is a new generative AI capability of AWS Glue that enables data engineers and ETL developers to build data integration jobs using natural language. Engineers and developers can ask Q to author jobs, troubleshoot issues and answer questions about AWS Glue and data integration. For more information, see Amazon Q data integration in AWS Glue. This feature includes an update to the | January 30, 2024 |
Update to the documentation for AWS Glue Streaming | Added a new chapter with new and reorganized content for AWS Glue Streaming. This content describes how streaming works with AWS Glue, the characteristics of real-time data processing, and how to monitor your streaming jobs. For more information, see AWS Glue Streaming. | December 27, 2023 |
Support for using fine-grained sensitive data detection | The Detect Sensitive Data transform provides the ability to detect, mask, or remove entities that you define, or are pre-defined by AWS Glue. Fine-grained actions further allows you to apply a specific action per entity. For more information, see Using fine-grained sensitive data detection . | November 26, 2023 |
Support for monitoring jobs with AWS Glue Observability metrics | Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue for Apache Spark jobs to improve triaging and analysis of issues. For more information, see Monitoring with AWS Glue Observability metrics . | November 26, 2023 |
Support for anomaly detection in AWS Glue Data Quality | AWS Glue Data Quality anomaly detection applies machine learning (ML) algorithms on data statistics over time to detect abnormal patterns and hidden data quality issues that are hard to detect through rules. For more information, see Anomaly detection in AWS Glue Data Quality . | November 26, 2023 |
Update to default Spark UI logging behavior | Spark jobs generating Spark UI logs will now write with a different filename pattern to support Spark UI in the AWS Glue console. This does not change CloudWatch log behavior. You can revert to the legacy behavior by updating your job configuration. For more information, see Monitoring jobs using the Apache Spark web UI. | November 17, 2023 |
Support for new data sources in AWS Glue for Spark | Connections to Amazon OpenSearch Service, Azure SQL, Azure Cosmos for NoSQL, SAP HANA Teradata Vantage and Vertica are now supported natively within AWS Glue. Additionally, connections to these data sources, along with MongoDB, are now available for use in the AWS Glue Studio visual editor. For more information, see Connection types and options for ETL in AWS Glue for Spark for information about AWS Glue for Spark support and Adding an AWS Glue connection for information about use in the AWS Glue Studio visual editor. | November 17, 2023 |
Support for generating column statistics | You can compute column-level statistics for AWS Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. For more information, see Working with column statistics. | November 16, 2023 |
Support for data compaction for Iceberg tables | For better read performance by AWS analytics services such as Amazon Athena and Amazon EMR, and AWS Glue ETL jobs, Data Catalog provides managed compaction (a process that compacts small Amazon S3 objects into larger objects) for Iceberg tables in Data Catalog. For more information, see Optimizing Iceberg tables. | November 13, 2023 |
Update to job run wait behavior | Standard Spark and Python shell job runs will now transition to | November 8, 2023 |
AWS Glue Studio user guide consolidated into AWS Glue developer guide | The AWS Glue Studio user guide has been moved into the developer guide to create a single unified user guide for AWS Glue Studio, the AWS Glue console, and AWS Glue Studio programmatic access. | October 25, 2023 |
Update to the AWSGlueServiceNotebookRole AWS managed policy | Added information about a minor update to the AWSGlueServiceNotebookRole AWS managed policy. For more information, see AWS Glue Updates to AWS Managed Policies. | October 9, 2023 |
AWS Glue Studio supports five new built-in transforms | AWS Glue Studio supports the following five new built-in transforms: Record matching, Remove null rows, Parse JSON column, Extract JSON path, and Regex extractor. For more information, see Editing AWS Glue managed data transform nodes. | August 11, 2023 |
Update to the AWSGlueServiceRole AWS managed policy | Added information about a minor update to the AWSGlueServiceRole AWS managed policy. For more information, see AWS Glue Updates to AWS Managed Policies. | August 4, 2023 |
Support for crawling Apache Hudi tables | Added information about using AWS Glue to crawl Hudi tables in Amazon S3 buckets and registering the Hudi tables to the AWS Glue Data Catalog. For more information, see Which data stores can I crawl?, and Crawler properties. | July 21, 2023 |
Update to the AWSGlueConsoleFullAccess AWS managed policy | Added information about a minor update to the AWSGlueConsoleFullAccess AWS managed policy. For more information, see AWS Glue Updates to AWS Managed Policies. | July 14, 2023 |
Support for crawling Apache Iceberg tables | Added information about using AWS Glue to crawl Iceberg tables in Amazon S3 buckets and registering the Iceberg tables to the AWS Glue Data Catalog. For more information, see Which data stores can I crawl?, and Crawler properties. | July 7, 2023 |
Support for AWS Glue with Ray | Added information about AWS Glue with Ray, a new engine that can back AWS Glue jobs. Reorganized existing AWS Glue with Spark content to disambiguate. | May 30, 2023 |
Support for AWS Glue Data Quality (GA) | AWS Glue Data Quality is now generally available. AWS Glue Data Quality helps you evaluate and monitor the quality of your data. For information about how to use AWS Glue Data Quality with Data Catalog, see AWS Glue Data Quality. To learn about AWS Glue Data Quality for AWS Glue Studio, see Evaluating data quality with AWS Glue Studio. | May 24, 2023 |
Support for larger worker types for Apache Spark jobs | Support is now available for use of the | May 8, 2023 |
Support for creating partition indexes when crawling tables | Added information about how crawlers support the creation of partition indexes for tables that the crawler detects. For more information, see Setting the partition index crawler configuration option. | April 24, 2023 |
Support for resource usage metrics | Added information about viewing the service's resource usage and configuring alarms in Amazon CloudWatch. For more information, see AWS Glue resource monitoring. | April 7, 2023 |
Update to the AWSGlueConsoleFullAccess AWS managed policy | Added information about a minor update to the AWSGlueConsoleFullAccess AWS managed policy. For more information, see AWS Glue Updates to AWS Managed Policies. | March 28, 2023 |
Added guidance for using AWS Glue with an AWS SDK with examples | The AWS Glue Developer Guide has two new sections that provide information to help you use AWS Glue with an AWS SDK. For more information, see Using AWS Glue with an AWS SDK and Code examples for AWS Glue using AWS SDKs. | February 23, 2023 |
Update to the documentation for IAM with AWS Glue | Reorganized and added information on using IAM with AWS Glue. For more information, see Identity and access management for AWS Glue. | February 15, 2023 |
Support for running streaming ETL jobs in AWS Glue version 4.0 | Added information about support for running streaming ETL jobs in Glue version 4.0, and new options for connecting to a Kafka cluster or an Amazon Managed Streaming for Apache Kafka cluster, and Amazon Kinesis Data Streams. For more information, see Adding Streaming ETL Jobs in AWS Glue and Connection types and options for ETL in AWS Glue. | February 8, 2023 |
Support for crawling MongoDB Atlas data sources | Added information about using AWS Glue to crawl MongoDB Atlas data sources. For more information, see Which data stores can I crawl?, MongoDB and MongoDB Atlas connection properties , and Using a MongoDB or MongoDB Atlas connection. | February 6, 2023 |
Support for crawling Delta Lake tables using a native Delta Lake connector | Added information about using AWS Glue to crawl Delta Lake tables using a native Delta Lake connector. This feature allows you to use AWS query engines to query the Delta transaction log directly and use features such as time travel and ACID guarantees, and to sync your Delta Lake metadata from Amazon S3 transaction files into the Data Catalog to enable column permissions on your queries in Lake Formation. For more information, see How to specify configuration options for a Delta Lake data store, and Querying Delta Lake tables. | December 15, 2022 |
Support for AWS Glue Data Quality (preview) | Support is now available for AWS Glue Data Quality (preview). AWS Glue Data Quality helps you evaluate and monitor the quality of your data when you use AWS Glue 3.0. For information about how to use AWS Glue Data Quality with Data Catalog, see AWS Glue Data Quality (preview). To learn about AWS Glue Data Quality for AWS Glue Studio, see Evaluating data quality with AWS Glue Studio. | November 30, 2022 |
Support for a new Amazon Redshift Spark connector with new features and performance improvements | Support is now available for a new Amazon Redshift Spark connector with a new JDBC driver for use with AWS Glue ETL jobs to build Apache Spark applications that read from and write to data in Amazon Redshift as part of your data ingestion and transformation pipelines. For more information, see Moving data to and from Amazon Redshift. | November 29, 2022 |
Support for AWS Glue version 4.0. | Added information about support for AWS Glue version 4.0. Features include native support for open-data lake frameworks with Apache Hudi, Delta Lake, and Apache Iceberg, and native support for the Amazon S3-based Cloud Shuffle Storage Plugin (an Apache Spark plugin) to use Amazon S3 for shuffling and elastic storage capacity. For more information, see AWS Glue Release Notes and Migrating AWS Glue jobs to AWS Glue version 4.0. | November 28, 2022 |
AWS Glue Studio now offers custom visual transforms | Custom visual transforms let customers define, reuse, and share business-specific ETL logic among their teams. For more information, see Custom visual transforms . | November 28, 2022 |
Support for using the AWS Glue crawler to publish metadata for JDBC data stores | Support is now available for using the AWS Glue crawler to publish metadata such as comments and rawtypes to the Data Catalog for JDBC data stores. For more information, see Parameters set on Data Catalog tables by crawler, Crawler properties, and JdbcTarget structure. | November 18, 2022 |
Support for crawling Snowflake data stores | Support is now available for using AWS Glue to crawl Snowflake tables and views, and to publish the metadata to the Data Catalog as a table entry. For Snowflake external tables in Amazon S3, the crawler also crawls the Amazon S3 location and the file format type of the external table and populates as Table parameters. For more information, see Which data stores can I crawl?, AWS Glue connection properties, and Parameters set on Data Catalog tables by crawler. | November 18, 2022 |
Support for improved shuffle management of your Spark applications | Support is now available for a new Cloud Shuffle Storage Plugin for Apache Spark. For more information, see AWS Glue Spark shuffle plugin with Amazon S3 and Cloud Shuffle Storage Plugin for Apache Spark. | November 15, 2022 |
Added support for Data Catalog targets when accelerating crawls Amazon S3 event notifications | In addition to the existing support for Amazon S3 targets, support is now available for accelerating crawls for Data Catalog targets using Amazon S3 event notifications. For more information, see Accelerating Crawls Using Amazon S3 Event Notifications. | October 13, 2022 |
Support for specifying the maximum number of tables a crawler can create | Support is now available for specifying the maximum number of tables the crawler is allowed to create. For more information, see How to specify the maximum number of tables the crawler is allowed to create. | September 6, 2022 |
Support for Python 3.9 in Python shell jobs in AWS Glue | Support is now available for running scripts compatible with Python 3.9 in Python shell jobs in AWS Glue, and for choosing to use pre-packaged library sets. For more information, see Python shell jobs in AWS Glue. | August 11, 2022 |
Support for running non-urgent or non-time sensitive AWS Glue jobs on spare capacity | Support is now available for the configuration of flexible job runs for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. For more information, see Adding jobs in AWS Glue. | August 9, 2022 |
Support for a new worker type for streaming jobs | Support is now available for use of the | July 14, 2022 |
Support for the use of Kafka SASL in AWS Glue connections | Support is now available for use of Kafka SASL in AWS Glue connections. For more information, see AWS Glue Kafka connection properties for client authentication. | July 5, 2022 |
Support for Apache kafka connector for protobuf schemas | Support is now available for Apache Kafka Connector for Protobuf schemas. For more information, see AWS Glue Schema Registry. | June 9, 2022 |
Support for Auto Scaling for AWS Glue jobs (GA) | Added information on using Auto Scaling for jobs in AWS Glue version 3.0 to dynamically scale compute resources. For more information, see Using Auto Scaling for AWS Glue. | April 14, 2022 |
Update to the documentation for AWS Glue developing and testing AWS Glue job scripts | Reorganized and added information on the available development and testing methods for AWS Glue, including instructions for developing with Docker. For more information, see Developing and testing AWS Glue job scripts. | March 14, 2022 |
Addition of protocol buffers (protobuf) as a supported data format for the AWS Glue schema registry | Added information about Protobuf as a supported data format (in addition to AVRO and JSON). For more information, see AWS Glue Schema Registry. | February 25, 2022 |
Support for crawling Delta Lake tables | Added information about using AWS Glue to crawl Delta Lake tables. For more information, see How to specify configuration options for a Delta Lake data store. | February 24, 2022 |
Support for AWS Glue job insights | Added information about using AWS Glue job insights to simplify job debugging and optimization for your AWS Glue jobs. For more information, see Monitoring with AWS Glue job insights. | February 8, 2022 |
Support for crawling Amazon S3 backed Data Catalog tables using a VPC endpoint | In addition to Amazon S3 data stores, you can configure your Amazon S3 backed Data Catalog tables to be accessed only by an Amazon Virtual Private Cloud environment (Amazon VPC), for security, auditing, or control purposes. For more information, see Crawling an Amazon S3 Data Store or Amazon S3 backed Data Catalog tables using a VPC Endpoint. | February 3, 2022 |
Support for Lake Formation governed tables | Added information about AWS Glue support for Lake Formation governed tables, which support ACID transactions, automatic data compaction, and time-travel queries. For more information, see AWS Glue API and the AWS Lake Formation developer guide. | November 30, 2021 |
New AWS managed policies added for interactive sessions and notebooks | New managed policies for IAM provided enhanced security for using AWS Glue with interactive sessions and notebooks. For more information, see AWS Managed Policies for AWS Glue. | November 30, 2021 |
Glue schema registry now supported with streaming jobs | You can create streaming jobs that access tables that are part of the Glue Schema Registry. For more information see AWS Glue Schema Registry and Adding Streaming ETL Jobs in AWS Glue. | November 15, 2021 |
Support for new machine learning features | Added information about new features for the Find matches machine learning transform, including incremental matching and match scoring. For more information, see Finding Incremental Matches and Estimating the Quality of Matches using Match Confidence Scores. | October 31, 2021 |
(Private preview) Support for AWS Glue flex jobs | Added information about configuring AWS Glue Spark jobs with a flexible execution class, appropriate for time-insensitive jobs whose start and completion times may vary. For more information, see Adding Jobs in AWS Glue. | October 29, 2021 |
Support for accelerating crawls using Amazon S3 event notifications | Added information about accelerating crawls using Amazon S3 event notifications. For more information, see Accelerating Crawls Using Amazon S3 Event Notifications. | October 15, 2021 |
Additional security configuration options related to access-control and VPCs | Added information about how you can configure new access control permissions on AWS Glue and configuration of VPCs. For more information, see AWS Tags in AWS Glue, Identity-Based Policies (IAM Policies) that Control Settings Using Condition Keys or Context Keys, and Configuring all AWS calls to go through your VPC. | October 13, 2021 |
Support for VPC endpoint policies | Added information about support for Virtual Private Cloud (VPC) endpoint policies in AWS Glue. For more information, see AWS Glue and interface VPC endpoints (AWS PrivateLink). | October 11, 2021 |
Glue Studio is now available in China | AWS Glue Studio is now available in the China Beijing and Ningxia regions. | October 11, 2021 |
AWS Glue Studio offers notebook authoring, for interactive job editing | Notebooks help you to write and execute code, visualize the results, and share insights. Typically, data scientists use notebooks for experiments and data exploration tasks. For more information, see Using Notebooks. | October 1, 2021 |
Direct access to streaming sources now available | When adding data sources to your ETL job in the visual editor, you can supply information to access the data stream instead of having to use a Data Catalog database and table. | September 30, 2021 |
Documented the AWS Glue version support policy | Added information about the AWS Glue version support policy and the end of life phases for certain AWS Glue versions. For more information, see AWS Glue version support policy. | September 24, 2021 |
Custom connectors can now be used with data previews | When editing data source node using a custom connector, you can preview the dataset by choosing the Dat preview tab. For more information, see Custom Connectors . | September 24, 2021 |
Support for AWS Glue interactive sessions (private preview) | (Private preview) Added information about using AWS Glue interactive sessions to run Spark workloads in the cloud from any Jupyter Notebook. Interactive sessions are the preferred method for developing your AWS Glue extract, transform, and load (ETL) code when you use AWS Glue 2.0 or later. For more information, see Setting Up and Running AWS Glue interactive sessions for Jupyter Notebook. | August 24, 2021 |
Support for creating workflows from blueprints (GA) | Added information about coding common extract, transform, and load (ETL) use cases in blueprints and then creating workflows from blueprints. Enables data analysts to easily create and run complex ETL processes. For more information, see Performing Complex ETL Activities Using blueprints and Workflows in AWS Glue. | August 23, 2021 |
Support for AWS Glue version 3.0. | Added information about support for AWS Glue version 3.0 which supports the Apache Spark 3.0 engine upgrade for running Apache Spark ETL jobs, and other optimizations and upgrades. For more information, see AWS Glue Release Notes and Migrating AWS Glue jobs to AWS Glue version 3.0. Other features in this release include the AWS Glue shuffle manager, a SIMD vectorized CSV reader, and catalog partition predicates. For more information see AWS Glue Spark shuffle manager with Amazon S3, Format Options for ETL Inputs and Outputs in AWS Glue, and Server-side filtering using catalog partition predicates. | August 18, 2021 |
AWS GovCloud (US) Region | AWS Glue Studio is now available in the AWS GovCloud (US) Region | August 18, 2021 |
Python shell authoring available in AWS Glue Studio | When creating a new job, you can now choose to create a Python shell job. For more information, see Start the job creation process and Editing Python shell jobs in AWS Glue Studio. | August 13, 2021 |
Support for starting a workflow with an Amazon EventBridge event | Added information about how AWS Glue can be an event consumer in an event-driven architecture. For more information, see Starting an AWS Glue Workflow with an Amazon EventBridge Event and Viewing the EventBridge Events That Started a Workflow. | July 14, 2021 |
Addition of JSON as a supported data format for the AWS Glue schema registry | Added information about JSON as a supported data format (in addition to AVRO). For more information, see AWS Glue Schema Registry. | June 30, 2021 |
Create AWS Glue streaming jobs without a Data Catalog table | The | June 15, 2021 |
AWS Glue machine learning transforms now support AWS Key Management Service keys | You can specify a security configuration or AWS KMS key when configuring AWS Glue Machine Learning transforms with the console, the CLI, or the AWS Glue APIs. For more information, see Using Data Encryption with Machine Learning Transforms and AWS Glue Machine Learning API. | June 15, 2021 |
Update to the AWSGlueConsoleFullAccess AWS managed policy | Added information about a minor update to the AWSGlueConsoleFullAccess AWS managed policy. For more information, see AWS Glue Updates to AWS Managed Policies. | June 10, 2021 |
View your job's dataset while creating and editing jobs | You can use the new Data preview tab for a node in your job diagram to see a sample of the data processed by that node. For more information, see Using data previews in the visual job editor. | June 7, 2021 |
Support for specifying a value that indicates the table location for the crawler output. | Added information about specifying a value that indicates the table location when configuring the crawler's output. For more information, see How to specify the table location. | June 4, 2021 |
Support for crawling a sample of files in a dataset when crawling an Amazon S3 data store | Added information about how to crawl a sample of files when crawling Amazon S3. For more information, see Crawler Properties. | May 10, 2021 |
Support for the AWS Glue optimized parquet writer | Added information about using the AWS Glue optimized parquet writer for DynamicFrames to create or update tables with the | May 4, 2021 |
Support for kafka client authentication passwords | Added information about how streaming ETL jobs in AWS Glue support SSL client certificate authentication with Apache Kafka stream producers. You can now provide a custom certificate while defining an AWS Glue connection to an Apache Kafka cluster, which AWS Glue will use when authenticating with it. For more information, see AWS Glue Connection Properties and Connection API. | April 28, 2021 |
Support for consuming data from Amazon Kinesis Data Streams in another account in streaming ETL jobs | Added information about to create a streaming ETL job to consume data from Amazon Kinesis Data Streams in another account. For more information, see Adding Streaming ETL Jobs in AWS Glue. | March 30, 2021 |
SQL transform available | You can use a SQL transform node to write your own transform in the form of a SQL query. For more information, see Using a SQL query to transform data. | March 23, 2021 |
Support for creating workflows from blueprints (public preview) | (Public preview) Added information about coding common extract, transform, and load (ETL) use cases in blueprints and then creating workflows from blueprints. Enables data analysts to easily create and run complex ETL processes. For more information, see Performing Complex ETL Activities Using blueprints and Workflows in AWS Glue. | March 22, 2021 |
Connectors can be used for data targets | Using a custom or AWS Marketplace connector for your data target is now supported. For more information, see Authoring jobs with custom connectors. | March 15, 2021 |
Support for column importance metrics for AWS Glue machine learning transforms | Added information about viewing column importance metrics when working with AWS Glue machine learning transforms. For more information see Working with Machine Learning Transforms on the AWS Glue Console | February 5, 2021 |
Job scheduling now available in AWS Glue Studio | You can define a time-based schedule for your job runs in AWS Glue Studio. You can use
the console to create a basic schedule, or define a more complex schedule using
the Unix-like cron | December 21, 2020 |
AWS Glue Custom Connectors released | AWS Glue Custom Connectors allow you to discover and subscribe to connectors in AWS Marketplace. We also released AWS Glue Spark runtime interfaces to plug in connectors built for Apache Spark Datasource, Athena federated query, and JDBC APIs. For more information, see Using Connectors and connections with AWS Glue Studio. | December 21, 2020 |
Support for running streaming ETL jobs in AWS Glue version 2.0 | Added information about support for running streaming ETL jobs in Glue version 2.0. For more information, see Adding Streaming ETL Jobs in AWS Glue. | December 18, 2020 |
Support for workload partitioning with bounded execution | Added information about enabling workload partitioning to configure the upper bounds on the dataset size, or the number of files processed on ETL job runs. For more information, see Workload Partitioning with Bounded Execution. | November 23, 2020 |
Support for enhanced partition management | Added information about how to use new APIs to add or delete a partition index to/from an existing table. For more information, see Working with Partition Indexes. | November 23, 2020 |
Support for the AWS Glue schema registry | Added information about using the AWS Glue Schema Registry to centrally discover, control, and evolve schemas. For more information, seeAWS Glue Schema Registry. | November 19, 2020 |
Support for the grok input format in streaming ETL jobs | Added information about applying Grok patterns to streaming sources such as log files. For more information, see Applying Grok Patterns to Streaming Sources. | November 17, 2020 |
Support for adding tags to workflows on the AWS Glue console | Added information about adding tags when creating a workflow using the AWS Glue console. For more information, see Creating and Building Out a Workflow Using the AWS Glue Console. | October 27, 2020 |
Support for incremental crawler runs | Added information about support for incremental crawler runs, which crawl only Amazon S3 folders added since the last run. For more information, see Incremental Crawls. | October 21, 2020 |
Support for schema detection for streaming ETL data sources. support for Avro streaming ETL data sources and self-managed kafka | Streaming extract, transform, and load (ETL) jobs in AWS Glue can now automatically detect the schema of incoming records and handle schema changes on a per-record basis. Self-managed Kafka data sources are now supported. Streaming ETL jobs now support the Avro format in data sources. For more information, see Streaming ETL in AWS Glue, Defining Job Properties for a Streaming ETL Job, and Notes and Restrictions for Avro Streaming Sources. | October 7, 2020 |
Support for crawling MongoDB and DocumentDB data sources | Added information about support for crawling MongoDB and Amazon DocumentDB (with MongoDB compatibility) data sources. For more information, see Defining Crawlers. | October 5, 2020 |
Support for FIPS compliance | Added information about FIPS endpoints for customers who require FIPS 140-2 validated cryptographic modules when accessing data using AWS Glue. For more information, see FIPS Compliance. | September 23, 2020 |
AWS Glue Studio provides an easy to use visual interface for creating and monitoring jobs | You can now use a simple graph-based interface to compose jobs that move and transform data and run them on AWS Glue. You can then use the job run dashboard in AWS Glue Studio to monitor ETL execution and ensure that your jobs are operating as intended. For more information, see AWS Glue Studio User Guide. | September 23, 2020 |
Support for creating table indexes to improve query performance | Added information about creating table indexes to allow you to retrieve a subset of the partitions from a table. For more information, see Working with Partition Indexes. | September 9, 2020 |
Support for reduced startup times when running Apache Spark ETL jobs in AWS Glue version 2.0. | Added information about support for AWS Glue version 2.0 which provides an upgraded infrastructure for running Apache Spark ETL jobs with reduced startup times, changes in logging, and support for specifying additional Python modules at the job level. For more information, see AWS Glue Release Notes and Running Spark ETL Jobs with Reduced Startup Times. | August 10, 2020 |
Support for limiting the number of concurrent workflow runs. | Added information about how to limit the number of concurrent workflow runs for a particular workflow. For more information, see Creating and Building Out a Workflow Using the AWS Glue Console. | August 10, 2020 |
Support for crawling an Amazon S3 data store using a VPC endpoint | Added information about configuring your Amazon S3 data store to be accessed only by an Amazon Virtual Private Cloud environment (Amazon VPC), for security, auditing, or control purposes. For more information, see Crawling an Amazon S3 Data Store using a VPC Endpoint. | August 7, 2020 |
Support for resuming workflow runs | Added information about how to resume workflow runs that only partially completed because one or more nodes (jobs or crawlers) did not complete successfully. For more information, see Repairing and Resuming a Workflow Run. | July 27, 2020 |
Support for enabling private CA certificates in kafka connections in AWS Glue. | Added information about new connection options that support enabling private CA certificates for Kafka connections in AWS Glue. For more information, see Connection Types and Options for ETL in AWS Glue and Special Parameters Used by AWS Glue. | July 20, 2020 |
Support for reading DynamoDB data in another account | Added information about AWS Glue support for reading data from another AWS account's DynamoDB table For more information, see Reading from DynamoDB Data in Another Account. | July 17, 2020 |
Support for a DynamoDB writer connection in AWS Glue version 1.0 or later | Added information about support for DynamoDB writer, and new or updated connection options for DynamoDB to read or write. For more information, see Connection Types and Options for ETL in AWS Glue. | July 17, 2020 |
Support for resource links and for cross-account access control using both AWS Glue and Lake Formation | Added content about new Data Catalog objects called resource links, and about how to manage sharing Data Catalog resources across accounts with both AWS Glue and AWS Lake Formation. For more information, see Granting Cross-Account Access and Table Resource Links. | July 7, 2020 |
Support for sampling records when crawling DynamoDB data stores | Added information about new properties that you can configure when crawling a DynamoDB data store. For more information, see Crawler Properties. | June 12, 2020 |
Support for stopping a workflow run. | Added information about how to stop a workflow run for a particular workflow. For more information, see Stopping a Workflow Run. | May 14, 2020 |
Support for Spark streaming ETL jobs | Added information about creating extract, transform, and load (ETL) jobs with streaming data sources. For more information, see Adding Streaming ETL Jobs in AWS Glue. | April 27, 2020 |
Support for creating tables, updating the schema, and adding new partitions in the Data Catalog after running an ETL job | Added information about how you can enable creating tables, updating the schema, and adding new partitions to see the results of your ETL job in the Data Catalog. For more information, see Creating Tables, Updating Schema, and Adding New Partitions in the Data Catalog from AWS Glue ETL Jobs. | April 2, 2020 |
Support for specifying a version for the Apache Avro data format as an ETL input and output in AWS Glue | Added information about specifying a version for the Apache Avro data format
as an ETL input and output in AWS Glue. The default version 1.7. You can use the
| March 31, 2020 |
Support for the EMRFS S3-optimized committer for writing Parquet data into Amazon S3 | Added information about how to set a new flag to enable the EMRFR S3-optimized committer for writing Parquet data into Amazon S3 when creating or updating an AWS Glue job. For more information, see Special Parameters Used by AWS Glue. | March 30, 2020 |
Support for machine learning transforms as a resource managed by AWS resource tags | Added information about using AWS resource tags to manage and control access to your machine learning transforms in AWS Glue. You can assign AWS resource tags to jobs, triggers, endpoints, crawlers, and machine learning transforms in AWS Glue. For more information, see AWS Tags in AWS Glue. | March 2, 2020 |
Support for non-overrideable job arguments | Added information about support for special job parameters that cannot be overridden in triggers or when you run the job. For more information see Adding Jobs in AWS Glue. | February 12, 2020 |
Support for new transforms to work with datasets in Amazon S3 | Added information about new transforms (Merge, Purge, and Transition) and Amazon S3 storage class exclusions for Apache Spark applications to work with datasets in Amazon S3. For more information on support for these transforms for Python, see mergeDynamicFrame and Working with Datasets in Amazon S3. For Scala, see mergeDynamicFrames and AWS Glue Scala GlueContext APIs. | January 16, 2020 |
Support for updating the Data Catalog with new partition information from an ETL job | Added information about how to code an extract, transform, and load (ETL) script to update the AWS Glue Data Catalog with new partition information. With this capability, you no longer have to rerun the crawler after job completion to view the new partitions. For more information see Updating the Data Catalog with New Partitions. | January 15, 2020 |
New tutorial: Using an SageMaker AI notebook | Added a tutorial that demonstrates how to use an Amazon SageMaker notebook to help develop your ETL and machine learning scripts. See Tutorial: Use an Amazon SageMaker Notebook with Your Development Endpoint. | January 3, 2020 |
Support for reading from MongoDB and Amazon DocumentDB (with MongoDB compatibility) | Added information about new connection types and connection options for reading from and writing to MongoDB and Amazon DocumentDB (with MongoDB Compatibility). For more information, see Connection Types and Options for ETL in AWS Glue. | December 17, 2019 |
Various corrections and clarifications | Added corrections and clarifications throughout. Removed entries from the Known Issues chapter. Added warnings that AWS Glue supports only symmetrical customer master keys (CMKs) when specifying Data Catalog encryption settings and creating security configurations. Added a note that AWS Glue does not support writing to Amazon DynamoDB. | December 9, 2019 |
Support for custom JDBC drivers | Added information about connecting to data sources and targets with JDBC drivers that AWS Glue does not natively support, such as MySQL version 8 and Oracle Database version 18. For more information see JDBC connectionType Values. | November 25, 2019 |
Support for connecting SageMaker AI notebooks to different development endpoints | Added information about how you can connect an SageMaker AI notebook to different development endpoints. Updates to describe the new console action for switching to a new development endpoint, and the new SageMaker AI IAM policy. For more information, see Working with Notebooks on the AWS Glue Console and Create an IAM Policy for Amazon SageMaker AI Notebooks. | November 21, 2019 |
Support for AWS Glue version in machine learning transforms | Added information about defining the AWS Glue version in a machine learning transform to indicate the which version of AWS Glue a machine learning transform is compatible with. For more information see Working with Machine Learning Transforms on the AWS Glue Console. | November 21, 2019 |
Support for rewinding your job bookmarks | Added information about rewinding your job bookmarks to any previous job run,
resulting in the subsequent job run reprocessing data only from the bookmarked
job run. Described two new sub-options for the | October 22, 2019 |
Support for custom JDBC certificates for connecting to a data store | Added information about AWS Glue support of custom JDBC certificates for SSL connections to AWS Glue data sources or targets. For more information, see Working with Connections on the AWS Glue Console. | October 10, 2019 |
Support for Python wheel | Added information about AWS Glue support of wheel files (along with egg files) as dependencies for Python shell jobs. For more information, see Providing Your Own Python Library. | September 26, 2019 |
Support for versioning of development endpoints in AWS Glue | Added information about defining the | September 19, 2019 |
Support for monitoring AWS Glue using Spark UI | Added information about using Apache Spark UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system, and Spark applications on AWS Glue development endpoints. For more information, see Monitoring AWS Glue Using Spark UI. | September 19, 2019 |
Enhancement of support for local ETL script development using the public AWS Glue ETL library | Updated the AWS Glue ETL library content to reflect that AWS Glue version 1.0 is now supported. For more information, see Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library. | September 18, 2019 |
Support for excluding Amazon S3 storage classes when running jobs | Added information about excluding Amazon S3 storage classes when running AWS Glue ETL jobs that read files or partitions from Amazon S3. For more information, see Excluding Amazon S3 Storage Classes. | August 29, 2019 |
Support for local ETL script development using the public AWS Glue ETL library | Added information about how to develop and test Python and Scala ETL scripts locally without the need for a network connection. For more information, see Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library. | August 28, 2019 |
Known issues | Added information about known issues in AWS Glue. For more information, see Known Issues for AWS Glue. | August 28, 2019 |
Support for machine learning transforms in AWS Glue | Added information about machine learning capabilities provided by AWS Glue to create custom transforms. You can create these transforms when you create a job. For more information, see Machine Learning Transforms in AWS Glue. | August 8, 2019 |
Support for shared Amazon Virtual Private Cloud | Added information about AWS Glue support for shared Amazon Virtual Private Cloud. For more information, see Shared Amazon VPCs. | August 6, 2019 |
Support for versioning in AWS Glue | Added information about defining the | July 24, 2019 |
Support for additional configuration options for development endpoints | Added information about configuration options for development endpoints that have memory-intensive workloads. You can choose from two new configurations that provide more memory per executor. For more information, see Working with Development Endpoints on the AWS Glue Console. | July 24, 2019 |
Support for performing extract, transfer, and load (ETL) activities using workflows | Added information about using a new construct called a workflow to design a complex multi-job extract, transform, and load (ETL) activity that AWS Glue can run and track as a single entity. For more information, see Performing Complex ETL Activities Using Workflows in AWS Glue. | June 20, 2019 |
Support for Python 3.6 in Python shell jobs | Added information about support for Python 3.6 in Python shell jobs. You can specify either Python 2.7 or Python 3.6 as a job property. For more information, see Adding Python Shell Jobs inAWS Glue. | June 5, 2019 |
Support for virtual private cloud (VPC) endpoints | Added information about connecting directly to AWS Glue through an interface endpoint in your VPC. When you use a VPC interface endpoint, communication between your VPC and AWS Glue is conducted entirely and securely within the AWS network. For more information, see Using AWS Glue with VPC Endpoints. | June 4, 2019 |
Support for real-time, continuous logging for AWS Glue jobs. | Added information about enabling and viewing real-time Apache Spark job logs in CloudWatch including the driver logs, each of the executor logs, and a Spark job progress bar. For more information, see Continuous Logging for AWS Glue Jobs. | May 28, 2019 |
Support for existing Data Catalog tables as crawler sources | Added information about specifying a list of existing Data Catalog tables as crawler sources. Crawlers can then detect changes to table schemas, update table definitions, and register new partitions as new data becomes available. For more information, see Crawler Properties. | May 10, 2019 |
Support for additional configuration options for memory-intensive jobs | Added information about configuration options for Apache Spark jobs with memory-intensive workloads. You can choose from two new configurations that provide more memory per executor. For more information, see Adding Jobs in AWS Glue. | April 5, 2019 |
Support for CSV custom classifiers | Added information about using a custom CSV classifier to infer the schema of various types of CSV data. For more information, see Writing Custom Classifiers. | March 26, 2019 |
Support for AWS resource tags | Added information about using AWS resource tags to help you manage and control access to your AWS Glue resources. You can assign AWS resource tags to jobs, triggers, endpoints, and crawlers in AWS Glue. For more information, see AWS Tags in AWS Glue. | March 20, 2019 |
Support of Data Catalog for Spark SQL jobs | Added information about configuring your AWS Glue jobs and development endpoints to use the AWS Glue Data Catalog as an external Apache Hive Metastore. This allows jobs and development endpoints to directly run Apache Spark SQL queries against the tables stored in the AWS Glue Data Catalog. For more information, see AWS Glue Data Catalog Support for Spark SQL Jobs. | March 14, 2019 |
Support for Python shell jobs | Added information about Python shell jobs and the new field Maximum capacity. For more information, see Adding Python Shell Jobs in AWS Glue. | January 18, 2019 |
Support for notifications when there are changes to databases and tables | Added information about events that are generated for changes to database, table, and partition API calls. You can configure actions in CloudWatch Events to respond to these events. For more information, see Automating AWS Glue with CloudWatch Events. | January 16, 2019 |
Support for encrypting connection passwords | Added information about encrypting passwords used in connection objects. For more information, see Encrypting Connection Passwords. | December 11, 2018 |
Support for resource-level permission and resource-based policies | Added information about using resource-level permissions and resource-based policies with AWS Glue. For more information, see the topics within Security in AWS Glue. | October 15, 2018 |
Support for SageMaker AI notebooks | Added information about using SageMaker AI notebooks with AWS Glue development endpoints. For more information, see Managing Notebooks. | October 5, 2018 |
Support for encryption | Added information about using encryption with AWS Glue. For more information, see Encryption at Rest, Encryption in Transit, and Setting Up Encryption in AWS Glue. | August 24, 2018 |
Support for Apache Spark job metrics | Added information about the use of Apache Spark metrics for better debugging and profiling of ETL jobs. You can easily track runtime metrics such as bytes read and written, memory usage and CPU load of the driver and executors, and data shuffles among executors from the AWS Glue console. For more information, see Monitoring AWS Glue Using CloudWatch Metrics, Job Monitoring and Debugging, and Working with Jobs on the AWS Glue Console. | July 13, 2018 |
Support of DynamoDB as a data source | Added information about crawling DynamoDB and using it as a data source of ETL jobs. For more information, see Cataloging Tables with a Crawler and Connection Parameters. | July 10, 2018 |
Updates to create notebook server procedure | Updated information about how to create a notebook server on an Amazon EC2 instance associated with a development endpoint. For more information, see Creating a Notebook Server Associated with a Development Endpoint. | July 9, 2018 |
Updates now available over RSS | You can now subscribe to an RSS feed to receive notifications about updates to the AWS Glue Developer Guide. | June 25, 2018 |
Support delay notifications for jobs | Added information about configuring a delay threshold when a job runs. For more information, see Adding Jobs in AWS Glue. | May 25, 2018 |
Configure a crawler to append new columns | Added information about new configuration option for crawlers, MergeNewColumns. For more information, see Configuring a Crawler. | May 7, 2018 |
Support timeout of jobs | Added information about setting a timeout threshold when a job runs. For more information, see Adding Jobs in AWS Glue. | April 10, 2018 |
Support Scala ETL script and trigger jobs based on additional run states | Added information about using Scala as the ETL programming language. In addition, the trigger API now supports firing when any conditions are met (in addition to all conditions). Also, jobs can be triggered based on a "failed" or "stopped" job run (in addition to a "succeeded" job run). | January 12, 2018 |
Earlier updates
The following table describes the important changes in each release of the AWS Glue Developer Guide before January 2018.
Change | Description | Date |
---|---|---|
Support XML data sources and new crawler configuration option | Added information about classifying XML data sources and new crawler option for partition changes. | November 16, 2017 |
New transforms, support for additional Amazon RDS database engines, and development endpoint enhancements | Added information about the map and filter transforms, support for Amazon RDS Microsoft SQL Server, and Amazon RDS Oracle, and new features for development endpoints. | September 29, 2017 |
AWS Glue initial release | This is the initial release of the AWS Glue Developer Guide. | August 14, 2017 |