# AWS Glue programming guide
<a name="edit-script"></a>

A script contains the code that extracts data from sources, transforms it, and loads it into targets. AWS Glue runs a script when it starts a job.

AWS Glue ETL scripts are coded in Python or Scala. While all job types can be written in Python, AWS Glue for Spark jobs can be written in Scala as well. When you automatically generate the source code logic for your job in AWS Glue Studio, a script is created. You can edit this script, or you can provide your own script to process your ETL work.

# Providing your own custom scripts
<a name="console-custom-created"></a>

Scripts perform the extract, transform, and load (ETL) work in AWS Glue. A script is created when you automatically generate the source code logic for a job. You can either edit this generated script, or you can provide your own custom script.

**To provide your own custom script in AWS Glue, follow these general steps:**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose the **ETL Jobs** tab, and then view the **Create job** section. Choose a **script editor** option.

1. Under **This job runs**, choose one of the following:
   + **Create a new script with boilerplate code**
   + **Upload and edit an existing script**

1. On the **Job details** page, choose the **IAM role** that is required for your custom script to run. For more information, see [Identity and access management for AWS Glue](security-iam.md).

1. Choose any connections that your script references. These objects are needed to connect to the necessary JDBC data stores.

   An elastic network interface is a virtual network interface that you can attach to an instance in a virtual private cloud (VPC). Choose the elastic network interface that is required to connect to the data store that's used in the script.

1. Provide additional configuration, including parameters, specific to your job type. For more information about configuration for your job type, see the [Building visual ETL jobs](author-job-glue.md) section.

1. On the **Script** tab, paste or write your custom script.

Use the content in this section to guide the process of writing your custom script.

For more information about adding jobs in AWS Glue, see [Building visual ETL jobs](author-job-glue.md). 

For step-by-step guidance, see the **Add job** tutorial in the AWS Glue console.

# Programming Spark scripts
<a name="aws-glue-programming"></a>

AWS Glue makes it easy to write or autogenerate extract, transform, and load (ETL) scripts, in addition to testing and running them. This section describes the extensions to Apache Spark that AWS Glue has introduced, and provides examples of how to code and run ETL scripts in Python and Scala.

**Important**  
Different versions of AWS Glue support different versions of Apache Spark. Your custom script must be compatible with the supported Apache Spark version. For information about AWS Glue versions, see the [Glue version job property](add-job.md#glue-version-table).

**Topics**
+ [Tutorial: Writing an AWS Glue for Spark script](aws-glue-programming-intro-tutorial.md)
+ [Program AWS Glue ETL scripts in PySpark](aws-glue-programming-python.md)
+ [Programming AWS Glue ETL scripts in Scala](aws-glue-programming-scala.md)
+ [Features and optimizations for programming AWS Glue for Spark ETL scripts](aws-glue-programming-general.md)

# Tutorial: Writing an AWS Glue for Spark script
<a name="aws-glue-programming-intro-tutorial"></a>

This tutorial introduces you to the process of writing AWS Glue scripts. You can run scripts on a schedule with jobs, or interactively with interactive sessions. For more information about jobs, see [Building visual ETL jobs](author-job-glue.md). For more information about interactive sessions, see [Overview of AWS Glue interactive sessions](interactive-sessions-chapter.md#interactive-sessions-overview). 

The AWS Glue Studio visual editor offers a graphical, no-code interface for building AWS Glue jobs. AWS Glue scripts back visual jobs. They give you access to the expanded set of tools available to work with Apache Spark programs. You can access native Spark APIs, as well as AWS Glue libraries that facilitate extract, transform, and load (ETL) workflows from within an AWS Glue script.

In this tutorial, you extract, transform, and load a dataset of parking tickets. The script that does this work is identical in form and function to the one generated in [Making ETL easier with AWS Glue Studio](https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/) on the AWS Big Data Blog, which introduces the AWS Glue Studio visual editor. By running this script in a job, you can compare it to visual jobs and see how AWS Glue ETL scripts work. This prepares you to use additional functionalities that aren't yet available in visual jobs.

You use the Python language and libraries in this tutorial. Similar functionality is available in Scala. After going through this tutorial, you should be able to generate and inspect a sample Scala script to understand how to perform the Scala AWS Glue ETL script writing process. 

## Prerequisites
<a name="aws-glue-programming-intro-tutorial-prerequisites"></a>

 This tutorial has the following prerequisites: 
+ The same prerequisites as the AWS Glue Studio blog post, which instructs you to run a CloudFormation template.

  This template uses the AWS Glue Data Catalog to manage the parking ticket dataset available in `s3://aws-bigdata-blog/artifacts/gluestudio/`. It creates the following resources which will be referenced:
+  **AWS Glue StudioRole** – IAM role to run AWS Gluejobs 
+  **AWS Glue StudioAmazon S3Bucket** – Name of the Amazon S3 bucket to store blog-related files 
+  **AWS Glue StudioTicketsYYZDB** – AWS Glue Data Catalog database 
+  **AWS Glue StudioTableTickets** – Data Catalog table to use as a source 
+  **AWS Glue StudioTableTrials** – Data Catalog table to use as a source 
+  **AWS Glue StudioParkingTicketCount **– Data Catalog table to use as the destination 
+ The script generated in the AWS Glue Studio blog post. If the blog post changes, the script is also available in the following text.

### Generate a sample script
<a name="aws-glue-programming-intro-tutorial-sample-script"></a>

 You can use the AWS Glue Studio visual editor as a powerful code generation tool to create a scaffold for the script you want to write. You will use this tool to create a sample script.

 If you want to skip these steps, the script is provided.

#### Tutorial sample script
<a name="aws-glue-programming-intro-tutorial-code-sample"></a>

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="yyz-tickets", table_name="tickets", transformation_ctx="S3bucket_node1"
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
    mappings=[
        ("tag_number_masked", "string", "tag_number_masked", "string"),
        ("date_of_infraction", "string", "date_of_infraction", "string"),
        ("ticket_date", "string", "ticket_date", "string"),
        ("ticket_number", "decimal", "ticket_number", "float"),
        ("officer", "decimal", "officer_name", "decimal"),
        ("infraction_code", "decimal", "infraction_code", "decimal"),
        ("infraction_description", "string", "infraction_description", "string"),
        ("set_fine_amount", "decimal", "set_fine_amount", "float"),
        ("time_of_infraction", "decimal", "time_of_infraction", "decimal"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    frame=ApplyMapping_node2,
    connection_type="s3",
    format="glueparquet",
    connection_options={"path": "s3://DOC-EXAMPLE-BUCKET", "partitionKeys": []},
    format_options={"compression": "gzip"},
    transformation_ctx="S3bucket_node3",
)

job.commit()
```

**To generate a sample script**

1. Complete the AWS Glue Studio tutorial. To complete this tutorial, see [Creating a job in AWS Glue Studio from an example job](https://docs.aws.amazon.com/glue/latest/dg/edit-nodes-chapter.html#create-jobs-start.html).

1. Navigate to the **Script** tab on the job page, as shown in the following screenshot:   
![\[The Script tab for an AWS Glue job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/programming-intro-generated-script.png)

1. Copy the complete contents of the **Script** tab. By setting the script language in **Job details**, you can switch back and forth between generating Python or Scala code.

## Step 1. Create a job and paste your script
<a name="aws-glue-programming-intro-tutorial-create-job"></a>

In this step, you create an AWS Glue job in the AWS Management Console. This sets up a configuration that allows AWS Glue to run your script. It also creates a place for you to store and edit your script. 

**To create a job**

1. In the AWS Management Console, navigate to the AWS Glue landing page.

1. In the side navigation pane, choose **Jobs**.

1. Choose **Spark script editor** in **Create job**, and then choose **Create**.

1. **Optional** - Paste the full text of your script into the **Script** pane. Alternatively, you can follow along with the tutorial.

## Step 2. Import AWS Glue libraries
<a name="aws-glue-programming-intro-tutorial-import-statements"></a>

You need to set your script up to interact with code and configuration that are defined outside of the script. This work is done behind the scenes in AWS Glue Studio. 

In this step, you perform the following actions. 
+ Import and initialize a `GlueContext` object. This is the most important import, from the script writing perspective. This exposes standard methods for defining source and target datasets, which is the starting point for any ETL script. To learn more about the `GlueContext` class, see [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md).
+ Initialize a `SparkContext` and `SparkSession`. These allow you to configure the Spark engine available inside the AWS Glue job. You won't need to use them directly within introductory AWS Glue scripts.
+ Call `getResolvedOptions` to prepare your job arguments for use within the script. For more information about resolving job parameters, see [Accessing parameters using `getResolvedOptions`](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md).
+ Initialize a `Job`. The `Job` object sets configuration and tracks the state of various optional AWS Glue features. Your script can run without a `Job` object, but the best practice is to initialize it so that you don't encounter confusion if those features are later integrated. 

  One of these features is job bookmarks, which you can optionally configure in this tutorial. You can learn about job bookmarks in the following section, [Optional - Enable job bookmarks](#aws-glue-programming-intro-tutorial-create-job-bookmarks).

 In this procedure, you write the following code. This code is a portion of the generated sample script. 

```
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
```

**To import AWS Glue libraries**
+ Copy this section of code and paste it into the **Script** editor. 
**Note**  
You might consider copying code to be a bad engineering practice. In this tutorial, we suggest this to encourage you to consistently name your core variables across all AWS Glue ETL scripts.

## Step 3. Extract data from a source
<a name="aws-glue-programming-intro-tutorial-create-data-source"></a>

In any ETL process, you first need to define a source dataset that you want to change. In the AWS Glue Studio visual editor, you provide this information by creating a **Source** node. 

In this step, you provide the `create_dynamic_frame.from_catalog` method a `database` and `table_name` to extract data from a source configured in the AWS Glue Data Catalog.

In the previous step, you initialized a `GlueContext` object. You use this object to find methods that are used to configure sources, such as `create_dynamic_frame.from_catalog`.

In this procedure, you write the following code using `create_dynamic_frame.from_catalog`. This code is a portion of the generated sample script. 

```
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="yyz-tickets", table_name="tickets", transformation_ctx="S3bucket_node1"
    )
```

**To extract data from a source**

1. Examine the documentation to find a method on `GlueContext` to extract data from a source defined in the AWS Glue Data Catalog. These methods are documented in [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md). Choose the [create\$1dynamic\$1frame.from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog) method. Call this method on `glueContext`. 

1. Examine the documentation for `create_dynamic_frame.from_catalog`. This method requires `database` and `table_name` parameters. Provide the necessary parameters to `create_dynamic_frame.from_catalog`. 

   The AWS Glue Data Catalog stores information about the location and format of your source data, and was set up in the prerequisite section. You don't have to directly provide your script with that information.

1.  **Optional** – Provide the `transformation_ctx` parameter to the method in order to support job bookmarks. You can learn about job bookmarks in the following section, [Optional - Enable job bookmarks](#aws-glue-programming-intro-tutorial-create-job-bookmarks).

**Note**  
**Common methods for extracting data**  
[create\$1dynamic\$1frame\$1from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog) is used to connect to tables in the AWS Glue Data Catalog.   
If you need to directly provide your job with configuration that describes the structure and location of your source, see the [create\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method. You will need to provide more detailed parameters describing your data than when using `create_dynamic_frame.from_catalog`.   
Refer to the supplemental documentation about `format_options` and `connection_parameters` to identify your required parameters. For an explanation of how to provide your script information about your source data format, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md). For an explanation of how to provide your script information about your source data location, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).   
If you're reading information from a streaming source, you provide your job with source information through the [create\$1data\$1frame\$1from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog) or [create\$1data\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options) methods. Note that these methods return Apache Spark `DataFrames`.  
Our generated code calls `create_dynamic_frame.from_catalog` while the reference documentation refers to `create_dynamic_frame_from_catalog`. These methods ultimately call the same code, and are included so you can write cleaner code. You can verify this by viewing the source for our Python wrapper, available at [https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py](https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py). 

## Step 4. Transform data with AWS Glue
<a name="aws-glue-programming-intro-tutorial-create-transform"></a>

After extracting source data in an ETL process, you need to describe how you want to change your data. You provide this information by creating a **Transform** node in the AWS Glue Studio visual editor.

In this step, you provide the `ApplyMapping` method with a map of current and desired field names and types to transform your `DynamicFrame`. 

You perform the following transformations.
+ Drop the four `location` and `province` keys.
+ Change the name of `officer` to `officer_name`.
+ Change the type of `ticket_number` and `set_fine_amount` to `float`.

 `create_dynamic_frame.from_catalog` provides you with a `DynamicFrame` object. A `DynamicFrame` represents a dataset in AWS Glue. AWS Glue transforms are operations that change `DynamicFrames`. 

**Note**  
What is a `DynamicFrame`?  
A `DynamicFrame` is an abstraction that allows you to connect a dataset with a description of the names and types of entries in the data. In Apache Spark, a similar abstraction exists called a DataFrame. For an explanation of DataFrames, see [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html).  
With `DynamicFrames`, you can describe dataset schemas dynamically. Consider a dataset with a price column, where some entries store price as a string, and others store price as a double. AWS Glue computes a schema on-the-fly—it creates a self-describing record for each row.   
Inconsistent fields (like price) are explicitly represented with a type (`ChoiceType`) in the schema for the frame. You can address your inconsistent fields by dropping them with `DropFields` or resolving them with `ResolveChoice`. These are transforms that are available on the `DynamicFrame`. You can then write your data back to your data lake with `writeDynamicFrame`.  
You can call many of the same transforms from methods on the `DynamicFrame` class, which can lead to more readable scripts. For more information about `DynamicFrame`, see [DynamicFrame class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md). 

 In this procedure, you write the following code using `ApplyMapping`. This code is a portion of the generated sample script. 

```
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
    mappings=[
        ("tag_number_masked", "string", "tag_number_masked", "string"),
        ("date_of_infraction", "string", "date_of_infraction", "string"),
        ("ticket_date", "string", "ticket_date", "string"),
        ("ticket_number", "decimal", "ticket_number", "float"),
        ("officer", "decimal", "officer_name", "decimal"),
        ("infraction_code", "decimal", "infraction_code", "decimal"),
        ("infraction_description", "string", "infraction_description", "string"),
        ("set_fine_amount", "decimal", "set_fine_amount", "float"),
        ("time_of_infraction", "decimal", "time_of_infraction", "decimal"),
    ],
    transformation_ctx="ApplyMapping_node2",
)
```

**To transform data with AWS Glue**

1. Examine the documentation to identify a transform to change and drop fields. For details, see [GlueTransform base class](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md). Choose the `ApplyMapping` transform. For more information about `ApplyMapping`, see [ApplyMapping class](aws-glue-api-crawler-pyspark-transforms-ApplyMapping.md). Call `apply` on the `ApplyMapping` transform object.
**Note**  
What is `ApplyMapping`?  
`ApplyMapping` takes a `DynamicFrame` and transforms it. It takes a list of tuples that represent transformations on fields—a "mapping". The first two tuple elements, a field name and type, are used to identify a field in the frame. The second two parameters are also a field name and type.   
ApplyMapping converts the source field to the target name and type in a new `DynamicFrame`, which it returns. Fields that aren't provided are dropped in the return value.   
Rather than calling `apply`, you can call the same transform with the `apply_mapping` method on the `DynamicFrame` to create more fluent, readable code. For more information, see [apply\$1mapping](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-apply_mapping). 

1. Examine the documentation for `ApplyMapping` to identify required parameters. See [ApplyMapping class](aws-glue-api-crawler-pyspark-transforms-ApplyMapping.md). You will find that this method requires `frame` and `mappings` parameters. Provide the necessary parameters to `ApplyMapping`.

1. **Optional** – Provide `transformation_ctx` to the method to support job bookmarks. You can learn about job bookmarks in the following section, [Optional - Enable job bookmarks](#aws-glue-programming-intro-tutorial-create-job-bookmarks).

**Note**  
**Apache Spark functionality**  
We provide transforms to streamline ETL workflows within your job. You also have access to the libraries that are available in a Spark program in your job, built for more general purposes. In order to use them, you convert between `DynamicFrame` and `DataFrame`.   
You can create a `DataFrame` with [toDF](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF). Then, you can use methods available on the DataFrame to transform your dataset. For more information on these methods, see [DataFrame](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.html). You can then convert backwards with [fromDF](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF) to use AWS Glue operations for loading your frame to a target.

## Step 5. Load data into a target
<a name="aws-glue-programming-intro-tutorial-create-data-target"></a>

After you transform your data, you typically store the transformed data in a different place from the source. You perform this operation by creating a **target** node in the AWS Glue Studio visual editor. 

In this step, you provide the `write_dynamic_frame.from_options` method a `connection_type`, `connection_options`, `format`, and `format_options` to load data into a target bucket in Amazon S3.

In Step 1, you initialized a `GlueContext` object. In AWS Glue, this is where you will find methods that are used to configure targets, much like sources.

In this procedure, you write the following code using `write_dynamic_frame.from_options`. This code is a portion of the generated sample script. 

```
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    frame=ApplyMapping_node2,
    connection_type="s3",
    format="glueparquet",
    connection_options={"path": "s3://amzn-s3-demo-bucket", "partitionKeys": []},
    format_options={"compression": "gzip"},
    transformation_ctx="S3bucket_node3",
    )
```

**To load data into a target**

1. Examine the documentation to find a method to load data into a target Amazon S3 bucket. These methods are documented in [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md). Choose the [write\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method. Call this method on `glueContext`. 
**Note**  
**Common methods for loading data**  
`write_dynamic_frame.from_options` is the most common method used to load data. It supports all targets that are available in AWS Glue.  
If you're writing to a JDBC target defined in an AWS Glue connection, use the [write\$1dynamic\$1frame\$1from\$1jdbc\$1conf](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_jdbc_conf) method. AWS Glue connections store information about how to connect to a data source. This removes the need to provide that information in `connection_options`. However, you still need to use `connection_options` to provide `dbtable`.  
`write_dynamic_frame.from_catalog` is not a common method for loading data. This method updates the AWS Glue Data Catalog without updating the underlying dataset, and is used in combination with other processes that change the underlying dataset. For more information, see [Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs](update-from-job.md).

1. Examine the documentation for [write\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options). This method requires `frame`, `connection_type`, `format`, `connection_options`, and `format_options`. Call this method on `glueContext`.

   1. Refer to the supplemental documentation about `format_options` and `format` to identify the parameters you need. For an explanation of data formats, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md).

   1. Refer to the supplemental documentation about `connection_type` and `connection_options` to identify the parameters you need. For an explanation of connections, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).

   1. Provide the necessary parameters to `write_dynamic_frame.from_options`. This method has a similar configuration to `create_dynamic_frame.from_options`. 

1. **Optional** – Provide `transformation_ctx` to `write_dynamic_frame.from_options` to support job bookmarks. You can learn about job bookmarks in the following section, [Optional - Enable job bookmarks](#aws-glue-programming-intro-tutorial-create-job-bookmarks).

## Step 6. Commit the `Job` object
<a name="aws-glue-programming-intro-tutorial-commit-job"></a>

 You initialized a `Job` object in Step 1. You may need to manually conclude its lifecycle at the end of your script if certain optional features need this to function properly, such as when using Job Bookmarks. This work is done behind the scenes in AWS Glue Studio. 

 In this step, call the `commit` method on the `Job` object. 

 In this procedure, you write the following code. This code is a portion of the generated sample script. 

```
job.commit()
```

**To commit the `Job` object**

1. If you have not yet done this, perform the optional steps outlined in previous sections to include `transformation_ctx`.

1. Call `commit`.

## Optional - Enable job bookmarks
<a name="aws-glue-programming-intro-tutorial-create-job-bookmarks"></a>

 In every prior step, you have been instructed to set `transformation_ctx` parameters. This is related to a feature called job bookmarks. 

With job bookmarks, you can save time and money with jobs that run on a recurring basis, against datasets where previous work can easily be tracked. Job bookmarks track the progress of an AWS Glue transform across a dataset from previous runs. By tracking where previous runs ended, AWS Glue can limit its work to rows it hasn't processed before. For more information about job bookmarks, see [Tracking processed data using job bookmarks](monitor-continuations.md).

To enable job bookmarks, first add the `transformation_ctx` statements into our provided functions, as described in the previous examples. Job bookmark state is persisted across runs. `transformation_ctx` parameters are keys used to access that state. On their own, these statements will do nothing. You also need to activate the feature in the configuration for your job.

In this procedure, you enable job bookmarks using the AWS Management Console.

**To enable job bookmarks**

1. Navigate to the **Job details** section of your corresponding job.

1. Set **Job bookmark** to **Enable**.

## Step 7. Run your code as a job
<a name="aws-glue-programming-intro-tutorial-running-as-job"></a>

In this step, you run your job to verify that you successfully completed this tutorial. This is done with the click of a button, as in the AWS Glue Studio visual editor.

**To run your code as a job**

1. Choose **Untitled job** on the title bar to edit and set your job name.

1. Navigate to the **Job details** tab. Assign your job an **IAM Role**. You can use the one created by the CloudFormation template in the prerequisites for the AWS Glue Studio tutorial. If you have completed that tutorial, it should be available as `AWS Glue StudioRole`. 

1. Choose **Save** to save your script.

1. Choose **Run** to run your job.

1. Navigate to the **Runs** tab to verify that your job completes.

1. Navigate to *amzn-s3-demo-bucket*, the target for `write_dynamic_frame.from_options`. Confirm that the output matches your expectations. 

For more information about configuring and managing jobs, see [Providing your own custom scripts](console-custom-created.md).

## More information
<a name="aws-glue-programming-intro-tutorial-further-info"></a>

 Apache Spark libraries and methods are available in AWS Glue scripts. You can look at the Spark documentation to understand what you can do with those included libraries. For more information, see the [examples section of the Spark source repository](https://github.com/apache/spark/tree/master/examples/src/main/python). 

 AWS Glue 2.0\$1 includes several common Python libraries by default. There are also mechanisms for loading your own dependencies into an AWS Glue job in a Scala or Python environment. For information about Python dependencies, see [Using Python libraries with AWS Glue](aws-glue-programming-python-libraries.md). 

For more examples of how to use AWS Glue features in Python, see [AWS Glue Python code samples](aws-glue-programming-python-samples.md). Scala and Python jobs have feature parity, so our Python examples should give you some thoughts about how to perform similar work in Scala. 

# Program AWS Glue ETL scripts in PySpark
<a name="aws-glue-programming-python"></a>

You can find Python code examples and utilities for AWS Glue in the [AWS Glue samples repository](https://github.com/awslabs/aws-glue-samples) on the GitHub website.

## Using Python with AWS Glue
<a name="aws-glue-programming-python-using"></a>

AWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load (ETL) jobs. This section describes how to use Python in ETL scripts and with the AWS Glue API.
+ [Setting up to use Python with AWS Glue](aws-glue-programming-python-setup.md)
+ [Calling AWS Glue APIs in Python](aws-glue-programming-python-calling.md)
+ [Using Python libraries with AWS Glue](aws-glue-programming-python-libraries.md)
+ [AWS Glue Python code samples](aws-glue-programming-python-samples.md)

## AWS Glue PySpark extensions
<a name="aws-glue-programming-python-extensions-list"></a>

AWS Glue has created the following extensions to the PySpark Python dialect.
+ [Accessing parameters using `getResolvedOptions`](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md)
+ [PySpark extension types](aws-glue-api-crawler-pyspark-extensions-types.md)
+ [DynamicFrame class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md)
+ [DynamicFrameCollection class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection.md)
+ [DynamicFrameWriter class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.md)
+ [DynamicFrameReader class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md)
+ [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md)

## AWS Glue PySpark transforms
<a name="aws-glue-programming-python-transforms-list"></a>

AWS Glue has created the following transform Classes to use in PySpark ETL operations.
+ [GlueTransform base class](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md)
+ [ApplyMapping class](aws-glue-api-crawler-pyspark-transforms-ApplyMapping.md)
+ [DropFields class](aws-glue-api-crawler-pyspark-transforms-DropFields.md)
+ [DropNullFields class](aws-glue-api-crawler-pyspark-transforms-DropNullFields.md)
+ [ErrorsAsDynamicFrame class](aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame.md)
+ [FillMissingValues class](aws-glue-api-crawler-pyspark-transforms-fillmissingvalues.md)
+ [Filter class](aws-glue-api-crawler-pyspark-transforms-filter.md)
+ [FindIncrementalMatches class](aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.md)
+ [FindMatches class](aws-glue-api-crawler-pyspark-transforms-findmatches.md)
+ [FlatMap class](aws-glue-api-crawler-pyspark-transforms-flat-map.md)
+ [Join class](aws-glue-api-crawler-pyspark-transforms-join.md)
+ [Map class](aws-glue-api-crawler-pyspark-transforms-map.md)
+ [MapToCollection class](aws-glue-api-crawler-pyspark-transforms-MapToCollection.md)
+ [mergeDynamicFrame](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-merge)
+ [Relationalize class](aws-glue-api-crawler-pyspark-transforms-Relationalize.md)
+ [RenameField class](aws-glue-api-crawler-pyspark-transforms-RenameField.md)
+ [ResolveChoice class](aws-glue-api-crawler-pyspark-transforms-ResolveChoice.md)
+ [SelectFields class](aws-glue-api-crawler-pyspark-transforms-SelectFields.md)
+ [SelectFromCollection class](aws-glue-api-crawler-pyspark-transforms-SelectFromCollection.md)
+ [Spigot class](aws-glue-api-crawler-pyspark-transforms-spigot.md)
+ [SplitFields class](aws-glue-api-crawler-pyspark-transforms-SplitFields.md)
+ [SplitRows class](aws-glue-api-crawler-pyspark-transforms-SplitRows.md)
+ [Unbox class](aws-glue-api-crawler-pyspark-transforms-Unbox.md)
+ [UnnestFrame class](aws-glue-api-crawler-pyspark-transforms-UnnestFrame.md)

# Setting up to use Python with AWS Glue
<a name="aws-glue-programming-python-setup"></a>

Use Python to develop your ETL scripts for Spark jobs. The supported Python versions for ETL jobs depend on the AWS Glue version of the job. For more information on AWS Glue versions, see the [Glue version job property](add-job.md#glue-version-table).

**To set up your system for using Python with AWS Glue**

Follow these steps to install Python and to be able to invoke the AWS Glue APIs. 

1. If you don't already have Python installed, download and install it from the [Python.org download page](https://www.python.org/downloads/).

1. Install the AWS Command Line Interface (AWS CLI) as documented in the [AWS CLI documentation](https://docs.aws.amazon.com/cli/latest/userguide/installing.html).

   The AWS CLI is not directly necessary for using Python. However, installing and configuring it is a convenient way to set up AWS with your account credentials and verify that they work.

1. Install the AWS SDK for Python (Boto 3), as documented in the [Boto3 Quickstart](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) .

   Boto 3 resource APIs are not yet available for AWS Glue. Currently, only the Boto 3 client APIs can be used.

   For more information about Boto 3, see [AWS SDK for Python (Boto3) Getting Started](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html).

You can find Python code examples and utilities for AWS Glue in the [AWS Glue samples repository](https://github.com/awslabs/aws-glue-samples) on the GitHub website.

# Calling AWS Glue APIs in Python
<a name="aws-glue-programming-python-calling"></a>

Note that Boto 3 resource APIs are not yet available for AWS Glue. Currently, only the Boto 3 client APIs can be used.

## AWS Glue API names in Python
<a name="aws-glue-programming-python-calling-names"></a>

AWS Glue API names in Java and other programming languages are generally CamelCased. However, when called from Python, these generic names are changed to lowercase, with the parts of the name separated by underscore characters to make them more "Pythonic". In the [AWS Glue API](aws-glue-api.md) reference documentation, these Pythonic names are listed in parentheses after the generic CamelCased names.

However, although the AWS Glue API names themselves are transformed to lowercase, their parameter names remain capitalized. It is important to remember this, because parameters should be passed by name when calling AWS Glue APIs, as described in the following section.

## Passing and accessing Python parameters in AWS Glue
<a name="aws-glue-programming-python-calling-parameters"></a>

In Python calls to AWS Glue APIs, it's best to pass parameters explicitly by name. For example:

```
job = glue.create_job(Name='sample', Role='Glue_DefaultRole',
                      Command={'Name': 'glueetl',
                               'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'})
```

It is helpful to understand that Python creates a dictionary of the name/value tuples that you specify as arguments to an ETL script in a [Job structure](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-Job) or [JobRun structure](aws-glue-api-jobs-runs.md#aws-glue-api-jobs-runs-JobRun). Boto 3 then passes them to AWS Glue in JSON format by way of a REST API call. This means that you cannot rely on the order of the arguments when you access them in your script.

For example, suppose that you're starting a `JobRun` in a Python Lambda handler function, and you want to specify several parameters. Your code might look something like the following:

```
from datetime import datetime, timedelta

client = boto3.client('glue')

def lambda_handler(event, context):
  last_hour_date_time = datetime.now() - timedelta(hours = 1)
  day_partition_value = last_hour_date_time.strftime("%Y-%m-%d")
  hour_partition_value = last_hour_date_time.strftime("%-H")

  response = client.start_job_run(
               JobName = 'my_test_Job',
               Arguments = {
                 '--day_partition_key':   'partition_0',
                 '--hour_partition_key':  'partition_1',
                 '--day_partition_value':  day_partition_value,
                 '--hour_partition_value': hour_partition_value } )
```

To access these parameters reliably in your ETL script, specify them by name using AWS Glue's `getResolvedOptions` function and then access them from the resulting dictionary:

```
import sys
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv,
                          ['JOB_NAME',
                           'day_partition_key',
                           'hour_partition_key',
                           'day_partition_value',
                           'hour_partition_value'])
print "The day partition key is: ", args['day_partition_key']
print "and the day partition value is: ", args['day_partition_value']
```

If you want to pass an argument that is a nested JSON string, to preserve the parameter value as it gets passed to your AWS Glue ETL job, you must encode the parameter string before starting the job run, and then decode the parameter string before referencing it your job script. For example, consider the following argument string:

```
glue_client.start_job_run(JobName = "gluejobname", Arguments={
"--my_curly_braces_string": '{"a": {"b": {"c": [{"d": {"e": 42}}]}}}'
})
```

To pass this parameter correctly, you should encode the argument as a Base64 encoded string.

```
import base64
...
sample_string='{"a": {"b": {"c": [{"d": {"e": 42}}]}}}'
sample_string_bytes = sample_string.encode("ascii")

base64_bytes = base64.b64encode(sample_string_bytes) 
base64_string = base64_bytes.decode("ascii") 
...
glue_client.start_job_run(JobName = "gluejobname", Arguments={
"--my_curly_braces_string": base64_bytes})
...
sample_string_bytes = base64.b64decode(base64_bytes) 
sample_string = sample_string_bytes.decode("ascii") 
print(f"Decoded string: {sample_string}") 
...
```

## Example: Create and run a job
<a name="aws-glue-programming-python-calling-example"></a>

The following example shows how call the AWS Glue APIs using Python, to create and run an ETL job.

**To create and run a job**

1. Create an instance of the AWS Glue client:

   ```
   import boto3
   glue = boto3.client(service_name='glue', region_name='us-east-1',
                 endpoint_url='https://glue.us-east-1.amazonaws.com')
   ```

1. Create a job. You must use `glueetl` as the name for the ETL command, as shown in the following code:

   ```
   myJob = glue.create_job(Name='sample', Role='Glue_DefaultRole',
                             Command={'Name': 'glueetl',
                                      'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'})
   ```

1. Start a new run of the job that you created in the previous step:

   ```
   myNewJobRun = glue.start_job_run(JobName=myJob['Name'])
   ```

1. Get the job status:

   ```
   status = glue.get_job_run(JobName=myJob['Name'], RunId=myNewJobRun['JobRunId'])
   ```

1. Print the current state of the job run:

   ```
   print(status['JobRun']['JobRunState'])
   ```

# Using Python libraries with AWS Glue
<a name="aws-glue-programming-python-libraries"></a>

 You can install additional Python modules and libraries for use with AWS Glue ETL. For AWS Glue 2.0 and above, AWS Glue uses the Python Package Installer (pip3) to install additional modules used by AWS Glue ETL. AWS Glue provides multiple options to bring the additional Python modules to your AWS Glue job environment. You can use the `--additional-python-modules` parameter to bring in new modules using zip files containing bundled Python wheels (also known as "zip of wheels", available for AWS Glue 5.0 and above), individual Python wheel files, requirements files (requirements.txt, available for AWS Glue 5.0 and above), or a list of comma-separated Python modules. It could also be used to change the version of the python modules provided in the AWS Glue environment (see [Python modules already provided in AWS Glue](#glue-modules-provided) for more details). 

**Topics**
+ [Installing additional Python modules with pip in AWS Glue 2.0 or later](#addl-python-modules-support)
+ [Including Python files with PySpark native features](#extra-py-files-support)
+ [Programming scripts that use visual transforms](#aws-glue-programming-with-cvt)
+ [Zipping libraries for inclusion](#aws-glue-programming-python-libraries-zipping)
+ [Loading Python libraries in AWS Glue Studio notebooks](#aws-glue-programming-python-libraries-notebooks)
+ [Loading Python libraries in a development endpoint in AWS Glue 0.9/1.0](#aws-glue-programming-python-libraries-dev-endpoint)
+ [Using Python libraries in a job or JobRun](#aws-glue-programming-python-libraries-job)
+ [Proactively analyze Python dependencies](#aws-glue-programming-analyzing-python-dependencies)
+ [Python modules already provided in AWS Glue](#glue-modules-provided)
+ [Appendix A: Creating a Zip of Wheels Artifact](#glue-python-library-zip-of-wheels-appendix)
+ [Appendix B: AWS Glue environment details](#glue-python-libraries-environment-details)

## Installing additional Python modules with pip in AWS Glue 2.0 or later
<a name="addl-python-modules-support"></a>

AWS Glue uses the Python Package Installer (pip3) to install additional modules to be used by AWS Glue ETL. You can use the `--additional-python-modules` parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module. You can install built wheel artifacts either through a zip of wheels or a standalone wheel artifact by uploading the file to Amazon S3, then including the path to the Amazon S3 object in your list of modules. For more information about setting job parameters, see [Using job parameters in AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html).

 You can pass additional options to pip3 with the `--python-modules-installer-option` parameter. For example, you could pass `--only-binary` to force pip to install only pre-built artifacts for the packages specified by `--additional-python-modules`. For more examples, see [ Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0 ](https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/). 

### Best Practices for Python Dependency Management
<a name="glue-python-library-best-practices"></a>

For production workloads, AWS Glue recommend packaging all your Python dependencies as wheel files in a single zip artifact. This approach provides:
+ **Deterministic execution**: Exact control over which package versions are installed
+ **Reliability**: No dependency on external package repositories during job execution
+ **Performance**: Single download operation instead of multiple network calls
+ **Offline installation**: Works in private VPC environments without internet access

#### Important Considerations
<a name="glue-python-library-important-considerations"></a>

Under the [AWS shared responsibility model](https://aws.amazon.com/compliance/shared-responsibility-model/), you are responsible for managing additional Python modules, libraries, and their dependencies. This includes:
+ **Security updates**: Regularly updating packages to address security vulnerabilities
+ **Version compatibility**: Ensuring packages are compatible with your AWS Glue version
+ **Testing**: Validating that your packaged dependencies work correctly in the Glue environment

If you have minimal dependencies, you may consider using individual wheel files instead.

### (Recommended) Installing additional Python libraries in AWS Glue 5.0 or above using Zip of Wheels
<a name="glue-python-library-installing-zip-of-wheels"></a>

AWS Glue 5.0 and above supports packaging multiple wheel files into a single zip artifact containing bundled Python wheels for more reliable and deterministic dependency management. To use this approach, create a zip file containing all your wheel dependencies and their transitive dependencies with the `.gluewheels.zip` suffix, upload it to Amazon S3, and reference it using the `--additional-python-modules` parameter. Be sure to add `--no-index` to the `--python-modules-installer-option` job parameter. With this configuration, the zip of wheels file essentially acts as a local index for pip to resolve dependencies from at runtime. This eliminates dependencies on external package repositories like PyPI during job execution, providing greater stability and consistency for production workloads. For example: 

```
--additional-python-modules s3://amzn-s3-demo-bucket/path/to/zip-of-wheels-1.0.0.gluewheels.zip
--python-modules-installer-option --no-index
```

For Instructions on how to create a zip of wheels file, see [Appendix A: Creating a Zip of Wheels Artifact](#glue-python-library-zip-of-wheels-appendix).

### Installing additional Python libraries using Wheel
<a name="glue-python-library-installing-wheel"></a>

AWS Glue supports installing custom Python packages using wheel (.whl) files stored in Amazon S3. To include wheel files in your AWS Glue jobs, provide a comma-separated list of your wheel files stored in s3 to the `--additional-python-modules` job parameter. For example: 

```
--additional-python-modules s3://amzn-s3-demo-bucket/path/to/package-1.0.0-py3-none-any.whl,s3://your-bucket/path/to/another-package-2.1.0-cp311-cp311-linux_x86_64.whl
```

This approach also supports when you need custom distributions, or packages with native dependencies that are pre-compiled for the correct operating system. For more examples, see [ Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0](https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/). 

### Installing additional Python libraries in AWS Glue 5.0 or above using requirements.txt
<a name="addl-python-modules-requirements-txt"></a>

In AWS Glue 5.0\$1, you can provide the defacto-standard `requirements.txt` to manage Python library dependencies. To do that, provide following two job parameters:
+ Key: `--python-modules-installer-option`

  Value: `-r`
+ Key: `--additional-python-modules`

  Value: `s3://path_to_requirements.txt`

AWS Glue 5.0 nodes initially load python libraries specified in `requirements.txt`.

Here's a sample requirements.txt:

```
awswrangler==3.9.1
elasticsearch==8.15.1
PyAthena==3.9.0
PyMySQL==1.1.1
PyYAML==6.0.2
pyodbc==5.2.0
pyorc==0.9.0
redshift-connector==2.1.3
scipy==1.14.1
scikit-learn==1.5.2
SQLAlchemy==2.0.36
```

**Important**  
Use this option with caution, especially in production workloads. Pulling dependencies from PyPI at runtime is highly risky because you cannot be sure what artifact pip resolves to. Using unpinned library versions is especially risky since it pulls latest version of the python modules, which can introduce breaking changes or bring in incompatible python module. This could result in job failure due to python installation failure in AWS Glue job environment. While pinning library version increases stability, pip resolution is still not fully deterministic, so similar issues can arise. As a best practice, AWS Glue recommends using frozen artifacts such as zip of wheels or individual wheel files (see [(Recommended) Installing additional Python libraries in AWS Glue 5.0 or above using Zip of Wheels](#glue-python-library-installing-zip-of-wheels) for more details). 

**Important**  
If you do not pin the versions of your transitive dependencies, a primary dependency may pull incompatible transitive dependency versions. As a best practice, all library versions should be pinned for increased consistency in AWS Glue jobs. Even better, AWS Glue recommends packaging your dependencies into a zip of wheels file to ensure maximum consistency and reliability for your production workloads. 

### Installing additional Python libraries directly configuring as comma separated list
<a name="glue-python-library-installing-comma-separated-list"></a>

To update or to add a new Python module AWS Glue allows passing `--additional-python-modules` parameter with a list of comma-separated Python modules as values. For example to update/ add scikit-learn module use the following key/value: `"--additional-python-modules", "scikit-learn==0.21.3"`. You have two options to directly configure the python modules.
+ **Pinned Python Module**

  `"--additional-python-modules", "scikit-learn==0.21.3,ephem==4.1.6"`
+ **Unpinned Python Module: (Not recommended for Production Workloads)**

  `"--additional-python-modules", "scikit-learn>==0.20.0,ephem>=4.0.0"`

  OR

  `"--additional-python-modules", "scikit-learn,ephem"`

**Important**  
Use this option with caution, especially in production workloads. Pulling dependencies from PyPI at runtime is highly risky because you cannot be sure what artifact pip resolves to. Using unpinned library versions is especially risky since it pulls latest version of the python modules, which can introduce breaking changes or bring in incompatible python module. This could result in job failure due to python installation failure in AWS Glue job environment. While pinning library version increases stability, pip resolution is still not fully deterministic, so similar issues can arise. As a best practice, AWS Glue recommends using frozen artifacts such as zip of wheels or individual wheel files (see [(Recommended) Installing additional Python libraries in AWS Glue 5.0 or above using Zip of Wheels](#glue-python-library-installing-zip-of-wheels) for more details).

**Important**  
If you do not pin the versions of your transitive dependencies, a primary dependency may pull incompatible transitive dependency versions. As a best practice, all library versions should be pinned for increased consistency in AWS Glue jobs. Even better, AWS Glue recommends packaging your dependencies into a zip of wheels file to ensure maximum consistency and reliability for your production workloads. 

## Including Python files with PySpark native features
<a name="extra-py-files-support"></a>

AWS Glue uses PySpark to include Python files in AWS Glue ETL jobs. You will want to use `--additional-python-modules` to manage your dependencies when available. You can use the `--extra-py-files` job parameter to include Python files. Dependencies must be hosted in Amazon S3 and the argument value should be a comma delimited list of Amazon S3 paths with no spaces. This functionality behaves like the Python dependency management you would use with Spark. For more information on Python dependency management in Spark, see [Using PySpark Native Features](https://spark.apache.org/docs/latest/api/python/tutorial/python_packaging.html#using-pyspark-native-features) page in Apache Spark documentation. `--extra-py-files` is useful in cases where your additional code is not packaged, or when you are migrating a Spark program with an existing toolchain for managing dependencies. For your dependency tooling to be maintainable, you will have to bundle your dependencies before submitting. 

## Programming scripts that use visual transforms
<a name="aws-glue-programming-with-cvt"></a>

 When you create a AWS Glue job using the AWS Glue Studio visual interface, you can transform your data with managed data transform nodes and custom visual transforms. For more information about managed data transform nodes, see [Transform data with AWS Glue managed transforms](edit-jobs-transforms.md). For more information about custom visual transforms, see [Transform data with custom visual transforms](custom-visual-transform.md). Scripts using visual transforms can only be generated when your job **Language** is set to use Python.

 When generating a AWS Glue job using visual transforms, AWS Glue Studio will include these transforms in the runtime environment using the `--extra-py-files` parameter in the job configuration. For more information about job parameters, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). When making changes to a generated script or runtime environment, you will need to preserve this job configuration for your script to run successfully.

## Zipping libraries for inclusion
<a name="aws-glue-programming-python-libraries-zipping"></a>

Unless a library is contained in a single `.py` file, it should be packaged in a `.zip` archive. The package directory should be at the root of the archive, and must contain an `__init__.py` file for the package. Python will then be able to import the package in the normal way.

If your library only consists of a single Python module in one `.py` file, you do not need to place it in a `.zip` file.

## Loading Python libraries in AWS Glue Studio notebooks
<a name="aws-glue-programming-python-libraries-notebooks"></a>

 To specify Python libraries in AWS Glue Studio notebooks, see [ Installing additional Python modules. ](https://docs.aws.amazon.com/glue/latest/dg/manage-notebook-sessions.html#specify-default-modules) 

## Loading Python libraries in a development endpoint in AWS Glue 0.9/1.0
<a name="aws-glue-programming-python-libraries-dev-endpoint"></a>

If you are using different library sets for different ETL scripts, you can either set up a separate development endpoint for each set, or you can overwrite the library `.zip` file(s) that your development endpoint loads every time you switch scripts.

You can use the console to specify one or more library .zip files for a development endpoint when you create it. After assigning a name and an IAM role, choose **Script Libraries and job parameters (optional)** and enter the full Amazon S3 path to your library `.zip` file in the **Python library path** box. For example:

```
s3://bucket/prefix/site-packages.zip
```

If you want, you can specify multiple full paths to files, separating them with commas but no spaces, like this:

```
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
```

If you update these `.zip` files later, you can use the console to re-import them into your development endpoint. Navigate to the developer endpoint in question, check the box beside it, and choose **Update ETL libraries** from the **Action** menu.

In a similar way, you can specify library files using the AWS Glue APIs. When you create a development endpoint by calling [CreateDevEndpoint action (Python: create\$1dev\$1endpoint)](aws-glue-api-dev-endpoint.md#aws-glue-api-dev-endpoint-CreateDevEndpoint), you can specify one or more full paths to libraries in the `ExtraPythonLibsS3Path` parameter, in a call that looks this:

```
dep = glue.create_dev_endpoint(
             EndpointName="testDevEndpoint",
             RoleArn="arn:aws:iam::123456789012",
             SecurityGroupIds="sg-7f5ad1ff",
             SubnetId="subnet-c12fdba4",
             PublicKey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtp04H/y...",
             NumberOfNodes=3,
             ExtraPythonLibsS3Path="s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip")
```

When you update a development endpoint, you can also update the libraries it loads using a [DevEndpointCustomLibraries](aws-glue-api-dev-endpoint.md#aws-glue-api-dev-endpoint-DevEndpointCustomLibraries) object and setting the `UpdateEtlLibraries ` parameter to `True` when calling [UpdateDevEndpoint (update\$1dev\$1endpoint)](aws-glue-api-dev-endpoint.md#aws-glue-api-dev-endpoint-UpdateDevEndpoint).

## Using Python libraries in a job or JobRun
<a name="aws-glue-programming-python-libraries-job"></a>

When you are creating a new Job on the console, you can specify one or more library .zip files by choosing **Script Libraries and job parameters (optional)** and entering the full Amazon S3 library path(s) in the same way you would when creating a development endpoint:

```
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
```

If you are calling [CreateJob (create\$1job)](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-CreateJob), you can specify one or more full paths to default libraries using the `--extra-py-files` default parameter, like this:

```
job = glue.create_job(Name='sampleJob',
                      Role='Glue_DefaultRole',
                      Command={'Name': 'glueetl',
                               'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'},
                      DefaultArguments={'--extra-py-files': 's3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip'})
```

Then when you are starting a JobRun, you can override the default library setting with a different one:

```
runId = glue.start_job_run(JobName='sampleJob',
                           Arguments={'--extra-py-files': 's3://bucket/prefix/lib_B.zip'})
```

## Proactively analyze Python dependencies
<a name="aws-glue-programming-analyzing-python-dependencies"></a>

 To proactively identify potential dependency issues before deploying to AWS Glue, you can use the dependency analysis tool to validate your Python packages against your target AWS Glue environment. 

 AWS provides an open-source Python dependency analyzer tool specifically designed for AWS Glue environments. This tool is available in the AWS Glue samples repository and can be used locally to validate your dependencies before deployment. 

 This analysis helps ensure your dependencies follow the recommended practice of pinning all library versions for consistent production deployments. For more details, please see the tool's [ README ](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/glue_python_dependency_analyzer). 

### Using the AWS Glue Dependency Analyzer
<a name="w2aac67c11c14c18c37c11b1"></a>

 The AWS Glue Python Dependency Analyzer helps identify unpinned dependencies and version conflicts by simulating pip installation with platform-specific constraints that match your target AWS Glue environment. 

```
# Analyze a single Glue job
python glue_dependency_analyzer.py -j my-glue-job

# Analyze multiple jobs with specific AWS configuration
python glue_dependency_analyzer.py -j job1 -j job2 --aws-profile production --aws-region us-west-2
```

 The tool will flag: 
+  Unpinned dependencies that could install different versions across job runs 
+  Version conflicts between packages 
+  Dependencies not available for your target AWS Glue environment 

## Analyze and fix job failures due to Python dependencies with Amazon Q Developer
<a name="aws-glue-programming-analyze-job-failures-with-amazon-q"></a>

 Amazon Q Developer is a generative artificial intelligence (AI) powered conversational assistant that can help you understand, build, extend, and operate AWS applications. You can download it by following the instructions in the Getting started guide for Amazon Q. 

 Amazon Q Developer can be used to analyze and fix job failures due to python dependency. We suggest using the following prompt by replacing the job <Job-Name> placeholder with the name of your glue job. 

```
I have an AWS Glue job named <Job-Name> that has failed due to Python module installation conflicts. Please assist in diagnosing and resolving this issue using the following systematic approach. Proceed once sufficient information is available.

Objective: Implement a fix that addresses the root cause module while minimizing disruption to the existing working environment.

Step 1: Root Cause Analysis
• Retrieve the most recent failed job run ID for the specified Glue job
• Extract error logs from CloudWatch Logs using the job run ID as a log stream prefix
• Analyze the logs to identify:
  • The recently added or modified Python module that triggered the dependency conflict
  • The specific dependency chain causing the installation failure
  • Version compatibility conflicts between required and existing modules

Step 2: Baseline Configuration Identification
• Locate the last successful job run ID prior to the dependency failure
• Document the Python module versions that were functioning correctly in that baseline run
• Establish the compatible version constraints for conflicting dependencies

Step 3: Targeted Resolution Implementation
• Apply pinning by updating the job's additional_python_modules parameter
• Pin only the root cause module and its directly conflicting dependencies to compatible versions, and do not remove python modules unless necessary
• Preserve flexibility for non-conflicting modules by avoiding unnecessary version constraints
• Deploy the configuration changes with minimal changes to the existing configuration and execute a validation test run. Do not change the Glue versions.

Implementation Example:
Scenario: Recently added pandas==2.0.0 to additional_python_modules
Error: numpy version conflict (pandas 2.0.0 requires numpy>=1.21, but existing job code requires numpy<1.20)
Resolution: Update additional_python_modules to "pandas==1.5.3,numpy==1.19.5"
Rationale: Use pandas 1.5.3 (compatible with numpy 1.19.5) and pin numpy to last known working version

Expected Outcome: Restore job functionality with minimal configuration changes while maintaining system stability.
```

 The prompt instructs Q to: 

1. Fetch the latest failed job run ID

1. Find associated logs and details

1. Find successful job runs to detect any changed Python packages

1. Make any configuration fixes and trigger another test run

## Python modules already provided in AWS Glue
<a name="glue-modules-provided"></a>

To change the version of these provided modules, provide new versions with the `--additional-python-modules` job parameter.

------
#### [ AWS Glue version 5.1 ]

AWS Glue version 5.1 includes the following Python modules out of the box:
+ aiobotocore==2.25.1
+ aiohappyeyeballs==2.6.1
+ aiohttp==3.13.2
+ aioitertools==0.12.0
+ aiosignal==1.4.0
+ appdirs==1.4.4
+ attrs==25.4.0
+ boto3==1.40.61
+ botocore==1.40.61
+ certifi==2025.10.5
+ charset-normalizer==3.4.4
+ choreographer==1.2.0
+ contourpy==1.3.3
+ cycler==0.12.1
+ distlib==0.4.0
+ filelock==3.20.0
+ fonttools==4.60.1
+ frozenlist==1.8.0
+ fsspec==2025.10.0
+ idna==3.11
+ iniconfig==2.3.0
+ jmespath==1.0.1
+ kaleido==1.2.0
+ kiwisolver==1.4.9
+ logistro==2.0.1
+ matplotlib==3.10.7
+ multidict==6.7.0
+ narwhals==2.10.2
+ numpy==2.3.4
+ orjson==3.11.4
+ packaging==25.0
+ pandas==2.3.3
+ pillow==12.0.0
+ pip==24.0
+ platformdirs==4.5.0
+ plotly==6.4.0
+ pluggy==1.6.0
+ propcache==0.4.1
+ pyarrow==22.0.0
+ Pygments==2.19.2
+ pyparsing==3.2.5
+ pytest-timeout==2.4.0
+ pytest==8.4.2
+ python-dateutil==2.9.0.post0
+ pytz==2025.2
+ requests==2.32.5
+ s3fs==2025.10.0
+ s3transfer==0.14.0
+ seaborn==0.13.2
+ setuptools==79.0.1
+ simplejson==3.20.2
+ six==1.17.0
+ tenacity==9.1.2
+ typing\$1extensions==4.15.0
+ tzdata==2025.2
+ urllib3==2.5.0
+ uv==0.9.7
+ virtualenv==20.35.4
+ wrapt==1.17.3
+ yarl==1.22.0

------
#### [ AWS Glue version 5.0 ]

AWS Glue version 5.0 includes the following Python modules out of the box:
+ aiobotocore==2.13.1
+ aiohappyeyeballs==2.3.5
+ aiohttp==3.10.1
+ aioitertools==0.11.0
+ aiosignal==1.3.1
+ appdirs==1.4.4
+ attrs==24.2.0
+ boto3==1.34.131
+ botocore==1.34.131
+ certifi==2024.7.4
+ charset-normalizer==3.3.2
+ contourpy==1.2.1
+ cycler==0.12.1
+ fonttools==4.53.1
+ frozenlist==1.4.1
+ fsspec==2024.6.1
+ idna==2.10
+ jmespath==0.10.0
+ kaleido==0.2.1
+ kiwisolver==1.4.5
+ matplotlib==3.9.0
+ multidict==6.0.5
+ numpy==1.26.4
+ packaging==24.1
+ pandas==2.2.2
+ pillow==10.4.0
+ pip==23.0.1
+ plotly==5.23.0
+ pyarrow==17.0.0
+ pyparsing==3.1.2
+ python-dateutil==2.9.0.post0
+ pytz==2024.1
+ requests==2.32.2
+ s3fs==2024.6.1
+ s3transfer==0.10.2
+ seaborn==0.13.2
+ setuptools==59.6.0
+ six==1.16.0
+ tenacity==9.0.0
+ tzdata==2024.1
+ urllib3==1.25.10
+ virtualenv==20.4.0
+ wrapt==1.16.0
+ yarl==1.9.4

------
#### [ AWS Glue version 4.0 ]

AWS Glue version 4.0 includes the following Python modules out of the box:
+ aiobotocore==2.4.1
+ aiohttp==3.8.3
+ aioitertools==0.11.0
+ aiosignal==1.3.1
+ async-timeout==4.0.2
+ asynctest==0.13.0
+ attrs==22.2.0
+ avro-python3==1.10.2
+ boto3==1.24.70
+ botocore==1.27.59
+ certifi==2021.5.30
+ chardet==3.0.4
+ charset-normalizer==2.1.1
+ click==8.1.3
+ cycler==0.10.0
+ Cython==0.29.32
+ fsspec==2021.8.1
+ idna==2.10
+ importlib-metadata==5.0.0
+ jmespath==0.10.0
+ joblib==1.0.1
+ kaleido==0.2.1
+ kiwisolver==1.4.4
+ matplotlib==3.4.3
+ mpmath==1.2.1
+ multidict==6.0.4
+ nltk==3.7
+ numpy==1.23.5
+ packaging==23.0
+ pandas==1.5.1
+ patsy==0.5.1
+ Pillow==9.4.0
+ pip==23.0.1
+ plotly==5.16.0
+ pmdarima==2.0.1
+ ptvsd==4.3.2
+ pyarrow==10.0.0
+ pydevd==2.5.0
+ pyhocon==0.3.58
+ PyMySQL==1.0.2
+ pyparsing==2.4.7
+ python-dateutil==2.8.2
+ pytz==2021.1
+ PyYAML==6.0.1
+ regex==2022.10.31
+ requests==2.23.0
+ s3fs==2022.11.0
+ s3transfer==0.6.0
+ scikit-learn==1.1.3
+ scipy==1.9.3
+ setuptools==49.1.3
+ six==1.16.0
+ statsmodels==0.13.5
+ subprocess32==3.5.4
+ sympy==1.8
+ tbats==1.1.0
+ threadpoolctl==3.1.0
+ tqdm==4.64.1
+ typing\$1extensions==4.4.0
+ urllib3==1.25.11
+ wheel==0.37.0
+ wrapt==1.14.1
+ yarl==1.8.2
+ zipp==3.10.0

------
#### [ AWS Glue version 3.0 ]

AWS Glue version 3.0 includes the following Python modules out of the box:,
+ aiobotocore==1.4.2
+ aiohttp==3.8.3
+ aioitertools==0.11.0
+ aiosignal==1.3.1
+ async-timeout==4.0.2
+ asynctest==0.13.0
+ attrs==22.2.0
+ avro-python3==1.10.2
+ boto3==1.18.50
+ botocore==1.21.50
+ certifi==2021.5.30
+ chardet==3.0.4
+ charset-normalizer==2.1.1
+ click==8.1.3
+ cycler==0.10.0
+ Cython==0.29.4
+ docutils==0.17.1
+ enum34==1.1.10
+ frozenlist==1.3.3
+ fsspec==2021.8.1
+ idna==2.10
+ importlib-metadata==6.0.0
+ jmespath==0.10.0
+ joblib==1.0.1
+ kiwisolver==1.3.2
+ matplotlib==3.4.3
+ mpmath==1.2.1
+ multidict==6.0.4
+ nltk==3.6.3
+ numpy==1.19.5
+ packaging==23.0
+ pandas==1.3.2
+ patsy==0.5.1
+ Pillow==9.4.0
+ pip==23.0
+ pmdarima==1.8.2
+ ptvsd==4.3.2
+ pyarrow==5.0.0
+ pydevd==2.5.0
+ pyhocon==0.3.58
+ PyMySQL==1.0.2
+ pyparsing==2.4.7
+ python-dateutil==2.8.2
+ pytz==2021.1
+ PyYAML==5.4.1
+ regex==2022.10.31
+ requests==2.23.0
+ s3fs==2021.8.1
+ s3transfer==0.5.0
+ scikit-learn==0.24.2
+ scipy==1.7.1
+ six==1.16.0
+ Spark==1.0
+ statsmodels==0.12.2
+ subprocess32==3.5.4
+ sympy==1.8
+ tbats==1.1.0
+ threadpoolctl==3.1.0
+ tqdm==4.64.1
+ typing\$1extensions==4.4.0
+ urllib3==1.25.11
+ wheel==0.37.0
+ wrapt==1.14.1
+ yarl==1.8.2
+ zipp==3.12.0

------
#### [ AWS Glue version 2.0 ]

AWS Glue version 2.0 includes the following Python modules out of the box:
+ avro-python3==1.10.0
+ awscli==1.27.60
+ boto3==1.12.4
+ botocore==1.15.4
+ certifi==2019.11.28
+ chardet==3.0.4
+ click==8.1.3
+ colorama==0.4.4
+ cycler==0.10.0
+ Cython==0.29.15
+ docutils==0.15.2
+ enum34==1.1.9
+ fsspec==0.6.2
+ idna==2.9
+ importlib-metadata==6.0.0
+ jmespath==0.9.4
+ joblib==0.14.1
+ kiwisolver==1.1.0
+ matplotlib==3.1.3
+ mpmath==1.1.0
+ nltk==3.5
+ numpy==1.18.1
+ pandas==1.0.1
+ patsy==0.5.1
+ pmdarima==1.5.3
+ ptvsd==4.3.2
+ pyarrow==0.16.0
+ pyasn1==0.4.8
+ pydevd==1.9.0
+ pyhocon==0.3.54
+ PyMySQL==0.9.3
+ pyparsing==2.4.6
+ python-dateutil==2.8.1
+ pytz==2019.3
+ PyYAML==5.3.1
+ regex==2022.10.31
+ requests==2.23.0
+ rsa==4.7.2
+ s3fs==0.4.0
+ s3transfer==0.3.3
+ scikit-learn==0.22.1
+ scipy==1.4.1
+ setuptools==45.2.0
+ six==1.14.0
+ Spark==1.0
+ statsmodels==0.11.1
+ subprocess32==3.5.4
+ sympy==1.5.1
+ tbats==1.0.9
+ tqdm==4.64.1
+ typing-extensions==4.4.0
+ urllib3==1.25.8
+ wheel==0.35.1
+ zipp==3.12.0

------

## Appendix A: Creating a Zip of Wheels Artifact
<a name="glue-python-library-zip-of-wheels-appendix"></a>

We demonstrate by example how to create a zip of wheels artifact. The example shown downloads the packages `cryptography` and `scipy` into a zip of wheels artifact and copies the zip of wheels to an Amazon S3 location.

1. You must run the commands to create the zip of wheels in an Amazon Linux environment similar to Glue's environment. See [Appendix B: AWS Glue environment details](#glue-python-libraries-environment-details). Glue 5.1 uses AL2023 with python version 3.11. Create Dockerfile that will build this environment:

   ```
   FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
   
   # Install Python 3.11, pip, and zip utility
   RUN dnf install -y python3.11 pip zip && \
       dnf clean all
   
   WORKDIR /build
   ```

1. Create a requirements.txt file

   ```
   cryptography
   scipy
   ```

1. Build and spin up docker container

   ```
   # Build docker image
   docker build --platform linux/amd64 -t glue-wheel-builder .
   
   # Spin up container
   docker run --platform linux/amd64 -v $(pwd)/requirements.txt:/input/requirements.txt:ro -v $(pwd):/output -it glue-wheel-builder bash
   ```

1. Run the following commands in the docker image

   ```
   # Create a directory for the wheels
   mkdir wheels
   
   # Copy requirements.txt into wheels directory
   cp /input/requirements.txt wheels/
   
   # Download the wheels with the correct platform and Python version
   pip3 download \
       -r wheels/requirements.txt \
       --dest wheels/ \
       --platform manylinux2014_x86_64 \
       --python-version 311 \
       --only-binary=:all:
   
   # Package the wheels into a zip archive with the .gluewheels.zip suffix
   zip -r mylibraries-1.0.0.gluewheels.zip wheels/
   
   # Copy zip to output
   cp mylibraries-1.0.0.gluewheels.zip /output/
   
   # Exit the container
   exit
   ```

1. Upload zip of wheels to Amazon S3 location

   ```
   aws s3 cp mylibraries-1.0.0.gluewheels.zip s3://amzn-s3-demo-bucket/example-prefix/
   ```

1. Optional cleanup

   ```
   rm mylibraries-1.0.0.gluewheels.zip
   rm Dockerfile
   rm requirements.txt
   ```

1. Run the Glue job with the following job args:

   ```
   --additional-python-modules s3://amzn-s3-demo-bucket/example-prefix/mylibraries-1.0.0.gluewheels.zip
   --python-modules-installer-option --no-index
   ```

## Appendix B: AWS Glue environment details
<a name="glue-python-libraries-environment-details"></a>


**Glue Version Compatibility and Installation Methods**  

| AWS Glue version | Python version | Base image | glibc version | Compatible platform tags | 
| --- | --- | --- | --- | --- | 
| 5.1 | 3.11 | [Amazon Linux 2023 (AL2023)](https://aws.amazon.com/linux/amazon-linux-2023/) | 2.34 |  manylinux\$12\$134\$1x86\$164 manylinux\$12\$128\$1x86\$164 manylinux2014\$1x86\$164  | 
| 5.0 | 3.11 | [Amazon Linux 2023 (AL2023)](https://aws.amazon.com/linux/amazon-linux-2023/) | 2.34 |  manylinux\$12\$134\$1x86\$164 manylinux\$12\$128\$1x86\$164 manylinux2014\$1x86\$164  | 
| 4.0 | 3.10 | [Amazon Linux 2 (AL2)](https://aws.amazon.com/amazon-linux-2/) | 2.26 | manylinux2014\$1x86\$164 | 
| 3.0 | 3.7 | [Amazon Linux 2 (AL2)](https://aws.amazon.com/amazon-linux-2/) | 2.26 | manylinux2014\$1x86\$164 | 
| 2.0 | 3.7 | [Amazon Linux AMI (AL1)](https://aws.amazon.com/amazon-linux-ami/) | 2.17 | manylinux2014\$1x86\$164 | 

 Under the [AWS shared responsibility model](https://aws.amazon.com/compliance/shared-responsibility-model/), you are responsible for the management of additional Python modules, libraries, and their dependencies that you use with your AWS Glue ETL jobs. This includes applying updates and security patches. 

 AWS Glue does not support compiling native code in the job environment. However, AWS Glue jobs run within an Amazon-managed Linux environment. You may be able to provide your native dependencies in a compiled form through a Python wheel file. Please refer to above table for AWS Glue version compatibility details. 

**Important**  
 Using incompatible dependencies can result in runtime issues, particularly for libraries with native extensions that must match the target environment's architecture and system libraries. Each AWS Glue version runs on a specific Python version with pre-installed libraries and system configurations. 

# AWS Glue Python code samples
<a name="aws-glue-programming-python-samples"></a>
+ [Code example: Joining and relationalizing data](aws-glue-programming-python-samples-legislators.md)
+ [Code example: Data preparation using ResolveChoice, Lambda, and ApplyMapping](aws-glue-programming-python-samples-medicaid.md)

# Code example: Joining and relationalizing data
<a name="aws-glue-programming-python-samples-legislators"></a>

This example uses a dataset that was downloaded from [http://everypolitician.org/](http://everypolitician.org/) to the `sample-dataset` bucket in Amazon Simple Storage Service (Amazon S3): `s3://awsglue-datasets/examples/us-legislators/all`. The dataset contains data in JSON format about United States legislators and the seats that they have held in the US House of Representatives and Senate, and has been modified slightly and made available in a public Amazon S3 bucket for purposes of this tutorial.

You can find the source code for this example in the `join_and_relationalize.py` file in the [AWS Glue samples repository](https://github.com/awslabs/aws-glue-samples) on the GitHub website.

Using this data, this tutorial shows you how to do the following:
+ Use an AWS Glue crawler to classify objects that are stored in a public Amazon S3 bucket and save their schemas into the AWS Glue Data Catalog.
+ Examine the table metadata and schemas that result from the crawl.
+ Write a Python extract, transfer, and load (ETL) script that uses the metadata in the Data Catalog to do the following:
  + Join the data in the different source files together into a single data table (that is, denormalize the data).
  + Filter the joined table into separate tables by type of legislator.
  + Write out the resulting data to separate Apache Parquet files for later analysis.

The preferred way to debug Python or PySpark scripts while running on AWS is to use [Notebooks on AWS Glue Studio](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html).

## Step 1: Crawl the data in the Amazon S3 bucket
<a name="aws-glue-programming-python-samples-legislators-crawling"></a>

1. Sign in to the AWS Management Console, and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Following the steps in [Configuring a crawler](define-crawler.md), create a new crawler that can crawl the `s3://awsglue-datasets/examples/us-legislators/all` dataset into a database named `legislators` in the AWS Glue Data Catalog. The example data is already in this public Amazon S3 bucket.

1. Run the new crawler, and then check the `legislators` database. 

   The crawler creates the following metadata tables:
   + `persons_json`
   + `memberships_json`
   + `organizations_json`
   + `events_json`
   + `areas_json`
   + `countries_r_json`

   This is a semi-normalized collection of tables containing legislators and their histories.

## Step 2: Add boilerplate script to the development endpoint notebook
<a name="aws-glue-programming-python-samples-legislators-boilerplate"></a>

Paste the following boilerplate script into the development endpoint notebook to import the AWS Glue libraries that you need, and set up a single `GlueContext`:

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
```

## Step 3: Examine the schemas from the data in the Data Catalog
<a name="aws-glue-programming-python-samples-legislators-schemas"></a>

Next, you can easily create examine a DynamicFrame from the AWS Glue Data Catalog, and examine the schemas of the data. For example, to see the schema of the `persons_json` table, add the following in your notebook:

```
persons = glueContext.create_dynamic_frame.from_catalog(
             database="legislators",
             table_name="persons_json")
print "Count: ", persons.count()
persons.printSchema()
```

Here's the output from the print calls:

```
Count:  1961
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- name: string
|    |    |-- lang: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string
```

Each person in the table is a member of some US congressional body.

To view the schema of the `memberships_json` table, type the following:

```
memberships = glueContext.create_dynamic_frame.from_catalog(
                 database="legislators",
                 table_name="memberships_json")
print "Count: ", memberships.count()
memberships.printSchema()
```

The output is as follows:

```
Count:  10439
root
|-- area_id: string
|-- on_behalf_of_id: string
|-- organization_id: string
|-- role: string
|-- person_id: string
|-- legislative_period_id: string
|-- start_date: string
|-- end_date: string
```

The `organizations` are parties and the two chambers of Congress, the Senate and House of Representatives. To view the schema of the `organizations_json` table, type the following:

```
orgs = glueContext.create_dynamic_frame.from_catalog(
           database="legislators",
           table_name="organizations_json")
print "Count: ", orgs.count()
orgs.printSchema()
```

The output is as follows:

```
Count:  13
root
|-- classification: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- id: string
|-- name: string
|-- seats: int
|-- type: string
```

## Step 4: Filter the data
<a name="aws-glue-programming-python-samples-legislators-filtering"></a>

Next, keep only the fields that you want, and rename `id` to `org_id`. The dataset is small enough that you can view the whole thing. 

The `toDF()` converts a `DynamicFrame` to an Apache Spark `DataFrame`, so you can apply the transforms that already exist in Apache Spark SQL:

```
orgs = orgs.drop_fields(['other_names',
                        'identifiers']).rename_field(
                            'id', 'org_id').rename_field(
                               'name', 'org_name')
orgs.toDF().show()
```

The following shows the output:

```
+--------------+--------------------+--------------------+--------------------+-----+-----------+--------------------+
|classification|              org_id|            org_name|               links|seats|       type|               image|
+--------------+--------------------+--------------------+--------------------+-----+-----------+--------------------+
|         party|            party/al|                  AL|                null| null|       null|                null|
|         party|      party/democrat|            Democrat|[[website,http://...| null|       null|https://upload.wi...|
|         party|party/democrat-li...|    Democrat-Liberal|[[website,http://...| null|       null|                null|
|   legislature|d56acebe-8fdc-47b...|House of Represen...|                null|  435|lower house|                null|
|         party|   party/independent|         Independent|                null| null|       null|                null|
|         party|party/new_progres...|     New Progressive|[[website,http://...| null|       null|https://upload.wi...|
|         party|party/popular_dem...|    Popular Democrat|[[website,http://...| null|       null|                null|
|         party|    party/republican|          Republican|[[website,http://...| null|       null|https://upload.wi...|
|         party|party/republican-...|Republican-Conser...|[[website,http://...| null|       null|                null|
|         party|      party/democrat|            Democrat|[[website,http://...| null|       null|https://upload.wi...|
|         party|   party/independent|         Independent|                null| null|       null|                null|
|         party|    party/republican|          Republican|[[website,http://...| null|       null|https://upload.wi...|
|   legislature|8fa6c3d2-71dc-478...|              Senate|                null|  100|upper house|                null|
+--------------+--------------------+--------------------+--------------------+-----+-----------+--------------------+
```

Type the following to view the `organizations` that appear in `memberships`:

```
memberships.select_fields(['organization_id']).toDF().distinct().show()
```

The following shows the output:

```
+--------------------+
|     organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+
```

## Step 5: Put it all together
<a name="aws-glue-programming-python-samples-legislators-joining"></a>

Now, use AWS Glue to join these relational tables and create one full history table of legislator `memberships` and their corresponding `organizations`.

1. First, join `persons` and `memberships` on `id` and `person_id`.

1. Next, join the result with `orgs` on `org_id` and `organization_id`.

1. Then, drop the redundant fields, `person_id` and `org_id`.

You can do all these operations in one (extended) line of code:

```
l_history = Join.apply(orgs,
                       Join.apply(persons, memberships, 'id', 'person_id'),
                       'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
print "Count: ", l_history.count()
l_history.printSchema()
```

The output is as follows:

```
Count:  10439
root
|-- role: string
|-- seats: int
|-- org_name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- type: string
|-- sort_name: string
|-- area_id: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- on_behalf_of_id: string
|-- other_names: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- name: string
|    |    |-- lang: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- name: string
|-- birth_date: string
|-- organization_id: string
|-- gender: string
|-- classification: string
|-- death_date: string
|-- legislative_period_id: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- image: string
|-- given_name: string
|-- family_name: string
|-- id: string
|-- start_date: string
|-- end_date: string
```

You now have the final table that you can use for analysis. You can write it out in a compact, efficient format for analytics—namely Parquet—that you can run SQL over in AWS Glue, Amazon Athena, or Amazon Redshift Spectrum.

The following call writes the table across multiple files to support fast parallel reads when doing analysis later:

```
glueContext.write_dynamic_frame.from_options(frame = l_history,
          connection_type = "s3",
          connection_options = {"path": "s3://glue-sample-target/output-dir/legislator_history"},
          format = "parquet")
```

To put all the history data into a single file, you must convert it to a data frame, repartition it, and write it out:

```
s_history = l_history.toDF().repartition(1)
s_history.write.parquet('s3://glue-sample-target/output-dir/legislator_single')
```

Or, if you want to separate it by the Senate and the House:

```
l_history.toDF().write.parquet('s3://glue-sample-target/output-dir/legislator_part',
                               partitionBy=['org_name'])
```

## Step 6: Transform the data for relational databases
<a name="aws-glue-programming-python-samples-legislators-writing"></a>

AWS Glue makes it easy to write the data to relational databases like Amazon Redshift, even with semi-structured data. It offers a transform `relationalize`, which flattens `DynamicFrames` no matter how complex the objects in the frame might be.

Using the `l_history` `DynamicFrame` in this example, pass in the name of a root table (`hist_root`) and a temporary working path to `relationalize`. This returns a `DynamicFrameCollection`. You can then list the names of the `DynamicFrames` in that collection:

```
dfc = l_history.relationalize("hist_root", "s3://glue-sample-target/temp-dir/")
dfc.keys()
```

The following is the output of the `keys` call:

```
[u'hist_root', u'hist_root_contact_details', u'hist_root_links',
 u'hist_root_other_names', u'hist_root_images', u'hist_root_identifiers']
```

`Relationalize` broke the history table out into six new tables: a root table that contains a record for each object in the `DynamicFrame`, and auxiliary tables for the arrays. Array handling in relational databases is often suboptimal, especially as those arrays become large. Separating the arrays into different tables makes the queries go much faster.

Next, look at the separation by examining `contact_details`:

```
l_history.select_fields('contact_details').printSchema()
dfc.select('hist_root_contact_details').toDF().where("id = 10 or id = 75").orderBy(['id','index']).show()
```

The following is the output of the `show` call:

```
root
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
| 10|    0|                     fax|                         |
| 10|    1|                        |             202-225-1314|
| 10|    2|                   phone|                         |
| 10|    3|                        |             202-225-3772|
| 10|    4|                 twitter|                         |
| 10|    5|                        |          MikeRossUpdates|
| 75|    0|                     fax|                         |
| 75|    1|                        |             202-225-7856|
| 75|    2|                   phone|                         |
| 75|    3|                        |             202-225-2711|
| 75|    4|                 twitter|                         |
| 75|    5|                        |                SenCapito|
+---+-----+------------------------+-------------------------+
```

The `contact_details` field was an array of structs in the original `DynamicFrame`. Each element of those arrays is a separate row in the auxiliary table, indexed by `index`. The `id` here is a foreign key into the `hist_root` table with the key `contact_details`:

```
dfc.select('hist_root').toDF().where(
    "contact_details = 10 or contact_details = 75").select(
       ['id', 'given_name', 'family_name', 'contact_details']).show()
```

The following is the output:

```
+--------------------+----------+-----------+---------------+
|                  id|given_name|family_name|contact_details|
+--------------------+----------+-----------+---------------+
|f4fc30ee-7b42-432...|      Mike|       Ross|             10|
|e3c60f34-7d1b-4c0...|   Shelley|     Capito|             75|
+--------------------+----------+-----------+---------------+
```

Notice in these commands that `toDF()` and then a `where` expression are used to filter for the rows that you want to see.

So, joining the `hist_root` table with the auxiliary tables lets you do the following:
+ Load data into databases without array support.
+ Query each individual item in an array using SQL.

Safely store and access your Amazon Redshift credentials with a AWS Glue connection. For information about how to create your own connection, see [Connecting to data](glue-connections.md).

You are now ready to write your data to a connection by cycling through the `DynamicFrames` one at a time:

```
for df_name in dfc.keys():
  m_df = dfc.select(df_name)
  print "Writing to table: ", df_name
  glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, connection settings here)
```

Your connection settings will differ based on your type of relational database: 
+ For instructions on writing to Amazon Redshift consult [Redshift connections](aws-glue-programming-etl-connect-redshift-home.md).
+ For other databases, consult [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).

## Conclusion
<a name="aws-glue-programming-python-samples-legislators-conclusion"></a>

Overall, AWS Glue is very flexible. It lets you accomplish, in a few lines of code, what normally would take days to write. You can find the entire source-to-target ETL scripts in the Python file `join_and_relationalize.py` in the [AWS Glue samples](https://github.com/awslabs/aws-glue-samples) on GitHub.

# Code example: Data preparation using ResolveChoice, Lambda, and ApplyMapping
<a name="aws-glue-programming-python-samples-medicaid"></a>

The dataset that is used in this example consists of Medicare Provider payment data that was downloaded from two [Data.CMS.gov](https://data.cms.gov) data sets: "Inpatient Prospective Payment System Provider Summary for the Top 100 Diagnosis-Related Groups - FY2011" and "Inpatient Charge Data FY 2011". After downloading the data, we modified the dataset to introduce a couple of erroneous records at the end of the file. This modified file is located in a public Amazon S3 bucket at `s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv`.

You can find the source code for this example in the `data_cleaning_and_lambda.py` file in the [AWS Glue examples](https://github.com/awslabs/aws-glue-samples) GitHub repository.

The preferred way to debug Python or PySpark scripts while running on AWS is to use [Notebooks on AWS Glue Studio](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html).

## Step 1: Crawl the data in the Amazon S3 bucket
<a name="aws-glue-programming-python-samples-medicaid-crawling"></a>

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Following the process described in [Configuring a crawler](define-crawler.md), create a new crawler that can crawl the `s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv` file, and can place the resulting metadata into a database named `payments` in the AWS Glue Data Catalog.

1. Run the new crawler, and then check the `payments` database. You should find that the crawler has created a metadata table named `medicare` in the database after reading the beginning of the file to determine its format and delimiter.

   The schema of the new `medicare` table is as follows:

   ```
   Column  name                            Data type
   ==================================================
   drg definition                             string
   provider id                                bigint
   provider name                              string
   provider street address                    string
   provider city                              string
   provider state                             string
   provider zip code                          bigint
   hospital referral region description       string
   total discharges                           bigint
   average covered charges                    string
   average total payments                     string
   average medicare payments                  string
   ```

## Step 2: Add boilerplate script to the development endpoint notebook
<a name="aws-glue-programming-python-samples-medicaid-boilerplate"></a>

Paste the following boilerplate script into the development endpoint notebook to import the AWS Glue libraries that you need, and set up a single `GlueContext`:

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
```

## Step 3: Compare different schema parsings
<a name="aws-glue-programming-python-samples-medicaid-schemas"></a>

Next, you can see if the schema that was recognized by an Apache Spark `DataFrame` is the same as the one that your AWS Glue crawler recorded. Run this code:

```
medicare = spark.read.format(
   "com.databricks.spark.csv").option(
   "header", "true").option(
   "inferSchema", "true").load(
   's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')
medicare.printSchema()
```

Here's the output from the `printSchema` call:

```
root
 |-- DRG Definition: string (nullable = true)
 |-- Provider Id: string (nullable = true)
 |-- Provider Name: string (nullable = true)
 |-- Provider Street Address: string (nullable = true)
 |-- Provider City: string (nullable = true)
 |-- Provider State: string (nullable = true)
 |-- Provider Zip Code: integer (nullable = true)
 |-- Hospital Referral Region Description: string (nullable = true)
 |--  Total Discharges : integer (nullable = true)
 |--  Average Covered Charges : string (nullable = true)
 |--  Average Total Payments : string (nullable = true)
 |-- Average Medicare Payments: string (nullable = true)
```

Next, look at the schema that an AWS Glue `DynamicFrame` generates:

```
medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
       database = "payments",
       table_name = "medicare")
medicare_dynamicframe.printSchema()
```

The output from `printSchema` is as follows:

```
root
 |-- drg definition: string
 |-- provider id: choice
 |    |-- long
 |    |-- string
 |-- provider name: string
 |-- provider street address: string
 |-- provider city: string
 |-- provider state: string
 |-- provider zip code: long
 |-- hospital referral region description: string
 |-- total discharges: long
 |-- average covered charges: string
 |-- average total payments: string
 |-- average medicare payments: string
```

The `DynamicFrame` generates a schema in which `provider id` could be either a `long` or a `string` type. The `DataFrame` schema lists `Provider Id` as being a `string` type, and the Data Catalog lists `provider id` as being a `bigint` type.

Which one is correct? There are two records at the end of the file (out of 160,000 records) with `string` values in that column. These are the erroneous records that were introduced to illustrate a problem.

To address this kind of problem, the AWS Glue `DynamicFrame` introduces the concept of a *choice* type. In this case, the `DynamicFrame` shows that both `long` and `string` values can appear in that column. The AWS Glue crawler missed the `string` values because it considered only a 2 MB prefix of the data. The Apache Spark `DataFrame` considered the whole dataset, but it was forced to assign the most general type to the column, namely `string`. In fact, Spark often resorts to the most general case when there are complex types or variations with which it is unfamiliar.

To query the `provider id` column, resolve the choice type first. You can use the `resolveChoice` transform method in your `DynamicFrame` to convert those `string` values to `long` values with a `cast:long` option:

```
medicare_res = medicare_dynamicframe.resolveChoice(specs = [('provider id','cast:long')])
medicare_res.printSchema()
```

The `printSchema` output is now:

```
root
 |-- drg definition: string
 |-- provider id: long
 |-- provider name: string
 |-- provider street address: string
 |-- provider city: string
 |-- provider state: string
 |-- provider zip code: long
 |-- hospital referral region description: string
 |-- total discharges: long
 |-- average covered charges: string
 |-- average total payments: string
 |-- average medicare payments: string
```

Where the value was a `string` that could not be cast, AWS Glue inserted a `null`.

Another option is to convert the choice type to a `struct`, which keeps values of both types.

Next, look at the rows that were anomalous:

```
medicare_res.toDF().where("'provider id' is NULL").show()
```

You see the following:

```
+--------------------+-----------+---------------+-----------------------+-------------+--------------+-----------------+------------------------------------+----------------+-----------------------+----------------------+-------------------------+
|      drg definition|provider id|  provider name|provider street address|provider city|provider state|provider zip code|hospital referral region description|total discharges|average covered charges|average total payments|average medicare payments|
+--------------------+-----------+---------------+-----------------------+-------------+--------------+-----------------+------------------------------------+----------------+-----------------------+----------------------+-------------------------+
|948 - SIGNS & SYM...|       null|            INC|       1050 DIVISION ST|      MAUSTON|            WI|            53948|                        WI - Madison|              12|              $11961.41|              $4619.00|                 $3775.33|
|948 - SIGNS & SYM...|       null| INC- ST JOSEPH|     5000 W CHAMBERS ST|    MILWAUKEE|            WI|            53210|                      WI - Milwaukee|              14|              $10514.28|              $5562.50|                 $4522.78|
+--------------------+-----------+---------------+-----------------------+-------------+--------------+-----------------+------------------------------------+----------------+-----------------------+----------------------+-------------------------+
```

Now remove the two malformed records, as follows:

```
medicare_dataframe = medicare_res.toDF()
medicare_dataframe = medicare_dataframe.where("'provider id' is NOT NULL")
```

## Step 4: Map the data and use Apache Spark Lambda functions
<a name="aws-glue-programming-python-samples-medicaid-lambda-mapping"></a>

AWS Glue does not yet directly support Lambda functions, also known as user-defined functions. But you can always convert a `DynamicFrame` to and from an Apache Spark `DataFrame` to take advantage of Spark functionality in addition to the special features of `DynamicFrames`.

Next, turn the payment information into numbers, so analytic engines like Amazon Redshift or Amazon Athena can do their number crunching faster:

```
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

chop_f = udf(lambda x: x[1:], StringType())
medicare_dataframe = medicare_dataframe.withColumn(
        "ACC", chop_f(
            medicare_dataframe["average covered charges"])).withColumn(
                "ATP", chop_f(
                    medicare_dataframe["average total payments"])).withColumn(
                        "AMP", chop_f(
                            medicare_dataframe["average medicare payments"]))
medicare_dataframe.select(['ACC', 'ATP', 'AMP']).show()
```

The output from the `show` call is as follows:

```
+--------+-------+-------+
|     ACC|    ATP|    AMP|
+--------+-------+-------+
|32963.07|5777.24|4763.73|
|15131.85|5787.57|4976.71|
|37560.37|5434.95|4453.79|
|13998.28|5417.56|4129.16|
|31633.27|5658.33|4851.44|
|16920.79|6653.80|5374.14|
|11977.13|5834.74|4761.41|
|35841.09|8031.12|5858.50|
|28523.39|6113.38|5228.40|
|75233.38|5541.05|4386.94|
|67327.92|5461.57|4493.57|
|39607.28|5356.28|4408.20|
|22862.23|5374.65|4186.02|
|31110.85|5366.23|4376.23|
|25411.33|5282.93|4383.73|
| 9234.51|5676.55|4509.11|
|15895.85|5930.11|3972.85|
|19721.16|6192.54|5179.38|
|10710.88|4968.00|3898.88|
|51343.75|5996.00|4962.45|
+--------+-------+-------+
only showing top 20 rows
```

These are all still strings in the data. We can use the powerful `apply_mapping` transform method to drop, rename, cast, and nest the data so that other data programming languages and systems can easily access it:

```
from awsglue.dynamicframe import DynamicFrame
medicare_tmp_dyf = DynamicFrame.fromDF(medicare_dataframe, glueContext, "nested")
medicare_nest_dyf = medicare_tmp_dyf.apply_mapping([('drg definition', 'string', 'drg', 'string'),
                 ('provider id', 'long', 'provider.id', 'long'),
                 ('provider name', 'string', 'provider.name', 'string'),
                 ('provider city', 'string', 'provider.city', 'string'),
                 ('provider state', 'string', 'provider.state', 'string'),
                 ('provider zip code', 'long', 'provider.zip', 'long'),
                 ('hospital referral region description', 'string','rr', 'string'),
                 ('ACC', 'string', 'charges.covered', 'double'),
                 ('ATP', 'string', 'charges.total_pay', 'double'),
                 ('AMP', 'string', 'charges.medicare_pay', 'double')])
medicare_nest_dyf.printSchema()
```

The `printSchema` output is as follows:

```
root
 |-- drg: string
 |-- provider: struct
 |    |-- id: long
 |    |-- name: string
 |    |-- city: string
 |    |-- state: string
 |    |-- zip: long
 |-- rr: string
 |-- charges: struct
 |    |-- covered: double
 |    |-- total_pay: double
 |    |-- medicare_pay: double
```

Turning the data back into a Spark `DataFrame`, you can show what it looks like now:

```
medicare_nest_dyf.toDF().show()
```

The output is as follows:

```
+--------------------+--------------------+---------------+--------------------+
|                 drg|            provider|             rr|             charges|
+--------------------+--------------------+---------------+--------------------+
|039 - EXTRACRANIA...|[10001,SOUTHEAST ...|    AL - Dothan|[32963.07,5777.24...|
|039 - EXTRACRANIA...|[10005,MARSHALL M...|AL - Birmingham|[15131.85,5787.57...|
|039 - EXTRACRANIA...|[10006,ELIZA COFF...|AL - Birmingham|[37560.37,5434.95...|
|039 - EXTRACRANIA...|[10011,ST VINCENT...|AL - Birmingham|[13998.28,5417.56...|
|039 - EXTRACRANIA...|[10016,SHELBY BAP...|AL - Birmingham|[31633.27,5658.33...|
|039 - EXTRACRANIA...|[10023,BAPTIST ME...|AL - Montgomery|[16920.79,6653.8,...|
|039 - EXTRACRANIA...|[10029,EAST ALABA...|AL - Birmingham|[11977.13,5834.74...|
|039 - EXTRACRANIA...|[10033,UNIVERSITY...|AL - Birmingham|[35841.09,8031.12...|
|039 - EXTRACRANIA...|[10039,HUNTSVILLE...|AL - Huntsville|[28523.39,6113.38...|
|039 - EXTRACRANIA...|[10040,GADSDEN RE...|AL - Birmingham|[75233.38,5541.05...|
|039 - EXTRACRANIA...|[10046,RIVERVIEW ...|AL - Birmingham|[67327.92,5461.57...|
|039 - EXTRACRANIA...|[10055,FLOWERS HO...|    AL - Dothan|[39607.28,5356.28...|
|039 - EXTRACRANIA...|[10056,ST VINCENT...|AL - Birmingham|[22862.23,5374.65...|
|039 - EXTRACRANIA...|[10078,NORTHEAST ...|AL - Birmingham|[31110.85,5366.23...|
|039 - EXTRACRANIA...|[10083,SOUTH BALD...|    AL - Mobile|[25411.33,5282.93...|
|039 - EXTRACRANIA...|[10085,DECATUR GE...|AL - Huntsville|[9234.51,5676.55,...|
|039 - EXTRACRANIA...|[10090,PROVIDENCE...|    AL - Mobile|[15895.85,5930.11...|
|039 - EXTRACRANIA...|[10092,D C H REGI...|AL - Tuscaloosa|[19721.16,6192.54...|
|039 - EXTRACRANIA...|[10100,THOMAS HOS...|    AL - Mobile|[10710.88,4968.0,...|
|039 - EXTRACRANIA...|[10103,BAPTIST ME...|AL - Birmingham|[51343.75,5996.0,...|
+--------------------+--------------------+---------------+--------------------+
only showing top 20 rows
```

## Step 5: Write the data to Apache Parquet
<a name="aws-glue-programming-python-samples-medicaid-writing"></a>

AWS Glue makes it easy to write the data in a format such as Apache Parquet that relational databases can effectively consume:

```
glueContext.write_dynamic_frame.from_options(
       frame = medicare_nest_dyf,
       connection_type = "s3",
       connection_options = {"path": "s3://glue-sample-target/output-dir/medicare_parquet"},
       format = "parquet")
```

# AWS Glue PySpark extensions reference
<a name="aws-glue-programming-python-extensions"></a>

AWS Glue has created the following extensions to the PySpark Python dialect.
+ [Accessing parameters using `getResolvedOptions`](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md)
+ [PySpark extension types](aws-glue-api-crawler-pyspark-extensions-types.md)
+ [DynamicFrame class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md)
+ [DynamicFrameCollection class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection.md)
+ [DynamicFrameWriter class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.md)
+ [DynamicFrameReader class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md)
+ [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md)

# Accessing parameters using `getResolvedOptions`
<a name="aws-glue-api-crawler-pyspark-extensions-get-resolved-options"></a>

The AWS Glue `getResolvedOptions(args, options)` utility function gives you access to the arguments that are passed to your script when you run a job. To use this function, start by importing it from the AWS Glue `utils` module, along with the `sys` module:

```
import sys
from awsglue.utils import getResolvedOptions
```

**`getResolvedOptions(args, options)`**
+ `args` – The list of arguments contained in `sys.argv`.
+ `options` – A Python array of the argument names that you want to retrieve.

**Example Retrieving arguments passed to a JobRun**  
Suppose that you created a JobRun in a script, perhaps within a Lambda function:  

```
response = client.start_job_run(
             JobName = 'my_test_Job',
             Arguments = {
               '--day_partition_key':   'partition_0',
               '--hour_partition_key':  'partition_1',
               '--day_partition_value':  day_partition_value,
               '--hour_partition_value': hour_partition_value } )
```
To retrieve the arguments that are passed, you can use the `getResolvedOptions` function as follows:  

```
import sys
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv,
                          ['JOB_NAME',
                           'day_partition_key',
                           'hour_partition_key',
                           'day_partition_value',
                           'hour_partition_value'])
print "The day-partition key is: ", args['day_partition_key']
print "and the day-partition value is: ", args['day_partition_value']
```
Note that each of the arguments are defined as beginning with two hyphens, then referenced in the script without the hyphens. The arguments use only underscores, not hyphens. Your arguments need to follow this convention to be resolved.

# PySpark extension types
<a name="aws-glue-api-crawler-pyspark-extensions-types"></a>

The types that are used by the AWS Glue PySpark extensions.

## DataType
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-datatype"></a>

The base class for the other AWS Glue types.

**`__init__(properties={})`**
+ `properties` – Properties of the data type (optional).

 
**`typeName(cls)`**

Returns the type of the AWS Glue type class (that is, the class name with "Type" removed from the end).
+ `cls` – An AWS Glue class instance derived from `DataType`.

 
`jsonValue( )`

Returns a JSON object that contains the data type and properties of the class:

```
  {
    "dataType": typeName,
    "properties": properties
  }
```

## AtomicType and simple derivatives
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-atomictype"></a>

Inherits from and extends the [DataType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-datatype) class, and serves as the base class for all the AWS Glue atomic data types.

**`fromJsonValue(cls, json_value)`**

Initializes a class instance with values from a JSON object.
+ `cls` – An AWS Glue type class instance to initialize.
+ `json_value` – The JSON object to load key-value pairs from.

 
The following types are simple derivatives of the [AtomicType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-atomictype) class:
+ `BinaryType` – Binary data.
+ `BooleanType` – Boolean values.
+ `ByteType` – A byte value.
+ `DateType` – A datetime value.
+ `DoubleType` – A floating-point double value.
+ `IntegerType` – An integer value.
+ `LongType` – A long integer value.
+ `NullType` – A null value.
+ `ShortType` – A short integer value.
+ `StringType` – A text string.
+ `TimestampType` – A timestamp value (typically in seconds from 1/1/1970).
+ `UnknownType` – A value of unidentified type.

## DecimalType(AtomicType)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-decimaltype"></a>

Inherits from and extends the [AtomicType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-atomictype) class to represent a decimal number (a number expressed in decimal digits, as opposed to binary base-2 numbers).

**`__init__(precision=10, scale=2, properties={})`**
+ `precision` – The number of digits in the decimal number (optional; the default is 10).
+ `scale` – The number of digits to the right of the decimal point (optional; the default is 2).
+ `properties` – The properties of the decimal number (optional).

## EnumType(AtomicType)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-enumtype"></a>

Inherits from and extends the [AtomicType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-atomictype) class to represent an enumeration of valid options.

**`__init__(options)`**
+ `options` – A list of the options being enumerated.

##  collection types
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-collections"></a>
+ [ArrayType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-arraytype)
+ [ChoiceType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-choicetype)
+ [MapType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-maptype)
+ [Field(Object)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-field)
+ [StructType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-structtype)
+ [EntityType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-entitytype)

## ArrayType(DataType)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-arraytype"></a>

**`__init__(elementType=UnknownType(), properties={})`**
+ `elementType` – The type of elements in the array (optional; the default is UnknownType).
+ `properties` – Properties of the array (optional).

## ChoiceType(DataType)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-choicetype"></a>

**`__init__(choices=[], properties={})`**
+ `choices` – A list of possible choices (optional).
+ `properties` – Properties of these choices (optional).

 
**`add(new_choice)`**

Adds a new choice to the list of possible choices.
+ `new_choice` – The choice to add to the list of possible choices.

 
**`merge(new_choices)`**

Merges a list of new choices with the existing list of choices.
+ `new_choices` – A list of new choices to merge with existing choices.

## MapType(DataType)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-maptype"></a>

**`__init__(valueType=UnknownType, properties={})`**
+ `valueType` – The type of values in the map (optional; the default is UnknownType).
+ `properties` – Properties of the map (optional).

## Field(Object)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-field"></a>

Creates a field object out of an object that derives from [DataType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-datatype).

**`__init__(name, dataType, properties={})`**
+ `name` – The name to be assigned to the field.
+ `dataType` – The object to create a field from.
+ `properties` – Properties of the field (optional).

## StructType(DataType)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-structtype"></a>

Defines a data structure (`struct`).

**`__init__(fields=[], properties={})`**
+ `fields` – A list of the fields (of type `Field`) to include in the structure (optional).
+ `properties` – Properties of the structure (optional).

 
**`add(field)`**
+ `field` – An object of type `Field` to add to the structure.

 
**`hasField(field)`**

Returns `True` if this structure has a field of the same name, or `False` if not.
+ `field` – A field name, or an object of type `Field` whose name is used.

 
**`getField(field)`**
+ `field` – A field name or an object of type `Field` whose name is used. If the structure has a field of the same name, it is returned.

## EntityType(DataType)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-entitytype"></a>

`__init__(entity, base_type, properties)`

This class is not yet implemented.

##  other types
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-other-types"></a>
+ [DataSource(object)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-source)
+ [DataSink(object)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-sink)

## DataSource(object)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-source"></a>

**`__init__(j_source, sql_ctx, name)`**
+ `j_source` – The data source.
+ `sql_ctx` – The SQL context.
+ `name` – The data-source name.

 
**`setFormat(format, **options)`**
+ `format` – The format to set for the data source.
+ `options` – A collection of options to set for the data source. For more information about format options, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md).

 
`getFrame()`

Returns a `DynamicFrame` for the data source.

## DataSink(object)
<a name="aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-sink"></a>

**`__init__(j_sink, sql_ctx)`**
+ `j_sink` – The sink to create.
+ `sql_ctx` – The SQL context for the data sink.

 
**`setFormat(format, **options)`**
+ `format` – The format to set for the data sink.
+ `options` – A collection of options to set for the data sink. For more information about format options, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md).

 
**`setAccumulableSize(size)`**
+ `size` – The accumulable size to set, in bytes.

 
**`writeFrame(dynamic_frame, info="")`**
+ `dynamic_frame` – The `DynamicFrame` to write.
+ `info` – Information about the `DynamicFrame` (optional).

 
**`write(dynamic_frame_or_dfc, info="")`**

Writes a `DynamicFrame` or a `DynamicFrameCollection`.
+ `dynamic_frame_or_dfc` – Either a `DynamicFrame` object or a `DynamicFrameCollection` object to be written.
+ `info` – Information about the `DynamicFrame` or `DynamicFrames` to be written (optional).

# DynamicFrame class
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame"></a>

One of the major abstractions in Apache Spark is the SparkSQL `DataFrame`, which is similar to the `DataFrame` construct found in R and Pandas. A `DataFrame` is similar to a table and supports functional-style (map/reduce/filter/etc.) operations and SQL operations (select, project, aggregate).

`DataFrames` are powerful and widely used, but they have limitations with respect to extract, transform, and load (ETL) operations. Most significantly, they require a schema to be specified before any data is loaded. SparkSQL addresses this by making two passes over the data—the first to infer the schema, and the second to load the data. However, this inference is limited and doesn't address the realities of messy data. For example, the same field might be of a different type in different records. Apache Spark often gives up and reports the type as `string` using the original field text. This might not be correct, and you might want finer control over how schema discrepancies are resolved. And for large datasets, an additional pass over the source data might be prohibitively expensive.

To address these limitations, AWS Glue introduces the `DynamicFrame`. A `DynamicFrame` is similar to a `DataFrame`, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type. You can resolve these inconsistencies to make your datasets compatible with data stores that require a fixed schema.

Similarly, a `DynamicRecord` represents a logical record within a `DynamicFrame`. It is like a row in a Spark `DataFrame`, except that it is self-describing and can be used for data that does not conform to a fixed schema. When using AWS Glue with PySpark, you do not typically manipulate independent `DynamicRecords`. Rather, you will transform the dataset together through its `DynamicFrame`.

You can convert `DynamicFrames` to and from `DataFrames` after you resolve any schema inconsistencies. 

##  — construction —
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-_constructing"></a>
+ [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-__init__)
+ [fromDF](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF)
+ [toDF](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF)

## \$1\$1init\$1\$1
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-__init__"></a>

**`__init__(jdf, glue_ctx, name)`**
+ `jdf` – A reference to the data frame in the Java Virtual Machine (JVM).
+ `glue_ctx` – A [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md) object.
+ `name` – An optional name string, empty by default.

## fromDF
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF"></a>

**`fromDF(dataframe, glue_ctx, name)`**

Converts a `DataFrame` to a `DynamicFrame` by converting `DataFrame` fields to `DynamicRecord` fields. Returns the new `DynamicFrame`.

A `DynamicRecord` represents a logical record in a `DynamicFrame`. It is similar to a row in a Spark `DataFrame`, except that it is self-describing and can be used for data that does not conform to a fixed schema.

This function expects columns with duplicated names in your `DataFrame` to have already been resolved.
+ `dataframe` – The Apache Spark SQL `DataFrame` to convert (required).
+ `glue_ctx` – The [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md) object that specifies the context for this transform (required).
+ `name` – The name of the resulting `DynamicFrame` (optional since AWS Glue 3.0).

## toDF
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF"></a>

**`toDF(options)`**

Converts a `DynamicFrame` to an Apache Spark `DataFrame` by converting `DynamicRecords` into `DataFrame` fields. Returns the new `DataFrame`.

A `DynamicRecord` represents a logical record in a `DynamicFrame`. It is similar to a row in a Spark `DataFrame`, except that it is self-describing and can be used for data that does not conform to a fixed schema.
+  `options` – A list of `ResolveOption` objects that specify how to resolve choice types during the conversion. This parameter is used to handle schema inconsistencies, not for format options like CSV parsing. 

   For CSV parsing and other format options, specify these in the `from_options` method when creating the DynamicFrame, not in the `toDF` method. 

   Here's an example of the correct way to handle CSV format options: 

  ```
  from awsglue.context import GlueContext
  from awsglue.dynamicframe import DynamicFrame
  from pyspark.context import SparkContext
  
  sc = SparkContext()
  glueContext = GlueContext(sc)
  
  # Correct: Specify format options in from_options
  csv_dyf = glueContext.create_dynamic_frame.from_options(
      connection_type="s3",
      connection_options={"paths": ["s3://my-bucket/path/to/csv/"]},
      format="csv",
      format_options={
          "withHeader": True,
          "separator": ",",
          "inferSchema": True
      }
  )
  
  # Convert to DataFrame (no format options needed here)
  csv_df = csv_dyf.toDF()
  ```

   The `options` parameter in `toDF` is specifically for resolving choice types. Specify the target type if you choose the `Project` and `Cast` action type. Examples include the following. 

  ```
  >>>toDF([ResolveOption("a.b.c", "KeepAsStruct")])
  >>>toDF([ResolveOption("a.b.c", "Project", DoubleType())])
  ```

##  — information —
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-_informational"></a>
+ [count](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-count)
+ [schema](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-schema)
+ [printSchema](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-printSchema)
+ [show](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-show)
+ [repartition](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-repartition)
+ [coalesce](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-coalesce)

## count
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-count"></a>

`count( )` – Returns the number of rows in the underlying `DataFrame`.

## schema
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-schema"></a>

`schema( )` – Returns the schema of this `DynamicFrame`, or if that is not available, the schema of the underlying `DataFrame`.

For more information about the `DynamicFrame` types that make up this schema, see [PySpark extension types](aws-glue-api-crawler-pyspark-extensions-types.md).

## printSchema
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-printSchema"></a>

`printSchema( )` – Prints the schema of the underlying `DataFrame`.

## show
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-show"></a>

`show(num_rows)` – Prints a specified number of rows from the underlying `DataFrame`.

## repartition
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-repartition"></a>

`repartition(numPartitions)` – Returns a new `DynamicFrame` with `numPartitions` partitions.

## coalesce
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-coalesce"></a>

`coalesce(numPartitions)` – Returns a new `DynamicFrame` with `numPartitions` partitions.

##  — transforms —
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-_transforms"></a>
+ [apply\$1mapping](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-apply_mapping)
+ [drop\$1fields](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-drop_fields)
+ [filter](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-filter)
+ [join](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-join)
+ [map](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-map)
+ [mergeDynamicFrame](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-merge)
+ [relationalize](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-relationalize)
+ [rename\$1field](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-rename_field)
+ [resolveChoice](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-resolveChoice)
+ [select\$1fields](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-select_fields)
+ [spigot](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-spigot)
+ [split\$1fields](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_fields)
+ [split\$1rows](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_rows)
+ [unbox](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unbox)
+ [union](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-union)
+ [unnest](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest)
+ [unnest\$1ddb\$1json](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest_ddb_json)
+ [write](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-write)

## apply\$1mapping
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-apply_mapping"></a>

**`apply_mapping(mappings, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Applies a declarative mapping to a `DynamicFrame` and returns a new `DynamicFrame` with those mappings applied to the fields that you specify. Unspecified fields are omitted from the new `DynamicFrame`.
+ `mappings` – A list of mapping tuples (required). Each consists of: (source column, source type, target column, target type). 

  If the source column has a dot "`.`" in the name, you must place backticks "````" around it. For example, to map `this.old.name` (string) to `thisNewName`, you would use the following tuple:

  ```
  ("`this.old.name`", "string", "thisNewName", "string")
  ```
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use apply\$1mapping to rename fields and change field types
<a name="pyspark-apply_mapping-example"></a>

The following code example shows how to use the `apply_mapping` method to rename selected fields and change field types.

**Note**  
To access the dataset that is used in this example, see [Code example: Joining and relationalizing data](aws-glue-programming-python-samples-legislators.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling).

```
# Example: Use apply_mapping to reshape source data into
# the desired column names and types as a new DynamicFrame

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create a DynamicFrame and view its schema
persons = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="persons_json"
)
print("Schema for the persons DynamicFrame:")
persons.printSchema()

# Select and rename fields, change field type
print("Schema for the persons_mapped DynamicFrame, created with apply_mapping:")
persons_mapped = persons.apply_mapping(
    [
        ("family_name", "String", "last_name", "String"),
        ("name", "String", "first_name", "String"),
        ("birth_date", "String", "date_of_birth", "Date"),
    ]
)
persons_mapped.printSchema()
```

#### Output
<a name="apply_mapping-example-output"></a>

```
Schema for the persons DynamicFrame:
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string

Schema for the persons_mapped DynamicFrame, created with apply_mapping:
root
|-- last_name: string
|-- first_name: string
|-- date_of_birth: date
```

## drop\$1fields
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-drop_fields"></a>

**`drop_fields(paths, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Calls the [FlatMap class](aws-glue-api-crawler-pyspark-transforms-flat-map.md) transform to remove fields from a `DynamicFrame`. Returns a new `DynamicFrame` with the specified fields dropped.
+ `paths` – A list of strings. Each contains the full path to a field node that you want to drop. You can use dot notation to specify nested fields. For example, if field `first` is a child of field `name` in the tree, you specify `"name.first"` for the path.

  If a field node has a literal `.` in the name, you must enclose the name in backticks (```).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use drop\$1fields to remove fields from a `DynamicFrame`
<a name="pyspark-drop_fields-example"></a>

This code example uses the `drop_fields` method to remove selected top-level and nested fields from a `DynamicFrame`.

**Example dataset**

The example uses the following dataset that is represented by the `EXAMPLE-FRIENDS-DATA` table in the code:

```
{"name": "Sally", "age": 23, "location": {"state": "WY", "county": "Fremont"}, "friends": []}
{"name": "Varun", "age": 34, "location": {"state": "NE", "county": "Douglas"}, "friends": [{"name": "Arjun", "age": 3}]}
{"name": "George", "age": 52, "location": {"state": "NY"}, "friends": [{"name": "Fred"}, {"name": "Amy", "age": 15}]}
{"name": "Haruki", "age": 21, "location": {"state": "AK", "county": "Denali"}}
{"name": "Sheila", "age": 63, "friends": [{"name": "Nancy", "age": 22}]}
```

**Example code**

```
# Example: Use drop_fields to remove top-level and nested fields from a DynamicFrame.
# Replace MY-EXAMPLE-DATABASE with your Glue Data Catalog database name.
# Replace EXAMPLE-FRIENDS-DATA with your table name.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create a DynamicFrame from Glue Data Catalog
glue_source_database = "MY-EXAMPLE-DATABASE"
glue_source_table = "EXAMPLE-FRIENDS-DATA"

friends = glueContext.create_dynamic_frame.from_catalog(
    database=glue_source_database, table_name=glue_source_table
)
print("Schema for friends DynamicFrame before calling drop_fields:")
friends.printSchema()

# Remove location.county, remove friends.age, remove age
friends = friends.drop_fields(paths=["age", "location.county", "friends.age"])
print("Schema for friends DynamicFrame after removing age, county, and friend age:")
friends.printSchema()
```

#### Output
<a name="drop_fields-example-output"></a>

```
Schema for friends DynamicFrame before calling drop_fields:
root
|-- name: string
|-- age: int
|-- location: struct
|    |-- state: string
|    |-- county: string
|-- friends: array
|    |-- element: struct
|    |    |-- name: string
|    |    |-- age: int

Schema for friends DynamicFrame after removing age, county, and friend age:
root
|-- name: string
|-- location: struct
|    |-- state: string
|-- friends: array
|    |-- element: struct
|    |    |-- name: string
```

## filter
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-filter"></a>

**`filter(f, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Returns a new `DynamicFrame` that contains all `DynamicRecords` within the input `DynamicFrame` that satisfy the specified predicate function `f`.
+ `f` – The predicate function to apply to the `DynamicFrame`. The function must take a `DynamicRecord` as an argument and return True if the `DynamicRecord` meets the filter requirements, or False if not (required).

  A `DynamicRecord` represents a logical record in a `DynamicFrame`. It's similar to a row in a Spark `DataFrame`, except that it is self-describing and can be used for data that doesn't conform to a fixed schema.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use filter to get a filtered selection of fields
<a name="pyspark-filter-example"></a>

This example uses the `filter` method to create a new `DynamicFrame` that includes a filtered selection of another `DynamicFrame`'s fields. 

Like the `map` method, `filter` takes a function as an argument that gets applied to each record in the original `DynamicFrame`. The function takes a record as an input and returns a Boolean value. If the return value is true, the record gets included in the resulting `DynamicFrame`. If it's false, the record is left out.

**Note**  
To access the dataset that is used in this example, see [Code example: Data preparation using ResolveChoice, Lambda, and ApplyMapping](aws-glue-programming-python-samples-medicaid.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-medicaid.md#aws-glue-programming-python-samples-medicaid-crawling).

```
# Example: Use filter to create a new DynamicFrame
# with a filtered selection of records

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create DynamicFrame from Glue Data Catalog
medicare = glueContext.create_dynamic_frame.from_options(
    "s3",
    {
        "paths": [
            "s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv"
        ]
    },
    "csv",
    {"withHeader": True},
)

# Create filtered DynamicFrame with custom lambda
# to filter records by Provider State and Provider City
sac_or_mon = medicare.filter(
    f=lambda x: x["Provider State"] in ["CA", "AL"]
    and x["Provider City"] in ["SACRAMENTO", "MONTGOMERY"]
)

# Compare record counts
print("Unfiltered record count: ", medicare.count())
print("Filtered record count:  ", sac_or_mon.count())
```

#### Output
<a name="filter-example-output"></a>

```
Unfiltered record count:  163065
Filtered record count:   564
```

## join
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-join"></a>

**`join(paths1, paths2, frame2, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Performs an equality join with another `DynamicFrame` and returns the resulting `DynamicFrame`.
+ `paths1` – A list of the keys in this frame to join.
+ `paths2` – A list of the keys in the other frame to join.
+ `frame2` – The other `DynamicFrame` to join.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use join to combine `DynamicFrames`
<a name="pyspark-join-example"></a>

This example uses the `join` method to perform a join on three `DynamicFrames`. AWS Glue performs the join based on the field keys that you provide. The resulting `DynamicFrame` contains rows from the two original frames where the specified keys match.

Note that the `join` transform keeps all fields intact. This means that the fields that you specify to match appear in the resulting DynamicFrame, even if they're redundant and contain the same keys. In this example, we use `drop_fields` to remove these redundant keys after the join.

**Note**  
To access the dataset that is used in this example, see [Code example: Joining and relationalizing data](aws-glue-programming-python-samples-legislators.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling).

```
# Example: Use join to combine data from three DynamicFrames

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Load DynamicFrames from Glue Data Catalog
persons = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="persons_json"
)
memberships = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="memberships_json"
)
orgs = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="organizations_json"
)
print("Schema for the persons DynamicFrame:")
persons.printSchema()
print("Schema for the memberships DynamicFrame:")
memberships.printSchema()
print("Schema for the orgs DynamicFrame:")
orgs.printSchema()

# Join persons and memberships by ID
persons_memberships = persons.join(
    paths1=["id"], paths2=["person_id"], frame2=memberships
)

# Rename and drop fields from orgs
# to prevent field name collisions with persons_memberships
orgs = (
    orgs.drop_fields(["other_names", "identifiers"])
    .rename_field("id", "org_id")
    .rename_field("name", "org_name")
)

# Create final join of all three DynamicFrames
legislators_combined = orgs.join(
    paths1=["org_id"], paths2=["organization_id"], frame2=persons_memberships
).drop_fields(["person_id", "org_id"])

# Inspect the schema for the joined data
print("Schema for the new legislators_combined DynamicFrame:")
legislators_combined.printSchema()
```

#### Output
<a name="join-example-output"></a>

```
Schema for the persons DynamicFrame:
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string

Schema for the memberships DynamicFrame:
root
|-- area_id: string
|-- on_behalf_of_id: string
|-- organization_id: string
|-- role: string
|-- person_id: string
|-- legislative_period_id: string
|-- start_date: string
|-- end_date: string

Schema for the orgs DynamicFrame:
root
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- id: string
|-- classification: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- image: string
|-- seats: int
|-- type: string

Schema for the new legislators_combined DynamicFrame:
root
|-- role: string
|-- seats: int
|-- org_name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- type: string
|-- sort_name: string
|-- area_id: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- on_behalf_of_id: string
|-- other_names: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- name: string
|    |    |-- lang: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- name: string
|-- birth_date: string
|-- organization_id: string
|-- gender: string
|-- classification: string
|-- legislative_period_id: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- image: string
|-- given_name: string
|-- start_date: string
|-- family_name: string
|-- id: string
|-- death_date: string
|-- end_date: string
```

## map
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-map"></a>

**`map(f, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Returns a new `DynamicFrame` that results from applying the specified mapping function to all records in the original `DynamicFrame`.
+ `f` – The mapping function to apply to all records in the `DynamicFrame`. The function must take a `DynamicRecord` as an argument and return a new `DynamicRecord` (required).

  A `DynamicRecord` represents a logical record in a `DynamicFrame`. It's similar to a row in an Apache Spark `DataFrame`, except that it is self-describing and can be used for data that doesn't conform to a fixed schema.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string that is associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

### Example: Use map to apply a function to every record in a `DynamicFrame`
<a name="pyspark-map-example"></a>

This example shows how to use the `map` method to apply a function to every record of a `DynamicFrame`. Specifically, this example applies a function called `MergeAddress` to each record in order to merge several address fields into a single `struct` type.

**Note**  
To access the dataset that is used in this example, see [Code example: Data preparation using ResolveChoice, Lambda, and ApplyMapping](aws-glue-programming-python-samples-medicaid.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-medicaid.md#aws-glue-programming-python-samples-medicaid-crawling).

```
# Example: Use map to combine fields in all records
# of a DynamicFrame

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create a DynamicFrame and view its schema
medicare = glueContext.create_dynamic_frame.from_options(
        "s3",
        {"paths": ["s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv"]},
        "csv",
        {"withHeader": True})
print("Schema for medicare DynamicFrame:")
medicare.printSchema()

# Define a function to supply to the map transform
# that merges address fields into a single field
def MergeAddress(rec):
  rec["Address"] = {}
  rec["Address"]["Street"] = rec["Provider Street Address"]
  rec["Address"]["City"] = rec["Provider City"]
  rec["Address"]["State"] = rec["Provider State"]
  rec["Address"]["Zip.Code"] = rec["Provider Zip Code"]
  rec["Address"]["Array"] = [rec["Provider Street Address"], rec["Provider City"], rec["Provider State"], rec["Provider Zip Code"]]
  del rec["Provider Street Address"]
  del rec["Provider City"]
  del rec["Provider State"]
  del rec["Provider Zip Code"]
  return rec


# Use map to apply MergeAddress to every record
mapped_medicare = medicare.map(f = MergeAddress)
print("Schema for mapped_medicare DynamicFrame:")
mapped_medicare.printSchema()
```

#### Output
<a name="map-example-output"></a>

```
Schema for medicare DynamicFrame:
root
|-- DRG Definition: string
|-- Provider Id: string
|-- Provider Name: string
|-- Provider Street Address: string
|-- Provider City: string
|-- Provider State: string
|-- Provider Zip Code: string
|-- Hospital Referral Region Description: string
|-- Total Discharges: string
|-- Average Covered Charges: string
|-- Average Total Payments: string
|-- Average Medicare Payments: string

Schema for mapped_medicare DynamicFrame:
root
|-- Average Total Payments: string
|-- Average Covered Charges: string
|-- DRG Definition: string
|-- Average Medicare Payments: string
|-- Hospital Referral Region Description: string
|-- Address: struct
|    |-- Zip.Code: string
|    |-- City: string
|    |-- Array: array
|    |    |-- element: string
|    |-- State: string
|    |-- Street: string
|-- Provider Id: string
|-- Total Discharges: string
|-- Provider Name: string
```

## mergeDynamicFrame
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-merge"></a>

**`mergeDynamicFrame(stage_dynamic_frame, primary_keys, transformation_ctx = "", options = {}, info = "", stageThreshold = 0, totalThreshold = 0)`**

Merges this `DynamicFrame` with a staging `DynamicFrame` based on the specified primary keys to identify records. Duplicate records (records with the same primary keys) are not deduplicated. If there is no matching record in the staging frame, all records (including duplicates) are retained from the source. If the staging frame has matching records, the records from the staging frame overwrite the records in the source in AWS Glue.
+ `stage_dynamic_frame` – The staging `DynamicFrame` to merge.
+ `primary_keys` – The list of primary key fields to match records from the source and staging dynamic frames.
+ `transformation_ctx` – A unique string that is used to retrieve metadata about the current transformation (optional).
+ `options` – A string of JSON name-value pairs that provide additional information for this transformation. This argument is not currently used.
+ `info` – A `String`. Any string to be associated with errors in this transformation.
+ `stageThreshold` – A `Long`. The number of errors in the given transformation for which the processing needs to error out.
+ `totalThreshold` – A `Long`. The total number of errors up to and including this transformation for which the processing needs to error out.

This method returns a new `DynamicFrame` that is obtained by merging this `DynamicFrame` with the staging `DynamicFrame`.

The returned `DynamicFrame` contains record A in these cases:
+ If `A` exists in both the source frame and the staging frame, then `A` in the staging frame is returned.
+ If `A` is in the source table and `A.primaryKeys` is not in the `stagingDynamicFrame`, `A` is not updated in the staging table.

The source frame and staging frame don't need to have the same schema.

### Example: Use mergeDynamicFrame to merge two `DynamicFrames` based on a primary key
<a name="pyspark-mergeDynamicFrame-example"></a>

The following code example shows how to use the `mergeDynamicFrame` method to merge a `DynamicFrame` with a "staging" `DynamicFrame`, based on the primary key `id`.

**Example dataset**

The example uses two `DynamicFrames` from a `DynamicFrameCollection` called `split_rows_collection`. The following is the list of keys in `split_rows_collection`.

```
dict_keys(['high', 'low'])
```

**Example code**

```
# Example: Use mergeDynamicFrame to merge DynamicFrames
# based on a set of specified primary keys

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import SelectFromCollection

# Inspect the original DynamicFrames
frame_low = SelectFromCollection.apply(dfc=split_rows_collection, key="low")
print("Inspect the DynamicFrame that contains rows where ID < 10")
frame_low.toDF().show()

frame_high = SelectFromCollection.apply(dfc=split_rows_collection, key="high")
print("Inspect the DynamicFrame that contains rows where ID > 10")
frame_high.toDF().show()

# Merge the DynamicFrames based on the "id" primary key
merged_high_low = frame_high.mergeDynamicFrame(
    stage_dynamic_frame=frame_low, primary_keys=["id"]
)

# View the results where the ID is 1 or 20
print("Inspect the merged DynamicFrame that contains the combined rows")
merged_high_low.toDF().where("id = 1 or id= 20").orderBy("id").show()
```

#### Output
<a name="mergeDynamicFrame-example-output"></a>

```
Inspect the DynamicFrame that contains rows where ID < 10
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
|  1|    0|                     fax|             202-225-3307|
|  1|    1|                   phone|             202-225-5731|
|  2|    0|                     fax|             202-225-3307|
|  2|    1|                   phone|             202-225-5731|
|  3|    0|                     fax|             202-225-3307|
|  3|    1|                   phone|             202-225-5731|
|  4|    0|                     fax|             202-225-3307|
|  4|    1|                   phone|             202-225-5731|
|  5|    0|                     fax|             202-225-3307|
|  5|    1|                   phone|             202-225-5731|
|  6|    0|                     fax|             202-225-3307|
|  6|    1|                   phone|             202-225-5731|
|  7|    0|                     fax|             202-225-3307|
|  7|    1|                   phone|             202-225-5731|
|  8|    0|                     fax|             202-225-3307|
|  8|    1|                   phone|             202-225-5731|
|  9|    0|                     fax|             202-225-3307|
|  9|    1|                   phone|             202-225-5731|
| 10|    0|                     fax|             202-225-6328|
| 10|    1|                   phone|             202-225-4576|
+---+-----+------------------------+-------------------------+
only showing top 20 rows

Inspect the DynamicFrame that contains rows where ID > 10
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
| 11|    0|                     fax|             202-225-6328|
| 11|    1|                   phone|             202-225-4576|
| 11|    2|                 twitter|           RepTrentFranks|
| 12|    0|                     fax|             202-225-6328|
| 12|    1|                   phone|             202-225-4576|
| 12|    2|                 twitter|           RepTrentFranks|
| 13|    0|                     fax|             202-225-6328|
| 13|    1|                   phone|             202-225-4576|
| 13|    2|                 twitter|           RepTrentFranks|
| 14|    0|                     fax|             202-225-6328|
| 14|    1|                   phone|             202-225-4576|
| 14|    2|                 twitter|           RepTrentFranks|
| 15|    0|                     fax|             202-225-6328|
| 15|    1|                   phone|             202-225-4576|
| 15|    2|                 twitter|           RepTrentFranks|
| 16|    0|                     fax|             202-225-6328|
| 16|    1|                   phone|             202-225-4576|
| 16|    2|                 twitter|           RepTrentFranks|
| 17|    0|                     fax|             202-225-6328|
| 17|    1|                   phone|             202-225-4576|
+---+-----+------------------------+-------------------------+
only showing top 20 rows

Inspect the merged DynamicFrame that contains the combined rows
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
|  1|    0|                     fax|             202-225-3307|
|  1|    1|                   phone|             202-225-5731|
| 20|    0|                     fax|             202-225-5604|
| 20|    1|                   phone|             202-225-6536|
| 20|    2|                 twitter|                USRepLong|
+---+-----+------------------------+-------------------------+
```

## relationalize
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-relationalize"></a>

**`relationalize(root_table_name, staging_path, options, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Converts a `DynamicFrame` into a form that fits within a relational database. Relationalizing a `DynamicFrame` is especially useful when you want to move data from a NoSQL environment like DynamoDB into a relational database like MySQL.

The transform generates a list of frames by unnesting nested columns and pivoting array columns. You can join the pivoted array columns to the root table by using the join key that is generated during the unnest phase.
+ `root_table_name` – The name for the root table.
+ `staging_path` – The path where the method can store partitions of pivoted tables in CSV format (optional). Pivoted tables are read back from this path.
+ `options` – A dictionary of optional parameters.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use relationalize to flatten a nested schema in a `DynamicFrame`
<a name="pyspark-relationalize-example"></a>

This code example uses the `relationalize` method to flatten a nested schema into a form that fits into a relational database.

**Example dataset**

The example uses a `DynamicFrame` called `legislators_combined` with the following schema. `legislators_combined` has multiple nested fields such as `links`, `images`, and `contact_details`, which will be flattened by the `relationalize` transform.

```
root
|-- role: string
|-- seats: int
|-- org_name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- type: string
|-- sort_name: string
|-- area_id: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- on_behalf_of_id: string
|-- other_names: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- name: string
|    |    |-- lang: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- name: string
|-- birth_date: string
|-- organization_id: string
|-- gender: string
|-- classification: string
|-- legislative_period_id: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- image: string
|-- given_name: string
|-- start_date: string
|-- family_name: string
|-- id: string
|-- death_date: string
|-- end_date: string
```

**Example code**

```
# Example: Use relationalize to flatten
# a nested schema into a format that fits
# into a relational database.
# Replace DOC-EXAMPLE-S3-BUCKET/tmpDir with your own location.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Apply relationalize and inspect new tables
legislators_relationalized = legislators_combined.relationalize(
    "l_root", "s3://DOC-EXAMPLE-BUCKET/tmpDir"
)
legislators_relationalized.keys()

# Compare the schema of the contact_details
# nested field to the new relationalized table that
# represents it
legislators_combined.select_fields("contact_details").printSchema()
legislators_relationalized.select("l_root_contact_details").toDF().where(
    "id = 10 or id = 75"
).orderBy(["id", "index"]).show()
```

#### Output
<a name="relationalize-example-output"></a>

The following output lets you compare the schema of the nested field called `contact_details` to the table that the `relationalize` transform created. Notice that the table records link back to the main table using a foreign key called `id` and an `index` column that represents the positions of the array.

```
dict_keys(['l_root', 'l_root_images', 'l_root_links', 'l_root_other_names', 'l_root_contact_details', 'l_root_identifiers'])

root
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string

+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
| 10|    0|                     fax|             202-225-4160|
| 10|    1|                   phone|             202-225-3436|
| 75|    0|                     fax|             202-225-6791|
| 75|    1|                   phone|             202-225-2861|
| 75|    2|                 twitter|               RepSamFarr|
+---+-----+------------------------+-------------------------+
```

## rename\$1field
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-rename_field"></a>

**`rename_field(oldName, newName, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Renames a field in this `DynamicFrame` and returns a new `DynamicFrame` with the field renamed.
+ `oldName` – The full path to the node you want to rename.

  If the old name has dots in it, `RenameField` doesn't work unless you place backticks around it (```). For example, to replace `this.old.name` with `thisNewName`, you would call rename\$1field as follows.

  ```
  newDyF = oldDyF.rename_field("`this.old.name`", "thisNewName")
  ```
+ `newName` – The new name, as a full path.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use rename\$1field to rename fields in a `DynamicFrame`
<a name="pyspark-rename_field-example"></a>

This code example uses the `rename_field` method to rename fields in a `DynamicFrame`. Notice that the example uses method chaining to rename multiple fields at the same time.

**Note**  
To access the dataset that is used in this example, see [Code example: Joining and relationalizing data](aws-glue-programming-python-samples-legislators.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling).

**Example code**

```
# Example: Use rename_field to rename fields
# in a DynamicFrame

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Inspect the original orgs schema
orgs = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="organizations_json"
)
print("Original orgs schema: ")
orgs.printSchema()

# Rename fields and view the new schema
orgs = orgs.rename_field("id", "org_id").rename_field("name", "org_name")
print("New orgs schema with renamed fields: ")
orgs.printSchema()
```

#### Output
<a name="rename_field-example-output"></a>

```
Original orgs schema: 
root
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- id: string
|-- classification: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- image: string
|-- seats: int
|-- type: string

New orgs schema with renamed fields: 
root
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- classification: string
|-- org_id: string
|-- org_name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- image: string
|-- seats: int
|-- type: string
```

## resolveChoice
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-resolveChoice"></a>

**`resolveChoice(specs = None, choice = "" , database = None , table_name = None , transformation_ctx="", info="", stageThreshold=0, totalThreshold=0, catalog_id = None)`**

Resolves a choice type within this `DynamicFrame` and returns the new `DynamicFrame`.
+ `specs` – A list of specific ambiguities to resolve, each in the form of a tuple: `(field_path, action)`. 

  There are two ways to use `resolveChoice`. The first is to use the `specs` argument to specify a sequence of specific fields and how to resolve them. The other mode for `resolveChoice` is to use the `choice` argument to specify a single resolution for all `ChoiceTypes`.

  Values for `specs` are specified as tuples made up of `(field_path, action)` pairs. The `field_path` value identifies a specific ambiguous element, and the `action` value identifies the corresponding resolution. The following are the possible actions: 
  + `cast:type` – Attempts to cast all values to the specified type. For example: `cast:int`.
  + `make_cols` – Converts each distinct type to a column with the name `columnName_type`. It resolves a potential ambiguity by flattening the data. For example, if `columnA` could be an `int` or a `string`, the resolution would be to produce two columns named `columnA_int` and `columnA_string` in the resulting `DynamicFrame`.
  + `make_struct` – Resolves a potential ambiguity by using a `struct` to represent the data. For example, if data in a column could be an `int` or a `string`, the `make_struct` action produces a column of structures in the resulting `DynamicFrame`. Each structure contains both an `int` and a `string`.
  + `project:type` – Resolves a potential ambiguity by projecting all the data to one of the possible data types. For example, if data in a column could be an `int` or a `string`, using a `project:string` action produces a column in the resulting `DynamicFrame` where all the `int` values have been converted to strings.

  If the `field_path` identifies an array, place empty square brackets after the name of the array to avoid ambiguity. For example, suppose you are working with data structured as follows:

  ```
  "myList": [
    { "price": 100.00 },
    { "price": "$100.00" }
  ]
  ```

  You can select the numeric rather than the string version of the price by setting the `field_path` to `"myList[].price"`, and setting the `action` to `"cast:double"`.
**Note**  
You can only use one of the `specs` and `choice` parameters. If the `specs` parameter is not `None`, then the `choice` parameter must be an empty string. Conversely, if the `choice` is not an empty string, then the `specs` parameter must be `None`. 
+ `choice` – Specifies a single resolution for all `ChoiceTypes`. You can use this in cases where the complete list of `ChoiceTypes` is unknown before runtime. In addition to the actions listed previously for `specs`, this argument also supports the following action:
  + `match_catalog` – Attempts to cast each `ChoiceType` to the corresponding type in the specified Data Catalog table. 
+ `database` – The Data Catalog database to use with the `match_catalog` action.
+ `table_name` – The Data Catalog table to use with the `match_catalog` action.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional).The default is zero, which indicates that the process should not error out.
+ `catalog_id` – The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). When set to `None` (default value), it uses the catalog ID of the calling account. 

### Example: Use resolveChoice to handle a column that contains multiple types
<a name="pyspark-resolveChoice-example"></a>

This code example uses the `resolveChoice` method to specify how to handle a `DynamicFrame` column that contains values of multiple types. The example demonstrates two common ways to handle a column with different types:
+ Cast the column to a single data type.
+ Retain all types in separate columns.

**Example dataset**

**Note**  
To access the dataset that is used in this example, see [Code example: Data preparation using ResolveChoice, Lambda, and ApplyMapping](aws-glue-programming-python-samples-medicaid.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-medicaid.md#aws-glue-programming-python-samples-medicaid-crawling).

The example uses a `DynamicFrame` called `medicare` with the following schema:

```
root
|-- drg definition: string
|-- provider id: choice
|    |-- long
|    |-- string
|-- provider name: string
|-- provider street address: string
|-- provider city: string
|-- provider state: string
|-- provider zip code: long
|-- hospital referral region description: string
|-- total discharges: long
|-- average covered charges: string
|-- average total payments: string
|-- average medicare payments: string
```

**Example code**

```
# Example: Use resolveChoice to handle
# a column that contains multiple types

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Load the input data and inspect the "provider id" column
medicare = glueContext.create_dynamic_frame.from_catalog(
    database="payments", table_name="medicare_hospital_provider_csv"
)
print("Inspect the provider id column:")
medicare.toDF().select("provider id").show()

# Cast provider id to type long
medicare_resolved_long = medicare.resolveChoice(specs=[("provider id", "cast:long")])
print("Schema after casting provider id to type long:")
medicare_resolved_long.printSchema()
medicare_resolved_long.toDF().select("provider id").show()

# Create separate columns
# for each provider id type
medicare_resolved_cols = medicare.resolveChoice(choice="make_cols")
print("Schema after creating separate columns for each type:")
medicare_resolved_cols.printSchema()
medicare_resolved_cols.toDF().select("provider id_long", "provider id_string").show()
```

#### Output
<a name="resolveChoice-example-output"></a>

```
Inspect the 'provider id' column:
+-----------+
|provider id|
+-----------+
|   [10001,]|
|   [10005,]|
|   [10006,]|
|   [10011,]|
|   [10016,]|
|   [10023,]|
|   [10029,]|
|   [10033,]|
|   [10039,]|
|   [10040,]|
|   [10046,]|
|   [10055,]|
|   [10056,]|
|   [10078,]|
|   [10083,]|
|   [10085,]|
|   [10090,]|
|   [10092,]|
|   [10100,]|
|   [10103,]|
+-----------+
only showing top 20 rows

Schema after casting 'provider id' to type long:
root
|-- drg definition: string
|-- provider id: long
|-- provider name: string
|-- provider street address: string
|-- provider city: string
|-- provider state: string
|-- provider zip code: long
|-- hospital referral region description: string
|-- total discharges: long
|-- average covered charges: string
|-- average total payments: string
|-- average medicare payments: string

+-----------+
|provider id|
+-----------+
|      10001|
|      10005|
|      10006|
|      10011|
|      10016|
|      10023|
|      10029|
|      10033|
|      10039|
|      10040|
|      10046|
|      10055|
|      10056|
|      10078|
|      10083|
|      10085|
|      10090|
|      10092|
|      10100|
|      10103|
+-----------+
only showing top 20 rows

Schema after creating separate columns for each type:
root
|-- drg definition: string
|-- provider id_string: string
|-- provider id_long: long
|-- provider name: string
|-- provider street address: string
|-- provider city: string
|-- provider state: string
|-- provider zip code: long
|-- hospital referral region description: string
|-- total discharges: long
|-- average covered charges: string
|-- average total payments: string
|-- average medicare payments: string

+----------------+------------------+
|provider id_long|provider id_string|
+----------------+------------------+
|           10001|              null|
|           10005|              null|
|           10006|              null|
|           10011|              null|
|           10016|              null|
|           10023|              null|
|           10029|              null|
|           10033|              null|
|           10039|              null|
|           10040|              null|
|           10046|              null|
|           10055|              null|
|           10056|              null|
|           10078|              null|
|           10083|              null|
|           10085|              null|
|           10090|              null|
|           10092|              null|
|           10100|              null|
|           10103|              null|
+----------------+------------------+
only showing top 20 rows
```

## select\$1fields
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-select_fields"></a>

**`select_fields(paths, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Returns a new `DynamicFrame` that contains the selected fields.
+ `paths` – A list of strings. Each string is a path to a top-level node that you want to select.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use select\$1fields to create a new `DynamicFrame` with chosen fields
<a name="pyspark-select_fields-example"></a>

The following code example shows how to use the `select_fields` method to create a new `DynamicFrame` with a chosen list of fields from an existing `DynamicFrame`.

**Note**  
To access the dataset that is used in this example, see [Code example: Joining and relationalizing data](aws-glue-programming-python-samples-legislators.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling).

```
# Example: Use select_fields to select specific fields from a DynamicFrame

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create a DynamicFrame and view its schema
persons = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="persons_json"
)
print("Schema for the persons DynamicFrame:")
persons.printSchema()

# Create a new DynamicFrame with chosen fields
names = persons.select_fields(paths=["family_name", "given_name"])
print("Schema for the names DynamicFrame, created with select_fields:")
names.printSchema()
names.toDF().show()
```

#### Output
<a name="select_fields-example-output"></a>

```
Schema for the persons DynamicFrame:
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string

Schema for the names DynamicFrame:
root
|-- family_name: string
|-- given_name: string

+-----------+----------+
|family_name|given_name|
+-----------+----------+
|    Collins|   Michael|
|   Huizenga|      Bill|
|    Clawson|    Curtis|
|    Solomon|    Gerald|
|     Rigell|    Edward|
|      Crapo|   Michael|
|      Hutto|      Earl|
|      Ertel|     Allen|
|     Minish|    Joseph|
|    Andrews|    Robert|
|     Walden|      Greg|
|      Kazen|   Abraham|
|     Turner|   Michael|
|      Kolbe|     James|
|  Lowenthal|      Alan|
|    Capuano|   Michael|
|   Schrader|      Kurt|
|     Nadler|   Jerrold|
|     Graves|       Tom|
|   McMillan|      John|
+-----------+----------+
only showing top 20 rows
```

## simplify\$1ddb\$1json
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-simplify"></a>

**`simplify_ddb_json(): DynamicFrame`**

Simplifies nested columns in a `DynamicFrame` that are specifically in the DynamoDB JSON structure, and returns a new simplified `DynamicFrame`. If there’re multiple types or Map type in a List type, the elements in the List will not be simplified. Note that this is a specific type of transform that behaves differently from the regular `unnest` transform and requires the data to already be in the DynamoDB JSON structure. For more information, see [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html#DataExport.Output.Data).

For example, the schema of a reading an export with the DynamoDB JSON structure might look like the following:

```
root
|-- Item: struct
|    |-- parentMap: struct
|    |    |-- M: struct
|    |    |    |-- childMap: struct
|    |    |    |    |-- M: struct
|    |    |    |    |    |-- appName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- packageName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- updatedAt: struct
|    |    |    |    |    |    |-- N: string
|    |-- strings: struct
|    |    |-- SS: array
|    |    |    |-- element: string
|    |-- numbers: struct
|    |    |-- NS: array
|    |    |    |-- element: string
|    |-- binaries: struct
|    |    |-- BS: array
|    |    |    |-- element: string
|    |-- isDDBJson: struct
|    |    |-- BOOL: boolean
|    |-- nullValue: struct
|    |    |-- NULL: boolean
```

The `simplify_ddb_json()` transform would convert this to:

```
root
|-- parentMap: struct
|    |-- childMap: struct
|    |    |-- appName: string
|    |    |-- packageName: string
|    |    |-- updatedAt: string
|-- strings: array
|    |-- element: string
|-- numbers: array
|    |-- element: string
|-- binaries: array
|    |-- element: string
|-- isDDBJson: boolean
|-- nullValue: null
```

### Example: Use simplify\$1ddb\$1json to invoke a DynamoDB JSON simplify
<a name="pyspark-simplify-ddb-json-example"></a>

This code example uses the `simplify_ddb_json` method to use the AWS Glue DynamoDB export connector, invoke a DynamoDB JSON simplify, and print the number of partitions.

**Example code**

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type = "dynamodb",
    connection_options = {
        'dynamodb.export': 'ddb',
        'dynamodb.tableArn': '<table arn>',
        'dynamodb.s3.bucket': '<bucket name>',
        'dynamodb.s3.prefix': '<bucket prefix>',
        'dynamodb.s3.bucketOwner': '<account_id of bucket>'
    }
)
simplified = dynamicFrame.simplify_ddb_json()
print(simplified.getNumPartitions())
```

## spigot
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-spigot"></a>

**`spigot(path, options={})`**

Writes sample records to a specified destination to help you verify the transformations performed by your job.
+ `path` – The path of the destination to write to (required).
+ `options` – Key-value pairs that specify options (optional). The `"topk"` option specifies that the first `k` records should be written. The `"prob"` option specifies the probability (as a decimal) of choosing any given record. You can use it in selecting records to write.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).

### Example: Use spigot to write sample fields from a `DynamicFrame` to Amazon S3
<a name="pyspark-spigot-example"></a>

This code example uses the `spigot` method to write sample records to an Amazon S3 bucket after applying the `select_fields` transform. 

**Example dataset**

**Note**  
To access the dataset that is used in this example, see [Code example: Joining and relationalizing data](aws-glue-programming-python-samples-legislators.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling).

The example uses a `DynamicFrame` called `persons` with the following schema:

```
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string
```

**Example code**

```
# Example: Use spigot to write sample records
# to a destination during a transformation
# from pyspark.context import SparkContext.
# Replace DOC-EXAMPLE-BUCKET with your own location.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Load table data into a DynamicFrame
persons = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="persons_json"
)

# Perform the select_fields on the DynamicFrame
persons = persons.select_fields(paths=["family_name", "given_name", "birth_date"])

# Use spigot to write a sample of the transformed data
# (the first 10 records)
spigot_output = persons.spigot(
    path="s3://DOC-EXAMPLE-BUCKET", options={"topk": 10}
)
# Example: Use spigot to write sample records
# to a destination during a transformation
# from pyspark.context import SparkContext.
# Replace DOC-EXAMPLE-BUCKET with your own location.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Load table data into a DynamicFrame
persons = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="persons_json"
)

# Perform the select_fields on the DynamicFrame
persons = persons.select_fields(paths=["family_name", "given_name", "birth_date"])

# Use spigot to write a sample of the transformed data
# (the first 10 records)
spigot_output = persons.spigot(
    path="s3://DOC-EXAMPLE-BUCKET", options={"topk": 10}
)
```

#### Output
<a name="spigot-example-output"></a>

The following is an example of the data that `spigot` writes to Amazon S3. Because the example code specified `options={"topk": 10}`, the sample data contains the first 10 records.

```
{"family_name":"Collins","given_name":"Michael","birth_date":"1944-10-15"}
{"family_name":"Huizenga","given_name":"Bill","birth_date":"1969-01-31"}
{"family_name":"Clawson","given_name":"Curtis","birth_date":"1959-09-28"}
{"family_name":"Solomon","given_name":"Gerald","birth_date":"1930-08-14"}
{"family_name":"Rigell","given_name":"Edward","birth_date":"1960-05-28"}
{"family_name":"Crapo","given_name":"Michael","birth_date":"1951-05-20"}
{"family_name":"Hutto","given_name":"Earl","birth_date":"1926-05-12"}
{"family_name":"Ertel","given_name":"Allen","birth_date":"1937-11-07"}
{"family_name":"Minish","given_name":"Joseph","birth_date":"1916-09-01"}
{"family_name":"Andrews","given_name":"Robert","birth_date":"1957-08-04"}
```

## split\$1fields
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_fields"></a>

**`split_fields(paths, name1, name2, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Returns a new `DynamicFrameCollection` that contains two `DynamicFrames`. The first `DynamicFrame` contains all the nodes that have been split off, and the second contains the nodes that remain.
+ `paths` – A list of strings, each of which is a full path to a node that you want to split into a new `DynamicFrame`.
+ `name1` – A name string for the `DynamicFrame` that is split off.
+ `name2` – A name string for the `DynamicFrame` that remains after the specified nodes have been split off.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use split\$1fields to split selected fields into a separate `DynamicFrame`
<a name="pyspark-split_fields-example"></a>

This code example uses the `split_fields` method to split a list of specified fields into a separate `DynamicFrame`.

**Example dataset**

The example uses a `DynamicFrame` called `l_root_contact_details` that is from a collection named `legislators_relationalized`.

`l_root_contact_details` has the following schema and entries.

```
root
|-- id: long
|-- index: int
|-- contact_details.val.type: string
|-- contact_details.val.value: string

+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
|  1|    0|                   phone|             202-225-5265|
|  1|    1|                 twitter|              kathyhochul|
|  2|    0|                   phone|             202-225-3252|
|  2|    1|                 twitter|            repjackyrosen|
|  3|    0|                     fax|             202-225-1314|
|  3|    1|                   phone|             202-225-3772|
...
```

**Example code**

```
# Example: Use split_fields to split selected
# fields into a separate DynamicFrame

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Load the input DynamicFrame and inspect its schema
frame_to_split = legislators_relationalized.select("l_root_contact_details")
print("Inspect the input DynamicFrame schema:")
frame_to_split.printSchema()

# Split id and index fields into a separate DynamicFrame
split_fields_collection = frame_to_split.split_fields(["id", "index"], "left", "right")

# Inspect the resulting DynamicFrames
print("Inspect the schemas of the DynamicFrames created with split_fields:")
split_fields_collection.select("left").printSchema()
split_fields_collection.select("right").printSchema()
```

#### Output
<a name="split_fields-example-output"></a>

```
Inspect the input DynamicFrame's schema:
root
|-- id: long
|-- index: int
|-- contact_details.val.type: string
|-- contact_details.val.value: string

Inspect the schemas of the DynamicFrames created with split_fields:
root
|-- id: long
|-- index: int

root
|-- contact_details.val.type: string
|-- contact_details.val.value: string
```

## split\$1rows
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_rows"></a>

**`split_rows(comparison_dict, name1, name2, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Splits one or more rows in a `DynamicFrame` off into a new `DynamicFrame`.

The method returns a new `DynamicFrameCollection` that contains two `DynamicFrames`. The first `DynamicFrame` contains all the rows that have been split off, and the second contains the rows that remain.
+ `comparison_dict` – A dictionary where the key is a path to a column, and the value is another dictionary for mapping comparators to values that the column values are compared to. For example, `{"age": {">": 10, "<": 20}}` splits off all rows whose value in the age column is greater than 10 and less than 20.
+ `name1` – A name string for the `DynamicFrame` that is split off.
+ `name2` – A name string for the `DynamicFrame` that remains after the specified nodes have been split off.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use split\$1rows to split rows in a `DynamicFrame`
<a name="pyspark-split_rows-example"></a>

This code example uses the `split_rows` method to split rows in a `DynamicFrame` based on the `id` field value.

**Example dataset**

The example uses a `DynamicFrame` called `l_root_contact_details` that is selected from a collection named `legislators_relationalized`.

`l_root_contact_details` has the following schema and entries.

```
root
|-- id: long
|-- index: int
|-- contact_details.val.type: string
|-- contact_details.val.value: string

+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
|  1|    0|                   phone|             202-225-5265|
|  1|    1|                 twitter|              kathyhochul|
|  2|    0|                   phone|             202-225-3252|
|  2|    1|                 twitter|            repjackyrosen|
|  3|    0|                     fax|             202-225-1314|
|  3|    1|                   phone|             202-225-3772|
|  3|    2|                 twitter|          MikeRossUpdates|
|  4|    0|                     fax|             202-225-1314|
|  4|    1|                   phone|             202-225-3772|
|  4|    2|                 twitter|          MikeRossUpdates|
|  5|    0|                     fax|             202-225-1314|
|  5|    1|                   phone|             202-225-3772|
|  5|    2|                 twitter|          MikeRossUpdates|
|  6|    0|                     fax|             202-225-1314|
|  6|    1|                   phone|             202-225-3772|
|  6|    2|                 twitter|          MikeRossUpdates|
|  7|    0|                     fax|             202-225-1314|
|  7|    1|                   phone|             202-225-3772|
|  7|    2|                 twitter|          MikeRossUpdates|
|  8|    0|                     fax|             202-225-1314|
+---+-----+------------------------+-------------------------+
```

**Example code**

```
# Example: Use split_rows to split up 
# rows in a DynamicFrame based on value

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Retrieve the DynamicFrame to split
frame_to_split = legislators_relationalized.select("l_root_contact_details")

# Split up rows by ID
split_rows_collection = frame_to_split.split_rows({"id": {">": 10}}, "high", "low")

# Inspect the resulting DynamicFrames
print("Inspect the DynamicFrame that contains IDs < 10")
split_rows_collection.select("low").toDF().show()
print("Inspect the DynamicFrame that contains IDs > 10")
split_rows_collection.select("high").toDF().show()
```

#### Output
<a name="split_rows-example-output"></a>

```
Inspect the DynamicFrame that contains IDs < 10
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
|  1|    0|                   phone|             202-225-5265|
|  1|    1|                 twitter|              kathyhochul|
|  2|    0|                   phone|             202-225-3252|
|  2|    1|                 twitter|            repjackyrosen|
|  3|    0|                     fax|             202-225-1314|
|  3|    1|                   phone|             202-225-3772|
|  3|    2|                 twitter|          MikeRossUpdates|
|  4|    0|                     fax|             202-225-1314|
|  4|    1|                   phone|             202-225-3772|
|  4|    2|                 twitter|          MikeRossUpdates|
|  5|    0|                     fax|             202-225-1314|
|  5|    1|                   phone|             202-225-3772|
|  5|    2|                 twitter|          MikeRossUpdates|
|  6|    0|                     fax|             202-225-1314|
|  6|    1|                   phone|             202-225-3772|
|  6|    2|                 twitter|          MikeRossUpdates|
|  7|    0|                     fax|             202-225-1314|
|  7|    1|                   phone|             202-225-3772|
|  7|    2|                 twitter|          MikeRossUpdates|
|  8|    0|                     fax|             202-225-1314|
+---+-----+------------------------+-------------------------+
only showing top 20 rows

Inspect the DynamicFrame that contains IDs > 10
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
| 11|    0|                   phone|             202-225-5476|
| 11|    1|                 twitter|            RepDavidYoung|
| 12|    0|                   phone|             202-225-4035|
| 12|    1|                 twitter|           RepStephMurphy|
| 13|    0|                     fax|             202-226-0774|
| 13|    1|                   phone|             202-225-6335|
| 14|    0|                     fax|             202-226-0774|
| 14|    1|                   phone|             202-225-6335|
| 15|    0|                     fax|             202-226-0774|
| 15|    1|                   phone|             202-225-6335|
| 16|    0|                     fax|             202-226-0774|
| 16|    1|                   phone|             202-225-6335|
| 17|    0|                     fax|             202-226-0774|
| 17|    1|                   phone|             202-225-6335|
| 18|    0|                     fax|             202-226-0774|
| 18|    1|                   phone|             202-225-6335|
| 19|    0|                     fax|             202-226-0774|
| 19|    1|                   phone|             202-225-6335|
| 20|    0|                     fax|             202-226-0774|
| 20|    1|                   phone|             202-225-6335|
+---+-----+------------------------+-------------------------+
only showing top 20 rows
```

## unbox
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unbox"></a>

**`unbox(path, format, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0, **options)`**

Unboxes (reformats) a string field in a `DynamicFrame` and returns a new `DynamicFrame` that contains the unboxed `DynamicRecords`.

A `DynamicRecord` represents a logical record in a `DynamicFrame`. It's similar to a row in an Apache Spark `DataFrame`, except that it is self-describing and can be used for data that doesn't conform to a fixed schema.
+ `path` – A full path to the string node you want to unbox.
+ `format` – A format specification (optional). You use this for an Amazon S3 or AWS Glue connection that supports multiple formats. For the formats that are supported, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `options` – One or more of the following:
  + `separator` – A string that contains the separator character.
  + `escaper` – A string that contains the escape character.
  + `skipFirst` – A Boolean value that indicates whether to skip the first instance.
  + `withSchema` – A string containing a JSON representation of the node's schema. The format of a schema's JSON representation is defined by the output of `StructType.json()`.
  + `withHeader` – A Boolean value that indicates whether a header is included.

### Example: Use unbox to unbox a string field into a struct
<a name="pyspark-unbox-example"></a>

This code example uses the `unbox` method to *unbox*, or reformat, a string field in a `DynamicFrame` into a field of type struct.

**Example dataset**

The example uses a `DynamicFrame` called `mapped_with_string` with the following schema and entries.

Notice the field named `AddressString`. This is the field that the example unboxes into a struct.

```
root
|-- Average Total Payments: string
|-- AddressString: string
|-- Average Covered Charges: string
|-- DRG Definition: string
|-- Average Medicare Payments: string
|-- Hospital Referral Region Description: string
|-- Address: struct
|    |-- Zip.Code: string
|    |-- City: string
|    |-- Array: array
|    |    |-- element: string
|    |-- State: string
|    |-- Street: string
|-- Provider Id: string
|-- Total Discharges: string
|-- Provider Name: string

+----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+
|Average Total Payments|       AddressString|Average Covered Charges|      DRG Definition|Average Medicare Payments|Hospital Referral Region Description|             Address|Provider Id|Total Discharges|       Provider Name|
+----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+
|              $5777.24|{"Street": "1108 ...|              $32963.07|039 - EXTRACRANIA...|                 $4763.73|                         AL - Dothan|[36301, DOTHAN, [...|      10001|              91|SOUTHEAST ALABAMA...|
|              $5787.57|{"Street": "2505 ...|              $15131.85|039 - EXTRACRANIA...|                 $4976.71|                     AL - Birmingham|[35957, BOAZ, [25...|      10005|              14|MARSHALL MEDICAL ...|
|              $5434.95|{"Street": "205 M...|              $37560.37|039 - EXTRACRANIA...|                 $4453.79|                     AL - Birmingham|[35631, FLORENCE,...|      10006|              24|ELIZA COFFEE MEMO...|
|              $5417.56|{"Street": "50 ME...|              $13998.28|039 - EXTRACRANIA...|                 $4129.16|                     AL - Birmingham|[35235, BIRMINGHA...|      10011|              25|   ST VINCENT'S EAST|
...
```

**Example code**

```
# Example: Use unbox to unbox a string field
# into a struct in a DynamicFrame

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

unboxed = mapped_with_string.unbox("AddressString", "json")
unboxed.printSchema()
unboxed.toDF().show()
```

#### Output
<a name="unbox-example-output"></a>

```
root
|-- Average Total Payments: string
|-- AddressString: struct
|    |-- Street: string
|    |-- City: string
|    |-- State: string
|    |-- Zip.Code: string
|    |-- Array: array
|    |    |-- element: string
|-- Average Covered Charges: string
|-- DRG Definition: string
|-- Average Medicare Payments: string
|-- Hospital Referral Region Description: string
|-- Address: struct
|    |-- Zip.Code: string
|    |-- City: string
|    |-- Array: array
|    |    |-- element: string
|    |-- State: string
|    |-- Street: string
|-- Provider Id: string
|-- Total Discharges: string
|-- Provider Name: string

+----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+
|Average Total Payments|       AddressString|Average Covered Charges|      DRG Definition|Average Medicare Payments|Hospital Referral Region Description|             Address|Provider Id|Total Discharges|       Provider Name|
+----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+
|              $5777.24|[1108 ROSS CLARK ...|              $32963.07|039 - EXTRACRANIA...|                 $4763.73|                         AL - Dothan|[36301, DOTHAN, [...|      10001|              91|SOUTHEAST ALABAMA...|
|              $5787.57|[2505 U S HIGHWAY...|              $15131.85|039 - EXTRACRANIA...|                 $4976.71|                     AL - Birmingham|[35957, BOAZ, [25...|      10005|              14|MARSHALL MEDICAL ...|
|              $5434.95|[205 MARENGO STRE...|              $37560.37|039 - EXTRACRANIA...|                 $4453.79|                     AL - Birmingham|[35631, FLORENCE,...|      10006|              24|ELIZA COFFEE MEMO...|
|              $5417.56|[50 MEDICAL PARK ...|              $13998.28|039 - EXTRACRANIA...|                 $4129.16|                     AL - Birmingham|[35235, BIRMINGHA...|      10011|              25|   ST VINCENT'S EAST|
|              $5658.33|[1000 FIRST STREE...|              $31633.27|039 - EXTRACRANIA...|                 $4851.44|                     AL - Birmingham|[35007, ALABASTER...|      10016|              18|SHELBY BAPTIST ME...|
|              $6653.80|[2105 EAST SOUTH ...|              $16920.79|039 - EXTRACRANIA...|                 $5374.14|                     AL - Montgomery|[36116, MONTGOMER...|      10023|              67|BAPTIST MEDICAL C...|
|              $5834.74|[2000 PEPPERELL P...|              $11977.13|039 - EXTRACRANIA...|                 $4761.41|                     AL - Birmingham|[36801, OPELIKA, ...|      10029|              51|EAST ALABAMA MEDI...|
|              $8031.12|[619 SOUTH 19TH S...|              $35841.09|039 - EXTRACRANIA...|                 $5858.50|                     AL - Birmingham|[35233, BIRMINGHA...|      10033|              32|UNIVERSITY OF ALA...|
|              $6113.38|[101 SIVLEY RD, H...|              $28523.39|039 - EXTRACRANIA...|                 $5228.40|                     AL - Huntsville|[35801, HUNTSVILL...|      10039|             135| HUNTSVILLE HOSPITAL|
|              $5541.05|[1007 GOODYEAR AV...|              $75233.38|039 - EXTRACRANIA...|                 $4386.94|                     AL - Birmingham|[35903, GADSDEN, ...|      10040|              34|GADSDEN REGIONAL ...|
|              $5461.57|[600 SOUTH THIRD ...|              $67327.92|039 - EXTRACRANIA...|                 $4493.57|                     AL - Birmingham|[35901, GADSDEN, ...|      10046|              14|RIVERVIEW REGIONA...|
|              $5356.28|[4370 WEST MAIN S...|              $39607.28|039 - EXTRACRANIA...|                 $4408.20|                         AL - Dothan|[36305, DOTHAN, [...|      10055|              45|    FLOWERS HOSPITAL|
|              $5374.65|[810 ST VINCENT'S...|              $22862.23|039 - EXTRACRANIA...|                 $4186.02|                     AL - Birmingham|[35205, BIRMINGHA...|      10056|              43|ST VINCENT'S BIRM...|
|              $5366.23|[400 EAST 10TH ST...|              $31110.85|039 - EXTRACRANIA...|                 $4376.23|                     AL - Birmingham|[36207, ANNISTON,...|      10078|              21|NORTHEAST ALABAMA...|
|              $5282.93|[1613 NORTH MCKEN...|              $25411.33|039 - EXTRACRANIA...|                 $4383.73|                         AL - Mobile|[36535, FOLEY, [1...|      10083|              15|SOUTH BALDWIN REG...|
|              $5676.55|[1201 7TH STREET ...|               $9234.51|039 - EXTRACRANIA...|                 $4509.11|                     AL - Huntsville|[35609, DECATUR, ...|      10085|              27|DECATUR GENERAL H...|
|              $5930.11|[6801 AIRPORT BOU...|              $15895.85|039 - EXTRACRANIA...|                 $3972.85|                         AL - Mobile|[36608, MOBILE, [...|      10090|              27| PROVIDENCE HOSPITAL|
|              $6192.54|[809 UNIVERSITY B...|              $19721.16|039 - EXTRACRANIA...|                 $5179.38|                     AL - Tuscaloosa|[35401, TUSCALOOS...|      10092|              31|D C H REGIONAL ME...|
|              $4968.00|[750 MORPHY AVENU...|              $10710.88|039 - EXTRACRANIA...|                 $3898.88|                         AL - Mobile|[36532, FAIRHOPE,...|      10100|              18|     THOMAS HOSPITAL|
|              $5996.00|[701 PRINCETON AV...|              $51343.75|039 - EXTRACRANIA...|                 $4962.45|                     AL - Birmingham|[35211, BIRMINGHA...|      10103|              33|BAPTIST MEDICAL C...|
+----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+
only showing top 20 rows
```

## union
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-union"></a>

**`union(frame1, frame2, transformation_ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)`**

Union two DynamicFrames. Returns DynamicFrame containing all records from both input DynamicFrames. This transform may return different results from the union of two DataFrames with equivalent data. If you need the Spark DataFrame union behavior, consider using `toDF`.
+ `frame1` – First DynamicFrame to union. 
+ `frame2` – Second DynamicFrame to union. 
+ `transformation_ctx` – (optional) A unique string that is used to identify stats / state information 
+ `info` – (optional) Any string to be associated with errors in the transformation 
+ `stageThreshold` – (optional) Max number of errors in the transformation until processing will error out 
+ `totalThreshold` – (optional) Max number of errors total until processing will error out. 

## unnest
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest"></a>

**`unnest(transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**

Unnests nested objects in a `DynamicFrame`, which makes them top-level objects, and returns a new unnested `DynamicFrame`.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional). The default is zero, which indicates that the process should not error out.

### Example: Use unnest to turn nested fields into top-level fields
<a name="pyspark-unnest-example"></a>

This code example uses the `unnest` method to flatten all of the nested fields in a `DynamicFrame` into top-level fields.

**Example dataset**

The example uses a `DynamicFrame` called `mapped_medicare` with the following schema. Notice that the `Address` field is the only field that contains nested data.

```
root
|-- Average Total Payments: string
|-- Average Covered Charges: string
|-- DRG Definition: string
|-- Average Medicare Payments: string
|-- Hospital Referral Region Description: string
|-- Address: struct
|    |-- Zip.Code: string
|    |-- City: string
|    |-- Array: array
|    |    |-- element: string
|    |-- State: string
|    |-- Street: string
|-- Provider Id: string
|-- Total Discharges: string
|-- Provider Name: string
```

**Example code**

```
# Example: Use unnest to unnest nested
# objects in a DynamicFrame

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Unnest all nested fields
unnested = mapped_medicare.unnest()
unnested.printSchema()
```

#### Output
<a name="unnest-example-output"></a>

```
root
|-- Average Total Payments: string
|-- Average Covered Charges: string
|-- DRG Definition: string
|-- Average Medicare Payments: string
|-- Hospital Referral Region Description: string
|-- Address.Zip.Code: string
|-- Address.City: string
|-- Address.Array: array
|    |-- element: string
|-- Address.State: string
|-- Address.Street: string
|-- Provider Id: string
|-- Total Discharges: string
|-- Provider Name: string
```

## unnest\$1ddb\$1json
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest_ddb_json"></a>

Unnests nested columns in a `DynamicFrame` that are specifically in the DynamoDB JSON structure, and returns a new unnested `DynamicFrame`. Columns that are of an array of struct types will not be unnested. Note that this is a specific type of unnesting transform that behaves differently from the regular `unnest` transform and requires the data to already be in the DynamoDB JSON structure. For more information, see [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html#DataExport.Output.Data).

**`unnest_ddb_json(transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`**
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string to be associated with error reporting for this transformation (optional).
+ `stageThreshold` – The number of errors encountered during this transformation at which the process should error out (optional: zero by default, indicating that the process should not error out).
+ `totalThreshold` – The number of errors encountered up to and including this transformation at which the process should error out (optional: zero by default, indicating that the process should not error out).

For example, the schema of a reading an export with the DynamoDB JSON structure might look like the following:

```
root
|-- Item: struct
|    |-- ColA: struct
|    |    |-- S: string
|    |-- ColB: struct
|    |    |-- S: string
|    |-- ColC: struct
|    |    |-- N: string
|    |-- ColD: struct
|    |    |-- L: array
|    |    |    |-- element: null
```

The `unnest_ddb_json()` transform would convert this to:

```
root
|-- ColA: string
|-- ColB: string
|-- ColC: string
|-- ColD: array    
|    |-- element: null
```

The following code example shows how to use the AWS Glue DynamoDB export connector, invoke a DynamoDB JSON unnest, and print the number of partitions:

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dynamicFrame = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "<test_source>",
        "dynamodb.s3.bucket": "<bucket name>",
        "dynamodb.s3.prefix": "<bucket prefix>",
        "dynamodb.s3.bucketOwner": "<account_id>",
    }
)
unnested = dynamicFrame.unnest_ddb_json()
print(unnested.getNumPartitions())

job.commit()
```

## write
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-write"></a>

**`write(connection_type, connection_options, format, format_options, accumulator_size)`**

Gets a [DataSink(object)](aws-glue-api-crawler-pyspark-extensions-types.md#aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-sink) of the specified connection type from the [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md) of this `DynamicFrame`, and uses it to format and write the contents of this `DynamicFrame`. Returns the new `DynamicFrame` formatted and written as specified.
+ `connection_type` – The connection type to use. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, and `oracle`.
+ `connection_options` – The connection option to use (optional). For a `connection_type` of `s3`, an Amazon S3 path is defined.

  ```
  connection_options = {"path": "s3://aws-glue-target/temp"}
  ```

  For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
**Warning**  
Storing passwords in your script is not recommended. Consider using `boto3` to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

  ```
  connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} 
  ```
+ `format` – A format specification (optional). This is used for an Amazon Simple Storage Service (Amazon S3) or an AWS Glue connection that supports multiple formats. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `format_options` – Format options for the specified format. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `accumulator_size` – The accumulable size to use, in bytes (optional).

##  — errors —
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-_errors"></a>
+ [assertErrorThreshold](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-assertErrorThreshold)
+ [errorsAsDynamicFrame](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsAsDynamicFrame)
+ [errorsCount](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsCount)
+ [stageErrorsCount](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-stageErrorsCount)

## assertErrorThreshold
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-assertErrorThreshold"></a>

`assertErrorThreshold( )` – An assert for errors in the transformations that created this `DynamicFrame`. Returns an `Exception` from the underlying `DataFrame`.

## errorsAsDynamicFrame
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsAsDynamicFrame"></a>

`errorsAsDynamicFrame( )` – Returns a `DynamicFrame` that has error records nested inside.

### Example: Use errorsAsDynamicFrame to view error records
<a name="pyspark-errorsAsDynamicFrame-example"></a>

The following code example shows how to use the `errorsAsDynamicFrame` method to view an error record for a `DynamicFrame`.

**Example dataset**

The example uses the following dataset that you can upload to Amazon S3 as JSON. Notice that the second record is malformed. Malformed data typically breaks file parsing when you use SparkSQL. However, `DynamicFrame` recognizes malformation issues and turns malformed lines into error records that you can handle individually.

```
{"id": 1, "name": "george", "surname": "washington", "height": 178}
{"id": 2, "name": "benjamin", "surname": "franklin", 
{"id": 3, "name": "alexander", "surname": "hamilton", "height": 171}
{"id": 4, "name": "john", "surname": "jay", "height": 190}
```

**Example code**

```
# Example: Use errorsAsDynamicFrame to view error records.
# Replace s3://DOC-EXAMPLE-S3-BUCKET/error_data.json with your location.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create errors DynamicFrame, view schema
errors = glueContext.create_dynamic_frame.from_options(
    "s3", {"paths": ["s3://DOC-EXAMPLE-S3-BUCKET/error_data.json"]}, "json"
)
print("Schema of errors DynamicFrame:")
errors.printSchema()

# Show that errors only contains valid entries from the dataset
print("errors contains only valid records from the input dataset (2 of 4 records)")
errors.toDF().show()

# View errors
print("Errors count:", str(errors.errorsCount()))
print("Errors:")
errors.errorsAsDynamicFrame().toDF().show()

# View error fields and error data
error_record = errors.errorsAsDynamicFrame().toDF().head()

error_fields = error_record["error"]
print("Error fields: ")
print(error_fields.asDict().keys())

print("\nError record data:")
for key in error_fields.asDict().keys():
    print("\n", key, ": ", str(error_fields[key]))
```

#### Output
<a name="errorsAsDynamicFrame-example-output"></a>

```
Schema of errors DynamicFrame:
root
|-- id: int
|-- name: string
|-- surname: string
|-- height: int

errors contains only valid records from the input dataset (2 of 4 records)
+---+------+----------+------+
| id|  name|   surname|height|
+---+------+----------+------+
|  1|george|washington|   178|
|  4|  john|       jay|   190|
+---+------+----------+------+

Errors count: 1
Errors:
+--------------------+
|               error|
+--------------------+
|[[  File "/tmp/20...|
+--------------------+

Error fields: 
dict_keys(['callsite', 'msg', 'stackTrace', 'input', 'bytesread', 'source', 'dynamicRecord'])

Error record data:

 callsite :  Row(site='  File "/tmp/2060612586885849088", line 549, in <module>\n    sys.exit(main())\n  File "/tmp/2060612586885849088", line 523, in main\n    response = handler(content)\n  File "/tmp/2060612586885849088", line 197, in execute_request\n    result = node.execute()\n  File "/tmp/2060612586885849088", line 103, in execute\n    exec(code, global_dict)\n  File "<stdin>", line 10, in <module>\n  File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 625, in from_options\n    format_options, transformation_ctx, push_down_predicate, **kwargs)\n  File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 233, in create_dynamic_frame_from_options\n    source.setFormat(format, **format_options)\n', info='')

 msg :  error in jackson reader

 stackTrace :  com.fasterxml.jackson.core.JsonParseException: Unexpected character ('{' (code 123)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name
 at [Source: com.amazonaws.services.glue.readers.BufferedStream@73492578; line: 3, column: 2]
	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:462)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleOddName(UTF8StreamJsonParser.java:2012)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1650)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:740)
	at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$hasNextGoodToken$1.apply(JacksonReader.scala:57)
	at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$hasNextGoodToken$1.apply(JacksonReader.scala:57)
	at scala.collection.Iterator$$anon$9.next(Iterator.scala:162)
	at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:599)
	at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:598)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$1.apply(JacksonReader.scala:120)
	at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$1.apply(JacksonReader.scala:116)
	at com.amazonaws.services.glue.DynamicRecordBuilder.handleErr(DynamicRecordBuilder.scala:209)
	at com.amazonaws.services.glue.DynamicRecordBuilder.handleErrorWithException(DynamicRecordBuilder.scala:202)
	at com.amazonaws.services.glue.readers.JacksonReader.nextFailSafe(JacksonReader.scala:116)
	at com.amazonaws.services.glue.readers.JacksonReader.next(JacksonReader.scala:109)
	at com.amazonaws.services.glue.readers.JSONReader.next(JSONReader.scala:247)
	at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReaderSplittable.nextKeyValue(TapeHadoopRecordReaderSplittable.scala:103)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:230)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)


 input :  

 bytesread :  252

 source :  

 dynamicRecord :  Row(id=2, name='benjamin', surname='franklin')
```

## Comprehensive DynamicFrame examples
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-comprehensive-examples"></a>

The following examples demonstrate various ways to create and work with DynamicFrames beyond basic Glue catalog scenarios.

### Loading from PostgreSQL with SQL SELECT query
<a name="dynamicframe-postgresql-example"></a>

This example shows how to load data from a PostgreSQL database using a custom SQL SELECT query:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

# Load specific data from PostgreSQL with custom query
postgres_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="postgresql",
    connection_options={
        "url": "jdbc:postgresql://your-postgres-host:5432/your-database",
        "user": "your-username",
        "password": "your-password",
        "dbtable": "(SELECT customer_id, customer_name, email FROM customers WHERE active = true) AS filtered_customers"
    }
)
```

### Loading specific columns to avoid full table scans
<a name="dynamicframe-column-selection-example"></a>

This example demonstrates how to load only specific columns from a large database table:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

# Load only specific columns from a large table
selected_columns_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="mysql",
    connection_options={
        "url": "jdbc:mysql://your-mysql-host:3306/your-database",
        "user": "your-username", 
        "password": "your-password",
        "dbtable": "(SELECT order_id, customer_id FROM large_orders_table) AS selected_data"
    }
)

# Alternative approach using column selection in query
efficient_load_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="postgresql",
    connection_options={
        "url": "jdbc:postgresql://your-postgres-host:5432/your-database",
        "user": "your-username",
        "password": "your-password", 
        "query": "SELECT product_id, product_name FROM products WHERE category = 'electronics'"
    }
)
```

### Row-level filtering via JDBC connections
<a name="dynamicframe-row-filtering-example"></a>

This example shows how to use row-level filtering to load only specific rows from a database table:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

# Load filtered rows using WHERE clause
filtered_rows_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="postgresql",
    connection_options={
        "url": "jdbc:postgresql://your-postgres-host:5432/your-database",
        "user": "your-username",
        "password": "your-password",
        "dbtable": "(SELECT * FROM transactions WHERE transaction_date >= '2024-01-01' AND amount > 100) AS recent_large_transactions"
    }
)

# Using partitionColumn for parallel loading with filtering
partitioned_load_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="mysql",
    connection_options={
        "url": "jdbc:mysql://your-mysql-host:3306/your-database",
        "user": "your-username",
        "password": "your-password",
        "dbtable": "sales_data",
        "partitionColumn": "sale_date",
        "lowerBound": "2024-01-01",
        "upperBound": "2024-12-31",
        "numPartitions": "10"
    }
)
```

### Creating DynamicFrame from in-memory Python data
<a name="dynamicframe-in-memory-example"></a>

This example demonstrates how to create a DynamicFrame from Python lists, tuples, or dictionaries:

```
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from pyspark.sql import Row

sc = SparkContext()
glueContext = GlueContext(sc)

# Method 1: From list of tuples
data_tuples = [
    ("John", "Doe", 30, "Engineer"),
    ("Jane", "Smith", 25, "Designer"), 
    ("Bob", "Johnson", 35, "Manager")
]

# Convert to RDD of Rows
rdd = sc.parallelize([Row(first_name=row[0], last_name=row[1], age=row[2], job=row[3]) for row in data_tuples])
df = glueContext.spark_session.createDataFrame(rdd)
dyf_from_tuples = DynamicFrame.fromDF(df, glueContext, "employees_from_tuples")

# Method 2: From list of dictionaries
data_dicts = [
    {"product_id": 1, "product_name": "Laptop", "price": 999.99, "category": "Electronics"},
    {"product_id": 2, "product_name": "Book", "price": 19.99, "category": "Education"},
    {"product_id": 3, "product_name": "Chair", "price": 149.99, "category": "Furniture"}
]

df_from_dicts = glueContext.spark_session.createDataFrame(data_dicts)
dyf_from_dicts = DynamicFrame.fromDF(df_from_dicts, glueContext, "products_from_dicts")

# Method 3: From nested data structures
nested_data = [
    {
        "customer_id": 1,
        "customer_info": {
            "name": "Alice Brown",
            "email": "alice@example.com"
        },
        "orders": [
            {"order_id": 101, "amount": 250.00},
            {"order_id": 102, "amount": 175.50}
        ]
    }
]

df_nested = glueContext.spark_session.createDataFrame(nested_data)
dyf_nested = DynamicFrame.fromDF(df_nested, glueContext, "customers_with_orders")
```

### Performance optimization for large datasets
<a name="dynamicframe-performance-tips"></a>

When working with large datasets, consider these performance optimization techniques:

```
# Use partitioning for parallel reads
large_table_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="postgresql",
    connection_options={
        "url": "jdbc:postgresql://your-postgres-host:5432/your-database",
        "user": "your-username",
        "password": "your-password",
        "dbtable": "large_table",
        "partitionColumn": "id",
        "lowerBound": "1",
        "upperBound": "1000000", 
        "numPartitions": "20"
    }
)

# Use pushdown predicates to filter at source
filtered_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="mysql",
    connection_options={
        "url": "jdbc:mysql://your-mysql-host:3306/your-database",
        "user": "your-username",
        "password": "your-password",
        "dbtable": "transactions"
    },
    push_down_predicate="transaction_date >= '2024-01-01'"
)
```

## errorsCount
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsCount"></a>

`errorsCount( )` – Returns the total number of errors in a `DynamicFrame`.

## stageErrorsCount
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-stageErrorsCount"></a>

`stageErrorsCount` – Returns the number of errors that occurred in the process of generating this `DynamicFrame`.

# DynamicFrameCollection class
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection"></a>

A `DynamicFrameCollection` is a dictionary of [DynamicFrame class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md) objects, in which the keys are the names of the `DynamicFrames` and the values are the `DynamicFrame` objects.

## \$1\$1init\$1\$1
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection-__init__"></a>

**`__init__(dynamic_frames, glue_ctx)`**
+ `dynamic_frames` – A dictionary of [DynamicFrame class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md) objects.
+ `glue_ctx` – A [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md) object.

## Keys
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection-keys"></a>

`keys( )` – Returns a list of the keys in this collection, which generally consists of the names of the corresponding `DynamicFrame` values.

## Values
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection-values"></a>

`values(key)` – Returns a list of the `DynamicFrame` values in this collection.

## Select
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection-select"></a>

**`select(key)`**

Returns the `DynamicFrame` that corresponds to the specfied key (which is generally the name of the `DynamicFrame`).
+ `key` – A key in the `DynamicFrameCollection`, which usually represents the name of a `DynamicFrame`.

## Map
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection-map"></a>

**`map(callable, transformation_ctx="")`**

Uses a passed-in function to create and return a new `DynamicFrameCollection` based on the `DynamicFrames` in this collection.
+ `callable` – A function that takes a `DynamicFrame` and the specified transformation context as parameters and returns a `DynamicFrame`.
+ `transformation_ctx` – A transformation context to be used by the callable (optional).

## Flatmap
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection-flatmap"></a>

**`flatmap(f, transformation_ctx="")`**

Uses a passed-in function to create and return a new `DynamicFrameCollection` based on the `DynamicFrames` in this collection.
+ `f` – A function that takes a `DynamicFrame` as a parameter and returns a `DynamicFrame` or `DynamicFrameCollection`.
+ `transformation_ctx` – A transformation context to be used by the function (optional).

# DynamicFrameWriter class
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer"></a>


##  methods
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-_methods"></a>
+ [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-__init__)
+ [from\$1options](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_options)
+ [from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_catalog)
+ [from\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_jdbc_conf)

## \$1\$1init\$1\$1
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-__init__"></a>

**`__init__(glue_context)`**
+ `glue_context` – The [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md) to use.

## from\$1options
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_options"></a>

**`from_options(frame, connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")`**

Writes a `DynamicFrame` using the specified connection and format.
+ `frame` – The `DynamicFrame` to write.
+ `connection_type` – The connection type. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, and `oracle`.
+ `connection_options` – Connection options, such as path and database table (optional). For a `connection_type` of `s3`, an Amazon S3 path is defined.

  ```
  connection_options = {"path": "s3://aws-glue-target/temp"}
  ```

  For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
**Warning**  
Storing passwords in your script is not recommended. Consider using `boto3` to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

  ```
  connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} 
  ```

  The `dbtable` property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used.

  For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `format` – A format specification (optional). This is used for an Amazon Simple Storage Service (Amazon S3) or an AWS Glue connection that supports multiple formats. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `format_options` – Format options for the specified format. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `transformation_ctx` – A transformation context to use (optional).

## from\$1catalog
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_catalog"></a>

**`from_catalog(frame, name_space, table_name, redshift_tmp_dir="", transformation_ctx="")`**

Writes a `DynamicFrame` using the specified catalog database and table name.
+ `frame` – The `DynamicFrame` to write.
+ `name_space` – The database to use.
+ `table_name` – The `table_name` to use.
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – A transformation context to use (optional).
+ `additional_options` – Additional options provided to AWS Glue. 

  To write to Lake Formation governed tables, you can use these additional options:
  + `transactionId` – (String) The transaction ID at which to do the write to the Governed table. This transaction can not be already committed or aborted, or the write will fail.
  + `callDeleteObjectsOnCancel ` – (Boolean, optional) If set to `true` (default), AWS Glue automatically calls the `DeleteObjectsOnCancel` API after the object is written to Amazon S3. For more information, see [DeleteObjectsOnCancel](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-transactions-api.html#aws-lake-formation-api-transactions-api-DeleteObjectsOnCancel) in the *AWS Lake Formation Developer Guide*.  
**Example: Writing to a governed table in Lake Formation**  

  ```
  txId = glueContext.start_transaction(read_only=False)
  glueContext.write_dynamic_frame.from_catalog(
      frame=dyf,
      database = db, 
      table_name = tbl, 
      transformation_ctx = "datasource0", 
      additional_options={"transactionId":txId})
  ...
  glueContext.commit_transaction(txId)
  ```

## from\$1jdbc\$1conf
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_jdbc_conf"></a>

**`from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")`**

Writes a `DynamicFrame` using the specified JDBC connection information.
+ `frame` – The `DynamicFrame` to write.
+ `catalog_connection` – A catalog connection to use.
+ `connection_options` – Connection options, such as path and database table (optional).
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – A transformation context to use (optional).

## Example for write\$1dynamic\$1frame
<a name="pyspark-WriteDynamicFrame-examples"></a>

This example writes the output locally using a `connection_type` of S3 with a POSIX path argument in `connection_options`, which allows writing to local storage.

```
glueContext.write_dynamic_frame.from_options(\
frame = dyf_splitFields,\
connection_options = {'path': '/home/glue/GlueLocalOutput/'},\
connection_type = 's3',\
format = 'json')
```

# DynamicFrameReader class
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader"></a>

##  — methods —
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-_methods"></a>
+ [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-__init__)
+ [from\$1rdd](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_rdd)
+ [from\$1options](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_options)
+ [from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_catalog)

## \$1\$1init\$1\$1
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-__init__"></a>

**`__init__(glue_context)`**
+ `glue_context` – The [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md) to use.

## from\$1rdd
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_rdd"></a>

**`from_rdd(data, name, schema=None, sampleRatio=None)`**

Reads a `DynamicFrame` from a Resilient Distributed Dataset (RDD).
+ `data` – The dataset to read from.
+ `name` – The name to read from.
+ `schema` – The schema to read (optional).
+ `sampleRatio` – The sample ratio (optional).

## from\$1options
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_options"></a>

**`from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")`**

Reads a `DynamicFrame` using the specified connection and format.
+ `connection_type` – The connection type. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, `oracle`, `dynamodb`, and `snowflake`.
+ `connection_options` – Connection options, such as path and database table (optional). For more information, see [ Connection types and options for ETL in AWS Glue for Spark ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html). For a `connection_type` of `s3`, Amazon S3 paths are defined in an array.

  ```
  connection_options = {"paths": [ "s3://amzn-s3-demo-bucket/object_a", "s3://amzn-s3-demo-bucket/object_b"]}
  ```

  For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
**Warning**  
Storing passwords in your script is not recommended. Consider using `boto3` to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

  ```
  connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} 
  ```

  For a JDBC connection that performs parallel reads, you can set the hashfield option. For example:

  ```
  connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path" , "hashfield": "month"} 
  ```

  For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md). 
+ `format` – A format specification (optional). This is used for an Amazon Simple Storage Service (Amazon S3) or an AWS Glue connection that supports multiple formats. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `format_options` – Format options for the specified format. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `transformation_ctx` – The transformation context to use (optional).
+ `push_down_predicate` – Filters partitions without having to list and read all the files in your dataset. For more information, see [Pre-Filtering Using Pushdown Predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns).

## from\$1catalog
<a name="aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_catalog"></a>

**`from_catalog(database, table_name, redshift_tmp_dir="", transformation_ctx="", push_down_predicate="", additional_options={})`**

Reads a `DynamicFrame` using the specified catalog namespace and table name.
+ `database` – The database to read from.
+ `table_name` – The name of the table to read from.
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional if not reading data from Redshift).
+ `transformation_ctx` – The transformation context to use (optional).
+ `push_down_predicate` – Filters partitions without having to list and read all the files in your dataset. For more information, see [Pre-filtering using pushdown predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns).
+ `additional_options` – Additional options provided to AWS Glue. 
  + To use a JDBC connection that performs parallel reads, you can set the `hashfield`, `hashexpression`, or `hashpartitions` options. For example:

    ```
    additional_options = {"hashfield": "month"} 
    ```

    For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md). 
  + To pass a catalog expression to filter based on the index columns, you can see the `catalogPartitionPredicate` option.

    `catalogPartitionPredicate` — You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see [AWS Glue Partition Indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html). Note that `push_down_predicate` and `catalogPartitionPredicate` use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.

    For more information, see [Managing partitions for ETL output in AWS Glue](aws-glue-programming-etl-partitions.md). 

# GlueContext class
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context"></a>

Wraps the Apache Spark [SparkContext](https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html) object, and thereby provides mechanisms for interacting with the Apache Spark platform.

## \$1\$1init\$1\$1
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-__init__"></a>

**`__init__(sparkContext)`**
+ `sparkContext` – The Apache Spark context to use.

## Creating
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-_creating"></a>
+ [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-glue-context-__init__)
+ [getSource](#aws-glue-api-crawler-pyspark-extensions-glue-context-get-source)
+ [create\$1dynamic\$1frame\$1from\$1rdd](#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_rdd)
+ [create\$1dynamic\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog)
+ [create\$1dynamic\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options)
+ [create\$1sample\$1dynamic\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-sample-dynamic-frame-from-catalog)
+ [create\$1sample\$1dynamic\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-sample-dynamic-frame-from-options)
+ [add\$1ingestion\$1time\$1columns](#aws-glue-api-crawler-pyspark-extensions-glue-context-add-ingestion-time-columns)
+ [create\$1data\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog)
+ [create\$1data\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options)
+ [forEachBatch](#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch)

## getSource
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-get-source"></a>

**`getSource(connection_type, transformation_ctx = "", **options)`**

Creates a `DataSource` object that can be used to read `DynamicFrames` from external sources.
+ `connection_type` – The connection type to use, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and JDBC. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, `oracle`, and `dynamodb`.
+ `transformation_ctx` – The transformation context to use (optional).
+ `options` – A collection of optional name-value pairs. For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).

The following is an example of using `getSource`.

```
>>> data_source = context.getSource("file", paths=["/in/path"])
>>> data_source.setFormat("json")
>>> myFrame = data_source.getFrame()
```

## create\$1dynamic\$1frame\$1from\$1rdd
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_rdd"></a>

**`create_dynamic_frame_from_rdd(data, name, schema=None, sample_ratio=None, transformation_ctx="")`**

Returns a `DynamicFrame` that is created from an Apache Spark Resilient Distributed Dataset (RDD).
+ `data` – The data source to use.
+ `name` – The name of the data to use.
+ `schema` – The schema to use (optional).
+ `sample_ratio` – The sample ratio to use (optional).
+ `transformation_ctx` – The transformation context to use (optional).

## create\$1dynamic\$1frame\$1from\$1catalog
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog"></a>

**`create_dynamic_frame_from_catalog(database, table_name, redshift_tmp_dir, transformation_ctx = "", push_down_predicate= "", additional_options = {}, catalog_id = None)`**

Returns a `DynamicFrame` that is created using a Data Catalog database and table name. When using this method, you provide `format_options` through table properties on the specified AWS Glue Data Catalog table and other options through the `additional_options` argument.
+ `Database` – The database to read from.
+ `table_name` – The name of the table to read from.
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – The transformation context to use (optional).
+ `push_down_predicate` – Filters partitions without having to list and read all the files in your dataset. For supported sources and limitations, see [Optimizing reads with pushdown in AWS Glue ETL](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-pushdown.html). For more information, see [Pre-filtering using pushdown predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns).
+ `additional_options` – A collection of optional name-value pairs. The possible options include those listed in [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md) except for `endpointUrl`, `streamName`, `bootstrap.servers`, `security.protocol`, `topicName`, `classification`, and `delimiter`. Another supported option is `catalogPartitionPredicate`:

  `catalogPartitionPredicate` — You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see [AWS Glue Partition Indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html). Note that `push_down_predicate` and `catalogPartitionPredicate` use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.
+ `catalog_id` — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used. 

## create\$1dynamic\$1frame\$1from\$1options
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options"></a>

**`create_dynamic_frame_from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")`**

Returns a `DynamicFrame` created with the specified connection and format.
+ `connection_type` – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, `oracle`, and `dynamodb`.
+ `connection_options` – Connection options, such as paths and database table (optional). For a `connection_type` of `s3`, a list of Amazon S3 paths is defined.

  ```
  connection_options = {"paths": ["s3://aws-glue-target/temp"]}
  ```

  For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
**Warning**  
Storing passwords in your script is not recommended. Consider using `boto3` to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

  ```
  connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} 
  ```

  The `dbtable` property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used.

  For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `format` – A format specification. This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `format_options` – Format options for the specified format. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `transformation_ctx` – The transformation context to use (optional).
+ `push_down_predicate` – Filters partitions without having to list and read all the files in your dataset. For supported sources and limitations, see [Optimizing reads with pushdown in AWS Glue ETL](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-pushdown.html). For more information, see [Pre-Filtering Using Pushdown Predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns).

## create\$1sample\$1dynamic\$1frame\$1from\$1catalog
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-create-sample-dynamic-frame-from-catalog"></a>

**`create_sample_dynamic_frame_from_catalog(database, table_name, num, redshift_tmp_dir, transformation_ctx = "", push_down_predicate= "", additional_options = {}, sample_options = {}, catalog_id = None)`**

Returns a sample `DynamicFrame` that is created using a Data Catalog database and table name. The `DynamicFrame` only contains first `num` records from a datasource. 
+ `database` – The database to read from.
+ `table_name` – The name of the table to read from.
+ `num` – The maximum number of records in the returned sample dynamic frame.
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – The transformation context to use (optional).
+ `push_down_predicate` – Filters partitions without having to list and read all the files in your dataset. For more information, see [Pre-filtering using pushdown predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns).
+ `additional_options` – A collection of optional name-value pairs. The possible options include those listed in [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md) except for `endpointUrl`, `streamName`, `bootstrap.servers`, `security.protocol`, `topicName`, `classification`, and `delimiter`.
+ `sample_options` – Parameters to control sampling behavior (optional). Current available parameters for Amazon S3 sources:
  + `maxSamplePartitions` – The maximum number of partitions the sampling will read. Default value is 10
  + `maxSampleFilesPerPartition` – The maximum number of files the sampling will read in one partition. Default value is 10.

    These parameters help to reduce the time consumed by file listing. For example, suppose the dataset has 1000 partitions, and each partition has 10 files. If you set `maxSamplePartitions` = 10, and `maxSampleFilesPerPartition` = 10, instead of listing all 10,000 files, the sampling will only list and read the first 10 partitions with the first 10 files in each: 10\$110 = 100 files in total. 
+ `catalog_id` – The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set to `None` by default. `None` defaults to the catalog ID of the calling account in the service.

## create\$1sample\$1dynamic\$1frame\$1from\$1options
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-create-sample-dynamic-frame-from-options"></a>

**`create_sample_dynamic_frame_from_options(connection_type, connection_options={}, num, sample_options={}, format=None, format_options={}, transformation_ctx = "")`**

Returns a sample `DynamicFrame` created with the specified connection and format. The `DynamicFrame` only contains first `num` records from a datasource. 
+ `connection_type` – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, `oracle`, and `dynamodb`.
+ `connection_options` – Connection options, such as paths and database table (optional). For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `num` – The maximum number of records in the returned sample dynamic frame.
+ `sample_options` – Parameters to control sampling behavior (optional). Current available parameters for Amazon S3 sources:
  + `maxSamplePartitions` – The maximum number of partitions the sampling will read. Default value is 10
  + `maxSampleFilesPerPartition` – The maximum number of files the sampling will read in one partition. Default value is 10.

    These parameters help to reduce the time consumed by file listing. For example, suppose the dataset has 1000 partitions, and each partition has 10 files. If you set `maxSamplePartitions` = 10, and `maxSampleFilesPerPartition` = 10, instead of listing all 10,000 files, the sampling will only list and read the first 10 partitions with the first 10 files in each: 10\$110 = 100 files in total. 
+ `format` – A format specification. This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `format_options` – Format options for the specified format. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `transformation_ctx` – The transformation context to use (optional).
+ `push_down_predicate` – Filters partitions without having to list and read all the files in your dataset. For more information, see [Pre-filtering using pushdown predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns).

## add\$1ingestion\$1time\$1columns
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-add-ingestion-time-columns"></a>

**`add_ingestion_time_columns(dataFrame, timeGranularity = "")`**

Appends ingestion time columns like `ingest_year`, `ingest_month`, `ingest_day`, `ingest_hour`, `ingest_minute` to the input `DataFrame`. This function is automatically generated in the script generated by the AWS Glue when you specify a Data Catalog table with Amazon S3 as the target. This function automatically updates the partition with ingestion time columns on the output table. This allows the output data to be automatically partitioned on ingestion time without requiring explicit ingestion time columns in the input data.
+ `dataFrame` – The `dataFrame` to append the ingestion time columns to.
+ `timeGranularity` – The granularity of the time columns. Valid values are "`day`", "`hour`" and "`minute`". For example, if "`hour`" is passed in to the function, the original `dataFrame` will have "`ingest_year`", "`ingest_month`", "`ingest_day`", and "`ingest_hour`" time columns appended.

Returns the data frame after appending the time granularity columns.

Example:

```
dynamic_frame = DynamicFrame.fromDF(glueContext.add_ingestion_time_columns(dataFrame, "hour"))
```

## create\$1data\$1frame\$1from\$1catalog
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog"></a>

**`create_data_frame_from_catalog(database, table_name, transformation_ctx = "", additional_options = {})`**

Returns a `DataFrame` that is created using information from a Data Catalog table.
+ `database` – The Data Catalog database to read from.
+ `table_name` – The name of the Data Catalog table to read from.
+ `transformation_ctx` – The transformation context to use (optional).
+ `additional_options` – A collection of optional name-value pairs. The possible options include those listed in [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md) for streaming sources, such as `startingPosition`, `maxFetchTimeInMs`, and `startingOffsets`.
  + `useSparkDataSource` – When set to true, forces AWS Glue to use the native Spark Data Source API to read the table. The Spark Data Source API supports the following formats: AVRO, binary, CSV, JSON, ORC, Parquet, and text. In a Data Catalog table, you specify the format using the `classification` property. To learn more about the Spark Data Source API, see the official [Apache Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html).

    Using `create_data_frame_from_catalog` with `useSparkDataSource` has the following benefits:
    + Directly returns a `DataFrame` and provides an alternative to `create_dynamic_frame.from_catalog().toDF()`.
    + Supports AWS Lake Formation table-level permission control for native formats.
    + Supports reading data lake formats without AWS Lake Formation table-level permission control. For more information, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md).

    When you enable `useSparkDataSource`, you can also add any of the [Spark Data Source options](https://spark.apache.org/docs/latest/sql-data-sources.html) in `additional_options` as needed. AWS Glue passes these options directly to the Spark reader.
  + `useCatalogSchema` – When set to true, AWS Glue applies the Data Catalog schema to the resulting `DataFrame`. Otherwise, the reader infers the schema from the data. When you enable `useCatalogSchema`, you must also set `useSparkDataSource` to true.

**Limitations**

Consider the following limitations when you use the `useSparkDataSource` option:
+ When you use `useSparkDataSource`, AWS Glue creates a new `DataFrame` in a separate Spark session that is different from the original Spark session.
+ Spark DataFrame partition filtering doesn't work with the following AWS Glue features. 
  + [Job bookmarks](monitor-continuations.md)
  + [Excluding Amazon S3 storage classes](aws-glue-programming-etl-storage-classes.md#aws-glue-programming-etl-storage-classes-dynamic-frame)
  + [Catalog partition predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-cat-predicates)

  To use partition filtering with these features, you can use the AWS Glue pushdown predicate. For more information, see [Pre-filtering using pushdown predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns). Filtering on non-partitioned columns is not affected.

  The following example script demonstrates the incorrect way to perform partition filtering with the `excludeStorageClasses` option.

  ```
  // Incorrect partition filtering using Spark filter with excludeStorageClasses
  read_df = glueContext.create_data_frame.from_catalog(
      database=database_name,
      table_name=table_name,
      additional_options = {
        "useSparkDataSource": True,
        "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"]
      }
  )
  
  //  Suppose year and month are partition keys.
  //  Filtering on year and month won't work, the filtered_df will still
  //  contain data with other year/month values.
  filtered_df = read_df.filter("year == '2017 and month == '04' and 'state == 'CA'")
  ```

  The following example script demonstrates the correct way to use a pushdown predicate in order to perform partition filtering with the `excludeStorageClasses` option.

  ```
  // Correct partition filtering using the AWS Glue pushdown predicate
  // with excludeStorageClasses
  read_df = glueContext.create_data_frame.from_catalog(
      database=database_name,
      table_name=table_name,
      //  Use AWS Glue pushdown predicate to perform partition filtering
      push_down_predicate = "(year=='2017' and month=='04')"
      additional_options = {
        "useSparkDataSource": True,
        "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"]
      }
  )
  
  //  Use Spark filter only on non-partitioned columns
  filtered_df = read_df.filter("state == 'CA'")
  ```

**Example: Creating a CSV table using the Spark data source reader**

```
//  Read a CSV table with '\t' as separator
read_df = glueContext.create_data_frame.from_catalog(
    database=<database_name>,
    table_name=<table_name>,
    additional_options = {"useSparkDataSource": True,  "sep": '\t'}
)
```

## create\$1data\$1frame\$1from\$1options
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options"></a>

**`create_data_frame_from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")`**

This API is now deprecated. Instead use the `getSource()` API. Returns a `DataFrame` created with the specified connection and format. Use this function only with AWS Glue streaming sources.
+ `connection_type` – The streaming connection type. Valid values include `kinesis` and `kafka`.
+ `connection_options` – Connection options, which are different for Kinesis and Kafka. You can find the list of all connection options for each streaming data source at [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). Note the following differences in streaming connection options:
  + Kinesis streaming sources require `streamARN`, `startingPosition`, `inferSchema`, and `classification`.
  + Kafka streaming sources require `connectionName`, `topicName`, `startingOffsets`, `inferSchema`, and `classification`.
+ `format` – A format specification. This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. For information about the supported formats, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md).
+ `format_options` – Format options for the specified format. For information about the supported format options, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md).
+ `transformation_ctx` – The transformation context to use (optional).

Example for Amazon Kinesis streaming source:

```
kinesis_options =
   { "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
     "startingPosition": "TRIM_HORIZON", 
     "inferSchema": "true", 
     "classification": "json" 
   }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kinesis", connection_options=kinesis_options)
```

Example for Kafka streaming source:

```
kafka_options =
    { "connectionName": "ConfluentKafka", 
      "topicName": "kafka-auth-topic", 
      "startingOffsets": "earliest", 
      "inferSchema": "true", 
      "classification": "json" 
    }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kafka", connection_options=kafka_options)
```

## forEachBatch
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch"></a>

**`forEachBatch(frame, batch_function, options)`**

Applies the `batch_function` passed in to every micro batch that is read from the Streaming source.
+ `frame` – The DataFrame containing the current micro batch.
+ `batch_function` – A function that will be applied for every micro batch.
+ `options` – A collection of key-value pairs that holds information about how to process micro batches. The following options are required:
  + `windowSize` – The amount of time to spend processing each batch.
  + `checkpointLocation` – The location where checkpoints are stored for the streaming ETL job.
  + `batchMaxRetries` – The maximum number of times to retry the batch if it fails. The default value is 3. This option is only configurable for Glue version 2.0 and above.

**Example:**

```
glueContext.forEachBatch(
    frame = data_frame_datasource0,
    batch_function = processBatch, 
    options = {
        "windowSize": "100 seconds", 
        "checkpointLocation": "s3://kafka-auth-dataplane/confluent-test/output/checkpoint/"
    }
)
   
def processBatch(data_frame, batchId):
    if (data_frame.count() > 0):
        datasource0 = DynamicFrame.fromDF(
          glueContext.add_ingestion_time_columns(data_frame, "hour"), 
          glueContext, "from_data_frame"
        )
        additionalOptions_datasink1 = {"enableUpdateCatalog": True}
        additionalOptions_datasink1["partitionKeys"] = ["ingest_yr", "ingest_mo", "ingest_day"]
        datasink1 = glueContext.write_dynamic_frame.from_catalog(
          frame = datasource0, 
          database = "tempdb", 
          table_name = "kafka-auth-table-output", 
          transformation_ctx = "datasink1", 
          additional_options = additionalOptions_datasink1
        )
```

## Working with datasets in Amazon S3
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-_storage_layer"></a>
+ [purge\$1table](#aws-glue-api-crawler-pyspark-extensions-glue-context-purge_table)
+ [purge\$1s3\$1path](#aws-glue-api-crawler-pyspark-extensions-glue-context-purge_s3_path)
+ [transition\$1table](#aws-glue-api-crawler-pyspark-extensions-glue-context-transition_table)
+ [transition\$1s3\$1path](#aws-glue-api-crawler-pyspark-extensions-glue-context-transition_s3_path)

## purge\$1table
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-purge_table"></a>

**`purge_table(catalog_id=None, database="", table_name="", options={}, transformation_ctx="")`**

Deletes files from Amazon S3 for the specified catalog's database and table. If all files in a partition are deleted, that partition is also deleted from the catalog. We don't support purge\$1table action on tables registered with Lake Formation.

If you want to be able to recover deleted objects, you can turn on [object versioning](https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html) on the Amazon S3 bucket. When an object is deleted from a bucket that doesn't have object versioning enabled, the object can't be recovered. For more information about how to recover deleted objects in a version-enabled bucket, see [How can I retrieve an Amazon S3 object that was deleted?](https://aws.amazon.com/premiumsupport/knowledge-center/s3-undelete-configuration/) in the AWS Support Knowledge Center.
+ `catalog_id` – The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set to `None` by default. `None` defaults to the catalog ID of the calling account in the service.
+ `database` – The database to use.
+ `table_name` – The name of the table to use.
+ `options` – Options to filter files to be deleted and for manifest file generation.
  + `retentionPeriod` – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.
  + `partitionPredicate` – Partitions satisfying this predicate are deleted. Files within the retention period in these partitions are not deleted. Set to `""` – empty by default.
  + `excludeStorageClasses` – Files with storage class in the `excludeStorageClasses` set are not deleted. The default is `Set()` – an empty set.
  + `manifestFilePath` – An optional path for manifest file generation. All files that were successfully purged are recorded in `Success.csv`, and those that failed in `Failed.csv`
+ `transformation_ctx` – The transformation context to use (optional). Used in the manifest file path.

**Example**  

```
glueContext.purge_table("database", "table", {"partitionPredicate": "(month=='march')", "retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"})
```

## purge\$1s3\$1path
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-purge_s3_path"></a>

**`purge_s3_path(s3_path, options={}, transformation_ctx="")`**

Deletes files from the specified Amazon S3 path recursively.

If you want to be able to recover deleted objects, you can turn on [object versioning](https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html) on the Amazon S3 bucket. When an object is deleted from a bucket that doesn't have object versioning turned on, the object can't be recovered. For more information about how to recover deleted objects in a bucket with versioning, see [How can I retrieve an Amazon S3 object that was deleted?](https://aws.amazon.com/premiumsupport/knowledge-center/s3-undelete-configuration/) in the Support Knowledge Center.
+ `s3_path` – The path in Amazon S3 of the files to be deleted in the format `s3://<bucket>/<prefix>/`
+ `options` – Options to filter files to be deleted and for manifest file generation.
  + `retentionPeriod` – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.
  + `excludeStorageClasses` – Files with storage class in the `excludeStorageClasses` set are not deleted. The default is `Set()` – an empty set.
  + `manifestFilePath` – An optional path for manifest file generation. All files that were successfully purged are recorded in `Success.csv`, and those that failed in `Failed.csv`
+ `transformation_ctx` – The transformation context to use (optional). Used in the manifest file path.

**Example**  

```
glueContext.purge_s3_path("s3://bucket/path/", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"})
```

## transition\$1table
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-transition_table"></a>

**`transition_table(database, table_name, transition_to, options={}, transformation_ctx="", catalog_id=None)`**

Transitions the storage class of the files stored on Amazon S3 for the specified catalog's database and table.

You can transition between any two storage classes. For the `GLACIER` and `DEEP_ARCHIVE` storage classes, you can transition to these classes. However, you would use an `S3 RESTORE` to transition from `GLACIER` and `DEEP_ARCHIVE` storage classes.

If you're running AWS Glue ETL jobs that read files or partitions from Amazon S3, you can exclude some Amazon S3 storage class types. For more information, see [Excluding Amazon S3 Storage Classes](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-storage-classes.html).
+ `database` – The database to use.
+ `table_name` – The name of the table to use.
+ `transition_to` – The [Amazon S3 storage class](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/StorageClass.html) to transition to.
+ `options` – Options to filter files to be deleted and for manifest file generation.
  + `retentionPeriod` – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.
  + `partitionPredicate` – Partitions satisfying this predicate are transitioned. Files within the retention period in these partitions are not transitioned. Set to `""` – empty by default.
  + `excludeStorageClasses` – Files with storage class in the `excludeStorageClasses` set are not transitioned. The default is `Set()` – an empty set.
  + `manifestFilePath` – An optional path for manifest file generation. All files that were successfully transitioned are recorded in `Success.csv`, and those that failed in `Failed.csv`
  + `accountId` – The Amazon Web Services account ID to run the transition transform. Mandatory for this transform.
  + `roleArn` – The AWS role to run the transition transform. Mandatory for this transform.
+ `transformation_ctx` – The transformation context to use (optional). Used in the manifest file path.
+ `catalog_id` – The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set to `None` by default. `None` defaults to the catalog ID of the calling account in the service.

**Example**  

```
glueContext.transition_table("database", "table", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"})
```

## transition\$1s3\$1path
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-transition_s3_path"></a>

**`transition_s3_path(s3_path, transition_to, options={}, transformation_ctx="")`**

Transitions the storage class of the files in the specified Amazon S3 path recursively.

You can transition between any two storage classes. For the `GLACIER` and `DEEP_ARCHIVE` storage classes, you can transition to these classes. However, you would use an `S3 RESTORE` to transition from `GLACIER` and `DEEP_ARCHIVE` storage classes.

If you're running AWS Glue ETL jobs that read files or partitions from Amazon S3, you can exclude some Amazon S3 storage class types. For more information, see [Excluding Amazon S3 Storage Classes](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-storage-classes.html).
+ `s3_path` – The path in Amazon S3 of the files to be transitioned in the format `s3://<bucket>/<prefix>/`
+ `transition_to` – The [Amazon S3 storage class](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/StorageClass.html) to transition to.
+ `options` – Options to filter files to be deleted and for manifest file generation.
  + `retentionPeriod` – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.
  + `partitionPredicate` – Partitions satisfying this predicate are transitioned. Files within the retention period in these partitions are not transitioned. Set to `""` – empty by default.
  + `excludeStorageClasses` – Files with storage class in the `excludeStorageClasses` set are not transitioned. The default is `Set()` – an empty set.
  + `manifestFilePath` – An optional path for manifest file generation. All files that were successfully transitioned are recorded in `Success.csv`, and those that failed in `Failed.csv`
  + `accountId` – The Amazon Web Services account ID to run the transition transform. Mandatory for this transform.
  + `roleArn` – The AWS role to run the transition transform. Mandatory for this transform.
+ `transformation_ctx` – The transformation context to use (optional). Used in the manifest file path.

**Example**  

```
glueContext.transition_s3_path("s3://bucket/prefix/", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"})
```

## Extracting
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-_extracting"></a>
+ [extract\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-glue-context-extract_jdbc_conf)

## extract\$1jdbc\$1conf
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-extract_jdbc_conf"></a>

**`extract_jdbc_conf(connection_name, catalog_id = None)`**

Returns a `dict` with keys with the configuration properties from the AWS Glue connection object in the Data Catalog.
+ `user` – The database user name.
+ `password` – The database password.
+ `vendor` – Specifies a vendor (`mysql`, `postgresql`, `oracle`, `sqlserver`, etc.).
+ `enforceSSL` – A boolean string indicating if a secure connection is required.
+ `customJDBCCert` – Use a specific client certificate from the Amazon S3 path indicated.
+ `skipCustomJDBCCertValidation` – A boolean string indicating if the `customJDBCCert` must be validated by a CA.
+ `customJDBCCertString` – Additional information about the custom certificate, specific for the driver type.
+ `url` – (Deprecated) JDBC URL with only protocol, server and port.
+ `fullUrl` – JDBC URL as entered when the connection was created (Available in AWS Glue version 3.0 or later).

Example retrieving JDBC configurations:

```
jdbc_conf = glueContext.extract_jdbc_conf(connection_name="your_glue_connection_name")
print(jdbc_conf)
>>> {'enforceSSL': 'false', 'skipCustomJDBCCertValidation': 'false', 'url': 'jdbc:mysql://myserver:3306', 'fullUrl': 'jdbc:mysql://myserver:3306/mydb', 'customJDBCCertString': '', 'user': 'admin', 'customJDBCCert': '', 'password': '1234', 'vendor': 'mysql'}
```

## Transactions
<a name="aws-glue-api-pyspark-extensions-glue-context-transactions"></a>
+ [start\$1transaction](#aws-glue-api-pyspark-extensions-glue-context-start-transaction)
+ [commit\$1transaction](#aws-glue-api-pyspark-extensions-glue-context-commit-transaction)
+ [cancel\$1transaction](#aws-glue-api-pyspark-extensions-glue-cancel-transaction)

## start\$1transaction
<a name="aws-glue-api-pyspark-extensions-glue-context-start-transaction"></a>

**`start_transaction(read_only)`**

Start a new transaction. Internally calls the Lake Formation [startTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-StartTransaction) API.
+ `read_only` – (Boolean) Indicates whether this transaction should be read only or read and write. Writes made using a read-only transaction ID will be rejected. Read-only transactions do not need to be committed.

Returns the transaction ID.

## commit\$1transaction
<a name="aws-glue-api-pyspark-extensions-glue-context-commit-transaction"></a>

**`commit_transaction(transaction_id, wait_for_commit = True)`**

Attempts to commit the specified transaction. `commit_transaction` may return before the transaction has finished committing. Internally calls the Lake Formation [commitTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CommitTransaction) API.
+ `transaction_id ` – (String) The transaction to commit.
+ `wait_for_commit` – (Boolean) Determines whether the `commit_transaction` returns immediately. The default value is true. If false, `commit_transaction` polls and waits until the transaction is committed. The amount of wait time is restricted to 1 minute using exponential backoff with a maximum of 6 retry attempts.

Returns a Boolean to indicate whether the commit is done or not. 

## cancel\$1transaction
<a name="aws-glue-api-pyspark-extensions-glue-cancel-transaction"></a>

**`cancel_transaction(transaction_id)`**

Attempts to cancel the specified transaction. Returns a `TransactionCommittedException` exception if the transaction was previously committed. Internally calls the Lake Formation [CancelTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CancelTransaction) API.
+ `transaction_id` – (String) The transaction to cancel.

## Writing
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-_writing"></a>
+ [getSink](#aws-glue-api-crawler-pyspark-extensions-glue-context-get-sink)
+ [write\$1dynamic\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options)
+ [write\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_from_options)
+ [write\$1dynamic\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_catalog)
+ [write\$1data\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_data_frame_from_catalog)
+ [write\$1dynamic\$1frame\$1from\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_jdbc_conf)
+ [write\$1from\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_from_jdbc_conf)

## getSink
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-get-sink"></a>

**`getSink(connection_type, format = None, transformation_ctx = "", **options)`**

Gets a `DataSink` object that can be used to write `DynamicFrames` to external sources. Check the SparkSQL `format` first to be sure to get the expected sink.
+ `connection_type` – The connection type to use, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, `oracle`, `kinesis`, and `kafka`.
+ `format` – The SparkSQL format to use (optional).
+ `transformation_ctx` – The transformation context to use (optional).
+ `options` – A collection of name-value pairs used to specify the connection options. Some of the possible values are:
  + `user` and `password`: For authorization
  + `url`: The endpoint for the data store
  + `dbtable`: The name of the target table
  + `bulkSize`: Degree of parallelism for insert operations

The options that you can specify depends on the connection type. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md) for additional values and examples.

Example:

```
>>> data_sink = context.getSink("s3")
>>> data_sink.setFormat("json"),
>>> data_sink.writeFrame(myFrame)
```

## write\$1dynamic\$1frame\$1from\$1options
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options"></a>

**`write_dynamic_frame_from_options(frame, connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")`**

Writes and returns a `DynamicFrame` using the specified connection and format.
+ `frame` – The `DynamicFrame` to write.
+ `connection_type` – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, `oracle`, `kinesis`, and `kafka`.
+ `connection_options` – Connection options, such as path and database table (optional). For a `connection_type` of `s3`, an Amazon S3 path is defined.

  ```
  connection_options = {"path": "s3://aws-glue-target/temp"}
  ```

  For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
**Warning**  
Storing passwords in your script is not recommended. Consider using `boto3` to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

  ```
  connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} 
  ```

  The `dbtable` property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used.

  For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `format` – A format specification. This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `format_options` – Format options for the specified format. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `transformation_ctx` – A transformation context to use (optional).

## write\$1from\$1options
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-write_from_options"></a>

**`write_from_options(frame_or_dfc, connection_type, connection_options={}, format={}, format_options={}, transformation_ctx = "")`**

Writes and returns a `DynamicFrame` or `DynamicFrameCollection` that is created with the specified connection and format information.
+ `frame_or_dfc` – The `DynamicFrame` or `DynamicFrameCollection` to write.
+ `connection_type` – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include `s3`, `mysql`, `postgresql`, `redshift`, `sqlserver`, and `oracle`.
+ `connection_options` – Connection options, such as path and database table (optional). For a `connection_type` of `s3`, an Amazon S3 path is defined.

  ```
  connection_options = {"path": "s3://aws-glue-target/temp"}
  ```

  For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
**Warning**  
Storing passwords in your script is not recommended. Consider using `boto3` to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

  ```
  connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} 
  ```

  The `dbtable` property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used.

  For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `format` – A format specification. This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `format_options` – Format options for the specified format. See [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) for the formats that are supported.
+ `transformation_ctx` – A transformation context to use (optional).

## write\$1dynamic\$1frame\$1from\$1catalog
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_catalog"></a>

**`write_dynamic_frame_from_catalog(frame, database, table_name, redshift_tmp_dir, transformation_ctx = "", additional_options = {}, catalog_id = None)`**

Writes and returns a `DynamicFrame` using information from a Data Catalog database and table.
+ `frame` – The `DynamicFrame` to write.
+ `Database` – The Data Catalog database that contains the table.
+ `table_name` – The name of the Data Catalog table associated with the target.
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – The transformation context to use (optional).
+ `additional_options` – A collection of optional name-value pairs.
+ `catalog_id` — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used. 

## write\$1data\$1frame\$1from\$1catalog
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-write_data_frame_from_catalog"></a>

**`write_data_frame_from_catalog(frame, database, table_name, redshift_tmp_dir, transformation_ctx = "", additional_options = {}, catalog_id = None)`**

Writes and returns a `DataFrame` using information from a Data Catalog database and table. This method supports writing to data lake formats (Hudi, Iceberg, and Delta Lake). For more information, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md).
+ `frame` – The `DataFrame` to write.
+ `Database` – The Data Catalog database that contains the table.
+ `table_name` – The name of the Data Catalog table that is associated with the target.
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – The transformation context to use (optional).
+ `additional_options` – A collection of optional name-value pairs.
  + `useSparkDataSink` – When set to true, forces AWS Glue to use the native Spark Data Sink API to write to the table. When you enable this option, you can add any [Spark Data Source options](https://spark.apache.org/docs/latest/sql-data-sources.html) to `additional_options` as needed. AWS Glue passes these options directly to the Spark writer.
+ `catalog_id` – The catalog ID (account ID) of the Data Catalog being accessed. When you don't specify a value, the default account ID of the caller is used. 

**Limitations**

Consider the following limitations when you use the `useSparkDataSink` option:
+ The [`enableUpdateCatalog`](update-from-job.md) option isn't supported when you use the `useSparkDataSink` option.

**Example: Writing to a Hudi table using the Spark Data Source writer**

```
hudi_options = {
    'useSparkDataSink': True,
    'hoodie.table.name': <table_name>,
    'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
    'hoodie.datasource.write.recordkey.field': 'product_id',
    'hoodie.datasource.write.table.name': <table_name>,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.database': <database_name>,
    'hoodie.datasource.hive_sync.table': <table_name>,
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.mode': 'hms'}

glueContext.write_data_frame.from_catalog(
    frame = <df_product_inserts>,
    database = <database_name>,
    table_name = <table_name>,
    additional_options = hudi_options
)
```

## write\$1dynamic\$1frame\$1from\$1jdbc\$1conf
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_jdbc_conf"></a>

**`write_dynamic_frame_from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)`**

Writes and returns a `DynamicFrame` using the specified JDBC connection information.
+ `frame` – The `DynamicFrame` to write.
+ `catalog_connection` – A catalog connection to use.
+ `connection_options` – Connection options, such as path and database table (optional). For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – A transformation context to use (optional).
+ `catalog_id` — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used. 

## write\$1from\$1jdbc\$1conf
<a name="aws-glue-api-crawler-pyspark-extensions-glue-context-write_from_jdbc_conf"></a>

**`write_from_jdbc_conf(frame_or_dfc, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)`**

Writes and returns a `DynamicFrame` or `DynamicFrameCollection` using the specified JDBC connection information.
+ `frame_or_dfc` – The `DynamicFrame` or `DynamicFrameCollection` to write.
+ `catalog_connection` – A catalog connection to use.
+ `connection_options` – Connection options, such as path and database table (optional). For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `redshift_tmp_dir` – An Amazon Redshift temporary directory to use (optional).
+ `transformation_ctx` – A transformation context to use (optional).
+ `catalog_id` — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used. 

# AWS Glue PySpark transforms reference
<a name="aws-glue-programming-python-transforms"></a>

AWS Glue provides the following built-in transforms that you can use in PySpark ETL operations. Your data passes from transform to transform in a data structure called a *DynamicFrame*, which is an extension to an Apache Spark SQL `DataFrame`. The `DynamicFrame` contains your data, and you reference its schema to process your data. 

Most of these transforms also exist as methods of the `DynamicFrame` class. For more information, see [DynamicFrame transforms ](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-_transforms).
+ [GlueTransform base class](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md)
+ [ApplyMapping class](aws-glue-api-crawler-pyspark-transforms-ApplyMapping.md)
+ [DropFields class](aws-glue-api-crawler-pyspark-transforms-DropFields.md)
+ [DropNullFields class](aws-glue-api-crawler-pyspark-transforms-DropNullFields.md)
+ [ErrorsAsDynamicFrame class](aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame.md)
+ [EvaluateDataQuality class](aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality.md)
+ [FillMissingValues class](aws-glue-api-crawler-pyspark-transforms-fillmissingvalues.md)
+ [Filter class](aws-glue-api-crawler-pyspark-transforms-filter.md)
+ [FindIncrementalMatches class](aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.md)
+ [FindMatches class](aws-glue-api-crawler-pyspark-transforms-findmatches.md)
+ [FlatMap class](aws-glue-api-crawler-pyspark-transforms-flat-map.md)
+ [Join class](aws-glue-api-crawler-pyspark-transforms-join.md)
+ [Map class](aws-glue-api-crawler-pyspark-transforms-map.md)
+ [MapToCollection class](aws-glue-api-crawler-pyspark-transforms-MapToCollection.md)
+ [mergeDynamicFrame](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-merge)
+ [Relationalize class](aws-glue-api-crawler-pyspark-transforms-Relationalize.md)
+ [RenameField class](aws-glue-api-crawler-pyspark-transforms-RenameField.md)
+ [ResolveChoice class](aws-glue-api-crawler-pyspark-transforms-ResolveChoice.md)
+ [SelectFields class](aws-glue-api-crawler-pyspark-transforms-SelectFields.md)
+ [SelectFromCollection class](aws-glue-api-crawler-pyspark-transforms-SelectFromCollection.md)
+ [Simplify\$1ddb\$1json class](aws-glue-api-crawler-pyspark-transforms-simplify-ddb-json.md)
+ [Spigot class](aws-glue-api-crawler-pyspark-transforms-spigot.md)
+ [SplitFields class](aws-glue-api-crawler-pyspark-transforms-SplitFields.md)
+ [SplitRows class](aws-glue-api-crawler-pyspark-transforms-SplitRows.md)
+ [Unbox class](aws-glue-api-crawler-pyspark-transforms-Unbox.md)
+ [UnnestFrame class](aws-glue-api-crawler-pyspark-transforms-UnnestFrame.md)

# GlueTransform base class
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform"></a>

The base class that all the `awsglue.transforms` classes inherit from.

The classes all define a `__call__` method. They either override the `GlueTransform` class methods listed in the following sections, or they are called using the class name by default.

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-_methods"></a>
+ [apply(cls, \$1args, \$1\$1kwargs)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)
+ [name(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)
+ [describeArgs(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)
+ [describeReturn(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)
+ [describeTransform(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)
+ [describeErrors(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)
+ [describe(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply"></a>

Applies the transform by calling the transform class, and returns the result.
+ `cls` – The `self` class object.

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-name"></a>

Returns the name of the derived transform class.
+ `cls` – The `self` class object.

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs"></a>
+ `cls` – The `self` class object.

Returns a list of dictionaries, each corresponding to a named argument, in the following format:

```
[
  {
    "name": "(name of argument)",
    "type": "(type of argument)",
    "description": "(description of argument)",
    "optional": "(Boolean, True if the argument is optional)",
    "defaultValue": "(Default value string, or None)(String; the default value, or None)"
  },
...
]
```

Raises a `NotImplementedError` exception when called in a derived transform where it is not implemented.

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn"></a>
+ `cls` – The `self` class object.

Returns a dictionary with information about the return type, in the following format:

```
{
  "type": "(return type)",
  "description": "(description of output)"
}
```

Raises a `NotImplementedError` exception when called in a derived transform where it is not implemented.

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform"></a>

Returns a string describing the transform.
+ `cls` – The `self` class object.

Raises a `NotImplementedError` exception when called in a derived transform where it is not implemented.

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors"></a>
+ `cls` – The `self` class object.

Returns a list of dictionaries, each describing a possible exception thrown by this transform, in the following format:

```
[
  {
    "type": "(type of error)",
    "description": "(description of error)"
  },
...
]
```

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe"></a>
+ `cls` – The `self` class object.

Returns an object with the following format:

```
{
  "transform" : {
    "name" : cls.name( ),
    "args" : cls.describeArgs( ),
    "returns" : cls.describeReturn( ),
    "raises" : cls.describeErrors( ),
    "location" : "internal"
  }
}
```

# ApplyMapping class
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping"></a>

Applies a mapping in a `DynamicFrame`.

## Example
<a name="pyspark-ApplyMapping-examples"></a>

We recommend that you use the [`DynamicFrame.apply_mapping()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-apply_mapping) method to apply a mapping in a `DynamicFrame`. To view a code example, see [Example: Use apply\$1mapping to rename fields and change field types](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-apply_mapping-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describe)

## \$1\$1call\$1\$1(frame, mappings, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-__call__"></a>

Applies a declarative mapping to a specified `DynamicFrame`.
+ `frame` – The `DynamicFrame` to apply the mapping to (required).
+ `mappings` – A list of mapping tuples (required). Each consists of: (source column, source type, target column, target type).

  If the source column has a dot "`.`" in the name, you must place back-ticks "````" around it. For example, to map `this.old.name` (string) to `thisNewName`, you would use the following tuple:

  ```
  ("`this.old.name`", "string", "thisNewName", "string")
  ```
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string that is associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

Returns only the fields of the `DynamicFrame` that are specified in the "mapping" tuples.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# DropFields class
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields"></a>

Drops fields within a `DynamicFrame`.

## Example
<a name="pyspark-DropFields-examples"></a>

We recommend that you use the [`DynamicFrame.drop_fields()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-drop_fields) method to drop fields from a `DynamicFrame`. To view a code example, see [Example: Use drop\$1fields to remove fields from a `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-drop_fields-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-DropFields-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-DropFields-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-DropFields-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-DropFields-describe)

## \$1\$1call\$1\$1(frame, paths, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-__call__"></a>

Drops nodes within a `DynamicFrame`.
+ `frame` – The `DynamicFrame` to drop the nodes in (required).
+ `paths` – A list of full paths to the nodes to drop (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

Returns a new `DynamicFrame` without the specified fields.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropFields-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# DropNullFields class
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields"></a>

Drops all null fields in a `DynamicFrame` whose type is `NullType`. These are fields with missing or null values in every record in the `DynamicFrame` dataset.

## Example
<a name="pyspark-DropNullFields-examples"></a>

This example uses `DropNullFields` to create a new `DynamicFrame` where fields of type `NullType` have been dropped. In order to demonstrate `DropNullFields`, we add a new column named `empty_column` with type null to the already-loaded `persons` dataset.

**Note**  
To access the dataset that is used in this example, see [Code example: Joining and relationalizing data](aws-glue-programming-python-samples-legislators.md) and follow the instructions in [Step 1: Crawl the data in the Amazon S3 bucket](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling).

```
# Example: Use DropNullFields to create a new DynamicFrame without NullType fields

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import lit
from pyspark.sql.types import NullType
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms import DropNullFields

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Create DynamicFrame
persons = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="persons_json"
)
print("Schema for the persons DynamicFrame:")
persons.printSchema()

# Add new column "empty_column" with NullType
persons_with_nulls = persons.toDF().withColumn("empty_column", lit(None).cast(NullType()))
persons_with_nulls_dyf = DynamicFrame.fromDF(persons_with_nulls, glueContext, "persons_with_nulls")
print("Schema for the persons_with_nulls_dyf DynamicFrame:")
persons_with_nulls_dyf.printSchema()

# Remove the NullType field
persons_no_nulls = DropNullFields.apply(persons_with_nulls_dyf)
print("Schema for the persons_no_nulls DynamicFrame:")
persons_no_nulls.printSchema()
```

### Output
<a name="drop_null_fields-example-output"></a>

```
Schema for the persons DynamicFrame:
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string

Schema for the persons_with_nulls_dyf DynamicFrame:
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string
|-- empty_column: null

null_fields ['empty_column']
Schema for the persons_no_nulls DynamicFrame:
root
|-- family_name: string
|-- name: string
|-- links: array
|    |-- element: struct
|    |    |-- note: string
|    |    |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
|-- sort_name: string
|-- images: array
|    |-- element: struct
|    |    |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
|    |-- element: struct
|    |    |-- type: string
|    |    |-- value: string
|-- death_date: string
```

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describe)

## \$1\$1call\$1\$1(frame, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-__call__"></a>

Drops all null fields in a `DynamicFrame` whose type is `NullType`. These are fields with missing or null values in every record in the `DynamicFrame` dataset.
+ `frame` – The `DynamicFrame` to drop null fields in (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

Returns a new `DynamicFrame` with no null fields.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-apply"></a>
+ `cls` – cls

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-name"></a>
+ `cls` – cls

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeArgs"></a>
+ `cls` – cls

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeReturn"></a>
+ `cls` – cls

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeTransform"></a>
+ `cls` – cls

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeErrors"></a>
+ `cls` – cls

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-DropNullFields-describe"></a>
+ `cls` – cls

# ErrorsAsDynamicFrame class
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame"></a>

Returns a `DynamicFrame` that contains nested records for errors that occurred while the source `DynamicFrame` was created.

## Example
<a name="pyspark-ErrorsAsDynamicFrame-examples"></a>

We recommend that you use the [`DynamicFrame.errorsAsDynamicFrame()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsAsDynamicFrame) method to retrieve and view error records. To view a code example, see [Example: Use errorsAsDynamicFrame to view error records](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-errorsAsDynamicFrame-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describe)

## \$1\$1call\$1\$1(frame)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-__call__"></a>

Returns a `DynamicFrame` that contains nested error records that relate to the source `DynamicFrame`.
+ `frame` – The source `DynamicFrame` (required).

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-apply"></a>
+ `cls` – cls

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-name"></a>
+ `cls` – cls

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeArgs"></a>
+ `cls` – cls

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeReturn"></a>
+ `cls` – cls

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeTransform"></a>
+ `cls` – cls

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeErrors"></a>
+ `cls` – cls

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describe"></a>
+ `cls` – cls

# EvaluateDataQuality class
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality"></a>

Evaluates a data quality ruleset against a `DynamicFrame` and returns a new `DynamicFrame` with results of the evaluation.

## Example
<a name="pyspark-EvaluateDataQuality-example"></a>

The following example code demonstrates how to evaluate data quality for a `DynamicFrame` and then view the data quality results. 

```
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsgluedq.transforms import EvaluateDataQuality

#Create Glue context
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Define DynamicFrame
legislatorsAreas = glueContext.create_dynamic_frame.from_catalog(
    database="legislators", table_name="areas_json")

# Create data quality ruleset
ruleset = """Rules = [ColumnExists "id", IsComplete "id"]"""

# Evaluate data quality
dqResults = EvaluateDataQuality.apply(
    frame=legislatorsAreas,
    ruleset=ruleset,
    publishing_options={
        "dataQualityEvaluationContext": "legislatorsAreas",
        "enableDataQualityCloudWatchMetrics": True,
        "enableDataQualityResultsPublishing": True,
        "resultsS3Prefix": "amzn-s3-demo-bucket1",
    },
)


# Inspect data quality results
dqResults.printSchema()
dqResults.toDF().show()
```

### Output
<a name="pyspark-EvaluateDataQuality-example-output"></a>

```
root
|-- Rule: string
|-- Outcome: string
|-- FailureReason: string
|-- EvaluatedMetrics: map
|    |-- keyType: string
|    |-- valueType: double


+-----------------------+-------+-------------+---------------------------------------+
|Rule                   |Outcome|FailureReason|EvaluatedMetrics                       |
+-----------------------+-------+-------------+---------------------------------------+
|ColumnExists "id"      |Passed |null         |{}                                     |
|IsComplete "id"        |Passed |null         |{Column.first_name.Completeness -> 1.0}|
+-----------------------+-------+-------------+---------------------------------------+
```

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describe)

## \$1\$1call\$1\$1(frame, ruleset, publishing\$1options = \$1\$1)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-__call__"></a>
+ `frame` – The `DynamicFrame` that you want evaluate the data quality of.
+ `ruleset` – A Data Quality Definition Language (DQDL) ruleset in string format. To learn more about DQDL, see the [Data Quality Definition Language (DQDL) reference](dqdl.md) guide.
+ `publishing_options` – A dictionary that specifies the following options for publishing evaluation results and metrics:
  + `dataQualityEvaluationContext` – A string that specifies the namespace under which AWS Glue should publish Amazon CloudWatch metrics and the data quality results. The aggregated metrics appear in CloudWatch, while the full results appear in the AWS Glue Studio interface.
    + Required: No
    + Default value: `default_context`
  + `enableDataQualityCloudWatchMetrics` – Specifies whether the results of the data quality evaluation should be published to CloudWatch. You specify a namespace for the metrics using the `dataQualityEvaluationContext` option.
    + Required: No
    + Default value: False
  + `enableDataQualityResultsPublishing` – Specifies whether the data quality results should be visible on the **Data Quality** tab in the AWS Glue Studio interface.
    + Required: No
    + Default value: True
  + `resultsS3Prefix` – Specifies the Amazon S3 location where AWS Glue can write the data quality evaluation results.
    + Required: No
    + Default value: "" (empty string)

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# FillMissingValues class
<a name="aws-glue-api-crawler-pyspark-transforms-fillmissingvalues"></a>

The `FillMissingValues` class locates null values and empty strings in a specified `DynamicFrame` and uses machine learning methods, such as linear regression and random forest, to predict the missing values. The ETL job uses the values in the input dataset to train the machine learning model, which then predicts what the missing values should be.

**Tip**  
If you use incremental data sets, then each incremental set is used as the training data for the machine learning model, so the results might not be as accurate.

To import:

```
from awsglueml.transforms import FillMissingValues
```

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-fillmissingvalues-_methods"></a>
+ [Apply](#aws-glue-api-crawler-pyspark-transforms-fillmissingvalues-apply)

## apply(frame, missing\$1values\$1column, output\$1column ="", transformation\$1ctx ="", info ="", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-fillmissingvalues-apply"></a>

Fills a dynamic frame's missing values in a specified column and returns a new frame with estimates in a new column. For rows without missing values, the specified column's value is duplicated to the new column.
+ `frame` – The `DynamicFrame` in which to fill missing values. Required.
+ `missing_values_column` – The column containing missing values (`null` values and empty strings). Required.
+ `output_column` – The name of the new column that will contain estimated values for all rows whose value was missing. Optional; the default is the name of `missing_values_column` suffixed by `"_filled"`.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional; the default is zero).
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional; the default is zero).

Returns a new `DynamicFrame` with one additional column that contains estimations for rows with missing values and the present value for other rows.

# Filter class
<a name="aws-glue-api-crawler-pyspark-transforms-filter"></a>

Builds a new `DynamicFrame` that contains records from the input `DynamicFrame` that satisfy a specified predicate function.

## Example
<a name="aws-glue-api-crawler-pyspark-transforms-filter-example"></a>

We recommend that you use the [`DynamicFrame.filter()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-filter) method to filter records in a `DynamicFrame`. To view a code example, see [Example: Use filter to get a filtered selection of fields](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-filter-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-filter-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-filter-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-filter-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-filter-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-filter-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-filter-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-filter-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-filter-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-filter-describe)

## \$1\$1call\$1\$1(frame, f, transformation\$1ctx="", info="", stageThreshold=0, totalThreshold=0))
<a name="aws-glue-api-crawler-pyspark-transforms-filter-__call__"></a>

Returns a new `DynamicFrame` that is built by selecting records from the input `DynamicFrame` that satisfy a specified predicate function.
+ `frame` – The source `DynamicFrame` to apply the specified filter function to (required).
+ `f` – The predicate function to apply to each `DynamicRecord` in the `DynamicFrame`. The function must take a `DynamicRecord` as its argument and return True if the `DynamicRecord` meets the filter requirements, or False if it doesn't (required).

  A `DynamicRecord` represents a logical record in a `DynamicFrame`. It's similar to a row in a Spark `DataFrame`, except that it is self-describing and can be used for data that doesn't conform to a fixed schema.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string that is associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-filter-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-filter-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-filter-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-filter-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-filter-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-filter-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-filter-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# FindIncrementalMatches class
<a name="aws-glue-api-crawler-pyspark-transforms-findincrementalmatches"></a>

Identifies matching records in the existing and incremental `DynamicFrame` and creates a new `DynamicFrame` with a unique identifier assigned to each group of matching records.

To import:

```
from awsglueml.transforms import FindIncrementalMatches
```

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-findincrementalmatches-_methods"></a>
+ [Apply](#aws-glue-api-crawler-pyspark-transforms-findincrementalmatches-apply)

## apply(existingFrame, incrementalFrame, transformId, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0, enforcedMatches = none, computeMatchConfidenceScores = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-findincrementalmatches-apply"></a>

Identifies matching records in the input `DynamicFrame` and creates a new `DynamicFrame` with a unique identifier assigned to each group of matching records.
+ `existingFrame` – The existing and pre-matched `DynamicFrame` to apply the FindIncrementalMatches transform. Required.
+ `incrementalFrame` – The incremental `DynamicFrame` to apply the FindIncrementalMatches transform to match against the `existingFrame`. Required.
+ `transformId` – The unique ID associated with the FindIncrementalMatches transform to apply on records in the `DynamicFrames`. Required.
+ `transformation_ctx` – A unique string that is used to identify stats/state information. Optional.
+ `info` – A string to be associated with errors in the transformation. Optional.
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out. Optional. The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out. Optional. The default is zero.
+ `enforcedMatches` – The `DynamicFrame` used to enforce matches. Optional. The default is None.
+ `computeMatchConfidenceScores` – A Boolean value indicating whether to compute a confidence score for each group of matching records. Optional. The default is false.

Returns a new `DynamicFrame` with a unique identifier assigned to each group of matching records.

# FindMatches class
<a name="aws-glue-api-crawler-pyspark-transforms-findmatches"></a>

Identifies matching records in the input `DynamicFrame` and creates a new `DynamicFrame` with a unique identifier assigned to each group of matching records.

To import:

```
from awsglueml.transforms import FindMatches
```

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-findmatches-_methods"></a>
+ [Apply](#aws-glue-api-crawler-pyspark-transforms-findmatches-apply)

## apply(frame, transformId, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0, enforcedMatches = none, computeMatchConfidenceScores = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-findmatches-apply"></a>

Identifies matching records in the input `DynamicFrame` and creates a new `DynamicFrame` with a unique identifier assigned to each group of matching records.
+ `frame` – The `DynamicFrame` to apply the FindMatches transform. Required.
+ `transformId` – The unique ID associated with the FindMatches transform to apply on records in the `DynamicFrame`. Required.
+ `transformation_ctx` – A unique string that is used to identify stats/state information. Optional.
+ `info` – A string to be associated with errors in the transformation. Optional.
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out. Optional. The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out. Optional. The default is zero.
+ `enforcedMatches` – The `DynamicFrame` used to enforce matches. Optional. The default is None.
+ `computeMatchConfidenceScores` – A Boolean value indicating whether to compute a confidence score for each group of matching records. Optional. The default is false.

Returns a new `DynamicFrame` with a unique identifier assigned to each group of matching records.

# FlatMap class
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map"></a>

 Applies a transform to each `DynamicFrame` in a collection. Results are not flattened into a single `DynamicFrame`, but preserved as a collection. 

## Examples for FlatMap
<a name="aws-glue-api-crawler-pyspark-flat-map-examples"></a>

 The following example snippet demonstrates how to use the `ResolveChoice` transform on a collection of dynamic frames when applied to a `FlatMap`. The data used for input is in the JSON located at the placeholder Amazon S3 address `s3://bucket/path-for-data/sample.json` and contains the following data. 

### Example JSON data
<a name="aws-glue-api-crawler-pyspark-flat-map-examples-json"></a>

```
[{
    "firstname": "Arnav",
    "lastname": "Desai",
    "address": {
        "street": "6 Anyroad Avenue",
        "city": "London",
        "state": "England",
        "country": "UK"
    },
    "phone": 17235550101,
    "affiliations": [
        "General Anonymous Example Products",
        "Example Independent Research",
        "Government Department of Examples"
    ]
},
{
    "firstname": "Mary",
    "lastname": "Major",
    "address": {
        "street": "7821 Spot Place",
        "city": "Centerville",
        "state": "OK",
        "country": "US"
    },
    "phone": 19185550023,
    "affiliations": [
        "Example Dot Com",
        "Example Independent Research",
        "Example.io"
    ]
},
{
    "firstname": "Paulo",
    "lastname": "Santos",
    "address": {
        "street": "123 Maple Street",
        "city": "London",
        "state": "Ontario",
        "country": "CA"
    },
    "phone": 12175550181,
    "affiliations": [
        "General Anonymous Example Products",
        "Example Dot Com"
    ]
}]
```

**Example Apply ResolveChoice to a DynamicFrameCollection and show output.**  

```
#Read DynamicFrame
datasource = glueContext.create_dynamic_frame_from_options("s3", connection_options = {"paths":["s3://bucket/path/to/file/mysamplejson.json"]}, format="json")
datasource.printSchema()
datasource.show()

## Split to create a DynamicFrameCollection
split_frame=datasource.split_fields(["firstname","lastname","address"],"personal_info","business_info")
split_frame.keys()
print("---")

## Use FlatMap to run ResolveChoice
kwargs = {"choice": "cast:string"}
flat = FlatMap.apply(split_frame, ResolveChoice, frame_name="frame", transformation_ctx='tcx', **kwargs)
flat.keys()

##Select one of the DynamicFrames
personal_info = flat.select("personal_info")
personal_info.printSchema()
personal_info.show()
print("---")

business_info = flat.select("business_info")
business_info.printSchema()
business_info.show()
```
 When calling `FlatMap.apply`, the `frame_name` parameter **must** be `"frame"`. No other value is currently accepted. 

### Example output
<a name="aws-glue-api-crawler-pyspark-flat-map-examples"></a>

```
root
|-- firstname: string
|-- lastname: string
|-- address: struct
|    |-- street: string
|    |-- city: string
|    |-- state: string
|    |-- country: string
|-- phone: long
|-- affiliations: array
|    |-- element: string
---
{
    "firstname": "Mary",
    "lastname": "Major",
    "address": {
        "street": "7821 Spot Place",
        "city": "Centerville",
        "state": "OK",
        "country": "US"
    },
    "phone": 19185550023,
    "affiliations": [
        "Example Dot Com",
        "Example Independent Research",
        "Example.io"
    ]
}

{
    "firstname": "Paulo",
    "lastname": "Santos",
    "address": {
        "street": "123 Maple Street",
        "city": "London",
        "state": "Ontario",
        "country": "CA"
    },
    "phone": 12175550181,
    "affiliations": [
        "General Anonymous Example Products",
        "Example Dot Com"
    ]
}
---
root
|-- firstname: string
|-- lastname: string
|-- address: struct
|    |-- street: string
|    |-- city: string
|    |-- state: string
|    |-- country: string

{
    "firstname": "Mary",
    "lastname": "Major",
    "address": {
        "street": "7821 Spot Place",
        "city": "Centerville",
        "state": "OK",
        "country": "US"
    }
}

{
    "firstname": "Paulo",
    "lastname": "Santos",
    "address": {
        "street": "123 Maple Street",
        "city": "London",
        "state": "Ontario",
        "country": "CA"
    }
}
---
root
|-- phone: long
|-- affiliations: array
|    |-- element: string

{
    "phone": 19185550023,
    "affiliations": [
        "Example Dot Com",
        "Example Independent Research",
        "Example.io"
    ]
}

{
    "phone": 12175550181,
    "affiliations": [
        "General Anonymous Example Products",
        "Example Dot Com"
    ]
}
```

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-flat-map-__call__)
+ [Apply](#aws-glue-api-crawler-pyspark-transforms-flat-map-apply)
+ [Name](#aws-glue-api-crawler-pyspark-transforms-flat-map-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-flat-map-describe)

## \$1\$1call\$1\$1(dfc, BaseTransform, frame\$1name, transformation\$1ctx = "", \$1\$1base\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-__call__"></a>

Applies a transform to each `DynamicFrame` in a collection and flattens the results.
+ `dfc` – The `DynamicFrameCollection` over which to flatmap (required).
+ `BaseTransform` – A transform derived from `GlueTransform` to apply to each member of the collection (required).
+ `frame_name` – The argument name to pass the elements of the collection to (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `base_kwargs` – Arguments to pass to the base transform (required).

Returns a new `DynamicFrameCollection` created by applying the transform to each `DynamicFrame` in the source `DynamicFrameCollection`.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-flat-map-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# Join class
<a name="aws-glue-api-crawler-pyspark-transforms-join"></a>

Performs an equality join on two `DynamicFrames`.

## Example
<a name="pyspark-Join-example"></a>

We recommend that you use the [`DynamicFrame.join()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-join) method to join `DynamicFrames`. To view a code example, see [Example: Use join to combine `DynamicFrames`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-join-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-join-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-join-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-join-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-join-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-join-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-join-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-join-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-join-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-join-describe)

## \$1\$1call\$1\$1(frame1, frame2, keys1, keys2, transformation\$1ctx = "")
<a name="aws-glue-api-crawler-pyspark-transforms-join-__call__"></a>

Performs an equality join on two `DynamicFrames`.
+ `frame1` – The first `DynamicFrame` to join (required).
+ `frame2` – The second `DynamicFrame` to join (required).
+ `keys1` – The keys to join on for the first frame (required).
+ `keys2` – The keys to join on for the second frame (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).

Returns a new `DynamicFrame` that is created by joining the two `DynamicFrames`.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-join-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-join-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-join-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-join-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-join-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-join-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-join-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# Map class
<a name="aws-glue-api-crawler-pyspark-transforms-map"></a>

Builds a new `DynamicFrame` by applying a function to all records in the input `DynamicFrame`.

## Example
<a name="aws-glue-api-crawler-pyspark-transforms-map-examples"></a>

We recommend that you use the [`DynamicFrame.map()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-map) method to apply a function to all records in a `DynamicFrame`. To view a code example, see [Example: Use map to apply a function to every record in a `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-map-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-map-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-map-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-map-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-map-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-map-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-map-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-map-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-map-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-map-describe)

## \$1\$1call\$1\$1(frame, f, transformation\$1ctx="", info="", stageThreshold=0, totalThreshold=0)
<a name="aws-glue-api-crawler-pyspark-transforms-map-__call__"></a>

Returns a new `DynamicFrame` that results from applying the specified function to all `DynamicRecords` in the original `DynamicFrame`.
+ `frame` – The original `DynamicFrame` to apply the mapping function to (required).
+ `f` – The function to apply to all `DynamicRecords` in the `DynamicFrame`. The function must take a `DynamicRecord` as an argument and return a new `DynamicRecord` that is produced by the mapping (required).

  A `DynamicRecord` represents a logical record in a `DynamicFrame`. It's similar to a row in an Apache Spark `DataFrame`, except that it is self-describing and can be used for data that doesn't conform to a fixed schema.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

Returns a new `DynamicFrame` that results from applying the specified function to all `DynamicRecords` in the original `DynamicFrame`.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-map-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-map-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-map-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-map-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-map-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-map-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-map-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# MapToCollection class
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection"></a>

Applies a transform to each `DynamicFrame` in the specified `DynamicFrameCollection`.

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-__call__)
+ [Apply](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-apply)
+ [Name](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describe)

## \$1\$1call\$1\$1(dfc, BaseTransform, frame\$1name, transformation\$1ctx = "", \$1\$1base\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-__call__"></a>

Applies a transform function to each `DynamicFrame` in the specified `DynamicFrameCollection`.
+ `dfc` – The `DynamicFrameCollection` over which to apply the transform function (required).
+ `callable` – A callable transform function to apply to each member of the collection (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).

Returns a new `DynamicFrameCollection` created by applying the transform to each `DynamicFrame` in the source `DynamicFrameCollection`.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MapToCollection-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# Relationalize class
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize"></a>

Flattens a nested schema in a `DynamicFrame` and pivots out array columns from the flattened frame.

## Example
<a name="pyspark-Relationalize-example"></a>

We recommend that you use the [`DynamicFrame.relationalize()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-relationalize) method to relationalize a `DynamicFrame`. To view a code example, see [Example: Use relationalize to flatten a nested schema in a `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-relationalize-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-Relationalize-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-Relationalize-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-Relationalize-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describe)

## \$1\$1call\$1\$1(frame, staging\$1path=None, name='roottable', options=None, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-__call__"></a>

Relationalizes a `DynamicFrame` and produces a list of frames that are generated by unnesting nested columns and pivoting array columns. You can join a pivoted array column to the root table by using the join key that is generated in the unnest phase.
+ `frame` – The `DynamicFrame` to relationalize (required).
+ `staging_path` – The path where the method can store partitions of pivoted tables in CSV format (optional). Pivoted tables are read back from this path.
+ `name` – The name of the root table (optional).
+ `options` – A dictionary of optional parameters. Currently unused. 
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Relationalize-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# RenameField class
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField"></a>

Renames a node within a `DynamicFrame`.

## Example
<a name="pyspark-RenameField-example"></a>

We recommend that you use the [`DynamicFrame.rename_field()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-rename_field) method to rename a field in a `DynamicFrame`. To view a code example, see [Example: Use rename\$1field to rename fields in a `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-rename_field-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-RenameField-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-RenameField-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-RenameField-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-RenameField-describe)

## \$1\$1call\$1\$1(frame, old\$1name, new\$1name, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-__call__"></a>

Renames a node within a `DynamicFrame`.
+ `frame` – The `DynamicFrame` in which to rename a node (required).
+ `old_name` – The full path to the node to rename (required).

  If the old name has dots in it, RenameField will not work unless you place backticks around it (````). For example, to replace `this.old.name` with `thisNewName`, you would call RenameField as follows:

  ```
  newDyF = RenameField(oldDyF, "`this.old.name`", "thisNewName")
  ```
+ `new_name` – The new name, including full path (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RenameField-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# ResolveChoice class
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice"></a>

Resolves a choice type within a `DynamicFrame`.

## Example
<a name="pyspark-ResolveChoice-example"></a>

We recommend that you use the [`DynamicFrame.resolveChoice()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-resolveChoice) method to handle fields that contain multiple types in a `DynamicFrame`. To view a code example, see [Example: Use resolveChoice to handle a column that contains multiple types](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-resolveChoice-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describe)

## \$1\$1call\$1\$1(frame, specs = none, choice = "", transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-__call__"></a>

Provides information for resolving ambiguous types within a `DynamicFrame`. It returns the resulting `DynamicFrame`.
+ `frame` – The `DynamicFrame` in which to resolve the choice type (required).
+ `specs` – A list of specific ambiguities to resolve, each in the form of a tuple: `(path, action)`. The `path` value identifies a specific ambiguous element, and the `action` value identifies the corresponding resolution. 

  You can only use one of the `spec` and `choice` parameters. If the `spec` parameter is not `None`, then the `choice` parameter must be an empty string. Conversely, if the `choice` is not an empty string, then the `spec` parameter must be `None`. If neither parameter is provided, AWS Glue tries to parse the schema and use it to resolve ambiguities. 

  You can specify one of the following resolution strategies in the `action` portion of a `specs` tuple:
  + `cast` – Allows you to specify a type to cast to (for example, `cast:int`).
  + `make_cols` – Resolves a potential ambiguity by flattening the data. For example, if `columnA` could be an `int` or a `string`, the resolution is to produce two columns named `columnA_int` and `columnA_string` in the resulting `DynamicFrame`.
  + `make_struct` – Resolves a potential ambiguity by using a struct to represent the data. For example, if data in a column could be an `int` or a `string`, using the `make_struct` action produces a column of structures in the resulting `DynamicFrame` with each containing both an `int` and a `string`.
  + `project` – Resolves a potential ambiguity by retaining only values of a specified type in the resulting `DynamicFrame`. For example, if data in a `ChoiceType` column could be an `int` or a `string`, specifying a `project:string` action drops values from the resulting `DynamicFrame` that are not type `string`. 

  If the `path` identifies an array, place empty square brackets after the name of the array to avoid ambiguity. For example, suppose you are working with data structured as follows:

  ```
  "myList": [
    { "price": 100.00 },
    { "price": "$100.00" }
  ]
  ```

  You can select the numeric rather than the string version of the price by setting the `path` to `"myList[].price"`, and setting the `action` to `"cast:double"`.
+ `choice` – The default resolution action if the `specs` parameter is `None`. If the `specs` parameter is not `None`, then this must not be set to anything but an empty string.

  In addition to the `specs` actions previously described, this argument also supports the following action:
  + `MATCH_CATALOG` – Attempts to cast each `ChoiceType` to the corresponding type in the specified Data Catalog table.
+ `database` – The AWS Glue Data Catalog database to use with the `MATCH_CATALOG` choice (required for `MATCH_CATALOG` ).
+ `table_name` – The AWS Glue Data Catalog table name to use with the `MATCH_CATALOG` action (required for `MATCH_CATALOG` ).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# SelectFields class
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields"></a>

The `SelectFields` class creates a new `DynamicFrame` from an existing `DynamicFrame`, and keeps only the fields that you specify. `SelectFields` provides similar functionality to a SQL `SELECT` statement.

## Example
<a name="pyspark-SelectFields-examples"></a>

We recommend that you use the [`DynamicFrame.select_fields()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-select_fields) method to select fields from a `DynamicFrame`. To view a code example, see [Example: Use select\$1fields to create a new `DynamicFrame` with chosen fields](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-select_fields-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SelectFields-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-SelectFields-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-SelectFields-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeErrors)
+ [Describe](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describe)

## \$1\$1call\$1\$1(frame, paths, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-__call__"></a>

Gets fields (nodes) in a `DynamicFrame`.
+ `frame` – The `DynamicFrame` to select fields in (required).
+ `paths` – A list of full paths to the fields to select (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string that is associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

Returns a new `DynamicFrame` that contains only the specified fields.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFields-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# SelectFromCollection class
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection"></a>

Selects one `DynamicFrame` in a `DynamicFrameCollection`.

## Example
<a name="pyspark-SelectFromCollection-example"></a>

This example uses `SelectFromCollection` to select a `DynamicFrame` from a `DynamicFrameCollection`.

**Example dataset**

The example selects two `DynamicFrames` from a `DynamicFrameCollection` called `split_rows_collection`. The following is the list of keys in `split_rows_collection`.

```
dict_keys(['high', 'low'])
```

**Example code**

```
# Example: Use SelectFromCollection to select
# DynamicFrames from a DynamicFrameCollection

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import SelectFromCollection

# Create GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Select frames and inspect entries
frame_low = SelectFromCollection.apply(dfc=split_rows_collection, key="low")
frame_low.toDF().show()

frame_high = SelectFromCollection.apply(dfc=split_rows_collection, key="high")
frame_high.toDF().show()
```

### Output
<a name="SelectFromCollection-example-output"></a>

```
+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
|  1|    0|                     fax|             202-225-3307|
|  1|    1|                   phone|             202-225-5731|
|  2|    0|                     fax|             202-225-3307|
|  2|    1|                   phone|             202-225-5731|
|  3|    0|                     fax|             202-225-3307|
|  3|    1|                   phone|             202-225-5731|
|  4|    0|                     fax|             202-225-3307|
|  4|    1|                   phone|             202-225-5731|
|  5|    0|                     fax|             202-225-3307|
|  5|    1|                   phone|             202-225-5731|
|  6|    0|                     fax|             202-225-3307|
|  6|    1|                   phone|             202-225-5731|
|  7|    0|                     fax|             202-225-3307|
|  7|    1|                   phone|             202-225-5731|
|  8|    0|                     fax|             202-225-3307|
|  8|    1|                   phone|             202-225-5731|
|  9|    0|                     fax|             202-225-3307|
|  9|    1|                   phone|             202-225-5731|
| 10|    0|                     fax|             202-225-6328|
| 10|    1|                   phone|             202-225-4576|
+---+-----+------------------------+-------------------------+
only showing top 20 rows

+---+-----+------------------------+-------------------------+
| id|index|contact_details.val.type|contact_details.val.value|
+---+-----+------------------------+-------------------------+
| 11|    0|                     fax|             202-225-6328|
| 11|    1|                   phone|             202-225-4576|
| 11|    2|                 twitter|           RepTrentFranks|
| 12|    0|                     fax|             202-225-6328|
| 12|    1|                   phone|             202-225-4576|
| 12|    2|                 twitter|           RepTrentFranks|
| 13|    0|                     fax|             202-225-6328|
| 13|    1|                   phone|             202-225-4576|
| 13|    2|                 twitter|           RepTrentFranks|
| 14|    0|                     fax|             202-225-6328|
| 14|    1|                   phone|             202-225-4576|
| 14|    2|                 twitter|           RepTrentFranks|
| 15|    0|                     fax|             202-225-6328|
| 15|    1|                   phone|             202-225-4576|
| 15|    2|                 twitter|           RepTrentFranks|
| 16|    0|                     fax|             202-225-6328|
| 16|    1|                   phone|             202-225-4576|
| 16|    2|                 twitter|           RepTrentFranks|
| 17|    0|                     fax|             202-225-6328|
| 17|    1|                   phone|             202-225-4576|
+---+-----+------------------------+-------------------------+
only showing top 20 rows
```

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describe)

## \$1\$1call\$1\$1(dfc, key, transformation\$1ctx = "")
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-__call__"></a>

Gets one `DynamicFrame` from a `DynamicFrameCollection`.
+ `dfc` – The `DynamicFrameCollection` that the `DynamicFrame` should be selected from (required).
+ `key` – The key of the `DynamicFrame` to select (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# Simplify\$1ddb\$1json class
<a name="aws-glue-api-crawler-pyspark-transforms-simplify-ddb-json"></a>

Simplifies nested columns in a `DynamicFrame` that are specifically in the DynamoDB JSON structure, and returns a new simplified `DynamicFrame`.

## Example
<a name="pyspark-simplify-ddb-json-examples"></a>

We recommend that you use the `DynamicFrame.simplify_ddb_json()` method to simplify nested columns in a `DynamicFrame` that are specifically in the DynamoDB JSON structure. To view a code example, see [Example: Use simplify\$1ddb\$1json to invoke a DynamoDB JSON simplify](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-simplify-ddb-json-example).

# Spigot class
<a name="aws-glue-api-crawler-pyspark-transforms-spigot"></a>

Writes sample records to a specified destination to help you verify the transformations performed by your AWS Glue job.

## Example
<a name="pyspark-spigot-examples"></a>

We recommend that you use the [`DynamicFrame.spigot()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-spigot) method to write a subset of records from a `DynamicFrame` to a specified destination. To view a code example, see [Example: Use spigot to write sample fields from a `DynamicFrame` to Amazon S3](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-spigot-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-spigot-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-spigot-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-spigot-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-spigot-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-spigot-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-spigot-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-spigot-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-spigot-describe)

## \$1\$1call\$1\$1(frame, path, options, transformation\$1ctx = "")
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-__call__"></a>

Writes sample records to a specified destination during a transformation.
+ `frame` – The `DynamicFrame` to spigot (required).
+ `path` – The path of the destination to write to (required).
+ `options` – JSON key-value pairs that specify options (optional). The `"topk"` option specifies that the first *k* records should be written. The `"prob"` option specifies the probability (as a decimal) of picking any given record. You use this in selecting records to write.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-spigot-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)

# SplitFields class
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields"></a>

Splits a `DynamicFrame` into two new ones, by specified fields.

## Example
<a name="pyspark-SplitFields-examples"></a>

We recommend that you use the [`DynamicFrame.split_fields()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_fields) method to split fields in a `DynamicFrame`. To view a code example, see [Example: Use split\$1fields to split selected fields into a separate `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-split_fields-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SplitFields-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-SplitFields-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-SplitFields-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describe)

## \$1\$1call\$1\$1(frame, paths, name1 = none, name2 = none, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-__call__"></a>

Splits one or more fields in a `DynamicFrame` off into a new `DynamicFrame`, and creates another new `DynamicFrame` that contains the fields that remain.
+ `frame` – The source `DynamicFrame` to split into two new ones (required).
+ `paths` – A list of full paths to the fields to be split (required).
+ `name1` – The name to assign to the `DynamicFrame` that will contain the fields to be split off (optional). If no name is supplied, the name of the source frame is used with "1" appended.
+ `name2` – The name to assign to the `DynamicFrame` that will contain the fields that remain after the specified fields are split off (optional). If no name is provided, the name of the source frame is used with "2" appended.
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitFields-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# SplitRows class
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows"></a>

Creates a `DynamicFrameCollection` that contains two `DynamicFrames`. One `DynamicFrame` contains only the specified rows to be split, and the other contains all remaining rows.

## Example
<a name="pyspark-SplitRows-examples"></a>

We recommend that you use the [`DynamicFrame.split_rows()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_rows) method to split rows in a `DynamicFrame`. To view a code example, see [Example: Use split\$1rows to split rows in a `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-split_rows-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SplitRows-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-SplitRows-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-SplitRows-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describe)

## \$1\$1call\$1\$1(frame, comparison\$1dict, name1="frame1", name2="frame2", transformation\$1ctx = "", info = none, stageThreshold = 0, totalThreshold = 0)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-__call__"></a>

Splits one or more rows in a `DynamicFrame` off into a new `DynamicFrame`.
+ `frame` – The source `DynamicFrame` to split into two new ones (required).
+ `comparison_dict` – A dictionary where the key is the full path to a column, and the value is another dictionary for mapping comparators to values that the column values are compared to. For example, `{"age": {">": 10, "<": 20}}` splits rows where the value of "age" is between 10 and 20, exclusive, from rows where "age" is outside that range (required).
+ `name1` – The name to assign to the `DynamicFrame` that will contain the rows to be split off (optional).
+ `name2` – The name to assign to the `DynamicFrame` that will contain the rows that remain after the specified rows are split off (optional).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-SplitRows-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# Unbox class
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox"></a>

Unboxes (reformats) a string field in a `DynamicFrame`.

## Example
<a name="pyspark-Unbox-example"></a>

We recommend that you use the [`DynamicFrame.unbox()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unbox) method to unbox a field in a `DynamicFrame`. To view a code example, see [Example: Use unbox to unbox a string field into a struct](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-unbox-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-Unbox-__call__)
+ [applyapply](#aws-glue-api-crawler-pyspark-transforms-Unbox-apply)
+ [namename](#aws-glue-api-crawler-pyspark-transforms-Unbox-name)
+ [describeArgsdescribeArgs](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeArgs)
+ [describeReturndescribeReturn](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeReturn)
+ [describeTransformdescribeTransform](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeTransform)
+ [describeErrorsdescribeErrors](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeErrors)
+ [describedescribe](#aws-glue-api-crawler-pyspark-transforms-Unbox-describe)

## \$1\$1call\$1\$1(frame, path, format, transformation\$1ctx = "", info="", stageThreshold=0, totalThreshold=0, \$1\$1options)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-__call__"></a>

Unboxes a string field in a `DynamicFrame`.
+ `frame` – The `DynamicFrame` in which to unbox a field. (required).
+ `path` – The full path to the `StringNode` to unbox (required).
+ `format` – A format specification (optional). This is used for an Amazon S3 or AWS Glue connection that supports multiple formats. For the formats that are supported, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.
+ `separator` – A separator token (optional).
+ `escaper` – An escape token (optional).
+ `skipFirst` – `True` if the first line of data should be skipped, or `False` if it should not be skipped (optional).
+ withSchema`` – A string that contains a schema for the data to be unboxed (optional). This should always be created using `StructType.json`.
+ `withHeader` – `True` if the data being unpacked includes a header, or `False` if not (optional).

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Unbox-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# UnnestFrame class
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame"></a>

Unnests a `DynamicFrame`, flattens nested objects to top-level elements, and generates join keys for array objects.

## Example
<a name="pyspark-UnnestFrame-example"></a>

We recommend that you use the [`DynamicFrame.unnest()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest) method to flatten nested structures in a `DynamicFrame`. To view a code example, see [Example: Use unnest to turn nested fields into top-level fields](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-unnest-example).

## Methods
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describe)

## \$1\$1call\$1\$1(frame, transformation\$1ctx = "", info="", stageThreshold=0, totalThreshold=0)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-__call__"></a>

Unnests a `DynamicFrame`, flattens nested objects to top-level elements, and generates join keys for array objects.
+ `frame` – The `DynamicFrame` to unnest (required).
+ `transformation_ctx` – A unique string that is used to identify state information (optional).
+ `info` – A string associated with errors in the transformation (optional).
+ `stageThreshold` – The maximum number of errors that can occur in the transformation before it errors out (optional). The default is zero.
+ `totalThreshold` – The maximum number of errors that can occur overall before processing errors out (optional). The default is zero.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# FlagDuplicatesInColumn class
<a name="aws-glue-api-pyspark-transforms-FlagDuplicatesInColumn"></a>

The `FlagDuplicatesInColumn` transform returns a new column with a specified value in each row that indicates whether the value in the row's source column matches a value in an earlier row of the source column. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row.

## Example
<a name="pyspark-FlagDuplicatesInColumn-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession      
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

datasource1 = spark.read.json("s3://${BUCKET}/json/zips/raw/data")

try:
    df_output = column.FlagDuplicatesInColumn.apply(
        data_frame=datasource1,
        spark_context=sc,
        source_column="city",
        target_column="flag_col",
        true_string="True",
        false_string="False"
    )
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-FlagDuplicatesInColumn-output"></a>

 The `FlagDuplicatesInColumn` transformation will add a new column `flag\$1col` to the `df\$1output` DataFrame. This column will contain a string value indicating whether the corresponding row has a duplicate value in the `city` column or not. If a row has a duplicate `city` value, the `flag\$1col` will contain the `true\$1string` value "True". If a row has a unique `city` value, the `flag\$1col` will contain the `false\$1string` value "False". 

 The resulting `df\$1output` DataFrame will contain all columns from the original `datasource1` DataFrame, plus the additional `flag\$1col` column indicating duplicate `city` values. 

## Methods
<a name="aws-glue-api-pyspark-transforms-FlagDuplicatesInColumn-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FlagDuplicatesInColumn-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, target\$1column, true\$1string=DEFAULT\$1TRUE\$1STRING, false\$1string=DEFAULT\$1FALSE\$1STRING)
<a name="aws-glue-api-pyspark-transforms-FlagDuplicatesInColumn-__call__"></a>

The `FlagDuplicatesInColumn` transform returns a new column with a specified value in each row that indicates whether the value in the row's source column matches a value in an earlier row of the source column. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row.
+ `source_column` – Name of the source column.
+ `target_column` – Name of the target column.
+ `true_string` – String to be inserted in the target column when a source column value duplicates an earlier value in that column.
+ `false_string` – String to be inserted in the target column when a source column value is distinct from earlier values in that column.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# FormatPhoneNumber class
<a name="aws-glue-api-pyspark-transforms-FormatPhoneNumber"></a>

The `FormatPhoneNumber` transform returns a column in which a phone number string is converted into a formatted value.

## Example
<a name="pyspark-FormatPhoneNumber-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        ("408-341-5669",),
        ("4083415669",)
    ],
    ["phone"],
)

try:
    df_output = column_formatting.FormatPhoneNumber.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="phone",
        default_region="US"
    )
    df_output.show()
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-FormatPhoneNumber-output"></a>

 The output will be: 

```
```
+---------------+
| phone|
+---------------+
|(408) 341-5669|
|(408) 341-5669|
+---------------+
```
```

 The `FormatPhoneNumber` transformation takes the `source\$1column` as `"phone"` and the `default\$1region` as `"US"`. 

 The transformation successfully formats both phone numbers, regardless of their initial format, to the standard US format `(408) 341-5669`. 

## Methods
<a name="aws-glue-api-pyspark-transforms-FormatPhoneNumber-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FormatPhoneNumber-__call__)
+ [applyapply](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-apply)
+ [namename](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-name)
+ [describeArgsdescribeArgs](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeArgs)
+ [describeReturndescribeReturn](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeReturn)
+ [describeTransformdescribeTransform](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeTransform)
+ [describeErrorsdescribeErrors](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeErrors)
+ [describedescribe](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, phone\$1number\$1format=None, default\$1region=None, default\$1region\$1column=None)
<a name="aws-glue-api-pyspark-transforms-FormatPhoneNumber-__call__"></a>

The `FormatPhoneNumber` transform returns a column in which a phone number string is converted into a formatted value.
+ `source_column` – The name of an existing column.
+ `phone_number_format` – The format to convert the phone number to. If no format is specified, the default is `E.164`, an internationally-recognized standard phone number format. Valid values include the following: 
  + E164 (omit the period after E)
+ `default_region` – A valid region code consisting of two or three uppercase letters that specifies the region for the phone number when no country code is present in the number itself. At most, one of `defaultRegion` or `defaultRegionColumn` can be provided.
+ `default_region_column` – The name of a column of the advanced data type `Country`. The region code from the specified column is used to determine the country code for the phone number when no country code is present in the number itself. At most, one of `defaultRegion` or `defaultRegionColumn` can be provided.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# FormatCase class
<a name="aws-glue-api-pyspark-transforms-FormatCase"></a>

The `FormatCase` transform changes each string in a column to the specified case type.

## Example
<a name="pyspark-FormatCase-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

datasource1 = spark.read.json("s3://${BUCKET}/json/zips/raw/data")

try:
    df_output = data_cleaning.FormatCase.apply(
        data_frame=datasource1,
        spark_context=sc,
        source_column="city",
        case_type="LOWER"
    )    
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-FormatCase-output"></a>

 The `FormatCase` transformation will convert the values in the `city` column to lowercase based on the `case\$1type="LOWER"` parameter. The resulting `df\$1output` DataFrame will contain all columns from the original `datasource1` DataFrame, but with the `city` column values in lowercase. 

## Methods
<a name="aws-glue-api-pyspark-transforms-FormatCase-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FormatCase-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-FormatCase-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-FormatCase-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, case\$1type)
<a name="aws-glue-api-pyspark-transforms-FormatCase-__call__"></a>

The `FormatCase` transform changes each string in a column to the specified case type.
+ `source_column` – The name of an existing column.
+ `case_type` – Supported case types are `CAPITAL`,`LOWER`, `UPPER`, `SENTENCE`. 

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatCase-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatCase-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatCase-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatCase-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatCase-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatCase-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FormatCase-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# FillWithMode class
<a name="aws-glue-api-pyspark-transforms-FillWithMode"></a>

 The `FillWithMode` transform formats a column according to the phone numberformat you specify. You can also specify tie-breaker logic, where some of the values are identical. For example, consider the following values: `1 2 2 3 3 4` 

 A modeType of `MINIMUM` causes `FillWithMode` to return 2 as the mode value. If modeType is `MAXIMUM`, the mode is 3. For `AVERAGE`, the mode is 2.5. 

## Example
<a name="pyspark-FillWithMode-examples"></a>

```
from awsglue.context import *
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (105.111, 13.12),
        (1055.123, 13.12),
        (None, 13.12),
        (13.12, 13.12),
        (None, 13.12),
    ],
    ["source_column_1", "source_column_2"],
)

try:
    df_output = data_quality.FillWithMode.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="source_column_1",
        mode_type="MAXIMUM"
    )
    df_output.show()    
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-FillWithMode-output"></a>

 The output of the given code will be: 

```
```
+---------------+---------------+
|source_column_1|source_column_2|
+---------------+---------------+
| 105.111| 13.12|
| 1055.123| 13.12|
| 1055.123| 13.12|
| 13.12| 13.12|
| 1055.123| 13.12|
+---------------+---------------+
```
```

 The `FillWithMode` transformation from the `awsglue.data\$1quality` module is applied to the `input\$1df` DataFrame. It replaces the `null` values in the `source_column_1` column with the maximum value (`mode\$1type="MAXIMUM"`) from the non-null values in that column. 

 In this case, the maximum value in the `source_column_1` column is `1055.123`. Therefore, the `null` values in `source_column_1` are replaced by `1055.123` in the output DataFrame `df\$1output`. 

## Methods
<a name="aws-glue-api-pyspark-transforms-FillWithMode-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FillWithMode-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, mode\$1type)
<a name="aws-glue-api-pyspark-transforms-FillWithMode-__call__"></a>

 The `FillWithMode` transform formats the case of strings in a column. 
+ `source_column` – The name of an existing column.
+ `mode_type` – How to resolve tie values in the data. This value must be one of `MINIMUM`, `NONE`, `AVERAGE`, or `MAXIMUM`. 

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-FillWithMode-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FillWithMode-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FillWithMode-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# FlagDuplicateRows class
<a name="aws-glue-api-pyspark-transforms-FlagDuplicateRows"></a>

The `FlagDuplicateRows` transform returns a new column with a specified value in each row that indicates whether that row is an exact match of an earlier row in the dataset. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row. 

## Example
<a name="pyspark-FlagDuplicateRows-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (105.111, 13.12),
        (13.12, 13.12),
        (None, 13.12),
        (13.12, 13.12),
        (None, 13.12),
    ],
    ["source_column_1", "source_column_2"],
)

try:
    df_output = data_quality.FlagDuplicateRows.apply(
        data_frame=input_df,
        spark_context=sc,
        target_column="flag_row",
        true_string="True",
        false_string="False",
        target_index=1
    )
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-FlagDuplicateRows-output"></a>

 The output will be a PySpark DataFrame with an additional column `flag_row` that indicates whether a row is a duplicate or not, based on the `source_column_1` column. The resulting `df\$1output` DataFrame will contain the following rows:

```
```
+---------------+---------------+--------+
|source_column_1|source_column_2|flag_row|
+---------------+---------------+--------+
| 105.111| 13.12| False|
| 13.12| 13.12| True|
| null| 13.12| True|
| 13.12| 13.12| True|
| null| 13.12| True|
+---------------+---------------+--------+
```
```

 The `flag_row` column indicates whether a row is a duplicate or not. The `true\$1string` is set to "True", and the `false\$1string` is set to "False". The `target\$1index` is set to 1, which means that the `flag_row` column will be inserted at the second position (index 1) in the output DataFrame. 

## Methods
<a name="aws-glue-api-pyspark-transforms-FlagDuplicateRows-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FlagDuplicateRows-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, true\$1string=DEFAULT\$1TRUE\$1STRING, false\$1string=DEFAULT\$1FALSE\$1STRING, target\$1index=None)
<a name="aws-glue-api-pyspark-transforms-FlagDuplicateRows-__call__"></a>

The `FlagDuplicateRows` transform returns a new column with a specified value in each row that indicates whether that row is an exact match of an earlier row in the dataset. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row. 
+ `true_string` – Value to be inserted if the row matches an earlier row.
+ `false_string` – Value to be inserted if the row is unique. 
+  `target_column` – Name of the new column that is inserted in the dataset. 

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# RemoveDuplicates class
<a name="aws-glue-api-pyspark-transforms-RemoveDuplicates"></a>

 The `RemoveDuplicates` transform deletes an entire row, if a duplicate value is encountered in a selected source column. 

## Example
<a name="pyspark-RemoveDuplicates-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (105.111, 13.12),
        (13.12, 13.12),
        (None, 13.12),
        (13.12, 13.12),
        (None, 13.12),
    ],
    ["source_column_1", "source_column_2"],
)

try:
    df_output = data_quality.RemoveDuplicates.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="source_column_1"
    )
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-RemoveDuplicates-output"></a>

 The output will be a PySpark DataFrame with duplicates removed based on the `source_column_1` column. The resulting `df\$1output` DataFrame will contain the following rows: 

```
```
+---------------+---------------+
|source_column_1|source_column_2|
+---------------+---------------+
| 105.111| 13.12|
| 13.12| 13.12|
| null| 13.12|
+---------------+---------------+
```
```

 Note that the rows with `source_column_1` values of `13.12` and `null` appear only once in the output DataFrame, as the duplicates have been removed based on the `source_column_1` column. 

## Methods
<a name="aws-glue-api-pyspark-transforms-RemoveDuplicates-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-RemoveDuplicates-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column)
<a name="aws-glue-api-pyspark-transforms-RemoveDuplicates-__call__"></a>

 The `RemoveDuplicates` transform deletes an entire row, if a duplicate value is encountered in a selected source column. 
+ `source_column` – The name of an existing column.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# MonthName class
<a name="aws-glue-api-pyspark-transforms-MonthName"></a>

 The `MonthName` transform creates a new column containing the name of the month, from a string that represents a date. 

## Example
<a name="pyspark-MonthName-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

input_df = spark.createDataFrame(
    [
        ("20-2018-12",),
        ("2018-20-12",),
        ("20182012",),
        ("12202018",),
        ("20122018",),
        ("20-12-2018",),
        ("12/20/2018",),
        ("02/02/02",),
        ("02 02 2009",),
        ("02/02/2009",),
        ("August/02/2009",),
        ("02/june/2009",),
        ("02/2020/june",),
        ("2013-02-21 06:35:45.658505",),
        ("August 02 2009",),
        ("2013/02/21",),
        (None,),
    ],
    ["column_1"],
)

try:
    df_output = datetime_functions.MonthName.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="column_1",
        target_column="target_column"
    )
    df_output.show()
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-MonthName-output"></a>

 The output will be: 

```
```
+------------+------------+
| column_1|target_column|
+------------+------------+
|20-2018-12 | December |
|2018-20-12 | null |
| 20182012| null |
| 12202018| null |
| 20122018| null |
|20-12-2018 | December |
|12/20/2018 | December |
| 02/02/02 | February |
|02 02 2009 | February |
|02/02/2009 | February |
|August/02/2009| August |
|02/june/2009| null |
|02/2020/june| null |
|2013-02-21 06:35:45.658505| February |
|August 02 2009| August |
| 2013/02/21| February |
| null | null |
+------------+------------+
```
```

 The `MonthName` transformation takes the `source\$1column` as `"column\$11"` and the `target\$1column` as `"target\$1column"`. It attempts to extract the month name from the date/time strings in the `"column\$11"` column and places it in the `"target\$1column"` column. If the date/time string is in an unrecognized format or cannot be parsed, the `"target\$1column"` value is set to `null`. 

 The transformation successfully extracts the month name from various date/time formats, such as "20-12-2018", "12/20/2018", "02/02/2009", "2013-02-21 06:35:45.658505", and "August 02 2009". 

## Methods
<a name="aws-glue-api-pyspark-transforms-MonthName-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-MonthName-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-MonthName-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-MonthName-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-MonthName-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, value=None)
<a name="aws-glue-api-pyspark-transforms-MonthName-__call__"></a>

 The `MonthName` transform creates a new column containing the name of the month, from a string that represents a date. 
+ `source_column` – The name of an existing column.
+ `value` – A character string to evaluate..
+ `target_column` – A name for the newly created column.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-MonthName-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MonthName-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MonthName-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MonthName-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MonthName-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MonthName-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-MonthName-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# IsEven class
<a name="aws-glue-api-pyspark-transforms-IsEven"></a>

 The `IsEven` transform returns a Boolean value in a new column that indicates whether the source column or value is even. If the source column or value is a decimal, the result is false. 

## Example
<a name="pyspark-IsEven-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [(5,), (0,), (-1,), (2,), (None,)],
    ["source_column"],
)

try:
    df_output = math_functions.IsEven.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="source_column",
        target_column="target_column",
        value=None,
        true_string="Even",
        false_string="Not even",
    )
    df_output.show()
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-IsEven-output"></a>

 The output will be: 

```
```
+------------+------------+
|source_column|target_column|
+------------+------------+
| 5| Not even|
| 0| Even|
| -1| Not even|
| 2| Even|
| null| null|
+------------+------------+
```
```

 The `IsEven` transformation takes the `source\$1column` as "source\$1column" and the `target\$1column` as "target\$1column". It checks if the value in the `"source\$1column"` is even or not. If the value is even, it sets the `"target\$1column"` value to the `true\$1string` "Even". If the value is odd, it sets the `"target\$1column"` value to the `false\$1string` "Not even". If the `"source\$1column"` value is `null`, the `"target\$1column"` value is set to `null`. 

 The transformation correctly identifies the even numbers (0 and 2) and sets the `"target\$1column"` value to "Even". For odd numbers (5 and -1), it sets the `"target\$1column"` value to "Not even". For the `null` value in `"source\$1column"`, the `"target\$1column"` value is set to `null`. 

## Methods
<a name="aws-glue-api-pyspark-transforms-IsEven-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-IsEven-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-IsEven-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-IsEven-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-IsEven-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, true\$1string=DEFAULT\$1TRUE\$1STRING, false\$1string=DEFAULT\$1FALSE\$1STRING, value=None)
<a name="aws-glue-api-pyspark-transforms-IsEven-__call__"></a>

 The `IsEven` transform returns a Boolean value in a new column that indicates whether the source column or value is even. If the source column or value is a decimal, the result is false. 
+ `source_column` – The name of an existing column.
+ `target_column` – The name of the new column to be created.
+ `true_string` – A string that indicates whether the value is even.
+ `false_string` – A string that indicates whether the value is not even.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-IsEven-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IsEven-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IsEven-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IsEven-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IsEven-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IsEven-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IsEven-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# CryptographicHash class
<a name="aws-glue-api-pyspark-transforms-CryptographicHash"></a>

 The `CryptographicHash` transform applies an algorithm to hash values in the column. 

## Example
<a name="pyspark-CryptographicHash-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

secret = "${SECRET}"
sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (1, "1234560000"),
        (2, "1234560001"),
        (3, "1234560002"),
        (4, "1234560003"),
        (5, "1234560004"),
        (6, "1234560005"),
        (7, "1234560006"),
        (8, "1234560007"),
        (9, "1234560008"),
        (10, "1234560009"),
    ],
    ["id", "phone"],
)

try:
    df_output = pii.CryptographicHash.apply(
        data_frame=input_df,
        spark_context=sc,
        source_columns=["id", "phone"],
        secret_id=secret,
        algorithm="HMAC_SHA256",
        output_format="BASE64",
    )
    df_output.show()
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-CryptographicHash-output"></a>

The output will be:

```
```
+---+------------+-------------------+-------------------+
| id| phone | id_hashed | phone_hashed |
+---+------------+-------------------+-------------------+
| 1| 1234560000 | QUI1zXTJiXmfIb... | juDBAmiRnnO3g... |
| 2| 1234560001 | ZAUWiZ3dVTzCo... | vC8lgUqBVDMNQ... |
| 3| 1234560002 | ZP4VvZWkqYifu... | Kl3QAkgswYpzB... |
| 4| 1234560003 | 3u8vO3wQ8EQfj... | CPBzK1P8PZZkV... |
| 5| 1234560004 | eWkQJk4zAOIzx... | aLf7+mHcXqbLs... |
| 6| 1234560005 | xtI9fZCJZCvsa... | dy2DFgdYWmr0p... |
| 7| 1234560006 | iW9hew7jnHuOf... | wwfGMCOEv6oOv... |
| 8| 1234560007 | H9V1pqvgkFhfS... | g9WKhagIXy9ht... |
| 9| 1234560008 | xDhEuHaxAUbU5... | b3uQLKPY+Q5vU... |
| 10| 1234560009 | GRN6nFXkxk349... | VJdsKt8VbxBbt... |
+---+------------+-------------------+-------------------+
```
```

 The transformation computes the cryptographic hashes of the values in the `id` and `phone` columns using the specified algorithm and secret key, and encodes the hashes in Base64 format. The resulting `df\$1output` DataFrame contains all columns from the original `input\$1df` DataFrame, plus the additional `id\$1hashed` and `phone\$1hashed` columns with the computed hashes. 

## Methods
<a name="aws-glue-api-pyspark-transforms-CryptographicHash-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-CryptographicHash-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1columns, secret\$1id, algorithm=None, secret\$1version=None, create\$1secret\$1if\$1missing=False, output\$1format=None, entity\$1type\$1filter=None)
<a name="aws-glue-api-pyspark-transforms-CryptographicHash-__call__"></a>

 The `CryptographicHash` transform applies an algorithm to hash values in the column. 
+ `source_columns` – An array of existing columns.
+ `secret_id` – The ARN of the Secrets Manager secret key. The key used in the hash-based message authentication code (HMAC) prefix algorithm to hash the source columns.
+ `secret_version` – Optional. Defaults to the latest secret version.
+ `entity_type_filter` – Optional array of entity types. Can be used to encrypt only detected PII in free-text column.
+ `create_secret_if_missing` – Optional boolean. If true will attempt to create the secret on behalf of the caller.
+ `algorithm` – The algorithm used to hash your data. Valid enum values: MD5, SHA1, SHA256, SHA512, HMAC\$1MD5, HMAC\$1SHA1, HMAC\$1SHA256, HMAC\$1SHA512.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-CryptographicHash-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-CryptographicHash-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# Decrypt class
<a name="aws-glue-api-pyspark-transforms-Decrypt"></a>

The `Decrypt` transform decrypts inside of AWS Glue. Your data can also be decrypted outside of AWS Glue with the AWS Encryption SDK. If the provided KMS key ARN does not match what was used to encrypt the column, the decrypt operation fails.

## Example
<a name="pyspark-Decrypt-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

kms = "${KMS}"
sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (1, "1234560000"),
        (2, "1234560001"),
        (3, "1234560002"),
        (4, "1234560003"),
        (5, "1234560004"),
        (6, "1234560005"),
        (7, "1234560006"),
        (8, "1234560007"),
        (9, "1234560008"),
        (10, "1234560009"),
    ],
    ["id", "phone"],
)

try:
    df_encrypt = pii.Encrypt.apply(
        data_frame=input_df,
        spark_context=sc,
        source_columns=["phone"],
        kms_key_arn=kms
    )
    df_decrypt = pii.Decrypt.apply(
        data_frame=df_encrypt,
        spark_context=sc,
        source_columns=["phone"],
        kms_key_arn=kms
    )
    df_decrypt.show()
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-Decrypt-output"></a>

 The output will be a PySpark DataFrame with the original `id` column and the decrypted `phone` column: 

```
```
+---+------------+
| id| phone|
+---+------------+
| 1| 1234560000|
| 2| 1234560001|
| 3| 1234560002|
| 4| 1234560003|
| 5| 1234560004|
| 6| 1234560005|
| 7| 1234560006|
| 8| 1234560007|
| 9| 1234560008|
| 10| 1234560009|
+---+------------+
```
```

 The `Encrypt` transform takes the `source\$1columns` as `["phone"]` and the `kms\$1key\$1arn` as the value of the `\$1\$1KMS\$1` environment variable. The transformation encrypts the values in the `phone` column using the specified KMS key. The encrypted DataFrame `df\$1encrypt` is then passed to the `Decrypt` transform from the `awsglue.pii` module. It takes the `source\$1columns` as `["phone"]` and the `kms\$1key\$1arn` as the value of the `\$1\$1KMS\$1` environment variable. The transformation decrypts the encrypted values in the `phone` column using the same KMS key. The resulting `df\$1decrypt` DataFrame contains the original `id` column and the decrypted `phone` column. 

## Methods
<a name="aws-glue-api-pyspark-transforms-Decrypt-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-Decrypt-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-Decrypt-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-Decrypt-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1columns, kms\$1key\$1arn)
<a name="aws-glue-api-pyspark-transforms-Decrypt-__call__"></a>

The `Decrypt` transform decrypts inside of AWS Glue. Your data can also be decrypted outside of AWS Glue with the AWS Encryption SDK. If the provided KMS key ARN does not match what was used to encrypt the column, the decrypt operation fails.
+ `source_columns` – An array of existing columns.
+ `kms_key_arn` – The key ARN of the AWS Key Management Service key to use to decrypt the source columns.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-Decrypt-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Decrypt-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Decrypt-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Decrypt-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Decrypt-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Decrypt-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Decrypt-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# Encrypt class
<a name="aws-glue-api-pyspark-transforms-Encrypt"></a>

 The `Encrypt` transform encrypts source columns using the AWS Key Management Service key. The `Encrypt` transform can encrypt up to 128 MiB per cell. It will attempt to preserve the format on decryption. To preserve the data type, the data type metadata must serialize to less than 1KB. Otherwise, you must set the `preserve_data_type` parameter to false. The data type metadata will be stored in plaintext in the encryption context. 

## Example
<a name="pyspark-Encrypt-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

kms = "${KMS}"
sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (1, "1234560000"),
        (2, "1234560001"),
        (3, "1234560002"),
        (4, "1234560003"),
        (5, "1234560004"),
        (6, "1234560005"),
        (7, "1234560006"),
        (8, "1234560007"),
        (9, "1234560008"),
        (10, "1234560009"),
    ],
    ["id", "phone"],
)

try:
    df_encrypt = pii.Encrypt.apply(
        data_frame=input_df,
        spark_context=sc,
        source_columns=["phone"],
        kms_key_arn=kms
    )
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-Encrypt-output"></a>

 The output will be a PySpark DataFrame with the original `id` column and an additional column containing the encrypted values of the `phone` column. 

```
```
+---+------------+-------------------------+
| id| phone | phone_encrypted |
+---+------------+-------------------------+
| 1| 1234560000| EncryptedData1234...abc |
| 2| 1234560001| EncryptedData5678...def |
| 3| 1234560002| EncryptedData9012...ghi |
| 4| 1234560003| EncryptedData3456...jkl |
| 5| 1234560004| EncryptedData7890...mno |
| 6| 1234560005| EncryptedData1234...pqr |
| 7| 1234560006| EncryptedData5678...stu |
| 8| 1234560007| EncryptedData9012...vwx |
| 9| 1234560008| EncryptedData3456...yz0 |
| 10| 1234560009| EncryptedData7890...123 |
+---+------------+-------------------------+
```
```

 The `Encrypt` transform takes the `source\$1columns` as `["phone"]` and the `kms\$1key\$1arn` as the value of the `\$1\$1KMS\$1` environment variable. The transformation encrypts the values in the `phone` column using the specified KMS key. The resulting `df\$1encrypt` DataFrame contains the original `id` column, the original `phone` column, and an additional column named `phone\$1encrypted` containing the encrypted values of the `phone` column. 

## Methods
<a name="aws-glue-api-pyspark-transforms-Encrypt-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-Encrypt-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-Encrypt-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-Encrypt-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1columns, kms\$1key\$1arn, entity\$1type\$1filter=None, preserve\$1data\$1type=None)
<a name="aws-glue-api-pyspark-transforms-Encrypt-__call__"></a>

 The `Encrypt` transform encrypts source columns using the AWS Key Management Service key. 
+ `source_columns` – An array of existing columns.
+ `kms_key_arn` – The key ARN of the AWS Key Management Service key to use to Encrypt the source columns.
+ `entity_type_filter` – Optional array of entity types. Can be used to encrypt only detected PII in free-text column.
+ `preserve_data_type` – Optional boolean. Defaults to true. If false, the data type will not be stored.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-Encrypt-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Encrypt-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Encrypt-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Encrypt-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Encrypt-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Encrypt-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-Encrypt-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# IntToIp class
<a name="aws-glue-api-pyspark-transforms-IntToIp"></a>

The `IntToIp` transform converts the integer value of source column or other value to the corresponding IPv4 value in then target column, and returns the result in a new column.

## Example
<a name="pyspark-IntToIp-examples"></a>

```
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (3221225473,),
        (0,),
        (1,),
        (100,),
        (168430090,),
        (4294967295,),
        (4294967294,),
        (4294967296,),
        (-1,),
        (None,),
    ],
    ["source_column_int"],
)

try:
    df_output = web_functions.IntToIp.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="source_column_int",
        target_column="target_column",
        value=None
    )
    df_output.show()
except:
    print("Unexpected Error happened ")
    raise
```

## Output
<a name="pyspark-IntToIp-output"></a>

The output will be:

```
```
+---------------+---------------+
|source_column_int|target_column|
+---------------+---------------+
| 3221225473| 192.0.0.1 |
| 0| 0.0.0.0 |
| 1| 0.0.0.1 |
| 100| 0.0.0.100|
| 168430090 | 10.0.0.10 |
| 4294967295| 255.255.255.255|
| 4294967294| 255.255.255.254|
| 4294967296| null |
| -1| null |
| null| null |
+---------------+---------------+
```
```

 The `IntToIp.apply` transformation takes the `source\$1column` as `"source\$1column\$1int"` and the `target\$1column` as `"target\$1column"` and converts the integer values in the `source\$1column\$1int` column to their corresponding IPv4 address representation and stores the result in the `target\$1column` column. 

 For valid integer values within the range of IPv4 addresses (0 to 4294967295), the transformation successfully converts them to their IPv4 address representation (e.g., 192.0.0.1, 0.0.0.0, 10.0.0.10, 255.255.255.255). 

 For integer values outside the valid range (e.g., 4294967296, -1), the `target\$1column` value is set to `null`. For `null` values in the `source\$1column\$1int` column, the `target\$1column` value is also set to `null`. 

## Methods
<a name="aws-glue-api-pyspark-transforms-IntToIp-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-IntToIp-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-IntToIp-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-IntToIp-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, value=None)
<a name="aws-glue-api-pyspark-transforms-IntToIp-__call__"></a>

The `IntToIp` transform converts the integer value of source column or other value to the corresponding IPv4 value in then target column, and returns the result in a new column.
+ `sourceColumn` – The name of an existing column.
+ `value` – A character string to evaluate.
+ `targetColumn` – The name of the new column to be created.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-IntToIp-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IntToIp-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IntToIp-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IntToIp-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IntToIp-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IntToIp-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IntToIp-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

# IpToInt class
<a name="aws-glue-api-pyspark-transforms-IpToInt"></a>

The `IpToInt` transform converts the Internet Protocol version 4 (IPv4) value of the source column or other value to the corresponding integer value in the target column, and returns the result in a new column. 

## Example
<a name="pyspark-IpToInt-examples"></a>

 For AWS Glue 4.0 and above, create or update job arguments with `key: --enable-glue-di-transforms, value: true` 

```
from pyspark.context import SparkContext
from awsgluedi.transforms import *

sc = SparkContext()

input_df = spark.createDataFrame(
    [
        ("192.0.0.1",),
        ("10.10.10.10",),
        ("1.2.3.4",),
        ("1.2.3.6",),
        ("http://12.13.14.15",),
        ("https://16.17.18.19",),
        ("1.2.3.4",),
        (None,),
        ("abc",),
        ("abc.abc.abc.abc",),
        ("321.123.123.123",),
        ("244.4.4.4",),
        ("255.255.255.255",),
    ],
    ["source_column_ip"],
)

    df_output = web_functions.IpToInt.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="source_column_ip",
        target_column="target_column",
        value=None
    )
    df_output.show()
```

## Output
<a name="pyspark-IpToInt-output"></a>

 The output will be: 

```
```
+----------------+---------------+
|source_column_ip| target_column|
+----------------+---------------+
| 192.0.0.1| 3221225473|
| 10.10.10.10| 168427722|
| 1.2.3.4| 16909060|
| 1.2.3.6| 16909062|
|http://12.13.14.15| null|
|https://16.17.18.19| null|
| 1.2.3.4| 16909060|
| null| null|
| abc| null|
|abc.abc.abc.abc| null|
| 321.123.123.123| null|
| 244.4.4.4| 4102444804|
| 255.255.255.255| 4294967295|
+----------------+---------------+
```
```

 The `IpToInt` transformation takes the `source\$1column` as `"source\$1column\$1ip"` and the `target\$1column` as `"target\$1column"` and converts the valid IPv4 address strings in the `source\$1column\$1ip` column to their corresponding 32-bit integer representation and stores the result in the `target\$1column` column. 

 For valid IPv4 address strings (e.g., "192.0.0.1", "10.10.10.10", "1.2.3.4"), the transformation successfully converts them to their integer representation (e.g., 3221225473, 168427722, 16909060). For strings that are not valid IPv4 addresses (e.g., URLs, non-IP strings like "abc", invalid IP formats like "abc.abc.abc.abc"), the `target\$1column` value is set to `null`. For `null` values in the `source\$1column\$1ip` column, the `target\$1column` value is also set to `null`. 

## Methods
<a name="aws-glue-api-pyspark-transforms-IpToInt-_methods"></a>
+ [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-IpToInt-__call__)
+ [apply](#aws-glue-api-crawler-pyspark-transforms-IpToInt-apply)
+ [name](#aws-glue-api-crawler-pyspark-transforms-IpToInt-name)
+ [describeArgs](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeArgs)
+ [describeReturn](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeReturn)
+ [describeTransform](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeTransform)
+ [describeErrors](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeErrors)
+ [describe](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describe)

## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, value=None)
<a name="aws-glue-api-pyspark-transforms-IpToInt-__call__"></a>

 The `IpToInt` transform converts the Internet Protocol version 4 (IPv4) value of the source column or other value to the corresponding integer value in the target column, and returns the result in a new column. 
+ `sourceColumn` – The name of an existing column.
+ `value` – A character string to evaluate.
+ `targetColumn` – The name of the new column to be created.

## apply(cls, \$1args, \$1\$1kwargs)
<a name="aws-glue-api-crawler-pyspark-transforms-IpToInt-apply"></a>

Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply).

## name(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IpToInt-name"></a>

Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name).

## describeArgs(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IpToInt-describeArgs"></a>

Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs).

## describeReturn(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IpToInt-describeReturn"></a>

Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn).

## describeTransform(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IpToInt-describeTransform"></a>

Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform).

## describeErrors(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IpToInt-describeErrors"></a>

Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors).

## describe(cls)
<a name="aws-glue-api-crawler-pyspark-transforms-IpToInt-describe"></a>

Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).

## Data integration transforms
<a name="aws-glue-programming-python-di-transforms"></a>

 For AWS Glue 4.0 and above, create or update job arguments with `key: --enable-glue-di-transforms, value: true`. 

 Example job script: 

```
from pyspark.context import SparkContext
        
from awsgluedi.transforms import *
sc = SparkContext()

input_df = spark.createDataFrame(
    [(5,), (0,), (-1,), (2,), (None,)],
    ["source_column"],
)

try:
    df_output = math_functions.IsEven.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="source_column",
        target_column="target_column",
        value=None,
        true_string="Even",
        false_string="Not even",
    )
    df_output.show()   
except:
    print("Unexpected Error happened ")
    raise
```

 Example Sessions using Notebooks 

```
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5
%region eu-west-1
```

```
%%configure
{
    "--enable-glue-di-transforms": "true"
}
```

```
from pyspark.context import SparkContext
from awsgluedi.transforms import *

sc = SparkContext()

input_df = spark.createDataFrame(
    [(5,), (0,), (-1,), (2,), (None,)],
    ["source_column"],
)

try:
    df_output = math_functions.IsEven.apply(
        data_frame=input_df,
        spark_context=sc,
        source_column="source_column",
        target_column="target_column",
        value=None,
        true_string="Even",
        false_string="Not even",
    )
    df_output.show()    
except:
    print("Unexpected Error happened ")
    raise
```

 Example Sessions using AWS CLI 

```
aws glue create-session --default-arguments "--enable-glue-di-transforms=true"
```

 DI transforms: 
+  [FlagDuplicatesInColumn class](aws-glue-api-pyspark-transforms-FlagDuplicatesInColumn.md) 
+  [FormatPhoneNumber class](aws-glue-api-pyspark-transforms-FormatPhoneNumber.md) 
+  [FormatCase class](aws-glue-api-pyspark-transforms-FormatCase.md) 
+  [FillWithMode class](aws-glue-api-pyspark-transforms-FillWithMode.md) 
+  [FlagDuplicateRows class](aws-glue-api-pyspark-transforms-FlagDuplicateRows.md) 
+  [RemoveDuplicates class](aws-glue-api-pyspark-transforms-RemoveDuplicates.md) 
+  [MonthName class](aws-glue-api-pyspark-transforms-MonthName.md) 
+  [IsEven class](aws-glue-api-pyspark-transforms-IsEven.md) 
+  [CryptographicHash class](aws-glue-api-pyspark-transforms-CryptographicHash.md) 
+  [Decrypt class](aws-glue-api-pyspark-transforms-Decrypt.md) 
+  [Encrypt class](aws-glue-api-pyspark-transforms-Encrypt.md) 
+  [IntToIp class](aws-glue-api-pyspark-transforms-IntToIp.md) 
+  [IpToInt class](aws-glue-api-pyspark-transforms-IpToInt.md) 

### Maven: Bundle the plugin with your Spark applications
<a name="aws-glue-programming-python-di-transforms-maven"></a>

 You can bundle the transforms dependency with your Spark applications and Spark distributions (version 3.3) by adding the plugin dependency in your Maven `pom.xml` while developing your Spark applications locally. 

```
<repositories>
   ...
    <repository>
        <id>aws-glue-etl-artifacts</id>
        <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ </url>
    </repository>
</repositories>
...
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>AWSGlueTransforms</artifactId>
    <version>4.0.0</version>
</dependency>
```

 You can alternatively download the binaries from AWS Glue Maven artifacts directly and include them in your Spark application as follows. 

```
#!/bin/bash
sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/AWSGlueTransforms/4.0.0/AWSGlueTransforms-4.0.0.jar -P /usr/lib/spark/jars/
```

# Programming AWS Glue ETL scripts in Scala
<a name="aws-glue-programming-scala"></a>

You can find Scala code examples and utilities for AWS Glue in the [AWS Glue samples repository](https://github.com/awslabs/aws-glue-samples) on the GitHub website.

AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load (ETL) jobs. The following sections describe how to use the AWS Glue Scala library and the AWS Glue API in ETL scripts, and provide reference documentation for the library.

**Contents**
+ [Using Scala](glue-etl-scala-using.md)
  + [Testing on a DevEndpoint notebook](glue-etl-scala-using.md#aws-glue-programming-scala-using-notebook)
  + [Testing on a DevEndpoint REPL](glue-etl-scala-using.md#aws-glue-programming-scala-using-repl)
+ [Scala script example](glue-etl-scala-example.md)
+ [Scala API list](glue-etl-scala-apis.md)
  + [com.amazonaws.services.glue](glue-etl-scala-apis.md#glue-etl-scala-apis-glue)
  + [com.amazonaws.services.glue.ml](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-ml)
  + [com.amazonaws.services.glue.dq](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-dq)
  + [com.amazonaws.services.glue.types](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-types)
  + [com.amazonaws.services.glue.util](glue-etl-scala-apis.md#glue-etl-scala-apis-glue-util)
  + [ChoiceOption](glue-etl-scala-apis-glue-choiceoption.md)
    + [ChoiceOption trait](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoption-trait)
    + [ChoiceOption object](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoption-object)
      + [Apply](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoption-object-def-apply)
    + [ChoiceOptionWithResolver](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-choiceoptionwithresolver-case-class)
    + [MatchCatalogSchemaChoiceOption](glue-etl-scala-apis-glue-choiceoption.md#glue-etl-scala-apis-glue-matchcatalogschemachoiceoption-case-class)
  + [DataSink](glue-etl-scala-apis-glue-datasink-class.md)
    + [writeDynamicFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-writeDynamicFrame)
    + [pyWriteDynamicFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDynamicFrame)
    + [writeDataFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-writeDataFrame)
    + [pyWriteDataFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDataFrame)
    + [setCatalogInfo](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-setCatalogInfo)
    + [supportsFormat](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-supportsFormat)
    + [setFormat](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-setFormat)
    + [withFormat](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-withFormat)
    + [setAccumulableSize](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-setAccumulableSize)
    + [getOutputErrorRecordsAccumulable](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-getOutputErrorRecordsAccumulable)
    + [errorsAsDynamicFrame](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-class-defs-errorsAsDynamicFrame)
    + [DataSink object](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-object)
      + [recordMetrics](glue-etl-scala-apis-glue-datasink-class.md#glue-etl-scala-apis-glue-datasink-object-defs-recordMetrics)
  + [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md)
  + [DynamicFrame](glue-etl-scala-apis-glue-dynamicframe.md)
    + [DynamicFrame class](glue-etl-scala-apis-glue-dynamicframe-class.md)
      + [errorsCount](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-vals-errorsCount)
      + [applyMapping](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping)
      + [assertErrorThreshold](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-assertErrorThreshold)
      + [Count](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-count)
      + [dropField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropField)
      + [dropFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropFields)
      + [dropNulls](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropNulls)
      + [errorsAsDynamicFrame](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame)
      + [Filter](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-filter)
      + [getName](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getName)
      + [getNumPartitions](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getNumPartitions)
      + [getSchemaIfComputed](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getSchemaIfComputed)
      + [isSchemaComputed](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-isSchemaComputed)
      + [javaToPython](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-javaToPython)
      + [Join](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-join)
      + [Map](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-map)
      + [mergeDynamicFrames](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-merge)
      + [printSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-printSchema)
      + [recomputeSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-recomputeSchema)
      + [Relationalize](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-relationalize)
      + [renameField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-renameField)
      + [Repartition](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-repartition)
      + [resolveChoice](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-resolveChoice)
      + [Schema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-schema)
      + [selectField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectField)
      + [selectFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectFields)
      + [Show](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-show)
      + [SimplifyDDBJson](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-simplifyDDBJson)
      + [Spigot](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-spigot)
      + [splitFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitFields)
      + [Def splitRows](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitRows)
      + [stageErrorsCount](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-stageErrorsCount)
      + [toDF](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-toDF)
      + [Unbox](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unbox)
      + [Unnest](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest)
      + [unnestDDBJson](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnestddbjson)
      + [withFrameSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withFrameSchema)
      + [Def withName](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withName)
      + [withTransformationContext](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withTransformationContext)
    + [DynamicFrame object](glue-etl-scala-apis-glue-dynamicframe-object.md)
      + [Def apply](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-apply)
      + [Def emptyDynamicFrame](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-emptyDynamicFrame)
      + [Def fromPythonRDD](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-fromPythonRDD)
      + [Def ignoreErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-ignoreErrors)
      + [Def inlineErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-inlineErrors)
      + [Def newFrameWithErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-newFrameWithErrors)
  + [DynamicRecord](glue-etl-scala-apis-glue-dynamicrecord-class.md)
    + [addField](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-addField)
    + [dropField](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-dropField)
    + [setError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-setError)
    + [isError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-isError)
    + [getError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getError)
    + [clearError](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clearError)
    + [Write](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-write)
    + [readFields](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-readFields)
    + [Clone](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clone)
    + [Schema](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-schema)
    + [getRoot](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getRoot)
    + [toJson](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-toJson)
    + [getFieldNode](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getFieldNode)
    + [getField](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getField)
    + [hashCode](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-hashCode)
    + [Equals](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-class-defs-equals)
    + [DynamicRecord object](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-object)
      + [Apply](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-dynamicrecord-object-defs-apply)
    + [RecordTraverser trait](glue-etl-scala-apis-glue-dynamicrecord-class.md#glue-etl-scala-apis-glue-recordtraverser-trait)
  + [GlueContext](glue-etl-scala-apis-glue-gluecontext.md)
    + [addIngestionTimeColumns](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-addIngestionTimeColumns)
    + [createDataFrameFromOptions](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-createDataFrameFromOptions)
    + [forEachBatch](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-forEachBatch)
    + [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink)
    + [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource)
    + [getJDBCSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getJDBCSink)
    + [getSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSink)
    + [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat)
    + [getSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSource)
    + [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat)
    + [getSparkSession](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSparkSession)
    + [startTransaction](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-start-transaction)
    + [commitTransaction](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-commit-transaction)
    + [cancelTransaction](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-cancel-transaction)
    + [this](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-this-1)
    + [this](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-this-2)
    + [this](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-this-3)
  + [MappingSpec](glue-etl-scala-apis-glue-mappingspec.md)
    + [MappingSpec case class](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-case-class)
    + [MappingSpec object](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object)
    + [orderingByTarget](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-val-orderingbytarget)
    + [Apply](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-defs-apply-1)
    + [Apply](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-defs-apply-2)
    + [Apply](glue-etl-scala-apis-glue-mappingspec.md#glue-etl-scala-apis-glue-mappingspec-object-defs-apply-3)
  + [ResolveSpec](glue-etl-scala-apis-glue-resolvespec.md)
    + [ResolveSpec object](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-object)
      + [Def](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-object-def-apply_1)
      + [Def](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-object-def-apply_2)
    + [ResolveSpec case class](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-case-class)
      + [Def methods](glue-etl-scala-apis-glue-resolvespec.md#glue-etl-scala-apis-glue-resolvespec-case-class-defs)
  + [ArrayNode](glue-etl-scala-apis-glue-types-arraynode.md)
    + [ArrayNode case class](glue-etl-scala-apis-glue-types-arraynode.md#glue-etl-scala-apis-glue-types-arraynode-case-class)
      + [Def methods](glue-etl-scala-apis-glue-types-arraynode.md#glue-etl-scala-apis-glue-types-arraynode-case-class-defs)
  + [BinaryNode](glue-etl-scala-apis-glue-types-binarynode.md)
    + [BinaryNode case class](glue-etl-scala-apis-glue-types-binarynode.md#glue-etl-scala-apis-glue-types-binarynode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-binarynode.md#glue-etl-scala-apis-glue-types-binarynode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-binarynode.md#glue-etl-scala-apis-glue-types-binarynode-case-class-defs)
  + [BooleanNode](glue-etl-scala-apis-glue-types-booleannode.md)
    + [BooleanNode case class](glue-etl-scala-apis-glue-types-booleannode.md#glue-etl-scala-apis-glue-types-booleannode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-booleannode.md#glue-etl-scala-apis-glue-types-booleannode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-booleannode.md#glue-etl-scala-apis-glue-types-booleannode-case-class-defs)
  + [ByteNode](glue-etl-scala-apis-glue-types-bytenode.md)
    + [ByteNode case class](glue-etl-scala-apis-glue-types-bytenode.md#glue-etl-scala-apis-glue-types-bytenode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-bytenode.md#glue-etl-scala-apis-glue-types-bytenode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-bytenode.md#glue-etl-scala-apis-glue-types-bytenode-case-class-defs)
  + [DateNode](glue-etl-scala-apis-glue-types-datenode.md)
    + [DateNode case class](glue-etl-scala-apis-glue-types-datenode.md#glue-etl-scala-apis-glue-types-datenode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-datenode.md#glue-etl-scala-apis-glue-types-datenode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-datenode.md#glue-etl-scala-apis-glue-types-datenode-case-class-defs)
  + [DecimalNode](glue-etl-scala-apis-glue-types-decimalnode.md)
    + [DecimalNode case class](glue-etl-scala-apis-glue-types-decimalnode.md#glue-etl-scala-apis-glue-types-decimalnode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-decimalnode.md#glue-etl-scala-apis-glue-types-decimalnode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-decimalnode.md#glue-etl-scala-apis-glue-types-decimalnode-case-class-defs)
  + [DoubleNode](glue-etl-scala-apis-glue-types-doublenode.md)
    + [DoubleNode case class](glue-etl-scala-apis-glue-types-doublenode.md#glue-etl-scala-apis-glue-types-doublenode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-doublenode.md#glue-etl-scala-apis-glue-types-doublenode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-doublenode.md#glue-etl-scala-apis-glue-types-doublenode-case-class-defs)
  + [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md)
    + [DynamicNode class](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-class)
      + [Def methods](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-class-defs)
    + [DynamicNode object](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-object)
      + [Def methods](glue-etl-scala-apis-glue-types-dynamicnode.md#glue-etl-scala-apis-glue-types-dynamicnode-object-defs)
  + [EvaluateDataQuality](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md)
    + [apply](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md#glue-etl-scala-apis-glue-dq-EvaluateDataQuality-defs-apply)
    + [Example](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md#glue-etl-scala-apis-glue-dq-EvaluateDataQuality-example)
  + [FloatNode](glue-etl-scala-apis-glue-types-floatnode.md)
    + [FloatNode case class](glue-etl-scala-apis-glue-types-floatnode.md#glue-etl-scala-apis-glue-types-floatnode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-floatnode.md#glue-etl-scala-apis-glue-types-floatnode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-floatnode.md#glue-etl-scala-apis-glue-types-floatnode-case-class-defs)
  + [FillMissingValues](glue-etl-scala-apis-glue-ml-fillmissingvalues.md)
    + [Apply](glue-etl-scala-apis-glue-ml-fillmissingvalues.md#glue-etl-scala-apis-glue-ml-fillmissingvalues-defs-apply)
  + [FindMatches](glue-etl-scala-apis-glue-ml-findmatches.md)
    + [Apply](glue-etl-scala-apis-glue-ml-findmatches.md#glue-etl-scala-apis-glue-ml-findmatches-defs-apply)
  + [FindIncrementalMatches](glue-etl-scala-apis-glue-ml-findincrementalmatches.md)
    + [Apply](glue-etl-scala-apis-glue-ml-findincrementalmatches.md#glue-etl-scala-apis-glue-ml-findincrementalmatches-defs-apply)
  + [IntegerNode](glue-etl-scala-apis-glue-types-integernode.md)
    + [IntegerNode case class](glue-etl-scala-apis-glue-types-integernode.md#glue-etl-scala-apis-glue-types-integernode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-integernode.md#glue-etl-scala-apis-glue-types-integernode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-integernode.md#glue-etl-scala-apis-glue-types-integernode-case-class-defs)
  + [LongNode](glue-etl-scala-apis-glue-types-longnode.md)
    + [LongNode case class](glue-etl-scala-apis-glue-types-longnode.md#glue-etl-scala-apis-glue-types-longnode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-longnode.md#glue-etl-scala-apis-glue-types-longnode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-longnode.md#glue-etl-scala-apis-glue-types-longnode-case-class-defs)
  + [MapLikeNode](glue-etl-scala-apis-glue-types-maplikenode.md)
    + [MapLikeNode class](glue-etl-scala-apis-glue-types-maplikenode.md#glue-etl-scala-apis-glue-types-maplikenode-class)
      + [Def methods](glue-etl-scala-apis-glue-types-maplikenode.md#glue-etl-scala-apis-glue-types-maplikenode-class-defs)
  + [MapNode](glue-etl-scala-apis-glue-types-mapnode.md)
    + [MapNode case class](glue-etl-scala-apis-glue-types-mapnode.md#glue-etl-scala-apis-glue-types-mapnode-case-class)
      + [Def methods](glue-etl-scala-apis-glue-types-mapnode.md#glue-etl-scala-apis-glue-types-mapnode-case-class-defs)
  + [NullNode](glue-etl-scala-apis-glue-types-nullnode.md)
    + [NullNode class](glue-etl-scala-apis-glue-types-nullnode.md#glue-etl-scala-apis-glue-types-nullnode-class)
    + [NullNode case object](glue-etl-scala-apis-glue-types-nullnode.md#glue-etl-scala-apis-glue-types-nullnode-case-object)
  + [ObjectNode](glue-etl-scala-apis-glue-types-objectnode.md)
    + [ObjectNode object](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-object)
      + [Def methods](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-object-defs)
    + [ObjectNode case class](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-case-class)
      + [Def methods](glue-etl-scala-apis-glue-types-objectnode.md#glue-etl-scala-apis-glue-types-objectnode-case-class-defs)
  + [ScalarNode](glue-etl-scala-apis-glue-types-scalarnode.md)
    + [ScalarNode class](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-class)
      + [Def methods](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-class-defs)
    + [ScalarNode object](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-object)
      + [Def methods](glue-etl-scala-apis-glue-types-scalarnode.md#glue-etl-scala-apis-glue-types-scalarnode-object-defs)
  + [ShortNode](glue-etl-scala-apis-glue-types-shortnode.md)
    + [ShortNode case class](glue-etl-scala-apis-glue-types-shortnode.md#glue-etl-scala-apis-glue-types-shortnode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-shortnode.md#glue-etl-scala-apis-glue-types-shortnode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-shortnode.md#glue-etl-scala-apis-glue-types-shortnode-case-class-defs)
  + [StringNode](glue-etl-scala-apis-glue-types-stringnode.md)
    + [StringNode case class](glue-etl-scala-apis-glue-types-stringnode.md#glue-etl-scala-apis-glue-types-stringnode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-stringnode.md#glue-etl-scala-apis-glue-types-stringnode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-stringnode.md#glue-etl-scala-apis-glue-types-stringnode-case-class-defs)
  + [TimestampNode](glue-etl-scala-apis-glue-types-timestampnode.md)
    + [TimestampNode case class](glue-etl-scala-apis-glue-types-timestampnode.md#glue-etl-scala-apis-glue-types-timestampnode-case-class)
      + [Val fields](glue-etl-scala-apis-glue-types-timestampnode.md#glue-etl-scala-apis-glue-types-timestampnode-case-class-vals)
      + [Def methods](glue-etl-scala-apis-glue-types-timestampnode.md#glue-etl-scala-apis-glue-types-timestampnode-case-class-defs)
  + [GlueArgParser](glue-etl-scala-apis-glue-util-glueargparser.md)
    + [GlueArgParser object](glue-etl-scala-apis-glue-util-glueargparser.md#glue-etl-scala-apis-glue-util-glueargparser-object)
      + [Def methods](glue-etl-scala-apis-glue-util-glueargparser.md#glue-etl-scala-apis-glue-util-glueargparser-object-defs)
  + [Job](glue-etl-scala-apis-glue-util-job.md)
    + [Job object](glue-etl-scala-apis-glue-util-job.md#glue-etl-scala-apis-glue-util-job-object)
      + [Def methods](glue-etl-scala-apis-glue-util-job.md#glue-etl-scala-apis-glue-util-job-object-defs)

# Using Scala to program AWS Glue ETL scripts
<a name="glue-etl-scala-using"></a>

You can automatically generate a Scala extract, transform, and load (ETL) program using the AWS Glue console, and modify it as needed before assigning it to a job. Or, you can write your own program from scratch. For more information, see [Configuring job properties for Spark jobs in AWS Glue](add-job.md). AWS Glue then compiles your Scala program on the server before running the associated job.

To ensure that your program compiles without errors and runs as expected, it's important that you load it on a development endpoint in a REPL (Read-Eval-Print Loop) or a Jupyter Notebook and test it there before running it in a job. Because the compile process occurs on the server, you will not have good visibility into any problems that happen there.

## Testing a Scala ETL program in a Jupyter notebook on a development endpoint
<a name="aws-glue-programming-scala-using-notebook"></a>

To test a Scala program on an AWS Glue development endpoint, set up the development endpoint as described in [Adding a development endpoint](add-dev-endpoint.md).

Next, connect it to a Jupyter Notebook that is either running locally on your machine or remotely on an Amazon EC2 notebook server. To install a local version of a Jupyter Notebook, follow the instructions in [Tutorial: Jupyter notebook in JupyterLab](dev-endpoint-tutorial-local-jupyter.md).

The only difference between running Scala code and running PySpark code on your Notebook is that you should start each paragraph on the Notebook with the the following:

```
%spark
```

This prevents the Notebook server from defaulting to the PySpark flavor of the Spark interpreter.

## Testing a Scala ETL program in a Scala REPL
<a name="aws-glue-programming-scala-using-repl"></a>

You can test a Scala program on a development endpoint using the AWS GlueScala REPL. Follow the instructions in [Tutorial: Use a SageMaker AI notebookTutorial: Use a REPL shell](dev-endpoint-tutorial-repl.md), except at the end of the SSH-to-REPL command, replace `-t gluepyspark` with `-t glue-spark-shell`. This invokes the AWS Glue Scala REPL.

To close the REPL when you are finished, type `sys.exit`.

# Scala script example - streaming ETL
<a name="glue-etl-scala-example"></a>

**Example**  
The following example script connects to Amazon Kinesis Data Streams, uses a schema from the Data Catalog to parse a data stream, joins the stream to a static dataset on Amazon S3, and outputs the joined results to Amazon S3 in parquet format.  

```
// This script connects to an Amazon Kinesis stream, uses a schema from the data catalog to parse the stream,
// joins the stream to a static dataset on Amazon S3, and outputs the joined results to Amazon S3 in parquet format.
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import java.util.Calendar
import org.apache.spark.SparkContext
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.streaming.Trigger
import scala.collection.JavaConverters._

object streamJoiner {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val sparkSession: SparkSession = glueContext.getSparkSession
    import sparkSession.implicits._
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    val staticData = sparkSession.read          // read() returns type DataFrameReader
      .format("csv")
      .option("header", "true")
      .load("s3://amzn-s3-demo-bucket/inputs/productsStatic.csv")  // load() returns a DataFrame

    val datasource0 = sparkSession.readStream   // readstream() returns type DataStreamReader
      .format("kinesis")
      .option("streamName", "stream-join-demo")
      .option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
      .option("startingPosition", "TRIM_HORIZON")
      .load                                     // load() returns a DataFrame

    val selectfields1 = datasource0.select(from_json($"data".cast("string"), glueContext.getCatalogSchemaAsSparkSchema("stream-demos", "stream-join-demo2")) as "data").select("data.*")

    val datasink2 = selectfields1.writeStream.foreachBatch { (dataFrame: Dataset[Row], batchId: Long) => {   //foreachBatch() returns type DataStreamWriter
      val joined = dataFrame.join(staticData, "product_id")
      val year: Int = Calendar.getInstance().get(Calendar.YEAR)
      val month :Int = Calendar.getInstance().get(Calendar.MONTH) + 1
      val day: Int = Calendar.getInstance().get(Calendar.DATE)
      val hour: Int = Calendar.getInstance().get(Calendar.HOUR_OF_DAY)

      if (dataFrame.count() > 0) {
        joined.write                           // joined.write returns type DataFrameWriter
          .mode(SaveMode.Append)
          .format("parquet")
          .option("quote", " ")
          .save("s3://amzn-s3-demo-bucket/output/" + "/year=" + "%04d".format(year) + "/month=" + "%02d".format(month) + "/day=" + "%02d".format(day) + "/hour=" + "%02d".format(hour) + "/")
      }
    }
    }  // end foreachBatch()
      .trigger(Trigger.ProcessingTime("100 seconds"))
      .option("checkpointLocation", "s3://amzn-s3-demo-bucket/checkpoint/")
      .start().awaitTermination()              // start() returns type StreamingQuery
    Job.commit()
  }
}
```

# APIs in the AWS Glue Scala library
<a name="glue-etl-scala-apis"></a>

AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load (ETL) jobs. The following sections describe the APIs in the AWS Glue Scala library.

## com.amazonaws.services.glue
<a name="glue-etl-scala-apis-glue"></a>

The **com.amazonaws.services.glue** package in the AWS Glue Scala library contains the following APIs:
+ [ChoiceOption](glue-etl-scala-apis-glue-choiceoption.md)
+ [DataSink](glue-etl-scala-apis-glue-datasink-class.md)
+ [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md)
+ [DynamicFrame](glue-etl-scala-apis-glue-dynamicframe.md)
+ [DynamicRecord](glue-etl-scala-apis-glue-dynamicrecord-class.md)
+ [GlueContext](glue-etl-scala-apis-glue-gluecontext.md)
+ [MappingSpec](glue-etl-scala-apis-glue-mappingspec.md)
+ [ResolveSpec](glue-etl-scala-apis-glue-resolvespec.md)

## com.amazonaws.services.glue.ml
<a name="glue-etl-scala-apis-glue-ml"></a>

The **com.amazonaws.services.glue.ml** package in the AWS Glue Scala library contains the following APIs:
+ [FillMissingValues](glue-etl-scala-apis-glue-ml-fillmissingvalues.md)
+ [FindIncrementalMatches](glue-etl-scala-apis-glue-ml-findincrementalmatches.md)
+ [FindMatches](glue-etl-scala-apis-glue-ml-findmatches.md)

## com.amazonaws.services.glue.dq
<a name="glue-etl-scala-apis-glue-dq"></a>

The **com.amazonaws.services.glue.dq** package in the AWS Glue Scala library contains the following APIs:
+ [EvaluateDataQuality](glue-etl-scala-apis-glue-dq-EvaluateDataQuality.md)

## com.amazonaws.services.glue.types
<a name="glue-etl-scala-apis-glue-types"></a>

The **com.amazonaws.services.glue.types** package in the AWS Glue Scala library contains the following APIs:
+ [ArrayNode](glue-etl-scala-apis-glue-types-arraynode.md)
+ [BinaryNode](glue-etl-scala-apis-glue-types-binarynode.md)
+ [BooleanNode](glue-etl-scala-apis-glue-types-booleannode.md)
+ [ByteNode](glue-etl-scala-apis-glue-types-bytenode.md)
+ [DateNode](glue-etl-scala-apis-glue-types-datenode.md)
+ [DecimalNode](glue-etl-scala-apis-glue-types-decimalnode.md)
+ [DoubleNode](glue-etl-scala-apis-glue-types-doublenode.md)
+ [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md)
+ [FloatNode](glue-etl-scala-apis-glue-types-floatnode.md)
+ [IntegerNode](glue-etl-scala-apis-glue-types-integernode.md)
+ [LongNode](glue-etl-scala-apis-glue-types-longnode.md)
+ [MapLikeNode](glue-etl-scala-apis-glue-types-maplikenode.md)
+ [MapNode](glue-etl-scala-apis-glue-types-mapnode.md)
+ [NullNode](glue-etl-scala-apis-glue-types-nullnode.md)
+ [ObjectNode](glue-etl-scala-apis-glue-types-objectnode.md)
+ [ScalarNode](glue-etl-scala-apis-glue-types-scalarnode.md)
+ [ShortNode](glue-etl-scala-apis-glue-types-shortnode.md)
+ [StringNode](glue-etl-scala-apis-glue-types-stringnode.md)
+ [TimestampNode](glue-etl-scala-apis-glue-types-timestampnode.md)

## com.amazonaws.services.glue.util
<a name="glue-etl-scala-apis-glue-util"></a>

The **com.amazonaws.services.glue.util** package in the AWS Glue Scala library contains the following APIs:
+ [GlueArgParser](glue-etl-scala-apis-glue-util-glueargparser.md)
+ [Job](glue-etl-scala-apis-glue-util-job.md)

# AWS Glue Scala ChoiceOption APIs
<a name="glue-etl-scala-apis-glue-choiceoption"></a>

**Topics**
+ [ChoiceOption trait](#glue-etl-scala-apis-glue-choiceoption-trait)
+ [ChoiceOption object](#glue-etl-scala-apis-glue-choiceoption-object)
+ [Case class ChoiceOptionWithResolver](#glue-etl-scala-apis-glue-choiceoptionwithresolver-case-class)
+ [Case class MatchCatalogSchemaChoiceOption](#glue-etl-scala-apis-glue-matchcatalogschemachoiceoption-case-class)

**Package: com.amazonaws.services.glue**

## ChoiceOption trait
<a name="glue-etl-scala-apis-glue-choiceoption-trait"></a>

```
trait ChoiceOption extends Serializable 
```

## ChoiceOption object
<a name="glue-etl-scala-apis-glue-choiceoption-object"></a>

 **ChoiceOption**

```
object ChoiceOption
```

A general strategy to resolve choice applicable to all `ChoiceType` nodes in a `DynamicFrame`.
+ `val CAST`
+ `val MAKE_COLS`
+ `val MAKE_STRUCT`
+ `val MATCH_CATALOG`
+ `val PROJECT`

### Def apply
<a name="glue-etl-scala-apis-glue-choiceoption-object-def-apply"></a>

```
def apply(choice: String): ChoiceOption
```


## Case class ChoiceOptionWithResolver
<a name="glue-etl-scala-apis-glue-choiceoptionwithresolver-case-class"></a>

```
case class ChoiceOptionWithResolver(name: String, choiceResolver: ChoiceResolver) extends ChoiceOption {}
```


## Case class MatchCatalogSchemaChoiceOption
<a name="glue-etl-scala-apis-glue-matchcatalogschemachoiceoption-case-class"></a>

```
case class MatchCatalogSchemaChoiceOption() extends ChoiceOption {}
```


# Abstract DataSink class
<a name="glue-etl-scala-apis-glue-datasink-class"></a>

**Topics**
+ [Def writeDynamicFrame](#glue-etl-scala-apis-glue-datasink-class-defs-writeDynamicFrame)
+ [Def pyWriteDynamicFrame](#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDynamicFrame)
+ [Def writeDataFrame](#glue-etl-scala-apis-glue-datasink-class-defs-writeDataFrame)
+ [Def pyWriteDataFrame](#glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDataFrame)
+ [Def setCatalogInfo](#glue-etl-scala-apis-glue-datasink-class-defs-setCatalogInfo)
+ [Def supportsFormat](#glue-etl-scala-apis-glue-datasink-class-defs-supportsFormat)
+ [Def setFormat](#glue-etl-scala-apis-glue-datasink-class-defs-setFormat)
+ [Def withFormat](#glue-etl-scala-apis-glue-datasink-class-defs-withFormat)
+ [Def setAccumulableSize](#glue-etl-scala-apis-glue-datasink-class-defs-setAccumulableSize)
+ [Def getOutputErrorRecordsAccumulable](#glue-etl-scala-apis-glue-datasink-class-defs-getOutputErrorRecordsAccumulable)
+ [Def errorsAsDynamicFrame](#glue-etl-scala-apis-glue-datasink-class-defs-errorsAsDynamicFrame)
+ [DataSink object](#glue-etl-scala-apis-glue-datasink-object)

**Package: com.amazonaws.services.glue**

```
abstract class DataSink
```

The writer analog to a `DataSource`. `DataSink` encapsulates a destination and a format that a `DynamicFrame` can be written to.

## Def writeDynamicFrame
<a name="glue-etl-scala-apis-glue-datasink-class-defs-writeDynamicFrame"></a>

```
def writeDynamicFrame( frame : DynamicFrame,
                       callSite : CallSite = CallSite("Not provided", "")
                     ) : DynamicFrame
```


## Def pyWriteDynamicFrame
<a name="glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDynamicFrame"></a>

```
def pyWriteDynamicFrame( frame : DynamicFrame,
                         site : String = "Not provided",
                         info : String = "" )
```


## Def writeDataFrame
<a name="glue-etl-scala-apis-glue-datasink-class-defs-writeDataFrame"></a>

```
def writeDataFrame(frame: DataFrame,
                   glueContext: GlueContext,
                   callSite: CallSite = CallSite("Not provided", "")
                   ): DataFrame
```


## Def pyWriteDataFrame
<a name="glue-etl-scala-apis-glue-datasink-class-defs-pyWriteDataFrame"></a>

```
def pyWriteDataFrame(frame: DataFrame,
                     glueContext: GlueContext,
                     site: String = "Not provided",
                     info: String = ""
                     ): DataFrame
```


## Def setCatalogInfo
<a name="glue-etl-scala-apis-glue-datasink-class-defs-setCatalogInfo"></a>

```
def setCatalogInfo(catalogDatabase: String, 
                   catalogTableName : String, 
                   catalogId : String = "")
```


## Def supportsFormat
<a name="glue-etl-scala-apis-glue-datasink-class-defs-supportsFormat"></a>

```
def supportsFormat( format : String ) : Boolean
```


## Def setFormat
<a name="glue-etl-scala-apis-glue-datasink-class-defs-setFormat"></a>

```
def setFormat( format : String,
               options : JsonOptions
             ) : Unit
```


## Def withFormat
<a name="glue-etl-scala-apis-glue-datasink-class-defs-withFormat"></a>

```
def withFormat( format : String,
                options : JsonOptions = JsonOptions.empty
              ) : DataSink
```


## Def setAccumulableSize
<a name="glue-etl-scala-apis-glue-datasink-class-defs-setAccumulableSize"></a>

```
def setAccumulableSize( size : Int ) : Unit
```


## Def getOutputErrorRecordsAccumulable
<a name="glue-etl-scala-apis-glue-datasink-class-defs-getOutputErrorRecordsAccumulable"></a>

```
def getOutputErrorRecordsAccumulable : Accumulable[List[OutputError], OutputError]
```


## Def errorsAsDynamicFrame
<a name="glue-etl-scala-apis-glue-datasink-class-defs-errorsAsDynamicFrame"></a>

```
def errorsAsDynamicFrame : DynamicFrame
```


## DataSink object
<a name="glue-etl-scala-apis-glue-datasink-object"></a>

```
object DataSink
```

### Def recordMetrics
<a name="glue-etl-scala-apis-glue-datasink-object-defs-recordMetrics"></a>

```
def recordMetrics( frame : DynamicFrame,
                   ctxt : String
                 ) : DynamicFrame
```


# AWS Glue Scala DataSource trait
<a name="glue-etl-scala-apis-glue-datasource-trait"></a>

**Package: com.amazonaws.services.glue**

A high-level interface for producing a `DynamicFrame`.

```
trait DataSource {

  def getDynamicFrame : DynamicFrame 

  def getDynamicFrame( minPartitions : Int,
                       targetPartitions : Int
                     ) : DynamicFrame 
  def getDataFrame : DataFrame
					 
  /** @param num: the number of records for sampling.
    * @param options: optional parameters to control sampling behavior. Current available parameter for Amazon S3 sources in options:
    *  1. maxSamplePartitions: the maximum number of partitions the sampling will read. 
    *  2. maxSampleFilesPerPartition: the maximum number of files the sampling will read in one partition.
    */
  def getSampleDynamicFrame(num:Int, options: JsonOptions = JsonOptions.empty): DynamicFrame 

  def glueContext : GlueContext

  def setFormat( format : String,
                 options : String
               ) : Unit 

  def setFormat( format : String,
                 options : JsonOptions
               ) : Unit

  def supportsFormat( format : String ) : Boolean

  def withFormat( format : String,
                  options : JsonOptions = JsonOptions.empty
                ) : DataSource 
}
```

# AWS Glue Scala DynamicFrame APIs
<a name="glue-etl-scala-apis-glue-dynamicframe"></a>

**Package: com.amazonaws.services.glue**

**Contents**
+ [AWS Glue Scala DynamicFrame class](glue-etl-scala-apis-glue-dynamicframe-class.md)
  + [Val errorsCount](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-vals-errorsCount)
  + [Def applyMapping](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping)
  + [Def assertErrorThreshold](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-assertErrorThreshold)
  + [Def count](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-count)
  + [Def dropField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropField)
  + [Def dropFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropFields)
  + [Def dropNulls](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-dropNulls)
  + [Def errorsAsDynamicFrame](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame)
  + [Def filter](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-filter)
  + [Def getName](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getName)
  + [Def getNumPartitions](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getNumPartitions)
  + [Def getSchemaIfComputed](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-getSchemaIfComputed)
  + [Def isSchemaComputed](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-isSchemaComputed)
  + [Def javaToPython](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-javaToPython)
  + [Def join](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-join)
  + [Def map](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-map)
  + [Def mergeDynamicFrames](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-merge)
  + [Def printSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-printSchema)
  + [Def recomputeSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-recomputeSchema)
  + [Def relationalize](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-relationalize)
  + [Def renameField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-renameField)
  + [Def repartition](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-repartition)
  + [Def resolveChoice](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-resolveChoice)
  + [Def schema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-schema)
  + [Def selectField](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectField)
  + [Def selectFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-selectFields)
  + [Def show](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-show)
  + [Def simplifyDDBJson](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-simplifyDDBJson)
  + [Def spigot](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-spigot)
  + [Def splitFields](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitFields)
  + [Def splitRows](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-splitRows)
  + [Def stageErrorsCount](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-stageErrorsCount)
  + [Def toDF](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-toDF)
  + [Def unbox](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unbox)
  + [Def unnest](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest)
  + [Def unnestDDBJson](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnestddbjson)
  + [Def withFrameSchema](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withFrameSchema)
  + [Def withName](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withName)
  + [Def withTransformationContext](glue-etl-scala-apis-glue-dynamicframe-class.md#glue-etl-scala-apis-glue-dynamicframe-class-defs-withTransformationContext)
+ [The DynamicFrame object](glue-etl-scala-apis-glue-dynamicframe-object.md)
  + [Def apply](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-apply)
  + [Def emptyDynamicFrame](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-emptyDynamicFrame)
  + [Def fromPythonRDD](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-fromPythonRDD)
  + [Def ignoreErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-ignoreErrors)
  + [Def inlineErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-inlineErrors)
  + [Def newFrameWithErrors](glue-etl-scala-apis-glue-dynamicframe-object.md#glue-etl-scala-apis-glue-dynamicframe-object-defs-newFrameWithErrors)

# AWS Glue Scala DynamicFrame class
<a name="glue-etl-scala-apis-glue-dynamicframe-class"></a>

**Package: com.amazonaws.services.glue**

```
class DynamicFrame extends Serializable with Logging  (
           val glueContext : GlueContext,
           _records : RDD[DynamicRecord],
           val name : String = s"",
           val transformationContext : String = DynamicFrame.UNDEFINED,
           callSite : CallSite = CallSite("Not provided", ""),
           stageThreshold : Long = 0,
           totalThreshold : Long = 0,
           prevErrors : => Long = 0,
           errorExpr : => Unit = {} )
```

A `DynamicFrame` is a distributed collection of self-describing [DynamicRecord](glue-etl-scala-apis-glue-dynamicrecord-class.md) objects.

`DynamicFrame`s are designed to provide a flexible data model for ETL (extract, transform, and load) operations. They don't require a schema to create, and you can use them to read and transform data that contains messy or inconsistent values and types. A schema can be computed on demand for those operations that need one.

`DynamicFrame`s provide a range of transformations for data cleaning and ETL. They also support conversion to and from SparkSQL DataFrames to integrate with existing code and the many analytics operations that DataFrames provide.

The following parameters are shared across many of the AWS Glue transformations that construct `DynamicFrame`s:
+ `transformationContext` — The identifier for this `DynamicFrame`. The `transformationContext` is used as a key for job bookmark state that is persisted across runs.
+ `callSite` — Provides context information for error reporting. These values are automatically set when calling from Python.
+ `stageThreshold` — The maximum number of error records that are allowed from the computation of this `DynamicFrame` before throwing an exception, excluding records that are present in the previous `DynamicFrame`.
+ `totalThreshold` — The maximum number of total error records before an exception is thrown, including those from previous frames.

## Val errorsCount
<a name="glue-etl-scala-apis-glue-dynamicframe-class-vals-errorsCount"></a>

```
val errorsCount
```

The number of error records in this `DynamicFrame`. This includes errors from previous operations.

## Def applyMapping
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping"></a>

```
def applyMapping( mappings : Seq[Product4[String, String, String, String]],
                  caseSensitive : Boolean = true,
                  transformationContext : String = "",
                  callSite : CallSite = CallSite("Not provided", ""),
                  stageThreshold : Long = 0,
                  totalThreshold : Long = 0
                ) : DynamicFrame
```
+ `mappings` — A sequence of mappings to construct a new `DynamicFrame`.
+ `caseSensitive` — Whether to treat source columns as case sensitive. Setting this to false might help when integrating with case-insensitive stores like the AWS Glue Data Catalog.

Selects, projects, and casts columns based on a sequence of mappings.

Each mapping is made up of a source column and type and a target column and type. Mappings can be specified as either a four-tuple (`source_path`, `source_type`,` target_path`, `target_type`) or a [MappingSpec](glue-etl-scala-apis-glue-mappingspec.md) object containing the same information.

In addition to using mappings for simple projections and casting, you can use them to nest or unnest fields by separating components of the path with '`.`' (period). 

For example, suppose that you have a `DynamicFrame` with the following schema.

```
 {{{
   root
   |-- name: string
   |-- age: int
   |-- address: struct
   |    |-- state: string
   |    |-- zip: int
 }}}
```

You can make the following call to unnest the `state` and `zip` fields.

```
 {{{
   df.applyMapping(
    Seq(("name", "string", "name", "string"),
        ("age", "int", "age", "int"),
        ("address.state", "string", "state", "string"),
        ("address.zip", "int", "zip", "int")))
 }}}
```

The resulting schema is as follows.

```
 {{{
   root
   |-- name: string
   |-- age: int
   |-- state: string
   |-- zip: int
 }}}
```

You can also use `applyMapping` to re-nest columns. For example, the following inverts the previous transformation and creates a struct named `address` in the target.

```
 {{{
   df.applyMapping(
    Seq(("name", "string", "name", "string"),
        ("age", "int", "age", "int"),
        ("state", "string", "address.state", "string"),
        ("zip", "int", "address.zip", "int")))
 }}}
```

Field names that contain '`.`' (period) characters can be quoted by using backticks (````).

**Note**  
Currently, you can't use the `applyMapping` method to map columns that are nested under arrays.

## Def assertErrorThreshold
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-assertErrorThreshold"></a>

```
def assertErrorThreshold : Unit
```

An action that forces computation and verifies that the number of error records falls below `stageThreshold` and `totalThreshold`. Throws an exception if either condition fails.

## Def count
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-count"></a>

```
lazy
def count
```

Returns the number of elements in this `DynamicFrame`.

## Def dropField
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-dropField"></a>

```
def dropField( path : String,
               transformationContext : String = "",
               callSite : CallSite = CallSite("Not provided", ""),
               stageThreshold : Long = 0,
               totalThreshold : Long = 0
             ) : DynamicFrame
```

Returns a new `DynamicFrame` with the specified column removed.

## Def dropFields
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-dropFields"></a>

```
def dropFields( fieldNames : Seq[String],   // The column names to drop.
                transformationContext : String = "",
                callSite : CallSite = CallSite("Not provided", ""),
                stageThreshold : Long = 0,
                totalThreshold : Long = 0
              ) : DynamicFrame
```

Returns a new `DynamicFrame` with the specified columns removed.

You can use this method to delete nested columns, including those inside of arrays, but not to drop specific array elements.

## Def dropNulls
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-dropNulls"></a>

```
def dropNulls( transformationContext : String = "",
               callSite : CallSite = CallSite("Not provided", ""),
               stageThreshold : Long = 0,
               totalThreshold : Long = 0 )
```

Returns a new `DynamicFrame` with all null columns removed.

**Note**  
This only removes columns of type `NullType`. Individual null values in other columns are not removed or modified.

## Def errorsAsDynamicFrame
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame"></a>

```
def errorsAsDynamicFrame
```

Returns a new `DynamicFrame` containing the error records from this `DynamicFrame`.

## Def filter
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-filter"></a>

```
def filter( f : DynamicRecord => Boolean,
            errorMsg : String = "",
            transformationContext : String = "",
            callSite : CallSite = CallSite("Not provided"),
            stageThreshold : Long = 0,
            totalThreshold : Long = 0
          ) : DynamicFrame
```

Constructs a new `DynamicFrame` containing only those records for which the function '`f`' returns `true`. The filter function '`f`' should not mutate the input record.

## Def getName
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-getName"></a>

```
def getName : String 
```

Returns the name of this `DynamicFrame`.

## Def getNumPartitions
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-getNumPartitions"></a>

```
def getNumPartitions
```

Returns the number of partitions in this `DynamicFrame`.

## Def getSchemaIfComputed
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-getSchemaIfComputed"></a>

```
def getSchemaIfComputed : Option[Schema] 
```

Returns the schema if it has already been computed. Does not scan the data if the schema has not already been computed.

## Def isSchemaComputed
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-isSchemaComputed"></a>

```
def isSchemaComputed : Boolean 
```

Returns `true` if the schema has been computed for this `DynamicFrame`, or `false` if not. If this method returns false, then calling the `schema` method requires another pass over the records in this `DynamicFrame`.

## Def javaToPython
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-javaToPython"></a>

```
def javaToPython : JavaRDD[Array[Byte]] 
```


## Def join
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-join"></a>

```
def join( keys1 : Seq[String],
          keys2 : Seq[String],
          frame2 : DynamicFrame,
          transformationContext : String = "",
          callSite : CallSite = CallSite("Not provided", ""),
          stageThreshold : Long = 0,
          totalThreshold : Long = 0
        ) : DynamicFrame
```
+ `keys1` — The columns in this `DynamicFrame` to use for the join.
+ `keys2` — The columns in `frame2` to use for the join. Must be the same length as `keys1`.
+ `frame2` — The `DynamicFrame` to join against.

Returns the result of performing an equijoin with `frame2` using the specified keys.

## Def map
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-map"></a>

```
def map( f : DynamicRecord => DynamicRecord,
         errorMsg : String = "",
         transformationContext : String = "",
         callSite : CallSite = CallSite("Not provided", ""),
         stageThreshold : Long = 0,
         totalThreshold : Long = 0
       ) : DynamicFrame
```

Returns a new `DynamicFrame` constructed by applying the specified function '`f`' to each record in this `DynamicFrame`.

This method copies each record before applying the specified function, so it is safe to mutate the records. If the mapping function throws an exception on a given record, that record is marked as an error, and the stack trace is saved as a column in the error record.

## Def mergeDynamicFrames
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-merge"></a>

```
def mergeDynamicFrames( stageDynamicFrame: DynamicFrame,  primaryKeys: Seq[String], transformationContext: String = "",
                         options: JsonOptions = JsonOptions.empty, callSite: CallSite = CallSite("Not provided"),
                         stageThreshold: Long = 0, totalThreshold: Long = 0): DynamicFrame
```
+ `stageDynamicFrame` — The staging `DynamicFrame` to merge.
+ `primaryKeys` — The list of primary key fields to match records from the source and staging `DynamicFrame`s.
+ `transformationContext` — A unique string that is used to retrieve metadata about the current transformation (optional).
+ `options` — A string of JSON name-value pairs that provide additional information for this transformation.
+ `callSite` — Used to provide context information for error reporting.
+ `stageThreshold` — A `Long`. The number of errors in the given transformation for which the processing needs to error out.
+ `totalThreshold` — A `Long`. The total number of errors up to and including in this transformation for which the processing needs to error out.

Merges this `DynamicFrame` with a staging `DynamicFrame` based on the specified primary keys to identify records. Duplicate records (records with the same primary keys) are not de-duplicated. If there is no matching record in the staging frame, all records (including duplicates) are retained from the source. If the staging frame has matching records, the records from the staging frame overwrite the records in the source in AWS Glue.

The returned `DynamicFrame` contains record A in the following cases:

1. If `A` exists in both the source frame and the staging frame, then `A` in the staging frame is returned.

1. If `A` is in the source table and `A.primaryKeys` is not in the `stagingDynamicFrame` (that means `A` is not updated in the staging table).

The source frame and staging frame do not need to have the same schema.

**Example**  

```
val mergedFrame: DynamicFrame = srcFrame.mergeDynamicFrames(stageFrame, Seq("id1", "id2"))
```

## Def printSchema
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-printSchema"></a>

```
def printSchema : Unit 
```

Prints the schema of this `DynamicFrame` to `stdout` in a human-readable format.

## Def recomputeSchema
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-recomputeSchema"></a>

```
def recomputeSchema : Schema 
```

Forces a schema recomputation. This requires a scan over the data, but it might "tighten" the schema if there are some fields in the current schema that are not present in the data.

Returns the recomputed schema.

## Def relationalize
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-relationalize"></a>

```
def relationalize( rootTableName : String,
                   stagingPath : String,
                   options : JsonOptions = JsonOptions.empty,
                   transformationContext : String = "",
                   callSite : CallSite = CallSite("Not provided"),
                   stageThreshold : Long = 0,
                   totalThreshold : Long = 0
                 ) : Seq[DynamicFrame]
```
+ `rootTableName` — The name to use for the base `DynamicFrame` in the output. `DynamicFrame`s that are created by pivoting arrays start with this as a prefix.
+ `stagingPath` — The Amazon Simple Storage Service (Amazon S3) path for writing intermediate data.
+ `options` — Relationalize options and configuration. Currently unused.

Flattens all nested structures and pivots arrays into separate tables.

You can use this operation to prepare deeply nested data for ingestion into a relational database. Nested structs are flattened in the same manner as the [Unnest](#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest) transform. Additionally, arrays are pivoted into separate tables with each array element becoming a row. For example, suppose that you have a `DynamicFrame` with the following data.

```
 {"name": "Nancy", "age": 47, "friends": ["Fred", "Lakshmi"]}
 {"name": "Stephanie", "age": 28, "friends": ["Yao", "Phil", "Alvin"]}
 {"name": "Nathan", "age": 54, "friends": ["Nicolai", "Karen"]}
```

Run the following code.

```
{{{
  df.relationalize("people", "s3:/my_bucket/my_path", JsonOptions.empty)
}}}
```

This produces two tables. The first table is named "people" and contains the following.

```
{{{
  {"name": "Nancy", "age": 47, "friends": 1}
  {"name": "Stephanie", "age": 28, "friends": 2}
  {"name": "Nathan", "age": 54, "friends": 3)
}}}
```

Here, the friends array has been replaced with an auto-generated join key. A separate table named `people.friends` is created with the following content.

```
{{{
  {"id": 1, "index": 0, "val": "Fred"}
  {"id": 1, "index": 1, "val": "Lakshmi"}
  {"id": 2, "index": 0, "val": "Yao"}
  {"id": 2, "index": 1, "val": "Phil"}
  {"id": 2, "index": 2, "val": "Alvin"}
  {"id": 3, "index": 0, "val": "Nicolai"}
  {"id": 3, "index": 1, "val": "Karen"}
}}}
```

In this table, '`id`' is a join key that identifies which record the array element came from, '`index`' refers to the position in the original array, and '`val`' is the actual array entry.

The `relationalize` method returns the sequence of `DynamicFrame`s created by applying this process recursively to all arrays.

**Note**  
The AWS Glue library automatically generates join keys for new tables. To ensure that join keys are unique across job runs, you must enable job bookmarks.

## Def renameField
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-renameField"></a>

```
def renameField( oldName : String,
                 newName : String,
                 transformationContext : String = "",
                 callSite : CallSite = CallSite("Not provided", ""),
                 stageThreshold : Long = 0,
                 totalThreshold : Long = 0
               ) : DynamicFrame
```
+ `oldName` — The original name of the column.
+ `newName` — The new name of the column.

Returns a new `DynamicFrame` with the specified field renamed.

You can use this method to rename nested fields. For example, the following code would rename `state` to `state_code` inside the address struct.

```
{{{
  df.renameField("address.state", "address.state_code")
}}}
```

## Def repartition
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-repartition"></a>

```
def repartition( numPartitions : Int,
                 transformationContext : String = "",
                 callSite : CallSite = CallSite("Not provided", ""),
                 stageThreshold : Long = 0,
                 totalThreshold : Long = 0
               ) : DynamicFrame
```

Returns a new `DynamicFrame` with `numPartitions` partitions.

## Def resolveChoice
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-resolveChoice"></a>

```
def resolveChoice( specs : Seq[Product2[String, String]] = Seq.empty[ResolveSpec],
                   choiceOption : Option[ChoiceOption] = None,
                   database : Option[String] = None,
                   tableName : Option[String] = None,
                   transformationContext : String = "",
                   callSite : CallSite = CallSite("Not provided", ""),
                   stageThreshold : Long = 0,
                   totalThreshold : Long = 0
                 ) : DynamicFrame
```
+ `choiceOption` — An action to apply to all `ChoiceType` columns not listed in the specs sequence.
+ `database` — The Data Catalog database to use with the `match_catalog` action.
+ `tableName` — The Data Catalog table to use with the `match_catalog` action.

Returns a new `DynamicFrame` by replacing one or more `ChoiceType`s with a more specific type.

There are two ways to use `resolveChoice`. The first is to specify a sequence of specific columns and how to resolve them. These are specified as tuples made up of (column, action) pairs.

The following are the possible actions:
+ `cast:type` — Attempts to cast all values to the specified type.
+ `make_cols` — Converts each distinct type to a column with the name `columnName_type`.
+ `make_struct` — Converts a column to a struct with keys for each distinct type.
+ `project:type` — Retains only values of the specified type.

The other mode for `resolveChoice` is to specify a single resolution for all `ChoiceType`s. You can use this in cases where the complete list of `ChoiceType`s is unknown before execution. In addition to the actions listed preceding, this mode also supports the following action:
+ `match_catalog` — Attempts to cast each `ChoiceType` to the corresponding type in the specified catalog table.

**Examples:**

Resolve the `user.id` column by casting to an int, and make the `address` field retain only structs.

```
{{{
  df.resolveChoice(specs = Seq(("user.id", "cast:int"), ("address", "project:struct")))
}}}
```

Resolve all `ChoiceType`s by converting each choice to a separate column.

```
{{{
  df.resolveChoice(choiceOption = Some(ChoiceOption("make_cols")))
}}}
```

Resolve all `ChoiceType`s by casting to the types in the specified catalog table.

```
{{{
  df.resolveChoice(choiceOption = Some(ChoiceOption("match_catalog")),
                   database = Some("my_database"),
                   tableName = Some("my_table"))
}}}
```

## Def schema
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-schema"></a>

```
def schema : Schema 
```

Returns the schema of this `DynamicFrame`.

The returned schema is guaranteed to contain every field that is present in a record in this `DynamicFrame`. But in a small number of cases, it might also contain additional fields. You can use the [Unnest](#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest) method to "tighten" the schema based on the records in this `DynamicFrame`.

## Def selectField
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-selectField"></a>

```
def selectField( fieldName : String,
                 transformationContext : String = "",
                 callSite : CallSite = CallSite("Not provided", ""),
                 stageThreshold : Long = 0,
                 totalThreshold : Long = 0
               ) : DynamicFrame
```

Returns a single field as a `DynamicFrame`.

## Def selectFields
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-selectFields"></a>

```
def selectFields( paths : Seq[String],
                  transformationContext : String = "",
                  callSite : CallSite = CallSite("Not provided", ""),
                  stageThreshold : Long = 0,
                  totalThreshold : Long = 0
                ) : DynamicFrame
```
+ `paths` — The sequence of column names to select.

Returns a new `DynamicFrame` containing the specified columns.

**Note**  
You can only use the `selectFields` method to select top-level columns. You can use the [applyMapping](#glue-etl-scala-apis-glue-dynamicframe-class-defs-applyMapping) method to select nested columns.

## Def show
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-show"></a>

```
def show( numRows : Int = 20 ) : Unit 
```
+ `numRows` — The number of rows to print.

Prints rows from this `DynamicFrame` in JSON format.

## Def simplifyDDBJson
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-simplifyDDBJson"></a>

DynamoDB exports with the AWS Glue DynamoDB export connector results in JSON files of specific nested structures. For more information, see [Data objects](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html). `simplifyDDBJson` Simplifies nested columns in a DynamicFrame of this type of data, and returns a new simplified DynamicFrame. If there are multiple types or a Map type contained in a List type, the elements in the List will not be simplified. This method only supports data in the DynamoDB export JSON format. Consider `unnest` to perform similar changes on other kinds of data.

```
def simplifyDDBJson() : DynamicFrame 
```

This method does not take any parameters.

**Example input**

Consider the following schema generated by a DynamoDB export:

```
root
|-- Item: struct
|    |-- parentMap: struct
|    |    |-- M: struct
|    |    |    |-- childMap: struct
|    |    |    |    |-- M: struct
|    |    |    |    |    |-- appName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- packageName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- updatedAt: struct
|    |    |    |    |    |    |-- N: string
|    |-- strings: struct
|    |    |-- SS: array
|    |    |    |-- element: string
|    |-- numbers: struct
|    |    |-- NS: array
|    |    |    |-- element: string
|    |-- binaries: struct
|    |    |-- BS: array
|    |    |    |-- element: string
|    |-- isDDBJson: struct
|    |    |-- BOOL: boolean
|    |-- nullValue: struct
|    |    |-- NULL: boolean
```

**Example code**

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContextimport scala.collection.JavaConverters._

object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.export" -> "ddb",
        "dynamodb.tableArn" -> "ddbTableARN",
        "dynamodb.s3.bucket" -> "exportBucketLocation",
        "dynamodb.s3.prefix" -> "exportBucketPrefix",
        "dynamodb.s3.bucketOwner" -> "exportBucketAccountID",
      ))
    ).getDynamicFrame()
    
    val simplified = dynamicFrame.simplifyDDBJson()
    simplified.printSchema()

    Job.commit()
  }

}
```

### Example output
<a name="simplifyDDBJson-example-output"></a>

The `simplifyDDBJson` transform will simplify this to:

```
root
|-- parentMap: struct
|    |-- childMap: struct
|    |    |-- appName: string
|    |    |-- packageName: string
|    |    |-- updatedAt: string
|-- strings: array
|    |-- element: string
|-- numbers: array
|    |-- element: string
|-- binaries: array
|    |-- element: string
|-- isDDBJson: boolean
|-- nullValue: null
```

## Def spigot
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-spigot"></a>

```
def spigot( path : String,
            options : JsonOptions = new JsonOptions("{}"),
            transformationContext : String = "",
            callSite : CallSite = CallSite("Not provided"),
            stageThreshold : Long = 0,
            totalThreshold : Long = 0
          ) : DynamicFrame
```

Passthrough transformation that returns the same records but writes out a subset of records as a side effect.
+ `path` — The path in Amazon S3 to write output to, in the form `s3://bucket//path`.
+ `options`  — An optional `JsonOptions` map describing the sampling behavior.

Returns a `DynamicFrame` that contains the same records as this one.

By default, writes 100 arbitrary records to the location specified by `path`. You can customize this behavior by using the `options` map. Valid keys include the following:
+ `topk` — Specifies the total number of records written out. The default is 100.
+ `prob` — Specifies the probability (as a decimal) that an individual record is included. Default is 1.

For example, the following call would sample the dataset by selecting each record with a 20 percent probability and stopping after 200 records have been written.

```
{{{
  df.spigot("s3://my_bucket/my_path", JsonOptions(Map("topk" -&gt; 200, "prob" -&gt; 0.2)))
}}}
```

## Def splitFields
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-splitFields"></a>

```
def splitFields( paths : Seq[String],
                 transformationContext : String = "",
                 callSite : CallSite = CallSite("Not provided", ""),
                 stageThreshold : Long = 0,
                 totalThreshold : Long = 0
               ) : Seq[DynamicFrame]
```
+ `paths` — The paths to include in the first `DynamicFrame`.

Returns a sequence of two `DynamicFrame`s. The first `DynamicFrame` contains the specified paths, and the second contains all other columns.

**Example**

This example takes a DynamicFrame created from the `persons` table in the `legislators` database in the AWS Glue Data Catalog and splits the DynamicFrame into two, with the specified fields going into the first DynamicFrame and the remaining fields going into a second DynamicFrame. The example then chooses the first DynamicFrame from the result.

```
val InputFrame = glueContext.getCatalogSource(database="legislators", tableName="persons", 
transformationContext="InputFrame").getDynamicFrame()

val SplitField_collection = InputFrame.splitFields(paths=Seq("family_name", "name", "links.note", 
"links.url", "gender", "image", "identifiers.scheme", "identifiers.identifier", "other_names.lang", 
"other_names.note", "other_names.name"), transformationContext="SplitField_collection")

val ResultFrame = SplitField_collection(0)
```

## Def splitRows
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-splitRows"></a>

```
def splitRows( paths : Seq[String],
               values : Seq[Any],
               operators : Seq[String],
               transformationContext : String,
               callSite : CallSite,
               stageThreshold : Long,
               totalThreshold : Long
             ) : Seq[DynamicFrame]
```

Splits rows based on predicates that compare columns to constants.
+ `paths` — The columns to use for comparison.
+ `values` — The constant values to use for comparison.
+ `operators` — The operators to use for comparison.

Returns a sequence of two `DynamicFrame`s. The first contains rows for which the predicate is true and the second contains those for which it is false.

Predicates are specified using three sequences: '`paths`' contains the (possibly nested) column names, '`values`' contains the constant values to compare to, and '`operators`' contains the operators to use for comparison. All three sequences must be the same length: The `n`th operator is used to compare the `n`th column with the `n`th value.

Each operator must be one of "`!=`", "`=`", "`<=`", "`<`", "`>=`", or "`>`".

As an example, the following call would split a `DynamicFrame` so that the first output frame would contain records of people over 65 from the United States, and the second would contain all other records.

```
{{{
  df.splitRows(Seq("age", "address.country"), Seq(65, "USA"), Seq("&gt;=", "="))
}}}
```

## Def stageErrorsCount
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-stageErrorsCount"></a>

```
def stageErrorsCount
```

Returns the number of error records created while computing this `DynamicFrame`. This excludes errors from previous operations that were passed into this `DynamicFrame` as input.

## Def toDF
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-toDF"></a>

```
def toDF( specs : Seq[ResolveSpec] = Seq.empty[ResolveSpec] ) : DataFrame 
```

Converts this `DynamicFrame` to an Apache Spark SQL `DataFrame` with the same schema and records.

**Note**  
Because `DataFrame`s don't support `ChoiceType`s, this method automatically converts `ChoiceType` columns into `StructType`s. For more information and options for resolving choice, see [resolveChoice](#glue-etl-scala-apis-glue-dynamicframe-class-defs-resolveChoice).

## Def unbox
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-unbox"></a>

```
def unbox( path : String,
           format : String,
           optionString : String = "{}",
           transformationContext : String = "",
           callSite : CallSite = CallSite("Not provided"),
           stageThreshold : Long = 0,
           totalThreshold : Long = 0
         ) : DynamicFrame
```
+ `path` — The column to parse. Must be a string or binary.
+ `format` — The format to use for parsing.
+ `optionString` — Options to pass to the format, such as the CSV separator.

Parses an embedded string or binary column according to the specified format. Parsed columns are nested under a struct with the original column name.

For example, suppose that you have a CSV file with an embedded JSON column.

```
name, age, address
Sally, 36, {"state": "NE", "city": "Omaha"}
...
```

After an initial parse, you would get a `DynamicFrame` with the following schema.

```
{{{
  root
  |-- name: string
  |-- age: int
  |-- address: string
}}}
```

You can call `unbox` on the address column to parse the specific components.

```
{{{
  df.unbox("address", "json")
}}}
```

This gives us a `DynamicFrame` with the following schema.

```
{{{
  root
  |-- name: string
  |-- age: int
  |-- address: struct
  |    |-- state: string
  |    |-- city: string
}}}
```

## Def unnest
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-unnest"></a>

```
def unnest( transformationContext : String = "",
            callSite : CallSite = CallSite("Not Provided"),
            stageThreshold : Long = 0,
            totalThreshold : Long = 0
          ) : DynamicFrame
```

Returns a new `DynamicFrame` with all nested structures flattened. Names are constructed using the '`.`' (period) character.

For example, suppose that you have a `DynamicFrame` with the following schema.

```
{{{
  root
  |-- name: string
  |-- age: int
  |-- address: struct
  |    |-- state: string
  |    |-- city: string
}}}
```

The following call unnests the address struct.

```
{{{
  df.unnest()
}}}
```

The resulting schema is as follows.

```
{{{
  root
  |-- name: string
  |-- age: int
  |-- address.state: string
  |-- address.city: string
}}}
```

This method also unnests nested structs inside of arrays. But for historical reasons, the names of such fields are prepended with the name of the enclosing array and "`.val`".

## Def unnestDDBJson
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-unnestddbjson"></a>

```
unnestDDBJson(transformationContext : String = "",
            callSite : CallSite = CallSite("Not Provided"),
            stageThreshold : Long = 0,
            totalThreshold : Long = 0): DynamicFrame
```

Unnests nested columns in a `DynamicFrame` that are specifically in the DynamoDB JSON structure, and returns a new unnested `DynamicFrame`. Columns that are of an array of struct types will not be unnested. Note that this is a specific type of unnesting transform that behaves differently from the regular `unnest` transform and requires the data to already be in the DynamoDB JSON structure. For more information, see [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html#DataExport.Output.Data).

For example, the schema of a reading an export with the DynamoDB JSON structure might look like the following:

```
root
|-- Item: struct
|    |-- ColA: struct
|    |    |-- S: string
|    |-- ColB: struct
|    |    |-- S: string
|    |-- ColC: struct
|    |    |-- N: string
|    |-- ColD: struct
|    |    |-- L: array
|    |    |    |-- element: null
```

The `unnestDDBJson()` transform would convert this to:

```
root
|-- ColA: string
|-- ColB: string
|-- ColC: string
|-- ColD: array    
|    |-- element: null
```

The following code example shows how to use the AWS Glue DynamoDB export connector, invoke a DynamoDB JSON unnest, and print the number of partitions:

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.export" -> "ddb",
        "dynamodb.tableArn" -> "<test_source>",
        "dynamodb.s3.bucket" -> "<bucket name>",
        "dynamodb.s3.prefix" -> "<bucket prefix>",
        "dynamodb.s3.bucketOwner" -> "<account_id of bucket>",
      ))
    ).getDynamicFrame()
    
    val unnested = dynamicFrame.unnestDDBJson()
    print(unnested.getNumPartitions())

    Job.commit()
  }

}
```

## Def withFrameSchema
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-withFrameSchema"></a>

```
def withFrameSchema( getSchema : () => Schema ) : DynamicFrame 
```
+ `getSchema` — A function that returns the schema to use. Specified as a zero-parameter function to defer potentially expensive computation.

Sets the schema of this `DynamicFrame` to the specified value. This is primarily used internally to avoid costly schema recomputation. The passed-in schema must contain all columns present in the data.

## Def withName
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-withName"></a>

```
def withName( name : String ) : DynamicFrame 
```
+ `name` — The new name to use.

Returns a copy of this `DynamicFrame` with a new name.

## Def withTransformationContext
<a name="glue-etl-scala-apis-glue-dynamicframe-class-defs-withTransformationContext"></a>

```
def withTransformationContext( ctx : String ) : DynamicFrame 
```

Returns a copy of this `DynamicFrame` with the specified transformation context.

# The DynamicFrame object
<a name="glue-etl-scala-apis-glue-dynamicframe-object"></a>

**Package: com.amazonaws.services.glue**

```
object DynamicFrame
```

## Def apply
<a name="glue-etl-scala-apis-glue-dynamicframe-object-defs-apply"></a>

```
def apply( df : DataFrame,
           glueContext : GlueContext
         ) : DynamicFrame
```


## Def emptyDynamicFrame
<a name="glue-etl-scala-apis-glue-dynamicframe-object-defs-emptyDynamicFrame"></a>

```
def emptyDynamicFrame( glueContext : GlueContext ) : DynamicFrame 
```


## Def fromPythonRDD
<a name="glue-etl-scala-apis-glue-dynamicframe-object-defs-fromPythonRDD"></a>

```
def fromPythonRDD( rdd : JavaRDD[Array[Byte]],
                   glueContext : GlueContext
                 ) : DynamicFrame
```


## Def ignoreErrors
<a name="glue-etl-scala-apis-glue-dynamicframe-object-defs-ignoreErrors"></a>

```
def ignoreErrors( fn : DynamicRecord => DynamicRecord ) : DynamicRecord 
```


## Def inlineErrors
<a name="glue-etl-scala-apis-glue-dynamicframe-object-defs-inlineErrors"></a>

```
def inlineErrors( msg : String,
                  callSite : CallSite
                ) : (DynamicRecord => DynamicRecord)
```


## Def newFrameWithErrors
<a name="glue-etl-scala-apis-glue-dynamicframe-object-defs-newFrameWithErrors"></a>

```
def newFrameWithErrors( prevFrame : DynamicFrame,
                        rdd : RDD[DynamicRecord],
                        name : String = "",
                        transformationContext : String = "",
                        callSite : CallSite,
                        stageThreshold : Long,
                        totalThreshold : Long
                      ) : DynamicFrame
```


# AWS Glue Scala DynamicRecord class
<a name="glue-etl-scala-apis-glue-dynamicrecord-class"></a>

**Topics**
+ [Def addField](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-addField)
+ [Def dropField](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-dropField)
+ [Def setError](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-setError)
+ [Def isError](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-isError)
+ [Def getError](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getError)
+ [Def clearError](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clearError)
+ [Def write](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-write)
+ [Def readFields](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-readFields)
+ [Def clone](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-clone)
+ [Def schema](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-schema)
+ [Def getRoot](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getRoot)
+ [Def toJson](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-toJson)
+ [Def getFieldNode](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getFieldNode)
+ [Def getField](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-getField)
+ [Def hashCode](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-hashCode)
+ [Def equals](#glue-etl-scala-apis-glue-dynamicrecord-class-defs-equals)
+ [DynamicRecord object](#glue-etl-scala-apis-glue-dynamicrecord-object)
+ [RecordTraverser trait](#glue-etl-scala-apis-glue-recordtraverser-trait)

**Package: com.amazonaws.services.glue**

```
class DynamicRecord extends Serializable with Writable with Cloneable
```

A `DynamicRecord` is a self-describing data structure that represents a row of data in the dataset that is being processed. It is self-describing in the sense that you can get the schema of the row that is represented by the `DynamicRecord` by inspecting the record itself. A `DynamicRecord` is similar to a `Row` in Apache Spark.

## Def addField
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-addField"></a>

```
def addField( path : String,
              dynamicNode : DynamicNode
            ) : Unit
```

Adds a [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) to the specified path.
+ `path` — The path for the field to be added.
+ `dynamicNode` — The [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) to be added at the specified path.

## Def dropField
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-dropField"></a>

```
 def dropField(path: String, underRename: Boolean = false): Option[DynamicNode]
```

Drops a [DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md) from the specified path and returns the dropped node if there is not an array in the specified path.
+ `path` — The path to the field to drop.
+ `underRename` — True if `dropField` is called as part of a rename transform, or false otherwise (false by default).

Returns a `scala.Option Option` ([DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md)).

## Def setError
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-setError"></a>

```
def setError( error : Error )
```

Sets this record as an error record, as specified by the `error` parameter.

Returns a `DynamicRecord`.

## Def isError
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-isError"></a>

```
def isError
```

Checks whether this record is an error record.

## Def getError
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-getError"></a>

```
def getError
```

Gets the `Error` if the record is an error record. Returns `scala.Some Some` (Error) if this record is an error record, or otherwise `scala.None` .

## Def clearError
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-clearError"></a>

```
def clearError
```

Set the `Error` to `scala.None.None` .

## Def write
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-write"></a>

```
override def write( out : DataOutput ) : Unit 
```


## Def readFields
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-readFields"></a>

```
override def readFields( in : DataInput ) : Unit 
```


## Def clone
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-clone"></a>

```
override def clone : DynamicRecord 
```

Clones this record to a new `DynamicRecord` and returns it.

## Def schema
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-schema"></a>

```
def schema
```

Gets the `Schema` by inspecting the record.

## Def getRoot
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-getRoot"></a>

```
def getRoot : ObjectNode 
```

Gets the root `ObjectNode` for the record.

## Def toJson
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-toJson"></a>

```
def toJson : String 
```

Gets the JSON string for the record.

## Def getFieldNode
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-getFieldNode"></a>

```
def getFieldNode( path : String ) : Option[DynamicNode] 
```

Gets the field's value at the specified `path` as an option of `DynamicNode`.

Returns `scala.Some Some` ([DynamicNode](glue-etl-scala-apis-glue-types-dynamicnode.md)) if the field exists, or otherwise `scala.None.None` .

## Def getField
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-getField"></a>

```
def getField( path : String ) : Option[Any] 
```

Gets the field's value at the specified `path` as an option of `DynamicNode`.

Returns `scala.Some Some` (value).

## Def hashCode
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-hashCode"></a>

```
override def hashCode : Int 
```


## Def equals
<a name="glue-etl-scala-apis-glue-dynamicrecord-class-defs-equals"></a>

```
override def equals( other : Any )
```


## DynamicRecord object
<a name="glue-etl-scala-apis-glue-dynamicrecord-object"></a>

```
object DynamicRecord
```

### Def apply
<a name="glue-etl-scala-apis-glue-dynamicrecord-object-defs-apply"></a>

```
def apply( row : Row,
           schema : SparkStructType )
```

Apply method to convert an Apache Spark SQL `Row` to a [DynamicRecord](#glue-etl-scala-apis-glue-dynamicrecord-class).
+ `row` — A Spark SQL `Row`.
+ `schema` — The `Schema` of that row.

Returns a `DynamicRecord`.

## RecordTraverser trait
<a name="glue-etl-scala-apis-glue-recordtraverser-trait"></a>

```
trait RecordTraverser {
  def nullValue(): Unit
  def byteValue(value: Byte): Unit
  def binaryValue(value: Array[Byte]): Unit
  def booleanValue(value: Boolean): Unit
  def shortValue(value: Short) : Unit
  def intValue(value: Int) : Unit
  def longValue(value: Long) : Unit
  def floatValue(value: Float): Unit
  def doubleValue(value: Double): Unit
  def decimalValue(value: BigDecimal): Unit
  def stringValue(value: String): Unit
  def dateValue(value: Date): Unit
  def timestampValue(value: Timestamp): Unit
  def objectStart(length: Int): Unit
  def objectKey(key: String): Unit
  def objectEnd(): Unit
  def mapStart(length: Int): Unit
  def mapKey(key: String): Unit
  def mapEnd(): Unit
  def arrayStart(length: Int): Unit
  def arrayEnd(): Unit
}
```

# AWS Glue Scala GlueContext APIs
<a name="glue-etl-scala-apis-glue-gluecontext"></a>

**Package: com.amazonaws.services.glue**

```
class GlueContext extends SQLContext(sc) (
           @transient val sc : SparkContext,
           val defaultSourcePartitioner : PartitioningStrategy )
```

`GlueContext` is the entry point for reading and writing a [DynamicFrame](glue-etl-scala-apis-glue-dynamicframe.md) from and to Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, JDBC, and so on. This class provides utility functions to create [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) and [DataSink](glue-etl-scala-apis-glue-datasink-class.md) objects that can in turn be used to read and write `DynamicFrame`s.

You can also use `GlueContext` to set a target number of partitions (default 20) in the `DynamicFrame` if the number of partitions created from the source is less than a minimum threshold for partitions (default 10).

## def addIngestionTimeColumns
<a name="glue-etl-scala-apis-glue-gluecontext-defs-addIngestionTimeColumns"></a>

```
def addIngestionTimeColumns(
         df : DataFrame, 
         timeGranularity : String = "") : dataFrame
```

Appends ingestion time columns like `ingest_year`, `ingest_month`, `ingest_day`, `ingest_hour`, `ingest_minute` to the input `DataFrame`. This function is automatically generated in the script generated by the AWS Glue when you specify a Data Catalog table with Amazon S3 as the target. This function automatically updates the partition with ingestion time columns on the output table. This allows the output data to be automatically partitioned on ingestion time without requiring explicit ingestion time columns in the input data.
+ `dataFrame` – The `dataFrame` to append the ingestion time columns to.
+ `timeGranularity` – The granularity of the time columns. Valid values are "`day`", "`hour`" and "`minute`". For example, if "`hour`" is passed in to the function, the original `dataFrame` will have "`ingest_year`", "`ingest_month`", "`ingest_day`", and "`ingest_hour`" time columns appended.

Returns the data frame after appending the time granularity columns.

Example:

```
glueContext.addIngestionTimeColumns(dataFrame, "hour")
```

## def createDataFrameFromOptions
<a name="glue-etl-scala-apis-glue-gluecontext-defs-createDataFrameFromOptions"></a>

```
def createDataFrameFromOptions( connectionType : String,
                         connectionOptions : JsonOptions,
                         transformationContext : String = "",
                         format : String = null,
                         formatOptions : JsonOptions = JsonOptions.empty
                       ) : DataSource
```

Returns a `DataFrame` created with the specified connection and format. Use this function only with AWS Glue streaming sources.
+ `connectionType` – The streaming connection type. Valid values include `kinesis` and `kafka`.
+ `connectionOptions` – Connection options, which are different for Kinesis and Kafka. You can find the list of all connection options for each streaming data source at [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md). Note the following differences in streaming connection options:
  + Kinesis streaming sources require `streamARN`, `startingPosition`, `inferSchema`, and `classification`.
  + Kafka streaming sources require `connectionName`, `topicName`, `startingOffsets`, `inferSchema`, and `classification`.
+ `transformationContext` – The transformation context to use (optional).
+ `format` – A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. For information about the supported formats, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md)
+ `formatOptions` – Format options for the specified format. For information about the supported format options, see [Data format options](aws-glue-programming-etl-format.md).

Example for Amazon Kinesis streaming source:

```
val data_frame_datasource0 = 
glueContext.createDataFrameFromOptions(transformationContext = "datasource0", connectionType = "kinesis", 
connectionOptions = JsonOptions("""{"streamName": "example_stream", "startingPosition": "TRIM_HORIZON", "inferSchema": "true", "classification": "json"}}"""))
```

Example for Kafka streaming source:

```
val data_frame_datasource0 = 
glueContext.createDataFrameFromOptions(transformationContext = "datasource0", connectionType = "kafka", 
connectionOptions = JsonOptions("""{"connectionName": "example_connection", "topicName": "example_topic", "startingPosition": "earliest", "inferSchema": "false", "classification": "json", "schema":"`column1` STRING, `column2` STRING"}"""))
```

## forEachBatch
<a name="glue-etl-scala-apis-glue-gluecontext-defs-forEachBatch"></a>

**`forEachBatch(frame, batch_function, options)`**

Applies the `batch_function` passed in to every micro batch that is read from the Streaming source.
+ `frame` – The DataFrame containing the current micro batch.
+ `batch_function` – A function that will be applied for every micro batch.
+ `options` – A collection of key-value pairs that holds information about how to process micro batches. The following options are required:
  + `windowSize` – The amount of time to spend processing each batch.
  + `checkpointLocation` – The location where checkpoints are stored for the streaming ETL job.
  + `batchMaxRetries` – The maximum number of times to retry the batch if it fails. The default value is 3. This option is only configurable for Glue version 2.0 and above.

**Example:**

```
glueContext.forEachBatch(data_frame_datasource0, (dataFrame: Dataset[Row], batchId: Long) => 
   {
      if (dataFrame.count() > 0) 
        {
          val datasource0 = DynamicFrame(glueContext.addIngestionTimeColumns(dataFrame, "hour"), glueContext)
          // @type: DataSink
          // @args: [database = "tempdb", table_name = "fromoptionsoutput", stream_batch_time = "100 seconds", 
          //      stream_checkpoint_location = "s3://from-options-testing-eu-central-1/fromOptionsOutput/checkpoint/", 
          //      transformation_ctx = "datasink1"]
          // @return: datasink1
          // @inputs: [frame = datasource0]
          val options_datasink1 = JsonOptions(
             Map("partitionKeys" -> Seq("ingest_year", "ingest_month","ingest_day", "ingest_hour"), 
             "enableUpdateCatalog" -> true))
          val datasink1 = glueContext.getCatalogSink(
             database = "tempdb", 
             tableName = "fromoptionsoutput", 
             redshiftTmpDir = "", 
             transformationContext = "datasink1", 
             additionalOptions = options_datasink1).writeDynamicFrame(datasource0)
        }
   }, JsonOptions("""{"windowSize" : "100 seconds", 
         "checkpointLocation" : "s3://from-options-testing-eu-central-1/fromOptionsOutput/checkpoint/"}"""))
```

## def getCatalogSink
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink"></a>

```
def getCatalogSink( database : String,
        tableName : String,
        redshiftTmpDir : String = "",
        transformationContext : String = ""
        additionalOptions: JsonOptions = JsonOptions.empty,
        catalogId: String = null   
) : DataSink
```

Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes to a location specified in a table that is defined in the Data Catalog.
+ `database` — The database name in the Data Catalog.
+ `tableName` — The table name in the Data Catalog.
+ `redshiftTmpDir` — The temporary staging directory to be used with certain data sinks. Set to empty by default.
+ `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.
+ `additionalOptions` – Additional options provided to AWS Glue. 
+ `catalogId` — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used. 

Returns the `DataSink`.

## def getCatalogSource
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource"></a>

```
def getCatalogSource( database : String,
                      tableName : String,
                      redshiftTmpDir : String = "",
                      transformationContext : String = ""
                      pushDownPredicate : String = " "
                      additionalOptions: JsonOptions = JsonOptions.empty,
                      catalogId: String = null
                    ) : DataSource
```

Creates a [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) that reads data from a table definition in the Data Catalog.
+ `database` — The database name in the Data Catalog.
+ `tableName` — The table name in the Data Catalog.
+ `redshiftTmpDir` — The temporary staging directory to be used with certain data sinks. Set to empty by default.
+ `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.
+ `pushDownPredicate` – Filters partitions without having to list and read all the files in your dataset. For more information, see [Pre-filtering using pushdown predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns).
+ `additionalOptions` – A collection of optional name-value pairs. The possible options include those listed in [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md) except for `endpointUrl`, `streamName`, `bootstrap.servers`, `security.protocol`, `topicName`, `classification`, and `delimiter`. Another supported option is `catalogPartitionPredicate`:

  `catalogPartitionPredicate` — You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see [AWS Glue Partition Indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html). Note that `push_down_predicate` and `catalogPartitionPredicate` use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.
+ `catalogId` — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used. 

Returns the `DataSource`.

**Example for streaming source**

```
val data_frame_datasource0 = glueContext.getCatalogSource(
    database = "tempdb",
    tableName = "test-stream-input", 
    redshiftTmpDir = "", 
    transformationContext = "datasource0", 
    additionalOptions = JsonOptions("""{
        "startingPosition": "TRIM_HORIZON", "inferSchema": "false"}""")
    ).getDataFrame()
```

## def getJDBCSink
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getJDBCSink"></a>

```
def getJDBCSink( catalogConnection : String,
                 options : JsonOptions,
                 redshiftTmpDir : String = "",
                 transformationContext : String = "",
                 catalogId: String = null
               ) : DataSink
```

Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes to a JDBC database that is specified in a `Connection` object in the Data Catalog. The `Connection` object has information to connect to a JDBC sink, including the URL, user name, password, VPC, subnet, and security groups.
+ `catalogConnection` — The name of the connection in the Data Catalog that contains the JDBC URL to write to.
+ `options` — A string of JSON name-value pairs that provide additional information that is required to write to a JDBC data store. This includes: 
  + *dbtable* (required) — The name of the JDBC table. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used. The following example shows an options parameter that points to a schema named `test` and a table named `test_table` in database `test_db`.

    ```
    options = JsonOptions("""{"dbtable": "test.test_table", "database": "test_db"}""")
    ```
  + *database* (required) — The name of the JDBC database.
  + Any additional options passed directly to the SparkSQL JDBC writer. For more information, see [Redshift data source for Spark](https://github.com/databricks/spark-redshift).
+ `redshiftTmpDir` — A temporary staging directory to be used with certain data sinks. Set to empty by default.
+ `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.
+ `catalogId` — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used. 

Example code:

```
getJDBCSink(catalogConnection = "my-connection-name", options = JsonOptions("""{"dbtable": "my-jdbc-table", "database": "my-jdbc-db"}"""), redshiftTmpDir = "", transformationContext = "datasink4")
```

Returns the `DataSink`.

## def getSink
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getSink"></a>

```
def getSink( connectionType : String,
             connectionOptions : JsonOptions,
             transformationContext : String = ""
           ) : DataSink
```

Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes data to a destination like Amazon Simple Storage Service (Amazon S3), JDBC, or the AWS Glue Data Catalog, or an Apache Kafka or Amazon Kinesis data stream. 
+ `connectionType` — The type of the connection. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `connectionOptions` — A string of JSON name-value pairs that provide additional information to establish the connection with the data sink. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.

Returns the `DataSink`.

## def getSinkWithFormat
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat"></a>

```
def getSinkWithFormat( connectionType : String,
                       options : JsonOptions,
                       transformationContext : String = "",
                       format : String = null,
                       formatOptions : JsonOptions = JsonOptions.empty
                     ) : DataSink
```

Creates a [DataSink](glue-etl-scala-apis-glue-datasink-class.md) that writes data to a destination like Amazon S3, JDBC, or the Data Catalog, or an Apache Kafka or Amazon Kinesis data stream. Also sets the format for the data to be written out to the destination.
+ `connectionType` — The type of the connection. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `options` — A string of JSON name-value pairs that provide additional information to establish a connection with the data sink. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.
+ `format` — The format of the data to be written out to the destination.
+ `formatOptions` — A string of JSON name-value pairs that provide additional options for formatting data at the destination. See [Data format options](aws-glue-programming-etl-format.md).

Returns the `DataSink`.

## def getSource
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getSource"></a>

```
def getSource( connectionType : String,
               connectionOptions : JsonOptions,
               transformationContext : String = ""
               pushDownPredicate
             ) : DataSource
```

Creates a [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) that reads data from a source like Amazon S3, JDBC, or the AWS Glue Data Catalog. Also supports Kafka and Kinesis streaming data sources.
+ `connectionType` — The type of the data source. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `connectionOptions` — A string of JSON name-value pairs that provide additional information for establishing a connection with the data source. For more information, see [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).

  A Kinesis streaming source requires the following connection options: `streamARN`, `startingPosition`, `inferSchema`, and `classification`.

  A Kafka streaming source requires the following connection options: `connectionName`, `topicName`, `startingOffsets`, `inferSchema`, and `classification`.
+ `transformationContext` — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.
+ `pushDownPredicate` — Predicate on partition columns.

Returns the `DataSource`.

Example for Amazon Kinesis streaming source:

```
val kinesisOptions = jsonOptions()
data_frame_datasource0 = glueContext.getSource("kinesis", kinesisOptions).getDataFrame()

private def jsonOptions(): JsonOptions = {
    new JsonOptions(
      s"""{"streamARN": "arn:aws:kinesis:eu-central-1:123456789012:stream/fromOptionsStream",
         |"startingPosition": "TRIM_HORIZON",
         |"inferSchema": "true",
         |"classification": "json"}""".stripMargin)
}
```

Example for Kafka streaming source:

```
val kafkaOptions = jsonOptions()
val data_frame_datasource0 = glueContext.getSource("kafka", kafkaOptions).getDataFrame()

private def jsonOptions(): JsonOptions = {
    new JsonOptions(
      s"""{"connectionName": "ConfluentKafka",
         |"topicName": "kafka-auth-topic",
         |"startingOffsets": "earliest",
         |"inferSchema": "true",
         |"classification": "json"}""".stripMargin)
 }
```

## def getSourceWithFormat
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat"></a>

```
def getSourceWithFormat( connectionType : String,
                         options : JsonOptions,
                         transformationContext : String = "",
                         format : String = null,
                         formatOptions : JsonOptions = JsonOptions.empty
                       ) : DataSource
```

Creates a [DataSource trait](glue-etl-scala-apis-glue-datasource-trait.md) that reads data from a source like Amazon S3, JDBC, or the AWS Glue Data Catalog, and also sets the format of data stored in the source.
+ `connectionType` – The type of the data source. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `options` – A string of JSON name-value pairs that provide additional information for establishing a connection with the data source. See [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md).
+ `transformationContext` – The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.
+ `format` – The format of the data that is stored at the source. When the `connectionType` is "s3", you can also specify `format`. Can be one of “avro”, “csv”, “grokLog”, “ion”, “json”, “xml”, “parquet”, or “orc”. 
+ `formatOptions` – A string of JSON name-value pairs that provide additional options for parsing data at the source. See [Data format options](aws-glue-programming-etl-format.md).

Returns the `DataSource`.

**Examples**

Create a DynamicFrame from a data source that is a comma-separated values (CSV) file on Amazon S3:

```
val datasource0 = glueContext.getSourceWithFormat(
    connectionType="s3",
    options =JsonOptions(s"""{"paths": [ "s3://csv/nycflights.csv"]}"""),
    transformationContext = "datasource0", 
    format = "csv",
    formatOptions=JsonOptions(s"""{"withHeader":"true","separator": ","}""")
    ).getDynamicFrame()
```

Create a DynamicFrame from a data source that is a PostgreSQL using a JDBC connection:

```
val datasource0 = glueContext.getSourceWithFormat(
    connectionType="postgresql",
    options =JsonOptions(s"""{
      "url":"jdbc:postgresql://databasePostgres-1.rds.amazonaws.com:5432/testdb",
      "dbtable": "public.company",
      "redshiftTmpDir":"", 
      "user":"username", 
      "password":"password123"
    }"""),
    transformationContext = "datasource0").getDynamicFrame()
```

Create a DynamicFrame from a data source that is a MySQL using a JDBC connection:

```
 val datasource0 = glueContext.getSourceWithFormat(
    connectionType="mysql",
    options =JsonOptions(s"""{
      "url":"jdbc:mysql://databaseMysql-1.rds.amazonaws.com:3306/testdb",
      "dbtable": "athenatest_nycflights13_csv",
      "redshiftTmpDir":"", 
      "user":"username", 
      "password":"password123"
    }"""),
    transformationContext = "datasource0").getDynamicFrame()
```

## def getSparkSession
<a name="glue-etl-scala-apis-glue-gluecontext-defs-getSparkSession"></a>

```
def getSparkSession : SparkSession 
```

Gets the `SparkSession` object associated with this GlueContext. Use this SparkSession object to register tables and UDFs for use with `DataFrame` created from DynamicFrames.

Returns the SparkSession.

## def startTransaction
<a name="glue-etl-scala-apis-glue-gluecontext-defs-start-transaction"></a>

```
def startTransaction(readOnly: Boolean):String
```

Start a new transaction. Internally calls the Lake Formation [startTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-StartTransaction) API.
+ `readOnly` – (Boolean) Indicates whether this transaction should be read only or read and write. Writes made using a read-only transaction ID will be rejected. Read-only transactions do not need to be committed.

Returns the transaction ID.

## def commitTransaction
<a name="glue-etl-scala-apis-glue-gluecontext-defs-commit-transaction"></a>

```
def commitTransaction(transactionId: String, waitForCommit: Boolean): Boolean
```

Attempts to commit the specified transaction. `commitTransaction` may return before the transaction has finished committing. Internally calls the Lake Formation [commitTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CommitTransaction) API.
+ `transactionId` – (String) The transaction to commit.
+ `waitForCommit` – (Boolean) Determines whether the `commitTransaction` returns immediately. The default value is true. If false, `commitTransaction` polls and waits until the transaction is committed. The amount of wait time is restricted to 1 minute using exponential backoff with a maximum of 6 retry attempts.

Returns a Boolean to indicate whether the commit is done or not. 

## def cancelTransaction
<a name="glue-etl-scala-apis-glue-gluecontext-defs-cancel-transaction"></a>

```
def cancelTransaction(transactionId: String): Unit
```

Attempts to cancel the specified transaction. Internally calls the Lake Formation [CancelTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CancelTransaction) API.
+ `transactionId` – (String) The transaction to cancel.

Returns a `TransactionCommittedException` exception if the transaction was previously committed.

## def this
<a name="glue-etl-scala-apis-glue-gluecontext-defs-this-1"></a>

```
def this( sc : SparkContext,
          minPartitions : Int,
          targetPartitions : Int )
```

Creates a `GlueContext` object using the specified `SparkContext`, minimum partitions, and target partitions.
+ `sc` — The `SparkContext`.
+ `minPartitions` — The minimum number of partitions.
+ `targetPartitions` — The target number of partitions.

Returns the `GlueContext`.

## def this
<a name="glue-etl-scala-apis-glue-gluecontext-defs-this-2"></a>

```
def this( sc : SparkContext )
```

Creates a `GlueContext` object with the provided `SparkContext`. Sets the minimum partitions to 10 and target partitions to 20.
+ `sc` — The `SparkContext`.

Returns the `GlueContext`.

## def this
<a name="glue-etl-scala-apis-glue-gluecontext-defs-this-3"></a>

```
def this( sparkContext : JavaSparkContext )
```

Creates a `GlueContext` object with the provided `JavaSparkContext`. Sets the minimum partitions to 10 and target partitions to 20.
+ `sparkContext` — The `JavaSparkContext`.

Returns the `GlueContext`.

# MappingSpec
<a name="glue-etl-scala-apis-glue-mappingspec"></a>

**Package: com.amazonaws.services.glue**

## MappingSpec case class
<a name="glue-etl-scala-apis-glue-mappingspec-case-class"></a>

```
case class MappingSpec( sourcePath: SchemaPath,
                        sourceType: DataType,
                        targetPath: SchemaPath,
                        targetType: DataTyp
                       ) extends Product4[String, String, String, String] {
  override def _1: String = sourcePath.toString
  override def _2: String = ExtendedTypeName.fromDataType(sourceType)
  override def _3: String = targetPath.toString
  override def _4: String = ExtendedTypeName.fromDataType(targetType)
}
```
+ `sourcePath` — The `SchemaPath` of the source field.
+ `sourceType` — The `DataType` of the source field.
+ `targetPath` — The `SchemaPath` of the target field.
+ `targetType` — The `DataType` of the target field.

A `MappingSpec` specifies a mapping from a source path and a source data type to a target path and a target data type. The value at the source path in the source frame appears in the target frame at the target path. The source data type is cast to the target data type.

It extends from `Product4` so that you can handle any `Product4` in your `applyMapping` interface.

## MappingSpec object
<a name="glue-etl-scala-apis-glue-mappingspec-object"></a>

```
object MappingSpec
```

The `MappingSpec` object has the following members:

## Val orderingByTarget
<a name="glue-etl-scala-apis-glue-mappingspec-object-val-orderingbytarget"></a>

```
val orderingByTarget: Ordering[MappingSpec]
```


## Def apply
<a name="glue-etl-scala-apis-glue-mappingspec-object-defs-apply-1"></a>

```
def apply( sourcePath : String,
           sourceType : DataType,
           targetPath : String,
           targetType : DataType
         ) : MappingSpec
```

Creates a `MappingSpec`.
+ `sourcePath` — A string representation of the source path.
+ `sourceType` — The source `DataType`.
+ `targetPath` — A string representation of the target path.
+ `targetType` — The target `DataType`.

Returns a `MappingSpec`.

## Def apply
<a name="glue-etl-scala-apis-glue-mappingspec-object-defs-apply-2"></a>

```
def apply( sourcePath : String,
           sourceTypeString : String,
           targetPath : String,
           targetTypeString : String
         ) : MappingSpec
```

Creates a `MappingSpec`.
+ `sourcePath` — A string representation of the source path.
+ `sourceType` — A string representation of the source data type.
+ `targetPath` — A string representation of the target path.
+ `targetType` — A string representation of the target data type.

Returns a MappingSpec.

## Def apply
<a name="glue-etl-scala-apis-glue-mappingspec-object-defs-apply-3"></a>

```
def apply( product : Product4[String, String, String, String] ) : MappingSpec 
```

Creates a `MappingSpec`.
+ `product` — The `Product4` of the source path, source data type, target path, and target data type.

Returns a `MappingSpec`.

# AWS Glue Scala ResolveSpec APIs
<a name="glue-etl-scala-apis-glue-resolvespec"></a>

**Topics**
+ [ResolveSpec object](#glue-etl-scala-apis-glue-resolvespec-object)
+ [ResolveSpec case class](#glue-etl-scala-apis-glue-resolvespec-case-class)

**Package: com.amazonaws.services.glue**

## ResolveSpec object
<a name="glue-etl-scala-apis-glue-resolvespec-object"></a>

 **ResolveSpec**

```
object ResolveSpec
```

### Def
<a name="glue-etl-scala-apis-glue-resolvespec-object-def-apply_1"></a>

```
def apply( path : String,
           action : String
         ) : ResolveSpec
```

Creates a `ResolveSpec`.
+ `path` — A string representation of the choice field that needs to be resolved.
+ `action` — A resolution action. The action can be one of the following: `Project`, `KeepAsStruct`, or `Cast`.

Returns the `ResolveSpec`.

### Def
<a name="glue-etl-scala-apis-glue-resolvespec-object-def-apply_2"></a>

```
def apply( product : Product2[String, String] ) : ResolveSpec 
```

Creates a `ResolveSpec`.
+ `product` — `Product2` of: source path, resolution action.

Returns the `ResolveSpec`.

## ResolveSpec case class
<a name="glue-etl-scala-apis-glue-resolvespec-case-class"></a>

```
case class ResolveSpec extends Product2[String, String]  (
           path : SchemaPath,
           action : String )
```

Creates a `ResolveSpec`.
+ `path` — The `SchemaPath` of the choice field that needs to be resolved.
+ `action` — A resolution action. The action can be one of the following: `Project`, `KeepAsStruct`, or `Cast`.

### ResolveSpec def methods
<a name="glue-etl-scala-apis-glue-resolvespec-case-class-defs"></a>

```
def _1 : String 
```

```
def _2 : String 
```

# AWS Glue Scala ArrayNode APIs
<a name="glue-etl-scala-apis-glue-types-arraynode"></a>

**Package: com.amazonaws.services.glue.types**

## ArrayNode case class
<a name="glue-etl-scala-apis-glue-types-arraynode-case-class"></a>

 **ArrayNode**

```
case class ArrayNode extends DynamicNode  (
           value : ArrayBuffer[DynamicNode] )
```

### ArrayNode def methods
<a name="glue-etl-scala-apis-glue-types-arraynode-case-class-defs"></a>

```
def add( node : DynamicNode )
```

```
def clone
```

```
def equals( other : Any )
```

```
def get( index : Int ) : Option[DynamicNode] 
```

```
def getValue
```

```
def hashCode : Int 
```

```
def isEmpty : Boolean 
```

```
def nodeType
```

```
def remove( index : Int )
```

```
def this
```

```
def toIterator : Iterator[DynamicNode] 
```

```
def toJson : String 
```

```
def update( index : Int,
            node : DynamicNode )
```

# AWS Glue Scala BinaryNode APIs
<a name="glue-etl-scala-apis-glue-types-binarynode"></a>

**Package: com.amazonaws.services.glue.types**

## BinaryNode case class
<a name="glue-etl-scala-apis-glue-types-binarynode-case-class"></a>

 **BinaryNode**

```
case class BinaryNode extends ScalarNode(value, TypeCode.BINARY)  (
           value : Array[Byte] )
```

### BinaryNode val fields
<a name="glue-etl-scala-apis-glue-types-binarynode-case-class-vals"></a>
+ `ordering`

### BinaryNode def methods
<a name="glue-etl-scala-apis-glue-types-binarynode-case-class-defs"></a>

```
def clone
```

```
def equals( other : Any )
```

```
def hashCode : Int 
```

# AWS Glue Scala BooleanNode APIs
<a name="glue-etl-scala-apis-glue-types-booleannode"></a>

**Package: com.amazonaws.services.glue.types**

## BooleanNode case class
<a name="glue-etl-scala-apis-glue-types-booleannode-case-class"></a>

 **BooleanNode**

```
case class BooleanNode extends ScalarNode(value, TypeCode.BOOLEAN)  (
           value : Boolean )
```

### BooleanNode val fields
<a name="glue-etl-scala-apis-glue-types-booleannode-case-class-vals"></a>
+ `ordering`

### BooleanNode def methods
<a name="glue-etl-scala-apis-glue-types-booleannode-case-class-defs"></a>

```
def equals( other : Any )
```

# AWS Glue Scala ByteNode APIs
<a name="glue-etl-scala-apis-glue-types-bytenode"></a>

**Package: com.amazonaws.services.glue.types**

## ByteNode case class
<a name="glue-etl-scala-apis-glue-types-bytenode-case-class"></a>

 **ByteNode**

```
case class ByteNode extends ScalarNode(value, TypeCode.BYTE)  (
           value : Byte )
```

### ByteNode val fields
<a name="glue-etl-scala-apis-glue-types-bytenode-case-class-vals"></a>
+ `ordering`

### ByteNode def methods
<a name="glue-etl-scala-apis-glue-types-bytenode-case-class-defs"></a>

```
def equals( other : Any )
```

# AWS Glue Scala DateNode APIs
<a name="glue-etl-scala-apis-glue-types-datenode"></a>

**Package: com.amazonaws.services.glue.types**

## DateNode case class
<a name="glue-etl-scala-apis-glue-types-datenode-case-class"></a>

 **DateNode**

```
case class DateNode extends ScalarNode(value, TypeCode.DATE)  (
           value : Date )
```

### DateNode val fields
<a name="glue-etl-scala-apis-glue-types-datenode-case-class-vals"></a>
+ `ordering`

### DateNode def methods
<a name="glue-etl-scala-apis-glue-types-datenode-case-class-defs"></a>

```
def equals( other : Any )
```

```
def this( value : Int )
```

# AWS Glue Scala DecimalNode APIs
<a name="glue-etl-scala-apis-glue-types-decimalnode"></a>

**Package: com.amazonaws.services.glue.types**

## DecimalNode case class
<a name="glue-etl-scala-apis-glue-types-decimalnode-case-class"></a>

 **DecimalNode**

```
case class DecimalNode extends ScalarNode(value, TypeCode.DECIMAL)  (
           value : BigDecimal )
```

### DecimalNode val fields
<a name="glue-etl-scala-apis-glue-types-decimalnode-case-class-vals"></a>
+ `ordering`

### DecimalNode def methods
<a name="glue-etl-scala-apis-glue-types-decimalnode-case-class-defs"></a>

```
def equals( other : Any )
```

```
def this( value : Decimal )
```

# AWS Glue Scala DoubleNode APIs
<a name="glue-etl-scala-apis-glue-types-doublenode"></a>

**Package: com.amazonaws.services.glue.types**

## DoubleNode case class
<a name="glue-etl-scala-apis-glue-types-doublenode-case-class"></a>

 **DoubleNode**

```
case class DoubleNode extends ScalarNode(value, TypeCode.DOUBLE)  (
           value : Double )
```

### DoubleNode val fields
<a name="glue-etl-scala-apis-glue-types-doublenode-case-class-vals"></a>
+ `ordering`

### DoubleNode def methods
<a name="glue-etl-scala-apis-glue-types-doublenode-case-class-defs"></a>

```
def equals( other : Any )
```

# AWS Glue Scala DynamicNode APIs
<a name="glue-etl-scala-apis-glue-types-dynamicnode"></a>

**Topics**
+ [DynamicNode class](#glue-etl-scala-apis-glue-types-dynamicnode-class)
+ [DynamicNode object](#glue-etl-scala-apis-glue-types-dynamicnode-object)

**Package: com.amazonaws.services.glue.types**

## DynamicNode class
<a name="glue-etl-scala-apis-glue-types-dynamicnode-class"></a>

**DynamicNode**

```
class DynamicNode extends Serializable with Cloneable 
```

### DynamicNode def methods
<a name="glue-etl-scala-apis-glue-types-dynamicnode-class-defs"></a>

```
def getValue : Any
```

Get plain value and bind to the current record:

```
def nodeType : TypeCode
```

```
def toJson : String
```

Method for debug:

```
def toRow( schema : Schema,
           options : Map[String, ResolveOption]
         ) : Row
```

```
def typeName : String 
```

## DynamicNode object
<a name="glue-etl-scala-apis-glue-types-dynamicnode-object"></a>

 **DynamicNode**

```
object DynamicNode
```

### DynamicNode def methods
<a name="glue-etl-scala-apis-glue-types-dynamicnode-object-defs"></a>

```
def quote( field : String,
           useQuotes : Boolean
         ) : String
```

```
def quote( node : DynamicNode,
           useQuotes : Boolean
         ) : String
```

# EvaluateDataQuality class
<a name="glue-etl-scala-apis-glue-dq-EvaluateDataQuality"></a>


|  | 
| --- |
|  AWS Glue Data Quality is in preview release for AWS Glue and is subject to change.  | 

**Package: com.amazonaws.services.glue.dq**

```
object EvaluateDataQuality
```

## Def apply
<a name="glue-etl-scala-apis-glue-dq-EvaluateDataQuality-defs-apply"></a>

```
def apply(frame: DynamicFrame,
            ruleset: String,
            publishingOptions: JsonOptions = JsonOptions.empty): DynamicFrame
```

Evaluates a data quality ruleset against a `DynamicFrame`, and returns a new `DynamicFrame` with results of the evaluation. To learn more about AWS Glue Data Quality, see [AWS Glue Data Quality](glue-data-quality.md).
+ `frame` – The `DynamicFrame` that you want to evaluate the data quality of.
+ `ruleset` – A Data Quality Definition Language (DQDL) ruleset in string format. To learn more about DQDL, see the [Data Quality Definition Language (DQDL) reference](dqdl.md) guide.
+ `publishingOptions` – A dictionary that specifies the following options for publishing evaluation results and metrics:
  + `dataQualityEvaluationContext` – A string that specifies the namespace under which AWS Glue should publish Amazon CloudWatch metrics and the data quality results. The aggregated metrics appear in CloudWatch, while the full results appear in the AWS Glue Studio interface.
    + Required: No
    + Default value: `default_context`
  + `enableDataQualityCloudWatchMetrics` – Specifies whether the results of the data quality evaluation should be published to CloudWatch. You specify a namespace for the metrics using the `dataQualityEvaluationContext` option.
    + Required: No
    + Default value: False
  + `enableDataQualityResultsPublishing` – Specifies whether the data quality results should be visible on the **Data Quality** tab in the AWS Glue Studio interface.
    + Required: No
    + Default value: True
  + `resultsS3Prefix` – Specifies the Amazon S3 location where AWS Glue can write the data quality evaluation results.
    + Required: No
    + Default value: "" (empty string)

## Example
<a name="glue-etl-scala-apis-glue-dq-EvaluateDataQuality-example"></a>

The following example code demonstrates how to evaluate data quality for a `DynamicFrame` before performing a `SelectFields` transform. The script verifies that all data quality rules pass before it attempts the transform.

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
import com.amazonaws.services.glue.dq.EvaluateDataQuality

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    // Create DynamicFrame with data
    val Legislators_Area = glueContext.getCatalogSource(database="legislators", tableName="areas_json", transformationContext="S3bucket_node1").getDynamicFrame()

    // Define data quality ruleset
    val DQ_Ruleset = """
      Rules = [ColumnExists "id"]
    """

    // Evaluate data quality
    val DQ_Results = EvaluateDataQuality.apply(frame=Legislators_Area, ruleset=DQ_Ruleset, publishingOptions=JsonOptions("""{"dataQualityEvaluationContext": "Legislators_Area", "enableDataQualityMetrics": "true", "enableDataQualityResultsPublishing": "true"}"""))
    assert(DQ_Results.filter(_.getField("Outcome").contains("Failed")).count == 0, "Failing DQ rules for Legislators_Area caused the job to fail.")

    // Script generated for node Select Fields
    val SelectFields_Results = Legislators_Area.selectFields(paths=Seq("id", "name"), transformationContext="Legislators_Area")

    Job.commit()
  }
}
```

# AWS Glue Scala FloatNode APIs
<a name="glue-etl-scala-apis-glue-types-floatnode"></a>

**Package: com.amazonaws.services.glue.types**

## FloatNode case class
<a name="glue-etl-scala-apis-glue-types-floatnode-case-class"></a>

**FloatNode**

```
case class FloatNode extends ScalarNode(value, TypeCode.FLOAT)  (
           value : Float )
```

### FloatNode val fields
<a name="glue-etl-scala-apis-glue-types-floatnode-case-class-vals"></a>
+ `ordering`

### FloatNode def methods
<a name="glue-etl-scala-apis-glue-types-floatnode-case-class-defs"></a>

```
def equals( other : Any )
```

# FillMissingValues class
<a name="glue-etl-scala-apis-glue-ml-fillmissingvalues"></a>

**Package: com.amazonaws.services.glue.ml**

```
object FillMissingValues
```

## Def apply
<a name="glue-etl-scala-apis-glue-ml-fillmissingvalues-defs-apply"></a>

```
def apply(frame: DynamicFrame,
          missingValuesColumn: String,
          outputColumn: String = "",
          transformationContext: String = "",
          callSite: CallSite = CallSite("Not provided", ""),
          stageThreshold: Long = 0,
          totalThreshold: Long = 0): DynamicFrame
```

Fills a dynamic frame's missing values in a specified column and returns a new frame with estimates in a new column. For rows without missing values, the specified column's value is duplicated to the new column.
+ `frame` — The DynamicFrame in which to fill missing values. Required.
+ `missingValuesColumn` — The column containing missing values (`null` values and empty strings). Required.
+ `outputColumn` — The name of the new column that will contain estimated values for all rows whose value was missing. Optional; the default is the value of `missingValuesColumn` suffixed by `"_filled"`.
+ `transformationContext` — A unique string that is used to identify state information (optional).
+ `callSite` — Used to provide context information for error reporting. (optional).
+ `stageThreshold` — The maximum number of errors that can occur in the transformation before it errors out (optional; the default is zero).
+ `totalThreshold` — The maximum number of errors that can occur overall before processing errors out (optional; the default is zero).

Returns a new dynamic frame with one additional column that contains estimations for rows with missing values and the present value for other rows.

# FindMatches class
<a name="glue-etl-scala-apis-glue-ml-findmatches"></a>

**Package: com.amazonaws.services.glue.ml**

```
object FindMatches
```

## Def apply
<a name="glue-etl-scala-apis-glue-ml-findmatches-defs-apply"></a>

```
def apply(frame: DynamicFrame,
            transformId: String,
            transformationContext: String = "",
            callSite: CallSite = CallSite("Not provided", ""),
            stageThreshold: Long = 0,
            totalThreshold: Long = 0,
            enforcedMatches: DynamicFrame = null): DynamicFrame,
			computeMatchConfidenceScores: Boolean
```

Find matches in an input frame and return a new frame with a new column containing a unique ID per match group.
+ `frame` — The DynamicFrame in which to find matches. Required.
+ `transformId` — A unique ID associated with the FindMatches transform to apply on the input frame. Required.
+ `transformationContext` — Identifier for this `DynamicFrame`. The `transformationContext` is used as a key for the job bookmark state that is persisted across runs. Optional.
+ `callSite` — Used to provide context information for error reporting. These values are automatically set when calling from Python. Optional.
+ `stageThreshold` — The maximum number of error records allowed from the computation of this `DynamicFrame` before throwing an exception, excluding records present in the previous `DynamicFrame`. Optional. The default is zero.
+ `totalThreshold` — The maximum number of total errors records before an exception is thrown, including those from previous frames. Optional. The default is zero.
+ `enforcedMatches` — The frame for enforced matches. Optional. The default is `null`.
+ `computeMatchConfidenceScores` — A Boolean value indicating whether to compute a confidence score for each group of matching records. Optional. The default is false.

Returns a new dynamic frame with a unique identifier assigned to each group of matching records.

# FindIncrementalMatches class
<a name="glue-etl-scala-apis-glue-ml-findincrementalmatches"></a>

**Package: com.amazonaws.services.glue.ml**

```
object FindIncrementalMatches
```

## Def apply
<a name="glue-etl-scala-apis-glue-ml-findincrementalmatches-defs-apply"></a>

```
apply(existingFrame: DynamicFrame,
            incrementalFrame: DynamicFrame,
            transformId: String,
            transformationContext: String = "",
            callSite: CallSite = CallSite("Not provided", ""),
            stageThreshold: Long = 0,
            totalThreshold: Long = 0,
            enforcedMatches: DynamicFrame = null): DynamicFrame,
			computeMatchConfidenceScores: Boolean
```

Find matches across the existing and incremental frames and return a new frame with a column containing a unique ID per match group.
+ `existingframe` — An existing frame which has been assigned a matching ID for each group. Required.
+ `incrementalframe` — An incremental frame used to find matches against the existing frame. Required.
+ `transformId` — A unique ID associated with the FindIncrementalMatches transform to apply on the input frames. Required.
+ `transformationContext` — Identifier for this `DynamicFrame`. The `transformationContext` is used as a key for the job bookmark state that is persisted across runs. Optional.
+ `callSite` — Used to provide context information for error reporting. These values are automatically set when calling from Python. Optional.
+ `stageThreshold` — The maximum number of error records allowed from the computation of this `DynamicFrame` before throwing an exception, excluding records present in the previous `DynamicFrame`. Optional. The default is zero.
+ `totalThreshold` — The maximum number of total errors records before an exception is thrown, including those from previous frames. Optional. The default is zero.
+ `enforcedMatches` — The frame for enforced matches. Optional. The default is `null`.
+ `computeMatchConfidenceScores` — A Boolean value indicating whether to compute a confidence score for each group of matching records. Optional. The default is false.

Returns a new dynamic frame with a unique identifier assigned to each group of matching records.

# AWS Glue Scala IntegerNode APIs
<a name="glue-etl-scala-apis-glue-types-integernode"></a>

**Package: com.amazonaws.services.glue.types**

## IntegerNode case class
<a name="glue-etl-scala-apis-glue-types-integernode-case-class"></a>

**IntegerNode**

```
case class IntegerNode extends ScalarNode(value, TypeCode.INT)  (
           value : Int )
```

### IntegerNode val fields
<a name="glue-etl-scala-apis-glue-types-integernode-case-class-vals"></a>
+ `ordering`

### IntegerNode def methods
<a name="glue-etl-scala-apis-glue-types-integernode-case-class-defs"></a>

```
def equals( other : Any )
```

# AWS Glue Scala LongNode APIs
<a name="glue-etl-scala-apis-glue-types-longnode"></a>

**Package: com.amazonaws.services.glue.types**

## LongNode case class
<a name="glue-etl-scala-apis-glue-types-longnode-case-class"></a>

 **LongNode**

```
case class LongNode extends ScalarNode(value, TypeCode.LONG)  (
           value : Long )
```

### LongNode val fields
<a name="glue-etl-scala-apis-glue-types-longnode-case-class-vals"></a>
+ `ordering`

### LongNode def methods
<a name="glue-etl-scala-apis-glue-types-longnode-case-class-defs"></a>

```
def equals( other : Any )
```

# AWS Glue Scala MapLikeNode APIs
<a name="glue-etl-scala-apis-glue-types-maplikenode"></a>

**Package: com.amazonaws.services.glue.types**

## MapLikeNode class
<a name="glue-etl-scala-apis-glue-types-maplikenode-class"></a>

**MapLikeNode**

```
class MapLikeNode extends DynamicNode  (
           value : mutable.Map[String, DynamicNode] )
```

### MapLikeNode def methods
<a name="glue-etl-scala-apis-glue-types-maplikenode-class-defs"></a>

```
def clear : Unit 
```

```
def get( name : String ) : Option[DynamicNode] 
```

```
def getValue
```

```
def has( name : String ) : Boolean 
```

```
def isEmpty : Boolean 
```

```
def put( name : String,
         node : DynamicNode
       ) : Option[DynamicNode]
```

```
def remove( name : String ) : Option[DynamicNode] 
```

```
def toIterator : Iterator[(String, DynamicNode)] 
```

```
def toJson : String 
```

```
def toJson( useQuotes : Boolean ) : String 
```

**Example:** Given this JSON: 

```
{"foo": "bar"}
```

If `useQuotes == true`, `toJson` yields `{"foo": "bar"}`. If `useQuotes == false`, `toJson` yields `{foo: bar}` @return.

# AWS Glue Scala MapNode APIs
<a name="glue-etl-scala-apis-glue-types-mapnode"></a>

**Package: com.amazonaws.services.glue.types**

## MapNode case class
<a name="glue-etl-scala-apis-glue-types-mapnode-case-class"></a>

 **MapNode**

```
case class MapNode extends MapLikeNode(value)  (
           value : mutable.Map[String, DynamicNode] )
```

### MapNode def methods
<a name="glue-etl-scala-apis-glue-types-mapnode-case-class-defs"></a>

```
def clone
```

```
def equals( other : Any )
```

```
def hashCode : Int 
```

```
def nodeType
```

```
def this
```

# AWS Glue Scala NullNode APIs
<a name="glue-etl-scala-apis-glue-types-nullnode"></a>

**Topics**
+ [NullNode class](#glue-etl-scala-apis-glue-types-nullnode-class)
+ [NullNode case object](#glue-etl-scala-apis-glue-types-nullnode-case-object)

**Package: com.amazonaws.services.glue.types**

## NullNode class
<a name="glue-etl-scala-apis-glue-types-nullnode-class"></a>

 **NullNode**

```
class NullNode
```

## NullNode case object
<a name="glue-etl-scala-apis-glue-types-nullnode-case-object"></a>

 **NullNode**

```
case object NullNode extends NullNode 
```

# AWS Glue Scala ObjectNode APIs
<a name="glue-etl-scala-apis-glue-types-objectnode"></a>

**Topics**
+ [ObjectNode object](#glue-etl-scala-apis-glue-types-objectnode-object)
+ [ObjectNode case class](#glue-etl-scala-apis-glue-types-objectnode-case-class)

**Package: com.amazonaws.services.glue.types**

## ObjectNode object
<a name="glue-etl-scala-apis-glue-types-objectnode-object"></a>

**ObjectNode**

```
object ObjectNode
```

### ObjectNode def methods
<a name="glue-etl-scala-apis-glue-types-objectnode-object-defs"></a>

```
def apply( frameKeys : Set[String],
           v1 : mutable.Map[String, DynamicNode],
           v2 : mutable.Map[String, DynamicNode],
           resolveWith : String
         ) : ObjectNode
```

## ObjectNode case class
<a name="glue-etl-scala-apis-glue-types-objectnode-case-class"></a>

 **ObjectNode**

```
case class ObjectNode extends MapLikeNode(value)  (
           val value : mutable.Map[String, DynamicNode] )
```

### ObjectNode def methods
<a name="glue-etl-scala-apis-glue-types-objectnode-case-class-defs"></a>

```
def clone
```

```
def equals( other : Any )
```

```
def hashCode : Int 
```

```
def nodeType
```

```
def this
```

# AWS Glue Scala ScalarNode APIs
<a name="glue-etl-scala-apis-glue-types-scalarnode"></a>

**Topics**
+ [ScalarNode class](#glue-etl-scala-apis-glue-types-scalarnode-class)
+ [ScalarNode object](#glue-etl-scala-apis-glue-types-scalarnode-object)

**Package: com.amazonaws.services.glue.types**

## ScalarNode class
<a name="glue-etl-scala-apis-glue-types-scalarnode-class"></a>

**ScalarNode**

```
class ScalarNode extends DynamicNode  (
           value : Any,
           scalarType : TypeCode )
```

### ScalarNode def methods
<a name="glue-etl-scala-apis-glue-types-scalarnode-class-defs"></a>

```
def compare( other : Any,
             operator : String
           ) : Boolean
```

```
def getValue
```

```
def hashCode : Int 
```

```
def nodeType
```

```
def toJson
```

## ScalarNode object
<a name="glue-etl-scala-apis-glue-types-scalarnode-object"></a>

 **ScalarNode**

```
object ScalarNode
```

### ScalarNode def methods
<a name="glue-etl-scala-apis-glue-types-scalarnode-object-defs"></a>

```
def apply( v : Any ) : DynamicNode 
```

```
def compare( tv : Ordered[T],
             other : T,
             operator : String
           ) : Boolean
```

```
def compareAny( v : Any,
                y : Any,
                o : String )
```

```
def withEscapedSpecialCharacters( jsonToEscape : String ) : String 
```

# AWS Glue Scala ShortNode APIs
<a name="glue-etl-scala-apis-glue-types-shortnode"></a>

**Package: com.amazonaws.services.glue.types**

## ShortNode case class
<a name="glue-etl-scala-apis-glue-types-shortnode-case-class"></a>

**ShortNode**

```
case class ShortNode extends ScalarNode(value, TypeCode.SHORT)  (
           value : Short )
```

### ShortNode val fields
<a name="glue-etl-scala-apis-glue-types-shortnode-case-class-vals"></a>
+ `ordering`

### ShortNode def methods
<a name="glue-etl-scala-apis-glue-types-shortnode-case-class-defs"></a>

```
def equals( other : Any )
```

# AWS Glue Scala StringNode APIs
<a name="glue-etl-scala-apis-glue-types-stringnode"></a>

**Package: com.amazonaws.services.glue.types**

## StringNode case class
<a name="glue-etl-scala-apis-glue-types-stringnode-case-class"></a>

 **StringNode**

```
case class StringNode extends ScalarNode(value, TypeCode.STRING)  (
           value : String )
```

### StringNode val fields
<a name="glue-etl-scala-apis-glue-types-stringnode-case-class-vals"></a>
+ `ordering`

### StringNode def methods
<a name="glue-etl-scala-apis-glue-types-stringnode-case-class-defs"></a>

```
def equals( other : Any )
```

```
def this( value : UTF8String )
```

# AWS Glue Scala TimestampNode APIs
<a name="glue-etl-scala-apis-glue-types-timestampnode"></a>

**Package: com.amazonaws.services.glue.types**

## TimestampNode case class
<a name="glue-etl-scala-apis-glue-types-timestampnode-case-class"></a>

**TimestampNode**

```
case class TimestampNode extends ScalarNode(value, TypeCode.TIMESTAMP)  (
           value : Timestamp )
```

### TimestampNode val fields
<a name="glue-etl-scala-apis-glue-types-timestampnode-case-class-vals"></a>
+ `ordering`

### TimestampNode def methods
<a name="glue-etl-scala-apis-glue-types-timestampnode-case-class-defs"></a>

```
def equals( other : Any )
```

```
def this( value : Long )
```

# AWS Glue Scala GlueArgParser APIs
<a name="glue-etl-scala-apis-glue-util-glueargparser"></a>

**Package: com.amazonaws.services.glue.util**

## GlueArgParser object
<a name="glue-etl-scala-apis-glue-util-glueargparser-object"></a>

**GlueArgParser**

```
object GlueArgParser
```

This is strictly consistent with the Python version of `utils.getResolvedOptions` in the `AWSGlueDataplanePython` package.

### GlueArgParser def methods
<a name="glue-etl-scala-apis-glue-util-glueargparser-object-defs"></a>

```
def getResolvedOptions( args : Array[String],
                        options : Array[String]
                      ) : Map[String, String]
```

```
def initParser( userOptionsSet : mutable.Set[String] ) : ArgumentParser 
```

**Example Retrieving arguments passed to a job**  
To retrieve job arguments, you can use the `getResolvedOptions` method. Consider the following example, which retrieves a job argument named `aws_region`.  

```
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME","aws_region").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val region = args("aws_region")
println(region)
```

# AWS Glue Scala job APIs
<a name="glue-etl-scala-apis-glue-util-job"></a>

**Package: com.amazonaws.services.glue.util**

## Job object
<a name="glue-etl-scala-apis-glue-util-job-object"></a>

 **Job**

```
object Job
```

### Job def methods
<a name="glue-etl-scala-apis-glue-util-job-object-defs"></a>

```
def commit
```

```
def init( jobName : String,
          glueContext : GlueContext,
          args : java.util.Map[String, String] = Map[String, String]().asJava
        ) : this.type
```

```
def init( jobName : String,
          glueContext : GlueContext,
          endpoint : String,
          args : java.util.Map[String, String]
        ) : this.type
```

```
def isInitialized
```

```
def reset
```

```
def runId
```

# Features and optimizations for programming AWS Glue for Spark ETL scripts
<a name="aws-glue-programming-general"></a>

The following sections describe techniques and values that apply generally to AWS Glue for Spark ETL (extract, transform, and load) programming in any language.

**Topics**
+ [Connection types and options for ETL in AWS Glue for Spark](aws-glue-programming-etl-connect.md)
+ [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md)
+ [AWS Glue Data Catalog support for Spark SQL jobs](aws-glue-programming-etl-glue-data-catalog-hive.md)
+ [Using job bookmarks](programming-etl-connect-bookmarks.md)
+ [Using Sensitive Data Detection outside AWS Glue Studio](aws-glue-api-sensitive-data-example.md)
+ [AWS Glue Visual Job API](visual-job-api-chapter.md)

# Connection types and options for ETL in AWS Glue for Spark
<a name="aws-glue-programming-etl-connect"></a>

In AWS Glue for Spark, various PySpark and Scala methods and transforms specify the connection type using a `connectionType` parameter. They specify connection options using a `connectionOptions` or `options` parameter.

The `connectionType` parameter can take the values shown in the following table. The associated `connectionOptions` (or `options`) parameter values for each type are documented in the following sections. Except where otherwise noted, the parameters apply when the connection is used as a source or sink.

For sample code that demonstrates setting and using connection options, see the homepage for each connection type.


| `connectionType` | Connects to | 
| --- | --- | 
| [dynamodb](aws-glue-programming-etl-connect-dynamodb-home.md) | [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/) database | 
| [kinesis](aws-glue-programming-etl-connect-kinesis-home.md) | [Amazon Kinesis Data Streams](https://docs.aws.amazon.com/streams/latest/dev/introduction.html) | 
| [s3](aws-glue-programming-etl-connect-s3-home.md) | [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/) | 
| [documentdb](aws-glue-programming-etl-connect-documentdb-home.md#aws-glue-programming-etl-connect-documentdb) | [Amazon DocumentDB (with MongoDB compatibility)](https://docs.aws.amazon.com/documentdb/latest/developerguide/) database | 
| [opensearch](aws-glue-programming-etl-connect-opensearch-home.md) | [Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/). | 
| [redshift](aws-glue-programming-etl-connect-redshift-home.md) | [Amazon Redshift](https://aws.amazon.com/redshift/) database | 
| [kafka](aws-glue-programming-etl-connect-kafka-home.md) |  [Kafka](https://kafka.apache.org/) or [Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html) | 
| [azurecosmos](aws-glue-programming-etl-connect-azurecosmos-home.md) | Azure Cosmos for NoSQL. | 
| [azuresql](aws-glue-programming-etl-connect-azuresql-home.md) | Azure SQL. | 
| [bigquery](aws-glue-programming-etl-connect-bigquery-home.md) | Google BigQuery. | 
| [mongodb](aws-glue-programming-etl-connect-mongodb-home.md) | [MongoDB](https://www.mongodb.com/what-is-mongodb) database, including MongoDB Atlas. | 
| [sqlserver](aws-glue-programming-etl-connect-jdbc-home.md) |  Microsoft SQL Server database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [mysql](aws-glue-programming-etl-connect-jdbc-home.md) | [MySQL](https://www.mysql.com/) database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [oracle](aws-glue-programming-etl-connect-jdbc-home.md) | [Oracle](https://www.oracle.com/database/) database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [postgresql](aws-glue-programming-etl-connect-jdbc-home.md) |  [PostgreSQL](https://www.postgresql.org/) database (see [JDBC connections](aws-glue-programming-etl-connect-jdbc-home.md)) | 
| [saphana](aws-glue-programming-etl-connect-saphana-home.md) | SAP HANA. | 
| [snowflake](aws-glue-programming-etl-connect-snowflake-home.md) | [Snowflake](https://www.snowflake.com/) data lake | 
| [teradata](aws-glue-programming-etl-connect-teradata-home.md) | Teradata Vantage. | 
| [vertica](aws-glue-programming-etl-connect-vertica-home.md) | Vertica. | 
| [custom.\$1](#aws-glue-programming-etl-connect-market) | Spark, Athena, or JDBC data stores (see [Custom and AWS Marketplace connectionType values](#aws-glue-programming-etl-connect-market)  | 
| [marketplace.\$1](#aws-glue-programming-etl-connect-market) | Spark, Athena, or JDBC data stores (see [Custom and AWS Marketplace connectionType values](#aws-glue-programming-etl-connect-market))  | 

# DynamoDB connections
<a name="aws-glue-programming-etl-connect-dynamodb-home"></a>

You can use AWS Glue for Spark to read from and write to tables in DynamoDB in AWS Glue. You connect to DynamoDB using IAM permissions attached to your AWS Glue job. AWS Glue supports writing data into another AWS account's DynamoDB table. For more information, see [Cross-account cross-Region access to DynamoDB tables](aws-glue-programming-etl-dynamo-db-cross-account.md).

The original DynamoDB connector uses Glue DynamicFrame objects to work with the data extracted from DynamoDB. AWS Glue 5.0\$1 introduces a new [DynamoDB connector with Spark DataFrame support](aws-glue-programming-etl-connect-dynamodb-dataframe-support.md) that provides native Spark DataFrame support.

In addition to the AWS Glue DynamoDB ETL connector, you can read from DynamoDB using the DynamoDB export connector, that invokes a DynamoDB `ExportTableToPointInTime` request and stores it in an Amazon S3 location you supply, in the format of [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html). AWS Glue then creates a DynamicFrame object by reading the data from the Amazon S3 export location.

The DynamoDB writer is available in AWS Glue version 1.0 or later versions. The AWS Glue DynamoDB export connector is available in AWS Glue version 2.0 or later versions. The new DataFrame-based DynamoDB connector is available in AWS Glue version 5.0 or later versions.

For more information about DynamoDB, consult [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/) documentation.

**Note**  
The DynamoDB ETL reader does not support filters or pushdown predicates.

## Configuring DynamoDB connections
<a name="aws-glue-programming-etl-connect-dynamodb-configure"></a>

To connect to DynamoDB from AWS Glue, grant the IAM role associated with your AWS Glue job permission to interact with DynamoDB. For more information about permissions necessary to read or write from DynamoDB, consult [Actions, resources, and condition keys for DynamoDB](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazondynamodb.html) in the IAM documentation.

In the following situations, you may need additional configuration:
+ When using the DynamoDB export connector, you will need to configure IAM so your job can request DynamoDB table exports. Additionally, you will need to identify an Amazon S3 bucket for the export and provide appropriate permissions in IAM for DynamoDB to write to it, and for your AWS Glue job to read from it. For more information, consult [Request a table export in DynamoDB](https://docs.aws.amazon.com//amazondynamodb/latest/developerguide/S3DataExport_Requesting.html).
+ If your AWS Glue job has specific Amazon VPC connectivity requirements, use the `NETWORK` AWS Glue connection type to provide network options. Since access to DynamoDB is authorized by IAM, there is no need for a AWS Glue DynamoDB connection type.

## Reading from and writing to DynamoDB
<a name="aws-glue-programming-etl-connect-dynamodb-read-write"></a>

The following code examples show how to read from (via the ETL connector) and write to DynamoDB tables. They demonstrate reading from one table and writing to another table.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={"dynamodb.input.tableName": test_source,
        "dynamodb.throughput.read.percent": "1.0",
        "dynamodb.splits": "100"
    }
)
print(dyf.getNumPartitions())

glue_context.write_dynamic_frame_from_options(
    frame=dyf,
    connection_type="dynamodb",
    connection_options={"dynamodb.output.tableName": test_sink,
        "dynamodb.throughput.write.percent": "1.0"
    }
)

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.input.tableName" -> test_source,
        "dynamodb.throughput.read.percent" -> "1.0",
        "dynamodb.splits" -> "100"
      ))
    ).getDynamicFrame()
    
    print(dynamicFrame.getNumPartitions())

    val dynamoDbSink: DynamoDbDataSink =  glueContext.getSinkWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.output.tableName" -> test_sink,
        "dynamodb.throughput.write.percent" -> "1.0"
      ))
    ).asInstanceOf[DynamoDbDataSink]
    
    dynamoDbSink.writeDynamicFrame(dynamicFrame)

    Job.commit()
  }

}
```

------

## Using the DynamoDB export connector
<a name="aws-glue-programming-etl-connect-dynamodb-export-connector"></a>

The export connector performs better than the ETL connector when the DynamoDB table size is larger than 80 GB. In addition, given that the export request is conducted outside from the Spark processes in an AWS Glue job, you can enable [auto scaling of AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/auto-scaling.html) to save DPU usage during the export request. With the export connector, you also do not need to configure the number of splits for Spark executor parallelism or DynamoDB throughput read percentage.

**Note**  
DynamoDB has specific requirements to invoke the `ExportTableToPointInTime` requests. For more information, see [Requesting a table export in DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Requesting.html). For example, Point-in-Time-Restore (PITR) needs to be enabled on the table to use this connector. The DynamoDB connector also supports AWS KMS encryption for DynamoDB exports to Amazon S3. Supplying your security configuration in the AWS Glue job configuration enables AWS KMS encryption for a DynamoDB export. The KMS key must be in the same Region as the Amazon S3 bucket.  
Note that additional charges for DynamoDB export and Amazon S3 storage costs apply. Exported data in Amazon S3 persists after a job run finishes so you can reuse it without additional DynamoDB exports. A requirement for using this connector is that point-in-time recovery (PITR) is enabled for the table.  
The DynamoDB ETL connector or export connector do not support filters or pushdown predicates to be applied at the DynamoDB source.

The following code examples show how to read from (via the export connector) and print the number of partitions.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": test_source,
        "dynamodb.s3.bucket": bucket_name,
        "dynamodb.s3.prefix": bucket_prefix,
        "dynamodb.s3.bucketOwner": account_id_of_bucket,
    }
)
print(dyf.getNumPartitions())

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType = "dynamodb",
      options = JsonOptions(Map(
        "dynamodb.export" -> "ddb",
        "dynamodb.tableArn" -> test_source,
        "dynamodb.s3.bucket" -> bucket_name,
        "dynamodb.s3.prefix" -> bucket_prefix,
        "dynamodb.s3.bucketOwner" -> account_id_of_bucket,
      ))
    ).getDynamicFrame()
    
    print(dynamicFrame.getNumPartitions())

    Job.commit()
  }

}
```

------

These examples show how to do the read from (via the export connector) and print the number of partitions from an AWS Glue Data Catalog table that has a `dynamodb` classification:

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dynamicFrame = glue_context.create_dynamic_frame.from_catalog(
        database=catalog_database,
        table_name=catalog_table_name,
        additional_options={
            "dynamodb.export": "ddb", 
            "dynamodb.s3.bucket": s3_bucket,
            "dynamodb.s3.prefix": s3_bucket_prefix
        }
    )
print(dynamicFrame.getNumPartitions())

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamoDbDataSink
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {

  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val dynamicFrame = glueContext.getCatalogSource(
        database = catalog_database,
        tableName = catalog_table_name,
        additionalOptions = JsonOptions(Map(
            "dynamodb.export" -> "ddb", 
            "dynamodb.s3.bucket" -> s3_bucket,
            "dynamodb.s3.prefix" -> s3_bucket_prefix
        ))
    ).getDynamicFrame()
    print(dynamicFrame.getNumPartitions())
)
```

------

## Simplifying usage of DynamoDB export JSON
<a name="etl-connect-dynamodb-traversing-structure"></a>

The DynamoDB exports with the AWS Glue DynamoDB export connector results in JSON files of specific nested structures. For more information, see [Data objects](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html). AWS Glue supplies a DynamicFrame transformation, which can unnest such structures into an easier-to-use form for downstream applications.

The transform can be invoked in one of two ways. You can set the connection option `"dynamodb.simplifyDDBJson"` with the value `"true"` when calling a method to read from DynamoDB. You can also call the transform as a method independently available in the AWS Glue library.

Consider the following schema generated by a DynamoDB export:

```
root
|-- Item: struct
|    |-- parentMap: struct
|    |    |-- M: struct
|    |    |    |-- childMap: struct
|    |    |    |    |-- M: struct
|    |    |    |    |    |-- appName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- packageName: struct
|    |    |    |    |    |    |-- S: string
|    |    |    |    |    |-- updatedAt: struct
|    |    |    |    |    |    |-- N: string
|    |-- strings: struct
|    |    |-- SS: array
|    |    |    |-- element: string
|    |-- numbers: struct
|    |    |-- NS: array
|    |    |    |-- element: string
|    |-- binaries: struct
|    |    |-- BS: array
|    |    |    |-- element: string
|    |-- isDDBJson: struct
|    |    |-- BOOL: boolean
|    |-- nullValue: struct
|    |    |-- NULL: boolean
```

The `simplifyDDBJson` transform will simplify this to:

```
root
|-- parentMap: struct
|    |-- childMap: struct
|    |    |-- appName: string
|    |    |-- packageName: string
|    |    |-- updatedAt: string
|-- strings: array
|    |-- element: string
|-- numbers: array
|    |-- element: string
|-- binaries: array
|    |-- element: string
|-- isDDBJson: boolean
|-- nullValue: null
```

**Note**  
`simplifyDDBJson` is available in AWS Glue 3.0 and later versions. The `unnestDDBJson` transform is also available to simplify DynamoDB export JSON. We encourage users to transition to `simplifyDDBJson` from `unnestDDBJson`.

## Configuring paralleism in DynamoDB operations
<a name="aws-glue-programming-etl-connect-dynamodb-parallelism"></a>

To improve performance, you can tune certain parameters available for the DynamoDB connector. Your goal when tuning paralleism parameters is to maximize the use of the provisioned AWS Glue workers. Then, if you need more performance, we recommend you to scale out your job by increasing the number of DPUs. 

 You can alter the parallelism in a DynamoDB read operation using the `dynamodb.splits` parameter when using the ETL connector. When reading with the export connector, you do not need to configure the number of splits for Spark executor parallelism. You can alter the parallelism in a DynamoDB write operation with `dynamodb.output.numParallelTasks`.

**Reading with the DynamoDB ETL connector**

We recommend you to calculate `dynamodb.splits` based on the maximum number of workers set in your job configuration and the following `numSlots` calculation. If autoscaling, the actual number of workers available may change under that cap. For more information about setting the maximum number of workers, see **Number of workers** (`NumberOfWorkers`) in [Configuring job properties for Spark jobs in AWS Glue](add-job.md). 
+ `numExecutors = NumberOfWorkers - 1`

   For context, one executor is reserved for the Spark driver; other executors are used to process data.
+ `numSlotsPerExecutor =`

------
#### [ AWS Glue 3.0 and later versions ]
  + `4` if `WorkerType` is `G.1X`
  + `8` if `WorkerType` is `G.2X`
  + `16` if `WorkerType` is `G.4X`
  + `32` if `WorkerType` is `G.8X`

------
#### [ AWS Glue 2.0 and legacy versions ]
  + `8` if `WorkerType` is `G.1X`
  + `16` if `WorkerType` is `G.2X`

------
+ `numSlots = numSlotsPerExecutor * numExecutors`

We recommend you set `dynamodb.splits` to the number of slots available, `numSlots`.

**Writing to DynamoDB**

The `dynamodb.output.numParallelTasks` parameter is used to determine WCU per Spark task, using the following calculation:

`permittedWcuPerTask = ( TableWCU * dynamodb.throughput.write.percent ) / dynamodb.output.numParallelTasks`

The DynamoDB writer will function best if configuration accurately represents the number of Spark tasks writing to DynamoDB. In some cases, you may need to override the default calculation to improve write performance. If you do not specify this parameter, the permitted WCU per Spark task will be automatically calculated by the following formula:
+ 
  + `numPartitions = dynamicframe.getNumPartitions()`
  + `numSlots` (as defined previously in this section)
  + `numParallelTasks = min(numPartitions, numSlots)`
+ Example 1. DPU=10, WorkerType=Standard. Input DynamicFrame has 100 RDD partitions.
  + `numPartitions = 100`
  + `numExecutors = (10 - 1) * 2 - 1 = 17`
  + `numSlots = 4 * 17 = 68`
  + `numParallelTasks = min(100, 68) = 68`
+ Example 2. DPU=10, WorkerType=Standard. Input DynamicFrame has 20 RDD partitions.
  + `numPartitions = 20`
  + `numExecutors = (10 - 1) * 2 - 1 = 17`
  + `numSlots = 4 * 17 = 68`
  + `numParallelTasks = min(20, 68) = 20`

**Note**  
Jobs on legacy AWS Glue versions and those using Standard workers require different methods to calculate the number of slots. If you need to performance tune these jobs, we recommend you transition to supported AWS Glue versions.

## DynamoDB connection option reference
<a name="aws-glue-programming-etl-connect-dynamodb"></a>

Designates a connection to Amazon DynamoDB.

Connection options differ for a source connection and a sink connection.

### "connectionType": "dynamodb" with the ETL connector as source
<a name="etl-connect-dynamodb-as-source"></a>

Use the following connection options with `"connectionType": "dynamodb"` as a source, when using the AWS Glue DynamoDB ETL connector:
+ `"dynamodb.input.tableName"`: (Required) The DynamoDB table to read from.
+ `"dynamodb.throughput.read.percent"`: (Optional) The percentage of read capacity units (RCU) to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.
  + `0.5` represents the default read rate, meaning that AWS Glue will attempt to consume half of the read capacity of the table. If you increase the value above `0.5`, AWS Glue increases the request rate; decreasing the value below `0.5` decreases the read request rate. (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)
  + When the DynamoDB table is in on-demand mode, AWS Glue handles the read capacity of the table as 40000. For exporting a large table, we recommend switching your DynamoDB table to on-demand mode.
+ `"dynamodb.splits"`: (Optional) Defines how many splits we partition this DynamoDB table into while reading. The default is set to "1". Acceptable values are from "1" to "1,000,000", inclusive.

  `1` represents there is no parallelism. We highly recommend that you specify a larger value for better performance by using the below formula. For more information on appropriately setting a value, see [Configuring paralleism in DynamoDB operations](#aws-glue-programming-etl-connect-dynamodb-parallelism).
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access. This parameter is available in AWS Glue 1.0 or later.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-read-sts-session". This parameter is available in AWS Glue 1.0 or later.

### "connectionType": "dynamodb" with the AWS Glue DynamoDB export connector as source
<a name="etl-connect-dynamodb-as-source-export-connector"></a>

Use the following connection options with "connectionType": "dynamodb" as a source, when using the AWS Glue DynamoDB export connector, which is available only for AWS Glue version 2.0 onwards:
+ `"dynamodb.export"`: (Required) A string value:
  + If set to `ddb` enables the AWS Glue DynamoDB export connector where a new `ExportTableToPointInTimeRequest` will be invoked during the AWS Glue job. A new export will be generated with the location passed from `dynamodb.s3.bucket` and `dynamodb.s3.prefix`.
  + If set to `s3` enables the AWS Glue DynamoDB export connector but skips the creation of a new DynamoDB export and instead uses the `dynamodb.s3.bucket` and `dynamodb.s3.prefix` as the Amazon S3 location of a past export of that table.
+ `"dynamodb.tableArn"`: (Required) The DynamoDB table to read from.
+ `"dynamodb.unnestDDBJson"`: (Optional) Default: false. Valid values: boolean. If set to true, performs an unnest transformation of the DynamoDB JSON structure that is present in exports. It is an error to set `"dynamodb.unnestDDBJson"` and `"dynamodb.simplifyDDBJson"` to true at the same time. In AWS Glue 3.0 and later versions, we recommend you use `"dynamodb.simplifyDDBJson"` for better behavior when simplifying DynamoDB Map types. For more information, see [Simplifying usage of DynamoDB export JSON](#etl-connect-dynamodb-traversing-structure). 
+ `"dynamodb.simplifyDDBJson"`: (Optional) Default: false. Valid values: boolean. If set to true, performs a transformation to simplify the schema of the DynamoDB JSON structure that is present in exports. This has the same purpose as the `"dynamodb.unnestDDBJson"` option but provides better support for DynamoDB Map types or even nested Map types in your DynamoDB table. This option is available in AWS Glue 3.0 and later versions. It is an error to set `"dynamodb.unnestDDBJson"` and `"dynamodb.simplifyDDBJson"` to true at the same time. For more information, see [Simplifying usage of DynamoDB export JSON](#etl-connect-dynamodb-traversing-structure).
+ `"dynamodb.s3.bucket"`: (Optional) Indicates the Amazon S3 bucket location in which the DynamoDB `ExportTableToPointInTime` process is to be conducted. The file format for the export is DynamoDB JSON.
  + `"dynamodb.s3.prefix"`: (Optional) Indicates the Amazon S3 prefix location inside the Amazon S3 bucket in which the DynamoDB `ExportTableToPointInTime` loads are to be stored. If neither `dynamodb.s3.prefix` nor `dynamodb.s3.bucket` are specified, these values will default to the Temporary Directory location specified in the AWS Glue job configuration. For more information, see [Special Parameters Used by AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html).
  + `"dynamodb.s3.bucketOwner"`: Indicates the bucket owner needed for cross-account Amazon S3 access.
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access and/or cross-Region access for the DynamoDB table. Note: The same IAM role ARN will be used to access the Amazon S3 location specified for the `ExportTableToPointInTime` request.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-read-sts-session".
+ `"dynamodb.exportTime"` (Optional) Valid values: strings representing ISO-8601 instants. A point-in-time at which the export should be made. 
+ `"dynamodb.sts.region"`: (Required if making a cross-region call using a regional endpoint) The region hosting the DynamoDB table you want to read.

### "connectionType": "dynamodb" with the ETL connector as sink
<a name="etl-connect-dynamodb-as-sink"></a>

Use the following connection options with `"connectionType": "dynamodb"` as a sink:
+ `"dynamodb.output.tableName"`: (Required) The DynamoDB table to write to.
+ `"dynamodb.throughput.write.percent"`: (Optional) The percentage of write capacity units (WCU) to use. The default is set to "0.5". Acceptable values are from "0.1" to "1.5", inclusive.
  + `0.5` represents the default write rate, meaning that AWS Glue will attempt to consume half of the write capacity of the table. If you increase the value above 0.5, AWS Glue increases the request rate; decreasing the value below 0.5 decreases the write request rate. (The actual write rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table).
  + When the DynamoDB table is in on-demand mode, AWS Glue handles the write capacity of the table as `40000`. For importing a large table, we recommend switching your DynamoDB table to on-demand mode.
+ `"dynamodb.output.numParallelTasks"`: (Optional) Defines how many parallel tasks write into DynamoDB at the same time. Used to calculate permissive WCU per Spark task. In most cases, AWS Glue will calculate a reasonable default for this value. For more information, see [Configuring paralleism in DynamoDB operations](#aws-glue-programming-etl-connect-dynamodb-parallelism).
+ `"dynamodb.output.retry"`: (Optional) Defines how many retries we perform when there is a `ProvisionedThroughputExceededException` from DynamoDB. The default is set to "10".
+ `"dynamodb.sts.roleArn"`: (Optional) The IAM role ARN to be assumed for cross-account access.
+ `"dynamodb.sts.roleSessionName"`: (Optional) STS session name. The default is set to "glue-dynamodb-write-sts-session".

# Cross-account cross-Region access to DynamoDB tables
<a name="aws-glue-programming-etl-dynamo-db-cross-account"></a>

AWS Glue ETL jobs support both cross-region and cross-account access to DynamoDB tables. AWS Glue ETL jobs support both reading data from another AWS Account's DynamoDB table, and writing data into another AWS Account's DynamoDB table. AWS Glue also supports both reading from a DynamoDB table in another region, and writing into a DynamoDB table in another region. This section gives instructions on setting up the access, and provides an example script. 

The procedures in this section reference an IAM tutorial for creating an IAM role and granting access to the role. The tutorial also discusses assuming a role, but here you will instead use a job script to assume the role in AWS Glue. This tutorial also contains information about general cross-account practices. For more information, see [Tutorial: Delegate Access Across AWS Accounts Using IAM Roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html) in the *IAM User Guide*.

## Create a role
<a name="aws-glue-programming-etl-dynamo-db-create-role"></a>

Follow [step 1 in the tutorial](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html#tutorial_cross-account-with-roles-1) to create an IAM role in account A. When defining the permissions of the role, you can choose to attach existing policies such as `AmazonDynamoDBReadOnlyAccess`, or `AmazonDynamoDBFullAccess` to allow the role to read/write DynamoDB. The following example shows creating a role named `DynamoDBCrossAccessRole`, with the permission policy `AmazonDynamoDBFullAccess`.

## Grant access to the role
<a name="aws-glue-programming-etl-dynamo-db-grant-access"></a>

Follow [step 2 in the tutorial](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html#tutorial_cross-account-with-roles-2) in the *IAM User Guide* to allow account B to switch to the newly-created role. The following example creates a new policy with the following statement:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "arn:aws:iam::111122223333:role/DynamoDBCrossAccessRole"
  }
}
```

------

Then, you can attach this policy to the group/role/user you would like to use to access DynamoDB.

## Assume the role in the AWS Glue job script
<a name="aws-glue-programming-etl-dynamo-db-assume-role"></a>

Now, you can log in to account B and create an AWS Glue job. To create a job, refer to the instructions at [Configuring job properties for Spark jobs in AWS Glue](add-job.md). 

In the job script you need to use the `dynamodb.sts.roleArn` parameter to assume the `DynamoDBCrossAccessRole` role. Assuming this role allows you to get the temporary credentials, which need to be used to access DynamoDB in account B. Review these example scripts.

For a cross-account read across regions (ETL connector):

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
    "dynamodb.region": "us-east-1",
    "dynamodb.input.tableName": "test_source",
    "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
dyf.show()
job.commit()
```

For a cross-account read across regions (ELT connector):

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "<test_source ARN>",
        "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
dyf.show()
job.commit()
```

For a read and cross-account write across regions:

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
 
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
job = Job(glue_context)
job.init(args["JOB_NAME"], args)
 
dyf = glue_context.create_dynamic_frame_from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.region": "us-east-1",
        "dynamodb.input.tableName": "test_source"
    }
)
dyf.show()
 
glue_context.write_dynamic_frame_from_options(
    frame=dyf,
    connection_type="dynamodb",
    connection_options={
        "dynamodb.region": "us-west-2",
        "dynamodb.output.tableName": "test_sink",
        "dynamodb.sts.roleArn": "<DynamoDBCrossAccessRole's ARN>"
    }
)
 
job.commit()
```

# DynamoDB connector with Spark DataFrame support
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-support"></a>

DynamoDB connector with Spark DataFrame support allows you to read from and write to tables in DynamoDB using Spark DataFrame APIs. The connector setup steps are the same as for DynamicFrame-based connector and can be found [here](aws-glue-programming-etl-connect-dynamodb-home.md#aws-glue-programming-etl-connect-dynamodb-configure).

In order to load in the DataFrame-based connector library, make sure to attach a DynamoDB connection to the Glue job.

**Note**  
Glue console UI currently does not support creating a DynamoDB connection. You can use Glue CLI ([CreateConnection](https://docs.aws.amazon.com/cli/latest/reference/glue/create-connection.html)) to create a DynamoDB connection:  

```
        aws glue create-connection \
            --connection-input '{ \
                "Name": "my-dynamodb-connection", \
                "ConnectionType": "DYNAMODB", \
                "ConnectionProperties": {}, \
                "ValidateCredentials": false, \
                "ValidateForComputeEnvironments": ["SPARK"] \
            }'
```

Upon creating the DynamoDB connection, you can attach it to your Glue job via CLI ([CreateJob](https://docs.aws.amazon.com/cli/latest/reference/glue/create-job.html), [UpdateJob](https://docs.aws.amazon.com/cli/latest/reference/glue/update-job.html) ) or directly in the "Job details" page:

![\[alt text not found\]](http://docs.aws.amazon.com/glue/latest/dg/images/dynamodb-dataframe-connector.png)


Upon ensuring a connection with DYNAMODB Type is attached to your Glue job, you can utilize the following read, write, and export operations from the DataFrame-based connector.

## Reading from and writing to DynamoDB with the DataFrame-based connector
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-read-write"></a>

The following code examples show how to read from and write to DynamoDB tables via the DataFrame-based connector. They demonstrate reading from one table and writing to another table.

------
#### [ Python ]

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glue_context= GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

# Read from DynamoDB
df = spark.read.format("dynamodb") \
    .option("dynamodb.input.tableName", "test-source") \
    .option("dynamodb.throughput.read.ratio", "0.5") \
    .option("dynamodb.consistentRead", "false") \
    .load()

print(df.rdd.getNumPartitions())

# Write to DynamoDB
df.write \
  .format("dynamodb") \
  .option("dynamodb.output.tableName", "test-sink") \
  .option("dynamodb.throughput.write.ratio", "0.5") \
  .option("dynamodb.item.size.check.enabled", "true") \
  .save()

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val spark = glueContext.getSparkSession
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    val df = spark.read
      .format("dynamodb")
      .option("dynamodb.input.tableName", "test-source")
      .option("dynamodb.throughput.read.ratio", "0.5")
      .option("dynamodb.consistentRead", "false")
      .load()

    print(df.rdd.getNumPartitions)

    df.write
      .format("dynamodb")
      .option("dynamodb.output.tableName", "test-sink")
      .option("dynamodb.throughput.write.ratio", "0.5")
      .option("dynamodb.item.size.check.enabled", "true")
      .save()

    job.commit()
  }
}
```

------

## Using DynamoDB export via the DataFrame-based connector
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-export"></a>

The export operation is preffered to read operation for DynamoDB table sizes larger than 80 GB. The following code examples show how to read from a table, export to S3, and print the number of partitions via the DataFrame-based connector.

**Note**  
The DynamoDB export functionality is available through the Scala `DynamoDBExport` object. Python users can access it via Spark's JVM interop or use the AWS SDK for Python (boto3) with the DynamoDB `ExportTableToPointInTime` API.

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.{GlueArgParser, Job}
import org.apache.spark.SparkContext
import glue.spark.dynamodb.DynamoDBExport
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val glueContext = new GlueContext(SparkContext.getOrCreate())
    val spark = glueContext.getSparkSession
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    val options = Map(
      "dynamodb.export" -> "ddb",
      "dynamodb.tableArn" -> "arn:aws:dynamodb:us-east-1:123456789012:table/my-table",
      "dynamodb.s3.bucket" -> "my-s3-bucket",
      "dynamodb.s3.prefix" -> "my-s3-prefix",
      "dynamodb.simplifyDDBJson" -> "true"
    )
    val df = DynamoDBExport.fullExport(spark, options)
    
    print(df.rdd.getNumPartitions)
    df.count()
    
    Job.commit()
  }
}
```

------

## Configuration Options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-options"></a>

### Read options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-read-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.input.tableName | DynamoDB table name (required) | - | 
| dynamodb.throughput.read | The read capacity units (RCU) to use. If unspecified, dynamodb.throughput.read.ratio is used for calculation. | - | 
| dynamodb.throughput.read.ratio | The ratio of read capacity units (RCU) to use | 0.5 | 
| dynamodb.table.read.capacity | The read capacity of the on-demand table used for calculating the throughput. This parameter is effective only in on-demand capacity tables. Default to warm throughput read units. | - | 
| dynamodb.splits | Defines how many segments used in parallel scan operations. If not provided, connector will calculate a reasonable default value. | - | 
| dynamodb.consistentRead | Whether to use strongly consistent reads | FALSE | 
| dynamodb.input.retry | Defines how many retries we perform when there is a retryable exception. | 10 | 

### Write options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-write-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.output.tableName | DynamoDB table name (required) | - | 
| dynamodb.throughput.write | The write capacity units (WCU) to use. If unspecified, dynamodb.throughput.write.ratio is used for calculation. | - | 
| dynamodb.throughput.write.ratio | The ratio of write capacity units (WCU) to use | 0.5 | 
| dynamodb.table.write.capacity | The write capacity of the on-demand table used for calculating the throughput. This parameter is effective only in on-demand capacity tables. Default to warm throughput write units. | - | 
| dynamodb.item.size.check.enabled | If true, the connector calculate the item size and abort if the size exceeds the maximum size, before writing to DynamoDB table. | TRUE | 
| dynamodb.output.retry | Defines how many retries we perform when there is a retryable exception. | 10 | 

### Export options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-export-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.export | If set to ddb enables the AWS Glue DynamoDB export connector where a new ExportTableToPointInTimeRequet will be invoked during the AWS Glue job. A new export will be generated with the location passed from dynamodb.s3.bucket and dynamodb.s3.prefix. If set to s3 enables the AWS Glue DynamoDB export connector but skips the creation of a new DynamoDB export and instead uses the dynamodb.s3.bucket and dynamodb.s3.prefix as the Amazon S3 location of the past exported of that table. | ddb | 
| dynamodb.tableArn | The DynamoDB table to read from. Required if dynamodb.export is set to ddb. |  | 
| dynamodb.simplifyDDBJson | If set to true, performs a transformation to simplify the schema of the DynamoDB JSON structure that is present in exports. | FALSE | 
| dynamodb.s3.bucket | The S3 bucket to store temporary data during DynamoDB export (required) |  | 
| dynamodb.s3.prefix | The S3 prefix to store temporary data during DynamoDB export |  | 
| dynamodb.s3.bucketOwner | Indicate the bucket owner needed for cross-account Amazon S3 access |  | 
| dynamodb.s3.sse.algorithm | Type of encryption used on the bucket where temporary data will be stored. Valid values are AES256 and KMS. |  | 
| dynamodb.s3.sse.kmsKeyId | The ID of the AWS KMS managed key used to encrypt the S3 bucket where temporary data will be stored (if applicable). |  | 
| dynamodb.exportTime | A point-in-time at which the export should be made. Valid values: strings representing ISO-8601 instants. |  | 

### General options
<a name="aws-glue-programming-etl-connect-dynamodb-dataframe-general-options"></a>


| Option | Description | Default | 
| --- | --- | --- | 
| dynamodb.sts.roleArn | The IAM role ARN to be assumed for cross-account access | - | 
| dynamodb.sts.roleSessionName | STS session name | glue-dynamodb-sts-session | 
| dynamodb.sts.region | Region for the STS client (for cross-region role assumption) | Same as region option | 

# Kinesis connections
<a name="aws-glue-programming-etl-connect-kinesis-home"></a>

You can use a Kinesis connection to read and write to Amazon Kinesis data streams using information stored in a Data Catalog table, or by providing information to directly access the data stream. You can read information from Kinesis into a Spark DataFrame, then convert it to a AWS Glue DynamicFrame. You can write DynamicFrames to Kinesis in a JSON format. If you directly access the data stream, use these options to provide the information about how to access the data stream.

If you use `getCatalogSource` or `create_data_frame_from_catalog` to consume records from a Kinesis streaming source, the job has the Data Catalog database and table name information, and can use that to obtain some basic parameters for reading from the Kinesis streaming source. If you use `getSource`, `getSourceWithFormat`, `createDataFrameFromOptions` or `create_data_frame_from_options`, you must specify these basic parameters using the connection options described here.

You can specify the connection options for Kinesis using the following arguments for the specified methods in the `GlueContext` class.
+ Scala
  + `connectionOptions`: Use with `getSource`, `createDataFrameFromOptions`, `getSink` 
  + `additionalOptions`: Use with `getCatalogSource`, `getCatalogSink`
  + `options`: Use with `getSourceWithFormat`, `getSinkWithFormat`
+ Python
  + `connection_options`: Use with `create_data_frame_from_options`, `write_dynamic_frame_from_options`
  + `additional_options`: Use with `create_data_frame_from_catalog`, `write_dynamic_frame_from_catalog`
  + `options`: Use with `getSource`, `getSink`

For notes and restrictions about Streaming ETL jobs, consult [Streaming ETL notes and restrictions](add-job-streaming.md#create-job-streaming-restrictions).

## Configure Kinesis
<a name="aws-glue-programming-etl-connect-kinesis-configure"></a>

To connect to a Kinesis data stream in an AWS Glue Spark job, you will need some prerequisites:
+ If reading, the AWS Glue job must have Read access level IAM permissions to the Kinesis data stream.
+ If writing, the AWS Glue job must have Write access level IAM permissions to the Kinesis data stream.

In certain cases, you will need to configure additional prerequisites:
+ If your AWS Glue job is configured with **Additional network connections** (typically to connect to other datasets) and one of those connections provides Amazon VPC **Network options**, this will direct your job to communicate over Amazon VPC. In this case you will also need to configure your Kinesis data stream to communicate over Amazon VPC. You can do this by creating an interface VPC endpoint between your Amazon VPC and Kinesis data stream. For more information, see [Using Kinesis Data Streams with Interface VPC Endpoints](https://docs.aws.amazon.com//streams/latest/dev/vpc.html).
+ When specifying Amazon Kinesis Data Streams in another account, you must setup the roles and policies to allow cross-account access. For more information, see [ Example: Read From a Kinesis Stream in a Different Account](https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-cross.html).

For more information about Streaming ETL job prerequisites, consult [Streaming ETL jobs in AWS Glue](add-job-streaming.md).

## Example: Reading from Kinesis streams
<a name="aws-glue-programming-etl-connect-kinesis-read"></a>

### Example: Reading from Kinesis streams
<a name="section-etl-connect-kinesis-read"></a>

Used in conjunction with [forEachBatch](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch).

Example for Amazon Kinesis streaming source:

```
kinesis_options =
   { "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
     "startingPosition": "TRIM_HORIZON", 
     "inferSchema": "true", 
     "classification": "json" 
   }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kinesis", connection_options=kinesis_options)
```

## Example: Writing to Kinesis streams
<a name="aws-glue-programming-etl-connect-kinesis-write"></a>

### Example: Reading from Kinesis streams
<a name="section-etl-connect-kinesis-read"></a>

Used in conjunction with [forEachBatch](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch).

Example for Amazon Kinesis streaming source:

```
kinesis_options =
   { "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
     "startingPosition": "TRIM_HORIZON", 
     "inferSchema": "true", 
     "classification": "json" 
   }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kinesis", connection_options=kinesis_options)
```

## Kinesis connection option reference
<a name="aws-glue-programming-etl-connect-kinesis"></a>

Designates connection options for Amazon Kinesis Data Streams.

Use the following connection options for Kinesis streaming data sources: 
+ `"streamARN"` (Required) Used for Read/Write. The ARN of the Kinesis data stream.
+ `"classification"` (Required for read) Used for Read. The file format used by the data in the record. Required unless provided through the Data Catalog.
+ `"streamName"` – (Optional) Used for Read. The name of a Kinesis data stream to read from. Used with `endpointUrl`.
+ `"endpointUrl"` – (Optional) Used for Read. Default: "https://kinesis.us-east-1.amazonaws.com". The AWS endpoint of the Kinesis stream. You do not need to change this unless you are connecting to a special region.
+ `"partitionKey"` – (Optional) Used for Write. The Kinesis partition key used when producing records.
+ `"delimiter"` (Optional) Used for Read. The value separator used when `classification` is CSV. Default is "`,`."
+ `"startingPosition"`: (Optional) Used for Read. The starting position in the Kinesis data stream to read data from. The possible values are `"latest"`, `"trim_horizon"`, `"earliest"`, or a Timestamp string in UTC format in the pattern `yyyy-mm-ddTHH:MM:SSZ` (where `Z` represents a UTC timezone offset with a \$1/-. For example "2023-04-04T08:00:00-04:00"). The default value is `"latest"`. Note: the Timestamp string in UTC Format for `"startingPosition"` is supported only for AWS Glue version 4.0 or later.
+ `"failOnDataLoss"`: (Optional) Fail the job if any active shard is missing or expired. The default value is `"false"`.
+ `"awsSTSRoleARN"`: (Optional) Used for Read/Write. The Amazon Resource Name (ARN) of the role to assume using AWS Security Token Service (AWS STS). This role must have permissions for describe or read record operations for the Kinesis data stream. You must use this parameter when accessing a data stream in a different account. Used in conjunction with `"awsSTSSessionName"`.
+ `"awsSTSSessionName"`: (Optional) Used for Read/Write. An identifier for the session assuming the role using AWS STS. You must use this parameter when accessing a data stream in a different account. Used in conjunction with `"awsSTSRoleARN"`.
+ `"awsSTSEndpoint"`: (Optional) The AWS STS endpoint to use when connecting to Kinesis with an assumed role. This allows using the regional AWS STS endpoint in a VPC, which is not possible with the default global endpoint.
+ `"maxFetchTimeInMs"`: (Optional) Used for Read. The maximum time spent for the job executor to read records for the current batch from the Kinesis data stream, specified in milliseconds (ms). Multiple `GetRecords` API calls may be made within this time. The default value is `1000`.
+ `"maxFetchRecordsPerShard"`: (Optional) Used for Read. The maximum number of records to fetch per shard in the Kinesis data stream per microbatch. Note: The client can exceed this limit if the streaming job has already read extra records from Kinesis (in the same get-records call). If `maxFetchRecordsPerShard` needs to be strict then it needs to be a multiple of `maxRecordPerRead`. The default value is `100000`.
+ `"maxRecordPerRead"`: (Optional) Used for Read. The maximum number of records to fetch from the Kinesis data stream in each `getRecords` operation. The default value is `10000`.
+ `"addIdleTimeBetweenReads"`: (Optional) Used for Read. Adds a time delay between two consecutive `getRecords` operations. The default value is `"False"`. This option is only configurable for Glue version 2.0 and above. 
+ `"idleTimeBetweenReadsInMs"`: (Optional) Used for Read. The minimum time delay between two consecutive `getRecords` operations, specified in ms. The default value is `1000`. This option is only configurable for Glue version 2.0 and above. 
+ `"describeShardInterval"`: (Optional) Used for Read. The minimum time interval between two `ListShards` API calls for your script to consider resharding. For more information, see [Strategies for Resharding](https://docs.aws.amazon.com//streams/latest/dev/kinesis-using-sdk-java-resharding-strategies.html) in *Amazon Kinesis Data Streams Developer Guide*. The default value is `1s`.
+ `"numRetries"`: (Optional) Used for Read. The maximum number of retries for Kinesis Data Streams API requests. The default value is `3`.
+ `"retryIntervalMs"`: (Optional) Used for Read. The cool-off time period (specified in ms) before retrying the Kinesis Data Streams API call. The default value is `1000`.
+ `"maxRetryIntervalMs"`: (Optional) Used for Read. The maximum cool-off time period (specified in ms) between two retries of a Kinesis Data Streams API call. The default value is `10000`.
+ `"avoidEmptyBatches"`: (Optional) Used for Read. Avoids creating an empty microbatch job by checking for unread data in the Kinesis data stream before the batch is started. The default value is `"False"`.
+ `"schema"`: (Required when inferSchema set to false) Used for Read. The schema to use to process the payload. If classification is `avro` the provided schema must be in the Avro schema format. If the classification is not `avro` the provided schema must be in the DDL schema format.

  The following are schema examples.

------
#### [ Example in DDL schema format ]

  ```
  `column1` INT, `column2` STRING , `column3` FLOAT
  ```

------
#### [ Example in Avro schema format ]

  ```
  {
    "type":"array",
    "items":
    {
      "type":"record",
      "name":"test",
      "fields":
      [
        {
          "name":"_id",
          "type":"string"
        },
        {
          "name":"index",
          "type":
          [
            "int",
            "string",
            "float"
          ]
        }
      ]
    }
  }
  ```

------
+ `"inferSchema"`: (Optional) Used for Read. The default value is 'false'. If set to 'true', the schema will be detected at runtime from the payload within `foreachbatch`.
+ `"avroSchema"`: (Deprecated) Used for Read. Parameter used to specify a schema of Avro data when Avro format is used. This parameter is now deprecated. Use the `schema` parameter.
+ `"addRecordTimestamp"`: (Optional) Used for Read. When this option is set to 'true', the data output will contain an additional column named "\$1\$1src\$1timestamp" that indicates the time when the corresponding record received by the stream. The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.
+ `"emitConsumerLagMetrics"`: (Optional) Used for Read. When the option is set to 'true', for each batch, it will emit the metrics for the duration between the oldest record received by the stream and the time it arrives in AWS Glue to CloudWatch. The metric's name is "glue.driver.streaming.maxConsumerLagInMs". The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.
+ `"fanoutConsumerARN"`: (Optional) Used for Read. The ARN of a Kinesis stream consumer for the stream specified in `streamARN`. Used to enable enhanced fan-out mode for your Kinesis connection. For more information on consuming a Kinesis stream with enhanced fan-out, see [Using enhanced fan-out in Kinesis streaming jobs](aws-glue-programming-etl-connect-kinesis-efo.md).
+ `"recordMaxBufferedTime"` – (Optional) Used for Write. Default: 1000 (ms). Maximum time a record is buffered while waiting to be written.
+ `"aggregationEnabled"` – (Optional) Used for Write. Default: true. Specifies if records should be aggregated before sending them to Kinesis.
+ `"aggregationMaxSize"` – (Optional) Used for Write. Default: 51200 (bytes). If a record is larger than this limit, it will bypass the aggregator. Note Kinesis enforces a limit of 50KB on record size. If you set this beyond 50KB, oversize records will be rejected by Kinesis.
+ `"aggregationMaxCount"` – (Optional) Used for Write. Default: 4294967295. Maximum number of items to pack into an aggregated record.
+ `"producerRateLimit"` – (Optional) Used for Write. Default: 150 (%). Limits per-shard throughput sent from a single producer (such as your job), as a percentage of the backend limit.
+ `"collectionMaxCount"` – (Optional) Used for Write. Default: 500. Maximum number of items to pack into an PutRecords request. 
+ `"collectionMaxSize"` – (Optional) Used for Write. Default: 5242880 (bytes). Maximum amount of data to send with a PutRecords request.

# Using enhanced fan-out in Kinesis streaming jobs
<a name="aws-glue-programming-etl-connect-kinesis-efo"></a>

An enhanced fan-out consumer is able to recieve records from a Kinesis stream with dedicated throughput that can be greater than typical consumers. This is done by optimizing the transfer protocol used to provide data to a Kinesis consumer, such as your job. For more information about Kinesis Enhanced Fan-Out, see [the Kinesis documentation](https://docs.aws.amazon.com//streams/latest/dev/enhanced-consumers.html).

In enhanced fan-out mode, the `maxRecordPerRead` and `idleTimeBetweenReadsInMs` connection options no longer apply, as those parameters are not configurable when using enhanced fan-out. The configuration options for retries perform as described.

Use the following procedures to enable and disable enhanced fan-out for your streaming job. You should register a stream consumer for each job that will consume data from your stream.

**To enable enhanced fan-out consumption on your job:**

1. Register a stream consumer for your job using the Kinesis API. Follow the instructions to *register a consumer with enhanced fan-out using the Kinesis Data Streams API* in the [Kinesis documentation](https://docs.aws.amazon.com//streams/latest/dev/building-enhanced-consumers-api). You will only need to follow the first step - calling [RegisterStreamConsumer](https://docs.aws.amazon.com/kinesis/latest/APIReference/API_RegisterStreamConsumer.html). Your request should return an ARN, *consumerARN*. 

1. Set the connection option `fanoutConsumerARN` to *consumerARN* in your connection method arguments.

1. Restart your job.

**To disable enhanced fan-out consumption on your job:**

1. Remove the `fanoutConsumerARN` connection option from your method call.

1. Restart your job.

1. Follow the instructions to *deregister a consumer* in the [Kinesis documentation](https://docs.aws.amazon.com/streams/latest/dev/building-enhanced-consumers-console.html). These instructions apply to the console, but can also be achieved through the Kinesis API. For more information about stream consumer deregistration through the Kinesis API, consult [DeregisterStreamConsumer](https://docs.aws.amazon.com//kinesis/latest/APIReference/API_DeregisterStreamConsumer.html) in the Kinesis documentation.

# Amazon S3 connections
<a name="aws-glue-programming-etl-connect-s3-home"></a>

You can use AWS Glue for Spark to read and write files in Amazon S3. AWS Glue for Spark supports many common data formats stored in Amazon S3 out of the box, including CSV, Avro, JSON, Orc and Parquet. For more information about supported data formats, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md). Each data format may support a different set of AWS Glue features. Consult the page for your data format for the specifics of feature support. Additionally, you can read and write versioned files stored in the Hudi, Iceberg and Delta Lake data lake frameworks. For more information about data lake frameworks, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md). 

With AWS Glue you can partition your Amazon S3 objects into a folder structure while writing, then retrieve it by partition to improve performance using simple configuration. You can also set configuration to group small files together when transforming your data to improve performance. You can read and write `bzip2` and `gzip` archives in Amazon S3.

**Topics**
+ [Configuring S3 connections](#aws-glue-programming-etl-connect-s3-configure)
+ [Amazon S3 connection option reference](#aws-glue-programming-etl-connect-s3)
+ [Deprecated connection syntaxes for data formats](#aws-glue-programming-etl-connect-legacy-format)
+ [Excluding Amazon S3 storage classes](aws-glue-programming-etl-storage-classes.md)
+ [Managing partitions for ETL output in AWS Glue](aws-glue-programming-etl-partitions.md)
+ [Reading input files in larger groups](grouping-input-files.md)
+ [Amazon VPC endpoints for Amazon S3](vpc-endpoints-s3.md)

## Configuring S3 connections
<a name="aws-glue-programming-etl-connect-s3-configure"></a>

To connect to Amazon S3 in a AWS Glue with Spark job, you will need some prerequisites:
+ The AWS Glue job must have IAM permissions for relevant Amazon S3 buckets.

In certain cases, you will need to configure additional prerequisites:
+ When configuring cross-account access, appropriate access controls on the Amazon S3 bucket.
+ For security reasons, you may choose to route your Amazon S3 requests through an Amazon VPC. This approach can introduce bandwidth and availability challenges. For more information, see [Amazon VPC endpoints for Amazon S3](vpc-endpoints-s3.md). 

## Amazon S3 connection option reference
<a name="aws-glue-programming-etl-connect-s3"></a>

Designates a connection to Amazon S3.

Since Amazon S3 manages files rather than tables, in addition to specifying the connection properties provided in this document, you will need to specify additional configuration about your file type. You specify this information through data format options. For more information about format options, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md). You can also specify this information by integrating with the AWS Glue Data Catalog.

For an example of the distinction between connection options and format options, consider how the [create\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method takes `connection_type`, `connection_options`, `format` and `format_options`. This section specifically discusses parameters provided to `connection_options`.

Use the following connection options with `"connectionType": "s3"`:
+ `"paths"`: (Required) A list of the Amazon S3 paths to read from.
+ `"exclusions"`: (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, `"[\"**.pdf\"]"` excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see [Include and Exclude Patterns](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude).
+ `"compressionType"`: or "`compression`": (Optional) Specifies how the data is compressed. Use `"compressionType"` for Amazon S3 sources and `"compression"` for Amazon S3 targets. This is generally not necessary if the data has a standard file extension. Possible values are `"gzip"` and `"bzip2"`). Additional compression formats may be supported for specific formats. For the specifics of feature support, consult the data format page. 
+ `"groupFiles"`: (Optional) Grouping files is turned on by default when the input contains more than 50,000 files. To turn on grouping with fewer than 50,000 files, set this parameter to `"inPartition"`. To disable grouping when there are more than 50,000 files, set this parameter to `"none"`.
+ `"groupSize"`: (Optional) The target group size in bytes. The default is computed based on the input data size and the size of your cluster. When there are fewer than 50,000 input files, `"groupFiles"` must be set to `"inPartition"` for this to take effect.
+ `"recurse"`: (Optional) If set to true, recursively reads files in all subdirectories under the specified paths.
+ `"maxBand"`: (Optional, advanced) This option controls the duration in milliseconds after which the `s3` listing is likely to be consistent. Files with modification timestamps falling within the last `maxBand` milliseconds are tracked specially when using `JobBookmarks` to account for Amazon S3 eventual consistency. Most users don't need to set this option. The default is 900000 milliseconds, or 15 minutes.
+ `"maxFilesInBand"`: (Optional, advanced) This option specifies the maximum number of files to save from the last `maxBand` seconds. If this number is exceeded, extra files are skipped and only processed in the next job run. Most users don't need to set this option.
+ `"isFailFast"`: (Optional) This option determines if an AWS Glue ETL job throws reader parsing exceptions. If set to `true`, jobs fail fast if four retries of the Spark task fail to parse the data correctly.
+ `"catalogPartitionPredicate"`: (Optional) Used for Read. The contents of a SQL `WHERE` clause. Used when reading from Data Catalog tables with a very large quantity of partitions. Retrieves matching partitions from Data Catalog indices. Used with `push_down_predicate`, an option on the [create\$1dynamic\$1frame\$1from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog) method (and other similar methods). For more information, see [Server-side filtering using catalog partition predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-cat-predicates).
+ `"partitionKeys"`: (Optional) Used for Write. An array of column label strings. AWS Glue will partition your data as specified by this configuration. For more information, see [Writing partitions](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-writing).
+ `"excludeStorageClasses"`: (Optional) Used for Read. An array of strings specifying Amazon S3 storage classes. AWS Glue will exclude Amazon S3 objects based on this configuration. For more information, see [Excluding Amazon S3 storage classes](aws-glue-programming-etl-storage-classes.md).

## Deprecated connection syntaxes for data formats
<a name="aws-glue-programming-etl-connect-legacy-format"></a>

Certain data formats can be accessed using a specific connection type syntax. This syntax is deprecated. We recommend you specify your formats using the `s3` connection type and the format options provided in [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) instead.

### "connectionType": "Orc"
<a name="aws-glue-programming-etl-connect-orc"></a>

Designates a connection to files stored in Amazon S3 in the [Apache Hive Optimized Row Columnar (ORC)](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) file format.

Use the following connection options with `"connectionType": "orc"`:
+ `paths`: (Required) A list of the Amazon S3 paths to read from.
+ *(Other option name/value pairs)*: Any additional options, including formatting options, are passed directly to the SparkSQL `DataSource`.

### "connectionType": "parquet"
<a name="aws-glue-programming-etl-connect-parquet"></a>

Designates a connection to files stored in Amazon S3 in the [Apache Parquet](https://parquet.apache.org/docs/) file format.

Use the following connection options with `"connectionType": "parquet"`:
+ `paths`: (Required) A list of the Amazon S3 paths to read from.
+ *(Other option name/value pairs)*: Any additional options, including formatting options, are passed directly to the SparkSQL `DataSource`.

# Excluding Amazon S3 storage classes
<a name="aws-glue-programming-etl-storage-classes"></a>

If you're running AWS Glue ETL jobs that read files or partitions from Amazon Simple Storage Service (Amazon S3), you can exclude some Amazon S3 storage class types.

The following storage classes are available in Amazon S3:
+ `STANDARD` — For general-purpose storage of frequently accessed data.
+ `INTELLIGENT_TIERING` — For data with unknown or changing access patterns.
+ `STANDARD_IA` and `ONEZONE_IA` — For long-lived, but less frequently accessed data.
+ `GLACIER`, `DEEP_ARCHIVE`, and `REDUCED_REDUNDANCY` — For long-term archive and digital preservation.

For more information, see [Amazon S3 Storage Classes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html) in the *Amazon S3 Developer Guide*.

The examples in this section show how to exclude the `GLACIER` and `DEEP_ARCHIVE` storage classes. These classes allow you to list files, but they won't let you read the files unless they are restored. (For more information, see [Restoring Archived Objects](https://docs.aws.amazon.com/AmazonS3/latest/dev/restoring-objects.html) in the *Amazon S3 Developer Guide*.)

By using storage class exclusions, you can ensure that your AWS Glue jobs will work on tables that have partitions across these storage class tiers. Without exclusions, jobs that read data from these tiers fail with the following error: AmazonS3Exception: The operation is not valid for the object's storage class.

There are different ways that you can filter Amazon S3 storage classes in AWS Glue.

**Topics**
+ [Excluding Amazon S3 storage classes when creating a Dynamic Frame](#aws-glue-programming-etl-storage-classes-dynamic-frame)
+ [Excluding Amazon S3 storage classes on a Data Catalog table](#aws-glue-programming-etl-storage-classes-table)

## Excluding Amazon S3 storage classes when creating a Dynamic Frame
<a name="aws-glue-programming-etl-storage-classes-dynamic-frame"></a>

To exclude Amazon S3 storage classes while creating a dynamic frame, use `excludeStorageClasses` in `additionalOptions`. AWS Glue automatically uses its own Amazon S3 `Lister` implementation to list and exclude files corresponding to the specified storage classes.

The following Python and Scala examples show how to exclude the `GLACIER` and `DEEP_ARCHIVE` storage classes when creating a dynamic frame.

Python example:

```
glueContext.create_dynamic_frame.from_catalog(
    database = "my_database",
    tableName = "my_table_name",
    redshift_tmp_dir = "",
    transformation_ctx = "my_transformation_context",
    additional_options = {
        "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"]
    }
)
```

Scala example:

```
val* *df = glueContext.getCatalogSource(
    nameSpace, tableName, "", "my_transformation_context",  
    additionalOptions = JsonOptions(
        Map("excludeStorageClasses" -> List("GLACIER", "DEEP_ARCHIVE"))
    )
).getDynamicFrame()
```

## Excluding Amazon S3 storage classes on a Data Catalog table
<a name="aws-glue-programming-etl-storage-classes-table"></a>

You can specify storage class exclusions to be used by an AWS Glue ETL job as a table parameter in the AWS Glue Data Catalog. You can include this parameter in the `CreateTable` operation using the AWS Command Line Interface (AWS CLI) or programmatically using the API. For more information, see [Table Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table) and [CreateTable](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html). 

You can also specify excluded storage classes on the AWS Glue console.

**To exclude Amazon S3 storage classes (console)**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane on the left, choose **Tables**.

1. Choose the table name in the list, and then choose **Edit table**.

1. In **Table properties**, add **excludeStorageClasses** as a key and **[\$1"GLACIER\$1",\$1"DEEP\$1ARCHIVE\$1"]** as a value.

1. Choose **Apply**.

# Managing partitions for ETL output in AWS Glue
<a name="aws-glue-programming-etl-partitions"></a>

Partitioning is an important technique for organizing datasets so they can be queried efficiently. It organizes data in a hierarchical directory structure based on the distinct values of one or more columns.

For example, you might decide to partition your application logs in Amazon Simple Storage Service (Amazon S3) by date, broken down by year, month, and day. Files that correspond to a single day's worth of data are then placed under a prefix such as `s3://my_bucket/logs/year=2018/month=01/day=23/`. Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by partition value without having to read all the underlying data from Amazon S3.

Crawlers not only infer file types and schemas, they also automatically identify the partition structure of your dataset when they populate the AWS Glue Data Catalog. The resulting partition columns are available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.

After you crawl a table, you can view the partitions that the crawler created. In the AWS Glue console, choose **Tables** in the left navigation pane. Choose the table created by the crawler, and then choose **View Partitions**.

For Apache Hive-style partitioned paths in `key=val` style, crawlers automatically populate the column name using the key name. Otherwise, it uses default names like `partition_0`, `partition_1`, and so on. You can change the default names on the console. To do so, navigate to the table. Check if indexes exist under the **Indexes** tab. If that's the case, you need to delete them to proceed (you can recreate them using the new column names afterwards). Then, choose **Edit Schema**, and modify the names of the partition columns there.

In your ETL scripts, you can then filter on the partition columns. Because the partition information is stored in the Data Catalog, use the `from_catalog` API calls to include the partition columns in the `DynamicFrame`. For example, use `create_dynamic_frame.from_catalog` instead of `create_dynamic_frame.from_options`.

Partitioning is an optimization technique that reduces data scan. For more information about the process of identifying when this technique is appropriate, consult [Reduce the amount of data scan](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/reduce-data-scan.html) in the *Best practices for performance tuning AWS Glue for Apache Spark jobs* guide on AWS Prescriptive Guidance.

## Pre-filtering using pushdown predicates
<a name="aws-glue-programming-etl-partitions-pushdowns"></a>

In many cases, you can use a pushdown predicate to filter on partitions without having to list and read all the files in your dataset. Instead of reading the entire dataset and then filtering in a DynamicFrame, you can apply the filter directly on the partition metadata in the Data Catalog. Then you only list and read what you actually need into a DynamicFrame.

For example, in Python, you could write the following.

```
glue_context.create_dynamic_frame.from_catalog(
    database = "my_S3_data_set",
    table_name = "catalog_data_table",
    push_down_predicate = my_partition_predicate)
```

This creates a DynamicFrame that loads only the partitions in the Data Catalog that satisfy the predicate expression. Depending on how small a subset of your data you are loading, this can save a great deal of processing time.

The predicate expression can be any Boolean expression supported by Spark SQL. Anything you could put in a `WHERE` clause in a Spark SQL query will work. For example, the predicate expression `pushDownPredicate = "(year=='2017' and month=='04')"` loads only the partitions in the Data Catalog that have both `year` equal to 2017 and `month` equal to 04. For more information, see the [Apache Spark SQL documentation](https://spark.apache.org/docs/2.1.1/sql-programming-guide.html), and in particular, the [Scala SQL functions reference](https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.sql.functions$).

## Server-side filtering using catalog partition predicates
<a name="aws-glue-programming-etl-partitions-cat-predicates"></a>

The `push_down_predicate` option is applied after listing all the partitions from the catalog and before listing files from Amazon S3 for those partitions. If you have a lot of partitions for a table, catalog partition listing can still incur additional time overhead. To address this overhead, you can use server-side partition pruning with the `catalogPartitionPredicate` option that uses [partition indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html) in the AWS Glue Data Catalog. This makes partition filtering much faster when you have millions of partitions in one table. You can use both `push_down_predicate` and `catalogPartitionPredicate` in `additional_options` together if your `catalogPartitionPredicate` requires predicate syntax that is not yet supported with the catalog partition indexes.

Python:

```
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database=dbname, 
    table_name=tablename,
    transformation_ctx="datasource0",
    push_down_predicate="day>=10 and customer_id like '10%'",
    additional_options={"catalogPartitionPredicate":"year='2021' and month='06'"}
)
```

Scala:

```
val dynamicFrame = glueContext.getCatalogSource(
    database = dbname,
    tableName = tablename, 
    transformationContext = "datasource0",
    pushDownPredicate="day>=10 and customer_id like '10%'",
    additionalOptions = JsonOptions("""{
        "catalogPartitionPredicate": "year='2021' and month='06'"}""")
    ).getDynamicFrame()
```

**Note**  
The `push_down_predicate` and `catalogPartitionPredicate` use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.

## Writing partitions
<a name="aws-glue-programming-etl-partitions-writing"></a>

By default, a DynamicFrame is not partitioned when it is written. All of the output files are written at the top level of the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it to a Spark SQL DataFrame before writing.

However, DynamicFrames now support native partitioning using a sequence of keys, using the `partitionKeys` option when you create a sink. For example, the following Python code writes out a dataset to Amazon S3 in the Parquet format, into directories partitioned by the type field. From there, you can process these partitions using other systems, such as Amazon Athena.

```
glue_context.write_dynamic_frame.from_options(
    frame = projectedEvents,
    connection_type = "s3",    
    connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
    format = "parquet")
```

# Reading input files in larger groups
<a name="grouping-input-files"></a>

You can set properties of your tables to enable an AWS Glue ETL job to group files when they are read from an Amazon S3 data store. These properties enable each ETL task to read a group of input files into a single in-memory partition, this is especially useful when there is a large number of small files in your Amazon S3 data store. When you set certain properties, you instruct AWS Glue to group files within an Amazon S3 data partition and set the size of the groups to be read. You can also set these options when reading from an Amazon S3 data store with the `create_dynamic_frame.from_options` method. 

To enable grouping files for a table, you set key-value pairs in the parameters field of your table structure. Use JSON notation to set a value for the parameter field of your table. For more information about editing the properties of a table, see [Viewing and managing table details](tables-described.md#console-tables-details). 

You can use this method to enable grouping for tables in the Data Catalog with Amazon S3 data stores. 

**groupFiles**  
Set **groupFiles** to `inPartition` to enable the grouping of files within an Amazon S3 data partition. AWS Glue automatically enables grouping if there are more than 50,000 input files, as in the following example.  

```
  'groupFiles': 'inPartition'
```

**groupSize**  
Set **groupSize** to the target size of groups in bytes. The **groupSize** property is optional, if not provided, AWS Glue calculates a size to use all the CPU cores in the cluster while still reducing the overall number of ETL tasks and in-memory partitions.   
For example, the following sets the group size to 1 MB.  

```
  'groupSize': '1048576'
```
Note that the `groupsize` should be set with the result of a calculation. For example 1024 \$1 1024 = 1048576.

**recurse**  
Set **recurse** to `True` to recursively read files in all subdirectories when specifying `paths` as an array of paths. You do not need to set **recurse** if `paths` is an array of object keys in Amazon S3, or if the input format is parquet/orc, as in the following example.  

```
  'recurse':True
```

If you are reading from Amazon S3 directly using the `create_dynamic_frame.from_options` method, add these connection options. For example, the following attempts to group files into 1 MB groups.

```
df = glueContext.create_dynamic_frame.from_options("s3", {'paths': ["s3://s3path/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'}, format="json")
```

**Note**  
`groupFiles` is supported for DynamicFrames created from the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc.

# Amazon VPC endpoints for Amazon S3
<a name="vpc-endpoints-s3"></a>

For security reasons, many AWS customers run their applications within an Amazon Virtual Private Cloud environment (Amazon VPC). With Amazon VPC, you can launch Amazon EC2 instances into a virtual private cloud, which is logically isolated from other networks—including the public internet. With an Amazon VPC, you have control over its IP address range, subnets, routing tables, network gateways, and security settings.

**Note**  
If you created your AWS account after 2013-12-04, you already have a default VPC in each AWS Region. You can immediately start using your default VPC without any additional configuration.  
For more information, see [Your Default VPC and Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) in the Amazon VPC User Guide.

Many customers have legitimate privacy and security concerns about sending and receiving data across the public internet. Customers can address these concerns by using a virtual private network (VPN) to route all Amazon S3 network traffic through their own corporate network infrastructure. However, this approach can introduce bandwidth and availability challenges.

VPC endpoints for Amazon S3 can alleviate these challenges. A VPC endpoint for Amazon S3 enables AWS Glue to use private IP addresses to access Amazon S3 with no exposure to the public internet. AWS Glue does not require public IP addresses, and you don't need an internet gateway, a NAT device, or a virtual private gateway in your VPC. You use endpoint policies to control access to Amazon S3. Traffic between your VPC and the AWS service does not leave the Amazon network.

When you create a VPC endpoint for Amazon S3, any requests to an Amazon S3 endpoint within the Region (for example, *s3.us-west-2.amazonaws.com*) are routed to a private Amazon S3 endpoint within the Amazon network. You don't need to modify your applications running on Amazon EC2 instances in your VPC—the endpoint name remains the same, but the route to Amazon S3 stays entirely within the Amazon network, and does not access the public internet.

For more information about VPC endpoints, see [VPC Endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html) in the Amazon VPC User Guide.

The following diagram shows how AWS Glue can use a VPC endpoint to access Amazon S3.

![\[Network traffic flow showing VPC connection to Amazon S3.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PopulateCatalog-vpc-endpoint.png)


**To set up access for Amazon S3**

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. In the left navigation pane, choose **Endpoints**.

1. Choose **Create Endpoint**, and follow the steps to create an Amazon S3 VPC endpoint of type Gateway. 

# Amazon DocumentDB connections
<a name="aws-glue-programming-etl-connect-documentdb-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Amazon DocumentDB. You can connect to Amazon DocumentDB using credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Amazon DocumentDB, consult the [Amazon DocumentDB documentation](https://docs.aws.amazon.com/documentdb/latest/developerguide/what-is.html).

**Note**  
Amazon DocumentDB elastic clusters are not currently supported when using the AWS Glue connector. For more information about elastic clusters, see [Using Amazon DocumentDB elastic clusters](https://docs.aws.amazon.com/documentdb/latest/developerguide/docdb-using-elastic-clusters.html).

## Reading and writing to Amazon DocumentDB collections
<a name="aws-glue-programming-etl-connect-documentdb-read-write"></a>

**Note**  
When you create an ETL job that connects to Amazon DocumentDB, for the `Connections` job property, you must designate a connection object that specifies the virtual private cloud (VPC) in which Amazon DocumentDB is running. For the connection object, the connection type must be `JDBC`, and the `JDBC URL` must be `mongo://<DocumentDB_host>:27017`.

**Note**  
These code samples were developed for AWS Glue 3.0. To migrate to AWS Glue 4.0, consult [MongoDB](migrating-version-40.md#migrating-version-40-connector-driver-migration-mongodb). The `uri` parameter has changed.

**Note**  
When using Amazon DocumentDB, `retryWrites` must be set to false in certain situations, such as when the document written specifies `_id`. For more information, consult [Functional Differences with MongoDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/functional-differences.html#functional-differences.retryable-writes) in the Amazon DocumentDB documentation.

The following Python script demonstrates using connection types and connection options for reading and writing to Amazon DocumentDB.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

output_path = "s3://some_bucket/output/" + str(time.time()) + "/"
documentdb_uri = "mongodb://<mongo-instanced-ip-address>:27017"
documentdb_write_uri = "mongodb://<mongo-instanced-ip-address>:27017"

read_docdb_options = {
    "uri": documentdb_uri,
    "database": "test",
    "collection": "coll",
    "username": "username",
    "password": "1234567890",
    "ssl": "true",
    "ssl.domain_match": "false",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"
}

write_documentdb_options = {
    "retryWrites": "false",
    "uri": documentdb_write_uri,
    "database": "test",
    "collection": "coll",
    "username": "username",
    "password": "pwd"
}

# Get DynamicFrame from  DocumentDB
dynamic_frame2 = glueContext.create_dynamic_frame.from_options(connection_type="documentdb",
                                                               connection_options=read_docdb_options)

# Write DynamicFrame to MongoDB and DocumentDB
glueContext.write_dynamic_frame.from_options(dynamic_frame2, connection_type="documentdb",
                                             connection_options=write_documentdb_options)

job.commit()
```

The following Scala script demonstrates using connection types and connection options for reading and writing to Amazon DocumentDB.

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  val DOC_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  val DOC_WRITE_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  lazy val documentDBJsonOption = jsonOptions(DOC_URI)
  lazy val writeDocumentDBJsonOption = jsonOptions(DOC_WRITE_URI)
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    // Get DynamicFrame from DocumentDB
    val resultFrame2: DynamicFrame = glueContext.getSource("documentdb", documentDBJsonOption).getDynamicFrame()

    // Write DynamicFrame to DocumentDB
    glueContext.getSink("documentdb", writeJsonOption).writeDynamicFrame(resultFrame2)

    Job.commit()
  }

  private def jsonOptions(uri: String): JsonOptions = {
    new JsonOptions(
      s"""{"uri": "${uri}",
         |"database":"test",
         |"collection":"coll",
         |"username": "username",
         |"password": "pwd",
         |"ssl":"true",
         |"ssl.domain_match":"false",
         |"partitioner": "MongoSamplePartitioner",
         |"partitionerOptions.partitionSizeMB": "10",
         |"partitionerOptions.partitionKey": "_id"}""".stripMargin)
  }
}
```

## Amazon DocumentDB connection option reference
<a name="aws-glue-programming-etl-connect-documentdb"></a>

Designates a connection to Amazon DocumentDB (with MongoDB compatibility). 

Connection options differ for a source connection and a sink connection.

### "connectionType": "Documentdb" as source
<a name="etl-connect-documentdb-as-source"></a>

Use the following connection options with `"connectionType": "documentdb"` as a source:
+ `"uri"`: (Required) The Amazon DocumentDB host to read from, formatted as `mongodb://<host>:<port>`.
+ `"database"`: (Required) The Amazon DocumentDB database to read from.
+ `"collection"`: (Required) The Amazon DocumentDB collection to read from.
+ `"username"`: (Required) The Amazon DocumentDB user name.
+ `"password"`: (Required) The Amazon DocumentDB password.
+ `"ssl"`: (Required if using SSL) If your connection uses SSL, then you must include this option with the value `"true"`.
+ `"ssl.domain_match"`: (Required if using SSL) If your connection uses SSL, then you must include this option with the value `"false"`.
+ `"batchSize"`: (Optional): The number of documents to return per batch, used within the cursor of internal batches.
+ `"partitioner"`: (Optional): The class name of the partitioner for reading input data from Amazon DocumentDB. The connector provides the following partitioners:
  + `MongoDefaultPartitioner` (default) (Not supported in AWS Glue 4.0)
  + `MongoSamplePartitioner` (Not supported in AWS Glue 4.0)
  + `MongoShardedPartitioner`
  + `MongoSplitVectorPartitioner`
  + `MongoPaginateByCountPartitioner`
  + `MongoPaginateBySizePartitioner` (Not supported in AWS Glue 4.0)
+ `"partitionerOptions"` (Optional): Options for the designated partitioner. The following options are supported for each partitioner:
  + `MongoSamplePartitioner`: `partitionKey`, `partitionSizeMB`, `samplesPerPartition`
  + `MongoShardedPartitioner`: `shardkey`
  + `MongoSplitVectorPartitioner`: `partitionKey`, partitionSizeMB
  + `MongoPaginateByCountPartitioner`: `partitionKey`, `numberOfPartitions`
  + `MongoPaginateBySizePartitioner`: `partitionKey`, partitionSizeMB

  For more information about these options, see [Partitioner Configuration](https://docs.mongodb.com/spark-connector/master/configuration/#partitioner-conf) in the MongoDB documentation.

### "connectionType": "Documentdb" as sink
<a name="etl-connect-documentdb-as-sink"></a>

Use the following connection options with `"connectionType": "documentdb"` as a sink:
+ `"uri"`: (Required) The Amazon DocumentDB host to write to, formatted as `mongodb://<host>:<port>`.
+ `"database"`: (Required) The Amazon DocumentDB database to write to.
+ `"collection"`: (Required) The Amazon DocumentDB collection to write to.
+ `"username"`: (Required) The Amazon DocumentDB user name.
+ `"password"`: (Required) The Amazon DocumentDB password.
+ `"extendedBsonTypes"`: (Optional) If `true`, allows extended BSON types when writing data to Amazon DocumentDB. The default is `true`.
+ `"replaceDocument"`: (Optional) If `true`, replaces the whole document when saving datasets that contain an `_id` field. If `false`, only fields in the document that match the fields in the dataset are updated. The default is `true`.
+ `"maxBatchSize"`: (Optional): The maximum batch size for bulk operations when saving data. The default is 512.
+ `"retryWrites"`: (Optional): Automatically retry certain write operations a single time if AWS Glue encounters a network error.

# OpenSearch Service connections
<a name="aws-glue-programming-etl-connect-opensearch-home"></a>

You can use AWS Glue for Spark to read from and write to tables in OpenSearch Service in AWS Glue 4.0 and later versions. You can define what to read from OpenSearch Service with an OpenSearch query. You connect to OpenSearch Service using HTTP basic authentication credentials stored in AWS Secrets Manager through a AWS Glue connection. This feature is not compatible with OpenSearch Service serverless.

For more information about Amazon OpenSearch Service, see the [Amazon OpenSearch Service documentation](https://docs.aws.amazon.com/opensearch-service/).

## Configuring OpenSearch Service connections
<a name="aws-glue-programming-etl-connect-opensearch-configure"></a>

To connect to OpenSearch Service from AWS Glue, you will need to create and store your OpenSearch Service credentials in a AWS Secrets Manager secret, then associate that secret with a OpenSearch Service AWS Glue connection.

**Prerequisites:** 
+ Identify the domain endpoint, *aosEndpoint* and port, *aosPort* you would like to read from, or create the resource by following instructions in the Amazon OpenSearch Service documentation. For more information on creating a domain, see [Creating and managing Amazon OpenSearch Service domains](https://docs.aws.amazon.com//opensearch-service/latest/developerguide/createupdatedomains.html) in the Amazon OpenSearch Service documentation.

  An Amazon OpenSearch Service domain endpoint will have the following default form, https://search-*domainName*-*unstructuredIdContent*.*region*.es.amazonaws.com. For more information on identifying your domain endpoint, see [Creating and managing Amazon OpenSearch Service domains](https://docs.aws.amazon.com//opensearch-service/latest/developerguide/createupdatedomains.html) in the Amazon OpenSearch Service documentation. 

  Identify or generate HTTP basic authentication credentials, *aosUser* and *aosPassword* for your domain.

**To configure a connection to OpenSearch Service:**

1. In AWS Secrets Manager, create a secret using your OpenSearch Service credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `USERNAME` with the value *aosUser*.
   + When selecting **Key/value pairs**, create a pair for the key `PASSWORD` with the value *aosPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select OpenSearch Service.
   + When selecting a Domain endpoint, provide *aosEndpoint*.
   + When selecting a port, provide *aosPort*.
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue OpenSearch Service connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from OpenSearch Service indexes
<a name="aws-glue-programming-etl-connect-opensearch-read"></a>

**Prerequisites:** 
+ A OpenSearch Service index you would like to read from, *aosIndex*.
+ A AWS Glue OpenSearch Service connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to OpenSearch Service*. You will need the name of the AWS Glue connection, *connectionName*. 

This example reads an index from Amazon OpenSearch Service. You will need to provide the `pushdown` parameter.

For example: 

```
opensearch_read = glueContext.create_dynamic_frame.from_options(
    connection_type="opensearch",
    connection_options={
        "connectionName": "connectionName",
        "opensearch.resource": "aosIndex",
        "pushdown": "true",
    }
)
```

You can also provide a query string to filter the results returned in your DynamicFrame. You will need to configure `opensearch.query`.

`opensearch.query` can take a URL query parameter string *queryString* or a query DSL JSON object *queryObject*. For more information about the query DSL, see [Query DSL](https://opensearch.org/docs/latest/query-dsl/index/) in the OpenSearch documentation. To provide a URL query parameter string, prepend `?q=` to your query, as you would in a fully qualified URL. To provide a query DSL object, string escape the JSON object before providing it.

For example: 

```
            queryObject = "{ "query": { "multi_match": { "query": "Sample", "fields": [ "sample" ] } } }"
            queryString = "?q=queryString"
            
            opensearch_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="opensearch",
    connection_options={
        "connectionName": "connectionName",
        "opensearch.resource": "aosIndex",
        "opensearch.query": queryString,
        "pushdown": "true",
    }
)
```

For more information about how to build a query outside of its specific syntax, see [Query string syntax](https://opensearch.org/docs/latest/query-dsl/full-text/query-string/#query-string-syntax) in the OpenSearch documentation.

When reading from OpenSearch collections that contain array type data, you must specify which fields are array type in your method call using the `opensearch.read.field.as.array.include` parameter. 

For example, when reading the following document, you will encounter the `genre` and `actor` array fields:

```
{
    "_index": "movies",
    "_id": "2",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "director": "Frankenheimer, John",
        "genre": [
            "Drama",
            "Mystery",
            "Thriller",
            "Crime"
        ],
        "year": 1962,
        "actor": [
            "Lansbury, Angela",
            "Sinatra, Frank",
            "Leigh, Janet",
            "Harvey, Laurence",
            "Silva, Henry",
            "Frees, Paul",
            "Gregory, James",
            "Bissell, Whit",
            "McGiver, John",
            "Parrish, Leslie",
            "Edwards, James",
            "Flowers, Bess",
            "Dhiegh, Khigh",
            "Payne, Julie",
            "Kleeb, Helen",
            "Gray, Joe",
            "Nalder, Reggie",
            "Stevens, Bert",
            "Masters, Michael",
            "Lowell, Tom"
        ],
        "title": "The Manchurian Candidate"
    }
}
```

In this case, you would include those field names in your method call. For example:

```
"opensearch.read.field.as.array.include": "genre,actor"
```

If your array field is nested inside of your document structure, refer to it using dot notation: `"genre,actor,foo.bar.baz"`. This would specify an array `baz` included in your source document through the embedded document `foo` containing the embedded document `bar`.

## Writing to OpenSearch Service tables
<a name="aws-glue-programming-etl-connect-opensearch-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to OpenSearch Service. If the index already has information, AWS Glue will append data from your DynamicFrame. You will need to provide the `pushdown` parameter.

**Prerequisites:** 
+ A OpenSearch Service table you would like to write to. You will need identification information for the table. Let's call this *tableName*.
+ A AWS Glue OpenSearch Service connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to OpenSearch Service*. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="opensearch",
    connection_options={
      "connectionName": "connectionName",
      "opensearch.resource": "aosIndex",
    },
)
```

## OpenSearch Service connection option reference
<a name="aws-glue-programming-etl-connect-opensearch-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue OpenSearch Service connection configured to provide auth and network location information to your connection method.
+ `opensearch.resource` — Required. Used for Read/Write. Valid Values: OpenSearch index names. The name of the index your connection method will interact with.
+ `opensearch.query` — Used for Read. Valid Values: String escaped JSON or, when this string begins with `?`, the search part of a URL. An OpenSearch query that filters what should be retrieved when reading. For more information on using this parameter, consult the previous section [Reading from OpenSearch Service indexes](#aws-glue-programming-etl-connect-opensearch-read).
+ `pushdown` — Required if. Used for Read. Valid Values: boolean. Instructs Spark to pass read queries down to OpenSearch so the database only returns relevant documents.
+ `opensearch.read.field.as.array.include` — Required if reading array type data. Used for Read. Valid Values: comma separated lists of field names. Specifies fields to read as arrays from OpenSearch documents. For more information on using this parameter, consult the previous section [Reading from OpenSearch Service indexes](#aws-glue-programming-etl-connect-opensearch-read).

# Redshift connections
<a name="aws-glue-programming-etl-connect-redshift-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Amazon Redshift databases. When connecting to Amazon Redshift databases, AWS Glue moves data through Amazon S3 to achieve maximum throughput, using the Amazon Redshift SQL `COPY` and `UNLOAD` commands. In AWS Glue 4.0 and later, you can use the [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html) to read and write with optimizations and features specific to Amazon Redshift beyond those available when connecting through previous versions. 

Learn about how AWS Glue is making it easier than ever for Amazon Redshift users to migrate to AWS Glue for serverless data integration and ETL.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/ZapycBq8TKU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/ZapycBq8TKU)


## Configuring Redshift connections
<a name="aws-glue-programming-etl-connect-redshift-configure"></a>

To use Amazon Redshift clusters in AWS Glue, you will need some prerequisites:
+ An Amazon S3 directory to use for temporary storage when reading from and writing to the database.
+ An Amazon VPC enabling communication between your Amazon Redshift cluster, your AWS Glue job and your Amazon S3 directory.
+ Appropriate IAM permissions on the AWS Glue job and Amazon Redshift cluster.

### Configuring IAM roles
<a name="aws-glue-programming-etl-redshift-config-iam"></a>

**Set up the role for the Amazon Redshift cluster**  
Your Amazon Redshift cluster needs to be able to read and write to Amazon S3 in order to integrate with AWS Glue jobs. To allow this, you can associate IAM roles with the Amazon Redshift cluster you want to connect to. Your role should have a policy allowing read from and write to your Amazon S3 temporary directory. Your role should have a trust relationship allowing the `redshift.amazonaws.com` service to `AssumeRole`.

**To associate an IAM role with Amazon Redshift**

1. **Prerequisites: ** An Amazon S3 bucket or directory used for the temporary storage of files.

1. Identify which Amazon S3 permissions your Amazon Redshift cluster will need. When moving data to and from an Amazon Redshift cluster, AWS Glue jobs issue COPY and UNLOAD statements against Amazon Redshift. If your job modifies a table in Amazon Redshift, AWS Glue will also issue CREATE LIBRARY statements. For information on specific Amazon S3 permissions required for Amazon Redshift to execute these statements, refer to the Amazon Redshift documentation: [ Amazon Redshift: Permissions to access other AWS Resources](https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html).

1. In the IAM console, create an IAM policy with the necessary permissions. For more information about creating a policy [Creating IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html). 

1. In the IAM console, create a role and trust relationship allowing Amazon Redshift to assume the role. Follow the instructions in the IAM documentation [To create a role for an AWS service (console) ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html#roles-creatingrole-service-console)
   + When asked to choose an AWS service use case, choose "Redshift - Customizable".
   + When asked to attach a policy, choose the policy you previously defined.
**Note**  
For more information about configuring roles for Amazon Redshift, see [Authorizing Amazon Redshift to access other AWS services on your behalf](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) in the Amazon Redshift documentation. 

1. In the Amazon Redshift console, associate the role with your Amazon Redshift cluster. Follow the instructions in [the Amazon Redshift documentation](https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html).

   Select the highlighted option in the Amazon Redshift console to configure this setting:  
![\[An example of where to manage IAM permissions in the Amazon Redshift console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/RS-role-config.png)

**Note**  
 By default, AWS Glue jobs pass Amazon Redshift temporary credentials that are created using the role that you specified to run the job. We do not recommend using these credentials. For security purposes, these credentials expire after 1 hour. 

**Set up the role for the AWS Glue job**  
The AWS Glue job needs a role to access the Amazon S3 bucket. You do not need IAM permissions for the Amazon Redshift cluster, your access is controlled by connectivity in Amazon VPC and your database credentials.

### Set up Amazon VPC
<a name="aws-glue-programming-etl-redshift-config-vpc"></a>

**To set up access for Amazon Redshift data stores**

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. In the left navigation pane, choose **Clusters**.

1. Choose the cluster name that you want to access from AWS Glue.

1. In the **Cluster Properties** section, choose a security group in **VPC security groups** to allow AWS Glue to use. Record the name of the security group that you chose for future reference. Choosing the security group opens the Amazon EC2 console **Security Groups** list.

1. Choose the security group to modify and navigate to the **Inbound** tab.

1. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of **Type** `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and whose **Source** is the same security group name as the **Group ID**. 

   The inbound rule looks similar to the following:   
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-redshift-home.html)

   For example:  
![\[An example of a self-referencing inbound rule.\]](http://docs.aws.amazon.com/glue/latest/dg/images/SetupSecurityGroup-Start.png)

1. Add a rule for outbound traffic also. Either open outbound traffic to all ports, for example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-redshift-home.html)

   Or create a self-referencing rule where **Type** `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and whose **Destination** is the same security group name as the **Group ID**. If using an Amazon S3 VPC endpoint, also add an HTTPS rule for Amazon S3 access. The *s3-prefix-list-id* is required in the security group rule to allow traffic from the VPC to the Amazon S3 VPC endpoint.

   For example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-redshift-home.html)

### Set up AWS Glue
<a name="aws-glue-programming-etl-redshift-config-glue"></a>

You will need to create an AWS Glue Data Catalog connection that provides Amazon VPC connection information.

**To configure Amazon Redshift Amazon VPC connectivity to AWS Glue in the console**

1. Create a Data Catalog connection by following the steps in: [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step.
   + When selecting a **Connection type**, select **Amazon Redshift**.
   + When selecting a **Redshift cluster**, select your cluster by name.
   + Provide default connection information for a Amazon Redshift user on your cluster.
   + Your Amazon VPC settings will be automatically configured.
**Note**  
You will need to manually provide `PhysicalConnectionRequirements` for your Amazon VPC when creating an **Amazon Redshift** connection through the AWS SDK.

1. In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Example: Reading from Amazon Redshift tables
<a name="aws-glue-programming-etl-connect-redshift-read"></a>

 You can read from Amazon Redshift clusters and Amazon Redshift serverless environments. 

**Prerequisites:** An Amazon Redshift table you would like to read from. Follow the steps in the previous section [Configuring Redshift connections](#aws-glue-programming-etl-connect-redshift-configure) after which you should have the Amazon S3 URI for a temporary directory, *temp-s3-dir* and an IAM role, *rs-role-name*, (in account *role-account-id*).

------
#### [ Using the Data Catalog ]

**Additional Prerequisites:** A Data Catalog Database and Table for the Amazon Redshift table you would like to read from. For more information about Data Catalog, see [Data discovery and cataloging in AWS Glue](catalog-and-crawler.md). After creating a entry for your Amazon Redshift table you will identify your connection with a *redshift-dc-database-name* and *redshift-table-name*.

**Configuration:** In your function options you will identify your Data Catalog Table with the `database` and `table_name` parameters. You will identify your Amazon S3 temporary directory with `redshift_tmp_dir`. You will also provide *rs-role-name* using the `aws_iam_role` key in the `additional_options` parameter.

```
 glueContext.create_dynamic_frame.from_catalog(
    database = "redshift-dc-database-name", 
    table_name = "redshift-table-name", 
    redshift_tmp_dir = args["temp-s3-dir"], 
    additional_options = {"aws_iam_role": "arn:aws:iam::role-account-id:role/rs-role-name"})
```

------
#### [ Connecting directly ]

**Additional Prerequisites:**You will need the name of your Amazon Redshift table (*redshift-table-name*. You will need the JDBC connection information for the Amazon Redshift cluster storing that table. You will supply your connection information with *host*, *port*, *redshift-database-name*, *username* and *password*.

You can retrieve your connection information from the Amazon Redshift console when working with Amazon Redshift clusters. When using Amazon Redshift serverless, consult [Connecting to Amazon Redshift Serverless](https://docs.aws.amazon.com//redshift/latest/mgmt/serverless-connecting.html) in the Amazon Redshift documentation.

**Configuration:** In your function options you will identify your connection parameters with `url`, `dbtable`, `user` and `password`. You will identify your Amazon S3 temporary directory with `redshift_tmp_dir`. You can specify your IAM role using `aws_iam_role` when you use `from_options`. The syntax is similar to connecting through the Data Catalog, but you put the parameters in the `connection_options` map.

It is bad practice to hardcode passwords into AWS Glue scripts. Consider storing your passwords in AWS Secrets Manager and retrieving them in your script with SDK for Python (Boto3).

```
my_conn_options = {  
    "url": "jdbc:redshift://host:port/redshift-database-name",
    "dbtable": "redshift-table-name",
    "user": "username",
    "password": "password",
    "redshiftTmpDir": args["temp-s3-dir"],
    "aws_iam_role": "arn:aws:iam::account id:role/rs-role-name"
}

df = glueContext.create_dynamic_frame.from_options("redshift", my_conn_options)
```

------

## Example: Writing to Amazon Redshift tables
<a name="aws-glue-programming-etl-connect-redshift-write"></a>

 You can write to Amazon Redshift clusters and Amazon Redshift serverless environments. 

**Prerequisites:** An Amazon Redshift cluster and follow the steps in the previous section [Configuring Redshift connections](#aws-glue-programming-etl-connect-redshift-configure) after which you should have the Amazon S3 URI for a temporary directory, *temp-s3-dir* and an IAM role, *rs-role-name*, (in account *role-account-id*). You will also need a `DynamicFrame` whose contents you would like to write to the database. 

------
#### [ Using the Data Catalog ]

**Additional Prerequisites** A Data Catalog Database for the Amazon Redshift cluster and table you would like to write to. For more information about Data Catalog, see [Data discovery and cataloging in AWS Glue](catalog-and-crawler.md). You will identify your connection with *redshift-dc-database-name* and the target table with *redshift-table-name*.

**Configuration:** In your function options you will identify your Data Catalog Database with the `database` parameter, then provide table with `table_name`. You will identify your Amazon S3 temporary directory with `redshift_tmp_dir`. You will also provide *rs-role-name* using the `aws_iam_role` key in the `additional_options` parameter.

```
 glueContext.write_dynamic_frame.from_catalog(
    frame = input dynamic frame, 
    database = "redshift-dc-database-name", 
    table_name = "redshift-table-name", 
    redshift_tmp_dir = args["temp-s3-dir"], 
    additional_options = {"aws_iam_role": "arn:aws:iam::account-id:role/rs-role-name"})
```

------
#### [ Connecting through a AWS Glue connection ]

You can connect to Amazon Redshift directly using the `write_dynamic_frame.from_options` method. However, rather than insert your connection details directly into your script, you can reference connection details stored in a Data Catalog connection with the `from_jdbc_conf` method. You can do this without crawling or creating Data Catalog tables for your database. For more information about Data Catalog connections, see [Connecting to data](glue-connections.md).

**Additional Prerequisites:** A Data Catalog connection for your database, a Amazon Redshift table you would like to read from

**Configuration:** you will identify your Data Catalog connection with *dc-connection-name*. You will identify your Amazon Redshift database and table with *redshift-table-name* and *redshift-database-name*. You will provide your Data Catalog connection information with `catalog_connection` and your Amazon Redshift information with `dbtable` and `database`. The syntax is similar to connecting through the Data Catalog, but you put the parameters in the `connection_options` map. 

```
my_conn_options = {
    "dbtable": "redshift-table-name",
    "database": "redshift-database-name",
    "aws_iam_role": "arn:aws:iam::role-account-id:role/rs-role-name"
}

glueContext.write_dynamic_frame.from_jdbc_conf(
    frame = input dynamic frame, 
    catalog_connection = "dc-connection-name", 
    connection_options = my_conn_options, 
    redshift_tmp_dir = args["temp-s3-dir"])
```

------

## Amazon Redshift connection option reference
<a name="w2aac67c11c24b8c21c15"></a>

The basic connection options used for all AWS Glue JDBC connections to set up information like `url`, `user` and `password` are consistent across all JDBC types. For more information about standard JDBC parameters, see [JDBC connection option reference](aws-glue-programming-etl-connect-jdbc-home.md#aws-glue-programming-etl-connect-jdbc).

The Amazon Redshift connection type takes some additional connection options:
+ `"redshiftTmpDir"`: (Required) The Amazon S3 path where temporary data can be staged when copying out of the database.
+ `"aws_iam_role"`: (Optional) ARN for an IAM role. The AWS Glue job will pass this role to the Amazon Redshift cluster to grant the cluster permissions needed to complete instructions from the job.

### Additional connection options available in AWS Glue 4.0\$1
<a name="aws-glue-programming-etl-redshift-enhancements"></a>

You can also pass options for the new Amazon Redshift connector through AWS Glue connection options. For a complete list of supported connector options, see the *Spark SQL parameters* section in [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html).

For you convenience, we reiterate certain new options here:


| Name | Required | Default | Description | 
| --- | --- | --- | --- | 
|  autopushdown  | No | TRUE |  Applies predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. The operations are translated into a SQL query, and then run in Amazon Redshift to improve performance.  | 
|  autopushdown.s3\$1result\$1cache  | No | FALSE |  Caches the SQL query to unload data for Amazon S3 path mapping in memory so that the same query doesn't need to run again in the same Spark session. Only supported when `autopushdown` is enabled.  | 
|  unload\$1s3\$1format  | No | PARQUET |  PARQUET - Unloads the query results in Parquet format. TEXT - Unloads the query results in pipe-delimited text format.  | 
|  sse\$1kms\$1key  | No | N/A |  The AWS SSE-KMS key to use for encryption during `UNLOAD` operations instead of the default encryption for AWS.  | 
|  extracopyoptions  | No | N/A |  A list of extra options to append to the Amazon Redshift `COPY`command when loading data, such as `TRUNCATECOLUMNS` or `MAXERROR n` (for other options see [COPY: Optional parameters](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#r_COPY-syntax-overview-optional-parameters)).  Note that because these options are appended to the end of the `COPY` command, only options that make sense at the end of the command can be used. That should cover most possible use cases.  | 
|  csvnullstring (experimental)  | No | NULL |  The String value to write for nulls when using the CSV `tempformat`. This should be a value that doesn't appear in your actual data.  | 

These new parameters can be used in the following ways.

**New options for performance improvement**  
The new connector introduces some new performance improvement options:
+ `autopushdown`: Enabled by default.
+ `autopushdown.s3_result_cache`: Disabled by default.
+ `unload_s3_format`: `PARQUET` by default.

For information about using these options, see [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html). We recommend that you don't turn on ` autopushdown.s3_result_cache` when you have mixed read and write operations because the cached results might contain stale information. The option `unload_s3_format` is set to `PARQUET` by default for the `UNLOAD` command, to improve performance and reduce storage cost. To use the `UNLOAD` command default behavior, reset the option to `TEXT`.

**New encryption option for reading**  
By default, the data in the temporary folder that AWS Glue uses when it reads data from the Amazon Redshift table is encrypted using `SSE-S3` encryption. To use customer managed keys from AWS Key Management Service (AWS KMS) to encrypt your data, you can set up `("sse_kms_key" → kmsKey)` where ksmKey is the [key ID from AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/find-cmk-id-arn.html), instead of the legacy setting option `("extraunloadoptions" → s"ENCRYPTED KMS_KEY_ID '$kmsKey'")` in AWS Glue version 3.0.

```
datasource0 = glueContext.create_dynamic_frame.from_catalog(
  database = "database-name", 
  table_name = "table-name", 
  redshift_tmp_dir = args["TempDir"],
  additional_options = {"sse_kms_key":"<KMS_KEY_ID>"}, 
  transformation_ctx = "datasource0"
)
```

**Support IAM-based JDBC URL**  
The new connector supports an IAM-based JDBC URL so you don't need to pass in a user/password or secret. With an IAM-based JDBC URL, the connector uses the job runtime role to access to the Amazon Redshift data source. 

Step 1: Attach the following minimal required policy to your AWS Glue job runtime role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:us-east-1:111122223333:dbgroup:<cluster name>/*",
                "arn:aws:redshift:*:111122223333:dbuser:*/*",
                "arn:aws:redshift:us-east-1:111122223333:dbname:<cluster name>/<database name>"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        }
    ]
}
```

------

Step 2: Use the IAM-based JDBC URL as follows. Specify a new option `DbUser` with the Amazon Redshift user name that you're connecting with.

```
conn_options = {
     // IAM-based JDBC URL
    "url": "jdbc:redshift:iam://<cluster name>:<region>/<database name>",
    "dbtable": dbtable,
    "redshiftTmpDir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "DbUser": "<Redshift User name>" // required for IAM-based JDBC URL
    }

redshift_write = glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="redshift",
    connection_options=conn_options
)

redshift_read = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options=conn_options
)
```

**Note**  
A `DynamicFrame` currently only supports an IAM-based JDBC URL with a ` DbUser` in the `GlueContext.create_dynamic_frame.from_options` workflow. 

## Migrating from AWS Glue version 3.0 to version 4.0
<a name="aws-glue-programming-etl-redshift-migrating"></a>

In AWS Glue 4.0, ETL jobs have access to a new Amazon Redshift Spark connector and a new JDBC driver with different options and configuration. The new Amazon Redshift connector and driver are written with performance in mind, and keep transactional consistency of your data. These products are documented in the Amazon Redshift documentation. For more information, see:
+ [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html)
+ [Amazon Redshift JDBC driver, version 2.1](https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-download-driver.html)

**Table/column names and identifiers restriction**  
The new Amazon Redshift Spark connector and driver have a more restricted requirement for the Redshift table name. For more information, see [Names and identifiers](https://docs.aws.amazon.com/redshift/latest/dg/r_names.html) to define your Amazon Redshift table name. The job bookmark workflow might not work with a table name that doesn't match the rules and with certain characters, such as a space.

If you have legacy tables with names that don't conform to the [Names and identifiers](https://docs.aws.amazon.com/redshift/latest/dg/r_names.html) rules and see issues with bookmarks (jobs reprocessing old Amazon Redshift table data), we recommend that you rename your table names. For more information, see [ALTER TABLE examples](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_examples_basic.html). 

**Default tempformat change in Dataframe**  
The AWS Glue version 3.0 Spark connector defaults the `tempformat` to CSV while writing to Amazon Redshift. To be consistent, in AWS Glue version 3.0, the ` DynamicFrame` still defaults the `tempformat` to use `CSV`. If you've previously used Spark Dataframe APIs directly with the Amazon Redshift Spark connector, you can explicitly set the `tempformat` to CSV in the `DataframeReader` /`Writer` options. Otherwise, `tempformat` defaults to `AVRO` in the new Spark connector.

**Behavior change: map Amazon Redshift data type REAL to Spark data type FLOAT instead of DOUBLE**  
In AWS Glue version 3.0, Amazon Redshift `REAL` is converted to a Spark ` DOUBLE` type. The new Amazon Redshift Spark connector has updated the behavior so that the Amazon Redshift ` REAL` type is converted to, and back from, the Spark `FLOAT` type. If you have a legacy use case where you still want the Amazon Redshift `REAL` type to be mapped to a Spark `DOUBLE` type, you can use the following workaround:
+ For a `DynamicFrame`, map the `Float` type to a `Double` type with `DynamicFrame.ApplyMapping`. For a `Dataframe`, you need to use `cast`.

Code example:

```
dyf_cast = dyf.apply_mapping([('a', 'long', 'a', 'long'), ('b', 'float', 'b', 'double')])
```

**Handling VARBYTE Data Type**  
When working with AWS Glue 3.0 and Amazon Redshift data types, AWS Glue 3.0 converts Amazon Redshift `VARBYTE` to Spark `STRING` type. However, the latest Amazon Redshift Spark connector doesn't support the `VARBYTE` data type. To work around this limitation, you can [create a Redshift view](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_VIEW.html) that transforms `VARBYTE` columns to a supported data type. Then, use the new connector to load data from this view instead of the original table, which ensures compatibility while maintaining access to your `VARBYTE` data.

Example for Redshift query:

```
CREATE VIEW view_name AS SELECT FROM_VARBYTE(varbyte_column, 'hex') FROM table_name
```

# Kafka connections
<a name="aws-glue-programming-etl-connect-kafka-home"></a>

You can use a Kafka connection to read and write to Kafka data streams using information stored in a Data Catalog table, or by providing information to directly access the data stream. The connection supports a Kafka cluster or an Amazon Managed Streaming for Apache Kafka cluster. You can read information from Kafka into a Spark DataFrame, then convert it to a AWS Glue DynamicFrame. You can write DynamicFrames to Kafka in a JSON format. If you directly access the data stream, use these options to provide the information about how to access the data stream.

If you use `getCatalogSource` or `create_data_frame_from_catalog` to consume records from a Kafka streaming source, or `getCatalogSink` or `write_dynamic_frame_from_catalog` to write records to Kafka, and the job has the Data Catalog database and table name information, and can use that to obtain some basic parameters for reading from the Kafka streaming source. If you use `getSource`, `getCatalogSink`, `getSourceWithFormat`, `getSinkWithFormat`, `createDataFrameFromOptions` or `create_data_frame_from_options`, or `write_dynamic_frame_from_catalog`, you must specify these basic parameters using the connection options described here.

You can specify the connection options for Kafka using the following arguments for the specified methods in the `GlueContext` class.
+ Scala
  + `connectionOptions`: Use with `getSource`, `createDataFrameFromOptions`, `getSink` 
  + `additionalOptions`: Use with `getCatalogSource`, `getCatalogSink`
  + `options`: Use with `getSourceWithFormat`, `getSinkWithFormat`
+ Python
  + `connection_options`: Use with `create_data_frame_from_options`, `write_dynamic_frame_from_options`
  + `additional_options`: Use with `create_data_frame_from_catalog`, `write_dynamic_frame_from_catalog`
  + `options`: Use with `getSource`, `getSink`

For notes and restrictions about streaming ETL jobs, consult [Streaming ETL notes and restrictions](add-job-streaming.md#create-job-streaming-restrictions).

**Topics**
+ [Configure Kafka](#aws-glue-programming-etl-connect-kafka-configure)
+ [Example: Reading from Kafka streams](#aws-glue-programming-etl-connect-kafka-read)
+ [Example: Writing to Kafka streams](#aws-glue-programming-etl-connect-kafka-write)
+ [Kafka connection option reference](#aws-glue-programming-etl-connect-kafka)

## Configure Kafka
<a name="aws-glue-programming-etl-connect-kafka-configure"></a>

There are no AWS prerequisites to connecting to Kafka streams available through the internet.

You can create a AWS Glue Kafka connection to manage your connection credentials. For more information, see [Creating an AWS Glue connection for an Apache Kafka data stream](add-job-streaming.md#create-conn-streaming). In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**, then, in your method call, provide *connectionName* to the `connectionName` parameter.

In certain cases, you will need to configure additional prerequisites:
+ If using Amazon Managed Streaming for Apache Kafka with IAM authentication, you will need appropriate IAM configuration.
+ If using Amazon Managed Streaming for Apache Kafka within an Amazon VPC, you will need appropriate Amazon VPC configuration. You will need to create a AWS Glue connection that provides Amazon VPC connection information. You will need your job configuration to include the AWS Glue connection as an **Additional network connection**.

For more information about Streaming ETL job prerequisites, consult [Streaming ETL jobs in AWS Glue](add-job-streaming.md).

## Example: Reading from Kafka streams
<a name="aws-glue-programming-etl-connect-kafka-read"></a>

Used in conjunction with [forEachBatch](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch).

Example for Kafka streaming source:

```
kafka_options =
    { "connectionName": "ConfluentKafka", 
      "topicName": "kafka-auth-topic", 
      "startingOffsets": "earliest", 
      "inferSchema": "true", 
      "classification": "json" 
    }
data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kafka", connection_options=kafka_options)
```

## Example: Writing to Kafka streams
<a name="aws-glue-programming-etl-connect-kafka-write"></a>

Examples for writing to Kafka:

Example with the `getSink` method:

```
data_frame_datasource0 = 
glueContext.getSink(
	connectionType="kafka",
	connectionOptions={
		JsonOptions("""{
			"connectionName": "ConfluentKafka", 
			"classification": "json", 
			"topic": "kafka-auth-topic", 
			"typeOfData": "kafka"}
		""")}, 
	transformationContext="dataframe_ApacheKafka_node1711729173428")
	.getDataFrame()
```

Example with the `write_dynamic_frame.from_options` method:

```
kafka_options =
    { "connectionName": "ConfluentKafka", 
      "topicName": "kafka-auth-topic", 
      "classification": "json" 
    }
data_frame_datasource0 = glueContext.write_dynamic_frame.from_options(connection_type="kafka", connection_options=kafka_options)
```

## Kafka connection option reference
<a name="aws-glue-programming-etl-connect-kafka"></a>

When reading, use the following connection options with `"connectionType": "kafka"`:
+ `"bootstrap.servers"` (Required) A list of bootstrap server URLs, for example, as `b-1.vpc-test-2.o4q88o.c6.kafka.us-east-1.amazonaws.com:9094`. This option must be specified in the API call or defined in the table metadata in the Data Catalog.
+ `"security.protocol"` (Required) The protocol used to communicate with brokers. The possible values are `"SSL"` or `"PLAINTEXT"`.
+ `"topicName"` (Required) A comma-separated list of topics to subscribe to. You must specify one and only one of `"topicName"`, `"assign"` or `"subscribePattern"`.
+ `"assign"`: (Required) A JSON string specifying the specific `TopicPartitions` to consume. You must specify one and only one of `"topicName"`, `"assign"` or `"subscribePattern"`.

  Example: '\$1"topicA":[0,1],"topicB":[2,4]\$1'
+ `"subscribePattern"`: (Required) A Java regex string that identifies the topic list to subscribe to. You must specify one and only one of `"topicName"`, `"assign"` or `"subscribePattern"`.

  Example: 'topic.\$1'
+ `"classification"` (Required) The file format used by the data in the record. Required unless provided through the Data Catalog.
+ `"delimiter"` (Optional) The value separator used when `classification` is CSV. Default is "`,`."
+ `"startingOffsets"`: (Optional) The starting position in the Kafka topic to read data from. The possible values are `"earliest"` or `"latest"`. The default value is `"latest"`.
+ `"startingTimestamp"`: (Optional, supported only for AWS Glue version 4.0 or later) The Timestamp of the record in the Kafka topic to read data from. The possible value is a Timestamp string in UTC format in the pattern `yyyy-mm-ddTHH:MM:SSZ` (where `Z` represents a UTC timezone offset with a \$1/-. For example: "2023-04-04T08:00:00-04:00").

  Note: Only one of 'startingOffsets' or 'startingTimestamp' can be present in the Connection Options list of the AWS Glue streaming script, including both these properties will result in job failure.
+ `"endingOffsets"`: (Optional) The end point when a batch query is ended. Possible values are either `"latest"` or a JSON string that specifies an ending offset for each `TopicPartition`.

  For the JSON string, the format is `{"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}`. The value `-1` as an offset represents `"latest"`.
+ `"pollTimeoutMs"`: (Optional) The timeout in milliseconds to poll data from Kafka in Spark job executors. The default value is `600000`.
+ `"numRetries"`: (Optional) The number of times to retry before failing to fetch Kafka offsets. The default value is `3`.
+ `"retryIntervalMs"`: (Optional) The time in milliseconds to wait before retrying to fetch Kafka offsets. The default value is `10`.
+ `"maxOffsetsPerTrigger"`: (Optional) The rate limit on the maximum number of offsets that are processed per trigger interval. The specified total number of offsets is proportionally split across `topicPartitions` of different volumes. The default value is null, which means that the consumer reads all offsets until the known latest offset.
+ `"minPartitions"`: (Optional) The desired minimum number of partitions to read from Kafka. The default value is null, which means that the number of spark partitions is equal to the number of Kafka partitions.
+  `"includeHeaders"`: (Optional) Whether to include the Kafka headers. When the option is set to "true", the data output will contain an additional column named "glue\$1streaming\$1kafka\$1headers" with type `Array[Struct(key: String, value: String)]`. The default value is "false". This option is available in AWS Glue version 3.0 or later. 
+ `"schema"`: (Required when inferSchema set to false) The schema to use to process the payload. If classification is `avro` the provided schema must be in the Avro schema format. If the classification is not `avro` the provided schema must be in the DDL schema format.

  The following are schema examples.

------
#### [ Example in DDL schema format ]

  ```
  'column1' INT, 'column2' STRING , 'column3' FLOAT
  ```

------
#### [ Example in Avro schema format ]

  ```
  {
  "type":"array",
  "items":
  {
  "type":"record",
  "name":"test",
  "fields":
  [
    {
      "name":"_id",
      "type":"string"
    },
    {
      "name":"index",
      "type":
      [
        "int",
        "string",
        "float"
      ]
    }
  ]
  }
  }
  ```

------
+ `"inferSchema"`: (Optional) The default value is 'false'. If set to 'true', the schema will be detected at runtime from the payload within `foreachbatch`.
+ `"avroSchema"`: (Deprecated) Parameter used to specify a schema of Avro data when Avro format is used. This parameter is now deprecated. Use the `schema` parameter.
+ `"addRecordTimestamp"`: (Optional) When this option is set to 'true', the data output will contain an additional column named "\$1\$1src\$1timestamp" that indicates the time when the corresponding record received by the topic. The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.
+ `"emitConsumerLagMetrics"`: (Optional) When the option is set to 'true', for each batch, it will emit the metrics for the duration between the oldest record received by the topic and the time it arrives in AWS Glue to CloudWatch. The metric's name is "glue.driver.streaming.maxConsumerLagInMs". The default value is 'false'. This option is supported in AWS Glue version 4.0 or later.

When writing, use the following connection options with `"connectionType": "kafka"`:
+ `"connectionName"` (Required) Name of the AWS Glue connection used to connect to the Kafka cluster (similar to Kafka source).
+ `"topic"` (Required) If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the topic configuration option is set. That is, the `topic` configuration option overrides the topic column.
+ `"partition"` (Optional) If a valid partition number is specified, that `partition` will be used when sending the record.

  If no partition is specified but a `key` is present, a partition will be chosen using a hash of the key.

  If neither `key` nor `partition` is present, a partition will be chosen based on sticky partitioning those changes when at least batch.size bytes are produced to the partition.
+ `"key"` (Optional) Used for partitioning if `partition` is null.
+ `"classification"` (Optional) The file format used by the data in the record. We only support JSON, CSV and Avro.

  With Avro format, we can provide a custom avroSchema to serialize with, but note that this needs to be provided on the source for deserializing as well. Else, by default it uses the Apache AvroSchema for serializing.

Additionally, you can fine-tune the Kafka sink as required by updating the [Kafka producer configuration parameters](https://kafka.apache.org/documentation/#producerconfigs). Note that there is no allow listing on connection options, all the key-value pairs are persisted on the sink as is.

However, there is a small deny list of options that will not take effect. For more information, see [Kafka specific configurations](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html).

# Azure Cosmos DB connections
<a name="aws-glue-programming-etl-connect-azurecosmos-home"></a>

You can use AWS Glue for Spark to read from and write to existing containers in Azure Cosmos DB using the NoSQL API in AWS Glue 4.0 and later versions. You can define what to read from Azure Cosmos DB with a SQL query. You connect to Azure Cosmos DB using an Azure Cosmos DB Key stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Azure Cosmos DB for NoSQL, consult [the Azure documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/).

## Configuring Azure Cosmos DB connections
<a name="aws-glue-programming-etl-connect-azurecosmos-configure"></a>

To connect to Azure Cosmos DB from AWS Glue, you will need to create and store your Azure Cosmos DB Key in a AWS Secrets Manager secret, then associate that secret with a Azure Cosmos DB AWS Glue connection.

**Prerequisites:** 
+ In Azure, you will need to identify or generate an Azure Cosmos DB Key for use by AWS Glue, `cosmosKey`. For more information, see [Secure access to data in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/secure-access-to-data?tabs=using-primary-key) in the Azure documentation.

**To configure a connection to Azure Cosmos DB:**

1. In AWS Secrets Manager, create a secret using your Azure Cosmos DB Key. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `spark.cosmos.accountKey` with the value *cosmosKey*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select Azure Cosmos DB.
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue Azure Cosmos DB connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Azure Cosmos DB for NoSQL containers
<a name="aws-glue-programming-etl-connect-azurecosmos-read"></a>

**Prerequisites:** 
+ A Azure Cosmos DB for NoSQL container you would like to read from. You will need identification information for the container.

  An Azure Cosmos for NoSQL container is identified by its database and container. You must provide the database, *cosmosDBName*, and container, *cosmosContainerName*, names when connecting to the Azure Cosmos for NoSQL API.
+ A AWS Glue Azure Cosmos DB connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to Azure Cosmos DB*. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azurecosmos_read = glueContext.create_dynamic_frame.from_options(
    connection_type="azurecosmos",
    connection_options={
    "connectionName": connectionName,
    "spark.cosmos.database": cosmosDBName,
    "spark.cosmos.container": cosmosContainerName,
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
azurecosmos_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="azurecosmos",
    connection_options={
        "connectionName": "connectionName",
        "spark.cosmos.database": cosmosDBName,
        "spark.cosmos.container": cosmosContainerName,
        "spark.cosmos.read.customQuery": "query"
    }
)
```

## Writing to Azure Cosmos DB for NoSQL containers
<a name="aws-glue-programming-etl-connect-azurecosmos-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to Azure Cosmos DB. If the container already has information, AWS Glue will append data from your DynamicFrame. If the information in the container has a different schema from the information you write, you will run into errors.

**Prerequisites:** 
+ A Azure Cosmos DB table you would like to write to. You will need identification information for the container. **You must create the container before calling the connection method.**

  An Azure Cosmos for NoSQL container is identified by its database and container. You must provide the database, *cosmosDBName*, and container, *cosmosContainerName*, names when connecting to the Azure Cosmos for NoSQL API.
+ A AWS Glue Azure Cosmos DB connection configured to provide auth and network location information. To acquire this, complete the steps in the previous procedure, *To configure a connection to Azure Cosmos DB*. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azurecosmos_write = glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="azurecosmos",
    connection_options={
    "connectionName": connectionName,
    "spark.cosmos.database": cosmosDBName,
    "spark.cosmos.container": cosmosContainerName
)
```

## Azure Cosmos DB connection option reference
<a name="aws-glue-programming-etl-connect-azurecosmos-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Azure Cosmos DB connection configured to provide auth and network location information to your connection method.
+ `spark.cosmos.database` — Required. Used for Read/Write. Valid Values: database names. Azure Cosmos DB for NoSQL database name.
+ `spark.cosmos.container` — Required. Used for Read/Write. Valid Values: container names. Azure Cosmos DB for NoSQL container name.
+ `spark.cosmos.read.customQuery` — Used for Read. Valid Values: SELECT SQL queries. Custom query to select documents to be read.

# Azure SQL connections
<a name="aws-glue-programming-etl-connect-azuresql-home"></a>

You can use AWS Glue for Spark to read from and write to tables on Azure SQL Managed Instances in AWS Glue 4.0 and later versions. You can define what to read from Azure SQL with a SQL query. You connect to Azure SQL using user and password credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Azure SQL, consult the [Azure SQL documentation](https://azure.microsoft.com/en-us/products/azure-sql).

## Configuring Azure SQL connections
<a name="aws-glue-programming-etl-connect-azuresql-configure"></a>

To connect to Azure SQL from AWS Glue, you will need to create and store your Azure SQL credentials in a AWS Secrets Manager secret, then associate that secret with a Azure SQL AWS Glue connection.

**To configure a connection to Azure SQL:**

1. In AWS Secrets Manager, create a secret using your Azure SQL credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `user` with the value *azuresqlUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password` with the value *azuresqlPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select Azure SQL.
   + When providing **Azure SQL URL**, provide a JDBC endpoint URL.

      The URL must be in the following format: `jdbc:sqlserver://databaseServerName:databasePort;databaseName=azuresqlDBname;`.

     AWS Glue requires the following URL properties: 
     + `databaseName` – A default database in Azure SQL to connect to.

     For more information about JDBC URLs for Azure SQL Managed Instances, see the [Microsoft documentation](https://learn.microsoft.com/en-us/sql/connect/jdbc/building-the-connection-url?view=azuresqldb-mi-current).
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue Azure SQL connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Azure SQL tables
<a name="aws-glue-programming-etl-connect-azuresql-read"></a>

**Prerequisites:** 
+ A Azure SQL table you would like to read from. You will need identification information for the table, *databaseName* and *tableIdentifier*.

  An Azure SQL table is identified by its database, schema and table name. You must provide the database name and table name when connecting to Azure SQL. You also must provide the schema if it is not the default, "public". Database is provided through a URL property in *connectionName* , schema and table name through the `dbtable`.
+ A AWS Glue Azure SQL connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to Azure SQL* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azuresql_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type="azuresql",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableIdentifier"
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
azuresql_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="azuresql",
    connection_options={
        "connectionName": "connectionName",
        "query": "query"
    }
)
```

## Writing to Azure SQL tables
<a name="aws-glue-programming-etl-connect-azuresql-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to Azure SQL. If the table already has information, AWS Glue will append data from your DynamicFrame.

**Prerequisites:** 
+ A Azure SQL table you would like to write to. You will need identification information for the table, *databaseName* and *tableIdentifier*.

  An Azure SQL table is identified by its database, schema and table name. You must provide the database name and table name when connecting to Azure SQL. You also must provide the schema if it is not the default, "public". Database is provided through a URL property in *connectionName* , schema and table name through the `dbtable`.
+ Azure SQL auth information. Complete the steps in the previous procedure, *To configure a connection to Azure SQL* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
azuresql_write = glueContext.write_dynamic_frame.from_options(
    connection_type="azuresql",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableIdentifier"
    }
)
```

## Azure SQL connection option reference
<a name="aws-glue-programming-etl-connect-azuresql-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Azure SQL connection configured to provide auth information to your connection method.
+ `databaseName` — Used for Read/Write. Valid Values: Azure SQL database names. The name of the database in Azure SQL to connect to.
+ `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. Valid Values: Names of Azure SQL tables, or period separated schema/table name combinations. Used to specify the table and schema that identify the table to connect to. The default schema is "public". If your table is in a non-default schema, provide this information in the form `schemaName.tableName`.
+ `query` — Used for Read. A Transact-SQL SELECT query defining what should be retrieved when reading from Azure SQL. For more information, see the [Microsoft documentation](https://learn.microsoft.com/en-us/sql/t-sql/queries/select-transact-sql?view=azuresqldb-mi-current).

# BigQuery connections
<a name="aws-glue-programming-etl-connect-bigquery-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Google BigQuery in AWS Glue 4.0 and later versions. You can read from BigQuery with a Google SQL query. You connect to BigQuery using credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Google BigQuery, see the [Google Cloud BigQuery website](https://cloud.google.com/bigquery).

## Configuring BigQuery connections
<a name="aws-glue-programming-etl-connect-bigquery-configure"></a>

To connect to Google BigQuery from AWS Glue, you will need to create and store your Google Cloud Platform credentials in a AWS Secrets Manager secret, then associate that secret with a Google BigQuery AWS Glue connection.

**To configure a connection to BigQuery:**

1. In Google Cloud Platform, create and identify relevant resources:
   + Create or identify a GCP project containing BigQuery tables you would like to connect to.
   + Enable the BigQuery API. For more information, see [ Use the BigQuery Storage Read API to read table data ](https://cloud.google.com/bigquery/docs/reference/storage/#enabling_the_api).

1. In Google Cloud Platform, create and export service account credentials:

   You can use the BigQuery credentials wizard to expedite this step: [Create credentials](https://console.cloud.google.com/apis/credentials/wizard?api=bigquery.googleapis.com).

   To create a service account in GCP, follow the tutorial available in [Create service accounts](https://cloud.google.com/iam/docs/service-accounts-create).
   + When selecting **project**, select the project containing your BigQuery table.
   + When selecting GCP IAM roles for your service account, add or create a role that would grant appropriate permissions to run BigQuery jobs to read, write or create BigQuery tables.

   To create credentials for your service account, follow the tutorial available in [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating).
   + When selecting key type, select **JSON**.

   You should now have downloaded a JSON file with credentials for your service account. It should look similar to the following:

   ```
   {
     "type": "service_account",
     "project_id": "*****",
     "private_key_id": "*****",
     "private_key": "*****",
     "client_email": "*****",
     "client_id": "*****",
     "auth_uri": "https://accounts.google.com/o/oauth2/auth",
     "token_uri": "https://oauth2.googleapis.com/token",
     "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
     "client_x509_cert_url": "*****",
     "universe_domain": "googleapis.com"
   }
   ```

1. Upload your credentials JSON file to an appropriately secure Amazon S3 location. Retain the path to the file, *s3secretpath* for future steps.

1. In AWS Secrets Manager, create a secret using your Google Cloud Platform credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 

   When creating Key/value pairs, specify keys and values as followings:
   + For `token_uri`, `client_x509_cert_url`, `private_key_id`, `project_id`, `universe_domain`, `auth_provider_x509_cert_url`, `auth_uri`, `client_email`, `private_key`, `type`, `client_id` keys, specify the corresponding values in the downloaded JSON file.
   + For `spark.hadoop.google.cloud.auth.service.account.json.keyfile` key, specify the *s3secretpath*.

1. In the AWS Glue Data Catalog, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   + When selecting a **Connection type**, select Google BigQuery.
   + When selecting an **AWS Secret**, provide *secretName*.

1. Grant the IAM role associated with your AWS Glue job permission to read *secretName*.

1. In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from BigQuery tables
<a name="aws-glue-programming-etl-connect-bigquery-read"></a>

**Prerequisites:** 
+ A BigQuery table you would like to read from. You will need the BigQuery table and dataset names, in the form `[dataset].[table]`. Let's call this *tableName*.
+ The billing project for the BigQuery table. You will need the name of the project, *parentProject*. If there is no billing parent project, use the project containing the table.
+ BigQuery auth information. Complete the steps *To manage your connection credentials with AWS Glue* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
bigquery_read = glueContext.create_dynamic_frame.from_options(
    connection_type="bigquery",
    connection_options={
        "connectionName": "connectionName",
        "parentProject": "parentProject",
        "sourceType": "table",
        "table": "tableName",
    }
```

You can also provide a query, to filter the results returned to your DynamicFrame. You will need to configure `query`, `sourceType`, `viewsEnabled` and `materializationDataset`.

For example:

**Additional prerequisites:**

You will need to create or identify a BigQuery dataset, *materializationDataset*, where BigQuery can write materialized views for your queries.

You will need to grant appropriate GCP IAM permissions to your service account to create tables in *materializationDataset*.

```
glueContext.create_dynamic_frame.from_options(
            connection_type="bigquery",
            connection_options={
                "connectionName": "connectionName",
                "materializationDataset": materializationDataset,
                "parentProject": "parentProject",
                "viewsEnabled": "true",
                "sourceType": "query",
                "query": "select * from bqtest.test"
            }
        )
```

## Writing to BigQuery tables
<a name="aws-glue-programming-etl-connect-bigquery-write"></a>

This example writes directly to the BigQuery service. BigQuery also supports the "indirect" writing method. For more information about configuring indirect writes, see [Using indirect write with Google BigQuery](#aws-glue-programming-etl-connect-bigquery-indirect-write).

**Prerequisites:** 
+ A BigQuery table you would like to write to. You will need the BigQuery table and dataset names, in the form `[dataset].[table]`. You can also provide a new table name that will automatically be created. Let's call this *tableName*.
+ The billing project for the BigQuery table. You will need the name of the project, *parentProject*. If there is no billing parent project, use the project containing the table.
+ BigQuery auth information. Complete the steps *To manage your connection credentials with AWS Glue* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
bigquery_write = glueContext.write_dynamic_frame.from_options(
    frame=frameToWrite,
    connection_type="bigquery",
    connection_options={
        "connectionName": "connectionName",
        "parentProject": "parentProject",
        "writeMethod": "direct",
        "table": "tableName",
    }
)
```

## BigQuery connection option reference
<a name="aws-glue-programming-etl-connect-bigquery-reference"></a>
+ `project` — Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated with your table.
+ `table` — (Required) Used for Read/Write. The name of your BigQuery table in the format `[[project:]dataset.]`.
+ `dataset` — Required when not defined through the `table` option. Used for Read/Write. The name of the dataset containing your BigQuery table.
+ `parentProject` — Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated with `project` used for billing.
+ `sourceType` — Used for Read. Required when reading. Valid Values: `table`, `query` Informs AWS Glue of whether you will read by table or by query. 
+ `materializationDataset` — Used for Read. Valid Values: strings. The name of a BigQuery dataset used to store materializations for views.
+ `viewsEnabled` — Used for Read. Default: false. Valid Values: true, false. Configures whether BigQuery will use views. 
+ `query` — Used for Read. Used when `viewsEnabled` is true. A GoogleSQL DQL query.
+ `temporaryGcsBucket` — Used for Write. Required when `writeMethod` is set to default (`indirect`). Name of a Google Cloud Storage bucket used to store an intermediate form of your data while writing to BigQuery.
+ `writeMethod` — Default: `indirect`. Valid Values: `direct`, `indirect`. Used for Write. Specifies the method used to write your data.
  + If set to `direct`, your connector will write using the BigQuery Storage Write API.
  + If set to `indirect`, you connector will write to Google Cloud Storage, then transfer it to BigQuery using a load operation. Your Google Cloud service account will need appropriate GCS permissions.

## Using indirect write with Google BigQuery
<a name="aws-glue-programming-etl-connect-bigquery-indirect-write"></a>

This example uses indirect write, which writes data to Google Cloud Storage and copies it to Google BigQuery.

**Prerequisites:**

You will need a temporary Google Cloud Storage bucket, *temporaryBucket*.

The GCP IAM role for AWS Glue's GCP service account will need appropriate GCS permissions to access *temporaryBucket*.

**Additional Configuration:**

**To configure indirect write with BigQuery:**

1. Assess [Configuring BigQuery connections](#aws-glue-programming-etl-connect-bigquery-configure) and locate or redownload your GCP credentials JSON file. Identify *secretName*, the AWS Secrets Manager secret for the Google BigQuery AWS Glue connection used in your job.

1. Upload your credentials JSON file to an appropriately secure Amazon S3 location. Retain the path to the file, *s3secretpath* for future steps.

1. Edit *secretName*, adding the `spark.hadoop.google.cloud.auth.service.account.json.keyfile` key. Set the value to *s3secretpath*.

1. Grant your AWS Glue job Amazon S3 IAM permissions to access *s3secretpath*.

You can now provide your temporary GCS bucket location to your write method. You do not need to provide `writeMethod`, as `indirect` is historically the default.

```
bigquery_write = glueContext.write_dynamic_frame.from_options(
    frame=frameToWrite,
    connection_type="bigquery",
    connection_options={
        "connectionName": "connectionName",
        "parentProject": "parentProject",
        "temporaryGcsBucket": "temporaryBucket",
        "table": "tableName",
    }
)
```

# JDBC connections
<a name="aws-glue-programming-etl-connect-jdbc-home"></a>

 Certain, typically relational, database types support connecting through the JDBC standard. For more information about JDBC, see the [Java JDBC API](https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/) documentation. AWS Glue natively supports connecting to certain databases through their JDBC connectors - the JDBC libraries are provided in AWS Glue Spark jobs. When connecting to these database types using AWS Glue libraries, you have access to a standard set of options. 

The JDBC connectionType values include the following:
+ `"connectionType": "sqlserver"`: Designates a connection to a Microsoft SQL Server database.
+ `"connectionType": "mysql"`: Designates a connection to a MySQL database.
+ `"connectionType": "oracle"`: Designates a connection to an Oracle database.
+ `"connectionType": "postgresql"`: Designates a connection to a PostgreSQL database.
+ `"connectionType": "redshift"`: Designates a connection to an Amazon Redshift database. For more information, see [Redshift connections](aws-glue-programming-etl-connect-redshift-home.md).

The following table lists the JDBC driver versions that AWS Glue supports.


| Product | JDBC driver versions for Glue 5.1 | JDBC driver versions for Glue 5.0 | JDBC driver versions for Glue 4.0 | JDBC driver versions for Glue 3.0 | JDBC driver versions for Glue 0.9, 1.0, 2.0 | 
| --- | --- | --- | --- | --- | --- | 
| Microsoft SQL Server | 10.2.0 | 10.2.0 | 9.4.0 | 7.x | 6.x | 
| MySQL | 8.0.33 | 8.0.33 | 8.0.23 | 8.0.23 | 5.1 | 
| Oracle Database | 23.3.0.23.09 | 23.3.0.23.09 | 21.7 | 21.1 | 11.2 | 
| PostgreSQL | 42.7.3 | 42.7.3 | 42.3.6 | 42.2.18 | 42.1.x | 
| Amazon Redshift \$1 | redshift-jdbc42-2.1.0.29 | redshift-jdbc42-2.1.0.29 | redshift-jdbc42-2.1.0.16 | redshift-jdbc41-1.2.12.1017 | redshift-jdbc41-1.2.12.1017 | 

\$1 For the Amazon Redshift connection type, all other option name/value pairs that are included in connection options for a JDBC connection, including formatting options, are passed directly to the underlying SparkSQL DataSource. In AWS Glue with Spark jobs in AWS Glue 4.0 and later versions, the AWS Glue native connector for Amazon Redshift uses the Amazon Redshift integration for Apache Spark. For more information see [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html). In previous versions, see [Amazon Redshift data source for Spark](https://github.com/databricks/spark-redshift).

To configure your Amazon VPC to connect to Amazon RDS data stores using JDBC, refer to [Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue](setup-vpc-for-glue-access.md).

**Note**  
AWS Glue jobs are only associated with one subnet during a run. This may impact your ability to connect to multiple data sources through the same job. This behavior is not limited to JDBC sources.

**Topics**
+ [JDBC connection option reference](#aws-glue-programming-etl-connect-jdbc)
+ [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery)
+ [Use custom JDBC driver](#aws-glue-programming-etl-jdbc-custom-driver)
+ [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md)
+ [Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue](setup-vpc-for-glue-access.md)

## JDBC connection option reference
<a name="aws-glue-programming-etl-connect-jdbc"></a>

If you already have a JDBC AWS Glue connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. This feature is available in AWS Glue 3.0 and later versions. To do so, use the following connection properties:
+ `"useConnectionProperties"`: Set it to "true" to indicate you want to use the configuration from a connection.
+ `"connectionName"`: Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job.

Use these connection options with JDBC connections:
+ `"url"`: (Required) The JDBC URL for the database.
+ `"dbtable"`: (Required) The database table to read from. For JDBC data stores that support schemas within a database, specify `schema.table-name`. If a schema is not provided, then the default "public" schema is used.
+ `"user"`: (Required) The user name to use when connecting.
+ `"password"`: (Required) The password to use when connecting.
+ (Optional) The following options allow you to supply a custom JDBC driver. Use these options if you must use a driver that AWS Glue does not natively support. 

  ETL jobs can use different JDBC driver versions for the data source and target, even if the source and target are the same database product. This allows you to migrate data between source and target databases with different versions. To use these options, you must first upload the JAR file of the JDBC driver to Amazon S3.
  + `"customJdbcDriverS3Path"`: The Amazon S3 path of the custom JDBC driver.
  + `"customJdbcDriverClassName"`: The class name of JDBC driver.
+ `"bulkSize"`: (Optional) Used to configure parallel inserts for speeding up bulk loads into JDBC targets. Specify an integer value for the degree of parallelism to use when writing or inserting data. This option is helpful for improving the performance of writes into databases such as the Arch User Repository (AUR).
+ `"hashfield"` (Optional) A string, used to specify the name of a column in the JDBC table to be used to divide the data into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md).
+ `"hashexpression"` (Optional) A SQL select clause returning a whole number. Used to divide the data in a JDBC table into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md).
+ `"hashpartitions"` (Optional) A positive integer. Used to specify the number of parallel reads of the JDBC table when reading from JDBC tables in parallel. Default: 7. For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md).
+ `"sampleQuery"`: (Optional) A custom SQL query statement. Used to specify a subset of information in a table to retrieve a sample of the table contents. **When configured without regard to your data, it can be less efficient than DynamicFrame methods, causing timeouts or out of memory errors.** For more information, see [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery).
+ `"enablePartitioningForSampleQuery"`: (Optional) A boolean. Default: false. Used to enable reading from JDBC tables in parallel when specifying `sampleQuery`. **If set to true, `sampleQuery` must end with "where" or "and" for AWS Glue to append partitioning conditions.** For more information, see [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery).
+ `"sampleSize"`: (Optional) A positive integer. Limits the number of rows returned by the sample query. Works only when `enablePartitioningForSampleQuery` is true. If partitioning is not enabled, you should instead directly add `"limit x"` in the `sampleQuery` to limit the size. For more information, see [Use sampleQuery](#aws-glue-programming-etl-jdbc-samplequery).

## Use sampleQuery
<a name="aws-glue-programming-etl-jdbc-samplequery"></a>

This section explains how to use `sampleQuery`, `sampleSize` and `enablePartitioningForSampleQuery`.

`sampleQuery` can be an efficient way to sample a few rows of your dataset. By default, the query is run by a single executor. When configured without regard to your data, it can be less efficient than DynamicFrame methods, causing timeouts or out of memory errors. Running SQL on the underlying database as part of your ETL pipeline is generally only needed for performance purposes. If you are trying to preview a few rows of your dataset, consider using [show](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-show). If you are trying to transform your dataset using SQL, consider using [toDF](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF) to define a SparkSQL transform against your data in a DataFrame form.

While your query may manipulate a variety of tables, `dbtable` remains required.

**Using sampleQuery to retrieve a sample of your table**

When using default sampleQuery behavior to retrieve a sample of your data, AWS Glue does not expect substantial throughput, so it runs your query on a single executor. In order to limit the data you provide and not cause performance problems, we suggest you provide SQL with a `LIMIT` clause.

**Example Use sampleQuery without partitioning**  
The following code example shows how to use `sampleQuery` without partitioning.  

```
//A full sql query statement.
val query = "select name from $tableName where age > 0 limit 1"
val connectionOptions = JsonOptions(Map(
    "url" -> url, 
    "dbtable" -> tableName, 
    "user" -> user, 
    "password" -> password, 
    "sampleQuery" -> query ))
val dyf = glueContext.getSource("mysql", connectionOptions)
          .getDynamicFrame()
```

**Using sampleQuery against larger datasets**

 If you're reading a large dataset, you might need to enable JDBC partitioning to query a table in parallel. For more information, see [Reading from JDBC tables in parallel](run-jdbc-parallel-read-job.md). To use `sampleQuery` with JDBC partitioning, set `enablePartitioningForSampleQuery` to true. Enabling this feature requires you to make some changes to your `sampleQuery`.

When using JDBC partitioning with `sampleQuery`, your query must end with "where" or "and" for AWS Glue to append partitioning conditions.

 If you would like to limit the results of your sampleQuery when reading from JDBC tables in parallel, set the `"sampleSize"` parameter rather than specifying a `LIMIT` clause.

**Example Use sampleQuery with JDBC partitioning**  
The following code example shows how to use `sampleQuery` with JDBC partitioning.  

```
//note that the query should end with "where" or "and" if use with JDBC partitioning.
val query = "select name from $tableName where age > 0 and"

//Enable JDBC partitioning by setting hashfield.
//to use sampleQuery with partitioning, set enablePartitioningForSampleQuery.
//use sampleSize to limit the size of returned data.
val connectionOptions = JsonOptions(Map(
    "url" -> url, 
    "dbtable" -> tableName, 
    "user" -> user, 
    "password" -> password, 
    "hashfield" -> primaryKey,
    "sampleQuery" -> query,
    "enablePartitioningForSampleQuery" -> true,
    "sampleSize" -> "1" ))
val dyf = glueContext.getSource("mysql", connectionOptions)
          .getDynamicFrame()
```

 **Notes and Restrictions:** 

Sample queries cannot be used together with job bookmarks. The bookmark state will be ignored when configuration for both are provided.

## Use custom JDBC driver
<a name="aws-glue-programming-etl-jdbc-custom-driver"></a>

The following code examples show how to read from and write to JDBC databases with custom JDBC drivers. They demonstrate reading from one version of a database product, and writing to a later version of the same product.

------
#### [ Python ]

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Construct JDBC connection options
connection_mysql5_options = {
    "url": "jdbc:mysql://<jdbc-host-name>:3306/db",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd"}

connection_mysql8_options = {
    "url": "jdbc:mysql://<jdbc-host-name>:3306/db",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd",
    "customJdbcDriverS3Path": "s3://amzn-s3-demo-bucket/mysql-connector-java-8.0.17.jar",
    "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}

connection_oracle11_options = {
    "url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd"}

connection_oracle18_options = {
    "url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL",
    "dbtable": "test",
    "user": "admin",
    "password": "pwd",
    "customJdbcDriverS3Path": "s3://amzn-s3-demo-bucket/ojdbc10.jar",
    "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}

# Read from JDBC databases with custom driver
df_mysql8 = glueContext.create_dynamic_frame.from_options(connection_type="mysql",
                                                          connection_options=connection_mysql8_options)

# Read DynamicFrame from MySQL 5 and write to MySQL 8
df_mysql5 = glueContext.create_dynamic_frame.from_options(connection_type="mysql",
                                                          connection_options=connection_mysql5_options)
glueContext.write_from_options(frame_or_dfc=df_mysql5, connection_type="mysql",
                               connection_options=connection_mysql8_options)

# Read DynamicFrame from Oracle 11 and write to Oracle 18
df_oracle11 = glueContext.create_dynamic_frame.from_options(connection_type="oracle",
                                                            connection_options=connection_oracle11_options)
glueContext.write_from_options(frame_or_dfc=df_oracle11, connection_type="oracle",
                               connection_options=connection_oracle18_options)
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueApp {
  val MYSQL_5_URI: String = "jdbc:mysql://<jdbc-host-name>:3306/db"
  val MYSQL_8_URI: String = "jdbc:mysql://<jdbc-host-name>:3306/db"
  val ORACLE_11_URI: String = "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL"
  val ORACLE_18_URI: String = "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL"

  // Construct JDBC connection options
  lazy val mysql5JsonOption = jsonOptions(MYSQL_5_URI)
  lazy val mysql8JsonOption = customJDBCDriverJsonOptions(MYSQL_8_URI, "s3://amzn-s3-demo-bucket/mysql-connector-java-8.0.17.jar", "com.mysql.cj.jdbc.Driver")
  lazy val oracle11JsonOption = jsonOptions(ORACLE_11_URI)
  lazy val oracle18JsonOption = customJDBCDriverJsonOptions(ORACLE_18_URI, "s3://amzn-s3-demo-bucket/ojdbc10.jar", "oracle.jdbc.OracleDriver")

  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    // Read from JDBC database with custom driver
    val df_mysql8: DynamicFrame = glueContext.getSource("mysql", mysql8JsonOption).getDynamicFrame()

    // Read DynamicFrame from MySQL 5 and write to MySQL 8
    val df_mysql5: DynamicFrame = glueContext.getSource("mysql", mysql5JsonOption).getDynamicFrame()
    glueContext.getSink("mysql", mysql8JsonOption).writeDynamicFrame(df_mysql5)

    // Read DynamicFrame from Oracle 11 and write to Oracle 18
    val df_oracle11: DynamicFrame = glueContext.getSource("oracle", oracle11JsonOption).getDynamicFrame()
    glueContext.getSink("oracle", oracle18JsonOption).writeDynamicFrame(df_oracle11)

    Job.commit()
  }

  private def jsonOptions(url: String): JsonOptions = {
    new JsonOptions(
      s"""{"url": "${url}",
         |"dbtable":"test",
         |"user": "admin",
         |"password": "pwd"}""".stripMargin)
  }

  private def customJDBCDriverJsonOptions(url: String, customJdbcDriverS3Path: String, customJdbcDriverClassName: String): JsonOptions = {
    new JsonOptions(
      s"""{"url": "${url}",
         |"dbtable":"test",
         |"user": "admin",
         |"password": "pwd",
         |"customJdbcDriverS3Path": "${customJdbcDriverS3Path}",
         |"customJdbcDriverClassName" : "${customJdbcDriverClassName}"}""".stripMargin)
  }
}
```

------

# Reading from JDBC tables in parallel
<a name="run-jdbc-parallel-read-job"></a>

You can set properties of your JDBC table to enable AWS Glue to read data in parallel. When you set certain properties, you instruct AWS Glue to run parallel SQL queries against logical partitions of your data. You can control partitioning by setting a hash field or a hash expression. You can also control the number of parallel reads that are used to access your data. 

Reading from JDBC tables in parallel is an optimization technique that may improve performance. For more information about the process of identifying when this technique is appropriate, consult [Reduce the amount of data scan](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/parallelize-tasks.html) in the *Best practices for performance tuning AWS Glue for Apache Spark jobs* guide on AWS Prescriptive Guidance.

To enable parallel reads, you can set key-value pairs in the parameters field of your table structure. Use JSON notation to set a value for the parameter field of your table. For more information about editing the properties of a table, see [Viewing and managing table details](tables-described.md#console-tables-details). You can also enable parallel reads when you call the ETL (extract, transform, and load) methods `create_dynamic_frame_from_options` and `create_dynamic_frame_from_catalog`. For more information about specifying options in these methods, see [from\$1options](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_options) and [from\$1catalog](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_catalog). 

You can use this method for JDBC tables, that is, most tables whose base data is a JDBC data store. These properties are ignored when reading Amazon Redshift and Amazon S3 tables.

**hashfield**  
Set `hashfield` to the name of a column in the JDBC table to be used to divide the data into partitions. For best results, this column should have an even distribution of values to spread the data between partitions. This column can be of any data type. AWS Glue generates non-overlapping queries that run in parallel to read the data partitioned by this column. For example, if your data is evenly distributed by month, you can use the `month` column to read each month of data in parallel.  

```
  'hashfield': 'month'
```
AWS Glue creates a query to hash the field value to a partition number and runs the query for all partitions in parallel. To use your own query to partition a table read, provide a `hashexpression` instead of a `hashfield`.

**hashexpression**  
Set `hashexpression` to an SQL expression (conforming to the JDBC database engine grammar) that returns a whole number. A simple expression is the name of any numeric column in the table. AWS Glue generates SQL queries to read the JDBC data in parallel using the `hashexpression` in the `WHERE` clause to partition data.  
For example, use the numeric column `customerID` to read data partitioned by a customer number.  

```
  'hashexpression': 'customerID'
```
To have AWS Glue control the partitioning, provide a `hashfield` instead of a `hashexpression`.

**hashpartitions**  
Set `hashpartitions` to the number of parallel reads of the JDBC table. If this property is not set, the default value is 7.  
For example, set the number of parallel reads to `5` so that AWS Glue reads your data with five queries (or fewer).  

```
  'hashpartitions': '5'
```

# Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue
<a name="setup-vpc-for-glue-access"></a>

 When using JDBC to connect to databases in Amazon RDS, you will need to perform additional setup. To enable AWS Glue components to communicate with Amazon RDS, you must set up access to your Amazon RDS data stores in Amazon VPC. To enable AWS Glue to communicate between its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC. A self-referencing rule will not open the VPC to all networks. The default security group for your VPC might already have a self-referencing inbound rule for ALL Traffic. 

**To set up access between AWS Glue and Amazon RDS data stores**

1. Sign in to the AWS Management Console and open the Amazon RDS console at [https://console.aws.amazon.com/rds/](https://console.aws.amazon.com/rds/).

1. In the Amazon RDS console, identify the security group(s) used to control access to your Amazon RDS database.

   In the left navigation pane, choose **Databases**, then select the instance you would like to connect to from the list in the main pane.

   In the database detail page, find **VPC security groups** on the **Connectivity & security** tab.

1. Based on your network architecture, identify which associated security group is best to modify to allow access for the AWS Glue service. Save its name, *database-security-group* for future reference. If there is no appropriate security group, follow the directions to [Provide access to your DB instance in your VPC by creating a security group](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SettingUp.html) in the Amazon RDS documentation.

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. In the Amazon VPC console, identify how to update *database-security-group*.

   In the left navigation pane, choose **Security groups**, then select *database-security-group* from the list in the main pane.

1. Identify the security group ID for *database-security-group*, *database-sg-id*. Save it for future reference.

   In the security group detail page, find **Security group ID**.

1. Alter the inbound rules for *database-security-group*, add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule where **Type** is `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and **Source** is *database-sg-id*. Verify that the security group you have entered for **Source** is the same as the security group you are editing.

   In the security group detail page, select **Edit inbound rules**.

   The inbound rule looks similar to this:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)

1. Add rules for outbound traffic.

   In the security group detail page, select **Edit outbound rules**.

   If you security group allows all outbound traffic, you do not need separate rules. For example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)

   If your network architecture is designed for you to restrict outbound traffic, create the following outbound rules:

   Create a self-referencing rule where **Type** is `All TCP`, **Protocol** is `TCP`, **Port Range** includes all ports, and **Destination** is *database-sg-id*. Verify that the security group you have entered for **Destination** is the same as the security group you are editing.

    If using an Amazon S3 VPC endpoint, add an HTTPS rule to allow traffic from the VPC to Amazon S3. Create a rule where **Type** is `HTTPS`, **Protocol** is `TCP`, **Port Range** is `443` and **Destination** is the ID of the managed prefix list for the Amazon S3 gateway endpoint, *s3-prefix-list-id*. For more information about prefix lists and Amazon S3 gateway endpoints, see [Gateway endpoints for Amazon S3](https://docs.aws.amazon.com//vpc/latest/privatelink/vpc-endpoints-s3.html) in the Amazon VPC documentation.

   For example:  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html)

# MongoDB connections
<a name="aws-glue-programming-etl-connect-mongodb-home"></a>

You can use AWS Glue for Spark to read from and write to tables in MongoDB and MongoDB Atlas in AWS Glue 4.0 and later versions. You can connect to MongoDB using username and password credentials credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about MongoDB, consult the [MongoDB documentation](https://www.mongodb.com/docs/).

## Configuring MongoDB connections
<a name="aws-glue-programming-etl-connect-mongodb-configure"></a>

To connect to MongoDB from AWS Glue, you will need your MongoDB credentials, *mongodbUser* and *mongodbPass*.

To connect to MongoDB from AWS Glue, you may need some prerequisites:
+ If your MongoDB instance is in an Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the MongoDB instance without traffic traversing the public internet. 

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your MongoDB instance and this location. Based on your network layout, this may require changes to security group rules, Network ACLs, NAT Gateways and Peering connections.

You can then proceed to configure AWS Glue for use with MongoDB.

**To configure a connection to MongoDB:**

1. Optionally, in AWS Secrets Manager, create a secret using your MongoDB credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `username` with the value *mongodbUser*.

     When selecting **Key/value pairs**, create a pair for the key `password` with the value *mongodbPass*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select **MongoDB** or **MongoDB Atlas**.
   + When selecting **MongoDB URL** or **MongoDB Atlas URL**, provide the hostname of your MongoDB instance.

     A MongoDB URL is provided in the format `mongodb://mongoHost:mongoPort/mongoDBname`.

     A MongoDB Atlas URL is provided in the format `mongodb+srv://mongoHost/mongoDBname`.
   + If you chose to create an Secrets Manager secret, choose the AWS Secrets Manager **Credential type**.

     Then, in **AWS Secret** provide *secretName*.
   + If you choose to provide **Username and password**, provide *mongodbUser* and *mongodbPass*.

1. In the following situations, you may require additional configuration:
   + 

     For MongoDB instances hosted on AWS in an Amazon VPC
     + You will need to provide Amazon VPC connection information to the AWS Glue connection that defines your MongoDB security credentials. When creating or updating your connection, set **VPC**, **Subnet** and **Security groups** in **Network options**.

After creating a AWS Glue MongoDB connection, you will need to perform the following actions before calling your connection method:
+ If you chose to create an Secrets Manager secret, grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

To use your AWS Glue MongoDB connection in AWS Glue for Spark, provide the `connectionName` option in your connection method call. Alternatively, you can follow the steps in [Working with MongoDB connections in ETL jobs](integrate-with-mongo-db.md) to use the connection in conjunction with the AWS Glue Data Catalog.

## Reading from MongoDB using a AWS Glue connection
<a name="aws-glue-programming-etl-connect-mongodb-read"></a>

**Prerequisites:** 
+ A MongoDB collection you would like to read from. You will need identification information for the collection.

  A MongoDB collection is identified by a database name and a collection name, *mongodbName*, *mongodbCollection*.
+ A AWS Glue MongoDB connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to MongoDB* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
mongodb_read = glueContext.create_dynamic_frame.from_options(
    connection_type="mongodb",
    connection_options={
        "connectionName": "connectionName",
        "database": "mongodbName",
        "collection": "mongodbCollection",
        "partitioner": "com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner",
        "partitionerOptions.partitionSizeMB": "10",
        "partitionerOptions.partitionKey": "_id",
        "disableUpdateUri": "false",
    }
)
```

## Writing to MongoDB tables
<a name="aws-glue-programming-etl-connect-mongodb-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to MongoDB.

**Prerequisites:** 
+ A MongoDB collection you would like to write to. You will need identification information for the collection.

  A MongoDB collection is identified by a database name and a collection name, *mongodbName*, *mongodbCollection*.
+ A AWS Glue MongoDB connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to MongoDB* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="mongodb",
    connection_options={
        "connectionName": "connectionName",
        "database": "mongodbName",
        "collection": "mongodbCollection",
        "disableUpdateUri": "false",
        "retryWrites": "false", 
    },
)
```

## Reading and writing to MongoDB tables
<a name="aws-glue-programming-etl-connect-mongodb-read-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to MongoDB.

**Prerequisites:** 
+ A MongoDB collection you would like to read from. You will need identification information for the collection.

  A MongoDB collection you would like to write to. You will need identification information for the collection.

  A MongoDB collection is identified by a database name and a collection name, *mongodbName*, *mongodbCollection*.
+ MongoDB auth information, *mongodbUser* and *mongodbPassword*.

For example: 

------
#### [ Python ]

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

output_path = "s3://some_bucket/output/" + str(time.time()) + "/"
mongo_uri = "mongodb://<mongo-instanced-ip-address>:27017"
mongo_ssl_uri = "mongodb://<mongo-instanced-ip-address>:27017"
write_uri = "mongodb://<mongo-instanced-ip-address>:27017"

read_mongo_options = {
    "uri": mongo_uri,
    "database": "mongodbName",
    "collection": "mongodbCollection",
    "username": "mongodbUsername",
    "password": "mongodbPassword",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"}

ssl_mongo_options = {
    "uri": mongo_ssl_uri,
    "database": "mongodbName",
    "collection": "mongodbCollection",
    "ssl": "true",
    "ssl.domain_match": "false"
}

write_mongo_options = {
    "uri": write_uri,
    "database": "mongodbName",
    "collection": "mongodbCollection",
    "username": "mongodbUsername",
    "password": "mongodbPassword",
}

# Get DynamicFrame from MongoDB
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
                                                              connection_options=read_mongo_options)

# Write DynamicFrame to MongoDB
glueContext.write_dynamic_frame.from_options(dynamicFrame, connection_type="mongodb", connection_options=write_mongo_options)

job.commit()
```

------
#### [ Scala ]

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  val DEFAULT_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  val WRITE_URI: String = "mongodb://<mongo-instanced-ip-address>:27017"
  lazy val defaultJsonOption = jsonOptions(DEFAULT_URI)
  lazy val writeJsonOption = jsonOptions(WRITE_URI)
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)

    // Get DynamicFrame from MongoDB
    val dynamicFrame: DynamicFrame = glueContext.getSource("mongodb", defaultJsonOption).getDynamicFrame()

    // Write DynamicFrame to MongoDB
    glueContext.getSink("mongodb", writeJsonOption).writeDynamicFrame(dynamicFrame)

    Job.commit()
  }

  private def jsonOptions(uri: String): JsonOptions = {
    new JsonOptions(
      s"""{"uri": "${uri}",
         |"database":"mongodbName",
         |"collection":"mongodbCollection",
         |"username": "mongodbUsername",
         |"password": "mongodbPassword",
         |"ssl":"true",
         |"ssl.domain_match":"false",
         |"partitioner": "MongoSamplePartitioner",
         |"partitionerOptions.partitionSizeMB": "10",
         |"partitionerOptions.partitionKey": "_id"}""".stripMargin)
  }
}
```

------

## MongoDB connection option reference
<a name="aws-glue-programming-etl-connect-mongodb"></a>

Designates a connection to MongoDB. Connection options differ for a source connection and a sink connection.

These connection properties are shared between source and sink connections:
+ `connectionName` — Used for Read/Write. The name of a AWS Glue MongoDB connection configured to provide auth and networking information to your connection method. When a AWS Glue connection is configured as described in the previous section, [Configuring MongoDB connections](#aws-glue-programming-etl-connect-mongodb-configure), providing `connectionName` will replace the need to provide the `"uri"`, `"username"` and `"password"` connection options. 
+ `"uri"`: (Required) The MongoDB host to read from, formatted as `mongodb://<host>:<port>`. Used in AWS Glue versions prior to AWS Glue 4.0.
+ `"connection.uri"`: (Required) The MongoDB host to read from, formatted as `mongodb://<host>:<port>`. Used in AWS Glue 4.0 and later versions.
+ `"username"`: (Required) The MongoDB user name.
+ `"password"`: (Required) The MongoDB password.
+ `"database"`: (Required) The MongoDB database to read from. This option can also be passed in `additional_options` when calling `glue_context.create_dynamic_frame_from_catalog` in your job script.
+ `"collection"`: (Required) The MongoDB collection to read from. This option can also be passed in `additional_options` when calling `glue_context.create_dynamic_frame_from_catalog` in your job script.

### "connectionType": "mongodb" as source
<a name="etl-connect-mongodb-as-source"></a>

Use the following connection options with `"connectionType": "mongodb"` as a source:
+ `"ssl"`: (Optional) If `true`, initiates an SSL connection. The default is `false`.
+ `"ssl.domain_match"`: (Optional) If `true` and `ssl` is `true`, domain match check is performed. The default is `true`.
+ `"batchSize"`: (Optional): The number of documents to return per batch, used within the cursor of internal batches.
+ `"partitioner"`: (Optional): The class name of the partitioner for reading input data from MongoDB. The connector provides the following partitioners:
  + `MongoDefaultPartitioner` (default) (Not supported in AWS Glue 4.0)
  + `MongoSamplePartitioner` (Requires MongoDB 3.2 or later) (Not supported in AWS Glue 4.0)
  + `MongoShardedPartitioner` (Not supported in AWS Glue 4.0)
  + `MongoSplitVectorPartitioner` (Not supported in AWS Glue 4.0)
  + `MongoPaginateByCountPartitioner` (Not supported in AWS Glue 4.0)
  + `MongoPaginateBySizePartitioner` (Not supported in AWS Glue 4.0)
  + `com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner`
  + `com.mongodb.spark.sql.connector.read.partitioner.ShardedPartitioner`
  + `com.mongodb.spark.sql.connector.read.partitioner.PaginateIntoPartitionsPartitioner`
+ `"partitionerOptions"` (Optional): Options for the designated partitioner. The following options are supported for each partitioner:
  + `MongoSamplePartitioner`: `partitionKey`, `partitionSizeMB`, `samplesPerPartition`
  + `MongoShardedPartitioner`: `shardkey`
  + `MongoSplitVectorPartitioner`: `partitionKey`, `partitionSizeMB`
  + `MongoPaginateByCountPartitioner`: `partitionKey`, `numberOfPartitions`
  + `MongoPaginateBySizePartitioner`: `partitionKey`, `partitionSizeMB`

  For more information about these options, see [Partitioner Configuration](https://docs.mongodb.com/spark-connector/master/configuration/#partitioner-conf) in the MongoDB documentation.

### "connectionType": "mongodb" as sink
<a name="etl-connect-mongodb-as-sink"></a>

Use the following connection options with `"connectionType": "mongodb"` as a sink:
+ `"ssl"`: (Optional) If `true`, initiates an SSL connection. The default is `false`.
+ `"ssl.domain_match"`: (Optional) If `true` and `ssl` is `true`, domain match check is performed. The default is `true`.
+ `"extendedBsonTypes"`: (Optional) If `true`, allows extended BSON types when writing data to MongoDB. The default is `true`.
+ `"replaceDocument"`: (Optional) If `true`, replaces the whole document when saving datasets that contain an `_id` field. If `false`, only fields in the document that match the fields in the dataset are updated. The default is `true`.
+ `"maxBatchSize"`: (Optional): The maximum batch size for bulk operations when saving data. The default is 512.
+ `"retryWrites"`: (Optional): Automatically retry certain write operations a single time if AWS Glue encounters a network error.

# SAP HANA connections
<a name="aws-glue-programming-etl-connect-saphana-home"></a>

You can use AWS Glue for Spark to read from and write to tables in SAP HANA in AWS Glue 4.0 and later versions. You can define what to read from SAP HANA with a SQL query. You connect to SAP HANA using JDBC credentials stored in AWS Secrets Manager through a AWS Glue SAP HANA connection.

For more information about SAP HANA JDBC, consult [the SAP HANA documentation](https://help.sap.com/docs/SAP_HANA_PLATFORM/0eec0d68141541d1b07893a39944924e/ff15928cf5594d78b841fbbe649f04b4.html).

## Configuring SAP HANA connections
<a name="aws-glue-programming-etl-connect-saphana-configure"></a>

To connect to SAP HANA from AWS Glue, you will need to create and store your SAP HANA credentials in a AWS Secrets Manager secret, then associate that secret with a SAP HANA AWS Glue connection. You will need to configure network connectivity between your SAP HANA service and AWS Glue.

To connect to SAP HANA, you may need some prerequisites:
+ If your SAP HANA service is in an Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the SAP HANA service without traffic traversing the public internet.

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your SAP HANA endpoint and this location. Your job will need to establish a TCP connection with your SAP HANA JDBC port. For more information about SAP HANA ports, see the [SAP HANA documentation](https://help.sap.com/docs/HANA_SMART_DATA_INTEGRATION/7952ef28a6914997abc01745fef1b607/88e2e8bded9e4041ad3ad87dc46c7b55.html?locale=en-US). Based on your network layout, this may require changes to security group rules, Network ACLs, NAT Gateways and Peering connections.
+ There are no additional prerequisites if your SAP HANA endpoint is internet accesible.

**To configure a connection to SAP HANA:**

1. In AWS Secrets Manager, create a secret using your SAP HANA credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `username/USERNAME` with the value *saphanaUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password/PASSWORD` with the value *saphanaPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for future use in AWS Glue. 
   + When selecting a **Connection type**, select SAP HANA.
   + When providing **SAP HANA URL**, provide the URL for your instance.

     SAP HANA JDBC URLs are in the form `jdbc:sap://saphanaHostname:saphanaPort/?databaseName=saphanaDBname,ParameterName=ParameterValue`

     AWS Glue requires the following JDBC URL parameters: 
     + `databaseName` – A default database in SAP HANA to connect to.
   + When selecting an **AWS Secret**, provide *secretName*.

After creating a AWS Glue SAP HANA connection, you will need to perform the following steps before running your AWS Glue job:
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from SAP HANA tables
<a name="aws-glue-programming-etl-connect-saphana-read"></a>

**Prerequisites:** 
+ A SAP HANA table you would like to read from. You will need identification information for the table.

  A table can be specified with a SAP HANA table name and schema name, in the form `schemaName.tableName`. The schema name and "." separator are not required if the table is in the default schema, "public". Call this *tableIdentifier*. Note that the database is provided as a JDBC URL parameter in `connectionName`.
+ A AWS Glue SAP HANA connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to SAP HANA* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
saphana_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type="saphana",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableIdentifier",
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
saphana_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="saphana",
    connection_options={
        "connectionName": "connectionName",
        "query": "query"
    }
)
```

## Writing to SAP HANA tables
<a name="aws-glue-programming-etl-connect-saphana-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to SAP HANA. If the table already has information, AWS Glue will error.

**Prerequisites:** 
+ A SAP HANA table you would like to write to. 

  A table can be specified with a SAP HANA table name and schema name, in the form `schemaName.tableName`. The schema name and "." separator are not required if the table is in the default schema, "public". Call this *tableIdentifier*. Note that the database is provided as a JDBC URL parameter in `connectionName`.
+ SAP HANA auth information. Complete the steps in the previous procedure, *To configure a connection to SAP HANA* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
options = {
    "connectionName": "connectionName",
    "dbtable": 'tableIdentifier'
}

    saphana_write = glueContext.write_dynamic_frame.from_options(
        frame=dynamicFrame,
        connection_type="saphana",
        connection_options=options
)
```

## SAP HANA connection option reference
<a name="aws-glue-programming-etl-connect-saphana-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue SAP HANA connection configured to provide auth and networking information to your connection method.
+ `databaseName` — Used for Read/Write. Valid Values: names of databases in SAP HANA. Name of database to connect to.
+ `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. Valid Values: contents of a SAP HANA SQL FROM clause. Identifies a table in SAP HANA to connect to. You may also provide other SQL than a table name, such as a subquery. For more information, see the [From clause](https://help.sap.com/docs/SAP_HANA_PLATFORM/4fe29514fd584807ac9f2a04f6754767/20fcf24075191014a89e9dc7b8408b26.html#loio20fcf24075191014a89e9dc7b8408b26__from_clause) in the SAP HANA documentation.
+ `query` — Used for Read. A SAP HANA SQL SELECT query defining what should be retrieved when reading from SAP HANA.

# Snowflake connections
<a name="aws-glue-programming-etl-connect-snowflake-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Snowflake in AWS Glue 4.0 and later versions. You can read from Snowflake with a SQL query. You can connect to Snowflake using one of three methods - basic authentication (using username and password), OAuth authentication, or key-pair authentication. You can refer to Snowflake credentials stored in AWS Secrets Manager through the AWS Glue Data connections. Data connection Snowflake credentials for AWS Glue for Spark are stored separately from Data Catalog Snowflake credentials for crawlers. You must choose a `SNOWFLAKE` type connection and not a `JDBC` type connection configured to connect to Snowflake.

For more information about Snowflake, see the [Snowflake website](https://www.snowflake.com/). For more information about Snowflake on AWS, see [Snowflake Data Warehouse on Amazon Web Services](https://aws.amazon.com/financial-services/partner-solutions/snowflake/).

## Configuring Snowflake connections
<a name="aws-glue-programming-etl-connect-snowflake-configure"></a>

There are no AWS prerequisites to connecting to Snowflake databases available through the internet.

Optionally, you can perform the following configuration to manage your connection credentials with AWS Glue.

**To manage your connection credentials with AWS Glue**

1. In AWS Secrets Manager, create a secret using your Snowflake credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html#create_secret_cli) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + For OAuth authentication:
     + When selecting **Key/value pairs**, create a pair for *snowflakeUser* with the key `sfUser`
     + When selecting **Key/value pairs**, create a pair for *OAUTH\$1CLIENT\$1SECRET* with the key `USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET`
   + For Key-pair authentication:
     + When selecting **Key/value pairs**, create a pair for *snowflakeUser* with the key `sfUser`
     + When selecting **Key/value pairs**, create a pair for *private key* with the key `pem_private_key`
   + For basic authentication:
     + When selecting **Key/value pairs**, create a pair for *snowflakeUser* with the key `USERNAME`
     + When selecting **Key/value pairs**, create a pair for *snowflakePassword* with the key `PASSWORD`
   + When selecting **Key/value pairs**, you can provide your Snowflake warehouse with the key `sfWarehouse`.
   + When selecting **Key/value pairs**, you can provide additional Snowflake connection properties using their corresponding Spark property names as keys. Supported properties include:
     + `sfDatabase` - Snowflake database name
     + `sfSchema` - Snowflake schema name
     + `sfRole` - Snowflake role name

1. In the AWS Glue Studio Console, create a connection by choosing **Data Connections**, then **Create connection**. Following the steps in the connection wizard to complete the process: 
   + When selecting a **Data source**, select Snowflake, then choose **Next**.
   + Enter the connection details such as host and port. When entering the host **Snowflake URL**, provide the URL of your Snowflake instance. The URL will typically use a hostname in the form `account_identifier.snowflakecomputing.com`. However, the URL format may vary depending on your Snowflake account type (for example, AWS, Azure, or Snowflake-hosted).
   + When selecting the IAM service role, choose from the drop-down menu. This is the IAM role from your account that will be used to access AWS Secrets Manager and assign IP if VPC is specified.
   + When selecting an **AWS Secret**, provide *secretName*.

1. In the next step in the wizard, set properties for your Snowflake connection. 

1. In the final step in the wizard, review your settings and then complete the process to create your connection.

For Snowflake hosted on AWS in an Amazon VPC, you may require the following:
+ You will need appropriate Amazon VPC configuration for Snowflake. For more information on how to configure your Amazon VPC, consult [AWS PrivateLink & Snowflake ](https://docs.snowflake.com/en/user-guide/admin-security-privatelink) in the Snowflake documentation.
+ You will need appropriate Amazon VPC configuration for AWS Glue. [Configuring interface VPC endpoints (AWS PrivateLink) for AWS Glue (AWS PrivateLink)](vpc-interface-endpoints.md).
+ You will need to create a AWS Glue Data Catalog connection that provides Amazon VPC connection information (in addition to the id of an AWS Secrets Manager secret that defines your Snowflake security credentials). Your URL will change when using AWS PrivateLink, as described in the Snowflake documentation linked in a previous item.
+ You will need your job configuration in include the Data Catalog connection as an **Additional network connection**.

## Reading from Snowflake tables
<a name="aws-glue-programming-etl-connect-snowflake-read"></a>

**Prerequisites:** A Snowflake table you would like to read from. You will need the Snowflake table name, *tableName*. If your Snowflake user does not have a default namespace set, you will need the Snowflake database name, *databaseName* and the schema name *schemaName*. Additionally, if your Snowflake user does not have a default warehouse set, you will need a warehouse name *warehouseName*. To select which **Additional network connection** to connect with, the `connectionName` parameter will be used.

```
snowflake_read = glueContext.create_dynamic_frame.from_options(
  connection_type="snowflake",
  connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName",
        "sfDatabase": "databaseName",
        "sfSchema": "schemaName",
        "sfWarehouse": "warehouseName",
    }
)
```

 Additionally, you can use the `autopushdown` and `query` parameters to read a portion of a Snowflake table. This can be substantially more efficient than filtering your results after they have been loaded into Spark. Consider an example where all sales are stored in the same table, but you only need to analyze sales from a certain store on holidays. If that information is stored in the table, you could use predicate pushdown to retrieve the results as follows:

```
snowflake_node = glueContext.create_dynamic_frame.from_options(
    connection_type="snowflake",
    connection_options={
        "autopushdown": "on",
        "query": "select * from sales where store='1' and IsHoliday='TRUE'",
        "connectionName": "snowflake-glue-conn",
        "sfDatabase": "databaseName",
        "sfSchema": "schemaName",
        "sfWarehouse": "warehouseName",
    }
)
```

## Writing to Snowflake tables
<a name="aws-glue-programming-etl-connect-snowflake-write"></a>

**Prerequisites:** A Snowflake database you would like to write to. You will need a current or desired table name, *tableName*. If your Snowflake user does not have a default namespace set, you will need the Snowflake database name, *databaseName* and the schema name *schemaName*. Additionally, if your Snowflake user does not have a default warehouse set, you will need a warehouse name *warehouseName*. To select which **Additional network connection** to connect with, the `connectionName` parameter will be used.

```
glueContext.write_dynamic_frame.from_options(
    connection_type="snowflake",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName",
        "sfDatabase": "databaseName",
        "sfSchema": "schemaName",
        "sfWarehouse": "warehouseName",
    },
)
```

## Snowflake connection option reference
<a name="aws-glue-programming-etl-connect-snowflake-reference"></a>

The Snowflake connection type takes the following connection options:

You can retrieve some of the parameters in this section from a AWS Glue connection (`sfUrl`, `sfUser`, `sfPassword`), in which case you are not required to provide them. You can do this by providing the parameter `connectionName`.

You can retrieve connection parameters from AWS Secrets Manager secrets using the `secretId` parameter. When using Secrets Manager, the following Spark properties can be automatically retrieved if present in the secret:
+ `sfUser` (using key `USERNAME` or `sfUser`)
+ `sfPassword` (using key `PASSWORD` or `sfPassword`, when using basic authentication)
+ `sfWarehouse` (using key `sfWarehouse`)
+ `sfDatabase` (using key `sfDatabase`)
+ `sfSchema` (using key `sfSchema`)
+ `sfRole` (using key `sfRole`)
+ `pem_private_key` (using key `pem_private_key`, when using key-pair authentication)
+ `USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET` (when using OAuth authentication)

**Property Precedence Order:** When the same property is specified in multiple locations, AWS Glue uses the following precedence order (highest to lowest):

1. Explicitly provided connection options in your job code

1. Glue connection properties

1. AWS Secrets Manager secret values (when `secretId` is specified)

1. Snowflake user defaults

The following parameters are used generally when connecting to Snowflake.
+ `sfDatabase` — Required if a user default is not set in Snowflake. Used for Read/Write. The database to use for the session after connecting.
+ `sfSchema` — Required if a user default is not set in Snowflake. Used for Read/Write. The schema to use for the session after connecting.
+ `sfWarehouse` — Required if a user default is not set in Snowflake. Used for Read/Write. The default virtual warehouse to use for the session after connecting.
+ `sfRole` — Required if a user default is not set in Snowflake. Used for Read/Write. The default security role to use for the session after connecting.
+ `sfUrl` — (Required) Used for Read/Write. Specifies the hostname for your account in the following format: `account_identifier.snowflakecomputing.com`. For more information about account identifiers, see [Account Identifiers](https://docs.snowflake.com/en/user-guide/admin-account-identifier) in the Snowflake documentation.
+ `sfUser` — (Required) Used for Read/Write. Login name for the Snowflake user.
+ `sfPassword` — (Required when using basic authnetication) Used for Read/Write. Password for the Snowflake user.
+ `dbtable` — Required when working with full tables. Used for Read/Write. The name of the table to be read or the table to which data is written. When reading, all columns and records are retrieved.
+ `pem_private_key` — (Required when using key-pair authentication) Used for Read/Write. An unencrypted b64-encoded private key string. The private key for the Snowflake user. It is common to copy this out of a PEM file. For more information, see [Key-pair authentication and key-pair rotation](https://docs.snowflake.com/en/user-guide/key-pair-auth) in the Snowflake documentation.
+ `USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET` — (Required when using OAuth Authentication) Used for both read and write operations. This value corresponds to the OAUTH\$1CLIENT\$1SECRET, which can be obtained from the Snowflake security integration configured to enable OAuth-based authentication for your account. For more details, refer to your Snowflake OAuth security integration setup documentation - [Configure Snowflake OAuth for custom clients](https://docs.snowflake.com/en/user-guide/oauth-custom).
+ `query` — Required when reading with a query. Used for Read. The exact query (`SELECT` statement) to run

The following options are used to configure specific behaviors during the process of connecting to Snowflake.
+ `preactions` — Used for Read/Write. Valid Values: Semicolon separated list of SQL statements as String. SQL statements run before data is transferred between AWS Glue and Snowflake. If a statement contains `%s`, the `%s` is replaced with the table name referenced for the operation.
+ `postactions` — Used for Read/Write. SQL statements run after data is transferred between AWS Glue and Snowflake. If a statement contains `%s`, the `%s` is replaced with the table name referenced for the operation.
+ `autopushdown` — Default: `"on"`. Valid Values: `"on"`, `"off"`. This parameter controls whether automatic query pushdown is enabled. If pushdown is enabled, then when a query is run on Spark, if part of the query can be "pushed down" to the Snowflake server, it is pushed down. This improves performance of some queries. For information about whether your query can be pushed down, consult [Pushdown](https://docs.snowflake.com/en/user-guide/spark-connector-use#pushdown) in the Snowflake documentation.

Additionally, some of the options available on the Snowflake Spark connector may be supported in AWS Glue. For more information about options available on the Snowflake Spark connector, see [Setting Configuration Options for the Connector](https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector) in the Snowflake documentation. 

## Snowflake authentication methods
<a name="aws-glue-programming-etl-connect-snowflake-authentication"></a>

AWS Glue supports the following authentication methods for connecting to Snowflake:
+ **Basic authentication:** Provide `sfUser` and `sfPassword` parameters.
+ **Key-pair authentication:** Provide `sfUser` and `pem_private_key` parameters. When using key-pair authentication, the `sfPassword` parameter is not required.
+ **OAuth authentication:** The Snowflake Connector supports the AUTHORIZATION\$1CODE grant type to request access to your Snowflake data. This grant type is referred to as “3-legged OAuth”, as it involves redirecting users to a third-party authorization server where they can authenticate and approve access. This method is used when creating a connection through the AWS Glue Console. 
  + **Prerequisite:** To use this authentication method, ensure the following setup is complete: 
    + **Configure Snowflake OAuth for a custom client** by following the official Snowflake documentation: [Configure Snowflake OAuth for custom clients.](https://docs.snowflake.com/en/user-guide/oauth-custom) 
    + **Set the correct redirect URI** when creating the Snowflake security integration. For example: If you are creating the connection in the DUB (eu-west-1) region, your redirect URI should be: `https://eu-west-1.console.aws.amazon.com/gluestudio/oauth` 
    + After creating the security integration, retain the following information for use when creating the Glue connection: 
      + OAUTH\$1CLIENT\$1ID: This value should be provided as User Managed Client Application Client ID on the Glue connection creation page.
      + OAUTH\$1CLIENT\$1SECRET: This value should be stored in the AWS Secret used for the connection, under the key USER\$1MANAGED\$1CLIENT\$1APPLICATION\$1CLIENT\$1SECRET.
  + OAuth Scopes — (Optional) Defines the specific permissions or levels of access requested from the Snowflake account. For example, a scope might limit access to a particular resource or operation.
    + This value can be specified in the following format: `session:role:Snowflake_Role_Name`
    + Example: `session:role:ANALYST_ROLE`
  + Authorization Code URL — (Required) The endpoint where the user is redirected to log in and grant authorization.
    + Example: `https://host/oauth/authorize`
  + Authorization Token URL — (Required) The endpoint used to exchange the authorization code for an access token.
    + Example: `https://host/oauth/token-request`
  + User Managed Client Application Client Id — (Required) The unique identifier for your registered OAuth client application in Snowflake
  + AWS Secret — (Required) Refers to an AWS Secrets Manager secret containing the following key-value pairs:
    + sfUser - The Snowflake username
    + USER\$1MANAGED\$1CLIENT\$1APPLICATION\$1CLIENT\$1SECRET - The client secret associated with the OAuth client application

All three authentication methods are fully supported and can be configured using any combination of connection options, Glue connections, or AWS Secrets Manager secrets.

## Snowflake connector limitations
<a name="aws-glue-programming-etl-connect-snowflake-limitations"></a>

Connecting to Snowflake with AWS Glue for Spark is subject to the following limitations. 
+ This connector does not support job bookmarks. For more information about job bookmarks, see [Tracking processed data using job bookmarks](monitor-continuations.md).
+ This connector does not support Snowflake reads and writes through tables in the AWS Glue Data Catalog using the `create_dynamic_frame.from_catalog` and `write_dynamic_frame.from_catalog` methods.
+ This connector supports basic authentication, key-pair authentication, and OAuth authentication. Other authentication methods (such as SAML) are not currently supported.
+ This connector is not supported within streaming jobs.
+ This connector supports `SELECT` statement based queries when retrieving information (such as with the `query` parameter). Other kind of queries (such as `SHOW`, `DESC`, or DML statements) are not supported.
+ Snowflake limits the size of query text (i.e. SQL statements) submitted through Snowflake clients to 1 MB per statement. For more details, see [Limits on Query Text Size](https://docs.snowflake.com/en/user-guide/query-size-limits).

# Teradata Vantage connections
<a name="aws-glue-programming-etl-connect-teradata-home"></a>

You can use AWS Glue for Spark to read from and write to existing tables in Teradata Vantage in AWS Glue 4.0 and later versions. You can define what to read from Teradata with a SQL query. You can connect to Teradata using username and password credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Teradata, consult the [Teradata documentation](https://docs.teradata.com/)

## Configuring Teradata connections
<a name="aws-glue-programming-etl-connect-teradata-configure"></a>

To connect to Teradata from AWS Glue, you will need to create and store your Teradata credentials in an AWS Secrets Manager secret, then associate that secret with a AWS Glue Teradata connection. If your Teradata instance is in an Amazon VPC, you will also need to provide networking options to your AWS Glue Teradata connection.

To connect to Teradata from AWS Glue, you may need some prerequisites:
+ If you are accessing your Teradata environment through Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the Teradata environment. We discourage accessing the Teradata environment over the public internet.

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your Teradata instance and this location. Your job will need to establish a TCP connection with your Teradata client port. For more information about Teradata ports, see the [Teradata documentation](https://docs.teradata.com/r/Teradata-VantageTM-on-AWS-DIY-Installation-and-Administration-Guide/April-2020/Before-Deploying-Vantage-on-AWS-DIY/Security-Groups-and-Ports).

  Based on your network layout, secure VPC connectivity may require changes in Amazon VPC and other networking services. For more information about AWS connectivity, consult [AWS Connectivity Options](https://docs.teradata.com/r/Teradata-VantageCloud-Enterprise/Get-Started/Connecting-Your-Environment/AWS-Connectivity-Options) in the Teradata documentation.

**To configure a AWS Glue Teradata connection:**

1. In your Teradata configuration, identify or create a user and password AWS Glue will connect with, *teradataUser* and *teradataPassword*. For more information, consult [Vantage Security Overview](https://docs.teradata.com/r/Configuring-Teradata-VantageTM-After-Installation/January-2021/Security-Overview/Vantage-Security-Overview) in the Teradata documentation.

1. In AWS Secrets Manager, create a secret using your Teradata credentials. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `user` with the value *teradataUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password` with the value *teradataPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   + When selecting a **Connection type**, select Teradata.
   + When providing **JDBC URL**, provide the URL for your instance. You can also hardcode certain comma separated connection parameters in your JDBC URL. The URL must conform to the following format: `jdbc:teradata://teradataHostname/ParameterName=ParameterValue,ParameterName=ParameterValue`

     Supported URL parameters include:
     + `DATABASE`– name of database on host to access by default.
     + `DBS_PORT`– the database port, used when running on a nonstandard port.
   + When selecting a **Credential type**, select **AWS Secrets Manager**, then set **AWS Secret** to *secretName*.

1. In the following situations, you may require additional configuration:
   + 

     For Teradata instances hosted on AWS in an Amazon VPC
     + You will need to provide Amazon VPC connection information to the AWS Glue connection that defines your Teradata security credentials. When creating or updating your connection, set **VPC**, **Subnet** and **Security groups** in **Network options**.

After creating a AWS Glue Teradata connection, you will need to perform the following steps before calling your connection method.
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Teradata
<a name="aws-glue-programming-etl-connect-teradata-read"></a>

**Prerequisites:**
+ A Teradata table you would like to read from. You will need the table name, *tableName*.
+ A AWS Glue Teradata connection configured to provide auth information. Complete the steps *To configure a connection to Teradata* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

For example: 

```
teradata_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type="teradata",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName"
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure `query`.

For example:

```
teradata_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="teradata",
    connection_options={
        "connectionName": "connectionName",
        "query": "query"
    }
)
```

## Writing to Teradata tables
<a name="aws-glue-programming-etl-connect-teradata-write"></a>

**Prerequisites:** A Teradata table you would like to write to, *tableName*. **You must create the table before calling the connection method.**

For example:

```
teradata_write = glueContext.write_dynamic_frame.from_options(
    connection_type="teradata",
    connection_options={
        "connectionName": "connectionName", 
        "dbtable": "tableName"
    }
)
```

## Teradata connection option reference
<a name="aws-glue-programming-etl-connect-teradata-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Teradata connection configured to provide auth and networking information to your connection method.
+ `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. The name of a table your connection method will interact with.
+ `query` — Used for Read. A SELECT SQL query defining what should be retrieved when reading from Teradata.

# Teradata Vantage NOS connections
<a name="connecting-to-data-teradata-nos"></a>

 Teradata NOS (Native Object Store) connection is a new connection for Teradata Vantage which leverages Teradata WRITE\$1NOS query to read from existing tables and READ\$1NOS query to write to tables. These queries uses Amazon S3 as a staging directory, and therefore the Teradata NOS connector is faster than the existing Teradata connector (JDBC-based) especially in handling large amount of data. 

 You can use the Teradata NOS connection in AWS Glue for Spark to read from and write to existing tables in Teradata Vantage in AWS Glue 5.0 and later versions. You can define what to read from Teradata with a SQL query. You can connect to Teradata using username and password credentials stored in AWS Secrets Manager through a AWS Glue connection. 

 For more information about Teradata, consult the [Teradata documentation](https://docs.teradata.com/). 

**Topics**
+ [Creating a Teradata NOS connection](#creating-teradata-nos-connection)
+ [Reading from Teradata tables](#reading-from-teradata-nos-tables)
+ [Writing to Teradata tables](#writing-to-teradata-nos-tables)
+ [Teradata connection option reference](#teradata-nos-connection-option-reference)
+ [Provide Options in AWS Glue Visual ETL UI](#teradata-nos-connection-option-visual-etl-ui)

## Creating a Teradata NOS connection
<a name="creating-teradata-nos-connection"></a>

To connect to Teradata NOS from AWS Glue, you will need to create and store your Teradata credentials in an AWS Secrets Manager secret, then associate that secret with a AWS Glue Teradata NOS connection. If your Teradata instance is in an Amazon VPC, you will also need to provide networking options to your AWS Glue Teradata NOS connection. 

 **Prerequisites**: 
+  If you are accessing your Teradata environment through Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the Teradata environment. We discourage accessing the Teradata environment over the public internet. 
+  In Amazon VPC, identify or create a VPC, Subnet and Security group that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your Teradata instance and this location. Your job will need to establish a TCP connection with your Teradata client port. For more information about Teradata ports, see the [ Security Groups for Teradata Vantage ](https://docs.teradata.com/r/Teradata-VantageTM-on-AWS-DIY-Installation-and-Administration-Guide/April-2020/Before-Deploying-Vantage-on-AWS-DIY/Security-Groups-and-Ports). 
+  Based on your network layout, secure VPC connectivity may require changes in Amazon VPC and other networking services. For more information about AWS connectivity, see [AWS Connectivity Options ](https://docs.teradata.com/r/Teradata-VantageCloud-Enterprise/Get-Started/Connecting-Your-Environment/AWS-Connectivity-Options) in the Teradata documentation. 

### To configure a AWS Glue Teradata NOS connection:
<a name="creating-teradata-nos-connection-procedure"></a>

1.  In your Teradata configuration, identify or create a *teradataUsername* and *teradataPassword* AWS Glue will connect with. For more information, see [ Vantage Security Overview ](https://docs.teradata.com/r/Configuring-Teradata-VantageTM-After-Installation/January-2021/Security-Overview/Vantage-Security-Overview) in the Teradata documentation. 

1.  In AWS Secrets Manager, create a secret using your Teradata credentials. To create a secret in AWS Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret in the AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   +  When selecting Key/value pairs, create a pair for the key USERNAME with the value *teradataUsername*. 
   +  When selecting Key/value pairs, create a pair for the key PASSWORD with the value *teradataPassword*. 

1.  In the AWS Glue console, create a connection by following the steps in [ Adding an AWS Glue connection ](https://docs.aws.amazon.com/glue/latest/dg/console-connections.html). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   +  When selecting a **Connection type**, select Teradata Vantage NOS. 
   +  When providing JDBC URL, provide the URL for your instance. You can also hardcode certain comma separated connection parameters in your JDBC URL. The URL must conform to the following format: ` jdbc:teradata://teradataHostname/ParameterName=ParameterValue,ParameterName=ParameterValue `. 
   +  Supported URL parameters include: 
     +  `DATABASE`– name of database on host to access by default. 
     +  `DBS_PORT`– the database port, used when running on a nonstandard port. 
   +  When selecting a **Credential type**, select AWS Secrets Manager, then set ** AWS Secret** to *secretName*. 

1.  In the following situations, you may require additional configuration: 
   +  For Teradata instances hosted on AWS in an Amazon VPC, you will need to provide Amazon VPC connection information to the AWS Glue connection that defines your Teradata security credentials. When creating or updating your connection, set **VPC**, **Subnet**, and **Security groups** in **Network options**. 

 After creating a AWS Glue Teradata Vantage NOS connection, you will need to perform the following steps before calling your connection method. 

1.  Grant the IAM role associated with your AWS Glue job permission to read *secretName*. 

1.  In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection** Under **Connections**. 

## Reading from Teradata tables
<a name="reading-from-teradata-nos-tables"></a>

### Prerequisites:
<a name="w2aac67c11c24b8c41c17b3"></a>
+  A Teradata table you would like to read from. You will need the table name, *tableName*. 
+  The Teradata environment has write access to the Amazon S3 path specified by `staging_fs_url` option, *stagingFsUrl*. 
+  The IAM role associated with AWS Glue job has write access to the Amazon S3 location specified by `staging_fs_url` option. 
+  An AWS Glue Teradata NOS connection configured to provide auth information. Complete the steps [To configure a AWS Glue Teradata NOS connection:](#creating-teradata-nos-connection-procedure) to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

 Example: 

```
teradata_read_table = glueContext.create_dynamic_frame.from_options(
    connection_type= "teradatanos",
    connection_options={
        "connectionName": "connectionName",
        "dbtable": "tableName",
        "staging_fs_url": "stagingFsUrl"
    }
)
```

 You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame. You will need to configure query. If you configure both dbTable and query, the connector fails to read the data. For example: 

```
teradata_read_query = glueContext.create_dynamic_frame.from_options(
    connection_type="teradatanos",
    connection_options={
        "connectionName": "connectionName",
        "query": "query",
        "staging_fs_url": "stagingFsUrl"
    }
)
```

 Additionally, you can use Spark DataFrame API to read from Teradata tables. For example: 

```
options = {
    "url": "JDBC_URL",
    "dbtable": "tableName",
    "user": "teradataUsername", # or use "username" as key here
    "password": "teradataPassword",
    "staging_fs_url": "stagingFsUrl"
}
teradata_read_table = spark.read.format("teradatanos").option(**options).load()
```

## Writing to Teradata tables
<a name="writing-to-teradata-nos-tables"></a>

### Prerequisites
<a name="writing-to-teradata-nos-tables-prerequisites"></a>
+  A Teradata table you would like to write to: *tableName*. 
+  The Teradata environment has read access to the Amazon S3 location specified by `staging_fs_url` option, * stagingFsUrl *. 
+  The IAM role associated with AWS Glue job has write access to the Amazon S3 location specified by `staging_fs_url` option. 
+  An AWS Glue Teradata connection configured to provide auth information. Complete the steps in [To configure a AWS Glue Teradata NOS connection:](#creating-teradata-nos-connection-procedure) to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 

   For example: 

  ```
  teradata_write = glueContext.write_dynamic_frame.from_options(
      frame=dynamicFrame,
      connection_type= "teradatanos",
      connection_options={
          "connectionName": "connectionName", 
          "dbtable": "tableName",
          "staging_fs_url": "stagingFsUrl"
      }
  )
  ```

## Teradata connection option reference
<a name="teradata-nos-connection-option-reference"></a>

 **Connection and Operation Options:** 
+  `connectionName` — Required. Used for Read/Write. The name of an AWS Glue Teradata connection configured to provide auth and networking information to your connection method. 
+  `staging_fs_url` — Required. Used for Read/Write. A writable location in Amazon S3, to be used for unloaded data when reading from Teradata, and for Parquet data to be loaded into Redshift when writing to Teradata. The S3 bucket must be in the same region as the region of your AWS Glue jobs. 
+  `dbtable` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. The name of a table your connection method will interact with. 
+  `query` — Used for Read. A SELECT SQL query defining what should be retrieved when reading from Teradata. You can not pass if `dbtable` option is provided. 
+  `clean_staging_s3_dir` — Optional. Used for Read/Write. If true, clean up the staging Amazon S3 objects after a read or a write. The default value is true. 
+  `pre_actions` — Optional. Used for Write. Semicolon-separated list of SQL commands that are executed before data is transferred between Spark and Teradata Vantage. 
+  `post_actions` — Optional. Used for Write. Semicolon-separated list of SQL commands that are executed after data is transferred between Spark and Teradata Vantage. 
+  `truncate` — Optional. Used for Write. If true, the connector truncates the table when writing in overwrite mode. If false, the connector drops the table when writing in overwrite mode. The default value is false. 
+  `create_table_script` — Optional. Used for Write. An SQL statement to create table when writing to Teradata Vantage. Useful when you want to create a table with custom metadata (e.g. CREATE MULTISET or SET table or change primary index). Note that the table name used in create table script should match with the table name specified in `dbtable` option. 
+  `partition_size_in_mb` — Optional. Used for Read. Maximum size of a Spark partition in megabytes while reading staging Amazon S3 objects. The default value is 128. 

 You can provide advanced options when creating a Teradata node. These options are the same as those available when programming AWS Glue for Spark scripts.

 See [Teradata Vantage connections](aws-glue-programming-etl-connect-teradata-home.md). 

 **Authorization Options:** 

 Below are options used to provide AWS account credentials that the connector uses to access the staging Amazon S3 bucket. You can choose to (1) not provide any authorization options at all, and use temporary credentials generated from your AWS Glue execution role; or (2) provide an authorization object, `auth_object` you created; or (3) provide `aws_access_key_id and aws_secret_access_key` if using long term credentials, or provide `aws_access_key`, `aws_secret_access_key`, and `aws_session_token` if using temporary credentials. 
+  `auth_object` – Optional. Used for accessing the staging Amazon S3 bucket. An authorization object string created in Teradata instance. If provided, the connector will use this authorization object to access the staging Amazon S3 bucket. If not provided, and `aws_access_key_id` and `aws_secret_access_key` are also not provided, a temporary credential will be retrieved from AWS Glue execution role and used by the connector. The AWS account associated with this authorization object must be in the same region as your AWS Glue jobs and your staging Amazon S3 bucket or configured with cross account trust. 
+  `aws_access_key_id` – Optional. Used for accessing the staging Amazon S3 bucket. Part of an AWS account security credential. If `auth_object` is not provided, and `aws_access_key_id` is provided with `aws_secret_access_key`, the connector will use them to access staging Amazon S3 bucket. The AWS account associated with this access key must be in the same region as your AWS Glue jobs and your staging Amazon S3 bucket or configured with cross account trust. 
+  `aws_secret_access_key` – Optional. Used for accessing the staging Amazon S3 bucket. Part of an AWS account security credential. If `auth_object` is not provided, and `aws_secret_access_key` is provided with `aws_access_key_id` , the connector will use them to access staging Amazon S3 bucket. The AWS account associated with this secret key must be in the same region as your AWS Glue jobs and your staging Amazon S3 bucket or configured with cross account trust. 
+  `aws_session_token` – Optional. Used for accessing the staging Amazon S3 bucket. Part of a temporary AWS account security credential. Should be provided with `aws_access_key_id` and `aws_secret_access_key`. 

## Provide Options in AWS Glue Visual ETL UI
<a name="teradata-nos-connection-option-visual-etl-ui"></a>

 You can provide all above options in your visual ETL job UI. For connectionName option, you should choose it from the Teradata Vantage NOS connection drop down list. For all other options, you should provide them through the Custom Teradata Vantage NOS properties as key value pairs. 

![\[The window pane displays the Teradata NOS Vantage connection is selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/teradata-nos-vantage-connection-options.png)


# Vertica connections
<a name="aws-glue-programming-etl-connect-vertica-home"></a>

You can use AWS Glue for Spark to read from and write to tables in Vertica in AWS Glue 4.0 and later versions. You can define what to read from Vertica with a SQL query. You connect to Vertica using username and password credentials stored in AWS Secrets Manager through a AWS Glue connection.

For more information about Vertica, consult the [Vertica documentation](https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/UsingVerticaOnAWS/UsingVerticaOnAWS.htm).

## Configuring Vertica connections
<a name="aws-glue-programming-etl-connect-vertica-configure"></a>

To connect to Vertica from AWS Glue, you will need to create and store your Vertica credentials in a AWS Secrets Manager secret, then associate that secret with a Vertica AWS Glue connection. If your Vertica instance is in an Amazon VPC, you will also need to provide networking options to your AWS Glue Vertica connection. You will need an Amazon S3 bucket or folder to use for temporary storage when reading from and writing to the database.

To connect to Vertica from AWS Glue, you will need some prerequisites:
+ An Amazon S3 bucket or folder to use for temporary storage when reading from and writing to the database, referred to by *tempS3Path*.
**Note**  
When using Vertica in AWS Glue job data previews, temporary files may not be automatically removed from *tempS3Path*. To ensure the removal of temporary files, directly end the data preview session by choosing **End session** in the **Data preview** pane.  
If you cannot guarantee the data preview session is ended directly, consider setting Amazon S3 Lifecycle configuration to remove old data. We recommend removing data older than 49 hours, based on maximum job runtime plus a margin. For more information about configuring Amazon S3 Lifecycle, see [Managing your storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) in the Amazon S3 documentation.
+ An IAM policy with appropriate permissions to your Amazon S3 path you can associate with your AWS Glue job role.
+ If your Vertica instance is in an Amazon VPC, configure Amazon VPC to allow your AWS Glue job to communicate with the Vertica instance without traffic traversing the public internet. 

  In Amazon VPC, identify or create a **VPC**, **Subnet** and **Security group** that AWS Glue will use while executing the job. Additionally, you need to ensure Amazon VPC is configured to permit network traffic between your Vertica instance and this location. Your job will need to establish a TCP connection with your Vertica client port, (default 5433). Based on your network layout, this may require changes to security group rules, Network ACLs, NAT Gateways and Peering connections.

You can then proceed to configure AWS Glue for use with Vertica.

**To configure a connection to Vertica:**

1. In AWS Secrets Manager, create a secret using your Vertica credentials, *verticaUsername* and *verticaPassword*. To create a secret in Secrets Manager, follow the tutorial available in [ Create an AWS Secrets Manager secret ](https://docs.aws.amazon.com//secretsmanager/latest/userguide/create_secret.html) in the AWS Secrets Manager documentation. After creating the secret, keep the Secret name, *secretName* for the next step. 
   + When selecting **Key/value pairs**, create a pair for the key `user` with the value *verticaUsername*.
   + When selecting **Key/value pairs**, create a pair for the key `password` with the value *verticaPassword*.

1. In the AWS Glue console, create a connection by following the steps in [Adding an AWS Glue connection](console-connections.md). After creating the connection, keep the connection name, *connectionName*, for the next step. 
   + When selecting a **Connection type**, select Vertica.
   + When selecting **Vertica Host**, provide the hostname of your Vertica installation.
   + When selecting **Vertica Port**, the port your Vertica installation is available through.
   + When selecting an **AWS Secret**, provide *secretName*.

1. In the following situations, you may require additional configuration:
   + 

     For Vertica instances hosted on AWS in an Amazon VPC
     + Provide Amazon VPC connection information to the AWS Glue connection that defines your Vertica security credentials. When creating or updating your connection, set **VPC**, **Subnet** and **Security groups** in **Network options**.

After creating a AWS Glue Vertica connection, you will need to perform the following steps before calling your connection method.
+ Grant the IAM role associated with your AWS Glue job permissions to *tempS3Path*.
+ Grant the IAM role associated with your AWS Glue job permission to read *secretName*.
+ In your AWS Glue job configuration, provide *connectionName* as an **Additional network connection**.

## Reading from Vertica
<a name="aws-glue-programming-etl-connect-vertica-read"></a>

**Prerequisites:** 
+ A Vertica table you would like to read from. You will need the Vertica database name, *dbName* and the table name, *tableName*.
+ A AWS Glue Vertica connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to Vertica* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 
+ A Amazon S3 bucket or folder to use for temporary storage, mentioned previously. You will need the name, *tempS3Path*. You will need to connect to this location using the `s3a` protocol.

For example: 

```
dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="vertica",
    connection_options={
        "connectionName": "connectionName",
        "staging_fs_url": "s3a://tempS3Path",
        "db": "dbName",
        "table": "tableName",
    }
)
```

You can also provide a SELECT SQL query, to filter the results returned to your DynamicFrame or to access a dataset from multiple tables.

For example:

```
dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="vertica",
    connection_options={
        "connectionName": "connectionName",
        "staging_fs_url": "s3a://tempS3Path",
        "db": "dbName",
        "query": "select * FROM tableName",
    },
)
```

## Writing to Vertica tables
<a name="aws-glue-programming-etl-connect-vertica-write"></a>

This example writes information from an existing DynamicFrame, *dynamicFrame* to Vertica. If the table already has information, AWS Glue will append data from your DynamicFrame.

**Prerequisites:** 
+ A current or desired table name, *tableName*, you would like to write to. You will also need the corresponding Vertica database name, *dbName*.
+ A AWS Glue Vertica connection configured to provide auth information. Complete the steps in the previous procedure, *To configure a connection to Vertica* to configure your auth information. You will need the name of the AWS Glue connection, *connectionName*. 
+ A Amazon S3 bucket or folder to use for temporary storage, mentioned previously. You will need the name, *tempS3Path*. You will need to connect to this location using the `s3a` protocol.

For example: 

```
glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="vertica",
    connection_options={
        "connectionName": "connectionName",
        "staging_fs_url": "s3a://tempS3Path",
        "db": "dbName",
        "table": "tableName",
    }
)
```

## Vertica connection option reference
<a name="aws-glue-programming-etl-connect-vertica-reference"></a>
+ `connectionName` — Required. Used for Read/Write. The name of a AWS Glue Vertica connection configured to provide auth and networking information to your connection method.
+ `db` — Required. Used for Read/Write. The name of a database in Vertica your connection method will interact with.
+ `dbSchema` — Required if needed to identify your table. Used for Read/Write. Default: `public`. The name of a schema your connection method will interact with.
+ `table` — Required for writing, required for reading unless `query` is provided. Used for Read/Write. The name of a table your connection method will interact with.
+ `query` — Used for Read. A SELECT SQL query defining what should be retrieved when reading from Teradata.
+ `staging_fs_url` — Required. Used for Read/Write. Valid Values: `s3a` URLs. The URL of a Amazon S3 bucket or folder to use for temporary storage.

## DataFrame options for ETL in AWS Glue 5.0 for Spark
<a name="aws-glue-programming-etl-connect-dataframe"></a>

A DataFrame is a Dataset organized into named columns similar to a table and supports functional-style (map/reduce/filter/etc.) operations and SQL operations (select, project, aggregate).

To create a DataFrame for a data source supported by Glue, the following are required:
+ data source connector `ClassName`
+ data source connection `Options`

Similarly, to write a DataFrame to a data sink supported by Glue, the same are required:
+ data sink connector `ClassName`
+ data sink connection `Options`

Note that AWS Glue features such as job bookmarks and DynamicFrame options such as `connectionName` are not supported in DataFrame. For more details about DataFrame and the supported operations, see the Spark documentation for [DataFrame](https://spark.apache.org/docs/3.5.2/api/python/reference/pyspark.sql/dataframe.html).

### Specifying the connector ClassName
<a name="aws-glue-programming-etl-connect-dataframe-classname"></a>

To specify the `ClassName` of a data source/sink, use the `.format` option to provide the corresponding connector `ClassName` that defines the data source/sink.

**JDBC connectors**  
For JDBC connectors, specify `jdbc` as the value of the `.format` option and provide the JDBC driver `ClassName` in the `driver` option.

```
df = spark.read.format("jdbc").option("driver", "<DATA SOURCE JDBC DRIVER CLASSNAME>")...

df.write.format("jdbc").option("driver", "<DATA SINK JDBC DRIVER CLASSNAME>")...
```

The following table lists the JDBC driver `ClassName` of the supported data source in AWS Glue for DataFrames.

| Data source | Driver ClassName | 
| --- |--- |
| PostgreSQL | org.postgresql.Driver | 
| Oracle | oracle.jdbc.driver.OracleDriver | 
| SQLServer | com.microsoft.sqlserver.jdbc.SQLServerDriver | 
| MySQL | com.mysql.jdbc.Driver | 
| SAPHana | com.sap.db.jdbc.Driver | 
| Teradata | com.teradata.jdbc.TeraDriver | 

**Spark connectors**  
For spark connectors, specify the `ClassName` of the connector as the value of the `.format` option.

```
df = spark.read.format("<DATA SOURCE CONNECTOR CLASSNAME>")...

df.write.format("<DATA SINK CONNECTOR CLASSNAME>")...
```

The following table lists the Spark connector `ClassName` of the supported data source in AWS Glue for DataFrames.

| Data source | ClassName | 
| --- |--- |
| MongoDB/DocumentDB | glue.spark.mongodb | 
| Redshift | io.github.spark\$1redshift\$1community.spark.redshift | 
| AzureCosmos | cosmos.oltp | 
| AzureSQL | com.microsoft.sqlserver.jdbc.spark | 
| BigQuery | com.google.cloud.spark.bigquery | 
| OpenSearch | org.opensearch.spark.sql | 
| Snowflake | net.snowflake.spark.snowflake | 
| Vertica | com.vertica.spark.datasource.VerticaSource | 

### Specifying the connection Options
<a name="aws-glue-programming-etl-connect-dataframe-connection-options"></a>

To specify the `Options` of the connection to a data source/sink, use the `.option(<KEY>, <VALUE>)` to provide individual options or `.options(<MAP>)` to provide multiple options as a key-value map.

Each data source/sink supports its own set of connection `Options`. For details on the available `Options`, refer to the public documentation of the specific data source/sink Spark connector listed in the following table.
+ [JDBC](https://spark.apache.org/docs/3.5.2/sql-data-sources-jdbc.html)
+ [MongoDB/DocumentDB](https://www.mongodb.com/docs/spark-connector/v10.4/)
+ [Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html)
+ [AzureCosmos](https://github.com/Azure/azure-cosmosdb-spark)
+ [AzureSQL](https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16)
+ [BigQuery](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example)
+ [OpenSearch](https://github.com/opensearch-project/opensearch-hadoop/blob/main/USER_GUIDE.md#apache-spark)
+ [Snowflake](https://docs.snowflake.com/en/user-guide/spark-connector-use#setting-configuration-options-for-the-connector)
+ [Vertica](https://github.com/vertica/spark-connector)

### Examples
<a name="aws-glue-programming-etl-connect-dataframe-examples"></a>

The following examples read from PostgreSQL and write into SnowFlake:

**Python**  
Example:

```
from awsglue.context import GlueContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataSourceClassName = "jdbc"
dataSourceOptions = {
  "driver": "org.postgresql.Driver",
  "url": "<url>",
  "user": "<user>",
  "password": "<password>",
  "dbtable": "<dbtable>",
}

dataframe = spark.read.format(className).options(**options).load()

dataSinkClassName = "net.snowflake.spark.snowflake"
dataSinkOptions = {
  "sfUrl": "<url>",
  "sfUsername": "<username>",
  "sfPassword": "<password>",
  "sfDatabase" -> "<database>",                              
  "sfSchema" -> "<schema>",                       
  "sfWarehouse" -> "<warehouse>"  
}

dataframe.write.format(dataSinkClassName).options(**dataSinkOptions).save()
```

**Scala**  
Example:

```
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()

val dataSourceClassName = "jdbc"
val dataSourceOptions = Map(
  "driver" -> "org.postgresql.Driver",
  "url" -> "<url>",
  "user" -> "<user>",
  "password" -> "<password>",
  "dbtable" -> "<dbtable>"
)

val dataframe = spark.read.format(dataSourceClassName).options(dataSourceOptions).load()

val dataSinkClassName = "net.snowflake.spark.snowflake"
val dataSinkOptions = Map(
  "sfUrl" -> "<url>",
  "sfUsername" -> "<username>",
  "sfPassword" -> "<password>",
  "sfDatabase" -> "<database>",
  "sfSchema" -> "<schema>",
  "sfWarehouse" -> "<warehouse>"
)

dataframe.write.format(dataSinkClassName).options(dataSinkOptions).save()
```

## Custom and AWS Marketplace connectionType values
<a name="aws-glue-programming-etl-connect-market"></a>

These include the following:
+ `"connectionType": "marketplace.athena"`: Designates a connection to an Amazon Athena data store. The connection uses a connector from AWS Marketplace.
+ `"connectionType": "marketplace.spark"`: Designates a connection to an Apache Spark data store. The connection uses a connector from AWS Marketplace.
+ `"connectionType": "marketplace.jdbc"`: Designates a connection to a JDBC data store. The connection uses a connector from AWS Marketplace.
+ `"connectionType": "custom.athena"`: Designates a connection to an Amazon Athena data store. The connection uses a custom connector that you upload to AWS Glue Studio.
+ `"connectionType": "custom.spark"`: Designates a connection to an Apache Spark data store. The connection uses a custom connector that you upload to AWS Glue Studio.
+ `"connectionType": "custom.jdbc"`: Designates a connection to a JDBC data store. The connection uses a custom connector that you upload to AWS Glue Studio.

### Connection options for type custom.jdbc or marketplace.jdbc
<a name="marketplace-jdbc-connect-options"></a>
+ `className` – String, required, driver class name.
+ `connectionName` – String, required, name of the connection that is associated with the connector.
+ `url` – String, required, JDBC URL with placeholders (`${}`) which are used to build the connection to the data source. The placeholder `${secretKey}` is replaced with the secret of the same name in AWS Secrets Manager. Refer to the data store documentation for more information about constructing the URL. 
+ `secretId` or `user/password` – String, required, used to retrieve credentials for the URL. 
+ `dbTable` or `query` – String, required, the table or SQL query to get the data from. You can specify either `dbTable` or `query`, but not both. 
+ `partitionColumn` – String, optional, the name of an integer column that is used for partitioning. This option works only when it's included with `lowerBound`, `upperBound`, and `numPartitions`. This option works the same way as in the Spark SQL JDBC reader. For more information, see [JDBC To Other Databases](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) in the *Apache Spark SQL, DataFrames and Datasets Guide*.

  The `lowerBound` and `upperBound` values are used to decide the partition stride, not for filtering the rows in table. All rows in the table are partitioned and returned. 
**Note**  
When using a query instead of a table name, you should validate that the query works with the specified partitioning condition. For example:   
If your query format is `"SELECT col1 FROM table1"`, then test the query by appending a `WHERE` clause at the end of the query that uses the partition column. 
If your query format is "`SELECT col1 FROM table1 WHERE col2=val"`, then test the query by extending the `WHERE` clause with `AND` and an expression that uses the partition column.
+ `lowerBound` – Integer, optional, the minimum value of `partitionColumn` that is used to decide partition stride. 
+ `upperBound` – Integer, optional, the maximum value of `partitionColumn` that is used to decide partition stride. 
+ `numPartitions` – Integer, optional, the number of partitions. This value, along with `lowerBound` (inclusive) and `upperBound` (exclusive), form partition strides for generated `WHERE` clause expressions that are used to split the `partitionColumn`. 
**Important**  
Be careful with the number of partitions because too many partitions might cause problems on your external database systems. 
+ `filterPredicate` – String, optional, extra condition clause to filter data from source. For example: 

  ```
  BillingCity='Mountain View'
  ```

  When using a *query* instead of a *table* name, you should validate that the query works with the specified `filterPredicate`. For example: 
  + If your query format is `"SELECT col1 FROM table1"`, then test the query by appending a `WHERE` clause at the end of the query that uses the filter predicate. 
  + If your query format is `"SELECT col1 FROM table1 WHERE col2=val"`, then test the query by extending the `WHERE` clause with `AND` and an expression that uses the filter predicate.
+ `dataTypeMapping` – Dictionary, optional, custom data type mapping that builds a mapping from a **JDBC** data type to a **Glue** data type. For example, the option `"dataTypeMapping":{"FLOAT":"STRING"}` maps data fields of JDBC type `FLOAT` into the Java `String` type by calling the `ResultSet.getString()` method of the driver, and uses it to build AWS Glue records. The `ResultSet` object is implemented by each driver, so the behavior is specific to the driver you use. Refer to the documentation for your JDBC driver to understand how the driver performs the conversions. 
+ The AWS Glue data types supported currently are:
  + DATE
  + STRING
  + TIMESTAMP
  + INT
  + FLOAT
  + LONG
  + BIGDECIMAL
  + BYTE
  + SHORT
  + DOUBLE

   The JDBC data types supported are [Java8 java.sql.types](https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html).

  The default data type mappings (from JDBC to AWS Glue) are:
  +  DATE -> DATE
  +  VARCHAR -> STRING
  +  CHAR -> STRING
  +  LONGNVARCHAR -> STRING
  +  TIMESTAMP -> TIMESTAMP
  +  INTEGER -> INT
  +  FLOAT -> FLOAT
  +  REAL -> FLOAT
  +  BIT -> BOOLEAN
  +  BOOLEAN -> BOOLEAN
  +  BIGINT -> LONG
  +  DECIMAL -> BIGDECIMAL
  +  NUMERIC -> BIGDECIMAL
  +  TINYINT -> SHORT
  +  SMALLINT -> SHORT
  +  DOUBLE -> DOUBLE

  If you use a custom data type mapping with the option `dataTypeMapping`, then you can override a default data type mapping. Only the JDBC data types listed in the `dataTypeMapping` option are affected; the default mapping is used for all other JDBC data types. You can add mappings for additional JDBC data types if needed. If a JDBC data type is not included in either the default mapping or a custom mapping, then the data type converts to the AWS Glue `STRING` data type by default. 

The following Python code example shows how to read from JDBC databases with AWS Marketplace JDBC drivers. It demonstrates reading from a database and writing to an S3 location. 

```
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
     
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
     
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    ## @type: DataSource
    ## @args: [connection_type = "marketplace.jdbc", connection_options = 
     {"dataTypeMapping":{"INTEGER":"STRING"},"upperBound":"200","query":"select id, 
       name, department from department where id < 200","numPartitions":"4",
       "partitionColumn":"id","lowerBound":"0","connectionName":"test-connection-jdbc"},
        transformation_ctx = "DataSource0"]
    ## @return: DataSource0
    ## @inputs: []
    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = 
      "marketplace.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"},
      "upperBound":"200","query":"select id, name, department from department where 
       id < 200","numPartitions":"4","partitionColumn":"id","lowerBound":"0",
       "connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    ## @type: ApplyMapping
    ## @args: [mappings = [("department", "string", "department", "string"), ("name", "string",
      "name", "string"), ("id", "int", "id", "int")], transformation_ctx = "Transform0"]
    ## @return: Transform0
    ## @inputs: [frame = DataSource0]
    Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("department", "string",
      "department", "string"), ("name", "string", "name", "string"), ("id", "int", "id", "int")], 
       transformation_ctx = "Transform0")
    ## @type: DataSink
    ## @args: [connection_type = "s3", format = "json", connection_options = {"path": 
     "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0"]
    ## @return: DataSink0
    ## @inputs: [frame = Transform0]
    DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, 
      connection_type = "s3", format = "json", connection_options = {"path": 
      "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0")
    job.commit()
```

### Connection options for type custom.athena or marketplace.athena
<a name="marketplace-athena-connect-options"></a>
+ `className` – String, required, driver class name. When you're using the Athena-CloudWatch connector, this parameter value is the prefix of the class Name (for example, `"com.amazonaws.athena.connectors"`). The Athena-CloudWatch connector is composed of two classes: a metadata handler and a record handler. If you supply the common prefix here, then the API loads the correct classes based on that prefix.
+ `tableName` – String, required, the name of the CloudWatch log stream to read. This code snippet uses the special view name `all_log_streams`, which means that the dynamic data frame returned will contain data from all log streams in the log group.
+ `schemaName` – String, required, the name of the CloudWatch log group to read from. For example, `/aws-glue/jobs/output`.
+ `connectionName` – String, required, name of the connection that is associated with the connector.

For additional options for this connector, see the [Amazon Athena CloudWatch Connector README](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch) file on GitHub.

The following Python code example shows how to read from an Athena data store using an AWS Marketplace connector. It demonstrates reading from Athena and writing to an S3 location. 

```
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
     
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
     
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    ## @type: DataSource
    ## @args: [connection_type = "marketplace.athena", connection_options = 
     {"tableName":"all_log_streams","schemaName":"/aws-glue/jobs/output",
      "connectionName":"test-connection-athena"}, transformation_ctx = "DataSource0"]
    ## @return: DataSource0
    ## @inputs: []
    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = 
      "marketplace.athena", connection_options = {"tableName":"all_log_streams",,
      "schemaName":"/aws-glue/jobs/output","connectionName":
      "test-connection-athena"}, transformation_ctx = "DataSource0")
    ## @type: ApplyMapping
    ## @args: [mappings = [("department", "string", "department", "string"), ("name", "string",
      "name", "string"), ("id", "int", "id", "int")], transformation_ctx = "Transform0"]
    ## @return: Transform0
    ## @inputs: [frame = DataSource0]
    Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("department", "string",
      "department", "string"), ("name", "string", "name", "string"), ("id", "int", "id", "int")], 
       transformation_ctx = "Transform0")
    ## @type: DataSink
    ## @args: [connection_type = "s3", format = "json", connection_options = {"path": 
     "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0"]
    ## @return: DataSink0
    ## @inputs: [frame = Transform0]
    DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, 
      connection_type = "s3", format = "json", connection_options = {"path": 
      "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0")
    job.commit()
```

### Connection options for type custom.spark or marketplace.spark
<a name="marketplace-spark-connect-options"></a>
+ `className` – String, required, connector class name. 
+ `secretId` – String, optional, used to retrieve credentials for the connector connection.
+ `connectionName` – String, required, name of the connection that is associated with the connector.
+ Other options depend on the data store. For example, OpenSearch configuration options start with the prefix `es`, as described in the [Elasticsearch for Apache Hadoop](https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html) documentation. Spark connections to Snowflake use options such as `sfUser` and `sfPassword`, as described in [Using the Spark Connector](https://docs.snowflake.com/en/user-guide/spark-connector-use.html) in the *Connecting to Snowflake* guide.

The following Python code example shows how to read from an OpenSearch data store using a `marketplace.spark` connection.

```
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
     
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
     
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    ## @type: DataSource
    ## @args: [connection_type = "marketplace.spark", connection_options = {"path":"test",
      "es.nodes.wan.only":"true","es.nodes":"https://<AWS endpoint>",
      "connectionName":"test-spark-es","es.port":"443"}, transformation_ctx = "DataSource0"]
    ## @return: DataSource0
    ## @inputs: []
    DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = 
      "marketplace.spark", connection_options = {"path":"test","es.nodes.wan.only":
      "true","es.nodes":"https://<AWS endpoint>","connectionName":
      "test-spark-es","es.port":"443"}, transformation_ctx = "DataSource0")
    ## @type: DataSink
    ## @args: [connection_type = "s3", format = "json", connection_options = {"path": 
         "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0"]
    ## @return: DataSink0
    ## @inputs: [frame = DataSource0]
    DataSink0 = glueContext.write_dynamic_frame.from_options(frame = DataSource0, 
       connection_type = "s3", format = "json", connection_options = {"path": 
       "s3://<S3 path>/", "partitionKeys": []}, transformation_ctx = "DataSink0")
    job.commit()
```

## General options
<a name="aws-glue-programming-etl-connect-general-options"></a>

The options in this section are provided as `connection_options`, but do not specifically apply to one connector.

The following parameters are used generally when configuring bookmarks. They may apply to Amazon S3 or JDBC workflows. For more information, see [Using job bookmarks](programming-etl-connect-bookmarks.md).
+ `jobBookmarkKeys` — Array of column names. 
+ `jobBookmarkKeysSortOrder` — String defining how to compare values based on sort order. Valid values: `"asc"`, `"desc"`.
+ `useS3ListImplementation` — Used to manage memory performance when listing Amazon S3 bucket contents. For more information, see [Optimize memory management in AWS Glue](https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/).

# Data format options for inputs and outputs in AWS Glue for Spark
<a name="aws-glue-programming-etl-format"></a>

These pages offer information about feature support and configuration parameters for data formats supported by AWS Glue for Spark. See the following for a description of the usage and applicablity of this information. 

## Feature support across data formats in AWS Glue
<a name="aws-glue-programming-etl-format-features"></a>

 Each data format may support different AWS Glue features. The following common features may or may not be supported based on your format type. Refer to the documentation for your data format to understand how to leverage our features to meet your requirements. 


|  |  | 
| --- |--- |
| Read | AWS Glue can recognize and interpret this data format without additional resources, such as connectors. | 
| Write | AWS Glue can write data in this format without additional resources. You can include third-party libraries in your job and use standard Apache Spark functions to write data, as you would in other Spark environments. For more information about including libraries, see [Using Python libraries with AWS Glue](aws-glue-programming-python-libraries.md). | 
| Streaming read | AWS Glue can recognize and interpret this data format from an Apache Kafka, Amazon Managed Streaming for Apache Kafka or Amazon Kinesis message stream. We expect streams to present data in a consistent format, so they are read in as DataFrames. | 
| Group small files | AWS Glue can group files together to batch work sent to each node when performing AWS Glue transforms. This can significantly improve performance for workloads involving large amounts of small files. For more information, see [Reading input files in larger groups](grouping-input-files.md).  | 
| Job bookmarks | AWS Glue can track the progress of transforms performing the same work on the same dataset across job runs with job bookmarks. This can improve performance for workloads involving datasets where work only needs to be done on new data since the last job run. For more information, see [Tracking processed data using job bookmarks](monitor-continuations.md). | 

## Parameters used to interact with data formats in AWS Glue
<a name="aws-glue-programming-etl-format-parameters"></a>

Certain AWS Glue connection types support multiple `format` types, requiring you to specify information about your data format with a `format_options` object when using methods like `GlueContext.write_dynamic_frame.from_options`.
+ `s3` – For more information, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can also view the documentation for the methods facilitating this connection type: [create\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) and [write\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) in Python and the corresponding Scala methods [def getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) and [def getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat). 

  
+ `kinesis` – For more information, see Connection types and options for ETL in AWS Glue: [Kinesis connection parameters](aws-glue-programming-etl-connect-kinesis-home.md#aws-glue-programming-etl-connect-kinesis). You can also view the documentation for the method facilitating this connection type: [create\$1data\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options) and the corresponding Scala method [def createDataFrameFromOptions](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-createDataFrameFromOptions).
+ `kafka` – For more information, see Connection types and options for ETL in AWS Glue: [Kafka connection parameters](aws-glue-programming-etl-connect-kafka-home.md#aws-glue-programming-etl-connect-kafka). You can also view the documentation for the method facilitating this connection type: [create\$1data\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options) and the corresponding Scala method [def createDataFrameFromOptions](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-createDataFrameFromOptions).

Some connection types do not require `format_options`. For example, in normal use, a JDBC connection to a relational database retrieves data in a consistent, tabular data format. Therefore, reading from a JDBC connection would not require `format_options`.

Some methods to read and write data in glue do not require `format_options`. For example, using `GlueContext.create_dynamic_frame.from_catalog` with AWS Glue crawlers. Crawlers determine the shape of your data. When using crawlers, a AWS Glue classifier will examine your data to make smart decisions about how to represent your data format. It will then store a representation of your data in the AWS Glue Data Catalog, which can be used within a AWS Glue ETL script to retrieve your data with the `GlueContext.create_dynamic_frame.from_catalog` method. Crawlers remove the need to manually specify information about your data format.

For jobs that access AWS Lake Formation governed tables, AWS Glue supports reading and writing all formats supported by Lake Formation governed tables. For the current list of supported formats for AWS Lake Formation governed tables, see [Notes and Restrictions for Governed Tables](https://docs.aws.amazon.com/lake-formation/latest/dg/governed-table-restrictions.html) in the *AWS Lake Formation Developer Guide*.

**Note**  
For writing Apache Parquet, AWS Glue ETL only supports writing to a governed table by specifying an option for a custom Parquet writer type optimized for Dynamic Frames. When writing to a governed table with the `parquet` format, you should add the key `useGlueParquetWriter` with a value of `true` in the table parameters.

**Topics**
+ [Feature support across data formats in AWS Glue](#aws-glue-programming-etl-format-features)
+ [Parameters used to interact with data formats in AWS Glue](#aws-glue-programming-etl-format-parameters)
+ [Using the CSV format in AWS Glue](aws-glue-programming-etl-format-csv-home.md)
+ [Using the Parquet format in AWS Glue](aws-glue-programming-etl-format-parquet-home.md)
+ [Using the XML format in AWS Glue](aws-glue-programming-etl-format-xml-home.md)
+ [Using the Avro format in AWS Glue](aws-glue-programming-etl-format-avro-home.md)
+ [Using the grokLog format in AWS Glue](aws-glue-programming-etl-format-grokLog-home.md)
+ [Using the Ion format in AWS Glue](aws-glue-programming-etl-format-ion-home.md)
+ [Using the JSON format in AWS Glue](aws-glue-programming-etl-format-json-home.md)
+ [Using the ORC format in AWS Glue](aws-glue-programming-etl-format-orc-home.md)
+ [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md)
+ [Shared configuration reference](#aws-glue-programming-etl-format-shared-reference)

# Using the CSV format in AWS Glue
<a name="aws-glue-programming-etl-format-csv-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the CSV data format, this document introduces you available features for using your data in AWS Glue. 

 AWS Glue supports using the comma-separated value (CSV) format. This format is a minimal, row-based data format. CSVs often don't strictly conform to a standard, but you can refer to [RFC 4180](https://tools.ietf.org/html/rfc4180) and [RFC 7111](https://tools.ietf.org/html/rfc7111) for more information. 

You can use AWS Glue to read CSVs from Amazon S3 and from streaming sources as well as write CSVs to Amazon S3. You can read and write `bzip` and `gzip` archives containing CSV files from S3. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page. 

The following table shows which common AWS Glue features support the CSV format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Supported | Supported | Supported | Supported | 

## Example: Read CSV files or folders from S3
<a name="aws-glue-programming-etl-format-csv-read"></a>

 **Prerequisites:** You will need the S3 paths (`s3path`) to the CSV files or folders that you want to read. 

 **Configuration:** In your function options, specify `format="csv"`. In your `connection_options`, use the `paths` key to specify `s3path`. You can configure how the reader interacts with S3 in `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how the reader interprets CSV files in your `format_options`. For details, see [CSV Configuration Reference](#aws-glue-programming-etl-format-csv-reference). 

The following AWS Glue ETL script shows the process of reading CSV files or folders from S3.

 We provide a custom CSV reader with performance optimizations for common workflows through the `optimizePerformance` configuration key. To determine if this reader is right for your workload, see [Optimize read performance with vectorized SIMD CSV reader](#aws-glue-programming-etl-format-simd-csv-reader). 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
# Example: Read CSV from S3
# For show, we handle a CSV with a header row.  Set the withHeader option.
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="csv",
    format_options={
        "withHeader": True,
        # "optimizePerformance": True,
    },
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame = spark.read\
    .format("csv")\
    .option("header", "true")\
    .load("s3://s3path")
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) operation.

```
// Example: Read CSV from S3
// For show, we handle a CSV with a header row.  Set the withHeader option.
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"withHeader": true}"""),
      connectionType="s3",
      format="csv",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
val dataFrame = spark.read
  .option("header","true")
  .format("csv")
  .load("s3://s3path“)
```

------

## Example: Write CSV files and folders to S3
<a name="aws-glue-programming-etl-format-csv-write"></a>

 **Prerequisites:** You will need an initialized DataFrame (`dataFrame`) or a DynamicFrame (`dynamicFrame`). You will also need your expected S3 output path, `s3path`. 

 **Configuration:** In your function options, specify `format="csv"`. In your `connection_options`, use the `paths` key to specify `s3path`. You can configure how the writer interacts with S3 in `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how your operation writes the contents of your files in `format_options`. For details, see [CSV Configuration Reference](#aws-glue-programming-etl-format-csv-reference). The following AWS Glue ETL script shows the process of writing CSV files and folders to S3. 

------
#### [ Python ]

For this example, use the [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method.

```
# Example: Write CSV to S3
# For show, customize how we write string type values.  Set quoteChar to -1 so our values are not quoted.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    connection_options={"path": "s3://s3path"},
    format="csv",
    format_options={
        "quoteChar": -1,
    },
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame.write\
    .format("csv")\
    .option("quote", None)\
    .mode("append")\
    .save("s3://s3path")
```

------
#### [ Scala ]

For this example, use the [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) method.

```
// Example: Write CSV to S3
// For show, customize how we write string type values. Set quoteChar to -1 so our values are not quoted.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="csv"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
dataFrame.write
    .format("csv")
    .option("quote", null)
    .mode("Append")
    .save("s3://s3path")
```

------

## CSV configuration reference
<a name="aws-glue-programming-etl-format-csv-reference"></a>

You can use the following `format_options` wherever AWS Glue libraries specify `format="csv"`: 
+ `separator` –Specifies the delimiter character. The default is a comma, but any other character can be specified.
  + **Type:** Text, **Default:** `","`
+ `escaper` – Specifies a character to use for escaping. This option is used only when reading CSV files, not writing. If enabled, the character that immediately follows is used as-is, except for a small set of well-known escapes (`\n`, `\r`, `\t`, and `\0`).
  + **Type:** Text, **Default:** none
+ `quoteChar` – Specifies the character to use for quoting. The default is a double quote. Set this to `-1` to turn off quoting entirely.
  + **Type:** Text, **Default:** `'"'`
+ `multiLine` – Specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to `True` if any record spans multiple lines. Enabling `multiLine` might decrease performance because it requires more cautious file-splitting while parsing.
  + **Type:** Boolean, **Default:** `false`
+ `withHeader` – Specifies whether to treat the first line as a header. This option can be used in the `DynamicFrameReader` class.
  + **Type:** Boolean, **Default:** `false`
+ `writeHeader` – Specifies whether to write the header to output. This option can be used in the `DynamicFrameWriter` class.
  + **Type:** Boolean, **Default:** `true`
+ `skipFirst` – Specifies whether to skip the first data line.
  + **Type:** Boolean, **Default:** `false`
+ `optimizePerformance` – Specifies whether to use the advanced SIMD CSV reader along with Apache Arrow–based columnar memory formats. Only available in AWS Glue 3.0\$1.
  + **Type:** Boolean, **Default:** `false`
+ `strictCheckForQuoting` – When writing CSVs, Glue may add quotes to values it interprets as strings. This is done to prevent ambiguity in what is written out. To save time when deciding what to write, Glue may quote in certain situations where quotes are not necessary. Enabling a strict check will perform a more intensive computation and will only quote when strictly necessary. Only available in AWS Glue 3.0\$1.
  + **Type:** Boolean, **Default:** `false`

## Optimize read performance with vectorized SIMD CSV reader
<a name="aws-glue-programming-etl-format-simd-csv-reader"></a>

AWS Glue version 3.0 adds an optimized CSV reader that can significantly speed up overall job performance compared to row-based CSV readers. 

 The optimized reader:
+ Uses CPU SIMD instructions to read from disk
+ Immediately writes records to memory in a columnar format (Apache Arrow) 
+ Divides the records into batches

This saves processing time when records would be batched or converted to a columnar format later on. Some examples are when changing schemas or retrieving data by column. 

To use the optimized reader, set `"optimizePerformance"` to `true` in the `format_options` or table property.

```
glueContext.create_dynamic_frame.from_options(
    frame = datasource1,
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "csv", 
    format_options={
        "optimizePerformance": True, 
        "separator": ","
        }, 
    transformation_ctx = "datasink2")
```

**Limitations for the vectorized CSV reader**  
Note the following limitations of the vectorized CSV reader:
+ It doesn't support the `multiLine` and `escaper` format options. It uses the default `escaper` of double quote char `'"'`. When these options are set, AWS Glue automatically falls back to using the row-based CSV reader.
+ It doesn't support creating a DynamicFrame with [ChoiceType](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-types.html#aws-glue-api-crawler-pyspark-extensions-types-awsglue-choicetype). 
+ It doesn't support creating a DynamicFrame with [error records](https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame).
+ It doesn't support reading CSV files with multibyte characters such as Japanese or Chinese characters.

# Using the Parquet format in AWS Glue
<a name="aws-glue-programming-etl-format-parquet-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Parquet data format, this document introduces you available features for using your data in AWS Glue. 

AWS Glue supports using the Parquet format. This format is a performance-oriented, column-based data format. For an introduction to the format by the standard authority see, [Apache Parquet Documentation Overview](https://parquet.apache.org/docs/overview/).

You can use AWS Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet files to Amazon S3. You can read and write `bzip` and `gzip` archives containing Parquet files from S3. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page.

The following table shows which common AWS Glue features support the Parquet format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Supported | Supported | Unsupported | Supported\$1 | 

\$1 Supported in AWS Glue version 1.0\$1

## Example: Read Parquet files or folders from S3
<a name="aws-glue-programming-etl-format-parquet-read"></a>

** Prerequisites:** You will need the S3 paths (`s3path`) to the Parquet files or folders that you want to read. 

 **Configuration:** In your function options, specify `format="parquet"`. In your `connection_options`, use the `paths` key to specify your `s3path`. 

You can configure how the reader interacts with S3 in the `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3).

 You can configure how the reader interprets Parquet files in your `format_options`. For details, see [Parquet Configuration Reference](#aws-glue-programming-etl-format-parquet-reference).

The following AWS Glue ETL script shows the process of reading Parquet files or folders from S3: 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
# Example: Read Parquet from S3

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path/"]}, 
    format = "parquet"
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame = spark.read.parquet("s3://s3path/")
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) method.

```
// Example: Read Parquet from S3

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="parquet",
      options=JsonOptions("""{"paths": ["s3://s3path"]}""")
    ).getDynamicFrame()
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
spark.read.parquet("s3://s3path/")
```

------

## Example: Write Parquet files and folders to S3
<a name="aws-glue-programming-etl-format-parquet-write"></a>

** Prerequisites:** You will need an initialized DataFrame (`dataFrame`) or DynamicFrame (`dynamicFrame`). You will also need your expected S3 output path, `s3path`. 

 **Configuration:** In your function options, specify `format="parquet"`. In your `connection_options`, use the `paths` key to specify `s3path`. 

You can further alter how the writer interacts with S3 in the `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how your operation writes the contents of your files in `format_options`. For details, see [Parquet Configuration Reference](#aws-glue-programming-etl-format-parquet-reference).

The following AWS Glue ETL script shows the process of writing Parquet files and folders to S3. 

We provide a custom Parquet writer with performance optimizations for DynamicFrames, through the `useGlueParquetWriter` configuration key. To determine if this writer is right for your workload, see [Glue Parquet Writer](#aws-glue-programming-etl-format-glue-parquet-writer). 

------
#### [ Python ]

For this example, use the [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method.

```
# Example: Write Parquet to S3
# Consider whether useGlueParquetWriter is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": "s3://s3path",
    },
    format_options={
        # "useGlueParquetWriter": True,
    },
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
df.write.parquet("s3://s3path/")
```

------
#### [ Scala ]

For this example, use the [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) method.

```
// Example: Write Parquet to S3
// Consider whether useGlueParquetWriter is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="parquet"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
df.write.parquet("s3://s3path/")
```

------

## Parquet configuration reference
<a name="aws-glue-programming-etl-format-parquet-reference"></a>

You can use the following `format_options` wherever AWS Glue libraries specify `format="parquet"`: 
+ `useGlueParquetWriter` – Specifies the use of a custom Parquet writer that has performance optimizations for DynamicFrame workflows. For usage details, see [Glue Parquet Writer](#aws-glue-programming-etl-format-glue-parquet-writer). 
  + **Type:** Boolean, **Default:**`false`
+ `compression` – Specifies the compression codec used. Values are fully compatible with `org.apache.parquet.hadoop.metadata.CompressionCodecName`. 
  + **Type:** Enumerated Text, **Default:** `"snappy"`
  + Values: `"uncompressed"`, `"snappy"`, `"gzip"`, and `"lzo"`
+ `blockSize` – Specifies the size in bytes of a row group being buffered in memory. You use this for tuning performance. Size should divide exactly into a number of megabytes.
  + **Type:** Numerical, **Default:**`134217728`
  + The default value is equal to 128 MB.
+ `pageSize` – Specifies the size in bytes of a page. You use this for tuning performance. A page is the smallest unit that must be read fully to access a single record.
  + **Type:** Numerical, **Default:**`1048576`
  + The default value is equal to 1 MB.

**Note**  
Additionally, any options that are accepted by the underlying SparkSQL code can be passed to this format by way of the `connection_options` map parameter. For example, you can set a Spark configuration such as [mergeSchema](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging) for the AWS Glue Spark reader to merge the schema for all files.

## Optimize write performance with AWS Glue Parquet writer
<a name="aws-glue-programming-etl-format-glue-parquet-writer"></a>

**Note**  
 The AWS Glue Parquet writer has historically been accessed through the `glueparquet` format type. This access pattern is no longer advocated. Instead, use the `parquet` type with `useGlueParquetWriter` enabled. 

The AWS Glue Parquet writer has performance enhancements that allow faster Parquet file writes. The traditional writer computes a schema before writing. The Parquet format doesn't store the schema in a quickly retrievable fashion, so this might take some time. With the AWS Glue Parquet writer, a pre-computed schema isn't required. The writer computes and modifies the schema dynamically, as data comes in.

Note the following limitations when you specify `useGlueParquetWriter`:
+ The writer supports only schema evolution (such as adding or removing columns), but not changing column types, such as with `ResolveChoice`.
+ The writer doesn't support writing empty DataFrames—for example, to write a schema-only file. When integrating with the AWS Glue Data Catalog by setting `enableUpdateCatalog=True`, attempting to write an empty DataFrame will not update the Data Catalog. This will result in creating a table in the Data Catalog without a schema.

If your transform doesn't require these limitations, turning on the AWS Glue Parquet writer should increase performance.

# Using the XML format in AWS Glue
<a name="aws-glue-programming-etl-format-xml-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the XML data format, this document introduces you available features for using your data in AWS Glue. 

AWS Glue supports using the XML format. This format represents highly configurable, rigidly defined data structures that aren't row or column based. XML is highly standardized. For an introduction to the format by the standard authority, see [XML Essentials](https://www.w3.org/standards/xml/core). 

You can use AWS Glue to read XML files from Amazon S3, as well as `bzip` and `gzip` archives containing XML files. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page. 

The following table shows which common AWS Glue features support the XML format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Unsupported | Unsupported | Supported | Supported | 

## Example: Read XML from S3
<a name="aws-glue-programming-etl-format-xml-read"></a>

 The XML reader takes an XML tag name. It examines elements with that tag within its input to infer a schema and populates a DynamicFrame with corresponding values. The AWS Glue XML functionality behaves similarly to the [XML Data Source for Apache Spark](https://github.com/databricks/spark-xml). You might be able to gain insight around basic behavior by comparing this reader to that project's documentation. 

** Prerequisites:** You will need the S3 paths (`s3path`) to the XML files or folders that you want to read, and some information about your XML file. You will also need the tag for the XML element you want to read, `xmlTag`. 

 **Configuration:** In your function options, specify `format="xml"`. In your `connection_options`, use the `paths` key to specify `s3path`. You can further configure how the reader interacts with S3 in the `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). In your `format_options`, use the `rowTag` key to specify `xmlTag`. You can further configure how the reader interprets XML files in your `format_options`. For details, see [XML Configuration Reference](#aws-glue-programming-etl-format-xml-reference).

The following AWS Glue ETL script shows the process of reading XML files or folders from S3. 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
# Example: Read XML from S3
# Set the rowTag option to configure the reader.

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="xml",
    format_options={"rowTag": "xmlTag"},
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame = spark.read\
    .format("xml")\
    .option("rowTag", "xmlTag")\
    .load("s3://s3path")
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) operation.

```
// Example: Read XML from S3
// Set the rowTag option to configure the reader.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.sql.SparkSession

val glueContext = new GlueContext(SparkContext.getOrCreate())
val sparkSession: SparkSession = glueContext.getSparkSession

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"rowTag": "xmlTag"}"""), 
      connectionType="s3", 
      format="xml", 
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
val dataFrame = spark.read
  .option("rowTag", "xmlTag")
  .format("xml")
  .load("s3://s3path“)
```

------

## XML configuration reference
<a name="aws-glue-programming-etl-format-xml-reference"></a>

You can use the following `format_options` wherever AWS Glue libraries specify `format="xml"`:
+ `rowTag` – Specifies the XML tag in the file to treat as a row. Row tags cannot be self-closing.
  + **Type:** Text, **Required**
+ `encoding` – Specifies the character encoding. It can be the name or alias of a [Charset](https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html) supported by our runtime environment. We don't make specific guarantees around encoding support, but major encodings should work. 
  + **Type:** Text, **Default:** `"UTF-8"`
+ `excludeAttribute` – Specifies whether you want to exclude attributes in elements or not.
  + **Type:** Boolean, **Default:** `false`
+ `treatEmptyValuesAsNulls` – Specifies whether to treat white space as a null value.
  + **Type:** Boolean, **Default:** `false`
+ `attributePrefix` – A prefix for attributes to differentiate them from child element text. This prefix is used for field names.
  + **Type:** Text, **Default:** `"_"`
+ `valueTag` – The tag used for a value when there are attributes in the element that have no child.
  + **Type:** Text, **Default:** `"_VALUE"`
+ `ignoreSurroundingSpaces` – Specifies whether the white space that surrounds values should be ignored.
  + **Type:** Boolean, **Default:** `false`
+ `withSchema` – Contains the expected schema, in situations where you want to override the inferred schema. If you don't use this option, AWS Glue infers the schema from the XML data.
  + **Type:** Text, **Default:** Not applicable
  + The value should be a JSON object that represents a `StructType`.

## Manually specify the XML schema
<a name="aws-glue-programming-etl-format-xml-withschema"></a>

**Manual XML schema example**

This is an example of using the `withSchema` format option to specify the schema for XML data.

```
from awsglue.gluetypes import *

schema = StructType([ 
  Field("id", IntegerType()),
  Field("name", StringType()),
  Field("nested", StructType([
    Field("x", IntegerType()),
    Field("y", StringType()),
    Field("z", ChoiceType([IntegerType(), StringType()]))
  ]))
])

datasource0 = create_dynamic_frame_from_options(
    connection_type, 
    connection_options={"paths": ["s3://xml_bucket/someprefix"]},
    format="xml", 
    format_options={"withSchema": json.dumps(schema.jsonValue())},
    transformation_ctx = ""
)
```

# Using the Avro format in AWS Glue
<a name="aws-glue-programming-etl-format-avro-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Avro data format, this document introduces you available features for using your data in AWS Glue.

AWS Glue supports using the Avro format. This format is a performance-oriented, row-based data format. For an introduction to the format by the standard authority see, [Apache Avro 1.8.2 Documentation](https://avro.apache.org/docs/1.8.2/).

You can use AWS Glue to read Avro files from Amazon S3 and from streaming sources as well as write Avro files to Amazon S3. You can read and write `bzip2` and `gzip` archives containing Avro files from S3. Additionally, you can write `deflate`, `snappy`, and `xz` archives containing Avro files. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page. 

The following table shows which common AWS Glue operations support the Avro format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Supported | Supported\$1 | Unsupported | Supported | 

\$1Supported with restrictions. For more information, see [Notes and restrictions for Avro streaming sources](add-job-streaming.md#streaming-avro-notes).

## Example: Read Avro files or folders from S3
<a name="aws-glue-programming-etl-format-avro-read"></a>

** Prerequisites:** You will need the S3 paths (`s3path`) to the Avro files or folders that you want to read. 

**Configuration:** In your function options, specify `format="avro"`. In your `connection_options`, use the `paths` key to specify `s3path`. You can configure how the reader interacts with S3 in the `connection_options`. For details, see Data format options for ETL inputs and outputs in AWS Glue: [Amazon S3 connection option reference](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how the reader interprets Avro files in your `format_options`. For details, see [Avro Configuration Reference](#aws-glue-programming-etl-format-avro-reference).

The following AWS Glue ETL script shows the process of reading Avro files or folders from S3: 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="avro"
)
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) operation.

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.sql.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="avro",
      options=JsonOptions("""{"paths": ["s3://s3path"]}""")
    ).getDynamicFrame()
  }
```

------

## Example: Write Avro files and folders to S3
<a name="aws-glue-programming-etl-format-avro-write"></a>

** Prerequisites:** You will need an initialized DataFrame (`dataFrame`) or DynamicFrame (`dynamicFrame`). You will also need your expected S3 output path, `s3path`. 

**Configuration:** In your function options, specify `format="avro"`. In your `connection_options`, use the `paths` key to specify your `s3path`. You can further alter how the writer interacts with S3 in the `connection_options`. For details, see Data format options for ETL inputs and outputs in AWS Glue: [Amazon S3 connection option reference](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can alter how the writer interprets Avro files in your `format_options`. For details, see [Avro Configuration Reference](#aws-glue-programming-etl-format-avro-reference). 

The following AWS Glue ETL script shows the process of writing Avro files or folders to S3.

------
#### [ Python ]

For this example, use the [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method.

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="avro",
    connection_options={
        "path": "s3://s3path"
    }
)
```

------
#### [ Scala ]

For this example, use the [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) method.

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    glueContext.getSinkWithFormat(
      connectionType="s3",
      options=JsonOptions("""{"path": "s3://s3path"}"""),
      format="avro"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

------

## Avro configuration reference
<a name="aws-glue-programming-etl-format-avro-reference"></a>

You can use the following `format_options` values wherever AWS Glue libraries specify `format="avro"`:
+ `version` — Specifies the version of Apache Avro reader/writer format to support. The default is "1.7". You can specify `format_options={"version": “1.8”}` to enable Avro logical type reading and writing. For more information, see the [Apache Avro 1.7.7 Specification](https://avro.apache.org/docs/1.7.7/spec.html) and [Apache Avro 1.8.2 Specification](https://avro.apache.org/docs/1.8.2/spec.html).

  The Apache Avro 1.8 connector supports the following logical type conversions:

For the reader: this table shows the conversion between Avro data type (logical type and Avro primitive type) and AWS Glue `DynamicFrame` data type for Avro reader 1.7 and 1.8.


| Avro Data Type: Logical Type | Avro Data Type: Avro Primitive Type | GlueDynamicFrame Data Type:Avro Reader 1.7 | GlueDynamicFrame Data Type: Avro Reader 1.8 | 
| --- | --- | --- | --- | 
| Decimal | bytes | BINARY | Decimal | 
| Decimal | fixed | BINARY | Decimal | 
| Date | int | INT | Date | 
| Time (millisecond) | int | INT | INT | 
| Time (microsecond) | long | LONG | LONG | 
| Timestamp (millisecond) | long | LONG | Timestamp | 
| Timestamp (microsecond) | long | LONG | LONG | 
| Duration (not a logical type) | fixed of 12 | BINARY | BINARY | 

For the writer: this table shows the conversion between AWS Glue `DynamicFrame` data type and Avro data type for Avro writer 1.7 and 1.8.


| AWS Glue `DynamicFrame` Data Type | Avro Data Type:Avro Writer 1.7 | Avro Data Type:Avro Writer 1.8 | 
| --- | --- | --- | 
| Decimal | String | decimal | 
| Date | String | date | 
| Timestamp | String | timestamp-micros | 

## Avro Spark DataFrame support
<a name="aws-glueprogramming-etl-format-avro-dataframe-support"></a>

In order to use Avro from the Spark DataFrame API, you need to install the Spark Avro plugin for the corresponding Spark version. The version of Spark available in your job is determined by your AWS Glue version. For more information about Spark versions, see [AWS Glue versions](release-notes.md). This plugin is maintained by Apache, we do not make specific guarantees of support.

In AWS Glue 2.0 - use version 2.4.3 of the Spark Avro plugin. You can find this JAR on Maven Central, see [org.apache.spark:spark-avro\$12.12:2.4.3](https://search.maven.org/artifact/org.apache.spark/spark-avro_2.12/3.1.1/jar).

In AWS Glue 3.0 - use version 3.1.1 of the Spark Avro plugin. You can find this JAR on Maven Central, see [org.apache.spark:spark-avro\$12.12:3.1.1](https://search.maven.org/artifact/org.apache.spark/spark-avro_2.12/3.1.1/jar).

To include extra JARs in a AWS Glue ETL job, use the `--extra-jars` job parameter. For more information about job parameters, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). You can also configure this parameter in the AWS Management Console.

# Using the grokLog format in AWS Glue
<a name="aws-glue-programming-etl-format-grokLog-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in a loosely structured plaintext format, this document introduces you available features for using your data in AWS Glue through Grok patterns.

AWS Glue supports using Grok patterns. Grok patterns are similar to regular expression capture groups. They recognize patterns of character sequences in a plaintext file and give them a type and purpose. In AWS Glue, their primary purpose is to read logs. For an introduction to the Grok by the authors, see [Logstash Reference: Grok filter plugin](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html).


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Not Applicable | Supported | Supported | Unsupported | 

## grokLog configuration reference
<a name="aws-glue-programming-etl-format-groklog-reference"></a>

You can use the following `format_options` values with `format="grokLog"`:
+ `logFormat` — Specifies the Grok pattern that matches the log's format.
+ `customPatterns` — Specifies additional Grok patterns used here.
+ `MISSING` — Specifies the signal to use in identifying missing values. The default is `'-'`.
+ `LineCount` — Specifies the number of lines in each log record. The default is `'1'`, and currently only single-line records are supported.
+ `StrictMode` — A Boolean value that specifies whether strict mode is turned on. In strict mode, the reader doesn't do automatic type conversion or recovery. The default value is `"false"`.

# Using the Ion format in AWS Glue
<a name="aws-glue-programming-etl-format-ion-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Ion data format, this document introduces you available features for using your data in AWS Glue.

AWS Glue supports using the Ion format. This format represents data structures (that aren't row or column based) in interchangeable binary and plaintext representations. For an introduction to the format by the authors, see [Amazon Ion](https://amzn.github.io/ion-docs/). (For more information, see the [Amazon Ion Specification](https://amzn.github.io/ion-docs/spec.html).)

You can use AWS Glue to read Ion files from Amazon S3. You can read `bzip` and `gzip` archives containing Ion files from S3. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page.

The following table shows which common AWS Glue operations support the Ion format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Unsupported | Unsupported | Supported | Unsupported | 

## Example: Read Ion files and folders from S3
<a name="aws-glue-programming-etl-format-ion-read"></a>

** Prerequisites:** You will need the S3 paths (`s3path`) to the Ion files or folders that you want to read. 

**Configuration:** In your function options, specify `format="json"`. In your `connection_options`, use the `paths` key to specify your `s3path`. You can configure how the reader interacts with S3 in the `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [Amazon S3 connection option reference](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). 

The following AWS Glue ETL script shows the process of reading Ion files or folders from S3:

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
# Example: Read ION from S3

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="ion"
)
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) operation.

```
// Example: Read ION from S3

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="ion",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

------

## Ion configuration reference
<a name="aws-glue-programming-etl-format-ion-reference"></a>

There are no `format_options` values for `format="ion"`.

# Using the JSON format in AWS Glue
<a name="aws-glue-programming-etl-format-json-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the JSON data format, this document introduces you to available features for using your data in AWS Glue.

AWS Glue supports using the JSON format. This format represents data structures with consistent shape but flexible contents, that aren't row or column based. JSON is defined by parallel standards issued by several authorities, one of which is ECMA-404. For an introduction to the format by a commonly referenced source, see [Introducing JSON](https://www.json.org/).

You can use AWS Glue to read JSON files from Amazon S3, as well as `bzip` and `gzip` compressed JSON files. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page. 


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Supported | Supported | Supported | Supported | 

## Example: Read JSON files or folders from S3
<a name="aws-glue-programming-etl-format-json-read"></a>

** Prerequisites:** You will need the S3 paths (`s3path`) to the JSON files or folders you would like to read. 

**Configuration:** In your function options, specify `format="json"`. In your `connection_options`, use the `paths` key to specify your `s3path`. You can further alter how your read operation will traverse s3 in the connection options, consult [Amazon S3 connection option reference](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) for details. You can configure how the reader interprets JSON files in your `format_options`. For details, see [JSON Configuration Reference](#aws-glue-programming-etl-format-json-reference). 

 The following AWS Glue ETL script shows the process of reading JSON files or folders from S3: 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
# Example: Read JSON from S3
# For show, we handle a nested JSON file that we can limit with the JsonPath parameter
# For show, we also handle a JSON where a single entry spans multiple lines
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="json",
    format_options={
        "jsonPath": "$.id",
        "multiline": True,
        # "optimizePerformance": True, -> not compatible with jsonPath, multiline
    }
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame = spark.read\
    .option("multiline", "true")\
    .json("s3://s3path")
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) operation.

```
// Example: Read JSON from S3
// For show, we handle a nested JSON file that we can limit with the JsonPath parameter
// For show, we also handle a JSON where a single entry spans multiple lines
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"jsonPath": "$.id", "multiline": true, "optimizePerformance":false}"""),
      connectionType="s3",
      format="json",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
val dataFrame = spark.read
    .option("multiline", "true")
    .json("s3://s3path")
```

------

## Example: Write JSON files and folders to S3
<a name="aws-glue-programming-etl-format-json-write"></a>

** Prerequisites:**You will need an initialized DataFrame (`dataFrame`) or DynamicFrame (`dynamicFrame`). You will also need your expected S3 output path, `s3path`. 

**Configuration:** In your function options, specify `format="json"`. In your `connection_options`, use the `paths` key to specify `s3path`. You can further alter how the writer interacts with S3 in the `connection_options`. For details, see Data format options for ETL inputs and outputs in AWS Glue : [Amazon S3 connection option reference](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how the writer interprets JSON files in your `format_options`. For details, see [JSON Configuration Reference](#aws-glue-programming-etl-format-json-reference). 

The following AWS Glue ETL script shows the process of writing JSON files or folders from S3:

------
#### [ Python ]

For this example, use the [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method.

```
# Example: Write JSON to S3

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    connection_options={"path": "s3://s3path"},
    format="json"
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
df.write.json("s3://s3path/")
```

------
#### [ Scala ]

For this example, use the [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) method.

```
// Example: Write JSON to S3

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="json"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
df.write.json("s3://s3path")
```

------

## Json configuration reference
<a name="aws-glue-programming-etl-format-json-reference"></a>

You can use the following `format_options` values with `format="json"`:
+ `jsonPath` — A [JsonPath](https://github.com/json-path/JsonPath) expression that identifies an object to be read into records. This is particularly useful when a file contains records nested inside an outer array. For example, the following JsonPath expression targets the `id` field of a JSON object.

  ```
  format="json", format_options={"jsonPath": "$.id"}
  ```
+ `multiline` — A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to `"true"` if any record spans multiple lines. The default value is `"false"`, which allows for more aggressive file-splitting during parsing.
+ `optimizePerformance` — A Boolean value that specifies whether to use the advanced SIMD JSON reader along with Apache Arrow based columnar memory formats. Only available in AWS Glue 3.0. Not compatible with `multiline` or `jsonPath`. Providing either of those options will instruct AWS Glue to fall back to the standard reader.
+ `withSchema` — A String value that specifies a table schema in the format described in [Manually specify the XML schema](aws-glue-programming-etl-format-xml-home.md#aws-glue-programming-etl-format-xml-withschema). Only used with `optimizePerformance` when reading from non-Catalog connections.

## Using vectorized SIMD JSON reader with Apache Arrow columnar format
<a name="aws-glue-programming-etl-format-simd-json-reader"></a>

AWS Glue version 3.0 adds a vectorized reader for JSON data. It performs 2x faster under certain conditions, compared to the standard reader. This reader comes with certain limitations users should be aware of before use, documented in this section.

To use the optimized reader, set `"optimizePerformance"` to True in the `format_options` or table property. You will also need to provide `withSchema` unless reading from the catalog. `withSchema` expects an input as described in the [Manually specify the XML schema](aws-glue-programming-etl-format-xml-home.md#aws-glue-programming-etl-format-xml-withschema)

```
// Read from S3 data source        
glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "json", 
    format_options={
        "optimizePerformance": True,
        "withSchema": SchemaString
        })    
 
// Read from catalog table
glueContext.create_dynamic_frame.from_catalog(
    database = database, 
    table_name = table, 
    additional_options = {
    // The vectorized reader for JSON can read your schema from a catalog table property.
        "optimizePerformance": True,
        })
```

For more information about the building a *SchemaString* in the AWS Glue library, see [PySpark extension types](aws-glue-api-crawler-pyspark-extensions-types.md).

**Limitations for the vectorized CSV reader**  
Note the following limitations:
+ JSON elements with nested objects or array values are not supported. If provided, AWS Glue will fall back to the standard reader.
+ A schema must be provided, either from the Catalog or with `withSchema`.
+ Not compatible with `multiline` or `jsonPath`. Providing either of those options will instruct AWS Glue to fall back to the standard reader.
+ Providing input records that do not match the input schema will cause the reader to fail.
+ [Error records](https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame) will not be created.
+ JSON files with multi-byte characters (such as Japanese or Chinese characters) are not supported.

# Using the ORC format in AWS Glue
<a name="aws-glue-programming-etl-format-orc-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the ORC data format, this document introduces you available features for using your data in AWS Glue.

AWS Glue supports using the ORC format. This format is a performance-oriented, column-based data format. For an introduction to the format by the standard authority see, [Apache Orc](https://orc.apache.org/docs/).

You can use AWS Glue to read ORC files from Amazon S3 and from streaming sources as well as write ORC files to Amazon S3. You can read and write `bzip` and `gzip` archives containing ORC files from S3. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page.

The following table shows which common AWS Glue operations support the ORC format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Supported | Supported | Unsupported | Supported\$1 | 

\$1Supported in AWS Glue version 1.0\$1

## Example: Read ORC files or folders from S3
<a name="aws-glue-programming-etl-format-orc-read"></a>

** Prerequisites:** You will need the S3 paths (`s3path`) to the ORC files or folders that you want to read. 

**Configuration:** In your function options, specify `format="orc"`. In your `connection_options`, use the `paths` key to specify your `s3path`. You can configure how the reader interacts with S3 in the `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [Amazon S3 connection option reference](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3).

 The following AWS Glue ETL script shows the process of reading ORC files or folders from S3: 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="orc"
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame = spark.read\
    .orc("s3://s3path")
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) operation.

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.sql.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="orc",
      options=JsonOptions("""{"paths": ["s3://s3path"]}""")
    ).getDynamicFrame()
  }
}
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
val dataFrame = spark.read
    .orc("s3://s3path")
```

------

## Example: Write ORC files and folders to S3
<a name="aws-glue-programming-etl-format-orc-write"></a>

** Prerequisites:** You will need an initialized DataFrame (`dataFrame`) or DynamicFrame (`dynamicFrame`). You will also need your expected S3 output path, `s3path`. 

**Configuration:** In your function options, specify `format="orc"`. In your connection options, use the `paths` key to specify `s3path`. You can further alter how the writer interacts with S3 in the `connection_options`. For details, see Data format options for ETL inputs and outputs in AWS Glue: [Amazon S3 connection option reference](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). The following code example shows the process: 

------
#### [ Python ]

For this example, use the [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method.

```
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="orc",
    connection_options={
        "path": "s3://s3path"
    }
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
df.write.orc("s3://s3path/")
```

------
#### [ Scala ]

For this example, use the [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) method.

```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    glueContext.getSinkWithFormat(
      connectionType="s3",
      options=JsonOptions("""{"path": "s3://s3path"}"""),
      format="orc"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
df.write.orc("s3://s3path/")
```

------

## ORC configuration reference
<a name="aws-glue-programming-etl-format-orc-reference"></a>

There are no `format_options` values for `format="orc"`. However, any options that are accepted by the underlying SparkSQL code can be passed to it by way of the `connection_options` map parameter. 

# Using data lake frameworks with AWS Glue ETL jobs
<a name="aws-glue-programming-etl-datalake-native-frameworks"></a>

Open-source data lake frameworks simplify incremental data processing for files that you store in data lakes built on Amazon S3. AWS Glue 3.0 and later supports the following open-source data lake frameworks:
+ Apache Hudi
+ Linux Foundation Delta Lake
+ Apache Iceberg

We provide native support for these frameworks so that you can read and write data that you store in Amazon S3 in a transactionally consistent manner. There's no need to install a separate connector or complete extra configuration steps in order to use these frameworks in AWS Glue ETL jobs.

When you manage datasets through the AWS Glue Data Catalog, you can use AWS Glue methods to read and write data lake tables with Spark DataFrames. You can also read and write Amazon S3 data using the Spark DataFrame API.

In this video, you can learn about the basics of how Apache Hudi, Apache Iceberg, and Delta Lake work. You'll see how to insert, update, and delete data in your data lake and how each of these frameworks works.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/fryfx0Zg7KA/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/fryfx0Zg7KA)


**Topics**
+ [Limitations](aws-glue-programming-etl-datalake-native-frameworks-limitations.md)
+ [Using the Hudi framework in AWS Glue](aws-glue-programming-etl-format-hudi.md)
+ [Using the Delta Lake framework in AWS Glue](aws-glue-programming-etl-format-delta-lake.md)
+ [Using the Iceberg framework in AWS Glue](aws-glue-programming-etl-format-iceberg.md)

# Limitations
<a name="aws-glue-programming-etl-datalake-native-frameworks-limitations"></a>

Consider the following limitations before you use data lake frameworks with AWS Glue.
+ The following AWS Glue `GlueContext` methods for DynamicFrame don't support reading and writing data lake framework tables. Use the `GlueContext` methods for DataFrame or Spark DataFrame API instead.
  + `create_dynamic_frame.from_catalog`
  + `write_dynamic_frame.from_catalog`
  + `getDynamicFrame`
  + `writeDynamicFrame`
+ The following `GlueContext` methods for DataFrame are supported with Lake Formation permission control:
  + `create_data_frame.from_catalog`
  + `write_data_frame.from_catalog`
  + `getDataFrame`
  + `writeDataFrame`
+ [Grouping small files](grouping-input-files.md) is not supported.
+ [Job bookmarks](monitor-continuations.md) are not supported.
+ Apache Hudi 0.10.1 for AWS Glue 3.0 doesn't support Hudi Merge on Read (MoR) tables.
+ `ALTER TABLE … RENAME TO` is not available for Apache Iceberg 0.13.1 for AWS Glue 3.0.

## Limitations for data lake format tables managed by Lake Formation permissions
<a name="w2aac67c11c24c11c31c17b7"></a>

The data lake formats are integrated with AWS Glue ETL via Lake Formation permissions. Creating a DynamicFrame using `create_dynamic_frame` is not supported. For more information, see the following examples:
+ [Example: Read and write Iceberg table with Lake Formation permission control](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html#aws-glue-programming-etl-format-iceberg-read-write-lake-formation-tables)
+ [Example: Read and write Hudi table with Lake Formation permission control](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-read-write-lake-formation-tables)
+ [Example: Read and write Delta Lake table with Lake Formation permission control](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-delta-lake.html#aws-glue-programming-etl-format-delta-lake-read-write-lake-formation-tables)

**Note**  
The integration with AWS Glue ETL via Lake Formation permissions for Apache Hudi, Apache Iceberg, and Delta Lake is supported only in AWS Glue version 4.0.

Apache Iceberg has the best integration with AWS Glue ETL via Lake Formation permissions. It supports almost all operations and includes SQL support.

Hudi supports most basic operations with the exception of administrative operations. This is because these options generally are done via writing of dataframes and specified via `additional_options`. You need to use AWS Glue APIs to create DataFrames for your operations as SparkSQL is not supported.

Delta Lake only supports the reading and appending and overwriting of table data. Delta Lake requires the use of their own libraries to be able to perform various tasks such as updates.

The following features are not available for Iceberg tables managed by Lake Formation permissions.
+ Compaction using AWS Glue ETL
+ Spark SQL support via AWS Glue ETL

The following are limitations of Hudi tables managed by Lake Formation permissions:
+ Removal of orphaned files

The following are limitations of Delta Lake tables managed by Lake Formation permissions:
+ All features other than inserting and reading from Delta Lake tables.

# Using the Hudi framework in AWS Glue
<a name="aws-glue-programming-etl-format-hudi"></a>

AWS Glue 3.0 and later supports Apache Hudi framework for data lakes. Hudi is an open-source data lake storage framework that simplifies incremental data processing and data pipeline development. This topic covers available features for using your data in AWS Glue when you transport or store your data in a Hudi table. To learn more about Hudi, see the official [Apache Hudi documentation](https://hudi.apache.org/docs/overview/). 

You can use AWS Glue to perform read and write operations on Hudi tables in Amazon S3, or work with Hudi tables using the AWS Glue Data Catalog. Additional operations including insert, update, and all of the [Apache Spark operations](https://hudi.apache.org/docs/quick-start-guide/) are also supported.

**Note**  
The Apache Hudi 0.15.0 implementation in AWS Glue 5.0 internally reverts [HUDI-7001](https://github.com/apache/hudi/pull/9936). It does not exhibit the regression related to Complex Key gen when the record key consists of a single field. However this behavior differs from OSS Apache Hudi 0.15.0.  
Apache Hudi 0.10.1 for AWS Glue 3.0 doesn't support Hudi Merge on Read (MoR) tables.

The following table lists the Hudi version that is included in each AWS Glue version.


****  

| AWS Glue version | Supported Hudi version | 
| --- | --- | 
| 5.1 | 1.0.2 | 
| 5.0 | 0.15.0 | 
| 4.0 | 0.12.1 | 
| 3.0 | 0.10.1 | 

To learn more about the data lake frameworks that AWS Glue supports, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md).

## Enabling Hudi
<a name="aws-glue-programming-etl-format-hudi-enable"></a>

To enable Hudi for AWS Glue, complete the following tasks:
+ Specify `hudi` as a value for the `--datalake-formats` job parameter. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).
+ Create a key named `--conf` for your AWS Glue job, and set it to the following value. Alternatively, you can set the following configuration using `SparkConf` in your script. These settings help Apache Spark correctly handle Hudi tables.

  ```
  spark.serializer=org.apache.spark.serializer.KryoSerializer
  ```
+ Lake Formation permission support for Hudi is enabled by default for AWS Glue 4.0. No additional configuration is needed for reading/writing to Lake Formation-registered Hudi tables. To read a registered Hudi table, the AWS Glue job IAM role must have the SELECT permission. To write to a registered Hudi table, the AWS Glue job IAM role must have the SUPER permission. To learn more about managing Lake Formation permissions, see [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html).

**Using a different Hudi version**

To use a version of Hudi that AWS Glue doesn't support, specify your own Hudi JAR files using the `--extra-jars` job parameter. Do not include `hudi` as a value for the `--datalake-formats` job parameter. If you use AWS Glue 5.0 or above, you must set `--user-jars-first true` job parameter.

## Example: Write a Hudi table to Amazon S3 and register it in the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-hudi-write"></a>

This example script demonstrates how to write a Hudi table to Amazon S3 and register the table to the AWS Glue Data Catalog. The example uses the Hudi [Hive Sync tool](https://hudi.apache.org/docs/syncing_metastore/) to register the table.

**Note**  
This example requires you to set the `--enable-glue-datacatalog` job parameter in order to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To learn more, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

------
#### [ Python ]

```
# Example: Create a Hudi table from a DataFrame 
# and register the table to Glue Data Catalog

additional_options={
    "hoodie.table.name": "<your_table_name>",
    "hoodie.database.name": "<your_database_name>",
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>",
    "hoodie.datasource.write.precombine.field": "<your_precombine_field>",
    "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.database": "<your_database_name>",
    "hoodie.datasource.hive_sync.table": "<your_table_name>",
    "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.datasource.hive_sync.use_jdbc": "false",
    "hoodie.datasource.hive_sync.mode": "hms",
    "path": "s3://<s3Path/>"
}

dataFrame.write.format("hudi") \
    .options(**additional_options) \
    .mode("overwrite") \
    .save()
```

------
#### [ Scala ]

```
// Example: Example: Create a Hudi table from a DataFrame
// and register the table to Glue Data Catalog

val additionalOptions = Map(
  "hoodie.table.name" -> "<your_table_name>",
  "hoodie.database.name" -> "<your_database_name>",
  "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
  "hoodie.datasource.write.operation" -> "upsert",
  "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
  "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
  "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
  "hoodie.datasource.write.hive_style_partitioning" -> "true",
  "hoodie.datasource.hive_sync.enable" -> "true",
  "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
  "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
  "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
  "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
  "hoodie.datasource.hive_sync.use_jdbc" -> "false",
  "hoodie.datasource.hive_sync.mode" -> "hms",
  "path" -> "s3://<s3Path/>")

dataFrame.write.format("hudi")
  .options(additionalOptions)
  .mode("append")
  .save()
```

------

## Example: Read a Hudi table from Amazon S3 using the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-hudi-read"></a>

This example reads the Hudi table that you created in the [Example: Write a Hudi table to Amazon S3 and register it in the AWS Glue Data Catalog](#aws-glue-programming-etl-format-hudi-write) from Amazon S3.

**Note**  
This example requires you to set the `--enable-glue-datacatalog` job parameter in order to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To learn more, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

------
#### [ Python ]

For this example, use the `GlueContext.create\$1data\$1frame.from\$1catalog()` method.

```
# Example: Read a Hudi table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

dataFrame = glueContext.create_data_frame.from_catalog(
    database = "<your_database_name>",
    table_name = "<your_table_name>"
)
```

------
#### [ Scala ]

For this example, use the [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) method.

```
// Example: Read a Hudi table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dataFrame = glueContext.getCatalogSource(
      database = "<your_database_name>",
      tableName = "<your_table_name>"
    ).getDataFrame()
  }
}
```

------

## Example: Update and insert a `DataFrame` into a Hudi table in Amazon S3
<a name="aws-glue-programming-etl-format-hudi-update-insert"></a>

This example uses the AWS Glue Data Catalog to insert a DataFrame into the Hudi table that you created in [Example: Write a Hudi table to Amazon S3 and register it in the AWS Glue Data Catalog](#aws-glue-programming-etl-format-hudi-write).

**Note**  
This example requires you to set the `--enable-glue-datacatalog` job parameter in order to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To learn more, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

------
#### [ Python ]

For this example, use the `GlueContext.write\$1data\$1frame.from\$1catalog()` method.

```
# Example: Upsert a Hudi table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame = dataFrame,
    database = "<your_database_name>",
    table_name = "<your_table_name>",
    additional_options={
        "hoodie.table.name": "<your_table_name>",
        "hoodie.database.name": "<your_database_name>",
        "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
        "hoodie.datasource.write.operation": "upsert",
        "hoodie.datasource.write.recordkey.field": "<your_recordkey_field>",
        "hoodie.datasource.write.precombine.field": "<your_precombine_field>",
        "hoodie.datasource.write.partitionpath.field": "<your_partitionkey_field>",
        "hoodie.datasource.write.hive_style_partitioning": "true",
        "hoodie.datasource.hive_sync.enable": "true",
        "hoodie.datasource.hive_sync.database": "<your_database_name>",
        "hoodie.datasource.hive_sync.table": "<your_table_name>",
        "hoodie.datasource.hive_sync.partition_fields": "<your_partitionkey_field>",
        "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc": "false",
        "hoodie.datasource.hive_sync.mode": "hms"
    }
)
```

------
#### [ Scala ]

For this example, use the [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) method.

```
// Example: Upsert a Hudi table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = JsonOptions(Map(
        "hoodie.table.name" -> "<your_table_name>",
        "hoodie.database.name" -> "<your_database_name>",
        "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
        "hoodie.datasource.write.operation" -> "upsert",
        "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
        "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
        "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
        "hoodie.datasource.write.hive_style_partitioning" -> "true",
        "hoodie.datasource.hive_sync.enable" -> "true",
        "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
        "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
        "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
        "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.use_jdbc" -> "false",
        "hoodie.datasource.hive_sync.mode" -> "hms"
      )))
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## Example: Read a Hudi table from Amazon S3 using Spark
<a name="aws-glue-programming-etl-format-hudi-read-spark"></a>

This example reads a Hudi table from Amazon S3 using the Spark DataFrame API.

------
#### [ Python ]

```
# Example: Read a Hudi table from S3 using a Spark DataFrame

dataFrame = spark.read.format("hudi").load("s3://<s3path/>")
```

------
#### [ Scala ]

```
// Example: Read a Hudi table from S3 using a Spark DataFrame

val dataFrame = spark.read.format("hudi").load("s3://<s3path/>")
```

------

## Example: Write a Hudi table to Amazon S3 using Spark
<a name="aws-glue-programming-etl-format-hudi-write-spark"></a>

This example writes a Hudi table to Amazon S3 using Spark.

------
#### [ Python ]

```
# Example: Write a Hudi table to S3 using a Spark DataFrame

dataFrame.write.format("hudi") \
    .options(**additional_options) \
    .mode("overwrite") \
    .save("s3://<s3Path/>)
```

------
#### [ Scala ]

```
// Example: Write a Hudi table to S3 using a Spark DataFrame

dataFrame.write.format("hudi")
  .options(additionalOptions)
  .mode("overwrite")
  .save("s3://<s3path/>")
```

------

## Example: Read and write Hudi table with Lake Formation permission control
<a name="aws-glue-programming-etl-format-hudi-read-write-lake-formation-tables"></a>

This example reads and writes a Hudi table with Lake Formation permission control.

1. Create a Hudi table and register it in Lake Formation.

   1. To enable Lake Formation permission control, you’ll first need to register the table Amazon S3 path on Lake Formation. For more information, see [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html). You can register it either from the Lake Formation console or by using the AWS CLI:

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      Once you register an Amazon S3 location, any AWS Glue table pointing to the location (or any of its child locations) will return the value for the `IsRegisteredWithLakeFormation` parameter as true in the `GetTable` call.

   1. Create a Hudi table that points to the registered Amazon S3 path through the Spark dataframe API:

      ```
      hudi_options = {
          'hoodie.table.name': table_name,
          'hoodie.database.name': database_name,
          'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
          'hoodie.datasource.write.recordkey.field': 'product_id',
          'hoodie.datasource.write.table.name': table_name,
          'hoodie.datasource.write.operation': 'upsert',
          'hoodie.datasource.write.precombine.field': 'updated_at',
          'hoodie.datasource.write.hive_style_partitioning': 'true',
          'hoodie.upsert.shuffle.parallelism': 2,
          'hoodie.insert.shuffle.parallelism': 2,
          'path': <S3_TABLE_LOCATION>,
          'hoodie.datasource.hive_sync.enable': 'true',
          'hoodie.datasource.hive_sync.database': database_name,
          'hoodie.datasource.hive_sync.table': table_name,
          'hoodie.datasource.hive_sync.use_jdbc': 'false',
          'hoodie.datasource.hive_sync.mode': 'hms'
      }
      
      df_products.write.format("hudi")  \
          .options(**hudi_options)  \
          .mode("overwrite")  \
          .save()
      ```

1. Grant Lake Formation permission to the AWS Glue job IAM role. You can either grant permissions from the Lake Formation console, or using the AWS CLI. For more information, see [Granting table permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html)

1.  Read the Hudi table registered in Lake Formation. The code is same as reading a non-registered Hudi table. Note that the AWS Glue job IAM role needs to have the SELECT permission for the read to succeed.

   ```
    val dataFrame = glueContext.getCatalogSource(
         database = "<your_database_name>",
         tableName = "<your_table_name>"
       ).getDataFrame()
   ```

1. Write to a Hudi table registered in Lake Formation. The code is same as writing to a non-registered Hudi table. Note that the AWS Glue job IAM role needs to have the SUPER permission for the write to succeed.

   ```
   glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
         additionalOptions = JsonOptions(Map(
           "hoodie.table.name" -> "<your_table_name>",
           "hoodie.database.name" -> "<your_database_name>",
           "hoodie.datasource.write.storage.type" -> "COPY_ON_WRITE",
           "hoodie.datasource.write.operation" -> "<write_operation>",
           "hoodie.datasource.write.recordkey.field" -> "<your_recordkey_field>",
           "hoodie.datasource.write.precombine.field" -> "<your_precombine_field>",
           "hoodie.datasource.write.partitionpath.field" -> "<your_partitionkey_field>",
           "hoodie.datasource.write.hive_style_partitioning" -> "true",
           "hoodie.datasource.hive_sync.enable" -> "true",
           "hoodie.datasource.hive_sync.database" -> "<your_database_name>",
           "hoodie.datasource.hive_sync.table" -> "<your_table_name>",
           "hoodie.datasource.hive_sync.partition_fields" -> "<your_partitionkey_field>",
           "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
           "hoodie.datasource.hive_sync.use_jdbc" -> "false",
           "hoodie.datasource.hive_sync.mode" -> "hms"
         )))
         .writeDataFrame(dataFrame, glueContext)
   ```

# Using the Delta Lake framework in AWS Glue
<a name="aws-glue-programming-etl-format-delta-lake"></a>

AWS Glue 3.0 and later supports the Linux Foundation Delta Lake framework. Delta Lake is an open-source data lake storage framework that helps you perform ACID transactions, scale metadata handling, and unify streaming and batch data processing. This topic covers available features for using your data in AWS Glue when you transport or store your data in a Delta Lake table. To learn more about Delta Lake, see the official [Delta Lake documentation](https://docs.delta.io/latest/delta-intro.html). 

You can use AWS Glue to perform read and write operations on Delta Lake tables in Amazon S3, or work with Delta Lake tables using the AWS Glue Data Catalog. Additional operations such as insert, update, and [Table batch reads and writes](https://docs.delta.io/0.7.0/api/python/index.html) are also supported. When you use Delta Lake tables, you also have the option to use methods from the Delta Lake Python library such as `DeltaTable.forPath`. For more information about the Delta Lake Python library, see Delta Lake's Python documentation.

The following table lists the version of Delta Lake included in each AWS Glue version.


****  

| AWS Glue version | Supported Delta Lake version | 
| --- | --- | 
| 5.1 | 3.3.2 | 
| 5.0 | 3.3.0 | 
| 4.0 | 2.1.0 | 
| 3.0 | 1.0.0 | 

To learn more about the data lake frameworks that AWS Glue supports, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md).

## Enabling Delta Lake for AWS Glue
<a name="aws-glue-programming-etl-format-delta-lake-enable"></a>

To enable Delta Lake for AWS Glue, complete the following tasks:
+ Specify `delta` as a value for the `--datalake-formats` job parameter. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).
+ Create a key named `--conf` for your AWS Glue job, and set it to the following value. Alternatively, you can set the following configuration using `SparkConf` in your script. These settings help Apache Spark correctly handle Delta Lake tables.

  ```
  spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
  ```
+ Lake Formation permission support for Delta tables is enabled by default for AWS Glue 4.0. No additional configuration is needed for reading/writing to Lake Formation-registered Delta tables. To read a registered Delta table, the AWS Glue job IAM role must have the SELECT permission. To write to a registered Delta table, the AWS Glue job IAM role must have the SUPER permission. To learn more about managing Lake Formation permissions, see [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html).

**Using a different Delta Lake version**

To use a version of Delta lake that AWS Glue doesn't support, specify your own Delta Lake JAR files using the `--extra-jars` job parameter. Do not include `delta` as a value for the `--datalake-formats` job parameter. If you use AWS Glue 5.0 or above, you must set `--user-jars-first true` job parameter. To use the Delta Lake Python library in this case, you must specify the library JAR files using the `--extra-py-files` job parameter. The Python library comes packaged in the Delta Lake JAR files.

## Example: Write a Delta Lake table to Amazon S3 and register it to the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-delta-lake-write"></a>

The following AWS Glue ETL script demonstrates how to write a Delta Lake table to Amazon S3 and register the table to the AWS Glue Data Catalog.

------
#### [ Python ]

```
# Example: Create a Delta Lake table from a DataFrame 
# and register the table to Glue Data Catalog

additional_options = {
    "path": "s3://<s3Path>"
}
dataFrame.write \
    .format("delta") \
    .options(**additional_options) \
    .mode("append") \
    .partitionBy("<your_partitionkey_field>") \
    .saveAsTable("<your_database_name>.<your_table_name>")
```

------
#### [ Scala ]

```
// Example: Example: Create a Delta Lake table from a DataFrame
// and register the table to Glue Data Catalog

val additional_options = Map(
  "path" -> "s3://<s3Path>"
)
dataFrame.write.format("delta")
  .options(additional_options)
  .mode("append")
  .partitionBy("<your_partitionkey_field>")
  .saveAsTable("<your_database_name>.<your_table_name>")
```

------

## Example: Read a Delta Lake table from Amazon S3 using the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-delta-lake-read"></a>

The following AWS Glue ETL script reads the Delta Lake table that you created in [Example: Write a Delta Lake table to Amazon S3 and register it to the AWS Glue Data Catalog](#aws-glue-programming-etl-format-delta-lake-write).

------
#### [ Python ]

For this example, use the [create\$1data\$1frame.from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog) method.

```
# Example: Read a Delta Lake table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

df = glueContext.create_data_frame.from_catalog(
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

For this example, use the [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) method.

```
// Example: Read a Delta Lake table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .getDataFrame()
  }
}
```

------

## Example: Insert a `DataFrame` into a Delta Lake table in Amazon S3 using the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-delta-lake-insert"></a>

This example inserts data into the Delta Lake table that you created in [Example: Write a Delta Lake table to Amazon S3 and register it to the AWS Glue Data Catalog](#aws-glue-programming-etl-format-delta-lake-write).

**Note**  
This example requires you to set the `--enable-glue-datacatalog` job parameter in order to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To learn more, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

------
#### [ Python ]

For this example, use the [write\$1data\$1frame.from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_data_frame_from_catalog) method.

```
# Example: Insert into a Delta Lake table in S3 using Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame=dataFrame,
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

For this example, use the [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) method.

```
// Example: Insert into a Delta Lake table in S3 using Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## Example: Read a Delta Lake table from Amazon S3 using the Spark API
<a name="aws-glue-programming-etl-format-delta_lake-read-spark"></a>

This example reads a Delta Lake table from Amazon S3 using the Spark API.

------
#### [ Python ]

```
# Example: Read a Delta Lake table from S3 using a Spark DataFrame

dataFrame = spark.read.format("delta").load("s3://<s3path/>")
```

------
#### [ Scala ]

```
// Example: Read a Delta Lake table from S3 using a Spark DataFrame

val dataFrame = spark.read.format("delta").load("s3://<s3path/>")
```

------

## Example: Write a Delta Lake table to Amazon S3 using Spark
<a name="aws-glue-programming-etl-format-delta_lake-write-spark"></a>

This example writes a Delta Lake table to Amazon S3 using Spark.

------
#### [ Python ]

```
# Example: Write a Delta Lake table to S3 using a Spark DataFrame

dataFrame.write.format("delta") \
    .options(**additional_options) \
    .mode("overwrite") \
    .partitionBy("<your_partitionkey_field>")
    .save("s3://<s3Path>")
```

------
#### [ Scala ]

```
// Example: Write a Delta Lake table to S3 using a Spark DataFrame

dataFrame.write.format("delta")
  .options(additionalOptions)
  .mode("overwrite")
  .partitionBy("<your_partitionkey_field>")
  .save("s3://<s3path/>")
```

------

## Example: Read and write Delta Lake table with Lake Formation permission control
<a name="aws-glue-programming-etl-format-delta-lake-read-write-lake-formation-tables"></a>

This example reads and writes a Delta Lake table with Lake Formation permission control.

1. Create a Delta table and register it in Lake Formation

   1. To enable Lake Formation permission control, you’ll first need to register the table Amazon S3 path on Lake Formation. For more information, see [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html). You can register it either from the Lake Formation console or by using the AWS CLI:

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      Once you register an Amazon S3 location, any AWS Glue table pointing to the location (or any of its child locations) will return the value for the `IsRegisteredWithLakeFormation` parameter as true in the `GetTable` call.

   1. Create a Delta table that points to the registered Amazon S3 path through Spark:
**Note**  
The following are Python examples.

      ```
      dataFrame.write \
      	.format("delta") \
      	.mode("overwrite") \
      	.partitionBy("<your_partitionkey_field>") \
      	.save("s3://<the_s3_path>")
      ```

      After the data has been written to Amazon S3, use the AWS Glue crawler to create a new Delta catalog table. For more information, see [Introducing native Delta Lake table support with AWS Glue crawlers](https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/).

      You can also create the table manually through the AWS Glue `CreateTable` API.

1. Grant Lake Formation permission to the AWS Glue job IAM role. You can either grant permissions from the Lake Formation console, or using the AWS CLI. For more information, see [Granting table permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html)

1.  Read the Delta table registered in Lake Formation. The code is same as reading a non-registered Delta table. Note that the AWS Glue job IAM role needs to have the SELECT permission for the read to succeed.

   ```
   # Example: Read a Delta Lake table from Glue Data Catalog
   
   df = glueContext.create_data_frame.from_catalog(
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

1. Write to a Delta table registered in Lake Formation. The code is same as writing to a non-registered Delta table. Note that the AWS Glue job IAM role needs to have the SUPER permission for the write to succeed.

   By default AWS Glue uses `Append` as saveMode. You can change it by setting the saveMode option in `additional_options`. For information about saveMode support in Delta tables, see [Write to a table](https://docs.delta.io/latest/delta-batch.html#write-to-a-table).

   ```
   glueContext.write_data_frame.from_catalog(
       frame=dataFrame,
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

# Using the Iceberg framework in AWS Glue
<a name="aws-glue-programming-etl-format-iceberg"></a>

AWS Glue 3.0 and later supports the Apache Iceberg framework for data lakes. Iceberg provides a high-performance table format that works just like a SQL table. This topic covers available features for using your data in AWS Glue when you transport or store your data in an Iceberg table. To learn more about Iceberg, see the official [Apache Iceberg documentation](https://iceberg.apache.org/docs/latest/). 

You can use AWS Glue to perform read and write operations on Iceberg tables in Amazon S3, or work with Iceberg tables using the AWS Glue Data Catalog. Additional operations including insert and all [Spark Queries](https://iceberg.apache.org/docs/latest/spark-queries/) [Spark Writes](https://iceberg.apache.org/docs/latest/spark-writes/) are also supported. Update is not supported for Iceberg tables. 

**Note**  
`ALTER TABLE … RENAME TO` is not available for Apache Iceberg 0.13.1 for AWS Glue 3.0.

The following table lists the version of Iceberg included in each AWS Glue version.


****  

| AWS Glue version | Supported Iceberg version | 
| --- | --- | 
| 5.1 | 1.10.0 | 
| 5.0 | 1.7.1 | 
| 4.0 | 1.0.0 | 
| 3.0 | 0.13.1 | 

To learn more about the data lake frameworks that AWS Glue supports, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md).

## Enabling the Iceberg framework
<a name="aws-glue-programming-etl-format-iceberg-enable"></a>

To enable Iceberg for AWS Glue, complete the following tasks:
+ Specify `iceberg` as a value for the `--datalake-formats` job parameter. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).
+ Create a key named `--conf` for your AWS Glue job, and set it to the following value. Alternatively, you can set the following configuration using `SparkConf` in your script. These settings help Apache Spark correctly handle Iceberg tables.

  ```
  spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
  --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog 
  --conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ 
  --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
  --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
  ```

  If you are reading or writing to Iceberg tables that are registered with Lake Formation, follow the guidance in [Using AWS Glue with AWS Lake Formation for fine-grained access control](security-lf-enable.md) in AWS Glue 5.0 and later. In AWS Glue 4.0, add the following configuration to enable Lake Formation support.

  ```
  --conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true
  --conf spark.sql.catalog.glue_catalog.glue.id=<table-catalog-id>
  ```

  If you use AWS Glue 3.0 with Iceberg 0.13.1, you must set the following additional configurations to use Amazon DynamoDB lock manager to ensure atomic transaction. AWS Glue 4.0 or later uses optimistic locking by default. For more information, see [Iceberg AWS Integrations](https://iceberg.apache.org/docs/latest/aws/#dynamodb-lock-manager) in the official Apache Iceberg documentation.

  ```
  --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager 
  --conf spark.sql.catalog.glue_catalog.lock.table=<your-dynamodb-table-name>
  ```

**Using a different Iceberg version**

To use a version of Iceberg that AWS Glue doesn't support, specify your own Iceberg JAR files using the `--extra-jars` job parameter. Do not include `iceberg` as a value for the `--datalake-formats` parameter. If you use AWS Glue 5.0 or above, you must set `--user-jars-first true` job parameter.

**Enabling encryption for Iceberg tables**

**Note**  
Iceberg tables have their own mechanisms to enable server-side encryption. You should enable this configuration in addition to AWS Glue's security configuration.

To enable server-side encryption on Iceberg tables, review the guidance from the [Iceberg documentation](https://iceberg.apache.org/docs/latest/aws/#s3-server-side-encryption).

**Add Spark configuration for Iceberg cross region**

To add additional spark configuration for Iceberg cross-region table access with the AWS Glue Data Catalog and AWS Lake Formation, follow the steps below:

1. Create a [Multi-region access point](https://docs.aws.amazon.com/AmazonS3/latest/userguide/multi-region-access-point-create-examples.html).

1. Set the following Spark properties:

   ```
   -----
       --conf spark.sql.catalog.my_catalog.s3.use-arn-region-enabled=true \
       --conf spark.sql.catalog.{CATALOG}.s3.access-points.bucket1", "arn:aws:s3::<account-id>:accesspoint/<mrap-id>.mrap \
       --conf spark.sql.catalog.{CATALOG}.s3.access-points.bucket2", "arn:aws:s3::<account-id>:accesspoint/<mrap-id>.mrap
   -----
   ```

## Example: Write an Iceberg table to Amazon S3 and register it to the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-iceberg-write"></a>

This example script demonstrates how to write an Iceberg table to Amazon S3. The example uses [Iceberg AWS Integrations](https://iceberg.apache.org/docs/latest/aws/) to register the table to the AWS Glue Data Catalog.

------
#### [ Python ]

```
# Example: Create an Iceberg table from a DataFrame 
# and register the table to Glue Data Catalog

dataFrame.createOrReplaceTempView("tmp_<your_table_name>")

query = f"""
CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
USING iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM tmp_<your_table_name>
"""
spark.sql(query)
```

------
#### [ Scala ]

```
// Example: Example: Create an Iceberg table from a DataFrame
// and register the table to Glue Data Catalog

dataFrame.createOrReplaceTempView("tmp_<your_table_name>")

val query = """CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
USING iceberg
TBLPROPERTIES ("format-version"="2")
AS SELECT * FROM tmp_<your_table_name>
"""
spark.sql(query)
```

------

Alternatively, you can write an Iceberg table to Amazon S3 and the Data Catalog using Spark methods.

Prerequisites: You will need to provision a catalog for the Iceberg library to use. When using the AWS Glue Data Catalog, AWS Glue makes this straightforward. The AWS Glue Data Catalog is pre-configured for use by the Spark libraries as `glue_catalog`. Data Catalog tables are identified by a *databaseName* and a *tableName*. For more information about the AWS Glue Data Catalog, see [Data discovery and cataloging in AWS Glue](catalog-and-crawler.md).

If you are not using the AWS Glue Data Catalog, you will need to provision a catalog through the Spark APIs. For more information, see [Spark Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/) in the Iceberg documentation.

This example writes an Iceberg table to Amazon S3 and the Data Catalog using Spark.

------
#### [ Python ]

```
# Example: Write an Iceberg table to S3 on the Glue Data Catalog

# Create (equivalent to CREATE TABLE AS SELECT)
dataFrame.writeTo("glue_catalog.databaseName.tableName") \
    .tableProperty("format-version", "2") \
    .create()

# Append (equivalent to INSERT INTO)
dataFrame.writeTo("glue_catalog.databaseName.tableName") \
    .tableProperty("format-version", "2") \
    .append()
```

------
#### [ Scala ]

```
// Example: Write an Iceberg table to S3 on the Glue Data Catalog

// Create (equivalent to CREATE TABLE AS SELECT)
dataFrame.writeTo("glue_catalog.databaseName.tableName")
    .tableProperty("format-version", "2")
    .create()

// Append (equivalent to INSERT INTO)
dataFrame.writeTo("glue_catalog.databaseName.tableName")
    .tableProperty("format-version", "2")
    .append()
```

------

## Example: Read an Iceberg table from Amazon S3 using the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-iceberg-read"></a>

This example reads the Iceberg table that you created in [Example: Write an Iceberg table to Amazon S3 and register it to the AWS Glue Data Catalog](#aws-glue-programming-etl-format-iceberg-write).

------
#### [ Python ]

For this example, use the `GlueContext.create\$1data\$1frame.from\$1catalog()` method.

```
# Example: Read an Iceberg table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

df = glueContext.create_data_frame.from_catalog(
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

For this example, use the [getCatalogSource](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSource) method.

```
// Example: Read an Iceberg table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val df = glueContext.getCatalogSource("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .getDataFrame()
  }
}
```

------

## Example: Insert a `DataFrame` into an Iceberg table in Amazon S3 using the AWS Glue Data Catalog
<a name="aws-glue-programming-etl-format-iceberg-insert"></a>

This example inserts data into the Iceberg table that you created in [Example: Write an Iceberg table to Amazon S3 and register it to the AWS Glue Data Catalog](#aws-glue-programming-etl-format-iceberg-write).

**Note**  
This example requires you to set the `--enable-glue-datacatalog` job parameter in order to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To learn more, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

------
#### [ Python ]

For this example, use the `GlueContext.write\$1data\$1frame.from\$1catalog()` method.

```
# Example: Insert into an Iceberg table from Glue Data Catalog

from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)

glueContext.write_data_frame.from_catalog(
    frame=dataFrame,
    database="<your_database_name>",
    table_name="<your_table_name>",
    additional_options=additional_options
)
```

------
#### [ Scala ]

For this example, use the [getCatalogSink](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getCatalogSink) method.

```
// Example: Insert into an Iceberg table from Glue Data Catalog

import com.amazonaws.services.glue.GlueContext
import org.apacke.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    glueContext.getCatalogSink("<your_database_name>", "<your_table_name>",
      additionalOptions = additionalOptions)
      .writeDataFrame(dataFrame, glueContext)
  }
}
```

------

## Example: Read an Iceberg table from Amazon S3 using Spark
<a name="aws-glue-programming-etl-format-iceberg-read-spark"></a>

Prerequisites: You will need to provision a catalog for the Iceberg library to use. When using the AWS Glue Data Catalog, AWS Glue makes this straightforward. The AWS Glue Data Catalog is pre-configured for use by the Spark libraries as `glue_catalog`. Data Catalog tables are identified by a *databaseName* and a *tableName*. For more information about the AWS Glue Data Catalog, see [Data discovery and cataloging in AWS Glue](catalog-and-crawler.md).

If you are not using the AWS Glue Data Catalog, you will need to provision a catalog through the Spark APIs. For more information, see [Spark Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/) in the Iceberg documentation.

This example reads an Iceberg table in Amazon S3 from the Data Catalog using Spark.

------
#### [ Python ]

```
# Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog

dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")
```

------
#### [ Scala ]

```
// Example: Read an Iceberg table on S3 as a DataFrame from the Glue Data Catalog

val dataFrame = spark.read.format("iceberg").load("glue_catalog.databaseName.tableName")
```

------

## Example: Read and write Iceberg table with Lake Formation permission control
<a name="aws-glue-programming-etl-format-iceberg-read-write-lake-formation-tables"></a>

This example reads and writes an Iceberg table with Lake Formation permission control.

**Note**  
This example works only in AWS Glue 4.0. In AWS Glue 5.0 and later, follow the guidance in [Using AWS Glue with AWS Lake Formation for fine-grained access control](security-lf-enable.md).

1. Create an Iceberg table and register it in Lake Formation:

   1. To enable Lake Formation permission control, you’ll first need to register the table Amazon S3 path on Lake Formation. For more information, see [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html). You can register it either from the Lake Formation console or by using the AWS CLI:

      ```
      aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
      ```

      Once you register an Amazon S3 location, any AWS Glue table pointing to the location (or any of its child locations) will return the value for the `IsRegisteredWithLakeFormation` parameter as true in the `GetTable` call.

   1. Create an Iceberg table that points to the registered path through Spark SQL:
**Note**  
The following are Python examples.

      ```
      dataFrame.createOrReplaceTempView("tmp_<your_table_name>")
      
      query = f"""
      CREATE TABLE glue_catalog.<your_database_name>.<your_table_name>
      USING iceberg
      AS SELECT * FROM tmp_<your_table_name>
      """
      spark.sql(query)
      ```

      You can also create the table manually through AWS Glue `CreateTable` API. For more information, see [Creating Apache Iceberg tables](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-iceberg-tables.html).
**Note**  
The `UpdateTable` API does not currently support Iceberg table format as an input to the operation.

1. Grant Lake Formation permission to the job IAM role. You can either grant permissions from the Lake Formation console, or using the AWS CLI. For more information, see: https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html

1. Read an Iceberg table registered with Lake Formation. The code is same as reading a non-registered Iceberg table. Note that your AWS Glue job IAM role needs to have the SELECT permission for the read to succeed.

   ```
   # Example: Read an Iceberg table from the AWS Glue Data Catalog
   from awsglue.context import GlueContextfrom pyspark.context import SparkContext
   
   sc = SparkContext()
   glueContext = GlueContext(sc)
   
   df = glueContext.create_data_frame.from_catalog(
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

1. Write to an Iceberg table registered with Lake Formation. The code is same as writing to a non-registered Iceberg table. Note that your AWS Glue job IAM role needs to have the SUPER permission for the write to succeed.

   ```
   glueContext.write_data_frame.from_catalog(
       frame=dataFrame,
       database="<your_database_name>",
       table_name="<your_table_name>",
       additional_options=additional_options
   )
   ```

## Shared configuration reference
<a name="aws-glue-programming-etl-format-shared-reference"></a>

 You can use the following `format_options` values with any format type. 
+ `attachFilename` — A string in the appropriate format to be used as a column name. If you provide this option, the name of the source file for the record will be appended to the record. The parameter value will be used as the column name.
+ `attachTimestamp` — A string in the appropriate format to be used as a column name. If you provide this option, the modification time of the source file for the record will be appended to the record. The parameter value will be used as the column name.

# AWS Glue Data Catalog support for Spark SQL jobs
<a name="aws-glue-programming-etl-glue-data-catalog-hive"></a>

The AWS Glue Data Catalog is an Apache Hive metastore-compatible catalog. You can configure your AWS Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. AWS Glue dynamic frames integrate with the Data Catalog by default. However, with this feature, Spark SQL jobs can start using the Data Catalog as an external Hive metastore.

This feature requires network access to the AWS Glue API endpoint. For AWS Glue jobs with connections located in private subnets, you must configure either a VPC endpoint or NAT gateway to provide the network access. For information about configuring a VPC endpoint, see [Setting up network access to data stores](start-connecting.md). To create a NAT gateway, see [NAT Gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the *Amazon VPC User Guide*.

You can configure AWS Glue jobs and development endpoints by adding the `"--enable-glue-datacatalog": ""` argument to job arguments and development endpoint arguments respectively. Passing this argument sets certain configurations in Spark that enable it to access the Data Catalog as an external Hive metastore. It also [enables Hive support](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.Builder.html#enableHiveSupport--) in the `SparkSession` object created in the AWS Glue job or development endpoint. 

To enable the Data Catalog access, check the **Use AWS Glue Data Catalog as the Hive metastore** check box in the **Catalog options** group on the **Add job** or **Add endpoint** page on the console. Note that the IAM role used for the job or development endpoint should have `glue:CreateDatabase` permissions. A database called "`default`" is created in the Data Catalog if it does not exist. 

Lets look at an example of how you can use this feature in your Spark SQL jobs. The following example assumes that you have crawled the US legislators dataset available at `s3://awsglue-datasets/examples/us-legislators`.

To serialize/deserialize data from the tables defined in the AWS Glue Data Catalog, Spark SQL needs the [Hive SerDe](https://cwiki.apache.org/confluence/display/Hive/SerDe) class for the format defined in the AWS Glue Data Catalog in the classpath of the spark job. 

SerDes for certain common formats are distributed by AWS Glue. The following are the Amazon S3 links for these:
+ [JSON](https://s3.us-west-2.amazonaws.com/crawler-public/json/serde/json-serde.jar)
+ [XML](https://s3.us-west-2.amazonaws.com/crawler-public/xml/serde/hivexmlserde-1.0.5.3.jar)
+ [Grok](https://s3.us-west-2.amazonaws.com/crawler-public/grok/serde/AWSGlueHiveGrokSerDe-1.0-super.jar)

Add the JSON SerDe as an [extra JAR to the development endpoint](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-dev-endpoint.html#aws-glue-api-dev-endpoint-DevEndpointCustomLibraries). For jobs, you can add the SerDe using the `--extra-jars` argument in the arguments field. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). 

Here is an example input JSON to create a development endpoint with the Data Catalog enabled for Spark SQL.

```
{
    "EndpointName": "Name",
    "RoleArn": "role_ARN",
    "PublicKey": "public_key_contents",
    "NumberOfNodes": 2,
    "Arguments": {
      "--enable-glue-datacatalog": ""
    },
    "ExtraJarsS3Path": "s3://crawler-public/json/serde/json-serde.jar"
}
```

Now query the tables created from the US legislators dataset using Spark SQL.

```
>>> spark.sql("use legislators")
DataFrame[]
>>> spark.sql("show tables").show()
+-----------+------------------+-----------+
|   database|         tableName|isTemporary|
+-----------+------------------+-----------+
|legislators|        areas_json|      false|
|legislators|    countries_json|      false|
|legislators|       events_json|      false|
|legislators|  memberships_json|      false|
|legislators|organizations_json|      false|
|legislators|      persons_json|      false|
+-----------+------------------+-----------+
>>> spark.sql("describe memberships_json").show()
+--------------------+---------+-----------------+
|            col_name|data_type|          comment|
+--------------------+---------+-----------------+
|             area_id|   string|from deserializer|
|     on_behalf_of_id|   string|from deserializer|
|     organization_id|   string|from deserializer|
|                role|   string|from deserializer|
|           person_id|   string|from deserializer|
|legislative_perio...|   string|from deserializer|
|          start_date|   string|from deserializer|
|            end_date|   string|from deserializer|
+--------------------+---------+-----------------+
```

If the SerDe class for the format is not available in the job's classpath, you will see an error similar to the following.

```
>>> spark.sql("describe memberships_json").show()

Caused by: MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:399)
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
    ... 64 more
```

To view only the distinct `organization_id`s from the `memberships` table, run the following SQL query.

```
>>> spark.sql("select distinct organization_id from memberships_json").show()
+--------------------+
|     organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+
```

If you need to do the same with dynamic frames, run the following.

```
>>> memberships = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="memberships_json")
>>> memberships.toDF().createOrReplaceTempView("memberships")
>>> spark.sql("select distinct organization_id from memberships").show()
+--------------------+
|     organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+
```

While DynamicFrames are optimized for ETL operations, enabling Spark SQL to access the Data Catalog directly provides a concise way to run complex SQL statements or port existing applications.

# Using job bookmarks
<a name="programming-etl-connect-bookmarks"></a>

AWS Glue for Spark uses job bookmarks to track data that has already been processed. For a summary of the job bookmarks feature and what it supports, see [Tracking processed data using job bookmarks](monitor-continuations.md). When programming a AWS Glue job with bookmarks, you have access to flexibility unavailable in visual jobs.
+  When reading from JDBC, you can specify the column(s) to use as bookmark keys in your AWS Glue script. 
+  You can chose which `transformation_ctx` to apply to each method call. 

*Always call `job.init` in the beginning of the script and the `job.commit` in the end of the script with appropriately configured parameters*. These two functions initialize the bookmark service and update the state change to the service. Bookmarks won’t work without calling them.

## Specify bookmark keys
<a name="programming-etl-connect-bookmarks-columns"></a>

For JDBC workflows, the bookmark keeps track of which rows your job has read by comparing the values of key fields to a bookmarked value. This is not necessary or applicable for Amazon S3 workflows. When writing a AWS Glue script without the visual editor, you can specify which column to track with bookmarks. You can also specify multiple columns. Gaps in the sequence of values are permitted when specifying user-defined bookmark keys. 

**Warning**  
If user-defined bookmarks keys are used, they must each be strictly monotonically increasing or decreasing. When selecting additional fields for a compound key, fields for concepts like "minor versions" or "revision numbers" do not meet this criteria, since their values are reused throughout your dataset.

You can specify `jobBookmarkKeys` and `jobBookmarkKeysSortOrder` in the following ways: 
+ `create_dynamic_frame.from_catalog` — Use `additional_options`.
+ `create_dynamic_frame.from_options` — Use `connection_options`.

## Transformation context
<a name="monitor-continuations-implement-context"></a>

Many of the AWS Glue PySpark dynamic frame methods include an optional parameter named `transformation_ctx`, which is a unique identifier for the ETL operator instance. The `transformation_ctx` parameter is used to identify state information within a job bookmark for the given operator. Specifically, AWS Glue uses `transformation_ctx` to index the key to the bookmark state. 

**Warning**  
The `transformation_ctx` serves as the key to search the bookmark state for a specific source in your script. For the bookmark to work properly, you should always keep the source and the associated `transformation_ctx` consistent. Changing the source property or renaming the `transformation_ctx` may make the previous bookmark invalid and the time stamp based filtering may not yield the correct result.

For job bookmarks to work properly, enable the job bookmark parameter and set the `transformation_ctx` parameter. If you don't pass in the `transformation_ctx` parameter, then job bookmarks are not enabled for a dynamic frame or a table used in the method. For example, if you have an ETL job that reads and joins two Amazon S3 sources, you might choose to pass the `transformation_ctx` parameter only to those methods that you want to enable bookmarks. If you reset the job bookmark for a job, it resets all transformations that are associated with the job regardless of the `transformation_ctx` used. 

For more information about the `DynamicFrameReader` class, see [DynamicFrameReader class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md). For more information about PySpark extensions, see [AWS Glue PySpark extensions reference](aws-glue-programming-python-extensions.md). 

## Examples
<a name="monitor-continuations-implement-examples"></a>

**Example**  
The following is an example of a generated script for an Amazon S3 data source. The portions of the script that are required for using job bookmarks are shown in italics. For more information about these elements see the [GlueContext class](aws-glue-api-crawler-pyspark-extensions-glue-context.md) API, and the [DynamicFrameWriter class](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.md) API.  

```
# Sample Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "database",
    table_name = "relatedqueries_csv",
    transformation_ctx = "datasource0"
)

applymapping1 = ApplyMapping.apply(
    frame = datasource0,
    mappings = [("col0", "string", "name", "string"), ("col1", "string", "number", "string")],
    transformation_ctx = "applymapping1"
)

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = applymapping1,
    connection_type = "s3",
    connection_options = {"path": "s3://input_path"},
    format = "json",
    transformation_ctx = "datasink2"
)


job.commit()
```

**Example**  
The following is an example of a generated script for a JDBC source. The source table is an employee table with the `empno` column as the primary key. Although by default the job uses a sequential primary key as the bookmark key if no bookmark key is specified, because `empno` is not necessarily sequential—there could be gaps in the values—it does not qualify as a default bookmark key. Therefore, the script explicitly designates `empno` as the bookmark key. That portion of the code is shown in italics.  

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "hr",
    table_name = "emp",
    transformation_ctx = "datasource0",
    additional_options = {"jobBookmarkKeys":["empno"],"jobBookmarkKeysSortOrder":"asc"}
)

applymapping1 = ApplyMapping.apply(
    frame = datasource0,
    mappings = [("ename", "string", "ename", "string"), ("hrly_rate", "decimal(38,0)", "hrly_rate", "decimal(38,0)"), ("comm", "decimal(7,2)", "comm", "decimal(7,2)"), ("hiredate", "timestamp", "hiredate", "timestamp"), ("empno", "decimal(5,0)", "empno", "decimal(5,0)"), ("mgr", "decimal(5,0)", "mgr", "decimal(5,0)"), ("photo", "string", "photo", "string"), ("job", "string", "job", "string"), ("deptno", "decimal(3,0)", "deptno", "decimal(3,0)"), ("ssn", "decimal(9,0)", "ssn", "decimal(9,0)"), ("sal", "decimal(7,2)", "sal", "decimal(7,2)")],
    transformation_ctx = "applymapping1"
)

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame = applymapping1,
    connection_type = "s3",
    connection_options = {"path": "s3://hr/employees"},
    format = "csv",
    transformation_ctx = "datasink2"
)

job.commit()
```

# Using Sensitive Data Detection outside AWS Glue Studio
<a name="aws-glue-api-sensitive-data-example"></a>

 AWS Glue Studio allows you to detect sensitive data, however, you can also use the Sensitive Data Detection functionality outside of AWS Glue Studio. 

 For a full list of managed sensitive data types, see [Managed data types](https://docs.aws.amazon.com/glue/latest/dg/sensitive-data-managed-data-types.html). 

## Detecting Sensitive Data Detection using AWS Managed PII types
<a name="sensitive-data-managed-pii-types"></a>

 AWS Glue provides two APIs in a AWS Glue ETL job. These are `detect()` and `classifyColumns()`: 

```
  detect(frame: DynamicFrame, 
      entityTypesToDetect: Seq[String], 
      outputColumnName: String = "DetectedEntities",
      detectionSensitivity: String = "LOW"): DynamicFrame

 detect(frame: DynamicFrame, 
      detectionParameters: JsonOptions,
      outputColumnName: String = "DetectedEntities",
      detectionSensitivity: String = "LOW"): DynamicFrame
      
  classifyColumns(frame: DynamicFrame, 
      entityTypesToDetect: Seq[String], 
      sampleFraction: Double = 0.1, 
      thresholdFraction: Double = 0.1,
      detectionSensitivity: String = "LOW")
```

 You can use the `detect()` API to identify AWS Managed PII types and custom entity types. A new column is automatically created with the detection result. The `classifyColumns()` API returns a map where keys are column names and values are list of detected entity types. `SampleFraction` indicates the fraction of the data to sample when scanning for PII entities whereas `ThresholdFraction` indicates the fraction of the data that must be met in order for a column to be identified as PII data. 

### Row-level detection
<a name="w2aac67c11c24c19b9c11"></a>

 In the example, the job is performing the following actions using the `detect()` and `classifyColumns()` APIs: 
+  reading data from an Amazon S3 bucket and turns it into a dynamicFrame 
+  detecting instances of "Email" and "Credit Card" in the dynamicFrame 
+  returning a dynamicFrame with original values plus one column which encompasses detection result for each row 
+  writing the returned dynamicFrame in another Amazon S3 path 

```
  import com.amazonaws.services.glue.GlueContext
  import com.amazonaws.services.glue.MappingSpec
  import com.amazonaws.services.glue.errors.CallSite
  import com.amazonaws.services.glue.util.GlueArgParser
  import com.amazonaws.services.glue.util.Job
  import com.amazonaws.services.glue.util.JsonOptions
  import org.apache.spark.SparkContext
  import scala.collection.JavaConverters._
  import com.amazonaws.services.glue.ml.EntityDetector
  
  object GlueApp {
    def main(sysArgs: Array[String]) {
      val spark: SparkContext = new SparkContext()
      val glueContext: GlueContext = new GlueContext(spark)
      val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
      Job.init(args("JOB_NAME"), glueContext, args.asJava)
      val frame= glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ","}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://pathToSource"], "recurse": true}"""), transformationContext="AmazonS3_node1650160158526").getDynamicFrame()
  
      val frameWithDetectedPII = EntityDetector.detect(frame, Seq("EMAIL", "CREDIT_CARD"))
  
      glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://pathToOutput/", "partitionKeys": []}"""), transformationContext="someCtx", format="json").writeDynamicFrame(frameWithDetectedPII)
  
      Job.commit()
    }
  }
```

### Row-level detection with fine-grained actions
<a name="w2aac67c11c24c19b9c15"></a>

 In the example, the job is performing the following actions using the `detect()` APIs: 
+  reading data from an Amazon S3 bucket and turns it into a dynamicFrame 
+  detecting sensitive data types for “USA\$1PTIN”, “ BANK\$1ACCOUNT”, “USA\$1SSN”, “USA\$1PASSPORT\$1NUMBER” , and “PHONE\$1NUMBER” in the dynamicFrame 
+  returning a dynamicFrame with modified masked values plus one column which encompasses detection result for each row 
+  writing the returned dynamicFrame in another Amazon S3 path 

 In contrast with the above `detect()` API, this uses fine-grained actions for entity types to detect. For more information, see [Detection parameters for using `detect()`](#sensitive-data-detect-parameters-fine-grained-actions). 

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
import com.amazonaws.services.glue.ml.EntityDetector

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    val frame = glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ","}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://pathToSource"], "recurse": true}"""), transformationContext="AmazonS3_node_source").getDynamicFrame()

    val detectionParameters = JsonOptions(
      """
        {
          "USA_DRIVING_LICENSE": [{
            "action": "PARTIAL_REDACT",
            "sourceColumns": ["Driving License"],
            "actionOptions": {
              "matchPattern": "[0-9]",
              "redactChar": "*"
            }
          }],
          "BANK_ACCOUNT": [{
            "action": "DETECT",
            "sourceColumns": ["*"]
          }],
          "USA_SSN": [{
            "action": "SHA256_HASH",
            "sourceColumns": ["SSN"]
          }],
          "IP_ADDRESS": [{
            "action": "REDACT",
            "sourceColumns": ["IP Address"],
            "actionOptions": {"redactText": "*****"}
          }],
          "PHONE_NUMBER": [{
            "action": "PARTIAL_REDACT",
            "sourceColumns": ["Phone Number"],
            "actionOptions": {
              "numLeftCharsToExclude": 1,
              "numRightCharsToExclude": 0,
              "redactChar": "*"
            }
          }]
        }
      """
    )

    val frameWithDetectedPII = EntityDetector.detect(frame, detectionParameters, "DetectedEntities", "HIGH")

    glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://pathToOutput/", "partitionKeys": []}"""), transformationContext="AmazonS3_node_target", format="json").writeDynamicFrame(frameWithDetectedPII)

    Job.commit()
  }
}
```

### Column-level detection
<a name="w2aac67c11c24c19b9c19"></a>

 In the example, the job is performing the following actions using the `classifyColumns()`APIs: 
+  reading data from an Amazon S3 bucket and turns it into a dynamicFrame 
+  detecting instances of "Email" and "Credit Card" in the dynamicFrame 
+  set parameters to sample 100% of the column, mark an entity as detected if it is in 10% of cells, and have “LOW” sensitivity 
+  returns a map where keys are column names and values are list of detected entity types 
+  writing the returned dynamicFrame in another Amazon S3 path 

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
import com.amazonaws.services.glue.DynamicFrame
import com.amazonaws.services.glue.ml.EntityDetector

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    val frame = glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ",", "optimizePerformance": false}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://pathToSource"], "recurse": true}"""), transformationContext="frame").getDynamicFrame()
    
    import glueContext.sparkSession.implicits._

    val detectedDataFrame = EntityDetector.classifyColumns(
        frame, 
        entityTypesToDetect = Seq("CREDIT_CARD", "PHONE_NUMBER"), 
        sampleFraction = 1.0, 
        thresholdFraction = 0.1,
        detectionSensitivity = "LOW"
    )
    val detectedDF = (detectedDataFrame).toSeq.toDF("columnName", "entityTypes")
    val DetectSensitiveData_node = DynamicFrame(detectedDF, glueContext)

    glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://pathToOutput", "partitionKeys": []}"""), transformationContext="someCtx", format="json").writeDynamicFrame(DetectSensitiveData_node)

    Job.commit()
  }
}
```

## Detecting Sensitive Data Detection using AWS CustomEntityType PII types
<a name="sensitive-data-custom-entity-PII-types"></a>

 You can define custom entities through AWS Studio. However, to use this feature out of AWS Studio, you have to first define the custom entity types and then add the defined custom entity types to the list of `entityTypesToDetect`. 

 If you have specific sensitive data types in your data (such as 'Employee Id'), you can create custom entities by calling the `CreateCustomEntityType()` API. The following example defines the custom entity type 'EMPLOYEE\$1ID' to the `CreateCustomEntityType()` API with the request parameters: 

```
  { 
      "name": "EMPLOYEE_ID",
      "regexString": "\d{4}-\d{3}",
      "contextWords": ["employee"]
  }
```

 Then, modify the job to use the new custom sensitive data type by adding the custom entity type (EMPLOYEE\$1ID) to the `EntityDetector()` API: 

```
  import com.amazonaws.services.glue.GlueContext
  import com.amazonaws.services.glue.MappingSpec
  import com.amazonaws.services.glue.errors.CallSite
  import com.amazonaws.services.glue.util.GlueArgParser
  import com.amazonaws.services.glue.util.Job
  import com.amazonaws.services.glue.util.JsonOptions
  import org.apache.spark.SparkContext
  import scala.collection.JavaConverters._
  import com.amazonaws.services.glue.ml.EntityDetector
  
  object GlueApp {
    def main(sysArgs: Array[String]) {
      val spark: SparkContext = new SparkContext()
      val glueContext: GlueContext = new GlueContext(spark)
      val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
      Job.init(args("JOB_NAME"), glueContext, args.asJava)
      val frame= glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ","}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://pathToSource"], "recurse": true}"""), transformationContext="AmazonS3_node1650160158526").getDynamicFrame()
  
      val frameWithDetectedPII = EntityDetector.detect(frame, Seq("EMAIL", "CREDIT_CARD", "EMPLOYEE_ID"))
  
      glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://pathToOutput/", "partitionKeys": []}"""), transformationContext="someCtx", format="json").writeDynamicFrame(frameWithDetectedPII)
  
      Job.commit()
    }
  }
```

**Note**  
 If a custom sensitive data type is defined with the same name as an existing managed entity type, then the custom sensitive data type will take precedent and overwrite the managed entity type's logic. 

## Detection parameters for using `detect()`
<a name="sensitive-data-detect-parameters-fine-grained-actions"></a>

 This method is used for detecting entities in a DynamicFrame. It returns a new DataFrame with original values and an additional column outputColumnName that has PII detection metadata. Custom masking can be done after this DynamicFrame is returned within the AWS Glue script, or the detect() with fine-grained actions API can be used instead. 

```
detect(frame: DynamicFrame, 
       entityTypesToDetect: Seq[String], 
       outputColumnName: String = "DetectedEntities",
       detectionSensitivity: String = "LOW"): DynamicFrame
```

 Parameters: 
+  **frame** – (type: `DynamicFrame`) The input DynamicFrame containing the data to be processed. 
+  **entityTypesToDetect** – (type: `[Seq[String]`) List of entity types to detect. Can be either Managed Entity Types or Custom Entity Types. 
+  **outputColumnName** – (type: `String`, default: "DetectedEntities") The name of the column where detected entities will be stored. If not provided, the default column name is "DetectedEntities". 
+  **detectionSensitivity** – (type: `String`, options: "LOW" or "HIGH", default: "LOW") Specifies the sensitivity of the detection process. Valid options are "LOW" or "HIGH". If not provided, the default sensitivity is set to "LOW". 

 `outputColumnName` settings: 

 The name of the column where detected entities will be stored. If not provided, the default column name is "DetectedEntities". For each row in the output column, the supplementary column includes a map of the column name to the detected entity metadata with the following key-value pairs: 
+  **entityType** – The detected entity type. 
+  **start** – The starting position of the detected entity in the original data. 
+  **end** – The ending position of the detected entity in the original data. 
+  **actionUsed** – The action performed on the detected entity (e.g., "DETECT," "REDACT," "PARTIAL\$1REDACT," "SHA256\$1HASH"). 

 Example: 

```
{
   "DetectedEntities":{
      "SSN Col":[
         {
            "entityType":"USA_SSN",
            "actionUsed":"DETECT",
            "start":4,
            "end":15
         }
      ],
      "Random Data col":[
         {
            "entityType":"BANK_ACCOUNT",
            "actionUsed":"PARTIAL_REDACT",
            "start":4,
            "end":13
         },
         {
            "entityType":"IP_ADDRESS",
            "actionUsed":"REDACT",
            "start":4,
            "end":13
         }
      ]
   }
}
```

 **Detection Parameters for `detect()` with fine grained actions** 

 This method is used for detecting entities in a DynamicFrame using specified parameters. It returns a new DataFrame with original values replaced with masked sensitive data and an additional column `outputColumnName` that has PII detection metadata. 

```
detect(frame: DynamicFrame, 
       detectionParameters: JsonOptions,
       outputColumnName: String = "DetectedEntities",
       detectionSensitivity: String = "LOW"): DynamicFrame
```

 Parameters: 
+  **frame** – (type: `DynamicFrame`): The input DynamicFrame containing the data to be processed. 
+  **detectionParameters** – (type: `JsonOptions`): JSON options specifying parameters for the detection process. 
+  **outputColumnName** – (type: `String`, default: "DetectedEntities"): The name of the column where detected entities will be stored. If not provided, the default column name is "DetectedEntities". 
+  **detectionSensitivity** – (type: `String`, options: "LOW" or "HIGH", default: "LOW"): Specifies the sensitivity of the detection process. Valid options are "LOW" or "HIGH". If not provided, the default sensitivity is set to "LOW". 

<a name="detection-parameters-settings"></a> `detectionParameters` settings 

 If no settings are included, default values will be used. 
+  **action** – (type: `String`, options: "DETECT", "REDACT", "PARTIAL\$1REDACT", "SHA256\$1HASH") Specifies the action to be performed on the entity. Required. Note that actions that perform masking (all but "DETECT") can only perform one action per column. This is a preventative measure for masking coalesced entities. 
+  **sourceColumns** – (type: `List[String]`, default: [“\$1”]) List of source column names to perform detection on for the entity. Defaults to [“\$1”] if not present. Raises `IllegalArgumentException` if an invalid column name is used. 
+  **sourceColumnsToExclude** – (type: `List[String]`) List of source column names to to perform detection on for the entity. Use either `sourceColumns` or `sourceColumnsToExclude`. Raises `IllegalArgumentException` if an invalid column name is used. 
+  **actionOptions** – Additional options based on the specified action: 
  +  For "DETECT" and "SHA256\$1HASH", no options are allowed. 
  +  For "REDACT": 
    + **redactText** – (type: `String`, default: "\$1\$1\$1\$1\$1") Text to replace the detected entity.
  +  For "PARTIAL\$1REDACT": 
    +  **redactChar** – (type: `String`, default: "\$1") Character to replace each detected character in the entity. 
    +  **matchPattern** – (type: `String`) Regex pattern for partial redaction. Cannot be combined with numLeftCharsToExclude or `numRightCharsToExclude`. 
    +  **numLeftCharsToExclude** – (type: `String, integer`) Number of left characters to exclude. Cannot be combined with matchPattern, but can be used with `numRightCharsToExclude`. 
    +  **numRightCharsToExclude** – (type: `String, integer`) Number of right characters to exclude. Cannot be combined with matchPattern, but can be used with `numRightCharsToExclude`. 

 `outputColumnName` settings 

 [See outputColumnName settings](#sensitive-data-detect-parameters-fine-grained-actions) 

## Detection Parameters for `classifyColumns()`
<a name="detection-parameters-classifycolumns"></a>

 This method is used for detecting entities in a DynamicFrame. It returns a map where keys are column names and values are list of detected entity types. Custom masking can be done after this is returned within the AWS Glue script. 

```
classifyColumns(frame: DynamicFrame, 
                entityTypesToDetect: Seq[String], 
                sampleFraction: Double = 0.1, 
                thresholdFraction: Double = 0.1,
                detectionSensitivity: String = "LOW")
```

 Parameters: 
+  **frame** – (type: `DynamicFrame`) The input DynamicFrame containing the data to be processed. 
+  **entityTypesToDetect** – (type: `Seq[String]`) List of entity types to detect. Can be either Managed Entity Types or Custom Entity Types. 
+  **sampleFraction** – (type: `Double`, default: 10%) The fraction of the data to sample when scanning for PII entities. 
+  **thresholdFraction** – (type: `Double`, default: 10%): The fraction of the data that must be met in order for a column to be identified as PII data. 
+  **detectionSensitivity** – (type: `String`, options: "LOW" or "HIGH", default: "LOW") Specifies the sensitivity of the detection process. Valid options are "LOW" or "HIGH". If not provided, the default sensitivity is set to "LOW". 

# Managed Sensitive Data Types
<a name="sensitive-data-managed-data-types"></a>

 **Global entities** 


| Data Type | Category | Description | 
| --- | --- | --- | 
| PERSON\$1NAME | Universal |  The name of the person.  | 
| EMAIL | Personal |  The email address.  | 
| IP\$1ADDRESS | Computer |  The IP address   | 
| MAC\$1ADDRESS | Personal |  The MAC address.  | 


 **US data types** 


| Data Type | Description | 
| --- | --- | 
| BANK\$1ACCOUNT |  The bank account number. Not specific to a country or region, however, only US and Canadian account formats are detected.   | 
| CREDIT\$1CARD |  The credit card number.  | 
| PHONE\$1NUMBER |   The phone number. Not specific to a country or region, however, only US and Canadian phone numbers are detected at this time.   | 
| USA\$1ATIN |  The US Adoption Taxpayer Identification Number issued by the Internal Revenue Service.  | 
| USA\$1CPT\$1CODE |  The CPT Code (US specific).  | 
| USA\$1DEA\$1NUMBER |  The DEA number (US specific).  | 
| USA\$1DRIVING\$1LICENSE |  The driver license number (US specific).  | 
| USA\$1HCPCS\$1CODE |  The HCPCS code (US specific).  | 
| USA\$1HEALTH\$1INSURANCE\$1CLAIM\$1NUMBER |  Health Insurance Claim Number (US specific).  | 
| USA\$1ITIN |  The ITIN (for US persons or entities).  | 
| USA\$1MEDICARE\$1BENEFICIARY\$1IDENTIFIER |  Medicare Beneficiary Identifier (US specific).  | 
| USA\$1NATIONAL\$1DRUG\$1CODE |  The NDC code (US specific).  | 
| USA\$1NATIONAL\$1PROVIDER\$1IDENTIFIER |  The National Provider Identifier number (US specific).  | 
| USA\$1PASSPORT\$1NUMBER |  The passport number (for US persons).  | 
| USA\$1PTIN |  The US Preparer Tax Identification Number issued by the Internal Revenue Service.  | 
| USA\$1SSN |  The social security number (for US persons).  | 


 **Argentina data types** 


| Data Type | Description | 
| --- | --- | 
| ARGENTINA\$1TAX\$1IDENTIFICATION\$1NUMBER |   Argentina Tax Identification Number. Also known as CUIT or CUIL.   | 

 **Australian data types** 


| Data Type | Description | 
| --- | --- | 
| AUSTRALIA\$1BUSINESS\$1NUMBER |   Australia Business Number (ABN). A unique identifier issued by the Australian Business Register (ABR) to identify businesses to the government and community.   | 
| AUSTRALIA\$1COMPANY\$1NUMBER |   Australia Company Number (ACN). Unique identifier issued by the Australian Securities and Investments Commission.   | 
| AUSTRALIA\$1DRIVING\$1LICENSE |  A driver’s license number for Australia.   | 
| AUSTRALIA\$1MEDICARE\$1NUMBER |  Australian Medicare Number. Personal identifier issued by the Australian Health Insurance Commission.  | 
| AUSTRALIA\$1PASSPORT\$1NUMBER |  Australian passport number.   | 
| AUSTRALIA\$1TAX\$1FILE\$1NUMBER |   Australia Tax File Number (TFN). Issued by the Australian Taxation Office (ATO) to taxpayers (individual, company, etc) for tax dealings.   | 

 **Austria data types** 


| Data Type | Description | 
| --- | --- | 
| AUSTRIA\$1DRIVING\$1LICENSE |  The driver license number (Austria specific).  | 
| AUSTRIA\$1PASSPORT\$1NUMBER |  The passport number (Austria specific).  | 
| AUSTRIA\$1SSN |  The social security number (for Austria persons).  | 
| AUSTRIA\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Austria specific).  | 
| AUSTRIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Austria specific).  | 

 **Balkans data types** 


| Data Type | Description | 
| --- | --- | 
| BOSNIA\$1UNIQUE\$1MASTER\$1CITIZEN\$1NUMBER |  Unique master citizen number (JMBG) for Bosnia-Herzegovina citizens.  | 
| KOSOVO\$1UNIQUE\$1MASTER\$1CITIZEN\$1NUMBER |  Unique master citizen number (JMBG) for Kosovo.  | 
| MACEDONIA\$1UNIQUE\$1MASTER\$1CITIZEN\$1NUMBER |  Unique master citizen number for Macedonia.  | 
| MONTENEGRO\$1UNIQUE\$1MASTER\$1CITIZEN\$1NUMBER |  Unique master citizen number (JMBG) for Montenegro.  | 
| SERBIA\$1UNIQUE\$1MASTER\$1CITIZEN\$1NUMBER |  Unique master citizen number (JMBG) for Serbia.  | 
| SERBIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Serbia specific).  | 
| VOJVODINA\$1UNIQUE\$1MASTER\$1CITIZEN\$1NUMBER |  Unique master citizen number (JMBG) for Vojvodina.  | 

 **Belgium data types** 


| Data Type | Description | 
| --- | --- | 
| BELGIUM\$1DRIVING\$1LICENSE |  The driver license number (Belgium specific).  | 
| BELGIUM\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The Belgian National Number (BNN).  | 
| BELGIUM\$1PASSPORT\$1NUMBER |  The passport number (Belgium specific).  | 
| BELGIUM\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Belgium specific).  | 
| BELGIUM\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Belgium specific).  | 

 **Brazil data types** 


| Data Type | Description | 
| --- | --- | 
| BRAZIL\$1BANK\$1ACCOUNT | The bank account number (Brazil specific). | 
| BRAZIL\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier (Brazil specific).  | 
| BRAZIL\$1NATIONAL\$1REGISTRY\$1OF\$1LEGAL\$1ENTITIES\$1NUMBER |  The identification number issued to companies (Brazil specific), also known as the CNPJ.  | 
| BRAZIL\$1NATURAL\$1PERSON\$1REGISTRY\$1NUMBER |  Natural Person Registry Number, also known as CPF.  | 

 **Bulgaria data types** 


| Data Type | Description | 
| --- | --- | 
| BULGARIA\$1DRIVING\$1LICENSE |  The driver license number (Bulgaria specific).  | 
| BULGARIA\$1UNIFORM\$1CIVIL\$1NUMBER |  Unified Civil Number (EGN) that serves as a national identification number.  | 
| BULGARIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Bulgaria specific).  | 

 **Canada data types** 


| Data Type | Description | 
| --- | --- | 
| CANADA\$1DRIVING\$1LICENSE |  The driver license number (Canada specific).  | 
| CANADA\$1GOVERNMENT\$1IDENTIFICATION\$1CARD\$1NUMBER |  The national identifier (Canada specific).  | 
| CANADA\$1PASSPORT\$1NUMBER |  The passport number (Canada specific).  | 
| CANADA\$1PERMANENT\$1RESIDENCE\$1NUMBER |  Permanent residence number (PR Card number).  | 
| CANADA\$1PERSONAL\$1HEALTH\$1NUMBER |  The unique identifier for healthcare (PHN number).  | 
| CANADA\$1SOCIAL\$1INSURANCE\$1NUMBER |  The social insurance number (SIN) in Canada.  | 

 **Chile data types** 


| Data Type | Description | 
| --- | --- | 
| CHILE\$1DRIVING\$1LICENSE |  The driver license number (Chile specific).  | 
| CHILE\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The Chile national identifier, also known as RUT or RUN.  | 

 **China, Hong Kong, Macau, and Taiwan data types** 


| Data Type | Description | 
| --- | --- | 
| CHINA\$1IDENTIFICATION |  The China identifier.  | 
| CHINA\$1LICENSE\$1PLATE\$1NUMBER |  The driver license number (China specific).  | 
| CHINA\$1MAINLAND\$1TRAVEL\$1PERMIT\$1ID\$1HONG\$1KONG\$1MACAU |  The Mainland Travel Permit for Hong Kong and Macao Residents.  | 
| CHINA\$1MAINLAND\$1TRAVEL\$1PERMIT\$1ID\$1TAIWAN |  The Mainland Travel Permit for Taiwan Residents issued by Government of the People's Republic of China (PRC).  | 
| CHINA\$1PASSPORT\$1NUMBER |  The passport number (China specific).  | 
| CHINA\$1PHONE\$1NUMBER |  The phone number (China specific).  | 
| HONG\$1KONG\$1IDENTITY\$1CARD |  The official identity document issued by the Immigration Department of Hong Kong.  | 
| MACAU\$1RESIDENT\$1IDENTITY\$1CARD |  The Macau Resident Identity Card or BIR is an official identity card issued by the Identification Services Bureau of Macau.  | 
| TAIWAN\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier (Taiwan specific).  | 
| TAIWAN\$1PASSPORT\$1NUMBER |  The passport number (Taiwan specific).  | 

 **Colombia data types** 


| Data Type | Description | 
| --- | --- | 
| COLOMBIA\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  Unique identifier assigned to Colombians at birth.  | 
| COLOMBIA\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Colombia specific).  | 

 **Croatia data types** 


| Data Type | Description | 
| --- | --- | 
| CROATIA\$1DRIVING\$1LICENSE |  The driver license number (Croatia specific).  | 
| CROATIA\$1IDENTITY\$1NUMBER |  The national identifier (Croatia specific).  | 
| CROATIA\$1PASSPORT\$1NUMBER |  The passport number (Croatia specific).  | 
| CROATIA\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identifier number (OIB).  | 

 **Cyprus data types** 


| Data Type | Description | 
| --- | --- | 
| CYPRUS\$1DRIVING\$1LICENSE |  The driver license number (Cyprus specific).  | 
| CYPRUS\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The Cypriot identity card.  | 
| CYPRUS\$1PASSPORT\$1NUMBER |  The passport number (Cyprus specific).  | 
| CYPRUS\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Cyprus specific).  | 
| CYPRUS\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Cyprus specific).  | 

 **Czechia data types** 


| Data Type | Description | 
| --- | --- | 
| CZECHIA\$1DRIVING\$1LICENSE |  The driver license number (Czechia specific).  | 
| CZECHIA\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identifier number (Czechia specific).  | 
| CZECHIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Czechia specific).  | 

 **Denmark data types** 


| Data Type | Description | 
| --- | --- | 
| DENMARK\$1DRIVING\$1LICENSE |  The driver license number (Denmark specific).  | 
| DENMARK\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identifier number (Denmark specific).  | 
| DENMARK\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Denmark specific).  | 
| DENMARK\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Denmark specific).  | 

 **Estonia data types** 


| Data Type | Description | 
| --- | --- | 
| ESTONIA\$1DRIVING\$1LICENSE |  The driver license number (Estonia specific).  | 
| ESTONIA\$1PASSPORT\$1NUMBER |  The passport number (Estonia specific).  | 
| ESTONIA\$1PERSONAL\$1IDENTIFICATION\$1CODE |  The personal identifier number (Estonia specific).  | 
| ESTONIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Estonia specific).  | 

 **Finland data types** 


| Data Type | Description | 
| --- | --- | 
| FINLAND\$1DRIVING\$1LICENSE |  The driver license number (Finland specific).  | 
| FINLAND\$1HEALTH\$1INSURANCE\$1NUMBER |  The health insurance number (Finland specific).  | 
| FINLAND\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier number (Finland specific).  | 
| FINLAND\$1PASSPORT\$1NUMBER |  The passport number (Finland specific).  | 
| FINLAND\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Finland specific).  | 

 **France data types** 


| Data Type | Description | 
| --- | --- | 
| FRANCE\$1BANK\$1ACCOUNT |  The bank account number (France specific).  | 
| FRANCE\$1DRIVING\$1LICENSE |  The driver license number (France specific).  | 
| FRANCE\$1HEALTH\$1INSURANCE\$1NUMBER |  France health insurance number.  | 
| FRANCE\$1INSEE\$1CODE |  France social security, SSN, or NIR number.  | 
| FRANCE\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  France national identifier number (CNI).  | 
| FRANCE\$1PASSPORT\$1NUMBER |  The passport number (France specific).  | 
| FRANCE\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (France specific).  | 
| FRANCE\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (France specific).  | 

 **Germany data types** 


| Data Type | Description | 
| --- | --- | 
| GERMANY\$1BANK\$1ACCOUNT |  The bank account number (Germany specific).  | 
| GERMANY\$1DRIVING\$1LICENSE |  The driver license number (Germany specific).  | 
| GERMANY\$1PASSPORT\$1NUMBER |  The passport number (Germany specific).  | 
| GERMANY\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identification number (Germany specific).  | 
| GERMANY\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Germany specific).  | 
| GERMANY\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Germany specific).  | 

 **Greece data types** 


| Data Type | Description | 
| --- | --- | 
| GREECE\$1DRIVING\$1LICENSE |  The driver license number (Greece specific).  | 
| GREECE\$1PASSPORT\$1NUMBER |  The passport number (Greece specific).  | 
| GREECE\$1SSN |  The social security number (for Greece persons).  | 
| GREECE\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Greece specific).  | 
| GREECE\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Greece specific).  | 

 **Hungary data types** 


| Data Type | Description | 
| --- | --- | 
| HUNGARY\$1DRIVING\$1LICENSE |  The driver license number (Hungary specific).  | 
| HUNGARY\$1PASSPORT\$1NUMBER |  The passport number (Hungary specific).  | 
| HUNGARY\$1SSN |  The social security number (for Hungary persons).  | 
| HUNGARY\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Hungary specific).  | 
| HUNGARY\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Hungary specific).  | 

 **Iceland data types** 


| Data Type | Description | 
| --- | --- | 
| ICELAND\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier (Iceland specific).  | 
| ICELAND\$1PASSPORT\$1NUMBER |  The passport number (Iceland specific).  | 
| ICELAND\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Iceland specific).  | 

 **India data types** 


| Data Type | Description | 
| --- | --- | 
| INDIA\$1AADHAAR\$1NUMBER |  Aadhaar identification number issued by the Unique Identification Authority of India.  | 
| INDIA\$1PERMANENT\$1ACCOUNT\$1NUMBER |  India Permanent Account Number (PAN).  | 

 **Indonesia data types** 


| Data Type | Description | 
| --- | --- | 
| INDONESIA\$1IDENTITY\$1CARD\$1NUMBER |  The national identifier (Indonesia specific).  | 

 **Ireland data types** 


| Data Type | Description | 
| --- | --- | 
| IRELAND\$1DRIVING\$1LICENSE |  The driver license number (Ireland specific).  | 
| IRELAND\$1PASSPORT\$1NUMBER |  The passport number (Ireland specific).  | 
| IRELAND\$1PERSONAL\$1PUBLIC\$1SERVICE\$1NUMBER |  Ireland personal public service number (PPS).  | 
| IRELAND\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Ireland specific).  | 
| IRELAND\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Ireland specific).  | 

 **Israel data types** 


| Data Type | Description | 
| --- | --- | 
| ISRAEL\$1IDENTIFICATION\$1NUMBER |  The national identifier (Israel specific).  | 

 **Italy data types** 


| Data Type | Description | 
| --- | --- | 
| ITALY\$1BANK\$1ACCOUNT |  The bank account number (Italy specific).  | 
| ITALY\$1DRIVING\$1LICENSE |  The driver license number (Italy specific).  | 
| ITALY\$1FISCAL\$1CODE |  The identifier number, also known as the Italian Codice Fiscale.  | 
| ITALY\$1PASSPORT\$1NUMBER |  The passport number (Italy specific).  | 
| ITALY\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Italy specific).  | 

 **Japan data types** 


| Data Type | Description | 
| --- | --- | 
| JAPAN\$1BANK\$1ACCOUNT |  Japan bank account.  | 
| JAPAN\$1DRIVING\$1LICENSE |  A driver's license number for Japan.  | 
| JAPAN\$1MY\$1NUMBER |  The unique identifier for Japan citizens or corporations used for tax administration, social security administration, and disaster response   | 
| JAPAN\$1PASSPORT\$1NUMBER |  Japan passort number.  | 

 **Korea data types** 


| Data Type | Description | 
| --- | --- | 
| KOREA\$1PASSPORT\$1NUMBER |  The passport number (Korea specific).  | 
| KOREA\$1RESIDENCE\$1REGISTRATION\$1NUMBER\$1FOR\$1CITIZENS |  Korea residence registrant number for residents.  | 
| KOREA\$1RESIDENCE\$1REGISTRATION\$1NUMBER\$1FOR\$1FOREIGNERS |  Korea residence registrant number for foreigners.  | 

 **Latvia data types** 


| Data Type | Description | 
| --- | --- | 
| LATVIA\$1DRIVING\$1LICENSE |  The driver license number (Latvia specific).  | 
| LATVIA\$1PASSPORT\$1NUMBER |  The passport number (Latvia specific).  | 
| LATVIA\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identifier number (Latvia specific).  | 
| LATVIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Latvia specific).  | 

 **Liechtenstein data types** 


| Data Type | Description | 
| --- | --- | 
| LIECHTENSTEIN\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier (Liechtenstein specific).  | 
| LIECHTENSTEIN\$1PASSPORT\$1NUMBER |  The passport number (Liechtenstein specific).  | 
| LIECHTENSTEIN\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Liechtenstein specific).  | 

 **Lithuania data types** 


| Data Type | Description | 
| --- | --- | 
| LITHUANIA\$1DRIVING\$1LICENSE |  The driver license number (Lithuania specific).  | 
| LITHUANIA\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identifier number (Lithuania specific).  | 
| LITHUANIA\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Lithuania specific).  | 
| LITHUANIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Lithuania specific).  | 

 **Luxembourg data types** 


| Data Type | Description | 
| --- | --- | 
| LUXEMBOURG\$1DRIVING\$1LICENSE |  The driver license number (Luxembourg specific).  | 
| LUXEMBOURG\$1NATIONAL\$1INDIVIDUAL\$1NUMBER |  The national identifier (Luxembourg specific).  | 
| LUXEMBOURG\$1PASSPORT\$1NUMBER |  The passport number (Luxembourg specific).  | 
| LUXEMBOURG\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Luxembourg specific).  | 
| LUXEMBOURG\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Luxembourg specific).  | 

 **Malaysia data types** 


| Data Type | Description | 
| --- | --- | 
| MALAYSIA\$1MYKAD\$1NUMBER |  The national identifier (Malaysia specific).  | 
| MALAYSIA\$1PASSPORT\$1NUMBER |  The passport number (Malaysia specific).  | 

 **Malta data types** 


| Data Type | Description | 
| --- | --- | 
| MALTA\$1DRIVING\$1LICENSE |  The driver license number (Malta specific).  | 
| MALTA\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier (Malta specific).  | 
| MALTA\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Malta specific).  | 
| MALTA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Malta specific).  | 

 **Mexico data types** 


| Data Type | Description | 
| --- | --- | 
| MEXICO\$1CLABE\$1NUMBER |  Mexico CLABE (Clave Bancaria Estandarizada) bank number).  | 
| MEXICO\$1DRIVING\$1LICENSE |  The driver license number (Mexico specific).  | 
| MEXICO\$1PASSPORT\$1NUMBER |  The passport number (Mexico specific).  | 
| MEXICO\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Mexico specific).  | 
| MEXICO\$1UNIQUE\$1POPULATION\$1REGISTRY\$1CODE |  The Clave Única de Registro de Población (CURP) unique identity code for Mexico.  | 

 **Netherlands data types** 


| Data Type | Description | 
| --- | --- | 
| NETHERLANDS\$1CITIZEN\$1SERVICE\$1NUMBER |  Netherlands citizen number (BSN, burgerservicenummer).  | 
| NETHERLANDS\$1DRIVING\$1LICENSE |  The driver license number (Netherlands specific).  | 
| NETHERLANDS\$1PASSPORT\$1NUMBER |  The passport number (Netherlands specific).  | 
| NETHERLANDS\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Netherlands specific).  | 
| NETHERLANDS\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Netherlands specific).  | 
| NETHERLANDS\$1BANK\$1ACCOUNT |  The bank account number (Netherlands specific).  | 

 **New Zealand data types** 


| Data Type | Description | 
| --- | --- | 
| NEW\$1ZEALAND\$1DRIVING\$1LICENSE |  The driver license number (New Zealand specific).  | 
| NEW\$1ZEALAND\$1NATIONAL\$1HEALTH\$1INDEX\$1NUMBER |  New Zealand national health index number.  | 
| NEW\$1ZEALAND\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number, also known as inland revenue number (New Zealand specific).  | 

 **Norway data types** 


| Data Type | Description | 
| --- | --- | 
| NORWAY\$1BIRTH\$1NUMBER |  Norwegian national identity number.  | 
| NORWAY\$1DRIVING\$1LICENSE |  The driver license number (Norway specific).  | 
| NORWAY\$1HEALTH\$1INSURANCE\$1NUMBER |  Norway health insurance number.  | 
| NORWAY\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier number (Norway specific).  | 
| NORWAY\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Norway specific).  | 

 **Philippines data types** 


| Data Type | Description | 
| --- | --- | 
| PHILIPPINES\$1DRIVING\$1LICENSE |  The driver license number (Philippines specific).  | 
| PHILIPPINES\$1PASSPORT\$1NUMBER |  The passport number (Philippines specific).  | 

 **Poland data types** 


| Data Type | Description | 
| --- | --- | 
| POLAND\$1DRIVING\$1LICENSE |  The driver license number (Poland specific).  | 
| POLAND\$1IDENTIFICATION\$1NUMBER |  The Poland identifier.  | 
| POLAND\$1PASSPORT\$1NUMBER |  The passport number (Poland specific).  | 
| POLAND\$1REGON\$1NUMBER |  The REGON identifier number, also known as the Statistical Identification Number.  | 
| POLAND\$1SSN |  The social security number (for Poland persons).  | 
| POLAND\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Poland specific).  | 
| POLAND\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Poland specific).  | 

 **Portugal data types** 


| Data Type | Description | 
| --- | --- | 
| PORTUGAL\$1DRIVING\$1LICENSE |  The driver license number (Portugal specific).  | 
| PORTUGAL\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier number (Portugal specific).  | 
| PORTUGAL\$1PASSPORT\$1NUMBER |  The passport number (Portugal specific).  | 
| PORTUGAL\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Portugal specific).  | 
| PORTUGAL\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Portugal specific).  | 

 **Romania data types** 


| Data Type | Description | 
| --- | --- | 
| ROMANIA\$1DRIVING\$1LICENSE |  The driver license number (Romania specific).  | 
| ROMANIA\$1NUMERICAL\$1PERSONAL\$1CODE |  The personal identifier number (Romania specific).  | 
| ROMANIA\$1PASSPORT\$1NUMBER |  The passport number (Romania specific).  | 
| ROMANIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Romania specific).  | 

 **Singapore data types** 


| Data Type | Description | 
| --- | --- | 
| SINGAPORE\$1DRIVING\$1LICENSE |  The driver license number (Singapore specific).  | 
| SINGAPORE\$1NATIONAL\$1REGISTRY\$1IDENTIFICATION\$1NUMBER |  The national registration identity card for Singapore.  | 
| SINGAPORE\$1PASSPORT\$1NUMBER |  The passport number (Singapore specific).  | 
| SINGAPORE\$1UNIQUE\$1ENTITY\$1NUMBER |  The Unique Entity Number for Singapore.  | 

 **Slovakia data types** 


| Data Type | Description | 
| --- | --- | 
| SLOVAKIA\$1DRIVING\$1LICENSE |  The driver license number (Slovakia specific).  | 
| SLOVAKIA\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier number (Slovakia specific).  | 
| SLOVAKIA\$1PASSPORT\$1NUMBER |  The passport number (Slovakia specific).  | 
| SLOVAKIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Slovakia specific).  | 

 **Slovenia data types** 


| Data Type | Description | 
| --- | --- | 
| SLOVENIA\$1DRIVING\$1LICENSE |  The driver license number (Slovenia specific).  | 
| SLOVENIA\$1PASSPORT\$1NUMBER |  The passport number (Slovenia specific).  | 
| SLOVENIA\$1TAX\$1IDENTIFICATION\$1NUMBER |  Tax identification number (Slovenia specific).  | 
| SLOVENIA\$1UNIQUE\$1MASTER\$1CITIZEN\$1NUMBER |  Unique master citizen number (JMBG) for Slovenia citizens.  | 
| SLOVENIA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Slovenia specific).  | 

 **South Africa data types** 


| Data Type | Description | 
| --- | --- | 
| SOUTH\$1AFRICA\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identifier number (South Sfrica specific).  | 

 **Spain data types** 


| Data Type | Description | 
| --- | --- | 
| SPAIN\$1BANK\$1ACCOUNT |  The bank account number (Spain specific).  | 
| SPAIN\$1DNI |  The national identity card (Documento Nacional de Identidad) of Spain.  | 
| SPAIN\$1DRIVING\$1LICENSE |  The driver license number (Spain specific).  | 
| SPAIN\$1NIE |  The foreigner identity number (Spain specific), also known as the NIE.  | 
| SPAIN\$1NIF |  Tax identification number (Spain specific), also known as the NIF.  | 
| SPAIN\$1PASSPORT\$1NUMBER |  The passport number (Spain specific).  | 
| SPAIN\$1SSN |  The social security number (for Spain persons).  | 
| SPAIN\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Spain specific).  | 

 **Sri Lanka data types** 


| Data Type | Description | 
| --- | --- | 
| SRI\$1LANKA\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier (Sri Lanka specific).  | 

 **Sweden data types** 


| Data Type | Description | 
| --- | --- | 
| SWEDEN\$1DRIVING\$1LICENSE |  The driver license number (Sweden specific).  | 
| SWEDEN\$1PASSPORT\$1NUMBER |  The passport number (Sweden specific).  | 
| SWEDEN\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier number (Sweden specific).  | 
| SWEDEN\$1TAX\$1IDENTIFICATION\$1NUMBER |  Sweden tax identification number (personnummer).  | 
| SWEDEN\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Sweden specific).  | 

 **Switzerland data types** 


| Data Type | Description | 
| --- | --- | 
| SWITZERLAND\$1AHV |  The social security number for Swiss persons (AHV).  | 
| SWITZERLAND\$1HEALTH\$1INSURANCE\$1NUMBER |  Swiss health insurance number.  | 
| SWITZERLAND\$1PASSPORT\$1NUMBER |  The passport number (Switzerland specific).  | 
| SWITZERLAND\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Switzerland specific).  | 

 **Thailand data types** 


| Data Type | Description | 
| --- | --- | 
| THAILAND\$1PASSPORT\$1NUMBER |  The passport number (Thailand specific).  | 
| THAILAND\$1PERSONAL\$1IDENTIFICATION\$1NUMBER |  The personal identifier number (Thailand specific).  | 

 **Turkey data types** 


| Data Type | Description | 
| --- | --- | 
| TURKEY\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier number (Turkey specific).  | 
| TURKEY\$1PASSPORT\$1NUMBER |  The passport number (Turkey specific).  | 
| TURKEY\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Turkey specific).  | 

 **Ukraine data types** 


| Data Type | Description | 
| --- | --- | 
| UKRAINE\$1INDIVIDUAL\$1IDENTIFICATION\$1NUMBER |  The unique identifier (Ukraine specific).  | 
| UKRAINE\$1PASSPORT\$1NUMBER\$1DOMESTIC |  The domestic passport number (Ukraine specific).  | 
| UKRAINE\$1PASSPORT\$1NUMBER\$1INTERNATIONAL |  The international passport number (Ukraine specific).  | 

 **United Arab Emirates (UAE) data types** 


| Data Type | Description | 
| --- | --- | 
| UNITED\$1ARAB\$1EMIRATES\$1PERSONAL\$1NUMBER |  The personal identifier number (UAE specific).  | 

 **UK data types** 


| Data Type | Description | 
| --- | --- | 
| UK\$1BANK\$1ACCOUNT |  United Kingdom (UK) bank account.  | 
| UK\$1BANK\$1SORT\$1CODE |   United Kingdom (UK) bank sort code. Sort codes are bank codes used to route money transfers between banks within their respective countries via their respective clearance organizations.   | 
| UK\$1DRIVING\$1LICENSE |  The driver's license number for the United Kingdom of Great Britain and Northern Ireland (UK specific)  | 
| UK\$1ELECTORAL\$1ROLL\$1NUMBER |  The Electoral Roll Number (ERN) is the identification number issued to an individual for UK election registration. The format of this number is specified by the UK Government Standards of the UK Cabinet Office.  | 
| UK\$1NATIONAL\$1HEALTH\$1SERVICE\$1NUMBER |  The National Health Service (NHS) number is the unique number allocated to a registered user of public health services in the United Kingdom.  | 
| UK\$1NATIONAL\$1INSURANCE\$1NUMBER |  The National Insurance number (NINO) is a number used in the United Kingdom (UK) to identify an individual for the national insurance program or social security system. The number is sometimes referred to as NI No or NINO.  | 
| UK\$1PASSPORT\$1NUMBER |  United Kingdom (UK) passport number.  | 
| UK\$1UNIQUE\$1TAXPAYER\$1REFERENCE\$1NUMBER |  The United Kingdom (UK) Unique Taxpayer Reference (UTR) number. An identifier used by the UK government to manage the taxation system.   | 
| UK\$1VALUE\$1ADDED\$1TAX |  VAT is a consumption tax that is borne by the end consumer. VAT is paid for each transaction in the manufacturing and distribution process. For the United Kingdom, the VAT number is issued by the VAT office for the region in which the business is established.  | 
| UK\$1PHONE\$1NUMBER |  United Kingdom (UK) phone number.  | 

 **Venezuela data types** 


| Data Type | Description | 
| --- | --- | 
| VENEZUELA\$1DRIVING\$1LICENSE |  The driver license number (Venezuela specific).  | 
| VENEZUELA\$1NATIONAL\$1IDENTIFICATION\$1NUMBER |  The national identifier number (Venezuela specific).  | 
| VENEZUELA\$1VALUE\$1ADDED\$1TAX |  Value-Added Tax (Venezuela specific).  | 

# Using fine-grained sensitive data detection
<a name="sensitive-data-fine-grained-actions"></a>

**Note**  
 Fine-grained actions is only available in AWS Glue 3.0 and 4.0. This includes the AWS Glue Studio experience. The persistent audit log changes are also not available in 2.0.   
 All AWS Glue Studio 3.0 and 4.0 visual jobs will have a script created that automatically uses fine-grained actions APIs. 

 The Detect Sensitive Data transform provides the ability to detect, mask, or remove entities that you define, or are pre-defined by AWS Glue. Fine-grained actions further allows you to apply a specific action per entity. Additional benefits include: 
+  Improved performance as actions are being applied as soon data is detected. 
+  The option to include or exclude specific columns. 
+  The ability to use partial masking. This allows you to mask detected sensitive data entities partially, rather than masking the entire string. Both simple params with offsets and regex are supported. 

 The following are code snippets of sensitive data detection APIs and fine-grained actions used in the sample jobs referenced in the next section. 

 **Detect API** – fine-grained actions use the new `detectionParameters` parameter: 

```
def detect(
    frame: DynamicFrame,
    detectionParameters: JsonOptions,
    outputColumnName: String = "DetectedEntities",
    detectionSensitivity: String = "LOW"
): DynamicFrame = {}
```

## Using Sensitive Data Detection APIs with fine-grained actions
<a name="sensitive-data-fine-grained-actions-glue-jobs"></a>

 Sensitive data detection APIs using **detect** analyzes the data given, determines if the rows or columns are Sensitive Data Entity Types, and will run actions specified by the user for each Entity type. 

### Using the detect API with fine-grained actions
<a name="sensitive-data-fine-grained-actions-glue-jobs-detect"></a>

 Use the **detect** API and specify the `outputColumnName` and ` detectionParameters`. 

```
    object GlueApp {
      def main(sysArgs: Array[String]) {
      
        val spark: SparkContext = new SparkContext()
        val glueContext: GlueContext = new GlueContext(spark)
        
        // @params: [JOB_NAME]
        val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
        Job.init(args("JOB_NAME"), glueContext, args.asJava)
        
        // Script generated for node S3 bucket. Creates DataFrame from data stored in S3.
        val S3bucket_node1 = glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ",", "optimizePerformance": false}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://189657479688-ddevansh-pii-test-bucket/tiny_pii.csv"], "recurse": true}"""), transformationContext="S3bucket_node1").getDynamicFrame()
     
        // Script generated for node Detect Sensitive Data. Will run detect API for the DataFrame
        // detectionParameter contains information on which EntityType are being detected
        // and what actions are being applied to them when detected. 
        val DetectSensitiveData_node2 = EntityDetector.detect(
            frame = S3bucket_node1, 
            detectionParameters = JsonOptions(
             """
                {
                    "PHONE_NUMBER": [
                        {
                            "action": "PARTIAL_REDACT",
                            "actionOptions": {
                                "numLeftCharsToExclude": "3",
                                "numRightCharsToExclude": "4",
                                "redactChar": "#"
                            },
                            "sourceColumnsToExclude": [ "Passport No", "DL NO#" ]
                        }
                    ],
                    "USA_PASSPORT_NUMBER": [
                        {
                            "action": "SHA256_HASH",
                            "sourceColumns": [ "Passport No" ]
                        }
                    ],
                    "USA_DRIVING_LICENSE": [
                        {
                            "action": "REDACT",
                            "actionOptions": {
                                "redactText": "USA_DL"
                            },
                            "sourceColumns": [ "DL NO#" ]
                        }
                    ]
                    
                }
            """
            ),
            outputColumnName = "DetectedEntities"
        )
     
        // Script generated for node S3 bucket. Store Results of detect to S3 location
        val S3bucket_node3 = glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://amzn-s3-demo-bucket/test-output/", "partitionKeys": []}"""), transformationContext="S3bucket_node3", format="json").writeDynamicFrame(DetectSensitiveData_node2)
     
        Job.commit()
      }
```

 The above script will create a DataFrame from a location in Amazon S3 and then it will run the `detect` API. Since the `detect` API requires the field `detectionParameters` (a map of the entity name to a list all of the action settings to be used for that entity) is represented by AWS Glue’s `JsonOptions` object, it will also allow us to extend the functionality of the API. 

 For each action specified per entity, enter a list of all column names to which to apply the entity/action combination. This allows you to customize the entities to detect for every column in your dataset and skip entities that you know are not in a specific column. This also allows your jobs to be more performant by not performing unnecessary detection calls those entities and allows you to perform actions unique to each column and entity combination. 

 Taking a closer look at the `detectionParameters`, there are three entity types in the sample job. These are `Phone Number`, `USA_PASSPORT_NUMBER`, and `USA_DRIVING_LICENSE`. For each of these entity types AWS Glue will run different actions which are either `PARTIAL_REDACT`, `SHA256_HASH`, `REDACT`, and `DETECT`. Each of the Entity Types also have `sourceColumns` to apply to and/or `sourceColumnsToExclude` if detected. 

**Note**  
 Only one edit-in-place action (`PARTIAL_REDACT`, `SHA256_HASH`, or `REDACT`) can be used per column but the `DETECT` action can be used with any of these actions. 

 The `detectionParameters` field has the below layout: 

```
    ENTITY_NAME -> List[Actions]
    {
    	"ENTITY_NAME": [{
    		Action, // required
    		ColumnSpecs,
    		ActionOptionsMap
        }],
        "ENTITY_NAME2": [{
    		...
        }]
    }
```

 The types of `actions` and `actionOptions` are listed below: 

```
DETECT
{
    # Required
    "action": "DETECT",
    # Optional, depending on action chosen
    "actionOptions": {
        // There are no actionOptions for DETECT 
    },
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}

SHA256_HASH
{
    # Required
    "action": "SHA256_HASH",
    # Required or optional, depending on action chosen
    "actionOptions": {
        // There are no actionOptions for SHA256_HASH
    },
    
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}

REDACT
{
    # Required
    "action": "REDACT",
    # Required or optional, depending on action chosen
    "actionOptions": {
        // The text that is being replaced
        "redactText": "USA_DL"
    },
    
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}

PARTIAL_REDACT
{
    # Required
    "action": "PARTIAL_REDACT",
    # Required or optional, depending on action chosen
    "actionOptions": {
        // number of characters to not redact from the left side 
        "numLeftCharsToExclude": "3",
        // number of characters to not redact from the right side
        "numRightCharsToExclude": "4",
        // the partial redact will be made with this redacted character  
        "redactChar": "#",
        // regex pattern for partial redaction
        "matchPattern": "[0-9]"
    },
    
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}
```

 Once the script runs, results are output to the given Amazon S3 location. You can view your data in Amazon S3 but with the selected entity types being sensitized based on the selected action. In the case, we would have a rows that would have that looked like this: 

```
{
    "Name": "Colby Schuster",
    "Address": "39041 Antonietta Vista, South Rodgerside, Nebraska 24151",
    "Car Owned": "Fiat",
    "Email": "Kitty46@gmail.com",
    "Company": "O'Reilly Group",
    "Job Title": "Dynamic Functionality Facilitator",
    "ITIN": "991-22-2906",
    "Username": "Cassandre.Kub43",
    "SSN": "914-22-2906",
    "DOB": "2020-08-27",
    "Phone Number": "1-2#######1718",
    "Bank Account No": "69741187",
    "Credit Card Number": "6441-6289-6867-2162-2711",
    "Passport No": "94f311e93a623c72ccb6fc46cf5f5b0265ccb42c517498a0f27fd4c43b47111e",
    "DL NO#": "USA_DL"
}
```

 In the above script, the `Phone Number` was partially redacted with `#`. The `Passport No` was changed into a SHA256 hash. The `DL NO# `was detected as a USA driver license number and was redacted to “USA\$1DL” just like it was stated in the `detectionParameters`. 

**Note**  
 The classifyColumns API is not available for use with fine-grained actions due to the nature of the API. This API performs column sampling (adjustable by the user but has default values) to perform detection more quickly. Fine-grained actions require iterating over every value for this reason. 

### Persistent Audit Log
<a name="sensitive-data-fine-grained-actions-persistent-audit-log"></a>

 A new feature introduced with fine-grained actions (but also available when using the normal APIs) is the presence of a persistent audit log. Currently, running the detect API adds an additional column (defaults to `DetectedEntities` but customizable through the `outputColumnName`) parameter with PII detection metadata. This now has an “actionUsed” metadata key, which is one of `DETECT`, `PARTIAL_REDACT`, `SHA256_HASH`, `REDACT`. 

```
"DetectedEntities": {
    "Credit Card Number": [
        {
            "entityType": "CREDIT_CARD",
            "actionUsed": "DETECT",
            "start": 0,
            "end": 19
        }
    ],
    "Phone Number": [
        {
            "entityType": "PHONE_NUMBER",
            "actionUsed": "REDACT",
            "start": 0,
            "end": 14
        }
    ]
}
```

 Even customers using APIs without fine-grained actions such as `detect(entityTypesToDetect, outputColumnName)` will see this persistent audit log in the resulting dataframe. 

 Customers using APIs with fine-grained actions will see all of the actions, regardless of if they are redacted or not. Example: 

```
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Credit Card Number  |  Phone Number  |                                                                                            DetectedEntities                                                                                             |
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 622126741306XXXX    | +12#####7890   | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":16}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":12}]}} |
| 6221 2674 1306 XXXX | +12#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} |
| 6221-2674-1306-XXXX | 22#######7890  | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} |
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

 If you do not want to see the **DetectedEntities** column, you can simply drop the additional column in a custom script. 

# AWS Glue Visual Job API
<a name="visual-job-api-chapter"></a>

 AWS Glue provides an API that allows customers to create data integration jobs using the AWS Glue API from a JSON object that represents a visual step workflow. Customers can then use the visual editor in AWS Glue Studio to work with these jobs. 

 For more information on Visual Job API data types, see [Visual Job API](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-visual-job-api.html). 

**Topics**
+ [API design and CRUD APIs](#visual-job-api-design)
+ [Getting started](#getting-started-visual-job-api)
+ [Visual job limitations](#visual-job-limitations)

## API design and CRUD APIs
<a name="visual-job-api-design"></a>

 The CreateJob and UpdateJob [ APIs ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html) now support an additional optional parameter, codeGenConfigurationNodes. Providing a non-empty JSON structure for this field will result in the DAG being registered in AWS Glue Studio for the created job and the associated code being generated. A null value or empty string for this field on job create will be ignored. 

 Updates to the codeGenConfigurationNodes field will be done through the UpdateJob AWS Glue API in a similar way as CreateJob. The entire field should be specified in UpdateJob where the DAG has been changed as desired. A null value provided will be ignored and no update to the DAG would be performed. An empty structure or string will cause the codeGenConfigurationNodes to be set as empty and any previous DAG removed. The GetJob API will return a DAG if one exists. The DeleteJob API will also delete any associated DAG. 

## Getting started
<a name="getting-started-visual-job-api"></a>

 To create a job, use the [ CreateJob action ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob). The `CreateJob` request input will have an additional field ‘codeGenConfigurationNodes’ where you can get specify the DAG object in JSON. 

 Things to keep in mind: 
+  The ‘codeGenConfigurationNodes’ field is a map of nodeId to node. 
+  Each node begins with a key identifying what kind of node it is. 
+  There can only be one key specified since a node can only be of one type. 
+  The input field contains the parent nodes of the current node. 

 The following is a JSON representation of a **CreateJob** input. 

```
{
  "node-1": {
    "S3CatalogSource": {
      "Table": "csvFormattedTable",
      "PartitionPredicate": "",
      "Name": "S3 bucket",
      "AdditionalOptions": {},
      "Database": "myDatabase"
    }
  },
  "node-3": {
    "S3DirectTarget": {
      "Inputs": ["node-2"],
      "PartitionKeys": [],
      "Compression": "none",
      "Format": "json",
      "SchemaChangePolicy": { "EnableUpdateCatalog": false },
      "Path": "",
      "Name": "S3 bucket"
    }
  },
  "node-2": {
    "ApplyMapping": {
      "Inputs": ["node-1"],
      "Name": "ApplyMapping",
      "Mapping": [
        {
          "FromType": "long",
          "ToType": "long",
          "Dropped": false,
          "ToKey": "myheader1",
          "FromPath": ["myheader1"]
        },
        {
          "FromType": "long",
          "ToType": "long",
          "Dropped": false,
          "ToKey": "myheader2",
          "FromPath": ["myheader2"]
        },
        {
          "FromType": "long",
          "ToType": "long",
          "Dropped": false,
          "ToKey": "myheader3",
          "FromPath": ["myheader3"]
        }
      ]
    }
  }
}
```

 ** Updating and getting jobs ** 

 Since *UpdateJob* will also have a ‘codeGenConfigurationNodes’ field, the input format will be the same. See [ UpdateJob ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob) Action. 

 The *GetJob* action will return a ‘codeGenConfigurationNodes’ field in the same format as well. See [ GetJob ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-GetJob) Action. 

## Visual job limitations
<a name="visual-job-limitations"></a>

 Since the ‘codeGenConfigurationNodes’ parameter has been added to existing APIs, any limitations in those APIs will be inherited. In addition, the codeGenConfigurationNodes and some nodes will be limited in size. See [ Job Structure ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-Job) for more information. 

# Programming Ray scripts
<a name="aws-glue-programming-ray"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

AWS Glue makes it easy to write and run Ray scripts. This section describes the supported Ray capabilities that are available in AWS Glue for Ray. You program Ray scripts in Python.

Your custom script must be compatible with the version of Ray that's defined by the `Runtime` field in your job definition. For more information about `Runtime` in the Jobs API, see [Jobs](aws-glue-api-jobs-job.md). For information about each runtime environment, see [Supported Ray runtime environments](ray-jobs-section.md#author-job-ray-runtimes).

**Topics**
+ [Tutorial: Writing an ETL script in AWS Glue for Ray](edit-script-ray-intro-tutorial.md)
+ [Using Ray Core and Ray Data in AWS Glue for Ray](edit-script-ray-scripting.md)
+ [Providing files and Python libraries to Ray jobs](edit-script-ray-env-dependencies.md)
+ [Connecting to data in Ray jobs](edit-script-ray-connections-formats.md)

# Tutorial: Writing an ETL script in AWS Glue for Ray
<a name="edit-script-ray-intro-tutorial"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

Ray gives you the ability to write and scale distributed tasks natively in Python. AWS Glue for Ray offers serverless Ray environments that you can access from both jobs and interactive sessions (Ray interactive sessions are in preview). The AWS Glue job system provides a consistent way to manage and run your tasks—on a schedule, from a trigger, or from the AWS Glue console. 

Combining these AWS Glue tools creates a powerful toolchain that you can use for extract, transform, and load (ETL) workloads, a popular use case for AWS Glue. In this tutorial, you will learn the basics of putting together this solution.

We also support using AWS Glue for Spark for your ETL workloads. For a tutorial on writing a AWS Glue for Spark script, see [Tutorial: Writing an AWS Glue for Spark script](aws-glue-programming-intro-tutorial.md). For more information about available engines, see [AWS Glue for Spark and AWS Glue for Ray](how-it-works-engines.md). Ray is capable of addressing many different kinds of tasks in analytics, machine learning (ML), and application development. 

In this tutorial, you will extract, transform, and load a CSV dataset that is hosted in Amazon Simple Storage Service (Amazon S3). You will begin with the New York City Taxi and Limousine Commission (TLC) Trip Record Data Dataset, which is stored in a public Amazon S3 bucket. For more information about this dataset, see the [Registry of Open Data on AWS](https://registry.opendata.aws/nyc-tlc-trip-records-pds/). 

You will transform your data with predefined transforms that are available in the Ray Data library. Ray Data is a dataset preparation library designed by Ray and included by default in AWS Glue for Ray environments. For more information about libraries included by default, see [Modules provided with Ray jobs](edit-script-ray-env-dependencies.md#edit-script-ray-modules-provided). You will then write your transformed data to an Amazon S3 bucket that you control.

**Prerequisites** – For this tutorial, you need an AWS account with access to AWS Glue and Amazon S3. 

## Step 1: Create a bucket in Amazon S3 to hold your output data
<a name="edit-script-ray-intro-tutorial-s3"></a>

You will need an Amazon S3 bucket that you control to serve as a sink for data created in this tutorial. You can create this bucket with the following procedure.

**Note**  
If you want to write your data to an existing bucket that you control, you can skip this step. Take note of *yourBucketName*, the existing bucket's name, to use in later steps.

**To create a bucket for your Ray job output**
+ Create a bucket by following the steps in [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in the *Amazon S3 User Guide*.
  + When choosing a bucket name, take note of *yourBucketName*, which you will refer to in later steps.
  + For other configuration, the suggested settings provided in the Amazon S3 console should work fine in this tutorial.

  As an example, the bucket creation dialog box might look like this in the Amazon S3 console.  
![\[A dialog box in the Amazon S3 console that is used in configuring a new bucket.\]](http://docs.aws.amazon.com/glue/latest/dg/images/ray-tutorial-create-bucket.jpg)

## Step 2: Create an IAM role and policy for your Ray job
<a name="edit-script-ray-intro-tutorial-iam"></a>

Your job will need an AWS Identity and Access Management (IAM) role with the following:
+ Permissions granted by the `AWSGlueServiceRole` managed policy. These are the basic permissions that are necessary to run an AWS Glue job.
+ `Read` access level permissions for the `nyc-tlc/*` Amazon S3 resource.
+ `Write` access level permissions for the `yourBucketName/*` Amazon S3 resource.
+ A trust relationship that allows the `glue.amazonaws.com` principal to assume the role.

You can create this role with the following procedure.

**To create an IAM role for your AWS Glue for Ray job**
**Note**  
You can create an IAM role by following many different procedures. For more information or options about how to provision IAM resources, see the [AWS Identity and Access Management documentation](https://docs.aws.amazon.com/iam/index.html).

1. Create a policy that defines the previously outlined Amazon S3 permissions by following the steps in [Creating IAM policies (console) with the visual editor](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html#access_policies_create-visual-editor) in the *IAM User Guide*.
   + When selecting a service, choose Amazon S3.
   + When selecting permissions for your policy, attach the following sets of actions for the following resources (mentioned previously):
     + Read access level permissions for the `nyc-tlc/*` Amazon S3 resource.
     + Write access level permissions for the `yourBucketName/*` Amazon S3 resource.
   + When selecting the policy name, take note of *YourPolicyName*, which you will refer to in a later step.

1. Create a role for your AWS Glue for Ray job by following the steps in [ Creating a role for an AWS service (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html#roles-creatingrole-service-console) in the *IAM User Guide*.
   + When selecting a trusted AWS service entity, choose `Glue`. This will automatically populate the necessary trust relationship for your job.
   + When selecting policies for the permissions policy, attach the following policies:
     + `AWSGlueServiceRole`
     + *YourPolicyName*
   + When selecting the role name, take note of *YourRoleName*, which you will refer to in later steps.

## Step 3: Create and run an AWS Glue for Ray job
<a name="edit-script-ray-intro-tutorial-author-job"></a>

In this step, you create an AWS Glue job using the AWS Management Console, provide it with a sample script, and run the job. When you create a job, it creates a place in the console for you to store, configure, and edit your Ray script. For more information about creating jobs, see [Managing AWS Glue Jobs in the AWS Console](author-job-glue.md#console-jobs).

In this tutorial, we address the following ETL scenario: you would like to read the January 2022 records from the New York City TLC Trip Record dataset, add a new column (`tip_rate`) to the dataset by combining data in existing columns, then remove a number of columns that aren't relevant to your current analysis, and then you would like to write the results to *yourBucketName*. The following Ray script performs these steps:

```
import ray
import pandas
from ray import data

ray.init('auto')

ds = ray.data.read_csv("s3://nyc-tlc/opendata_repo/opendata_webconvert/yellow/yellow_tripdata_2022-01.csv")

# Add the given new column to the dataset and show the sample record after adding a new column
ds = ds.add_column( "tip_rate", lambda df: df["tip_amount"] / df["total_amount"])

# Dropping few columns from the underlying Dataset 
ds = ds.drop_columns(["payment_type", "fare_amount", "extra", "tolls_amount", "improvement_surcharge"])

ds.write_parquet("s3://yourBucketName/ray/tutorial/output/")
```

**To create and run an AWS Glue for Ray job**

1. In the AWS Management Console, navigate to the AWS Glue landing page.

1. In the side navigation pane, choose **ETL Jobs**.

1. In **Create job**, choose **Ray script editor**, and then choose **Create**, as in the following illustration.  
![\[A dialog box in the AWS Glue console used to create a Ray job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-script-ray-create.png)

1. Paste the full text of the script into the **Script** pane, and replace any existing text.

1. Navigate to **Job details** and set the **IAM Role** property to *YourRoleName*.

1. Choose **Save**, and then choose **Run**.

## Step 4: Inspect your output
<a name="edit-script-ray-intro-tutorial-inspect"></a>

After running your AWS Glue job, you should validate that the output matches the expectations of this scenario. You can do so with the following procedure.

**To validate whether your Ray job ran successfully**

1. On the job details page, navigate to **Runs**.

1. After a few minutes, you should see a run with a **Run status** of **Succeeded**.

1. Navigate to the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/) and inspect *yourBucketName*. You should see files written to your output bucket.

1. Read the Parquet files and verify their contents. You can do this with your existing tools. If you don't have a process for validating Parquet files, you can do this in the AWS Glue console with an AWS Glue interactive session, using either Spark or Ray (in preview).

   In an interactive session, you have access to Ray Data, Spark, or pandas libraries, which are provided by default (based on your choice of engine). To verify your file contents, you can use common inspection methods that are available in those libraries—methods like `count`, `schema`, and `show`. For more information about interactive sessions in the console, see [Using notebooks with AWS Glue Studio and AWS Glue](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html). 

   Because you have confirmed that files have been written to the bucket, you can say with relative certainty that if your output has problems, they are not related to IAM configuration. Configure your session with *yourRoleName* to have access to the relevant files.

If you don't see the expected outcomes, examine the troubleshooting content in this guide to identify and remediate the source of the error. You can find the troubleshooting content in the [Troubleshooting AWS Glue](troubleshooting-glue.md) chapter. For specific errors that are related to Ray jobs, see [Troubleshooting AWS Glue for Ray errors from logs](troubleshooting-ray.md) in the troubleshooting chapter. 

## Next steps
<a name="edit-script-ray-intro-tutorial-next"></a>

 You have now seen and performed an ETL process using AWS Glue for Ray from end to end. You can use the following resources to understand what tools AWS Glue for Ray provides to transform and interpret your data at scale. 
+  For more information about Ray's task model, see [Using Ray Core and Ray Data in AWS Glue for Ray](edit-script-ray-scripting.md). For more experience in using Ray tasks, follow the examples in the Ray Core documentation. See [Ray Core: Ray Tutorials and Examples (2.4.0)](https://docs.ray.io/en/releases-2.4.0/ray-core/examples/overview.html) in the Ray documentation. 
+  For guidance about available data management libraries in AWS Glue for Ray, see [Connecting to data in Ray jobs](edit-script-ray-connections-formats.md). For more experience using Ray Data to transform and write datasets, follow the examples in the Ray Data documentation. See [Ray Data: Examples (2.4.0)](https://docs.ray.io/en/releases-2.4.0/data/examples/index.html). 
+ For more information about configuring AWS Glue for Ray jobs, see [Working with Ray jobs in AWS Glue](ray-jobs-section.md).
+ For more information about writing AWS Glue for Ray scripts, continue reading the documentation in this section.

# Using Ray Core and Ray Data in AWS Glue for Ray
<a name="edit-script-ray-scripting"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

Ray is a framework for scaling up Python scripts by distributing work across a cluster. You can use Ray as a solution to many sorts of problems, so Ray provides libraries to optimize certain tasks. In AWS Glue, we focus on using Ray to transform large datasets. AWS Glue offers support for Ray Data and parts of Ray Core to facilitate this task. 

## What is Ray Core?
<a name="edit-script-ray-scripting-core-what"></a>

The first step of building a distributed application is identifying and defining work that can be performed concurrently. Ray Core contains the parts of Ray that you use to define tasks that can be performed concurrently. Ray provides reference and quick start information that you can use to learn the tools they provide. For more information, see [What is Ray Core?](https://docs.ray.io/en/latest/ray-core/walkthrough.html) and [Ray Core Quick Start](https://docs.ray.io/en/latest/ray-overview/getting-started.html#ray-core-quick-start). For more information about effectively defining concurrent tasks in Ray, see [Tips for first-time users](https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html). 

**Ray tasks and actors**  
In AWS Glue for Ray documentation, we might refer to *tasks* and *actors*, which are core concepts in Ray.  
Ray uses Python functions and classes as the building blocks of a distributed computing system. Much like when Python functions and variables become "methods" and "attributes" when used in a class, functions become "tasks" and classes become "actors" when they're used in Ray to send code to workers. You can identify functions and classes that might be used by Ray by the `@ray.remote` annotation.  
Tasks and actors are configurable, they have a lifecycle, and they take up compute resources throughout their life. Code that throws errors can be traced back to a task or actor when you're finding the root cause of problems. Thus, these terms might come up when you're learning how to configure, monitor, or debug AWS Glue for Ray jobs.   
To begin learning how to effectively use tasks and actors to build a distributed application, see [Key Concepts](https://docs.ray.io/en/latest/ray-core/key-concepts.html) in the Ray docs.

## Ray Core in AWS Glue for Ray
<a name="edit-script-ray-scripting-core-glue"></a>

AWS Glue for Ray environments manage cluster formation and scaling, as well as collecting and visualizing logs. Because we manage these concerns, we consequently limit access to and support for the APIs in Ray Core that would be used to address these concerns in an open-source cluster.

In the managed `Ray2.4` runtime environment, we do not support:
+ [Ray Core CLI](https://docs.ray.io/en/releases-2.4.0/ray-core/api/cli.html)
+ [Ray State CLI](https://docs.ray.io/en/releases-2.4.0/ray-observability/api/state/cli.html)
+ `ray.util.metrics` Prometheus metric utility methods:
  + [Counter](https://docs.ray.io/en/releases-2.4.0/ray-core/api/doc/ray.util.metrics.Counter.html)
  + [Gauge](https://docs.ray.io/en/releases-2.4.0/ray-core/api/doc/ray.util.metrics.Gauge.html)
  + [Histogram](https://docs.ray.io/en/releases-2.4.0/ray-core/api/doc/ray.util.metrics.Histogram.html)
+ Other debugging tools:
  + [ray.util.pdb.set\$1trace](https://docs.ray.io/en/releases-2.4.0/ray-core/api/doc/ray.util.pdb.set_trace.html)
  + [ray.util.inspect\$1serializability](https://docs.ray.io/en/releases-2.4.0/ray-core/api/doc/ray.util.inspect_serializability.html)
  + [ray.timeline](https://docs.ray.io/en/releases-2.4.0/ray-core/api/doc/ray.timeline.html)

## What is Ray Data?
<a name="edit-script-ray-scripting-data-what"></a>

When you're connecting to data sources and destinations, handling datasets, and initiating common transforms, Ray Data is a straightforward methodology for using Ray to solve problems transforming Ray datasets. For more information about using Ray Data, see [Ray Datasets: Distributed Data Preprocessing](https://docs.ray.io/en/releases-2.4.0/data/dataset.html). 

You can use Ray Data or other tools to access your data. For more information on accessing your data in Ray, see [Connecting to data in Ray jobs](edit-script-ray-connections-formats.md).

## Ray Data in AWS Glue for Ray
<a name="edit-script-ray-scripting-data-glue"></a>

Ray Data is supported and provided by default in the managed `Ray2.4` runtime environment. For more information about provided modules, see [Modules provided with Ray jobs](edit-script-ray-env-dependencies.md#edit-script-ray-modules-provided).

# Providing files and Python libraries to Ray jobs
<a name="edit-script-ray-env-dependencies"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

This section provides information that you need for using Python libraries with AWS Glue Ray jobs. You can use certain common libraries included by default in all Ray jobs. You can also provide your own Python libraries to your Ray job. 

## Modules provided with Ray jobs
<a name="edit-script-ray-modules-provided"></a>

You can perform data integration workflows in a Ray job with the following provided packages. These packages are available by default in Ray jobs.

------
#### [ AWS Glue version 4.0 ]

In AWS Glue 4.0, the Ray (`Ray2.4` runtime) environment provides the following packages:
+ boto3 == 1.26.133
+ ray == 2.4.0
+ pyarrow == 11.0.0
+ pandas == 1.5.3
+ numpy == 1.24.3
+ fsspec == 2023.4.0

This list includes all packages that would be installed with `ray[data] == 2.4.0`. Ray Data is supported out of box.

------

## Providing files to your Ray job
<a name="edit-script-ray-working-directory"></a>

You can provide files to your Ray job with the `--working-dir` parameter. Provide this parameter with a path to a .zip file hosted on Amazon S3. Within the .zip file, your files must be contained in a single top-level directory. No other files should be at the top level.

Your files will be distributed to each Ray node before your script begins to run. Consider how this might impact the disk space that's available to each Ray node. Available disk space is determined by the WorkerType set in the job configuration. If you want to provide your job data at scale, this mechanism is not the right solution. For more information on providing data to your job, see [Connecting to data in Ray jobs](edit-script-ray-connections-formats.md). 

Your files will be accessible as if the directory was provided to Ray through the `working_dir` parameter. For example, to read a file named `sample.txt` in your .zip file's top-level directory, you could call:

```
@ray.remote
def do_work():
    f = open("sample.txt", "r")
    print(f.read())
```

For more information about `working_dir`, see the [Ray documentation](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#remote-uris). This feature behaves similarly to Ray's native capabilities.

## Additional Python modules for Ray jobs
<a name="edit-script-ray-python-libraries-additional"></a>

**Additional modules from PyPI**

Ray jobs use the Python Package Installer (pip3) to install additional modules to be used by a Ray script. You can use the `--pip-install` parameter with a list of comma-separated Python modules to add a new module or change the version of an existing module. 

For example, to update or add a new `scikit-learn` module, use the following key-value pair: 

`"--pip-install", "scikit-learn==0.21.3"`

If you have custom modules or custom patches, you can distribute your own libraries from Amazon S3 with the `--s3-py-modules` parameter. Before uploading your distribution, it might need to be repackaged and rebuilt. Follow the guidelines in in [Including Python code in Ray jobs](#edit-script-ray-packaging).

**Custom distributions from Amazon S3**

Custom distributions should adhere to Ray packaging guidelines for dependencies. You can find out how to build these distributions in the following section. For more information about how Ray sets up dependencies, see [Environment Dependencies](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html) in the Ray documentation. 

To include a custom distributable after assessing its contents, upload your distributable to a bucket available to the job's IAM role. Specify the Amazon S3 path to a Python zip archive in your parameter configuration. If you're providing multiple distributables, separate them by comma. For example:

`"--s3-py-modules", "s3://s3bucket/pythonPackage.zip"` 

**Limitations**

Ray jobs do not support compiling native code in the job environment. You can be limited by this if your Python dependencies transitively depend on native, compiled code. Ray jobs can run provided binaries, but they must be compiled for Linux on ARM64. This means you might be able to use the contents of `aarch64``manylinux` wheels. You can provide your native dependencies in a compiled form by repackaging a wheel to Ray standards. Typically, this means removing `dist-info` folders so that there is only one folder at the root at the archive. 

You cannot upgrade the version of `ray` or `ray[data]` using this parameter. In order to use a new version of Ray, you will need to change the runtime field on your job, after we have released support for it. For more information about supported Ray versions, see [AWS Glue versions](release-notes.md#release-notes-versions).

## Including Python code in Ray jobs
<a name="edit-script-ray-packaging"></a>

The Python Software Foundation offers standardized behaviors for packaging Python files for use across different runtimes. Ray introduces limitations to packaging standards that you should be aware of. AWS Glue does not specify packaging standards beyond those specified to Ray. The following instructions provide standard guidance on packaging simple Python packages.

Package your files in a `.zip` archive. A directory should be at the root of the archive. **There should be no other files at the root level of the archive, or this may lead to unexpected behavior.** The root directory is the package, and its name is used to refer to your Python code when importing it.

If you provide a distribution in this form to a Ray job with `--s3-py-modules`, you will be able to import Python code from your package in your Ray script.

Your package can provide a single Python module with some Python files, or you can package together many modules. When repackaging dependencies, such as libraries from PyPI, **check for hidden files and metadata directories** inside of those packages. 

**Warning**  
 Certain OS behaviors make make it difficult to properly follow these packaging instructions.   
OSX may add hidden files such as `__MACOSX` to your zip file at the top level.
Windows may add your files to a folder inside the zip automatically, unintentionally creating a nested folder.

The following procedures assume you are interacting with your files in Amazon Linux 2 or a similar OS that provides a distribution of the Info-ZIP `zip` and `zipinfo` utilities. We recommend using these tools to prevent unexpected behaviors. 

To package Python files for use in Ray

1. Create a temporary directory with your package name, then confirm your working directory is its parent directory. You can do this with the following commands:

   ```
   cd parent_directory
   mkdir temp_dir
   ```

1. Copy your files into the temporary directory, then confirm your directory structure. The contents of this directory will be directly accessed as your Python module. You can do this with the following command:

   ```
   ls -AR temp_dir
   # my_file_1.py
   # my_file_2.py
   ```

1. Compress your temporary folder using zip. You can do this with the following commands:

   ```
   zip -r zip_file.zip temp_dir
   ```

1. Confirm your file is properly packaged. `zip_file.zip` should now be found in your working directory. You can inspect it with the following command:

   ```
   zipinfo -1 zip_file.zip
   # temp_dir/
   # temp_dir/my_file_1.py
   # temp_dir/my_file_2.py
   ```

To repackage a Python package for use in Ray.

1. Create a temporary directory with your package name, then confirm your working directory is its parent directory. You can do this with the following commands:

   ```
   cd parent_directory
   mkdir temp_dir
   ```

1. Decompress your package and copy the contents into your temporary directory. Remove files related to your previous packaging standard, leaving only the contents of the module. Confirm your file structure looks correct with the following command:

   ```
   ls -AR temp_dir
   # my_module
   # my_module/__init__.py
   # my_module/my_file_1.py
   # my_module/my_submodule/__init__.py
   # my_module/my_submodule/my_file_2.py
   # my_module/my_submodule/my_file_3.py
   ```

1. Compress your temporary folder using zip. You can do this with the following commands:

   ```
   zip -r zip_file.zip temp_dir
   ```

1. Confirm your file is properly packaged. `zip_file.zip` should now be found in your working directory. You can inspect it with the following command:

   ```
   zipinfo -1 zip_file.zip
   # temp_dir/my_module/
   # temp_dir/my_module/__init__.py
   # temp_dir/my_module/my_file_1.py
   # temp_dir/my_module/my_submodule/
   # temp_dir/my_module/my_submodule/__init__.py
   # temp_dir/my_module/my_submodule/my_file_2.py
   # temp_dir/my_module/my_submodule/my_file_3.py
   ```

# Connecting to data in Ray jobs
<a name="edit-script-ray-connections-formats"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

AWS Glue Ray jobs can use a broad array of Python packages that are designed for you to quickly integrate data. We provide a minimal set of dependencies in order to not clutter your environment. For more information about what is included by default, see [Modules provided with Ray jobs](edit-script-ray-env-dependencies.md#edit-script-ray-modules-provided).

**Note**  
AWS Glue extract, transform, and load (ETL) provides the DynamicFrame abstraction to streamline ETL workflows where you resolve schema differences between rows in your dataset. AWS Glue ETL provides additional features—job bookmarks and grouping input files. We don't currently provide corresponding features in Ray jobs.  
AWS Glue for Spark provides direct support for connecting to certain data formats, sources and sinks. In Ray, AWS SDK for pandas and current third-party libraries substantively cover that need. You will need to consult those libraries to understand what capabilities are available.

AWS Glue for Ray integration with Amazon VPC is not currently available. Resources in Amazon VPC will not be accessible without a public route. For more information about using AWS Glue with Amazon VPC, see [Configuring interface VPC endpoints (AWS PrivateLink) for AWS Glue (AWS PrivateLink)](vpc-interface-endpoints.md). 

## Common libraries for working with data in Ray
<a name="edit-script-ray-etl-libraries"></a>

**Ray Data** – Ray Data provides methods to handle common data formats, sources and sinks. For more information about supported formats and sources in Ray Data, see [Input/Output](https://docs.ray.io/en/latest/data/api/input_output.html) in the Ray Data documentation. Ray Data is an opinionated library, rather than a general-purpose library, for handling datasets. 

Ray provides certain guidance around use cases where Ray Data might be the best solution for your job. For more information, see [ Ray use cases ](https://docs.ray.io/en/latest/ray-overview/use-cases.html) in the Ray documentation. 

**AWS SDK for pandas (awswrangler)** – AWS SDK for pandas is an AWS product that delivers clean, tested solutions for reading from and writing to AWS services when your transformations manage data with pandas DataFrames. For more information about supported formats and sources in the AWS SDK for pandas, see the [API Reference](https://aws-sdk-pandas.readthedocs.io/en/stable/api.html) in the AWS SDK for pandas documentation. 

For examples of how to read and write data with the AWS SDK for pandas, see [Quick Start](https://aws-sdk-pandas.readthedocs.io/en/stable/) in the AWS SDK for pandas documentation. The AWS SDK for pandas doesn't provide transforms for your data. It only provides support for reading and writing from sources. 

**Modin** – Modin is a Python library that implements common pandas operations in a distributable way. For more information about Modin, see the [Modin documentation](https://modin.readthedocs.io/en/stable/). Modin itself doesn't provide support for reading and writing from sources. It provides distributed implementations of common transforms. Modin is supported by the AWS SDK for pandas. 

When you run Modin and the AWS SDK for pandas together in a Ray environment, you can perform common ETL tasks with performant results. For more information about using Modin with the AWS SDK for pandas, see [At scale](https://aws-sdk-pandas.readthedocs.io/en/stable/scale.html) in the AWS SDK for pandas documentation. 

**Other frameworks** – For more information about frameworks that Ray supports, see [ The Ray Ecosystem ](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html) in the Ray documentation. We don't provide support for other frameworks in AWS Glue for Ray.

## Connecting to data through the Data Catalog
<a name="edit-script-ray-gludc"></a>

Managing your data through the Data Catalog in conjunction with Ray jobs is supported with the AWS SDK for pandas. For more information, see [Glue Catalog](https://aws-sdk-pandas.readthedocs.io/en/3.0.0rc2/tutorials/005%20-%20Glue%20Catalog.html) on the AWS SDK for pandas website.