AWS Glue PySpark transforms reference
AWS Glue provides the following built-in transforms that you can use in PySpark ETL operations. Your data passes from transform to transform in
a data structure called a DynamicFrame, which is an extension to an Apache
Spark SQL DataFrame
. The DynamicFrame
contains your data, and you
reference its schema to process your data.
Most of these transforms also exist
as methods of the DynamicFrame
class. For more information, see DynamicFrame transforms
.
Data integration transforms
For AWS Glue 4.0 and above, create or update job arguments with key: --enable-glue-di-transforms, value: true
.
Example job script:
from pyspark.context import SparkContext from awsgluedi.transforms import * sc = SparkContext() input_df = spark.createDataFrame( [(5,), (0,), (-1,), (2,), (None,)], ["source_column"], ) try: df_output = math_functions.IsEven.apply( data_frame=input_df, spark_context=sc, source_column="source_column", target_column="target_column", value=None, true_string="Even", false_string="Not even", ) df_output.show() except: print("Unexpected Error happened ") raise
Example Sessions using Notebooks
%idle_timeout 2880 %glue_version 4.0 %worker_type G.1X %number_of_workers 5 %region eu-west-1
%%configure { "--enable-glue-di-transforms": "true" }
from pyspark.context import SparkContext from awsgluedi.transforms import * sc = SparkContext() input_df = spark.createDataFrame( [(5,), (0,), (-1,), (2,), (None,)], ["source_column"], ) try: df_output = math_functions.IsEven.apply( data_frame=input_df, spark_context=sc, source_column="source_column", target_column="target_column", value=None, true_string="Even", false_string="Not even", ) df_output.show() except: print("Unexpected Error happened ") raise
Example Sessions using AWS CLI
aws glue create-session --default-arguments "--enable-glue-di-transforms=true"
DI transforms:
Maven: Bundle the plugin with your Spark applications
You can bundle the transforms dependency with your Spark applications and Spark distributions (version 3.3)
by adding the plugin dependency in your Maven pom.xml
while developing your Spark applications locally.
<repositories> ... <repository> <id>aws-glue-etl-artifacts</id> <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ </url> </repository> </repositories> ... <dependency> <groupId>com.amazonaws</groupId> <artifactId>AWSGlueTransforms</artifactId> <version>4.0.0</version> </dependency>
You can alternatively download the binaries from AWS Glue Maven artifacts directly and include them in your Spark application as follows.
#!/bin/bash sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/AWSGlueTransforms/4.0.0/AWSGlueTransforms-4.0.0.jar -P /usr/lib/spark/jars/