Language	Package
.NET	`Amazon.CDK.AWS.Glue.Alpha`
Go	`github.com/aws/aws-cdk-go/awscdkgluealpha/v2`
Java	`software.amazon.awscdk.services.glue.alpha`
Python	`aws_cdk.aws_glue_alpha`
TypeScript	`@aws-cdk/aws-glue-alpha`

AWS Glue Construct Library

cdk-constructs: Experimental

The APIs of higher level constructs in this module are experimental and under active development. They are subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model and breaking changes will be announced in the release notes. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.

This module is part of the AWS Cloud Development Kit project.

README

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

The Glue L2 construct has convenience methods working backwards from common use cases and sets required parameters to defaults that align with recommended best practices for each job type. It also provides customers with a balance between flexibility via optional parameter overrides, and opinionated interfaces that discouraging anti-patterns, resulting in reduced time to develop and deploy new resources.

References

Glue Launch Announcement
Glue Documentation
Glue L1 (CloudFormation) Constructs
Prior version of the @aws-cdk/aws-glue-alpha module

Create a Glue Job

A Job encapsulates a script that connects to data sources, processes them, and then writes output to a data target. There are four types of Glue Jobs: Spark (ETL and Streaming), Python Shell, Ray, and Flex Jobs. Most of the required parameters for these jobs are common across all types, but there are a few differences depending on the languages supported and features provided by each type. For all job types, the L2 defaults to AWS best practice recommendations, such as:

Use of Secrets Manager for Connection JDBC strings
Glue job autoscaling
Default parameter values for Glue job creation

This iteration of the L2 construct introduces breaking changes to the existing glue-alpha-module, but these changes streamline the developer experience, introduce new constants for defaults, and replacing synth-time validations with interface contracts for enforcement of the parameter combinations that Glue supports. As an opinionated construct, the Glue L2 construct does not allow developers to create resources that use non-current versions of Glue or deprecated language dependencies (e.g. deprecated versions of Python). As always, L1s allow you to specify a wider range of parameters if you need or want to use alternative configurations.

Optional and required parameters for each job are enforced via interface rather than validation; see Glue's public documentation for more granular details.

Spark Jobs

ETL Jobs

ETL jobs support pySpark and Scala languages, for which there are separate but similar constructors. ETL jobs default to the G2 worker type, but you can override this default with other supported worker type values (G1, G2, G4 and G8). ETL jobs defaults to Glue version 4.0, which you can override to 3.0. The following ETL features are enabled by default: —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log. You can find more details about version, worker type and other features in Glue's public documentation.

Reference the pyspark-etl-jobs.test.ts and scalaspark-etl-jobs.test.ts unit tests for examples of required-only and optional job parameters when creating these types of jobs.

For the sake of brevity, examples are shown using the pySpark job variety.

Example with only required parameters:

import * as cdk from 'aws-cdk-lib';
import * as iam from 'aws-cdk-lib/aws-iam';
declare const stack: cdk.Stack;
declare const role: iam.IRole;
declare const script: glue.Code;
new glue.PySparkEtlJob(stack, 'PySparkETLJob', {
  role,
  script,
  jobName: 'PySparkETLJob',
});