AWS Glue Best Practices: Building a Performant and Cost Optimized Data Pipeline

Publication date: August 26, 2022 (Document revisions)

Abstract

Data integration is a critical element in building a data lake and a data warehouse. Data integration enables data from different sources to be cleaned, harmonized, transformed, and finally loaded. When building a data warehouse, the bulk of the development efforts are required for building a data integration pipeline. Data integration is one of the most critical elements in data analytics ecosystems. An efficient and well-designed data integration pipeline is critical for making the data available, and being trusted amongst analytics consumers.

This whitepaper shows you some of the consideration and best practices in building high-performance, cost-optimized data pipelines with AWS Glue.

To get the most out of reading this whitepaper, it’s helpful to be familiar with AWS Glue, AWS Glue DataBrew, Amazon Simple Storage Service (Amazon S3), AWS Lambda, and AWS Step Functions.

For best practices around Operational Excellence for your data pipelines, refer to AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline.
For best practices around Security and Reliability for your data pipelines, refer to AWS Glue Best Practices: Building a Secure and Reliable Data Pipeline.

Are you Well-Architected?

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

For more expert guidance and best practices for your cloud architecture—reference architecture deployments, diagrams, and whitepapers—refer to the AWS Architecture Center.

Introduction

Data volumes and complexities are increasing at an unprecedented rate, exploding from terabytes to petabytes or even exabytes of data. Traditional on-premises based approaches for bundling a data pipeline do not work well with a cloud-based strategy, and most of the time, do not provide the elasticity and cost effectiveness of cloud native approaches. We hear from customers all the time that they are looking to extract more value from their data but struggling to capture, store, and analyze all the data generated by today’s modern and digital businesses. Data is growing exponentially, coming from new sources, it is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people.

With changing data and business needs, the focus on building a high performing, cost effective, and low maintenance data pipeline is paramount. Introduced in 2017, AWS Glue is a fully managed, serverless data integration service which allows customers to scale based on their workload with no infrastructures to manage. In the next section, we discuss common best practices around performance efficiency and cost optimization of the data pipeline built with AWS Glue. This document is intended for advanced users, data engineers and architects.

Refer to AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline to understand more about the AWS Glue product family before proceeding to the next sections.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Using AWS Well-Architected framework for building a data pipeline