Getting started with serverless ETL on AWS Glue
Dheer Toprani and Adnan Alvee, Amazon Web Services (AWS)
March 2024 (document history)
On the Amazon Web Services (AWS) Cloud, AWS Glue is a fully managed serverless environment where you can extract, transform, and load (ETL) data at scale. With AWS Glue, you can categorize data, clean it, enrich it, and move it reliably across various data stores and streams in a cost-effective manner.
AWS Glue is serverless, so you don’t have to worry about provisioning or managing servers. With AWS Glue, you pay only for the resources you use, and you can scale up or down as needed.
AWS Glue consists of the following components:
-
AWS Glue ETL – AWS Glue ETL provides batch and streaming options to extract, transform, and load data from one source to another.
-
AWS Glue Data Catalog – Data Catalog is a central repository for organizing the metadata of all your data assets. Data Catalog provides a unified interface where you can search, discover, and share data assets across data analytics services.
-
AWS Glue DataBrew – DataBrew is a no-code data preparation tool that you can use to visually explore, clean, and transform data. You can choose from more than 250 prebuilt transformations to automate data preparation tasks without writing any code.
This guide provides a high-level introduction to AWS Glue, including how it works and how you can get started using it. It covers the key concepts that you need to know before authoring AWS Glue jobs, such as automation, monitoring, and integrating with other AWS services. The Next steps section will get you up to speed with writing code in AWS Glue. If you already have some experience using AWS Glue, the Best practices section will help you fill in any gaps in your knowledge. By the end of this guide, you will be equipped with the knowledge and resources you need to start using AWS Glue effectively.