Deploy and manage a serverless data lake on the AWS Cloud by using infrastructure as code
Created by Kirankumar Chandrashekar (AWS) and Abdel Jaidi (AWS)
Summary
Notice: AWS CodeCommit is no longer available to new customers. Existing customers of AWS CodeCommit can continue to use the service as normal. Learn more
This pattern describes how to use serverless computing
SDLF is a collection of reusable resources that accelerate the delivery of enterprise data lakes on the AWS Cloud and helps with faster deployment to production. It is used to implement the foundational structure of a data lake by following best practices.
SDLF implements a continuous integration / continuous deployment (CI/CD) process throughout the code and infrastructure deployment by using AWS services such as AWS CodePipeline, AWS CodeBuild, and AWS CodeCommit.
This pattern uses multiple AWS serverless services to simplify data lake management. These include Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB for storage, AWS Lambda and AWS Glue for computing, and Amazon CloudWatch Events, Amazon Simple Queue Service (Amazon SQS), and AWS Step Functions for orchestration.
AWS CloudFormation and AWS code services act as the IaC layer to provide reproducible and fast deployments with easy operations and administration.
Prerequisites and limitations
Prerequisites
An active AWS account.
AWS Command Line Interface (AWS CLI), installed and configured.
A Git client, installed and configured.
The SDLF workshop
, open in a web browser window and ready to use.
Architecture
The architecture diagram illustrates an event-driven process with the following steps.
After a file is added to the raw data S3 bucket, an Amazon S3 event notification is placed in an SQS queue. Each notification is delivered as a JSON file, which contains metadata such as the S3 bucket name, object key, or timestamp.
This notification is consumed by a Lambda function that routes the event to the correct extraction, transformation, and loading (ETL) process based on the metadata. The Lambda function can also use contextual configurations stored in an Amazon DynamoDB table. This step enables decoupling and scaling to multiple applications in the data lake.
The event is routed to the first Lambda function in the ETL process, which transforms and moves data from the raw data area to the staging area for the data lake. The first step is to update the comprehensive catalog. This is a DynamoDB table that contains all the file metadata of the data lake. Each row in this table holds operational metadata about a single object stored in Amazon S3. A synchronous call is made to a Lambda function that performs a light transformation, which is a computationally inexpensive operation (such as converting a file from one format to another), on the S3 object. Because a new object has been added to the staging S3 bucket, the comprehensive catalog is updated and a message is sent to the SQS queue waiting for the next phase in the ETL.
A CloudWatch Events rule triggers a Lambda function every 5 minutes. This function checks if messages were delivered to the SQS queue from the previous ETL phase. If a message was delivered, the Lambda function begins the second function from AWS Step Functions in the ETL process.
A heavy transformation is then applied on a batch of files. This heavy transformation is a computationally expensive operation, such as a synchronous call to an AWS Glue job, AWS Fargate task, Amazon EMR step, or Amazon SageMaker notebook. Table metadata is extracted from the output files by using an AWS Glue crawler, which updates the AWS Glue catalog. File metadata is also added to the comprehensive catalog table in DynamoDB. Finally, a data quality step leveraging Deequ
is also run.
Technology stack
Amazon CloudWatch Events
AWS CloudFormation
AWS CodePipeline
AWS CodeBuild
AWS CodeCommit
Amazon DynamoDB
AWS Glue
AWS Lambda
Amazon S3
Amazon SQS
AWS Step Functions
Tools
Amazon CloudWatch Events – CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources.
AWS CloudFormation – CloudFormation helps create and provision AWS infrastructure deployments predictably and repeatedly.
AWS CodeBuild – CodeBuild is a fully managed build service that compiles your source code, runs unit tests, and produces artifacts that are ready to deploy.
AWS CodeCommit – CodeCommit is a version control service hosted by AWS that you can use to privately store and manage assets (such as source code and binary files).
AWS CodePipeline – CodePipeline is a continuous delivery service that you can use to model, visualize, and automate the steps required to release your software changes continuously.
Amazon DynamoDB – DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with scalability.
AWS Glue – AWS Glue is a fully managed ETL service that makes it easier to prepare and load data for analytics.
AWS Lambda – Lambda supports running code without provisioning or managing servers. Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second.
Amazon S3 – Amazon Simple Storage Service (Amazon S3) is a highly scalable object storage service. Amazon S3 can be used for a wide range of storage solutions, including websites, mobile applications, backups, and data lakes.
AWS Step Functions - AWS Step Functions is a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications.
Amazon SQS – Amazon Simple Queue Service (Amazon SQS) is a fully managed message queuing service that helps you decouple and scale microservices, distributed systems, and serverless applications.
Deequ
– Deequ is a tool that helps you compute data quality metrics for large datasets, define and verify data quality constraints, and stay informed about changes in the data distribution.
Code repository
The source code and resources for the SDLF are available in the AWS Labs GitHub repository
Epics
Task | Description | Skills required |
---|---|---|
Set up the CI/CD pipeline to manage IaC for the data lake. | Sign in to the AWS Management Console and follow the steps from the Initial setup | DevOps engineer |
Task | Description | Skills required |
---|---|---|
Clone the CodeCommit repository on your local machine. | Follow the steps from the Deploying the foundations For more information, see Connecting to CodeCommit repositories from the CodeCommit documentation. | DevOps engineer |
Modify the CloudFormation templates. | Use your local workstation and a code editor to modify the CloudFormation templates according to your use cases or requirements. Commit them to the locally cloned Git repository. For more information, see Working with AWS CloudFormation templates from the AWS CloudFormation documentation. | DevOps engineer |
Push the changes to the CodeCommit repository. | Your infrastructure code is now under version control and modifications to your code base are tracked. When you push a change to the CodeCommit repository, CodePipeline automatically applies it to your infrastructure and delivers it to CodeBuild. ImportantIf you use the AWS SAM CLI in CodeBuild, run the | DevOps engineer |
Related resources
Set up the CI/CD pipeline to provision IaC
Version-control the IaC
Other resources