Guidance for Data Lakes on AWS

Overview

This Guidance demonstrates an automatically configured data lake on AWS using an event-driven, serverless, and scalable architecture. It leverages AWS managed services to ingest, store, process, and analyze data, offering a secure, flexible, and cost-effective design with proper data governance. This approach provides greater agility, flexibility, and reliability compared to traditional data management systems. The entire solution is built as a codified application using infrastructure-as-code (IaC) and a continuous integration, continuous delivery (CI/CD) pipeline.

Benefits

Accelerate data-driven decisions

Architecture 1:Deploy a serverless data lake framework that transforms raw data into actionable insights using AWS analytics services. Enable your business analysts to query processed data through Amazon Athena while maintaining comprehensive data governance through AWS Lake Formation. Architecture 2:Deploy a unified data lakehouse architecture that seamlessly ingests data from diverse sources including streaming inventory, relational databases, and enterprise systems like SAP. Reduce development time by leveraging zero-ETL capabilities and open table formats that eliminate complex data pipeline engineering.

Streamline data processing workflows

Architecture 1:EImplement automated, event-driven data pipelines that efficiently transform and catalog your data across its lifecycle. AWS Step Functions orchestrates the workflow while AWS Glue handles ETL processes, converting data to optimized formats for improved query performance. Architecture 2:Enable business teams to generate actionable insights through natural language queries with Amazon Bedrock and visualize results with Amazon QuickSight. Query data through existing spark platform or leveraging the AWS services under one unified platform. Marketing and sales teams can independently access the data they need while IT maintains centralized governance through AWS Lake Formation.

Enhance data accessibility securely

Architecture 1:Create a unified data environment where teams can access and analyze data through their preferred tools while maintaining centralized governance. AWS Lake Formation provides fine-grained access controls while Amazon SageMaker and Amazon Bedrock enable advanced analytics and AI-powered insights from your data lake. Architecture 2:Implement a secure, well-governed data environment using Lakehouse for Amazon SageMaker with federated catalogs and AWS Lake Formation controls. This architecture allows you to maintain data security and compliance while enabling cross-account data sharing between producer and consumer accounts.

How it works

AWS Serverless Data Lake Framework

This architecture diagram shows how to build a data lake on AWS in addition to demonstrating how to process, store, and consume data using serverless AWS analytics services.

Download the architecture diagram AWS Serverless Data Lake Framework Step 1
The data administrator uploads JSON files in the Amazon Simple Storage Service (Amazon S3) raw bucket. Object creation in Amazon S3 triggers an event in Amazon EventBridge.
Step 2
EventBridge has a rule that sends a message in Amazon Simple Queue Service (Amazon SQS), which invokes an AWS Lambda function.
Step 3
The Lambda function triggers the AWS Step Functions workflow, in which another Lambda function reads files from the S3 raw bucket and performs transformation. It also writes the new set of JSON files in the S3 stage bucket.
Step 4
A Lambda function updates the Amazon DynamoDB table with the Step Functions job status.
Step 5
Once the files are created in the S3 stage bucket, it triggers an event in EventBridge, which has a rule that sends a message in Amazon SQS with created file details.
Step 6
The Eventbridge scheduler runs at certain intervals and invokes a Lambda function that retrieves messages from Amazon SQS and starts another Step Functions workflow.
Step 7
AWS Glue extract, transform, load (ETL) reads the data from the AWS Glue database stage, then converts the files from JSON to Parquet format.
Step 8
AWS Glue ETL writes the Parquet files in the S3 analytics bucket. AWS Glue crawler crawls the Parquet files in the same bucket and then creates analytics tables in AWS Glue database analytics.
Step 9
All the staging and analytics catalogs are maintained in the AWS Glue Data Catalog.
Step 10
A Lambda function updates the DynamoDB table with the Step Functions job status.
Step 11
Business analysts use Amazon Athena to query the AWS Glue database analytics.
Multi-Source Analytics Lakehouse with AI-Powered Insights

This architecture diagram shows how to build a data lake on AWS in addition to demonstrating how to process, store, and consume data using Lakehouse for Amazon SageMaker and Amazon SageMaker Unified Studio.

Download the architecture diagram Multi-Source Analytics Lakehouse with AI-Powered Insights Step 1
Ingest store inventory data with Amazon Kinesis Data Streams, which feeds the data into Amazon Data Firehose.
Step 2
Upload store inventory streaming data into Amazon Simple Storage Service (Amazon S3) Tables.
Step 3
Catalog store inventory data into Lakehouse for Amazon SageMaker, managed with AWS Lake Formation, as a federated AWS Glue catalog.
Step 4
Ingest store, product, and promotions dimension data from Amazon Aurora (MySQL) to Amazon Redshift Serverless via Zero-ETL.
Step 5
Catalog dimension data into Lakehouse for Amazon SageMaker as a federated catalog.
Step 6
Ingest store sales data from SAP using AWS Glue via Zero-ETL. AWS Glue writes the store sales data into Amazon S3 in Apache Iceberg open table format.
Step 7
Catalog store sales data into Lakehouse for Amazon SageMaker as a federated catalog.
Step 8
Control access and governance through Amazon SageMaker Unified Studio from the central governance account. The producer account publishes the sales data. The consumer account subscribes and accesses the sales data.
Step 9
The marketing team generates insights from unified data using Amazon Athena. The data is pulled from Lakehouse for Amazon SageMaker. The sales team can also use Amazon QuickSight to visualize the data.
Step 10
The marketing team's data engineer with an existing Spark platform accesses sales data from Lakehouse for Amazon SageMaker by running Spark jobs on Amazon Elastic Compute Cloud (Amazon EC2) using an open Iceberg REST API.
Step 11
Sales team generates insights from unified data using Amazon Bedrock foundation models with Amazon Bedrock Knowledge Bases using Retrieval Augmented Generation (RAG) in the Producer account by using natural language queries.

Deploy with confidence

Everything you need to launch this Guidance in your account is right here.

Let's make it happen

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.