AWS Serverless Data Analytics Pipeline
Publication date: April 7, 2021 (Document revisions)
Abstract
This whitepaper discusses a layered, component-oriented, and logical architecture of modern analytics platforms. It also includes a reference architecture for building a serverless data platform that includes a data lake, data processing pipelines, and a consumption layer that enables several ways to analyze the data in the data lake without moving it, including business intelligence (BI) dashboarding, exploratory interactive SQL, big data processing, predictive analytics, and machine learning (ML).
The
AWS
Well-Architected Framework
Are you Well-Architected?
The
AWS Well-Architected Framework
For more expert guidance and best practices for your cloud
architecture—reference architecture deployments, diagrams, and
whitepapers—refer to the
AWS Architecture Center
Introduction
Onboarding new data or building new analytics pipelines in traditional analytics architectures usually require extensive coordination across multiple teams such as, business, data engineering, and data science and analytics. Before starting, these teams must first negotiate requirements, schema, infrastructure capacity needs, and workload management.
It is becoming increasingly difficult and inefficient to pre-define constantly changing schemas, time consuming to negotiate capacity slots on a shared infrastructure. For these reasons, business users, data scientists, and analysts want easy, frictionless, and self-service options to build end-to-end data pipelines. The exploratory nature of ML and many analytics tasks means you need to rapidly ingest new datasets and clean, normalize, and feature engineer them without worrying about operational overhead when you must think about the infrastructure that runs data pipelines.
A serverless data lake architecture enables agile and self-service
data onboarding and analytics for all data consumer roles across a
company. By using
AWS serverless
technologies
In this whitepaper, we discuss a layered, component-oriented logical architecture of modern analytics platforms and present a reference architecture for building a serverless data platform. This architecture includes a data lake, data processing pipelines, and a consumption layer that enables several ways to analyze the data in the data lake without moving it, including business intelligence (BI) dashboarding, exploratory interactive SQL, big data processing, predictive analytics, and ML.