Recommended data layers - AWS Prescriptive Guidance

Recommended data layers

If you work with non-sensitive data, such as non-personally identifiable information (PII) data, we recommend that you use at least three different data layers in a data lake on the AWS Cloud.

However, you might require additional layers depending on the data’s complexity and use cases. For example, if you work with sensitive data (for example, PII data), we recommend that you use an additional Amazon Simple Storage Service (Amazon S3) bucket as a landing zone and then mask the data before it is moved into the raw data layer. For more information about this, see the Handling sensitive data section of this guide.

Each data layer must have an individual S3 bucket; the following table describes our recommended data layers:

Data layer name Description Sample lifecycle policy strategy
Raw

Contains the raw, unprocessed data and is the layer in which data is ingested into the data lake.

If possible, you should keep the original file format and turn on versioning in the S3 bucket.

After one year, move files into the Amazon S3 infrequent access (IA) storage class. After two years in Amazon S3 IA, archive them to Amazon S3 Glacier.
Stage

Contains intermediate, processed data that is optimized for consumption (for example CSV to Apache Parquet converted raw files or data transformations).

An AWS Glue job reads the files from the raw layer and validates the data. The AWS Glue job then stores the data in an Apache Parquet-formatted file and the metadata is stored in a table in the AWS Glue Data Catalog.

Data can be deleted after a defined time period or according to your organization's requirements.

Some data derivatives (for example, an Apache Avro transform of an original JSON format) can be removed from the data lake after a shorter amount of time (for example, after 90 days).

Analytics Contains the aggregated data for your specific use cases in a consumption-ready format (for example, Apache Parquet). Data can be moved to Amazon S3 IA and then deleted after a defined time period or according to your organization's requirements.
Note

You must evaluate all the recommended lifecycle policy strategies against your organizational needs, regulatory requirements, query patterns, and cost considerations.