# Analytics
<a name="analytics-pattern-list"></a>

**Topics**
+ [

# Analyze Amazon Redshift data in Microsoft SQL Server Analysis Services
](analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services.md)
+ [

# Analyze and visualize nested JSON data with Amazon Athena and Amazon Quick Sight
](analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight.md)
+ [

# Automate data ingestion from AWS Data Exchange into Amazon S3
](automate-data-ingestion-from-aws-data-exchange-into-amazon-s3.md)
+ [

# Automate encryption enforcement in AWS Glue using an AWS CloudFormation template
](automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template.md)
+ [

# Build a data pipeline to ingest, transform, and analyze Google Analytics data using the AWS DataOps Development Kit
](build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit.md)
+ [

# Build a video processing pipeline by using Amazon Kinesis Video Streams and AWS Fargate
](build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate.md)
+ [

# Build an ETL service pipeline to load data incrementally from Amazon S3 to Amazon Redshift using AWS Glue
](build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.md)
+ [

# Build an enterprise data mesh with Amazon DataZone, AWS CDK, and AWS CloudFormation
](build-enterprise-data-mesh-amazon-data-zone.md)
+ [

# Calculate value at risk (VaR) by using AWS services
](calculate-value-at-risk-var-by-using-aws-services.md)
+ [

# Configure cross-account access to a shared AWS Glue Data Catalog using Amazon Athena
](configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena.md)
+ [

# Convert the Teradata NORMALIZE temporal feature to Amazon Redshift SQL
](convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql.md)
+ [

# Convert the Teradata RESET WHEN feature to Amazon Redshift SQL
](convert-the-teradata-reset-when-feature-to-amazon-redshift-sql.md)
+ [

# Deploy and manage a serverless data lake on the AWS Cloud by using infrastructure as code
](deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code.md)
+ [

# Enforce tagging of Amazon EMR clusters at launch
](enforce-tagging-of-amazon-emr-clusters-at-launch.md)
+ [

# Ensure Amazon EMR logging to Amazon S3 is enabled at launch
](ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch.md)
+ [

# Generate test data using an AWS Glue job and Python
](generate-test-data-using-an-aws-glue-job-and-python.md)
+ [

# Cost-effectively ingest IoT data directly into Amazon S3 using AWS IoT Greengrass
](cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.md)
+ [

# Launch a Spark job in a transient EMR cluster using a Lambda function
](launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function.md)
+ [

# Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue
](migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue.md)
+ [

# Migrate Oracle Business Intelligence 12c to the AWS Cloud from on-premises servers
](migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.md)
+ [

# Migrate an ELK Stack to Elastic Cloud on AWS
](migrate-an-elk-stack-to-elastic-cloud-on-aws.md)
+ [

# Migrate data to the AWS Cloud by using Starburst
](migrate-data-to-the-aws-cloud-by-using-starburst.md)
+ [

# Optimize the ETL ingestion of input file size on AWS
](optimize-the-etl-ingestion-of-input-file-size-on-aws.md)
+ [

# Orchestrate an ETL pipeline with validation, transformation, and partitioning using AWS Step Functions
](orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.md)
+ [

# Perform advanced analytics using Amazon Redshift ML
](perform-advanced-analytics-using-amazon-redshift-ml.md)
+ [

# Query Amazon DynamoDB tables with SQL by using Amazon Athena
](query-amazon-dynamodb-tables-sql-amazon-athena.md)
+ [

# Access, query, and join Amazon DynamoDB tables using Athena
](access-query-and-join-amazon-dynamodb-tables-using-athena.md)
+ [

# Set up language-specific sorting for Amazon Redshift query results using a scalar Python UDF
](set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf.md)
+ [

# Subscribe a Lambda function to event notifications from S3 buckets in different AWS Regions
](subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions.md)
+ [

# Three AWS Glue ETL job types for converting data to Apache Parquet
](three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet.md)
+ [

# Visualize Amazon Redshift audit logs using Amazon Athena and Amazon QuickSight
](visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight.md)
+ [

# Visualize IAM credential reports for all AWS accounts using Amazon Quick Sight
](visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight.md)
+ [

# More patterns
](analytics-more-patterns-pattern-list.md)

# Analyze Amazon Redshift data in Microsoft SQL Server Analysis Services
<a name="analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services"></a>

*Sunil Vora, Amazon Web Services*

## Summary
<a name="analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services-summary"></a>

This pattern describes how to connect and analyze Amazon Redshift data in Microsoft SQL Server Analysis Services, by using the Intellisoft OLE DB Provider or CData ADO.NET Provider for database access.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. SQL Server Analysis Services is an online analytical processing (OLAP) tool that you can use to analyze data from data marts and data warehouses such as Amazon Redshift. You can use SQL Server Analysis Services to create OLAP cubes from your data for rapid, advanced data analysis.  

## Prerequisites and limitations
<a name="analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services-prereqs"></a>

**Assumptions**
+ This pattern describes how to set up SQL Server Analysis Services and Intellisoft OLE DB Provider or CData ADO.NET Provider for Amazon Redshift on an Amazon Elastic Compute Cloud (Amazon EC2) instance. Alternatively, you can install both on a host in your corporate data center.

**Prerequisites**
+ An active AWS account
+ An Amazon Redshift cluster with credentials

## Architecture
<a name="analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services-architecture"></a>

**Source technology stack**
+ An Amazon Redshift cluster

**Target technology stack**
+ Microsoft SQL Server Analysis Services

**Source and target architecture**

![\[Analyzing Amazon Redshift data in Microsoft SQL Server Analysis Services\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/e444fec0-e00f-4cc6-acc6-4ffc61b654a0/images/6f29dab5-1ea7-452f-9b07-d1d23ae469a2.png)


## Tools
<a name="analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services-tools"></a>
+ [Microsoft Visual Studio 2019 (Community Edition)](https://visualstudio.microsoft.com/vs/)
+ [Intellisoft OLE DB Provider for Amazon Redshift (Trial)](https://www.pgoledb.com/index.php?option=com_filecabinet&view=files&id=1&Itemid=68) or[ CData ADO.NET Provider for Amazon Redshift (Trial)](https://www.cdata.com/kb/tech/redshift-ado-ssas.rst)

## Epics
<a name="analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services-epics"></a>

### Analyze tables
<a name="analyze-tables"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Analyze the tables and data to be imported. | Identify the Amazon Redshift tables to be imported and their sizes. | DBA | 

### Set up EC2 instance and install tools
<a name="set-up-ec2-instance-and-install-tools"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Set up an EC2 instance. | In your AWS account, create an EC2 instance in a private or public subnet. | Systems administrator | 
| Install tools for database access. | Download and install the [Intellisoft OLE DB Provider for Amazon Redshift](https://www.pgoledb.com/index.php?option=com_filecabinet&view=files&id=1&Itemid=68) (or [CData ADO.NET Provider for Amazon Redshift](https://www.cdata.com/kb/tech/redshift-ado-ssas.rst)).  | Systems administrator | 
| Install Visual Studio. | Download and install [Visual Studio 2019 (Community Edition)](https://visualstudio.microsoft.com/vs/).  | Systems administrator | 
| Install extensions. | Install the **Microsoft Analysis Services Projects** extension in Visual Studio. | Systems administrator | 
| Create a project. | Create a new tabular model project in Visual Studio to store your Amazon Redshift data. In Visual Studio, choose the **Analysis Services Tabular Project** option when creating your project. | DBA | 

### Create data source and import tables
<a name="create-data-source-and-import-tables"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an Amazon Redshift data source. | Create an Amazon Redshift data source by using the Intellisoft OLE DB Provider for Amazon Redshift (or CData ADO.NET Provider for Amazon Redshift) and your Amazon Redshift credentials. | Amazon Redshift, DBA | 
| Import tables. | Select and import tables and views from Amazon Redshift into your SQL Server Analysis Services project. | Amazon Redshift, DBA | 

### Clean up after migration
<a name="clean-up-after-migration"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Delete the EC2 instance. | Delete the EC2 instance you launched previously. | Systems administrator | 

## Related resources
<a name="analyze-amazon-redshift-data-in-microsoft-sql-server-analysis-services-resources"></a>
+ [Amazon Redshift](https://docs.aws.amazon.com/redshift/) (AWS documentation)
+ [Install SQL Server Analysis Services](https://docs.microsoft.com/en-us/analysis-services/instances/install-windows/install-analysis-services?view=asallproducts-allversions) (Microsoft documentation)
+ [Tabular Model Designer](https://docs.microsoft.com/en-us/analysis-services/tabular-models/tabular-model-designer-ssas?view=asallproducts-allversions) (Microsoft documentation)
+ [Overview of OLAP cubes for advanced analytics](https://docs.microsoft.com/en-us/system-center/scsm/olap-cubes-overview?view=sc-sm-2019) (Microsoft documentation)
+ [Microsoft Visual Studio 2019 (Community Edition)](https://visualstudio.microsoft.com/vs/)
+ [Intellisoft OLE DB Provider for Amazon Redshift (Trial)](https://www.pgoledb.com/index.php?option=com_filecabinet&view=files&id=1&Itemid=68) 
+ [CData ADO.NET Provider for Amazon Redshift (Trial)](https://www.cdata.com/kb/tech/redshift-ado-ssas.rst)

# Analyze and visualize nested JSON data with Amazon Athena and Amazon Quick Sight
<a name="analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight"></a>

*Anoop Singh, Amazon Web Services*

## Summary
<a name="analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight-summary"></a>

This pattern explains how to translate a nested, JSON-formatted data structure into a tabular view by using Amazon Athena, and then visualize the data in Amazon Quick Sight.

You can use JSON-formatted data for API-powered data feeds from operational systems to create data products. This data can also help you understand your customers and their interactions with your products better, so you can tailor user experiences and predict outcomes.

## Prerequisites and limitations
<a name="analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ A JSON file that represents a nested data structure (this pattern provides a sample file)

**Limitations:**
+ JSON features integrate well with existing SQL-oriented functions in Athena. However, they aren’t ANSI SQL compatible, and the JSON file is expected to carry each record on a separate line. You might need to use the `ignore.malformed.json` property in Athena to indicate if malformed JSON records should be turned into null characters or generate errors. For more information, see [Best practices for reading JSON data](https://docs.aws.amazon.com/athena/latest/ug/parsing-JSON.html) in the Athena documentation.
+ This pattern considers only simple and small amounts of JSON-formatted data. If you want to use these concepts at scale, consider applying data partitioning and consolidate your data into larger files.

## Architecture
<a name="analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight-architecture"></a>

The following diagram shows the architecture and workflow for this pattern. The nested data structures are stored in Amazon Simple Storage Service (Amazon S3) in JSON format. In Athena, the JSON data is mapped to an Athena data structure. You then create a view to analyze the data, and visualize the data structure in Quick Sight.

![\[Analyzing and visualizing nested JSON data on AWS.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/e9ad39a1-e0a4-4429-bdc0-594b68707761/images/474e8747-626f-468c-9c27-c007af79bf2d.png)


## Tools
<a name="analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight-tools"></a>

**AWS services**
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data. This pattern uses Amazon S3 to store the JSON file.
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL. This pattern uses Athena to query and transform the JSON data. With a few actions in the AWS Management Console, you can point Athena at your data in Amazon S3 and use standard SQL to run one-time queries. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries that you run. Athena scales automatically and runs queries in parallel, so results are fast, even with large datasets and complex queries.     
+ [Amazon Quick Sight](https://docs.aws.amazon.com/quicksuite/latest/userguide/quick-bi.html) is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data on a single dashboard. Quick Sight lets you easily create and publish interactive dashboards that include machine learning (ML) insights. You can access these dashboards from any device, and embed them into your applications, portals, and websites.

**Example code **

The following JSON file provides a nested data structure that you can use in this pattern.

```
{
  "symbol": "AAPL",
  "financials": [
    {
      "reportDate": "2017-03-31",
      "grossProfit": 20591000000,
      "costOfRevenue": 32305000000,
      "operatingRevenue": 52896000000,
      "totalRevenue": 52896000000,
      "operatingIncome": 14097000000,
      "netIncome": 11029000000,
      "researchAndDevelopment": 2776000000,
      "operatingExpense": 6494000000,
      "currentAssets": 101990000000,
      "totalAssets": 334532000000,
      "totalLiabilities": 200450000000,
      "currentCash": 15157000000,
      "currentDebt": 13991000000,
      "totalCash": 67101000000,
      "totalDebt": 98522000000,
      "shareholderEquity": 134082000000,
      "cashChange": -1214000000,
      "cashFlow": 12523000000,
      "operatingGainsLosses": null
    }
  ]
}
```

## Epics
<a name="analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight-epics"></a>

### Set up an S3 bucket
<a name="set-up-an-s3-bucket"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an S3 bucket. | To create a bucket to store the JSON file, sign in to the AWS Management Console, open the [Amazon S3 console](https://console.aws.amazon.com/s3/), and then choose **Create bucket**. For more information, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in the Amazon S3 documentation.  | Systems administrator | 
| Add the nested JSON data. | Upload your JSON file to the S3 bucket. For a sample JSON file, see the previous section. For instructions, see [Uploading objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html) in the Amazon S3 documentation. | Systems administrator | 

### Analyze data in Athena
<a name="analyze-data-in-ate"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a table for mapping the JSON data. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight.html)For more information about creating tables, see the [Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html). | Developer | 
| Create a view for data analysis. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight.html)For more information about creating views, see the [Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/create-view.html). | Developer | 
| Analyze and validate the data. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight.html) | Developer | 

### Visualize data in Quick Sight
<a name="visualize-data-in-qsight"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Set up Athena as a data source in Quick Sight. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight.html) | Systems administrator | 
| Visualize data in Quick Sight. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight.html) | Data analyst | 

## Related resources
<a name="analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight-resources"></a>
+ [Amazon Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html)
+ [Amazon Quick Sight tutorials](https://docs.aws.amazon.com/quicksuite/latest/userguide/example-analysis.html)
+ [Working with nested JSON ](https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/)(blog post)

# Automate data ingestion from AWS Data Exchange into Amazon S3
<a name="automate-data-ingestion-from-aws-data-exchange-into-amazon-s3"></a>

*Adnan Alvee and Manikanta Gona, Amazon Web Services*

## Summary
<a name="automate-data-ingestion-from-aws-data-exchange-into-amazon-s3-summary"></a>

This pattern provides an CloudFormation template that enables you to automatically ingest data from AWS Data Exchange into your data lake in Amazon Simple Storage Service (Amazon S3). 

AWS Data Exchange is a service that makes it easy to securely exchange file-based data sets in the AWS Cloud. AWS Data Exchange data sets are subscription-based. As a subscriber, you can also access data set revisions as providers publish new data. 

The CloudFormation template creates an event in Amazon CloudWatch Events and an AWS Lambda function. The event watches for any updates to the data set you have subscribed to. If there is an update, CloudWatch initiates a Lambda function, which copies the data over to the S3 bucket you specify. When the data has been copied successfully, Lambda sends you an Amazon Simple Notification Service (Amazon SNS) notification.

## Prerequisites and limitations
<a name="automate-data-ingestion-from-aws-data-exchange-into-amazon-s3-prereqs"></a>

**Prerequisites **
+ An active AWS account
+ Subscription to a data set in AWS Data Exchange

**Limitations **
+ The CloudFormation template must be deployed separately for each subscribed data set in AWS Data Exchange.

## Architecture
<a name="automate-data-ingestion-from-aws-data-exchange-into-amazon-s3-architecture"></a>

**Target technology stack  **
+ AWS Lambda
+ Amazon S3
+ AWS Data Exchange
+ Amazon CloudWatch
+ Amazon SNS

**Target architecture **

![\[CloudWatch initiates a Lambda function to copy data to S3 bucket and send Amazon SNS notification.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/059816dc-5a71-4942-9c7f-ac977072eebc/images/ec021620-47c1-4fb5-95a9-3b8985accc56.png)


**Automation and scale**

You can use the CloudFormation template multiple times for the data sets you want to ingest into the data lake.

## Tools
<a name="automate-data-ingestion-from-aws-data-exchange-into-amazon-s3-tools"></a>
+ [AWS Data Exchange](https://docs.aws.amazon.com/data-exchange/latest/userguide/what-is.html) makes it easy for AWS customers to securely exchange file-based data sets in the AWS Cloud. As a subscriber, you can find and subscribe to hundreds of products from qualified data providers. Then, you can quickly download the data set or copy it to Amazon S3 for use across a variety of AWS analytics and machine learning services. Anyone with an AWS account can be an AWS Data Exchange subscriber.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) lets you run code without provisioning or managing servers. Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second. You pay only for the compute time you consume; there is no charge when your code isn't running. With Lambda, you can run code for virtually any type of application or backend service with zero administration. Lambda runs your code on a high-availability compute infrastructure and manages all the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring, and logging.
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html) provides storage for the internet. You can use Amazon S3 to store and retrieve any amount of data at any time, from anywhere on the web.
+ [Amazon CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams. CloudWatch Events becomes aware of operational changes as they occur. It responds to these operational changes and takes corrective action as necessary, by sending messages to respond to the environment, activating functions, making changes, and capturing state information. You can also use CloudWatch Events to schedule automated actions that self-initiate at certain times using **cron** or **rate** expressions.
+ [Amazon Simple Notification Service (Amazon SNS)](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) enables applications, end-users, and devices to instantly send and receive notifications from the cloud. Amazon SNS provides topics (communication channels) for high-throughput, push-based, many-to-many messaging. Using Amazon SNS topics, publishers can distribute messages to a large number of subscribers for parallel processing, including Amazon Simple Queue Service (Amazon SQS) queues, Lambda functions, and HTTP/S webhooks. You can also use Amazon SNS to send notifications to end users using mobile push, SMS, and email.

## Epics
<a name="automate-data-ingestion-from-aws-data-exchange-into-amazon-s3-epics"></a>

### Subscribe to a data set
<a name="subscribe-to-a-data-set"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Subscribe to a data set. | In the AWS Data Exchange console, subscribe to a dataset. For instructions, see [Subscribing to data products on AWS Data Exchange](https://docs.aws.amazon.com/data-exchange/latest/userguide/subscribe-to-data-sets.html) in the AWS documentation. | General AWS | 
| Note the data set attributes. | Note the AWS Region, ID, and revision ID for the data set. You will need this for the CloudFormation template in the next step. | General AWS | 

### Deploy the CloudFormation template
<a name="deploy-the-cfn-template"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an S3 bucket and folder. | If you already have a data lake in Amazon S3, create a folder to store the data to ingest from AWS Data Exchange. If you are deploying the template for testing purposes, create a new S3 bucket, and note the bucket name and folder prefix for the next step. | General AWS | 
| Deploy the CloudFormation template. | Deploy the CloudFormation template that's provided as an attachment to this pattern. For instructions, see the [CloudFormation documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html).Configure the following parameters to correspond to your AWS account, data set, and S3 bucket settings: **Dataset AWS Region**, **Dataset ID**, **Revision ID**, **S3 Bucket Name** (for example, `DOC-EXAMPLE-BUCKET`), **Folder Prefix** (for example, `myfolder/`), and **Email for SNS Notification**. You can set the **Dataset Name** parameter to any name. When you deploy the template, it runs a Lambda function to automatically ingest the first set of data available in the data set. Subsequent ingestion then takes place automatically, as new data arrives in the data set. | General AWS | 

## Related resources
<a name="automate-data-ingestion-from-aws-data-exchange-into-amazon-s3-resources"></a>
+ [Subscribing to data products on AWS Data Exchange](https://docs.aws.amazon.com/data-exchange/latest/userguide/subscribe-to-data-sets.html) (AWS Data Exchange documentation)

## Attachments
<a name="attachments-059816dc-5a71-4942-9c7f-ac977072eebc"></a>

To access additional content that is associated with this document, unzip the following file: [attachment.zip](samples/p-attach/059816dc-5a71-4942-9c7f-ac977072eebc/attachments/attachment.zip)

# Automate encryption enforcement in AWS Glue using an AWS CloudFormation template
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template"></a>

*Diogo Guedes, Amazon Web Services*

## Summary
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template-summary"></a>

This pattern shows you how to set up and automate encryption enforcement in AWS Glue by using an AWS CloudFormation template. The template creates all the required configurations and resources for enforcing encryption. These resources include an initial configuration, a preventive control created by an Amazon EventBridge rule, and an AWS Lambda function.

## Prerequisites and limitations
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template-prereqs"></a>

**Prerequisites **
+ An active AWS account
+ Permissions to deploy the CloudFormation template and its resources

**Limitations **

This security control is regional. You must deploy the security control in each AWS Region where you want to set up encryption enforcement in AWS Glue.

## Architecture
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template-architecture"></a>

**Target technology stack  **
+ Amazon CloudWatch Logs (from AWS Lambda)
+ Amazon EventBridge rule
+ AWS CloudFormation stack
+ AWS CloudTrail
+ AWS Identity and Access Management (IAM) managed role and policy
+ AWS Key Management Service (AWS KMS)
+ AWS KMS alias
+ AWS Lambda function
+ AWS Systems Manager Parameter Store

**Target architecture**

The following diagram shows how to automate encryption enforcement in AWS Glue.

![\[Diagram shows how to automate encryption enforcement in AWS Glue using a CloudFormation template.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/d50d0659-5592-44d0-8fcb-7a2983712640/images/272a7fb2-ecbc-41f7-a556-d555e4e39a59.png)


The diagram shows the following workflow:

1. A [CloudFormation template](https://github.com/aws-samples/aws-custom-guardrail-event-driven/blob/main/CloudFormation/aws-custom-guardrail-event-driven.yaml) creates all the resources, including the initial configuration and detective control for encryption enforcement in AWS Glue.

1. An EventBridge rule detects a state change in the encryption configuration.

1. A Lambda function is invoked for evaluation and logging through CloudWatch Logs. For non-compliant detection, the Parameter Store is recovered with an Amazon Resource Name (ARN) for an AWS KMS key. The service is remediated to compliant status with encryption enabled.

**Automation and scale**

If you’re using [AWS Organizations](https://aws.amazon.com/organizations/), you can use [AWS CloudFormation StackSets](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/what-is-cfnstacksets.html) to deploy this template in multiple accounts where you want to enable encryption enforcement in AWS Glue.

## Tools
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template-tools"></a>
+ [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) helps you monitor the metrics of your AWS resources and the applications you run on AWS in real time.
+ [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) is a serverless event bus service that helps you connect your applications with real-time data from a variety of sources. For example, Lambda functions, HTTP invocation endpoints using API destinations, or event buses in other AWS accounts.
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) helps you set up AWS resources, provision them quickly and consistently, and manage them throughout their lifecycle across AWS accounts and Regions.
+ [AWS CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html) helps you enable operational and risk auditing, governance, and compliance of your AWS account.
+ [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) is a fully managed extract, transform, and load (ETL) service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.
+ [AWS Key Management Service (AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) helps you create and control cryptographic keys to help protect your data.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.
+ [AWS Systems Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/what-is-systems-manager.html) helps you manage your applications and infrastructure running in the AWS Cloud. It simplifies application and resource management, shortens the time to detect and resolve operational problems, and helps you manage your AWS resources securely at scale.

**Code**

The code for this pattern is available in the GitHub [aws-custom-guardrail-event-driven](https://github.com/aws-samples/aws-custom-guardrail-event-driven/blob/main/CloudFormation/aws-custom-guardrail-event-driven.yaml) repository.

## Best practices
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template-best-practices"></a>

AWS Glue supports data encryption at rest for [authoring jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/author-job-glue.html) and [developing scripts using development endpoints](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html).

Consider the following best practices:
+ Configure ETL jobs and development endpoints to use AWS KMS keys to write encrypted data at rest.
+ Encrypt the metadata stored in the [AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro) by using keys that you manage through AWS KMS.
+ Use AWS KMS keys to encrypt job bookmarks and the logs generated by [crawlers](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) and ETL jobs.

## Epics
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template-epics"></a>

### Launch the CloudFormation template
<a name="launch-the-cloudformation-template"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Deploy the CloudFormation template. | Download the `aws-custom-guardrail-event-driven.yaml` template from the GitHub [repository](https://github.com/aws-samples/aws-custom-guardrail-event-driven/blob/main/CloudFormation/aws-custom-guardrail-event-driven.yaml), and then [deploy](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/cloudformation/deploy/index.html) the template. The `CREATE_COMPLETE` status indicates that your template was successfully deployed.The template requires no input parameters. | Cloud architect | 

### Verify the encryption settings in AWS Glue
<a name="verify-the-encryption-settings-in-aws-glue"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Check the AWS KMS key configurations. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template.html) | Cloud architect | 

### Test the encryption enforcement
<a name="test-the-encryption-enforcement"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Identify the encryption setting in CloudFormation. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template.html) | Cloud architect | 
| Switch the provisioned infrastructure to an uncompliant state. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template.html)The guardrail detects the uncompliant state in AWS Glue after you clear the check boxes, and then enforces compliance by automatically remediating the encryption misconfiguration. As a result, the encryption check boxes should again be selected after you refresh the page. | Cloud architect | 

## Related resources
<a name="automate-encryption-enforcement-in-aws-glue-using-an-aws-cloudformation-template-resources"></a>
+ [Creating a stack on the AWS CloudFormation console](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) (AWS CloudFormation documentation)
+ [Creating a CloudWatch Events rule that triggers on an AWS API call using AWS CloudTrail](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Create-CloudWatch-Events-CloudTrail-Rule.html) (Amazon CloudWatch documentation)
+ [Setting up encryption in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/set-up-encryption.html) (AWS Glue documentation)

# Build a data pipeline to ingest, transform, and analyze Google Analytics data using the AWS DataOps Development Kit
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit"></a>

*Anton Kukushkin and Rudy Puig, Amazon Web Services*

## Summary
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-summary"></a>

This pattern describes how to build a data pipeline to ingest, transform, and analyze Google Analytics data by using the AWS DataOps Development Kit (AWS DDK) and other AWS services. The AWS DDK is an open-source development framework that helps you build data workflows and modern data architecture on AWS. One of the main objectives of the AWS DDK is to save you the time and effort that's typically devoted to labor-intensive data pipeline tasks, such as orchestrating pipelines, building infrastructure, and creating the DevOps behind that infrastructure. You can offload these labor-intensive tasks to AWS DDK so that you can focus on writing code and other high-value activities.

## Prerequisites and limitations
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ An Amazon AppFlow connector for Google Analytics, [configured](https://docs.aws.amazon.com/appflow/latest/userguide/google-analytics.html)
+ [Python](https://www.python.org/downloads/) and [pip](https://pip.pypa.io/en/stable/cli/pip_download/) (Python’s package manager)
+ Git, installed and [configured](https://git-scm.com/book/en/v2/Getting-Started-First-Time-Git-Setup)
+ AWS Command Line Interface (AWS CLI), [installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
+ AWS Cloud Development Kit (AWS CDK), [installed](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install)

**Product versions**
+ Python 3.7 or later
+ pip 9.0.3 or later

## Architecture
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-architecture"></a>

**Technology stack**
+ Amazon AppFlow
+ Amazon Athena
+ Amazon CloudWatch
+ Amazon EventBridge
+ Amazon Simple Storage Service (Amazon S3)
+ Amazon Simple Queue Service (Amazon SQS)
+ AWS DataOps Development Kit (AWS DDK)
+ AWS Lambda

**Target architecture**

The following diagram shows the event-driven process that ingests, transforms, and analyzes Google Analytics data.

![\[Ingesting, transforming, and analyzing Google Analytics data with AWS services.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/edf40222-2867-4d4a-9153-ab29785b6662/images/8c38b472-153b-4497-982c-8efb97d2f7a5.png)


The diagram shows the following workflow:

1. An Amazon CloudWatch scheduled event rule invokes Amazon AppFlow.

1. Amazon AppFlow ingests Google Analytics data into an S3 bucket.

1. After the data is ingested by the S3 bucket, event notifications in EventBridge are generated, captured by a CloudWatch Events rule, and then put into an Amazon SQS queue.

1. A Lambda function consumes events from the Amazon SQS queue, reads the respective S3 objects, transforms the objects to Apache Parquet format, writes the transformed objects to the S3 bucket, and then creates or updates the AWS Glue Data Catalog table definition.

1. An Athena query runs against the table.

## Tools
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-tools"></a>

**AWS tools**
+ [Amazon AppFlow](https://docs.aws.amazon.com/appflow/latest/userguide/what-is-appflow.html) is a fully-managed integration service that enables you to securely exchange data between software as a service (SaaS) applications.
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL.
+ [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) helps you monitor the metrics of your AWS resources and the applications you run on AWS in real time.
+ [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) is a serverless event bus service that helps you connect your applications with real-time data from a variety of sources. For example, AWS Lambda functions, HTTP invocation endpoints using API destinations, or event buses in other AWS accounts.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [Amazon Simple Queue Service (Amazon SQS)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) provides a secure, durable, and available hosted queue that helps you integrate and decouple distributed software systems and components.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.
+ [AWS Cloud Development Kit (AWS CDK)](https://docs.aws.amazon.com/cdk/v2/guide/home.html) is a framework for defining cloud infrastructure in code and provisioning it through CloudFormation.
+ [AWS DataOps Development Kit (AWS DDK)](https://github.com/awslabs/aws-ddk) is an open-source development framework to help you build data workflows and modern data architecture on AWS.

**Code**

The code for this pattern is available in the GitHub [AWS DataOps Development Kit (AWS DDK)](https://github.com/awslabs/aws-ddk) and [Analyzing Google Analytics data with Amazon AppFlow, Amazon Athena, and AWS DataOps Development Kit](https://github.com/aws-samples/aws-ddk-examples/tree/main/google-analytics-data-using-appflow/python) repositories.

## Epics
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-epics"></a>

### Prepare the environment
<a name="prepare-the-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Clone the source code. | To clone the source code, run the following command:<pre>git clone https://github.com/aws-samples/aws-ddk-examples.git</pre> | DevOps engineer | 
| Create a virtual environment. | Navigate to the source code directory, and then run the following command to create a virtual environment:<pre>cd google-analytics-data-using-appflow/python && python3 -m venv .venv</pre> | DevOps engineer | 
| Install the dependencies. | To activate the virtual environment and install the dependencies, run the following command:<pre>source .venv/bin/activate && pip install -r requirements.txt</pre> | DevOps engineer | 

### Deploy the application that uses your data pipeline
<a name="deploy-the-application-that-uses-your-data-pipeline"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Bootstrap the environment. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit.html) | DevOps engineer | 
| Deploy the data. | To deploy the data pipeline, run the `cdk deploy --profile [AWS_PROFILE]` command. | DevOps engineer | 

### Test the deployment
<a name="test-the-deployment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Validate stack status. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit.html) | DevOps engineer | 

## Troubleshooting
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-troubleshooting"></a>


| Issue | Solution | 
| --- | --- | 
| Deployment fails during the creation of an `AWS::AppFlow::Flow` resource and you receive the following error: `Connector Profile with name ga-connection does not exist` | Confirm that you created an Amazon AppFlow connector for Google Analytics and named it `ga-connection`.For instructions, see [Google Analytics](https://docs.aws.amazon.com/appflow/latest/userguide/google-analytics.html) in the Amazon AppFlow documentation. | 

## Related resources
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-resources"></a>
+ [AWS DataOps Development Kit (AWS DDK)](https://github.com/awslabs/aws-ddk) (GitHub)
+ [AWS DDK Examples](https://github.com/aws-samples/aws-ddk-examples) (GitHub)

## Additional information
<a name="build-a-data-pipeline-to-ingest-transform-and-analyze-google-analytics-data-using-the-aws-dataops-development-kit-additional"></a>

AWS DDK data pipelines are composed of one or many stages. In the following code examples, you use `AppFlowIngestionStage` to ingest data from Google Analytics, `SqsToLambdaStage` to handle data transformation, and `AthenaSQLStage` to run the Athena query.

First, the data transformation and ingestion stages are created, as the following code example shows:

```
        appflow_stage = AppFlowIngestionStage(
            self,
            id="appflow-stage",
            flow_name=flow.flow_name,
        )
        sqs_lambda_stage = SqsToLambdaStage(
            self,
            id="lambda-stage",
            lambda_function_props={
                "code": Code.from_asset("./ddk_app/lambda_handlers"),
                "handler": "handler.lambda_handler",
                "layers": [
                    LayerVersion.from_layer_version_arn(
                        self,
                        id="layer",
                        layer_version_arn=f"arn:aws:lambda:{self.region}:336392948345:layer:AWSDataWrangler-Python39:1",
                    )
                ],
                "runtime": Runtime.PYTHON_3_9,
            },
        )
        # Grant lambda function S3 read & write permissions
        bucket.grant_read_write(sqs_lambda_stage.function)
        # Grant Glue database & table permissions
        sqs_lambda_stage.function.add_to_role_policy(
            self._get_glue_db_iam_policy(database_name=database.database_name)
        )
        athena_stage = AthenaSQLStage(
            self,
            id="athena-sql",
            query_string=[
                (
                    "SELECT year, month, day, device, count(user_count) as cnt "
                    f"FROM {database.database_name}.ga_sample "
                    "GROUP BY year, month, day, device "
                    "ORDER BY cnt DESC "
                    "LIMIT 10; "
                )
            ],
            output_location=Location(
                bucket_name=bucket.bucket_name, object_key="query-results/"
            ),
            additional_role_policy_statements=[
                self._get_glue_db_iam_policy(database_name=database.database_name)
            ],
        )
```

Next, the `DataPipeline` construct is used to "wire" the stages together by using EventBridge rules, as the following code example shows:

```
        (
            DataPipeline(self, id="ingestion-pipeline")
            .add_stage(
                stage=appflow_stage,
                override_rule=Rule(
                    self,
                    "schedule-rule",
                    schedule=Schedule.rate(Duration.hours(1)),
                    targets=appflow_stage.targets,
                ),
            )
            .add_stage(
                stage=sqs_lambda_stage,
                # By default, AppFlowIngestionStage stage emits an event after the flow run finishes successfully
                # Override rule below changes that behavior to call the the stage when data lands in the bucket instead
                override_rule=Rule(
                    self,
                    "s3-object-created-rule",
                    event_pattern=EventPattern(
                        source=["aws.s3"],
                        detail={
                            "bucket": {"name": [bucket.bucket_name]},
                            "object": {"key": [{"prefix": "ga-data"}]},
                        },
                        detail_type=["Object Created"],
                    ),
                    targets=sqs_lambda_stage.targets,
                ),
            )
            .add_stage(stage=athena_stage)
        )
```

For more code examples, see the GitHub [Analyzing Google Analytics data with Amazon AppFlow, Amazon Athena, and AWS DataOps Development Kit](https://github.com/aws-samples/aws-ddk-examples/tree/main/google-analytics-data-using-appflow/python) repository.

# Build a video processing pipeline by using Amazon Kinesis Video Streams and AWS Fargate
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate"></a>

*Piotr Chotkowski and Pushparaju Thangavel, Amazon Web Services*

## Summary
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate-summary"></a>

This pattern demonstrates how to use [Amazon Kinesis Video Streams](https://aws.amazon.com/kinesis/video-streams/) and [AWS Fargate](https://aws.amazon.com/fargate) to extract frames from a video stream and store them as image files for further processing in [Amazon Simple Storage Service (Amazon S3](https://aws.amazon.com/s3/)). 

The pattern provides a sample application in the form of a Java Maven project. This application defines the AWS infrastructure by using the [AWS Cloud Development Kit](https://aws.amazon.com/cdk/) (AWS CDK). Both the frame processing logic and the infrastructure definitions are written in the Java programming language. You can use this sample application as a basis for developing your own real-time video processing pipeline or to build the video preprocessing step of a machine learning pipeline. 

## Prerequisites and limitations
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate-prereqs"></a>

**Prerequisites **
+ An active AWS account
+ Java SE Development Kit (JDK) 11, installed
+ [Apache Maven](https://maven.apache.org/), installed
+ [AWS Cloud Development Kit (AWS CDK)](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html), installed
+ [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) version 2, installed
+ [Docker](https://docs.docker.com/get-docker/) (required for building Docker images to use in AWS Fargate task definitions), installed

**Limitations **

This pattern is intended as a proof of concept, or as a basis for further development. It should not be used in its current form in production deployments.

**Product versions**
+ This pattern was tested with the AWS CDK version 1.77.0 (see [AWS CDK versions](https://docs.aws.amazon.com/cdk/api/latest/versions.html))
+ JDK 11
+ AWS CLI version 2

## Architecture
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate-architecture"></a>

**Target technology stack**
+ Amazon Kinesis Video Streams
+ AWS Fargate task
+ Amazon Simple Queue Service (Amazon SQS) queue
+ Amazon S3 bucket

**Target architecture**

![\[Architecture for using Kinesis Video Streams and Fargate to build a video processing pipeline.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/9d1442c2-f3ee-47fd-8cce-90d9206ce4d4/images/a60e585f-27be-4dd6-897b-c38adf1d283f.png)


The user creates a Kinesis video stream, uploads a video, and sends a JSON message that contains details about the input Kinesis video stream and the output S3 bucket to an SQS queue. AWS Fargate, which is running the main application in a container, pulls the message from the SQS queue and starts extracting frames. Each frame is saved in an image file and stored in the target S3 bucket.

**Automation and scale**

The sample application can scale both horizontally and vertically within a single AWS Region. Horizontal scaling can be achieved by increasing the number of deployed AWS Fargate tasks that read from the SQS queue. Vertical scaling can be achieved by increasing the number of frame-splitting and image-publishing threads in the application. These settings are passed as environment variables to the application in the definition of the [QueueProcessingFargateService](https://docs.aws.amazon.com/cdk/api/latest/docs/@aws-cdk_aws-ecs-patterns.QueueProcessingFargateService.html) resource in the AWS CDK. Due to the nature of AWS CDK stack deployment, you can deploy this application in multiple AWS Regions and accounts with no additional effort.

## Tools
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate-tools"></a>

**Tools**
+ [AWS CDK](https://aws.amazon.com/cdk/) is a software development framework for defining your cloud infrastructure and resources by using programming languages such as TypeScript, JavaScript, Python, Java, and C\$1/.Net.
+ [Amazon Kinesis Video Streams](https://aws.amazon.com/kinesis/video-streams/) is a fully managed AWS service that you can use to stream live video from devices to the AWS Cloud, or build applications for real-time video processing or batch-oriented video analytics.
+ [AWS Fargate](https://aws.amazon.com/fargate) is a serverless compute engine for containers. Fargate removes the need to provision and manage servers, and lets you focus on developing your applications.
+ [Amazon S3](https://aws.amazon.com/s3/) is an object storage service that offers scalability, data availability, security, and performance.
+ [Amazon SQS](https://aws.amazon.com/sqs/) is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.

**Code**
+ A .zip file of the sample application project (`frame-splitter-code.zip`) is attached.

## Epics
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate-epics"></a>

### Deploy the infrastructure
<a name="deploy-the-infrastructure"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Start the Docker daemon. | Start the Docker daemon on your local system. The AWS CDK uses Docker to build the image that is used in the AWS Fargate task. You must run Docker before you proceed to the next step. | Developer, DevOps engineer | 
| Build the project. | Download the `frame-splitter-code` sample application (attached) and extract its contents into a folder on your local machine. Before you can deploy the infrastructure, you have to build the [Java Maven](https://maven.apache.org/) project. At a command prompt, navigate to the root directory of the project, and build the project by running the command: <pre>mvn clean install</pre> | Developer, DevOps engineer | 
| Bootstrap the AWS CDK. | (First-time AWS CDK users only) If this is the first time you’re using the AWS CDK, you might have to bootstrap the environment by running the AWS CLI command:<pre>cdk bootstrap --profile "$AWS_PROFILE_NAME" </pre>where `$AWS_PROFILE_NAME` holds the name of the AWS profile from your AWS credentials. Or, you can remove this parameter to use the default profile. For more information, see the [AWS CDK documentation](https://docs.aws.amazon.com/cdk/latest/guide/bootstrapping.html). | Developer, DevOps engineer | 
| Deploy the AWS CDK stack. | In this step, you create the required infrastructure resources (SQS queue, S3 bucket, AWS Fargate task definition) in your AWS account, build the Docker image that is required for the AWS Fargate task, and deploy the application. At a command prompt, navigate to the root directory of the project, and run the command:<pre>cdk deploy --profile "$AWS_PROFILE_NAME" --all </pre>where `$AWS_PROFILE_NAME` holds the name of the AWS profile from your AWS credentials. Or, you can remove this parameter to use the default profile. Confirm the deployment. Note the **QueueUrl **and **Bucket **values from the CDK deployment output; you will need these in later steps. The AWS CDK creates the assets, uploads them to your AWS account, and creates all infrastructure resources. You can observe the resource creation process in the [AWS CloudFormation console](https://console.aws.amazon.com/cloudformation/). For more information, see the [AWS CloudFormation documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) and the [AWS CDK documentation](https://docs.aws.amazon.com/cdk/latest/guide/hello_world.html#hello_world_tutorial_deploy). | Developer, DevOps engineer | 
| Create a video stream. | In this step, you create a Kinesis video stream that will serve as an input stream for video processing. Make sure that you have the AWS CLI installed and configured. In the AWS CLI, run:<pre>aws kinesisvideo --profile "$AWS_PROFILE_NAME" create-stream --stream-name "$STREAM_NAME" --data-retention-in-hours "24" </pre>where `$AWS_PROFILE_NAME` holds the name of the AWS profile from your AWS credentials (or remove this parameter to use the default profile) and `$STREAM_NAME` is any valid stream name. Alternatively, you can create a video stream by using the Kinesis console by following the steps in the [Kinesis Video Streams documentation](https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/gs-createstream.html#gs-createstream-console). Note the AWS Resource Name (ARN) of the created stream; you will need it later. | Developer, DevOps engineer | 

### Run an example
<a name="run-an-example"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Upload the video to the stream. | In the project folder for the sample `frame-splitter-code` application, open the `ProcessingTaskTest.java` file in the `src/test/java/amazon/awscdk/examples/splitter` folder. Replace the `profileName`** **and `streamName`** **variables with the values you used in the previous steps. To upload the example video to the Kinesis video stream you created in the previous step, run:  <pre>amazon.awscdk.examples.splitter.ProcessingTaskTest#testExample test</pre>Alternatively, you can upload your video by using one of the methods described in the [Kinesis Video Streams documentation](https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/producer-sdk.html). | Developer, DevOps engineer | 
| Initiate video processing. | Now that you have uploaded a video to the Kinesis video stream, you can start processing it. To initiate the processing logic, you have to send a message with details to the SQS queue that the AWS CDK created during deployment. To send a message by using the AWS CLI, run:<pre>aws sqs --profile "$AWS_PROFILE_NAME" send-message --queue-url QUEUE_URL --message-body MESSAGE </pre>where `$AWS_PROFILE_NAME` holds the name of the AWS profile from your AWS credentials (remove this parameter to use the default profile), `QUEUE_URL` is the **QueueUrl **value from the AWS CDK output, and `MESSAGE` is a JSON string in the following format: <pre>{ "streamARN": "STREAM_ARN", "bucket": "BUCKET_NAME", "s3Directory": "test-output" }</pre>where `STREAM_ARN` is the ARN of of the video stream you created in an earlier step and `BUCKET_NAME` is the **Bucket** value from the AWS CDK output. Sending this message initiates video processing. Alternatively, you can send a message by using the Amazon SQS console, as described in the [Amazon SQS documentation](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-using-send-messages.html). | Developer, DevOps engineer | 
| View images of the video frames. | You can see the resulting images in the S3 output bucket `s3://BUCKET_NAME/test-output` where `BUCKET_NAME` is the **Bucket** value from the AWS CDK output. | Developer, DevOps engineer | 

## Related resources
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate-resources"></a>
+ [AWS CDK documentation](https://docs.aws.amazon.com/cdk/latest/guide/home.html)
+ [AWS CDK API reference](https://docs.aws.amazon.com/cdk/api/latest/docs/aws-construct-library.html)
+ [AWS CDK introductory workshop](https://cdkworkshop.com/)
+ [Amazon Kinesis Video Streams documentation](https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/what-is-kinesis-video.html)
+ [Example: Identifying Objects in Video Streams Using SageMaker](https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/examples-sagemaker.html)
+ [Example: Parsing and Rendering Kinesis Video Streams Fragments](https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/examples-renderer.html)
+ [Analyze live video at scale in real time using Amazon Kinesis Video Streams and Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/analyze-live-video-at-scale-in-real-time-using-amazon-kinesis-video-streams-and-amazon-sagemaker/) (AWS Machine Learning blog post)
+ [AWS Fargate Getting Started](https://aws.amazon.com/fargate/getting-started/)

## Additional information
<a name="build-a-video-processing-pipeline-by-using-amazon-kinesis-video-streams-and-aws-fargate-additional"></a>

**Choosing an IDE**

We recommend that you use your favorite Java IDE to build and explore this project.  

**Cleaning up**

After you finish running this example, remove all deployed resources to avoid incurring additional AWS infrastructure costs. 

To remove the infrastructure and the video stream, use these two commands in the AWS CLI:

```
cdk destroy --profile "$AWS_PROFILE_NAME" --all
```

```
aws kinesisvideo --profile "$AWS_PROFILE_NAME" delete-stream --stream-arn "$STREAM_ARN"
```

Alternatively, you can remove the resources manually by using the AWS CloudFormation console to remove the AWS CloudFormation stack, and the Kinesis console to remove the Kinesis video stream. Note that `cdk destroy` doesn’t remove the output S3 bucket or the images in Amazon Elastic Container Registry (Amazon ECR) repositories (`aws-cdk/assets`). You have to remove them manually.

## Attachments
<a name="attachments-9d1442c2-f3ee-47fd-8cce-90d9206ce4d4"></a>

To access additional content that is associated with this document, unzip the following file: [attachment.zip](samples/p-attach/9d1442c2-f3ee-47fd-8cce-90d9206ce4d4/attachments/attachment.zip)

# Build an ETL service pipeline to load data incrementally from Amazon S3 to Amazon Redshift using AWS Glue
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue"></a>

*Rohan Jamadagni and Arunabha Datta, Amazon Web Services*

## Summary
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue-summary"></a>

This pattern provides guidance on how to configure Amazon Simple Storage Service (Amazon S3) for optimal data lake performance, and then load incremental data changes from Amazon S3 into Amazon Redshift by using AWS Glue, performing extract, transform, and load (ETL) operations. 

The source files in Amazon S3 can have different formats, including comma-separated  values (CSV), XML, and JSON files. This pattern describes how you can use AWS Glue to convert the source files into a cost-optimized and performance-optimized  format like Apache Parquet. You can query Parquet files directly from Amazon Athena and Amazon Redshift Spectrum. You can also load Parquet files into Amazon Redshift, aggregate them, and share the aggregated data with consumers, or visualize the data by using Amazon Quick Sight.

## Prerequisites and limitations
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue-prereqs"></a>

**Prerequisites**
+ An active AWS account.
+ An S3 source bucket that has the right privileges and contains CSV, XML, or JSON files.

**Assumptions**
+ The CSV, XML, or JSON source files are already loaded into Amazon S3 and are accessible from the account where AWS Glue and Amazon Redshift are configured.
+ Best practices for loading the files, splitting the files, compression, and using a manifest are followed, as discussed in the [Amazon Redshift documentation](https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html).
+ The source file structure is unaltered.
+ The source system is able to ingest data into Amazon S3 by following the folder structure defined in Amazon S3.
+ The Amazon Redshift cluster spans a single Availability Zone. (This architecture is appropriate because AWS Lambda, AWS Glue, and Amazon Athena are serverless.) For high availability, cluster snapshots are taken at a regular frequency.

**Limitations**
+ The file formats are limited to those that are [currently supported by AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html).
+ Real-time downstream reporting isn't supported.

## Architecture
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue-architecture"></a>

**Source technology stack**
+ S3 bucket with CSV, XML, or JSON files

**Target technology stack**
+ S3 data lake (with partitioned Parquet file storage)
+ Amazon Redshift

**Target architecture**

![\[Architecture for loading incremental changes from Amazon S3 to Amazon Redshift using AWS Glue.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/105b58ec-56c1-464a-8e69-f625360caa14/images/626aa365-e6e6-4874-a873-1c71adbe5306.png)


**Data flow**

![\[Data flow for loading incremental changes from Amazon S3 to Amazon Redshift using AWS Glue.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/105b58ec-56c1-464a-8e69-f625360caa14/images/29569e48-9f2d-4f48-bc59-1f33949d01ca.png)


## Tools
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue-tools"></a>
+ [Amazon S3](https://aws.amazon.com/s3/) – Amazon Simple Storage Service (Amazon S3) is a highly scalable object storage service. Amazon S3 can be used for a wide range of storage solutions, including websites, mobile applications, backups, and data lakes.
+ [AWS Lambda](https://aws.amazon.com/lambda/) – AWS Lambda lets you run code without provisioning or managing servers. AWS Lambda is an event-driven service; you can set up your code to automatically initiate from other AWS services.
+ [Amazon Redshift](https://aws.amazon.com/redshift/) – Amazon Redshift is a fully managed, petabyte-scale data warehouse service. With Amazon Redshift, you can query petabytes of structured and semi-structured data across your data warehouse and your data lake using standard SQL.
+ [AWS Glue](https://aws.amazon.com/glue/) – AWS Glue is a fully managed ETL service that makes it easier to prepare and load data for analytics. AWS Glue discovers your data and stores the associated metadata (for example, table definitions and schema) in the AWS Glue Data Catalog. Your cataloged data is immediately searchable, can be queried, and is available for ETL.
+ [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/) – AWS Secrets Manager facilitates protection and central management of secrets needed for application or service access. The service stores database credentials, API keys, and other secrets, and eliminates the need to hardcode sensitive information in plaintext format. Secrets Manager also offers key rotation to meet security and compliance needs. It has built-in integration for Amazon Redshift, Amazon Relational Database Service (Amazon RDS), and Amazon DocumentDB. You can store and centrally manage secrets by using the Secrets Manager console, the command-line interface (CLI), or Secrets Manager API and SDKs.
+ [Amazon Athena](https://aws.amazon.com/athena/) – Amazon Athena is an interactive query service that makes it easy to analyze data that's stored in Amazon S3. Athena is serverless and integrated with AWS Glue, so it can directly query the data that's cataloged using AWS Glue. Athena is elastically scaled to deliver interactive query performance.

## Epics
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue-epics"></a>

### Create the S3 buckets and folder structure
<a name="create-the-s3-buckets-and-folder-structure"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Analyze source systems for data structure and attributes. | Perform this task for each data source that contributes to the Amazon S3 data lake. | Data engineer | 
| Define the partition and access strategy. | This strategy should be based on the frequency of data captures, delta processing, and consumption needs. Make sure that S3 buckets are not open to the public and that access is controlled by specific service role-based policies only. For more information, see the [Amazon S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html). | Data engineer | 
| Create separate S3 buckets for each data source type and a separate S3 bucket per source for the processed (Parquet) data. | Create a separate bucket for each source, and then create a folder structure that's based on the source system's data ingestion frequency; for example, `s3://source-system-name/date/hour`. For the processed (converted to Parquet format) files, create a similar structure; for example, `s3://source-processed-bucket/date/hour`. For more information about creating S3 buckets, see the [Amazon S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-bucket.html). | Data engineer | 

### Create a data warehouse in Amazon Redshift
<a name="create-a-data-warehouse-in-amazon-redshift"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Launch the Amazon Redshift cluster with the appropriate parameter groups and maintenance and backup strategy. | Use the Secrets Manager database secret for admin user credentials while creating the Amazon Redshift cluster. For information about creating and sizing an Amazon Redshift cluster, see the [Amazon Redshift documentation](https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-cluster.html) and the [Sizing Cloud Data Warehouses](https://d1.awsstatic.com/whitepapers/Size-Cloud-Data-Warehouse-on-AWS.pdf) whitepaper. | Data engineer | 
| Create and attach the IAM service role to the Amazon Redshift cluster. | The AWS Identity and Access Management (IAM) service role ensures access to Secrets Manager and the source S3 buckets. For more information, see the AWS documentation on [authorization](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) and [adding a role](https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-add-role.html). | Data engineer | 
| Create the database schema. | Follow Amazon Redshift best practices for table design. Based on the use case, choose the appropriate sort and distribution keys, and the best possible compression encoding. For best practices, see the [AWS documentation](https://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html). | Data engineer | 
| Configure workload management. | Conﬁgure workload management (WLM) queues, short query acceleration (SQA), or concurrency scaling, depending on your requirements. For more information, see [Implementing workload management](https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html) in the Amazon Redshift documentation. | Data engineer | 

### Create a secret in Secrets Manager
<a name="create-a-secret-in-secrets-manager"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a new secret to store the Amazon Redshift sign-in credentials in Secrets Manager. | This secret stores the credentials for the admin user as well as individual database service users. For instructions, see the [Secrets Manager documentation](https://docs.aws.amazon.com/secretsmanager/latest/userguide/manage_create-basic-secret.html). Choose **Amazon Redshift Cluster** as the secret type. Additionally, on the **Secret rotation** page, turn on the rotation. This will create the appropriate user in the Amazon Redshift cluster and will rotate the key secrets at defined intervals. | Data engineer | 
| Create an IAM policy to restrict Secrets Manager access. | Restrict Secrets Manager access to only Amazon Redshift administrators and AWS Glue. | Data engineer | 

### Configure AWS Glue
<a name="configure-aws-glue"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| In the AWS Glue Data Catalog, add a connection for Amazon Redshift. | For instructions, see the [AWS Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/console-connections.html). | Data engineer | 
| Create and attach an IAM service role for AWS Glue to access Secrets Manager, Amazon Redshift, and S3 buckets. | For more information, see the [AWS Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/create-service-policy.html). | Data engineer | 
| Define the AWS Glue Data Catalog for the source. | This step involves creating a database and required tables in the AWS Glue Data Catalog. You can either use a crawler to catalog the tables in the AWS Glue database, or deﬁne them as Amazon Athena external tables. You can also access the external tables deﬁned in Athena through the AWS Glue Data Catalog. See the AWS documentation for more information about [deﬁning the Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html) and [creating an external table](https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html) in Athena. | Data engineer | 
| Create an AWS Glue job to process source data. | The AWS Glue job can be a Python shell or PySpark to standardize, deduplicate, and cleanse the source data ﬁles. To optimize performance and avoid having to query the entire S3 source bucket, partition the S3 bucket by date, broken down by year, month, day, and hour as a pushdown predicate for the AWS Glue job. For more information, see the [AWS Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html). Load the processed and transformed data to the processed S3 bucket partitions in Parquet format. You can query the Parquet ﬁles from Athena. | Data engineer | 
| Create an AWS Glue job to load data into Amazon Redshift. | The AWS Glue job can be a Python shell or PySpark to load the data by upserting the data, followed by a complete refresh. For details, see the [AWS Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/author-job.html) and the *Additional information* section. | Data engineer | 
| (Optional) Schedule AWS Glue jobs by using triggers as necessary. | The incremental data load is primarily driven by an Amazon S3 event that causes an AWS Lambda function to call the AWS Glue job. Use AWS Glue trigger-based scheduling for any data loads that demand time-based instead of event-based scheduling. | Data engineer | 

### Create a Lambda function
<a name="create-a-lambda-function"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create and attach an IAM service-linked role for AWS Lambda to access S3 buckets and the AWS Glue job. | Create an IAM service-linked role for AWS Lambda with a policy to read Amazon S3 objects and buckets, and a policy to access the AWS Glue API to start an AWS Glue job. For more information, see the [Knowledge Center](https://aws.amazon.com/premiumsupport/knowledge-center/lambda-execution-role-s3-bucket/). | Data engineer | 
| Create a Lambda function to run the AWS Glue job based on the defined Amazon S3 event. | The Lambda function should be initiated by the creation of the Amazon S3 manifest ﬁle. The Lambda function should pass the Amazon S3 folder location (for example, source\$1bucket/year/month/date/hour) to the AWS Glue job as a parameter. The AWS Glue job will use this parameter as a pushdown predicate to optimize ﬁle access and job processing performance. For more information, see the [AWS Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html). | Data engineer | 
| Create an Amazon S3 PUT object event to detect object creation, and call the respective Lambda function. | The Amazon S3 PUT object event should be initiated only by the creation of the manifest ﬁle. The manifest ﬁle controls the Lambda function and the AWS Glue job concurrency, and processes the load as a batch instead of processing individual ﬁles that arrive in a speciﬁc partition of the S3 source bucket. For more information, see the [Lambda documentation](https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html). | Data engineer | 

## Related resources
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue-resources"></a>
+ [Amazon S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html)
+ [AWS Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html)
+ [Amazon Redshift documentation](https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html)
+ [AWS Lambda](https://aws.amazon.com/lambda/)
+ [Amazon Athena](https://aws.amazon.com/athena/)
+ [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/)

## Additional information
<a name="build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue-additional"></a>

**Detailed approach for upsert and complete refresh**

**Upsert**: This is for datasets that require historical aggregation, depending on the business use case. Follow one of the approaches described in [Updating and inserting new data](https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html) (Amazon Redshift documentation) based on your business needs.

**Complete refresh**: This is for small datasets that don't need historical aggregations. Follow one of these approaches:

1. Truncate the Amazon Redshift table.

1. Load the current partition from the staging area

or:

1. Create a temporary table with current partition data.

1. Drop the target Amazon Redshift table.

1. Rename the temporary table to the target table.

# Build an enterprise data mesh with Amazon DataZone, AWS CDK, and AWS CloudFormation
<a name="build-enterprise-data-mesh-amazon-data-zone"></a>

*Dhrubajyoti Mukherjee, Ravi Kumar, Weizhou Sun, and Adjoa Taylor, Amazon Web Services*

## Summary
<a name="build-enterprise-data-mesh-amazon-data-zone-summary"></a>

On Amazon Web Services (AWS), customers understand that data is the key to accelerating innovation and driving business value for their enterprise. To manage this massive data, you can adopt a decentralized architecture such as data mesh. A data mesh architecture facilitates product thinking, a mindset that takes customers, goals, and the market into account. Data mesh also helps to establish a federated governance model that provides fast, secure access to your data.

[Strategies for building a data mesh-based enterprise solution on AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-data-mesh/introduction.html) discusses how you can use the Data Mesh Strategy Framework to formulate and implement a data mesh strategy for your organization. By using the Data Mesh Strategy Framework, you can optimize the organization of teams and their interactions to accelerate your data mesh journey.

This document provides guidance on how to build an enterprise data mesh with [Amazon DataZone](https://docs.aws.amazon.com/datazone/latest/userguide/what-is-datazone.html). Amazon DataZone is a data management service for cataloging, discovering, sharing, and governing data stored across AWS, on premises, and third-party sources. The pattern includes code artifacts that help you deploy the data mesh‒based data solution infrastructure using AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation. This pattern is intended for cloud architects and DevOps engineers.

For information about the objectives of this pattern and the solution scope, see the [Additional information](#build-enterprise-data-mesh-amazon-data-zone-additional) section.

## Prerequisites and limitations
<a name="build-enterprise-data-mesh-amazon-data-zone-prereqs"></a>

**Prerequisites**
+ A minimum of two active AWS accounts: one for the central governance account and another for the member account
+ AWS administrator credentials for the central governance account in your development environment
+ AWS Command Line Interface (AWS CLI) [installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) to manage your AWS services from the command line
+ Node.js and Node Package Manager (npm) [installed](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm) to manage AWS CDK applications
+ AWS CDK Toolkit [installed](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install) globally in your development environment by using npm, to synthesize and deploy AWS CDK applications

  ```
  npm install -g aws-cdk
  ```
+ Python version 3.12 installed in your development environment
+ TypeScript installed in your development environment or installed globally by using npm compiler:

  ```
  npm install -g typescript
  ```
+ Docker installed in your development environment
+ A version control system such as Git to maintain the source code of the solution (recommended)
+ An integrated development environment (IDE) or text editor with support for Python and TypeScript (strongly recommended)

**Limitations**
+ The solution has been tested only on machines that are running Linux or macOS.
+ In the current version, the solution doesn’t support the integration of Amazon DataZone and AWS IAM Identity Center by default. However, you can configure it to support this integration.

**Product versions**
+ Python version 3.12

## Architecture
<a name="build-enterprise-data-mesh-amazon-data-zone-architecture"></a>

The following diagram shows a data mesh reference architecture. The architecture is based on Amazon DataZone and uses Amazon Simple Storage Service (Amazon S3) and AWS Glue Data Catalog as data sources. The AWS services that you use with Amazon DataZone in your data mesh implementation might differ, based on your organization's requirements.

![\[Five step workflow for members accounts and central governance account.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/79056dd9-e669-4302-afb8-21e3e0e8a136/images/eacbffca-4f25-405a-a7c8-f89d0fee5682.png)


1. In the producer accounts, raw data is either fit for consumption in its current form or it’s transformed for consumption by using AWS Glue. The technical metadata for the data is stored in Amazon S3 and is evaluated using a AWS Glue data crawler. The data quality is measured by using [AWS Glue Data Quality](https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html). The source database in the Data Catalog is registered as an asset in the Amazon DataZone catalog. The Amazon DataZone catalog is hosted in the central governance account using Amazon DataZone data source jobs.

1. The central governance account hosts the Amazon DataZone domain and the Amazon DataZone data portal. The AWS accounts of the data producers and consumers are associated with the Amazon DataZone domain. The Amazon DataZone projects of the data producers and consumers are organized under the corresponding Amazon DataZone domain units.

1. End users of the data assets log into the Amazon DataZone data portal by using their AWS Identity and Access Management (IAM) credentials or single sign-on (with integration through IAM Identity Center). They search, filter, and view asset information (for example, data quality information or business and technical metadata) in the Amazon DataZone data catalog.

1. After an end user finds the data asset that they want, they use the Amazon DataZone subscription feature to request access. The data owner on the producer team receives a notification and evaluates the subscription request in the Amazon DataZone data portal. The data owner approves or rejects the subscription request based on its validity.

1. After the subscription request is granted and fulfilled, the asset is accessed in the consumer account for the following activities:
   + AI/ML model development by using Amazon SageMaker AI
   + Analytics and reporting by using Amazon Athena and Amazon Quick

## Tools
<a name="build-enterprise-data-mesh-amazon-data-zone-tools"></a>

**AWS services**
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that helps you analyze data directly in Amazon Simple Storage Service (Amazon S3) by using standard SQL.
+ [AWS Cloud Development Kit (AWS CDK)](https://docs.aws.amazon.com/cdk/v2/guide/home.html) is a software development framework that helps you define and provision AWS Cloud infrastructure in code.
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) helps you set up AWS resources, provision them quickly and consistently, and manage them throughout their lifecycle across AWS accounts and AWS Regions.
+ [Amazon DataZone](https://docs.aws.amazon.com/datazone/latest/userguide/what-is-datazone.html) is a data management service that helps you catalog, discover, share, and govern data stored across AWS, on premises, and in third-party sources.
+ [Amazon Quick](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html) is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data in a single dashboard.
+ [Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/?id=docs_gateway) is a managed machine learning (ML) service that helps you build and train ML models and then deploy them into a production-ready hosted environment.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [Amazon Simple Queue Service (Amazon SQS)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) provides a secure, durable, and available hosted queue that helps you integrate and decouple distributed software systems and components.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

**Code repository**

The solution is available in the GitHub [data-mesh-datazone-cdk-cloudformation](https://github.com/aws-samples/data-mesh-datazone-cdk-cloudformation) repository.

## Epics
<a name="build-enterprise-data-mesh-amazon-data-zone-epics"></a>

### Set up the environment
<a name="set-up-the-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Clone the repository. | To clone the repository, run the following command in your local development environment (Linux or macOS):<pre>git clone https://github.com/aws-samples/data-mesh-datazone-cdk-cloudformation</pre> | Cloud architect, DevOps engineer | 
| Create the environment. | To create the Python virtual environment, run the following commands:<pre> python3 -m venv .venv<br /> source .venv/bin/activate<br /> pip install -r requirements.txt</pre> | Cloud architect, DevOps engineer | 
| Bootstrap the account. | To bootstrap the central governance account by using AWS CDK, run the following command:<pre>cdk bootstrap aws://<GOVERNANCE_ACCOUNT_ID>/<AWS_REGION></pre>Sign in to the AWS Management Console, open the central governance account console, and get the Amazon Resource Name (ARN) of the AWS CDK execution role. | Cloud architect, DevOps engineer | 
| Construct the `DzDataMeshMemberStackSet.yaml` file. | To construct the `DzDataMeshMemberStackSet.yaml` file, from the root directory of the repository, initiate the following bash script:<pre>./lib/scripts/create_dz_data_mesh_member_stack_set.sh</pre> | Cloud architect, DevOps engineer | 
| Confirm template creation. | Ensure that the CloudFormation template file is created at the `lib/cfn-templates/DzDataMeshMemberStackSet.yaml` location. | Cloud architect, DevOps engineer | 

### Deploy resources in the central governance account
<a name="deploy-resources-in-the-central-governance-account"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Modify the configuration. | In the `config/Config.ts` file, modify the following parameters:<pre>DZ_APPLICATION_NAME - Name of the application.<br />DZ_STAGE_NAME - Name of the stage. <br />DZ_DOMAIN_NAME - Name of the Amazon DataZone domain<br />DZ_DOMAIN_DESCRIPTION - Description of the Amazon DataZone domain<br />DZ_DOMAIN_TAG - Tag of the Amazon DataZone domain<br />DZ_ADMIN_PROJECT_NAME - Name of the Amazon DataZone project for administrators<br />DZ_ADMIN_PROJECT_DESCRIPTION - Description of the Amazon DataZone project for administrators<br />CDK_EXEC_ROLE_ARN - ARN of the cdk execution role<br />DZ_ADMIN_ROLE_ARN - ARN of the administrator role</pre>Keep the remaining parameters empty. | Cloud architect, DevOps engineer | 
| Update the Amazon DataZone glossary configuration. | To update the Amazon DataZone glossary configuration in the `lib/utils/glossary_config.json` file, use the following example configuration:<pre>{<br />    "GlossaryName": "PII Data",<br />    "GlossaryDescription": "If data source contains PII attributes",<br />    "GlossaryTerms": [{<br />            "Name": "Yes",<br />            "ShortDescription": "Yes",<br />            "LongDescription": "Yes Glossary Term"<br />        },<br />        {<br />            "Name": "No",<br />            "ShortDescription": "No",<br />            "LongDescription": "No Glossary Term"<br />        }<br />    ]<br />}</pre> | Cloud architect, DevOps engineer | 
| Update the Amazon DataZone metadata form configuration. | To update the Amazon DataZone metadata form configuration in the `lib/utils/metadata_form_config.json file`, use the following example configuration:<pre>{<br />    "FormName": "ScheduleDataRefresh",<br />    "FormDescription": "Form for data refresh schedule",<br />    "FormSmithyModel": "@amazon.datazone#displayname(defaultName: \"Data Refresh Schedule\")\nstructure ScheduleDataRefresh {\n    @documentation(\"Schedule of Data Refresh\")\n    @required\n    @amazon.datazone#searchable\n    @amazon.datazone#displayname(defaultName: \"Data Refresh Schedule\")\n    data_refresh_schedule: String\n}"<br />}</pre> | Cloud architect, DevOps engineer | 
| Export the AWS credentials. | To export AWS credentials to your development environment for the IAM role with administrative permissions, use the following format:<pre>export AWS_ACCESS_KEY_ID=<br />export AWS_SECRET_ACCESS_KEY=<br />export AWS_SESSION_TOKEN=</pre> | Cloud architect, DevOps engineer | 
| Synthesize the template. | To synthesize the CloudFormation template, run the following command:<pre>npx cdk synth</pre> | Cloud architect, DevOps engineer | 
| Deploy the solution. | To deploy the solution, run the following command:<pre>npx cdk deploy --all</pre> | Cloud architect, DevOps engineer | 

### Configure new member accounts
<a name="configure-new-member-accounts"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Deploy the template. | Deploy the CloudFormation template located at `lib/cfn-templates/DzDataMeshCfnStackSetExecutionRole.yaml` in the member account with the following input parameters: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-enterprise-data-mesh-amazon-data-zone.html) | Cloud architect, DevOps engineer | 
| Update the ARNs. | To update the list of CloudFormation StackSet execution role ARNs for the member accounts, use the following code:<pre>DZ_MEMBER_STACK_SET_EXEC_ROLE_LIST  - List of Stack set execution role arns for the member accounts.</pre> | Cloud architect, DevOps engineer | 
| Synthesize and deploy. | To synthesize the CloudFormation template and deploy the solution, run the following commands:<pre>npx cdk synth<br />npx cdk deploy --all</pre> | Cloud architect, DevOps engineer | 
| Associate the member account. | To associate the member account with the central governance account, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-enterprise-data-mesh-amazon-data-zone.html) | Cloud architect, DevOps engineer | 
| Update the parameters. | To update the member account‒specific parameters in the config file at `config/Config.ts`, use the following format:<pre>export const DZ_MEMBER_ACCOUNT_CONFIG: memberAccountConfig = {<br />  '123456789012' : {<br />    PROJECT_NAME: 'TEST-PROJECT-123456789012',<br />    PROJECT_DESCRIPTION: 'TEST-PROJECT-123456789012',<br />    PROJECT_EMAIL: 'user@xyz.com'<br />  }<br />}</pre> | Cloud architect, DevOps engineer | 
| Synthesize and deploy the template. | To synthesize the CloudFormation template and deploy the solution, run the following commands:<pre>npx cdk synth<br />npx cdk deploy --all</pre> | Cloud architect, DevOps engineer | 
| Add member accounts. | To create and configure additional member accounts in the data solution, repeat the previous steps for each member account.This solution doesn’t differentiate between data producers and consumers. | Cloud architect, DevOps engineer | 

### Clean up
<a name="clean-up"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Disassociate the member accounts. | To disassociate the accounts, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-enterprise-data-mesh-amazon-data-zone.html) | Cloud architect, DevOps engineer | 
| Delete the stack instances. | To delete the CloudFormation stack instances, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-enterprise-data-mesh-amazon-data-zone.html) | Cloud architect, DevOps engineer | 
| Destroy all resources. | To destroy resources, implement the following steps in your local development environment (Linux or macOS):[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-enterprise-data-mesh-amazon-data-zone.html) | Cloud architect, DevOps engineer | 

## Related resources
<a name="build-enterprise-data-mesh-amazon-data-zone-resources"></a>
+ [Amazon DataZone quickstart with AWS Glue data](https://docs.aws.amazon.com/datazone/latest/userguide/quickstart-glue.html)
+ [Tutorial: Create your first AWS CDK app](https://docs.aws.amazon.com/cdk/v2/guide/hello_world.html)
+ [Getting started with CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/GettingStarted.html)
+ [Strategies for building a data mesh-based enterprise solution on AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-data-mesh/introduction.html)

## Additional information
<a name="build-enterprise-data-mesh-amazon-data-zone-additional"></a>

**Objectives**

Implementing this pattern achieves the following:
+ **Decentralized ownership of data** ‒ Shift data ownership from a central team to teams that represent the source systems, business units, or use cases of your organization.
+ **Product thinking** ‒ Introduce a product-based mindset that includes customers, the market, and other factors when considering the data assets in your organization.
+ **Federated governance** ‒ Improve security guardrails, controls, and compliance across your organization's data products.
+ **Multi-account and multiple-project support** ‒ Support efficient, secure data sharing and collaboration across the business units or projects of your organization.
+ **Centralized monitoring and notifications** ‒ Monitor the cloud resources of your data mesh by using Amazon CloudWatch, and notify users when a new member account is associated.
+ **Scalability and extensibility** ‒ Add new use cases into the data mesh as your organization evolves.

**Solution scope**

When you use this solution, you can start small and scale as you progress in your data mesh journey. Often, when a member account adopts the data solution, it contains account configurations specific to the organization, project, or business unit. This solution accommodates these diverse AWS account configurations by supporting the following features:
+ AWS Glue Data Catalog as the data source for Amazon DataZone
+ Management of the Amazon DataZone data domain and the related data portal
+ Management of adding member accounts in the data mesh‒based data solution
+ Management of Amazon DataZone projects and environments
+ Management of Amazon DataZone glossaries and metadata forms
+ Management of IAM roles that correspond to the data mesh‒based data solution users
+ Notification of data mesh‒based data solution users
+ Monitoring of the provisioned cloud infrastructure

  This solution uses AWS CDK and CloudFormation to deploy the cloud infrastructure. It uses CloudFormation to do the following:
  + Define and deploy cloud resources at a lower level of abstraction.
  + Deploy cloud resources from the AWS Management Console. By using this approach, you can deploy infrastructure without a development environment.

  The data mesh solution uses AWS CDK to define resources at higher abstraction level. As a result, the solution provides a decoupled, modular, and scalable approach by choosing the relevant tool to deploy the cloud resources.

**Next steps**

You can reach out to AWS[ experts](https://aws.amazon.com/professional-services) for guidance on building data mesh  with Amazon DataZone.

The modular nature of this solution supports building data management solutions with different architectures, such as data fabric and data lakes. In addition, based on the requirements of your organization, you can extend the solution to other Amazon DataZone data sources.

# Calculate value at risk (VaR) by using AWS services
<a name="calculate-value-at-risk-var-by-using-aws-services"></a>

*Sumon Samanta, Amazon Web Services*

## Summary
<a name="calculate-value-at-risk-var-by-using-aws-services-summary"></a>

This pattern describes how to implement a value at risk (VaR) calculation system by using AWS services. In an on-premises environment, most VaR systems use a large, dedicated infrastructure and in-house or commercial grid scheduling software to run batch processes. This pattern presents a simple, reliable, and scalable architecture to handle VaR processing in the AWS Cloud. It builds a serverless architecture that uses Amazon Kinesis Data Streams as a streaming service, Amazon Simple Queue Service (Amazon SQS) as a managed queue service, Amazon ElastiCache as a caching service, and AWS Lambda to process orders and calculate risk.

VaR is a statistical measure that traders and risk managers use to estimate potential loss in their portfolio beyond a certain confidence level. Most VaR systems involve running a large number of mathematical and statistical calculations and storing the results. These calculations require significant compute resources, so VaR batch processes have to be broken into smaller sets of compute tasks. Splitting a large batch into smaller tasks is possible because these tasks are mostly independent (that is, calculations for one task don’t depend other tasks). 

Another important requirement for a VaR architecture is compute scalability. This pattern uses a serverless architecture that automatically scales in or out based on compute load. Because the batch or online compute demand is difficult to predict, dynamic scaling is required to complete the process within the timeline imposed by a service-level agreement (SLA). Also, a cost-optimized architecture should be able to scale down each compute resource as soon as the tasks on that resource are complete. 

AWS services are well-suited for VaR calculations because they offer scalable compute and storage capacity, analytics services for processing in a cost-optimized way, and different types of schedulers to run your risk management workflows. Also, you pay only for the compute and storage resources that you use on AWS.

## Prerequisites and limitations
<a name="calculate-value-at-risk-var-by-using-aws-services-prereqs"></a>

**Prerequisites **
+ An active AWS account.
+ Input files, which depend on your business requirements. A typical use case involves the following input files:
  + Market data file (input to the VaR calculation engine)
  + Trade data file (unless trade data comes through a stream).
  + Configuration data file (model and other static configuration data)
  + Calculation engine model files (quantitative libraries)
  + Time series data file (for historical data such as the stock price for the last five years)
+ If the market data or other input comes in through a stream, Amazon Kinesis Data Streams set up, and Amazon Identity and Access Management (IAM) permissions configured to write to the stream.  

This pattern builds an architecture in which trade data is written from a trading system to a Kinesis data stream. Instead of using a streaming service, you can save your trade data in small batch files, store them in an Amazon Simple Storage Service (Amazon S3) bucket, and invoke an event to start processing the data.

**Limitations**
+ Kinesis data stream sequencing is guaranteed on each shard, so trade orders that are written to multiple shards are not guaranteed to be delivered in the same order as write operations.
+ The AWS Lambda runtime limit is currently 15 minutes. (For more information, see the [Lambda FAQ](https://aws.amazon.com/lambda/faqs/).)

## Architecture
<a name="calculate-value-at-risk-var-by-using-aws-services-architecture"></a>

**Target architecture **

The following architecture diagram displays the AWS services and workflows for the risk assessment system.

![\[VaR calculation system with AWS services\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/eb615fc5-3cc3-445a-af2c-8446ee7b5276/images/c60aec03-ff6c-410c-8ee8-f1f6efa22cf7.png)


The diagram illustrates the following:

1. Trades stream in from the order management system.

1. The *ticket position netting* Lambda function processes the orders and writes consolidated messages for each ticker to a risk queue in Amazon SQS.

1. The *risk calculation engine* Lambda function processes the messages from Amazon SQS, performs risk calculations, and updates the VaR profit and loss (PnL) information in the risk cache in Amazon ElastiCache.

1. The *read ElastiCache data* Lambda function retrieves the risk results from ElastiCache and stores them in a database and S3 bucket.

For more information about these services and steps, see the *Epics* section.

**Automation and scale**

You can deploy the entire architecture by using the AWS Cloud Development Kit (AWS CDK) or AWS CloudFormation templates. The architecture can support both batch processing and intraday (real-time) processing.

Scaling is built into the architecture. As more trades are written into the Kinesis data stream and are waiting to be processed, additional Lambda functions can be invoked to process those trades and can then scale down after processing is complete. Processing through multiple Amazon SQS risk calculation queues is also an option. If strict ordering or consolidation is required across queues, processing cannot be parallelized. However, for an end-of-the-day batch or a mini intraday batch, the Lambda functions can process in parallel and store the final results in ElastiCache. 

## Tools
<a name="calculate-value-at-risk-var-by-using-aws-services-tools"></a>

** AWS services**
+ [Amazon Aurora MySQL-Compatible Edition](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraMySQL.html) is a fully managed, MySQL-compatible relational database engine that helps you set up, operate, and scale MySQL deployments. This pattern uses MySQL as an example, but you can use any RDBMS system to store data.
+ [Amazon ElastiCache](https://docs.aws.amazon.com/elasticache/) helps you set up, manage, and scale distributed in-memory cache environments in the AWS Cloud.
+ [Amazon Kinesis Data Streams](https://docs.aws.amazon.com/streams/latest/dev/introduction.html) helps you collect and process large streams of data records in real time.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.
+ [Amazon Simple Queue Service (Amazon SQS)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) provides a secure, durable, and available hosted queue that helps you integrate and decouple distributed software systems and components.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

**Code **

This pattern provides an example architecture for a VaR system in the AWS Cloud and describes how you can use Lambda functions for VaR calculations. To create your Lambda functions, see the code examples in the [Lambda documentation](https://docs.aws.amazon.com/lambda/latest/dg/service_code_examples.html). For assistance, contact [AWS Professional Services](https://pages.awscloud.com/AWS-Professional-Services.html).

## Best practices
<a name="calculate-value-at-risk-var-by-using-aws-services-best-practices"></a>
+ Keep each VaR compute task as small and lightweight as possible. Experiment with different numbers of trades in each compute task to see which one is most optimized for compute time and cost.
+ Store reusable objects in Amazon ElastiCache. Use a framework such as Apache Arrow to reduce serialization and deserialization.
+ Consider Lambda’s time limitation. If you think your compute tasks might exceed 15 minutes, try to break them down into smaller tasks to avoid the Lambda timeout. If this is not possible, you might consider a container orchestration solution with AWS Fargate, Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS).

## Epics
<a name="calculate-value-at-risk-var-by-using-aws-services-epics"></a>

### Trade flow to risk system
<a name="trade-flow-to-risk-system"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Start writing trades.  | New, settled, or partially settled trades are written from the order management system to a risk stream. This pattern uses Amazon Kinesis as the managed streaming service. The trade order ticker’s hash is used to put trade orders across multiple shards. | Amazon Kinesis | 

### Run Lambda functions for order processing
<a name="run-lambda-functions-for-order-processing"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Start risk processing with Lambda. | Run an AWS Lambda function for the new orders. Based on the number of pending trade orders, Lambda will automatically scale. Each Lambda instance has one or more orders and retrieves the latest position for each ticker from Amazon ElastiCache. (You can use a CUSIP ID, a Curve name, or an index name for other financial derivative products as a key to store and retrieve data from ElasticCache.) In ElastiCache, the total position (quantity) and the key-value pair <*ticker*, *net position*>, where *net position* is the scaling factor, are updated once for each ticker.  | Amazon Kinesis, AWS Lambda, Amazon ElastiCache | 

### Write messages for each ticker into queue
<a name="write-messages-for-each-ticker-into-queue"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Write consolidated messages to the risk queue. | Write the message to a queue. This pattern uses Amazon SQS as a managed queue service. A single Lambda instance might get a mini batch of trade orders at any given time, but will write only a single message for each ticker to Amazon SQS. A scaling factor is calculated: (*old net position* \$1 *current position*) / *old net position*. | Amazon SQS, AWS Lambda | 

### Invoke risk engine
<a name="invoke-risk-engine"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Start risk calculations. | The Lambda function for the risk engine lambda is invoked. Each position is processed by a single Lambda function. However, for optimization purposes, each Lambda function can process multiple messages from Amazon SQS. | Amazon SQS, AWS Lambda | 

### Retrieve risk results from cache
<a name="retrieve-risk-results-from-cache"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Retrieve and update risk cache. | Lambda retrieves the current net position for each ticker from ElastiCache. It also retrieves a VaR profit and loss (PnL) array for each ticker from ElastiCache. If the PnL array already exists, the Lambda function updates the array and VaR with a scale, which comes from the Amazon SQS message written by the netting Lambda function. If the PnL array isn’t in ElasticCache, a new PnL and VaR are calculated by using simulated ticker price series data. | Amazon SQS, AWS Lambda, Amazon ElastiCache | 

### Update data in Elastic Cache and store in database
<a name="update-data-in-elastic-cache-and-store-in-database"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Store risk results. | After the VaR and PnL numbers are updated in ElastiCache, a new Lambda function is invoked every five minutes. This function reads all stored data from ElastiCache and stores it in an Aurora MySQL-Compatible database and in an S3 bucket. | AWS Lambda, Amazon ElastiCache | 

## Related resources
<a name="calculate-value-at-risk-var-by-using-aws-services-resources"></a>
+ [Basel VaR Framework](https://www.bis.org/basel_framework/chapter/DIS/50.htm)

# Configure cross-account access to a shared AWS Glue Data Catalog using Amazon Athena
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena"></a>

*Denis Avdonin, Amazon Web Services*

## Summary
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena-summary"></a>

This pattern provides step-by-step instructions, including AWS Identity and Access Management (IAM) policy samples, to configure cross-account sharing of a dataset stored in an Amazon Simple Storage Service (Amazon S3) bucket by using the AWS Glue Data Catalog. You can store the dataset in an S3 bucket. The metadata is collected by an AWS Glue crawler and put into the AWS Glue Data Catalog. The S3 bucket and the AWS Glue Data Catalog reside in an AWS account referred to as the *data account*. You can provide access to IAM principals in another AWS account referred to as the *consumer account*. Users can query the data in the consumer account by using the Amazon Athena serverless query engine.

## Prerequisites and limitations
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena-prereqs"></a>

**Prerequisites**
+ Two active [AWS accounts](https://docs.aws.amazon.com/accounts/latest/reference/manage-acct-creating.html)
+ An [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in one of the AWS accounts
+ [Athena engine version 2](https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html#engine-versions-reference-0002)
+ AWS Command Line Interface (AWS CLI), [installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) (or [AWS CloudShell](https://aws.amazon.com/cloudshell/) for running AWS CLI commands)

**Product versions**

This pattern works with [Athena engine version 2](https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0002.html) and [Athena engine version 3](https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html) only. We recommend that you upgrade to Athena engine version 3. If you can’t upgrade from Athena engine version 1 to Athena engine version 3, follow the approach discussed in [Cross-account AWS Glue Data Catalog access with Amazon Athena](https://aws.amazon.com/blogs/big-data/cross-account-aws-glue-data-catalog-access-with-amazon-athena/) in the AWS Big Data Blog.

## Architecture
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena-architecture"></a>

**Target technology stack**
+ Amazon Athena
+ Amazon Simple Storage Service (Amazon S3)
+ AWS Glue
+ AWS Identity and Access Management (IAM)
+ AWS Key Management Service (AWS KMS)

The following diagram shows an architecture that uses IAM permissions to share data in an S3 bucket in one AWS account (data account) with another AWS account (consumer account) through the AWS Glue Data Catalog.

![\[Sharing a dataset in an S3 bucket between a data account and a consumer account by using the AWS Glue Data Catalog.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/4ee1d6f5-f8e1-4acb-8a9c-7601c832a87d/images/e73a2b47-c0b1-49f1-be66-9c563c11c9f9.png)


The diagram shows the following workflow:

1. The S3 bucket policy in the data account grants permissions to an IAM role in the consumer account and to the AWS Glue crawler service role in the data account.

1. The AWS KMS key policy in the data account grants permissions to the IAM role in the consumer account and to the AWS Glue crawler service role in the data account.

1. The AWS Glue crawler in the data account discovers the schema of the data that’s stored in the S3 bucket.

1. The resource policy of the AWS Glue Data Catalog in the data account grants access to the IAM role in the consumer account.

1. A user creates a named catalog reference in the consumer account by using an AWS CLI command.

1. An IAM policy grants an IAM role in the consumer account access to resources in the data account. The IAM role’s trust policy allows users in the consumer account to assume the IAM role.

1. A user in the consumer account assumes the IAM role and accesses objects in the data catalog by using SQL queries.

1.  The Athena serverless engine runs the SQL queries.

**Note**  
[IAM best practices](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html) recommend that you grant permissions to an IAM role and use [identity federation](https://aws.amazon.com/identity/federation/).

## Tools
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena-tools"></a>
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) is a fully managed extract, transform, and load (ETL) service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.
+ [AWS Identity and Access Management (IAM)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html) helps you securely manage access to your AWS resources by controlling who is authenticated and authorized to use them.
+ [AWS Key Management Service (AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) helps you create and control cryptographic keys to protect your data.

## Epics
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena-epics"></a>

### Set up permissions in the data account
<a name="set-up-permissions-in-the-data-account"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Grant access to data in the S3 bucket. | [Create an S3 bucket policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/add-bucket-policy.html) based on the following template and assign the policy to the bucket where the data is stored.<pre>{<br />    "Version": "2012-10-17",		 	 	 <br />    "Statement": [<br />        {<br />            "Effect": "Allow",<br />            "Principal": {<br />                "AWS": [<br />                     "arn:aws:iam::<consumer account id>:role/<role name>",<br />                     "arn:aws:iam::<data account id>:role/service-role/AWSGlueServiceRole-data-bucket-crawler"<br />                ]<br />            },<br />            "Action": "s3:GetObject",<br />            "Resource": "arn:aws:s3:::data-bucket/*"<br />        },<br />        {<br />            "Effect": "Allow",<br />            "Principal": {<br />                "AWS": [<br />                      "arn:aws:iam::<consumer account id>:role/<role name>",<br />                      "arn:aws:iam::<data account id>:role/service-role/AWSGlueServiceRole-data-bucket-crawler"<br />                ]<br />            },<br />            "Action": "s3:ListBucket",<br />            "Resource": "arn:aws:s3:::data-bucket"<br />        }<br />    ]<br />}</pre>The bucket policy grants permissions to the IAM role in the consumer account and to the AWS Glue crawler service role in the data account. | Cloud administrator | 
| (If required) Grant access to the data encryption key. | If the S3 bucket is encrypted by an AWS KMS key, grant `kms:Decrypt` permission on the key to the IAM role in the consumer account and to the AWS Glue crawler service role in the data account.Update the [key policy](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html) with the following statement:<pre>{<br />    "Effect": "Allow",<br />    "Principal": {<br />        "AWS": [<br />            "arn:aws:iam::<consumer account id>:role/<role name>",<br />            "arn:aws:iam::<data account id>:role/service-role/AWSGlueServiceRole-data-bucket-crawler"<br />        ]<br />    },<br />    "Action": "kms:Decrypt",<br />    "Resource": "arn:aws:kms:<region>:<data account id>:key/<key id>"<br />}</pre> | Cloud administrator | 
| Grant the crawler access to the data. | Attach the following IAM policy to the crawler’s service role:<pre>{<br />    "Version": "2012-10-17",		 	 	 <br />    "Statement": [<br />        {<br />            "Effect": "Allow",<br />            "Action": "s3:GetObject",<br />            "Resource": "arn:aws:s3:::data-bucket/*"<br />        },<br />        {<br />            "Effect": "Allow",<br />            "Action": "s3:ListBucket",<br />            "Resource": "arn:aws:s3:::data-bucket"<br />        }<br />    ]<br />} </pre> | Cloud administrator | 
| (If required) Grant the crawler access to the data encryption key. | If the S3 bucket is encrypted by an AWS KMS key, grant `kms:Decrypt` permission on the key to the crawler’s service role by attaching the following policy to it:<pre>{<br />    "Effect": "Allow",<br />    "Action": "kms:Decrypt",<br />    "Resource": "arn:aws:kms:<region>:<data account id>:key/<key id>"<br />}</pre> | Cloud administrator | 
| Grant the IAM role in the consumer account and the crawler access to the data catalog. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena.html)<pre>{<br />    "Version" : "2012-10-17",<br />    "Statement" : [ <br />        {<br />            "Effect" : "Allow",<br />            "Principal" : {<br />                "AWS" : [<br />                     "arn:aws:iam::<consumer account id>:role/<role name>",<br />                    "arn:aws:iam::<data account id>:role/service-role/AWSGlueServiceRole-data-bucket-crawler"<br />                ]<br />            },<br />            "Action" : "glue:*",<br />            "Resource" : [ <br />                 "arn:aws:glue:<region>:<data account id>:catalog", <br />                 "arn:aws:glue:<region>:<data account id>:database/*", <br />                 "arn:aws:glue:<region>:<data account id>:table/*" <br />            ]<br />        } <br />    ]<br />}</pre>This policy allows all AWS Glue actions on all databases and tables in the data account. You can customize the policy to grant only required permissions to the consumer principals. For example, you can provide read-only access to specific tables or views in a database. | Cloud administrator | 

### Access data from the consumer account
<a name="access-data-from-the-consumer-account"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a named reference for the data catalog. | To create a named data catalog reference, use [CloudShell](https://docs.aws.amazon.com/cloudshell/latest/userguide/getting-started.html) or a locally installed AWS CLI to run the following command:<pre>aws athena create-data-catalog --name <shared catalog name> --type GLUE --parameters catalog-id=<data account id></pre> | Cloud administrator | 
| Grant the IAM role in the consumer account access to the data. | Attach the following policy to the IAM role in the consumer account to grant the role cross-account access to the data:<pre>{<br />     "Version": "2012-10-17",		 	 	 <br />     "Statement": [<br />        {<br />             "Effect": "Allow",<br />             "Action": "s3:GetObject",<br />             "Resource": "arn:aws:s3:::data-bucket/*"<br />        },<br />        {<br />             "Effect": "Allow",<br />             "Action": "s3:ListBucket",<br />             "Resource": "arn:aws:s3:::data-bucket"<br />        },<br />        {<br />            "Effect": "Allow",<br />            "Action": "glue:*",<br />            "Resource": [<br />                 "arn:aws:glue:<region>:<data account id>:catalog",<br />                 "arn:aws:glue:<region>:<data account id>:database/*",<br />                 "arn:aws:glue:<region>:<data account id>:table/*"<br />            ]<br />        }<br />    ]<br />}</pre>Next, use the following template to specify what users can accept the IAM role in its trust policy:<pre>{<br />    "Version": "2012-10-17",		 	 	 <br />    "Statement": [<br />       {<br />            "Effect": "Allow",<br />            "Principal": {<br />                "AWS": "arn:aws:iam::<consumer account id>:user/<IAM user>"<br />            },<br />            "Action": "sts:AssumeRole"<br />        }<br />    ]<br />}</pre>Finally, grant user permissions to assume the IAM role by attaching the same policy to the user group they belong to. | Cloud administrator | 
| (If required) Grant the IAM role in the consumer account access to the data encryption key. | If the S3 bucket is encrypted by an AWS KMS key, grant `kms:Decrypt` permission on the key to the IAM role in the consumer account by attaching the following policy to it:<pre>{<br />     "Effect": "Allow",<br />     "Action": "kms:Decrypt",<br />     "Resource": "arn:aws:kms:<region>:<data account id>:key/<key id>"<br />}</pre> | Cloud administrator | 
| Switch to the IAM role in the consumer account to access data. | As a data consumer, [switch to the IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-console.html) to access data in the data account. | Data consumer | 
| Access the data. | Query data using Athena. For example, open the Athena query editor and run the following query:<pre>SELECT *<br />  FROM <shared catalog name>.<database name>.<table name></pre>Instead of using a named catalog reference, you can also refer to the catalog by its Amazon Resource Name (ARN).If you use a dynamic catalog reference in a query or view, surround the reference with escaped double quotation marks (\$1"). For example:<pre>SELECT * <br />  FROM \"glue:arn:aws:glue:<region>:<data account id>:catalog\".<database name>.<table name></pre>For more information, see [Cross-account access to AWS Glue data catalogs](https://docs.aws.amazon.com/athena/latest/ug/security-iam-cross-account-glue-catalog-access.html) in the Amazon Athena User Guide. | Data consumer | 

## Related resources
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena-resources"></a>
+ [Cross-account access to AWS Glue data catalogs](https://docs.aws.amazon.com/athena/latest/ug/security-iam-cross-account-glue-catalog-access.html) (Athena documentation)
+ [(AWS CLI) create-data-catalog](https://docs.aws.amazon.com/cli/latest/reference/athena/create-data-catalog.html) (AWS CLI Command Reference)
+ [Cross-account AWS Glue Data Catalog access with Amazon Athena](https://aws.amazon.com/blogs/big-data/cross-account-aws-glue-data-catalog-access-with-amazon-athena/) (AWS Big Data Blog)
+ [Security best practices in IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html) (IAM documentation)

## Additional information
<a name="configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena-additional"></a>

**Using Lake Formation as an alternative for cross-account sharing**

You can also use AWS Lake Formation to share access to AWS Glue catalog objects across accounts. Lake Formation provides fine-grained access control at the column and row level, tag-based access control, governed tables for ACID transactions, and other functionality. Although Lake Formation is well-integrated with Athena, it does require additional configuration compared to this pattern’s IAM-only approach. We recommend that you consider the decision to use Lake Formation or IAM-only access controls within the wider context of your overall solution architecture. Considerations include what other services are involved and how they integrate with both approaches.

# Convert the Teradata NORMALIZE temporal feature to Amazon Redshift SQL
<a name="convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql"></a>

*Po Hong, Amazon Web Services*

## Summary
<a name="convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql-summary"></a>

**NORMALIZE** is a Teradata extension to the ANSI SQL standard. When a SQL table includes a column that has a **PERIOD** data type, **NORMALIZE** combines values that meet or overlap in that column, to form a single period that consolidates multiple, individual period values. To use **NORMALIZE**, at least one column in the SQL **SELECT** list must be of Teradata's temporal **PERIOD** data type. For more information about **NORMALIZE**, see the [Teradata documentation](https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/MIGI5UMEwgERC3Un7tEZ6g). 

Amazon Redshift doesn’t support **NORMALIZE**, but you can implement this functionality by using native SQL syntax and the **LAG** window function in Amazon Redshift. This pattern focuses on using the Teradata **NORMALIZE** extension with the **ON MEETS OR OVERLAPS** condition, which is the most popular format. It explains how this feature works in Teradata and how it can be converted into Amazon Redshift native SQL syntax.

## Prerequisites and limitations
<a name="convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql-prereqs"></a>

**Prerequisites**
+ Basic Teradata SQL knowledge and experience
+ Amazon Redshift knowledge and experience

## Architecture
<a name="convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql-architecture"></a>

**Source technology stack**
+ Teradata data warehouse

**Target technology stack**
+ Amazon Redshift

**Target architecture**

For a high-level architecture for migrating a Teradata database to Amazon Redshift, see the pattern [Migrate a Teradata database to Amazon Redshift using AWS SCT data extraction agents](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-a-teradata-database-to-amazon-redshift-using-aws-sct-data-extraction-agents.html). The migration doesn't automatically convert the Teradata **NORMALIZE** phrase to Amazon Redshift SQL. You can convert this Teradata extension by following the guidelines in this pattern.

## Tools
<a name="convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql-tools"></a>

**Code**

To illustrate the concept and functionality of **NORMALIZE**, consider the following table definition in Teradata:

```
CREATE TABLE systest.project
     (    emp_id        INTEGER,
          project_name  VARCHAR(20),
          dept_id       INTEGER,
          duration      PERIOD(DATE)
     );
```

Run the following SQL code to insert sample data into the table:

```
BEGIN TRANSACTION;

INSERT INTO systest.project VALUES (10, 'First Phase', 1000,  PERIOD(DATE '2010-01-10', DATE '2010-03-20') );
INSERT INTO systest.project VALUES (10, 'First Phase', 2000,  PERIOD(DATE '2010-03-20', DATE '2010-07-15') );

INSERT INTO systest.project VALUES (10, 'Second Phase', 2000,  PERIOD(DATE '2010-06-15', DATE '2010-08-18') );
INSERT INTO systest.project VALUES (20, 'First Phase', 2000,  PERIOD(DATE '2010-03-10', DATE '2010-07-20') );

INSERT INTO systest.project VALUES (20, 'Second Phase', 1000,  PERIOD(DATE '2020-05-10', DATE '2020-09-20') );

END TRANSACTION;
```

Results:

```
select * from systest.project order by 1,2,3;
 
 *** Query completed. 4 rows found. 4 columns returned.
 *** Total elapsed time was 1 second.
 
     emp_id  project_name              dept_id  duration
-----------  --------------------  -----------  ------------------------
         10  First Phase                  1000  ('10/01/10', '10/03/20')        
         10  First Phase                  2000  ('10/03/20', '10/07/15')
         10  Second Phase                 2000  ('10/06/15', '10/08/18')
         20  First Phase                  2000  ('10/03/10', '10/07/20')
         20  Second Phase                 1000  ('20/05/10', '20/09/20')
```

*Teradata NORMALIZE use case*

Now add the Teradata **NORMALIZE** SQL clause to the **SELECT** statement:

```
SELECT NORMALIZE ON MEETS OR OVERLAPS emp_id, duration 
FROM systest.project 
ORDER BY 1,2;
```

This **NORMALIZE** operation is performed on a single column (`emp_id`). For `emp_id=10`, the three overlapping period values in duration coalesce into a single period value, as follows:  

```
     emp_id  duration
-----------  ------------------------
         10  ('10/01/10', '10/08/18')
         20  ('10/03/10', '10/07/20')
         20  ('20/05/10', '20/09/20')
```

The following **SELECT** statement performs a **NORMALIZE** operation on `project_name` and `dept_id`. Note that the **SELECT** list contains only one **PERIOD** column, `duration`.

```
SELECT NORMALIZE project_name, dept_id, duration 
FROM systest.project;
```

Output:

```
project_name              dept_id  duration
--------------------  -----------  ------------------------
First Phase                  1000  ('10/01/10', '10/03/20')
Second Phase                 1000  ('20/05/10', '20/09/20')
First Phase                  2000  ('10/03/10', '10/07/20')
Second Phase                 2000  ('10/06/15', '10/08/18')
```

*Amazon Redshift equivalent SQL*

Amazon Redshift currently doesn’t support the **PERIOD** data type in a table. Instead, you need to divide a Teradata **PERIOD** data field into two parts: `start_date, end_date`, as follows:  

```
CREATE TABLE systest.project
     (    emp_id        INTEGER,
          project_name  VARCHAR(20),
          dept_id       INTEGER,
          start_date  DATE,
          end_date    DATE
     );
```

Insert sample data into the table:

```
BEGIN TRANSACTION;
 
INSERT INTO systest.project VALUES (10, 'First Phase', 1000,  DATE '2010-01-10', DATE '2010-03-20' );
INSERT INTO systest.project VALUES (10, 'First Phase', 2000,  DATE '2010-03-20', DATE '2010-07-15');
 
INSERT INTO systest.project VALUES (10, 'Second Phase', 2000,  DATE '2010-06-15', DATE '2010-08-18' );
INSERT INTO systest.project VALUES (20, 'First Phase', 2000,  DATE '2010-03-10', DATE '2010-07-20' );
 
INSERT INTO systest.project VALUES (20, 'Second Phase', 1000,  DATE '2020-05-10', DATE '2020-09-20' );
 
END TRANSACTION;
```

Output:

```
 emp_id | project_name | dept_id | start_date |  end_date
--------+--------------+---------+------------+------------
     10 | First Phase  |    1000 | 2010-01-10 | 2010-03-20
     10 | First Phase  |    2000 | 2010-03-20 | 2010-07-15
     10 | Second Phase |    2000 | 2010-06-15 | 2010-08-18
     20 | First Phase  |    2000 | 2010-03-10 | 2010-07-20
     20 | Second Phase |    1000 | 2020-05-10 | 2020-09-20
(5 rows)
```

To rewrite Teradata’s **NORMALIZE** clause, you can use the [LAG window function](https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LAG.html) in Amazon Redshift. This function returns the values for a row at a given offset above (before) the current row in the partition.

You can use the **LAG** function to identify each row that begins a new period by determining if a period meets or overlaps with the previous period (0 if yes and 1 if no). When this flag is cumulatively summed up, it provides a group identifier that can be used in the outer **Group By** clause to arrive at the desired result in Amazon Redshift.  

Here’s a sample Amazon Redshift SQL statement that uses **LAG()**:

```
SELECT emp_id, start_date, end_date, 
            (CASE WHEN start_date <= LAG(end_date) OVER (PARTITION BY emp_id ORDER BY start_date, end_date) THEN 0 ELSE 1 END) AS GroupStartFlag
FROM systest.project 
ORDER BY 1,2;
```

Output:

```
 emp_id | start_date |  end_date  | groupstartflag
--------+------------+------------+----------------
     10 | 2010-01-10 | 2010-03-20 |              1
     10 | 2010-03-20 | 2010-07-15 |              0
     10 | 2010-06-15 | 2010-08-18 |              0
     20 | 2010-03-10 | 2010-07-20 |              1
     20 | 2020-05-10 | 2020-09-20 |              1
(5 rows)
```

The following Amazon Redshift SQL statement normalizes only on the `emp_id` column:

```
SELECT T2.emp_id, MIN(T2.start_date) as new_start_date, MAX(T2.end_date) as new_end_date
FROM 
( SELECT T1.*, SUM(GroupStartFlag) OVER (PARTITION BY emp_id ORDER BY start_date ROWS UNBOUNDED PRECEDING) As GroupID
FROM ( SELECT emp_id, start_date, end_date, 
            (CASE WHEN start_date <= LAG(end_date) OVER (PARTITION BY emp_id ORDER BY start_date, end_date) THEN 0 ELSE 1 END) AS GroupStartFlag
FROM systest.project ) T1
) T2
GROUP BY T2.emp_id, T2.GroupID
ORDER BY 1,2;
```

Output:  

```
 emp_id | new_start_date | new_end_date
--------+----------------+------------------------------------
     10 | 2010-01-10     | 2010-08-18
     20 | 2010-03-10     | 2010-07-20
     20 | 2020-05-10     | 2020-09-20
(3 rows)
```

** **

The following Amazon Redshift SQL statement normalizes on both the `project_name` and `dept_id` columns:

```
SELECT T2.project_name, T2.dept_id, MIN(T2.start_date) as new_start_date, MAX(T2.end_date) as new_end_date
FROM 
( SELECT T1.*, SUM(GroupStartFlag) OVER (PARTITION BY project_name, dept_id ORDER BY start_date ROWS UNBOUNDED PRECEDING) As GroupID
FROM ( SELECT project_name, dept_id, start_date, end_date, 
            (CASE WHEN start_date <= LAG(end_date) OVER (PARTITION BY project_name, dept_id ORDER BY start_date, end_date) THEN 0 ELSE 1 END) AS GroupStartFlag
FROM systest.project ) T1
) T2
GROUP BY T2.project_name, T2.dept_id, T2.GroupID
ORDER BY 1,2,3;
```

Output:

```
 project_name | dept_id | new_start_date | new_end_date
--------------+---------+----------------+--------------
 First Phase  |    1000 | 2010-01-10     | 2010-03-20
 First Phase  |    2000 | 2010-03-10     | 2010-07-20
 Second Phase |    1000 | 2020-05-10     | 2020-09-20
 Second Phase |    2000 | 2010-06-15     | 2010-08-18
(4 rows)
```

## Epics
<a name="convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql-epics"></a>

### Convert NORMALIZE to Amazon Redshift SQL
<a name="convert-normalize-to-amazon-redshift-sql"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create your Teradata SQL code. | Use the NORMALIZE phrase according to your needs. | SQL developer | 
| Convert the code to Amazon Redshift SQL. | To convert your code, follow the guidelines in the "Tools" section of this pattern. | SQL developer | 
| Run the code in Amazon Redshift. | Create your table, load data into the table, and run your code in Amazon Redshift. | SQL developer | 

## Related resources
<a name="convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql-resources"></a>

**References**
+ [Teradata NORMALIZE temporal feature](https://docs.teradata.com/reader/2_MC9vCtAJRlKle2Rpb0mA/MIGI5UMEwgERC3Un7tEZ6g) (Teradata documentation)
+ [LAG window function](https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LAG.html) (Amazon Redshift documentation)
+ [Migrate to Amazon Redshift](https://aws.amazon.com/redshift/data-warehouse-migration/) (AWS website)
+ [Migrate a Teradata database to Amazon Redshift using AWS SCT data extraction agents](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-a-teradata-database-to-amazon-redshift-using-aws-sct-data-extraction-agents.html) (AWS Prescriptive Guidance)
+ [Convert the Teradata RESET WHEN feature to Amazon Redshift SQL](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/convert-the-teradata-reset-when-feature-to-amazon-redshift-sql.html) (AWS Prescriptive Guidance)

**Tools**
+ [AWS Schema Conversion Tool (AWS SCT)](https://aws.amazon.com/dms/schema-conversion-tool/)

**Partners**
+ [AWS Migration Competency Partners](https://aws.amazon.com/migration/partner-solutions/#delivery)

# Convert the Teradata RESET WHEN feature to Amazon Redshift SQL
<a name="convert-the-teradata-reset-when-feature-to-amazon-redshift-sql"></a>

*Po Hong, Amazon Web Services*

## Summary
<a name="convert-the-teradata-reset-when-feature-to-amazon-redshift-sql-summary"></a>

**RESET WHEN** is a Teradata feature used in SQL analytical window functions. It is an extension to the ANSI SQL standard. **RESET WHEN** determines the partition over which an SQL window function operates based on some specified condition. If the condition evaluates to **TRUE**, a new, dynamic sub-partition is created inside the existing window partition. For more information about **RESET WHEN**, see the [Teradata documentation](https://docs.teradata.com/reader/1DcoER_KpnGTfgPinRAFUw/b7wL86OoMTPno6hrSPNdDg).

Amazon Redshift doesn’t support **RESET WHEN** in SQL window functions. To implement this functionality, you have to convert **RESET WHEN** to the native SQL syntax in Amazon Redshift, and use multiple, nested functions. This pattern demonstrates how you can use the Teradata **RESET WHEN** feature and how you can convert it to Amazon Redshift SQL syntax. 

## Prerequisites and limitations
<a name="convert-the-teradata-reset-when-feature-to-amazon-redshift-sql-prereqs"></a>

**Prerequisites**
+ Basic knowledge of the Teradata data warehouse and its SQL syntax
+ Good understanding of Amazon Redshift and its SQL syntax

## Architecture
<a name="convert-the-teradata-reset-when-feature-to-amazon-redshift-sql-architecture"></a>

**Source technology stack**
+ Teradata data warehouse

**Target technology stack**
+ Amazon Redshift

**Architecture**

For a high-level architecture for migrating a Teradata database to Amazon Redshift, see the pattern [Migrate a Teradata database to Amazon Redshift using AWS SCT data extraction agents](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-a-teradata-database-to-amazon-redshift-using-aws-sct-data-extraction-agents.html). The migration doesn't automatically convert the Teradata **RESET WHEN** phrase to Amazon Redshift SQL. You can convert this Teradata extension by following the guidelines in the next section.

## Tools
<a name="convert-the-teradata-reset-when-feature-to-amazon-redshift-sql-tools"></a>

**Code**

To illustrate the concept of **RESET WHEN**, consider the following table definition in Teradata:

```
create table systest.f_account_balance                                
( account_id integer NOT NULL,
  month_id integer,
  balance integer )
unique primary index (account_id, month_id);
```

Run the following SQL code to insert sample data into the table:

```
BEGIN TRANSACTION;
Insert Into systest.f_account_balance values (1,1,60);
Insert Into systest.f_account_balance values (1,2,99);
Insert Into systest.f_account_balance values (1,3,94);
Insert Into systest.f_account_balance values (1,4,90);
Insert Into systest.f_account_balance values (1,5,80);
Insert Into systest.f_account_balance values (1,6,88);
Insert Into systest.f_account_balance values (1,7,90);
Insert Into systest.f_account_balance values (1,8,92);
Insert Into systest.f_account_balance values (1,9,10);
Insert Into systest.f_account_balance values (1,10,60);
Insert Into systest.f_account_balance values (1,11,80);
Insert Into systest.f_account_balance values (1,12,10);
END TRANSACTION;
```

The sample table has the following data:


| 
| 
| account\$1id | month\$1id | balance | 
| --- |--- |--- |
| 1 | 1 | 60 | 
| 1 | 2 | 99 | 
| 1 | 3 | 94 | 
| 1 | 4 | 90 | 
| 1 | 5 | 80 | 
| 1 | 6 | 88 | 
| 1 | 7 | 90 | 
| 1 | 8 | 92 | 
| 1 | 9 | 10 | 
| 1 | 10 | 60 | 
| 1 | 11 | 80 | 
| 1 | 12 | 10 | 

For each account, let’s say that you want to analyze the sequence of consecutive monthly balance increases. When one month’s balance is less than, or equal to, the previous month’s balance, the requirement is to reset the counter to zero and restart.

*Teradata RESET WHEN use case*

To analyze this data, Teradata SQL uses a window function with a nested aggregate and a **RESET WHEN** phrase, as follows:

```
SELECT account_id, month_id, balance,
 ( ROW_NUMBER() OVER (PARTITION BY account_id ORDER BY month_id
RESET WHEN balance <= SUM(balance) over (PARTITION BY account_id ORDER BY month_id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) ) -1 ) as balance_increase
FROM systest.f_account_balance
ORDER BY 1,2;
```

Output:


| 
| 
|  account\$1id | month\$1id | balance | balance\$1increase | 
| --- |--- |--- |--- |
| 1 | 1 | 60 | 0 | 
| 1 | 2 | 99 | 1 | 
| 1 | 3 | 94 | 0 | 
| 1 | 4 | 90 | 0 | 
| 1 | 5 | 80 | 0 | 
| 1 | 6 | 88 | 1 | 
| 1 | 7 | 90 | 2 | 
| 1 | 8 | 92 | 3 | 
| 1 | 9 | 10 | 0 | 
| 1 | 10 | 60 | 1 | 
| 1 | 11 | 80 | 2 | 
| 1 | 12 | 10 | 0 | 

The query is processed as follows in Teradata:

1. The **SUM(balance)** aggregate function calculates the sum of all balances for a given account in a given month.

1. We check to see if a balance in a given month (for a given account) is greater than the balance of the previous month.

1. If the balance increased, we track a cumulative count value. If the **RESET WHEN** condition evaluates to **false**, which means that the balance has increased over successive months, we continue to increase the count.

1. The **ROW\$1NUMBER()** ordered analytical function calculates the count value. When we reach a month whose balance is less than, or equal to, the balance of the previous month, the **RESET WHEN** condition evaluates to **true**. If so, we start a new partition and **ROW\$1NUMBER()** restarts the count from 1. We use **ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING** to access the value of the previous row.

1. We subtract 1 to ensure that the count value starts with 0.

*Amazon Redshift equivalent SQL*

Amazon Redshift doesn’t support the **RESET WHEN** phrase in an SQL analytical window function.  To produce the same result, you must rewrite the Teradata SQL using Amazon Redshift native SQL syntax and nested sub-queries, as follows: 

```
SELECT account_id, month_id, balance,
   (ROW_NUMBER() OVER(PARTITION BY account_id, new_dynamic_part ORDER BY month_id) -1) as balance_increase
FROM
( SELECT account_id, month_id, balance, prev_balance,
SUM(dynamic_part) OVER (PARTITION BY account_id ORDER BY month_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) As new_dynamic_part
FROM ( SELECT account_id, month_id, balance,
SUM(balance) over (PARTITION BY account_id ORDER BY month_id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) as prev_balance,
(CASE When balance <= prev_balance Then 1 Else 0 END) as dynamic_part
FROM systest.f_account_balance ) A
) B
ORDER BY 1,2;
```

Because Amazon Redshift doesn’t support nested window functions in the **SELECT** clause of a single SQL statement, you must use two nested sub-queries.
+ In the inner sub-query (alias A), a dynamic partition indicator (**dynamic\$1part**) is created and populated. **dynamic\$1part** is set to 1 if one month’s balance is less than or equal to the preceding month’s balance; otherwise, it’s set to 0. 
+ In the next layer (alias B), a **new\$1dynamic\$1part** attribute is generated as the result of a **SUM** window function. 
+ Finally, you add **new\$1dynamic\$1part** as a new partition attribute (**dynamic partition**) to the existing partition attribute (**account\$1id**) and apply the same **ROW\$1NUMBER()** window function as in Teradata (and minus one). 

After these changes, Amazon Redshift SQL generates the same output as Teradata.

## Epics
<a name="convert-the-teradata-reset-when-feature-to-amazon-redshift-sql-epics"></a>

### Convert RESET WHEN to Amazon Redshift SQL
<a name="convert-reset-when-to-amazon-redshift-sql"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create your Teradata window function. | Use nested aggregates and the RESET WHEN phrase according to your needs. | SQL developer | 
| Convert the code to Amazon Redshift SQL. | To convert your code, follow the guidelines in the "Tools" section of this pattern. | SQL developer | 
| Run the code in Amazon Redshift. | Create your table, load data into the table, and run your code in Amazon Redshift. | SQL developer | 

## Related resources
<a name="convert-the-teradata-reset-when-feature-to-amazon-redshift-sql-resources"></a>

**References**
+ [RESET WHEN Phrase](https://docs.teradata.com/reader/1DcoER_KpnGTfgPinRAFUw/b7wL86OoMTPno6hrSPNdDg) (Teradata documentation)
+ [RESET WHEN explanation](https://stackoverflow.com/questions/53344536/teradata-reset-when-partition-by-order-by) (Stack Overflow)
+ [Migrate to Amazon Redshift](https://aws.amazon.com/redshift/data-warehouse-migration/) (AWS website)
+ [Migrate a Teradata database to Amazon Redshift using AWS SCT data extraction agents](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-a-teradata-database-to-amazon-redshift-using-aws-sct-data-extraction-agents.html) (AWS Prescriptive Guidance)
+ [Convert the Teradata NORMALIZE temporal feature to Amazon Redshift SQL](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/convert-the-teradata-normalize-temporal-feature-to-amazon-redshift-sql.html) (AWS Prescriptive Guidance)

**Tools**
+ [AWS Schema Conversion Tool (AWS SCT)](https://aws.amazon.com/dms/schema-conversion-tool/)

**Partners**
+ [AWS Migration Competency Partners](https://aws.amazon.com/migration/partner-solutions/#delivery)

# Deploy and manage a serverless data lake on the AWS Cloud by using infrastructure as code
<a name="deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code"></a>

*Kirankumar Chandrashekar and Abdel Jaidi, Amazon Web Services*

## Summary
<a name="deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code-summary"></a>

This pattern describes how to use [serverless computing](https://aws.amazon.com/serverless/) and [infrastructure as code](https://docs.aws.amazon.com/whitepapers/latest/introduction-devops-aws/infrastructure-as-code.html) (IaC) to implement and administer a data lake on the Amazon Web Services (AWS) Cloud. This pattern is based on the [serverless data lake framework (SDLF)](https://sdlf.workshop.aws/) workshop developed by AWS.

SDLF is a collection of reusable resources that accelerate the delivery of enterprise data lakes on the AWS Cloud and helps with faster deployment to production. It is used to implement the foundational structure of a data lake by following best practices.

SDLF implements a continuous integration / continuous deployment (CI/CD) process throughout the code and infrastructure deployment by using AWS services such as AWS CodePipeline, AWS CodeBuild, and AWS CodeCommit.

This pattern uses multiple AWS serverless services to simplify data lake management. These include Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB for storage, AWS Lambda and AWS Glue for computing, and Amazon CloudWatch Events, Amazon Simple Queue Service (Amazon SQS), and AWS Step Functions for orchestration.

AWS CloudFormation and AWS code services act as the IaC layer to provide reproducible and fast deployments with easy operations and administration.

## Prerequisites and limitations
<a name="deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code-prereqs"></a>

**Prerequisites **
+ An active AWS account.
+ [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html), installed and configured. 
+ A Git client, installed and configured.
+ The [SDLF workshop](https://sdlf.workshop.aws/), open in a web browser window and ready to use. 

## Architecture
<a name="deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code-architecture"></a>

![\[Implementing and administering a data lake on the AWS Cloud.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/f4fc3ad2-1c4f-45ea-bc86-2db13105a173/images/e0cfff30-f0f8-4fc1-8e84-a152ef615c58.png)


 The architecture diagram illustrates an event-driven process with the following steps. 

1. After a file is added to the raw data S3 bucket, an Amazon S3 event notification is placed in an SQS queue. Each notification is delivered as a JSON file, which contains metadata such as the S3 bucket name, object key, or timestamp.

1. This notification is consumed by a Lambda function that routes the event to the correct extraction, transformation, and loading (ETL) process based on the metadata. The Lambda function can also use contextual configurations stored in an Amazon DynamoDB table. This step enables decoupling and scaling to multiple applications in the data lake.

1. The event is routed to the first Lambda function in the ETL process, which transforms and moves data from the raw data area to the staging area for the data lake. The first step is to update the comprehensive catalog. This is a DynamoDB table that contains all the file metadata of the data lake. Each row in this table holds operational metadata about a single object stored in Amazon S3. A synchronous call is made to a Lambda function that performs a light transformation, which is a computationally inexpensive operation (such as converting a file from one format to another), on the S3 object. Because a new object has been added to the staging S3 bucket, the comprehensive catalog is updated and a message is sent to the SQS queue waiting for the next phase in the ETL.

1. A CloudWatch Events rule triggers a Lambda function every 5 minutes. This function checks if messages were delivered to the SQS queue from the previous ETL phase. If a message was delivered, the Lambda function begins the second function from [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/sample-start-workflow.html) in the ETL process.

1. A heavy transformation is then applied on a batch of files. This heavy transformation is a computationally expensive operation, such as a synchronous call to an AWS Glue job, AWS Fargate task, Amazon EMR step, or Amazon SageMaker notebook. Table metadata is extracted from the output files by using an AWS Glue crawler, which updates the AWS Glue catalog. File metadata is also added to the comprehensive catalog table in DynamoDB. Finally, a data quality step leveraging [Deequ](https://github.com/awslabs/deequ) is also run.

**Technology stack  **
+ Amazon CloudWatch Events
+ AWS CloudFormation
+ AWS CodePipeline
+ AWS CodeBuild
+ AWS CodeCommit
+ Amazon DynamoDB
+ AWS Glue
+ AWS Lambda
+ Amazon S3
+ Amazon SQS
+ AWS Step Functions

## Tools
<a name="deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code-tools"></a>
+ [Amazon CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) – CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources.
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) – CloudFormation helps create and provision AWS infrastructure deployments predictably and repeatedly.
+ [AWS CodeBuild](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) – CodeBuild is a fully managed build service that compiles your source code, runs unit tests, and produces artifacts that are ready to deploy.
+ [AWS CodeCommit](https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html) – CodeCommit is a version control service hosted by AWS that you can use to privately store and manage assets (such as source code and binary files).
+ [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) – CodePipeline is a continuous delivery service that you can use to model, visualize, and automate the steps required to release your software changes continuously.
+ [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html) – DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with scalability. 
+ [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) – AWS Glue is a fully managed ETL service that makes it easier to prepare and load data for analytics.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) – Lambda supports running code without provisioning or managing servers. Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second. 
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/gsg/GetStartedWithS3.html) – Amazon Simple Storage Service (Amazon S3) is a highly scalable object storage service. Amazon S3 can be used for a wide range of storage solutions, including websites, mobile applications, backups, and data lakes.
+ [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html) - AWS Step Functions is a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications.
+ [Amazon SQS](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) – Amazon Simple Queue Service (Amazon SQS) is a fully managed message queuing service that helps you decouple and scale microservices, distributed systems, and serverless applications.
+ [Deequ](https://github.com/awslabs/deequ) – Deequ is a tool that helps you compute data quality metrics for large datasets, define and verify data quality constraints, and stay informed about changes in the data distribution.

**Code repository**

The source code and resources for the SDLF are available in the [AWS Labs GitHub repository](https://github.com/awslabs/aws-serverless-data-lake-framework/).

## Epics
<a name="deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code-epics"></a>

### Set up the CI/CD pipeline to provision IaC
<a name="set-up-the-ci-cd-pipeline-to-provision-iac"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Set up the CI/CD pipeline to manage IaC for the data lake. | Sign in to the AWS Management Console and follow the steps from the [Initial setup](https://sdlf.workshop.aws/en/10-deployment/100-setup.html) section of the SDLF workshop. This creates the initial CI/CD resources, such as CodeCommit repositories, CodeBuild environments, and CodePipeline pipelines that provision and manage IaC for the data lake. | DevOps engineer | 

### Version-control the IaC
<a name="version-control-the-iac"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Clone the CodeCommit repository on your local machine. | Follow the steps from the [Deploying the foundations ](https://sdlf.workshop.aws/en/10-deployment/200-foundations.html)section of the SDLF workshop. This helps you clone the Git repository that hosts IaC into your local environment. For more information, see [Connecting to CodeCommit repositories ](https://docs.aws.amazon.com/codecommit/latest/userguide/how-to-connect.html)from the CodeCommit documentation. | DevOps engineer | 
| Modify the CloudFormation templates. | Use your local workstation and a code editor to modify the CloudFormation templates according to your use cases or requirements. Commit them to the locally cloned Git repository. For more information, see [Working with AWS CloudFormation templates](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-guide.html) from the AWS CloudFormation documentation. | DevOps engineer | 
| Push the changes to the CodeCommit repository.  | Your infrastructure code is now under version control and modifications to your code base are tracked. When you push a change to the CodeCommit repository, CodePipeline automatically applies it to your infrastructure and delivers it to CodeBuild. If you use the AWS SAM CLI in CodeBuild, run the `sam package` and `sam deploy` commands. If you use AWS CLI, run the `aws cloudformation package` and `aws cloudformation deploy` commands. | DevOps engineer | 

## Related resources
<a name="deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code-resources"></a>

**Set up the CI/CD pipeline to provision IaC**
+ [SDLF workshop – Initial setup](https://sdlf.workshop.aws/en/10-deployment/100-setup.html)

**Version-control the IaC **
+ [SDLF workshop – Deploying the foundations](https://sdlf.workshop.aws/en/10-deployment/200-foundations.html)
+ [Connecting to CodeCommit repositories](https://docs.aws.amazon.com/codecommit/latest/userguide/how-to-connect.html)
+ [Working with AWS CloudFormation templates](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-guide.html)

**Other resources**
+ [AWS serverless data analytics pipeline reference architecture](https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/)
+ [SDLF documentation](https://sdlf.readthedocs.io/en/latest/)

# Enforce tagging of Amazon EMR clusters at launch
<a name="enforce-tagging-of-amazon-emr-clusters-at-launch"></a>

*Priyanka Chaudhary, Amazon Web Services*

## Summary
<a name="enforce-tagging-of-amazon-emr-clusters-at-launch-summary"></a>

This pattern provides a security control that ensures that Amazon EMR clusters are tagged when they are created. 

Amazon EMR is an Amazon Web Services (AWS) service for processing and analyzing vast amounts of data. Amazon EMR offers an expandable, low-configuration service as an easier alternative to running in-house cluster computing. You can use tagging to categorize AWS resources in different ways, such as by purpose, owner, or environment . For example, you can tag your Amazon EMR clusters by assigning custom metadata to each cluster. A tag consists of a key and value that you define. We recommend that you create a consistent set of tags to meet your organization's requirements. When you add a tag to an Amazon EMR cluster, the tag is also propagated to each active Amazon Elastic Compute Cloud (Amazon EC2) instance that is associated with the cluster. Similarly, when you remove a tag from an Amazon EMR cluster, that tag is removed from each associated, active EC2 instance as well.

The detective control monitors API calls and initiates an Amazon CloudWatch Events event for the [RunJobFlow](https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html), [AddTags](https://docs.aws.amazon.com/emr/latest/APIReference/API_AddTags.html), [RemoveTags](https://docs.aws.amazon.com/emr/latest/APIReference/API_RemoveTags.html), and [CreateTags](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CreateTags.html) APIs. The event calls AWS Lambda, which runs a Python script. The Python function gets the Amazon EMR cluster ID from the JSON input from the event and performs the following checks:
+ Check if the Amazon EMR cluster is configured with tag names that you specify.
+ If not, send an Amazon Simple Notification Service (Amazon SNS) notification to the user with the relevant information: the Amazon EMR cluster name, violation details, AWS Region, AWS account, and Amazon Resource Name (ARN) for Lambda that this notification is sourced from.

## Prerequisites and limitations
<a name="enforce-tagging-of-amazon-emr-clusters-at-launch-prereqs"></a>

**Prerequisites **
+ An active AWS account
+ An Amazon Simple Storage Service (Amazon S3) bucket to upload the provided Lambda code. Or, you can create an S3 bucket for this purpose, as described in the *Epics *section.
+ An active email address where you would like to receive violation notifications.
+ A list of mandatory tags you want to check for.

**Limitations **
+ This security control is regional. You must deploy it in each AWS Region that you want to monitor.

**Product versions**
+ Amazon EMR release 4.8.0 and later.

## Architecture
<a name="enforce-tagging-of-amazon-emr-clusters-at-launch-architecture"></a>

**Workflow architecture **

![\[Cluster launch, monitoring using APIs, event generation, Lambda function call, notification sent.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/1a4fc0f8-b0c9-4391-9c79-9eb3898d6ecb/images/0d95c414-69d1-4f29-a9e7-09f202e27014.png)


**Automation and scale**
+ If you are using [AWS Organizations](https://aws.amazon.com/organizations/), you can use [AWS Cloudformation StackSets](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/what-is-cfnstacksets.html) to deploy this template in multiple accounts that you want to monitor.

## Tools
<a name="enforce-tagging-of-amazon-emr-clusters-at-launch-tools"></a>

**AWS services**
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) –  AWS CloudFormation helps you model and set up your AWS resources, provision them quickly and consistently, and manage them throughout their lifecycle. You can use a template to describe your resources and their dependencies, and launch and configure them together as a stack, instead of managing resources individually. You can manage and provision stacks across multiple AWS accounts and AWS Regions.
+ [Amazon CloudWatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) - Amazon CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources.
+ [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html) - Amazon EMR is web service that simplifies running big data frameworks and processing vast amounts of data efficiently. 
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) – AWS Lambda is a compute service that supports running code without provisioning or managing servers. Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second. 
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) – Amazon Simple Storage Service (Amazon S3) is an object storage service. You can use Amazon S3 to store and retrieve any amount of data at any time, from anywhere on the web.
+ [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) – Amazon Simple Notification Service (Amazon SNS) coordinates and manages the delivery or sending of messages between publishers and clients, including web servers and email addresses. Subscribers receive all messages published to the topics to which they subscribe, and all subscribers to a topic receive the same messages.

**Code**

This pattern includes the following attachments:
+ `EMRTagValidation.zip` – The Lambda code for the security control.
+ `EMRTagValidation.yml` – The CloudFormation template that sets up the event and Lambda function.

## Epics
<a name="enforce-tagging-of-amazon-emr-clusters-at-launch-epics"></a>

### Set up the S3 bucket
<a name="set-up-the-s3-bucket"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Define the S3 bucket. | On the [Amazon S3 console](https://console.aws.amazon.com/s3/), choose or create an S3 bucket to host the Lambda code .zip file. This S3 bucket must be in the same AWS Region as the Amazon EMR cluster you want to monitor. An Amazon S3 bucket name is globally unique, and the namespace is shared by all AWS accounts. The S3 bucket name cannot include leading slashes. | Cloud architect | 
| Upload the Lambda code. | Upload the Lambda code .zip file provided in the *Attachments *section to the S3 bucket.                                               | Cloud architect | 

### Deploy the AWS CloudFormation template
<a name="deploy-the-aws-cloudformation-template"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Launch the AWS CloudFormation template. | Open the [AWS CloudFormation console](https://console.aws.amazon.com/cloudformation/.) in the same AWS Region as your S3 bucket and deploy the template. For more information about deploying AWS CloudFormation templates, see [Creating a stack on the AWS CloudFormation console](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html) in the CloudFormation documentation. | Cloud architect | 
| Complete the parameters in the template. | When you launch the template, you'll be prompted for the following information:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/enforce-tagging-of-amazon-emr-clusters-at-launch.html) | Cloud architect | 

### Confirm the subscription
<a name="confirm-the-subscription"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Confirm the subscription. | When the CloudFormation template deploys successfully, it sends a subscription email to the email address you provided. You must confirm this email subscription to start receiving violation notifications. | Cloud architect | 

## Related resources
<a name="enforce-tagging-of-amazon-emr-clusters-at-launch-resources"></a>
+ [AWS Lambda developer guide](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html)
+ [Tagging clusters in Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags.html)

## Attachments
<a name="attachments-1a4fc0f8-b0c9-4391-9c79-9eb3898d6ecb"></a>

To access additional content that is associated with this document, unzip the following file: [attachment.zip](samples/p-attach/1a4fc0f8-b0c9-4391-9c79-9eb3898d6ecb/attachments/attachment.zip)

# Ensure Amazon EMR logging to Amazon S3 is enabled at launch
<a name="ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch"></a>

*Priyanka Chaudhary, Amazon Web Services*

## Summary
<a name="ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch-summary"></a>

This pattern provides a security control that monitors logging configuration for Amazon EMR clusters running on Amazon Web Services (AWS).

Amazon EMR is an AWS tool for big data processing and analysis. Amazon EMR offers the expandable low-configuration service as an alternative to running in-house cluster computing. Amazon EMR provides two types of EMR clusters.
+ **Transient Amazon EMR clusters**: Transient Amazon EMR clusters automatically shut down and stop incurring costs when processing is finished.
+ **Persistent Amazon EMR clusters**: Persistent Amazon EMR clusters continue to run after the data processing job is complete.

Amazon EMR and Hadoop both produce log files that report status on the cluster. By default, these are written to the master node in the **/mnt/var/log/** directory. Depending on how you configure the cluster when you launch it, you can also save these logs to Amazon Simple Storage Service (Amazon S3) and view them through the graphical debugging tool. Note that Amazon S3 logging can be specified only when the cluster is launched. With this configuration, logs are sent from the primary node to the Amazon S3 location every 5 minutes. For transient clusters, Amazon S3 logging is important because the clusters disappear when processing is complete, and these log files can be use to debug any failed jobs.

The pattern uses an AWS CloudFormation template to deploy a security control that monitors for API calls and starts Amazon CloudWatch Events on "RunJobFlow." The trigger invokes AWS Lambda, which runs a Python script. The Lambda function retrieves the EMR cluster ID from the event JSON input and also checks for an Amazon S3 log URI. If an Amazon S3 URI is not found, the Lambda function sends an Amazon Simple Notification Service (Amazon SNS) notification detailing the EMR cluster name, violation details, AWS Region, AWS account, and the Lambda Amazon Resource Name (ARN) that the notification is sourced from.

## Prerequisites and limitations
<a name="ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ An S3 bucket for the Lambda code .zip file
+ An email address where you want to receive the violation notification

**Limitations**
+ This detective control is regional and must be deployed in the AWS Regions you intend to monitor.

**Product versions**
+ Amazon EMR release 4.8.0 and later

## Architecture
<a name="ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch-architecture"></a>

**Target technology stack**
+ Amazon CloudWatch Events event
+ Amazon EMR
+ Lambda function
+ S3 bucket
+ Amazon SNS

**Target architecture**

![\[Workflow for ensuring that Amazon EMR logging to Amazon S3 is enabled at launch.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/079af32e-0418-4bb2-bc20-c67ea5ac3b6e/images/3ba125cb-018a-4ace-838f-8c20992ecc20.png)


**Automation and scale**
+ If you are using AWS Organizations, you can use [AWS CloudFormation StackSets](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/what-is-cfnstacksets.html) to deploy this template in multiple accounts that you want to monitor.

## Tools
<a name="ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch-tools"></a>

**Tools**
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) – AWS CloudFormation helps you model and set up AWS resources using infrastructure as code.
+ [AWS Cloudwatch Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html) – AWS CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources.
+ [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html) – Amazon EMR is a managed cluster platform that simplifies running big data frameworks.
+ [AWS Lambda](https://aws.amazon.com/lambda/) – AWS Lambda supports running code without provisioning or managing servers. Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second.
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html) – Amazon S3 is a web services interface that you can use to store and retrieve any amount of data from anywhere on the web.
+ [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) – Amazon SNS is a web service that coordinates and manages the delivery or sending of messages between publishers and clients, including web servers and email addresses.

**Code**
+ A .zip file of the project is available as an attachment.

## Epics
<a name="ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch-epics"></a>

### Define the S3 bucket
<a name="define-the-s3-bucket"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Define the S3 bucket. | To host the Lambda code .zip file, choose or create an S3 bucket with a unique name that does not contain leading slashes. An S3 bucket name is globally unique, and the namespace is shared by all AWS accounts. Your S3 bucket needs to be in the same AWS Region as the Amazon EMR cluster that is being evaluated. | Cloud Architect | 

### Upload the Lambda code to the S3 bucket
<a name="upload-the-lambda-code-to-the-s3-bucket"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Upload the Lambda code to the S3 bucket. | Upload the Lambda code .zip file that's provided in the "Attachments" section to the S3 bucket. The S3 bucket must be in the same Region as the Amazon EMR cluster that is being evaluated. | Cloud Architect | 

### Deploy the AWS CloudFormation template
<a name="deploy-the-aws-cloudformation-template"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Deploy the AWS CloudFormation template. | On the AWS CloudFormation console, in the same Region as your S3 bucket, deploy the AWS CloudFormation template that's provided as an attachment to this pattern. In the next epic, provide the values for the parameters. For more information about deploying AWS CloudFormation templates, see the "Related resources" section. | Cloud Architect | 

### Complete the parameters in the AWS CloudFormation template
<a name="complete-the-parameters-in-the-aws-cloudformation-template"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Name the S3 bucket. | Enter the name of the S3 bucket that you created in the first epic. | Cloud Architect | 
| Provide the Amazon S3 key. | Provide the location of the Lambda code .zip file in your S3 bucket, without leading slashes (for example, <directory>/<file-name>.zip). | Cloud Architect | 
| Provide an email address. | Provide an active email address to receive Amazon SNS notifications. | Cloud Architect | 
| Define the logging level. | Define the logging level and frequency for your Lambda function. "Info" designates detailed informational messages on the application’s progress. "Error" designates error events that could still allow the application to continue running. "Warning" designates potentially harmful situations. | Cloud Architect | 

### Confirm the subscription
<a name="confirm-the-subscription"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Confirm the subscription. | When the template successfully deploys, it sends a subscription email message to the email address provided. You must confirm this email subscription to receive violation notifications. | Cloud Architect | 

## Related resources
<a name="ensure-amazon-emr-logging-to-amazon-s3-is-enabled-at-launch-resources"></a>
+ [AWS Lambda](https://aws.amazon.com/lambda/)
+ [Amazon EMR logging](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html)
+ [Deploying AWS CloudFormation templates](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html)

## Attachments
<a name="attachments-079af32e-0418-4bb2-bc20-c67ea5ac3b6e"></a>

To access additional content that is associated with this document, unzip the following file: [attachment.zip](samples/p-attach/079af32e-0418-4bb2-bc20-c67ea5ac3b6e/attachments/attachment.zip)

# Generate test data using an AWS Glue job and Python
<a name="generate-test-data-using-an-aws-glue-job-and-python"></a>

*Moinul Al-Mamun, Amazon Web Services*

## Summary
<a name="generate-test-data-using-an-aws-glue-job-and-python-summary"></a>

This pattern shows you how to quickly and easily generate millions of sample files concurrently by creating an AWS Glue job written in Python. The sample files are stored in an Amazon Simple Storage Service (Amazon S3) bucket. The ability to quickly generate a large number of sample files is important for testing or evaluating services in the AWS Cloud. For example, you can test the performance of AWS Glue Studio or AWS Glue DataBrew jobs by performing data analysis on millions of small files in an Amazon S3 prefix.

Although you can use other AWS services to generate sample datasets, we recommend that you use AWS Glue. You don’t need to manage any infrastructure because AWS Glue is a serverless data processing service. You can just bring your code and run it in an AWS Glue cluster. Additionally, AWS Glue provisions, configures, and scales the resources required to run your jobs. You pay only for the resources that your jobs use while running.

## Prerequisites and limitations
<a name="generate-test-data-using-an-aws-glue-job-and-python-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ AWS Command Line Interface (AWS CLI), [installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) to work with the AWS account

**Product versions**
+ Python 3.9
+ AWS CLI version 2

**Limitations**

The maximum number of AWS Glue jobs per trigger is 50. For more information, see [AWS Glue endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html).

## Architecture
<a name="generate-test-data-using-an-aws-glue-job-and-python-architecture"></a>

The following diagram depicts an example architecture centered around an AWS Glue job that writes its output (that is, sample files) to an S3 bucket.

![\[Workflow shows AWS CLI initiates AWS Glue job that writes output to S3 bucket.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/f35943e8-3b2b-410e-a3f0-05e1ebd357d0/images/452ccbda-71f2-42b8-976d-bcc968bb1dab.png)


The diagram includes the following workflow:

1. You use the AWS CLI, AWS Management Console, or an API to initiate the AWS Glue job. The AWS CLI or API enables you to automate the parallelization of the invoked job and reduce the runtime for generating sample files.

1. The AWS Glue job generates file content randomly, converts the content into CSV format, and then stores the content as an Amazon S3 object under a common prefix. Each file is less than a kilobyte. The AWS Glue job accepts two user-defined job parameters: `START_RANGE` and `END_RANGE`. You can use these parameters to set file names and the number of files generated in Amazon S3 by each job run. You can run multiple instances of this job in parallel (for example, 100 instances).

## Tools
<a name="generate-test-data-using-an-aws-glue-job-and-python-tools"></a>
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html) is an open-source tool that helps you interact with AWS services through commands in your command-line shell.
+ [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) is a fully managed extract, transform, and load (ETL) service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.
+ [AWS Identity and Access Management (IAM)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html) helps you securely manage access to your AWS resources by controlling who is authenticated and authorized to use them.

## Best practices
<a name="generate-test-data-using-an-aws-glue-job-and-python-best-practices"></a>

Consider the following AWS Glue best practices as you implement this pattern:
+ **Use the right AWS Glue worker type to reduce cost.** We recommend that you understand the different properties of worker types, and then choose the right worker type for your workload based on CPU and memory requirements. For this pattern, we recommend that you use a Python shell job as your job type to minimize DPU and reduce cost. For more information, see [Adding jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/add-job.html) in the AWS Glue Developer Guide.
+ **Use the right concurrency limit to scale your job.** We recommend that you base the maximum concurrency of your AWS Glue job on your time requirement and required number of files.
+ **Start generating a small number of files at first.** To reduce cost and save time when you build your AWS Glue jobs, start with a small number of files (such as 1,000). This can make troubleshooting easier. If generating a small number of files is successful, then you can scale to a larger number of files.
+ **Run locally first.** To reduce cost and save time when you build your AWS Glue jobs, start the development locally and test your code. For instructions on setting up a Docker container that can help you write AWS Glue extract, transform, and load (ETL) jobs both in a shell and in an integrated development environment (IDE), see the [Developing AWS Glue ETL jobs locally using a container](https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/) post on the AWS Big Data Blog.

For more AWS Glue best practices, see [Best practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/best-practices.html) in the AWS Glue documentation.

## Epics
<a name="generate-test-data-using-an-aws-glue-job-and-python-epics"></a>

### Create a destination S3 bucket and IAM role
<a name="create-a-destination-s3-bucket-and-iam-role"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an S3 bucket for storing the files. | Create an [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) and a [prefix](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html) within it.This pattern uses the `s3://{your-s3-bucket-name}/small-files/` location for demonstration purposes. | App developer | 
| Create and configure an IAM role. | You must create an IAM role that your AWS Glue job can use to write to your S3 bucket.[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/generate-test-data-using-an-aws-glue-job-and-python.html) | App developer | 

### Create and configure an AWS Glue job to handle concurrent runs
<a name="create-and-configure-an-aws-glue-job-to-handle-concurrent-runs"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an AWS Glue job. | You must create an AWS Glue job that generates your content and stores it in an S3 bucket.Create an [AWS Glue job](https://docs.aws.amazon.com/glue/latest/dg/console-jobs.html), and then configure your job by completing the following steps:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/generate-test-data-using-an-aws-glue-job-and-python.html) | App developer | 
| Update the job code. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/generate-test-data-using-an-aws-glue-job-and-python.html) | App developer | 

### Run the AWS Glue job from the command line or console
<a name="run-the-aws-glue-job-from-the-command-line-or-console"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Run the AWS Glue job from the command line. | To run your AWS Glue job from the AWS CLI , run the following command using your values:<pre>cmd:~$ aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"0","--END_RANGE":"1000000"}'<br />cmd:~$ aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"1000000","--END_RANGE":"2000000"}'</pre>For instructions on running the AWS Glue job from the AWS Management Console, see the *Run the AWS Glue job in the AWS Management Console* story in this pattern.We recommend using the AWS CLI to run AWS Glue jobs if you want to run multiple executions at a time with different parameters, as shown in the example above.To generate all AWS CLI commands that are required to generate a defined number of files using a certain parallelization factor, run the following bash code (using your values):<pre># define parameters<br />NUMBER_OF_FILES=10000000;<br />PARALLELIZATION=50; <br /> <br /># initialize<br />_SB=0;<br />      <br /># generate commands<br />for i in $(seq 1 $PARALLELIZATION); <br />do <br />      echo aws glue start-job-run --job-name create_small_files --arguments "'"'{"--START_RANGE":"'$(((NUMBER_OF_FILES/PARALLELIZATION) * (i-1) + _SB))'","--END_RANGE":"'$(((NUMBER_OF_FILES/PARALLELIZATION) * (i)))'"}'"'";<br />      _SB=1; <br />done</pre>If you use the script above, consider the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/generate-test-data-using-an-aws-glue-job-and-python.html) To see an example of output from the above script, see *Shell script output* in the *Additional information *section of this pattern. | App developer | 
| Run the AWS Glue job in the AWS Management Console. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/generate-test-data-using-an-aws-glue-job-and-python.html) | App developer | 
| Check the status of your AWS Glue job. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/generate-test-data-using-an-aws-glue-job-and-python.html) | App developer | 

## Related resources
<a name="generate-test-data-using-an-aws-glue-job-and-python-resources"></a>

**References**
+ [Registry of Open Data on AWS](https://registry.opendata.aws/)
+ [Data sets for analytics](https://aws.amazon.com/marketplace/solutions/data-analytics/data-sets)
+ [Open Data on AWS](https://aws.amazon.com/opendata/)
+ [Adding jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/add-job.html)
+ [Getting started with AWS Glue](https://aws.amazon.com/glue/getting-started/)

**Guides and patterns**
+ [AWS Glue best practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/best-practices.html)
+ [Load testing applications](https://docs.aws.amazon.com/prescriptive-guidance/latest/load-testing/welcome.html)

## Additional information
<a name="generate-test-data-using-an-aws-glue-job-and-python-additional"></a>

**Benchmarking test**

This pattern was used to generate 10 million files using different parallelization parameters as part of a benchmarking test. The following table shows the output of the test:


| 
| 
| Parallelization | Number of files generated by a job run | Job duration | Speed | 
| --- |--- |--- |--- |
| 10 | 1,000,000 | 6 hours, 40 minutes | Very slow | 
| 50 | 200,000 | 80 minutes | Moderate | 
| 100 | 100,000 | 40 minutes | Fast | 

If you want to make the process faster, you can configure more concurrent runs in your job configuration. You can easily adjust the job configuration based on your requirements, but keep in mind that there is an AWS Glue service quota limit. For more information, see [AWS Glue endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html).

**Shell script output**

The following example shows the output of the shell script from the *Run the AWS Glue job from the command line* story in this pattern.

```
user@MUC-1234567890 MINGW64 ~
  $ # define parameters
  NUMBER_OF_FILES=10000000;
  PARALLELIZATION=50;
  # initialize
  _SB=0;
   
  # generate commands
  for i in $(seq 1 $PARALLELIZATION);
   do
         echo aws glue start-job-run --job-name create_small_files --arguments "'"'{"--START_RANGE":"'$(((NUMBER_OF_FILES/PARALLELIZATION)  (i-1) + SB))'","--ENDRANGE":"'$(((NUMBER_OF_FILES/PARALLELIZATION)  (i)))'"}'"'";
         _SB=1;
   done
   
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"0","--END_RANGE":"200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"200001","--END_RANGE":"400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"400001","--END_RANGE":"600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"600001","--END_RANGE":"800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"800001","--END_RANGE":"1000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"1000001","--END_RANGE":"1200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"1200001","--END_RANGE":"1400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"1400001","--END_RANGE":"1600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"1600001","--END_RANGE":"1800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"1800001","--END_RANGE":"2000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"2000001","--END_RANGE":"2200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"2200001","--END_RANGE":"2400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"2400001","--END_RANGE":"2600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"2600001","--END_RANGE":"2800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"2800001","--END_RANGE":"3000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"3000001","--END_RANGE":"3200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"3200001","--END_RANGE":"3400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"3400001","--END_RANGE":"3600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"3600001","--END_RANGE":"3800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"3800001","--END_RANGE":"4000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"4000001","--END_RANGE":"4200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"4200001","--END_RANGE":"4400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"4400001","--END_RANGE":"4600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"4600001","--END_RANGE":"4800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"4800001","--END_RANGE":"5000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"5000001","--END_RANGE":"5200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"5200001","--END_RANGE":"5400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"5400001","--END_RANGE":"5600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"5600001","--END_RANGE":"5800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"5800001","--END_RANGE":"6000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"6000001","--END_RANGE":"6200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"6200001","--END_RANGE":"6400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"6400001","--END_RANGE":"6600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"6600001","--END_RANGE":"6800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"6800001","--END_RANGE":"7000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"7000001","--END_RANGE":"7200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"7200001","--END_RANGE":"7400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"7400001","--END_RANGE":"7600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"7600001","--END_RANGE":"7800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"7800001","--END_RANGE":"8000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"8000001","--END_RANGE":"8200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"8200001","--END_RANGE":"8400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"8400001","--END_RANGE":"8600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"8600001","--END_RANGE":"8800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"8800001","--END_RANGE":"9000000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"9000001","--END_RANGE":"9200000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"9200001","--END_RANGE":"9400000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"9400001","--END_RANGE":"9600000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"9600001","--END_RANGE":"9800000"}'
  aws glue start-job-run --job-name create_small_files --arguments '{"--START_RANGE":"9800001","--END_RANGE":"10000000"}'
  
  user@MUC-1234567890 MINGW64 ~
```

**FAQ**

*How many concurrent runs or parallel jobs should I use?*

The number of concurrent runs and parallel jobs depend on your time requirement and desired number of test files. We recommend that you check the size of the files that you’re creating. First, check how much time an AWS Glue job takes to generate your desired number of files. Then, use the right number of concurrent runs to meet your goals. For example, if you assume that 100,000 files takes 40 minutes to complete the run but your target time is 30 minutes, then you must increase the concurrency setting for your AWS Glue job.

*What type of content can I create using this pattern?*

You can create any type of content, such as text files with different delimiters (for example, PIPE, JSON, or CSV). This pattern uses Boto3 to write to a file and then saves the file in an S3 bucket.

*What level of IAM permission do I need in the S3 bucket?*

You must have an identity-based policy that allows `Write` access to objects in your S3 bucket. For more information, see [Amazon S3: Allows read and write access to objects in an S3 bucket](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_s3_rw-bucket.html) in the Amazon S3 documentation.

# Cost-effectively ingest IoT data directly into Amazon S3 using AWS IoT Greengrass
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass"></a>

*Sebastian Viviani and Rizwan Syed, Amazon Web Services*

## Summary
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-summary"></a>

This pattern shows you how to cost-effectively ingest Internet of Things (IoT) data directly into an Amazon Simple Storage Service (Amazon S3) bucket by using an AWS IoT Greengrass Version 2 device. The device runs a custom component that reads the IoT data and saves the data in persistent storage (that is, a local disk or volume). Then, the device compresses the IoT data into an Apache Parquet file and uploads the data periodically to an S3 bucket.

The amount and speed of IoT data that you ingest is limited only by your edge hardware capabilities and network bandwidth. You can use Amazon Athena to cost-effectively analyze your ingested data. Athena supports compressed Apache Parquet files and data visualization by using [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html).

## Prerequisites and limitations
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ An [edge gateway](https://docs.aws.amazon.com/greengrass/v1/developerguide/quick-start.html) that runs on [AWS IoT Greengrass Version 2](https://docs.aws.amazon.com/greengrass/v2/developerguide/greengrass-v2-whats-new.html) and collects data from sensors (The data sources and the data collection process are beyond the scope of this pattern, but you can use nearly any type of sensor data. This pattern uses a local [MQTT](https://mqtt.org/) broker with sensors or gateways that publish data locally.)
+ AWS IoT Greengrass [component](https://docs.aws.amazon.com/greengrass/v2/developerguide/develop-greengrass-components.html), [roles](https://docs.aws.amazon.com/greengrass/v1/developerguide/service-role.html), and [SDK dependencies](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#installation)
+ A [stream manager component](https://docs.aws.amazon.com/greengrass/v2/developerguide/stream-manager-component.html) to upload the data to the S3 bucket
+ [AWS SDK for Java](https://aws.amazon.com/sdk-for-java/), [AWS SDK for JavaScript](https://aws.amazon.com/sdk-for-javascript/), or [AWS SDK for Python (Boto3)](https://docs.aws.amazon.com/pythonsdk/) to run the APIs

**Limitations**
+ The data in this pattern isn’t uploaded in real time to the S3 bucket. There is a delay period, and you can configure the delay period. Data is buffered temporarily in the edge device and then uploaded once the period expires.
+ The SDK is available only in Java, Node.js, and Python.

## Architecture
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-architecture"></a>

**Target technology stack**
+ Amazon S3
+ AWS IoT Greengrass
+ MQTT broker
+ Stream manager component

**Target architecture**

The following diagram shows an architecture designed to ingest IoT sensor data and store that data in an S3 bucket.

![\[Architecture diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/b9032ae2-fffb-4750-b161-09810e19d878/images/8c28e639-5dcf-4950-b4a6-8015ec1a2894.png)


The diagram shows the following workflow:

1. Multiple sensors (for example, temperature and valve) updates are published to a local MQTT broker.

1. The Parquet file compressor that's subscribed to these sensors updates topics and receives these updates.

1. The Parquet file compressor stores the updates locally.

1. After the period lapses, the stored files are compressed into Parquet files and passed on to the stream manager to get uploaded to the specified S3 bucket.

1. The stream manager uploads the Parquet files to the S3 bucket.

**Note**  
The stream manager (`StreamManager`) is a managed component. For examples of how to export data to Amazon S3, see [Stream manager](https://docs.aws.amazon.com/greengrass/v2/developerguide/stream-manager-component.html) in the AWS IoT Greengrass documentation. You can use a local MQTT broker as a component or another broker like [Eclipse Mosquitto](https://mosquitto.org/).

## Tools
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-tools"></a>

**AWS tools**
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [AWS IoT Greengrass](https://docs.aws.amazon.com/greengrass/v2/developerguide/what-is-iot-greengrass.html) is an open source IoT edge runtime and cloud service that helps you build, deploy, and manage IoT applications on your devices.

**Other tools**
+ [Apache Parquet](https://parquet.apache.org/) is an open-source column-oriented data file format designed for storage and retrieval.
+ [MQTT](https://docs.aws.amazon.com/iot/latest/developerguide/mqtt.html) (Message Queuing Telemetry Transport) is a lightweight messaging protocol that's designed for constrained devices.

## Best practices
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-best-practices"></a>

**Use the right partition format for uploaded data**

There are no specific requirements for the root prefix names in the S3 bucket (for example, `"myAwesomeDataSet/"` or `"dataFromSource"`), but we recommend that you use a meaningful partition and prefix so that it's easy to understand the purpose of the dataset.

We also recommend that you use the right partitioning in Amazon S3 so that the queries run optimally on the dataset. In the following example, the data is partitioned in HIVE format so that the amount of data scanned by each Athena query is optimized. This improves performance and reduces cost.

`s3://<ingestionBucket>/<rootPrefix>/year=YY/month=MM/day=DD/HHMM_<suffix>.parquet`

## Epics
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-epics"></a>

### Set up your environment
<a name="set-up-your-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an S3 bucket. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | App developer | 
| Add IAM permissions to the S3 bucket. | To grant users write access to the S3 bucket and prefix that you created earlier, add the following IAM policy to your AWS IoT Greengrass role:<pre>{<br />    "Version": "2012-10-17",		 	 	 <br />    "Statement": [<br />        {<br />            "Sid": "S3DataUpload",<br />            "Effect": "Allow",<br />            "Action": [<br />                "s3:List*",<br />                "s3:Put*"<br />            ],<br />            "Resource": [<br />                "arn:aws:s3:::<ingestionBucket>",<br />                "arn:aws:s3:::<ingestionBucket>/<prefix>/*"<br />            ]<br />        }<br />    ]<br />}</pre>For more information, see [Creating an IAM policy to access Amazon S3 resources](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.Authorizing.IAM.S3CreatePolicy.html) in the Aurora documentation.Next, update the resource policy (if needed) for the S3 bucket to allow write access with the correct AWS [principals](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-bucket-user-policy-specifying-principal-intro.html). | App developer | 

### Build and deploy the AWS IoT Greengrass component
<a name="build-and-deploy-the-aws-iot-greengrass-component"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Update the recipe of the component. | [Update the component configuration](https://docs.aws.amazon.com/greengrass/v2/developerguide/update-component-configurations.html) when you [create a deployment](https://docs.aws.amazon.com/greengrass/v2/developerguide/create-deployments.html) based on the following example:<pre>{<br />  "region": "<region>",<br />  "parquet_period": <period>,<br />  "s3_bucket": "<s3Bucket>",<br />  "s3_key_prefix": "<s3prefix>"<br />}</pre>Replace `<region>` with your AWS Region, `<period>` with your periodic interval, `<s3Bucket>` with your S3 bucket, and `<s3prefix>` with your prefix. | App developer | 
| Create the component. | Do one of the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | App developer | 
| Update the MQTT client. | The sample code doesn't use authentication because the component connects locally to the broker. If your scenario differs, update the MQTT client section as needed. Additionally, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | App developer | 

### Add the component to the AWS IoT Greengrass Version 2 core device
<a name="add-the-component-to-the-aws-iot-greengrass-version-2-core-device"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Update the deployment of the core device. | If the deployment of the AWS IoT Greengrass Version 2 core device already exists, [revise the deployment](https://docs.aws.amazon.com/greengrass/v2/developerguide/revise-deployments.html). If the deployment doesn't exist, [create a new deployment](https://docs.aws.amazon.com/greengrass/v2/developerguide/create-deployments.html).To give the component the correct name, [update the log manager configuration](https://docs.aws.amazon.com/greengrass/v2/developerguide/log-manager-component.html) for the new component (if needed) based on the following:<pre>{<br />  "logsUploaderConfiguration": {<br />    "systemLogsConfiguration": {<br />    ...<br />    },<br />    "componentLogsConfigurationMap": {<br />      "<com.iot.ingest.parquet>": {<br />        "minimumLogLevel": "INFO",<br />        "diskSpaceLimit": "20",<br />        "diskSpaceLimitUnit": "MB",<br />        "deleteLogFileAfterCloudUpload": "false"<br />      }<br />      ...<br />    }<br />  },<br />  "periodicUploadIntervalSec": "300"<br />}</pre>Finally, complete the revision of the deployment for your AWS IoT Greengrass core device. | App developer | 

### Verify data ingestion into the S3 bucket
<a name="verify-data-ingestion-into-the-s3-bucket"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Check the logs for the AWS IoT Greengrass volume. | Check for the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | App developer | 
| Check the S3 bucket. | Verify if the data is being uploaded to the S3 bucket. You can see the files being uploaded at every period.You can also verify if the data is uploaded to the S3 bucket by querying the data in the next section. | App developer | 

### Set up querying from Athena
<a name="set-up-querying-from-athena"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a database and table. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | App developer | 
| Grant Athena access to the data. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | App developer | 

## Troubleshooting
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-troubleshooting"></a>


| Issue | Solution | 
| --- | --- | 
| MQTT client fails to connect | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | 
| MQTT client fails to subscribe | Validate the permissions on the MQTT broker. If you have an MQTT broker from AWS, see [MQTT 3.1.1 broker (Moquette)](https://docs.aws.amazon.com/greengrass/v2/developerguide/mqtt-broker-moquette-component.html) and [MQTT 5 broker (EMQX)](https://docs.aws.amazon.com/greengrass/v2/developerguide/mqtt-broker-emqx-component.html). | 
| Parquet files don't get created | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | 
| Objects are not uploaded to the S3 bucket | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass.html) | 

## Related resources
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-resources"></a>
+ [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (Pandas documentation)
+ [Apache Parquet Documentation](https://parquet.apache.org/docs/) (Parquet documentation)
+ [Develop AWS IoT Greengrass components](https://docs.aws.amazon.com/greengrass/v2/developerguide/develop-greengrass-components.html) (AWS IoT Greengrass Developer Guide, Version 2)
+ [Deploy AWS IoT Greengrass components to devices](https://docs.aws.amazon.com/greengrass/v2/developerguide/manage-deployments.html) (AWS IoT Greengrass Developer Guide, Version 2)
+ [Interact with local IoT devices](https://docs.aws.amazon.com/greengrass/v2/developerguide/interact-with-local-iot-devices.html) (AWS IoT Greengrass Developer Guide, Version 2)
+ [MQTT 3.1.1 broker (Moquette)](https://docs.aws.amazon.com/greengrass/v2/developerguide/mqtt-broker-moquette-component.html) (AWS IoT Greengrass Developer Guide, Version 2)
+ [MQTT 5 broker (EMQX)](https://docs.aws.amazon.com/greengrass/v2/developerguide/mqtt-broker-emqx-component.html) (AWS IoT Greengrass Developer Guide, Version 2)

## Additional information
<a name="cost-effectively-ingest-iot-data-directly-into-amazon-s3-using-aws-iot-greengrass-additional"></a>

**Cost analysis**

The following cost analysis scenario demonstrates how the data ingestion approach covered in this pattern can impact data ingestion costs in the AWS Cloud. The pricing examples in this scenario are based on prices at the time of publication. Prices are subject to change. Additionally, your costs may vary depending on your AWS Region, AWS service quotas, and other factors related to your cloud environment.

*Input signal set*

This analysis uses the following set of input signals as the basis for comparing IoT ingestion costs with other available alternatives.


| 
| 
| Number of signals | Frequency | Data per signal | 
| --- |--- |--- |
| 125 | 25 Hz | 8 bytes | 

In this scenario, the system receives 125 signals. Each signal is 8 bytes and occurs every 40 milliseconds (25 Hz). These signals could come individually or grouped in a common payload. You have the option to split and pack these signals based on your needs. You can also determine the latency. Latency consists of the time period for receiving, accumulating, and ingesting the data.

For comparison purposes, the ingestion operation for this scenario is based in the `us-east-1` AWS Region. The cost comparison applies to AWS services only. Other costs, like hardware or connectivity, are not factored into the analysis.

*Cost comparisons*

The following table shows the monthly cost in US dollars (USD) for each ingestion method.


| 
| 
| Method | Monthly cost | 
| --- |--- |
| AWS IoT SiteWise*\$1* | 331.77 USD | 
| AWS IoT SiteWise Edge with data processing pack (keeping all data at the edge) | 200 USD | 
| AWS IoT Core and Amazon S3 rules for accessing raw data | 84.54 USD | 
| Parquet file compression at the edge and uploading to Amazon S3 | 0.5 USD | 

\$1Data must be downsampled to comply with service quotas. This means there is some data loss with this method.

*Alternative methods*

This section shows the equivalent costs for the following alternative methods:
+ **AWS IoT SiteWise** – Each signal must be uploaded in an individual message. Therefore, the total number of messages per month is 125×25×3600×24×30, or 8.1 billion messages per month. However, AWS IoT SiteWise can handle only 10 data points per second per property. Assuming the data is downsampled to 10 Hz, the number of messages per month is reduced to 125×10×3600×24×30, or 3.24 billion. If you use the publisher component that packs measurements in groups of 10 (at 1 USD per million messages), then you get a monthly cost of 324 USD per month. Assuming that each message is 8 bytes (1 Kb/125), that’s 25.92 Gb of data storage. This adds a monthly cost of 7.77 USD per month. The total cost for the first month is 331.77 USD and increases by 7.77 USD every month.
+ **AWS IoT SiteWise Edge with data processing pack, including all models and signals fully processed at the edge (that is, no cloud ingestion)** – You can use the data processing pack as an alternative to reduce costs and to configure all the models that get calculated at the edge. This can work just for storage and visualization, even if no real calculation is performed. In this case, it’s necessary to use powerful hardware for the edge gateway. There is a fixed cost of 200 USD per month.
+ **Direct ingestion to AWS IoT Core by MQTT and an IoT rule to store the raw data in Amazon S3** – Assuming all the signals are published in a common payload, the total number of messages published to AWS IoT Core is 25×3600×24×30, or 64.8 million per month. At 1 USD per million messages, that’s a monthly cost of 64.8 USD per month. At 0.15 USD per million rule activations and with one rule per message, that adds a monthly cost of 19.44 USD per month. At a cost of 0.023 USD per Gb of storage in Amazon S3, that adds another 1.5 USD per month (increasing every month to reflect the new data). The total cost for the first month is 84.54 USD and increases by 1.5 USD every month.
+ **Compressing data at the edge in a Parquet file and uploading to Amazon S3 (proposed method**) – The compression ratio depends on the type of data. With the same industrial data tested for MQTT, the total output data for a full month is 1.2 Gb. This costs 0.03 USD per month. Compression ratios (using random data) described in other benchmarks are on the order of 66 percent (closer to a worst-case scenario). The total data is 21 Gb and costs 0.5 USD per month.

**Parquet file generator**

The following code example shows the structure of a Parquet file generator that's written in Python. The code example is for illustration purposes only and won’t work if pasted into your environment.

```
import queue
import paho.mqtt.client as mqtt
import pandas as pd

#queue for decoupling the MQTT thread
messageQueue = queue.Queue()
client = mqtt.Client()
streammanager = StreamManagerClient()

def feederListener(topic, message):
    payload = {
        "topic" : topic,
        "payload" : message,
    }
    messageQueue.put_nowait(payload)

def on_connect(client_instance, userdata, flags, rc):
    client.subscribe("#",qos=0)

def on_message(client, userdata, message):
    feederListener(topic=str(message.topic), message=str(message.payload.decode("utf-8")))


filename = "tempfile.parquet"
streamname = "mystream"
destination_bucket= "amzn-s3-demo-bucket"
keyname="mykey"
period= 60

client.on_connect = on_connect
client.on_message = on_message
streammanager.create_message_stream(
            MessageStreamDefinition(name=streamname, strategy_on_full=StrategyOnFull.OverwriteOldestData)
        )


while True:
   try:
       message = messageQueue.get(timeout=myArgs.mqtt_timeout)
   except (queue.Empty):
       logger.warning("MQTT message reception timed out")

   currentTimestamp = getCurrentTime()
   if currentTimestamp >= nextUploadTimestamp:
       df = pd.DataFrame.from_dict(accumulator)
       df.to_parquet(filename)
       s3_export_task_definition = S3ExportTaskDefinition(input_url=filename, bucket=destination_bucket, key=key_name)
       streammanager.append_message(streamname, Util.validate_and_serialize_to_json_bytes(s3_export_task_definition))
       accumulator = {}
       nextUploadTimestamp += period
   else:
        accumulator.append(message)
```

# Launch a Spark job in a transient EMR cluster using a Lambda function
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function"></a>

*Dhrubajyoti Mukherjee, Amazon Web Services*

## Summary
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function-summary"></a>

This pattern uses the Amazon EMR RunJobFlow API action to launch a transient cluster to run a Spark job from a Lambda function. A transient EMR cluster is designed to terminate as soon as the job is complete or if any error occurs. A transient cluster provides cost savings because it runs only during the computation time, and it provides scalability and flexibility in a cloud environment.

The transient EMR cluster is launched using the Boto3 API and the Python programming language in a Lambda function. The Lambda function, which is written in Python, provides the added flexibility of initiating the cluster when it is needed.

To demonstrate a sample batch computation and output, this pattern will launch a Spark job in an EMR cluster from a Lambda function and run a batch computation against the example sales data of a fictional company. The output of the Spark job will be a comma-separated values (CSV) file in Amazon Simple Storage Service (Amazon S3). The input data file, Spark .jar file, a code snippet, and an AWS CloudFormation template for a virtual private cloud (VPC) and AWS Identity and Access Management (IAM) roles to run the computation are provided as an attachment.

## Prerequisites and limitations
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function-prereqs"></a>

**Prerequisites **
+ An active AWS account

**Limitations**
+ Only one Spark job can be initiated from the code at a time. 

**Product versions**
+ Tested on Amazon EMR 6.0.0

## Architecture
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function-architecture"></a>

**Target technology stack  **
+ Amazon EMR 
+ AWS Lambda
+ Amazon S3
+ Apache Spark

**Target architecture **

![\[Lambda to Amazon EMR and Spark to Amazon S3\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/6c999fa7-9550-4929-a5c1-60394142175d/images/eb4fbb3f-2114-44d2-b9da-3fdcb9ca456e.png)


**Automation and scale**

To automate the Spark-EMR batch computation, you can use either of the following options.
+ Implement an Amazon EventBridge rule that can initiate the Lambda function in a cron schedule. For more information, see [Tutorial: Schedule AWS Lambda functions using EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-run-lambda-schedule.html).
+ Configure [Amazon S3 event notifications](https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html) to initiate the Lambda function on file arrival.
+ Pass the input parameters to the AWS Lambda function through the event body and Lambda environment variables. 

## Tools
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function-tools"></a>

**AWS services**
+ [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html) is a managed cluster platform that simplifies running big data frameworks on AWS to process and analyze large amounts of data.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

**Other tools**
+ [Apache Spark](https://spark.apache.org/docs/latest/) is a multiple-language analytics engine for large-scale data processing.

## Epics
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function-epics"></a>

### Create the Amazon EMR and Lambda IAM roles and the VPC
<a name="create-the-amazon-emr-and-lambda-iam-roles-and-the-vpc"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create the IAM roles and the VPC. | If you already have the AWS Lambda and Amazon EMR IAM roles and a VPC, you can skip this step. To run the code, both the EMR cluster and the Lambda function require IAM roles. The EMR cluster also requires a VPC with a public subnet or a private subnet with a NAT gateway. To automatically create all the IAM roles and a VPC, deploy the attached AWS CloudFormation template as is, or you can create the roles and the VPC manually as specified in the *Additional information* section. | Cloud architect | 
| Note the AWS CloudFormation template output keys. | After the CloudFormation template has successfully deployed, navigate to the **Outputs** tab in the AWS CloudFormation console. Note the five output keys: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function.html)You will use the values from these keys when you create the Lambda function. | Cloud architect | 

### Upload the Spark .jar file
<a name="upload-the-spark-jar-file"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Upload the Spark .jar file. | Upload the Spark .jar file to the S3 bucket that the AWS CloudFormation stack created. The bucket name is the same as the output key `S3Bucket`.  | General AWS | 

### Create the Lambda function to launch the EMR cluster
<a name="create-the-lambda-function-to-launch-the-emr-cluster"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a Lambda function. | On the Lambda console, create a Python 3.9\$1 Lambda function with an execution role. The execution role policy must allow Lambda to launch an EMR cluster. (See the attached AWS CloudFormation template.) | Data engineer, Cloud engineer | 
| Copy and paste the code. | Replace the code in the `lambda_function.py` file with the code from the *Additional information* section of this pattern. | Data engineer, Cloud engineer | 
| Change the parameters in the code. | Follow the comments in the code to change the parameter values to match your AWS account. | Data engineer, Cloud engineer | 
| Launch the function to initiate the cluster. | Launch the function to initiate the creation of a transient EMR cluster with the Spark .jar file provided. It will run the Spark job and terminate automatically when the job is complete. | Data engineer, Cloud engineer | 
| Check the EMR cluster status. | After the EMR cluster is initiated, it appears in the Amazon EMR console under the **Clusters** tab. Any errors while launching the cluster or running the job can be be checked accordingly. | Data engineer, Cloud engineer | 

### Set up and run the sample demo
<a name="set-up-and-run-the-sample-demo"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Upload the Spark .jar file.  | Download the Spark .jar file from the *Attachments* section and upload it to the S3 bucket. | Data engineer, Cloud engineer | 
| Upload the input dataset. | Upload the attached `fake_sales_data.csv` file to the S3 bucket. | Data engineer, Cloud engineer | 
| Paste the Lambda code and change the parameters. | Copy the code from the **Tools** section, and paste the code in a Lambda function, replacing the code `lambda_function.py` file. Change the parameter values to match your account. | Data engineer, Cloud engineer | 
| Launch the function and verify the output. | After the Lambda function initiates the cluster with the provided Spark job, it generates a .csv file in the S3 bucket. | Data engineer, Cloud engineer | 

## Related resources
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function-resources"></a>
+ [Building Spark](https://spark.apache.org/docs/latest/building-spark.html)
+ [Apache Spark and Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html)
+ [Boto3 Docs run\$1job\$1flow documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow)
+ [Apache Spark information and documentation](https://spark.apache.org/)

## Additional information
<a name="launch-a-spark-job-in-a-transient-emr-cluster-using-a-lambda-function-additional"></a>

**Code**

```
"""
Copy paste the following code in your Lambda function. Make sure to change the following key parameters for the API as per your account

-Name (Name of Spark cluster)
-LogUri (S3 bucket to store EMR logs)
-Ec2SubnetId (The subnet to launch the cluster into)
-JobFlowRole (Service role for EC2)
-ServiceRole (Service role for Amazon EMR)

The following parameters are additional parameters for the Spark job itself. Change the bucket name and prefix for the Spark job (located at the bottom).

-s3://your-bucket-name/prefix/lambda-emr/SparkProfitCalc.jar (Spark jar file)
-s3://your-bucket-name/prefix/fake_sales_data.csv (Input data file in S3)
-s3://your-bucket-name/prefix/outputs/report_1/ (Output location in S3)
"""
import boto3

client = boto3.client('emr')


def lambda_handler(event, context):
    response = client.run_job_flow(
        Name='spark_job_cluster',
        LogUri='s3://your-bucket-name/prefix/logs',
        ReleaseLabel='emr-6.0.0',
        Instances={
            'MasterInstanceType': 'm5.xlarge',
            'SlaveInstanceType': 'm5.large',
            'InstanceCount': 1,
            'KeepJobFlowAliveWhenNoSteps': False,
            'TerminationProtected': False,
            'Ec2SubnetId': 'subnet-XXXXXXXXXXXXXX'
        },
        Applications=[{'Name': 'Spark'}],
        Configurations=[
            {'Classification': 'spark-hive-site',
             'Properties': {
                 'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
             }
        ],
        VisibleToAllUsers=True,
        JobFlowRole='EMRLambda-EMREC2InstanceProfile-XXXXXXXXX',
        ServiceRole='EMRLambda-EMRRole-XXXXXXXXX',
        Steps=[
            {
                'Name': 'flow-log-analysis',
                'ActionOnFailure': 'TERMINATE_CLUSTER',
                'HadoopJarStep': {
                    'Jar': 'command-runner.jar',
                    'Args': [
                        'spark-submit',
                        '--deploy-mode', 'cluster',
                        '--executor-memory', '6G',
                        '--num-executors', '1',
                        '--executor-cores', '2',
                        '--class', 'com.aws.emr.ProfitCalc',
                        's3://your-bucket-name/prefix/lambda-emr/SparkProfitCalc.jar',
                        's3://your-bucket-name/prefix/fake_sales_data.csv',
                        's3://your-bucket-name/prefix/outputs/report_1/'
                    ]
                }
            }
        ]
    )
```

**IAM roles and VPC creation**

To launch the EMR cluster in a Lambda function, a VPC and IAM roles are needed. You can set up the VPC and IAM roles by using the AWS CloudFormation template in the Attachments section of this pattern, or you can manually create them by using the following links. 

The following IAM roles are required to run Lambda and Amazon EMR. 

*Lambda execution role*

A Lambda function's [execution role](https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html) grants it permission to access AWS services and resources.

*Service role for Amazon EMR*

The [Amazon EMR role](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role.html) defines the allowable actions for Amazon EMR when provisioning resources and performing service-level tasks that are not performed in the context of an Amazon Elastic Compute Cloud (Amazon EC2) instance running within a cluster. For example, the service role is used to provision EC2 instances when a cluster launches.

*Service role for EC2 instances*

The [service role for cluster EC2 instances](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-for-ec2.html) (also called the EC2 instance profile for Amazon EMR) is a special type of service role that is assigned to every EC2 instance in an Amazon EMR cluster when the instance launches. Application processes that run on top of Apache Hadoop assume this role for permissions to interact with other AWS services.

*VPC and subnet creation*

You can [create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#create-vpc-and-other-resources) from the VPC console. 

## Attachments
<a name="attachments-6c999fa7-9550-4929-a5c1-60394142175d"></a>

To access additional content that is associated with this document, unzip the following file: [attachment.zip](samples/p-attach/6c999fa7-9550-4929-a5c1-60394142175d/attachments/attachment.zip)

# Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue"></a>

*Nikolai Kolesnikov, Karthiga Priya Chandran, and Samir Patel, Amazon Web Services*

## Summary
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-summary"></a>

This pattern shows you how to migrate your existing Apache Cassandra workloads to Amazon Keyspaces (for Apache Cassandra) by using CQLReplicator on AWS Glue. You can use CQLReplicator on AWS Glue to minimize the replication lag of migrating your workloads down to a matter of minutes. You also learn how to use an Amazon Simple Storage Service (Amazon S3) bucket to store data required for the migration, including [Apache Parquet](https://parquet.apache.org/) files, configuration files, and scripts. This pattern assumes that your Cassandra workloads are hosted on Amazon Elastic Compute Cloud (Amazon EC2) instances in a virtual private cloud (VPC).

## Prerequisites and limitations
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-prereqs"></a>

**Prerequisites**
+ Cassandra cluster with a source table
+ Target table in Amazon Keyspaces to replicate the workload
+ S3 bucket to store intermediate Parquet files that contain incremental data changes
+ S3 bucket to store job configuration files and scripts

**Limitations**
+ CQLReplicator on AWS Glue requires some time to provision Data Processing Units (DPUs) for the Cassandra workloads. The replication lag between the Cassandra cluster and the target keyspace and table in Amazon Keyspaces is likely to last for only a matter of minutes.

## Architecture
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-architecture"></a>

**Source technology stack  **
+ Apache Cassandra
+ DataStax Server
+ ScyllaDB

**Target technology stack  **
+ Amazon Keyspaces

**Migration architecture  **

The following diagram shows an example architecture where a Cassandra cluster is hosted on EC2 instances and spread across three Availability Zones. The Cassandra nodes are hosted in private subnets.

![\[Custom service role, Amazon Keyspaces, and Amazon S3, with AWS Glue connecting to the nodes VPC.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/e08048da-8996-4f2c-b8ed-da49fe9e693b/images/76256ab3-a1e6-4c9e-9c40-dc78f51edf0f.png)


The diagram shows the following workflow:

1. A custom service role provides access to Amazon Keyspaces and the S3 bucket.

1. An AWS Glue job reads the job configuration and scripts in the S3 bucket.

1. The AWS Glue job connects through port 9042 to read data from the Cassandra cluster.

1. The AWS Glue job connects through port 9142 to write data to Amazon Keyspaces.

## Tools
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-tools"></a>

**AWS services and tools**
+ [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html) is an open-source tool that helps you interact with AWS services through commands in your command-line shell.
+ [AWS CloudShell](https://docs.aws.amazon.com/cloudshell/latest/userguide/welcome.html) is a browser-based shell that you can use to manage AWS services by using the AWS Command Line Interface (AWS CLI) and a range of preinstalled development tools.
+ [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) is a fully managed ETL service that helps you reliably categorize, clean, enrich, and move data between data stores and data streams.
+ [Amazon Keyspaces (for Apache Cassandra)](https://docs.aws.amazon.com/keyspaces/latest/devguide/what-is-keyspaces.html) is a managed database service that helps you migrate, run, and scale your Cassandra workloads in the AWS Cloud.

**Code**

The code for this pattern is available in the GitHub [CQLReplicator](https://github.com/aws-samples/cql-replicator/tree/main/glue) repository.

## Best practices
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-best-practices"></a>
+ To determine the necessary AWS Glue resources for the migration, estimate the number of rows in the source Cassandra table. For example, 250 K rows per 0.25 DPU (2 vCPUs, 4 GB of memory) with 84 GB disk.
+ Pre-warm Amazon Keyspaces tables before running CQLReplicator. For example, eight CQLReplicator tiles (AWS Glue jobs) can write up to 22 K WCUs per second, so the target should be pre-warmed up to 25-30 K WCUs per second.
+ To enable communication between AWS Glue components, use a self-referencing inbound rule for all TCP ports in your security group.
+ Use the incremental traffic strategy to distribute the migration workload over time.

## Epics
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-epics"></a>

### Deploy CQLReplicator
<a name="deploy-cqlreplicator"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a target keyspace and table.  | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue.html) | App owner, AWS administrator, DBA, App developer | 
| Configure the Cassandra driver to connect to Cassandra. | Use the following configuration script:<pre>Datastax-java-driver {<br />  basic.request.consistency = "LOCAL_QUORUM"<br />  basic.contact-points = ["127.0.0.1:9042"]<br />   advanced.reconnect-on-init = true<br />   basic.load-balancing-policy {<br />        local-datacenter = "datacenter1"<br />}<br />advanced.auth-provider = {<br />       class = PlainTextAuthProvider<br />       username = "user-at-sample"<br />       password = "S@MPLE=PASSWORD="<br />}<br />}</pre>The preceding script uses the Spark Cassandra Connector. For more information, see the reference configuration for[ Cassandra](https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/configuration/reference/). | DBA | 
| Configure the Cassandra driver to connect to Amazon Keyspaces. | Use the following configuration script:<pre>datastax-java-driver {<br />basic {<br />  load-balancing-policy {<br />    local-datacenter = us-west-2<br />        }<br />  contact-points = [<br />            "cassandra.us-west-2.amazonaws.com:9142"<br />        ]<br />  request {<br />  page-size = 2500<br />  timeout = 360 seconds<br />  consistency = LOCAL_QUORUM<br />        }<br />    }<br />advanced {<br /> control-connection {<br />  timeout = 360 seconds<br />        }<br /> session-leak.threshold = 6<br /> connection {<br /> connect-timeout = 360 seconds<br /> init-query-timeout = 360 seconds<br /> warn-on-init-error = false<br />        }<br /> auth-provider = {<br />  class = software.aws.mcs.auth.SigV4AuthProvider<br />  aws-region = us-west-2<br /> }<br /><br /> ssl-engine-factory {<br />  class = DefaultSslEngineFactory<br />        }<br />    }<br />}</pre>The preceding script uses the Spark Cassandra Connector. For more information, see the reference configuration for[ Cassandra](https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/configuration/reference/). | DBA | 
| Create an IAM role for the AWS Glue job. | Create a new AWS service role named `glue-cassandra-migration` with AWS Glue as a trusted entity.The `glue-cassandra-migration` should provide read and write access to the S3 bucket and Amazon Keyspaces. The S3 bucket contains the .jar files, configuration files for Amazon Keyspaces and Cassandra, and the intermediate Parquet files. For example, it contains the `AWSGlueServiceRole`, `AmazonS3FullAccess`, and `AmazonKeyspacesFullAccess` managed policies. | AWS DevOps | 
| Download CQLReplicator in AWS CloudShell. | Download the project to your home folder by running the following command:<pre>git clone https://github.com/aws-samples/cql-replicator.git<br />cd cql-replicator/glue<br /># Only for AWS CloudShell, the bc package includes bc and dc. Bc is an arbitrary precision numeric processing arithmetic language<br />sudo yum install bc -y</pre> |  | 
| Modify the reference configuration files. | Copy `CassandraConnector.conf` and `KeyspacesConnector.conf` to the `../glue/conf` directory in the project folder. | AWS DevOps | 
| Initiate the migration process. | The following command initializes the CQLReplicator environment. Initializaition involves copying .jar artifacts, and creating an AWS Glue connector, an S3 bucket, an AWS Glue job, the `migration` keyspace, and the `ledger` table:<pre>cd cql-replicator/glue/bin<br />./cqlreplicator --state init --sg '"sg-1","sg-2"' \ <br />                --subnet "subnet-XXXXXXXXXXXX" \ <br />                --az us-west-2a --region us-west-2 \ <br />                --glue-iam-role glue-cassandra-migration \ <br />                --landing-zone s3://cql-replicator-1234567890-us-west-2<br /></pre>The script includes the following parameters:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue.html) | AWS DevOps | 
| Validate the deployment. | After you run the previous command, the AWS account should contain the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue.html) | AWS DevOps | 

### Run CQLReplicator
<a name="run-cqlreplicator"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Start the migration process. | To operate CQLReplicator on AWS Glue, you need to use the `--state run` command, followed by a series of parameters. The precise configuration of these parameters is primarily determined by your unique migration requirements. For example, these settings might vary if you choose to replicate time to live (TTL) values and updates, or you offload objects exceeding 1 MB to Amazon S3.To replicate the workload from the Cassandra cluster to Amazon Keyspaces, run the following command: <pre>./cqlreplicator --state run --tiles 8  \<br />                --landing-zone s3://cql-replicator-1234567890-us-west-2 \ <br />                --region us-west-2 \                              <br />                --src-keyspace source_keyspace \ <br />                --src-table source_table \  <br />                --trg-keyspace taget_keyspace \<br />                --writetime-column column_name \<br />                --trg-table target_table --inc-traffic</pre>Your source keyspace and table are `source_keyspace.source_table` in the Cassandra cluster. Your target keyspace and table are `target_keyspace.target_table` in Amazon Keyspaces. The parameter `--inc-traffic` helps prevent incremental traffic from overloading the Cassandra cluster and Amazon Keyspaces with a high number of requests.To replicate updates, add `--writetime-column regular_column_name` to your command line. The regular column is going to be used as the source of the write timestamp. | AWS DevOps | 

### Monitor the migration process
<a name="monitor-the-migration-process"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Validate migrated Cassandra rows during the historical migration phase. | To obtain the number of rows replicated during the backfilling phase, run the following command:<pre>./cqlreplicator --state stats \<br />                --landing-zone s3://cql-replicator-1234567890-us-west-2 \  <br />                --src-keyspace source_keyspace --src-table source_table --region us-west-2</pre> | AWS DevOps | 

### Stop the migration process
<a name="stop-the-migration-process"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Use the `cqlreplicator` command or the AWS Glue console. | To stop the migration process gracefully, run the following command:<pre>./cqlreplicator --state request-stop --tiles 8 \                         <br />                --landing-zone s3://cql-replicator-1234567890-us-west-2 \     <br />                --region us-west-2 \                     <br />                --src-keyspace source_keyspace --src-table source_table</pre>To stop the migration process immediately, use the AWS Glue console. | AWS DevOps | 

### Clean up
<a name="clean-up"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Delete the deployed resources. | The following command will delete the AWS Glue job, connector, S3 bucket, and Keyspaces table `ledger`:<pre>./cqlreplicator --state cleanup --landing-zone s3://cql-replicator-1234567890-us-west-2</pre> | AWS DevOps | 

## Troubleshooting
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-troubleshooting"></a>


| Issue | Solution | 
| --- | --- | 
| AWS Glue jobs failed and returned an Out of Memory (OOM) error. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue.html) | 

## Related resources
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-resources"></a>
+ [CQLReplicator with AWS Glue README.MD](https://github.com/aws-samples/cql-replicator/blob/main/glue/README.MD)
+ [AWS Glue documentation](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html)
+ [Amazon Keyspaces documentation](https://docs.aws.amazon.com/keyspaces/latest/devguide/what-is-keyspaces.html)
+ [Apache Cassandra](https://cassandra.apache.org/_/index.html)

## Additional information
<a name="migrate-apache-cassandra-workloads-to-amazon-keyspaces-by-using-aws-glue-additional"></a>

**Migration considerations**

You can use AWS Glue to migrate your Cassandra workload to Amazon Keyspaces, while keeping your Cassandra source databases completely functional during the migration process. After the replication is complete, you can choose to cut over your applications to Amazon Keyspaces with minimal replication lag (less than minutes) between the Cassandra cluster and Amazon Keyspaces. To maintain data consistency, you can also use a similar pipeline to replicate the data back to the Cassandra cluster from Amazon Keyspaces.

**Write unit calculations**

As an example, consider that you intend to write 500,000,000 with the row size 1 KiB during one hour. The total number of Amazon Keyspaces write units (WCUs) that you require is based on this calculation:

`(number of rows/60 mins 60s) 1 WCU per row = (500,000,000/(60*60s) * 1 WCU) = 69,444 WCUs required`

69,444 WCUs per second is the rate for 1 hour, but you could add some cushion for overhead.  For example, `69,444 * 1.10 = 76,388 WCUs` has 10 percent overhead.

**Create a keyspace by using CQL**

To create a keyspace by using CQL, run the following commands:

```
CREATE KEYSPACE target_keyspace WITH replication = {'class': 'SingleRegionStrategy'}
CREATE TABLE target_keyspace.target_table ( userid uuid, level text, gameid int, description text, nickname text, zip text, email text, updatetime text, PRIMARY KEY (userid, level, gameid) ) WITH default_time_to_live = 0 AND CUSTOM_PROPERTIES = {'capacity_mode':{ 'throughput_mode':'PROVISIONED', 'write_capacity_units':76388, 'read_capacity_units':3612 }} AND CLUSTERING ORDER BY (level ASC, gameid ASC)
```

# Migrate Oracle Business Intelligence 12c to the AWS Cloud from on-premises servers
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers"></a>

*Lanre (Lan-Ray) showunmi and Patrick Huang, Amazon Web Services*

## Summary
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers-summary"></a>

This pattern shows how to migrate [Oracle Business Intelligence Enterprise Edition 12c](https://www.oracle.com/business-analytics/business-intelligence/technologies/bi-enterprise-edition.html) from on-premises servers to the AWS Cloud by using AWS CloudFormation. It also describes how you can use other AWS services to implement Oracle BI 12c components that deliver high availability, security, flexibility, and the ability to dynamically scale.

For a list of best practices related to migrating Oracle BI 12c to the AWS Cloud, see the **Additional information** section of this pattern.

**Note**  
It’s a best practice to run multiple test migrations before transferring your existing Oracle BI 12c data to the cloud. These tests help you fine tune your migration approach, identify and fix potential issues, and estimate downtime requirements more accurately.

## Prerequisites and limitations
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers-prereqs"></a>

**Prerequisites **
+ An active AWS account
+ Secure network connectivity between your on-premises servers and AWS through either [AWS Virtual Private Network (AWS VPN)](https://aws.amazon.com/vpn/) services or [AWS Direct Connect](https://docs.aws.amazon.com/directconnect/latest/UserGuide/Welcome.html)
+ Software licenses for your Oracle operating system, Oracle BI 12c, Oracle Database, Oracle WebLogic Server, and Oracle HTTP Server

**Limitations **

For information about storage size limits, see the [Amazon Relational Database Service (Amazon RDS) for Oracle](https://aws.amazon.com/rds/oracle/features/) documentation.

**Product versions**
+ Oracle Business Intelligence Enterprise Edition 12c
+ Oracle WebLogic Server 12c
+ Oracle HTTP Server 12c
+ Oracle Database 12c (or newer)
+ Oracle Java SE 8

## Architecture
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers-architecture"></a>

The following diagram shows an example architecture for running Oracle BI 12c components in the AWS Cloud:

![\[Example architecture for running Oracle BI 12c components in the AWS Cloud.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/8bb72df1-7546-4208-bc70-5789767e3600/images/aae8f8f3-8125-4868-a8e5-eac1cc42812f.png)


This diagram shows the following architecture:

1. Amazon Route 53 provides domain name service (DNS) configuration.

1. Elastic Load Balancing (ELB) distributes network traffic to improve the scalability and availability of the Oracle BI 12c components across multiple Availability Zones.

1. Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling groups host the Oracle HTTP Servers, Weblogic Admin server, and managed BI servers across multiple Availability Zones.

1. Amazon Relational Database Service (Amazon RDS) for Oracle database store BI Server metadata across multiple Availability Zones.

1. Amazon Elastic File System (Amazon EFS) is mounted to every Oracle BI 12c component for shared file storage.

**Technology stack**
+ Amazon Elastic Block Store (Amazon EBS)
+ Amazon Elastic Compute Cloud (Amazon EC2)
+ Amazon Elastic File System (Amazon EFS)
+ Amazon RDS for Oracle
+ AWS Certificate Manager (ACM)
+ Elastic Load Balancing (ELB)
+ Oracle BI 12c
+ Oracle WebLogic Server 12c
+ Oracle HTTP Server (OHS)

## Tools
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers-tools"></a>
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) helps you set up AWS resources, provision them quickly and consistently, and manage them throughout their lifecycle across AWS accounts and Regions.
+ [AWS Certificate Manager (ACM)](https://docs.aws.amazon.com/acm/latest/userguide/acm-overview.html) helps you create, store, and renew public and private SSL/TLS X.509 certificates and keys that protect your AWS websites and applications.
+ [AWS Database Migration Service (AWS DMS)](https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html) helps you migrate data stores into the AWS Cloud or between combinations of cloud and on-premises setups.
+ [Amazon Elastic Compute Cloud (Amazon EC2)](https://docs.aws.amazon.com/ec2/?id=docs_gateway) provides scalable computing capacity in the AWS Cloud. You can launch as many virtual servers as you need, and quickly scale them up or down.
+ [Amazon EC2 Auto Scaling](https://aws.amazon.com/ec2/autoscaling/) helps you maintain application availability and allows you to automatically add or remove Amazon EC2 instances according to conditions you define. 
+ [Amazon Elastic File System (Amazon EFS)](https://docs.aws.amazon.com/efs/latest/ug/whatisefs.html) helps you create and configure shared file systems in the AWS Cloud.
+ [Elastic Load Balancing](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/what-is-load-balancing.html) distributes incoming application or network traffic across multiple targets. For example, you can distribute traffic across Amazon Elastic Compute Cloud (Amazon EC2) instances, containers, and IP addresses in one or more Availability Zones.
+ [Amazon Relational Database Service (Amazon RDS)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html) helps you set up, operate, and scale a relational database in the AWS Cloud.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [Amazon Virtual Private Cloud (Amazon VPC)](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) helps you launch AWS resources into a virtual network that you’ve defined. This virtual network resembles a traditional network that you’d operate in your own data center, with the benefits of using the scalable infrastructure of AWS.
+ [Oracle Data Pump](https://docs.oracle.com/cd/B19306_01/server.102/b14215/dp_overview.htm) helps you move data and metadata from one database to another at high speeds.
+ [Oracle Fusion Middleware](https://www.oracle.com/middleware/technologies/business-intelligence-v12214-downloads.html) is a suite of application development tools and integration solutions to identity management, collaboration, and business intelligence reporting.
+ [Oracle GoldenGate](https://www.oracle.com/integration/goldengate/#:~:text=OCI%20GoldenGate%20is%20a%20real,in%20the%20Oracle%20Cloud%20Infrastructure.) helps you design, run, orchestrate, and monitor your data replication and stream data processing solutions in the Oracle Cloud Infrastructure.
+ [Oracle WebLogic Scripting Tool (WLST)](https://docs.oracle.com/middleware/12213/cross/wlsttasks.htm) provides a command line interface that helps you horizontally scale out your WebLogic clusters.

## Epics
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers-epics"></a>

### Assess the source environment
<a name="assess-the-source-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Gather software inventory information. | Identify versions and patch levels for each of your source technology stack’s software components, including the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator | 
| Gather compute and storage inventory information. | In your source environment, review current and historical utilization metrics for the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html)Make sure that you account for historical spikes in usage. | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator, System Administrator | 
| Gather information about the source environment’s architecture and its requirements. | Obtain a full understanding of your source environment’s architecture and its requirements, including knowledge of the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator | 
| Identify Java Database Connectivity (JDBC) data sources. | Gather information about your source environment’s JDBC data sources and drivers for each database engine that it uses. | Migration Architect, Application Owner, Oracle BI Administrator, Database Engineer or Administrator | 
| Gather information about environment-specific settings. | Collect information about settings and configurations that are specific to your source environment, including the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator | 
| Identify any dependencies on other applications. | Collect information about integrations in your source environment that create dependencies with other applications.Make sure that you identify any Lightweight Directory Access Protocol (LDAP) integrations and other networking requirements. | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator | 

### Design your target environment
<a name="design-your-target-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a high-level design document. | Create a target architecture design document. Make sure that you use the information that you collected when assessing your source environment to inform the design document. | Solutions Architect, Application Architect, Database Engineer, Migration Architect | 
| Obtain approval for the design document. | Review the design document with stakeholders and obtain the required approvals. | Application or Service Owner, Solutions Architect, Application Architect | 

### Deploy the infrastructure
<a name="deploy-the-infrastructure"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Prepare the infrastructure code in CloudFormation. | Create CloudFormation templates to provision your Oracle BI 12c infrastructure in the AWS Cloud.For more information, see [Working with AWS CloudFormation templates](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-guide.html) in the *AWS CloudFormation User Guide*.It’s a best practice to create modular CloudFormation templates for each Oracle BI 12c tier, rather than one large template for all of your resources. For more  information about CloudFormation best practices, see [8 best practices when automating your deployments with AWS CloudFormation](https://aws.amazon.com/blogs/infrastructure-and-automation/best-practices-automating-deployments-with-aws-cloudformation/) on the AWS Blog. | Cloud Infrastructure Achitect, Solutions Architect, Application Architect | 
| Download the required software. | DDownload the following software along with the required versions and patches from the [Oracle website](https://www.oracle.com/):[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Database Engineer, Application Architect | 
| Prepare the installation scripts. | Create software installation scripts that run a silent install. These scripts simplify the deployment automation.For more information, see [OBIEE 12c: How to Perform Silent Installation?](https://support.oracle.com/knowledge/Enterprise%20Performance%20Management%20and%20Business%20Intelligence/2267490_1.html) on the Oracle Support site. You need an Oracle Support account to view the documentation. | Migration Architect, Database Engineer, Application Architect | 
| Create an Amazon EBS-backed Linux AMI for your web and application tiers. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Database Engineer, Application Architect | 
| Launch your AWS infrastructure by using CloudFormation. | Deploy your Oracle BI 12c web and application tiers in modules by using the CloudFormation templates that you created.For instructions, see [Getting started with AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/GettingStarted.html) in the *AWS CloudFormation User Guide*. | Cloud Infrastructure Architect, Solutions Architect, Application Architect | 

### Migrate Oracle BI 12c to AWS by using a fresh installation
<a name="migrate-oracle-bi-12c-to-aws-by-using-a-fresh-installation"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Stage the required software. | Stage the required software in a location that is accessible to the Amazon EC2 instances. For example, you could stage the software in Amazon S3 or another Amazon EC2 instance that would be accessible to your web and application servers. | Migration Architect, Oracle BI Architect, Cloud Infrastructure Achitect, Solutions Architect, Application Architect | 
| Prepare your repository database for Oracle BI 12c installation. | Create Oracle BI 12c schemas by running the [Oracle Repository Creation Utility (RCU)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Oracle.Resources.RCU.html#Oracle.Resources.RCU.Installing) against a new [Amazon RDS for Oracle](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Oracle.html) database instance. | Cloud Infrastructure Achitect, Solutions Architect, Application Architect, Migration Architect, Oracle BI Architect | 
| Install Oracle Fusion Middleware 12c and Oracle BI 12c. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Oracle BI Architect | 
| Configure your Oracle WebLogic Server domain for Oracle BI 12c. | Configure your Oracle BI 12c domain as a non-clustered deployment. For more information, see [Configuring the BI Domain](https://docs.oracle.com/middleware/bi12214/lcm/BIEDG/GUID-FD80C4C4-A2E9-4190-8B6A-7AD659D8FBE2.htm#BIEDG-GUID-D4FC2F0A-88B0-453C-BF14-17D365018252) in the *Oracle Fusion Middleware Enterprise Deployment Guide for Oracle Business Intelligence*. | Migration Architect, Oracle BI Architect | 
| Perform horizontal scale out of the Oracle BI 12c. | Horizontally scale out the single node to the desired number of nodes. For more information, see [Scaling out Oracle Business Intelligence](https://docs.oracle.com/middleware/bi12214/lcm/BIEDG/GUID-B7BEA45C-A8C0-4591-B748-FC35C587DAE0.htm#BIEDG-GUID-B7BEA45C-A8C0-4591-B748-FC35C587DAE0) in the *Oracle Fusion Middleware Enterprise Deployment Guide for Oracle Business Intelligence*. | Migration Architect, Oracle BI Architect | 
| Install Oracle HTTP Server 12c. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Oracle BI Architect | 
| Configure load balancers for SSL termination. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Cloud Infrastructure Architect, Migration Architect | 
| Migrate business intelligence metadata artifacts to AWS. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Oracle BI Architect | 
| Perform post-migration tasks. | After importing the BAR files, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Migration Architect, Oracle BI Architect | 

### Test the new environment
<a name="test-the-new-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Test the new Oracle BI 12c environment. | Conduct end-to-end testing on the new Oracle BI 12c environment. Use automation as much as possible.Example of testing activities include the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html)Conduct additional testing and validation as required. | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator | 

### Cut over to the new environment
<a name="cut-over-to-the-new-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Disconnect traffic to the on-premises Oracle BI 12c environment. | At the appointed cutover window, stop all traffic to the on-premises Oracle BI 12c environment. | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator | 
| Resynchronize the new Oracle BI 12c repository database with the source database. | Resynchronize the Amazon RDS Oracle Oracle BI 12c repository database with the on-premises database.To synchronize the databases, you can either use an [Oracle Data Pump refresh](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Oracle.Procedural.Importing.DataPump.html) or an [AWS DMS change data capture (CDC)](https://aws.amazon.com/blogs/database/migrating-oracle-databases-with-near-zero-downtime-using-aws-dms/). | Oracle BI Administrator, Database Engineer/Administrator | 
| Switch your Oracle BI 12c URLs to point to the new AWS environment. | Update the Oracle BI 12c URLs on your internal DNS servers so that they point to the new AWS installation. | Migration Architect, Solutions Architect, Application Owner, Oracle BI Administrator | 
| Monitor the new environment. | Monitor the new Oracle BI 12c environment by using any of the following tools:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers.html) | Oracle BI Administrator, Database Engineer/Administrator, Application Administrator | 
| Get sign-off on the project. | Review the testing results with stakeholders and obtain the required approvals to wrap up the migration. | Application Owner, Service Owner, Cloud Infrastructure Architect, Migration Architect, Oracle BI Architect | 

## Related resources
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers-resources"></a>
+ [Using the Oracle Repository Creation Utility on RDS for Oracle](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Oracle.Resources.RCU.html) (*Amazon RDS User Guide*)
+ [Oracle on Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Oracle.html) (*Amazon RDS User Guide*)
+ [Oracle WebLogic Server 12c on AWS](https://d1.awsstatic.com/whitepapers/Oracle-WebLogic-12c-on-AWS.pdf) (AWS whitepaper)
+ [Deploying Oracle Business Intelligence for high availability](https://docs.oracle.com/middleware/1221/biee/BIESG/highavail.htm#BIESG1584) (Oracle Help Center)
+ [Oracle Business Intelligence Application Archive (BAR) Files](https://docs.oracle.com/middleware/bi12214/biee/BIESG/GUID-7FCD90A3-E005-49BF-902F-30FBF9B41B07.htm#BIESG2889) (Oracle Help Center)
+ [How to migrate OBI 12c between environments](https://support.oracle.com/knowledge/Enterprise%20Performance%20Management%20and%20Business%20Intelligence/2203360_1.html) (Oracle Support) 

## Additional information
<a name="migrate-oracle-business-intelligence-12c-to-the-aws-cloud-from-on-premises-servers-additional"></a>

The following is a list of best practices related to migrating Oracle BI 12c to the AWS Cloud.

**Repository databases**

It’s a best practice to host Oracle BI 12c database schemas on an Amazon RDS for Oracle instance. This instance type provides cost-efficient and resizable capacity while automating administration tasks, such as hardware provisioning, database setup, patching, and backups.

For more information, see [Using the Oracle Repository Creation Utility on RDS for Oracle](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Oracle.Resources.RCU.html) in the *Amazon RDS User Guide*.

**Web and application tiers**

[Memory optimized Amazon EC2 instances](https://aws.amazon.com/ec2/instance-types/) are often well suited for Oracle BI 12c servers. Whatever instance type you choose, make sure that the instances that you provision meet your system’s memory usage requirements. Also, make sure that you [configure a sufficient WebLogic Java Virtual Machine (JVM) heap size](https://docs.oracle.com/cd/E49933_01/server.770/es_install/src/tins_postinstall_jvm_heap.html#:~:text=The%20default%20JVM%20heap%20size%20for%20WebLogic%20is%203GB.,file%20for%20Linux%20or%20setDomainEnv.) based on your Amazon EC2 instance’s available memory.

**Local storage**

I/O plays an important part in the overall performance of your Oracle BI 12c application. Amazon Elastic Block Store (Amazon EBS) offers different storage classes that are optimized for different workload patterns. Make sure that you choose an Amazon EBS volume type that fits your use case.

For more information about EBS volume types, see [Amazon EBS features](https://aws.amazon.com/ebs/features/) in the Amazon EBS documentation.

**Shared storage**

A clustered Oracle BI 12c domain requires shared storage for the following resources:
+ Configuration files
+ Oracle BI 12c singleton data directory (SDD)
+ Oracle global cache
+ Oracle BI Scheduler scripts
+ Oracle WebLogic Server binaries

You can meet this shared storage requirement by using [Amazon EFS](https://aws.amazon.com/efs/), which provides a scalable, fully managed elastic Network File System (NFS) file system.

*Fine tuning shared storage performance*

Amazon EFS has two [throughput modes](https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes): **Provisioned** and **Bursting**. The service also has two [performance modes](https://docs.aws.amazon.com/efs/latest/ug/performance.html#performancemodes): **General Purpose** and **Max I/O**.

To fine tune performance, start by testing your workloads in the **General Purpose** performance mode and **Provisioned** throughput mode. Doing these tests will help you determine if those baseline modes are sufficient to meet your desired service levels.

For more information, see [Amazon EFS performance](https://docs.aws.amazon.com/efs/latest/ug/performance.html) in the *Amazon EFS User Guide*.

**Availability and disaster recovery**

It’s a best practice to deploy Oracle BI 12c components across multiple Availability Zones to protect those resources in the event of an Availability Zone failure. The following is a list of availability and disaster recovery best practices for specific Oracle BI 12c resources hosted in the AWS Cloud:
+ **Oracle BI 12c repository databases**: Deploy a multi-AZ Amazon RDS database instance to your Oracle BI 12crepository database. In a multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different AZ. Running an Oracle BI 12c repository database instance across Availability Zones can enhance availability during planned system maintenance and help protect your databases against instance and Availability Zone failures.
+ **Oracle BI 12c Managed Servers**: To achieve fault tolerance, it’s a best practice to deploy Oracle BI 12c system components on Managed Servers in an Amazon EC2 Auto Scaling Group configured to span multiple Availability Zones. Auto Scaling replaces faulty instances based on [Amazon EC2 health checks](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-health-checks.html). In the event of an Availability Zone failure, Oracle HTTP Servers continue to direct traffic to Managed Servers in the functioning Availability Zone. Then, Auto Scaling launches instances to keep up with your host count requirements. Activating HTTP session state replication is recommended to help make sure that there’s a smooth failover of the existing sessions to the functioning Managed Servers.
+ **Oracle BI 12c Administration Servers:** To make sure that your Administration Server has high availability, host it in an Amazon EC2 Auto Scaling group configured to span multiple Availability Zones. Then, set the minimum and maximum size of the group set to **1**. If an Availability Zone failure occurs, Amazon EC2 Auto Scaling starts up a replacement Administration Server in an alternate Availability Zone. To recover any failed underlying hosts within the same Availability Zone, you can activate [Amazon EC2 Auto Recovery](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html).
+ **Oracle Web Tier servers:** It’s a best practice to associate your Oracle HTTP Server with your Oracle WebLogic Server domain. For high availability, deploy your Oracle HTTP Server in an Amazon EC2 Auto Scaling group configured to aspen multiple Availability Zoness. Then, place the server behind an ELB elastic load balancer. To provide additional protection against host failure, you can activate Amazon EC2 Auto Recovery.

**Scalability**

The elasticity of the AWS Cloud helps you scale applications either horizontally or vertically in response to workload requirements.

*Vertical scaling*

To vertically scale your application, you can change the size and type of the Amazon EC2 instances that are running your Oracle BI 12c components. You don’t need to over-provision instances at the start of your deployment and incur unnecessary cost. 

*Horizontal scaling*

Amazon EC2 Auto Scaling helps you horizontally scale your application by automatically adding or removing Managed Servers based on workload requirements.

**Note**  
Horizontal scaling with Amazon EC2 Auto Scaling requires scripting skills and thorough testing to implement.

**Backup and recovery**

The following is a list of backup and recovery best practices for specific Oracle BI 12c resources hosted in the AWS Cloud:
+ **Oracle Business Intelligence metadata repositories:** Amazon RDS automatically creates and saves backups of your database instances. These backups are retained for a period of time that you specify. Make sure that you configure your Amazon RDS backup duration and retention settings based on your data protection requirements. For more information, see [Amazon RDS backup and restore](https://aws.amazon.com/rds/features/backup/).
+ **Managed Servers, Administration Servers, and Web Tier servers:** Make sure that you configure [Amazon EBS snapshots](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html) based on your data protection and retention requirements.
+ **Shared storage:** You can manage backup and recovery for files stored in Amazon EFS by using [AWS Backup](https://docs.aws.amazon.com/efs/latest/ug/awsbackup.html). The AWS Backup service can also be deployed to centrally manage backup and recovery of other services, including Amazon EC2, Amazon EBS, and Amazon RDS. For more information, see [What is AWS Backup?](https://docs.aws.amazon.com/aws-backup/latest/devguide/whatisbackup.html) In the *AWS Backup Developer Guide*.

**Security and compliance**

The following is a list of security best practices and AWS services that can help you protect your Oracle BI 12c applications in the AWS Cloud:
+ **Encryption at rest:** Amazon RDS, Amazon EFS, and Amazon EBS all support industry standard encryption algorithms. You can use [AWS Key Management Service (AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) to create and manage cryptographic keys and control their use across AWS services and in your applications. You can also configure [Oracle Transparent Data Encryption (TDE)](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.Oracle.Options.AdvSecurity.html) on the Amazon RDS for Oracle database instance that’s hosting your Oracle BI 12c repository database.
+ **Encryption in transit:** It’s a best practice to activate either SSL or TLS protocols to protect data in transit between the various layers of your Oracle BI 12c installation. You can use [AWS Certificate Manager (ACM)](https://aws.amazon.com/certificate-manager/) to provision, manage, and deploy public and private SSL and TLS certificates for your Oracle BI 12c resources.
+ **Network security**: Make sure that you deploy your Oracle BI 12c resources in an Amazon VPC that has the appropriate access controls configured for your use case. Configure your security groups to filter inbound and outbound traffic from the Amazon EC2 instances that are running your installation. Also, make sure that you configure [Network Access Control Lists (NACLs)](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html) that allow or deny traffic based on defined rules.
+ **Monitoring and logging**: You can use [AWS CloudTrail](https://aws.amazon.com/cloudtrail/) to track API calls to your AWS infrastructure, including your Oracle BI 12c resources. This functionality is useful when tracking changes to infrastructure or when conducting a security analysis. You can also use [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) to view operational data that can provide you with actionable insight into the performance and health of your Oracle BI 12c application. You can configure alarms and take automated actions based on those alarms, too. Amazon RDS provides additional monitoring tools, including [Enhanced Monitoring](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html) and [Performance Insights](https://aws.amazon.com/rds/performance-insights/).

# Migrate an ELK Stack to Elastic Cloud on AWS
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws"></a>

*Battulga Purevragchaa and Antony Prasad Thevaraj, Amazon Web Services*

*uday reddy, None*

## Summary
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws-summary"></a>

[Elastic](https://www.elastic.co/) has provided services for many years, with their users and customers typically managing Elastic themselves on premises. [Elastic Cloud](https://www.elastic.co/cloud), the managed [Elasticsearch service](https://www.elastic.co/elasticsearch/service), provides a way to consume the Elastic Stack (ELK Stack) and solutions for [enterprise search](https://www.elastic.co/enterprise-search), [observability](https://www.elastic.co/observability), and [security](https://www.elastic.co/security). You can access Elastic solutions with apps such as Logs, Metrics, APM (application performance monitoring), and SIEM (security information and event management). You can use integrated features such as machine learning, index lifecycle management, Kibana Lens (for drag-and drop visualizations).

When you move from self-managed Elasticsearch to Elastic Cloud, the Elasticsearch service takes care of the following:
+ Provisioning and managing the underlying infrastructure
+ Creating and managing Elasticsearch clusters
+ Scaling clusters up and down
+ Upgrades, patching, and taking snapshots

This gives you more time to focus on solving other challenges.

This pattern defines how to migrate on-premises Elasticsearch 7.13  to Elasticsearch on Elastic Cloud on Amazon Web Services (AWS). Other versions might require slight modifications to the processes described in thie pattern. For more information, contact your Elastic representative.

## Prerequisites and limitations
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws-prereqs"></a>

**Prerequisites**
+ An active [AWS account](https://aws.amazon.com/account/) with access to [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) for snapshots
+ A secure, sufficiently high-bandwidth [private link](https://docs.aws.amazon.com/vpc/latest/userguide/vpn-connections.html) for copying snapshot data files to Amazon S3
+ [Amazon S3 Transfer Acceleration](https://aws.amazon.com/s3/transfer-acceleration/)
+ [Elastic Snapshot policies](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/getting-started-snapshot-lifecycle-management.html) to ensure that data ingestion is archived regularly, either to a sufficiently large local data store or to remote storage (Amazon S3)

You must understand how large your snapshots and the [ lifecyle policies](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html) for accompanying indexes are on premises before initiating your migration. For more information, [contact Elastic](https://www.elastic.co/contact).

**Roles and skills**

The migration process also requires the roles and expertise described in the following table.


| 
| 
| Role | Expertise | Responsibilities | 
| --- |--- |--- |
| App support | Familiarity with Elastic Cloud and Elastic on premises | All Elastic related tasks | 
| Systems administrator or DBA | In-depth knowledge of the on-premises Elastic environment and its configuration | The ability to provision storage, install and use the AWS Command Line Interface (AWS CLI), and identify all data sources feeding Elastic on premises | 
| Network administrator | Knowledge of on-premises to AWS network connectivity, security, and performance | Establishment of network links from on premises to Amazon S3, with an understanding of connectivity bandwidth | 

**Limitations **
+ Elasticsearch on Elastic Cloud is available only in [supported AWS Regions (September 2021)](https://www.elastic.co/guide/en/cloud/current/ec-regions-templates-instances.html#ec-aws_regions).

**Product versions**
+ Elasticsearch 7.13

## Architecture
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws-architecture"></a>

**Source technology stack  **

On-premises Elasticsearch 7.13 or later:
+ Cluster snapshots
+ Index snapshots
+ [Beats](https://www.elastic.co/beats/) configuration

**Source technology architecture**

The following diagram shows a typical on-premises architecture with different ingestion methods, node types, and Kibana. The different node types reflect the Elasticsearch cluster, authentication, and visualization roles.

![\[Eight-step process including Beats, Logstash, Elasticsearch, and Kibana.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/23d1b437-22ff-420e-80ac-834f1116d234/images/937c4d22-429f-4673-86df-ae491d68389c.png)


1. Ingestion from Beats to Logstash

1. Ingestion from Beats to Apache Kafka messaging queue

1. Ingestion from Filebeat to Logstash

1. Ingestion from Apache Kafka messaging queue to Logstash

1. Ingestion from Logstash to an Elasticsearch cluster

1. Elasticsearch cluster

1. Authentication and notification node

1. Kibana and blob nodes

**Target technology stack **

Elastic Cloud is deployed to your software as a service (SaaS) account in multiple AWS Regions with cross-cluster replication.
+ Cluster snapshots
+ Index snapshots
+ Beats configurations
+ Elastic Cloud
+ Network Load Balancer
+ Amazon Route 53
+ Amazon S3

**Target architecture **

![\[Route 53 endpoints route traffic to Multi-AZ environments in two different Regions.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/23d1b437-22ff-420e-80ac-834f1116d234/images/16cbac49-0adb-4469-b546-ae4b1ca35357.png)


The managed Elastic Cloud infrastructure is:
+ Highly available, being present in multiple [Availability Zones](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/) and multiple AWS Regions.
+ Region failure tolerant because data (indexes and snapshots) is replicated using Elastic Cloud  [cross-cluster replication (CCR)](https://www.elastic.co/guide/en/elasticsearch/reference/7.14/xpack-ccr.html)
+ Archival, because snapshots are archived in [Amazon S3](https://aws.amazon.com/s3/?p=pm&c=s3&z=4)
+ Network partition tolerant through a combination of [Network Load Balancers](https://aws.amazon.com/elasticloadbalancing/network-load-balancer/) and [Route 53](https://aws.amazon.com/route53/)
+ Data ingestion originating from (but not limited to) [Elastic APM](https://www.elastic.co/apm/), [Beats](https://www.elastic.co/beats/), [Logstash](https://www.elastic.co/guide/en/logstash/current/index.html)

**High-level migration steps**

Elastic has developed its own prescriptive methodology for migrating on-premises Elastic Cluster to Elastic Cloud. The Elastic methodology is directly aligned and complementary to the AWS migration guidance and best practices, including [Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc) and [AWS Migration Acceleration Program](https://aws.amazon.com/migration-acceleration-program/) (MAP). Typically, the three AWS migration phases are the following:
+ Assess
+ Mobilize
+ Migrate and modernize

Elastic follows similar migration phases with complementary terminology:
+ Initiate
+ Plan
+ Implement
+ Deliver
+ Close

Elastic uses the Elastic Implementation Methodology to facilitate the delivery of project outcomes. This is inclusive by design to ensure that the Elastic, consulting teams, and customer teams work together with clarity to jointly deliver intended outcomes.

The Elastic methodology combines traditional waterfall phasing with Scrum within the implementation phase. Configurations of technical requirements are delivered iteratively in a collaborative manner while minimizing risk.

![\[Diagram showing the five stages of the Elastic Implementation Methodology.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/23d1b437-22ff-420e-80ac-834f1116d234/images/b041c61d-980e-49a0-a721-791c20edde64.png)


## Tools
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws-tools"></a>

**AWS services**
+ [Amazon Route 53](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html) – Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service. You can use Route 53 to perform three main functions in any combination: domain registration, DNS routing, and health checking.
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) – Amazon Simple Storage Service (Amazon S3) is an object storage service. You can use Amazon S3 to store and retrieve any amount of data at any time, from anywhere on the web. This pattern uses an S3 bucket and [Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration-examples.html).
+ [Elastic Load Balancing](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/what-is-load-balancing.html) – Elastic Load Balancing automatically distributes your incoming traffic across multiple targets, such as EC2 instances, containers, and IP addresses, in one or more Availability Zones.

**Other tools**
+ [Beats](https://www.elastic.co/beats/) – Beats ship data from Logstash or Elasticsearch
+ [Elastic Cloud](https://www.elastic.co/cloud/) – Elastic Cloud is a managed service for hosting Elasticsearch.
+ [Elasticsearch](https://www.elastic.co/elasticsearch/) – Elasticsearch is a search and analytics engine that uses the Elastic Stack to centrally store your data for search and analytics that scale. This pattern also uses snapshot creation and cross-cluster replication.
+ [Logstash](https://www.elastic.co/logstash/) – Logstash is a server-side data processing pipeline that ingests data from multiple sources, transforms it, and then sends it to your data storage.

## Epics
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws-epics"></a>

### Prepare the migration
<a name="prepare-the-migration"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Identify servers running the on-premises Elastic solution. | Confirm that Elastic migration is supported. | App owner | 
| Understand the on-premises server configuration. | To understand the server configuration needed to drive workloads successfully on premises, find the server hardware footprint, network configuration, and storage characteristics that are currently in use | App Support | 
| Gather user and app account information. | Identify the user names and app names that are used by the on-premises Elastic environment. | Systems administrator, App support | 
| Document Beats and data shipper configuration. | To document the configurations, look at existing data sources and sinks. For more information, see the [Elastic documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html). | App support | 
| Determine the velocity and volume of data. | Establish a baseline for how much data the cluster is handling. | Systems administrator, App support | 
| Document RPO and RTO scenarios. | Document recovery point objective (RPO) and recovery time objective (RTO) scenarios in terms of outages and service level agreements (SLAs). | App owner, Systems administrator, App support | 
| Determine the optimal snapshot lifecycle settings. | Define how often data needs to be secured by using Elastic snapshots *during and after *the migration. | App owner, Systems administrator, App support | 
| Define post-migration performance expectations. | Generate metrics on current and expected screen refresh, query runtimes, and user interface behaviors. | Systems administrator, App support | 
| Document internet access transport, bandwidth, and availability requirements. | Ascertain speed, latency, and resiliency of internet connections for copying snapshots to Amazon S3. | Network administrator | 
| Document current costs of on-premises runtime for Elastic. | Ensure that the sizing of the AWS targeted environment is designed to be both high performing and cost effective. | DBA, Systems administrator, App support | 
| Identify the authentication and authorization needs. | The Elastic Stack security features provide built-in realms such as Lightweight Directory Access Protocol (LDAP), Security Assertion Markup Language (SAML), and OpenID Connect (OIDC). | DBA, Systems administrator, App support | 
| Understand the specific regulatory requirements based on the geographic location. | Ensure that data is exported and encrypted according to your requirements and to any relevant national requirements. | DBA, Systems administrator, App support | 

### Implement the migration
<a name="implement-the-migration"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Prepare the staging area on Amazon S3.  | To receive snapshots on Amazon S3, [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) and a temporary AWS Identity and Access Management (IAM) role with full access to your newly created bucket. For more information, see [Creating a role to delegate permissions to an IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html). Use the AWS Security Token Service to [request temporary security credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html). Keep the access key ID, secret access key, and session token secured.Enable [Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration-examples.html) on the bucket. | AWS administrator | 
| Install AWS CLI and the Amazon S3 plugin on premises. | On each Elasticsearch node, run the following command.<pre>sudo bin/elasticsearch-plugin install repository-s3</pre>Then reboot the node. | AWS administrator | 
| Configure Amazon S3 client access. | Add the keys created previously by running the following commands.<pre>elasticsearch-keystore add s3.client.default.access_key</pre><pre>elasticsearch-keystore add s3.client.default.secret_key</pre><pre>elasticsearch-keystore add s3.client.default.session_token</pre> | AWS administrator | 
| Register a snapshot repository for Elastic data | Use the [Kibana Dev Tools](https://www.elastic.co/guide/en/kibana/current/console-kibana.html) to tell the on-premises local cluster which remote S3 bucket to write to. | AWS administrator | 
| Configure snapshot policy. | To configure snapshot lifecycle management, on the Kibana **Policies** tab, choose **SLM policy**, and define which times, data streams, or indexes should be included, and what names to use.Configure a policy that takes frequent snapshots. Snapshots are incremental and make efficient use of storage. Match your readiness assessment decision. A policy can also specify a [retention policy](https://www.elastic.co/guide/en/elasticsearch/reference/current/slm-retention.html) and automatically delete snapshots when they are no longer needed. | App support | 
| Verify that snapshots work. | In Kibana Dev Tools, run the following command.<pre>GET _snapshot/<your_repo_name>/_all</pre> | AWS administrator, App support,  | 
| Deploy a new cluster on Elastic Cloud.  | [Log in to Elastic](https://cloud.elastic.co/login?redirectTo=%2Fhome), and choose a cluster for "observability, search or security" derived from your business findings in the readiness assessment. | AWS administrator, App support | 
| Set up cluster key store access. | The new cluster needs access to the S3 bucket that will store the snapshots. On the Elasticsearch Service Console, choose **Security**, and enter the access and secret IAM keys that you created earlier. | AWS administrator | 
| Configure the Elastic Cloud hosted cluster to access Amazon S3. | Set up new cluster access to the previously created snapshot repository in Amazon S3. Using Kibana, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/migrate-an-elk-stack-to-elastic-cloud-on-aws.html) | AWS administrator, App Support | 
| Verify the new Amazon S3 repository. | Ensure that you can access your new repository hosted in the Elastic Cloud cluster. | AWS administrator | 
| Initilaize the Elasticsearch service cluster. | On the Elasticsearch Service Console, initialize the Elasticsearch service cluster from the S3 snapshot.Run the following commands as POST.<pre>*/_close?expand_wildcards=all</pre><pre>/_snapshot/<your-repo-name>/  <your-snapshot-name>/_restore</pre><pre>*/_open?expand_wildcards=all</pre> | App Support | 

### Complete the migration
<a name="complete-the-migration"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Verify that the snapshot restore was successful. | Using Kibana Dev Tools, run the following command.<pre>GET _cat/indices</pre> | App support | 
| Redploy ingestion services. | Connect the endpoints for Beats and Logstash to the new Elasticsearch service endpoint. | App support | 

### Test the cluster environment and clean up
<a name="test-the-cluster-environment-and-clean-up"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Validate the cluster environment. | After the on-premises Elastic cluster environment is migrated to AWS, you can connect to it and use your own user acceptance testing (UAT) tools to validate the new environment. | App support | 
| Clean-up the resources. | After you validate that the cluster migrated successfully, remove the S3 bucket and the IAM role used for the migration. | AWS administrator | 

## Related resources
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws-resources"></a>

**Elastic references**
+ [Elastic Cloud](https://www.elastic.co/cloud/)
+ [Managed Elasticsearch and Kibana on AWS](https://www.elastic.co/elasticsearch/service)
+ [Elastic enterprise search](https://www.elastic.co/enterprise-search)
+ [Elastic integrations](https://www.elastic.co/integrations?search=amazon)
+ [Elastic observability](https://www.elastic.co/observability)
+ [Elastic security](https://www.elastic.co/security)
+ [Beats](https://www.elastic.co/beats/)
+ [Elastic APM](https://www.elastic.co/apm/)
+ [Migrate to index lifecycle management](https://www.elastic.co/guide/en/cloud-enterprise/current/ece-migrate-index-management.html)
+ [Elastic subscriptions](https://www.elastic.co/subscriptions)
+ [Contact Elastic](https://www.elastic.co/contact)

*Elastic blog posts*
+ [How to migrate from self-managed Elasticsearch to Elastic Cloud on AWS](https://www.elastic.co/blog/how-to-migrate-from-self-managed-elasticsearch-to-elastic-cloud-on-aws) (blog post)
+ [Migrating to Elastic Cloud](https://www.elastic.co/blog/migrating-to-elastic-cloud) (blog post)

*Elastic documentation*
+ [Tutorial: Automate backups with SLM](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/getting-started-snapshot-lifecycle-management.html)
+ [ILM: Manage the index lifecycle](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html)
+ [Logstash](https://www.elastic.co/guide/en/logstash/current/index.html)
+ [Cross-cluster replication (CCR)](https://www.elastic.co/guide/en/elasticsearch/reference/7.14/xpack-ccr.html)
+ [Ingest pipelines](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html)
+ [Run Elasticsearch API requests](https://www.elastic.co/guide/en/kibana/current/console-kibana.html)
+ [Snapshot retention](https://www.elastic.co/guide/en/elasticsearch/reference/current/slm-retention.html)

*Elastic video and webinar*
+ [Elastic cloud migration](https://www.youtube.com/watch?v=WbPJi-APZ_Q)
+ [Elastic Cloud: Why are customers migrating](https://www.elastic.co/webinars/elastic-cloud-why-customers-are-migrating-now) (webinar)

**AWS references**
+ [Elastic Cloud on AWS Marketplace](https://aws.amazon.com/marketplace/seller-profile?id=d8f59038-c24c-4a9d-a66d-6711d35d7305)
+ [AWS Command Line Interface](https://docs.aws.amazon.com/AmazonS3/latest/userguide/setup-aws-cli.html)
+ [AWS Direct Connect](https://aws.amazon.com/directconnect/)
+ [AWS Migration Acceleration Program](https://aws.amazon.com/migration-acceleration-program/)
+ [Network Load Balancers](https://aws.amazon.com/elasticloadbalancing/network-load-balancer/)
+ [Regions and Availability Zones](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/)
+ [Amazon Route 53](https://aws.amazon.com/route53/)
+ [Amazon Simple Storage Service](https://aws.amazon.com/s3/)
+ [Amazon S3 Transfer Acceleration](https://aws.amazon.com/s3/transfer-acceleration/)
+ [VPN connections](https://docs.aws.amazon.com/vpc/latest/userguide/vpn-connections.html)
+ [Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc)

## Additional information
<a name="migrate-an-elk-stack-to-elastic-cloud-on-aws-additional"></a>

If you're planning to migrate complex workloads, engage [Elastic Consulting Services](https://www.elastic.co/consulting/engage). If you have basic questions related to configurations and services, contact the [Elastic Support](mailto:support@elastic.co) team.

# Migrate data to the AWS Cloud by using Starburst
<a name="migrate-data-to-the-aws-cloud-by-using-starburst"></a>

*Antony Prasad Thevaraj and Suresh Veeragoni, Amazon Web Services*

*Shaun Van Staden, None*

## Summary
<a name="migrate-data-to-the-aws-cloud-by-using-starburst-summary"></a>

Starburst helps accelerate your data migration journey to Amazon Web Services (AWS) by providing an enterprise query engine that brings existing data sources together in a single access point. You can run analytics across multiple data sources to get valuable insights, before finalizing any migration plans. Without disrupting business-as-usual analytics, you can migrate the data by using the Starburst engine or a dedicated extract, transform, and load (ETL) application.

## Prerequisites and limitations
<a name="migrate-data-to-the-aws-cloud-by-using-starburst-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ A virtual private cloud (VPC)
+ An Amazon Elastic Kubernetes Service (Amazon EKS) cluster
+ An Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling group
+ A list of current system workloads that need to be migrated
+ Network connectivity from AWS to your on-premises environment

## Architecture
<a name="migrate-data-to-the-aws-cloud-by-using-starburst-architecture"></a>

**Reference architecture**

The following high-level architecture diagram shows the typical deployment of Starburst Enterprise in the AWS Cloud:

1. The Starburst Enterprise cluster runs inside your AWS account.

1. A user authenticates by using Lightweight Directory Access Protocol (LDAP) or Open Authorization (OAuth) and interacts directly with the Starburst cluster.

1. Starburst can connect to several AWS data sources, such as AWS Glue, Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon Redshift. Starburst provides federated query capabilities across data sources in the AWS Cloud, on premises, or in other cloud environments.

1. You launch Starburst Enterprise in an Amazon EKS cluster by using Helm charts.

1. Starburst Enterprise uses Amazon EC2 Auto Scaling groups and Amazon EC2 Spot Instances to optimize infrastructure.

1. Starburst Enterprise connects directly to your existing on-premises data sources to read data real-time. In addition, if you have an existing Starburst Enterprise deployment in this environment, you can directly connect your new Starburst cluster in the AWS Cloud to this existing cluster.

![\[High-level architecture diagram of Starburst Enterprise deployment in the AWS Cloud\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/12ae0463-9029-4a32-9d7b-70cdb5406fb0/images/e9975d3a-f75b-41a2-8c08-5b82539adf8c.png)


Please note the following:
+ Starburst is not a data virtualization platform. It is a SQL-based massively parallel processing (MPP) query engine that forms the basis of an overall data mesh strategy for analytics.
+ When Starburst is deployed as part of a migration, it has direct connectivity to the existing on-premises infrastructure.
+ Starburst provides several built-in enterprise and open-source connectors that facilitate connectivity to a variety of legacy systems. For a full list of connectors and their capabilities, see [Connectors](https://docs.starburst.io/latest/connector.html) in the *Starburst Enterprise user guide*.
+ Starburst can query data in real-time from on-premises data sources. This prevents interruptions of regular business operations while data is being migrated.
+ If you are migrating from an existing on-premises Starburst Enterprise deployment, you can use a special connector, *Starburst Stargate*, to connect your Starburst Enterprise cluster in AWS directly to your on-premises cluster. This provides additional performance benefits when business users and data analysts are federating queries from the AWS Cloud to your on-premises environment.

**High-level process overview**

You can accelerate data migration projects by using Starburst because Starburst enables insights across all of your data, prior to migrating it. The following image shows the typical process for migrating data by using Starburst.

![\[Process flow for migrating data to the AWS Cloud by using Starburst\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/12ae0463-9029-4a32-9d7b-70cdb5406fb0/images/e79b0084-7275-4824-8854-646236ec23dc.png)


**Roles**

The following roles are typically required to complete a migration using Starburst:
+ **Cloud administrator **– Responsible for making cloud resources available to run the Starburst Enterprise application
+ **Starburst administrator** – Responsible for installing, configuring, managing, and supporting the Starburst application
+ **Data engineer**– Responsible for:
  + Migrating the legacy data to the cloud
  + Building semantic views to support analytics
+ **Solution or system owner** – Responsible for the overall solution implementation

## Tools
<a name="migrate-data-to-the-aws-cloud-by-using-starburst-tools"></a>

**AWS services**
+ [Amazon Elastic Compute Cloud (Amazon EC2)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html) provides scalable computing capacity in the AWS Cloud. You can launch as many virtual servers as you need and quickly scale them up or down.
+ [Amazon Elastic Kubernetes Service (Amazon EKS)](https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html) helps you run Kubernetes on AWS without needing to install or maintain your own Kubernetes control plane or nodes.

**Other tools**
+ [Helm](https://helm.sh/docs/) – Helm is a package manager for Kubernetes that helps you install and manage applications on your Kubernetes cluster.
+ [Starburst Enterprise](https://docs.starburst.io/latest/index.html) – Starburst Enterprise is a SQL-based massively parallel processing (MPP) query engine that forms the basis of an overall data mesh strategy for analytics.
+ [Starburst Stargate](https://docs.starburst.io/latest/connector/starburst-stargate.html) – Starburst Stargate links catalogs and data sources in one Starburst Enterprise environment, such as a cluster in an on-premises data center, to the catalogs and data sources in another Starburst Enterprise environment, such as a cluster in the AWS Cloud.

## Epics
<a name="migrate-data-to-the-aws-cloud-by-using-starburst-epics"></a>

### Assess the data
<a name="assess-the-data"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Identify and prioritize your data. | Identify the data you want to move. Large, on-premises legacy systems can include core data that you want to migrate alongside data that you don’t want to move or can’t be moved because of compliance reasons. Starting with a data inventory helps you prioritize which data you should target first. For more information, see [Get started with automated portfolio discovery](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/get-started-with-automated-portfolio-discovery.html). | Data engineer, DBA | 
| Explore, inventory, and back up your data. | Validate the quality, quantity, and relevance of the data for your use case. Back up or create a snapshot of the data as needed, and finalize the target environment for the data. | Data engineer, DBA | 

### Set up the Starburst Enterprise environment
<a name="set-up-the-starburst-enterprise-environment"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Configure Starburst Enterprise in the AWS Cloud. | While data is being catalogued, set up Starburst Enterprise in a managed Amazon EKS cluster. For more information see, [Deploying with Kubernetes](https://docs.starburst.io/latest/k8s.html) in the *Starburst Enterprise reference documentation*. This allows business-as-usual analytics while data migration is in process. | AWS administrator, App developer | 
| Connect Starburst to the data sources. | After you have identified the data and set up Starburst Enterprise, connect Starburst to the data sources. Starburst reads data directly from the data source as a SQL query. For more information, see the [Starburst Enterprise reference documentation](https://docs.starburst.io/latest/overview.html). | AWS administrator, App developer | 

### Migrate the data
<a name="migrate-the-data"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Build and run the ETL pipelines. | Begin the data migration process. This activity can occur at the same time as business-as-usual analytics. For the migration, you can use a third-party product or Starburst. Starburst has the capability to both read and write data across different sources. For more information, see the [Starburst Enterprise reference documentation](https://docs.starburst.io/latest/overview.html). | Data engineer | 
| Validate the data. | After the data has been migrated, validate the data to ensure all required data has been moved and is intact. | Data engineer, DevOps engineer | 

### Cut over
<a name="cut-over"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Cut over the data. | After data migration and validation is complete, you can cut over the data. This involves changing the data connection links in Starburst. Instead of pointing at the on-premises sources, you point to the new cloud sources and update the semantic views. For more information, see [Connectors](https://docs.starburst.io/latest/connector.html) in the *Starburst Enterprise reference documentation*. | Data engineer, Cutover lead | 
| Roll out to users. | Data consumers begin working off the migrated data sources. This process is invisible to the analytics end users. | Cutover lead, Data engineer | 

## Related resources
<a name="migrate-data-to-the-aws-cloud-by-using-starburst-resources"></a>

**AWS Marketplace**
+ [Starburst Galaxy](https://aws.amazon.com/marketplace/pp/prodview-yugyv4gbgn5ss?ref_=srh_res_product_title)
+ [Starburst Enterprise](https://aws.amazon.com/marketplace/pp/prodview-pwnl3c6p2jycg?ref_=srh_res_product_title)
+ [Starburst Data JumpStart](https://aws.amazon.com/marketplace/pp/prodview-ou6drb23gch2u?ref_=srh_res_product_title)
+ [Starburst Enterprise with Graviton](https://aws.amazon.com/marketplace/pp/prodview-2bpppdqlesn6w?ref_=srh_res_product_title)

**Starburst documentation**
+ [Starburst Enterprise user guide](https://docs.starburst.io/index.html)
+ [Starburst Enterprise reference documentation](https://docs.starburst.io/latest/index.html)

**Other AWS documentation**
+ [Get started with automated portfolio discovery](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/get-started-with-automated-portfolio-discovery.html) (AWS Prescriptive Guidance)
+ [Optimizing Cloud Infrastructure Cost and Performance with Starburst on AWS](https://aws.amazon.com/blogs/architecture/optimizing-cloud-infrastructure-cost-and-performance-with-starburst-on-aws/) (AWS blog post)

# Optimize the ETL ingestion of input file size on AWS
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws"></a>

*Apoorva Patrikar, Amazon Web Services*

## Summary
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws-summary"></a>

This pattern shows you how to optimize the ingestion step of the extract, transform, and load (ETL) process for big data and Apache Spark workloads on AWS Glue by optimizing file size before processing your data. Use this pattern to prevent or resolve the *small files problem*. That is, when a large number of small files slows down data processing due to the aggregate size of the files. For example, hundreds of files that are only a few hundred kilobytes each can significantly slow down data processing speeds for your AWS Glue jobs. This is because AWS Glue must perform internal list functions on Amazon Simple Storage Service (Amazon S3) and YARN (Yet Another Resource Negotiator) must store a large amount of metadata. To improve data processing speeds, you can use grouping to enable your ETL tasks to read a group of input files into a single in-memory partition. The partition automatically groups smaller files together. Alternatively, you can use custom code to add batch logic to your existing files.

## Prerequisites and limitations
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ One or more AWS glue [jobs](https://docs.aws.amazon.com/glue/latest/dg/author-job.html)
+ One or more big data or [Apache Spark](https://spark.apache.org/) workloads
+ An [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html)

## Architecture
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws-architecture"></a>

The following pattern shows how data in different formats is processed by an AWS Glue job and then stored in an S3 bucket to get visibility into performance.

![\[Data in different formats is processed by an AWS Glue job and then stored in an S3 bucket.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/88ee332f-2f12-4d88-b491-e78bbb173850/images/9a97b54a-3f67-4a48-94d2-5807373ace25.png)


The diagram shows the following workflow:

1. 
**Note**  
An AWS Glue job converts small files in CSV, JSON, and Parquet format to dynamic frames. : The size of the input file has the most significant impact on the performance of the AWS Glue job.

1. The AWS Glue job performs internal list functions in an S3 bucket.

## Tools
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws-tools"></a>
+ [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) is a fully managed ETL service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

## Epics
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws-epics"></a>

### Use grouping to optimize ETL ingestion during reading
<a name="use-grouping-to-optimize-etl-ingestion-during-reading"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Specify the group size. | If you have more than 50,000 files, grouping is done by default. However, you can also use grouping for less than 50,000 files by specifying the group size in the `connectionOptions` parameter. The `connectionOptions` parameter is in the `create_dynamic_frame.from_options` method. | Data engineer | 
| Write the grouping code. | Use the `create_dynamic_frame` method to create a dynamic frame. For example:<pre>S3bucket_node1 = glueContext.create_dynamic_frame.from_options(<br />    format_options={"multiline": False},<br />    connection_type="s3",<br />    format="json",<br />    connection_options={<br />        "paths": ["s3://bucket/prefix/file.json"],<br />        "recurse": True,<br />        "groupFiles": 'inPartition', <br />        "groupSize": 1048576<br />    },<br />    transformation_ctx="S3bucket_node1",<br />)</pre>Use `groupFiles` to group files in an Amazon S3 partition group. Use `groupSize` to set the target size of the group to be read in memory. Specify `groupSize` in bytes (1048576 = 1 MB). | Data engineer | 
| Add the code to the workflow. | Add the grouping code to your job [workflow](https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html) in AWS Glue. | Data engineer | 

### Use custom logic to optimize ETL ingestion
<a name="use-custom-logic-to-optimize-etl-ingestion"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Choose the language and processing platform. | Choose the scripting language and processing platform tailored to your use case. | Cloud architect | 
| Write the code. | Write the custom logic to batch your files together. | Cloud architect | 
| Add the code to the workflow. | Add the code to your job [workflow](https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html) in AWS Glue. This enables your custom logic to be applied every time the job is run. | Data engineer | 

### Repartition when writing data after transformation
<a name="repartition-when-writing-data-after-transformation"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Analyze consumption patterns. | Find out how downstream applications will use the data you write. For example, if they query data each day and you only partition data per Region or have very small output files, such as 2.5 KB per file, then this is not optimal for consumption. | DBA | 
| Repartition data before writing. | Repartition based on joins or queries during processing (based on processing logic) and after processing (based on consumption). For example, repartition based on byte size, such as `.repartition(100000)`, or repartition based on columns, such as `.repartition("column_name")`. | Data engineer | 

## Related resources
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws-resources"></a>
+ [Reading input files in larger groups](https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html)
+ [Monitoring AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/monitor-glue.html)
+ [Monitoring AWS Glue using Amazon CloudWatch metrics](https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)
+ [Job monitoring and debugging](https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html)
+ [Getting started with serverless ETL on AWS Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/welcome.html)

## Additional information
<a name="optimize-the-etl-ingestion-of-input-file-size-on-aws-additional"></a>

**Determining file size**

There is no straightforward way to determine if a file size is too big or too small. The impact of file size on processing performance depends on the configuration of your cluster. In core Hadoop, we recommend that you use files that are 128 MB or 256 MB to make the most of the block size.

For most text file workloads on AWS Glue, we recommended a file size between 100 MB and 1 GB for a 5-10 DPU cluster. To figure out the best size of input files, monitor the preprocessing section of your AWS Glue job, and then check the CPU utilization and memory utilization of the job.

**Additional considerations**

If performance in the early ETL stages is a bottleneck, consider grouping or merging the data files before processing. If you have complete control on the file generation process, it can be even more efficient to aggregate data points on the source system itself before the raw data is sent to AWS.

# Orchestrate an ETL pipeline with validation, transformation, and partitioning using AWS Step Functions
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions"></a>

*Sandip Gangapadhyay, Amazon Web Services*

## Summary
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-summary"></a>

This pattern describes how to build a serverless extract, transform, and load (ETL) pipeline to validate, transform, compress, and partition a large CSV dataset for performance and cost optimization. The pipeline is orchestrated by AWS Step Functions and includes error handling, automated retry, and user notification features.

When a CSV file is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket source folder, the ETL pipeline starts to run. The pipeline validates the content and the schema of the source CSV file, transforms the CSV file to a compressed Apache Parquet format, partitions the dataset by year, month, and day, and stores it in a separate folder for analytics tools to process.

The code that automates this pattern is available on GitHub, in the [ETL Pipeline with AWS Step Functions](https://github.com/aws-samples/aws-step-functions-etl-pipeline-pattern) repository.

## Prerequisites and limitations
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-prereqs"></a>

**Prerequisites**
+ An active AWS account.
+ AWS Command Line Interface (AWS CLI) installed and configured with your AWS account, so that you can create AWS resources by deploying an AWS CloudFormation stack. We recommend using AWS CLI version 2. For instructions, see[ Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) in the AWS CLI documentation. For configuration instructions, see [Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) in the AWS CLI documentation.
+ An Amazon S3 bucket.
+ A CSV dataset with the correct schema. (The [code repository](https://github.com/aws-samples/aws-step-functions-etl-pipeline-pattern/) included with this pattern provides a sample CSV file with the correct schema and data type that you can use.)
+ A web browser that supports the AWS Management Console. (See the [list of supported browsers](https://aws.amazon.com/premiumsupport/knowledge-center/browsers-management-console/).)
+ AWS Glue console access.
+ AWS Step Functions console access.

**Limitations**
+ In AWS Step Functions, the maximum limit for keeping history logs is 90 days. For more information, see [Step Functions service quotas](https://docs.aws.amazon.com/step-functions/latest/dg/service-quotas.html) in the AWS Step Functions documentation.

**Product versions**
+ Python 3.13 for AWS Lambda
+ AWS Glue version 4.0

## Architecture
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-architecture"></a>

![\[ETL process from S3 source bucket through Step Functions, AWS Glue, and Amazon SNS in 10 steps.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/8eb792b0-d35b-4a63-ba2d-71eb8d2c1203/images/054c8e9d-76a2-4334-b21a-cbe76af45923.png)


The workflow illustrated in the diagram consists of these high-level steps:

1. The user uploads a CSV file into the source folder in Amazon S3.

1. An Amazon S3 notification event initiates an AWS Lambda function that starts the AWS Step Functions state machine.

1. The Lambda function validates the schema and data type of the raw CSV file.

1. Depending on the validation results:

   1. If validation of the source file succeeds, the file moves to the stage folder for further processing.

   1. If validation fails, the file moves to the error folder, and an error notification is sent through Amazon Simple Notification Service (Amazon SNS).

1. An AWS Glue crawler creates the schema of the raw file from the stage folder in Amazon S3.

1. An AWS Glue job transforms, compresses, and partitions the raw file into Parquet format.

1. The AWS Glue job also moves the file to the transform folder in Amazon S3.

1. The AWS Glue crawler creates the schema from the transformed file. The resulting schema can be used by any analytics job. You can also use Amazon Athena to run ad-hoc queries.

1. If the pipeline completes without errors, the schema file is moved to the archive folder. If any errors are encountered, the file is moved to the error folder instead.

1. Amazon SNS sends a notification that indicates success or failure based on the pipeline completion status.

All the AWS resources used in this pattern are serverless. There are no servers to manage.

## Tools
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-tools"></a>

**AWS services**
+ [AWS Glue](https://aws.amazon.com/glue/) – AWS Glue is a fully managed ETL service that makes it easy for customers to prepare and load their data for analytics.
+ [AWS Step Functions](https://aws.amazon.com/step-functions/) – AWS Step Functions is a serverless orchestration service that lets you combine AWS Lambda functions and other AWS services to build business-critical applications. Through the AWS Step Functions graphical console, you see your application’s workflow as a series of event-driven steps.
+ [Amazon S3](https://aws.amazon.com/s3/) – Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
+ [Amazon SNS](https://aws.amazon.com/sns/) – Amazon Simple Notification Service (Amazon SNS) is a highly available, durable, secure, fully managed pub/sub messaging service that enables you to decouple microservices, distributed systems, and serverless applications.
+ [AWS Lambda](https://aws.amazon.com/lambda/) – AWS Lambda is a compute service that lets you run code without provisioning or managing servers. AWS Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second.

**Code **

The code for this pattern is available on GitHub, in the [ETL Pipeline with AWS Step Functions](https://github.com/aws-samples/aws-step-functions-etl-pipeline-pattern) repository. The code repository contains the following files and folders:
+ `template.yml` – AWS CloudFormation template for creating the ETL pipeline with AWS Step Functions.
+ `parameter.json` – Contains all parameters and parameter values. You update this file to change parameter values, as described in the *Epics *section.
+ `myLayer/python` folder – Contains Python packages needed to create the required AWS Lambda layer for this project.
+ `lambda` folder – Contains the following Lambda functions:
  + `move_file.py` – Moves the source dataset to the archive, transform, or error folder.
  + `check_crawler.py` – Checks the status of the AWS Glue crawler as many times as configured by the `RETRYLIMIT `environment variable before it sends a failure message.
  + `start_crawler.py` – Starts the AWS Glue crawler.
  + `start_step_function.py` – Starts AWS Step Functions.
  + `start_codebuild.py` – Starts the AWS CodeBuild project.
  + `validation.py` – Validates the input raw dataset.
  + `s3object.py` – Creates the required directory structure inside the Amazon S3 bucket.
  + `notification.py` – Sends success or error notifications at the end of the pipeline.

To use the sample code, follow the instructions in the *Epics *section.

## Epics
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-epics"></a>

### Prepare the source files
<a name="prepare-the-source-files"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Clone the sample code repository. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) | Developer | 
| Update parameter values. | In your local copy of the repository, edit the `parameter.json` file and update the default parameter values as follows:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) | Developer | 
| Upload the source code to the Amazon S3 bucket. | Before you deploy the AWS CloudFormation template that automates the ETL pipeline, you must package the source files for the template and upload them to an Amazon S3 bucket. To do this, run the following AWS CLI command with your preconfigured profile:<pre>aws cloudformation package --template-file template.yml --s3-bucket <bucket_name> --output-template-file packaged.template --profile <profile_name></pre>where:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) | Developer | 

### Create the stack
<a name="create-the-stack"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Deploy the CloudFormation template. | To deploy the AWS CloudFormation template, run the following AWS CLI command:<pre>aws cloudformation deploy --stack-name <stack_name> --template-file packaged.template --parameter-overrides file://parameter.json --capabilities CAPABILITY_IAM --profile <profile_name></pre>where:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) | Developer | 
| Check progress. | On the [AWS CloudFormation console](https://console.aws.amazon.com/cloudformation/), check the progress of stack development. When the status is `CREATE_COMPLETE`, the stack has been deployed successfully. | Developer | 
| Note the AWS Glue database name. | The **Outputs** tab for the stack displays the name of the AWS Glue database. The key name is `GlueDBOutput`. | Developer | 

### Test the pipeline
<a name="test-the-pipeline"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Start the ETL pipeline. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) | Developer | 
| Check for the partitioned dataset. | When the ETL pipeline completes, verify that the partitioned dataset is available in the Amazon S3 transform folder (`transform`, or the folder name you set in the `parameter.json` file). | Developer | 
| Check for the partitioned AWS Glue database. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions.html) | Developer | 
| Run queries. | (Optional) Use Amazon Athena to run ad-hoc queries on the partitioned and transformed database. For instructions, see [Run SQL queries in Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/querying-athena-tables.html) in the AWS documentation. | Database analyst | 

## Troubleshooting
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-troubleshooting"></a>


| Issue | Solution | 
| --- | --- | 
| AWS Identity and Access Management (IAM) permissions for the AWS Glue job and crawler | If you further customize the AWS Glue job or the crawler, be sure to grant the appropriate IAM permissions in the IAM role used by the AWS Glue job, or provide data permissions to AWS Lake Formation. For more information, see the [AWS documentatio](https://docs.aws.amazon.com/lake-formation/latest/dg/upgrade-glue-lake-formation.html)n. | 

## Related resources
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-resources"></a>

**AWS service documentation**
+ [AWS Step Functions](https://docs.aws.amazon.com/step-functions/)
+ [AWS Glue](https://docs.aws.amazon.com/glue/)
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/)
+ [Amazon S3](https://docs.aws.amazon.com/s3/)
+ [Amazon SNS](https://docs.aws.amazon.com/sns/)

## Additional information
<a name="orchestrate-an-etl-pipeline-with-validation-transformation-and-partitioning-using-aws-step-functions-additional"></a>

The following diagram shows the AWS Step Functions workflow for a successful ETL pipeline, from the AWS Step Functions **Inspector **panel.** **

![\[Step Functions workflow for validating the input .csv, crawling data, and running the AWS Glue job.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/8eb792b0-d35b-4a63-ba2d-71eb8d2c1203/images/bd77de7b-4d04-44bb-95d2-3ec4599b3770.png)


The following diagram shows the AWS Step Functions workflow for an ETL pipeline that fails because of an input validation error, from the Step Functions **Inspector **panel. 

![\[Step Functions workflow with failure so the file moves to the error folder.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/8eb792b0-d35b-4a63-ba2d-71eb8d2c1203/images/5fd7dd79-ba4c-4c20-b1f0-ad4b5f22bdfc.png)


# Perform advanced analytics using Amazon Redshift ML
<a name="perform-advanced-analytics-using-amazon-redshift-ml"></a>

*Po Hong and Chyanna Antonio, Amazon Web Services*

## Summary
<a name="perform-advanced-analytics-using-amazon-redshift-ml-summary"></a>

On the Amazon Web Services (AWS) Cloud, you can use Amazon Redshift machine learning (Amazon Redshift ML) to perform ML analytics on data stored in either an Amazon Redshift cluster or on Amazon Simple Storage Service (Amazon S3). Amazon Redshift ML supports supervised learning, which is typically used for advanced analytics. Use cases for Amazon Redshift ML include revenue forecasting, credit card fraud detection, and customer lifetime value (CLV) or customer churn predictions.

Amazon Redshift ML makes it easy for database users to create, train, and deploy ML models by using standard SQL commands. Amazon Redshift ML uses Amazon SageMaker Autopilot to automatically train and tune the best ML models for classification or regression based on your data, while you retain control and visibility.

All interactions between Amazon Redshift, Amazon S3, and Amazon SageMaker are abstracted away and automated. After the ML model is trained and deployed, it becomes available as a [user-defined function](https://docs.aws.amazon.com/redshift/latest/dg/user-defined-functions.html) (UDF) in Amazon Redshift and can be used in SQL queries.  

This pattern complements the [Create, train, and deploy ML models in Amazon Redshift using SQL with Amazon Redshift ML](https://aws.amazon.com/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/) from the AWS Blog, and the [Build, train, and deploy an ML model with Amazon SageMaker](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/) tutorial from the [Getting Started Resource Center](https://aws.amazon.com/getting-started/).

## Prerequisites and limitations
<a name="perform-advanced-analytics-using-amazon-redshift-ml-prereqs"></a>

**Prerequisites **
+ An active AWS account
+ Existing data in an Amazon Redshift table

**Skills **
+ Familiarity with terms and concepts used by Amazon Redshift ML, including *machine learning*, *training*, and *prediction*. For more information about this, see [Training ML models ](https://docs.aws.amazon.com/machine-learning/latest/dg/training-ml-models.html)in the Amazon Machine Learning (Amazon ML) documentation.
+ Experience with Amazon Redshift user setup, access management, and standard SQL syntax. For more information about this, see [Getting started with Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html) in the Amazon Redshift documentation.
+ Knowledge and experience with Amazon S3 and AWS Identity and Access Management (IAM). 
+ Experience running commands in AWS Command Line Interface (AWS CLI) is also beneficial but not required. 

**Limitations **
+ The Amazon Redshift cluster and S3 bucket must be located in the same AWS Region.
+ This pattern’s approach only supports supervised learning models such as regression, binary classification, and multiclass classification. 

## Architecture
<a name="perform-advanced-analytics-using-amazon-redshift-ml-architecture"></a>

![\[Workflow shows how Amazon Redshift ML works with SageMaker to build, train, and deploy an ML model.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/3b6a5f49-c647-45f4-940e-b623f989326a/images/36cc39fe-c1f3-4d68-b89e-e636156001a6.png)


The following steps explain how Amazon Redshift ML works with SageMaker to build, train, and deploy an ML model: 

1. Amazon Redshift exports training data to an S3 bucket.

1. SageMaker Autopilot automatically preprocesses the training data.

1. After the `CREATE MODEL` statement is invoked, Amazon Redshift ML uses SageMaker for training.

1. SageMaker Autopilot searches for and recommends the ML algorithm and optimal hyper-parameters that optimize the evaluation metrics.

1. Amazon Redshift ML registers the output ML model as a SQL function in the Amazon Redshift cluster.

1. The ML model's function can be used in a SQL statement. 

**Technology stack**
+ Amazon Redshift
+ SageMaker
+ Amazon S3

## Tools
<a name="perform-advanced-analytics-using-amazon-redshift-ml-tools"></a>
+ [Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/welcome.html) – Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.
+ [Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning.html) – Amazon Redshift machine learning (Amazon Redshift ML) is a robust, cloud-based service that makes it easy for analysts and data scientists of all skill levels to use ML technology.
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) – Amazon Simple Storage Service (Amazon S3) is storage for the internet. 
+ [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) – SageMaker is a fully managed ML service. 
+ [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html) – SageMaker Autopilot is a feature-set that automates key tasks of an automatic machine learning (AutoML) process.

 
**Code**

You can create a supervised ML model in Amazon Redshift by using the following code:

```
"CREATE MODEL customer_churn_auto_model
FROM (SELECT state,
             account_length,
             area_code,
             total_charge/account_length AS average_daily_spend, 
             cust_serv_calls/account_length AS average_daily_cases,
             churn 
      FROM customer_activity
      WHERE  record_date < '2020-01-01' 
     )
TARGET churn
FUNCTION ml_fn_customer_churn_auto
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/Redshift-ML'
SETTINGS (
  S3_BUCKET 'your-bucket'
);")
```

**Note**  
The `SELECT` state can refer to Amazon Redshift regular tables, Amazon Redshift Spectrum external tables, or both.

## Epics
<a name="perform-advanced-analytics-using-amazon-redshift-ml-epics"></a>

### Prepare a training and test dataset
<a name="prepare-a-training-and-test-dataset"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Prepare a training and test dataset. | Sign in to the AWS Management Console and open the Amazon SageMaker console. Follow the instructions from the [Build, train, and deploy a machine learning model](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/) tutorial to create a .csv or Apache Parquet file that has a label column (*supervised training*) and no header. We recommend that you shuffle and split the raw dataset into a training set for the model’s training (70 percent) and a test set for the model’s performance evaluation (30 percent). | Data scientist | 

### Prepare and configure the technology stack
<a name="prepare-and-configure-the-technology-stack"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create and configure an Amazon Redshift cluster. | On the Amazon Redshift console, create a cluster according to your requirements. For more information about this, see [Create a cluster](https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data-launch-cluster.html) in the Amazon Redshift documentation.  Amazon Redshift clusters must be created with the `SQL_PREVIEW` maintenance track. For more information about preview tracks, see [Choosing cluster maintenance tracks](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-mgmt-maintenance-tracks) in the Amazon Redshift documentation. | DBA, Cloud architect | 
| Create an S3 bucket to store training data and model artifacts. | On the Amazon S3 console, create an S3 bucket for the training and test data. For more information about creating an S3 bucket, see [Create an S3 bucket](https://docs.aws.amazon.com/quickstarts/latest/s3backup/step-1-create-bucket.html) from AWS Quick Starts. Make sure that your Amazon Redshift cluster and S3 bucket are in the same Region.  | DBA, Cloud architect | 
| Create and attach an IAM policy to the Amazon Redshift cluster. | Create an IAM policy to allow the Amazon Redshift cluster to access SageMaker and Amazon S3. For instructions and steps, see [Cluster setup for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cluster-setup.html) in the Amazon Redshift documentation. | DBA, Cloud architect | 
| Allow Amazon Redshift users and groups to access schemas and tables. | Grant permissions to allow users and groups in Amazon Redshift to access internal and external schemas and tables. For steps and instructions, see [Managing permissions and ownership](https://docs.aws.amazon.com/redshift/latest/dg/permissions-ownership.html) in the Amazon Redshift documentation. | DBA | 

### Create and train the ML model in Amazon Redshift
<a name="create-and-train-the-ml-model-in-amazon-redshift"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create and train the ML model in Amazon Redshift. | Create and train your ML model in Amazon Redshift ML. For more information, see the `CREATE MODEL` statement in the Amazon Redshift documentation. | Developer, Data scientist | 

### Perform batch inference and prediction in Amazon Redshift
<a name="perform-batch-inference-and-prediction-in-amazon-redshift"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Perform inference using the generated ML model function. | For more information about performing inference by using the generated ML model function, see [Prediction](https://docs.aws.amazon.com/redshift/latest/dg/prediction.html) in the Amazon Redshift documentation. | Data scientist, Business intelligence user | 

## Related resources
<a name="perform-advanced-analytics-using-amazon-redshift-ml-resources"></a>

**Prepare a training and test dataset**
+ [Building, training, and deploying a machine learning model with Amazon SageMaker](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/)

 
**Prepare and configure the technology stack**
+ [Creating an Amazon Redshift cluster](https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data-launch-cluster.html)
+ [Choosing Amazon Redshift cluster maintenance tracks](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-mgmt-maintenance-tracks)
+ [Creating an S3 bucket](https://docs.aws.amazon.com/quickstarts/latest/s3backup/step-1-create-bucket.html)
+ [Setting up an Amazon Redshift cluster for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cluster-setup.html)
+ [Managing permissions and ownership in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/permissions-ownership.html)

 
**Create and train the ML model in Amazon Redshift**
+ [CREATE MODEL statement in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html#r_simple_create_model) 

 
**Perform batch inference and prediction in Amazon Redshift**
+ [Prediction in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/prediction.html) 

 
**Other resources**
+ [Getting started with Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/geting-started-machine-learning.html)
+ [Creating, training, and deploying ML models in Amazon Redshift using SQL with Amazon Redshift ML](https://aws.amazon.com/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/)
+ [Amazon Redshift partners](https://aws.amazon.com/redshift/partners/?partner-solutions-cards.sort-by=item.additionalFields.partnerNameLower&partner-solutions-cards.sort-order=aschttps://aws.amazon.com/redshift/partners/?partner-solutions-cards.sort-by=item.additionalFields.partnerNameLower&partner-solutions-cards.sort-order=asc)
+ [AWS machine learning competency partners](https://aws.amazon.com/machine-learning/partner-solutions/?partner-solutions-cards.sort-by=item.additionalFields.partnerNameLower&partner-solutions-cards.sort-order=asc)

# Query Amazon DynamoDB tables with SQL by using Amazon Athena
<a name="query-amazon-dynamodb-tables-sql-amazon-athena"></a>

*Gavin Perrie, Ajit Ambike, and Brad Yates, Amazon Web Services*

## Summary
<a name="query-amazon-dynamodb-tables-sql-amazon-athena-summary"></a>

If your data includes sources other than Amazon Simple Storage Service (Amazon S3), you can use federated queries to access those relational, non-relational, object, or custom data sources. This pattern shows how to configure federated query access through Amazon Athena to Amazon DynamoDB by using an SQL data source connector.

Using this pattern you can do the following:
+ Query DynamoDB tables by using SQL.
+ Run federated SQL queries in Athena and join DynamoDB tables with other supported data sources.

## Prerequisites and limitations
<a name="query-amazon-dynamodb-tables-sql-amazon-athena-prereqs"></a>

**Prerequisites**
+ A DynamoDB table.
+ An Athena workgroup set to use Athena engine version 2. For instructions, see the [Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/engine-versions-changing.html).
+ An S3 bucket where the `AthenaDynamoDBConnector` AWS Lambda function can spill the data. The S3 bucket and the Lambda function must be in the same AWS Region.

If this is your first time accessing Athena, you will need an additional S3 bucket to use as the query result location. For instructions, see the [Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/querying.html#query-results-specify-location-console).

**Limitations**
+ Write operations such as [INSERT INTO](https://docs.aws.amazon.com/athena/latest/ug/insert-into.html) are not supported.** **

**Product versions**
+ [Athena Query Federation releases on GitHub](https://github.com/awslabs/aws-athena-query-federation/releases)

## Architecture
<a name="query-amazon-dynamodb-tables-sql-amazon-athena-architecture"></a>

**Target architecture**

The following diagram shows the connection flow after the pattern is established. The user connects to Amazon Athena to provide the query. Athena passes the query and target to the DynamoDB data source connector Lambda function, which retrieves and returns the data to Athena. If large amounts of data are returned, Athena stores the temporary results in the spill bucket before packaging and returning the complete dataset.

![\[Workflow from users to Athena to Lambda, which connects to the S3 bucket and to the DynamoDB table.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/9d1ee529-36f3-4c7e-a2e7-05b76e09a3d8/images/e50a00cf-a55a-4def-81d5-4d2755dc9a97.png)


## Tools
<a name="query-amazon-dynamodb-tables-sql-amazon-athena-tools"></a>

**AWS services**
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that helps you analyze data directly in Amazon Simple Storage Service (Amazon S3) by using standard SQL. This pattern uses [Amazon Athena DynamoDB Connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb), a tool built using the Amazon Athena Query Federation SDK and installed as an AWS Lambda application through the AWS Serverless Application Repository.
+ [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html) is a fully managed NoSQL database service that provides fast, predictable, and scalable performance.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

**Code repository**

The code for this pattern is available in the GitHub [Athena Query Federation](https://github.com/awslabs/aws-athena-query-federation/releases) repository.

## Epics
<a name="query-amazon-dynamodb-tables-sql-amazon-athena-epics"></a>

### Set up and test the DynamoDB data source connector
<a name="set-up-and-test-the-ddb-data-source-connector"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Deploy the AthenaDynamoDBConnector application. | To deploy AthenaDynamoDBConnecter, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/query-amazon-dynamodb-tables-sql-amazon-athena.html) | AWS DevOps | 
| Create a data source for Athena. | To create the data source, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/query-amazon-dynamodb-tables-sql-amazon-athena.html) | AWS DevOps | 
| Use Athena to query the DynamoDB table. | To query the DynamoDB table, do the following:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/query-amazon-dynamodb-tables-sql-amazon-athena.html) | App developer | 

## Troubleshooting
<a name="query-amazon-dynamodb-tables-sql-amazon-athena-troubleshooting"></a>


| Issue | Solution | 
| --- | --- | 
| Query fails with `GENERIC_INTERNAL_ERROR: The bucket is in this region: <region>`*.* | Make sure that the Athena spill bucket and Lambda function are created in the same AWS Region. | 
| The newly created data source isn't visible on the Athena console. | Athena data catalogs are regional. Ensure the `AthenaDynamoDBConnector` has been deployed in the Region where you are trying to use Athena. | 
| You're unable to run the query against the newly created data source. | Check that the **Query Result Location** has been set. | 

## Related resources
<a name="query-amazon-dynamodb-tables-sql-amazon-athena-resources"></a>
+ [Amazon Athena DynamoDB Connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb)
+ [Amazon Athena Federated Query](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html)

# Access, query, and join Amazon DynamoDB tables using Athena
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena"></a>

*Moinul Al-Mamun, Amazon Web Services*

## Summary
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena-summary"></a>

This pattern shows you how to set up a connection between Amazon Athena and Amazon DynamoDB by using the Amazon Athena DynamoDB connector. The connector uses an AWS Lambda function to query the data in DynamoDB. You don’t need to write any code to set up the connection. After the connection is established, you can quickly access and analyze DynamoDB tables by using [Athena Federated Query](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html) to run SQL commands from Athena. You can also join one or more DynamoDB tables to each other or to other data sources, such as Amazon Redshift or Amazon Aurora.

## Prerequisites and limitations
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena-prereqs"></a>

**Prerequisites**
+ An active AWS account with permissions to manage DynamoDB tables, Athena Data sources, Lambda, and AWS Identity and Access Management (IAM) roles
+ An Amazon Simple Storage Service (Amazon S3) bucket where Athena can store query results
+ An S3 bucket where the Athena DynamoDB Connector can save the data in the short term
+ An AWS Region that supports [Athena engine version 2](https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0002.html)
+ IAM permissions to access Athena and the required S3 buckets
+ [Amazon Athena DynamoDB Connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb), installed

**Limitations**

There is a cost for querying DynamoDB tables. Table sizes exceeding a few gigabytes (GBs) can incur a high cost. We recommend that you consider cost before performing any full table SCAN operation. For more information, see [Amazon DynamoDB pricing](https://aws.amazon.com/dynamodb/pricing/). To reduce costs and achieve high performance, we recommend that you always use LIMIT in your query (for example, `SELECT * FROM table1 LIMIT 10`). Also, before you perform a JOIN or GROUP BY query in a production environment, consider the size of your tables. If your tables are too large, consider alternative options such as [migrating the table to Amazon S3](https://aws.amazon.com/blogs/database/simplify-amazon-dynamodb-data-extraction-and-analysis-by-using-aws-glue-and-amazon-athena/).

## Architecture
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena-architecture"></a>

The following diagram shows how a user can run a SQL query on a DynamoDB table from Athena.

![\[Workflow for connecting Athena and DynamoDB to run a SQL query.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/e6ff94af-d208-40c7-94e4-af257755a603/images/bc8e0132-b578-463b-bf55-3c39ce359c17.png)


The diagram shows the following workflow:

1. To query a DynamoDB table, a user runs a SQL query from Athena.

1. Athena initiates a Lambda function.

1. The Lambda function queries the requested data in the DynamoDB table.

1. DynamoDB returns the requested data to the Lambda function. Then, the function transfers the query results to the user through Athena.

1. The Lambda function stores data in the S3 bucket.

**Technology stack**
+ Amazon Athena
+ Amazon DynamoDB
+ Amazon S3
+ AWS Lambda

## Tools
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena-tools"></a>
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL.
+ [Amazon Athena DynamoDB Connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb) is an AWS tool that enables Athena to connect with DynamoDB and access your tables by using SQL queries.
+ [Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html) is a fully managed NoSQL database service that provides fast, predictable, and scalable performance.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.

## Epics
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena-epics"></a>

### Create sample DynamoDB tables
<a name="create-sample-dynamodb-tables"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create the first sample table.  | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html) | Developer | 
| Insert sample data into the first table. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html)<pre>{<br />  "PK1": "1234",<br />  "SK1": "info",<br />  "Salary": "5000"<br /> }</pre><pre>{<br />  "PK1": "1235",<br />  "SK1": "info",<br />  "Salary": "5200"<br /> }</pre> | Developer | 
| Create the second sample table.  | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html) | Developer | 
| Insert sample data into the second table. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html)<pre>{<br />  "PK2": "1234",<br /> "SK2": "bonus",<br /> "Bonus": "500"<br />}</pre><pre>{<br />  "PK2": "1235",<br /> "SK2": "bonus",<br /> "Bonus": "1000"<br />}</pre> | Developer | 

### Create a data source in Athena for DynamoDB
<a name="create-a-data-source-in-athena-for-dynamodb"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Set up the data source connector. | Create a data source for DynamoDB, and then create a Lambda function to connect to that data source.[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html) | Developer | 
| Verify that the Lambda function can access the S3 spill bucket. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html)If you experience errors, see the *Additional information* section in this pattern for guidance. | Developer | 

### Access DynamoDB tables from Athena
<a name="access-dynamodb-tables-from-athena"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Query the DynamoDB tables. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html) | Developer | 
| Join the two DynamoDB tables. | DynamoDB is a NoSQL data store and doesn’t support the SQL join operation. Consequently, you must perform a join operation on two DynamoDB tables:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html)<pre>SELECT pk1, salary, bonus FROM dydbtable1 t1<br /> JOIN dydbtable2 t2 ON t1.pk1 = t2.pk2;</pre> | Developer | 

## Related resources
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena-resources"></a>
+ [Amazon Athena DynamoDB Connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb) (AWS Labs)
+ [Query any data source with Amazon Athena’s new federated query](https://aws.amazon.com/blogs/big-data/query-any-data-source-with-amazon-athenas-new-federated-query/) (AWS Big Data Blog)
+ [Athena engine version reference](https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html) (Athena User Guide)
+ [Simplify Amazon DynamoDB data extraction and analysis by using AWS Glue and Amazon Athena](https://aws.amazon.com/blogs/database/simplify-amazon-dynamodb-data-extraction-and-analysis-by-using-aws-glue-and-amazon-athena/) (AWS Database Blog)

## Additional information
<a name="access-query-and-join-amazon-dynamodb-tables-using-athena-additional"></a>

If you run a query in Athena with `spill_bucket` in the `{bucket_name}/folder_name/` format, then you can receive the following error message:

```
"GENERIC_USER_ERROR: Encountered an exception[java.lang.RuntimeException] from your LambdaFunction[arn:aws:lambda:us-east-1:xxxxxx:function:testdynamodb] executed in context[retrieving meta-data] with message[You do NOT own the spill bucket with the name: s3://amzn-s3-demo-bucket/athena_dynamodb_spill_data/]
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum  or contact customer support with Query Id: [query-id]"
```

To resolve this error, update the Lambda function’s environment variable `spill_bucket` to `{bucket_name_only}`, and then update the following Lambda IAM policy for bucket write access:

```
{
             "Action": [
                 "s3:GetObject",
                 "s3:ListBucket",
                 "s3:GetBucketLocation",
                 "s3:GetObjectVersion",
                 "s3:PutObject",
                 "s3:PutObjectAcl",
                 "s3:GetLifecycleConfiguration",
                 "s3:PutLifecycleConfiguration",
                 "s3:DeleteObject"
            ],
             "Resource": [
                 "arn:aws:s3:::spill_bucket",
                 "arn:aws:s3:::spill_bucket/*"
            ],
             "Effect": "Allow"
        }
```

Alternatively, you can remove the Athena data source connector that you created earlier, and recreate it by using only `{bucket_name}` for `spill_bucket`.

# Set up language-specific sorting for Amazon Redshift query results using a scalar Python UDF
<a name="set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf"></a>

*Ethan Stark, Amazon Web Services*

## Summary
<a name="set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf-summary"></a>

This pattern provides steps and sample code for using a scalar Python UDF (user-defined function) to set up case insensitive linguistic sorting for Amazon Redshift query results. It’s necessary to use a scalar Python UDF because Amazon Redshift returns results based on binary UTF-8 ordering and doesn’t support language-specific sorting. A Python UDF is non-SQL processing code that’s based on a Python program and runs in a data warehouse. You can run Python UDF code with a SQL statement in a single query. For more information, see the [Introduction to Python UDFs in Amazon Redshift](https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/) AWS Big Data Blog post.

The sample data in this pattern is based on the Turkish alphabet for demonstration purposes. The scalar Python UDF in this pattern is built to make the default query results of Amazon Redshift conform to the linguistic ordering of characters in the Turkish language. For more information, see *Turkish language example* in the *Additional information* section of this pattern. You can modify the scalar Python UDF in this pattern for other languages.

## Prerequisites and limitations
<a name="set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf-prereqs"></a>

**Prerequisites**
+ [Amazon Redshift cluster](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html) with a database, schema, and tables
+ [Amazon Redshift user](https://docs.aws.amazon.com/redshift/latest/dg/r_Users.html) with `CREATE TABLE` and `CREATE FUNCTION` permissions
+ [Python 2.7](https://www.python.org/download/releases/2.7/) or later

**Limitations**

The linguistic sorting used by the queries in this pattern is case insensitive.

## Tools
<a name="set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf-tools"></a>

**AWS services**
+ [Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html) is a managed petabyte-scale data warehouse service in the AWS Cloud. Amazon Redshift is integrated with your data lake, which enables you to use your data to acquire new insights for your business and customers.

**Other tools**
+ [Python (UDFs) user-defined functions](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-introduction.html) are functions that you can write in Python and then call in SQL statements.

## Epics
<a name="set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf-epics"></a>

### Develop code to sort query results in linguistic order
<a name="develop-code-to-sort-query-results-in-linguistic-order"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a table for your sample data. | To create a table in Amazon Redshift and insert your sample data into the table, use the following SQL statements:<pre>CREATE TABLE my_table (first_name varchar(30));<br /><br />INSERT INTO my_table (first_name)<br />VALUES<br />    ('ali'),<br />    ('Ali'),<br />    ('ırmak'),<br />    ('IRMAK'),<br />    ('irem'),<br />    ('İREM'),<br />    ('oğuz'),<br />    ('OĞUZ'),<br />    ('ömer'),<br />    ('ÖMER'),<br />    ('sedat'),<br />    ('SEDAT'),<br />    ('şule'),</pre>The first names in the sample data include special characters from the Turkish alphabet. For more information about Turkish language considerations for this example, see *Turkish language example* in the *Additional information* section of this pattern. | Data engineer | 
| Check the default sorting of the sample data. | To see the default sorting of your sample data in Amazon Redshift, run the following query:<pre>SELECT first_name FROM my_table ORDER BY first_name;</pre>The query returns the list of first names from the table that you created earlier:<pre>first_name<br />---------------<br />Ali<br />IRMAK<br />OĞUZ<br />SEDAT<br />ali<br />irem<br />oğuz<br />sedat<br />ÖMER<br />ömer<br />İREM<br />ırmak<br />ŞULE<br />şule</pre>The query results aren’t in the correct order because the default binary UTF-8 ordering doesn’t accommodate the linguistic ordering of the Turkish special characters. | Data engineer | 
| Create a scalar Python UDF. | To create a scalar Python UDF, use the following SQL code: <pre>CREATE OR REPLACE FUNCTION collate_sort (value varchar) <br />RETURNS varchar <br />IMMUTABLE <br />AS <br />$$    <br />    def sort_str(val):<br />        import string<br />        <br />        dictionary = {<br />            'I': 'ı', <br />            'ı': 'h~', <br />            'İ': 'i', <br />            'Ş': 's~', <br />            'ş': 's~', <br />            'Ğ': 'g~', <br />            'ğ': 'g~', <br />            'Ü': 'u~', <br />            'ü': 'u~', <br />            'Ö': 'o~', <br />            'ö': 'o~', <br />            'Ç': 'c~', <br />            'ç': 'c~'<br />        }<br />        <br />        for key, value in dictionary.items():<br />            val = val.replace(key, value)<br />        <br />        return val.lower()<br /> <br />    return sort_str(value)<br />    <br />$$ LANGUAGE plpythonu;</pre> | Data engineer | 
| Query the sample data. | To query the sample data by using the Python UDF, run the following SQL query:<pre>SELECT first_name FROM my_table ORDER BY collate_sort(first_name);</pre>The query now returns the sample data in Turkish linguistic order:<pre>first_name<br />---------------<br />ali<br />Ali<br />ırmak<br />IRMAK<br />irem<br />İREM<br />oğuz<br />OĞUZ<br />ömer<br />Ömer<br />sedat<br />SEDAT<br />şule<br />ŞULE</pre> | Data engineer | 

## Related resources
<a name="set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf-resources"></a>
+ [ORDER BY clause](https://docs.aws.amazon.com/redshift/latest/dg/r_ORDER_BY_clause.html) (Amazon Redshift documentation)
+ [Creating a scalar Python UDF](https://docs.aws.amazon.com/redshift/latest/dg/udf-creating-a-scalar-udf.html) (Amazon Redshift documentation)

## Additional information
<a name="set-up-language-specific-sorting-for-amazon-redshift-query-results-using-a-scalar-python-udf-additional"></a>

**Turkish language example**

Amazon Redshift returns query results based on binary UTF-8 sort ordering, not language-specific sort ordering. This means that if you query an Amazon Redshift table containing Turkish characters, then the query results aren’t sorted according to the linguistic ordering of the Turkish language. The Turkish language contains six special characters (ç, ı, ğ, ö, ş, and ü) that don’t appear in the Latin alphabet. These special characters are placed at the end of a sorted result set based on binary UTF-8 ordering, as the following table shows.


| 
| 
| Binary UTF-8 ordering | Turkish linguistic ordering | 
| --- |--- |
| a | a | 
| b | b | 
| c | c | 
| d | **ç (\$1)** | 
| e | d | 
| f | e | 
| g | f | 
| h | g | 
| i | **ğ (\$1)** | 
| j | h | 
| k | **ı (\$1)** | 
| l | i | 
| m | j | 
| n | k | 
| o | l | 
| p | m | 
| r | n | 
| s | o | 
| t | **ö (\$1)** | 
| u | p | 
| v | r | 
| y | s | 
| z | **ş (\$1)** | 
| **ç (\$1)** | t | 
| **ğ (\$1)** | u | 
| **ı (\$1)** | **ü (\$1)** | 
| **ö (\$1)** | v | 
| **ş (\$1)** | y | 
| **ü (\$1)** | z | 

**Note**  
The asterisk (\$1) indicates a special character in the Turkish language.

As the table above illustrates, special character **ç** is between **c** and **d** in Turkish linguistic ordering, but appears after **z** in binary UTF-8 ordering. The scalar Python UDF in this pattern uses the following character replacement dictionary to replace the Turkish special characters with corresponding Latin-equivalent characters.


| 
| 
| Turkish special character | Latin-equivalent character | 
| --- |--- |
| ç | c\$1 | 
| ı | h\$1 | 
| ğ | g\$1 | 
| ö | o\$1 | 
| ş | s\$1 | 
| ü | u\$1 | 

**Note**  
A tilde (\$1) character is appended to the end of the Latin characters that replace their corresponding Turkish special characters.

**Modify a scalar Python UDF function**

To modify the scalar Python UDF function from this pattern so that the function accepts a locate parameter and supports a multiple transaction dictionary, use the following SQL code:

```
CREATE OR REPLACE FUNCTION collate_sort (value varchar, locale varchar) 
RETURNS varchar 
IMMUTABLE 
AS 
$$    
    def sort_str(val):
       import string
       # Turkish Dictionary
       if locale == 'tr-TR':
            dictionary = {
               'I': 'ı', 
               'ı': 'h~', 
               'İ': 'i', 
               'Ş': 's~', 
               'ş': 's~', 
               'Ğ': 'g~', 
               'ğ': 'g~', 
               'Ü': 'u~', 
               'ü': 'u~', 
               'Ö': 'o~', 
               'ö': 'o~', 
               'Ç': 'c~', 
               'ç': 'c~'
        }
        # German Dictionary
        if locale == 'de-DE':
            dictionary = {
               ....
               ....
        }
        
        for key, value in dictionary.items():
            val = val.replace(key, value)
        
        return val.lower()
 
    return sort_str(value)
    
$$ LANGUAGE plpythonu;
```

The following example code shows how to query the modified Python UDF:

```
SELECT first_name FROM my_table ORDER BY collate_order(first_name, 'tr-TR');
```

# Subscribe a Lambda function to event notifications from S3 buckets in different AWS Regions
<a name="subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions"></a>

*Suresh Konathala, Amazon Web Services*

*Andrew Preston and Arindom Sarkar, None*

## Summary
<a name="subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions-summary"></a>

[Amazon Simple Storage Service (Amazon S3) Event Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html) publishes notifications for certain events in your S3 bucket (for example, object created events, object removal events, or restore object events). You can use an AWS Lambda function to process these notifications according to your application’s requirements. However, the Lambda function can’t directly subscribe to notifications from S3 buckets that are hosted in different AWS Regions.

This pattern’s approach deploys a [fanout scenario](https://docs.aws.amazon.com/sns/latest/dg/sns-common-scenarios.html) to process Amazon S3 notifications from cross-Region S3 buckets by using an Amazon Simple Notification Service (Amazon SNS) topic for each Region. These Regional SNS topics send the Amazon S3 event notifications to an Amazon Simple Queue Service (Amazon SQS) queue in a central Region that also contains your Lambda function. The Lambda function subscribes to this SQS queue and processes the event notifications according to your organization’s requirements.

## Prerequisites and limitations
<a name="subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions-prereqs"></a>

**Prerequisites **
+ An active AWS account.
+ Existing S3 buckets in multiple Regions, including a central Region to host the Amazon SQS queue and Lambda function.
+ AWS Command Line Interface (AWS CLI), installed and configured. For more information about this, see [Installing, updating, and uninstalling the AWS CLI ](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) in the AWS CLI documentation. 
+ Familiarity with the fanout scenario in Amazon SNS. For more information about this, see [Common Amazon SNS scenarios](https://docs.aws.amazon.com/sns/latest/dg/sns-common-scenarios.html) in the Amazon SNS documentation.

## Architecture
<a name="subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions-architecture"></a>

The following diagram shows the architecture for this pattern’s approach. 

![\[Workflow uses Amazon SNS, S3, and SQS to process event notifications from cross-Region S3 buckets.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/cf6c1804-8c41-46f1-9f17-ff361708c595/images/760cf4c0-0cb3-48d1-92ae-1cf0fa8ae076.png)


The diagram shows the following workflow:

1. Amazon S3 sends event notifications about S3 buckets (for example, object created, object removed, or object restored) to an SNS topic in the same Region.

1. The SNS topic publishes the event to an SQS queue in the central Region.

1. The SQS queue is configured as the event source for your Lambda function and buffers the event messages for the Lambda function. 

1. The Lambda function polls the SQS queue for messages and processes the Amazon S3 event notifications according to your application’s requirements.

 
**Technology stack  **
+ Lambda
+ Amazon SNS
+ Amazon SQS
+ Amazon S3

## Tools
<a name="subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions-tools"></a>
+ [AWS CLI ](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html)– The AWS Command Line Interface (AWS CLI) is an open-source tool for interacting with AWS services through commands in your command-line shell. With minimal configuration, you can run AWS CLI commands that implement functionality equivalent to that provided by the browser-based AWS Management Console from a command prompt.
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) – AWS CloudFormation helps you model and set up your AWS resources, provision them quickly and consistently, and manage them throughout their lifecycle. You can use a template to describe your resources and their dependencies, and launch and configure them together as a stack, instead of managing resources individually. You can manage and provision stacks across multiple AWS accounts and AWS Regions.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) – AWS Lambda is a compute service that supports running code without provisioning or managing servers. Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second. You pay only for the compute time that you consume—there is no charge when your code is not running.
+ [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) – Amazon Simple Notification Service (Amazon SNS) coordinates and manages the delivery or sending of messages between publishers and clients, including web servers and email addresses. Subscribers receive all messages published to the topics to which they subscribe, and all subscribers to a topic receive the same messages.
+ [Amazon SQS ](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html)– Amazon Simple Queue Service (Amazon SQS) offers a secure, durable, and available hosted queue that lets you integrate and decouple distributed software systems and components. Amazon SQS supports both standard and FIFO queues.

## Epics
<a name="subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions-epics"></a>

### Create the SQS queue and Lambda function in your central Region
<a name="create-the-sqs-queue-and-lambda-function-in-your-central-region"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create an SQS queue with a Lambda trigger. | Sign in to the AWS Management Console and use the instructions from the tutorial [Using Lambda with Amazon SQS](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs-example.html) in the AWS Lambda documentation to create the following resources in your central Region:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions.html)Make sure that you configure the SQS queue as the event source for your Lambda function. | AWS DevOps, Cloud architect | 

### Create an SNS topic and set up event notifications for the S3 buckets in each required Region
<a name="create-an-sns-topic-and-set-up-event-notifications-for-the-s3-buckets-in-each-required-region"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
|  Create an SNS topic to receive Amazon S3 event notifications. | Create an SNS topic in a Region that you want to receive Amazon S3 event notifications from. For more information about this, see [Creating an SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html) in the Amazon SNS documentation. Make sure that you record your SNS topic’s Amazon Resource Name (ARN).  | AWS DevOps, Cloud architect | 
| Subscribe the SNS topic to the central SQS queue. | Subscribe your SNS topic to the SQS queue hosted by your central Region. For more information about this, see [Subscribing to an SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html) in the Amazon SNS documentation. | AWS DevOps, Cloud architect | 
| Update the SNS topic's access policy.  | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions.html)<pre>{<br />  "Version": "2012-10-17",		 	 	 <br />  "Statement": [<br />    {<br />      "Sid": "0",<br />      "Effect": "Allow",<br />      "Principal": {<br />        "Service": "s3.amazonaws.com"<br />      },<br />      "Action": "sns:Publish",<br />      "Resource": "arn:aws:sns:us-west-2::s3Events-SNSTopic-us-west-2"<br />    }<br />  ]<br />}</pre> | AWS DevOps, Cloud architect | 
| Set up notifications for each S3 bucket in the Region. | Set up event notifications for each S3 bucket in the Region. For more information about this, see [Enabling and configuring event notifications using the Amazon S3 console](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html) in the Amazon S3 documentation.In the **Destination** section, choose **SNS topic** and specify the ARN of the SNS topic that you created earlier. | AWS DevOps, Cloud architect | 
| Repeat this epic for all required Regions. | Repeat the tasks in this epic for each Region that you want to receive Amazon S3 event notifications from, including your central Region. | AWS DevOps, Cloud architect | 

## Related resources
<a name="subscribe-a-lambda-function-to-event-notifications-from-s3-buckets-in-different-aws-regions-resources"></a>
+ [Configuring an access policy](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-configure-add-permissions.html) (Amazon SQS documentation)
+ [Configuring an SQS queue as an event source](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-eventsource) (AWS Lambda documentation)
+ [Configuring an SQS queue to initiate a Lambda function](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-configure-lambda-function-trigger.html) (Amazon SQS documentation)
+ [AWS::Lambda::Function resource](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html) (AWS CloudFormation documentation)

# Three AWS Glue ETL job types for converting data to Apache Parquet
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet"></a>

*Adnan Alvee, Nith Govindasivan, and Karthikeyan Ramachandran, Amazon Web Services*

## Summary
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-summary"></a>

On the Amazon Web Services (AWS) Cloud, AWS Glue is a fully managed extract, transform, and load (ETL) service. AWS Glue makes it cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams.

This pattern provides different job types in AWS Glue and uses three different scripts to demonstrate authoring ETL jobs.

You can use AWS Glue to write ETL jobs in a Python shell environment. You can also create both batch and streaming ETL jobs by using Python (PySpark) or Scala in a managed Apache Spark environment. To get you started with authoring ETL jobs, this pattern focuses on batch ETL jobs using Python shell, PySpark, and Scala. Python shell jobs are meant for workloads requiring lesser compute power. The managed Apache Spark environment is meant for workloads requiring high compute power.

Apache Parquet is built to support efficient compression and encoding schemes. It can speed up your analytics workloads because it stores data in a columnar fashion. Converting data to Parquet can save you storage space, cost, and time in the longer run. To learn more about Parquet, see the blog post [Apache Parquet: How to be a hero with the open-source columnar data format](https://blog.openbridge.com/how-to-be-a-hero-with-powerful-parquet-google-and-amazon-f2ae0f35ee04).

## Prerequisites and limitations
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-prereqs"></a>

**Prerequisites **
+ AWS Identity and Access Management (IAM) role (If you don’t have a role, see the [Additional information](#three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-additional) section.)

## Architecture
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-architecture"></a>

**Target technology stack  **
+ AWS Glue
+ Amazon Simple Storage Service (Amazon S3)
+ Apache Parquet

**Automation and scale**
+ [AWS Glue workflows](https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html) support full automation of an ETL pipeline.
+ You can change the number of data processing units (DPUs), or worker types, to scale horizontally and vertically.

## Tools
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-tools"></a>

**AWS services**
+ [Amazon Simple Storage Service (Amazon S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.
+ [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html) is a fully managed ETL service for categorizing, cleaning, enriching, and moving your data between various data stores and data streams.

**Other tools**
+ [Apache Parquet](https://parquet.apache.org/) is an open-source column-oriented data file format designed for storage and retrieval.

**Configuration**

Use the following settings for configuring the compute power of AWS Glue ETL. To reduce costs, use the minimal settings when you run the workload that is provided in this pattern. 
+ **Python shell** – You can use 1 DPU to utilize 16 GB of memory or 0.0625 DPU to utilize 1 GB of memory. This pattern uses 0.0625 DPU, which is the default in the AWS Glue console.
+ **Python or Scala for Spark** – If you choose the Spark-related job types in the console, AWS Glue by default uses 10 workers and the G.1X worker type. This pattern uses two workers, which is the minimum number allowed, with the standard worker type, which is sufficient and cost effective.

The following table displays the different AWS Glue worker types for the Apache Spark environment. Because a Python shell job does not use the Apache Spark environment to run Python, it is not included in the table.


| 
| 
|  | Standard | G.1X | G.2X | 
| --- |--- |--- |--- |
| vCPU | 4 | 4 | 8 | 
| Memory | 16 GB | 16 GB | 32 GB | 
| Disk space | 50 GB | 64 GB | 128 GB | 
| Executor per worker | 2 | 1  | 1 | 

**Code**

For the code that is used in this pattern, including the IAM role and parameter configuration, see the [Additional information](#three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-additional) section.

## Epics
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-epics"></a>

### Upload the data
<a name="upload-the-data"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Upload the data into a new or existing S3 bucket. | Create or use an existing S3 bucket in your account. Upload the sample\$1data.csv file from the [Attachments](#attachments-8c926709-8fa4-417f-9aaf-bcc8113d018f) section, and note the S3 bucket and prefix location. | General AWS | 

### Create and run the AWS Glue job
<a name="create-and-run-the-aws-glue-job"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create the AWS Glue job. | Under the ETL section of the AWS Glue console, add an AWS Glue job. Select the appropriate job type, AWS Glue version, and the corresponding DPU/Worker type and number of workers. For details, see the *Configuration* section. | Developer, cloud or data | 
| Change the input and output locations. | Copy the code corresponding to your AWS Glue job, and change the input and output location that you noted in the **Upload the data** epic. | Developer, cloud or data | 
| Configure the parameters. | You can use the snippets provided in the [Additional information](#three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-additional) section to set parameters for your ETL job. AWS Glue uses four argument names internally:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet.html)The `--JOB_NAME` parameter must be explicitly entered on the AWS Glue console. Choose **Jobs**, **Edit Job**, **Security configuration, script libraries, and job parameters (optional)**. Enter `--JOB_NAME` as the key and provide a value. You can also use the AWS Command Line Interface (AWS CLI) or the AWS Glue API to set this parameter. The `--JOB_NAME` parameter is used by Spark and is not needed in a Python shell environment job.You must add `--` before every parameter name; otherwise, the code will not work. For example, for the code snippets, the location parameters must be invoked by `--input_loc` and `--output_loc`. | Developer, cloud or data | 
| Run the ETL job. | Run your job and check the output. Note how much space was reduced from the original file. | Developer, cloud or data | 

## Related resources
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-resources"></a>

**References **
+ [Apache Spark](https://spark.apache.org/)
+ [AWS Glue: How it works](https://docs.aws.amazon.com/glue/latest/dg/how-it-works.html)
+ [AWS Glue pricing](https://aws.amazon.com/glue/pricing/)

**Tutorials and videos **
+ [What is AWS Glue?](https://www.youtube.com/watch?v=qgWMfNSN9f4)

## Additional information
<a name="three-aws-glue-etl-job-types-for-converting-data-to-apache-parquet-additional"></a>

**IAM role**

When you create the AWS Glue jobs, you can use either an existing IAM role that has the permissions shown in the following code snippet or a new role.

To create a new role, use the following YAML code.

```
# (c) 2022 Amazon Web Services, Inc. or its affiliates. All Rights Reserved. This AWS Content is provided subject to the terms of the AWS Customer
# Agreement available at https://aws.amazon.com/agreement/ or other written agreement between Customer and Amazon Web Services, Inc.

AWSTemplateFormatVersion: "2010-09-09"

Description: This template will setup IAM role for AWS Glue service.

Resources:
  rGlueRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: "Allow"
            Principal:
              Service:
                - "glue.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
      Policies:
        - PolicyName: !Sub "${AWS::StackName}-s3-limited-read-write-inline-policy"
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - "s3:PutObject"
                  - "s3:GetObject"
                Resource: "arn:aws:s3:::*/*"
      Tags:
        - Key  : "Name"
          Value : !Sub "${AWS::StackName}"

Outputs:
  oGlueRoleName:
    Description: AWS Glue IAM role
    Value:
      Ref: rGlueRole
    Export:
      Name: !Join [ ":", [ !Ref "AWS::StackName", rGlueRole ] ]
```

**AWS Glue Python Shell**

The Python code uses the Pandas and PyArrow libraries to convert data to Parquet. The Pandas library is already available. The PyArrow library is downloaded when you run the pattern, because it is a one-time run. You can use wheel files to convert PyArrow to a library and provide the file as a library package. For more information about packaging wheel files, see [Providing your own Python library](https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html).

*AWS Glue Python shell parameters*

```
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["input_loc", "output_loc"])
```

*AWS Glue Python shell code*

```
from io import BytesIO
import pandas as pd
import boto3
import os
import io
import site
from importlib import reload
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']
easy_install.main( ["--install-dir", install_path, "pyarrow"] )
reload(site)
import pyarrow


input_loc = "s3://bucket-name/prefix/sample_data.csv"
output_loc = "s3://bucket-name/prefix/"


input_bucket = input_loc.split('/', 1)[0]
object_key = input_loc.split('/', 1)[1]


output_loc_bucket = output_loc.split('/', 1)[0]
output_loc_prefix = output_loc.split('/', 1)[1] 


s3 = boto3.client('s3')
obj = s3.get_object(Bucket=input_bucket, Key=object_key)
df = pd.read_csv(io.BytesIO(obj['Body'].read()))


parquet_buffer = BytesIO()
s3_resource = boto3.resource('s3')
df.to_parquet(parquet_buffer, index=False) 
s3_resource.Object(output_loc_bucket, output_loc_prefix +  'data' + '.parquet').put(Body=parquet_buffer.getvalue())
```

**AWS Glue Spark job with Python**

To use an AWS Glue Spark job type with Python, choose **Spark** as the job type. Choose **Spark 3.1, Python 3 with improved job startup time (Glue Version 3.0)** as the AWS Glue version.

*AWS Glue Python parameters*

```
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ["JOB_NAME", "input_loc", "output_loc"])
```

*AWS Glue Spark job with Python code*

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions
from awsglue.job import Job


sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

input_loc = "s3://bucket-name/prefix/sample_data.csv"
output_loc = "s3://bucket-name/prefix/"

inputDyf = glueContext.create_dynamic_frame_from_options(\
    connection_type = "s3", \
    connection_options = { 
        "paths": [input_loc]}, \
    format = "csv",
    format_options={
        "withHeader": True,
        "separator": ","
    })


outputDF = glueContext.write_dynamic_frame.from_options(\
    frame = inputDyf, \
    connection_type = "s3", \
    connection_options = {"path": output_loc \
        }, format = "parquet")
```

For a large number of compressed big files (for example, 1,000 files that are each about 3 MB), use the `compressionType` parameter with the `recurse` parameter to read all the files that are available within the prefix, as shown in the following code.

```
input_loc = "bucket-name/prefix/"
output_loc = "bucket-name/prefix/"

inputDyf = glueContext.create_dynamic_frame_from_options(
                    connection_type = "s3", 
                    connection_options = {"paths": [input_loc], 
                                            "compressionType":"gzip","recurse" :"True",
                                            },
                    format = "csv",
                    format_options={"withHeader": True,"separator": ","}
                    )
```

For a large number of compressed small files (for example, 1,000 files that are each about 133 KB), use the `groupFiles` parameter, along with both the `compressionType` and the `recurse` parameters. The `groupFiles` parameter groups small files into multiple big files, and the `groupSize` parameter controls the grouping to the specified size in bytes (for example, 1 MB). The following code snippet provides an example of using these parameters within the code.

```
input_loc = "bucket-name/prefix/"
output_loc = "bucket-name/prefix/"

inputDyf = glueContext.create_dynamic_frame_from_options(
                    connection_type = "s3", 
                    connection_options = {"paths": [input_loc], 
                                            "compressionType":"gzip","recurse" :"True",
                                             "groupFiles" :"inPartition",  "groupSize" :"1048576",
                                            },
                    format = "csv",
                    format_options={"withHeader": True,"separator": ","}
                    )
```

Without any change in the worker nodes, these settings enable the AWS Glue job to read multiple files (large or small, with or without compression) and write them to the target in Parquet format.

**AWS Glue Spark job with Scala**

To use an AWS Glue Spark job type with Scala, choose **Spark** as the job type and **Language** as **Scala**. Choose **Spark 3.1, Scala 2 with improved job startup time (Glue Version 3.0)** as the AWS Glue version. To save on storage space, the following AWS Glue with Scala sample also uses the `applyMapping` feature to convert data types.

*AWS Glue Scala parameters*

```
import com.amazonaws.services.glue.util.GlueArgParser val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME", "inputLoc", "outputLoc").toArray)
```

*AWS Glue Spark job with Scala code*

```
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.DynamicFrame
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._


object GlueScalaApp {
  def main(sysArgs: Array[String]) {
    
    @transient val spark: SparkContext = SparkContext.getOrCreate()
    val glueContext: GlueContext = new GlueContext(spark)

    val inputLoc = "s3://bucket-name/prefix/sample_data.csv"
    val outputLoc = "s3://bucket-name/prefix/"

    val readCSV = glueContext.getSource("csv", JsonOptions(Map("paths" -> Set(inputLoc)))).getDynamicFrame()

    val applyMapping = readCSV.applyMapping(mappings = Seq(("_c0", "string", "date", "string"), ("_c1", "string", "sales", "long"),
    ("_c2", "string", "profit", "double")), caseSensitive = false)

    val formatPartition = applyMapping.toDF().coalesce(1)

    val dynamicFrame = DynamicFrame(formatPartition, glueContext)

    val dataSink = glueContext.getSinkWithFormat(
        connectionType = "s3", 
        options = JsonOptions(Map("path" -> outputLoc )),
        transformationContext = "dataSink", format = "parquet").writeDynamicFrame(dynamicFrame)
  }
}
```

## Attachments
<a name="attachments-8c926709-8fa4-417f-9aaf-bcc8113d018f"></a>

To access additional content that is associated with this document, unzip the following file: [attachment.zip](samples/p-attach/8c926709-8fa4-417f-9aaf-bcc8113d018f/attachments/attachment.zip)

# Visualize Amazon Redshift audit logs using Amazon Athena and Amazon QuickSight
<a name="visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight"></a>

*Sanket Sirsikar and Gopal Krishna Bhatia, Amazon Web Services*

## Summary
<a name="visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight-summary"></a>

Security is an integral part of database operations on the Amazon Web Services (AWS) Cloud. Your organization should ensure that it monitors database user activities and connections to detect potential security incidents and risks. This pattern helps you monitor your databases for security and troubleshooting purposes, which is a process known as database auditing.

 This pattern provides a SQL script that automates the creation of an Amazon Athena table and views for a reporting dashboard in Amazon QuickSight that helps you audit Amazon Redshift logs. This ensures that users responsible for monitoring database activities have convenient access to data security features.  

## Prerequisites and limitations
<a name="visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight-prereqs"></a>

**Prerequisites **
+ An active AWS account.
+ An existing Amazon Redshift cluster. For more information about this, see [Create an Amazon Redshift cluster](https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data-launch-cluster.html) in the Amazon Redshift documentation.
+ Access to an existing Athena workgroup. For more information, see [How workgroups work](https://docs.aws.amazon.com/athena/latest/ug/user-created-workgroups.html) in the Amazon Athena documentation. 
+ An existing Amazon Simple Storage Service (Amazon S3) source bucket with the required AWS Identity and Access Management (IAM) permissions. For more information, see [Bucket permissions for Amazon Redshift audit logging](https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html#db-auditing-bucket-permissions) from [Database audit logging](https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html) in the Amazon Redshift documentation.

## Architecture
<a name="visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight-architecture"></a>

![\[Data flow diagram showing Amazon Redshift, logs, S3 bucket, Amazon Athena, and Amazon Quick.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/90e69009-001e-4ced-bef0-3c361f93ae87/images/9fde7f01-17ab-4207-8a59-a12daf85a382.png)


**Technology stack  **
+ Athena
+ Amazon Redshift 
+ Amazon S3 
+ QuickSight

## Tools
<a name="visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight-tools"></a>
+ [Amazon Athena ](https://docs.aws.amazon.com/athena/latest/ug/what-is.html)– Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. 
+ [Amazon QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html) – QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service. 
+ [Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html) – Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service. 
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) – Amazon Simple Storage Service (Amazon S3) is storage for the internet.

## Epics
<a name="visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight-epics"></a>

### Configure the Amazon Redshift cluster
<a name="configure-the-amazon-redshift-cluster"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
|  Enable audit logging for the Amazon Redshift cluster. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight.html) | DBA, Data engineer | 
| Enable logging in the Amazon Redshift cluster parameter group. | You can enable auditing of connection logs, user logs, and user activity logs at the same time by using the AWS Management Console, the Amazon Redshift API reference, or AWS Command Line Interface (AWS CLI).  For auditing of user activity logs, you must enable the `enable_user_activity_logging` database parameter. If you only enable the audit logging feature but not the associated parameter, the database audit logs the logging information for the connection and user logs but not for the user activity logs. The `enable_user_activity_logging` parameter is not enabled by default, but you can enable it by changing it from `false` to `true`.You need to create a new cluster parameter group with the `user_activity_logging` parameter enabled and attach it to your Amazon Redshift cluster. For more information about this, see [Modifying a cluster](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html#modify-cluster) in the Amazon Redshift documentation.For more information about this task, see [Amazon Redshift parameter groups](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-parameter-groups.html) and [Configuring auditing using the console](https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing-console.html) in the Amazon Redshift documentation. | DBA, Data engineer | 
| Configure S3 bucket permissions for Amazon Redshift cluster logging. | When you enable logging, Amazon Redshift collects logging information and uploads it to log files stored in an S3 bucket. You can use an existing S3 bucket or create a new bucket.Make sure that Amazon Redshift has the required IAM permissions to access the S3 bucket. For more information about this, see [Bucket permissions for Amazon Redshift audit logging](https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html#db-auditing-bucket-permissions) from [Database audit logging](https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html) in the Amazon Redshift documentation. | DBA, Data engineer | 

### Create the Athena table and views
<a name="create-the-athena-table-and-views"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create the Athena table and views to query Amazon Redshift audit log data from the S3 bucket. | Open the Amazon Athena console and use the data definition language (DDL) query from the `AuditLogging.sql` SQL script (attached) to create the table and views for user activity logs, user logs, and connection logs.For more information and instructions, see the [Create tables and run queries](https://athena-in-action.workshop.aws/30-basics/301-create-tables.html) tutorial from the Amazon Athena Workshop. | Data engineer | 

### Set up log monitoring in the QuickSight dashboard
<a name="set-up-log-monitoring-in-the-quicksight-dashboard"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Create a QuickSight dashboard using Athena as the data source. | Open the Amazon QuickSight console and create a QuickSight dashboard by following the instructions in the [Visualize with QuickSight using Athena](https://athena-in-action.workshop.aws/30-basics/307-quicksight.html) tutorial from the Amazon Athena Workshop. | DBA, Data engineer | 

## Related resources
<a name="visualize-amazon-redshift-audit-logs-using-amazon-athena-and-amazon-quicksight-resources"></a>
+ [Create tables and run queries in Athena](https://athena-in-action.workshop.aws/30-basics/301-create-tables.html)
+ [Visualize with QuickSight using Athena](https://athena-in-action.workshop.aws/30-basics/307-quicksight.html)

## Attachments
<a name="attachments-90e69009-001e-4ced-bef0-3c361f93ae87"></a>

To access additional content that is associated with this document, unzip the following file: [attachment.zip](samples/p-attach/90e69009-001e-4ced-bef0-3c361f93ae87/attachments/attachment.zip)

# Visualize IAM credential reports for all AWS accounts using Amazon Quick Sight
<a name="visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight"></a>

*Parag Nagwekar and Arun Chandapillai, Amazon Web Services*

## Summary
<a name="visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight-summary"></a>


| 
| 
| Warning: IAM users have long-term credentials, which presents a security risk. To help mitigate this risk, we recommend that you provide these users with only the permissions they require to perform the task and that you remove these users when they are no longer needed. | 
| --- |

You can use AWS Identity and Access Management (IAM) credential reports to help you meet the security, auditing, and compliance requirements of your organization. [Credential reports](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_getting-report.html) provide a list of all the users in your AWS accounts and show the status of their credentials, such as passwords, access keys, and multi-factor authentication (MFA) devices. You can use credential reports for multiple AWS accounts managed by [AWS Organizations](https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/core-concepts.html).

This pattern includes steps and code to help you create and share IAM credential reports for all the AWS accounts in your organization by using Amazon Quick Sight dashboards. You can share the dashboards with stakeholders in your organization. The reports can help your organization achieve the following targeted business outcomes:
+ Identify security incidents related to IAM users
+ Track real-time migration of IAM users to single sign-on (SSO) authentication
+ Track AWS Regions accessed by IAM users
+ Stay compliant
+ Share information with other stakeholders

## Prerequisites and limitations
<a name="visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight-prereqs"></a>

**Prerequisites**
+ An active AWS account
+ An [organization](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_tutorials_basic.html) with member accounts
+ An [IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html) with permissions to access accounts in Organizations
+ AWS Command Line Interface (AWS CLI) version 2, [installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions) and [configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
+ A [subscription](https://docs.aws.amazon.com/quicksight/latest/user/signing-up.html) to [Amazon Quick Enterprise edition](https://docs.aws.amazon.com/quicksight/latest/user/editions.html)

## Architecture
<a name="visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight-architecture"></a>

**Technology stack**
+ Amazon Athena
+ Amazon EventBridge
+ Amazon Quick Sight
+ Amazon Simple Storage Service (Amazon S3)
+ AWS Glue
+ AWS Identity and Access Management (IAM)
+ AWS Lambda
+ AWS Organizations

**Target architecture**

The following diagram shows an architecture for setting up a workflow that captures IAM credential report data from multiple AWS accounts.

![\[The following screenshot illustrates the architecture diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/images/pattern-img/8724ff28-40f6-4c43-9c65-fbd18bbbfd0f/images/e780916a-4ab7-4fdc-8ecc-c837c7d90d13.png)


1. EventBridge invokes a Lambda function daily.

1. The Lambda function assumes an IAM role in every AWS account across the organization. Then, the function creates the IAM credentials report and stores the report data in a centralized S3 bucket. You must enable encryption and deactivate public access on the S3 bucket.

1. An AWS Glue crawler crawls the S3 bucket daily and updates the Athena table accordingly.

1. Quick Sight imports and analyzes the data from the credential report and builds a dashboard that can be visualized by and shared with stakeholders.

## Tools
<a name="visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight-tools"></a>

**AWS services**
+ [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/what-is.html) is an interactive query service that makes it easy to analyze data in Amazon S3 by using standard SQL.
+ [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) is a serverless event bus service that helps you connect your applications with real-time data from a variety of sources. For example, Lambda functions, HTTP invocation endpoints using API destinations, or event buses in other AWS accounts.
+ [Amazon Quick Sight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html) is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data in a single dashboard. Quick Sight is a core component within Amazon Quick, providing interactive data visualization, SPICE in-memory analytics, embedded analytics, and dashboard sharing.
+ [AWS Identity and Access Management (IAM)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html) helps you securely manage access to your AWS resources by controlling who is authenticated and authorized to use them.
+ [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.

**Code**

The code for this pattern is available in the GitHub [getiamcredsreport-allaccounts-org](https://github.com/aws-samples/getiamcredsreport-allaccounts-org) repository. You can use the code from this repository to create IAM credential reports across AWS accounts in Organizations and store them in a central location.

## Epics
<a name="visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight-epics"></a>

### Set up the infrastructure
<a name="set-up-the-infrastructure"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Set up Amazon Quick Enterprise edition. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight.html) | AWS administrator, AWS DevOps, Cloud administrator, Cloud architect | 
| Integrate Amazon Quick Sight with Amazon S3 and Athena. | You must [authorize](https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-connect-to-datasources.html) Quick Sight to use Amazon S3 and Athena before you deploy the AWS CloudFormation stack. | AWS administrator, AWS DevOps, Cloud administrator, Cloud architect | 

### Deploy the infrastructure
<a name="deploy-the-infrastructure"></a>


| Task | Description | Skills required | 
| --- | --- | --- | 
| Clone the GitHub repository. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight.html) | AWS administrator | 
| Deploy the infrastructure. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight.html) | AWS administrator | 
| Create an IAM permission policy. | [Create an IAM policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html) for every AWS account across your organization with the following permissions:<pre>{<br />  "Version": "2012-10-17",		 	 	 <br />  "Statement": [<br />    {<br />      "Effect": "Allow",<br />      "Action": [<br />        "iam:GenerateCredentialReport",<br />        "iam:GetCredentialReport"<br />        ],<br />      "Resource": "*"<br />    }<br />  ]<br />}</pre> | AWS DevOps, Cloud administrator, Cloud architect, Data engineer | 
| Create an IAM role with a trust policy. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight.html)<pre>{<br />   "Version": "2012-10-17",		 	 	 <br />   "Statement":[<br />      {<br />         "Effect":"Allow",<br />         "Principal":{<br />            "AWS":[<br />               "arn:aws:iam::<MasterAccountID>:role/<LambdaRole>"<br />            ]<br />         },<br />         "Action":"sts:AssumeRole"<br />      }<br />   ]<br />}</pre>Replace `arn:aws:iam::<MasterAccountID>:role/<LambdaRole>` with the ARN of the Lambda role that you noted previously.Organizations typically use automation to create IAM roles for their AWS accounts. We recommend that you use this automation, if available. Alternatively, you can use the `CreateRoleforOrg.py` script from** **the code repository. The script requires an existing administrative role or any other IAM role that has permission to create an IAM policy and role in every AWS account. | Cloud administrator, Cloud architect, AWS administrator | 
| Configure Amazon Quick Sight to visualize the data. | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight.html) | AWS DevOps, Cloud administrator, Cloud architect, Data engineer | 

## Additional information
<a name="visualize-iam-credential-reports-for-all-aws-accounts-using-amazon-quicksight-additional"></a>

**Additional considerations**

Consider the following:
+ After you use CloudFormation to deploy the infrastructure, you can wait to get the reports created in Amazon S3 and analyzed by Athena until Lambda and AWS Glue run as per their schedules. Alternatively, you can run Lambda manually to get the reports in Amazon S3, and then run the AWS Glue crawler to get the Athena table that's created from the data.
+ Quick is a powerful tool for analyzing and visualizing data based on your business requirements. You can use [parameters](https://docs.aws.amazon.com/quicksight/latest/user/parameters-in-quicksight.html) in Quick to control widget data based on data fields that you choose. Also, you can use a Quick analysis to create parameters (for example, Account, Date, and User fields such as `partition_0`, `partition_1`, and `user` respectively) from your dataset to add controls for the parameters for Account, Date, and User.
+ To build your own Quick Sight dashboards, see [Quick Workshops](https://catalog.workshops.aws/quicksight/en-US) from the AWS Workshop Studio website.
+ To see sample Quick Sight dashboards, see the GitHub [getiamcredsreport-allaccounts-org](https://github.com/aws-samples/getiamcredsreport-allaccounts-org) code repository.

**Targeted business outcomes**

You can use this pattern to achieve the following targeted business outcomes:
+ **Identify security incidents related to IAM users **– Investigate every user across every AWS account in your organization by using a single pane of glass. You can track the trend of an IAM user’s most recently accessed individual AWS Regions and the services they used.
+ **Track real-time migration of IAM users to SSO authentication **– By using SSO, users can sign in once with a single credential and access multiple AWS accounts and applications. If you’re planning to migrate your IAM users to SSO, this pattern can help you transition to SSO and track all IAM user credential usage (such as access to the AWS Management Console or usage of access keys) across all AWS accounts.
+ **Track AWS Regions accessed by IAM users **– You can control IAM user access to Regions for various purposes, such as data sovereignty and cost control. You can also track use of Regions by any IAM user.
+ **Stay compliant **– By following the principle of least privilege, you can grant only the specific IAM permissions that are required to perform a specific task. Also, you can track access to AWS services, the AWS Management Console, and long-term credentials usage.
+ **Share information with other stakeholders **– You can share curated dashboards with other stakeholders, without granting them access to IAM credential reports or AWS accounts.

# More patterns
<a name="analytics-more-patterns-pattern-list"></a>

**Topics**
+ [Authenticate existing React application users by using Amazon Cognito and AWS Amplify UI](authenticate-react-app-users-cognito-amplify-ui.md)
+ [Automatically extract content from PDF files using Amazon Textract](automatically-extract-content-from-pdf-files-using-amazon-textract.md)
+ [Build a cold start forecasting model by using DeepAR for time series in Amazon SageMaker AI Studio Lab](build-a-cold-start-forecasting-model-by-using-deepar.md)
+ [Create detailed cost and usage reports for Amazon EMR clusters by using AWS Cost Explorer](create-detailed-cost-and-usage-reports-for-amazon-emr-clusters-by-using-aws-cost-explorer.md)
+ [Create detailed cost and usage reports for Amazon RDS and Amazon Aurora](create-detailed-cost-and-usage-reports-for-amazon-rds-and-amazon-aurora.md)
+ [Create detailed cost and usage reports for AWS Glue jobs by using AWS Cost Explorer](create-detailed-cost-and-usage-reports-for-aws-glue-jobs-by-using-aws-cost-explorer.md)
+ [Deliver DynamoDB records to Amazon S3 using Kinesis Data Streams and Firehose with AWS CDK](deliver-dynamodb-records-to-amazon-s3-using-kinesis-data-streams-and-amazon-data-firehose-with-aws-cdk.md)
+ [Deploy an AWS Glue job with an AWS CodePipeline CI/CD pipeline](deploy-an-aws-glue-job-with-an-aws-codepipeline-ci-cd-pipeline.md)
+ [Embed Amazon Quick Sight visual components into web applications by using Amazon Cognito and IaC automation](embed-quick-sight-visual-components-into-web-apps-cognito-iac.md)
+ [Estimate storage costs for an Amazon DynamoDB table](estimate-storage-costs-for-an-amazon-dynamodb-table.md)
+ [Execute Amazon Redshift SQL queries by using Terraform](execute-redshift-sql-queries-using-terraform.md)
+ [Extract and query AWS IoT SiteWise metadata attributes in a data lake](extract-and-query-aws-iot-sitewise-metadata-attributes-in-a-data-lake.md)
+ [Generate data insights by using AWS Mainframe Modernization and Amazon Q in Quick Sight](generate-data-insights-by-using-aws-mainframe-modernization-and-amazon-q-in-quicksight.md)
+ [Generate Db2 z/OS data insights by using AWS Mainframe Modernization and Amazon Q in Quick Sight](generate-db2-zos-data-insights-aws-mainframe-modernization-amazon-q-in-quicksight.md)
+ [Give SageMaker notebook instances temporary access to a CodeCommit repository in another AWS account](give-sagemaker-notebook-instances-temporary-access-to-a-codecommit-repository-in-another-aws-account.md)
+ [Identify and alert when Amazon Data Firehose resources are not encrypted with an AWS KMS key](identify-and-alert-when-amazon-data-firehose-resources-are-not-encrypted-with-an-aws-kms-key.md)
+ [Import the psycopg2 library to AWS Lambda to interact with your PostgreSQL database](import-psycopg2-library-lambda.md)
+ [Ingest and analyze AWS security logs in Microsoft Sentinel](ingest-analyze-aws-security-logs-sentinel.md)
+ [Migrate Amazon RDS for Oracle to Amazon RDS for PostgreSQL in SSL mode by using AWS DMS](migrate-amazon-rds-for-oracle-to-amazon-rds-for-postgresql-in-ssl-mode-by-using-aws-dms.md)
+ [Migrate an Oracle database to Amazon RDS for Oracle by using Oracle GoldenGate flat file adapters](migrate-an-oracle-database-to-amazon-rds-for-oracle-by-using-oracle-goldengate-flat-file-adapters.md)
+ [Migrate an Oracle Database to Amazon Redshift using AWS DMS and AWS SCT](migrate-an-oracle-database-to-amazon-redshift-using-aws-dms-and-aws-sct.md)
+ [Migrate data from an on-premises Hadoop environment to Amazon S3 using DistCp with AWS PrivateLink for Amazon S3](migrate-data-from-an-on-premises-hadoop-environment-to-amazon-s3-using-distcp-with-aws-privatelink-for-amazon-s3.md)
+ [Migrate from Couchbase Server to Couchbase Capella on AWS](migrate-from-couchbase-server-to-couchbase-capella-on-aws.md)
+ [Migrate on-premises Cloudera workloads to Cloudera Data Platform on AWS](migrate-on-premises-cloudera-workloads-to-cloudera-data-platform-on-aws.md)
+ [Manage Multi-AZ failover for EMR clusters by using Application Recovery Controller](multi-az-failover-spark-emr-clusters-arc.md)
+ [Automate blue/green deployments of Amazon Aurora global databases by using IaC principles](p-automate-blue-green-deployments-aurora-global-databases-iac.md)
+ [Provision AWS Service Catalog products based on AWS CloudFormation templates by using GitHub Actions](provision-aws-service-catalog-products-using-github-actions.md)
+ [Run unit tests for Python ETL jobs in AWS Glue using the pytest framework](run-unit-tests-for-python-etl-jobs-in-aws-glue-using-the-pytest-framework.md)
+ [Set up a Grafana monitoring dashboard for AWS ParallelCluster](set-up-a-grafana-monitoring-dashboard-for-aws-parallelcluster.md)
+ [Unload data from an Amazon Redshift cluster across accounts to Amazon S3](unload-data-from-amazon-redshift-cross-accounts-to-amazon-s3.md)
+ [Visualize AI/ML model results using Flask and AWS Elastic Beanstalk](visualize-ai-ml-model-results-using-flask-and-aws-elastic-beanstalk.md)