Aggregate data in Amazon DynamoDB for ML forecasting in Athena - AWS Prescriptive Guidance

Aggregate data in Amazon DynamoDB for ML forecasting in Athena

Created by Sachin Doshi (AWS) and Peter Molnar (AWS)

Code repository: Use ML predictions over Amazon DynamoDB data with Amazon Athena ML

Environment: Production

Technologies: Machine learning & AI; Databases; Serverless

Workload: Open-source

AWS services: Amazon Athena; Amazon DynamoDB; AWS Lambda; Amazon SageMaker; Amazon QuickSight

Summary

This pattern shows you how to build complex aggregations of Internet of Things (IoT) data in an Amazon DynamoDB table by using Amazon Athena. You also learn how to enrich the data with machine learning (ML) inference by using Amazon SageMaker and how to query geospatial data by using Athena. You can use this pattern as the basis for creating an ML forecasting solution that meets your organization's requirements.

For demonstration purposes, this pattern uses an example scenario of a business that's operating a scooter rideshare and wants to predict the optimal number of scooters that must be deployed for customers in different urban neighborhoods. The business uses a pre-trained ML model that predicts customer demand for the next hour based on the past four hours. The scenario uses a public dataset from the Office of Civic Innovation & Technology for the Louisville Metro government. The resources for this scenario are available in a GitHub repository.

Prerequisites and limitations

  • An active AWS account

  • Permissions to create an AWS CloudFormation stack with AWS Identity and Access Management (IAM) roles for the following:

    • Amazon Simple Storage Service (Amazon S3) bucket

    • Athena

    • DynamoDB

    • SageMaker

    • AWS Lambda

Architecture

Technology stack

  • Amazon QuickSight

  • Amazon S3

  • Athena

  • DynamoDB

  • Lambda

  • SageMaker

Target architecture

The following diagram shows an architecture for building complex aggregations of data in DynamoDB by using the querying capabilities of Athena, a Lambda function, Amazon S3 storage, a SageMaker endpoint, and a QuickSight dashboard.

Architecture for building complex aggregations of data in DynamoDB.

The diagram shows the following workflow:

  1. A DynamoDB table ingests IoT data that's transmitted from a fleet of scooters.

  2. A Lambda function loads the DynamoDB table with the ingested data.

  3. An Athena query creates a new DynamoDB table for the geospatial data that represents the urban neighborhoods.

  4. The query location is saved in an S3 bucket.

  5. An Athena function queries the ML inference from the SageMaker endpoint that hosts the pre-trained ML model.

  6. Athena queries data directly from the DynamoDB tables and aggregates the data for analysis.

  7. A user views the output of the analyzed data in a QuickSight dashboard.

Tools

AWS services

  • Amazon Athena is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL.

  • Amazon DynamoDB is a fully managed NoSQL database service that provides fast, predictable, and scalable performance.

  • Amazon SageMaker is a managed ML service that helps you build and train ML models and then deploy them into a production-ready hosted environment.

  • Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

  • Amazon QuickSight is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data in a single dashboard.

  • AWS Lambda is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.

Code repository

The code for this pattern is available in the GitHub Use ML predictions over Amazon DynamoDB data with Amazon Athena ML repository. You can use the CloudFormation template from the repository to create the following resources used in the example scenario:

  • A DynamoDB table

  • A Lambda function to load the table with relevant data

  • A SageMaker endpoint for inference requests, with the pre-trained XGBoost model that's stored in Amazon S3

  • An Athena workgroup named V2EngineWorkGroup

  • Named Athena queries to look up the geospatial shapefiles and predict scooter demand

  • A prebuilt Amazon Athena DynamoDB connector that enables Athena to communicate with DynamoDB and uses AWS Serverless Application Model (AWS SAM) to build the application in reference to the DynamoDB connector

Epics

TaskDescriptionSkills required

Download the dataset and resources.

  1. Download a public dataset of dockless vehicle rentals. For demo purposes, this data is prepopulated in DynamoDB as part of the use case, but in a production environment you send this data to DynamoDB through various mechanisms such as IoT devices or Amazon Kinesis consumers. These mechanisms use Lambda to insert data into DynamoDB.

  2. Download the GIS shapefiles that represent the boundaries of historical and cultural neighborhoods within the city of Louisville, KY. The public dataset is provided by the Louisville and Jefferson County, KY Information Consortium. The original shapefiles are already converted into a text file that you can query with Athena, but you can find the Python code for transforming shapefiles in the Jupyter notebook at Geo-Spatial processing of GIS shapefiles with Amazon Athena in GitHub.

  3. Download the pretrained Python code that trains the ML model for hourly forecasts by using SageMaker and Athena.

  4. Get the SQL query in Athena that brings everything together for live predictions from the data stored in DynamoDB.

  5. (Optionally) Use QuickSight to visualize geospatial data over a map of Louisville, Kentucky.

App developer, Data scientist
TaskDescriptionSkills required

Create a CloudFormation stack.

  1. Download the CloudFormation template from the GitHub repository.

  2. Sign in to the AWS Management Console, and then choose us-east-1. Note: The ML model is stored in the Amazon Elastic Container Registry (Amazon ECR) for the us-east-1 AWS Region, but the pattern is Region agnostic. You can replicate the pattern in any Region where the AWS services used in this pattern are supported.

  3. Open the CloudFormation console, and then choose Stacks on the navigation pane.

  4. Choose Create stack, and then choose With existing resources (import resources).

  5. On the Identify resources page, choose Next.

  6. In the Specify template section, for Template source, select Upload a template file.

  7. Choose File, and then choose the CloudFormation template that you downloaded earlier.

  8. Choose Next, accept the default parameter values, and choose Next to step through the rest of the setup wizard.

  9. Select the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box.

  10. Choose Create stack.

Note: It can take 15–20 minutes for the CloudFormation stack to create these resources.

AWS DevOps

Verify the CloudFormation deployment.

To verify that the sample data from the CloudFormation template is loaded into DynamoDB, do the following:

  1. Open the DynamoDB console, and then choose Tables from the navigation pane.

  2. In the Tables section, check for the DynamoDBTableDocklessVehicles table.

  3. After resource creation is complete, open the Athena console, and then choose Workgroups from the navigation pane.

  4. Choose the V2EngineWorkGroup workgroup, and then choose Switch workgroup.

  5. If you get a prompt to save the query result location, choose an Amazon S3 location where you have write permissions.

  6. Choose Save.

  7. In the navigation pane, choose Query editor, and then select the athena-ml-db-<your-AWS-account-number> database.

App developer
TaskDescriptionSkills required

Create an Athena table with geospatial data.

To load the geolocation files into Athena, do the following:

  1. Open the Athena console, and then choose Query editor from the navigation pane.

  2. Choose the Saved queries tab.

  3. Search for and select Q1: Neighborhoods.

  4. To return to the query editor, choose the Editor tab.

  5. Choose Run. This creates a table named louisville_ky_neighborhoods in your database. Make sure the table is created in the athena-ml-db-<your-AWS-account-number> database.

The query creates a new table for the geospatial data that represents the urban neighborhoods. The data table is created from GIS shapefiles. The CREATE EXTERNAL TABLE statement defines the schema of the table and the location and format of the underlying data file.

For the Python code to process shapefiles and produce this table, see Geo-Spatial processing of GIS shapefiles with Amazon Athena in AWS Samples. For detailed SQL code, see create_neighborhood_table.sql on GitHub.

Data engineer
TaskDescriptionSkills required

Declare a function in Athena to query SageMaker.

  1. Open the Athena console, choose Query editor from the navigation pane, and then choose the Editor tab.

  2. Copy and paste the following SQL statement into the query editor.

    USING EXTERNAL FUNCTION predict_demand ( location_id BIGINT, hr BIGINT , dow BIGINT, n_pickup_1 BIGINT, n_pickup_2 BIGINT, n_pickup_3 BIGINT, n_pickup_4 BIGINT, n_dropoff_1 BIGINT, n_dropoff_2 BIGINT, n_dropoff_3 BIGINT, n_dropoff_4 BIGINT ) RETURNS DOUBLE SAGEMAKER '<Your SageMaker endpoint>'

    The first part of the SQL statement declares the external function to query ML inferences from the SageMaker endpoint that hosts the pre-trained model.

  3. Define the order and type of the input parameters and the type of the return values.

  4. Choose Run.

Data scientist, Data engineer

Predict demand for scooters by neighborhood from the aggregated DynamoDB data.

Now you can use Athena to query transactional data directly from DynamoDB, and then aggregate the data for analysis and forecasting. This isn’t easily achieved by directly querying a DynamoDB NoSQL database.

  1. Open the Athena console, and then choose the Query editor from the navigation pane.

  2. Choose the Saved queries tab.

  3. Search for and select Q2: DynamoDBAthenaMLScooterPredict.

  4. To return to the query editor, choose the Editor tab.

  5. Choose Run.

The SQL statement does the following:

  • Uses an Athena Federated Query to query the DynamoDB table with the raw trip data

  • Places geographic coordinates into neighborhoods by using the geospatial functions of Athena

  • Enriches data with ML inference by using SageMaker

For information about using SQL to aggregate DynamoDB data and SageMaker inference data in Athena, see athena_long.sql in GitHub.

App developer, Data scientist

Verify the output.

The output table includes the neighborhood, longitude, and latitude of the centroid of the neighborhood. It also includes the number of vehicles that are predicted for the next hour.

The query produces the predictions for a selected point in time. You can make predictions for any other time by changing the expression TIMESTAMP '2019-09-07 15:00' everywhere in the statement.

If you have a real-time data feed in your DynamoDB table, change the timestamp to NOW().

App developer, Data scientist
TaskDescriptionSkills required

Delete resources.

  1. Open the Athena console and empty the bucket that you created as part of the CloudFormation stack.

  2. Open the CloudFormation console, and then delete the stack named bdb-1462-athena-dynamodb-ml-stack.

  3. Open the Amazon CloudWatch console, and then delete the log group named /aws/sagemaker/Endpoints/Sg-athena-ml-dynamodb-model-endpoint.

App developer, AWS DevOps

Related resources