Process data in an Amazon S3 bucket with Distributed Map - AWS Step Functions

Process data in an Amazon S3 bucket with Distributed Map

This sample project demonstrates how you can use the Distributed Map state to process large-scale data, for example, analyze historical weather data and identify the weather station that has the highest average temperature on the planet each month. The weather data is recorded in over 12,000 CSV files, which in turn are stored in an Amazon S3 bucket.

This sample project includes two Distributed Map states named Distributed S3 copy NOA Data and ProcessNOAAData. Distributed S3 copy NOA Data iterates over the CSV files in a public Amazon S3 bucket named noaa-gsod-pds and copies them to an Amazon S3 bucket in your AWS account. ProcessNOAAData iterates over the copied files and includes a Lambda function that performs the temperature analysis.

The sample project first checks the contents of the Amazon S3 bucket with a call to the ListObjectsV2 API action. Based on the number of keys returned in response to this call, the sample project takes one of the following decisions:

  • If the key count is more than or equal to 1, the project transitions to the ProcessNOAAData state. This Distributed Map state includes a Lambda function named TemperatureFunction that finds the weather station that had the highest average temperature each month. This function returns a dictionary with year-month as the key and a dictionary that contains information about the weather station as the value.

  • If the returned key count doesn't exceed 1, the Distributed S3 copy NOA Data state lists all objects from the public bucket noaa-gsod-pds and iteratively copies the individual objects to another bucket in your account in batches of 100. An Inline Map performs the iterative copying of the objects.

    After all objects are copied, the project transitions to the ProcessNOAAData state for processing the weather data.

The sample project finally transitions to a reducer Lambda function that performs a final aggregation of the results returned by the TemperatureFunction function and writes the results to an Amazon DynamoDB table.

With Distributed Map, you can run up to 10,000 parallel child workflow executions at a time. In this sample project, the maximum concurrency of ProcessNOAAData Distributed Map is set at 3000 that limits it to 3000 parallel child workflow executions.

This sample project creates the state machine, the supporting AWS resources, and configures the related IAM permissions. Explore this sample project to learn about using the Distributed Map for orchestrating large-scale, parallel workloads, or use it as a starting point for your own projects.

Important

This sample project is only available in the US East (N. Virginia) Region.

Step 1: Create the state machine

  1. Open the Step Functions console and choose Create state machine.

  2. Find and choose the starter template you want to work with. Choose Next to continue.

  3. Choose Run a demo to create a read-only and ready-to-deploy workflow, or choose Build on it to create an editable state machine definition that you can build on and later deploy.

  4. Choose Use template to continue with your selection.

Next steps depend on your previous choice:

  1. Run a demo – You can review the state machine before you create a read-only project with resources deployed by AWS CloudFormation to your AWS account.

    You can view the state machine definition, and when you are ready, choose Deploy and run to deploy the project and create the resources.

    Deploying can take up to 10 minutes to create resources and permissions. You can use the Stack ID link to monitor progress in AWS CloudFormation.

    After deploy completes, you should see your new state machine in the console.

  2. Build on it – You can review and edit the workflow definition. You might need to set values for placeholders in the sample project before attemping to run your custom workflow.

Note

Standard charges might apply for services deployed to your account.

Step 2: Run the state machine

  1. On the State machines page, choose your sample project.

  2. On the sample project page, choose Start execution.

  3. In the Start execution dialog box, do the following:

    1. (Optional) Enter a custom execution name to override the generated default.

      Non-ASCII names and logging

      Step Functions accepts names for state machines, executions, activities, and labels that contain non-ASCII characters. Because such characters will not work with Amazon CloudWatch, we recommend using only ASCII characters so you can track metrics in CloudWatch.

    2. (Optional) In the Input box, enter input values as JSON. You can skip this step if you are running a demo.

    3. Choose Start execution.

    The Step Functions console will direct you to an Execution Details page where you can choose states in the Graph view to explore related information in the Step details pane.

Congratulations!

You should now have either a running demo or a state machine definition that you can customize.