Example use case for UNLOAD from Timestream for LiveAnalytics

Focus mode

Example use case for UNLOAD from Timestream for LiveAnalytics - Amazon Timestream

Exporting the data without any partitions Partitioning data by channel Partitioning data by event Partitioning data by both channel and event Manifest and metadata files Using Glue crawlers

Assume you are monitoring user session metrics, traffic sources, and product purchases of your e-commerce website. You are using Timestream for LiveAnalytics to derive real-time insights into user behavior, product sales, and perform marketing analytics on traffic channels (organic search, social media, direct traffic, paid campaigns and others) that drive customers to the website.

Topics

Exporting the data without any partitions
Partitioning data by channel
Partitioning data by event
Partitioning data by both channel and event
Manifest and metadata files
Using Glue crawlers to build Glue Data Catalog

Exporting the data without any partitions

You want to export the last two days of your data in CSV format.


UNLOAD(SELECT user_id, ip_address, event, session_id, measure_name, time, 
query, quantity, product_id, channel 
FROM sample_clickstream.sample_shopping 
WHERE time BETWEEN ago(2d) AND now())  
TO 's3://<bucket_name>/withoutpartition' 
WITH (  format='CSV',   
compression='GZIP')

Partitioning data by channel

You want to export the last two days of data in CSV format but would like to have the data from each traffic channel in a separate folder. To do this, you need to partition the data using the channel column as shown in the following.


UNLOAD(SELECT user_id, ip_address, event, session_id, measure_name, time, 
query, quantity, product_id, channel 
FROM sample_clickstream.sample_shopping 
WHERE time BETWEEN ago(2d) AND now())  
TO 's3://<bucket_name>/partitionbychannel/' 
WITH (  
partitioned_by = ARRAY ['channel'], 
format='CSV',   
compression='GZIP')

Partitioning data by event

You want to export the last two days of data in CSV format but would like to have the data for each event in a separate folder. To do this, you need to partition the data using the event column as shown in the following.


UNLOAD(SELECT user_id, ip_address, channel, session_id, measure_name, time, 
query, quantity, product_id, event 
FROM sample_clickstream.sample_shopping 
WHERE time BETWEEN ago(2d) AND now())  
TO 's3://<bucket_name>/partitionbyevent/' 
WITH (  
partitioned_by = ARRAY ['event'], 
format='CSV',   
compression='GZIP')

Partitioning data by both channel and event

You want to export the last two days of data in CSV format but would like to have the data for each channel and within channel store each event in a separate folder. To do this, you need to partition the data using both channel and event column as shown in the following.


UNLOAD(SELECT user_id, ip_address, session_id, measure_name, time, 
query, quantity, product_id, channel,event 
FROM sample_clickstream.sample_shopping 
WHERE time BETWEEN ago(2d) AND now())  
TO 's3://<bucket_name>/partitionbychannelevent/' 
WITH (  
partitioned_by = ARRAY ['channel','event'], 
format='CSV',   
compression='GZIP')

Manifest and metadata files

Manifest file

The manifest file provides information on the list of files that are exported with the UNLOAD execution. The manifest file is available in the provided S3 bucket with a file name: S3://bucket_name/<queryid>_<UUID>_manifest.json. The manifest file will contain the url of the files in the results folder, the number of records and size of the respective files, and the query metadata (which is total bytes and total rows exported to S3 for the query).


{
  "result_files": [
    {
        "url":"s3://my_timestream_unloads/ec2_metrics/AEDAGANLHLBH4OLISD3CVOZZRWPX5GV2XCXRBKCVD554N6GWPWWXBP7LSG74V2Q_1448466917_szCL4YgVYzGXj2lS.gz", 
        "file_metadata": 
            { 
                "content_length_in_bytes": 32295, 
                "row_count": 10 
            }
    },
    {
        "url":"s3://my_timestream_unloads/ec2_metrics/AEDAGANLHLBH4OLISD3CVOZZRWPX5GV2XCXRBKCVD554N6GWPWWXBP7LSG74V2Q_1448466917_szCL4YgVYzGXj2lS.gz", 
        "file_metadata": 
            { 
                "content_length_in_bytes": 62295, 
                "row_count": 20 
            }
    },
  ],
  "query_metadata": 
    {
      "content_length_in_bytes": 94590, 
      "total_row_count": 30,
      "result_format": "CSV",
      "result_version": "Amazon Timestream version 1.0.0"  
    },
  "author": {
        "name": "Amazon Timestream", 
        "manifest_file_version": "1.0" 
  }
}

Metadata

The metadata file provides additional information about the data set such as column name, column type, and schema. The metadata file is available in the provided S3 bucket with a file name: S3://bucket_name/<queryid>_<UUID>_metadata.json

Following is an example of a metadata file.


{
    "ColumnInfo": [
        {
            "Name": "hostname",
            "Type": {
                "ScalarType": "VARCHAR"
            }
        },
        {
            "Name": "region",
            "Type": {
                "ScalarType": "VARCHAR"
            }
        },
        {
            "Name": "measure_name",
            "Type": {
                "ScalarType": "VARCHAR"
            }
        },
        {
            "Name": "cpu_utilization",
            "Type": {
                "TimeSeriesMeasureValueColumnInfo": {
                    "Type": {
                        "ScalarType": "DOUBLE"
                    }
                }
            }
        }
  ],
  "Author": {
        "Name": "Amazon Timestream", 
        "MetadataFileVersion": "1.0" 
  }
}

The column information shared in the metadata file has same structure as ColumnInfo sent in Query API response for SELECT queries.

Using Glue crawlers to build Glue Data Catalog

Login to your account with Admin credentials for the following validation.
Create a Crawler for Glue Database using the guidelines provided here. Please note that the S3 folder to be provided in the datasource should be the UNLOAD result folder such as s3://my_timestream_unloads/results.
Run the crawler following the guidelines here.
View the Glue table.
- Go to AWS Glue → Tables.
- You will see a new table created with table prefix provided while creating the crawler.
- You can see the schema and partition information by clicking the table details view.

The following are other AWS services and open-source projects that use the AWS Glue Data Catalog.

Amazon Athena – For more information, see Understanding tables, databases, and data catalogs in the Amazon Athena User Guide.
Amazon Redshift Spectrum – For more information, see Querying external data using Amazon Redshift Spectrum in the Amazon Redshift Database Developer Guide.
Amazon EMR – For more information, see Use resource-based policies for Amazon EMR access to AWS Glue Data Catalog in the Amazon EMR Management Guide.
AWS Glue Data Catalog client for Apache Hive metastore – For more information about this GitHub project, see AWS Glue Data Catalog Client for Apache Hive Metastore.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Best practices

Limits

Next topic:

Limits

Previous topic:

Best practices

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences