Formatting your data - Amazon Lookout for Equipment

Formatting your data

You've set up your account and created your project. Soon, you'll organize your data to help Lookout for Equipment determine an appropriate schema. But first, you must ensure that your data is formatted properly.

To monitor your equipment, you must provide Amazon Lookout for Equipment with time-series data from the sensors on your equipment. The data that you're providing to Lookout for Equipment is a series of numerical measurements from the sensors. You provide this data from either a data historian or Amazon Simple Storage Service (Amazon S3). A data historian is a software program that records and retrieves sensor data from your equipment.

To provide Amazon Lookout for Equipment with time-series data from the sensors, you must use properly formatted .csv files to create a dataset. Creating a dataset aggregates the data in a format that is suitable for analysis. You create a dataset for a single piece of equipment, or asset. You train an ML model on the dataset that you create. You then use that model to monitor your asset. You don't have to use all the data from the sensors to train a model. You train a model using data from some of the sensors in the dataset.

You can store the data for your asset in one of the following ways:

  • Storing all of the sensor data in one .csv file (recommended)

  • Using one .csv file for each sensor

Each .csv file must have at least two columns. The first column of the file is a timestamp that indicates the date and time. You must have at least one additional column containing the data from a sensor. Each subsequent column can have data from a different sensor.

To store the data for your asset in one .csv file, you arrange the data in the following format.

Timestamp

Sensor 1

Sensor 2

2020/1/1 0:00 2 12
2020/1/1 0:05 3 11
2020/1/1 0:10 5 10
2020/1/1 0:15 3 9
2020/1/1 0:20 4 12

The following example shows the information from the preceding table as a .csv file.

Timestamp,Sensor 1,Sensor 2 2020/1/1 0:00,2,12 2020/1/1 0:05,3,11 2020/1/1 0:10,5,10 2020/1/1 0:15,3,9 2020/1/1 0:20,4,12

You can choose your column names. We recommend using "Timestamp" as the name for the column with the time-series data. For the names of the columns with data from your sensors, we recommend using names that distinguish one sensor from another.

If your are storing the data from each sensor in one .csv file, use the following table to see how to format the data.

Timestamp

Sensor 3

2020/1/1 0:00 34
2020/1/1 0:05 33
2020/1/1 0:10 35
2020/1/1 0:15 33
2020/1/1 0:20 34

The following example shows the information from the preceding table as a .csv file.

Timestamp,Sensor 3 2020/1/1 0:00,34 2020/1/1 0:05,33 2020/1/1 0:10,35 2020/1/1 0:15,33 2020/1/1 0:20,34

We recommend using "Timestamp" as the name for the column with the time-series data. For the column with data from the sensor, we recommend using a name that distinguishes it from other sensors.

You must have a double (numerical) as the data type for your sensor data. You can only train your model on numeric data.

When you are preparing your data, you should keep the following information in mind:

Category Limit
minimum date range 14 days
maximum sensors per dataset 3000
maximum sensors per model 300
maximum length of a sensor name 200 characters
maximum size of each .csv file 5 GB
maximum historical dataset size (combined .csv files) 50 GB
maximum files per historical dataset 1,000
  • You can use the following delimiters for the data in the timestamp column: _ - (hyphen) and space

  • The timestamp column can use the following formats:

    • yyyy-MM-dd-HH-mm-ss

    • yyyy-MM-dd'T'HH:mm:ss

    • yyyy-MM-dd HH:mm:ss

    • yyyy-MM-dd-HH:mm:ss

    • yyyy-MM-dd'T'HH:mm

    • yyyy-MM-dd HH:mm

    • yyyy-MM-dd-HH:mm

    • yyyy/MM/dd'T'HH:mm:ss

    • yyyyMMdd'T'HH:mm

    • yyyyMMdd HH:mm

    • yyyyMMddHHmm

    • yyyy/MM/dd HH:mm:ss

    • yyyyMMdd'T'HH:mm:ss

    • yyyyMMdd HH:mm:ss

    • yyyyMMddHHmmss

    • yyyy/MM/dd'T'HH:mm

    • yyyy/MM/dd HH:mm

    • yyyy MM dd'T'HH:mm:ss

    • yyyy MM dd HH:mm:ss

    • yyyy MM dd'T'HH:mm

    • yyyy MM dd HH:mm

  • The valid characters that you can use in the column names of the dataset are A-Z, a-z, 0-9, and . _ - (hyphen)

To learn more about the formats listed above, see the ISO 86021 standard.

Now that your data is formatted properly, it's time to organize your files.

Understanding the minimum date range

The minimum data range for a dataset in Lookout for Equipment is 14 days. However, there are situations in which you should include more than 14 days' worth of data.

The dataset that you submit should cover a period of time during which your asset functioned in all of its normal operating modes. This is necessary for Lookout for Equipment to recognize the difference between normal operation and anomalies.

If your dataset does not include examples of all of your asset's normal operating modes, then Lookout for Equipment may find more false positives. In other words, it may identify some of your operating modes, with which it is not familiar, as anomalies.

In such cases, you can help Lookout for Equipment accurately identify anomalies by labeling your data. For more information, see Understanding labeling.