ML lifecycle phase - Data processing

In ML workloads, the data (inputs and corresponding desired output) serves important functions including:

Defining the goal of the system: the output representation and the relationship of each output to each input, by means of the input and output pairs.
Training the algorithm that associates inputs to outputs.
Measuring the performance of the model against changes in data distribution or data drift.
Building a baseline dataset to capture data drift.

As shown in Figure 7, data processing consists of data collection and data preparation. Data preparation includes data preprocessing and feature engineering. It mainly uses data wrangling for interactive data analysis and data visualization for exploratory data analysis (EDA). EDA focuses on understanding data, sanity checks, and validation of data quality.

It is important to note that the same sequence of data processing steps that is applied to the training data needs to also be applied to the inference requests.

7. Figure 7 includes the key components of data processing. Each component will be expanded in later figures in this whitepaper.

Figure 7: Data processing components

Best practices

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Lifecycle architecture diagram

Data collection