MLOE-10: Profile data to improve quality
Profile data to use data characteristics like distribution, descriptive statistics, data types, and data patterns. Review source data for content and quality. Filter out or correct any data not passing the reviews. This will contribute to quality improvement.
Implementation plan
-
Use the built-in data preparation capability of Amazon SageMaker AI Studio Notebook - This allows you to visually review data characteristics and remediate data-quality problems directly in your notebook environment. When you display a data frame (that is, a tabular representation of data) in your notebook, Amazon SageMaker AI Studio Notebook automatically generates charts to help users identify data-quality issues and suggests data transformations to help fix common problems. After you select a data transformation, Amazon SageMaker AI Studio Notebook generates the corresponding code within the notebook so that it can be repeatedly applied every time the notebook is run.
-
Use Amazon SageMaker AI Data Wrangler - Import, prepare, transform, visualize, and analyze data with SageMaker AI Data Wrangler
. You can integrate Data Wrangler into your ML workflows to simplify and streamline data pre-processing and feature engineering with little to no coding. You can also add your own Python scripts and transformations to customize your data preparation workflow. Import data from Amazon S3 , Amazon Redshift , or other data sources, and then query the data using Amazon Athena . Use Data Wrangler to create sophisticated machine learning data preparation workflows with built-in and custom data transformations and analysis features. These features include feature target leakage and quick modeling. -
Create an automatic data profile and a reporting system - Use AWS Glue Crawler to crawl the data sources and create a data schema. Use AWSAWS Glue Data Catalog to list all the tables and schemas. Use Amazon Athena
for serverless SQL querying to constantly profile data and then use Amazon QuickSight dashboards for visualization of the data. -
Create a baseline dataset with SageMaker AI Model Monitor – The training dataset used to train the model is usually a good baseline dataset. The training dataset data schema and the inference dataset schema should exactly match (the number and order of the features).