Data quality - Amazon SageMaker

Data quality

Data quality monitoring automatically monitors machine learning (ML) models in production and notifies you when data quality issues arise. ML models in production have to make predictions on real-life data that is not carefully curated like most training datasets. If the statistical nature of the data that your model receives while in production drifts away from the nature of the baseline data it was trained on, the model begins to lose accuracy in its predictions. Amazon SageMaker Model Monitor uses rules to detect data drift and alerts you when it happens. To monitor data quality, follow these steps:

  • Enable data capture. This captures inference input and output from a real-time inference endpoint or batch transform job and stores the data in Amazon S3. For more information, see Data capture.

  • Create a baseline. In this step, you run a baseline job that analyzes an input dataset that you provide. The baseline computes baseline schema constraints and statistics for each feature using Deequ, an open source library built on Apache Spark, which is used to measure data quality in large datasets. For more information, see Create a Baseline.

  • Define and schedule data quality monitoring jobs. For specific information and code samples of data quality monitoring jobs, see Schedule data quality monitoring jobs. For general information about monitoring jobs, see Schedule monitoring jobs.

    • Optionally use preprocessing and postprocessing scripts to transform the data coming out of your data quality analysis. For more information, see Preprocessing and Postprocessing.

  • View data quality metrics. For more information, see Schema for Statistics (statistics.json file).

  • Integrate data quality monitoring with Amazon CloudWatch. For more information, see CloudWatch Metrics.

  • Interpret the results of a monitoring job. For more information, see Interpret results.

  • Use SageMaker Studio to enable data quality monitoring and visualize results if you are using a real-time endpoint. For more information, see Visualize results for real-time endpoints in Amazon SageMaker Studio.

Note

Model Monitor computes model metrics and statistics on tabular data only. For example, an image classification model that takes images as input and outputs a label based on that image can still be monitored. Model Monitor would be able to calculate metrics and statistics for the output, not the input.