DQDL rule type reference
This section provides a reference for each rule type that AWS Glue Data Quality supports.
Note
DQDL doesn't currently support nested or list-type column data.
Bracketed values in the below table will be replaced with the information provided in rule arguments.
Rules typically require an additional argument for expression.
Ruletype | Description | Arguments | Reported Metrics | Supported as Rule? | Supported as Analyzer? | Returns row-level Results? | Dynamic rule support? | Generates Observations | Supports Where Clause Syntax? |
---|---|---|---|---|---|---|---|---|---|
AggregateMatch | Checks if two datasets match by comparing summary metrics like total sales amount. Useful for financial institutions to compare if all data is ingested from source systems. | One or more aggregations |
When first and second aggregation column names match:
When first and second aggregation column names different:
|
Yes | No | No | No | No | No |
AllStatistics | Standalone analyzer to gather multiple metrics for the provided column in a dataset. | A single column name |
For columns of all types:
Additional metrics for string-valued columns:
Additional metrics for numeric-valued columns:
|
No | Yes | No | No | No | No |
ColumnCorrelation | Checks how well two columns are correlated. | Exactly two column names | Multicolumn.[Column1,Column2].ColumnCorrelation |
Yes | Yes | No | Yes | No | Yes |
ColumnCount | Checks if any columns are dropped. | None | Dataset.*.ColumnCount |
Yes | Yes | No | Yes | Yes | No |
ColumnDataType | Checks if a column is compliant with a datatype. | Exactly one column name | Column.[Column].ColumnDataType.Compliance |
Yes | No | No | Yes, in row-level threshold expression | No | Yes |
ColumnExists | Checks if columns exist in a dataset. This allows customers building self service data platforms to ensure certain columns are made available. | Exactly one column name | N/A | Yes | No | No | No | No | No |
ColumnLength | Checks if length of data is consistent. | Exactly one column name |
Additional metric when row-level threshold provided:
|
Yes | Yes | Yes, when row-level threshold provided | No | Yes. Only generates observations by analyzing Minimum and Maximum length | Yes |
ColumnNamesMatchPattern | Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency. | A regex for column names | Dataset.*.ColumnNamesPatternMatchRatio |
Yes | No | No | No | No | No |
ColumnValues | Checks if data is consistent per defined values. This rule supports regular expressions. | Exactly one column name |
Additional metric when row-level threshold provided:
|
Yes | Yes | Yes, when row-level threshold provided | No | Yes. Only generates observations by analyzing Minimum and Maximum values | Yes |
Completeness | Checks for any blank or NULLs in data. | Exactly one column name |
|
Yes | Yes | Yes | Yes | Yes | Yes |
CustomSql | Customers can implement almost any type of data quality checks in SQL. |
A SQL statement (Optional) A row-level threshold |
Additional metric when row-level threshold provided:
|
Yes | No | Yes, when row-level threshold provided | Yes | No | No |
DataFreshness | Checks if data is fresh. | Exactly one column name | Column.[Column].DataFreshness.Compliance |
Yes | No | Yes | No | No | Yes |
DatasetMatch | Compares two datasets and identifies if they are in synch. |
Name of a reference dataset A column mapping (Optional) Columns to check for matches |
Dataset.[ReferenceDatasetAlias].DatasetMatch |
Yes | No | Yes | Yes | No | No |
DistinctValuesCount | Checks for duplicate values. | Exactly one column name | Column.[Column].DistinctValuesCount |
Yes | Yes | Yes | Yes | Yes | Yes |
DetectAnomalies | Checks for anomalies in another rule type's reported metrics. | A rule type | Metric(s) reported by the rule type argument | Yes | No | No | No | No | No |
Entropy | Checks for entropy of the data. | Exactly one column name | Column.[Column].Entropy |
Yes | Yes | No | Yes | No | Yes |
IsComplete | Checks if 100% of the data is complete. | Exactly one column name | Column.[Column].Completeness |
Yes | No | Yes | No | No | Yes |
IsPrimaryKey | Checks if a column is a primary key (not NULL and unique). | Exactly one column name |
For single column:
For multiple columns:
|
Yes | No | Yes | No | No | Yes |
IsUnique | Checks if 100% of the data is unique. | Exactly one column name | Column.[Column].Uniqueness |
Yes | No | Yes | No | No | Yes |
Mean | Checks if the mean matches the set threshold. | Exactly one column name | Column.[Column].Mean |
Yes | Yes | Yes | Yes | No | Yes |
ReferentialIntegrity | Checks if two datasets have referential integrity. |
One or more column names from dataset One or more column names from reference dataset |
Column.[ReferenceDatasetAlias].ReferentialIntegrity |
Yes | No | Yes | Yes | No | No |
RowCount | Checks if record counts match a threshold. | None | Dataset.*.RowCount |
Yes | Yes | No | Yes | Yes | Yes |
RowCountMatch | Checks if record counts between two datasets match. | Reference dataset alias | Dataset.[ReferenceDatasetAlias].RowCountMatch |
Yes | No | No | Yes | No | No |
StandardDeviation | Checks if standard deviation matches the threshold. | Exactly one column name | Column.[Column].StandardDeviation |
Yes | Yes | Yes | Yes | No | Yes |
SchemaMatch | Checks if schema between two datasets match. | Reference dataset alias | Dataset.[ReferenceDatasetAlias].SchemaMatch |
Yes | No | No | Yes | No | No |
Sum | Checks if sum matches a set threshold. | Exactly one column name | Column.[Column].Sum |
Yes | Yes | No | Yes | No | Yes |
Uniqueness | Checks if uniqueness of dataset matches threshold. | Exactly one column name | Column.[Column].Uniqueness |
Yes | Yes | Yes | Yes | No | Yes |
UniqueValueRatio | Checks if the unique value ration matches threshold. | Exactly one column name | Column.[Column].UniqueValueRatio |
Yes | Yes | Yes | Yes | No | Yes |
FileFreshness | Checks if files in Amazon S3 are fresh. | File or Folder path and a threshold. |
|
Yes | No | No | No | No | No |
FileMatch | Checks if contents of file match to a checksum or with other file. This rule uses checksums to validate if two files are same. | Source File or Folder path and Target file or folder path. | No statistics are generated. | Yes | No | No | No | No | No |
FileSize | Checks if the size of a file matches with a specified condition. | File or folder path and threshold. |
|
Yes | No | No | No | No | No |
FileUniqueness | Checks if files are unique using checksums. | File or folder path and threshold. |
|
Yes | No | No | No | No | No |
Topics
- AggregateMatch
- ColumnCorrelation
- ColumnCount
- ColumnDataType
- ColumnExists
- ColumnLength
- ColumnNamesMatchPattern
- ColumnValues
- Completeness
- CustomSQL
- DataFreshness
- DatasetMatch
- DistinctValuesCount
- Entropy
- IsComplete
- IsPrimaryKey
- IsUnique
- Mean
- ReferentialIntegrity
- RowCount
- RowCountMatch
- StandardDeviation
- Sum
- SchemaMatch
- Uniqueness
- UniqueValueRatio
- DetectAnomalies
- FileFreshness
- FileMatch
- FileUniqueness
- FileSize