AggregateMatch
Checks the ratio of two column aggregations against a given expression. This ruletype works on multiple datasets. The two column aggregations are evaluated and a ratio is produced by dividing the result of the first column aggregation with the result of the second column aggregation. The ratio is checked against the provided expression to produce a boolean response.
Syntax
Column aggregation
AggregateMatch
<AGG_OPERATION>
(<OPTIONAL_REFERENCE_ALIAS>
.<COL_NAME>
)
AGG_OPERATION – The operation to use for the aggregation. Currently,
sum
andavg
are supported.Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
OPTIONAL_REFERENCE_ALIAS – This parameter needs to be provided if the column is from a reference dataset and not the primary dataset. If you are using this rule in the AWS Glue Data Catalog, your reference alias must follow the format "<database_name>.<table_name>.<column_name>
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
COL_NAME – The name of the column to aggregate.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Example: Average
"avg(rating)"
Example: Sum
"sum(amount)"
Example: Average of column in reference dataset
"avg(reference.rating)"
Rule
AggregateMatch
<AGG_EXP_1>
<AGG_EXP_2>
<EXPRESSION>
AGG_EXP_1 – The first column aggregation.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
AGG_EXP_2 – The second column aggregation.
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Aggregate Match using sum
The following example rule checks whether the sum of the values in the amount
column is exactly equal to the sum of the values in the total_amount
column.
AggregateMatch "sum(amount)" "sum(total_amount)" = 1.0
Example: Aggregate Match using average
The following example rule checks whether the average of the values in the ratings
column is equal to at least 90% of the average of the values in the ratings
column in the reference
dataset. The reference dataset is provided as an additional data source in the ETL or Data Catalog experience.
In AWS Glue ETL, you can use:
AggregateMatch "avg(ratings)" "avg(reference.ratings)" >= 0.9
In the AWS Glue Data Catalog, you can use:
AggregateMatch "avg(ratings)" "avg(database_name.tablename.ratings)" >= 0.9
Null behavior
The AggregateMatch
rule will ignore rows with NULL values in the calculation of the
aggregation methods (sum/mean). For example:
+---+-----------+ |id |units | +---+-----------+ |100|0 | |101|null | |102|20 | |103|null | |104|40 | +---+-----------+
The mean of column units
will be (0 + 20 + 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation.