CustomSQL
This rule type has been extended to support two use cases:
Run a custom SQL statement against a dataset and checks the return value against a given expression.
Run a custom SQL statement where you specify a column name in your SELECT statement against which you compare with some condition to get row-level results.
Syntax
CustomSql
<SQL_STATEMENT>
<EXPRESSION>
SQL_STATEMENT – A SQL statement that returns a single numeric value, surrounded by double quotes.
EXPRESSION – An expression to run against the rule type response in order to produce a Boolean value. For more information, see Expressions.
Example: Custom SQL to retrieve an overall rule outcome
This example rule uses a SQL statement to retrieve the record count for a data set. The rule then checks that the record count is between 10 and 20.
CustomSql "select count(*) from primary" between 10 and 20
Example: Custom SQL to retrieve row-level results
This example rule uses a SQL statement wherein you specify a column name in your SELECT statement against which you compare with some condition to get row level results. A threshold condition expression defines a threshold of how many records should fail for the entire rule to fail. Note that a rule may not contain both a condition and keyword together.
CustomSql "select Name from primary where Age > 18"
or
CustomSql "select Name from primary where Age > 18" with threshold > 3
Important
The primary
alias stands in for the name of the data set that you want to
evaluate. When you work with visual ETL jobs on the console, primary
always
represents the DynamicFrame
being passed to the
EvaluateDataQuality.apply()
transform. When you use the AWS Glue Data Catalog to run
data quality tasks against a table, primary
represents the table.
If you are in AWS Glue Data Catalog, you can also use the actual table names:
CustomSql "select count(*) from database.table" between 10 and 20
You can also join multiple tables to compare different data elements:
CustomSql "select count(*) from database.table inner join database.table2 on id1 = id2" between 10 and 20
In AWS Glue ETL, CustomSQL can identify records that failed the data quality checks. For this to work, you will need to return records that are part of the primary table that you are evaluating data quality. Records that are returned as part of the query are considered successful and records that are not returned are considered failed.
The following rule will ensure that records with age < 100 are identified as successful and records that are above are marked as failed.
CustomSql "select id from primary where age < 100"
This CustomSQL rule will pass when 50% of the records have age > 10 and will also identify records that failed. The records returned by this CustomSQL will be considered passed while the ones not returned will be considered failed.
CustomSQL "select ID, CustomerID from primary where age > 10" with threshold > 0.5
Note: CustomSQL rule will fail if you return records that are not available in the dataset.