Aggregation query structure and syntax Aggregation analysis rule - query controls Aggregation analysis rule - query results controls Aggregation analysis rule structure Aggregation analysis rule - example Troubleshooting aggregation analysis rule issues

Aggregation analysis rule

Note

Applies to: AWS Clean Rooms SQL analytics engine

In AWS Clean Rooms, an aggregation analysis rule generates aggregate statistics using COUNT, SUM, and/or AVG functions along optional dimensions. When the aggregation analysis rule is added to a configured table, it enables the member who can query to run queries on the configured table.

The aggregation analysis rule supports uses cases such as campaign planning, media reach, frequency measurement, and attribution.

The supported query structure and syntax are defined in Aggregation query structure and syntax.

The parameters of the analysis rule, defined in Aggregation analysis rule - query controls, include query controls and query results controls. Its query controls include the ability to require that a configured table is joined to at least one configured table owned by the member who can query, either directly or transitively. This requirement allows you to ensure that the query is run on the intersection (INNER JOIN) of your table and theirs.

Aggregation query structure and syntax

Queries on tables that have an aggregation analysis rule must adhere to the following syntax.


--select_aggregate_function_expression
SELECT 
aggregation_function(column_name) [[AS] column_alias ] [, ...]

 --select_grouping_column_expression                        
  [, {column_name|scalar_function(arguments)} [[AS] column_alias ]][, ...]   

--table_expression
FROM table_name [[AS] table_alias ]
  [[INNER] JOIN table_name [[AS] table_alias] ON join_condition] [...]

--where_expression
[WHERE where_condition]          

--group_by_expression                          
[GROUP BY {column_name|scalar_function(arguments)}, ...]]                  

--having_expression
[HAVING having_condition]                               

--order_by_expression    
[ORDER BY {column_name|scalar_function(arguments)} [{ASC|DESC}]] [,...]]

The following table explains each expression listed in the preceding syntax.

Expression	Definition	Examples
`select_aggregate_function_expression`	A comma-separated list containing the following expressions: `select_aggregation_function_expression` `select_aggregate_expression` Note There must be at least one `select_aggregation_function_expression` in the `select_aggregate_expression`.	`SELECT SUM(PRICE), user_segment`
`select_aggregation_function_expression`	One or more supported aggregation functions applied to one or more columns. Only columns are allowed as arguments of aggregation functions. Note There must be at least one `select_aggregation_function_expression` in the `select_aggregate_expression`.	`AVG(PRICE)` `COUNT(DISTINCT user_id)`
`select_grouping_column_expression`	An expression that can contain any expression using the following: Table column names Supported scalar functions String literals Numerical literals Note `select_aggregate_expression` can alias columns with or without the `AS` parameter. For more information, see the AWS Clean Rooms SQL Reference.	`TRUNC(timestampColumn)` `UPPER(campaignName)`
`table_expression`	A table, or join of tables, connecting join conditional expressions with `join_condition`. `join_condition` returns a Boolean. The `table_expression` supports: A specific JOIN type (INNER JOIN) The equality comparison condition within a `join_condition` (`=`) Logical operators (`AND`, `OR`).	`FROM consumer_table INNER JOIN provider_table ON consumer_table.identifier1 = provider_table.identifier1 AND consumer_table.identifier2 = provider_table.identifier2`
`where_expression`	A conditional expression that returns a Boolean. It may be comprised of the following: Table column names Supported scalar functions Mathematical operators String literals Numerical literals Supported comparison conditions are (`=, >, <, <=, >=, <>, !=, NOT, IN, NOT IN, LIKE, IS NULL, IS NOT NULL`). Supported logical operators are (`AND, OR`). The `where_expression` is optional.	`WHERE where_condition` `WHERE price > 100` `WHERE TRUNC(timestampColumn) = '1/1/2022'` `WHERE timestampColumn = timestampColumn2 - 14`
`group_by_expression`	A comma-separated list of expressions that match the requirements for the `select_grouping_column_expression`.	`GROUP BY TRUNC(timestampColumn), UPPER(campaignName), segment`
`having_expression`	A conditional expression that returns an Boolean. They have a supported aggregation function applied to a single column (for example, `SUM(price)`) and are compared to a numerical literal. Supported conditions are (`=, >, <, <=, >=, <>, !=`). Supported logical operators are (`AND, OR`). The `having_expression` is optional.	`HAVING SUM(SALES) > 500`
`order_by_expression`	A comma-separated list of expressions that is compatible with the same requirements defined in `select_aggregate_expression` defined earlier. The `order_by_expression` is optional. Note `order_by_expression` permits `ASC` and `DESC` parameters. For more information, see ASC DESC parameters in the AWS Clean Rooms SQL Reference.	`ORDER BY SUM(SALES), UPPER(campaignName)`

For aggregation query structure and syntax, be aware of the following:

SQL commands other than SELECT are not supported.
Sub-queries and common table expressions (for example, WITH) are not supported.
Operators that combine multiple queries (for example, UNION) are not supported.
TOP, LIMIT, and OFFSET parameters are not supported.

Aggregation analysis rule - query controls

With aggregation query controls, you can control how the columns in your table are used to query the table. For example, you can control which column is used for joining, which column can be counted, or which column can be used in WHERE statements.

The following sections explain each control.

Aggregation controls

By using aggregation controls, you can define which aggregation functions to allow, and what columns they must to be applied to. Aggregation functions can be used in the SELECT, HAVING, and ORDER BY expressions.

Control Definition Usage

Control	Definition	Usage
`aggregateColumns`	Columns of configured table columns you allow for use within aggregation functions.	`aggregateColumns` can be used inside an aggregation function in the SELECT, HAVING, and ORDER BY expressions. Some `aggregateColumns` can also be categorized as a `joinColumn` (defined later). Given `aggregateColumn` can't also be categorized as a `dimensionColumn` (defined later).
`function`	The COUNT, SUM, and AVG functions you allow for use on top of the `aggregateColumns`.	`function` can be applied to an `aggregateColumns` that is associated to it.

aggregateColumns

Columns of configured table columns you allow for use within aggregation functions.

aggregateColumns can be used inside an aggregation function in the SELECT, HAVING, and ORDER BY expressions.

Some aggregateColumns can also be categorized as a joinColumn (defined later).

Given aggregateColumn can't also be categorized as a dimensionColumn (defined later).

function

The COUNT, SUM, and AVG functions you allow for use on top of the aggregateColumns.

function can be applied to an aggregateColumns that is associated to it.

Join controls

A JOIN clause is used to combine rows from two or more tables, based on a related column between them.

You can use Join controls to control how your table can be joined to other tables in the table_expression. AWS Clean Rooms only supports INNER JOIN. INNER JOIN statements can only use columns that have been explicitly categorized as a joinColumn in your analysis rule, subject to the controls that you define.

The INNER JOIN must operate on a joinColumn from your configured table and a joinColumn from another configured table in the collaboration. You decide which columns from your table can be used as joinColumn.

Each match condition within the ON clause is required to use the equality comparison condition (=) between two columns.

Multiple match conditions within an ON clauses can be:

Combined using the AND logical operator
Separated using the OR logical operator

Note

All JOIN match conditions must match one row from each side of the JOIN. All conditionals connected by an OR or an AND logical operator must adhere to this requirement as well.

The following is an example of a query with an AND logical operator.


SELECT some_col, other_col 
FROM table1 
    JOIN table2 
    ON table1.id = table2.id AND table1.name = table2.name

The following is an example of a query with an OR logical operator.


SELECT some_col, other_col 
FROM table1 
    JOIN table2 
    ON table1.id = table2.id OR table1.name = table2.name

Control Definition Usage

Control	Definition	Usage
`joinColumns`	The columns (if any) that you want to allow the member who can query to use in the INNER JOIN statement.	A specific `joinColumn` can also be categorized as a `aggregateColumn` (see Aggregation controls). The same column can't be used both as `joinColumn` and `dimensionColumns` (see later). Unless it has also been categorized as an `aggregateColumn`, a `joinColumn` can't be used in any other parts of the query other than the INNER JOIN.
`joinRequired`	Control whether you require an INNER JOIN with a configured table from the member who can query.	If you enable this parameter, an INNER JOIN is required. If you don't enable this parameter, an INNER JOIN is optional. Assuming you enable this parameter, the member who can query is required to include a table they own in the INNER JOIN. They must JOIN your table with theirs, either directly or transitively (that is, join their table to another table, which itself is joined to your table).

joinColumns

The columns (if any) that you want to allow the member who can query to use in the INNER JOIN statement.

A specific joinColumn can also be categorized as a aggregateColumn (see Aggregation controls).

The same column can't be used both as joinColumn and dimensionColumns (see later).

Unless it has also been categorized as an aggregateColumn, a joinColumn can't be used in any other parts of the query other than the INNER JOIN.

joinRequired

Control whether you require an INNER JOIN with a configured table from the member who can query.

If you enable this parameter, an INNER JOIN is required. If you don't enable this parameter, an INNER JOIN is optional.

Assuming you enable this parameter, the member who can query is required to include a table they own in the INNER JOIN. They must JOIN your table with theirs, either directly or transitively (that is, join their table to another table, which itself is joined to your table).

Following is an example of transitivity.


ON 
my_table.identifer = third_party_table.identifier
....
ON
third_party_table.identifier = member_who_can_query_table.id

Note

The member who can query can also use the joinRequired parameter. In that case, the query must join their table with at least one other table.

Dimension controls

Dimension controls control the column along which the aggregation columns can be filtered, grouped, or aggregated.

Control Definition Usage

Control	Definition	Usage
`dimensionColumns`	The columns (if any) that you allow the member who can query to use in SELECT, WHERE, GROUP BY, and ORDER BY.	A `dimensionColumn` can be used in SELECT (`select_grouping_column_expression`), WHERE, GROUP BY, and ORDER BY. The same column can't be both a `dimensionColumn`, a `joinColumn`, and/or an `aggregateColumn`.

dimensionColumns

The columns (if any) that you allow the member who can query to use in SELECT, WHERE, GROUP BY, and ORDER BY.

A dimensionColumn can be used in SELECT (select_grouping_column_expression), WHERE, GROUP BY, and ORDER BY.

The same column can't be both a dimensionColumn, a joinColumn, and/or an aggregateColumn.

Scalar functions

Scalar functions control which scalar functions can be used on dimension columns.

Control Definition Usage

Control	Definition	Usage
`scalarFunctions`	The scalar functions that can be used on `dimensionColumns` in the query.	Specifies the scalar functions (if any) that you allow (for example, CAST) to be applied on `dimensionColumns`. Scalar functions can't be used on top of other functions or within other functions. Arguments of scalar functions can be columns, string literals, or numeric literals.

scalarFunctions

The scalar functions that can be used on dimensionColumns in the query.

Specifies the scalar functions (if any) that you allow (for example, CAST) to be applied on dimensionColumns.

Scalar functions can't be used on top of other functions or within other functions. Arguments of scalar functions can be columns, string literals, or numeric literals.

The following scalar functions are supported:

Math functions – ABS, CEILING, FLOOR, LOG, LN, ROUND, SQRT
Data type formatting functions – CAST, CONVERT, TO_CHAR, TO_DATE, TO_NUMBER, TO_TIMESTAMP
String functions – LOWER, UPPER, TRIM, RTRIM, SUBSTRING
- For RTRIM, custom character sets to trim aren't allowed.
Conditional expressions – COALESCE
Date functions – EXTRACT, GETDATE, CURRENT_DATE, DATEADD
Other functions – TRUNC

For more details, see the AWS Clean Rooms SQL Reference.

Aggregation analysis rule - query results controls

With the aggregation query results controls, you can control which results are returned by specifying one or more conditions that each output row must meet for it to be returned. AWS Clean Rooms supports aggregation constraints in the form of COUNT (DISTINCT column) >= X. This form requires that each row aggregates at least X distinct values of a choice from your configured table (for example, a minimum number of distinct user_id values). This minimum threshold is automatically enforced, even if the submitted query itself does not use the specified column. They are enforced collectively across each configured table in the query from the configured tables from each member in the collaboration.

Each configured table must have at least one aggregation constraint in their analysis rule. Configured table owners can add multiple columnName and associated minimum and they are enforced collectively.

Aggregation constraints

Aggregation constraints control which rows in the query results are returned. To be returned, a row must meet the specified minimum number of distinct values in each column specified in the aggregation constraint. This requirement applies even if the column isn't explicitly mentioned in the query or in other parts of the analysis rule.

Control	Definition	Usage
`columnName`	The `aggregateColumn` that is used in the condition that each output row must meet.	Can be any column in the configured table.
`minimum`	The minimum number of distinct values for the associated `aggregateColumn` that the output row must have (for example, COUNT DISTINCT) for it to be returned in the query results.	The `minimum` must be at least value of 2.

Aggregation analysis rule structure

The following example shows a predefined structure for an aggregation analysis rule.

In the following example, MyTable refers to your data table. You can replace each user input placeholder with your own information.


{
  "aggregateColumns": [
    {
      "columnNames": [MyTable column names], "function": [Allowed Agg Functions]
    },
  ],
  "joinRequired": ["QUERY_RUNNER"],  
  "joinColumns": [MyTable column names],
  "dimensionColumns": [MyTable column names],
  "scalarFunctions": [Allowed Scalar functions],
  "outputConstraints": [
    {
      "columnName": [MyTable column names], "minimum": [Numeric value] 
    },
  ]
}

Aggregation analysis rule - example

The following example demonstrates how two companies can collaborate in AWS Clean Rooms using aggregation analysis.

Company A has customer and sales data. Company A is interested in understanding product return activity. Company B is one of Company A's retailers and has returns data. Company B also has segment attributes on customers that are useful to Company A (for example, purchased related products, uses customer service from the retailer). Company B doesn't want to provide row-level customer return data and attribute information. Company B only wants to enable a set of queries for Company A to obtain aggregate statistics on overlapping customers at a minimum aggregation threshold.

Company A and Company B decide to collaborate so that Company A can understand product return activity and deliver better products at Company B and other channels.

To create the collaboration and run an aggregation analysis, the companies do the following:

Company A creates a collaboration and creates a membership. The collaboration has Company B as another member in the collaboration. Company A enables query logging in the collaboration, and it enables query logging in their account.
Company B creates a membership in the collaboration. It enables query logging in its account.
Company A creates a sales configured table.
Company A adds the following aggregation analysis rule to the sales configured table.
```
{
  "aggregateColumns": [
    {
      "columnNames": [
        "identifier"
      ],
      "function": "COUNT_DISTINCT"
    },
    {
      "columnNames": [
        "purchases"
      ],
      "function": "AVG"
    },
    {
      "columnNames": [
        "purchases"
      ],
      "function": "SUM"
    }
  ],
  "joinColumns": [
    "hashedemail"
  ],
  "dimensionColumns": [
    "demoseg",
    "purchasedate",
    "productline"
  ],
  "scalarFunctions": [
    "CAST",
    "COALESCE",
    "TRUNC"
  ],
  "outputConstraints": [
    {
      "columnName": "hashedemail",
      "minimum": 2,
      "type": "COUNT_DISTINCT"
    },
  ]
}
```
aggregateColumns – Company A wants to count the number of unique customers in the overlap between sales data and returns data. Company A also wants to sum the number of purchases made to compare to number of returns.

joinColumns – Company A wants to use identifier to match customers from sales data to customers from returns data. This will help Company A match returns to the right purchases. It also helps Company A segment overlapping customers.

dimensionColumns – Company A uses dimensionColumns to filter by the specific product, compare purchases and returns over a certain period of time, ensure the return date is after the product date, and help segment overlapping customers.

scalarFunctions – Company A selects CAST scalar function to help update data type formats if needed based on the configured table Company A associates to the collaboration. It also adds scalar functions to help formatting columns if needed.

outputConstraints – Company A sets minimum output constraints. It doesn't need to constrain the results because the analyst is allowed to see row-level data from their sales table

Note
Company A doesn't include joinRequired in the analysis rule. It provides flexibility for their analyst to query the sales table alone.
Company B creates a returns configured table.
Company B adds the following aggregation analysis rule to the returns configured table.
```
{
  "aggregateColumns": [
    {
      "columnNames": [
        "identifier"
      ],
      "function": "COUNT_DISTINCT"
    },
    {
      "columnNames": [
        "returns"
      ],
      "function": "AVG"
    },
    {
      "columnNames": [
        "returns"
      ],
      "function": "SUM"
    }
  ],
  "joinColumns": [
    "hashedemail"
  ],
  "joinRequired": [
    "QUERY_RUNNER"
  ],
  "dimensionColumns": [
    "state",
    "popularpurchases",
    "customerserviceuser",
    "productline",
    "returndate"
  ],
  "scalarFunctions": [
    "CAST",
    "LOWER",
    "UPPER",
    "TRUNC"
  ],
  "outputConstraints": [
    {
      "columnName": "hashedemail",
      "minimum": 100,
      "type": "COUNT_DISTINCT"
    },
    {
      "columnName": "producttype",
      "minimum": 2,
      "type": "COUNT_DISTINCT"
    }
  ]
}
```
aggregateColumns – Company B enables Company A to sum returns to compare to the number of purchases. They have at least one aggregate column because they are enabling an aggregate query.

joinColumns – Company B enables Company A to join on identifier to match customers from return data to customers from sales data. identifier data is particularly sensitive and having it as a joinColumn ensures that the data will never be outputted in a query.

joinRequired – Company B requires queries on the return data to be overlapped with the sales data. They don’t want to enable Company A to query all individuals in their dataset. They also agreed on that restriction in their collaboration agreement.

dimensionColumns – Company B enables Company A to filter and group by state, popularpurchases, and customerserviceuser which are unique attributes that could help make the analysis for Company A. Company B enables Company A to use returndate to filter output on returndate that occurs after purchasedate. With this filtering, the output is more accurate for evaluating the impact of the product change.

scalarFunctions – Company B enables the following:
- TRUNC for dates
- LOWER and UPPER in case the producttype is entered in a different format in their data
- CAST if Company A needs to convert data types in sales to be the same as data types in returns
Company A doesn't enable other scalar functions because they don’t believe they are required for queries.

outputConstraints – Company B sets minimum output constraints on hashedemail to help reduce the ability to re-identify customers. It also adds minimum output constraint on producttype to reduce the ability to re-identify specific products that were returned. Certain product types could be more dominant based on dimensions of the output (for example, state). Their output constraints will always be enforced regardless of output constraints added by Company A to their data.
Company A creates a sales table association to collaboration.
Company B creates a returns table association to collaboration.

Company A runs queries, such as the following example, to better understand the quantity of returns in Company B as compared to total purchases by location in 2022.


SELECT
  companyB.state,
  SUM(companyB.returns),
  COUNT(DISTINCT companyA.hashedemail)
FROM
  sales companyA
  INNER JOIN returns companyB ON companyA.identifier = companyB.identifier
WHERE
  companyA.purchasedate BETWEEN '2022-01-01' AND '2022-12-31' AND
  TRUNC(companyB.returndate) > companyA.purchasedate
GROUP BY
  companyB.state;

Company A and Company B review query logs. Company B verifies that the query aligns with what was agreed upon in the collaboration agreement.

Troubleshooting aggregation analysis rule issues

Use the information here to help you diagnose and fix common issues when you work with aggregation analysis rules.

Issues

My query didn’t return any results

My query didn’t return any results

This can happen when there are no matching results or when the matching results don’t meet one or more minimum aggregation thresholds.

For more information about minimum aggregation thresholds, see Aggregation analysis rule - example.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Analysis rules

List analysis rule