Optimizing correlated subqueries in Babelfish - Amazon Aurora

Optimizing correlated subqueries in Babelfish

A correlated subquery references table columns from the outer query. It is evaluated once for every row returned by the outer query. In the following example, the subquery references a column from table t1. This table is not included in the subquery’s FROM clause, but it is referenced in the outer query’s FROM clause. If table t1 has 1 million rows, the subquery needs to be evaluated 1 million times.

SELECT col1, col2 FROM t1 WHERE col1 < (SELECT sum(col1) FROM t2 WHERE t1.col2 = t2.col2);

Improving Babelfish query performance using subquery transformation

Babelfish can accelerate correlated subqueries by transforming them into equivalent outer joins. This optimization applies to the following two types of correlated subqueries:

  • Subqueries that return a single aggregate value, and appear in the SELECT list. For more information, see SELECT clause in Microsoft Transact-SQL documentation.

    SELECT ( SELECT avg(a) FROM inner_sb_t2 WHERE inner_sb_t2.a = outer_sb_t1.a) FROM outer_sb_t1;
  • Subqueries that return a single aggregate value and appear in a WHERE clause.

    SELECT * FROM outer_sb_t1 WHERE ( SELECT avg(a) FROM inner_sb_t2 WHERE inner_sb_t2.a = outer_sb_t1.a) > 1.0;

Enabling transformation in the subquery

To enable the transformation of correlated subqueries into equivalent outer joins, set the apg_enable_correlated_scalar_transform parameter to ON. This parameter is available in Babelfish 4.2.0 and later versions. The default value of this parameter is OFF.

You can modify the cluster or instance parameter group to set the parameters. To learn more, see Parameter groups for Amazon Aurora.

Alternatively, you can configure the setting for just the current session by calling the function set_config. For example, run the following command to turn on subquery cache in Babelfish. To learn more, see Configuration Settings Functions.

1> SELECT set_config('apg_enable_subquery_cache', 'on', false) 2> GO

Verifying the transformation

Use the EXPLAIN command to verify if the correlated subquery has been transformed into an outer join in the query plan. For more information, see Using explain plan to improve Babelfish query performance.

When the transformation is enabled, the applicable correlated subquery part will be transformed into outer join. For example:

1> select set_config('apg_enable_correlated_scalar_transform', 'true', false); 2> GO 1> set BABELFISH_STATISTICS PROFILE on 2> GO 1> select customer_name, ( select max(o.cost) from correlated_orders o 2> where o.customer_id = c.customer_id and o.total_amount > 10 and o.customer_id != 10) AS max_order_amount 3> from correlated_customers c order by customer_name ; 4> GO QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Query Text: select customer_name, ( select max(o.cost) from correlated_orders o where o.customer_id = c.customer_id and o.total_amount > 10 and o.customer_id != 10 ) AS max_order_amount from correlated_customers c order by customer_name Sort (cost=88.23..90.18 rows=780 width=40) Sort Key: c.customer_name NULLS FIRST -> Hash Left Join (cost=30.90..50.76 rows=780 width=40) Hash Cond: (c.customer_id = o.customer_id) -> Seq Scan on correlated_customers c (cost=0.00..17.80 rows=780 width=36) -> Hash (cost=28.73..28.73 rows=174 width=12) -> HashAggregate (cost=26.99..28.73 rows=174 width=12) Group Key: o.customer_id -> Seq Scan on correlated_orders o (cost=0.00..25.30 rows=338 width=12) Filter: (((total_amount)::numeric > 10) AND (customer_id <> 10)) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The same query is not transformed when the GUC parameter is turned OFF. The plan will not have outer join but subplan instead.

1> select set_config('apg_enable_correlated_scalar_transform', 'false', false); 2> GO 1> select customer_name, ( select max(o.cost) 2> from correlated_orders o 3> where o.customer_id = c.customer_id and o.total_amount > 10 and o.customer_id != 10) AS max_order_amount 4> from correlated_customers c order by customer_name ; 5> GO QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Sort (cost=21789.97..21791.92 rows=780 width=40) Sort Key: c.customer_name NULLS FIRST -> Seq Scan on correlated_customers c (cost=0.00..21752.50 rows=780 width=40) SubPlan 1 -> Aggregate (cost=27.86..27.87 rows=1 width=8) -> Seq Scan on correlated_orders o (cost=0.00..27.85 rows=2 width=8) Filter: (((total_amount)::numeric > 10) AND (customer_id <> 10) AND (customer_id = c.customer_id))

Limitations

  • The subquery must be in the select_list or in one of the conditions in the where clause. otherwise, it won’t be transformed.

  • The subquery must return an aggregate function. User-defined aggregate functions aren't supported for transformation.

  • A subquery whose return expression isn't a simple aggregate function won't be transformed.

  • The correlated condition in subquery WHERE clauses should be a simple column reference. Otherwise, it won’t be transformed.

  • The correlated condition in subquery where clauses must be a plain equality predicate.

  • A correlated subquery containing a TOP clause can’t be transformed.

  • The subquery can't contain either a HAVING or a GROUP BY clause.

  • The where clause in the subquery may contain one or more predicates combined with AND. If the WHERE clause contains an OR clause, it can't be transformed.

Using subquery cache to improve Babelfish query performance

Starting with version 4.2.0, Babelfish supports subquery cache to store the results of correlated subqueries. This feature skips repeated correlated subquery executions when subquery results are already in the cache.

Understanding subquery cache

PostgreSQL’s Memoize node is the key part of subquery cache. The Memoize node maintains a hash table in local cache to map from input parameter values to query result rows. The memory limit for the hash table is the product of work_mem and hash_mem_multiplier. To learn more, see Resource Consumption.

During query execution, subquery cache uses Cache Hit Rate (CHR) to estimate whether the cache is improving query performance and to decide at query runtime whether to continue using the cache. CHR is the ratio of the number of cache hits to the total number of requests. For example, if a correlated subquery needs to be executed 100 times, and 70 of those execution results can be retrieved from the cache, the CHR is 0.7.

For every apg_subquery_cache_check_interval number of cache misses, the benefit of subquery cache is evaluated by checking whether the CHR is larger than apg_subquery_cache_hit_rate_threshold. If not, the cache will be deleted from memory, and the query execution will return to the original, uncached subquery re-execution.

Parameters that control subquery cache behavior

The following table lists the parameters that control the behavior of the subquery cache.

Parameter

Description

Default

Allowed

apg_enable_subquery_cache

Enables the use of cache for correlated scalar subqueries.

OFF

ON, OFF

apg_subquery_cache_check_interval

Sets the frequency, in number of cache misses, to evaluate subquery cache hit rate.

500

0–2147483647

apg_subquery_cache_hit_rate_threshold

Sets the threshold for subquery cache hit rate.

0.3

0.0–1.0
Note
  • Larger values of apg_subquery_cache_check_interval may improve the accuracy of the CHR-based cache benefit estimation, but will increase the cache overhead, since CHR won’t get evaluated until the cache table has apg_subquery_cache_check_interval rows.

  • Larger values of apg_subquery_cache_hit_rate_threshold bias towards abandoning subquery cache and returning back to the original, uncached subquery re-execution.

You can modify the cluster or instance parameter group to set the parameters. To learn more, see Working with parameter groups.

Alternatively, you can configure the setting for just the current session by calling the function set_config. For example, run the following command to turn on subquery cache in Babelfish. To learn more, see Configuration Settings Functions.

1> SELECT set_config('apg_enable_subquery_cache', 'on', false) 2> GO

Turning on subquery cache in Babelfish

Currently, subquery cache is OFF by default. As mentioned above, you can turn it on by modifying your parameter group. When apg_enable_subquery_cache is ON, Babelfish applies subquery cache to save subquery results. The query plan will then have a Memoize node under SubPlan.

For example, the following command sequence shows the estimated query execution plan of a simple correlated subquery without subquery cache. To learn more, see Using explain plan to improve Babelfish query performance.

1> CREATE TABLE outer_table (outer_col1 INT, outer_col2 INT) 2> CREATE TABLE inner_table (inner_col1 INT, inner_col2 INT) 3> GO 1> EXEC sp_babelfish_configure 'babelfishpg_tsql.explain_costs', 'off' 2> GO 1> SET BABELFISH_SHOWPLAN_ALL ON 2> GO 1> SELECT outer_col1, ( 2> SELECT inner_col1 3> FROM inner_table 4> WHERE inner_col2 = outer_col2 5> ) FROM outer_table 6> GO QUERY PLAN ------------------------------------------------------------ Query Text: SELECT outer_col1, ( SELECT inner_col1 FROM inner_table WHERE inner_col2 = outer_col2 ) FROM outer_table Seq Scan on outer_table SubPlan 1 -> Seq Scan on inner_table Filter: (inner_col2 = outer_table.outer_col2)
1> SET BABELFISH_SHOWPLAN_ALL OFF 2> GO 1> EXEC sp_babelfish_configure 'babelfishpg_tsql.explain_costs', 'on' 2> GO

After turning on apg_enable_subquery_cache, the query plan will contain a Memoize node under the SubPlan node, indicating that the subquery is planning to use cache.

Seq Scan on outer_table SubPlan 1 -> Memoize Cache Key: outer_table.outer_col2 Cache Mode: logical -> Seq Scan on inner_table Filter: (inner_col2 = outer_table.outer_col2)

The actual query execution plan contains more details of the subquery cache, including cache hits and cache misses. The following output shows the actual query execution plan of the above example query after inserting some values to the tables.

Seq Scan on outer_table (actual rows=10 loops=1) SubPlan 1 -> Memoize (actual rows=1 loops=10) Cache Key: outer_table.outer_col2 Cache Mode: logical Hits: 4 Misses: 6 Evictions: 0 Overflows: 0 Memory Usage: 1kB -> Seq Scan on inner_table (actual rows=1 loops=6) Filter: (inner_col2 = outer_table.outer_col2) Rows Removed by Filter: 4

The total cache hit number is 4, and the total cache miss number is 6. If the total number of hits and misses is less than the number of loops in the Memoize node, it means that the CHR evaluation did not pass and the cache was cleaned up and abandoned at some point. The subquery execution then returned back to the original uncached re-execution.

Limitations

Subquery cache does not support certain patterns of correlated subqueries. Those types of queries will be run without cache, even if subquery cache is turned on:

  • IN/EXISTS/ANY/ALL correlated subqueries

  • Correlated subqueries containing nondeterministic functions.

  • Correlated subqueries that reference an outer table column of BIT, VARBINARY or BINARY datatype.