subquery command - Amazon OpenSearch Service

subquery command

Note

To see which AWS data source integrations support this PPL command, see Commands.

Use the subquery command to perform complex, nested queries within your Piped Processing Language (PPL) statements.

source=logs | where field in [ subquery source=events | where condition | fields field ]

In this example, the primary search (source=logs) is filtered by results from the subquery (source=events).

The subquery command supports multiple levels of nesting for complex data analysis.

Nested Subquery Example

source=logs | where id in [ subquery source=users | where user in [ subquery source=actions | where action="login" | fields user] | fields uid ]
InSubquery Usage
  • source = outer | where a in [ source = inner | fields b ]

  • source = outer | where (a) in [ source = inner | fields b ]

  • source = outer | where (a,b,c) in [ source = inner | fields d,e,f ]

  • source = outer | where a not in [ source = inner | fields b ]

  • source = outer | where (a) not in [ source = inner | fields b ]

  • source = outer | where (a,b,c) not in [ source = inner | fields d,e,f ]

  • source = outer a in [ source = inner | fields b ] (search filtering with subquery)

  • source = outer a not in [ source = inner | fields b ] (search filtering with subquery)

  • source = outer | where a in [ source = inner1 | where b not in [ source = inner2 | fields c ] | fields b ] (nested)

  • source = table1 | inner join left = l right = r on l.a = r.a AND r.a in [ source = inner | fields d ] | fields l.a, r.a, b, c (as join filter)

SQL Migration Examples with IN-Subquery PPL

TPC-H Q4 (in-subquery with aggregation)

select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-07-01' and o_orderdate < date '1993-07-01' + interval '3' month and o_orderkey in ( select l_orderkey from lineitem where l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority

Rewritten by PPL InSubquery query:

source = orders | where o_orderdate >= "1993-07-01" and o_orderdate < "1993-10-01" and o_orderkey IN [ source = lineitem | where l_commitdate < l_receiptdate | fields l_orderkey ] | stats count(1) as order_count by o_orderpriority | sort o_orderpriority | fields o_orderpriority, order_count

TPC-H Q20 (nested in-subquery)

select s_name, s_address from supplier, nation where s_suppkey in ( select ps_suppkey from partsupp where ps_partkey in ( select p_partkey from part where p_name like 'forest%' ) ) and s_nationkey = n_nationkey and n_name = 'CANADA' order by s_name

Rewritten by PPL InSubquery query:

source = supplier | where s_suppkey IN [ source = partsupp | where ps_partkey IN [ source = part | where like(p_name, "forest%") | fields p_partkey ] | fields ps_suppkey ] | inner join left=l right=r on s_nationkey = n_nationkey and n_name = 'CANADA' nation | sort s_name
ExistsSubquery usage

Assumptions: a, b are fields of table outer, c, d are fields of table inner, e, f are fields of table inner2.

  • source = outer | where exists [ source = inner | where a = c ]

  • source = outer | where not exists [ source = inner | where a = c ]

  • source = outer | where exists [ source = inner | where a = c and b = d ]

  • source = outer | where not exists [ source = inner | where a = c and b = d ]

  • source = outer exists [ source = inner | where a = c ] (search filtering with subquery)

  • source = outer not exists [ source = inner | where a = c ] (search filtering with subquery)

  • source = table as t1 exists [ source = table as t2 | where t1.a = t2.a ] (table alias is useful in exists subquery)

  • source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ] (nested)

  • source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ] (nested)

  • source = outer | where exists [ source = inner | where c > 10 ] (uncorrelated exists)

  • source = outer | where not exists [ source = inner | where c > 10 ] (uncorrelated exists)

  • source = outer | where exists [ source = inner ] | eval l = "nonEmpty" | fields l (special uncorrelated exists)

ScalarSubquery usage

Assumptions: a, b are fields of table outer, c, d are fields of table inner, e, f are fields of table nested

Uncorrelated scalar subquery

In Select:

  • source = outer | eval m = [ source = inner | stats max(c) ] | fields m, a

  • source = outer | eval m = [ source = inner | stats max(c) ] + b | fields m, a

In Where:

  • source = outer | where a > [ source = inner | stats min(c) ] | fields a

In Search filter:

  • source = outer a > [ source = inner | stats min(c) ] | fields a

Correlated scalar subquery

In Select:

  • source = outer | eval m = [ source = inner | where outer.b = inner.d | stats max(c) ] | fields m, a

  • source = outer | eval m = [ source = inner | where b = d | stats max(c) ] | fields m, a

  • source = outer | eval m = [ source = inner | where outer.b > inner.d | stats max(c) ] | fields m, a

In Where:

  • source = outer | where a = [ source = inner | where outer.b = inner.d | stats max(c) ]

  • source = outer | where a = [ source = inner | where b = d | stats max(c) ]

  • source = outer | where [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a

In Search filter:

  • source = outer a = [ source = inner | where b = d | stats max(c) ]

  • source = outer [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a

Nested scalar subquery

  • source = outer | where a = [ source = inner | stats max(c) | sort c ] OR b = [ source = inner | where c = 1 | stats min(d) | sort d ]

  • source = outer | where a = [ source = inner | where c = [ source = nested | stats max(e) by f | sort f ] | stats max(d) by c | sort c | head 1 ]

(Relation) Subquery

InSubquery, ExistsSubquery and ScalarSubquery are all subquery expressions. But RelationSubquery is not a subquery expression, it is a subquery plan which is common used in Join or From clause.

  • source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] (subquery in join right side)

  • source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1

Additional Context

InSubquery, ExistsSubquery, and ScalarSubquery are subquery expressions commonly used in where clauses and search filters.

Where command:

| where <boolean expression> | ...

Search filter:

search source=* <boolean expression> | ...

A subquery expression could be used in a boolean expression:

| where orders.order_id in [ source=returns | where return_reason="damaged" | field order_id ]

The orders.order_id in [ source=... ] is a <boolean expression>.

In general, we name this kind of subquery clause the InSubquery expression. It is a <boolean expression>.

Subquery with different join types

Example using a ScalarSubquery:

source=employees | join source=sales on employees.employee_id = sales.employee_id | where sales.sale_amount > [ source=targets | where target_met="true" | fields target_value ]

Unlike InSubquery, ExistsSubquery, and ScalarSubquery, a RelationSubquery is not a subquery expression. Instead, it's a subquery plan.

SEARCH source=customer | FIELDS c_custkey | LEFT OUTER JOIN left = c, right = o ON c.c_custkey = o.o_custkey [ SEARCH source=orders | WHERE o_comment NOT LIKE '%unusual%packages%' | FIELDS o_orderkey, o_custkey ] | STATS ...