subquery command
Note
To see which AWS data source integrations support this PPL command, see Commands.
Use the subquery
command to perform complex, nested queries
within your Piped Processing Language (PPL) statements.
source=logs | where field in [ subquery source=events | where condition | fields field ]
In this example, the primary search (source=logs
) is filtered by results from the subquery (source=events
).
The subquery command supports multiple levels of nesting for complex data analysis.
Nested Subquery Example
source=logs | where id in [ subquery source=users | where user in [ subquery source=actions | where action="login" | fields user] | fields uid ]
InSubquery Usage
source = outer | where a in [ source = inner | fields b ]
source = outer | where (a) in [ source = inner | fields b ]
source = outer | where (a,b,c) in [ source = inner | fields d,e,f ]
source = outer | where a not in [ source = inner | fields b ]
source = outer | where (a) not in [ source = inner | fields b ]
source = outer | where (a,b,c) not in [ source = inner | fields d,e,f ]
source = outer a in [ source = inner | fields b ]
(search filtering with subquery)source = outer a not in [ source = inner | fields b ]
(search filtering with subquery)source = outer | where a in [ source = inner1 | where b not in [ source = inner2 | fields c ] | fields b ]
(nested)source = table1 | inner join left = l right = r on l.a = r.a AND r.a in [ source = inner | fields d ] | fields l.a, r.a, b, c
(as join filter)
SQL Migration Examples with IN-Subquery PPL
TPC-H Q4 (in-subquery with aggregation)
select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-07-01' and o_orderdate < date '1993-07-01' + interval '3' month and o_orderkey in ( select l_orderkey from lineitem where l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority
Rewritten by PPL InSubquery query:
source = orders | where o_orderdate >= "1993-07-01" and o_orderdate < "1993-10-01" and o_orderkey IN [ source = lineitem | where l_commitdate < l_receiptdate | fields l_orderkey ] | stats count(1) as order_count by o_orderpriority | sort o_orderpriority | fields o_orderpriority, order_count
TPC-H Q20 (nested in-subquery)
select s_name, s_address from supplier, nation where s_suppkey in ( select ps_suppkey from partsupp where ps_partkey in ( select p_partkey from part where p_name like 'forest%' ) ) and s_nationkey = n_nationkey and n_name = 'CANADA' order by s_name
Rewritten by PPL InSubquery query:
source = supplier | where s_suppkey IN [ source = partsupp | where ps_partkey IN [ source = part | where like(p_name, "forest%") | fields p_partkey ] | fields ps_suppkey ] | inner join left=l right=r on s_nationkey = n_nationkey and n_name = 'CANADA' nation | sort s_name
ExistsSubquery usage
Assumptions: a
, b
are fields of table outer,
c
, d
are fields of table inner,
e
, f
are fields of table inner2.
source = outer | where exists [ source = inner | where a = c ]
source = outer | where not exists [ source = inner | where a = c ]
source = outer | where exists [ source = inner | where a = c and b = d ]
source = outer | where not exists [ source = inner | where a = c and b = d ]
source = outer exists [ source = inner | where a = c ]
(search filtering with subquery)source = outer not exists [ source = inner | where a = c ]
(search filtering with subquery)source = table as t1 exists [ source = table as t2 | where t1.a = t2.a ]
(table alias is useful in exists subquery)source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ]
(nested)source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ]
(nested)source = outer | where exists [ source = inner | where c > 10 ]
(uncorrelated exists)source = outer | where not exists [ source = inner | where c > 10 ]
(uncorrelated exists)source = outer | where exists [ source = inner ] | eval l = "nonEmpty" | fields l
(special uncorrelated exists)
ScalarSubquery usage
Assumptions: a
, b
are fields of table outer, c
, d
are fields of table inner, e
, f
are fields of table nested
Uncorrelated scalar subquery
In Select:
source = outer | eval m = [ source = inner | stats max(c) ] | fields m, a
source = outer | eval m = [ source = inner | stats max(c) ] + b | fields m, a
In Where:
source = outer | where a > [ source = inner | stats min(c) ] | fields a
In Search filter:
source = outer a > [ source = inner | stats min(c) ] | fields a
Correlated scalar subquery
In Select:
source = outer | eval m = [ source = inner | where outer.b = inner.d | stats max(c) ] | fields m, a
source = outer | eval m = [ source = inner | where b = d | stats max(c) ] | fields m, a
source = outer | eval m = [ source = inner | where outer.b > inner.d | stats max(c) ] | fields m, a
In Where:
source = outer | where a = [ source = inner | where outer.b = inner.d | stats max(c) ]
source = outer | where a = [ source = inner | where b = d | stats max(c) ]
source = outer | where [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a
In Search filter:
source = outer a = [ source = inner | where b = d | stats max(c) ]
source = outer [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a
Nested scalar subquery
-
source = outer | where a = [ source = inner | stats max(c) | sort c ] OR b = [ source = inner | where c = 1 | stats min(d) | sort d ]
-
source = outer | where a = [ source = inner | where c = [ source = nested | stats max(e) by f | sort f ] | stats max(d) by c | sort c | head 1 ]
(Relation) Subquery
InSubquery
, ExistsSubquery
and ScalarSubquery
are all subquery expressions. But RelationSubquery
is not a subquery expression, it is a subquery plan which is common used in Join or From clause.
source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ]
(subquery in join right side)source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1
Additional Context
InSubquery
, ExistsSubquery
, and
ScalarSubquery
are subquery expressions commonly used in
where
clauses and search filters.
Where command:
| where <boolean expression> | ...
Search filter:
search source=* <boolean expression> | ...
A subquery expression could be used in a boolean expression:
| where orders.order_id in [ source=returns | where return_reason="damaged" | field order_id ]
The orders.order_id in [ source=... ]
is a <boolean expression>
.
In general, we name this kind of subquery clause the InSubquery
expression. It is a <boolean expression>
.
Subquery with different join types
Example using a ScalarSubquery
:
source=employees | join source=sales on employees.employee_id = sales.employee_id | where sales.sale_amount > [ source=targets | where target_met="true" | fields target_value ]
Unlike InSubquery, ExistsSubquery, and ScalarSubquery, a RelationSubquery is not a subquery expression. Instead, it's a subquery plan.
SEARCH source=customer | FIELDS c_custkey | LEFT OUTER JOIN left = c, right = o ON c.c_custkey = o.o_custkey [ SEARCH source=orders | WHERE o_comment NOT LIKE '%unusual%packages%' | FIELDS o_orderkey, o_custkey ] | STATS ...