join command
Note
To see which AWS data source integrations support this PPL command, see Commands.
The join command allows you to combine data from multiple sources based on common fields, enabling you to perform complex analyses and gain deeper insights from your distributed datasets
Schema
There are least two indices, otel-v1-apm-span-*
(large) and
otel-v1-apm-service-map
(small).
Relevant fields from indices:
otel-v1-apm-span-*
-
traceId - A unique identifier for a trace. All spans from the same trace share the same traceId.
-
spanId - A unique identifier for a span within a trace, assigned when the span is created.
-
parentSpanId - The spanId of this span's parent span. If this is a root span, then this field must be empty.
-
durationInNanos - The difference in nanoseconds between startTime and endTime. (this is
latency
in UI) -
serviceName - The resource from which the span originates.
-
traceGroup - The name of the trace's root span.
otel-v1-apm-service-map
-
serviceName - The name of the service that emitted the span.
-
destination.domain - The serviceName of the service being called by this client.
-
destination.resource - The span name (API, operation, and so on) being called by this client.
-
target.domain - The serviceName of the service being called by a client.
-
target.resource - The span name (API, operation, and so on) being called by a client.
-
traceGroupName - The top-level span name that started the request chain.
Requirement
Support join to calculate the following:
For each service, join span index on service map index to calculate metrics under different type of filters.
This sample query calculates latency when filtered by trace group
client_cancel_order
for the order
service.
SELECT avg(durationInNanos) FROM `otel-v1-apm-span-000001` t1 WHERE t1.serviceName = `order` AND ((t1.name in (SELECT target.resource FROM `otel-v1-apm-service-map` WHERE serviceName = `order` AND traceGroupName = `client_cancel_order`) AND t1.parentSpanId != NULL) OR (t1.parentSpanId = NULL AND t1.name = `client_cancel_order`)) AND t1.traceId in (SELECT traceId FROM `otel-v1-apm-span-000001` WHERE serviceName = `order`)
Migrate to PPL
Syntax of the join command
SEARCH source=<left-table> | <other piped command> | [joinType] JOIN [leftAlias] ON joinCriteria <right-table> | <other piped command>
Rewriting
SEARCH source=otel-v1-apm-span-000001 | WHERE serviceName = 'order' | JOIN left=t1 right=t2 ON t1.traceId = t2.traceId AND t2.serviceName = 'order' otel-v1-apm-span-000001 -- self inner join | EVAL s_name = t1.name -- rename to avoid ambiguous | EVAL s_parentSpanId = t1.parentSpanId -- RENAME command would be better when it is supported | EVAL s_durationInNanos = t1.durationInNanos | FIELDS s_name, s_parentSpanId, s_durationInNanos -- reduce colunms in join | LEFT JOIN left=s1 right=t3 ON s_name = t3.target.resource AND t3.serviceName = 'order' AND t3.traceGroupName = 'client_cancel_order' otel-v1-apm-service-map | WHERE (s_parentSpanId IS NOT NULL OR (s_parentSpanId IS NULL AND s_name = 'client_cancel_order')) | STATS avg(s_durationInNanos) -- no need to add alias if there is no ambiguous
joinType
-
Syntax:
INNER | LEFT OUTER | CROSS
-
Optional
-
The type of join to perform. The default is
INNER
if not specified.
leftAlias
-
Syntax:
left = <leftAlias>
-
Optional
-
The subquery alias to use with the left join side, to avoid ambiguous naming.
joinCriteria
-
Syntax:
<expression>
-
Required
-
The syntax starts with
ON
. It could be any comparison expression. Generally, the join criteria looks like<leftAlias>.<leftField>=<rightAlias>.<rightField>
.For example:
l.id = r.id
. If the join criteria contains multiple conditions, you can specifyAND
andOR
operator between each comparison expression. For example,l.id = r.id AND l.email = r.email AND (r.age > 65 OR r.age < 18)
.
More examples
Migration from SQL query (TPC-H Q13):
SELECT c_count, COUNT(*) AS custdist FROM ( SELECT c_custkey, COUNT(o_orderkey) c_count FROM customer LEFT OUTER JOIN orders ON c_custkey = o_custkey AND o_comment NOT LIKE '%unusual%packages%' GROUP BY c_custkey ) AS c_orders GROUP BY c_count ORDER BY custdist DESC, c_count DESC;
Rewritten by PPL join query:
SEARCH source=customer | FIELDS c_custkey | LEFT OUTER JOIN ON c_custkey = o_custkey AND o_comment NOT LIKE '%unusual%packages%' orders | STATS count(o_orderkey) AS c_count BY c_custkey | STATS count() AS custdist BY c_count | SORT - custdist, - c_count
Limitation: sub searches are unsupported in join right side.
If sub searches are supported, you can rewrite the above PPL query as follows:
SEARCH source=customer | FIELDS c_custkey | LEFT OUTER JOIN ON c_custkey = o_custkey [ SEARCH source=orders | WHERE o_comment NOT LIKE '%unusual%packages%' | FIELDS o_orderkey, o_custkey ] | STATS count(o_orderkey) AS c_count BY c_custkey | STATS count() AS custdist BY c_count | SORT - custdist, - c_count