dedup command
Note
To see which AWS data source integrations support this PPL command, see Commands.
Use the dedup
command to remove identical documents from your
search results based on specified fields.
Syntax
Use the following syntax:
dedup [int] <field-list> [keepempty=<bool>] [consecutive=<bool>]
int
-
Optional.
-
The
dedup
command retains multiple events for each combination when you specify <int>. The number for <int> must be greater than 0. If you don't specify a number, only the first occurring event is kept. All other duplicates are removed from the results. -
Default: 1
keepempty
-
Optional.
-
If true, keeps documents where any field in the field-list has a NULL value or is MISSING.
-
Default: false
consecutive
-
Optional.
-
If true, removes only events with consecutive duplicate combinations of values.
-
Default: false
field-list
-
Mandatory.
-
A comma-delimited list of fields. At least one field is required.
Example 1: Dedup by one field
This example shows how to dedup documents using the gender field.
PPL query:
os> source=accounts | dedup gender | fields account_number, gender; fetched rows / total rows = 2/2 +------------------+----------+ | account_number | gender | |------------------+----------| | 1 | M | | 13 | F | +------------------+----------+
Example 2: Keep 2 duplicates documents
The example shows how to dedup documents with the gender field, keeping two duplicates.
PPL query:
os> source=accounts | dedup 2 gender | fields account_number, gender; fetched rows / total rows = 3/3 +------------------+----------+ | account_number | gender | |------------------+----------| | 1 | M | | 6 | M | | 13 | F | +------------------+----------+
Example 3: Keep or ignore the empty field by default
The example shows how to dedup the document by keeping the null value field.
PPL query:
os> source=accounts | dedup email keepempty=true | fields account_number, email; fetched rows / total rows = 4/4 +------------------+-----------------------+ | account_number | email | +------------------+-----------------------+ | 1 | john_doe@example.com | | 6 | jane_doe@example.com | | 13 | null | | 18 | juan_li@example.com | +------------------+-----------------------+
The example shows how to dedup the document by ignoring the empty value field.
PPL query:
os> source=accounts | dedup email | fields account_number, email; fetched rows / total rows = 3/3 +------------------+-----------------------+ | account_number | email | +------------------+-----------------------+ | 1 | john_doe@example.com | | 6 | jane_doe@example.com | | 18 | juan_li@example.com | +------------------+-----------------------+
Example 4: Dedup in consecutive documents
The example shows how to dedup in consecutive documents.
PPL query:
os> source=accounts | dedup gender consecutive=true | fields account_number, gender; fetched rows / total rows = 3/3 +------------------+----------+ | account_number | gender | +------------------+----------+ | 1 | M | | 13 | F | | 18 | M | +------------------+----------+
Additional examples
source = table | dedup a | fields a,b,c
source = table | dedup a,b | fields a,b,c
source = table | dedup a keepempty=true | fields a,b,c
source = table | dedup a,b keepempty=true | fields a,b,c
source = table | dedup 1 a | fields a,b,c
source = table | dedup 1 a,b | fields a,b,c
source = table | dedup 1 a keepempty=true | fields a,b,c
source = table | dedup 1 a,b keepempty=true | fields a,b,c
source = table | dedup 2 a | fields a,b,c
source = table | dedup 2 a,b | fields a,b,c
source = table | dedup 2 a keepempty=true | fields a,b,c
source = table | dedup 2 a,b keepempty=true | fields a,b,c
source = table | dedup 1 a consecutive=true| fields a,b,c
(consecutive deduplication is unsupported)
Limitation
-
For
| dedup 2 a, b keepempty=false
DataFrameDropColumns('_row_number_) +- Filter ('_row_number_ <= 2) // allowed duplication = 2 +- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST] +- Filter (isnotnull('a) AND isnotnull('b)) // keepempty=false +- Project +- UnresolvedRelation
-
For
| dedup 2 a, b keepempty=true
Union :- DataFrameDropColumns('_row_number_) : +- Filter ('_row_number_ <= 2) : +- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST] : +- Filter (isnotnull('a) AND isnotnull('b)) : +- Project : +- UnresolvedRelation +- Filter (isnull('a) OR isnull('b)) +- Project +- UnresolvedRelation