dedup command - Amazon OpenSearch Service

dedup command

Note

To see which AWS data source integrations support this PPL command, see Commands.

Use the dedup command to remove identical documents from your search results based on specified fields.

Syntax

Use the following syntax:

dedup [int] <field-list> [keepempty=<bool>] [consecutive=<bool>]
int
  • Optional.

  • The dedup command retains multiple events for each combination when you specify <int>. The number for <int> must be greater than 0. If you don't specify a number, only the first occurring event is kept. All other duplicates are removed from the results.

  • Default: 1

keepempty
  • Optional.

  • If true, keeps documents where any field in the field-list has a NULL value or is MISSING.

  • Default: false

consecutive
  • Optional.

  • If true, removes only events with consecutive duplicate combinations of values.

  • Default: false

field-list
  • Mandatory.

  • A comma-delimited list of fields. At least one field is required.

Example 1: Dedup by one field

This example shows how to dedup documents using the gender field.

PPL query:

os> source=accounts | dedup gender | fields account_number, gender; fetched rows / total rows = 2/2 +------------------+----------+ | account_number | gender | |------------------+----------| | 1 | M | | 13 | F | +------------------+----------+
Example 2: Keep 2 duplicates documents

The example shows how to dedup documents with the gender field, keeping two duplicates.

PPL query:

os> source=accounts | dedup 2 gender | fields account_number, gender; fetched rows / total rows = 3/3 +------------------+----------+ | account_number | gender | |------------------+----------| | 1 | M | | 6 | M | | 13 | F | +------------------+----------+
Example 3: Keep or ignore the empty field by default

The example shows how to dedup the document by keeping the null value field.

PPL query:

os> source=accounts | dedup email keepempty=true | fields account_number, email; fetched rows / total rows = 4/4 +------------------+-----------------------+ | account_number | email | +------------------+-----------------------+ | 1 | john_doe@example.com | | 6 | jane_doe@example.com | | 13 | null | | 18 | juan_li@example.com | +------------------+-----------------------+

The example shows how to dedup the document by ignoring the empty value field.

PPL query:

os> source=accounts | dedup email | fields account_number, email; fetched rows / total rows = 3/3 +------------------+-----------------------+ | account_number | email | +------------------+-----------------------+ | 1 | john_doe@example.com | | 6 | jane_doe@example.com | | 18 | juan_li@example.com | +------------------+-----------------------+
Example 4: Dedup in consecutive documents

The example shows how to dedup in consecutive documents.

PPL query:

os> source=accounts | dedup gender consecutive=true | fields account_number, gender; fetched rows / total rows = 3/3 +------------------+----------+ | account_number | gender | +------------------+----------+ | 1 | M | | 13 | F | | 18 | M | +------------------+----------+
Additional examples
  • source = table | dedup a | fields a,b,c

  • source = table | dedup a,b | fields a,b,c

  • source = table | dedup a keepempty=true | fields a,b,c

  • source = table | dedup a,b keepempty=true | fields a,b,c

  • source = table | dedup 1 a | fields a,b,c

  • source = table | dedup 1 a,b | fields a,b,c

  • source = table | dedup 1 a keepempty=true | fields a,b,c

  • source = table | dedup 1 a,b keepempty=true | fields a,b,c

  • source = table | dedup 2 a | fields a,b,c

  • source = table | dedup 2 a,b | fields a,b,c

  • source = table | dedup 2 a keepempty=true | fields a,b,c

  • source = table | dedup 2 a,b keepempty=true | fields a,b,c

  • source = table | dedup 1 a consecutive=true| fields a,b,c (consecutive deduplication is unsupported)

Limitation
  • For | dedup 2 a, b keepempty=false

    DataFrameDropColumns('_row_number_) +- Filter ('_row_number_ <= 2) // allowed duplication = 2 +- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST] +- Filter (isnotnull('a) AND isnotnull('b)) // keepempty=false +- Project +- UnresolvedRelation
  • For | dedup 2 a, b keepempty=true

    Union :- DataFrameDropColumns('_row_number_) : +- Filter ('_row_number_ <= 2) : +- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST] : +- Filter (isnotnull('a) AND isnotnull('b)) : +- Project : +- UnresolvedRelation +- Filter (isnull('a) OR isnull('b)) +- Project +- UnresolvedRelation