Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

FlagDuplicateRows class

Focus mode
FlagDuplicateRows class - AWS Glue

The FlagDuplicateRows transform returns a new column with a specified value in each row that indicates whether that row is an exact match of an earlier row in the dataset. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row.

Example

from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (105.111, 13.12), (13.12, 13.12), (None, 13.12), (13.12, 13.12), (None, 13.12), ], ["source_column_1", "source_column_2"], ) try: df_output = data_quality.FlagDuplicateRows.apply( data_frame=input_df, spark_context=sc, target_column="flag_row", true_string="True", false_string="False", target_index=1 ) except: print("Unexpected Error happened ") raise

Output

The output will be a PySpark DataFrame with an additional column flag_row that indicates whether a row is a duplicate or not, based on the source_column_1 column. The resulting `df_output` DataFrame will contain the following rows:

``` +---------------+---------------+--------+ |source_column_1|source_column_2|flag_row| +---------------+---------------+--------+ | 105.111| 13.12| False| | 13.12| 13.12| True| | null| 13.12| True| | 13.12| 13.12| True| | null| 13.12| True| +---------------+---------------+--------+ ```

The flag_row column indicates whether a row is a duplicate or not. The `true_string` is set to "True", and the `false_string` is set to "False". The `target_index` is set to 1, which means that the flag_row column will be inserted at the second position (index 1) in the output DataFrame.

Methods

__call__(spark_context, data_frame, target_column, true_string=DEFAULT_TRUE_STRING, false_string=DEFAULT_FALSE_STRING, target_index=None)

The FlagDuplicateRows transform returns a new column with a specified value in each row that indicates whether that row is an exact match of an earlier row in the dataset. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row.

  • true_string – Value to be inserted if the row matches an earlier row.

  • false_string – Value to be inserted if the row is unique.

  • target_column – Name of the new column that is inserted in the dataset.

apply(cls, *args, **kwargs)

Inherited from GlueTransform apply.

name(cls)

Inherited from GlueTransform name.

describeArgs(cls)

Inherited from GlueTransform describeArgs.

describeReturn(cls)

Inherited from GlueTransform describeReturn.

describeTransform(cls)

Inherited from GlueTransform describeTransform.

describeErrors(cls)

Inherited from GlueTransform describeErrors.

describe(cls)

Inherited from GlueTransform describe.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.