FlagDuplicateRows class

PDF

RSS

Focus mode

Related resources

AWS Glue DataBrew Developer Guide

AWS CLI commands for AWS Glue

SDKs & Tools

Did this page help you?

Provide feedback

FlagDuplicateRows class - AWS Glue

Example Output — methods —__call__apply name describeArgs describeReturn describeTransform describeErrors describe

The FlagDuplicateRows transform returns a new column with a specified value in each row that indicates whether that row is an exact match of an earlier row in the dataset. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row.

Example


from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from awsgluedi.transforms import *

sc = SparkContext()
spark = SparkSession(sc)

input_df = spark.createDataFrame(
    [
        (105.111, 13.12),
        (13.12, 13.12),
        (None, 13.12),
        (13.12, 13.12),
        (None, 13.12),
    ],
    ["source_column_1", "source_column_2"],
)

try:
    df_output = data_quality.FlagDuplicateRows.apply(
        data_frame=input_df,
        spark_context=sc,
        target_column="flag_row",
        true_string="True",
        false_string="False",
        target_index=1
    )
except:
    print("Unexpected Error happened ")
    raise

Output

The output will be a PySpark DataFrame with an additional column flag_row that indicates whether a row is a duplicate or not, based on the source_column_1 column. The resulting `df_output` DataFrame will contain the following rows:


```
+---------------+---------------+--------+
|source_column_1|source_column_2|flag_row|
+---------------+---------------+--------+
| 105.111| 13.12| False|
| 13.12| 13.12| True|
| null| 13.12| True|
| 13.12| 13.12| True|
| null| 13.12| True|
+---------------+---------------+--------+
```

The flag_row column indicates whether a row is a duplicate or not. The `true_string` is set to "True", and the `false_string` is set to "False". The `target_index` is set to 1, which means that the flag_row column will be inserted at the second position (index 1) in the output DataFrame.

call(spark_context, data_frame, target_column, true_string=DEFAULT_TRUE_STRING, false_string=DEFAULT_FALSE_STRING, target_index=None)

true_string – Value to be inserted if the row matches an earlier row.
false_string – Value to be inserted if the row is unique.
target_column – Name of the new column that is inserted in the dataset.

apply(cls, *args, **kwargs)

Inherited from GlueTransform apply.

name(cls)

Inherited from GlueTransform name.

describeArgs(cls)

Inherited from GlueTransform describeArgs.

describeReturn(cls)

Inherited from GlueTransform describeReturn.

describeTransform(cls)

Inherited from GlueTransform describeTransform.

describeErrors(cls)

Inherited from GlueTransform describeErrors.

describe(cls)

Inherited from GlueTransform describe.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

FillWithMode

RemoveDuplicates

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

FlagDuplicateRows class

On this page

Related resources

Did this page help you?

Example

Output

Methods

call(spark_context, data_frame, target_column, true_string=DEFAULT_TRUE_STRING, false_string=DEFAULT_FALSE_STRING, target_index=None)

apply(cls, *args, **kwargs)

name(cls)

describeArgs(cls)

describeReturn(cls)

describeTransform(cls)

describeErrors(cls)

describe(cls)

Next topic:

Previous topic:

Need help?

Select your cookie preferences

FlagDuplicateRows class

On this page

Related resources

Did this page help you?

Example

Output

Methods

__call__(spark_context, data_frame, target_column, true_string=DEFAULT_TRUE_STRING, false_string=DEFAULT_FALSE_STRING, target_index=None)

apply(cls, *args, **kwargs)

name(cls)

describeArgs(cls)

describeReturn(cls)

describeTransform(cls)

describeErrors(cls)

describe(cls)

Next topic:

Previous topic:

Need help?

call(spark_context, data_frame, target_column, true_string=DEFAULT_TRUE_STRING, false_string=DEFAULT_FALSE_STRING, target_index=None)