Building a profile job configuration programmatically in AWS Glue DataBrew - AWS Glue DataBrew

Building a profile job configuration programmatically in AWS Glue DataBrew

In this section, you can find descriptions of profile job steps and functions that you can use programmatically. You can use them either from the AWS Command Line Interface (AWS CLI) or by using one of the AWS SDKs.

In a profile job, you can customize a configuration to control how DataBrew evaluates your dataset. You can apply the configuration to a dataset or apply it to particular columns. You can build the configuration when creating a profile job, and then update it anytime.

A profile configuration structure includes four parts:

Following is an example.

{ "ProfileColumns": [ { "Name": "example" }, { "Regex": "example.*" } ], "DatasetStatisticsConfiguration": { "IncludedStatistics": [ "CORRELATION" ], "Overrides": [ { "Statistic": "CORRELATION", "Parameters": { "columnSelectors": "[{\"name\":\"example\"}, {\"regex\":\"example.*\"}]" } } ] }, "ColumnStatisticsConfigurations": [ { "Selectors": [ { "Name": "example" } ], "Statistics": { "IncludedStatistics": [ "CORRELATION", "DUPLICATE_ROWS_COUNT" ], "Overrides": [ { "Statistic": "VALUE_DISTRIBUTION", "Parameters": { "binNumber": "10" } } ] } } ] }

ProfileColumns section

In the ProfileColumns section of your structure, set the columns from your dataset that you want to evaluate in your profile job. ProfileColumns is a list of column selectors (Selectors). You can specify either a column name or a regular expression in a column selector. An example follows.

"ProfileColumns": [{"Name": "example"}, {"Regex": "example.*"}]

When ProfileColumns is specified, only columns whose names match a name or regular expression in ProfileColumns are included in the profile job. If the profile job doesn't support a selected column's data type, DataBrew skips the selected column during the job run.

If ProfileColumns is undefined, the profile job evaluates all supported columns. Supported columns are columns containing data of a supported data type: ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, String, or Boolean.

DatasetStatisticsConfiguration section

In the DatasetStatisticsConfiguration section of your structure, you can build a configuration for intercolumn evaluations. The configuration includes IncludedStatistics and Overrides. An example follows.

"DatasetStatisticsConfiguration": { "IncludedStatistics": ["CORRELATION"], "Overrides": [ { "Statistic": "CORRELATION", "Parameters": { "columnSelectors": "[{\"name\":\"example\"}, {\"regex\":\"example.*\"}]" } } ] }

You can select evaluations that you want to have by adding evaluation names to IncludedStatistics. An example follows.

"IncludedStatistics": ["CORRELATION", "DUPLICATE_ROWS_COUNT"]

When you specify IncludedStatistics, only evaluations in the list are included in the profile job. If IncludedStatistics is undefined, the profile job runs all supported evaluations with default settings. You can exclude all evaluations by adding NONE to IncludedStatistics. An example follows.

"IncludedStatistics": ["NONE"]

Configurable statistics at the dataset level

In the DatasetStatisticsConfiguration section of your structure, a profile job supports the evaluations shown in the table following.

Statistic name Description Supported data types Default status Attributes of profile result Type of profile result

DUPLICATE_ROWS_COUNT

Count of duplicate rows in the dataset

all

Enable

duplicateRowsCount

Int

CORRELATION

Pearson Correlation Coefficient between two columns

number

Enable

correlations (in each selected column)

Object

In IncludedStatistics, you can override each evaluation's default settings by adding an override. Each override includes the name of a particular evaluation and a parameter map.

In DatasetStatisticsConfiguration, a profile job supports the CORRELATION override. This override calculates the Pearson Correlation Coefficient between two columns from a list of selected columns. The default setting is selecting the first 10 numeric columns. You can specify either a number of columns or a list of column selectors to override the default setting.

CORRELATION takes these parameters:

  • columnNumber – The number of numeric columns. The profile job selects the first n columns from the dataset. This value should be greater than 1. Use "ALL" to select all numeric columns.

  • columnSelectors: – List of column selectors. Each selector can have either a column name or a regular expression.

An example follows.

{ "Statistic": "CORRELATION", "Parameters": { "columnSelectors": "[{\"name\":\"example\"}, {\"regex\":\"example.*\"}]" } }

ColumnStatisticsConfigurations section

In the ColumnStatisticsConfigurations section of your structure, you can build configurations for particular columns. ColumnStatisticsConfigurations is a list of ColumnStatisticsConfiguration settings. In ColumnStatisticsConfiguration, there are Selectors, a list of column selectors, and Statistics for the configuration of statistics. An example follows.

{ "Selectors": [{"Name": "example"} ], "Statistics": { "IncludedStatistics": ["CORRELATION", "DUPLICATE_ROWS_COUNT"] "Overrides": [ { "Statistic": "VALUE_DISTRIBUTION", "Parameters": { "binNumber": "10" } } ] } }

Selectors is a list of column selectors. As with ProfileColumns, you can specify either a column name or a regular expression in each column selector. When you specify Selectors, the column configuration is applied to columns that match any column selector in Selectors. Otherwise, the configuration is applied to all supported columns.

In Statistics, you can override settings of selected columns. As with DatasetStatisticsConfiguration, Statistics has IncludedStatistics and Overrides.

To select the evaluations that you want, add evaluation names to IncludedStatistics.

"IncludedStatistics": ["CORRELATION", "DUPLICATE_ROWS_COUNT"]

When you specify IncludedStatistics, only evaluations in the list are included in the profile job. Otherwise, the profile job runs all supported evaluations with default settings.

You can exclude all evaluations by adding NONE to IncludedStatistics.

"IncludedStatistics": ["NONE"]

In some cases, there might be multiple configurations in ColumnStatisticsConfigurations that have different IncludedStatistics that you can apply to the same column. In these cases, the profile job picks the last configuration in ColumnStatisticsConfigurations and applies its IncludedStatistics to the selected column. A new configuration overrides older configurations.

Configurable statistics at the column level

In ColumnStatisticsConfigurations, a profile job supports the evaluations shown in the table following.

A supported data type of number in this table means that the attribute's data type is one of the following: ByteType, ShortType, IntegerType, LongType, FloatType, or DoubleType.

Statistic name Description Supported data types Default status Attributes of profile result Type of profile result

Name of the column.

all

name

string

Data type of the column.

all

type

string

DISTINCT_VALUES_COUNT

Number of distinct values. A distinct value is value that appears at least once.

number/boolean/string

Enabled

distinctValuesCount

Int

ENTROPY

Entropy (information theory).

number/boolean/string

Enabled

entropy

Double

INTER_QUARTILE_RANGE

Range between the 25th percent and 75th percent of numbers.

number

Enabled

interquartileRange

Double

KURTOSIS

Kurtosis of the column.

number

Enabled

kurtosis

Double

MAX

Maximum value in the column.

number/string length

Enabled

max

Int/Double

MAXIMUM_VALUES

List of the maximum values in the column and their counts.

number

Enabled

maximumValues

List

MEAN

Mean value of values in the column.

number/string length

Enabled

mean

Double

MEDIAN

Median of values in the column.

number/string length

Enabled

median

Double

MEDIAN_ABSOLUTE_DEVIATION

The median of the absolute differences between each data point and the median of a numeric column.

number

Enabled

medianAbsoluteDeviation

Double

MIN

Minimum value in the column.

number/string length

Enabled

min

Int/Double

MINIMUM_VALUES

List of the minimum values in the column and their counts.

number

Enabled

minimumValues

List

MISSING_VALUES_COUNT

Number of missing values in the column. Null and empty strings are considered as missing.

all

Enabled

missingValuesCount

Int

MODE

The most frequently occurring value in the column. If several values appear that often, the mode is one of those values.

number/string length

Enabled

mode

Int/Double

MOST_COMMON_VALUES

List of the most common values in the column.

number/boolean/string

Enabled

mostCommonValues

List

OUTLIER_DETECTION

Detect outliers in the column by Z_score algorithm. Count the number of outliers and extract a list of samples from detected outliers.

number/string length

Enabled

zScoreOutliersCount, zScoreOutliersSample

Int/List

PERCENTILES

Percentile values of numeric column (5%, 25%, 75%, 95%).

number

Enabled

percentile5, percentile25, percentile75, percentile95

Double

RANGE

Range of values in the column.

number

Enabled

range

Int/Double

SKEWNESS

Skewness of values in the column.

number

Enabled

skewness

Double

STANDARD_DEVIATION

Unbiased sample standard deviation of values in the column.

number/string length

Enabled

standardDeviation

Double

SUM

Sum of values in the column.

number

Enabled

sum

Int/Double

UNIQUE_VALUES_COUNT

Number of unique values. A unique value means that the value appears only once.

number/boolean/string

Enabled

uniqueValuesCount

Int

VALUE_DISTRIBUTION

Measure of the distribution of values in the column by range.

number/string length

Enabled

valueDistribution

List

VARIANCE

Variance of values in the column.

number

Enabled

variance

Double

Z_SCORE_DISTRIBUTION

Measure of the distribution of data points' z-score values by range.

number

Enabled

zScoreDistribution

List

ZEROS_COUNT

Number of zeroes (0s) in the column.

number

Enabled

zerosCount

Int

In IncludedStatistics, you can override each evaluation's default parameters by adding an override. Each override includes the name of a particular evaluation and a parameter map.

Parameters for ColumnStatisticsConfigurations columns

In ColumnStatisticsConfigurations, a profile job supports the following parameters.

In some cases, there might be multiple configurations in ColumnStatisticsConfigurations that have different IncludedStatistics that you can apply to the same column. In these cases, the profile job picks the last configuration in ColumnStatisticsConfigurations and applies its IncludedStatistics to the selected column. A new configuration overrides older configurations.

MAXIMUM_VALUES

Lists the maximum values in the numeric column and their counts. The default list size is 5. You can override the list size by specifying a value for sampleSize.

Settings

sampleSize – The size of list that includes the maximum number and count of values in the numeric column. This value should be greater than 0. Use "ALL" to list all values.

Example

{ "Statistic": "MAXIMUM_VALUES", "Parameters": { "sampleSize": "5" } }

MINIMUM_VALUES

Lists the minimum values in the numeric column and their counts. The default list size is 5. You can override the list size by specifying a value for sampleSize.

Settings

sampleSize – The size of list that includes the maximum number and count of values in the numeric column. This value should be greater than 0. Use "ALL" to list all values.

Example

{ "Statistic": "MINIMUM_VALUES", "Parameters": { "sampleSize": "5" } }

MOST_COMMON_VALUES

Lists the most common values in the column and their counts. The default list size is 50. You can override the list size by specifying a value for sampleSize.

Settings

sampleSize – The size of list that includes the maximum number and count of values in the numeric column. This value should be greater than 0. Use "ALL" to list all values.

Example

{ "Statistic": "MOST_COMMON_VALUES", "Parameters": { "sampleSize": "50" } }

OUTLIER_DETECTION

Detects outliers in the numeric column or string column (based on string length) by Z_score algorithm.

Your profile job counts the number of outliers and generates a sample list of outliers and their z-scores. The sample list is ordered by the z-score's absolute value. The default list size is 50.

The Z_Score algorithm identifies a value as an outlier when it deviates from the mean by more than the standard deviation threshold. The default outlier threshold is 3.

You can provide one more threshold, a mild threshold, to get more information. Your mild threshold should be less than your threshold. This feature is turned off by default. When a mild threshold is specified, your profile job returns one more count, zScoreMildOutliersCount. Also, zScoreOutliersSample can include a sample of mild threshold outliers in this case.

Settings

  • threshold – The threshold value to use when detecting outliers. This value should be greater or equal to 0.

  • mildThreshold – The mild threshold value to use when detecting outliers. This value should be greater or equal to 0 and less than threshold.

  • sampleSize – The size of list that includes outliers in the column. Use "ALL" to list all values.

Example

{ "Statistic": "OUTLIER_DETECTION", "Parameters": { "threshold": "5", "mildThreshold": "3.5", "sampleSize": "20" } }

VALUE_DISTRIBUTION

Measures the distribution of values in the column by the values' ranges. A profile job groups values from a numeric column or string column (based on string length) into bins by numeric ranges, and generates a list of bins. Bins are consecutive, and the upper bound for a bucket is the lower bound for the next bucket.

Settings

binNumber – Number of bins. This value should be greater than 0.

Example

{ "Statistic": "VALUE_DISTRIBUTION", "Parameters": { "binNumber": "5" } }

Z_SCORE_DISTRIBUTION

Measures the distribution of values’ z-scores in numeric column. A profile job groups z-scores of values into bins by numeric ranges, and generates a list of bins. Bins are consecutive, and the upper bound for a bucket is the lower bound for the next bucket.

Settings

binNumber – Number of bins. This value should be greater than 0.

Example

{ "Statistic": "Z_SCORE_DISTRIBUTION", "Parameters": { "binNumber": "5" } }

EntityDetectorConfiguration section for configuring PII

In the EntityDetectorConfiguration section of your structure, you can configure the entity types in your dataset that you want DataBrew to detect as personally identifiable information (PII) for a profile job.

EntityTypes

You configure the entity types you want DataBrew to detect as PII for your profile job. When EntityDetectorConfiguration is undefined, entity detection is disabled. The following entity types can be detected in your dataset:

  • USA_SSN

  • EMAIL

  • USA_ITIN

  • USA_PASSPORT_NUMBER

  • PHONE_NUMBER

  • USA_DRIVING_LICENSE

  • BANK_ACCOUNT

  • CREDIT_CARD

  • IP_ADDRESS

  • MAC_ADDRESS

  • USA_DEA_NUMBER

  • USA_HCPCS_CODE

  • USA_NATIONAL_PROVIDER_IDENTIFIER

  • USA_NATIONAL_DRUG_CODE

  • USA_HEALTH_INSURANCE_CLAIM_NUMBER

  • USA_MEDICARE_BENEFICIARY_IDENTIFIER

  • USA_CPT_CODE

  • PERSON_NAME

  • DATE

The entity type group USA_ALL is also supported, and includes all of the above entity types except PERSON_NAME and DATE.

The type of EntityTypes is an array of strings.

AllowedStatistics

Configure the statistics that are allowed to be run on columns that contain detected entities. If AllowedStatistics is undefined, no statistics will be computed on columns that contain detected entities. See Configurable statistics at the column level for a list of valid values for the AllowedStatistics parameter.

The type of AllowedStatistics is an array of AllowedStatistics objects.