

# Record matching with AWS Lake Formation FindMatches
<a name="machine-learning"></a>

**Note**  
Record matching is currently unavailable in the following Regions in the AWS Glue console: Middle East (UAE), Europe (Spain), Asia Pacific (Jakarta), and Europe (Zurich).

AWS Lake Formation provides machine learning capabilities to create custom transforms to cleanse your data. There is currently one available transform named FindMatches. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how machine learning works. FindMatches can be useful in many different problems, such as: 
+ **Matching customers**: Linking customer records across different customer databases, even when many customer fields do not match exactly across the databases (e.g. different name spelling, address differences, missing or inaccurate data, etc).
+ **Matching products**: Matching products in your catalog against other product sources, such as product catalog against a competitor's catalog, where entries are structured differently.
+ **Improving fraud detection**: Identifying duplicate customer accounts, determining when a newly created account is (or might be) a match for a previously known fraudulent user.
+ **Other matching problems**: Match addresses, movies, parts lists, etc etc. In general, if a human being could look at your database rows and determine that they were a match, there is a really good chance that the FindMatches transform can help you.

 You can create these transforms when you create a job. The transform that you create is based on a source data store schema and example data from the source data set that you label (we call this process "teaching" a transform). The records that you label must be present in the source dataset. In this process we generate a file which you label and then upload back which the transform would in a manner learn from. After you teach your transform, you can call it from your Spark-based AWS Glue job (PySpark or Scala Spark) and use it in other scripts with a compatible source data store. 

 After the transform is created, it is stored in AWS Glue. On the AWS Glue console, you can manage the transforms that you create. In the navigation pane under **Data Integration and ETL**, **Data classification tools > Record Matching**, you can edit and continue to teach your machine learning transform. For more information about managing transforms on the console, see [Working with machine learning transforms](console-machine-learning-transforms.md). 

**Note**  
AWS Glue version 2.0 FindMatches jobs use the Amazon S3 bucket `aws-glue-temp-<accountID>-<region>` to store temporary files while the transform is processing data. You can delete this data after the run has completed, either manually or by setting an Amazon S3 Lifecycle rule.

## Types of machine learning transforms
<a name="machine-learning-transforms"></a>

You can create machine learning transforms to cleanse your data. You can call these transforms from your ETL script. Your data passes from transform to transform in a data structure called a *DynamicFrame*, which is an extension to an Apache Spark SQL `DataFrame`. The `DynamicFrame` contains your data, and you reference its schema to process your data.

The following types of machine learning transforms are available:

*Find matches*  
Finds duplicate records in the source data. You teach this machine learning transform by labeling example datasets to indicate which rows match. The machine learning transform learns which rows should be matches the more you teach it with example labeled data. Depending on how you configure the transform, the output is one of the following:  
+ A copy of the input table plus a `match_id` column filled in with values that indicate matching sets of records. The `match_id` column is an arbitrary identifier. Any records which have the same `match_id` have been identified as matching to each other. Records with different `match_id`'s do not match.
+ A copy of the input table with duplicate rows removed. If multiple duplicates are found, then the record with the lowest primary key is kept.

*Find incremental matches*  
The Find matches transform can also be configured to find matches across the existing and incremental frames and return as output a column containing a unique ID per match group.   
For more information, see: [Finding incremental matches](machine-learning-incremental-matches.md)

### Using the FindMatches transform
<a name="machine-learning-find-matches"></a>

You can use the `FindMatches` transform to find duplicate records in the source data. A labeling file is generated or provided to help teach the transform.

**Note**  
Currently, `FindMatches` transforms that use a custom encryption key aren't supported in the following Regions:  
Asia Pacific (Osaka) - `ap-northeast-3`

 To get started with the FindMatches transform, you can follow the steps below. For a more advanced and detailed example, see the **AWS Big Data Blog**: [ Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view ](https://aws.amazon.com/blogs/big-data/harmonize-data-using-aws-glue-and-aws-lake-formation-findmatches-ml-to-build-a-customer-360-view/). 

#### Getting started using the Find Matches transform
<a name="machine-learning-find-mathes-workflow"></a>

Follow these steps to get started with the `FindMatches` transform:

1. Create a table in the AWS Glue Data Catalog for the source data that is to be cleaned. For information about how to create a crawler, see [Working with Crawlers on the AWS Glue Console](https://docs.aws.amazon.com/glue/latest/dg/console-crawlers.html).

   If your source data is a text-based file such as a comma-separated values (CSV) file, consider the following: 
   + Keep your input record CSV file and labeling file in separate folders. Otherwise, the AWS Glue crawler might consider them as multiple parts of the same table and create tables in the Data Catalog incorrectly. 
   + Unless your CSV file includes ASCII characters only, ensure that UTF-8 without BOM (byte order mark) encoding is used for the CSV files. Microsoft Excel often adds a BOM in the beginning of UTF-8 CSV files. To remove it, open the CSV file in a text editor, and resave the file as **UTF-8 without BOM**. 

1. On the AWS Glue console, create a job, and choose the **Find matches** transform type.
**Important**  
The data source table that you choose for the job can't have more than 100 columns.

1. Tell AWS Glue to generate a labeling file by choosing **Generate labeling file**. AWS Glue takes the first pass at grouping similar records for each `labeling_set_id` so that you can review those groupings. You label matches in the `label` column.
   + If you already have a labeling file, that is, an example of records that indicate matching rows, upload the file to Amazon Simple Storage Service (Amazon S3). For information about the format of the labeling file, see [Labeling file format](#machine-learning-labeling-file). Proceed to step 4.

1. Download the labeling file and label the file as described in the [Labeling](#machine-learning-labeling) section.

1. Upload the corrected labelled file. AWS Glue runs tasks to teach the transform how to find matches.

   On the **Machine learning transforms** list page, choose the **History** tab. This page indicates when AWS Glue performs the following tasks:
   + **Import labels**
   + **Export labels**
   + **Generate labels**
   + **Estimate quality**

1. To create a better transform, you can iteratively download, label, and upload the labelled file. In the initial runs, a lot more records might be mismatched. But AWS Glue learns as you continue to teach it by verifying the labeling file.

1. Evaluate and tune your transform by evaluating performance and results of finding matches. For more information, see [Tuning machine learning transforms in AWS Glue](add-job-machine-learning-transform-tuning.md).

#### Labeling
<a name="machine-learning-labeling"></a>

When `FindMatches` generates a labeling file, records are selected from your source table. Based on previous training, `FindMatches` identifies the most valuable records to learn from.

The act of *labeling* is editing a labeling file (we suggest using a spreadsheet such as Microsoft Excel) and adding identifiers, or labels, into the `label` column that identifies matching and nonmatching records. It is important to have a clear and consistent definition of a match in your source data. `FindMatches` learns from which records you designate as matches (or not) and uses your decisions to learn how to find duplicate records.

When a labeling file is generated by `FindMatches`, approximately 100 records are generated. These 100 records are typically divided into 10 *labeling sets*, where each labeling set is identified by a unique `labeling_set_id` generated by `FindMatches`. Each labeling set should be viewed as a separate labeling task independent of the other labeling sets. Your task is to identify matching and non-matching records within each labeling set.

##### Tips for editing labeling files in a spreadsheet
<a name="machine-learning-labeling-tips"></a>

When editing the labeling file in a spreadsheet application, consider the following:
+ The file might not open with column fields fully expanded. You might need to expand the `labeling_set_id` and `label` columns to see content in those cells.
+ If the primary key column is a number, such as a `long` data type, the spreadsheet might interpret it as a number and change the value. This key value must be treated as text. To correct this problem, format all the cells in the primary key column as **Text data**.

#### Labeling file format
<a name="machine-learning-labeling-file"></a>

The labeling file that is generated by AWS Glue to teach your `FindMatches` transform uses the following format. If you generate your own file for AWS Glue, it must follow this format as well:
+ It is a comma-separated values (CSV) file. 
+ It must be encoded in `UTF-8`. If you edit the file using Microsoft Windows, it might be encoded with `cp1252`.
+ It must be in an Amazon S3 location to pass it to AWS Glue.
+ Use a moderate number of rows for each labeling task. 10–20 rows per task are recommended, although 2–30 rows per task are acceptable. Tasks larger than 50 rows are not recommended and may cause poor results or system failure.
+ If you have already-labeled data consisting of pairs of records labeled as a "match" or a "no-match", this is fine. These labeled pairs can be represented as labeling sets of size 2. In this case label both records with, for instance, a letter "A" if they match, but label one as "A" and one as "B" if they do not match.
**Note**  
 Because it has additional columns, the labeling file has a different schema from a file that contains your source data. Place the labeling file in a different folder from any transform input CSV file so that the AWS Glue crawler does not consider it when it creates tables in the Data Catalog. Otherwise, the tables created by the AWS Glue crawler might not correctly represent your data. 
+ The first two columns (`labeling_set_id`, `label`) are required by AWS Glue. The remaining columns must match the schema of the data that is to be processed.
+ For each `labeling_set_id`, you identify all matching records by using the same label. A label is a unique string placed in the `label` column. We recommend using labels that contain simple characters, such as A, B, C, and so on. Labels are case sensitive and are entered in the `label` column.
+ Rows that contain the same `labeling_set_id` and the same label are understood to be labeled as a match.
+ Rows that contain the same `labeling_set_id` and a different label are understood to be labeled as *not* a match
+ Rows that contain a different `labeling_set_id` are understood to be conveying no information for or against matching.

  The following is an example of labeling the data:    
<a name="table-labeling-data"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/machine-learning.html)
+ In the above example we identify John/Johnny/Jon Doe as being a match and we teach the system that these records do not match Jane Smith. Separately, we teach the system that Richard and Rich Jones are the same person, but that these records are not a match to Sarah Jones/Jones-Walker and Richie Jones Jr.
+ As you can see, the scope of the labels is limited to the `labeling_set_id`. So labels do not cross `labeling_set_id` boundaries. For example, a label "A" in `labeling_set_id` 1 does not have any relation to label "A" in `labeling_set_id` 2.
+ If a record does not have any matches within a labeling set, then assign it a unique label. For instance, Jane Smith does not match any record in labeling set ABC123, so it is the only record in that labeling set with the label of B.
+ The labeling set "GHI678" shows that a labeling set can consist of just two records which are given the same label to show that they match. Similarly, "XYZABC" shows two records given different labels to show that they do not match.
+ Note that sometimes a labeling set may contain no matches (that is, you give every record in the labeling set a different label) or a labeling set might all be "the same" (you gave them all the same label). This is okay as long as your labeling sets collectively contain examples of records that are and are not "the same" by your criteria.

**Important**  
Confirm that the IAM role that you pass to AWS Glue has access to the Amazon S3 bucket that contains the labeling file. By convention, AWS Glue policies grant permission to Amazon S3 buckets or folders whose names are prefixed with **aws-glue-**. If your labeling files are in a different location, add permission to that location in the IAM role.

# Tuning machine learning transforms in AWS Glue
<a name="add-job-machine-learning-transform-tuning"></a>

You can tune your machine learning transforms in AWS Glue to improve the results of your data-cleansing jobs to meet your objectives. To improve your transform, you can teach it by generating a labeling set, adding labels, and then repeating these steps several times until you get your desired results. You can also tune by changing some machine learning parameters. 

For more information about machine learning transforms, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md).

**Topics**
+ [Machine learning measurements](machine-learning-terminology.md)
+ [Deciding between precision and recall](machine-learning-precision-recall-tradeoff.md)
+ [Deciding Between accuracy and cost](machine-learning-accuracy-cost-tradeoff.md)
+ [Estimating the quality of matches using match confidence scores](match-scoring.md)
+ [Teaching the Find Matches transform](machine-learning-teaching.md)

# Machine learning measurements
<a name="machine-learning-terminology"></a>

To understand the measurements that are used to tune your machine learning transform, you should be familiar with the following terminology:

**True positive (TP)**  
A match in the data that the transform correctly found, sometimes called a *hit*.

**True negative (TN)**  
A nonmatch in the data that the transform correctly rejected.

**False positive (FP)**  
A nonmatch in the data that the transform incorrectly classified as a match, sometimes called a *false alarm*.

**False negative (FN)**  
A match in the data that the transform didn't find, sometimes called a *miss*.

For more information about the terminology that is used in machine learning, see [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) in Wikipedia.

To tune your machine learning transforms, you can change the value of the following measurements in the **Advanced properties** of the transform.
+ **Precision** measures how well the transform finds true positives among the total number of records that it identifies as positive (true positives and false positives). For more information, see [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.
+ **Recall** measures how well the transform finds true positives from the total records in the source data. For more information, see [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.
+ **Accuracy ** measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall. For more information, see [Accuracy and precision](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_information_systems) in Wikipedia.
+ **Cost** measures how many compute resources (and thus money) are consumed to run the transform.

# Deciding between precision and recall
<a name="machine-learning-precision-recall-tradeoff"></a>

Each `FindMatches` transform contains a `precision-recall` parameter. You use this parameter to specify one of the following:
+ If you are more concerned about the transform falsely reporting that two records match when they actually don't match, then you should emphasize *precision*. 
+ If you are more concerned about the transform failing to detect records that really do match, then you should emphasize *recall*.

You can make this trade-off on the AWS Glue console or by using the AWS Glue machine learning API operations.

**When to favor precision**  
Favor precision if you are more concerned about the risk that `FindMatches` results in a pair of records matching when they don't actually match. To favor precision, choose a *higher* precision-recall trade-off value. With a higher value, the `FindMatches` transform requires more evidence to decide that a pair of records should be matched. The transform is tuned to bias toward saying that records do not match.

For example, suppose that you're using `FindMatches` to detect duplicate items in a video catalog, and you provide a higher precision-recall value to the transform. If your transform incorrectly detects that *Star Wars: A New Hope* is the same as *Star Wars: The Empire Strikes Back*, a customer who wants *A New Hope* might be shown *The Empire Strikes Back*. This would be a poor customer experience. 

However, if the transform fails to detect that *Star Wars: A New Hope* and *Star Wars: Episode IV—A New Hope* are the same item, the customer might be confused at first but might eventually recognize them as the same. It would be a mistake, but not as bad as the previous scenario.

**When to favor recall**  
Favor recall if you are more concerned about the risk that the `FindMatches` transform results might fail to detect a pair of records that actually do match. To favor recall, choose a *lower* precision-recall trade-off value. With a lower value, the `FindMatches` transform requires less evidence to decide that a pair of records should be matched. The transform is tuned to bias toward saying that records do match.

For example, this might be a priority for a security organization. Suppose that you are matching customers against a list of known defrauders, and it is important to determine whether a customer is a defrauder. You are using `FindMatches` to match the defrauder list against the customer list. Every time `FindMatches` detects a match between the two lists, a human auditor is assigned to verify that the person is, in fact, a defrauder. Your organization might prefer to choose recall over precision. In other words, you would rather have the auditors manually review and reject some cases when the customer is not a defrauder than fail to identify that a customer is, in fact, on the defrauder list.

**How to favor both precision and recall**  
The best way to improve both precision and recall is to label more data. As you label more data, the overall accuracy of the `FindMatches` transform improves, thus improving both precision and recall. Nevertheless, even with the most accurate transform, there is always a gray area where you need to experiment with favoring precision or recall, or choose a value in the middle. 

# Deciding Between accuracy and cost
<a name="machine-learning-accuracy-cost-tradeoff"></a>

Each `FindMatches` transform contains an `accuracy-cost` parameter. You can use this parameter to specify one of the following:
+ If you are more concerned with the transform accurately reporting that two records match, then you should emphasize *accuracy*.
+ If you are more concerned about the cost or speed of running the transform, then you should emphasize *lower cost*.

You can make this trade-off on the AWS Glue console or by using the AWS Glue machine learning API operations.

**When to favor accuracy**  
Favor accuracy if you are more concerned about the risk that the `find matches` results won't contain matches. To favor accuracy, choose a *higher* accuracy-cost trade-off value. With a higher value, the `FindMatches` transform requires more time to do a more thorough search for correctly matching records. Note that this parameter doesn't make it less likely to falsely call a nonmatching record pair a match. The transform is tuned to bias towards spending more time finding matches.

**When to favor cost**  
Favor cost if you are more concerned about the cost of running the `find matches` transform and less about how many matches are found. To favor cost, choose a *lower* accuracy-cost trade-off value. With a lower value, the `FindMatches` transform requires fewer resources to run. The transform is tuned to bias towards finding fewer matches. If the results are acceptable when favoring lower cost, use this setting.

**How to favor both accuracy and lower cost**  
It takes more machine time to examine more pairs of records to determine whether they might be matches. If you want to reduce cost without reducing quality, here are some steps you can take: 
+ Eliminate records in your data source that you aren't concerned about matching.
+ Eliminate columns from your data source that you are sure aren't useful for making a match/no-match decision. A good way of deciding this is to eliminate columns that you don't think affect your own decision about whether a set of records is “the same.”

# Estimating the quality of matches using match confidence scores
<a name="match-scoring"></a>

Match confidence scores provide an estimate of the quality of matches found by FindMatches to distinguish between matched records in which the machine learning model is highly confident, uncertain, or unlikely. A match confidence score will be between 0 and 1, where a higher score means higher similarity. Examining match confidence scores lets you distinguish between clusters of matches in which the system is highly confident (which you may decide to merge), clusters about which the system is uncertain (which you may decide to have reviewed by a human), and clusters that the system deems to be unlikely (which you may decide to reject).

You may want to adjust your training data in situations where you see a high match confidence score, but determine there are not matches, or where you see a low score but determine there are, in fact, matches.

Confidence scores are particularly useful when there are large sized industrial datasets, where it is infeasible to review every FindMatches decision.

Match confidence scores are available in AWS Glue version 2.0 or later.

## Generating match confidence scores
<a name="specifying-match-scoring"></a>

You can generate match confidence scores by setting the Boolean value of `computeMatchConfidenceScores` to True when calling the `FindMatches` or `FindIncrementalMatches` API.

AWS Glue adds a new `column match_confidence_score` to the output.

## Match scoring examples
<a name="match-scoring-examples"></a>

For example, consider the following matched records:

**Score >= 0.9**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

3281355037663    85899345947   0.9823658302132061
1546188247619    85899345947   0.9823658302132061
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score1.png)


From this example, we can see that two records are very similar and share `display_position`, `primary_name`, and `street name`. 

**Score >= 0.8 and score < 0.9**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

309237680432     85899345928   0.8309852373674638
3590592666790    85899345928   0.8309852373674638
343597390617     85899345928   0.8309852373674638
249108124906     85899345928   0.8309852373674638
463856477937     85899345928   0.8309852373674638
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score2.png)


From this example, we can see that these records share the same `primary_name`, and `country`.

**Score >= 0.6 and score < 0.7**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

2164663519676    85899345930   0.6971099896480333
 317827595278    85899345930   0.6971099896480333
 472446424341    85899345930   0.6971099896480333
3118146262932    85899345930   0.6971099896480333
 214748380804    85899345930   0.6971099896480333
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score3.png)


From this example, we can see that these records share only the same `primary_name`.

For more information, see:
+ [Step 5: Add and run a job with your machine learning transform](machine-learning-transform-tutorial.md#ml-transform-tutorial-add-job)
+ PySpark: [FindMatches class](aws-glue-api-crawler-pyspark-transforms-findmatches.md)
+ PySpark: [FindIncrementalMatches class](aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.md)
+ Scala: [FindMatches class](glue-etl-scala-apis-glue-ml-findmatches.md)
+ Scala: [FindIncrementalMatches class](glue-etl-scala-apis-glue-ml-findincrementalmatches.md)

# Teaching the Find Matches transform
<a name="machine-learning-teaching"></a>

Each `FindMatches` transform must be taught what should be considered a match and what should not be considered a match. You teach your transform by adding labels to a file and uploading your choices to AWS Glue. 

You can orchestrate this labeling on the AWS Glue console or by using the AWS Glue machine learning API operations.

**How many times should I add labels? How many labels do I need?**  
The answers to these questions are mostly up to you. You must evaluate whether `FindMatches` is delivering the level of accuracy that you need and whether you think the extra labeling effort is worth it for you. The best way to decide this is to look at the “Precision,” “Recall,” and “Area under the precision recall curve” metrics that you can generate when you choose **Estimate quality** on the AWS Glue console. After you label more sets of tasks, rerun these metrics and verify whether they have improved. If, after labeling a few sets of tasks, you don't see improvement on the metric that you are focusing on, the transform quality might have reached a plateau.

**Why are both true positive and true negative labels needed?**  
The `FindMatches` transform needs both positive and negative examples to learn what you think is a match. If you are labeling `FindMatches`-generated training data (for example, using the **I do not have labels** option), `FindMatches` tries to generate a set of “label set ids” for you. Within each task, you give the same “label” to some records and different “labels” to other records. In other words, the tasks generally are not either all the same or all different (but it's okay if a particular task is all “the same” or all “not the same”).

If you are teaching your `FindMatches` transform using the **Upload labels from S3** option, try to include both examples of matching and nonmatching records. It's acceptable to have only one type. These labels help you build a more accurate `FindMatches` transform, but you still need to label some records that you generate using the **Generate labeling file** option.

**How can I enforce that the transform matches exactly as I taught it?**  
The `FindMatches` transform learns from the labels that you provide, so it might generate records pairs that don't respect the provided labels. To enforce that the `FindMatches` transform respects your labels, select **EnforceProvidedLabels** in **FindMatchesParameter**.

**What techniques can you use when an ML transform identifies items as matches that are not true matches?**  
You can use the following techniques:
+ Increase the `precisionRecallTradeoff` to a higher value. This eventually results in finding fewer matches, but it should also break up your big cluster when it reaches a high enough value. 
+ Take the output rows corresponding to the incorrect results and reformat them as a labeling set (removing the `match_id` column and adding a `labeling_set_id` and `label` column). If necessary, break up (subdivide) into multiple labeling sets to ensure that the labeler can keep each labeling set in mind while assigning labels. Then, correctly label the matching sets and upload the label file and append it to your existing labels. This might teach your transformer enough about what it is looking for to understand the pattern. 
+ (Advanced) Finally, look at that data to see if there is a pattern that you can detect that the system is not noticing. Preprocess that data using standard AWS Glue functions to *normalize* the data. Highlight what you want the algorithm to learn from by separating data that you know to be differently important into their own columns. Or construct combined columns from columns whose data you know to be related. 

# Working with machine learning transforms
<a name="console-machine-learning-transforms"></a>

You can use AWS Glue to create custom machine learning transforms that can be used to cleanse your data. You can use these transforms when you create a job on the AWS Glue console. 

For information about how to create a machine learning transform, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md).

**Topics**
+ [Transform properties](#console-machine-learning-properties)
+ [Adding and editing machine learning transforms](#console-machine-learning-transforms-actions)
+ [Viewing transform details](#console-machine-learning-transforms-details)
+ [Teach transforms using labels](#console-machine-learning-transforms-teaching-transforms)

## Transform properties
<a name="console-machine-learning-properties"></a>

To view an existing machine learning transform, sign in to the AWS Management Console, and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). In the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**.

The properties for each transform:

**Transform name**  
The unique name you gave the transform when you created it.

**ID**  
A unique identifier of the transform. 

**Label count**  
The number of labels in the labeling file that was provided to help teach the transform. 

**Status**  
Indicates whether the transform is **Ready** or **Needs training**. To run a machine learning transform successfully in a job, it must be **Ready**. 

**Created**  
The date the transform was created.

**Modified**  
The date the transform was last updated.

**Description**  
The description supplied for the transform, if one was provided.

**AWS Glue version**  
The version of AWS Glue used.

**Run ID**  
The unique name you gave the transform when you created it.

**Task type**  
The type of machine learning transform; for example, **Find matching records**.

**Status**  
Indicates the status of the task run. Possible statuses include:  
+ Starting
+ Running
+ Stopping
+ Stopped
+ Succeeded
+ Failed
+ Timeout

**Error**  
If the status is Failed, an error message is displayed describing the reason for the failure.

## Adding and editing machine learning transforms
<a name="console-machine-learning-transforms-actions"></a>

 You can view, delete, set up and teach, or tune a transform on the AWS Glue console. Select the check box next to the transform in the list, choose **Action**, and then choose the action that you want to take. 

### Creating a new ML transform
<a name="w2aac37c11c24c23c11b5"></a>

 To add a new machine learning transform, choose **Create transform**. Follow the instructions in the **Add job** wizard. For more information, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md). 

#### Step 1. Set transform properties.
<a name="w2aac37c11c24c23c11b5b7"></a>

1. Enter the name and description (optional).

1. Optionally, set security configuration. See [Using data encryption with machine learning transforms](#ml_transform_sec_config). 

1. Optionally, set Task execution settings. Task execution settings allow you to customize how the task is run. Select the Worker type, number of workers, task timeout (in minutes), the number of retries, and the AWS Glue version.

1. Optionally, set Tags. Tags are labels that you can assign to an AWS resource. Each tag consists of a key and an optional value. Tags can be used to search and filter your resource or track your AWS costs.

#### Step 2. Choose table and primary key.
<a name="w2aac37c11c24c23c11b5b9"></a>

1. Choose the AWS Glue Catalog database and table.

1. Choose a primary key from the selected table. The primary key column typically contains a unique identifier for every record in the data source. 

#### Step 3. Select tuning options.
<a name="w2aac37c11c24c23c11b5c11"></a>

1.  For **Recall vs. precision**, choose the tuning value to tune the transform to favor recall or precision. By default, **Balanced** is selected, but you can choose to favor recall or favor precision, or choose **Custom** and enter a value between 0.0 and 1.0 (inclusive). 

1.  For **Lower cost vs. accuracy**, choose the tuning value to favor lower cost or accuracy, or choose **Custom** and enter a value between 0.0 and 1.0 (inclusive). 

1.  For **Match enforcement**, choose **Force output to match labels** if you want to teach the ML transform by forcing the output to match the labels used. 

#### Step 4. Review and create.
<a name="w2aac37c11c24c23c11b5c13"></a>

1.  Review the options for steps 1 – 3. 

1.  Choose **Edit** for any step that needs to be modified. Choose **Create transform** to complete the create transform wizard. 

### Using data encryption with machine learning transforms
<a name="ml_transform_sec_config"></a>

When adding a machine learning transform to AWS Glue, you can optionally specify a security configuration that is associated with the data source or data target. If the Amazon S3 bucket used to store the data is encrypted with a security configuration, specify the same security configuration when creating the transform.

You can also choose to use server-side encryption with AWS KMS (SSE-KMS) to encrypt the model and labels to prevent unauthorized persons from inspecting it. If you choose this option, you're prompted to choose the AWS KMS key by name, or you can choose **Enter a key ARN**. If you choose to enter the ARN for the KMS key, a second field appears where you can enter the KMS key ARN.

**Note**  
Currently, ML transforms that use a custom encryption key aren't supported in the following Regions:  
Asia Pacific (Osaka) - `ap-northeast-3`

## Viewing transform details
<a name="console-machine-learning-transforms-details"></a>

### Viewing transform properties
<a name="console-machine-learning-transforms-details"></a>

The **Transform properties** page includes attributes of your transform. It shows you the details about the transform definition, including the following:
+ **Transform name** shows the name of the transform.
+ **Type** lists the type of the transform.
+ **Status** displays whether the transform is ready to be used in a script or job.
+ **Force output to match labels** displays whether the transform forces the output to match the labels provided by the user.
+ **Spark version** is related to the AWS Glue version you chose in the **Task run properties** when adding the transform. AWS Glue 1.0 and Spark 2.4 is recommended for most customers. For more information, see [AWS Glue Versions](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html#release-notes-versions).

### History, Estimate quality and Tags tabs
<a name="w2aac37c11c24c23c13b5"></a>

 Transform details include the information that you defined when you created the transform. To view the details of a transform, select the transform in the **Machine learning transforms** list, and review the information on the following tabs: 
+ History
+ Estimate quality
+ Tags

#### History
<a name="console-machine-learning-transforms-history"></a>

The **History** tab shows your transform task run history. Several types of tasks are run to teach a transform. For each task, the run metrics include the following:
+ **Run ID** is an identifier created by AWS Glue for each run of this task.
+ **Task type** shows the type of task run.
+ **Status** shows the success of each task listed with the most recent run at the top.
+ **Error** shows the details of an error message if the run was not successful.
+ **Start time** shows the date and time (local time) that the task started.
+ **End time** shows the date and time (local time) that the task ended.
+ **Logs** links to the logs written to `stdout` for this job run.

  The **Logs** link takes you to Amazon CloudWatch Logs. There you can view the details about the tables that were created in the AWS Glue Data Catalog and any errors that were encountered. You can manage your log retention period on the CloudWatch console. The default log retention is `Never Expire`. For more information about how to change the retention period, see [Change Log Data Retention in CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#SettingLogRetention) in the *Amazon CloudWatch Logs User Guide*.
+ **Label file** shows a link to Amazon S3 for a generated labeling file.

#### Estimate quality
<a name="console-machine-learning-transforms-metrics"></a>

 The **Estimate quality** tab shows the metrics that you use to measure the quality of the transform. Estimates are calculated by comparing the transform match predictions using a subset of your labeled data against the labels you have provided. These estimates are approximate. You can invoke an **Estimate quality** task run from this tab.

The **Estimate quality** tab shows the metrics from the last **Estimate quality** run including the following properties:
+ **Area under the Precision-Recall curve** is a single number estimating the upper bound of the overall quality of the transform. It is independent of the choice made for the precision-recall parameter. Higher values indicate that you have a more attractive precision-recall tradeoff. 
+ **Precision** estimates how often the transform is correct when it predicts a match.
+ **Recall upper limit** estimates that for an actual match, how often the transform predicts the match.
+ **F1** estimates the transform's accuracy between 0 and 1, where 1 is the best accuracy. For more information, see [F1 score](https://en.wikipedia.org/wiki/F1_score) in Wikipedia.
+ The **Column importance** table show the column names and importance score for each column. Column importance helps you understand how columns contribute to your model, by identifying which columns in your records are being used the most to do the matching. This data may prompt you to add to or change your labelset to raise or lower the importance of columns.

  The Importance column provides a numerical score for each column, as a decimal not greater than 1.0.

For information about understanding quality estimates versus true quality, see [Quality estimates versus end-to-end (true) quality](#console-machine-learning-quality-estimates-true-quality).

For more information about tuning your transform, see [Tuning machine learning transforms in AWS Glue](add-job-machine-learning-transform-tuning.md).

#### Quality estimates versus end-to-end (true) quality
<a name="console-machine-learning-quality-estimates-true-quality"></a>

AWS Glue estimates the quality of your transform by presenting the internal machine-learned model with a number of pairs of records that you provided matching labels for but that the model has not seen before. These quality estimates are a function of the quality of the machine-learned model (which is influenced by the number of records that you label to “teach” the transform). The end-to-end, or *true* recall (which is not automatically calculated by the `ML transform`) is also influenced by the `ML transform` filtering mechanism that proposes a wide variety of possible matches to the machine-learned model. 

You can tune this filtering method primarily by specifying the **Lower Cost-Accuracy** tuning value. As the tuning value gets closer to favor **Accuracy**, the system does a more thorough and expensive search for pairs of records that might be matches. More pairs of records are fed to your machine-learned model, and your `ML transform`'s end-to-end or true recall gets closer to the estimated recall metric. As a result, changes in the end-to-end quality of your matches as a result of changes in the cost/accuracy tradeoff for your matches will typically not be reflected in the quality estimate.

#### Tags
<a name="w2aac37c11c24c23c13b5c13"></a>

 Tags are labels that you can assign to an AWS resource. Each tag consists of a key and an optional value. Tags can be used to search and filter your resource or track your AWS costs. 

## Teach transforms using labels
<a name="console-machine-learning-transforms-teaching-transforms"></a>

 You can teach your ML transform using labels (examples) by choosing **Teach transform** from the ML transform details page. When you teach your machine learning algorithm by providing examples (called labels), you can choose existing labels to use, or create a labeling file. 

![\[The screenshot shows a wizard screen for Teach the transform using labels.\]](http://docs.aws.amazon.com/glue/latest/dg/images/machine-learning-teach-transform.png)

+  **Labeling** – If you have labels, choose **I have labels**. If you do not have labels, you can still continue with the next step in generating a labeling file. 
+  **Generate labeling file** – AWS Glue extracts records from your source data and suggest potential matching records. You choose the Amazon S3 bucket to store the generated label file. Choose **Generate labeling file** to start the process. When done, choose **Download labeling file**. The downloaded file will have a column for labels where you can fill in the labels. 
+  **Upload labels from Amazon S3** – Choose the completed labeling file from the Amazon S3 bucket where the label file is stored. Then, choose to either append the labels to your existing labels or to overwrite your existing labels. Choose **Upload labeling file from Amazon S3**. 

# Tutorial: Creating a machine learning transform with AWS Glue
<a name="machine-learning-transform-tutorial"></a>

This tutorial guides you through the actions to create and manage a machine learning (ML) transform using AWS Glue. Before using this tutorial, you should be familiar with using the AWS Glue console to add crawlers and jobs and edit scripts. You should also be familiar with finding and downloading files on the Amazon Simple Storage Service (Amazon S3) console.

In this example, you create a `FindMatches` transform to find matching records, teach it how to identify matching and nonmatching records, and use it in an AWS Glue job. The AWS Glue job writes a new Amazon S3 file with an additional column named `match_id`. 

The source data used by this tutorial is a file named `dblp_acm_records.csv`. This file is a modified version of academic publications (DBLP and ACM) available from the original [DBLP ACM dataset](https://doi.org/10.3886/E100843V2). The `dblp_acm_records.csv` file is a comma-separated values (CSV) file in UTF-8 format with no byte-order mark (BOM). 

A second file, `dblp_acm_labels.csv`, is an example labeling file that contains both matching and nonmatching records used to teach the transform as part of the tutorial. 

**Topics**
+ [Step 1: Crawl the source data](#ml-transform-tutorial-crawler)
+ [Step 2: Add a machine learning transform](#ml-transform-tutorial-create)
+ [Step 3: Teach your machine learning transform](#ml-transform-tutorial-teach)
+ [Step 4: Estimate the quality of your machine learning transform](#ml-transform-tutorial-estimate-quality)
+ [Step 5: Add and run a job with your machine learning transform](#ml-transform-tutorial-add-job)
+ [Step 6: Verify output data from Amazon S3](#ml-transform-tutorial-data-output)

## Step 1: Crawl the source data
<a name="ml-transform-tutorial-crawler"></a>

First, crawl the source Amazon S3 CSV file to create a corresponding metadata table in the Data Catalog.

**Important**  
To direct the crawler to create a table for only the CSV file, store the CSV source data in a different Amazon S3 folder from other files.

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Crawlers**, **Add crawler**. 

1. Follow the wizard to create and run a crawler named `demo-crawl-dblp-acm` with output to database `demo-db-dblp-acm`. When running the wizard, create the database `demo-db-dblp-acm` if it doesn't already exist. Choose an Amazon S3 include path to sample data in the current AWS Region. For example, for `us-east-1`, the Amazon S3 include path to the source file is `s3://ml-transforms-public-datasets-us-east-1/dblp-acm/records/dblp_acm_records.csv`. 

   If successful, the crawler creates the table `dblp_acm_records_csv` with the following columns: id, title, authors, venue, year, and source.

## Step 2: Add a machine learning transform
<a name="ml-transform-tutorial-create"></a>

Next, add a machine learning transform that is based on the schema of your data source table created by the crawler named `demo-crawl-dblp-acm`.

1. On the AWS Glue console, in the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**, then **Add transform**. Follow the wizard to create a `Find matches` transform with the following properties. 

   1. For **Transform name**, enter **demo-xform-dblp-acm**. This is the name of the transform that is used to find matches in the source data.

   1. For **IAM role**, choose an IAM role that has permission to the Amazon S3 source data, labeling file, and AWS Glue API operations. For more information, see [Create an IAM Role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) in the *AWS Glue Developer Guide*.

   1. For **Data source**, choose the table named **dblp\$1acm\$1records\$1csv** in database **demo-db-dblp-acm**.

   1. For **Primary key**, choose the primary key column for the table, **id**.

1. In the wizard, choose **Finish** and return to the **ML transforms** list.

## Step 3: Teach your machine learning transform
<a name="ml-transform-tutorial-teach"></a>

Next, you teach your machine learning transform using the tutorial sample labeling file.

You can't use a machine language transform in an extract, transform, and load (ETL) job until its status is **Ready for use**. To get your transform ready, you must teach it how to identify matching and nonmatching records by providing examples of matching and nonmatching records. To teach your transform, you can **Generate a label file**, add labels, and then **Upload label file**. In this tutorial, you can use the example labeling file named `dblp_acm_labels.csv`. For more information about the labeling process, see [Labeling](machine-learning.md#machine-learning-labeling).

1. On the AWS Glue console, in the navigation pane, choose **Record Matching**.

1. Choose the `demo-xform-dblp-acm` transform, and then choose **Action**, **Teach**. Follow the wizard to teach your `Find matches` transform. 

1. On the transform properties page, choose **I have labels**. Choose an Amazon S3 path to the sample labeling file in the current AWS Region. For example, for `us-east-1`, upload the provided labeling file from the Amazon S3 path `s3://ml-transforms-public-datasets-us-east-1/dblp-acm/labels/dblp_acm_labels.csv` with the option to **overwrite** existing labels. The labeling file must be located in Amazon S3 in the same Region as the AWS Glue console.

   When you upload a labeling file, a task is started in AWS Glue to add or overwrite the labels used to teach the transform how to process the data source.

1. On the final page of the wizard, choose **Finish**, and return to the **ML transforms** list.

## Step 4: Estimate the quality of your machine learning transform
<a name="ml-transform-tutorial-estimate-quality"></a>

Next, you can estimate the quality of your machine learning transform. The quality depends on how much labeling you have done. For more information about estimating quality, see [Estimate quality](console-machine-learning-transforms.md#console-machine-learning-transforms-metrics).

1. On the AWS Glue console, in the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**. 

1. Choose the `demo-xform-dblp-acm` transform, and choose the **Estimate quality** tab. This tab displays the current quality estimates, if available, for the transform. 

1. Choose **Estimate quality** to start a task to estimate the quality of the transform. The accuracy of the quality estimate is based on the labeling of the source data.

1. Navigate to the **History** tab. In this pane, task runs are listed for the transform, including the **Estimating quality** task. For more details about the run, choose **Logs**. Check that the run status is **Succeeded** when it finishes.

## Step 5: Add and run a job with your machine learning transform
<a name="ml-transform-tutorial-add-job"></a>

In this step, you use your machine learning transform to add and run a job in AWS Glue. When the transform `demo-xform-dblp-acm` is **Ready for use**, you can use it in an ETL job.

1. On the AWS Glue console, in the navigation pane, choose **Jobs**.

1. Choose **Add job**, and follow the steps in the wizard to create an ETL Spark job with a generated script. Choose the following property values for your transform:

   1. For **Name**, choose the example job in this tutorial, **demo-etl-dblp-acm**.

   1. For **IAM role**, choose an IAM role with permission to the Amazon S3 source data, labeling file, and AWS Glue API operations. For more information, see [Create an IAM Role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) in the *AWS Glue Developer Guide*.

   1. For **ETL language**, choose **Scala**. This is the programming language in the ETL script.

   1. For **Script file name**, choose **demo-etl-dblp-acm**. This is the file name of the Scala script (same as the job name).

   1. For **Data source**, choose **dblp\$1acm\$1records\$1csv**. The data source you choose must match the machine learning transform data source schema.

   1. For **Transform type**, choose **Find matching records** to create a job using a machine learning transform.

   1. Clear **Remove duplicate records**. You don't want to remove duplicate records because the output records written have an additional `match_id` field added. 

   1. For **Transform**, choose **demo-xform-dblp-acm**, the machine learning transform used by the job.

   1. For **Create tables in your data target**, choose to create tables with the following properties:
      + **Data store type** — **Amazon S3**
      + **Format** — **CSV**
      + **Compression type** — **None**
      + **Target path** — The Amazon S3 path where the output of the job is written (in the current console AWS Region)

1. Choose **Save job and edit script** to display the script editor page.

1. Edit the script to add a statement to cause the job output to the **Target path** to be written to a single partition file. Add this statement immediately following the statement that runs the `FindMatches` transform. The statement is similar to the following.

   ```
   val single_partition = findmatches1.repartition(1) 
   ```

   You must modify the `.writeDynamicFrame(findmatches1)` statement to write the output as `.writeDynamicFrame(single_partion)`. 

1. After you edit the script, choose **Save**. The modified script looks similar to the following code, but customized for your environment.

   ```
   import com.amazonaws.services.glue.GlueContext
   import com.amazonaws.services.glue.errors.CallSite
   import com.amazonaws.services.glue.ml.FindMatches
   import com.amazonaws.services.glue.util.GlueArgParser
   import com.amazonaws.services.glue.util.Job
   import com.amazonaws.services.glue.util.JsonOptions
   import org.apache.spark.SparkContext
   import scala.collection.JavaConverters._
   
   object GlueApp {
     def main(sysArgs: Array[String]) {
       val spark: SparkContext = new SparkContext()
       val glueContext: GlueContext = new GlueContext(spark)
       // @params: [JOB_NAME]
       val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
       Job.init(args("JOB_NAME"), glueContext, args.asJava)
       // @type: DataSource
       // @args: [database = "demo-db-dblp-acm", table_name = "dblp_acm_records_csv", transformation_ctx = "datasource0"]
       // @return: datasource0
       // @inputs: []
       val datasource0 = glueContext.getCatalogSource(database = "demo-db-dblp-acm", tableName = "dblp_acm_records_csv", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
       // @type: FindMatches
       // @args: [transformId = "tfm-123456789012", emitFusion = false, survivorComparisonField = "<primary_id>", transformation_ctx = "findmatches1"]
       // @return: findmatches1
       // @inputs: [frame = datasource0]
       val findmatches1 = FindMatches.apply(frame = datasource0, transformId = "tfm-123456789012", transformationContext = "findmatches1", computeMatchConfidenceScores = true)
     
     
       // Repartition the previous DynamicFrame into a single partition. 
       val single_partition = findmatches1.repartition(1)    
    
       
       // @type: DataSink
       // @args: [connection_type = "s3", connection_options = {"path": "s3://aws-glue-ml-transforms-data/sal"}, format = "csv", transformation_ctx = "datasink2"]
       // @return: datasink2
       // @inputs: [frame = findmatches1]
       val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://aws-glue-ml-transforms-data/sal"}"""), transformationContext = "datasink2", format = "csv").writeDynamicFrame(single_partition)
       Job.commit()
     }
   }
   ```

1. Choose **Run job** to start the job run. Check the status of the job in the jobs list. When the job finishes, in the **ML transform**, **History** tab, there is a new **Run ID** row added of type **ETL job**.

1. Navigate to the **Jobs**, **History** tab. In this pane, job runs are listed. For more details about the run, choose **Logs**. Check that the run status is **Succeeded** when it finishes.

## Step 6: Verify output data from Amazon S3
<a name="ml-transform-tutorial-data-output"></a>

In this step, you check the output of the job run in the Amazon S3 bucket that you chose when you added the job. You can download the output file to your local machine and verify that matching records were identified.

1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Download the target output file of the job `demo-etl-dblp-acm`. Open the file in a spreadsheet application (you might need to add a file extension `.csv` for the file to properly open).

   The following image shows an excerpt of the output in Microsoft Excel.  
![\[Excel spreadsheet showing the output of the transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/demo_output_dblp_acm.png)

   The data source and target file both have 4,911 records. However, the `Find matches` transform adds another column named `match_id` to identify matching records in the output. Rows with the same `match_id` are considered matching records. The `match_confidence_score` is a number between 0 and 1 that provides an estimate of the quality of matches found by `Find matches`.

1. Sort the output file by `match_id` to easily see which records are matches. Compare the values in the other columns to see if you agree with the results of the `Find matches` transform. If you don't, you can continue to teach the transform by adding more labels. 

   You can also sort the file by another field, such as `title`, to see if records with similar titles have the same `match_id`. 

# Finding incremental matches
<a name="machine-learning-incremental-matches"></a>

The Find matches feature allows you to identify duplicate or matching records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. The initial release of the Find matches transform identified matching records within a single dataset. When you add new data to the dataset, you had to merge it with the existing clean dataset and rerun matching against the complete merged dataset.

The incremental matching feature makes it simpler to match to incremental records against existing matched datasets. Suppose that you want to match prospects data with existing customer datasets. The incremental match capability provides you the flexibility to match hundreds of thousands of new prospects with an existing database of prospects and customers by merging the results into a single database or table. By matching only between the new and existing datasets, the find incremental matches optimization reduces computation time, which also reduces cost.

The usage of incremental matching is similar to Find matches as described in [Tutorial: Creating a machine learning transform with AWS Glue](machine-learning-transform-tutorial.md). This topic identifies only the differences with incremental matching.

For more information, see the blog post on [Incremental data matching](https://aws.amazon.com/blogs/big-data/incremental-data-matching-using-aws-lake-formation/).

## Running an incremental matching job
<a name="machine-learning-incremental-matches-add"></a>

For the following procedure, suppose the following: 
+ You have crawled the existing dataset into the table *first\$1records*. The *first\$1records* dataset must be a matched dataset, or the output of the matched job.
+ You have created and trained a Find matches transform with AWS Glue version 2.0. This is the only version of AWS Glue that supports incremental matches.
+ The ETL language is Scala. Note that Python is also supported.
+ The model already generated is called `demo-xform`.

1. Crawl the incremental dataset to the table *second\$1records*.

1. On the AWS Glue console, in the navigation pane, choose **Jobs**.

1. Choose **Add job**, and follow the steps in the wizard to create an ETL Spark job with a generated script. Choose the following property values for your transform:

   1. For **Name**, choose **demo-etl**.

   1. For **IAM role**, choose an IAM role with permission to the Amazon S3 source data, labeling file, and [AWS Glue API operations](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html).

   1. For **ETL language**, choose **Scala**.

   1. For **Script file name**, choose **demo-etl**. This is the file name of the Scala script.

   1. For **Data source**, choose **first\$1records**. The data source you choose must match the machine learning transform data source schema.

   1. For **Transform type**, choose **Find matching records** to create a job using a machine learning transform.

   1. Select the incremental matching option, and for **Data Source** select the table named **second\$1records**.

   1. For **Transform**, choose **demo-xform**, the machine learning transform used by the job.

   1. Choose **Create tables in your data target** or **Use tables in the data catalog and update your data target**.

1. Choose **Save job and edit script** to display the script editor page.

1. Choose **Run job** to start the job run.

# Using FindMatches in a visual job
<a name="find-matches-visual-job"></a>

 To use the **FindMatches** transform in AWS Glue Studio, you can use the **Custom Transform** node that invokes the FindMatches API. For more information on how to use a custom transform, see [Creating a custom transformation](https://docs.aws.amazon.com/glue/latest/ug/transforms-custom.html) 

**Note**  
 Currently, the FindMatches API only works with `Glue 2.0`. In order to run a job with the Custom transform that invokes the FindMatches API, ensure the AWS Glue version is `Glue 2.0` in the **Job details** tab. If the version of AWS Glue is not `Glue 2.0`, the job will fail at runtime with the following error message: “cannot import name 'FindMatches' from 'awsglueml.transforms'”. 

## Prerequisites
<a name="adding-find-matches-to-a-visual-job-prerequisites"></a>
+  In order to use the **Find Matches** transform, open the AWS Glue Studio console at [https://console.aws.amazon.com/gluestudio/](https://console.aws.amazon.com/gluestudio/). 
+  Create a machine learning transform. When created, a transformId is generated. You will need this ID for the steps below. For more information on how to create a machine learning transform, see [Adding and editing machine learning transforms](https://docs.aws.amazon.com/glue/latest/dg/console-machine-learning-transforms.html#console-machine-learning-transforms-actions). 

## Adding a FindMatches transform
<a name="adding-find-matches-to-a-visual-job"></a>

**To add a FindMatches transform:**

1.  In the AWS Glue Studio job editor, open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and choose a Data source by choosing the **Data tab**. This is the data source you want to check for matches.   
![\[The screenshot shows a cross symbol inside a circle. When you click on this in the visual job editor, the resource panel opens.\]](http://docs.aws.amazon.com/glue/latest/dg/images/resource-panel-blank-canvas.png)

1.  Choose the data source node, then open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and search for 'custom transform'. Choose the **Custom Transform** node to add it to the graph. The **Custom Transform** is linked to the data source node. If it is not, you can click on the **Custom Transform ** node and choose the **Node properties** tab, then under **Node parents**, choose the data source. 

1.  Click the **Custom Transform** node in the visual graph, then choose the **Node properties** tab and name the custom transform. It is recommended that you rename the transform so that the transform name is easily identifiable in the visual graph. 

1.  Choose the **Transform** tab, where you can edit the code block. This is where the code to invoke the FindMatches API can be added.   
![\[The screenshot shows the code block in the Transform tab when the Custom Transform node is selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/custom-transform-code-block.png)

    The code block contains pre-populated code to get you started. Overwrite the pre-populated code with the template below. The template has a placeholder for the **transformId**, which you can provide. 

   ```
   def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
       dynf = dfc.select(list(dfc.keys())[0])
       from awsglueml.transforms import FindMatches
       findmatches = FindMatches.apply(frame = dynf, transformId = "<your id>")
       return(DynamicFrameCollection({"FindMatches": findmatches}, glueContext))
   ```

1.  Click the **Custom Transform** node in the visual graph, then open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and search for 'Select From Collection'. There is no need to change the default selection since there is only one DynamicFrame in the collection. 

1.  You can continue adding transformations or store the result, which is now enriched with the find matches additional columns. If you want to reference those new columns in downstream transforms, you need to add them to the transform output schema. the easiest way to do that is to choose the **Data preview** tab and then in the schema tab choose “Use datapreview schema”. 

1.  To customize FindMatches, you can add additional parameters to pass to the 'apply' method. See [FindMatches class](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-findmatches.html). 

## Adding a FindMatches incrementally transformation
<a name="find-matches-incrementally-visual-job"></a>

 In the case of incremental matches, the process is the same as **Adding a FindMatches transformation** with the following differences: 
+  Instead of a parent node for the custom transform, you need two parent nodes. 
+  The first parent node should be the dataset. 
+  The second parent node should be the incremental dataset. 

   Replace the `transformId` with your `transformId` in the template code block: 

  ```
  def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
      dfs = list(dfc.values())
      dynf = dfs[0]
      inc_dynf = dfs[1]
      from awsglueml.transforms import FindIncrementalMatches
      findmatches = FindIncrementalMatches.apply(existingFrame = dynf, incrementalFrame = inc_dynf,
                                      transformId = "<your id>")
      return(DynamicFrameCollection({"FindMatches": findmatches}, glueContext))
  ```
+  For optional parameters, see [FindIncrementalMatches class](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.html). 