

# Training custom entity recognizer models
<a name="training-recognizers"></a>

A custom entity recognizer identifies only the entity types that you include when you train the model. It does not automatically include the preset entity types. If you want to also identify the preset entity types,such as LOCATION, DATE, or PERSON, you need to provide additional training data for those entities.

When you create a custom entity recognizer using annotated PDF files, you can use the recognizer with a variety of input file formats: plaintext, image files (JPG, PNG, TIFF), PDF files, and Word documents, with no pre-processing or doc flattening required. Amazon Comprehend doesn't support annotation of image files or Word documents.

**Note**  
A custom entity recognizer using annotated PDF files supports English documents only.

After you create a custom entity recognizer, you can monitor the progress of the request using the [DescribeEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeEntityRecognizer.html) operation. Once the `Status` field is `TRAINED`, the recognizer model is ready to use for custom entity recognition.

**Topics**
+ [Train custom recognizers (console)](realtime-analysis-cer.md)
+ [Train custom entity recognizers (API)](train-cer-model.md)
+ [Custom entity recognizer metrics](cer-metrics.md)

# Train custom recognizers (console)
<a name="realtime-analysis-cer"></a>

You can create custom entity recognizers using the Amazon Comprehend console. This section shows you how to create and train a custom entity recognizer.

**Topics**

## Creating a custom entity recognizer using the console - CSV format
<a name="console-CER"></a>

To create the custom entity recognizer, first provide a dataset to train your model. With this dataset, include one of the following: a set of annotated documents or a list of entities and their type label, along with a set of documents containing those entities. For more information, see [Custom entity recognition](custom-entity-recognition.md)

**To train a custom entity recognizer with a CSV file**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Customization** and then choose **Custom entity recognition**.

1. Choose **Create new model**.

1. Give the recognizer a name. The name must be unique within the Region and account.

1. Select the language. 

1. Under **Custom entity type**, enter a custom label that you want the recognizer to find in the dataset. 

   The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore. 

1. Choose **Add type**.

1. If you want to add an additional entity type, enter it, and then choose **Add type**. If you want to remove one of the entity types you've added, choose **Remove type** and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed. 

1. To encrypt your training job, choose **Recognizer encryption** and then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, for **KMS key ID** choose the key ID.
   + If you are using a key associated with a different account, for **KMS key ARN** enter the ARN for the key ID.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Data specifications**, choose the format of your training documents:
   + **CSV file** — A CSV file that supplements your training documents. The CSV file contains information about the custom entities that your trained model will detect. The required format of the file depends on whether you are providing annotations or an entity list.
   + **Augmented manifest** — A labeled dataset that is produced by Amazon SageMaker Ground Truth. This file is in JSON lines format. Each line is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document. You can provide up to 5 augmented manifest files.

   For more information about available formats, and for examples, see [Training custom entity recognizer models](training-recognizers.md).

1. Under **Training type**, choose the training type to use:
   + **Using annotations and training docs**
   + **Using entity list and training docs**

    If choosing annotations, enter the URL of the annotations file in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation files are located and choose **Browse S3**.

    If choosing entity list, enter the URL of the entity list in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the entity list is located and choose **Browse S3**.

1. Enter the URL of an input dataset containing the training documents in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the training documents are located and choose **Select folder**.

1. Under **Test dataset** select how you want to evaluate the performance of your trained model - you can do this for both annotations and entity list training types.
   + **Autosplit**: Autosplit automatically selects 10% of your provided training data to use as testing data 
   + (Optional) **Customer provided**: When you select customer provided, you can specify exactly what test data you want to use.

1. If you select Customer provided test dataset, enter the URL of the annotations file in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation files are located and choose **Select folder**.

1. In the **Choose an IAM role** section, either select an existing IAM role or create a new one.
   + **Choose an existing IAM role** – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.
   + **Create a new IAM role** – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets. 
**Note**  
If the input documents are encrypted, the IAM role used must have `kms:Decrypt` permission. For more information, see [Permissions required to use KMS encryption](security_iam_id-based-policy-examples.md#auth-kms-permissions).

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your custom entity recognition job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

1. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under **Tags**. Choose **Add tag**. To remove this pair before creating the recognizer, choose **Remove tag**.

1. Choose **Train**.

The new recognizer will then appear in the list, showing its status. It will first show as `Submitted`. It will then show `Training` for a classifier that is processing training documents, `Trained` for a classifier that is ready to use, and `In error` for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.

## Creating a custom entity recognizer using the console - augmented manifest
<a name="getting-started-CER-PDF"></a>

**To train a custom entity recognizer with a plaintext, PDF, or word document**

1. Sign in to the AWS Management Console and open the [Amazon Comprehend console.](https://console.aws.amazon.com/comprehend/home?region=us-east-1#api-explorer:)

1. From the left menu, choose **Customization** and then choose **Custom entity recognition**.

1. Choose **Train recognizer**.

1. Give the recognizer a name. The name must be unique within the Region and account.

1. Select the language. Note: If you're training a PDF or Word document, English is the supported language. 

1. Under **Custom entity type**, enter a custom label that you want the recognizer to find in the dataset. 

   The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore.

1. Choose **Add type**.

1. If you want to add an additional entity type, enter it, and then choose **Add type**. If you want to remove one of the entity types you've added, choose **Remove type** and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed. 

1. To encrypt your training job, choose **Recognizer encryption** and then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, for **KMS key ID** choose the key ID.
   + If you are using a key associated with a different account, for **KMS key ARN** enter the ARN for the key ID.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Training data**, choose **Augmented manifest** as your data format:
   + **Augmented manifest** — is a labeled dataset that is produced by Amazon SageMaker Ground Truth. This file is in JSON lines format. Each line in the file is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document. You can provide up to 5 augmented manifest files. If you are using PDF documents for training data, you must select **Augmented manifest**. You can provide up to 5 augmented manifest files. For each file, you can name up to 5 attributes to use as training data.

   For more information about available formats, and for examples, see [Training custom entity recognizer models](training-recognizers.md).

1. Select the training model type. 

   If you selected **Plaintext documents**, under **Input location**, enter the Amazon S3URL of the Amazon SageMaker AIGround Truth augmented manifest file. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose **Select folder**.

1. Under **Attribute name**, enter the name of the attribute that contains your annotations. If the file contains annotations from multiple chained labeling jobs, add an attribute for each job. In this case, each attribute contains the set of annotations from a labeling job. Note: You can provide up to 5 attribute names for each file.

1. Select **Add**.

1. If you selected **PDF, Word documents** under **Input location**, enter the Amazon S3URL of the Amazon SageMaker AI Ground Truth augmented manifest file. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose **Select folder**.

1. Enter the S3 prefix for your **Annotation** data files. These are the PDF documents that you labeled.

1. Enter the S3 prefix for your **Source** documents. These are the original PDF documents (data objects) that you provided to Ground Truth for your labeling job.

   

1. Enter the attribute names that contain your annotations. Note: You can provide up to 5 attribute names for each file. Any attributes in your file that you don't specify are ignored. 

1. In the IAM role section, either select an existing IAM role or create a new one.
   + **Choose an existing IAM role** – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.
   + **Create a new IAM role** – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets. 
**Note**  
If the input documents are encrypted, the IAM role used must have `kms:Decrypt` permission. For more information, see [Permissions required to use KMS encryption](security_iam_id-based-policy-examples.md#auth-kms-permissions).

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your custom entity recognition job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

1. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under **Tags**. Choose **Add tag**. To remove this pair before creating the recognizer, choose **Remove tag**.

1. Choose **Train**.

The new recognizer will then appear in the list, showing its status. It will first show as `Submitted`. It will then show `Training` for a classifier that is processing training documents, `Trained` for a classifier that is ready to use, and `In error` for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.

# Train custom entity recognizers (API)
<a name="train-cer-model"></a>

To create and train a custom entity recognition model, use the Amazon Comprehend [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) API operation

**Topics**
+ [Training custom entity recognizers using the AWS Command Line Interface](#get-started-api-cer-cli)
+ [Training custom entity recognizers using the AWS SDK for Java](#get-started-api-cer-java)
+ [Training custom entity recognizers using Python (Boto3)](#cer-python)

## Training custom entity recognizers using the AWS Command Line Interface
<a name="get-started-api-cer-cli"></a>

The following examples demonstrate using the `CreateEntityRecognizer` operation and other associated APIs with the AWS CLI. 

The examples are formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^).

Create a custom entity recognizer using the `create-entity-recognizer` CLI command. For information about the input-data-config parameter, see [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) in the *Amazon Comprehend API Reference*.

```
aws comprehend create-entity-recognizer \
     --language-code en \
     --recognizer-name test-6 \
     --data-access-role-arn "arn:aws:iam::account number:role/service-role/AmazonComprehendServiceRole-role" \
     --input-data-config "EntityTypes=[{Type=PERSON}],Documents={S3Uri=s3://Bucket Name/Bucket Path/documents},
                Annotations={S3Uri=s3://Bucket Name/Bucket Path/annotations}" \
     --region region
```

List all entity recognizers in a Region using the `list-entity-recognizers` CLI command..

```
aws comprehend list-entity-recognizers \
     --region region
```

Check Job Status of custom entity recognizers using the `describe-entity-recognizer` CLI command..

```
aws comprehend describe-entity-recognizer \
     --entity-recognizer-arn arn:aws:comprehend:region:account number:entity-recognizer/test-6 \
     --region region
```

## Training custom entity recognizers using the AWS SDK for Java
<a name="get-started-api-cer-java"></a>

This example creates a custom entity recognizer and trains the model, using Java

For Amazon Comprehend examples that use Java, see [Amazon Comprehend Java examples](https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/javav2/example_code/comprehend).

## Training custom entity recognizers using Python (Boto3)
<a name="cer-python"></a>

Instantiate Boto3 SDK: 

```
import boto3
import uuid
comprehend = boto3.client("comprehend", region_name="region")
```

Create entity recognizer: 

```
response = comprehend.create_entity_recognizer(
    RecognizerName="Recognizer-Name-Goes-Here-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn="Role ARN",
    InputDataConfig={
        "EntityTypes": [
            {
                "Type": "ENTITY_TYPE"
            }
        ],
        "Documents": {
            "S3Uri": "s3://Bucket Name/Bucket Path/documents"
        },
        "Annotations": {
            "S3Uri": "s3://Bucket Name/Bucket Path/annotations"
        }
    }
)
recognizer_arn = response["EntityRecognizerArn"]
```

List all recognizers: 

```
response = comprehend.list_entity_recognizers()
```

Wait for recognizer to reach TRAINED status: 

```
while True:
    response = comprehend.describe_entity_recognizer(
        EntityRecognizerArn=recognizer_arn
    )

    status = response["EntityRecognizerProperties"]["Status"]
    if "IN_ERROR" == status:
        sys.exit(1)
    if "TRAINED" == status:
        break

    time.sleep(10)
```

# Custom entity recognizer metrics
<a name="cer-metrics"></a>

Amazon Comprehend provides you with metrics to help you estimate how well an entity recognizer should work for your job. They are based on training the recognizer model, and so while they accurately represent the performance of the model during training, they are only an approximation of the API performance during entity discovery. 

Metrics are returned any time metadata from a trained entity recognizer is returned. 

Amazon Comprehend supports training a model on up to 25 entities at a time. When metrics are returned from a trained entity recognizer, scores are computed against both the recognizer as a whole (global metrics) and for each individual entity (entity metrics).

Three metrics are available, both as global and entity metrics: 
+ **Precision**

  This indicates the fraction of entities produced by the system that are correctly identified and correctly labeled. This shows how many times the model's entity identification is truly a good identification. It is a percentage of the total number of identifications. 

  In other words, precision is based on *true positives (tp)* and *false positives (fp)* and it is calculated as *precision = tp / (tp \$1 fp)*.

  For example, if a model predicts that two examples of an entity are present in a document, where there's actually only one, the result is one true positive and one false positive. In this case, *precision = 1 / (1 \$1 1)*. The precision is 50%, as one entity is correct out of the two identified by the model. 

  
+  **Recall**

  This indicates the fraction of entities present in the documents that are correctly identified and labeled by the system. Mathematically, this is defined in terms of the total number of correct identifications *true positives (tp)* and missed identifcations *false negatives (fn)*. 

   It is calculated as *recall = tp / (tp \$1 fn)*. For example if a model correctly identifies one entity, but misses two other instances where that entity is present, the result is one true positive and two false negatives. In this case, *recall = 1 / (1 \$1 2)*. The recall is 33.33%, as one entity is correct out of a possible three examples.

  
+ **F1 score** 

  This is a combination of the Precision and Recall metrics, which measures the overall accuracy of the model for custom entity recognition. The F1 score is the harmonic mean of the Precision and Recall metrics: *F1 = 2 \$1 Precision \$1 Recall / (Precision \$1 Recall) *.
**Note**  
Intuitively, the harmonic mean penalizes the extremes more than the simple average or other means (example: `precision` = 0, `recall` = 1 could be achieved trivially by predicting all possible spans. Here, the simple average would be 0.5, but `F1` would penalize it as 0). 

  In the examples above, `precision` = 50% and `recall` = 33.33%, therefore `F1` = 2 \$1 0.5 \$1 0.3333 / (0.5 \$1 0.3333). The F1 Score is .3975, or 39.75%.

  

**Global and individual entity metrics**

The relationship between global and individual entity metrics can be seen when analyzing the following sentence for entities that are either a *place* or a *person*

```
John Washington and his friend Smith live in San Francisco, work in San Diego, and own 
    a house in Seattle.
```

In our example, the model makes the following predictions.

```
John Washington = Person
Smith = Place
San Francisco = Place
San Diego = Place
Seattle = Person
```

However, the predictions should have been the following.

```
John Washington = Person
Smith = Person  
San Francisco = Place
San Diego = Place
Seattle = Place
```

The individual entity metrics for this would be:

```
entity:  Person
  True positive (TP) = 1 (because John Washington is correctly predicted to be a 
    Person).
  False positive (FP) = 1 (because Seattle is incorrectly predicted to be a Person, 
    but is actually a Place).
  False negative (FN) = 1 (because Smith is incorrectly predicted to be a Place, but 
    is actually a Person).
  Precision = 1 / (1 + 1) = 0.5 or 50%
  Recall = 1 / (1+1) = 0.5 or 50%
  F1 Score = 2 * 0.5 * 0.5 / (0.5 + 0.5) = 0.5 or 50%
  
entity:  Place
  TP = 2 (because San Francisco and San Diego are each correctly predicted to be a 
    Place).
  FP = 1 (because Smith is incorrectly predicted to be a Place, but is actually a 
    Person).
  FN = 1 (because Seattle is incorrectly predicted to be a Person, but is actually a 
    Place).
  Precision = 2 / (2+1) = 0.6667 or 66.67%
  Recall = 2 / (2+1) = 0.6667 or 66.67%
  F1 Score = 2 * 0.6667 * 0.6667 / (0.6667 + 0.6667) = 0.6667 or  66.67%
```

The global metrics for this would be:

Global:

```
Global:
  TP = 3 (because John Washington, San Francisco and San Diego are predicted correctly. 
    This is also the sum of all individual entity TP).
  FP = 2 (because Seattle is predicted as Person and Smith is predicted as Place. This 
    is the sum of all individual entity FP).
  FN = 2 (because Seattle is predicted as Person and Smith is predicted as Place. This 
    is the sum of all individual FN).
  Global Precision = 3 / (3+2) = 0.6 or 60%  
    (Global Precision = Global TP / (Global TP + Global FP))
  Global Recall = 3 / (3+2) = 0.6 or 60% 
    (Global Recall = Global TP / (Global TP + Global FN))
  Global F1Score = 2 * 0.6 * 0.6 / (0.6 + 0.6) = 0.6 or 60% 
    (Global F1Score = 2 * Global Precision *  Global Recall / (Global Precision + 
    Global Recall))
```



## Improving custom entity recognizer performance
<a name="cer-performance"></a>

These metrics provide an insight into how accurately the trained model will perform when you use it to identify entities. Here are a few options you can use to improve your metrics if they are lower than your expectations:

1. Depending on whether you use [Annotations](cer-annotation.md) or [Entity lists (plaintext only)](cer-entity-list.md), make sure to follow the guidelines in the respective documentation to improve data quality. If you observe better metrics after improving your data and re-training the model, you can keep iterating and improving data quality to achieve better model performance.

1. If you are using an Entity List, consider using Annotations instead. Manual annotations can often improve your results.

1. If you are sure there is not a data quality issue, and yet the metrics remain unreasonably low, please submit a support request.