# Custom entity recognition
<a name="custom-entity-recognition"></a>

Custom entity recognition extends the capability of Amazon Comprehend by helping you identify your specific new entity types that are not in the preset [generic entity types](https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html). This means that you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs.

Building an accurate custom entity recognizer on your own can be a complex process, requiring preparation of large sets of manually annotated training documents and the selection of the right algorithms and parameters for model training. Amazon Comprehend helps to reduce the complexity by providing automatic annotation and model development to create a custom entity recognition model.

Creating a custom entity recognition model is a more effective approach than using string matching or regular expressions to extract entities from documents. For example, to extract ENGINEER names in a document, it is difficult to enumerate all possible names. Additionally, without context, it is challenging to distinguish between ENGINEER names and ANALYST names. A custom entity recognition model can learn the context where those names are likely to appear. Additionally, string matching will not detect entities that have typos or follow new naming conventions, while this is possible using a custom model. 

You have two options for creating a custom model: 

1. Annotations – provide a data set containing annotated entities for model training. 

1. Entity lists (plaintext only) – provide a list of entities and their type label (such as `PRODUCT_CODES` and a set of unannotated documents containing those entities for model training.

When you create a custom entity recognizer using annotated PDF files, you can use that recognizer with a variety of input file formats: plaintext, image files (JPG, PNG, TIFF), PDF files, and Word documents, with no pre-processing or doc flattening required. Amazon Comprehend doesn't support annotation of image files or Word documents.

**Note**  
A custom entity recognizer using annotated PDF files supports English documents only.

You can train a model on up to 25 custom entities at once. For more details, see the [Guidelines and quotas page](https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html).

After your model is trained, you can use the model for real-time entity detection and in entity detection jobs. 

**Topics**
+ [Preparing entity recognizer training data](prep-training-data-cer.md)
+ [Training custom entity recognizer models](training-recognizers.md)
+ [Running real-time custom recognizer analysis](running-cer-sync.md)
+ [Running analysis jobs for custom entity recognition](detecting-cer.md)

# Preparing entity recognizer training data
<a name="prep-training-data-cer"></a>

To train a successful custom entity recognition model, it's important to supply the model trainer with high quality data as input. Without good data, the model won't learn how to correctly identify entities. 

You can choose one of two ways to provide data to Amazon Comprehend in order to train a custom entity recognition model:
+ **Entity list** – Lists the specific entities so Amazon Comprehend can train to identify your custom entities. Note: Entity lists can only be used for plaintext documents. 
+ **Annotations** – Provides the location of your entities in a number of documents so Amazon Comprehend can train on both the entity and its context. To create a model for analyzing image files, PDFs, or Word documents, you must train your recognizer using PDF annotations. 

In both cases, Amazon Comprehend learns about the kind of documents and the context where the entities occur and builds a recognizer that can generalize to detect the new entities when you analyze documents.

When you create a custom model (or train a new version), you can provide a test dataset. If you do not provide test data, Amazon Comprehend reserves 10% of the input documents to test the model. Amazon Comprehend trains the model with the remaining documents.

If you provide a test dataset for your annotations training set, the test data must include at least one annotation for each of the entity types specified in the creation request. 

**Topics**
+ [When to use annotations vs entity lists](#prep-training-data-comp)
+ [Entity lists (plaintext only)](cer-entity-list.md)
+ [Annotations](cer-annotation.md)

## When to use annotations vs entity lists
<a name="prep-training-data-comp"></a>

 Creating annotations takes more work than creating an entity list, but the resulting model can be significantly more accurate. Using an entity list is quicker and less work-intensive, but the results are less refined and less accurate. This is because the annotations provide more context for Amazon Comprehend to use when training the model. Without that context, Amazon Comprehend will have a higher number of false positives when trying to identify the entities. 

There are scenarios when it makes more business sense to avoid the higher expense and workload of using annotations. For example, the name John Johnson is significant to your search, but whether it's the exact individual isn't relevant. Or the metrics when using the entity list are good enough to provide you with the recognizer results that you need. In such instances, using an entity list instead can be the more effective choice. 

We recommend using the annotations mode in the following cases:
+ If you plan to run inferences for image files, PDFs, or Word documents. In this scenario, you train a model using annotated PDF files and use the model to run inference jobs for image files, PDFs, and Word documents. 
+ When the meaning of the entities could be ambiguous and context-dependent. For example, the term *Amazon* could either refer to the river in Brazil, or the online retailer Amazon.com. When you build a custom entity recognizer to identify business entities such as *Amazon*, you should use annotations instead of an entity list because this method is better able to use context to find entities.
+ When you are comfortable setting up a process to acquire annotations, which can require some effort.

We recommend using an entity list in the following cases:
+ When you already have a list of entities or when it is relatively easy to compose a comprehensive list of entities. If you use an entity list, the list should be complete or at least covers the majority of valid entities that might appear in the documents you provide for training. 
+ For first-time users, it is generally recommended to use an entity list because this requires a smaller effort than constructing annotations. However, it is important to note that the trained model might not be as accurate as if you used annotations.

# Entity lists (plaintext only)
<a name="cer-entity-list"></a>

To train a model using an entity list, you provide two pieces of information: a list of the entity names with their corresponding custom entity types and a collection of unannotated documents in which you expect your entities to appear. 

When you provide an Entity List, Amazon Comprehend uses an intelligent algorithm to detect occurrences of the entity in the documents to serve as the basis for training the custom entity recognizer model.

For entity lists, provide at least 25 entity matches per entity type in the entity list.

An entity list for custom entity recognition needs a comma-separated value (CSV) file, with the following columns:
+ **Text**— The text of an entry example exactly as seen in the accompanying document corpus.
+ **Type**—The customer-defined entity type. Entity types must an uppercase, underscore separated string such as MANAGER or SENIOR\$1MANAGER. Up to 25 entity types can be trained per model. 

The file `documents.txt` contains four lines:

```
Jo Brown is an engineer in the high tech industry.
John Doe has been a engineer for 14 years.
Emilio Johnson is a judge on the Washington Supreme Court.
Our latest new employee, Jane Smith, has been a manager in the industry for 4 years.
```

The CSV file with the list of entities has the following lines: 

```
Text, Type
Jo Brown, ENGINEER
John Doe, ENGINEER
Jane Smith, MANAGER
```

**Note**  
In the entities list, the entry for Emilio Johnson is not present because it does not contain either the ENGINEER or MANAGER entity. 

**Creating your data files**

It is important that your entity list be in a properly configured CSV file so your chance of having problems with your entity list file is minimal. To manually configure your CSV file, the following must be true:
+ UTF-8 encoding must be explicitly specified, even if its used as a default in most cases.
+ It must include the column names: `Type` and `Text`.

We highly recommended that CSV input files are generated programmatically to avoid potential issues.

The following example uses Python to generate a CSV for the annotations shown above:

```
import csv 
with open("./entitylist/entitylist.csv", "w", encoding="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(["Text", "Type"])
    csv_writer.writerow(["Jo Brown", " ENGINEER"])
    csv_writer.writerow(["John Doe", " ENGINEER"])
    csv_writer.writerow(["Jane Smith", " MANAGER"])
```

## Best practices
<a name="entitylist-bestresults"></a>

There are a number of things to consider to get the best result when using an entity list, including:
+ The order of the entities in your list has no effects on model training.
+ Use entity list items that cover 80%-100% of positive entity examples mentioned in the unannotated corpus of documents.
+ Avoid entity examples that match non-entities in the document corpus by removing common words and phrases. Even a handful of incorrect matches can significantly affect the accuracy of your resulting model. For example, a word like *the* in the entity list will result in a high number of matches which are unlikely to be the entities you are looking for and thus will significantly affect your accuracy. 
+ Input data should not contain duplicates. Presence of duplicate samples might result into test set contamination and therefore negatively affect training process, model metrics, and behavior.
+ Provide documents that resemble real use cases as closely as possible. Don't use toy data or synthesized data for production systems. The input data should be as diverse as possible to avoid overfitting and help underlying model better generalize on real examples.
+ The entity list is case sensitive, and regular expressions are not currently supported. However, the trained model can often still recognize entities even if they do not match exactly to the casing provided in the entity list.
+ If you have an entity that is a substring of another entity (such as “Smith” and “Jane Smith”), provide both in the entity list.

Additional suggestions can be found at [Improving custom entity recognizer performance](cer-metrics.md#cer-performance) 

# Annotations
<a name="cer-annotation"></a>

Annotations label entities in context by associating your custom entity types with the locations where they occur in your training documents.

By submitting annotation along with your documents, you can increase the accuracy of the model. With Annotations, you're not simply providing the location of the entity you're looking for, but you're also providing more accurate context to the custom entity you're seeking.

For instance, if you're searching for the name John Johnson, with the entity type JUDGE, providing your annotation might help the model to learn that the person you want to find is a judge. If it is able to use the context, then Amazon Comprehend won't find people named John Johnson who are attorneys or witnesses. Without providing annotations, Amazon Comprehend will create its own version of an annotation, but won't be as effective at including only judges. Providing your own annotations might help to achieve better results and to generate models that are capable of better leverage context when extracting custom entities.

**Topics**
+ [Minimum number of annotations](#prep-training-data-ann)
+ [Annotation best practices](#cer-annotation-best-practices)
+ [Plain-text annotation files](cer-annotation-csv.md)
+ [PDF annotation files](cer-annotation-manifest.md)
+ [Annotating PDF files](cer-annotation-pdf.md)

## Minimum number of annotations
<a name="prep-training-data-ann"></a>

The minimum number of input documents and annotations required to train a model depends on the type of annotations. 

**PDF annotations**  
To create a model for analyzing image files, PDFs, or Word documents, train your recognizer using PDF annotations. For PDF annotations, provide at least 250 input documents and at least 100 annotations per entity.  
If you provide a test dataset, the test data must include at least one annotation for each of the entity types specified in the creation request. 

**Plain-text annotations**  
To create a model for analyzing text documents, you can train your recognizer using plain-text annotations.   
For plain-text annotations, provide at least three annotated input documents and at least 25 annotations per entity. If you provide less than 50 annotations total, Amazon Comprehend reserves more than 10% of the input documents to test the model (unless you provided a test dataset in the training request). Don't forget that the minimum document corpus size is 5 KB.  
If your input contains only a few training documents, you may encounter an error that the training input data contains too few documents that mention one of the entities. Submit the job again with additional documents that mention the entity.  
If you provide a test dataset, the test data must include at least one annotation for each of the entity types specified in the creation request.  
For an example of how to benchmark a model with a small dataset, see [Amazon Comprehend announces lower annotation limits for custom entity recognition](https://aws.amazon.com/blogs/machine-learning/amazon-comprehend-announces-lower-annotation-limits-for-custom-entity-recognition/) on the AWS blog site.

## Annotation best practices
<a name="cer-annotation-best-practices"></a>

There are a number of things to consider to get the best result when using annotations, including: 
+ Annotate your data with care and verify that you annotate every mention of the entity. Imprecise annotations can lead to poor results.
+ Input data should not contain duplicates, like a duplicate of a PDF you are going to annotate. Presence of a duplicate sample might result in test set contamination and could negatively affect the training process, model metrics, and model behavior.
+ Make sure that all of your documents are annotated, and that the documents without annotations are due to lack of legitimate entities, not due to negligence. For example, if you have a document that says "J Doe has been an engineer for 14 years", you should also provide an annotation for "J Doe" as well as "John Doe". Failing to do so confuses the model and can result in the model not recognizing "J Doe" as ENGINEER. This should be consistent within the same document and across documents.
+ In general, more annotations lead to better results.
+ You can train a model with the [minimum number](guidelines-and-limits.md#limits-custom-entity-recognition) of documents and annotations, but adding data usually improves the model. We recommend increasing the volume of annotated data by 10% to increase the accuracy of the model. You can run inference on a test dataset which remains unchanged and can be tested by different model versions. You can then compare the metrics for successive model versions.
+ Provide documents that resemble real use cases as closely as possible. Synthesized data with repetitive patterns should be avoided. The input data should be as diverse as possible to avoid overfitting and help the underlying model better generalize on real examples.
+ It is important that documents should be diverse in terms of word count. For example, if all documents in the training data are short, the resulting model may have difficulty predicting entities in longer documents.
+ Try and give the same data distribution for training as you expect to be using when you're actually detecting your custom entities (inference time). For example, at inference time, if you expect to be sending us documents that have no entities in them, this should also be part of your training document set.

For additional suggestions, see [Improving custom entity recognizer performance](https://docs.aws.amazon.com/comprehend/latest/dg/cer-metrics.html#cer-performance).

# Plain-text annotation files
<a name="cer-annotation-csv"></a>

For plain-text annotations, you create a comma-separated value (CSV) file that contains a list of annotations. The CSV file must contain the following columns if your training file input format is **one document per line**.


| File | Line | Begin offset | End offset | Type | 
| --- | --- | --- | --- | --- | 
|  The name of the file containing the document. For example, if one of the document files is located at `s3://my-S3-bucket/test-files/documents.txt`, the value in the `File` column will be `documents.txt`. You must include the file extension (in this case '`.txt`') as part of the file name.  |  The line number containing the entity. Omit this column if your input format is one document per file.  |  The character offset in the input text (relative to the beginning of the line) that shows where the entity begins. The first character is at position 0.  |  The character offset in the input text that shows where the entity ends.  |  The customer-defined entity type. Entity types must be an uppercase, underscore-separated string. We recommend using descriptive entity types such as `MANAGER`, `SENIOR_MANAGER`, or `PRODUCT_CODE`. Up to 25 entity types can be trained per model.  | 

If your training file input format is **one document per file**, you omit the line number column and the **Begin offset** and **End offset** values are the offsets of the entity from the start of the document.

The following example is for one document per line. The file `documents.txt` contains four lines (rows 0, 1, 2, and 3):

```
Diego Ramirez is an engineer in the high tech industry.
Emilio Johnson has been an engineer for 14 years.
J Doe is a judge on the Washington Supreme Court.
Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.
```

The CSV file with the list of annotations is as follows: 

```
File, Line, Begin Offset, End Offset, Type
documents.txt, 0, 0, 13, ENGINEER
documents.txt, 1, 0, 14, ENGINEER
documents.txt, 3, 25, 38, MANAGER
```

**Note**  
In the annotations file, the line number containing the entity starts with line 0. In this example, the CSV file contains no entry for line 2 because there is no entity in line 2 of `documents.txt`.

**Creating your data files**

It's important to put your annotations in a properly configured CSV file to reduce the risk of errors. To manually configure your CSV file, the following must be true:
+ UTF-8 encoding must be explicitly specified, even if its used as a default in most cases.
+ The first line contains the column headers: `File`, `Line` (optional), `Begin Offset`, `End Offset`, `Type`.

We highly recommended that you generate the CSV input files programmatically to avoid potential issues.

The following example uses Python to generate a CSV for the annotations shown earlier:

```
import csv 
with open("./annotations/annotations.csv", "w", encoding="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(["File", "Line", "Begin Offset", "End Offset", "Type"])
    csv_writer.writerow(["documents.txt", 0, 0, 11, "ENGINEER"])
    csv_writer.writerow(["documents.txt", 1, 0, 5, "ENGINEER"])
    csv_writer.writerow(["documents.txt", 3, 25, 30, "MANAGER"])
```

# PDF annotation files
<a name="cer-annotation-manifest"></a>

For PDF annotations, you use SageMaker AI Ground Truth to create a labeled dataset in an augmented manifest file. Ground Truth is a data labeling service that helps you (or a workforce that you employ) to build training datasets for machine learning models. Amazon Comprehend accepts augmented manifest files as training data for custom models. You can provide these files when you create a custom entity recognizer by using the Amazon Comprehend console or the [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) API action. 

You can use the Ground Truth built-in task type, Named Entity Recognition, to create a labeling job to have workers identify entities in text. To learn more, see [Named Entity Recognition](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-named-entity-recg.html#sms-creating-ner-console) in the *Amazon SageMaker AI Developer Guide*. To learn more about Amazon SageMaker Ground Truth, see [Use Amazon SageMaker AI Ground Truth to Label Data](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html).

**Note**  
Using Ground Truth, you can define overlapping labels (text that you associate with more than one label). However, Amazon Comprehend entity recognition does not support overlapping labels.

Augmented manifest files are in JSON lines format. In these files, each line is a complete JSON object that contains a training document and its associated labels. The following example is an augmented manifest file that trains an entity recognizer to detect the professions of individuals who are mentioned in the text:

```
{"source":"Diego Ramirez is an engineer in the high tech industry.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":13,"startOffset":0,"label":"ENGINEER"}],"labels":[{"label":"ENGINEER"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.92}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.175903","human-annotated":"yes"}}
{"source":"J Doe is a judge on the Washington Supreme Court.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":5,"startOffset":0,"label":"JUDGE"}],"labels":[{"label":"JUDGE"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.72}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.174910","human-annotated":"yes"}}
{"source":"Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":38,"startOffset":26,"label":"MANAGER"}],"labels":[{"label":"MANAGER"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.91}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.174035","human-annotated":"yes"}}
```

Each line in this JSON lines file is a complete JSON object, where the attributes include the document text, the annotations, and other metadata from Ground Truth. The following example is a single JSON object in the augmented manifest file, but it's formatted for readability: 

```
{
  "source": "Diego Ramirez is an engineer in the high tech industry.",
  "NamedEntityRecognitionDemo": {
    "annotations": {
      "entities": [
        {
          "endOffset": 13,
          "startOffset": 0,
          "label": "ENGINEER"
        }
      ],
      "labels": [
        {
          "label": "ENGINEER"
        }
      ]
    }
  },
  "NamedEntityRecognitionDemo-metadata": {
    "entities": [
      {
        "confidence": 0.92
      }
    ],
    "job-name": "labeling-job/namedentityrecognitiondemo",
    "type": "groundtruth/text-span",
    "creation-date": "2020-05-14T21:45:27.175903",
    "human-annotated": "yes"
  }
}
```

In this example, the `source` attribute provides the text of the training document, and the `NamedEntityRecognitionDemo` attribute provides the annotations for the entities in the text. The name of the `NamedEntityRecognitionDemo` attribute is arbitrary, and you provide a name of your choice when you define the labeling job in Ground Truth.

In this example, the `NamedEntityRecognitionDemo` attribute is the *label attribute name*, which is the attribute that provides the labels that a Ground Truth worker assigns to the training data. When you provide your training data to Amazon Comprehend, you must specify one or more label attribute names. The number of attribute names that you specify depends on whether your augmented manifest file is the output of a single labeling job or a chained labeling job.

If your file is the output of a single labeling job, specify the single label attribute name that was used when the job was created in Ground Truth. 

If your file is the output of a chained labeling job, specify the label attribute name for one or more jobs in the chain. Each label attribute name provides the annotations from an individual job. You can specify up to 5 of these attributes for augmented manifest files that are produced by chained labeling jobs. 

In an augmented manifest file, the label attribute name typically follows the `source` key. If the file is the output of a chained job, there will be multiple label attribute names. When you provide your training data to Amazon Comprehend, provide only those attributes that contain annotations that are relevant for your model. Do not specify the attributes that end with "-metadata".

For more information about chained labeling jobs, and for examples of the output that they produce, see [Chaining Labeling Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-reusing-data.html) in the Amazon SageMaker AI Developer Guide.

# Annotating PDF files
<a name="cer-annotation-pdf"></a>

Before you can annotate your training PDFs in SageMaker AI Ground Truth, complete the following prerequisites:
+ Install python3.8.x
+ Install [jq](https://stedolan.github.io/jq/download/)
+ Install the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html)

  If you're using the us-east-1 Region, you can skip installing the AWS CLI because it's already installed with your Python environment. In this case, you create a virtual environment to use Python 3.8 in AWS Cloud9.
+ Configure your [AWS credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
+ Create a private [SageMaker AI Ground Truth workforce](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private-use-cognito.html) to support annotation

  Make sure to record the workteam name you choose in your new private workforce, as you use it during installation.

**Topics**
+ [Setting up your environment](#cer-annotation-pdf-set-up)
+ [Uploading a PDF to an S3 bucket](#cer-annotation-pdf-upload)
+ [Creating an annotation job](#cer-annotation-pdf-job)
+ [Annotating with SageMaker AI Ground Truth](#w2aac35c23c21c19c15)

## Setting up your environment
<a name="cer-annotation-pdf-set-up"></a>

1. If using Windows, install [Cygwin](https://cygwin.com/install.html); if using Linux or Mac, skip this step.

1. Download the [annotation artifacts](http://github.com/aws-samples/amazon-comprehend-semi-structured-documents-annotation-tools) from GitHub. Unzip the file.

1. From your terminal window, navigate to the unzipped folder (**amazon-comprehend-semi-structured-documents-annotation-tools-main**). 

1. This folder includes a choice of `Makefiles` that you run to install dependencies, setup a Python virtualenv, and deploy the required resources. Review the **readme** file to make your choice.

1. The recommended option uses a single command to install all dependencies into a virtualenv, builds the CloudFormation stack from the template, and deploys the stack to your AWS account with interactive guidance. Run the following command:

   `make ready-and-deploy-guided`

   This command presents a set of configuration options. Be sure your AWS Region is correct. For all other fields, you can either accept the default values or fill in custom values. If you modify the CloudFormation stack name, write it down as you need it in the next steps.  
![\[Terminal session showing CloudFormation configuration options.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/deploy_guided_anno.png)

   The CloudFormation stack creates and manage the [AWS lambdas](https://aws.amazon.com/lambda/), [AWS IAM](https://aws.amazon.com/iam/) roles, and [AWS S3](https://aws.amazon.com/s3/) buckets required for the annotation tool.

   You can review each of these resources in the stack details page in the CloudFormation console.

1. The command prompts you to start the deployment. CloudFormation creates all the resources in the specified Region.  
![\[Terminal session showing the deployed CloudFormation configuration.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/deploy_guided_anno_2.png)

   When the CloudFormation stack status transitions to create-complete, the resources are ready to use.

## Uploading a PDF to an S3 bucket
<a name="cer-annotation-pdf-upload"></a>

In the [Setting up](#cer-annotation-pdf-set-up) section, you deployed a CloudFormation stack that creates an S3 bucket named **comprehend-semi-structured-documents-\$1\$1AWS::Region\$1-\$1\$1AWS::AccountId\$1**. You now upload your source PDF documents into this bucket.

**Note**  
This bucket contains the data required for your labeling job. The Lambda Execution Role policy grants permission for the Lambda function to access this bucket.  
You can find the S3 bucket name in the **CloudFormation Stack details** using the '**SemiStructuredDocumentsS3Bucket**' key.

1. Create a new folder in the S3 bucket. Name this new folder '**src**'. 

1. Add your PDF source files to your '**src**' folder. In a later step, you annotate these files to train your recognizer.

1. (Optional) Here's an AWS CLI example you can use to upload your source documents from a local directory into an S3 bucket:

   `aws s3 cp --recursive local-path-to-your-source-docs s3://deploy-guided/src/`

   Or, with your Region and Account ID:

   `aws s3 cp --recursive local-path-to-your-source-docs s3://deploy-guided-Region-AccountID/src/`

1. You now have a private SageMaker AI Ground Truth workforce and have uploaded your source files to the S3 bucket, **deploy-guided/src/**; you're ready to start annotating.

## Creating an annotation job
<a name="cer-annotation-pdf-job"></a>

The **comprehend-ssie-annotation-tool-cli.py** script in the `bin` directory is a simple wrapper command that streamlines the creation of a SageMaker AI Ground Truth labeling job. The python script reads the source documents from your S3 bucket and creates a corresponding single-page manifest file with one source document per line. The script then creates a labeling job, which requires the manifest file as an input. 

The python script uses the S3 bucket and CloudFormation stack that you configured in the [Setting up](#cer-annotation-pdf-set-up) section. Required input parameters for the script include:
+ **input-s3-path**: S3 Uri to the source documents you uploaded to your S3 bucket. For example: `s3://deploy-guided/src/`. You can also add your Region and Account ID to this path. For example: `s3://deploy-guided-Region-AccountID/src/`.
+ **cfn-name**: The CloudFormation stack name. If you used the default value for the stack name, your cfn-name is **sam-app**.
+ **work-team-name**: The workforce name you created when you built out the private workforce in SageMaker AI Ground Truth.
+ **job-name-prefix**: The prefix for the SageMaker AI Ground Truth labeling job. Note that there is a 29-character limit for this field. A timestamp is appended to this value. For example: `my-job-name-20210902T232116`.
+ **entity-types**: The entities you want to use during your labeling job, separated by commas. This list must include all entities that you want to annotate in your training dataset. The Ground Truth labeling job displays only these entities for annotators to label content in the PDF documents. 

To view additional arguments the script supports, use the `-h` option to display the help content.
+ Run the following script with the input parameters as described in the previous list.

  ```
  python bin/comprehend-ssie-annotation-tool-cli.py \
  --input-s3-path s3://deploy-guided-Region-AccountID/src/ \
  --cfn-name sam-app \
  --work-team-name my-work-team-name \
  --region us-east-1 \
  --job-name-prefix my-job-name-20210902T232116 \
  --entity-types "EntityA, EntityB, EntityC" \
  --annotator-metadata "key=info,value=sample,key=Due Date,value=12/12/2021"
  ```

  The script produces the following output:

  ```
  Downloaded files to temp local directory /tmp/a1dc0c47-0f8c-42eb-9033-74a988ccc5aa
  Deleted downloaded temp files from /tmp/a1dc0c47-0f8c-42eb-9033-74a988ccc5aa
  Uploaded input manifest file to s3://comprehend-semi-structured-documents-us-west-2-123456789012/input-manifest/my-job-name-20220203-labeling-job-20220203T183118.manifest
  Uploaded schema file to s3://comprehend-semi-structured-documents-us-west-2-123456789012/comprehend-semi-structured-docs-ui-template/my-job-name-20220203-labeling-job-20220203T183118/ui-template/schema.json
  Uploaded template UI to s3://comprehend-semi-structured-documents-us-west-2-123456789012/comprehend-semi-structured-docs-ui-template/my-job-name-20220203-labeling-job-20220203T183118/ui-template/template-2021-04-15.liquid
  Sagemaker GroundTruth Labeling Job submitted: arn:aws:sagemaker:us-west-2:123456789012:labeling-job/my-job-name-20220203-labeling-job-20220203t183118
  (amazon-comprehend-semi-structured-documents-annotation-tools-main) user@3c063014d632 amazon-comprehend-semi-structured-documents-annotation-tools-main %
  ```

## Annotating with SageMaker AI Ground Truth
<a name="w2aac35c23c21c19c15"></a>

Now that you have configured the required resources and created a labeling job, you can log in to the labeling portal and annotate your PDFs.

1. Log in to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker) using either Chrome or Firefox web browsers.

1. Select **Labeling workforces** and choose **Private**.

1. Under **Private workforce summary**, select the labeling portal sign-in URL that you created with your private workforce. Sign in with the appropriate credentials.

   If you don't see any jobs listed, don't worry—it can take a while to update, depending on the number of files you uploaded for annotation.

1. Select your task and, in the top right corner, choose **Start working** to open the annotation screen.

   You'll see one of your documents open in the annotation screen and, above it, the entity types you provided during set up. To the right of your entity types, there is an arrow you can use to navigate through your documents.  
![\[The Amazon Comprehend annotation screen.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/annotation_demo1.png)

   Annotate the open document. You can also remove, undo, or auto tag your annotations on each document; these options are available in the right panel of the annotation tool.  
![\[Available options in the Amazon Comprehend annotation right panel.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/data_annotation.png)

   To use auto tag, annotate an instance of one of your entities; all other instances of that specific word are then automatically annotated with that entity type.

   Once you've finished, select **Submit** on the bottom right, then use the navigation arrows to move to the next document. Repeat this until you've annotated all your PDFs.

After you annotate all the training documents, you can find the annotations in JSON format in the Amazon S3 bucket at this location:

```
/output/your labeling job name/annotations/
```

The output folder also contains an output manifest file, which lists all the annotations within your training documents. You can find your output manifest file at the following location.

```
/output/your labeling job name/manifests/
```

# Training custom entity recognizer models
<a name="training-recognizers"></a>

A custom entity recognizer identifies only the entity types that you include when you train the model. It does not automatically include the preset entity types. If you want to also identify the preset entity types,such as LOCATION, DATE, or PERSON, you need to provide additional training data for those entities.

When you create a custom entity recognizer using annotated PDF files, you can use the recognizer with a variety of input file formats: plaintext, image files (JPG, PNG, TIFF), PDF files, and Word documents, with no pre-processing or doc flattening required. Amazon Comprehend doesn't support annotation of image files or Word documents.

**Note**  
A custom entity recognizer using annotated PDF files supports English documents only.

After you create a custom entity recognizer, you can monitor the progress of the request using the [DescribeEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeEntityRecognizer.html) operation. Once the `Status` field is `TRAINED`, the recognizer model is ready to use for custom entity recognition.

**Topics**
+ [Train custom recognizers (console)](realtime-analysis-cer.md)
+ [Train custom entity recognizers (API)](train-cer-model.md)
+ [Custom entity recognizer metrics](cer-metrics.md)

# Train custom recognizers (console)
<a name="realtime-analysis-cer"></a>

You can create custom entity recognizers using the Amazon Comprehend console. This section shows you how to create and train a custom entity recognizer.

**Topics**

## Creating a custom entity recognizer using the console - CSV format
<a name="console-CER"></a>

To create the custom entity recognizer, first provide a dataset to train your model. With this dataset, include one of the following: a set of annotated documents or a list of entities and their type label, along with a set of documents containing those entities. For more information, see [Custom entity recognition](custom-entity-recognition.md)

**To train a custom entity recognizer with a CSV file**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Customization** and then choose **Custom entity recognition**.

1. Choose **Create new model**.

1. Give the recognizer a name. The name must be unique within the Region and account.

1. Select the language. 

1. Under **Custom entity type**, enter a custom label that you want the recognizer to find in the dataset. 

   The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore. 

1. Choose **Add type**.

1. If you want to add an additional entity type, enter it, and then choose **Add type**. If you want to remove one of the entity types you've added, choose **Remove type** and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed. 

1. To encrypt your training job, choose **Recognizer encryption** and then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, for **KMS key ID** choose the key ID.
   + If you are using a key associated with a different account, for **KMS key ARN** enter the ARN for the key ID.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Data specifications**, choose the format of your training documents:
   + **CSV file** — A CSV file that supplements your training documents. The CSV file contains information about the custom entities that your trained model will detect. The required format of the file depends on whether you are providing annotations or an entity list.
   + **Augmented manifest** — A labeled dataset that is produced by Amazon SageMaker Ground Truth. This file is in JSON lines format. Each line is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document. You can provide up to 5 augmented manifest files.

   For more information about available formats, and for examples, see [Training custom entity recognizer models](training-recognizers.md).

1. Under **Training type**, choose the training type to use:
   + **Using annotations and training docs**
   + **Using entity list and training docs**

    If choosing annotations, enter the URL of the annotations file in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation files are located and choose **Browse S3**.

    If choosing entity list, enter the URL of the entity list in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the entity list is located and choose **Browse S3**.

1. Enter the URL of an input dataset containing the training documents in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the training documents are located and choose **Select folder**.

1. Under **Test dataset** select how you want to evaluate the performance of your trained model - you can do this for both annotations and entity list training types.
   + **Autosplit**: Autosplit automatically selects 10% of your provided training data to use as testing data 
   + (Optional) **Customer provided**: When you select customer provided, you can specify exactly what test data you want to use.

1. If you select Customer provided test dataset, enter the URL of the annotations file in Amazon S3. You can also navigate to the bucket or folder in Amazon S3 where the annotation files are located and choose **Select folder**.

1. In the **Choose an IAM role** section, either select an existing IAM role or create a new one.
   + **Choose an existing IAM role** – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.
   + **Create a new IAM role** – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets. 
**Note**  
If the input documents are encrypted, the IAM role used must have `kms:Decrypt` permission. For more information, see [Permissions required to use KMS encryption](security_iam_id-based-policy-examples.md#auth-kms-permissions).

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your custom entity recognition job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

1. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under **Tags**. Choose **Add tag**. To remove this pair before creating the recognizer, choose **Remove tag**.

1. Choose **Train**.

The new recognizer will then appear in the list, showing its status. It will first show as `Submitted`. It will then show `Training` for a classifier that is processing training documents, `Trained` for a classifier that is ready to use, and `In error` for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.

## Creating a custom entity recognizer using the console - augmented manifest
<a name="getting-started-CER-PDF"></a>

**To train a custom entity recognizer with a plaintext, PDF, or word document**

1. Sign in to the AWS Management Console and open the [Amazon Comprehend console.](https://console.aws.amazon.com/comprehend/home?region=us-east-1#api-explorer:)

1. From the left menu, choose **Customization** and then choose **Custom entity recognition**.

1. Choose **Train recognizer**.

1. Give the recognizer a name. The name must be unique within the Region and account.

1. Select the language. Note: If you're training a PDF or Word document, English is the supported language. 

1. Under **Custom entity type**, enter a custom label that you want the recognizer to find in the dataset. 

   The entity type must be uppercase, and if it consists of more than one word, separate the words with an underscore.

1. Choose **Add type**.

1. If you want to add an additional entity type, enter it, and then choose **Add type**. If you want to remove one of the entity types you've added, choose **Remove type** and then choose the entity type to remove from the list. A maximum of 25 entity types can be listed. 

1. To encrypt your training job, choose **Recognizer encryption** and then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, for **KMS key ID** choose the key ID.
   + If you are using a key associated with a different account, for **KMS key ARN** enter the ARN for the key ID.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Training data**, choose **Augmented manifest** as your data format:
   + **Augmented manifest** — is a labeled dataset that is produced by Amazon SageMaker Ground Truth. This file is in JSON lines format. Each line in the file is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document. You can provide up to 5 augmented manifest files. If you are using PDF documents for training data, you must select **Augmented manifest**. You can provide up to 5 augmented manifest files. For each file, you can name up to 5 attributes to use as training data.

   For more information about available formats, and for examples, see [Training custom entity recognizer models](training-recognizers.md).

1. Select the training model type. 

   If you selected **Plaintext documents**, under **Input location**, enter the Amazon S3URL of the Amazon SageMaker AIGround Truth augmented manifest file. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose **Select folder**.

1. Under **Attribute name**, enter the name of the attribute that contains your annotations. If the file contains annotations from multiple chained labeling jobs, add an attribute for each job. In this case, each attribute contains the set of annotations from a labeling job. Note: You can provide up to 5 attribute names for each file.

1. Select **Add**.

1. If you selected **PDF, Word documents** under **Input location**, enter the Amazon S3URL of the Amazon SageMaker AI Ground Truth augmented manifest file. You can also navigate to the bucket or folder in Amazon S3 where the augmented manifest(s) is located and choose **Select folder**.

1. Enter the S3 prefix for your **Annotation** data files. These are the PDF documents that you labeled.

1. Enter the S3 prefix for your **Source** documents. These are the original PDF documents (data objects) that you provided to Ground Truth for your labeling job.

   
1. Enter the attribute names that contain your annotations. Note: You can provide up to 5 attribute names for each file. Any attributes in your file that you don't specify are ignored. 

1. In the IAM role section, either select an existing IAM role or create a new one.
   + **Choose an existing IAM role** – Select this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.
   + **Create a new IAM role** – Select this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets. 
**Note**  
If the input documents are encrypted, the IAM role used must have `kms:Decrypt` permission. For more information, see [Permissions required to use KMS encryption](security_iam_id-based-policy-examples.md#auth-kms-permissions).

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your custom entity recognition job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

1. (Optional) To add a tag to the custom entity recognizer, enter a key-value pair under **Tags**. Choose **Add tag**. To remove this pair before creating the recognizer, choose **Remove tag**.

1. Choose **Train**.

The new recognizer will then appear in the list, showing its status. It will first show as `Submitted`. It will then show `Training` for a classifier that is processing training documents, `Trained` for a classifier that is ready to use, and `In error` for a classifier that has an error. You can click on a job to get more information about the recognizer, including any error messages.

# Train custom entity recognizers (API)
<a name="train-cer-model"></a>

To create and train a custom entity recognition model, use the Amazon Comprehend [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) API operation

**Topics**
+ [Training custom entity recognizers using the AWS Command Line Interface](#get-started-api-cer-cli)
+ [Training custom entity recognizers using the AWS SDK for Java](#get-started-api-cer-java)
+ [Training custom entity recognizers using Python (Boto3)](#cer-python)

## Training custom entity recognizers using the AWS Command Line Interface
<a name="get-started-api-cer-cli"></a>

The following examples demonstrate using the `CreateEntityRecognizer` operation and other associated APIs with the AWS CLI. 

The examples are formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^).

Create a custom entity recognizer using the `create-entity-recognizer` CLI command. For information about the input-data-config parameter, see [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) in the *Amazon Comprehend API Reference*.

```
aws comprehend create-entity-recognizer \
     --language-code en \
     --recognizer-name test-6 \
     --data-access-role-arn "arn:aws:iam::account number:role/service-role/AmazonComprehendServiceRole-role" \
     --input-data-config "EntityTypes=[{Type=PERSON}],Documents={S3Uri=s3://Bucket Name/Bucket Path/documents},
                Annotations={S3Uri=s3://Bucket Name/Bucket Path/annotations}" \
     --region region
```

List all entity recognizers in a Region using the `list-entity-recognizers` CLI command..

```
aws comprehend list-entity-recognizers \
     --region region
```

Check Job Status of custom entity recognizers using the `describe-entity-recognizer` CLI command..

```
aws comprehend describe-entity-recognizer \
     --entity-recognizer-arn arn:aws:comprehend:region:account number:entity-recognizer/test-6 \
     --region region
```

## Training custom entity recognizers using the AWS SDK for Java
<a name="get-started-api-cer-java"></a>

This example creates a custom entity recognizer and trains the model, using Java

For Amazon Comprehend examples that use Java, see [Amazon Comprehend Java examples](https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/javav2/example_code/comprehend).

## Training custom entity recognizers using Python (Boto3)
<a name="cer-python"></a>

Instantiate Boto3 SDK: 

```
import boto3
import uuid
comprehend = boto3.client("comprehend", region_name="region")
```

Create entity recognizer: 

```
response = comprehend.create_entity_recognizer(
    RecognizerName="Recognizer-Name-Goes-Here-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn="Role ARN",
    InputDataConfig={
        "EntityTypes": [
            {
                "Type": "ENTITY_TYPE"
            }
        ],
        "Documents": {
            "S3Uri": "s3://Bucket Name/Bucket Path/documents"
        },
        "Annotations": {
            "S3Uri": "s3://Bucket Name/Bucket Path/annotations"
        }
    }
)
recognizer_arn = response["EntityRecognizerArn"]
```

List all recognizers: 

```
response = comprehend.list_entity_recognizers()
```

Wait for recognizer to reach TRAINED status: 

```
while True:
    response = comprehend.describe_entity_recognizer(
        EntityRecognizerArn=recognizer_arn
    )

    status = response["EntityRecognizerProperties"]["Status"]
    if "IN_ERROR" == status:
        sys.exit(1)
    if "TRAINED" == status:
        break

    time.sleep(10)
```

# Custom entity recognizer metrics
<a name="cer-metrics"></a>

Amazon Comprehend provides you with metrics to help you estimate how well an entity recognizer should work for your job. They are based on training the recognizer model, and so while they accurately represent the performance of the model during training, they are only an approximation of the API performance during entity discovery. 

Metrics are returned any time metadata from a trained entity recognizer is returned. 

Amazon Comprehend supports training a model on up to 25 entities at a time. When metrics are returned from a trained entity recognizer, scores are computed against both the recognizer as a whole (global metrics) and for each individual entity (entity metrics).

Three metrics are available, both as global and entity metrics: 
+ **Precision**

  This indicates the fraction of entities produced by the system that are correctly identified and correctly labeled. This shows how many times the model's entity identification is truly a good identification. It is a percentage of the total number of identifications. 

  In other words, precision is based on *true positives (tp)* and *false positives (fp)* and it is calculated as *precision = tp / (tp \$1 fp)*.

  For example, if a model predicts that two examples of an entity are present in a document, where there's actually only one, the result is one true positive and one false positive. In this case, *precision = 1 / (1 \$1 1)*. The precision is 50%, as one entity is correct out of the two identified by the model. 

  
+  **Recall**

  This indicates the fraction of entities present in the documents that are correctly identified and labeled by the system. Mathematically, this is defined in terms of the total number of correct identifications *true positives (tp)* and missed identifcations *false negatives (fn)*. 

   It is calculated as *recall = tp / (tp \$1 fn)*. For example if a model correctly identifies one entity, but misses two other instances where that entity is present, the result is one true positive and two false negatives. In this case, *recall = 1 / (1 \$1 2)*. The recall is 33.33%, as one entity is correct out of a possible three examples.

  
+ **F1 score** 

  This is a combination of the Precision and Recall metrics, which measures the overall accuracy of the model for custom entity recognition. The F1 score is the harmonic mean of the Precision and Recall metrics: *F1 = 2 \$1 Precision \$1 Recall / (Precision \$1 Recall) *.
**Note**  
Intuitively, the harmonic mean penalizes the extremes more than the simple average or other means (example: `precision` = 0, `recall` = 1 could be achieved trivially by predicting all possible spans. Here, the simple average would be 0.5, but `F1` would penalize it as 0). 

  In the examples above, `precision` = 50% and `recall` = 33.33%, therefore `F1` = 2 \$1 0.5 \$1 0.3333 / (0.5 \$1 0.3333). The F1 Score is .3975, or 39.75%.

  
**Global and individual entity metrics**

The relationship between global and individual entity metrics can be seen when analyzing the following sentence for entities that are either a *place* or a *person*

```
John Washington and his friend Smith live in San Francisco, work in San Diego, and own 
    a house in Seattle.
```

In our example, the model makes the following predictions.

```
John Washington = Person
Smith = Place
San Francisco = Place
San Diego = Place
Seattle = Person
```

However, the predictions should have been the following.

```
John Washington = Person
Smith = Person  
San Francisco = Place
San Diego = Place
Seattle = Place
```

The individual entity metrics for this would be:

```
entity:  Person
  True positive (TP) = 1 (because John Washington is correctly predicted to be a 
    Person).
  False positive (FP) = 1 (because Seattle is incorrectly predicted to be a Person, 
    but is actually a Place).
  False negative (FN) = 1 (because Smith is incorrectly predicted to be a Place, but 
    is actually a Person).
  Precision = 1 / (1 + 1) = 0.5 or 50%
  Recall = 1 / (1+1) = 0.5 or 50%
  F1 Score = 2 * 0.5 * 0.5 / (0.5 + 0.5) = 0.5 or 50%
  
entity:  Place
  TP = 2 (because San Francisco and San Diego are each correctly predicted to be a 
    Place).
  FP = 1 (because Smith is incorrectly predicted to be a Place, but is actually a 
    Person).
  FN = 1 (because Seattle is incorrectly predicted to be a Person, but is actually a 
    Place).
  Precision = 2 / (2+1) = 0.6667 or 66.67%
  Recall = 2 / (2+1) = 0.6667 or 66.67%
  F1 Score = 2 * 0.6667 * 0.6667 / (0.6667 + 0.6667) = 0.6667 or  66.67%
```

The global metrics for this would be:

Global:

```
Global:
  TP = 3 (because John Washington, San Francisco and San Diego are predicted correctly. 
    This is also the sum of all individual entity TP).
  FP = 2 (because Seattle is predicted as Person and Smith is predicted as Place. This 
    is the sum of all individual entity FP).
  FN = 2 (because Seattle is predicted as Person and Smith is predicted as Place. This 
    is the sum of all individual FN).
  Global Precision = 3 / (3+2) = 0.6 or 60%  
    (Global Precision = Global TP / (Global TP + Global FP))
  Global Recall = 3 / (3+2) = 0.6 or 60% 
    (Global Recall = Global TP / (Global TP + Global FN))
  Global F1Score = 2 * 0.6 * 0.6 / (0.6 + 0.6) = 0.6 or 60% 
    (Global F1Score = 2 * Global Precision *  Global Recall / (Global Precision + 
    Global Recall))
```


## Improving custom entity recognizer performance
<a name="cer-performance"></a>

These metrics provide an insight into how accurately the trained model will perform when you use it to identify entities. Here are a few options you can use to improve your metrics if they are lower than your expectations:

1. Depending on whether you use [Annotations](cer-annotation.md) or [Entity lists (plaintext only)](cer-entity-list.md), make sure to follow the guidelines in the respective documentation to improve data quality. If you observe better metrics after improving your data and re-training the model, you can keep iterating and improving data quality to achieve better model performance.

1. If you are using an Entity List, consider using Annotations instead. Manual annotations can often improve your results.

1. If you are sure there is not a data quality issue, and yet the metrics remain unreasonably low, please submit a support request.

# Running real-time custom recognizer analysis
<a name="running-cer-sync"></a>

Real-time analysis is useful for applications that process small documents as they arrive. For example, you can detect custom entities in social media posts, support tickets, or customer reviews. 

**Before you begin**  
You need a custom entity recognition model (also known as a recognizer) before you can detect custom entities. For more information about these models, see [Training custom entity recognizer models](training-recognizers.md). 

A recognizer that is trained with plain-text annotations supports entity detection for plain-text documents only. A recognizer that is trained with PDF document annotations supports entity detection for plain-text documents, images, PDF files, and Word documents. For information about the input files, see [Inputs for real-time custom analysis](idp-inputs-sync.md).

If you plan to analyze image files or scanned PDF documents, your IAM policy must grant permissions to use two Amazon Textract API methods (DetectDocumentText and AnalyzeDocument). Amazon Comprehend invokes these methods during text extraction. For an example policy, see [Permissions required to perform document analysis actions](security_iam_id-based-policy-examples.md#security-iam-based-policy-perform-cmp-actions).

**Topics**
+ [Real-time analysis for custom entity recognition (console)](detecting-cer-real-time.md)
+ [Real-time analysis for custom entity recognition (API)](detecting-cer-real-time-api.md)
+ [Outputs for real-time analysis](outputs-cer-sync.md)

# Real-time analysis for custom entity recognition (console)
<a name="detecting-cer-real-time"></a>

You can use the Amazon Comprehend console to run real-time analysis with a custom model. First, you create an endpoint to run the real-time analysis. After you create the endpoint, you run the real-time analysis.

For information about provisioning endpoint throughput, and the associated costs, see [Using Amazon Comprehend endpoints](using-endpoints.md).

**Topics**
+ [Creating an endpoint for custom entity detection](#detecting-cer-real-time-create-endpoint)
+ [Running real-time custom entity detection](#detecting-cer-real-time-run)

## Creating an endpoint for custom entity detection
<a name="detecting-cer-real-time-create-endpoint"></a>

**To create an endpoint (console)**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Endpoints** and choose the **Create endpoint** button. A **Create endpoint** screen opens.

1. Give the endpoint a name. The name must be unique within the current Region and account.

1. Choose a custom model that you want to attach the new endpoint to. From the dropdown, you can search by model name.
**Note**  
You must create a model before you can attach an endpoint to it. If you don't have a model yet, see [Training custom entity recognizer models](training-recognizers.md). 

1. (Optional) To add a tag to the endpoint, enter a key-value pair under **Tags** and choose **Add tag**. To remove this pair before creating the endpoint, choose **Remove tag**.

1. Enter the number of inference units (IUs) to assign to the endpoint. Each unit represents a throughput of 100 characters per second for up to two documents per second. For more information about endpoint throughput, see [Using Amazon Comprehend endpoints](using-endpoints.md).

1. (Optional) If you are creating a new endpoint, you have the option to use the IU estimator. The estimator can help you determine the number of IUs to request. The number of inference units depends on the throughput or the number of characters that you want to analyze per second.

1. From the **Purchase summary**, review your estimated hourly, daily, and monthly endpoint cost. 

1. Select the check box if you understand that your account accrues charges for the endpoint from the time it starts until you delete it.

1. Choose **Create endpoint**.

## Running real-time custom entity detection
<a name="detecting-cer-real-time-run"></a>

After you create an endpoint for your custom entity recognizer model, you can run real-time analysis to detect entities in individual documents.

Complete the following steps to detect custom entities in your text by using the Amazon Comprehend console.

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Real-time analysis**.

1. In the **Input text** section, for **Analysis type**, choose **Custom**. 

1. For **Select endpoint**, choose the endpoint that is associated with the entity-detection model that you want to use.

1. To specify the input data for analysis, you can input text or upload a file.
   + To enter text:

     1. Choose **Input text**.

     1. Enter the text that you want to analyze. 
   + To upload a file:

     1. Choose **Upload file** and enter the filename to upload.

     1. (Optional) Under **Advanced read actions**, you can override the default actions for text extraction. For details, see [Setting text extraction options](idp-set-textract-options.md).

1. Choose **Analyze**. The console displays the output of the analysis, along with a confidence assessment. 

# Real-time analysis for custom entity recognition (API)
<a name="detecting-cer-real-time-api"></a>

You can use the Amazon Comprehend API to run real-time analysis with a custom model. First, you create an endpoint to run the real-time analysis. After you create the endpoint, you run the real-time analysis.

For information about provisioning endpoint throughput, and the associated costs, see [Using Amazon Comprehend endpoints](using-endpoints.md).

**Topics**
+ [Creating an endpoint for custom entity detection](#detecting-cer-real-time-create-endpoint-api)
+ [Running real-time custom entity detection](#detecting-cer-real-time-run)

## Creating an endpoint for custom entity detection
<a name="detecting-cer-real-time-create-endpoint-api"></a>

For information about the costs associated with endpoints, see [Using Amazon Comprehend endpoints](using-endpoints.md).

### Creating an Endpoint with the AWS CLI
<a name="detecting-cer-real-time-create-endpoint-examples"></a>

To create an endpoint by using the AWS CLI, use the `create-endpoint` command:

```
$ aws comprehend create-endpoint \
> --desired-inference-units number of inference units \
> --endpoint-name endpoint name \
> --model-arn arn:aws:comprehend:region:account-id:model/example \
> --tags Key=Key,Value=Value
```

If your command succeeds, Amazon Comprehend responds with the endpoint ARN:

```
{
   "EndpointArn": "Arn"
}
```

For more information about this command, its parameter arguments, and its output, see [https://docs.aws.amazon.com/cli/latest/reference/comprehend/create-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/comprehend/create-endpoint.html) in the AWS CLI Command Reference.

## Running real-time custom entity detection
<a name="detecting-cer-real-time-run"></a>

After you create an endpoint for your custom entity recognizer model, you use the endpoint to run the [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) API operation. You can provide text input using the `text` or `bytes` parameter. Enter the other input types using the `bytes` parameter.

For image files and PDF files, you can use the `DocumentReaderConfig` parameter to override the default text extraction actions. For details, see [Setting text extraction options](idp-set-textract-options.md).

### Detecting entities in text using the AWS CLI
<a name="detecting-cer-real-time-run-cli1"></a>

To detect custom entities in text, run the `detect-entities` command with the input text in the `text` parameter.

**Example : Use the CLI to detect entities in input text**  

```
$ aws comprehend detect-entities \
> --endpoint-arn arn \
> --language-code en \
> --text  "Andy Jassy is the CEO of Amazon."
```
If your command succeeds, Amazon Comprehend responds with the analysis. For each entity that Amazon Comprehend detects, it provides the entity type, text, location, and confidence score.

### Detecting entities in semi-structured documents using the AWS CLI
<a name="detecting-cer-real-time-run-cli2"></a>

To detect custom entities in PDF, Word, or image file, run the `detect-entities` command with the input file in the `bytes` parameter.

**Example : Use the CLI to detect entities in an image file**  
This example shows how to pass in the image file using the `fileb` option to base64 encode the image bytes. For more information, see [Binary large objects](https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-parameters-types.html#parameter-type-blob) in the AWS Command Line Interface User Guide.   
This example also passes in a JSON file named `config.json` to set the text extraction options.  

```
$ aws comprehend detect-entities \
> --endpoint-arn arn \
> --language-code en \
> --bytes fileb://image1.jpg   \
> --document-reader-config file://config.json
```
The **config.json** file contains the following content.  

```
 {
    "DocumentReadMode": "FORCE_DOCUMENT_READ_ACTION",
    "DocumentReadAction": "TEXTRACT_DETECT_DOCUMENT_TEXT"    
 }
```

For more information about the command syntax, see [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) in the *Amazon Comprehend API Reference*.

# Outputs for real-time analysis
<a name="outputs-cer-sync"></a>

## Outputs for text inputs
<a name="outputs-cer-sync-text"></a>

If you input text using the `Text` parameter, the output consists of an array of entities that the analysis detected. The following example shows an analysis that detected two JUDGE entities.

```
{
        "Entities":
        [
            {
                "BeginOffset": 0,
                "EndOffset": 22,
                "Score": 0.9763959646224976,
                "Text": "John Johnson",
                "Type": "JUDGE"
            },
            {
                "BeginOffset": 11,
                "EndOffset": 15,
                "Score": 0.9615424871444702,
                "Text": "Thomas Kincaid",
                "Type": "JUDGE"
            }
        ]
    }
```

## Outputs for semi-structured inputs
<a name="outputs-cer-sync-other"></a>

For a semi-structured input document, or a text file, the output can include the following additional fields:
+ DocumentMetadata – Extraction information about the document. The metadata includes a list of pages in the document, with the number of characters extracted from each page. This field is present in the response if the request included the `Byte` parameter.
+ DocumentType – The document type for each page in the input document. This field is present in the response for a request that included the `Byte` parameter.
+ Blocks – Information about each block of text in the input document. Blocks are nested. A page block contains a block for each line of text, which contains a block for each word. This field is present in the response for a request that included the `Byte` parameter.
+ BlockReferences – A reference to each block for this entity. This field is present in the response for a request that included the `Byte` parameter. The field is not present for text files.
+ Errors – Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.

For descriptions of these output fields, see [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) in the *Amazon Comprehend API Reference*. For more information about the layout elements, see [Amazon Textract analysis objects](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html) in the Amazon Textract Developer Guide.

The following example shows the output for a one-page scanned PDF input document.

```
{
    "Entities": [{
        "Score": 0.9984670877456665,
        "Type": "DATE-TIME",
        "Text": "September 4,",
        "BlockReferences": [{
            "BlockId": "42dcaaee-c484-4b5d-9e3f-ae0be928b3e1",
            "BeginOffset": 0,
            "EndOffset": 12,
            "ChildBlocks": [{
                    "ChildBlockId": "6e9cbb43-f8be-4da0-9a4b-ff9a6c350a14",
                    "BeginOffset": 0,
                    "EndOffset": 9
                },
                {
                    "ChildBlockId": "599e0d53-ae9f-491b-a762-459b22c79ff5",
                    "BeginOffset": 0,
                    "EndOffset": 2
                },
                {
                    "ChildBlockId": "599e0d53-ae9f-491b-a762-459b22c79ff5",
                    "BeginOffset": 0,
                    "EndOffset": 2
                }
            ]
        }]
    }],
    "DocumentMetadata": {
        "Pages": 1,
        "ExtractedCharacters": [{
            "Page": 1,
            "Count": 609
        }]
    },
    "DocumentType": [{
        "Page": 1,
        "Type": "SCANNED_PDF"
    }],
    "Blocks": [{
        "Id": "ee82edf3-28de-4d63-8883-40e2e4938ccb",
        "BlockType": "LINE",
        "Text": "Your Band",
        "Page": 1,
        "Geometry": {
            "BoundingBox": {
                "Height": 0.024125460535287857,
                "Left": 0.11745482683181763,
                "Top": 0.06821706146001816,
                "Width": 0.12074867635965347
            },
            "Polygon": [{
                    "X": 0.11745482683181763,
                    "Y": 0.06821706146001816
                },
                {
                    "X": 0.2382034957408905,
                    "Y": 0.06821706146001816
                },
                {
                    "X": 0.2382034957408905,
                    "Y": 0.09234252572059631
                },
                {
                    "X": 0.11745482683181763,
                    "Y": 0.09234252572059631
                }
            ]
        },
        "Relationships": [{
            "Ids": [
                "b105c561-c8d9-485a-a728-7a5b1a308935",
                "60ecb119-3173-4de2-8c5d-de182a5f86a5"
            ],
            "Type": "CHILD"
        }]
    }]
}
```

The following example shows the output for analysis of a native PDF document.

**Example output from a custom entity recognition analysis of a PDF document**  

```
{
        "Blocks":
        [
            {
                "BlockType": "LINE",
                "Geometry":
                {
                    "BoundingBox":
                    {
                        "Height": 0.012575757575757575,
                        "Left": 0.0,
                        "Top": 0.0015063131313131314,
                        "Width": 0.02262091503267974
                    },
                    "Polygon":
                    [
                        {
                            "X": 0.0,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.014082070707070706
                        },
                        {
                            "X": 0.0,
                            "Y": 0.014082070707070706
                        }
                    ]
                },
                "Id": "4330efed-6334-4fc4-ba48-e050afa95c8d",
                "Page": 1,
                "Relationships":
                [
                    {
                        "ids":
                        [
                            "f343ce48-583d-4abe-b84b-a232e266450f"
                        ],
                        "type": "CHILD"
                    }
                ],
                "Text": "S-3"
            },
            {
                "BlockType": "WORD",
                "Geometry":
                {
                    "BoundingBox":
                    {
                        "Height": 0.012575757575757575,
                        "Left": 0.0,
                        "Top": 0.0015063131313131314,
                        "Width": 0.02262091503267974
                    },
                    "Polygon":
                    [
                        {
                            "X": 0.0,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.014082070707070706
                        },
                        {
                            "X": 0.0,
                            "Y": 0.014082070707070706
                        }
                    ]
                },
                "Id": "f343ce48-583d-4abe-b84b-a232e266450f",
                "Page": 1,
                "Relationships":
                [],
                "Text": "S-3"
            }
        ],
        "DocumentMetadata":
        {
            "PageNumber": 1,
            "Pages": 1
        },
        "DocumentType": "NativePDF",
        "Entities":
        [
            {
                "BlockReferences":
                [
                    {
                        "BeginOffset": 25,
                        "BlockId": "4330efed-6334-4fc4-ba48-e050afa95c8d",
                        "ChildBlocks":
                        [
                            {
                                "BeginOffset": 1,
                                "ChildBlockId": "cbba5534-ac69-4bc4-beef-306c659f70a6",
                                "EndOffset": 6
                            }
                        ],
                        "EndOffset": 30
                    }
                ],
                "Score": 0.9998825926329088,
                "Text": "0.001",
                "Type": "OFFERING_PRICE"
            },
            {
                "BlockReferences":
                [
                    {
                        "BeginOffset": 41,
                        "BlockId": "f343ce48-583d-4abe-b84b-a232e266450f",
                        "ChildBlocks":
                        [
                            {
                                "BeginOffset": 0,
                                "ChildBlockId": "292a2e26-21f0-401b-a2bf-03aa4c47f787",
                                "EndOffset": 9
                            }
                        ],
                        "EndOffset": 50
                    }
                ],
                "Score": 0.9809727537330395,
                "Text": "6,097,560",
                "Type": "OFFERED_SHARES"
            }
        ],
        "File": "example.pdf",
        "Version": "2021-04-30"
    }
```

# Running analysis jobs for custom entity recognition
<a name="detecting-cer"></a>

You can run an asynchronous analysis job to detect custom entities in a set of one or more documents.

**Before you begin**  
You need a custom entity recognition model (also known as a recognizer) before you can detect custom entities. For more information about these models, see [Training custom entity recognizer models](training-recognizers.md). 

A recognizer that is trained with plain-text annotations supports entity detection for plain-text documents only. A recognizer that is trained with PDF document annotations supports entity detection for plain-text documents, images, PDF files, and Word documents. For files other than text files, Amazon Comprehend performs text extraction before running the analysis. For information about the input files, see [Inputs for asynchronous custom analysis](idp-inputs-async.md).

If you plan to analyze image files or scanned PDF documents, your IAM policy must grant permissions to use two Amazon Textract API methods (DetectDocumentText and AnalyzeDocument). Amazon Comprehend invokes these methods during text extraction. For an example policy, see [Permissions required to perform document analysis actions](security_iam_id-based-policy-examples.md#security-iam-based-policy-perform-cmp-actions).

To run an async analysis job, you perform the following overall steps:

1. Store the documents in an Amazon S3 bucket.

1. Use the API or console to start the analysis job.

1. Monitor the progress of the analysis job.

1. After the job runs to completion, retrieve the results of the analysis from the S3 bucket that you specified when you started the job.

**Topics**
+ [Starting a custom entity detection job (console)](detecting-cer-async-console.md)
+ [Starting a custom entity detection job (API)](detecting-cer-async-api.md)
+ [Outputs for asynchronous analysis jobs](outputs-cer-async.md)

# Starting a custom entity detection job (console)
<a name="detecting-cer-async-console"></a>

You can use the console to start and monitor an async analysis job for custom entity recognition.

**To start an async analysis job**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Analysis jobs** and then choose **Create job**.

1. Give the classification job a name. The name must be unique your account and current Region.

1. Under **Analysis type**, choose **Custom entity recognition**.

1. From **Recognizer model**, choose the custom entity recognizer to use.

1. From **Version**, choose the recognizer version to use.

1. (Optional) If you choose to encrypt the data that Amazon Comprehend uses while processing your job, choose **Job encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key ID under **KMS key ARN**.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [Key management service (KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Input data**, enter the location of the Amazon S3 bucket that contains your input documents or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role that you're using for access permissions for the analysis job must have reading permissions for the S3 bucket.

1. (Optional) for **Input format**, you can choose the format of the input documents. The format can be one document per file, or one document per line in a single file. One document per line applies only to text documents. 

1. (Optional) For **Document read mode**, you can override the default text extraction actions. For more information, see [Setting text extraction options](idp-set-textract-options.md). 

1. Under **Output data**, enter the location of the Amazon S3 bucket where Amazon Comprehend should write the job's output data or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role you're using for access permissions for the classification job must have write permissions for the S3 bucket.

1. (Optional) if you choose to encrypt the output result from your job, choose **Encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key alias or ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key alias or ID under **KMS key ID**.

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your analysis job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC that accesses the output bucket.

1. Choose **Create job** to create the entity recognition job.

# Starting a custom entity detection job (API)
<a name="detecting-cer-async-api"></a>

You can use the API to start and monitor an async analysis job for custom entity recognition.

To start a custom entity detection job with the [StartEntitiesDetectionJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartEntitiesDetectionJob.html) operation, you provide the EntityRecognizerArn, which is the Amazon Resource Name (ARN) of the trained model. You can find this ARN in the response to the [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) operation. 

**Topics**
+ [Detecting custom entities using the AWS Command Line Interface](#detecting-cer-async-api-cli)
+ [Detecting custom entities using the AWS SDK for Java](#detecting-cer-async-api-java)
+ [Detecting custom entities using the AWS SDK for Python (Boto3)](#detecting-cer-async-api-python)
+ [Overriding API actions for PDF files](#detecting-cer-api-pdf)

## Detecting custom entities using the AWS Command Line Interface
<a name="detecting-cer-async-api-cli"></a>

Use the following example for Unix, Linux, and macOS environments. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^). To detect custom entities in a document set, use the following request syntax:

```
aws comprehend start-entities-detection-job \
     --entity-recognizer-arn "arn:aws:comprehend:region:account number:entity-recognizer/test-6" \
     --job-name infer-1 \
     --data-access-role-arn "arn:aws:iam::account number:role/service-role/AmazonComprehendServiceRole-role" \
     --language-code en \
     --input-data-config "S3Uri=s3://Bucket Name/Bucket Path" \
     --output-data-config "S3Uri=s3://Bucket Name/Bucket Path/" \
     --region region
```

Amazon Comprehend responds with the `JobID` and `JobStatus` and will return the output from the job in the S3 bucket that you specified in the request.

## Detecting custom entities using the AWS SDK for Java
<a name="detecting-cer-async-api-java"></a>

For Amazon Comprehend examples that use Java, see [Amazon Comprehend Java examples](https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/javav2/example_code/comprehend).

## Detecting custom entities using the AWS SDK for Python (Boto3)
<a name="detecting-cer-async-api-python"></a>

This example creates a custom entity recognizer, trains the model, and then runs it in an entity recognizer job using the AWS SDK for Python (Boto3).

Instantiate the SDK for Python. 

```
import boto3
import uuid
comprehend = boto3.client("comprehend", region_name="region")
```

Create an entity recognizer: 

```
response = comprehend.create_entity_recognizer(
    RecognizerName="Recognizer-Name-Goes-Here-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn="Role ARN",
    InputDataConfig={
        "EntityTypes": [
            {
                "Type": "ENTITY_TYPE"
            }
        ],
        "Documents": {
            "S3Uri": "s3://Bucket Name/Bucket Path/documents"
        },
        "Annotations": {
            "S3Uri": "s3://Bucket Name/Bucket Path/annotations"
        }
    }
)
recognizer_arn = response["EntityRecognizerArn"]
```

List all recognizers: 

```
response = comprehend.list_entity_recognizers()
```

Wait for the entity recognizer to reach TRAINED status: 

```
while True:
    response = comprehend.describe_entity_recognizer(
        EntityRecognizerArn=recognizer_arn
    )

    status = response["EntityRecognizerProperties"]["Status"]
    if "IN_ERROR" == status:
        sys.exit(1)
    if "TRAINED" == status:
        break

    time.sleep(10)
```

Start a custom entities detection job: 

```
response = comprehend.start_entities_detection_job(
    EntityRecognizerArn=recognizer_arn,
    JobName="Detection-Job-Name-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn="Role ARN",
    InputDataConfig={
        "InputFormat": "ONE_DOC_PER_LINE",
        "S3Uri": "s3://Bucket Name/Bucket Path/documents"
    },
    OutputDataConfig={
        "S3Uri": "s3://Bucket Name/Bucket Path/output"
    }
)
```

## Overriding API actions for PDF files
<a name="detecting-cer-api-pdf"></a>

For image files and PDF files, you can override the default extraction actions using the `DocumentReaderConfig` parameter in `InputDataConfig`.

The following example defines a JSON file named myInputDataConfig.json to set the `InputDataConfig` values. It sets `DocumentReadConfig` to use the Amazon Textract `DetectDocumentText` API for all PDF files.

**Example**  

```
"InputDataConfig": {
  "S3Uri": s3://Bucket Name/Bucket Path",
  "InputFormat": "ONE_DOC_PER_FILE",
  "DocumentReaderConfig": {
      "DocumentReadAction": "TEXTRACT_DETECT_DOCUMENT_TEXT",
      "DocumentReadMode": "FORCE_DOCUMENT_READ_ACTION"
  }
}
```

In the `StartEntitiesDetectionJob` operation, specify the myInputDataConfig.json file as the `InputDataConfig` parameter:

```
  --input-data-config file://myInputDataConfig.json  
```

For more information about the `DocumentReaderConfig` parameters, see [Setting text extraction options](idp-set-textract-options.md).

# Outputs for asynchronous analysis jobs
<a name="outputs-cer-async"></a>

After an analysis job completes, it stores the results in the S3 bucket that you specified in the request.

## Outputs for text inputs
<a name="outputs-cer-async-text"></a>

For text input files, the output consists of a list of entities for each input document.

The following example shows the output for two documents from an input file named **50\$1docs**, using one document per line format.

```
{
        "File": "50_docs",
        "Line": 0,
        "Entities":
        [
            {
                "BeginOffset": 0,
                "EndOffset": 22,
                "Score": 0.9763959646224976,
                "Text": "John Johnson",
                "Type": "JUDGE"
            }
        ]
    }
    {
        "File": "50_docs",
        "Line": 1,
        "Entities":
        [
            {
                "BeginOffset": 11,
                "EndOffset": 15,
                "Score": 0.9615424871444702,
                "Text": "Thomas Kincaid",
                "Type": "JUDGE"
            }
        ]
    }
```

## Outputs for semi-structured inputs
<a name="outputs-cer-async-other"></a>

For semi-structured input documents, the output can include the following additional fields:
+ DocumentMetadata – Extraction information about the document. The metadata includes a list of pages in the document, with the number of characters extracted from each page. This field is present in the response if the request included the `Byte` parameter.
+ DocumentType – The document type for each page in the input document. This field is present in the response for a request that included the `Byte` parameter.
+ Blocks – Information about each block of text in the input document. Blocks can nest within a block. A page block contains a block for each line of text, which contains a block for each word. This field is present in the response for a request that included the `Byte` parameter.
+ BlockReferences – A reference to each block for this entity. This field is present in the response for a request that included the `Byte` parameter. The field isn't present for text files.
+ Errors – Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.

For more details about these output fields, see [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) in the *Amazon Comprehend API Reference*

The following example shows the output for a one-page native PDF input document.

**Example output from a custom entity recognition analysis of a PDF document**  

```
{
        "Blocks":
        [
            {
                "BlockType": "LINE",
                "Geometry":
                {
                    "BoundingBox":
                    {
                        "Height": 0.012575757575757575,
                        "Left": 0.0,
                        "Top": 0.0015063131313131314,
                        "Width": 0.02262091503267974
                    },
                    "Polygon":
                    [
                        {
                            "X": 0.0,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.014082070707070706
                        },
                        {
                            "X": 0.0,
                            "Y": 0.014082070707070706
                        }
                    ]
                },
                "Id": "4330efed-6334-4fc4-ba48-e050afa95c8d",
                "Page": 1,
                "Relationships":
                [
                    {
                        "ids":
                        [
                            "f343ce48-583d-4abe-b84b-a232e266450f"
                        ],
                        "type": "CHILD"
                    }
                ],
                "Text": "S-3"
            },
            {
                "BlockType": "WORD",
                "Geometry":
                {
                    "BoundingBox":
                    {
                        "Height": 0.012575757575757575,
                        "Left": 0.0,
                        "Top": 0.0015063131313131314,
                        "Width": 0.02262091503267974
                    },
                    "Polygon":
                    [
                        {
                            "X": 0.0,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.014082070707070706
                        },
                        {
                            "X": 0.0,
                            "Y": 0.014082070707070706
                        }
                    ]
                },
                "Id": "f343ce48-583d-4abe-b84b-a232e266450f",
                "Page": 1,
                "Relationships":
                [],
                "Text": "S-3"
            }
        ],
        "DocumentMetadata":
        {
            "PageNumber": 1,
            "Pages": 1
        },
        "DocumentType": "NativePDF",
        "Entities":
        [
            {
                "BlockReferences":
                [
                    {
                        "BeginOffset": 25,
                        "BlockId": "4330efed-6334-4fc4-ba48-e050afa95c8d",
                        "ChildBlocks":
                        [
                            {
                                "BeginOffset": 1,
                                "ChildBlockId": "cbba5534-ac69-4bc4-beef-306c659f70a6",
                                "EndOffset": 6
                            }
                        ],
                        "EndOffset": 30
                    }
                ],
                "Score": 0.9998825926329088,
                "Text": "0.001",
                "Type": "OFFERING_PRICE"
            },
            {
                "BlockReferences":
                [
                    {
                        "BeginOffset": 41,
                        "BlockId": "f343ce48-583d-4abe-b84b-a232e266450f",
                        "ChildBlocks":
                        [
                            {
                                "BeginOffset": 0,
                                "ChildBlockId": "292a2e26-21f0-401b-a2bf-03aa4c47f787",
                                "EndOffset": 9
                            }
                        ],
                        "EndOffset": 50
                    }
                ],
                "Score": 0.9809727537330395,
                "Text": "6,097,560",
                "Type": "OFFERED_SHARES"
            }
        ],
        "File": "example.pdf",
        "Version": "2021-04-30"
    }
```