

# Preparing entity recognizer training data
<a name="prep-training-data-cer"></a>

To train a successful custom entity recognition model, it's important to supply the model trainer with high quality data as input. Without good data, the model won't learn how to correctly identify entities. 

You can choose one of two ways to provide data to Amazon Comprehend in order to train a custom entity recognition model:
+ **Entity list** – Lists the specific entities so Amazon Comprehend can train to identify your custom entities. Note: Entity lists can only be used for plaintext documents. 
+ **Annotations** – Provides the location of your entities in a number of documents so Amazon Comprehend can train on both the entity and its context. To create a model for analyzing image files, PDFs, or Word documents, you must train your recognizer using PDF annotations. 

In both cases, Amazon Comprehend learns about the kind of documents and the context where the entities occur and builds a recognizer that can generalize to detect the new entities when you analyze documents.

When you create a custom model (or train a new version), you can provide a test dataset. If you do not provide test data, Amazon Comprehend reserves 10% of the input documents to test the model. Amazon Comprehend trains the model with the remaining documents.

If you provide a test dataset for your annotations training set, the test data must include at least one annotation for each of the entity types specified in the creation request. 

**Topics**
+ [When to use annotations vs entity lists](#prep-training-data-comp)
+ [Entity lists (plaintext only)](cer-entity-list.md)
+ [Annotations](cer-annotation.md)

## When to use annotations vs entity lists
<a name="prep-training-data-comp"></a>

 Creating annotations takes more work than creating an entity list, but the resulting model can be significantly more accurate. Using an entity list is quicker and less work-intensive, but the results are less refined and less accurate. This is because the annotations provide more context for Amazon Comprehend to use when training the model. Without that context, Amazon Comprehend will have a higher number of false positives when trying to identify the entities. 

There are scenarios when it makes more business sense to avoid the higher expense and workload of using annotations. For example, the name John Johnson is significant to your search, but whether it's the exact individual isn't relevant. Or the metrics when using the entity list are good enough to provide you with the recognizer results that you need. In such instances, using an entity list instead can be the more effective choice. 

We recommend using the annotations mode in the following cases:
+ If you plan to run inferences for image files, PDFs, or Word documents. In this scenario, you train a model using annotated PDF files and use the model to run inference jobs for image files, PDFs, and Word documents. 
+ When the meaning of the entities could be ambiguous and context-dependent. For example, the term *Amazon* could either refer to the river in Brazil, or the online retailer Amazon.com. When you build a custom entity recognizer to identify business entities such as *Amazon*, you should use annotations instead of an entity list because this method is better able to use context to find entities.
+ When you are comfortable setting up a process to acquire annotations, which can require some effort.

We recommend using an entity list in the following cases:
+ When you already have a list of entities or when it is relatively easy to compose a comprehensive list of entities. If you use an entity list, the list should be complete or at least covers the majority of valid entities that might appear in the documents you provide for training. 
+ For first-time users, it is generally recommended to use an entity list because this requires a smaller effort than constructing annotations. However, it is important to note that the trained model might not be as accurate as if you used annotations.

# Entity lists (plaintext only)
<a name="cer-entity-list"></a>

To train a model using an entity list, you provide two pieces of information: a list of the entity names with their corresponding custom entity types and a collection of unannotated documents in which you expect your entities to appear. 

When you provide an Entity List, Amazon Comprehend uses an intelligent algorithm to detect occurrences of the entity in the documents to serve as the basis for training the custom entity recognizer model.

For entity lists, provide at least 25 entity matches per entity type in the entity list.

An entity list for custom entity recognition needs a comma-separated value (CSV) file, with the following columns:
+ **Text**— The text of an entry example exactly as seen in the accompanying document corpus.
+ **Type**—The customer-defined entity type. Entity types must an uppercase, underscore separated string such as MANAGER or SENIOR\$1MANAGER. Up to 25 entity types can be trained per model. 

The file `documents.txt` contains four lines:

```
Jo Brown is an engineer in the high tech industry.
John Doe has been a engineer for 14 years.
Emilio Johnson is a judge on the Washington Supreme Court.
Our latest new employee, Jane Smith, has been a manager in the industry for 4 years.
```

The CSV file with the list of entities has the following lines: 

```
Text, Type
Jo Brown, ENGINEER
John Doe, ENGINEER
Jane Smith, MANAGER
```

**Note**  
In the entities list, the entry for Emilio Johnson is not present because it does not contain either the ENGINEER or MANAGER entity. 

**Creating your data files**

It is important that your entity list be in a properly configured CSV file so your chance of having problems with your entity list file is minimal. To manually configure your CSV file, the following must be true:
+ UTF-8 encoding must be explicitly specified, even if its used as a default in most cases.
+ It must include the column names: `Type` and `Text`.

We highly recommended that CSV input files are generated programmatically to avoid potential issues.

The following example uses Python to generate a CSV for the annotations shown above:

```
import csv 
with open("./entitylist/entitylist.csv", "w", encoding="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(["Text", "Type"])
    csv_writer.writerow(["Jo Brown", " ENGINEER"])
    csv_writer.writerow(["John Doe", " ENGINEER"])
    csv_writer.writerow(["Jane Smith", " MANAGER"])
```

## Best practices
<a name="entitylist-bestresults"></a>

There are a number of things to consider to get the best result when using an entity list, including:
+ The order of the entities in your list has no effects on model training.
+ Use entity list items that cover 80%-100% of positive entity examples mentioned in the unannotated corpus of documents.
+ Avoid entity examples that match non-entities in the document corpus by removing common words and phrases. Even a handful of incorrect matches can significantly affect the accuracy of your resulting model. For example, a word like *the* in the entity list will result in a high number of matches which are unlikely to be the entities you are looking for and thus will significantly affect your accuracy. 
+ Input data should not contain duplicates. Presence of duplicate samples might result into test set contamination and therefore negatively affect training process, model metrics, and behavior.
+ Provide documents that resemble real use cases as closely as possible. Don't use toy data or synthesized data for production systems. The input data should be as diverse as possible to avoid overfitting and help underlying model better generalize on real examples.
+ The entity list is case sensitive, and regular expressions are not currently supported. However, the trained model can often still recognize entities even if they do not match exactly to the casing provided in the entity list.
+ If you have an entity that is a substring of another entity (such as “Smith” and “Jane Smith”), provide both in the entity list.

Additional suggestions can be found at [Improving custom entity recognizer performance](cer-metrics.md#cer-performance) 

# Annotations
<a name="cer-annotation"></a>

Annotations label entities in context by associating your custom entity types with the locations where they occur in your training documents.

By submitting annotation along with your documents, you can increase the accuracy of the model. With Annotations, you're not simply providing the location of the entity you're looking for, but you're also providing more accurate context to the custom entity you're seeking.

For instance, if you're searching for the name John Johnson, with the entity type JUDGE, providing your annotation might help the model to learn that the person you want to find is a judge. If it is able to use the context, then Amazon Comprehend won't find people named John Johnson who are attorneys or witnesses. Without providing annotations, Amazon Comprehend will create its own version of an annotation, but won't be as effective at including only judges. Providing your own annotations might help to achieve better results and to generate models that are capable of better leverage context when extracting custom entities.

**Topics**
+ [Minimum number of annotations](#prep-training-data-ann)
+ [Annotation best practices](#cer-annotation-best-practices)
+ [Plain-text annotation files](cer-annotation-csv.md)
+ [PDF annotation files](cer-annotation-manifest.md)
+ [Annotating PDF files](cer-annotation-pdf.md)

## Minimum number of annotations
<a name="prep-training-data-ann"></a>

The minimum number of input documents and annotations required to train a model depends on the type of annotations. 

**PDF annotations**  
To create a model for analyzing image files, PDFs, or Word documents, train your recognizer using PDF annotations. For PDF annotations, provide at least 250 input documents and at least 100 annotations per entity.  
If you provide a test dataset, the test data must include at least one annotation for each of the entity types specified in the creation request. 

**Plain-text annotations**  
To create a model for analyzing text documents, you can train your recognizer using plain-text annotations.   
For plain-text annotations, provide at least three annotated input documents and at least 25 annotations per entity. If you provide less than 50 annotations total, Amazon Comprehend reserves more than 10% of the input documents to test the model (unless you provided a test dataset in the training request). Don't forget that the minimum document corpus size is 5 KB.  
If your input contains only a few training documents, you may encounter an error that the training input data contains too few documents that mention one of the entities. Submit the job again with additional documents that mention the entity.  
If you provide a test dataset, the test data must include at least one annotation for each of the entity types specified in the creation request.  
For an example of how to benchmark a model with a small dataset, see [Amazon Comprehend announces lower annotation limits for custom entity recognition](https://aws.amazon.com/blogs/machine-learning/amazon-comprehend-announces-lower-annotation-limits-for-custom-entity-recognition/) on the AWS blog site.

## Annotation best practices
<a name="cer-annotation-best-practices"></a>

There are a number of things to consider to get the best result when using annotations, including: 
+ Annotate your data with care and verify that you annotate every mention of the entity. Imprecise annotations can lead to poor results.
+ Input data should not contain duplicates, like a duplicate of a PDF you are going to annotate. Presence of a duplicate sample might result in test set contamination and could negatively affect the training process, model metrics, and model behavior.
+ Make sure that all of your documents are annotated, and that the documents without annotations are due to lack of legitimate entities, not due to negligence. For example, if you have a document that says "J Doe has been an engineer for 14 years", you should also provide an annotation for "J Doe" as well as "John Doe". Failing to do so confuses the model and can result in the model not recognizing "J Doe" as ENGINEER. This should be consistent within the same document and across documents.
+ In general, more annotations lead to better results.
+ You can train a model with the [minimum number](guidelines-and-limits.md#limits-custom-entity-recognition) of documents and annotations, but adding data usually improves the model. We recommend increasing the volume of annotated data by 10% to increase the accuracy of the model. You can run inference on a test dataset which remains unchanged and can be tested by different model versions. You can then compare the metrics for successive model versions.
+ Provide documents that resemble real use cases as closely as possible. Synthesized data with repetitive patterns should be avoided. The input data should be as diverse as possible to avoid overfitting and help the underlying model better generalize on real examples.
+ It is important that documents should be diverse in terms of word count. For example, if all documents in the training data are short, the resulting model may have difficulty predicting entities in longer documents.
+ Try and give the same data distribution for training as you expect to be using when you're actually detecting your custom entities (inference time). For example, at inference time, if you expect to be sending us documents that have no entities in them, this should also be part of your training document set.

For additional suggestions, see [Improving custom entity recognizer performance](https://docs.aws.amazon.com/comprehend/latest/dg/cer-metrics.html#cer-performance).

# Plain-text annotation files
<a name="cer-annotation-csv"></a>

For plain-text annotations, you create a comma-separated value (CSV) file that contains a list of annotations. The CSV file must contain the following columns if your training file input format is **one document per line**.


| File | Line | Begin offset | End offset | Type | 
| --- | --- | --- | --- | --- | 
|  The name of the file containing the document. For example, if one of the document files is located at `s3://my-S3-bucket/test-files/documents.txt`, the value in the `File` column will be `documents.txt`. You must include the file extension (in this case '`.txt`') as part of the file name.  |  The line number containing the entity. Omit this column if your input format is one document per file.  |  The character offset in the input text (relative to the beginning of the line) that shows where the entity begins. The first character is at position 0.  |  The character offset in the input text that shows where the entity ends.  |  The customer-defined entity type. Entity types must be an uppercase, underscore-separated string. We recommend using descriptive entity types such as `MANAGER`, `SENIOR_MANAGER`, or `PRODUCT_CODE`. Up to 25 entity types can be trained per model.  | 

If your training file input format is **one document per file**, you omit the line number column and the **Begin offset** and **End offset** values are the offsets of the entity from the start of the document.

The following example is for one document per line. The file `documents.txt` contains four lines (rows 0, 1, 2, and 3):

```
Diego Ramirez is an engineer in the high tech industry.
Emilio Johnson has been an engineer for 14 years.
J Doe is a judge on the Washington Supreme Court.
Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.
```

The CSV file with the list of annotations is as follows: 

```
File, Line, Begin Offset, End Offset, Type
documents.txt, 0, 0, 13, ENGINEER
documents.txt, 1, 0, 14, ENGINEER
documents.txt, 3, 25, 38, MANAGER
```

**Note**  
In the annotations file, the line number containing the entity starts with line 0. In this example, the CSV file contains no entry for line 2 because there is no entity in line 2 of `documents.txt`.

**Creating your data files**

It's important to put your annotations in a properly configured CSV file to reduce the risk of errors. To manually configure your CSV file, the following must be true:
+ UTF-8 encoding must be explicitly specified, even if its used as a default in most cases.
+ The first line contains the column headers: `File`, `Line` (optional), `Begin Offset`, `End Offset`, `Type`.

We highly recommended that you generate the CSV input files programmatically to avoid potential issues.

The following example uses Python to generate a CSV for the annotations shown earlier:

```
import csv 
with open("./annotations/annotations.csv", "w", encoding="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(["File", "Line", "Begin Offset", "End Offset", "Type"])
    csv_writer.writerow(["documents.txt", 0, 0, 11, "ENGINEER"])
    csv_writer.writerow(["documents.txt", 1, 0, 5, "ENGINEER"])
    csv_writer.writerow(["documents.txt", 3, 25, 30, "MANAGER"])
```

# PDF annotation files
<a name="cer-annotation-manifest"></a>

For PDF annotations, you use SageMaker AI Ground Truth to create a labeled dataset in an augmented manifest file. Ground Truth is a data labeling service that helps you (or a workforce that you employ) to build training datasets for machine learning models. Amazon Comprehend accepts augmented manifest files as training data for custom models. You can provide these files when you create a custom entity recognizer by using the Amazon Comprehend console or the [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) API action. 

You can use the Ground Truth built-in task type, Named Entity Recognition, to create a labeling job to have workers identify entities in text. To learn more, see [Named Entity Recognition](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-named-entity-recg.html#sms-creating-ner-console) in the *Amazon SageMaker AI Developer Guide*. To learn more about Amazon SageMaker Ground Truth, see [Use Amazon SageMaker AI Ground Truth to Label Data](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html).

**Note**  
Using Ground Truth, you can define overlapping labels (text that you associate with more than one label). However, Amazon Comprehend entity recognition does not support overlapping labels.

Augmented manifest files are in JSON lines format. In these files, each line is a complete JSON object that contains a training document and its associated labels. The following example is an augmented manifest file that trains an entity recognizer to detect the professions of individuals who are mentioned in the text:

```
{"source":"Diego Ramirez is an engineer in the high tech industry.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":13,"startOffset":0,"label":"ENGINEER"}],"labels":[{"label":"ENGINEER"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.92}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.175903","human-annotated":"yes"}}
{"source":"J Doe is a judge on the Washington Supreme Court.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":5,"startOffset":0,"label":"JUDGE"}],"labels":[{"label":"JUDGE"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.72}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.174910","human-annotated":"yes"}}
{"source":"Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":38,"startOffset":26,"label":"MANAGER"}],"labels":[{"label":"MANAGER"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.91}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.174035","human-annotated":"yes"}}
```

Each line in this JSON lines file is a complete JSON object, where the attributes include the document text, the annotations, and other metadata from Ground Truth. The following example is a single JSON object in the augmented manifest file, but it's formatted for readability: 

```
{
  "source": "Diego Ramirez is an engineer in the high tech industry.",
  "NamedEntityRecognitionDemo": {
    "annotations": {
      "entities": [
        {
          "endOffset": 13,
          "startOffset": 0,
          "label": "ENGINEER"
        }
      ],
      "labels": [
        {
          "label": "ENGINEER"
        }
      ]
    }
  },
  "NamedEntityRecognitionDemo-metadata": {
    "entities": [
      {
        "confidence": 0.92
      }
    ],
    "job-name": "labeling-job/namedentityrecognitiondemo",
    "type": "groundtruth/text-span",
    "creation-date": "2020-05-14T21:45:27.175903",
    "human-annotated": "yes"
  }
}
```

In this example, the `source` attribute provides the text of the training document, and the `NamedEntityRecognitionDemo` attribute provides the annotations for the entities in the text. The name of the `NamedEntityRecognitionDemo` attribute is arbitrary, and you provide a name of your choice when you define the labeling job in Ground Truth.

In this example, the `NamedEntityRecognitionDemo` attribute is the *label attribute name*, which is the attribute that provides the labels that a Ground Truth worker assigns to the training data. When you provide your training data to Amazon Comprehend, you must specify one or more label attribute names. The number of attribute names that you specify depends on whether your augmented manifest file is the output of a single labeling job or a chained labeling job.

If your file is the output of a single labeling job, specify the single label attribute name that was used when the job was created in Ground Truth. 

If your file is the output of a chained labeling job, specify the label attribute name for one or more jobs in the chain. Each label attribute name provides the annotations from an individual job. You can specify up to 5 of these attributes for augmented manifest files that are produced by chained labeling jobs. 

In an augmented manifest file, the label attribute name typically follows the `source` key. If the file is the output of a chained job, there will be multiple label attribute names. When you provide your training data to Amazon Comprehend, provide only those attributes that contain annotations that are relevant for your model. Do not specify the attributes that end with "-metadata".

For more information about chained labeling jobs, and for examples of the output that they produce, see [Chaining Labeling Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-reusing-data.html) in the Amazon SageMaker AI Developer Guide.

# Annotating PDF files
<a name="cer-annotation-pdf"></a>

Before you can annotate your training PDFs in SageMaker AI Ground Truth, complete the following prerequisites:
+ Install python3.8.x
+ Install [jq](https://stedolan.github.io/jq/download/)
+ Install the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html)

  If you're using the us-east-1 Region, you can skip installing the AWS CLI because it's already installed with your Python environment. In this case, you create a virtual environment to use Python 3.8 in AWS Cloud9.
+ Configure your [AWS credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
+ Create a private [SageMaker AI Ground Truth workforce](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private-use-cognito.html) to support annotation

  Make sure to record the workteam name you choose in your new private workforce, as you use it during installation.

**Topics**
+ [Setting up your environment](#cer-annotation-pdf-set-up)
+ [Uploading a PDF to an S3 bucket](#cer-annotation-pdf-upload)
+ [Creating an annotation job](#cer-annotation-pdf-job)
+ [Annotating with SageMaker AI Ground Truth](#w2aac35c23c21c19c15)

## Setting up your environment
<a name="cer-annotation-pdf-set-up"></a>

1. If using Windows, install [Cygwin](https://cygwin.com/install.html); if using Linux or Mac, skip this step.

1. Download the [annotation artifacts](http://github.com/aws-samples/amazon-comprehend-semi-structured-documents-annotation-tools) from GitHub. Unzip the file.

1. From your terminal window, navigate to the unzipped folder (**amazon-comprehend-semi-structured-documents-annotation-tools-main**). 

1. This folder includes a choice of `Makefiles` that you run to install dependencies, setup a Python virtualenv, and deploy the required resources. Review the **readme** file to make your choice.

1. The recommended option uses a single command to install all dependencies into a virtualenv, builds the CloudFormation stack from the template, and deploys the stack to your AWS account with interactive guidance. Run the following command:

   `make ready-and-deploy-guided`

   This command presents a set of configuration options. Be sure your AWS Region is correct. For all other fields, you can either accept the default values or fill in custom values. If you modify the CloudFormation stack name, write it down as you need it in the next steps.  
![\[Terminal session showing CloudFormation configuration options.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/deploy_guided_anno.png)

   The CloudFormation stack creates and manage the [AWS lambdas](https://aws.amazon.com/lambda/), [AWS IAM](https://aws.amazon.com/iam/) roles, and [AWS S3](https://aws.amazon.com/s3/) buckets required for the annotation tool.

   You can review each of these resources in the stack details page in the CloudFormation console.

1. The command prompts you to start the deployment. CloudFormation creates all the resources in the specified Region.  
![\[Terminal session showing the deployed CloudFormation configuration.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/deploy_guided_anno_2.png)

   When the CloudFormation stack status transitions to create-complete, the resources are ready to use.

## Uploading a PDF to an S3 bucket
<a name="cer-annotation-pdf-upload"></a>

In the [Setting up](#cer-annotation-pdf-set-up) section, you deployed a CloudFormation stack that creates an S3 bucket named **comprehend-semi-structured-documents-\$1\$1AWS::Region\$1-\$1\$1AWS::AccountId\$1**. You now upload your source PDF documents into this bucket.

**Note**  
This bucket contains the data required for your labeling job. The Lambda Execution Role policy grants permission for the Lambda function to access this bucket.  
You can find the S3 bucket name in the **CloudFormation Stack details** using the '**SemiStructuredDocumentsS3Bucket**' key.

1. Create a new folder in the S3 bucket. Name this new folder '**src**'. 

1. Add your PDF source files to your '**src**' folder. In a later step, you annotate these files to train your recognizer.

1. (Optional) Here's an AWS CLI example you can use to upload your source documents from a local directory into an S3 bucket:

   `aws s3 cp --recursive local-path-to-your-source-docs s3://deploy-guided/src/`

   Or, with your Region and Account ID:

   `aws s3 cp --recursive local-path-to-your-source-docs s3://deploy-guided-Region-AccountID/src/`

1. You now have a private SageMaker AI Ground Truth workforce and have uploaded your source files to the S3 bucket, **deploy-guided/src/**; you're ready to start annotating.

## Creating an annotation job
<a name="cer-annotation-pdf-job"></a>

The **comprehend-ssie-annotation-tool-cli.py** script in the `bin` directory is a simple wrapper command that streamlines the creation of a SageMaker AI Ground Truth labeling job. The python script reads the source documents from your S3 bucket and creates a corresponding single-page manifest file with one source document per line. The script then creates a labeling job, which requires the manifest file as an input. 

The python script uses the S3 bucket and CloudFormation stack that you configured in the [Setting up](#cer-annotation-pdf-set-up) section. Required input parameters for the script include:
+ **input-s3-path**: S3 Uri to the source documents you uploaded to your S3 bucket. For example: `s3://deploy-guided/src/`. You can also add your Region and Account ID to this path. For example: `s3://deploy-guided-Region-AccountID/src/`.
+ **cfn-name**: The CloudFormation stack name. If you used the default value for the stack name, your cfn-name is **sam-app**.
+ **work-team-name**: The workforce name you created when you built out the private workforce in SageMaker AI Ground Truth.
+ **job-name-prefix**: The prefix for the SageMaker AI Ground Truth labeling job. Note that there is a 29-character limit for this field. A timestamp is appended to this value. For example: `my-job-name-20210902T232116`.
+ **entity-types**: The entities you want to use during your labeling job, separated by commas. This list must include all entities that you want to annotate in your training dataset. The Ground Truth labeling job displays only these entities for annotators to label content in the PDF documents. 

To view additional arguments the script supports, use the `-h` option to display the help content.
+ Run the following script with the input parameters as described in the previous list.

  ```
  python bin/comprehend-ssie-annotation-tool-cli.py \
  --input-s3-path s3://deploy-guided-Region-AccountID/src/ \
  --cfn-name sam-app \
  --work-team-name my-work-team-name \
  --region us-east-1 \
  --job-name-prefix my-job-name-20210902T232116 \
  --entity-types "EntityA, EntityB, EntityC" \
  --annotator-metadata "key=info,value=sample,key=Due Date,value=12/12/2021"
  ```

  The script produces the following output:

  ```
  Downloaded files to temp local directory /tmp/a1dc0c47-0f8c-42eb-9033-74a988ccc5aa
  Deleted downloaded temp files from /tmp/a1dc0c47-0f8c-42eb-9033-74a988ccc5aa
  Uploaded input manifest file to s3://comprehend-semi-structured-documents-us-west-2-123456789012/input-manifest/my-job-name-20220203-labeling-job-20220203T183118.manifest
  Uploaded schema file to s3://comprehend-semi-structured-documents-us-west-2-123456789012/comprehend-semi-structured-docs-ui-template/my-job-name-20220203-labeling-job-20220203T183118/ui-template/schema.json
  Uploaded template UI to s3://comprehend-semi-structured-documents-us-west-2-123456789012/comprehend-semi-structured-docs-ui-template/my-job-name-20220203-labeling-job-20220203T183118/ui-template/template-2021-04-15.liquid
  Sagemaker GroundTruth Labeling Job submitted: arn:aws:sagemaker:us-west-2:123456789012:labeling-job/my-job-name-20220203-labeling-job-20220203t183118
  (amazon-comprehend-semi-structured-documents-annotation-tools-main) user@3c063014d632 amazon-comprehend-semi-structured-documents-annotation-tools-main %
  ```

## Annotating with SageMaker AI Ground Truth
<a name="w2aac35c23c21c19c15"></a>

Now that you have configured the required resources and created a labeling job, you can log in to the labeling portal and annotate your PDFs.

1. Log in to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker) using either Chrome or Firefox web browsers.

1. Select **Labeling workforces** and choose **Private**.

1. Under **Private workforce summary**, select the labeling portal sign-in URL that you created with your private workforce. Sign in with the appropriate credentials.

   If you don't see any jobs listed, don't worry—it can take a while to update, depending on the number of files you uploaded for annotation.

1. Select your task and, in the top right corner, choose **Start working** to open the annotation screen.

   You'll see one of your documents open in the annotation screen and, above it, the entity types you provided during set up. To the right of your entity types, there is an arrow you can use to navigate through your documents.  
![\[The Amazon Comprehend annotation screen.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/annotation_demo1.png)

   Annotate the open document. You can also remove, undo, or auto tag your annotations on each document; these options are available in the right panel of the annotation tool.  
![\[Available options in the Amazon Comprehend annotation right panel.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/data_annotation.png)

   To use auto tag, annotate an instance of one of your entities; all other instances of that specific word are then automatically annotated with that entity type.

   Once you've finished, select **Submit** on the bottom right, then use the navigation arrows to move to the next document. Repeat this until you've annotated all your PDFs.

After you annotate all the training documents, you can find the annotations in JSON format in the Amazon S3 bucket at this location:

```
/output/your labeling job name/annotations/
```

The output folder also contains an output manifest file, which lists all the annotations within your training documents. You can find your output manifest file at the following location.

```
/output/your labeling job name/manifests/
```