

# Custom classification
<a name="how-document-classification"></a>

Use *custom classification* to organize your documents into categories (classes) that you define. Custom classification is a two-step process. First, you train a custom classification model (also called a classifier) to recognize the classes that are of interest to you. Then you use your model to classify any number of document sets.

For example, you can categorize the content of support requests so that you can route the request to the proper support team. Or you can categorize emails received from customers to provide guidance based on the type of customer request. You can combine Amazon Comprehend with Amazon Transcribe to convert speech to text and then classify the requests coming from support phone calls.

You can run custom classification on a single document synchronously (in real time) or start an asynchronous job to classify a set of documents. You can have multiple custom classifiers in your account, each trained using different data. Custom classification supports a variety of input document types, such as plain text, PDF, Word, and images.

When you submit a classification job, you choose the classifier model to use, based on the type of documents that you need to analyze. For example, to analyze plain-text documents, you achieve the most accurate results by using a model that you trained with plain-text documents. To analyze semi-structured documents (such as PDF, Word, images, Amazon Textract output, or scanned files) , you achieve the most accurate results by using a model that you trained with native documents.

**Topics**
+ [

# Preparing classifier training data
](prep-classifier-data.md)
+ [

# Training classification models
](training-classifier-model.md)
+ [

# Running real-time analysis
](running-class-sync.md)
+ [

# Running asynchronous jobs
](running-classifiers.md)

# Preparing classifier training data
<a name="prep-classifier-data"></a>

For custom classification, you train the model in either multi-class mode or multi-label mode. Multi-class mode associates a single class with each document. Multi-label mode associates one or more classes with each document. The input file formats are different for each mode, so choose the mode to use before you create the training data. 

**Note**  
The Amazon Comprehend console refers to multi-class mode as single-label mode.

Custom classification supports models that you train with plain-text documents and models that you train with native documents (such as PDF, Word, or images). For more information about classifier models and their supported document types, see [Training classification models](training-classifier-model.md).

To prepare data to train a custom classifier model: 

1. Identify the classes that you want this classifier to analyze. Decide which mode to use (multi-class or multi-label).

1. Decide on the classifier model type, based on whether the model is for analyzing plain-text documents or semi-structured documents. 

1. Gather examples of documents for each of the classes. For minimum training requirements, see [General quotas for document classification](guidelines-and-limits.md#limits-class-general).

1. For a plain-text model, choose the training file format to use (CSV file or augmented manifest file). To train a native document model, you always use a CSV file. 

**Topics**
+ [

# Classifier training file formats
](prep-class-data-format.md)
+ [

# Multi-class mode
](prep-classifier-data-multi-class.md)
+ [

# Multi-label mode
](prep-classifier-data-multi-label.md)

# Classifier training file formats
<a name="prep-class-data-format"></a>

For a plain-text model, you can provide classifier training data as a CSV file or as an augmented manifest file that you create using SageMaker AI Ground Truth. The CSV file or augmented manifest file includes the text for each training document, and its associated labels.

For a native document model, you provide Classifier training data as a CSV file. The CSV file includes the file name for each training document, and its associated labels. You include the training documents in the Amazon S3 input folder for the training job.

## CSV files
<a name="prep-data-csv"></a>

You provide labeled training data as UTF-8 encoded text in a CSV file. Don't include a header row. Adding a header row in your file may cause runtime errors.

For each row in the CSV file, the first column contains one or more class labels, A class label can be any valid UTF-8 string. We recommend using clear class names that don't overlap in meaning. The name can include white space, and can consist of multiple words connected by underscores or hyphens.

Do not leave any space characters before or after the commas that separate the values in a row. 

The exact content of the CSV file depends on the classifier mode and the type of training data. For details, see the sections on [Multi-class mode](prep-classifier-data-multi-class.md) and [Multi-label mode](prep-classifier-data-multi-label.md).

## Augmented manifest file
<a name="prep-data-annotations"></a>

An augmented manifest file is a labeled dataset that you create using SageMaker AI Ground Truth. Ground Truth is a data labeling service that helps you—or a workforce that you employ—to build training datasets for machine learning models. 

For more information about Ground Truth and the output that it produces, see [Use SageMaker AI Ground Truth to Label Data](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) in the *Amazon SageMaker AI Developer Guide*.

Augmented manifest files are in JSON lines format. In these files, each line is a complete JSON object that contains a training document and its associated labels. The exact content of each line depends on the classifier mode. For details, see the sections on [Multi-class mode](prep-classifier-data-multi-class.md) and [Multi-label mode](prep-classifier-data-multi-label.md).

When you provide your training data to Amazon Comprehend, you specify one or more label attribute names. How many attribute names you specify depends on whether your augmented manifest file is the output of a single labeling job or a chained labeling job.

If your file is the output of a single labeling job, specify the single label attribute name from the Ground Truth job. 

If your file is the output of a chained labeling job, specify the label attribute name for one or more jobs in the chain. Each label attribute name provides the annotations from an individual job. You can specify up to 5 of these attributes for augmented manifest files from chained labeling jobs. 

For more information about chained labeling jobs, and for examples of the output that they produce, see [Chaining Labeling Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-reusing-data.html) in the Amazon SageMaker AI Developer Guide.

# Multi-class mode
<a name="prep-classifier-data-multi-class"></a>

In multi-class mode, classification assigns one class for each document. The individual classes are mutually exclusive. For example, you can classify a movie as comedy or science fiction, but not both. 

**Note**  
The Amazon Comprehend console refers to multi-class mode as single-label mode.

**Topics**
+ [

## Plain-text models
](#prep-multi-class-plaintext)
+ [

## Native document models
](#prep-multi-class-structured)

## Plain-text models
<a name="prep-multi-class-plaintext"></a>

To train a plain-text model, you can provide labeled training data as a CSV file or as an augmented manifest file from SageMaker AI Ground Truth.

### CSV file
<a name="prep-multi-class-plaintext-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a two-column CSV file. For each row, the first column contains the class label value. The second column contains an example text document for that class. Each row must end with \$1n or \$1r\$1n characters.

The following example shows a CSV file containing three documents.

```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS,Text of document 3
```

The following example shows one row of a CSV file that trains a custom classifier to detect whether an email message is spam:

```
SPAM,"Paulo, your $1000 award is waiting for you! Claim it while you still can at http://example.com."
```

### Augmented manifest file
<a name="prep-multi-class-plaintext-manifest"></a>

For general information about using augmented manifest files for training classifiers, see [Augmented manifest file](prep-class-data-format.md#prep-data-annotations).

For plain-text documents, each line of the augmented manifest file is a complete JSON object that contains a training document, a single class name, and other metadata from Ground Truth. The following example is an augmented manifest file for training a custom classifier to recognize spam email messages:

```
{"source":"Document 1 text", "MultiClassJob":0, "MultiClassJob-metadata":{"confidence":0.62, "job-name":"labeling-job/multiclassjob", "class-name":"not_spam", "human-annotated":"yes", "creation-date":"2020-05-21T17:36:45.814354", "type":"groundtruth/text-classification"}}
{"source":"Document 2 text", "MultiClassJob":1, "MultiClassJob-metadata":{"confidence":0.81, "job-name":"labeling-job/multiclassjob", "class-name":"spam", "human-annotated":"yes", "creation-date":"2020-05-21T17:37:51.970530", "type":"groundtruth/text-classification"}}
{"source":"Document 3 text", "MultiClassJob":1, "MultiClassJob-metadata":{"confidence":0.81, "job-name":"labeling-job/multiclassjob", "class-name":"spam", "human-annotated":"yes", "creation-date":"2020-05-21T17:37:51.970566", "type":"groundtruth/text-classification"}}
```

 The following example shows one JSON object from the augmented manifest file, formatted for readability: 

```
{
   "source": "Paulo, your $1000 award is waiting for you! Claim it while you still can at http://example.com.",
   "MultiClassJob": 0,
   "MultiClassJob-metadata": {
       "confidence": 0.98,
       "job-name": "labeling-job/multiclassjob",
       "class-name": "spam",
       "human-annotated": "yes",
       "creation-date": "2020-05-21T17:36:45.814354",
       "type": "groundtruth/text-classification"
   }
}
```

In this example, the `source` attribute provides the text of the training document, and the `MultiClassJob` attribute assigns the index of a class from a classification list. The `job-name` attribute is the name that you defined for the labeling job in Ground Truth. 

 When you start the classifier training job in Amazon Comprehend, you specify the same labeling job name. 

## Native document models
<a name="prep-multi-class-structured"></a>

A native document model is a model that you train with native documents (such as PDF, DOCX , and images). You provide the training data as a CSV file.

### CSV file
<a name="prep-multi-class-structured-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a three-column CSV file. For each row, the first column contains the class label value. The second column contains the file name of an example document for this class. The third column contains the page number. The page number is optional if the example document is an image.

The following example shows a CSV file that references three input documents. 

```
CLASS,input-doc-1.pdf,3
CLASS,input-doc-2.docx,1
CLASS,input-doc-3.png
```

The following example shows one row of a CSV file that trains a custom classifier to detect whether an email message is spam. Page 2 of the PDF file contains the spam example. 

```
SPAM,email-content-3.pdf,2
```

# Multi-label mode
<a name="prep-classifier-data-multi-label"></a>

In multi-label mode, individual classes represent different categories that aren't mutually exclusive. Multi-label classification assigns one or more classes to each document. For example, you can classify one movie as Documentary, and another movie as Science fiction, Action, and Comedy. 

For training, multi-label mode supports up to 1 million examples containing up to 100 unique classes.

**Topics**
+ [

## Plain-text models
](#prep-multi-label-plaintext)
+ [

## Native document models
](#prep-multi-label-structured)

## Plain-text models
<a name="prep-multi-label-plaintext"></a>

To train a plain-text model, you can provide labeled training data as a CSV file or as an augmented manifest file from SageMaker AI Ground Truth.

### CSV file
<a name="prep-multi-label-plaintext-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a two-column CSV file. For each row, the first column contains the class label values, and the second column contains an example text document for these classes. To enter more than one class in the first column, use a delimiter (such as a \$1 ) between each class.

```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS|CLASS|CLASS,Text of document 3
```

The following example shows one row of a CSV file that trains a custom classifier to detect genres in movie abstracts:

```
COMEDY|MYSTERY|SCIENCE_FICTION|TEEN,"A band of misfit teens become unlikely detectives when they discover troubling clues about their high school English teacher. Could the strange Mrs. Doe be an alien from outer space?"
```

The default delimiter between class names is a pipe (\$1). However, you can use a different character as a delimiter. The delimiter must be distinct from all characters in your class names. For example, if your classes are CLASS\$11, CLASS\$12, and CLASS\$13, the underscore (**\$1**) is part of the class name. So don't use an underscore as the delimiter for separating class names.

### Augmented manifest file
<a name="prep-multi-label-plaintext-manifest"></a>

For general information about using augmented manifest files for training classifiers, see [Augmented manifest file](prep-class-data-format.md#prep-data-annotations).

For plain-text documents, each line of the augmented manifest file is a complete JSON object. It contains a training document, class names, and other metadata from Ground Truth. The following example is an augmented manifest file for training a custom classifier to detect genres in movie abstracts:

```
{"source":"Document 1 text", "MultiLabelJob":[0,4], "MultiLabelJob-metadata":{"job-name":"labeling-job/multilabeljob", "class-map":{"0":"action", "4":"drama"}, "human-annotated":"yes", "creation-date":"2020-05-21T19:02:21.521882", "confidence-map":{"0":0.66}, "type":"groundtruth/text-classification-multilabel"}}
{"source":"Document 2 text", "MultiLabelJob":[3,6], "MultiLabelJob-metadata":{"job-name":"labeling-job/multilabeljob", "class-map":{"3":"comedy", "6":"horror"}, "human-annotated":"yes", "creation-date":"2020-05-21T19:00:01.291202", "confidence-map":{"1":0.61,"0":0.61}, "type":"groundtruth/text-classification-multilabel"}}
{"source":"Document 3 text", "MultiLabelJob":[1], "MultiLabelJob-metadata":{"job-name":"labeling-job/multilabeljob", "class-map":{"1":"action"}, "human-annotated":"yes", "creation-date":"2020-05-21T18:58:51.662050", "confidence-map":{"1":0.68}, "type":"groundtruth/text-classification-multilabel"}}
```

 The following example shows one JSON object from the augmented manifest file, formatted for readability: 

```
{
      "source": "A band of misfit teens become unlikely detectives when 
                   they discover troubling clues about their high school English teacher. 
                     Could the strange Mrs. Doe be an alien from outer space?",
      "MultiLabelJob": [
          3,
          8,
          10,
          11
      ],
      "MultiLabelJob-metadata": {
          "job-name": "labeling-job/multilabeljob",
          "class-map": {
              "3": "comedy",
              "8": "mystery",
              "10": "science_fiction",
              "11": "teen"
          },
          "human-annotated": "yes",
          "creation-date": "2020-05-21T19:00:01.291202",
          "confidence-map": {
              "3": 0.95,
              "8": 0.77,
              "10": 0.83,
              "11": 0.92
          },
          "type": "groundtruth/text-classification-multilabel"
      }
  }
```

In this example, the `source` attribute provides the text of the training document, and the `MultiLabelJob` attribute assigns the indexes of several classes from a classification list. The job-name in the `MultiLabelJob` metadata is the name that you defined for the labeling job in Ground Truth. 

## Native document models
<a name="prep-multi-label-structured"></a>

A native document model is a model that you train with native documents (such as PDF, DOCX , and image files). You provide labeled training data as a CSV file.

### CSV file
<a name="prep-multi-label-structured-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a three-column CSV file. For each row, the first column contains the class label values. The second column contains the file name of an example document for these classes. The third column contains the page number. The page number is optional if the example document is an image.

To enter more than one class in the first column, use a delimiter (such as a \$1 ) between each class.

```
CLASS,input-doc-1.pdf,3
CLASS,input-doc-2.docx,1
CLASS|CLASS|CLASS,input-doc-3.png,2
```

The following example shows one row of a CSV file that trains a custom classifier to detect genres in movie abstracts. Page 2 of the PDF file contains the example of a comedy/teen movie.

```
COMEDY|TEEN,movie-summary-1.pdf,2
```

The default delimiter between class names is a pipe (\$1). However, you can use a different character as a delimiter. The delimiter must be distinct from all characters in your class names. For example, if your classes are CLASS\$11, CLASS\$12, and CLASS\$13, the underscore (**\$1**) is part of the class name. So don't use an underscore as the delimiter for separating class names.

# Training classification models
<a name="training-classifier-model"></a>

To train a model for custom classification, you define the categories and provide example documents to train the custom model. You train the model in either multi-class or multi-label mode. Multi-class mode associates a single class with each document. Multi-label mode associates one or more classes with each document.

Custom classification supports two types of classifier models: plain-text models and native document models. A plain-text model classifies documents based on their text content. A native document model also classifies documents based on text content. A native document model can also use additional signals, such as from the layout of the document. You train a native document model with native documents for the model to learn the layout information. 

Plain-text models have the following characteristics: 
+ You train the model using UTF-8 encoded text documents. 
+ You can train the model using documents in one of following languages: English, Spanish, German, Italian, French, or Portuguese. 
+ The training documents for a given classifier must all use the same language. 
+ Training documents are plain text, so there are no additional charges for text extraction. 

Native document models have the following characteristics: 
+ You train the model using semi-structured documents, which includes the following document types:
  + Digital and scanned PDF documents.
  + Word documents (DOCX).
  + Images: JPG files, PNG files, and single-page TIFF files.
  + Textract API output JSON files.
+ You train the model using English documents. 
+ If your training documents include scanned document files, you incur additional charges for text extraction. See the [Amazon Comprehend Pricing](https://aws.amazon.com/comprehend/pricing) page for details. 

You can classify any of the supported document types using either type of model. However, for the most accurate results, we recommend using a plain-text model to classify plain-text documents and a native document model to classify semi-structured documents.

**Topics**
+ [

# Train custom classifiers (console)
](create-custom-classifier-console.md)
+ [

# Train custom classifiers (API)
](train-custom-classifier-api.md)
+ [

# Test the training data
](testing-the-model.md)
+ [

# Classifier training output
](train-classifier-output.md)
+ [

# Custom classifier metrics
](cer-doc-class.md)

# Train custom classifiers (console)
<a name="create-custom-classifier-console"></a>

You can create and train a custom classifier using the console, and then use the custom classifier to analyze your documents.

To train a custom classifier, you need a set of training documents. You label these documents with the categories that you want the document classifier to recognize. For information about preparing your training documents, see [Preparing classifier training data](prep-classifier-data.md).



**To create and train a document classifier model**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Customization** and then choose **Custom Classification**.

1. Choose **Create new model**.

1. Under **Model settings**, enter a model name for the classifier. The name must be unique within your account and current Region.

   (Optional) Enter a version name. The name must be unique within your account and current Region.

1. Select the language of the training documents. To see the languages that classifiers support, see [Training classification models](training-classifier-model.md). 

1. (Optional) If you want to encrypt the data in the storage volume while Amazon Comprehend processes your training job, choose **Classifier encryption**. Then choose whether to use a KMS key associated with your current account, or one from another account.
   + If you are using a key associated with the current account, choose the key ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key ID under **KMS key ARN**.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [AWS Key Management Service (AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Data specifications**, choose the **Training model type** to use.
   + **Plain text documents:** Choose this option to create a plain text model. Train the model using plain text documents.
   + **Native documents:** Choose this option to create a native document model. Train the model using native documents (PDF, Word, images). 

1. Choose the **Data format** of your training data. For information about the data formats, see [Classifier training file formats](prep-class-data-format.md).
   + **CSV file:** Choose this option if your training data uses the CSV file format.
   + **Augmented manifest:** Choose this option if you used Ground Truth to create augmented manifest files for your training data. This format is available if you chose **Plain text documents** as the training model type.

1. Choose the **Classifier mode** to use.
   + **Single-label mode:** Choose this mode if the categories you're assigning to documents are mutually exclusive and you're training your classifier to assign one label to each document. In the Amazon Comprehend API, single-label mode is known as multi-class mode.
   + **Multi-label mode:** Choose this mode if multiple categories can be applied to a document at the same time, and you are training your classifier to assign one or more labels to each document. 

1. If you choose **Multi-label mode**, you can select the **Delimiter for labels**. Use this delimiter character to separate labels when there are multiple classes for a training document. The default delimiter is the pipe character.

1. (Optional) If you chose **Augmented manifest** as the data format, you can input up to five augmented manifest files. Each augmented manifest file contains either a training dataset or a test dataset. You must provide at least one training dataset. Test datasets are optional. Use the following steps to configure the augmented manifest files:

   1. Under **Training and test dataset**, expand the **Input location** panel.

   1. In **Dataset type**, choose **Training data** or **Test data**.

   1. For the **SageMaker AI Ground Truth augmented manifest file S3 location**, enter the location of the Amazon S3 bucket that contains the manifest file or navigate to it by choosing **Browse S3**. The IAM role that you're using for access permissions for the training job must have read permissions for the S3 bucket. 

   1. For the **Attribute names**, enter the name of the attribute that contains your annotations. If the file contains annotations from multiple chained labeling jobs, add an attribute for each job.

   1. To add another input location, choose **Add input location** and then configure the next location.

1. (Optional) If you chose **CSV file** as the data format, use the following steps to configure the training dataset and optional test dataset:

   1. Under **Training dataset**, enter the location of the Amazon S3 bucket that contains your training data CSV file or navigate to it by choosing **Browse S3**. The IAM role that you're using for access permissions for the training job must have read permissions for the S3 bucket. 

      (Optional) If you chose **Native documents** as the training model type, you also provide the URL of the Amazon S3 folder that contains the training example files.

   1. Under **Test dataset**, select whether you are providing extra data for Amazon Comprehend to test the trained model.
      + **Autosplit**: Autosplit automatically selects 10% of your training data to reserve for use as testing data.
      + (Optional) **Customer provided**: Enter the URL of the test data CSV file in Amazon S3. You can also navigate to its location in Amazon S3 and choose **Select folder**.

        (Optional) If you chose **Native documents** as the training model type, you also provide the URL of the Amazon S3 folder that contains the test files.

1. (Optional) For **Document read mode**, you can override the default text extraction actions. This option isn't required for plain-text models, as it applies to text extraction for scanned documents. For more information, see [Setting text extraction options](idp-set-textract-options.md). 

1. (Optional for plain-text models) For **Output data**, enter the location of an Amazon S3 bucket to save training output data, such as the confusion matrix. For more information, see [Confusion matrix](train-classifier-output.md#conf-matrix).

   (Optional) If you choose to encrypt the output result from your training job, choose **Encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key alias for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key alias or ID under **KMS key ID**.

1. For **IAM role**, choose **Choose an existing IAM role**, and then choose an existing IAM role that has read permissions for the S3 bucket that contains your training documents. The role must have a trust policy that begins with `comprehend.amazonaws.com` to be valid.

   If you don't already have an IAM role with these permissions, choose **Create an IAM role** to make one. Choose the access permissions to grant this role, and then choose a name suffix to distinguish the role from IAM roles in your account.
**Note**  
For encrypted input documents, the IAM role used must also have `kms:Decrypt` permission. For more information, see [Permissions required to use KMS encryption](security_iam_id-based-policy-examples.md#auth-kms-permissions).

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the dropdown list. 

   1. Choose the subnet under **Subnets(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your classification job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC that accesses the input documents and the output bucket.

1. (Optional) To add a tag to the custom classifier, enter a key-value pair under **Tags**. Choose **Add tag**. To remove this pair before creating the classifier, choose **Remove tag**. For more information, see [Tagging your resources](tagging.md).

1. Choose **Create**.

The console displays the **Classifiers** page. The new classifier appears in the table, showing `Submitted` as its status. When the classifier starts processing the training documents, the status changes to `Training`. When a classifier is ready to use, the status changes to `Trained` or `Trained with warnings`. If the status is `TRAINED_WITH_WARNINGS`, review the skipped files folder in the [Classifier training output](train-classifier-output.md).

If Amazon Comprehend encountered errors during creation or training, the status changes to `In error`. You can choose a classifier job in the table to get more information about the classifier, including any error messages.

![\[The custom classifier list.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/class-list.png)


# Train custom classifiers (API)
<a name="train-custom-classifier-api"></a>

To create and train a custom classifier, use the [CreateDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateDocumentClassifier.html) operation.

You can monitor the progress of the request using the [DescribeDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassifier.html) operation. After the `Status` field transitions to `TRAINED`, you can use the classifier to classify documents. If the status is `TRAINED_WITH_WARNINGS`, review the skipped files folder in the [Classifier training output](train-classifier-output.md) from the `CreateDocumentClassifier` operation.

**Topics**
+ [

## Training custom classification using the AWS Command Line Interface
](#get-started-api-customclass-cli)
+ [

## Using the AWS SDK for Java or SDK for Python
](#get-started-api-customclass-java)

## Training custom classification using the AWS Command Line Interface
<a name="get-started-api-customclass-cli"></a>

The following examples show how to use the `CreateDocumentClassifier` operation, the `DescribeDocumentClassificationJob` operation, and other custom classifier APIs with the AWS CLI. 

The examples are formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^).

Create a plain-text custom classifier using the `create-document-classifier` operation.

```
aws comprehend create-document-classifier \
     --region region \
     --document-classifier-name testDelete \
     --language-code en \
     --input-data-config S3Uri=s3://S3Bucket/docclass/file name \
     --data-access-role-arn arn:aws:iam::account number:role/testFlywheelDataAccess
```

To create a native custom classifier, provide the following additional parameters in the `create-document-classifier` request.

1. DocumentType: set the value to SEMI\$1STRUCTURED\$1DOCUMENT.

1. Documents: the S3 location for the training documents (and, optionally, the test documents).

1. OutputDataConfig: provide the S3 location for the output documents (and an optional KMS key). 

1. DocumentReaderConfig: Optional field for text extraction settings.

```
aws comprehend create-document-classifier \
     --region region \
     --document-classifier-name testDelete \
     --language-code en \
     --input-data-config 
          S3Uri=s3://S3Bucket/docclass/file name \
           DocumentType \
             Documents  \
     --output-data-config S3Uri=s3://S3Bucket/docclass/file name \
     --data-access-role-arn arn:aws:iam::account number:role/testFlywheelDataAccess
```

Get information on a custom classifier with the document classifier ARN using the `DescribeDocumentClassifier` operation.

```
aws comprehend describe-document-classifier \
     --region region \
     --document-classifier-arn arn:aws:comprehend:region:account number:document-classifier/file name
```

Delete a custom classifier using the `DeleteDocumentClassifier` operation.

```
aws comprehend delete-document-classifier \
     --region region \
     --document-classifier-arn arn:aws:comprehend:region:account number:document-classifier/testDelete
```

List all custom classifiers in the account using the `ListDocumentClassifiers` operation.

```
aws comprehend list-document-classifiers
     --region region
```

## Using the AWS SDK for Java or SDK for Python
<a name="get-started-api-customclass-java"></a>

For SDK examples of how to create and train a custom classifier , see [Use `CreateDocumentClassifier` with an AWS SDK or CLI](example_comprehend_CreateDocumentClassifier_section.md).

# Test the training data
<a name="testing-the-model"></a>

After training the model, Amazon Comprehend tests the custom classifier model. If you don't provide a test dataset, Amazon Comprehend trains the model with 90 percent of the training data. It reserves 10 percent of the training data to use for testing. If you do provide a test dataset, the test data must include at least one example for each unique label in the training dataset. 

Testing the model provides you with metrics that you can use to estimate the accuracy of the model. The console displays the metrics in the **Classifier performance** section of the **Classifier details** page in the console. They're also returned in the `Metrics` fields returned by the [DescribeDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassifier.html) operation.

In the following example training data, there are five labels, DOCUMENTARY, DOCUMENTARY, SCIENCE\$1FICTION, DOCUMENTARY, ROMANTIC\$1COMEDY. There are three unique classes: DOCUMENTARY, SCIENCE\$1FICTION, ROMANTIC\$1COMEDY. 


| Column 1 | Column 2 | 
| --- | --- | 
| DOCUMENTARY | document text 1 | 
| DOCUMENTARY | document text 2 | 
| SCIENCE\$1FICTION | document text 3 | 
| DOCUMENTARY | document text 4 | 
| ROMANTIC\$1COMEDY | document text 5 | 

For auto split (where Amazon Comprehend reserves 10 percent of the training data to use for testing), if the training data contains limited examples of a specific label, the test dataset may contain zero examples of that label. For instance, if the training dataset contains 1000 instances of the DOCUMENTARY class, 900 instances of SCIENCE\$1FICTION, and a single instance of the ROMANTIC\$1COMEDY class, the test dataset might contain 100 DOCUMENTARY and 90 SCIENCE\$1FICTION instances, but no ROMANTIC\$1COMEDY instances, as there is a single example available. 

After you finish training your model, the training metrics provide information that you can use to decide if the model is sufficiently accurate for your needs. 

# Classifier training output
<a name="train-classifier-output"></a>

After Amazon Comprehend completes the custom classifier model training, it creates output files in the Amazon S3 output location that you specified in the [CreateDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateDocumentClassifier.html) API request or the equivalent console request.

Amazon Comprehend creates a confusion matrix when you train a plain-text model or a native document model. It can create additional output files when you train a native document model.

**Topics**
+ [

## Confusion matrix
](#conf-matrix)
+ [

## Additional outputs for native document models
](#train-class-output-native)

## Confusion matrix
<a name="conf-matrix"></a>

When you train a custom classifier model, Amazon Comprehend creates a confusion matrix that provides metrics on how well the model performed in training. This matrix shows a matrix of labels that the model predicted, compared to the actual document labels. Amazon Comprehend uses a portion of the training data to create the confusion matrix.

A confusion matrix provides an indication of which classes could use more data to improve model performance. A class with a high fraction of correct predictions has the highest number of results along the diagonal of the matrix. If the number on the diagonal is a lower number, the class has a lower fraction of correct predictions. You can add more training examples for this class and train the model again. For example, if 40 percent of label A samples get classified as label D, adding more samples for label A and label D enhances the classifier's performance.

After Amazon Comprehend creates the classifier model, the confusion matrix is available in the `confusion_matrix.json` file in the S3 output location. 

The format of the confusion matrix varies, depending on whether you trained your classifier using multi-class mode or multi-label mode.

**Topics**
+ [

### Confusion matrix for multi-class mode
](#m-c-matrix)
+ [

### Confusion matrix for multi-label mode
](#m-l-matrix)

### Confusion matrix for multi-class mode
<a name="m-c-matrix"></a>

In multi-class mode, the individual classes are mutually exclusive, so classification assigns one label to each document. For example, an animal can be a dog or a cat, but not both at the same time.

Consider the following example of a confusion matrix for a multi-class trained classifier:

```
  A B X Y <-(predicted label)
A 1 2 0 4
B 0 3 0 1
X 0 0 1 0
Y 1 1 1 1
^
|
(actual label)
```

In this case, the model predicted the following:
+ One "A" label was accurately predicted, two "A" labels were incorrectly predicted as "B" labels, and four "A" labels were incorrectly predicted as "Y" labels.
+ Three "B" labels were accurately predicted, and one "B" label was incorrectly predicted as a "Y" label.
+ One "X" was accurately predicted.
+ One "Y" label was accurately predicted, one was incorrectly predicted as an "A" label, one was incorrectly predicted as a "B" label, and one was incorrectly predicted as an "X" label.

The diagonal line in the matrix (A:A, B:B, X:X, and Y:Y) shows the accurate predictions. The prediction errors are the values outside of the diagonal. In this case, the matrix shows the following prediction error rates: 
+ A labels: 86%
+ B labels: 25%
+ X labels: 0%
+ Y labels: 75%

The classifier returns the confusion matrix as a file in JSON format. The following JSON file represents the matrix for the previous example.

```
{
 "type": "multi_class",
 "confusion_matrix": [
 [1, 2, 0,4],
 [0, 3, 0, 1],
 [0, 0, 1, 0],
 [1, 1, 1, 1]],
 "labels": ["A", "B", "X", "Y"],
 "all_labels": ["A", "B", "X", "Y"]
}
```

### Confusion matrix for multi-label mode
<a name="m-l-matrix"></a>

In multi-label mode, classification can assign one or more classes to a document. Consider the following example of a confusion matrix for a multi-class trained classifier.

In this example, there are three possible labels: `Comedy`, `Action`, and `Drama`. The multi-label confusion matrix creates one 2x2 matrix for each label.

```
Comedy                   Action                   Drama 
     No Yes                   No Yes                   No Yes   <-(predicted label)                                      
 No  2   1                No  1   1                No  3   0                                                         
Yes  0   2               Yes  2   1               Yes  1   1   
 ^                        ^                        ^
 |                        |                        |
 |-----------(was this label actually used)--------|
```

In this case, the model returned the following for the `Comedy` label:
+ Two instances where a `Comedy` label was accurately predicted to be present. True positive (TP). 
+ Two instances where a `Comedy` label was accurately predicted to be absent. True negative (TN).
+ Zero instances where a `Comedy` label was incorrectly predicted to be present. False positive (FP).
+ One instance where a `Comedy` label was incorrectly predicted to be absent. False negative (FN).

As with a multi-class confusion matrix, the diagonal line in each matrix shows the accurate predictions.

In this case, the model accurately predicted `Comedy` labels 80% of the time (TP plus TN) and incorrectly predicted them 20% of the time (FP plus FN).



The classifier returns the confusion matrix as a file in JSON format. The following JSON file represents the matrix for the previous example.

```
{
"type": "multi_label",
"confusion_matrix": [
 [[2, 1],        
 [0, 2]],
 [[1, 1],        
 [2, 1]],      
 [[3, 0],        
 [1, 1]]
], 
"labels": ["Comedy", "Action", "Drama"]
"all_labels": ["Comedy", "Action", "Drama"]
}
```

## Additional outputs for native document models
<a name="train-class-output-native"></a>

Amazon Comprehend can create additional output files when you train a native document model.

### Amazon Textract output
<a name="textract-output"></a>

If Amazon Comprehend invoked the Amazon Textract APIs to extract text for any of the training documents, it saves the Amazon Textract output files in the S3 output location. It uses the following directory structure:
+ **Training documents:** 

  `amazon-textract-output/train/<file_name>/<page_num>/textract_output.json` 
+ **Test documents:** 

  `amazon-textract-output/test/<file_name>/<page_num>/textract_output.json`

Amazon Comprehend populates the test folder if you provided test documents in the API request.

### Document annotation failures
<a name="failed-files-output"></a>

 Amazon Comprehend creates the following files in the Amazon S3 output location (in the **skipped\$1documents/** folder) if there are any failed annotations:
+ failed\$1annotations\$1train.jsonl

  File exists if any annotations failed in the training data.
+ failed\$1annotations\$1test.jsonl

  File exists if the request included test data and any annotations failed in the test data.

The failed annotation files are JSONL files with the following format:

```
{
     "File": "String", "Page": Number, "ErrorCode": "...", "ErrorMessage": "..."}
    {"File": "String", "Page": Number, "ErrorCode": "...", "ErrorMessage": "..."
  }
```

# Custom classifier metrics
<a name="cer-doc-class"></a>

Amazon Comprehend provides metrics to help you estimate how well a custom classifier performs. Amazon Comprehend calculates the metrics using the test data from the classifier training job. The metrics accurately represent the performance of the model during training, so they approximate the model performance for classification of similar data. 

Use API operations such as [DescribeDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassifier.html) to retrieve the metrics for a custom classifier.

**Note**  
Refer to [Metrics: Precision, recall, and FScore](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) for an understanding of the underlying Precision, Recall, and F1 score metrics. These metrics are defined at a class level. Amazon Comprehend uses **macro** averaging to combine these metrics into the test set P,R, and F1, as discussed in the following.

**Topics**
+ [

## Metrics
](#cer-doc-class-metrics)
+ [

## Improving your custom classifier's performance
](#improving-metrics-doc)

## Metrics
<a name="cer-doc-class-metrics"></a>

Amazon Comprehend supports the following metrics: 

**Topics**
+ [

### Accuracy
](#class-accuracy-metric)
+ [

### Precision (macro precision)
](#class-macroprecision-metric)
+ [

### Recall (macro recall)
](#class-macrorecall-metric)
+ [

### F1 score (macro F1 score)
](#class-macrof1score-metric)
+ [

### Hamming loss
](#class-hammingloss-metric)
+ [

### Micro precision
](#class-microprecision-metric)
+ [

### Micro recall
](#class-microrecall-metric)
+ [

### Micro F1 score
](#class-microf1score-metric)

To view the metrics for a Classifier, open the **Classifier Details** page in the console.

![\[Custom Classifier Metrics\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/classifierperformance.png)


### Accuracy
<a name="class-accuracy-metric"></a>

Accuracy indicates the percentage of labels from the test data that the model predicted accurately. To compute accuracy, divide the number of accurately predicted labels in the test documents by the total number of labels in the test documents.

For example


| Actual label | Predicted label | Accurate/Incorrect | 
| --- | --- | --- | 
|  1  |  1  |  Accurate  | 
|  0  |  1  |  Incorrect  | 
|  2  |  3  |  Incorrect  | 
|  3  |  3  |  Accurate  | 
|  2  |  2  |  Accurate  | 
|  1  |  1  |  Accurate  | 
|  3  |  3  | Accurate | 

The accuracy consists of the number of accurate predictions divided by the number of overall test samples = 5/7 = 0.714, or 71.4%

### Precision (macro precision)
<a name="class-macroprecision-metric"></a>

Precision is a measure of the usefulness of the classifier results in the test data. It's defined as the number of documents accurately classified, divided by the total number of classifications for the class. High precision means that the classifier returned significantly more relevant results than irrelevant ones. 

The `Precision` metric is also known as *Macro Precision*. 

The following example shows precision results for a test set.


| Label | Sample size | Label precision | 
| --- | --- | --- | 
|  Label\$11  |  400  |  0.75  | 
|  Label\$12  |  300  |  0.80  | 
|  Label\$13  |  30000  |  0.90  | 
|  Label\$14  |  20  |  0.50  | 
|  Label\$15  |  10  |  0.40  | 

The Precision (Macro Precision) metric for the model is therefore:

```
Macro Precision = (0.75 + 0.80 + 0.90 + 0.50 + 0.40)/5 = 0.67
```

### Recall (macro recall)
<a name="class-macrorecall-metric"></a>

This indicates the percentage of correct categories in your text that the model can predict. This metric comes from averaging the recall scores of all available labels. Recall is a measure of how complete the classifier results are for the test data. 

High recall means that the classifier returned most of the relevant results. 

The `Recall` metric is also known as *Macro Recall*.

The following example shows recall results for a test set.


| Label | Sample size | Label recall | 
| --- | --- | --- | 
|  Label\$11  |  400  |  0.70  | 
|  Label\$12  |  300  |  0.70  | 
|  Label\$13  |  30000  |  0.98  | 
|  Label\$14  |  20  |  0.80  | 
|  Label\$15  |  10  |  0.10  | 

The Recall (Macro Recall) metric for the model is therefore:

```
Macro Recall = (0.70 + 0.70 + 0.98 + 0.80 + 0.10)/5 = 0.656
```

### F1 score (macro F1 score)
<a name="class-macrof1score-metric"></a>

The F1 score is derived from the `Precision` and `Recall` values. It measures the overall accuracy of the classifier. The highest score is 1, and the lowest score is 0. 

Amazon Comprehend calculates the *Macro F1 Score*. It's the unweighted average of the label F1 scores. Using the following test set as an example:


| Label | Sample size | Label F1 score | 
| --- | --- | --- | 
|  Label\$11  |  400  |  0.724  | 
|  Label\$12  |  300  |  0.824  | 
|  Label\$13  |  30000  |  0.94  | 
|  Label\$14  |  20  |  0.62  | 
|  Label\$15  |  10  |  0.16  | 

The F1 Score (Macro F1 Score) for the model is calculated as follows:

```
Macro F1 Score = (0.724 + 0.824 + 0.94 + 0.62 + 0.16)/5 = 0.6536
```

### Hamming loss
<a name="class-hammingloss-metric"></a>

The fraction of labels that are incorrectly predicted. Also seen as the fraction of incorrect labels compared to the total number of labels. Scores closer to zero are better.

### Micro precision
<a name="class-microprecision-metric"></a>

Original: 

Similar to the precision metric, except that micro precision is based on the overall score of all precision scores added together.

### Micro recall
<a name="class-microrecall-metric"></a>

Similar to the recall metric, except that micro recall is based on the overall score of all recall scores added together.

### Micro F1 score
<a name="class-microf1score-metric"></a>

The Micro F1 score is a combination of the Micro Precision and Micro Recall metrics.

## Improving your custom classifier's performance
<a name="improving-metrics-doc"></a>

The metrics provide an insight into how your custom classifier performs during a classification job. If the metrics are low, the classification model might not be effective for your use case. You have several options to improve your classifier performance:

1. In your training data, provide concrete examples that define clear separation of the categories. For example, provide documents that use unique words/sentences to represent the category. 

1. Add more data for under-represented labels in your training data.

1. Try to reduce skew in the categories. If the largest label in your data has more than 10 times the documents in the smallest label, try increasing the number of documents for the smallest label. Make sure to reduce the skew ratio to at most 10:1 between highly represented and least represented classes. You can also try removing input documents from the highly represented classes.

# Running real-time analysis
<a name="running-class-sync"></a>

After you train a custom classifier, you can classify documents using real-time analysis. Real-time analysis takes a single document as input and returns the results synchronously. Custom classification accepts a variety of document types as inputs for real-time analysis. For details, see [Inputs for real-time custom analysis](idp-inputs-sync.md).

If you plan to analyze image files or scanned PDF documents, your IAM policy must grant permissions to use two Amazon Textract API methods (DetectDocumentText and AnalyzeDocument). Amazon Comprehend invokes these methods during text extraction. For an example policy, see [Permissions required to perform document analysis actions](security_iam_id-based-policy-examples.md#security-iam-based-policy-perform-cmp-actions).

You must create an endpoint to run real-time analysis using a custom classification model. 

**Topics**
+ [

# Real-time analysis for custom classification (console)
](custom-sync.md)
+ [

# Real-time analysis for custom classification (API)
](class-sync-api.md)
+ [

# Outputs for real-time analysis
](outputs-class-sync.md)

# Real-time analysis for custom classification (console)
<a name="custom-sync"></a>

You can use the Amazon Comprehend console to run real-time analysis using a custom classification model.

You create an endpoint to run the real-time analysis. An endpoint includes managed resources that makes your custom model available for real-time inference.

For information about provisioning endpoint throughput, and the associated costs, see [Using Amazon Comprehend endpoints](using-endpoints.md).

**Topics**
+ [

## Creating an endpoint for custom classification
](#create-endpoint)
+ [

## Running real-time custom classification
](#cc-real-time-analysis)

## Creating an endpoint for custom classification
<a name="create-endpoint"></a>

**To create an endpoint (console)**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Endpoints** and choose the **Create endpoint** button. A **Create endpoint** screen opens.

1. Give the endpoint a name. The name must be unique within the current Region and account.

1. Choose a custom model that you want to attach the new endpoint to. From the dropdown, you can search by model name.
**Note**  
You must create a model before you can attach an endpoint to it. If you don't have a model yet, see [Training classification models](training-classifier-model.md).

1. (Optional) to add a tag to the endpoint, enter a key-value pair under **Tags** and choose **Add tag**. To remove this pair before creating the endpoint, choose **Remove tag**

1. Enter the number of inference units (IUs) to assign to the endpoint. Each unit represents a throughput of 100 characters per second for up to two documents per second. For information about endpoint throughput, see [Using Amazon Comprehend endpoints](using-endpoints.md). 

1. (Optional) If you are creating a new endpoint, you have the option to use the IU estimator. Depending on the throughput, or the number of characters you want to analyze per second, it can be hard to know how many inference units you need. This optional step can help you determine how the number of IUs to request. 

1. From the **Purchase summary**, review your estimated hourly, daily, and monthly endpoint cost. 

1. Select the check box if you understand that your account incurs charges for the endpoint from the time it starts until you delete it.

1. Choose **Create endpoint**

## Running real-time custom classification
<a name="cc-real-time-analysis"></a>

Once you've created an endpoint, you can run real-time analysis using your custom model. There are two ways to run real-time analysis from the console. You can input text or upload a file, as shown in the following. 

**To run real-time analysis using a custom model (console)**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Real-time analysis**.

1. Under **Input type**, choose **Custom** for **Analysis type**. 

1. Under **Custom model type**, choose **Custom classification**. 

1. For **Endpoint**, choose the endpoint that you want to use. This endpoint links to a specific custom model. 

1. To specify the input data for analysis, you can input text or upload a file.
   + To enter text:

     1. Choose **Input text**.

     1. Enter the text that you want to analyze. 
   + To upload a file:

     1. Choose **Upload file** and enter the file name to upload.

     1. (Optional) Under **Advanced read actions**, you can override the default actions for text extraction. For details, see [Setting text extraction options](idp-set-textract-options.md)

   For best results, match the type of input to the classifier model type. The console displays a warning if you submit a native document to a plain-text model, or plain text to a native document model. For more information, see [Training classification models](training-classifier-model.md).

1. Choose **Analyze**. Amazon Comprehend analyzes the input data using your custom model. Amazon Comprehend displays the discovered classes, along with a confidence assessment for each class. 

# Real-time analysis for custom classification (API)
<a name="class-sync-api"></a>

You can use the Amazon Comprehend API to run real-time classification with a custom model. First, you create an endpoint to run the real-time analysis. After you create the endpoint, you run the real-time classification.

The examples in this section use command formats for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^).

For information about provisioning endpoint throughput, and the associated costs, see [Using Amazon Comprehend endpoints](using-endpoints.md).

**Topics**
+ [

## Creating an endpoint for custom classification
](#create-endpoint-api)
+ [

## Running real-time custom classification
](#cc-real-time-analysis-api)

## Creating an endpoint for custom classification
<a name="create-endpoint-api"></a>

The following example shows the [CreateEndpoint](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEndpoint.html) API operation using the AWS CLI. 

```
aws comprehend create-endpoint \
    --desired-inference-units number of inference units \
    --endpoint-name endpoint name \
    --model-arn arn:aws:comprehend:region:account-id:model/example \
    --tags Key=My1stTag,Value=Value1
```

Amazon Comprehend responds with the following:

```
{
   "EndpointArn": "Arn"
}
```

## Running real-time custom classification
<a name="cc-real-time-analysis-api"></a>

After you create an endpoint for your custom classification model, you use the endpoint to run the [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) API operation. You can provide text input using the `text` or `bytes` parameter. Enter the other input types using the `bytes` parameter.

For image files and PDF files, you can use the `DocumentReaderConfig` parameter to override the default text extraction actions. For details, see [Setting text extraction options](idp-set-textract-options.md)

For best results, match the type of input to the classifier model type. The API response includes a warning if you submit a native document to a plain-text model, or a plain-text file to a native document model. For more information, see [Training classification models](training-classifier-model.md).

### Using the AWS Command Line Interface
<a name="cc-real-time-analysis-api-cli"></a>

The following examples demonstrate how to use the *classify-document* CLI command. 

#### Classify text using the AWS CLI
<a name="cc-real-time-analysis-api-run-cli1"></a>

The following example runs real-time classification on a block of text.

```
aws comprehend classify-document \
     --endpoint-arn arn:aws:comprehend:region:account-id:endpoint/endpoint name \
     --text 'From the Tuesday, April 16th, 1912 edition of The Guardian newspaper: The maiden voyage of the White Star liner Titanic, 
     the largest ship ever launched ended in disaster. The Titanic started her trip from Southampton for New York on Wednesday. Late 
     on Sunday night she struck an iceberg off the Grand Banks of Newfoundland. By wireless telegraphy she sent out signals of distress, 
     and several liners were near enough to catch and respond to the call.'
```

Amazon Comprehend responds with the following:

```
{
    "Classes": [ 
       { 
          "Name": "string",
          "Score": 0.9793661236763
       }
    ]
 }
```

#### Classify a semi-structured document using the AWS CLI
<a name="cc-real-time-analysis-api-run-cli2"></a>

To analyze custom classification for a PDF, Word, or image file, run the `classify-document` command with the input file in the `bytes` parameter.

The following example uses an image as the input file. It uses the `fileb` option to base-64 encode the image file bytes. For more information, see [Binary large objects](https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-parameters-types.html#parameter-type-blob) in the AWS Command Line Interface User Guide. 

This example also passes in a JSON file named `config.json` to set the text extraction options.

```
$ aws comprehend classify-document \
> --endpoint-arn arn \
> --language-code en \
> --bytes fileb://image1.jpg   \
> --document-reader-config file://config.json
```

The **config.json** file contains the following content.

```
 {
    "DocumentReadMode": "FORCE_DOCUMENT_READ_ACTION",
    "DocumentReadAction": "TEXTRACT_DETECT_DOCUMENT_TEXT"    
 }
```

Amazon Comprehend responds with the following:

```
{
    "Classes": [ 
       { 
          "Name": "string",
          "Score": 0.9793661236763
       }
    ]
 }
```

For more information, see [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) in the *Amazon Comprehend API Reference*.

# Outputs for real-time analysis
<a name="outputs-class-sync"></a>

## Outputs for text inputs
<a name="outputs-class-sync-text"></a>

For text inputs, the output includes the list of classes or labels identified by the classifier analysis. The following example shows a list with two classes.

```
"Classes": [
  {
     "Name": "abc",
     "Score": 0.2757999897003174,
     "Page": 1
  },
  {
    "Name": "xyz",
    "Score": 0.2721000015735626,
    "Page": 1
  }
]
```

## Outputs for semi-structured inputs
<a name="outputs-class-sync-other"></a>

For a semi-structured input document, or a text file, the output can include the following additional fields:
+ DocumentMetadata – Extraction information about the document. The metadata includes a list of pages in the document, with the number of characters extracted from each page. This field is present in the response if the request included the `Byte` parameter.
+ DocumentType – The document type for each page in the input document. This field is present in the response if the request included the `Byte` parameter.
+ Errors – Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.
+ Warnings – Warnings detected while processing the input document. The response includes a warning if there is a mismatch between the input document type and the model type associated with the endpoint that you specified. The field is empty if the system generated no warnings.

For more details about these output fields, see [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) in the *Amazon Comprehend API Reference*.

The following example shows the output for a one-page native PDF input document.

```
{
  "Classes": [
      {
          "Name": "123",
          "Score": 0.39570000767707825,
          "Page": 1
      },
      {
          "Name": "abc",
          "Score": 0.2757999897003174,
          "Page": 1
      },
      {
          "Name": "xyz",
          "Score": 0.2721000015735626,
          "Page": 1
      }
  ],
  "DocumentMetadata": {
      "Pages": 1,
      "ExtractedCharacters": [
          {
              "Page": 1,
              "Count": 2013
          }
      ]
  },
  "DocumentType": [
      {
          "Page": 1,
          "Type": "NATIVE_PDF"
      }
  ]
}
```

# Running asynchronous jobs
<a name="running-classifiers"></a>

After you train a custom classifier, you can use asynchronous jobs to analyze large documents or multiple documents in one batch.

Custom classification accepts a variety of input document types. For details, see [Inputs for asynchronous custom analysis](idp-inputs-async.md).

If you plan to analyze image files or scanned PDF documents, your IAM policy must grant permissions to use two Amazon Textract API methods (DetectDocumentText and AnalyzeDocument). Amazon Comprehend invokes these methods during text extraction. For an example policy, see [Permissions required to perform document analysis actions](security_iam_id-based-policy-examples.md#security-iam-based-policy-perform-cmp-actions).

For classification of semi-structured documents (image, PDF, or Docx files) using a plain-text model, use the `one document per file` input format. Also, include the `DocumentReaderConfig` parameter in your [StartDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartDocumentClassificationJob.html) request.

**Topics**
+ [

# File formats for async analysis
](class-inputs-async.md)
+ [

# Analysis jobs for custom classification (console)
](analysis-jobs-custom-classifier.md)
+ [

# Analysis jobs for custom classification (API)
](analysis-jobs-custom-class-api.md)
+ [

# Outputs for asynchronous analysis jobs
](outputs-class-async.md)

# File formats for async analysis
<a name="class-inputs-async"></a>

When you run async analysis with your model, you have a choice of formats for input documents: `One document per line` or `one document per file`. The format you use depends on the type of documents you want to analyze, as described in the following table.


| Description | Format | 
| --- | --- | 
| The input contains multiple files. Each file contains one input document. This format is best for collections of large documents, such as newspaper articles or scientific papers. Also, use this format for semi-structured documents (image, PDF, or Docx files) using a native document classifier. | One document per file | 
|  The input is one or more files. Each line in the file is a separate input document. This format is best for short documents, such as text messages or social media posts.  | One document per line | 

**One document per file**

With `one document per file` format, each file represents one input document. 

**One document per line**

With the `One document per line` format, each document is placed on a separate line and no header is used. The label is not included on each line (since you don't yet know the label for the document). Each line of the file (the end of the individual document) must end with a line feed (LF, \$1n), a carriage return (CR, \$1r), or both (CRLF, \$1r\$1n). Don't use the UTF-8 line separator (u\$12028) to end a line.

The following example shows the format of the input file.

```
Text of document 1 \n
Text of document 2 \n
Text of document 3 \n
Text of document 4 \n
```

For either format, use UTF-8 encoding for text files. After you prepare the files, place them in the S3 bucket that you're using for input data.

When you start a classification job, you specify this Amazon S3 location for your input data. The URI must be in the same Region as the API endpoint that you are calling. The URI can point to a single file (as when using the "one document per line" method, or it can be the prefix for a collection of data files. 

For example, if you use the URI `S3://bucketName/prefix`, if the prefix is a single file, Amazon Comprehend uses that file as input. If more than one file begins with the prefix, Amazon Comprehend uses all of them as input. 

Grant Amazon Comprehend access to the S3 bucket that contains your document collection and output files. For more information, see [Role-based permissions required for asynchronous operations](security_iam_id-based-policy-examples.md#auth-role-permissions).

# Analysis jobs for custom classification (console)
<a name="analysis-jobs-custom-classifier"></a>

After you create and train a [custom document classifier](), you can use the console to run custom classification jobs with the model.

**To create a custom classification job (console)**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Analysis jobs** and then choose **Create job**.

1. Give the classification job a name. The name must be unique to your account and current Region.

1. Under **Analysis type**, choose **Custom classification**.

1. From **Select classifier**, choose the custom classifier to use.

1. (Optional) If you choose to encrypt the data that Amazon Comprehend uses while processing your job, choose **Job encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key ID under **KMS key ARN**.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [Key management service (KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Input data**, enter the location of the Amazon S3 bucket that contains your input documents or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role that you're using for access permissions for the classification job must have reading permissions for the S3 bucket.

   To achieve the highest level of accuracy in training a model, match the type of input to the classifier model type. The classifier job returns a warning if you submit native documents to a plain-text model, or plain text documents to a native document model. For more information, see [Training classification models](training-classifier-model.md).

1. (Optional) For **Input format**, you can choose the format of the input documents. The format can be one document per file, or one document per line in a single file. One document per line applies only to text documents. 

1. (Optional) For **Document read mode**, you can override the default text extraction actions. For more information, see [Setting text extraction options](idp-set-textract-options.md). 

1. Under **Output data**, enter the location of the Amazon S3 bucket where Amazon Comprehend should write the job's output data or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role that you're using for access permissions for the classification job must have write permissions for the S3 bucket.

1. (Optional) If you choose to encrypt the output result from your job, choose **Encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key alias or ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key alias or ID under **KMS key ID**.

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your classification job, the `DataAccessRole` used for the Create and Start operations must grant permissions to the VPC that accesses the output bucket.

1. Choose **Create job** to create the document classification job.

# Analysis jobs for custom classification (API)
<a name="analysis-jobs-custom-class-api"></a>

After you [create and train](train-custom-classifier-api.md) a custom document classifier, you can use the classifier to run analysis jobs.

Use the [StartDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartDocumentClassificationJob.html) operation to start classifying unlabeled documents. You specify the S3 bucket that contains the input documents, the S3 bucket for the output documents, and the classifier to use.

To achieve the highest level of accuracy in training a model, match the type of input to the classifier model type. The classifier job returns a warning if you submit native documents to a plain-text model, or plain text documents to a native document model. For more information, see [Training classification models](training-classifier-model.md).

 [StartDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartDocumentClassificationJob.html) is asynchronous. Once you have started the job, use the [DescribeDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassificationJob.html) operation to monitor its progress. When the `Status` field in the response shows `COMPLETED`, you can access the output in the location that you specified.

**Topics**
+ [

## Using the AWS Command Line Interface
](#get-started-api-customclass-cli)
+ [

## Using the AWS SDK for Java or SDK for Python
](#get-started-api-customclass-java)

## Using the AWS Command Line Interface
<a name="get-started-api-customclass-cli"></a>

The following examples the `StartDocumentClassificationJob` operation, and other custom classifier APIs with the AWS CLI. 

The following examples use the command format for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^).

Run a custom classification job using the `StartDocumentClassificationJob` operation.

```
aws comprehend start-document-classification-job \
     --region region \
     --document-classifier-arn arn:aws:comprehend:region:account number:document-classifier/testDelete \
     --input-data-config S3Uri=s3://S3Bucket/docclass/file name,InputFormat=ONE_DOC_PER_LINE \
     --output-data-config S3Uri=s3://S3Bucket/output \
     --data-access-role-arn arn:aws:iam::account number:role/resource name
```

Get information on a custom classifier with the job id using the `DescribeDocumentClassificationJob` operation.

```
aws comprehend describe-document-classification-job \
     --region region \
     --job-id job id
```

List all custom classification jobs in your account using the `ListDocumentClassificationJobs` operation.

```
aws comprehend list-document-classification-jobs
     --region region
```

## Using the AWS SDK for Java or SDK for Python
<a name="get-started-api-customclass-java"></a>

For SDK examples of how to start a custom classifier job, see [Use `StartDocumentClassificationJob` with an AWS SDK or CLI](example_comprehend_StartDocumentClassificationJob_section.md).

# Outputs for asynchronous analysis jobs
<a name="outputs-class-async"></a>

After an analysis job completes, it stores the results in the S3 bucket that you specified in the request.

## Outputs for text inputs
<a name="outputs-class-async-text"></a>

For either format of text input documents (multi-class or multi-label), the job output consists of a single file named `output.tar.gz`. It's a compressed archive file that contains a text file with the output. 

**Multi-class output**

When you use a classifier trained in multi-class mode, your results show `classes`. Each of these `classes` is the class used to create the set of categories when training your classifier.

For more details about these output fields, see [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) in the *Amazon Comprehend API Reference*.

The following examples use the following mutually exclusive classes.

```
DOCUMENTARY
SCIENCE_FICTION
ROMANTIC_COMEDY
SERIOUS_DRAMA
OTHER
```

If your input data format is one document per line, the output file contains one line for each line in the input. Each line includes the file name, the zero-based line number of the input line, and the class or classes found in the document. It ends with the confidence that Amazon Comprehend has that the individual instance was correctly classified.

For example:

```
{"File": "file1.txt", "Line": "0", "Classes": [{"Name": "Documentary", "Score": 0.8642}, {"Name": "Other", "Score": 0.0381}, {"Name": "Serious_Drama", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "1", "Classes": [{"Name": "Science_Fiction", "Score": 0.5}, {"Name": "Science_Fiction", "Score": 0.0381}, {"Name": "Science_Fiction", "Score": 0.0372}]}
{"File": "file2.txt", "Line": "2", "Classes": [{"Name": "Documentary", "Score": 0.1}, {"Name": "Documentary", "Score": 0.0381}, {"Name": "Documentary", "Score": 0.0372}]}
{"File": "file2.txt", "Line": "3", "Classes": [{"Name": "Serious_Drama", "Score": 0.3141}, {"Name": "Other", "Score": 0.0381}, {"Name": "Other", "Score": 0.0372}]}
```

If your input data format is one document per file, the output file contains one line for each document. Each line has the name of the file and the class or classes found in the document. It ends with the confidence that Amazon Comprehend classified the individual instance accurately.

For example:

```
{"File": "file0.txt", "Classes": [{"Name": "Documentary", "Score": 0.8642}, {"Name": "Other", "Score": 0.0381}, {"Name": "Serious_Drama", "Score": 0.0372}]}
{"File": "file1.txt", "Classes": [{"Name": "Science_Fiction", "Score": 0.5}, {"Name": "Science_Fiction", "Score": 0.0381}, {"Name": "Science_Fiction", "Score": 0.0372}]}
{"File": "file2.txt", "Classes": [{"Name": "Documentary", "Score": 0.1}, {"Name": "Documentary", "Score": 0.0381}, {"Name": "Domentary", "Score": 0.0372}]}
{"File": "file3.txt", "Classes": [{"Name": "Serious_Drama", "Score": 0.3141}, {"Name": "Other", "Score": 0.0381}, {"Name": "Other", "Score": 0.0372}]}
```

**Multi-label output**

When you use a classifier trained in multi-label mode, your results show `labels`. Each of these `labels` is the label used to create the set of categories when training your classifier.

The following examples use these unique labels.

```
SCIENCE_FICTION
ACTION
DRAMA
COMEDY
ROMANCE
```

If your input data format is one document per line, the output file contains one line for each line in the input. Each line includes the file name, the zero-based line number of the input line, and the class or classes found in the document. It ends with the confidence that Amazon Comprehend has that the individual instance was correctly classified.

For example:

```
{"File": "file1.txt", "Line": "0", "Labels": [{"Name": "Action", "Score": 0.8642}, {"Name": "Drama", "Score": 0.650}, {"Name": "Science Fiction", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "1", "Labels": [{"Name": "Comedy", "Score": 0.5}, {"Name": "Action", "Score": 0.0381}, {"Name": "Drama", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "2", "Labels": [{"Name": "Action", "Score": 0.9934}, {"Name": "Drama", "Score": 0.0381}, {"Name": "Action", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "3", "Labels": [{"Name": "Romance", "Score": 0.9845}, {"Name": "Comedy", "Score": 0.8756}, {"Name": "Drama", "Score": 0.7723}, {"Name": "Science_Fiction", "Score": 0.6157}]}
```

If your input data format is one document per file, the output file contains one line for each document. Each line has the name of the file and the class or classes found in the document. It ends with the confidence that Amazon Comprehend classified the individual instance accurately.

For example:

```
{"File": "file0.txt", "Labels": [{"Name": "Action", "Score": 0.8642}, {"Name": "Drama", "Score": 0.650}, {"Name": "Science Fiction", "Score": 0.0372}]}
{"File": "file1.txt", "Labels": [{"Name": "Comedy", "Score": 0.5}, {"Name": "Action", "Score": 0.0381}, {"Name": "Drama", "Score": 0.0372}]}
{"File": "file2.txt", "Labels": [{"Name": "Action", "Score": 0.9934}, {"Name": "Drama", "Score": 0.0381}, {"Name": "Action", "Score": 0.0372}]}
{"File": "file3.txt”, "Labels": [{"Name": "Romance", "Score": 0.9845}, {"Name": "Comedy", "Score": 0.8756}, {"Name": "Drama", "Score": 0.7723}, {"Name": "Science_Fiction", "Score": 0.6157}]}
```

## Outputs for semi-structured input documents
<a name="outputs-class-async-other"></a>

For semi-structured input documents, the output can include the following additional fields:
+ DocumentMetadata – Extraction information about the document. The metadata includes a list of pages in the document, with the number of characters extracted from each page. This field is present in the response if the request included the `Byte` parameter.
+ DocumentType – The document type for each page in the input document. This field is present in the response if the request included the `Byte` parameter.
+ Errors – Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.

For more details about these output fields, see [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) in the *Amazon Comprehend API Reference*.

The following example shows output for a two-page scanned PDF file.

```
[{ #First page output
    "Classes": [
        {
            "Name": "__label__2 ",
            "Score": 0.9993996620178223
        },
        {
            "Name": "__label__3 ",
            "Score": 0.0004330444789957255
        }
    ],
    "DocumentMetadata": {
        "PageNumber": 1,
        "Pages": 2
    },
    "DocumentType": "ScannedPDF",
    "File": "file.pdf",
    "Version": "VERSION_NUMBER"
},
#Second page output
{
    "Classes": [
        {
            "Name": "__label__2 ",
            "Score": 0.9993996620178223
        },
        {
            "Name": "__label__3 ",
            "Score": 0.0004330444789957255
        }
    ],
    "DocumentMetadata": {
        "PageNumber": 2,
        "Pages": 2
    },
    "DocumentType": "ScannedPDF",
    "File": "file.pdf",
    "Version": "VERSION_NUMBER" 
}]
```