

# Running asynchronous jobs
<a name="running-classifiers"></a>

After you train a custom classifier, you can use asynchronous jobs to analyze large documents or multiple documents in one batch.

Custom classification accepts a variety of input document types. For details, see [Inputs for asynchronous custom analysis](idp-inputs-async.md).

If you plan to analyze image files or scanned PDF documents, your IAM policy must grant permissions to use two Amazon Textract API methods (DetectDocumentText and AnalyzeDocument). Amazon Comprehend invokes these methods during text extraction. For an example policy, see [Permissions required to perform document analysis actions](security_iam_id-based-policy-examples.md#security-iam-based-policy-perform-cmp-actions).

For classification of semi-structured documents (image, PDF, or Docx files) using a plain-text model, use the `one document per file` input format. Also, include the `DocumentReaderConfig` parameter in your [StartDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartDocumentClassificationJob.html) request.

**Topics**
+ [File formats for async analysis](class-inputs-async.md)
+ [Analysis jobs for custom classification (console)](analysis-jobs-custom-classifier.md)
+ [Analysis jobs for custom classification (API)](analysis-jobs-custom-class-api.md)
+ [Outputs for asynchronous analysis jobs](outputs-class-async.md)

# File formats for async analysis
<a name="class-inputs-async"></a>

When you run async analysis with your model, you have a choice of formats for input documents: `One document per line` or `one document per file`. The format you use depends on the type of documents you want to analyze, as described in the following table.


| Description | Format | 
| --- | --- | 
| The input contains multiple files. Each file contains one input document. This format is best for collections of large documents, such as newspaper articles or scientific papers. Also, use this format for semi-structured documents (image, PDF, or Docx files) using a native document classifier. | One document per file | 
|  The input is one or more files. Each line in the file is a separate input document. This format is best for short documents, such as text messages or social media posts.  | One document per line | 

**One document per file**

With `one document per file` format, each file represents one input document. 

**One document per line**

With the `One document per line` format, each document is placed on a separate line and no header is used. The label is not included on each line (since you don't yet know the label for the document). Each line of the file (the end of the individual document) must end with a line feed (LF, \$1n), a carriage return (CR, \$1r), or both (CRLF, \$1r\$1n). Don't use the UTF-8 line separator (u\$12028) to end a line.

The following example shows the format of the input file.

```
Text of document 1 \n
Text of document 2 \n
Text of document 3 \n
Text of document 4 \n
```

For either format, use UTF-8 encoding for text files. After you prepare the files, place them in the S3 bucket that you're using for input data.

When you start a classification job, you specify this Amazon S3 location for your input data. The URI must be in the same Region as the API endpoint that you are calling. The URI can point to a single file (as when using the "one document per line" method, or it can be the prefix for a collection of data files. 

For example, if you use the URI `S3://bucketName/prefix`, if the prefix is a single file, Amazon Comprehend uses that file as input. If more than one file begins with the prefix, Amazon Comprehend uses all of them as input. 

Grant Amazon Comprehend access to the S3 bucket that contains your document collection and output files. For more information, see [Role-based permissions required for asynchronous operations](security_iam_id-based-policy-examples.md#auth-role-permissions).

# Analysis jobs for custom classification (console)
<a name="analysis-jobs-custom-classifier"></a>

After you create and train a [custom document classifier](), you can use the console to run custom classification jobs with the model.

**To create a custom classification job (console)**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Analysis jobs** and then choose **Create job**.

1. Give the classification job a name. The name must be unique to your account and current Region.

1. Under **Analysis type**, choose **Custom classification**.

1. From **Select classifier**, choose the custom classifier to use.

1. (Optional) If you choose to encrypt the data that Amazon Comprehend uses while processing your job, choose **Job encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key ID under **KMS key ARN**.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [Key management service (KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Input data**, enter the location of the Amazon S3 bucket that contains your input documents or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role that you're using for access permissions for the classification job must have reading permissions for the S3 bucket.

   To achieve the highest level of accuracy in training a model, match the type of input to the classifier model type. The classifier job returns a warning if you submit native documents to a plain-text model, or plain text documents to a native document model. For more information, see [Training classification models](training-classifier-model.md).

1. (Optional) For **Input format**, you can choose the format of the input documents. The format can be one document per file, or one document per line in a single file. One document per line applies only to text documents. 

1. (Optional) For **Document read mode**, you can override the default text extraction actions. For more information, see [Setting text extraction options](idp-set-textract-options.md). 

1. Under **Output data**, enter the location of the Amazon S3 bucket where Amazon Comprehend should write the job's output data or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role that you're using for access permissions for the classification job must have write permissions for the S3 bucket.

1. (Optional) If you choose to encrypt the output result from your job, choose **Encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key alias or ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key alias or ID under **KMS key ID**.

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your classification job, the `DataAccessRole` used for the Create and Start operations must grant permissions to the VPC that accesses the output bucket.

1. Choose **Create job** to create the document classification job.

# Analysis jobs for custom classification (API)
<a name="analysis-jobs-custom-class-api"></a>

After you [create and train](train-custom-classifier-api.md) a custom document classifier, you can use the classifier to run analysis jobs.

Use the [StartDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartDocumentClassificationJob.html) operation to start classifying unlabeled documents. You specify the S3 bucket that contains the input documents, the S3 bucket for the output documents, and the classifier to use.

To achieve the highest level of accuracy in training a model, match the type of input to the classifier model type. The classifier job returns a warning if you submit native documents to a plain-text model, or plain text documents to a native document model. For more information, see [Training classification models](training-classifier-model.md).

 [StartDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartDocumentClassificationJob.html) is asynchronous. Once you have started the job, use the [DescribeDocumentClassificationJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassificationJob.html) operation to monitor its progress. When the `Status` field in the response shows `COMPLETED`, you can access the output in the location that you specified.

**Topics**
+ [Using the AWS Command Line Interface](#get-started-api-customclass-cli)
+ [Using the AWS SDK for Java or SDK for Python](#get-started-api-customclass-java)

## Using the AWS Command Line Interface
<a name="get-started-api-customclass-cli"></a>

The following examples the `StartDocumentClassificationJob` operation, and other custom classifier APIs with the AWS CLI. 

The following examples use the command format for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^).

Run a custom classification job using the `StartDocumentClassificationJob` operation.

```
aws comprehend start-document-classification-job \
     --region region \
     --document-classifier-arn arn:aws:comprehend:region:account number:document-classifier/testDelete \
     --input-data-config S3Uri=s3://S3Bucket/docclass/file name,InputFormat=ONE_DOC_PER_LINE \
     --output-data-config S3Uri=s3://S3Bucket/output \
     --data-access-role-arn arn:aws:iam::account number:role/resource name
```

Get information on a custom classifier with the job id using the `DescribeDocumentClassificationJob` operation.

```
aws comprehend describe-document-classification-job \
     --region region \
     --job-id job id
```

List all custom classification jobs in your account using the `ListDocumentClassificationJobs` operation.

```
aws comprehend list-document-classification-jobs
     --region region
```

## Using the AWS SDK for Java or SDK for Python
<a name="get-started-api-customclass-java"></a>

For SDK examples of how to start a custom classifier job, see [Use `StartDocumentClassificationJob` with an AWS SDK or CLI](example_comprehend_StartDocumentClassificationJob_section.md).

# Outputs for asynchronous analysis jobs
<a name="outputs-class-async"></a>

After an analysis job completes, it stores the results in the S3 bucket that you specified in the request.

## Outputs for text inputs
<a name="outputs-class-async-text"></a>

For either format of text input documents (multi-class or multi-label), the job output consists of a single file named `output.tar.gz`. It's a compressed archive file that contains a text file with the output. 

**Multi-class output**

When you use a classifier trained in multi-class mode, your results show `classes`. Each of these `classes` is the class used to create the set of categories when training your classifier.

For more details about these output fields, see [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) in the *Amazon Comprehend API Reference*.

The following examples use the following mutually exclusive classes.

```
DOCUMENTARY
SCIENCE_FICTION
ROMANTIC_COMEDY
SERIOUS_DRAMA
OTHER
```

If your input data format is one document per line, the output file contains one line for each line in the input. Each line includes the file name, the zero-based line number of the input line, and the class or classes found in the document. It ends with the confidence that Amazon Comprehend has that the individual instance was correctly classified.

For example:

```
{"File": "file1.txt", "Line": "0", "Classes": [{"Name": "Documentary", "Score": 0.8642}, {"Name": "Other", "Score": 0.0381}, {"Name": "Serious_Drama", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "1", "Classes": [{"Name": "Science_Fiction", "Score": 0.5}, {"Name": "Science_Fiction", "Score": 0.0381}, {"Name": "Science_Fiction", "Score": 0.0372}]}
{"File": "file2.txt", "Line": "2", "Classes": [{"Name": "Documentary", "Score": 0.1}, {"Name": "Documentary", "Score": 0.0381}, {"Name": "Documentary", "Score": 0.0372}]}
{"File": "file2.txt", "Line": "3", "Classes": [{"Name": "Serious_Drama", "Score": 0.3141}, {"Name": "Other", "Score": 0.0381}, {"Name": "Other", "Score": 0.0372}]}
```

If your input data format is one document per file, the output file contains one line for each document. Each line has the name of the file and the class or classes found in the document. It ends with the confidence that Amazon Comprehend classified the individual instance accurately.

For example:

```
{"File": "file0.txt", "Classes": [{"Name": "Documentary", "Score": 0.8642}, {"Name": "Other", "Score": 0.0381}, {"Name": "Serious_Drama", "Score": 0.0372}]}
{"File": "file1.txt", "Classes": [{"Name": "Science_Fiction", "Score": 0.5}, {"Name": "Science_Fiction", "Score": 0.0381}, {"Name": "Science_Fiction", "Score": 0.0372}]}
{"File": "file2.txt", "Classes": [{"Name": "Documentary", "Score": 0.1}, {"Name": "Documentary", "Score": 0.0381}, {"Name": "Domentary", "Score": 0.0372}]}
{"File": "file3.txt", "Classes": [{"Name": "Serious_Drama", "Score": 0.3141}, {"Name": "Other", "Score": 0.0381}, {"Name": "Other", "Score": 0.0372}]}
```

**Multi-label output**

When you use a classifier trained in multi-label mode, your results show `labels`. Each of these `labels` is the label used to create the set of categories when training your classifier.

The following examples use these unique labels.

```
SCIENCE_FICTION
ACTION
DRAMA
COMEDY
ROMANCE
```

If your input data format is one document per line, the output file contains one line for each line in the input. Each line includes the file name, the zero-based line number of the input line, and the class or classes found in the document. It ends with the confidence that Amazon Comprehend has that the individual instance was correctly classified.

For example:

```
{"File": "file1.txt", "Line": "0", "Labels": [{"Name": "Action", "Score": 0.8642}, {"Name": "Drama", "Score": 0.650}, {"Name": "Science Fiction", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "1", "Labels": [{"Name": "Comedy", "Score": 0.5}, {"Name": "Action", "Score": 0.0381}, {"Name": "Drama", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "2", "Labels": [{"Name": "Action", "Score": 0.9934}, {"Name": "Drama", "Score": 0.0381}, {"Name": "Action", "Score": 0.0372}]}
{"File": "file1.txt", "Line": "3", "Labels": [{"Name": "Romance", "Score": 0.9845}, {"Name": "Comedy", "Score": 0.8756}, {"Name": "Drama", "Score": 0.7723}, {"Name": "Science_Fiction", "Score": 0.6157}]}
```

If your input data format is one document per file, the output file contains one line for each document. Each line has the name of the file and the class or classes found in the document. It ends with the confidence that Amazon Comprehend classified the individual instance accurately.

For example:

```
{"File": "file0.txt", "Labels": [{"Name": "Action", "Score": 0.8642}, {"Name": "Drama", "Score": 0.650}, {"Name": "Science Fiction", "Score": 0.0372}]}
{"File": "file1.txt", "Labels": [{"Name": "Comedy", "Score": 0.5}, {"Name": "Action", "Score": 0.0381}, {"Name": "Drama", "Score": 0.0372}]}
{"File": "file2.txt", "Labels": [{"Name": "Action", "Score": 0.9934}, {"Name": "Drama", "Score": 0.0381}, {"Name": "Action", "Score": 0.0372}]}
{"File": "file3.txt”, "Labels": [{"Name": "Romance", "Score": 0.9845}, {"Name": "Comedy", "Score": 0.8756}, {"Name": "Drama", "Score": 0.7723}, {"Name": "Science_Fiction", "Score": 0.6157}]}
```

## Outputs for semi-structured input documents
<a name="outputs-class-async-other"></a>

For semi-structured input documents, the output can include the following additional fields:
+ DocumentMetadata – Extraction information about the document. The metadata includes a list of pages in the document, with the number of characters extracted from each page. This field is present in the response if the request included the `Byte` parameter.
+ DocumentType – The document type for each page in the input document. This field is present in the response if the request included the `Byte` parameter.
+ Errors – Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.

For more details about these output fields, see [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) in the *Amazon Comprehend API Reference*.

The following example shows output for a two-page scanned PDF file.

```
[{ #First page output
    "Classes": [
        {
            "Name": "__label__2 ",
            "Score": 0.9993996620178223
        },
        {
            "Name": "__label__3 ",
            "Score": 0.0004330444789957255
        }
    ],
    "DocumentMetadata": {
        "PageNumber": 1,
        "Pages": 2
    },
    "DocumentType": "ScannedPDF",
    "File": "file.pdf",
    "Version": "VERSION_NUMBER"
},
#Second page output
{
    "Classes": [
        {
            "Name": "__label__2 ",
            "Score": 0.9993996620178223
        },
        {
            "Name": "__label__3 ",
            "Score": 0.0004330444789957255
        }
    ],
    "DocumentMetadata": {
        "PageNumber": 2,
        "Pages": 2
    },
    "DocumentType": "ScannedPDF",
    "File": "file.pdf",
    "Version": "VERSION_NUMBER" 
}]
```