

# Document processing
<a name="idp"></a>

Amazon Comprehend supports one-step document processing for custom classification and custom entity recognition. For example, you can input a mix of plain text documents and semi-structured documents (such as PDF documents, Microsoft Word documents, and images) to a custom analysis job.

For input files that require text extraction, Amazon Comprehend automatically performs the text extraction before running the analysis. To extract the text content, Amazon Comprehend uses an internal parser for native semi-structured documents and uses Amazon Textract APIs for images and scanned documents. 

Amazon Comprehend document processing is available in each of the Amazon Comprehend [Supported Regions](guidelines-and-limits.md#limits-regions), except Asia Pacific (Tokyo) and AWS GovCloud (US-West) support only plain-text models for custom classification.

The following topics provide details about the input document types that Amazon Comprehend supports for custom analysis.

**Topics**
+ [

# Inputs for real-time custom analysis
](idp-inputs-sync.md)
+ [

# Inputs for asynchronous custom analysis
](idp-inputs-async.md)
+ [

# Setting text extraction options
](idp-set-textract-options.md)
+ [

# Best practices for images
](idp-images-bp.md)

# Inputs for real-time custom analysis
<a name="idp-inputs-sync"></a>

Real-time analysis using custom models takes a single document as input. The following topics describe the input document types that you can use.

**Topics**
+ [

## Plain text documents
](#idp-inputs-sync-text)
+ [

## Semi-structured documents
](#idp-inputs-sync-semi)
+ [

## Image files and scanned PDF files
](#idp-inputs-sync-ocr)
+ [

## Amazon Textract output
](#idp-inputs-sync-textract)
+ [

## Maximum document sizes for real-time analysis
](#idp-inputs-sync-sizes)
+ [

## Errors in semi-structured documents
](#idp-inputs-sync-err)

## Plain text documents
<a name="idp-inputs-sync-text"></a>

Provide the input document as UTF-8-formatted text. 

## Semi-structured documents
<a name="idp-inputs-sync-semi"></a>

Semi-structured documents include native PDF documents and Word documents. 

By default, real-time custom analysis uses the Amazon Comprehend parser to extract the text from Word files and digital PDF files. For PDF files, you can override this default and use Amazon Textract to extract the text. See [Setting text extraction options](idp-set-textract-options.md).

## Image files and scanned PDF files
<a name="idp-inputs-sync-ocr"></a>

Supported image types include JPEG, PNG, and TIFF.

By default, custom entity recognition uses the Amazon Textract `DetectDocumentText` API operation to extract the text from image files and scanned PDF files. You can override this default to use the `AnalyzeDocument` API operation instead. See [Setting text extraction options](idp-set-textract-options.md).

## Amazon Textract output
<a name="idp-inputs-sync-textract"></a>

You can provide the JSON output from the Amazon Textract `DetectDocumentText` API or `AnalyzeDocument` API as input to the real-time API operations for custom classification and custom entity recognition. Amazon Comprehend supports this input type for the real-time API operations, but not for the console.

## Maximum document sizes for real-time analysis
<a name="idp-inputs-sync-sizes"></a>

For all input document types, the input file maximum is one page, with no more than 10,000 characters.

The following table shows the maximum file sizes for input documents. 


| File type | Maximum size (API) | Maximum size (console) | 
| --- | --- | --- | 
| UTF-8 text documents | 10 KB | 10 KB | 
| PDF documents | 10 MB | 5 MB | 
| Word documents | 10 MB | 1 MB | 
| Image files | 10 MB | 5 MB | 
| Textract output files | 1 MB | n/a | 

## Errors in semi-structured documents
<a name="idp-inputs-sync-err"></a>

 The [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) or [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) API operation can encounter document-level or page-level errors while extracting text from a semi-structured document or an image file.

### Page-level errors
<a name="idp-inputs-sync-page-err"></a>

 If the [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) or [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) API operation encounters errors while processing a page in the input document, the API response includes an entry in the [Errors list](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ErrorsListItem.html) for each error.

The `ErrorCode` in the error list entry contains one of the following values:
+ TEXTRACT\$1BAD\$1PAGE – Amazon Textract cannot read the page. For more information about page limits in Amazon Textract, see [ Page Quotas in Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/limits-document.html).
+ TEXTRACT\$1PROVISIONED\$1THROUGHPUT\$1EXCEEDED – The number of requests exceeded your throughput limit. For more information about throughput quotas in Amazon Textract, see [ Default quotas in Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/limits-quotas-explained.html).
+ PAGE\$1CHARACTERS\$1EXCEEDED – Too many text characters on the page (10,000 characters maximum).
+ PAGE\$1SIZE\$1EXCEEDED – The maximum page size is 10 MB.
+ INTERNAL\$1SERVER\$1ERROR – The request encountered a service issue. Try the API request again.

### Document-level errors
<a name="idp-inputs-sync-doc-err"></a>

If the [ClassifyDocument](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_ClassifyDocument.html) or [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) API operation detects a document-level error in your input document, the API returns an `InvalidRequestException` error response. 

In the error response, the **Reason** field contains the value `INVALID_DOCUMENT`. 

The **Detail** field contains one of the following values:
+ DOCUMENT\$1SIZE\$1EXCEEDED – Document size is too large. Check the size of your file and resubmit the request.
+ UNSUPPORTED\$1DOC\$1TYPE – Document type is not supported. Check the file type and resubmit the request.
+ PAGE\$1LIMIT\$1EXCEEDED – Too many pages in the document. Check the number of pages in your file and resubmit the request.
+ TEXTRACT\$1ACCESS\$1DENIED\$1EXCEPTION – Access denied to Amazon Textract. Verify that your account has permission to use the Amazon Textract [DetectDocumentText](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html) and [AnalyzeDocument](https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html) API operations and resubmit the request.

# Inputs for asynchronous custom analysis
<a name="idp-inputs-async"></a>

You can input multiple documents to a custom async analysis job. The following topics describe the input document types that you can use. The maximum file size varies depending on the type of input document. 

**Topics**
+ [

## Plain text documents
](#idp-inputs-async-text)
+ [

## Semi-structured documents
](#idp-inputs-async-semi)
+ [

## Image files and scanned PDF files
](#idp-inputs-async-ocr)
+ [

## Amazon Textract output JSON files
](#idp-inputs-async-textract)

## Plain text documents
<a name="idp-inputs-async-text"></a>

Provide all plain-text input documents as UTF-8-formatted text. The following table lists the maximum file sizes and other guidelines. 

**Note**  
These limits apply when **all** the input files are plain text.


| Description | Quota/Guideline | 
| --- | --- | 
| Maximum file size for one document per file format (Custom classification) | 1 byte–10 MB | 
| Document size (Custom entity recognition) | 1 byte–1 MB | 
| Maximum number of files, one document per file | 1,000,000 | 
| Maximum number of lines, one document per line (for all files in request) | 1,000,000 | 
| Document corpus size (all docs in plaintext combined) | 1 byte–5 GB | 

## Semi-structured documents
<a name="idp-inputs-async-semi"></a>

Semi-structured documents include native PDF documents and Word documents. 

The following table lists the maximum file sizes and other guidelines.


| Description | Quota/Guideline | 
| --- | --- | 
| Document size (PDF) | 1 byte–50 MB | 
| Document size (Docx) | 1 byte–5 MB | 
| Maximum number of files | 500 | 
| Maximum number of pages for a PDF or Docx file | 100 | 
| Document corpus size after text extraction (plaintext, all files combined) | 1 byte–5 GB | 

By default, custom analysis uses the Amazon Comprehend parser to extract the text from Word files and digital PDF files. For PDF files, you can override this default and use Amazon Textract to extract the text. See [Setting text extraction options](idp-set-textract-options.md).

## Image files and scanned PDF files
<a name="idp-inputs-async-ocr"></a>

Custom analysis supports JPEG, PNG, and TIFF images.

The following table lists the maximum file sizes for images. Scanned PDF files are subject to the same maximum sizes as native PDF files.


| Description | Quota/Guideline | 
| --- | --- | 
| Image size (JPG or PNG) | 1 byte–10 MB | 
| Image size (TIFF) | 1 byte–10 MB. Maximum one page. | 

For additional information about images, see [Best practices for images](idp-images-bp.md).

By default, Amazon Comprehend uses the Amazon Textract `DetectDocumentText` API operation to extract the text from image files and scanned PDF files. You can override this default to use the `AnalyzeDocument` API operation instead. See [Setting text extraction options](idp-set-textract-options.md).

## Amazon Textract output JSON files
<a name="idp-inputs-async-textract"></a>

For custom entity recognition, but not custom classification, you can provide the output file from the Amazon Textract `AnalyzeDocument` API operation as input to analysis jobs. 

# Setting text extraction options
<a name="idp-set-textract-options"></a>

 By default, Amazon Comprehend performs the following actions to extract text from a file, based on the input file type: 
+ **Word files** – Amazon Comprehend parser extracts the text. 
+ **Digital PDF files** – Amazon Comprehend parser extracts the text. 
+ **Image files and scanned PDF files** – Amazon Comprehend uses the Amazon Textract `DetectDocumentText` API to extract the text. 

For image files and PDF files, you can use the `DocumentReaderConfig` parameter to override these default extraction actions. This parameter is available when you use the Amazon Comprehend console or API for real-time or asynchronous custom analysis.

The `DocumentReaderConfig` parameter contains three fields:
+ **DocumentReadMode** – Set to `SERVICE_DEFAULT` for Amazon Comprehend to perform the default actions. 

  Set to `FORCE_DOCUMENT_READ_ACTION` to use Amazon Textract to parse digital PDF files.
+ **DocumentReadAction** – Sets the Amazon Textract API (DetectDocumentText or AnalyzeDocument) to use when Amazon Comprehend uses Amazon Textract for text extraction.
+ **FeatureTypes** – If you set **DocumentReadAction** to use the AnalyzeDocument API operation, you can add one or both of the `FeatureTypes` (TABLES, FORMS). These features provide additional information about the tables and forms in the document. For more information about these features, see [Amazon Textract Document Analysis Response Objects](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html).

The following examples show how to configure `DocumentReaderConfig` for specific use cases:

1. Use Amazon Textract for all PDF files. 

   1. **DocumentReadMode** – Set to `FORCE_DOCUMENT_READ_ACTION`.

   1. **DocumentReadAction** – Set to `TEXTRACT_DETECT_DOCUMENT_TEXT`.

   1. **FeatureTypes** – Not required.

1. Use Amazon Textract `AnalyzeDocument` API for all PDF and image files. 

   1. **DocumentReadMode** – Set to `FORCE_DOCUMENT_READ_ACTION`.

   1. **DocumentReadAction** – Set to `TEXTRACT_ANALYZE_DOCUMENT`.

   1. **FeatureTypes** – Set to `TABLES`, `FORMS` or both features.

1. Use Amazon Textract `AnalyzeDocument` API for scanned PDF files and all image files. 

   1. **DocumentReadMode** – Set to `SERVICE_DEFAULT`.

   1. **DocumentReadAction** – Set to `TEXTRACT_ANALYZE_DOCUMENT`.

   1. **FeatureTypes** – Set to `TABLES`, `FORMS` or both features.

For more information about the Amazon Textract options, see [DocumentReaderConfig](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DocumentReaderConfig.html).

# Best practices for images
<a name="idp-images-bp"></a>

When you use image files for custom classification or custom entity recognition, use the following guidelines to achieve the best results:
+ Provide a high quality image, ideally at least 150 DPI.
+ If the image file uses one of the supported formats (TIFF, JPEG, or PNG), don't convert or downsample the file before uploading it to Amazon S3.

For the best results when extracting text from tables in documents, follow these practices:
+ Tables in your document are visually separated from surrounding elements on the page. For example, the table isn't overlaid onto an image or complex pattern.
+ Text within the table is upright. For example, the text isn't rotated relative to other text on the page.

When extracting text from tables, you might see inconsistent results for the following cases: 
+ Merged table cells span multiple columns.
+ Tables have cells, rows, or columns that are different than other parts of the same table.