Inputs for asynchronous custom analysis - Amazon Comprehend

Inputs for asynchronous custom analysis

You can input multiple documents to a custom async analysis job. The following topics describe the input document types that you can use. The maximum file size varies depending on the type of input document.

Plain text documents

Provide all plain-text input documents as UTF-8-formatted text. The following table lists the maximum file sizes and other guidelines.

Note

These limits apply when all the input files are plain text.

Description Quota/Guideline
Maximum file size for one document per file format (Custom classification) 1 byte–10 MB
Document size (Custom entity recognition) 1 byte–1 MB
Maximum number of files, one document per file 1,000,000
Maximum number of lines, one document per line (for all files in request) 1,000,000
Document corpus size (all docs in plaintext combined) 1 byte–5 GB

Semi-structured documents

Semi-structured documents include native PDF documents and Word documents.

The following table lists the maximum file sizes and other guidelines.

Description Quota/Guideline
Document size (PDF) 1 byte–50 MB
Document size (Docx) 1 byte–5 MB
Maximum number of files 500
Maximum number of pages for a PDF or Docx file 100
Document corpus size after text extraction (plaintext, all files combined) 1 byte–5 GB

By default, custom analysis uses the Amazon Comprehend parser to extract the text from Word files and digital PDF files. For PDF files, you can override this default and use Amazon Textract to extract the text. See Setting text extraction options.

Image files and scanned PDF files

Custom analysis supports JPEG, PNG, and TIFF images.

The following table lists the maximum file sizes for images. Scanned PDF files are subject to the same maximum sizes as native PDF files.

Description Quota/Guideline
Image size (JPG or PNG) 1 byte–10 MB
Image size (TIFF) 1 byte–10 MB. Maximum one page.

For additional information about images, see Best practices for images.

By default, Amazon Comprehend uses the Amazon Textract DetectDocumentText API operation to extract the text from image files and scanned PDF files. You can override this default to use the AnalyzeDocument API operation instead. See Setting text extraction options.

Amazon Textract output JSON files

For custom entity recognition, but not custom classification, you can provide the output file from the Amazon Textract AnalyzeDocument API operation as input to analysis jobs.