Setting text extraction options - Amazon Comprehend

Setting text extraction options

By default, Amazon Comprehend performs the following actions to extract text from a file, based on the input file type:

  • Word files – Amazon Comprehend parser extracts the text.

  • Digital PDF files – Amazon Comprehend parser extracts the text.

  • Image files and scanned PDF files – Amazon Comprehend uses the Amazon Textract DetectDocumentText API to extract the text.

For image files and PDF files, you can use the DocumentReaderConfig parameter to override these default extraction actions. This parameter is available when you use the Amazon Comprehend console or API for real-time or asynchronous custom analysis.

The DocumentReaderConfig parameter contains three fields:

  • DocumentReadMode – Set to SERVICE_DEFAULT for Amazon Comprehend to perform the default actions.

    Set to FORCE_DOCUMENT_READ_ACTION to use Amazon Textract to parse digital PDF files.

  • DocumentReadAction – Sets the Amazon Textract API (DetectDocumentText or AnalyzeDocument) to use when Amazon Comprehend uses Amazon Textract for text extraction.

  • FeatureTypes – If you set DocumentReadAction to use the AnalyzeDocument API operation, you can add one or both of the FeatureTypes (TABLES, FORMS). These features provide additional information about the tables and forms in the document. For more information about these features, see Amazon Textract Document Analysis Response Objects.

The following examples show how to configure DocumentReaderConfig for specific use cases:

  1. Use Amazon Textract for all PDF files.

    1. DocumentReadMode – Set to FORCE_DOCUMENT_READ_ACTION.

    2. DocumentReadAction – Set to TEXTRACT_DETECT_DOCUMENT_TEXT.

    3. FeatureTypes – Not required.

  2. Use Amazon Textract AnalyzeDocument API for all PDF and image files.

    1. DocumentReadMode – Set to FORCE_DOCUMENT_READ_ACTION.

    2. DocumentReadAction – Set to TEXTRACT_ANALYZE_DOCUMENT.

    3. FeatureTypes – Set to TABLES, FORMS or both features.

  3. Use Amazon Textract AnalyzeDocument API for scanned PDF files and all image files.

    1. DocumentReadMode – Set to SERVICE_DEFAULT.

    2. DocumentReadAction – Set to TEXTRACT_ANALYZE_DOCUMENT.

    3. FeatureTypes – Set to TABLES, FORMS or both features.

For more information about the Amazon Textract options, see DocumentReaderConfig.