Response Granularity Output Settings Text Format Bounding Boxes and Generative Fields Bedrock Data Automation document response BDA Document Processing Restrictions

Documents

Standard output for documents lets you set the granularity of response you're interested in as well as establishing output format and text format in the output. Below are some of the outputs you can enable.

Response Granularity

Response granularity determines what kind of response you want to recieve from document text extraction. Each level of granularity gives you more and more seperated responses, with page providing all of the text extracted together, and word providing each word as a seperate response. The available granularity levels are:

Page level granularity – This is enabled by default. Page level granularity provides each page of the document in the text output format of your choice.
Element level granularity (Layout) – This is enabled by default. Provides the text of the document in the output format of your choice, seperated into different elements. These elements, such as figures, tables, or paragraphs. These are returned in logical reading order based off the structure of the document.
Word level granularity – Provides information about individual words without using broader context analysis. Provides you with each word and its location on the page.

Output Settings

Output settings determine the way your downloaded results will be structured. The options for output settings are:

JSON – The default output structure for document analysis. Provides a JSON output file with the information from your configuration settings.
JSON+files – Using this setting generates both a JSON output and files that correspond with different outputs. For example, this setting gives you a text file for the overall text extraction, a markdown file for the text with structural markdown, and CSV files for each table that's found in the text.

Text Format

Text format determines the different kinds of texts that will be provided via various extraction operations. You can select any number of the following options for your text format.

Plaintext – This setting provides a text-only output with no formatting or other markdown elements noted.
Text with markdown – The default output setting for standard output. Provides text with markdown elements integrated.
Text with HTML – Provides text with HTML elements integrated in the response.
CSV – Provides a CSV structured output for tables within the document. This will only give a response for tables, and not other elements of the document.

Bounding Boxes and Generative Fields

For Documents, there are two response options that change their output based on the selected granularity. These are Bounding Boxes, and Generative Fields. Selecting Bounding Boxes will provide a visual outline of the element or word you click on in the console response dropdown. This lets you track down particular elements of your response more easily. Bounding Boxes are returned in your JSON as the coordinates of the four corners of the box.

When you select Generative Fields, you are generated a summary of the document, both a 10 word and 250 word version. Then, if you select elements as a response granularity, you generate a descriptive caption of each figure detected in the document. Figures include things like charts, graphs, and images.

Bedrock Data Automation document response

This section focuses on the different response objects you receive from running the API operation InvokeDataAutomation on a document file. Below we'll break down each section of the response object and then see a full, populated response for an example document. The first section we'll receive is metadata.


"metadata":{
   "logical_subdocument_id":"XXXX-XXXX-XXXX-XXXX",
   "semantic_modality":"DOCUMENT",
   "s3_bucket":"bucket",
   "s3_prefix":"prefix"
},

The first section above provides an overview of the metadata associated with the document. Along with the S3 information, this section also informs you which modality was selected for your response.


"document":{
   "representation":{
      "text":"document text",
      "html":"document title document content",
      "markdown":"# text"
   },
   "description":"document text",
   "summary":"summary text",
   "statistics":{
      "element_count":5,
      "table_count":1,
      "figure_count":1,
      "word_count":1000,
      "line_count":32
   }
},

The above section provides document level granularity information. The description and summary sections are the generated fields based on the document The representation section provides the actual content of the document with various formatting styles. Finally statistics contains information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc.

This is the information for a table entity. In addition to location information, the different formats of the text, tables, and reading order, they specifically return csv information and cropped images of the table in S3 buckets. The CSV information shows the different headers, footers, and titles. The images will be routed to the s3 bucket of the prefix set in the InvokeDataAutomationAsync request



{
   "id":"entity_id",
   "type":"TEXT",
   "representation":{
      "text":"document text",
      "html":"document title document content",
      "markdown":"# text"
   },
   "reading_order":2,
   "page_indices":[
      0
   ],
   "locations":[
      {
         "page_index":0,
         "bounding_box":{
            "left":0.0,
            "top":0.0,
            "width":0.05,
            "height":0.5
         }
      }
   ],
   "sub_type":"TITLE/SECTION_TITLE/HEADER/FOOTER/PARAGRAPH/LIST/PAGE_NUMBER"
},

This is the entity used for text within a document, indicated by the TYPE line in the response. Again representation shows the text in different formats. reading_order shows when a reader would logically see the text. This is a semantic ordering based on associated keys and values. For example, it associates titles of paragraphs with their respective paragraph in reading order. page_indices tells you which pages the text is on. Next is location information, with a provided text bounding box if it was enabled in response. Finally, we have the entity subtype. This subtype provides more detailed information on what kind of text is being detected. For a complete list of subtypes see the API Reference.



{
   "id":"entity_id",
   "type":"TABLE",
   "representation":{
      "html":"table.../table",
      "markdown":"| header | ...",
      "text":"header \t header",
      "csv":"header, header, header\n..."
   },
   "csv_s3_uri":"s3://",
   "headers":[
      "date",
      "amount",
      "description",
      "total"
   ],
   "reading_order":3,
   "title":"Title of the table",
   "footers":[
      "the footers of the table"
   ],
   "crop_images":[
      "s3://bucket/prefix.png",
      "s3://bucket/prefix.png"
   ],
   "page_indices":[
      0,
      1
   ],
   "locations":[
      {
         "page_index":0,
         "bounding_box":{
            "left":0,
            "top":0,
            "width":1,
            "height":1
         }
      },
      {
         "page_index":1,
         "bounding_box":{
            "left":0,
            "top":0,
            "width":1,
            "height":1
         }
      }
   ]
},


{

   "id":"entity_id",

   "type":"FIGURE",

   "summary":"",

   "representation":{

      "text":"document text",

      "html":"document title document content",

      "markdown":"# text"

   },

   "crop_images":[

      "s3://bucket/prefix.png",

      "s3://bucket/prefix.png"

   ],

   "locations":[

      {

         "page_index":0,

         "bounding_box":{

            "left":0,

            "top":0,

            "width":1,

            "height":1

         }

      }

   ],

   "sub_type":"CHART",

   "title":"figure title",

   "rai_flag":"APPROVED/REDACTED/REJECTED",

   "reading_order":1,

   "page_indices":[

      0

   ]

}
,

This is the entity used for figures such as in document graphs and charts. Similar to tables, these figures will be cropped and images sent to the s3 bucket set in your prefix. Additionally, you'll receive a sub_type and a figure title response for the title text and an indication on what kind of figure it is.


"pages":[
   {
      "id":"page_id",
      "page_index":0,
      "detected_page_number":1,
      "representation":{
         "text":"document text",
         "html":"document title document content",
         "markdown":"# text"
      },
      "statistics":{
         "element_count":5,
         "table_count":1,
         "figure_count":1,
         "word_count":1000,
         "line_count":32
      },
      "asset_metadata":{
         "rectified_image":"s3://bucket/prefix.png",
         "rectified_image_width_pixels":1700,
         "rectified_image_height_pixels":2200
      }
   }
],

The last of the entities we extract through standard output is Pages. Pages are the same as Text entities, but additionally contain page numbers, for which detected page number is on the page.


"text_lines":[
   {
      "id":"line_id",
      "text":"line text",
      "reading_order":1,
      "page_index":0,
      "locations":{
         "page_index":0,
         "bounding_box":{
            "left":0,
            "top":0,
            "width":1,
            "height":1
         }
      }
   }
],


"text_words":[
   {
      "id":"word_id",
      "text":"word text",
      "line_id":"line_id",
      "reading_order":1,
      "page_index":0,
      "locations":{
         "page_index":0,
         "bounding_box":{
            "left":0,
            "top":0,
            "width":1,
            "height":1
         }
      }
   }
]

These final two elements are for individual text portions. Word level granularity returns a response for each word, while default output reports only lines of text.

BDA Document Processing Restrictions

BDA supports documents in PDF, JPEG, and PNG file formates. Documents must be less than 200 MB to be processed by the console, or 500 MB when procesed by the API. Single documents cannot exceed 20 pages, although with document splitting enabled files with up to 1500 pages may be submitted.

Limit	Description
PDF Specific Limits	The maximum height and width is 40 inches and 2880 points. PDFs cannot be password protected. PDFs can contain JPEG 2000 formatted images.
Document Rotation and Image Size	BDA supports all in-plane document rotations, for example 45-degree in-plane rotation. BDA supports images with a resolution less than or equal to 10000 pixels on all sides.
Text Alignment	Text can be text aligned horizontally within the document. Horizontally arrayed text can be read regardless of the degree of rotation of a document. BDA does not support vertical text (text written vertically, as is common in languages like Japanese and Chinese) alignment within the document.
Character Size	The minimum height for text to be detected is 15 pixels. At 150 DPI, this would be the same as 8 point font.
Character Type	BDA supports both handwritten and printed character recognition.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Standard output in Bedrock Data Automation

Videos