Documents
Standard output for documents lets you set the granularity of response you're interested in as well as establishing output format and text format in the output. Below are some of the outputs you can enable.
Response Granularity
Response granularity determines what kind of response you want to recieve from document text extraction. Each level of granularity gives you more and more seperated responses, with page providing all of the text extracted together, and word providing each word as a seperate response. The available granularity levels are:
-
Page level granularity – This is enabled by default. Page level granularity provides each page of the document in the text output format of your choice.
-
Element level granularity (Layout) – This is enabled by default. Provides the text of the document in the output format of your choice, seperated into different elements. These elements, such as figures, tables, or paragraphs. These are returned in logical reading order based off the structure of the document.
-
Word level granularity – Provides information about individual words without using broader context analysis. Provides you with each word and its location on the page.
Output Settings
Output settings determine the way your downloaded results will be structured. The options for output settings are:
-
JSON – The default output structure for document analysis. Provides a JSON output file with the information from your configuration settings.
-
JSON+files – Using this setting generates both a JSON output and files that correspond with different outputs. For example, this setting gives you a text file for the overall text extraction, a markdown file for the text with structural markdown, and CSV files for each table that's found in the text.
Text Format
Text format determines the different kinds of texts that will be provided via various extraction operations. You can select any number of the following options for your text format.
-
Plaintext – This setting provides a text-only output with no formatting or other markdown elements noted.
-
Text with markdown – The default output setting for standard output. Provides text with markdown elements integrated.
-
Text with HTML – Provides text with HTML elements integrated in the response.
-
CSV – Provides a CSV structured output for tables within the document. This will only give a response for tables, and not other elements of the document.
Bounding Boxes and Generative Fields
For Documents, there are two response options that change their output based on the selected granularity. These are Bounding Boxes, and Generative Fields. Selecting Bounding Boxes will provide a visual outline of the element or word you click on in the console response dropdown. This lets you track down particular elements of your response more easily. Bounding Boxes are returned in your JSON as the coordinates of the four corners of the box.
When you select Generative Fields, you are generated a summary of the document, both a 10 word and 250 word version. Then, if you select elements as a response granularity, you generate a descriptive caption of each figure detected in the document. Figures include things like charts, graphs, and images.
Bedrock Data Automation document response
This section focuses on the different response objects you receive from running
the API operation InvokeDataAutomation on a document file. Below we'll break down each
section of the response object and then see a full, populated response for an
example document. The first section we'll receive is metadata
.
"metadata":{ "logical_subdocument_id":"XXXX-XXXX-XXXX-XXXX", "semantic_modality":"DOCUMENT", "s3_bucket":"bucket", "s3_prefix":"prefix" },
The first section above provides an overview of the metadata associated with the document. Along with the S3 information, this section also informs you which modality was selected for your response.
"document":{ "representation":{ "text":"document text", "html":"document title document content", "markdown":"# text" }, "description":"document text", "summary":"summary text", "statistics":{ "element_count":5, "table_count":1, "figure_count":1, "word_count":1000, "line_count":32 } },
The above section provides document level granularity information. The description and summary sections are the generated fields based on the document The representation section provides the actual content of the document with various formatting styles. Finally statistics contains information on the actual content of the document, such as how many semantic elements there are, how many figures, words, lines, etc.
This is the information for a table entity. In addition to location information, the different formats of the text, tables, and reading order, they specifically return csv information and cropped images of the table in S3 buckets. The CSV information shows the different headers, footers, and titles. The images will be routed to the s3 bucket of the prefix set in the InvokeDataAutomationAsync request
{ "id":"entity_id", "type":"TEXT", "representation":{ "text":"document text", "html":"document title document content", "markdown":"# text" }, "reading_order":2, "page_indices":[ 0 ], "locations":[ { "page_index":0, "bounding_box":{ "left":0.0, "top":0.0, "width":0.05, "height":0.5 } } ], "sub_type":"TITLE/SECTION_TITLE/HEADER/FOOTER/PARAGRAPH/LIST/PAGE_NUMBER" },
This is the entity used for text within a document, indicated by the
TYPE
line in the response. Again representation shows the text in
different formats. reading_order
shows when a reader would logically
see the text. This is a semantic ordering based on associated keys and values. For
example, it associates titles of paragraphs with their respective paragraph in
reading order. page_indices
tells you which pages the text is on. Next
is location information, with a provided text bounding box if it was enabled in
response. Finally, we have the entity subtype. This subtype provides more detailed
information on what kind of text is being detected. For a complete list of subtypes
see the API Reference.
{ "id":"entity_id", "type":"TABLE", "representation":{ "html":"table.../table", "markdown":"| header | ...", "text":"header \t header", "csv":"header, header, header\n..." }, "csv_s3_uri":"s3://", "headers":[ "date", "amount", "description", "total" ], "reading_order":3, "title":"Title of the table", "footers":[ "the footers of the table" ], "crop_images":[ "s3://bucket/prefix.png", "s3://bucket/prefix.png" ], "page_indices":[ 0, 1 ], "locations":[ { "page_index":0, "bounding_box":{ "left":0, "top":0, "width":1, "height":1 } }, { "page_index":1, "bounding_box":{ "left":0, "top":0, "width":1, "height":1 } } ] },
This is the information for a table entity. In addition to location information, the different formats of the text, tables, and reading order, they specifically return csv information and cropped images of the table in S3 buckets. The CSV information shows the different headers, footers, and titles. The images will be routed to the s3 bucket of the prefix set in the InvokeDataAutomation request.
{ "id":"entity_id", "type":"FIGURE", "summary":"", "representation":{ "text":"document text", "html":"document title document content", "markdown":"# text" }, "crop_images":[ "s3://bucket/prefix.png", "s3://bucket/prefix.png" ], "locations":[ { "page_index":0, "bounding_box":{ "left":0, "top":0, "width":1, "height":1 } } ], "sub_type":"CHART", "title":"figure title", "rai_flag":"APPROVED/REDACTED/REJECTED", "reading_order":1, "page_indices":[ 0 ] } ,
This is the entity used for figures such as in document graphs and charts. Similar
to tables, these figures will be cropped and images sent to the s3 bucket set in
your prefix. Additionally, you'll receive a sub_type
and a figure title
response for the title text and an indication on what kind of figure it is.
"pages":[ { "id":"page_id", "page_index":0, "detected_page_number":1, "representation":{ "text":"document text", "html":"document title document content", "markdown":"# text" }, "statistics":{ "element_count":5, "table_count":1, "figure_count":1, "word_count":1000, "line_count":32 }, "asset_metadata":{ "rectified_image":"s3://bucket/prefix.png", "rectified_image_width_pixels":1700, "rectified_image_height_pixels":2200 } } ],
The last of the entities we extract through standard output is Pages. Pages are the same as Text entities, but additionally contain page numbers, for which detected page number is on the page.
"text_lines":[ { "id":"line_id", "text":"line text", "reading_order":1, "page_index":0, "locations":{ "page_index":0, "bounding_box":{ "left":0, "top":0, "width":1, "height":1 } } } ],
"text_words":[ { "id":"word_id", "text":"word text", "line_id":"line_id", "reading_order":1, "page_index":0, "locations":{ "page_index":0, "bounding_box":{ "left":0, "top":0, "width":1, "height":1 } } } ]
These final two elements are for individual text portions. Word level granularity returns a response for each word, while default output reports only lines of text.
BDA Document Processing Restrictions
BDA supports documents in PDF, JPEG, and PNG file formates. Documents must be less than 200 MB to be processed by the console, or 500 MB when procesed by the API. Single documents cannot exceed 20 pages, although with document splitting enabled files with up to 1500 pages may be submitted.
Limit | Description |
---|---|
PDF Specific Limits |
The maximum height and width is 40 inches and 2880 points. PDFs cannot be password protected. PDFs can contain JPEG 2000 formatted images. |
Document Rotation and Image Size |
BDA supports all in-plane document rotations, for example 45-degree in-plane rotation. BDA supports images with a resolution less than or equal to 10000 pixels on all sides. |
Text Alignment |
Text can be text aligned horizontally within the document. Horizontally arrayed text can be read regardless of the degree of rotation of a document. BDA does not support vertical text (text written vertically, as is common in languages like Japanese and Chinese) alignment within the document. |
Character Size |
The minimum height for text to be detected is 15 pixels. At 150 DPI, this would be the same as 8 point font. |
Character Type |
BDA supports both handwritten and printed character recognition. |