Tables
Use Amazon Textract to extract tables in a document and extract cells, merged cells, column headers, titles, section titles, footers, table type (structured or semistructured), and summary cells within a table.
Detected tables are returned as Block objects in the responses from AnalyzeDocument and
GetDocumentAnalysis. You can use the
FeatureTypes
input parameter to retrieve information about
key-value pairs, tables, or both. For tables only, use the value
TABLES
. For an example, see Exporting Tables into a CSV File. For general information about
how a document is represented by Block
objects, see Text Detection and Document Analysis
Response Objects.
The following is an example of a table that could be detected by Amazon Textract.
The following diagram shows how a single cell in a table is represented by
Block
objects.
A cell contains WORD
blocks for detected words, and where
applicable, TABLE_TITLE
blocks for table titles,
TABLE_FOOTER
blocks for table footers, and
SELECTION_ELEMENT
blocks for selection elements such as check
boxes.
The following is part of the JSON for the preceding table. The
PAGE
block object has a list of CHILD
block IDs
for the TABLE
block and each LINE
of text that's
detected.
{ "BlockType": "PAGE", "Geometry": { "BoundingBox": { "Width": 1.0, "Height": 1.0, "Left": 0.0, "Top": 0.0 }, }, "Id": "8a5d3f57-97bc-4a05-b028-f72617877626", "Relationships": [ { "Type": "CHILD", "Ids": [ "7499ac64-3fa9-46fd-8e3f-581ec9c316eb", "87ed4709-66f2-4b3d-abda-52c92a111474", "27a87eb3-bd21-475e-80fe-3f8e16958dcf", "d89894ea-2f37-4667-94b6-d90def01c5c1", "9f9d6383-ed6d-4bd0-ba8c-71fc3eec704e", "cdc74e1a-c568-439b-9eef-7bd54e060f18", "1b64f24c-5e84-4c7e-851a-cb1f5258a53c", "84a84878-04b4-4608-81b6-38117ead1629", ... "8cef603b-932e-452b-adc4-15f8e02ad1fe", "a3f97508-0d6b-4ae0-aa04-76078f9fe11a", "dd1f23c6-dfad-447b-8105-29ba136bd3a4", "46138f38-5b77-41a9-b068-f8394587122f", "a5e5247c-2637-4fa8-a271-ab46399cd77c", "63d7b889-71e3-422a-8cb7-2103ba0aa276", "033e5c86-371a-46fb-bbea-eb7f6b0cd092", "559b1354-ef94-4cb9-8e03-9eca83c6dba4", "55edc4fa-052f-40f9-9edd-739b100e6f75" ] } ] },
To learn more about the table, access the TABLE
block object. The
table block includes four types of relationships: “Child,” “Merged Cells,”
"Title," and "Footer." For relationship type CHILD
, each child ID
represents a single cell within the table. A merged cell is broken down into all
the individual cells that are combined to make one merged cell.
TABLE_TITLE
and TABLE_FOOTER
relationship types
contain the block ID for the corresponding TABLE_TITLE
and
TABLE_FOOTER
blocks, where information about the title and
footer is stored. The table block type has an EntityType
of either
STRUCTURED_TABLE
or SEMI_STRUCTURED_TABLE
that
identifies the type of table.
The following JSON shows that the preceding table has 65 cells for 13 rows
and 5 columns, which are listed in the CHILD
relationship
Ids
array. For relationship type MERGED_CELL
, each
merged cell ID represents a single merged cell within the table. The following
JSON shows that the table has 9 merged cells, which are listed in the
MERGED_CELL
relationship Ids
array. The two
additional relationship types, TABLE_TITLE
and
TABLE_FOOTER
, list the IDs of the respective title and footer
blocks. The following JSON also shows that the table is structured in the
EntityTypes
block.
{ "BlockType": "TABLE", "Confidence": 99.8046875, "Geometry": {...}, "Id": "55edc4fa-052f-40f9-9edd-739b100e6f75", "Relationships": [ { "Type": "CHILD", "Ids": [ "c1c03d64-d365-4906-af7a-a852f1acc040", "8b415996-6b05-4183-a959-d27d12ccef79", "48b0e972-7dba-4db7-896e-ca7066e8c761", "69948207-47d8-4825-8929-1d7abb650a88", "b9ac9f14-8899-43b3-8572-0e997180e0a4", "6f06c024-0b36-4acd-b61f-4467203234dd", "c8a88487-dbc7-4662-a69b-21103049b61d", ... "2b41c8e1-f754-4b37-91b6-a97cdc413f91", "365a1bab-0c18-4cd8-a465-6f7bc7e25e60", "f08af959-cfac-4ad6-a63f-2771c7a8ff62", "e4f6fbfd-c7d8-4f64-9102-733d4806850f", "68c0b8ff-fd35-41ce-ba76-de08c26084d7", "44e80372-aa70-4a36-9aac-3a93aaa91bb1" ] }, { "Type": "MERGED_CELL", "Ids": [ "a27a3ecc-afd0-4f7c-9db2-6f8e6d31c605", "6c02cf21-40de-4480-b755-e94462ac4884", "6faad856-8d37-4751-b741-c4ad8d5dcbe3", "d777d6e2-7430-4c6e-a261-03ec5a612c8c", "f0f5a9fb-5bfa-4c80-8f41-1d4fad674b09", "83c7af02-8128-4479-89c9-962544ad4048", "b2b5126c-409f-4b67-9adf-e3e12f60bf86", "87d7f688-3d38-4198-b491-433af0da4d8b", "1c2436e2-a1fc-4b2a-9e73-cc8a1ca67568" ] }, { "Type": "TABLE_TITLE", "Ids": [ "cde34920-0131-4e68-a3ec-82922269afd4" ] }, { "Type": "TABLE_FOOTER", "Ids": [ "11dfd98c-6140-49e8-a544-e220d76bdd2f", "ad1b9c81-3b53-4fc7-a533-dabb3d29b0b1" ] } ], "EntityTypes": [ "STRUCTURED_TABLE" ] },
The block type for each table cell is CELL
. The cell block type
will always have row span of 1 and column span of 1. The block object for each
cell includes information about the cell location compared to other cells in the
table. It also includes geometry information for the location of the cell on the
document. In addition, cell blocks can have different EntityTypes
that identify them as a particular type of cell, including TABLE_TITLE,
TABLE_FOOTER, TABLE_SECTION_TITLE, COLUMN_HEADER, and TABLE_SUMMARY. For
example, in the preceding table, the cell that contains the word “Date” is a
column header, as shown in the following example.
{ "BlockType": "CELL", "Confidence": 81.8359375, "RowIndex": 2, "ColumnIndex": 1, "RowSpan": 1, "ColumnSpan": 1, "Geometry": {...}, "Id": "6f06c024-0b36-4acd-b61f-4467203234dd", "Relationships": [ { "Type": "CHILD", "Ids": [ "c49f55d5-a7e4-41d5-9c29-d8244f56181c" ] } ], "EntityTypes": [ "COLUMN_HEADER" ] },
The cell that contains the word "Deposit" is not a title, footer, column
header, section title, or summary cell. This is shown by the lack of the field
"EntityTypes"
.
{ "BlockType": "CELL", "Confidence": 86.181640625, "RowIndex": 7, "ColumnIndex": 2, "RowSpan": 1, "ColumnSpan": 1, "Geometry": {...}, "Id": "7af5160b-bd60-45f5-a12c-bf376e9d742c", "Relationships": [ { "Type": "CHILD", "Ids": [ "bb9bcaed-5998-44a6-9076-aa1ecc82fbc6" ] } ] },
All the merged cells are listed under "Type": "MERGED_CELL"
in
the TABLE
block. In the preceding example table, there are nine
merged cells.
{ "Type": "MERGED_CELL", "Ids": [ "a27a3ecc-afd0-4f7c-9db2-6f8e6d31c605", "6c02cf21-40de-4480-b755-e94462ac4884", "6faad856-8d37-4751-b741-c4ad8d5dcbe3", "d777d6e2-7430-4c6e-a261-03ec5a612c8c", "f0f5a9fb-5bfa-4c80-8f41-1d4fad674b09", "83c7af02-8128-4479-89c9-962544ad4048", "b2b5126c-409f-4b67-9adf-e3e12f60bf86", "87d7f688-3d38-4198-b491-433af0da4d8b", "1c2436e2-a1fc-4b2a-9e73-cc8a1ca67568" ] },
To find specific details associated with each merged cell, go to
"BlockType": "MERGED_CELL"
. For the merged cell “Balance
Sheet”, which is also a title cell, the ID associated with it is
"a27a3ecc-afd0-4f7c-9db2-6f8e6d31c605"
.
There are 5 cells that constitute this merged cell, as shown by the
"ColumnSpan" of 5. To find the text within the merged cell, go further down to
the Ids
array for details on "BlockType": "CELL"
followed by "BlockType": "WORD"
.
{ "BlockType": "MERGED_CELL", "Confidence": 77.44140625, "RowIndex": 1, "ColumnIndex": 1, "RowSpan": 1, "ColumnSpan": 5, "Geometry": {...}, "Id": "a27a3ecc-afd0-4f7c-9db2-6f8e6d31c605", "Relationships": [ { "Type": "CHILD", "Ids": [ "c1c03d64-d365-4906-af7a-a852f1acc040", "8b415996-6b05-4183-a959-d27d12ccef79", "48b0e972-7dba-4db7-896e-ca7066e8c761", "69948207-47d8-4825-8929-1d7abb650a88", "b9ac9f14-8899-43b3-8572-0e997180e0a4" ] } ], "EntityTypes": [ "TABLE_TITLE" ] },
On the cell level, there are 5 cells for the merged cell “Balance Sheet”. Each
cell has an EntityType
of TABLE_TITLE
because the
title was identified in the merged cell. The cell with an Id
of
48b0e972-7dba-4db7-896e-ca7066e8c761
contains two
CHILD
relationship IDs that correspond to the WORD
blocks that make up this merged title cell.
{ "BlockType": "CELL", "Confidence": 77.44140625, "RowIndex": 1, "ColumnIndex": 1, "RowSpan": 1, "ColumnSpan": 1, "Geometry": {...}, "Id": "c1c03d64-d365-4906-af7a-a852f1acc040", "EntityTypes": [ "TABLE_TITLE" ] }, { "BlockType": "CELL", "Confidence": 77.44140625, "RowIndex": 1, "ColumnIndex": 2, "RowSpan": 1, "ColumnSpan": 1, "Geometry": {...}, "Id": "8b415996-6b05-4183-a959-d27d12ccef79", "EntityTypes": [ "TABLE_TITLE" ] }, { "BlockType": "CELL", "Confidence": 77.44140625, "RowIndex": 1, "ColumnIndex": 3, "RowSpan": 1, "ColumnSpan": 1, "Geometry": {...}, "Id": "48b0e972-7dba-4db7-896e-ca7066e8c761", "Relationships": [ { "Type": "CHILD", "Ids": [ "998394ef-c6cf-491b-9bac-ec470c638ecd", "1c875a06-f8e5-4df7-8f6a-583c47cbd9fe" ] } ], "EntityTypes": [ "TABLE_TITLE" ] }, { "BlockType": "CELL", "Confidence": 77.44140625, "RowIndex": 1, "ColumnIndex": 4, "RowSpan": 1, "ColumnSpan": 1, "Geometry": {...}, "Id": "69948207-47d8-4825-8929-1d7abb650a88", "EntityTypes": [ "TABLE_TITLE" ] }, { "BlockType": "CELL", "Confidence": 77.44140625, "RowIndex": 1, "ColumnIndex": 5, "RowSpan": 1, "ColumnSpan": 1, "Geometry": {...}, "Id": "b9ac9f14-8899-43b3-8572-0e997180e0a4", "EntityTypes": [ "TABLE_TITLE" ] },
On the word level, there are two words, “Balance” and "Sheet." Since the first two and last two cells on columns 1, 2, 4, and 5 are blank, there are no words associated with them. This is also shown in the previous JSON output, where only the third cell contains child IDs.
{ "BlockType": "WORD", "Confidence": 99.95711517333984, "Text": "Balance", "TextType": "PRINTED", "Geometry": {...}, "Id": "998394ef-c6cf-491b-9bac-ec470c638ecd" }, { "BlockType": "WORD", "Confidence": 99.87372589111328, "Text": "Sheet", "TextType": "PRINTED", "Geometry": {...}, "Id": "1c875a06-f8e5-4df7-8f6a-583c47cbd9fe" },
The TABLE_TITLE
and TABLE_FOOTER
block types contain
information about title and footer cells, including CHILD
relationships that point to the WORD
blocks that make up the title
or footer. This is shown in the following JSON response.
In this example, the title is an in-table title, meaning it is found within
the structure of the table itself, as opposed to outside of the table as a
floating title. This means that the title also has a CELL
block
type that contains the child IDs of the word blocks that make up the title. See
the previous JSON output for the five cell blocks that comprise the merged title
cell, which includes the title cell block with the child IDs of the word blocks.
The footer cells for this table would also be represented by cell blocks for
each footer.
{ "BlockType": "TABLE_TITLE", "Confidence": 97.802734375, "Geometry": {...}, "Id": "cde34920-0131-4e68-a3ec-82922269afd4", "Relationships": [ { "Type": "CHILD", "Ids": [ "998394ef-c6cf-491b-9bac-ec470c638ecd", "1c875a06-f8e5-4df7-8f6a-583c47cbd9fe" ] } ] }, { "BlockType": "TABLE_FOOTER", "Confidence": 88.0859375, "Geometry": {...}, "Id": "11dfd98c-6140-49e8-a544-e220d76bdd2f", "Relationships": [ { "Type": "CHILD", "Ids": [ "77a70b2d-c137-4161-8d9c-65170266e5ff", "d413ef1f-fa1b-44cb-87ed-809494fc87d8", "19616f50-1a34-431f-94bf-7e575106cd85", "35063ea4-a3c7-4e19-9d32-10eca92807b8", "48de1523-7776-49ef-96d9-fc19bcde89c5" ] } ] },