Document types or formats
Amazon Kendra supports popular document types or formats such as PDF, HTML, Word, PowerPoint, and more. An index can contain multiple document formats.
Amazon Kendra extracts the content inside the documents in order to make the documents searchable. The documents are parsed in a way to optimize search on the extracted text and any tabular content (HTML tables) within the documents. This means structuring the documents into fields or attributes that are used for search. The document metadata, such as the last modified date, can be useful fields for search.
Documents can be organized into rows and columns. For example, each document is a row and each document field/attribute, such as the title and body content, is a column. For example, if you use a database as your data source, the data should be structured or organized into rows and columns.
You can add documents to your index through the following ways:
If you want to add a FAQ file, you use the CreateFaq API to add the file stored in an Amazon S3 bucket. You can choose between a basic CSV format, a CSV format that includes customs fields/attributes in a header, and a JSON format that includes custom fields. The default format is basic CSV.
The following provides information on each supported document format and how Amazon Kendra treats each format when indexing documents.
Document format | Treated as | How document is treated | Original structure |
---|---|---|---|
Portable Document Format (PDF) | HTML | Converted to HTML, then content is extracted. | Unstructured |
HyperText Markup Language (HTML) | HTML | HTML tags are filtered out to extract content. Content must
between the main HTML start and closing tags
(<HTML>content</HTML> ). |
Semi-structured |
Extensible Markup Language (XML) | XML | XML tags are filtered out to extract content. | Semi-structured |
Extensible Stylesheet Language Transformation (XSLT) | XSLT | Tags are filtered out to extract content. | Semi-structured |
MarkDown (MD) | Plain text | Content is extracted with MarkDown syntax included. | Semi-structured |
Comma Separated Values (CSV) | CSV | Content extracted from each cell, with a single file treated as a single document result. | Structured for FAQ files, otherwise semi-structured |
Microsoft Excel (XLS and XLSX) | XLS and XLSX | Content extracted from each cell, with a single file treated as a single document result. | Semi-structured |
JavaScript Object Notation (JSON) | Plain text | Content is extracted with JSON syntax included. | Semi-structured |
Rich Text Format (RTF) | RTF | RTF syntax is filtered out to extract content. | Semi-structured |
Microsoft PowerPoint (PPT) | PPT, PPTX | Only text content is extracted from PowerPoint slides for search. Images and other content are not extracted. | Unstructured |
Microsoft Word | DOC, DOCX | Only text content is extracted from Word pages for search. Images and other content are not extracted. | Unstructured |
Plain text (TXT) | TXT | All text in the text document is extracted. | Unstructured |