Document types or formats - Amazon Kendra

Document types or formats

Amazon Kendra supports popular document types or formats such as PDF, HTML, Word, PowerPoint, and more. An index can contain multiple document formats.

Amazon Kendra extracts the content inside the documents in order to make the documents searchable. The documents are parsed in a way to optimize search on the extracted text and any tabular content (HTML tables) within the documents. This means structuring the documents into fields or attributes that are used for search. The document metadata, such as the last modified date, can be useful fields for search.

Documents can be organized into rows and columns. For example, each document is a row and each document field/attribute, such as the title and body content, is a column. For example, if you use a database as your data source, the data should be structured or organized into rows and columns.

You can add documents to your index through the following ways:

If you want to add a FAQ file, you use the CreateFaq API to add the file stored in an Amazon S3 bucket. You can choose between a basic CSV format, a CSV format that includes customs fields/attributes in a header, and a JSON format that includes custom fields. The default format is basic CSV.

The following provides information on each supported document format and how Amazon Kendra treats each format when indexing documents.

Document format Treated as How document is treated Original structure
Portable Document Format (PDF) HTML Converted to HTML, then content is extracted. Unstructured
HyperText Markup Language (HTML) HTML HTML tags are filtered out to extract content. Content must between the main HTML start and closing tags (<HTML>content</HTML>). Semi-structured
Extensible Markup Language (XML) XML XML tags are filtered out to extract content. Semi-structured
Extensible Stylesheet Language Transformation (XSLT) XSLT Tags are filtered out to extract content. Semi-structured
MarkDown (MD) Plain text Content is extracted with MarkDown syntax included. Semi-structured
Comma Separated Values (CSV) CSV Content extracted from each cell, with a single file treated as a single document result. Structured for FAQ files, otherwise semi-structured
Microsoft Excel (XLS and XLSX) XLS and XLSX Content extracted from each cell, with a single file treated as a single document result. Semi-structured
JavaScript Object Notation (JSON) Plain text Content is extracted with JSON syntax included. Semi-structured
Rich Text Format (RTF) RTF RTF syntax is filtered out to extract content. Semi-structured
Microsoft PowerPoint (PPT) PPT, PPTX Only text content is extracted from PowerPoint slides for search. Images and other content are not extracted. Unstructured
Microsoft Word DOC, DOCX Only text content is extracted from Word pages for search. Images and other content are not extracted. Unstructured
Plain text (TXT) TXT All text in the text document is extracted. Unstructured