Chunking and parsing with knowledge bases - Amazon SageMaker Unified Studio

Amazon SageMaker Unified Studio is in preview release and is subject to change.

Chunking and parsing with knowledge bases

Chunking and parsing are preprocessing techniques used to prepare and organize textual data for efficient storage, retrieval, and utilization by a model.

Chunking

When ingesting your data, Amazon Bedrock first splits your documents or content into manageable chunks for efficient data retrieval. The chunks are then converted to embeddings and written to a vector index (vector representation of the data), while maintaining a mapping to the original document. The vector embeddings allow the texts to be quantitatively compared.

Amazon Bedrock supports different approaches to chunking. Amazon Bedrock IDE supports default chunking which splits content into text chunks of approximately 300 tokens. The chunking process honors sentence boundaries, ensuring that complete sentences are preserved within each chunk.

You can set the maximum number of source chunks to from the vector store. For more information, see Add a Knowledge Base component to a chat agent app.

Parsing

Parsing involves analyzing the structure of information to understand its components and their relationships. With Amazon Bedrock IDE You can use advanced parsing techniques for parsing non-textual information from documents. This feature allows you to select a foundation model or for parsing of complex data, such as tables and charts. Additionally, you can tailor this to your specific needs by overwriting the default prompts for data extraction, ensuring optimal performance across a diverse set of use cases. Currently, Anthropic Claude 3 Sonnet and Anthropic Claude 3 Haiku are supported.

Amazon Bedrock IDE only supports advanced parsing with PDF format files. If your files aren't in PDF format, you must convert them to PDF format before you can apply advanced parsing.

Note

There are additional costs to using advanced parsing. This is due to its use of a foundation model. The cost depends on the amount of data you have. See Amazon Bedrock pricing for more information on the cost of foundation models.

There are limits for the types of files and total data that can be parsed using parsing. For information on the file types for parsing, see Document formats. For information on the total data that can be parsed using advanced parsing, see Quotas.

For more information, see How content chunking and parsing works for knowledge bases.

To create a Knowledge Base that uses an embeddings model, vector store, and advanced parsing, see Create a Knowledge Base component.