How content chunking and parsing works for knowledge bases
Amazon Bedrock first splits your documents or content into manageable chunks for efficient data retrieval. The chunks are then converted to embeddings and written to a vector index (vector representation of the data), while maintaining a mapping to the original document. The vector embeddings allow the texts to be mathematically compared.
Topics
Standard chunking
Amazon Bedrock supports the following standard approaches to chunking:
-
Fixed-size chunking: You can configure the desired chunk size by specifying the number of tokens per chunk, and an overlap percentage, providing flexibility to align with your specific requirements. You can set the maximum number of tokens that must not exceed for a chunk and the overlap percentage between consecutive chunks.
-
Default chunking: Splits content into text chunks of approximately 300 tokens. The chunking process honors sentence boundaries, ensuring that complete sentences are preserved within each chunk.
You can also choose no chunking for your documents. Each document is treated a single text chunk. You might want to pre-process your documents by splitting them into separate files before choosing no chunking as your chunking approach/strategy. If you choose no chunking for your documents, you cannot view page number in citation or filter by the x-amz-bedrock-kb-document-page-number metadata field/attribute. This field is automatically generated only for PDF files and if you use Amazon OpenSearch Serverless as your vector store.
The following is an example of configuring fixed-sized chunking:
Console
-
Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/
. -
From the left navigation pane, select Knowledge bases.
-
In the Knowledge bases section, select Create knowledge base.
-
Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.
-
Choose a supported data source and provide the connection configuration details.
-
For chunking and parsing configurations, first choose the custom option and then choose the fixed-size chunking as your chunking strategy.
-
Enter the fixed maximum tokens for a chunk and the overlap percentage between consecutive chunks.
-
Continue the steps to complete creating your knowledge base.
API
{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { "chunkingStrategy": "FIXED_SIZE", "fixedSizeChunkingConfiguration": { "maxTokens": "100", "overlapPercentage": "10" } } } }
Hierarchical chunking
Hierarchical chunking involves organizing information into nested structures of child and parent chunks. When creating a data source, you are able to define the parent chunk size, child chunk size and the number of tokens overlapping between each chunk. During retrieval, the system initially retrieves child chunks, but replaces them with broader parent chunks so as to provide the model with more comprehensive context.
Small text embeddings are more precise, but retrieval aims for comprehensive context. A hierarchical chunking system balances these needs by replacing retrieved child chunks with their parent chunks when appropriate.
For hierarchical chunking, Amazon Bedrock knowledge bases supports specifying two levels or the following depth for chunking:
-
Parent: You set the maximum parent chunk token size.
-
Child: You set the maximum child chunk token size.
You also set the overlap tokens between chunks. This is the absolute number of overlap tokens between consecutive parent chunks and consecutive child chunks.
The following is an example of configuring hierarchical chunking:
Console
-
Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/
. -
From the left navigation pane, select Knowledge bases.
-
In the Knowledge bases section, select Create knowledge base.
-
Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.
-
Choose a supported data source and provide the connection configuration details.
-
For chunking and parsing configurations, first choose the custom option and then choose hierarchical chunking as your chunking strategy.
-
Enter the maximum parent chunk token size.
-
Enter the maximum children chunk token size.
-
Enter the overlap tokens between chunks. This is the absolute number of overlap tokens between consecutive parent chunks and consecutive child chunks.
-
Continue the steps to complete creating your knowledge base.
API
{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { "chunkingStrategy": "HIERARCHICAL", "hierarchicalChunkingConfiguration": { // Hierarchical chunking "levelConfigurations": [{ "maxTokens": 1500 // Parent max tokens }, { "maxTokens": 300 // Child max tokens }], "overlapTokens": 60 } } } }
Note
The recommended default values are:
-
1,500 max tokens per parent chunk
-
300 max tokens per child chunk
-
60 overlap tokens between consecutive parent chunks and consecutive child chunks
For more information on the accepted values for max tokens per parent and child chunk, and the overlap tokens, see HierarchicalChunkingConfiguration.
Semantic chunking
Semantic chunking is a natural language processing technique that divides text into meaningful chunks to enhance understanding and information retrieval. It aims to improve retrieval accuracy by focusing on the semantic content rather than just syntactic structure. By doing so, it may facilitate more precise extraction and manipulation of relevant information.
When configuring semantic chunking, you have the option to specify the following hyper parameters.
-
Maximum tokens: The maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries.
-
Buffer size: For a given sentence, the buffer size defines the number of surrounding sentences to be added for embeddings creation. For example, a buffer size of 1 results in 3 sentences (current, previous and next sentence) to be combined and embedded. This parameter can influence how much text is examined together to determine the boundaries of each chunk, impacting the granularity and coherence of the resulting chunks. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.
-
Breakpoint percentile threshold: The percentile threshold of sentence distance/dissimilarity to draw breakpoints between sentences. A higher threshold requires sentences to be more distinguishable in order to be split into different chunks. A higher threshold results in fewer chunks and typically larger average chunk size.
Note
There are additional costs to using semantic chunking due to its use of a foundation model. The cost depends on the amount of data you have. See Amazon Bedrock pricing
for more information on the cost of foundation models.
The following is an example of configuring semantic chunking:
Console
-
Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/
. -
From the left navigation pane, select Knowledge bases.
-
In the Knowledge bases section, select Create knowledge base.
-
Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.
-
Choose a supported data source and provide the connection configuration details
-
For chunking and parsing configurations, first choose the custom option and then choose semantic chunking as your chunking strategy.
-
Enter the maximum number of sentences surrounding the target sentence to group together. Example: buffer size 1 is “sentence previous”, “sentence target”, “sentence next”.
-
Enter the maximum token size for a text chunk.
-
Set the breakpoint threshold between sentence groups. The percentile threshold of sentence distance/dissimilarity to draw breakpoints between sentences. A higher threshold requires sentences to be more distinguishable in order to be split into different chunks. A higher threshold results in fewer chunks and typically larger average chunk size.
-
Continue the steps to complete creating your knowledge base.
API
{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { "chunkingStrategy": "SEMANTIC", "semanticChunkingConfiguration": { // Semantic chunking "maxTokens": 300, "bufferSize": 0, "breakpointPercentileThreshold": 95 } } } }
Note
The recommended default values are:
-
300 max tokens per chunk
-
0 buffer
-
95% breakpoint percentile threshold
For more information on the accepted values for max tokens per chunk, buffer size, and breakpoint percentile threshold, see SemanticChunkingConfiguration.
Advanced parsing options
You can use advanced parsing techniques for parsing non-textual information from documents. This feature allows you to select a foundation model or inference profile for parsing of complex data, such as tables and charts. Additionally, you can tailor this to your specific needs by overwriting the default prompts for data extraction, ensuring optimal performance across a diverse set of use cases. Currently, Anthropic Claude 3 Sonnet and Anthropic Claude 3 Haiku are supported.
Note
There are additional costs to using advanced parsing. This is due to its
use of a foundation model. The cost depends on the amount of data you have. See
Amazon Bedrock pricing
There are limits for the types of files and total data that can be parsed using advanced parsing. For information on the file types for advanced parsing, see Document formats. For information on the total data that can be parsed using advanced parsing, see Quotas.
The following is an example of configuring a foundational model to aid in advanced parsing:
Console
-
Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/
. -
From the left navigation pane, select Knowledge bases.
-
In the Knowledge bases section, select Create knowledge base.
-
Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.
-
Choose a supported data source and provide the connection configuration details.
-
For chunking and parsing configurations, first choose the custom option and then enable Foundation model and select your preferred foundation model. You can also optionally overwrite the Instructions for the parser to suit your specific needs.
-
Continue the steps to complete creating your knowledge base.
API
{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { ... }, "parsingConfiguration": { // Parse tabular data within docs "parsingStrategy": "BEDROCK_FOUNDATION_MODEL", "bedrockFoundationModelConfiguration": { "parsingPrompt": { "parsingPromptText": "string" }, "modelArn": "string" } } } }
Metadata selection for CSVs
When ingesting CSV (comma separate values) files, you have the ability to have the knowledge base treat certain columns as content fields versus metadata fields. Instead of potentially having hundreds or thousands of content/metadata file pairs, you can now have a single CSV file and a corresponding metadata.json file, giving the knowledge base hints as to how to treat each column inside of your CSV.
There are limits for document metadata fields/attributes per chunk. See Quotas for knowledge bases
Before ingesting a CSV file, make sure:
-
Your CSV is in RFC4180 format and is UTF-8 encoded.
-
The first row of your CSV includes header information.
-
Metadata fields provided in your metadata.json are present as columns in your CSV.
-
You provide a fileName.csv.metadata.json file with the following format:
{ "metadataAttributes": { "${attribute1}": "${value1}", "${attribute2}": "${value2}", ... }, "documentStructureConfiguration": { "type": "RECORD_BASED_STRUCTURE_METADATA", "recordBasedStructureMetadata": { "contentFields": [ { "fieldName": "string" } ], "metadataFieldsSpecification": { "fieldsToInclude": [ { "fieldName": "string" } ], "fieldsToExclude": [ { "fieldName": "string" } ] } } } }
The CSV file is parsed one row at a time and the chunking strategy and vector embedding is applied to the content field. Amazon Bedrock knowledge bases currently supports one content field. The content field is split into chunks, and the metadata fields (columns) that are are associated with each chunk are treated as string values.
For example, say there's a CSV with a column 'Description' and a column 'Creation_Date'. The description field is the content field and the creation date is an associated metadata field. The description text is split into chunks and converted into vector embeddings for each row in the CSV. The creation date value is treated as string representation of the date and is associated with each chunk for the description.
If no inclusion/exclusion fields are provided, all columns are treated as metadata columns, except the content column.
If only inclusion fields are provided, only the provided columns are treated as metadata.
If only exclusion fields are provided, all columns, except the exclusion columns are treated as metadata.
If you provide the same fieldName
in both fieldsToInclude
and fieldsToExlcude
,
Amazon Bedrock throws a validation exception. If there’s a conflict between inclusion and exclusion, it ]
will result in a failure.
Blank rows found inside a CSV are ignored or skipped.
Custom transformation
You have the ability to define a custom transformation Lambda function to inject your own logic into the knowledge base ingestion process.
You may have specific chunking logic, not natively supported by Amazon Bedrock knowledge bases. Use the no chunking strategy option, while specifying a Lambda function that contains your chunking logic. Additionally, you'll need to specify an Amazon S3 bucket for the knowledge base to write files to be chunked by your Lambda function.
After chunking, your Lambda function will write back chunked files into the same bucket and return references for the knowledge base for further processing. You optionally have the ability to provide your own AWS KMS key for encryption of files being stored in your S3 bucket.
Alternatively, you may want to specify chunk-level metadata, while having the knowledge base apply one of the natively supported chunking strategies. In this case, select one of the pre-defined chunking strategies (for example, default or fixed-size chunking), while providing a reference to your Lambda function and S3 bucket. In this case, the knowledge base will store parsed and pre-chunked files in the pre-defined S3 bucket, before calling your Lambda function for further adding chunk-level metadata.
After adding chunk-level metadata, your Lambda function will write back chunked files into the same bucket and return references for the knowledge base for further processing. Please note that chunk-level metadata take precedence and overwrite file-level metadata, in case of any collisions.
For an example of using a Python Lambda function for custom chunking, see
Custom chunking using Lambda function
For API and file contracts, refer the the below structures:
API contract when adding a custom transformation using Lambda function
{ ... "vectorIngestionConfiguration": { "customTransformationConfiguration": { // Custom transformation "intermediateStorage": { "s3Location": { // the location where input/output of the Lambda is expected "uri": "string" } }, "transformations": [{ "transformationFunction": { "transformationLambdaConfiguration": { "lambdaArn": "string" } }, "stepToApply": "string" // enum of POST_CHUNKING }] }, "chunkingConfiguration": { "chunkingStrategy": "string", "fixedSizeChunkingConfiguration": { "maxTokens": "number", "overlapPercentage": "number" } ... } } }
Custom Lambda transformation input format
{ "version": "1.0", "knowledgeBaseId": "string", "dataSourceId": "string", "ingestionJobId": "string", "bucketName": "string", "priorTask": "string", "inputFiles": [{ "originalFileLocation": { "type": "S3", "s3_location": { "uri": "string" } }, "fileMetadata": { "key1": "value1", "key2": "value2" }, "contentBatches": [{ "key":"string" }] }] }
Custom Lambda transformation output format
{ "outputFiles": [{ "originalFileLocation": { "type": "S3", "s3_location": { "uri": "string" } }, "fileMetadata": { "key1": "value1", "key2": "value2" }, "contentBatches": [{ "key": "string" }] }] }
File format for objects in referenced in fileContents
{ "fileContents": [{ "contentBody": "...", "contentType": "string", // enum of TEXT, PDF, ... "contentMetadata": { "key1": "value1", "key2": "value2" } } ... ] }