Customize ingestion for a data source
You can customize vector ingestion when connecting a data source in the AWS Management Console or by modifying the value of the vectorIngestionConfiguration
field when sending a CreateDataSource request.
Select a topic to learn how to include configurations for customizing ingestion when connecting to a data source:
Topics
Choose the tool to use for parsing
You can customize how the documents in your data are parsed. To learn about options for parsing data in Amazon Bedrock Knowledge Bases, see Parsing options for your data source.
In the AWS Management Console you choose the parsing strategy when connecting to a data source. With the Amazon Bedrock API, you include a ParsingConfiguration in the parsingConfiguration
field of the VectorIngestionConfiguration.
Note
If you omit this configuration, Amazon Bedrock Knowledge Bases uses the Amazon Bedrock default parser.
Expand the section that corresponds to the parsing strategy that you want to use:
To use the default parser, don't include a parsingConfiguration
field within the VectorIngestionConfiguration
.
To use the Amazon Bedrock Data Automation parser, specify BEDROCK_DATA_AUTOMATION
in the parsingStrategy
field of the ParsingConfiguration
and include a BedrockDataAutomationConfiguration in the bedrockDataAutomationConfiguration
field, as in the following format:
{ "parsingStrategy": "BEDROCK_DATA_AUTOMATION", "bedrockDataAutomationConfiguration": { "parsingModality": "string" } }
To use a foundation model as a parser, specify the BEDROCK_FOUNDATION_MODEL
in the parsingStrategy
field of the ParsingConfiguration
and include a BedrockFoundationModelConfiguration in the bedrockFoundationModelConfiguration
field, as in the following format:
{ "parsingStrategy": "BEDROCK_FOUNDATION_MODEL", "bedrockFoundationModelConfiguration": { "modelArn": "string", "parsingModality": "string", "parsingPrompt": { "parsingPromptText": "string" } } }
Choose a chunking strategy
You can customize how the documents in your data are chunked for storage and retrieval. To learn about options for chunking data in Amazon Bedrock Knowledge Bases, see How content chunking works for knowledge bases.
Warning
You can't change the chunking strategy after connecting to the data source.
In the AWS Management Console you choose the chunking strategy when connecting to a data source. With the Amazon Bedrock API, you include a ChunkingConfiguration in the chunkingConfiguration
field of the VectorIngestionConfiguration.
Note
If you omit this configuration, Amazon Bedrock splits your content into chunks of approximately 300 tokens, while preserving sentence boundaries.
Expand the section that corresponds to the parsing strategy that you want to use:
To treat each document in your data source as a single source chunk, specify NONE
in the chunkingStrategy
field of the ChunkingConfiguration
, as in the following format:
{ "chunkingStrategy": "NONE" }
To divide each document in your data source into chunks of approximately the same size, specify FIXED_SIZE
in the chunkingStrategy
field of the ChunkingConfiguration
and include a FixedSizeChunkingConfiguration in the fixedSizeChunkingConfiguration
field, as in the following format:
{ "chunkingStrategy": "FIXED_SIZE", "fixedSizeChunkingConfiguration": { "maxTokens": number, "overlapPercentage": number } }
To divide each document in your data source into two levels, where the second layer contains smaller chunks derived from the first layer, specify HIERARCHICAL
in the chunkingStrategy
field of the ChunkingConfiguration
and include the hierarchicalChunkingConfiguration
field, as in the following format:
{ "chunkingStrategy": "HIERARCHICAL", "hierarchicalChunkingConfiguration": { "levelConfigurations": [{ "maxTokens": number }], "overlapTokens": number } }
To divide each document in your data source into chunks that prioritize semantic meaning over syntactic structure, specify SEMANTIC
in the chunkingStrategy
field of the ChunkingConfiguration
and include the semanticChunkingConfiguration
field, as in the following format:
{ "chunkingStrategy": "SEMANTIC", "semanticChunkingConfiguration": { "breakpointPercentileThreshold": number, "bufferSize": number, "maxTokens": number } }
Use a Lambda function during ingestion
You can post-process how the source chunks from your data are written to the vector store with a Lambda function in the following ways:
-
Include chunking logic to provide a custom chunking strategy.
-
Include logic to specify chunk-level metadata.
To learn about writing a custom Lambda function for ingestion, see Use a custom transformation Lambda function to define how your data is ingested. In the AWS Management Console you choose the Lambda function when connecting to a data source. With the Amazon Bedrock API, you include a CustomTransformationConfiguration in the CustomTransformationConfiguration
field of the VectorIngestionConfiguration and specify the ARN of the Lambda, as in the following format:
{ "transformations": [{ "transformationFunction": { "transformationLambdaConfiguration": { "lambdaArn": "string" } }, "stepToApply": "POST_CHUNKING" }], "intermediateStorage": { "s3Location": { "uri": "string" } } }
You also specify the S3 location in which to store the output after applying the Lambda function.
You can include the chunkingConfiguration
field to apply the Lambda function after applying one of the chunking options that Amazon Bedrock offers.