Connect to Amazon S3 for your Amazon Bedrock knowledge base
Amazon S3 is an object storage service that stores data as objects within buckets.
You can connect to your Amazon S3 bucket for your Amazon Bedrock knowledge base by using either
the AWS Management Console for Amazon Bedrock
or the CreateDataSource
API (see Amazon Bedrock supported SDKs and AWS CLI).
You can upload a small batch of files to an Amazon S3 bucket using the Amazon S3 console or API.
You can alternatively use AWS DataSync to upload
multiple files to S3 continuously, and transfer files on a schedule from on-premises, edge,
other cloud, or AWS storage.
Currently only General Purpose S3 buckets are supported.
There are limits to how many files and MB per file that can be crawled. See Quotas for knowledge bases.
Supported features
-
Document metadata fields
-
Inclusion/exclusion content filters
-
Incremental content syncs for added, updated, deleted content
Prerequisites
In Amazon S3, make sure you:
-
Note the Amazon S3 bucket URI, Amazon Resource Name (ARN), and the AWS
account ID for the owner of the bucket. You can find the URI and ARN
in the properties section in the Amazon S3 console. Your bucket must be in the
same region as your Amazon Bedrock knowledge base. You must
have permission to access the bucket.
In your AWS account, make sure you:
-
Include the necessary permissions to connect to your data source in your
AWS Identity and Access Management (IAM) role/permissions policy for your
knowledge base. For information on the required permissions for this data source
to add to your knowledge base IAM role, see
Permissions to access data sources.
If you use the console, the IAM role with all the required permissions
can be created for you as part of the steps for creating a knowledge base. After
you have configured your data source and other configurations, the IAM
role with all the required permissions are applied to your specific knowledge base.
Connection configuration
To connect to your Amazon S3 bucket, you must provide the necessary configuration
information so that Amazon Bedrock can access and crawl your data. You must also follow the
Prerequisites.
An example of a configuration for this data source is included in this section.
For more information about inclusion/exclusion filters,
document metadata fields, incremental syncing, and how these work, select
the following:
You can include a separate file that specifies the document metadata fields/attributes for each
file in Amazon S3. For example, the document oscars-coverage_20240310.pdf
contains news articles, which can be categorized by year and genre. For this example,
create and upload to your bucket the following
oscars-coverage_20240310.pdf.metadata.json
file.
{
"metadataAttributes": {
"genre": "entertainment",
"year": 2024
}
}
The metadata file must use the same name as its associated source document file,
with .metadata.json
appended onto the end of the file name. The metadata file
must be stored in the same folder or location as the source file in your Amazon S3 bucket. The file
must not exceed the limit of 10 KB. For information on the supported attribute/field data types
and the filtering operators you can apply to your metadata fields, see Metadata and filtering.
You can include or exclude crawling certain content. For example, you can specify an
exclusion prefix/regular expression pattern to skip crawling any file that contains
“private” in the file name. You could also specify an inclusion prefix/regular expression
pattern to include certain content entities or content types. If you specify an inclusion
and exclusion filter and both match a document, the exclusion filter takes
precedence and the document isn’t crawled.
An example of a filter pattern to include only PDF files: ".*\\.pdf"
The data source connector crawls new, modified, and deleted content each time your data
source syncs with your knowledge base. Amazon Bedrock can use your data source’s mechanism
for tracking content changes and crawl content that changed since the last sync. When you sync
your data source with your knowledge base for the first time, all content is crawled by default.
To sync your data source with your knowledge base, use the StartIngestionJob
API or select your knowledge base in the console and select Sync within the
data source overview section.
All data that you sync from your data source becomes available to anyone with
bedrock:Retrieve
permissions to retrieve the data. This can also include any
data with controlled data source permissions. For more
information, see Knowledge base permissions.
- Console
-
The following is an example of a configuration for connecting
to Amazon S3 for your Amazon Bedrock knowledge base. You configure your data
source as part of the knowledge base creation steps in the console.
-
Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at
https://console.aws.amazon.com/bedrock/.
-
From the left navigation pane, select Knowledge bases.
-
In the Knowledge bases section, select Create knowledge base.
-
Provide the knowledge base details.
-
Provide the knowledge base name and optional description.
-
Provide the AWS Identity and Access Management role for the necessary access
permissions required to create a knowledge base.
The IAM role with all the required permissions
can be created for you as part of the console steps for creating a knowledge base. After
you have completed the steps for creating a knowledge base, the IAM
role with all the required permissions are applied to your specific knowledge base.
-
Create any tags you want to assign to your knowledge base.
Go to the next section to configure your data source.
-
Choose Amazon S3 as your data source and provide the connection configuration details.
-
Provide the data source name.
-
Specify whether your Amazon S3 bucket is in your current AWS
account or another AWS account.
-
Browse from an existing Amazon S3 bucket location or provide the URI.
You can find the URI and ARN in the properties section in the Amazon S3
console. Your bucket must be in the
same region as your Amazon Bedrock knowledge base. You must
have permission to access the bucket.
You can choose to use your own managed AWS KMS key for data encryption.
Check the advanced settings. You can optionally change the default selected settings.
-
Set your transient data encryption key and data deletion policy in the advanced settings.
For KMS key settings, you can choose either a custom key or use the
default provided data encryption key.
While converting your data into embeddings, Amazon Bedrock encrypts your
transient data with a key that AWS owns and manages, by default.
You can use your own KMS key. For more information, see
Encryption of transient data storage during data ingestion.
For data deletion policy settings, you can choose either:
-
Delete: Deletes all data from your data source that’s converted
into vector embeddings upon deletion of a knowledge base or data source resource.
Note that the vector store itself is not deleted,
only the data. This flag is ignored if an AWS account is deleted.
-
Retain: Retains all data from your data source that’s converted
into vector embeddings upon deletion of a knowledge base or data source resource.
Note that the vector store itself is not deleted
if you delete a knowledge base or data source resource.
Continue configuring your data source.
-
Choose either the default or customized chunking and parsing configurations.
-
If you choose custom settings, select one of the following chunking options:
-
Fixed-size chunking: Content split into chunks of text of your set
approximate token size. You can set the maximum number of tokens that
must not exceed for a chunk and the overlap percentage between
consecutive chunks.
-
Default chunking: Content split into chunks of text of up to 300
tokens. If a single document or piece of content contains less than
300 tokens, the document is not further split.
-
Hierarchical chunking: Content organized into nested structures
of parent-child chunks. You set the maximum parent chunk token size
and the maximum child chunk token size. You also set the absolute
number of overlap tokens between consecutive parent chunks and
consecutive child chunks.
-
Semantic chunking: Content organized into semantically similar text
chunks or groups of sentences. You set the maximum number of sentences
surrounding the target/current sentence to group together (buffer size).
You also set the breakpoint percentile threshold for dividing the text
into meaningful chunks. Semantic chunking uses a foundation model. View
Amazon Bedrock pricing
for information on the cost of foundation models.
-
No chunking: Each document is treated as a single text chunk. You might
want to pre-process your documents by splitting them into separate files.
You can’t change the chunking strategy after you have created the data source.
-
You can choose to use Amazon Bedrock’s foundation model for parsing documents to
parse more than standard text. You can parse tabular data within documents with their
structure intact, for example. View Amazon Bedrock pricing for information on the cost of foundation models.
-
You can choose to use an AWS Lambda function to customize your chunking strategy and
how your document metadata attributes/fields are treated and ingested. Provide the
Amazon S3 bucket location for the Lambda function input and output.
Go to the next section to configure your vector store.
-
Choose a model for converting your data into vector embeddings.
Create a vector store to allow Amazon Bedrock to store, update, and manage embeddings.
You can quick create a new vector store or select from a supported vector store
you have created. If you create a new vector store, an Amazon OpenSearch
Serverless vector search collection and index with the required fields is set
up for you. If you select from a supported vector store, you must map the vector
field names and metadata field names.
Go to the next section to review your knowledge base configurations.
-
Check the details of your knowledge base. You can edit any
section before going ahead and creating your knowledge base.
The time it takes to create the knowledge base depends on your specific configurations.
When the creation of the knowledge base has completed, the status of the knowledge base changes to
either state it is ready or available.
Once your knowledge base is ready and available, sync your data source
for the first time and whenever you want to keep your content up to date.
Select your knowledge base in the console and select Sync within
the data source overview section.
- API
-
The following is an example of a configuration for connecting
to Amazon S3 for your Amazon Bedrock knowledge base. You configure your data
source using the API with the AWS CLI or supported SDK, such as Python.
After you call CreateKnowledgeBase, you call CreateDataSource to create your data
source with your connection information in dataSourceConfiguration
.
Remember to also specify your chunking strategy/approach in
vectorIngestionConfiguration
and your data deletion policy
in dataDeletionPolicy
.
AWS Command Line Interface
aws bedrock create-data-source \
--name "S3 connector" \
--description "S3 data source connector for Amazon Bedrock to use content in S3" \
--knowledge-base-id "your-knowledge-base-id" \
--data-source-configuration file://s3-bedrock-connector-configuration.json \
--data-deletion-policy "DELETE" \
--vector-ingestion-configuration '{"chunkingConfiguration":[{"chunkingStrategy":"FIXED_SIZE","fixedSizeChunkingConfiguration":[{"maxTokens":"100","overlapPercentage":"10"}]}]}'
s3-bedrock-connector-configuration.json
{
"s3Configuration": {
"bucketArn": "arn:aws:s3:::bucket-name",
"bucketOwnerAccountId": "000000000000",
"inclusionPrefixes": [
".*\\.pdf"
]
},
"type": "S3"
}