Connect to Amazon S3 for your knowledge base
Amazon S3 is an object storage service that stores data as objects within buckets.
You can connect to your Amazon S3 bucket for your Amazon Bedrock knowledge base by using either
the AWS Management Console for Amazon Bedrock
You can upload a small batch of files to an Amazon S3 bucket using the Amazon S3 console or API. You can alternatively use AWS DataSync to upload multiple files to S3 continuously, and transfer files on a schedule from on-premises, edge, other cloud, or AWS storage.
Currently only General Purpose S3 buckets are supported.
There are limits to how many files and MB per file that can be crawled. See Quotas for knowledge bases.
Supported features
-
Document metadata fields
-
Inclusion/exclusion content filters
-
Incremental content syncs for added, updated, deleted content
Prerequisites
In Amazon S3, make sure you:
-
Note the Amazon S3 bucket URI, Amazon Resource Name (ARN), and the AWS account ID for the owner of the bucket. You can find the URI and ARN in the properties section in the Amazon S3 console. Your bucket must be in the same region as your Amazon Bedrock knowledge base. You must have permission to access the bucket.
In your AWS account, make sure you:
-
Include the necessary permissions to connect to your data source in your AWS Identity and Access Management (IAM) role/permissions policy for your knowledge base. For information on the required permissions for this data source to add to your knowledge base IAM role, see Permissions to access data sources.
Note
If you use the console, the IAM role with all the required permissions can be created for you as part of the steps for creating a knowledge base. After you have configured your data source and other configurations, the IAM role with all the required permissions are applied to your specific knowledge base.
Connection configuration
To connect to your Amazon S3 bucket, you must provide the necessary configuration information so that Amazon Bedrock can access and crawl your data. You must also follow the Prerequisites.
An example of a configuration for this data source is included in this section.
For more information about inclusion/exclusion filters, document metadata fields, incremental syncing, and how these work, select the following:
You can include a separate file that specifies the document metadata fields/attributes for each
file in your Amazon S3 data source and whether to include them in the embeddings when indexing the data source into the vector store. For example, you can
create a file in the following format, name it example.metadata.json
and upload it to your S3 bucket.
{ "metadataAttributes": { "company": { "value": { "type": "STRING", "stringValue": "BioPharm Innovations" }, "includeForEmbedding": true }, "created_date": { "value": { "type": "NUMBER", "numberValue": 20221205 }, "includeForEmbedding": true }, "author": { "value": { "type": "STRING", "stringValue": "Lisa Thompson" }, "includeForEmbedding": true }, "origin": { "value": { "type": "STRING", "stringValue": "Overview" }, "includeForEmbedding": true } } }
The metadata file must use the same name as its associated source document file,
with .metadata.json
appended onto the end of the file name. The metadata file
must be stored in the same folder or location as the source file in your Amazon S3 bucket. The file
must not exceed the limit of 10 KB. For information on the supported attribute/field data types
and the filtering operators you can apply to your metadata fields, see Metadata and filtering.
You can include or exclude crawling certain content. For example, you can specify an exclusion prefix/regular expression pattern to skip crawling any file that contains “private” in the file name. You could also specify an inclusion prefix/regular expression pattern to include certain content entities or content types. If you specify an inclusion and exclusion filter and both match a document, the exclusion filter takes precedence and the document isn’t crawled.
An example of a filter pattern to include only PDF files: ".*\\.pdf"
The data source connector crawls new, modified, and deleted content each time your data source syncs with your knowledge base. Amazon Bedrock can use your data source’s mechanism for tracking content changes and crawl content that changed since the last sync. When you sync your data source with your knowledge base for the first time, all content is crawled by default.
To sync your data source with your knowledge base, use the StartIngestionJob API or select your knowledge base in the console and select Sync within the data source overview section.
Important
All data that you sync from your data source becomes available to anyone with
bedrock:Retrieve
permissions to retrieve the data. This can also include any
data with controlled data source permissions. For more
information, see Knowledge base permissions.