# Connect a data source to your knowledge base
<a name="data-source-connectors"></a>

After finishing the configurations for your knowledge base, you connect a supported data source to the knowledge base.

Amazon Bedrock Knowledge Bases supports connecting to unstructured data sources or to structured data stores through a query engine. Select a topic to learn how to connect to that type of data source:

**Multimodal content support**  
Multimodal content (images, audio, and video files) is only supported with Amazon S3 and custom data sources. Other data source types will skip multimodal files during ingestion. For comprehensive guidance on working with multimodal content, see [Build a knowledge base for multimodal content](kb-multimodal.md).

To learn how to connect to a data source using the Amazon Bedrock console, select the topic that corresponds to your data source type at the bottom of this page:

To connect to a data source using the Amazon Bedrock API, send a [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) request with an [Agents for Amazon Bedrock runtime endpoint](https://docs.aws.amazon.com/general/latest/gr/bedrock.html#bra-rt).

The following fields are required:


****  

| Field | Basic description | 
| --- | --- | 
| knowledgeBaseId | The ID of the knowledge base. | 
| name | A name for the knowledge base. | 
| dataSourceConfiguration | Specify the data source service or type in the type field and include the corresponding field. For more details about service-specific configurations, select the topic for the service from the topics at the bottom of this page. | 

The following fields are optional:


****  

| Field | Use case | 
| --- | --- | 
| description | To provide a description for the data source. | 
| vectorIngestionConfiguration | Contains configurations for customizing the ingestion process. For more information, see [Customize ingestion for a data source](kb-data-source-customize-ingestion.md). | 
| dataDeletionPolicy | To specify whether to RETAIN the vector embeddings in the vector store or to DELETE them. | 
| serverSideEncryptionConfiguration | To encrypt transient data during data syncing with a customer managed key, specify its ARN in the kmsKeyArn field. | 
| clientToken | To ensure the API request completes only once. For more information, see [Ensuring idempotency](https://docs.aws.amazon.com/ec2/latest/devguide/ec2-api-idempotency.html). | 

Select a topic to learn more about a service and configuring it.

**Topics**
+ [Connect to Amazon S3 for your knowledge base](s3-data-source-connector.md)
+ [Connect to Confluence for your knowledge base](confluence-data-source-connector.md)
+ [Connect to Microsoft SharePoint for your knowledge base](sharepoint-data-source-connector.md)
+ [Connect to Salesforce for your knowledge base](salesforce-data-source-connector.md)
+ [Crawl web pages for your knowledge base](webcrawl-data-source-connector.md)
+ [Connect your knowledge base to a custom data source](custom-data-source-connector.md)

# Connect to Amazon S3 for your knowledge base
<a name="s3-data-source-connector"></a>

Amazon S3 is an object storage service that stores data as objects within buckets. You can connect to your Amazon S3 bucket for your Amazon Bedrock knowledge base by using either the [AWS Management Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home) or the [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) API (see Amazon Bedrock [supported SDKs and AWS CLI](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)).

**Multimodal content support**  
Amazon S3 data sources support multimodal content including images, audio, and video files. For comprehensive guidance on working with multimodal content, see [Build a knowledge base for multimodal content](kb-multimodal.md).

You can upload a small batch of files to an Amazon S3 bucket using the Amazon S3 console or API. You can alternatively use [AWS DataSync](https://docs.aws.amazon.com/datasync/latest/userguide/create-s3-location.html) to upload multiple files to S3 continuously, and transfer files on a schedule from on-premises, edge, other cloud, or AWS storage.

Currently only General Purpose S3 buckets are supported.

There are limits to how many files and MB per file that can be crawled. See [Quotas for knowledge bases](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html).

**Topics**
+ [Supported features](#supported-features-s3-connector)
+ [Prerequisites](#prerequisites-s3-connector)
+ [Connection configuration](#configuration-s3-connector)

## Supported features
<a name="supported-features-s3-connector"></a>
+ Document metadata fields
+ Inclusion prefixes
+ Incremental content syncs for added, updated, deleted content

## Prerequisites
<a name="prerequisites-s3-connector"></a>

**In Amazon S3, make sure you**:
+ Note the Amazon S3 bucket URI, Amazon Resource Name (ARN), and the AWS account ID for the owner of the bucket. You can find the URI and ARN in the properties section in the Amazon S3 console. Your bucket must be in the same Region as your Amazon Bedrock knowledge base. You must have permission to access the bucket.

**In your AWS account, make sure you**:
+ Include the necessary permissions to connect to your data source in your AWS Identity and Access Management (IAM) role/permissions policy for your knowledge base. For information on the required permissions for this data source to add to your knowledge base IAM role, see [Permissions to access data sources](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html#kb-permissions-access-ds).

**Note**  
If you use the console, the IAM role with all the required permissions can be created for you as part of the steps for creating a knowledge base. After you have configured your data source and other configurations, the IAM role with all the required permissions are applied to your specific knowledge base.

## Connection configuration
<a name="configuration-s3-connector"></a>

To connect to your Amazon S3 bucket, you must provide the necessary configuration information so that Amazon Bedrock can access and crawl your data. You must also follow the [Prerequisites](#prerequisites-s3-connector).

An example of a configuration for this data source is included in this section.

For more information about inclusion filters, document metadata fields, incremental syncing, and how these work, select the following:

### Document metadata fields
<a name="ds-s3-metadata-fields"></a>

You can include a separate file that specifies the document metadata fields/attributes for each file in your Amazon S3 data source and whether to include them in the embeddings when indexing the data source into the vector store. For example, you can create a file in the following format, name it *fileName.extension.metadata.json* and upload it to your S3 bucket.

```
{
  "metadataAttributes": {
    "company": {
      "value": {
        "type": "STRING",
        "stringValue": "BioPharm Innovations"
      },
      "includeForEmbedding": true
    },
    "created_date": {
      "value": {
        "type": "NUMBER",
        "numberValue": 20221205
      },
      "includeForEmbedding": true
    },
    "author": {
      "value": {
        "type": "STRING",
        "stringValue": "Lisa Thompson"
      },
      "includeForEmbedding": true
    },
    "origin": {
      "value": {
        "type": "STRING",
        "stringValue": "Overview"
      },
      "includeForEmbedding": true
    }
  }
}
```

The metadata file must use the same name as its associated source document file, with `.metadata.json` appended onto the end of the file name. The metadata file must be stored in the same folder or location as the source file in your Amazon S3 bucket. The file must not exceed the limit of 10 KB. For information on the supported attribute/field data types and the filtering operators you can apply to your metadata fields, see [Metadata and filtering](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-test-config.html).

### Inclusion prefixes
<a name="ds-s3-inclusion-exclusion"></a>

You can specify an inclusion prefix, which is an Amazon S3 path prefix, where you can use an S3 file or a folder instead of the entire bucket to create the S3 data source connector.

### Incremental syncing
<a name="ds-s3-incremental-sync"></a>

The data source connector crawls new, modified, and deleted content each time your data source syncs with your knowledge base. Amazon Bedrock can use your data source’s mechanism for tracking content changes and crawl content that changed since the last sync. When you sync your data source with your knowledge base for the first time, all content is crawled by default.

To sync your data source with your knowledge base, use the [StartIngestionJob](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_StartIngestionJob.html) API or select your knowledge base in the console and select **Sync** within the data source overview section.

**Important**  
All data that you sync from your data source becomes available to anyone with `bedrock:Retrieve` permissions to retrieve the data. This can also include any data with controlled data source permissions. For more information, see [Knowledge base permissions](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html).

------
#### [ Console ]

**To connect an Amazon S3 bucket to your knowledge base**

1. Follow the steps at [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and choose **Amazon S3** as the data source.

1. Provide a name for the data source.

1. Specify whether the Amazon S3 bucket is in your current AWS account or another AWS account. Your bucket must be in the same Region as the knowledge base.

1. (Optional) If the Amazon S3 bucket is encrypted with a KMS key, include the key. For more information, see [Permissions to decrypt your AWS KMS key for your data sources in Amazon S3](encryption-kb.md#encryption-kb-ds).

1. (Optional) In the **Content parsing and chunking** section, you can customize how to parse and chunk your data. Refer to the following resources to learn more about these customizations:
   + For more information about parsing options, see [Parsing options for your data source](kb-advanced-parsing.md).
   + For more information about chunking strategies, see [How content chunking works for knowledge bases](kb-chunking.md).
**Warning**  
You can't change the chunking strategy after connecting to the data source.
   + For more information about how to customize chunking of your data and processing of your metadata with a Lambda function, see [Use a custom transformation Lambda function to define how your data is ingested](kb-custom-transformation.md).

1. In the **Advanced settings** section, you can optionally configure the following:
   + **KMS key for transient data storage.** – You can encrypt the transient data while converting your data into embeddings with the default AWS managed key or your own KMS key. For more information, see [Encryption of transient data storage during data ingestion](encryption-kb.md#encryption-kb-ingestion).
   + **Data deletion policy** – You can delete the vector embeddings for your data source that are stored in the vector store by default, or choose to retain the vector store data.

1. Continue to choose an embeddings model and vector store. To see the remaining steps, return to [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and continue from the step after connecting your data source.

------
#### [ API ]

The following is an example of a configuration for connecting to Amazon S3 for your Amazon Bedrock knowledge base. You configure your data source using the API with the AWS CLI or supported SDK, such as Python. After you call [CreateKnowledgeBase](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html), you call [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) to create your data source with your connection information in `dataSourceConfiguration`.

To learn about customizations that you can apply to ingestion by including the optional `vectorIngestionConfiguration` field, see [Customize ingestion for a data source](kb-data-source-customize-ingestion.md).

**AWS Command Line Interface**

```
aws bedrock-agent create-data-source \
 --name "S3-connector" \
 --description "S3 data source connector for Amazon Bedrock to use content in S3" \
 --knowledge-base-id "your-knowledge-base-id" \
 --data-source-configuration file://s3-bedrock-connector-configuration.json \
 --data-deletion-policy "DELETE" \
 --vector-ingestion-configuration '{"chunkingConfiguration":{"chunkingStrategy":"FIXED_SIZE","fixedSizeChunkingConfiguration":{"maxTokens":100,"overlapPercentage":10}}}'
                    
s3-bedrock-connector-configuration.json
{
    "s3Configuration": {
	    "bucketArn": "arn:aws:s3:::bucket-name",
	    "bucketOwnerAccountId": "000000000000",
	    "inclusionPrefixes": [
	        "documents/"
	    ]
    },
    "type": "S3"	
}
```

------

# Connect to Confluence for your knowledge base
<a name="confluence-data-source-connector"></a>

Atlassian Confluence is a collaborative work-management tool designed for sharing, storing, and working on project planning, software development, and product management. You can connect to your Confluence instance for your Amazon Bedrock knowledge base by using either the [AWS Management Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home) or the [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) API (see Amazon Bedrock [supported SDKs and AWS CLI](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)).

**Note**  
Confluence data source connector is in preview release and is subject to change.  
Confluence data sources don't support multimodal data, such as tables, charts, diagrams, or other images..

Amazon Bedrock supports connecting to Confluence Cloud instances. Currently, only Amazon OpenSearch Serverless vector store is available to use with this data source.

There are limits to how many files and MB per file that can be crawled. See [Quotas for knowledge bases](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html).

**Topics**
+ [Supported features](#supported-features-confluence-connector)
+ [Prerequisites](#prerequisites-confluence-connector)
+ [Connection configuration](#configuration-confluence-connector)

## Supported features
<a name="supported-features-confluence-connector"></a>
+ Auto detection of main document fields
+ Inclusion/exclusion content filters
+ Incremental content syncs for added, updated, deleted content
+ OAuth 2.0 authentication, authentication with Confluence API token

## Prerequisites
<a name="prerequisites-confluence-connector"></a>

**In Confluence, make sure you**:
+ Take note of your Confluence instance URL. For example, for Confluence Cloud, *https://example.atlassian.net*. The URL for Confluence Cloud must be the base URL, ending with *.atlassian.net*.
+ Configure basic authentication credentials containing a username (email of admin account) and password (Confluence API token) to allow Amazon Bedrock to connect to your Confluence Cloud instance. For information about how to create a Confluence API token, see [Manage API tokens for your Atlassian account](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/#Create-an-API-token) on the Atlassian website.
+ (Optional) Configure an OAuth 2.0 application with credentials of an app key, app secret, access token, and refresh token. For more information, see [OAuth 2.0 apps](https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/) on the Atlassian website.
+ Certain read permissions or scopes must be enabled for your OAuth 2.0 app to connect to Confluence.

  Confluence API:
  + offline\$1access
  + read:content:confluence – View detailed contents 
  + read:content-details:confluence – View content details 
  + read:space-details:confluence – View space details
  + read:audit-log:confluence – View audit records 
  + read:page:confluence – View pages 
  + read:attachment:confluence – View and download content attachments 
  + read:blogpost:confluence – View blogposts 
  + read:custom-content:confluence – View custom content 
  + read:comment:confluence – View comments 
  + read:template:confluence – View content templates 
  + read:label:confluence – View labels 
  + read:watcher:confluence – View content watchers 
  + read:relation:confluence – View entity relationships 
  + read:user:confluence – View user details 
  + read:configuration:confluence – View Confluence settings 
  + read:space:confluence – View space details 
  + read:space.property:confluence – View space properties 
  + read:user.property:confluence – View user properties 
  + read:space.setting:confluence – View space settings 
  + read:analytics.content:confluence – View analytics for content
  + read:content.property:confluence – View content properties
  + read:content.metadata:confluence – View content summaries 
  + read:inlinetask:confluence – View tasks 
  + read:task:confluence – View tasks 
  + read:whiteboard:confluence – View whiteboards 
  + read:app-data:confluence – Read app data 
  + read:folder:confluence - View folders
  + read:embed:confluence - View Smart Link data

**In your AWS account, make sure you**:
+ Store your authentication credentials in an [AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) and note the Amazon Resource Name (ARN) of the secret. Follow the **Connection configuration** instructions on this page to include the key-values pairs that must be included in your secret.
+ Include the necessary permissions to connect to your data source in your AWS Identity and Access Management (IAM) role/permissions policy for your knowledge base. For information on the required permissions for this data source to add to your knowledge base IAM role, see [Permissions to access data sources](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html#kb-permissions-access-ds).

**Note**  
If you use the console, you can go to AWS Secrets Manager to add your secret or use an existing secret as part of the data source configuration step. The IAM role with all the required permissions can be created for you as part of the console steps for creating a knowledge base. After you have configured your data source and other configurations, the IAM role with all the required permissions are applied to your specific knowledge base.  
We recommend that you regularly refresh or rotate your credentials and secret. Provide only the necessary access level for your own security. We do not recommend that you re-use credentials and secrets across data sources.

## Connection configuration
<a name="configuration-confluence-connector"></a>

To connect to your Confluence instance, you must provide the necessary configuration information so that Amazon Bedrock can access and crawl your data. You must also follow the [Prerequisites](#prerequisites-confluence-connector).

An example of a configuration for this data source is included in this section.

For more information about auto detection of document fields, inclusion/exclusion filters, incremental syncing, secret authentication credentials, and how these work, select the following:

### Auto detection of main document fields
<a name="ds-confluence-document-fields"></a>

The data source connector automatically detects and crawls all of the main metadata fields of your documents or content. For example, the data source connector can crawl the document body equivalent of your documents, the document title, the document creation or modification date, or other core fields that might apply to your documents.

**Important**  
If your content includes sensitive information, then Amazon Bedrock could respond using sensitive information.

You can apply filtering operators to metadata fields to help you further improve the relevancy of responses. For example, document "epoch\$1modification\$1time" or the number of seconds that’s passed January 1 1970 for when the document was last updated. You can filter on the most recent data, where "epoch\$1modification\$1time" is *greater than* a certain number. For more information on the filtering operators you can apply to your metadata fields, see [Metadata and filtering](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-test-config.html).

### Inclusion/exclusion filters
<a name="ds-confluence-inclusion-exclusion"></a>

You can include or exclude crawling certain content. For example, you can specify an exclusion prefix/regular expression pattern to skip crawling any file that contains “private” in the file name. You could also specify an inclusion prefix/regular expression pattern to include certain content entities or content types. If you specify an inclusion and exclusion filter and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

An example of a regular expression pattern to exclude or filter out PDF files that contain "private" in the file name: *".\$1private.\$1\$1\$1.pdf"*

You can apply inclusion/exclusion filters on the following content types:
+ `Space`: Unique space key
+ `Page`: Main page title
+ `Blog`: Main blog title
+ `Comment`: Comments that belong to a certain page or blog. Specify *Re: Page/Blog Title*
+ `Attachment`: Attachment file name with its extension

### Incremental syncing
<a name="ds-confluence-incremental-sync"></a>

The data source connector crawls new, modified, and deleted content each time your data source syncs with your knowledge base. Amazon Bedrock can use your data source’s mechanism for tracking content changes and crawl content that changed since the last sync. When you sync your data source with your knowledge base for the first time, all content is crawled by default.

To sync your data source with your knowledge base, use the [StartIngestionJob](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_StartIngestionJob.html) API or select your knowledge base in the console and select **Sync** within the data source overview section.

**Important**  
All data that you sync from your data source becomes available to anyone with `bedrock:Retrieve` permissions to retrieve the data. This can also include any data with controlled data source permissions. For more information, see [Knowledge base permissions](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html).

### Secret authentication credentials
<a name="ds-confluence-secret-auth-credentials"></a>

(If using basic authentication) Your secret authentication credentials in AWS Secrets Manager should include these key-value pairs:
+ `username`: *admin user email address of Atlassian account*
+ `password`: *Confluence API token*

(If using OAuth 2.0 authentication) Your secret authentication credentials in AWS Secrets Manager should include these key-value pairs:
+ `confluenceAppKey`: *app key*
+ `confluenceAppSecret`: *app secret*
+ `confluenceAccessToken`: *app access token*
+ `confluenceRefreshToken`: *app refresh token*

**Note**  
Confluence OAuth2.0 **access** token has a default expiry time of 60 minutes. If this token expires while your data source is syncing (sync job), Amazon Bedrock will use the provided **refresh** token to regenerate this token. This regeneration refreshes both the access and refresh tokens. To keep the tokens updated from the current sync job to the next sync job, Amazon Bedrock requires write/put permissions for your secret credentials as part of your knowledge base IAM role.

**Note**  
Your secret in AWS Secrets Manager must use the same Region of your knowledge base.

------
#### [ Console ]

**Connect a Confluence instance to your knowledge base**

1. Follow the steps at [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and choose **Confluence** as the data source.

1. Provide a name and optional description for the data source.

1. Provide your Confluence instance URL. For example, for Confluence Cloud, *https://example.atlassian.net*. The URL for Confluence Cloud must be the base URL, ending with *.atlassian.net*.

1. In the **Advanced settings** section, you can optionally configure the following:
   + **KMS key for transient data storage.** – You can encrypt the transient data while converting your data into embeddings with the default AWS managed key or your own KMS key. For more information, see [Encryption of transient data storage during data ingestion](encryption-kb.md#encryption-kb-ingestion).
   + **Data deletion policy** – You can delete the vector embeddings for your data source that are stored in the vector store by default, or choose to retain the vector store data.

1. Provide the authentication information to connect to your Confluence instance:
   + For basic authentication, go to AWS Secrets Manager to add your secret authentication credentials or use an existing Amazon Resource Name (ARN) for the secret you created. Your secret must contain the admin user email address of the Atlassian account as the username and a Confluence API token in place of a password. For information about how to create a Confluence API token, see [Manage API tokens for your Atlassian account](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/#Create-an-API-token) on the Atlassian website.
   + For OAuth 2.0 authentication, go to AWS Secrets Manager to add your secret authentication credentials or use an existing Amazon Resource Name (ARN) for the secret you created. Your secret must contain the Confluence app key, app secret, access token, and refresh token. For more information, see [OAuth 2.0 apps](https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/) on the Atlassian website.

1. (Optional) In the **Content parsing and chunking** section, you can customize how to parse and chunk your data. Refer to the following resources to learn more about these customizations:
   + For more information about parsing options, see [Parsing options for your data source](kb-advanced-parsing.md).
   + For more information about chunking strategies, see [How content chunking works for knowledge bases](kb-chunking.md).
**Warning**  
You can't change the chunking strategy after connecting to the data source.
   + For more information about how to customize chunking of your data and processing of your metadata with a Lambda function, see [Use a custom transformation Lambda function to define how your data is ingested](kb-custom-transformation.md).

1. Choose to use filters/regular expressions patterns to include or exclude certain content. All standard content is crawled otherwise.

1. Continue to choose an embeddings model and vector store. To see the remaining steps, return to [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and continue from the step after connecting your data source.

------
#### [ API ]

The following is an example of a configuration for connecting to Confluence Cloud for your Amazon Bedrock knowledge base. You configure your data source using the API with the AWS CLI or supported SDK, such as Python. After you call [CreateKnowledgeBase](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html), you call [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) to create your data source with your connection information in `dataSourceConfiguration`.

To learn about customizations that you can apply to ingestion by including the optional `vectorIngestionConfiguration` field, see [Customize ingestion for a data source](kb-data-source-customize-ingestion.md).

**AWS Command Line Interface**

```
aws bedrock create-data-source \
 --name "Confluence Cloud/SaaS connector" \
 --description "Confluence Cloud/SaaS data source connector for Amazon Bedrock to use content in Confluence" \
 --knowledge-base-id "your-knowledge-base-id" \
 --data-source-configuration file://confluence-bedrock-connector-configuration.json \
 --data-deletion-policy "DELETE" \
 --vector-ingestion-configuration '{"chunkingConfiguration":[{"chunkingStrategy":"FIXED_SIZE","fixedSizeChunkingConfiguration":[{"maxTokens":"100","overlapPercentage":"10"}]}]}'

confluence-bedrock-connector-configuration.json
{
    "confluenceConfiguration": {
        "sourceConfiguration": {
            "hostUrl": "https://example.atlassian.net",
            "hostType": "SAAS",
            "authType": "OAUTH2_CLIENT_CREDENTIALS",
            "credentialsSecretArn": "arn:aws::secretsmanager:your-region:secret:AmazonBedrock-Confluence"
        },
        "crawlerConfiguration": {
            "filterConfiguration": {
                "type": "PATTERN",
                "patternObjectFilter": {
                    "filters": [
                        {
                            "objectType": "Attachment",
                            "inclusionFilters": [
                                ".*\\.pdf"
                            ],
                            "exclusionFilters": [
                                ".*private.*\\.pdf"
                            ]
                        }
                    ]
                }
            }
        }
    },
    "type": "CONFLUENCE"
}
```

------

# Connect to Microsoft SharePoint for your knowledge base
<a name="sharepoint-data-source-connector"></a>

Microsoft SharePoint is a collaborative web-based service for working on documents, web pages, web sites, lists, and more. You can connect to your SharePoint instance for your Amazon Bedrock knowledge base by using either the [AWS Management Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home) or the [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) API (see Amazon Bedrock [supported SDKs and AWS CLI](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)).

**Note**  
SharePoint data source connector is in preview release and is subject to change.  
Microsoft SharePoint data sources don't support multimodal data, such as tables, charts, diagrams, or other images.

Amazon Bedrock supports connecting to SharePoint Online instances. Crawling OneNote documents is currently not supported. Currently, only Amazon OpenSearch Serverless vector store is available to use with this data source.

There are limits to how many files and MB per file that can be crawled. See [Quotas for knowledge bases](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html).

**Topics**
+ [Supported features](#supported-features-sharepoint-connector)
+ [Prerequisites](#prerequisites-sharepoint-connector)
+ [Connection configuration](#configuration-sharepoint-connector)

## Supported features
<a name="supported-features-sharepoint-connector"></a>
+ Auto detection of main document fields
+ Inclusion/exclusion content filters
+ Incremental content syncs for added, updated, deleted content
+ SharePoint App-Only authentication

## Prerequisites
<a name="prerequisites-sharepoint-connector"></a>

### SharePoint (Online)
<a name="prerequisites-sharepoint-connector-online"></a>

**In your SharePoint (Online), complete the following steps for using SharePoint App-Only authentication:**
+ Take note of your SharePoint Online site URL/URLs. For example, *https://yourdomain.sharepoint.com/sites/mysite*. Your URL must start with *https* and contain *sharepoint.com*. Your site URL must be the actual SharePoint site, not *sharepoint.com/* or *sites/mysite/home.aspx*
+ Take note of the domain name of your SharePoint Online instance URL/URLs.
+ Copy your Microsoft 365 tenant ID. You can find your tenant ID in the Properties of your Microsoft Entra portal. For details, see [Find your Microsoft 365 tenant ID](https://learn.microsoft.com/en-us/sharepoint/find-your-office-365-tenant-id).
**Note**  
For an example application, see [Register a client application in Microsoft Entra ID](https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application) (formerly known as Azure Active Directory) on the Microsoft Learn website. 
+ Configure SharePoint App-Only credentials.
+ Copy the client ID and client secret value when granting permission to SharePoint App-Only. For more information, see [Granting access using SharePoint App-Only](https://learn.microsoft.com/en-us/sharepoint/dev/solution-guidance/security-apponly-azureacs).
**Note**  
You do not need to setup any API permission for SharePoint App-Only. However, you must configure APP permissions on the SharePoint side. For more information about the required APP permissions, see the Microsoft documentation on [Granting access using SharePoint App-Only](https://learn.microsoft.com/en-us/sharepoint/dev/solution-guidance/security-apponly-azureacs).

### AWS account
<a name="prerequisites-sharepoint-connector-account"></a>

**In your AWS account, make sure you**:
+ Store your authentication credentials in an [AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) and note the Amazon Resource Name (ARN) of the secret. Follow the **Connection configuration** instructions on this page to include the key-values pairs that must be included in your secret.
+ Include the necessary permissions to connect to your data source in your AWS Identity and Access Management (IAM) role/permissions policy for your knowledge base. For information on the required permissions for this data source to add to your knowledge base IAM role, see [Permissions to access data sources](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html#kb-permissions-access-ds).

**Note**  
If you use the console, you can go to AWS Secrets Manager to add your secret or use an existing secret as part of the data source configuration step. The IAM role with all the required permissions can be created for you as part of the console steps for creating a knowledge base. After you have configured your data source and other configurations, the IAM role with all the required permissions are applied to your specific knowledge base.  
We recommend that you regularly refresh or rotate your credentials and secret. Provide only the necessary access level for your own security. We do not recommend that you re-use credentials and secrets across data sources.

## Connection configuration
<a name="configuration-sharepoint-connector"></a>

To connect to your SharePoint instance, you must provide the necessary configuration information so that Amazon Bedrock can access and crawl your data. You must also follow the [Prerequisites](#prerequisites-sharepoint-connector).

An example of a configuration for this data source is included in this section.

For more information about auto detection of document fields, inclusion/exclusion filters, incremental syncing, secret authentication credentials, and how these work, select the following:

### Auto detection of main document fields
<a name="ds-sharepoint-document-fields"></a>

The data source connector automatically detects and crawls all of the main metadata fields of your documents or content. For example, the data source connector can crawl the document body equivalent of your documents, the document title, the document creation or modification date, or other core fields that might apply to your documents.

**Important**  
If your content includes sensitive information, then Amazon Bedrock could respond using sensitive information.

You can apply filtering operators to metadata fields to help you further improve the relevancy of responses. For example, document "epoch\$1modification\$1time" or the number of seconds that’s passed January 1 1970 for when the document was last updated. You can filter on the most recent data, where "epoch\$1modification\$1time" is *greater than* a certain number. For more information on the filtering operators you can apply to your metadata fields, see [Metadata and filtering](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-test-config.html).

### Inclusion/exclusion filters
<a name="ds-sharepoint-inclusion-exclusion"></a>

You can include or exclude crawling certain content. For example, you can specify an exclusion prefix/regular expression pattern to skip crawling any file that contains “private” in the file name. You could also specify an inclusion prefix/regular expression pattern to include certain content entities or content types. If you specify an inclusion and exclusion filter and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

An example of a regular expression pattern to exclude or filter out PDF files that contain "private" in the file name: *".\$1private.\$1\$1\$1.pdf"*

You can apply inclusion/exclusion filters on the following content types:
+ `Page`: Main page title
+ `Event`: Event name
+ `File`: File name with its extension for attachments and all document files

Crawling OneNote documents is currently not supported.

### Incremental syncing
<a name="ds-sharepoint-incremental-sync"></a>

The data source connector crawls new, modified, and deleted content each time your data source syncs with your knowledge base. Amazon Bedrock can use your data source’s mechanism for tracking content changes and crawl content that changed since the last sync. When you sync your data source with your knowledge base for the first time, all content is crawled by default.

To sync your data source with your knowledge base, use the [StartIngestionJob](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_StartIngestionJob.html) API or select your knowledge base in the console and select **Sync** within the data source overview section.

**Important**  
All data that you sync from your data source becomes available to anyone with `bedrock:Retrieve` permissions to retrieve the data. This can also include any data with controlled data source permissions. For more information, see [Knowledge base permissions](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html).

### Secret authentication credentials
<a name="ds-sharepoint-secret-auth-credentials"></a>

When using SharePoint App-Only authentication, your secret authentication credentials in AWS Secrets Manager must include these key-value pairs:
+ `clientId`: *client ID associated with your Microsoft Entra SharePoint application*
+ `clientSecret`: *client secret associated with your Microsoft Entra SharePoint application*
+ `sharePointClientId`: *client ID generated when registering your SharePoint app for App-Only authentication*
+ `sharePointClientSecret`: *client secret generated when registering your SharePoint app for App-Only authentication*

**Note**  
Your secret in AWS Secrets Manager must use the same Region of your knowledge base.

------
#### [ Console ]

**Connect a SharePoint instance to your knowledge base**<a name="connect-sharepoint-console"></a>

1. Follow the steps at [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and choose **SharePoint** as the data source.

1. Provide a name and optional description for the data source.

1. Provide your SharePoint site URL/URLs. For example, for SharePoint Online, *https://yourdomain.sharepoint.com/sites/mysite*. Your URL must start with *https* and contain *sharepoint.com*. Your site URL must be the actual SharePoint site, not *sharepoint.com/* or *sites/mysite/home.aspx*

1. Provide the domain name of your SharePoint instance.

1. In the **Advanced settings** section, you can optionally configure the following:
   + **KMS key for transient data storage.** – You can encrypt the transient data while converting your data into embeddings with the default AWS managed key or your own KMS key. For more information, see [Encryption of transient data storage during data ingestion](encryption-kb.md#encryption-kb-ingestion).
   + **Data deletion policy** – You can delete the vector embeddings for your data source that are stored in the vector store by default, or choose to retain the vector store data.

1. Provide the authentication information to connect to your SharePoint instance. For SharePoint App-Only authentication:

   1. Provide the tenant ID. You can find your tenant ID in the Properties of your Azure Active Directory portal.

   1. Go to AWS Secrets Manager to add your secret credentials or use an existing Amazon Resource Name (ARN) for the secret you created. Your secret must contain the SharePoint client ID and the SharePoint client secret generated when you registered the App-Only at the tenant level or the site level, and the Entra client ID and Entra client secret generated when you register the app in Entra.

1. (Optional) In the **Content parsing and chunking** section, you can customize how to parse and chunk your data. Refer to the following resources to learn more about these customizations:
   + For more information about parsing options, see [Parsing options for your data source](kb-advanced-parsing.md).
   + For more information about chunking strategies, see [How content chunking works for knowledge bases](kb-chunking.md).
**Warning**  
You can't change the chunking strategy after connecting to the data source.
   + For more information about how to customize chunking of your data and processing of your metadata with a Lambda function, see [Use a custom transformation Lambda function to define how your data is ingested](kb-custom-transformation.md).

1. Choose to use filters/regular expressions patterns to include or exclude certain content. All standard content is crawled otherwise.

1. Continue to choose an embeddings model and vector store. To see the remaining steps, return to [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and continue from the step after connecting your data source.

------
#### [ API ]

The following is an example of a configuration for connecting to SharePoint Online for your Amazon Bedrock knowledge base. You configure your data source using the API with the AWS CLI or supported SDK, such as Python. After you call [CreateKnowledgeBase](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html), you call [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) to create your data source with your connection information in `dataSourceConfiguration`.

To learn about customizations that you can apply to ingestion by including the optional `vectorIngestionConfiguration` field, see [Customize ingestion for a data source](kb-data-source-customize-ingestion.md).

**AWS Command Line Interface**

```
aws bedrock-agent create-data-source \
 --name "SharePoint Online connector" \
 --description "SharePoint Online data source connector for Amazon Bedrock to use content in SharePoint" \
 --knowledge-base-id "your-knowledge-base-id" \
 --data-source-configuration file://sharepoint-bedrock-connector-configuration.json \
 --data-deletion-policy "DELETE"
```

**Contents of `sharepoint-bedrock-connector-configuration.json`**

```
{
    "sharePointConfiguration": {
        "sourceConfiguration": {
            "tenantId": "888d0b57-69f1-4fb8-957f-e1f0bedf64de",
            "hostType": "ONLINE",
            "domain": "yourdomain",
            "siteUrls": [
                "https://yourdomain.sharepoint.com/sites/mysite"
            ],
            "authType": "OAUTH2_SHAREPOINT_APP_ONLY_CLIENT_CREDENTIALS",
            "credentialsSecretArn": "arn:aws::secretsmanager:your-region:secret:AmazonBedrock-SharePoint"
        },
        "crawlerConfiguration": {
            "filterConfiguration": {
                "type": "PATTERN",
                "patternObjectFilter": {
                    "filters": [
                        {
                            "objectType": "File",
                            "inclusionFilters": [
                                ".*\\.pdf"
                            ],
                            "exclusionFilters": [
                                ".*private.*\\.pdf"
                            ]
                        }
                    ]
                }
            }
        }
    },
    "type": "SHAREPOINT"
}
```

------

**Important**  
The OAuth2.0 authentication is not recommended. We recommend that you use SharePoint App-Only authentication.

## Using OAuth2.0
<a name="sharepoint-connector-oauth"></a>

Using OAuth 2.0, you can authenticate and authorize access to SharePoint resources for SharePoint connectors integrated with Knowledge Bases.

### Pre-requisites
<a name="sharepoint-connector-oauth-prereq"></a>

**In SharePoint, for OAuth 2.0 authentication, make sure you**:
+ Take note of your SharePoint Online site URL/URLs. For example, *https://yourdomain.sharepoint.com/sites/mysite*. Your URL must start with *https* and contain *sharepoint.com*. Your site URL must be the actual SharePoint site, not *sharepoint.com/* or *sites/mysite/home.aspx*
+ Take note of the domain name of your SharePoint Online instance URL/URLs.
+ Copy your Microsoft 365 tenant ID. You can find your tenant ID in the Properties of your Microsoft Entra portal or in your OAuth application.

  Take note of the username and password of the admin SharePoint account, and copy the client ID and client secret value when registering an application.
**Note**  
For an example application, see [Register a client application in Microsoft Entra ID](https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application) (formerly known as Azure Active Directory) on the Microsoft Learn website. 
+ Certain read permissions are required to connect to SharePoint when you register an application.
  + SharePoint: AllSites.Read (Delegated) – Read items in all site collections
+ You might need to turn off **Security Defaults** in your Azure portal using an admin user. For more information on managing security default settings in the Azure portal, see [Microsoft documentation on how to enable/disable security defaults](https://learn.microsoft.com/en-us/microsoft-365/business-premium/m365bp-conditional-access?view=o365-worldwide&tabs=secdefaults#security-defaults-1).
+ You might need to turn off multi-factor authentication (MFA) in your SharePoint account, so that Amazon Bedrock is not blocked from crawling your SharePoint content.

To complete the pre-requisites, make sure that you've completed the steps in [AWS account](#prerequisites-sharepoint-connector-account).

### Secret authentication credentials
<a name="sharepoint-secret-auth-credentials-oauth"></a>

For connection configuration for OAuth2.0, you can perform the same steps for the auto detection of the main document fields, inclusion/exclusion filters, and incremental syncing as described in [Connection configuration](#configuration-sharepoint-connector).

**For OAuth 2.0 authentication, your secret authentication credentials in AWS Secrets Manager must include these key-value pairs**.
+ `username`: *SharePoint admin username*
+ `password`: *SharePoint admin password*
+ `clientId`: *OAuth app client ID*
+ `clientSecret`: *OAuth app client secret*

### Connect a SharePoint instance to your knowledge base
<a name="sharepoint-connector-oauth-using"></a>

To connect a SharePoint instance to your knowledge base when using OAuth2.0:
+ (console) In the console, follow the same steps as described in [Connect a SharePoint instance to your knowledge base](https://docs.aws.amazon.com/bedrock/latest/userguide/sharepoint-data-source-connector.html#connect-sharepoint-console). When you want to provide the authentication information to connect to your SharePoint instance.
  + Provide the tenant ID. You can find your tenant ID in the Properties of your Azure Active Directory portal.
  + Go to AWS Secrets Manager to add your secret authentication credentials or use an existing Amazon Resource Name (ARN) for the secret you created. Your secret must contain the SharePoint admin username and password, and your registered app client ID and client secret. For an example application, see [Register a client application in Microsoft Entra ID](https://learn.microsoft.com/en-us/azure/healthcare-apis/register-application) (formerly known as Azure Active Directory) on the Microsoft Learn website.
+ (API) The following is an example of using the `CreateDataSource` API to create your data source with your connection information for OAuth2.0.

  ```
  aws bedrock-agent create-data-source \
   --name "SharePoint Online connector" \
   --description "SharePoint Online data source connector for Amazon Bedrock to use content in SharePoint" \
   --knowledge-base-id "your-knowledge-base-id" \
   --data-source-configuration file://sharepoint-bedrock-connector-configuration.json \
   --data-deletion-policy "DELETE"
  ```

  **Contents of `sharepoint-bedrock-connector-configuration.json`**

  ```
  {
      "sharePointConfiguration": {
          "sourceConfiguration": {
              "tenantId": "888d0b57-69f1-4fb8-957f-e1f0bedf64de",
              "hostType": "ONLINE",
              "domain": "yourdomain",
              "siteUrls": [
                  "https://yourdomain.sharepoint.com/sites/mysite"
              ],
              "authType": "OAUTH2_CLIENT_CREDENTIALS",
              "credentialsSecretArn": "arn:aws::secretsmanager:your-region:secret:AmazonBedrock-SharePoint"
          },
          "crawlerConfiguration": {
              "filterConfiguration": {
                  "type": "PATTERN",
                  "patternObjectFilter": {
                      "filters": [
                          {
                              "objectType": "File",
                              "inclusionFilters": [
                                  ".*\\.pdf"
                              ],
                              "exclusionFilters": [
                                  ".*private.*\\.pdf"
                              ]
                          }
                      ]
                  }
              }
          }
      },
      "type": "SHAREPOINT"
  }
  ```

# Connect to Salesforce for your knowledge base
<a name="salesforce-data-source-connector"></a>

Salesforce is a customer relationship management (CRM) tool for managing support, sales, and marketing teams. You can connect to your Salesforce instance for your Amazon Bedrock knowledge base by using either the [AWS Management Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home) or the [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) API (see Amazon Bedrock [supported SDKs and AWS CLI](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)).

**Note**  
Salesforce data source connector is in preview release and is subject to change.  
Salesforce data sources don't support multimodal data, such as tables, charts, diagrams, or other images..

Currently, only Amazon OpenSearch Serverless vector store is available to use with this data source.

There are limits to how many files and MB per file that can be crawled. See [Quotas for knowledge bases](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html).

**Topics**
+ [Supported features](#supported-features-salesforce-connector)
+ [Prerequisites](#prerequisites-salesforce-connector)
+ [Connection configuration](#configuration-salesforce-connector)

## Supported features
<a name="supported-features-salesforce-connector"></a>
+ Auto detection of main document fields
+ Inclusion/exclusion content filters
+ Incremental content syncs for added, updated, deleted content
+ OAuth 2.0 authentication

## Prerequisites
<a name="prerequisites-salesforce-connector"></a>

**In Salesforce, make sure you**:
+ Take note of your Salesforce instance URL. For example, *https://company.salesforce.com/*. The instance must be running a Salesforce Connected App.
+ Create a Salesforce Connected App and configure client credentials. Then, for your selected app, copy the consumer key (client ID) and consumer secret (client secret) from the OAuth settings. For more information, see Salesforce documentation on [Create a Connected App](https://help.salesforce.com/s/articleView?id=sf.connected_app_create.htm&type=5) and [Configure a Connected App for the OAuth 2.0 Client Credentials](https://help.salesforce.com/s/articleView?id=sf.connected_app_client_credentials_setup.htm&type=5).
**Note**  
For Salesforce Connected Apps, under Client Credentials Flow, make sure you search and select the user’s name or alias for your client credentials in the “Run As” field.

**In your AWS account, make sure you**:
+ Store your authentication credentials in an [AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) and note the Amazon Resource Name (ARN) of the secret. Follow the **Connection configuration** instructions on this page to include the key-values pairs that must be included in your secret.
+ Include the necessary permissions to connect to your data source in your AWS Identity and Access Management (IAM) role/permissions policy for your knowledge base. For information on the required permissions for this data source to add to your knowledge base IAM role, see [Permissions to access data sources](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html#kb-permissions-access-ds).

**Note**  
If you use the console, you can go to AWS Secrets Manager to add your secret or use an existing secret as part of the data source configuration step. The IAM role with all the required permissions can be created for you as part of the console steps for creating a knowledge base. After you have configured your data source and other configurations, the IAM role with all the required permissions are applied to your specific knowledge base.  
We recommend that you regularly refresh or rotate your credentials and secret. Provide only the necessary access level for your own security. We do not recommend that you re-use credentials and secrets across data sources.

## Connection configuration
<a name="configuration-salesforce-connector"></a>

To connect to your Salesforce instance, you must provide the necessary configuration information so that Amazon Bedrock can access and crawl your data. You must also follow the [Prerequisites](#prerequisites-salesforce-connector).

An example of a configuration for this data source is included in this section.

For more information about auto detection of document fields, inclusion/exclusion filters, incremental syncing, secret authentication credentials, and how these work, select the following:

### Auto detection of main document fields
<a name="ds-salesforce-document-fields"></a>

The data source connector automatically detects and crawls all of the main metadata fields of your documents or content. For example, the data source connector can crawl the document body equivalent of your documents, the document title, the document creation or modification date, or other core fields that might apply to your documents.

**Important**  
If your content includes sensitive information, then Amazon Bedrock could respond using sensitive information.

You can apply filtering operators to metadata fields to help you further improve the relevancy of responses. For example, document "epoch\$1modification\$1time" or the number of seconds that’s passed January 1 1970 for when the document was last updated. You can filter on the most recent data, where "epoch\$1modification\$1time" is *greater than* a certain number. For more information on the filtering operators you can apply to your metadata fields, see [Metadata and filtering](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-test-config.html).

### Inclusion/exclusion filters
<a name="ds-salesforce-inclusion-exclusion"></a>

You can include or exclude crawling certain content. For example, you can specify an exclusion prefix/regular expression pattern to skip crawling any file that contains “private” in the file name. You could also specify an inclusion prefix/regular expression pattern to include certain content entities or content types. If you specify an inclusion and exclusion filter and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

An example of a regular expression pattern to exclude or filter out campaigns that contain "private" in the campaign name: *".\$1private.\$1"*

You can apply inclusion/exclusion filters on the following content types:
+ `Account`: Account number/identifier
+ `Attachment`: Attachment file name with its extension
+ `Campaign`: Campaign name and associated identifiers
+ `ContentVersion`: Document version and associated identifiers
+ `Partner`: Partner information fields including associated identifiers
+ `Pricebook2`: Product/price list name
+ `Case`: Customer inquiry/issue number and other information fields including associated identifiers (please note: can contain personal information, which you can choose to exclude or filter out)
+ `Contact`: Customer information fields (please note: can contain personal information, which you can choose to exclude or filter out)
+ `Contract`: Contract name and associated identifiers
+ `Document`: File name with its extension
+ `Idea`: Idea information fields and associated identifiers
+ `Lead`: Potential new customer information fields (please note: can contain personal information, which you can choose to exclude or filter out)
+ `Opportunity`: Pending sale/deal information fields and associated identifiers
+ `Product2`: Product information fields and associated identifiers
+ `Solution`: Solution name for a customer inquiry/issue and associated identifiers
+ `Task`: Task information fields and associated identifiers
+ `FeedItem`: Identifier of the chatter feed post
+ `FeedComment`: Identifier of the chatter feed post that the comments belong to
+ `Knowledge__kav`: Knowledge Article Title
+ `User`: User alias within your organization
+ `CollaborationGroup`: Chatter group name (unique)

### Incremental syncing
<a name="ds-salesforce-incremental-sync"></a>

The data source connector crawls new, modified, and deleted content each time your data source syncs with your knowledge base. Amazon Bedrock can use your data source’s mechanism for tracking content changes and crawl content that changed since the last sync. When you sync your data source with your knowledge base for the first time, all content is crawled by default.

To sync your data source with your knowledge base, use the [StartIngestionJob](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_StartIngestionJob.html) API or select your knowledge base in the console and select **Sync** within the data source overview section.

**Important**  
All data that you sync from your data source becomes available to anyone with `bedrock:Retrieve` permissions to retrieve the data. This can also include any data with controlled data source permissions. For more information, see [Knowledge base permissions](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html).

### Secret authentication credentials
<a name="ds-salesforce-secret-auth-credentials"></a>

(For OAuth 2.0 authentication) Your secret authentication credentials in AWS Secrets Manager should include these key-value pairs:
+ `consumerKey`: *app client ID*
+ `consumerSecret`: *app client secret*
+ `authenticationUrl`: *Salesforce instance URL or the URL to request the authentication token from*

**Note**  
Your secret in AWS Secrets Manager must use the same Region of your knowledge base.

------
#### [ Console ]

**Connect a Salesforce instance to your knowledge base**

1. Follow the steps at [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and choose **Salesforce** as the data source.

1. Provide a name and optional description for the data source.

1. Provide your Salesforce instance URL. For example, *https://company.salesforce.com/*. The instance must be running a Salesforce Connected App.

1. In the **Advanced settings** section, you can optionally configure the following:
   + **KMS key for transient data storage.** – You can encrypt the transient data while converting your data into embeddings with the default AWS managed key or your own KMS key. For more information, see [Encryption of transient data storage during data ingestion](encryption-kb.md#encryption-kb-ingestion).
   + **Data deletion policy** – You can delete the vector embeddings for your data source that are stored in the vector store by default, or choose to retain the vector store data.

1. Provide the authentication information to connect to your Salesforce instance:

   1. For OAuth 2.0 authentication, go to AWS Secrets Manager to add your secret authentication credentials or use an existing Amazon Resource Name (ARN) for the secret you created. Your secret must contain the Salesforce Connected App consumer key (client ID), consumer secret (client secret), and the Salesforce instance URL or the URL to request the authentication token from. For more information, see Salesforce documentation on [Create a Connected App](https://help.salesforce.com/s/articleView?id=sf.connected_app_create.htm&type=5) and [Configure a Connected App for the OAuth 2.0 Client Credentials](https://help.salesforce.com/s/articleView?id=sf.connected_app_client_credentials_setup.htm&type=5).

1. (Optional) In the **Content parsing and chunking** section, you can customize how to parse and chunk your data. Refer to the following resources to learn more about these customizations:
   + For more information about parsing options, see [Parsing options for your data source](kb-advanced-parsing.md).
   + For more information about chunking strategies, see [How content chunking works for knowledge bases](kb-chunking.md).
**Warning**  
You can't change the chunking strategy after connecting to the data source.
   + For more information about how to customize chunking of your data and processing of your metadata with a Lambda function, see [Use a custom transformation Lambda function to define how your data is ingested](kb-custom-transformation.md).

1. Choose to use filters/regular expressions patterns to include or exclude certain content. All standard content is crawled otherwise.

1. Continue to choose an embeddings model and vector store. To see the remaining steps, return to [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and continue from the step after connecting your data source.

------
#### [ API ]

The following is an example of a configuration for connecting to Salesforce for your Amazon Bedrock knowledge base. You configure your data source using the API with the AWS CLI or supported SDK, such as Python. After you call [CreateKnowledgeBase](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html), you call [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) to create your data source with your connection information in `dataSourceConfiguration`.

To learn about customizations that you can apply to ingestion by including the optional `vectorIngestionConfiguration` field, see [Customize ingestion for a data source](kb-data-source-customize-ingestion.md).

**AWS Command Line Interface**

```
aws bedrock create-data-source \
 --name "Salesforce connector" \
 --description "Salesforce data source connector for Amazon Bedrock to use content in Salesforce" \
 --knowledge-base-id "your-knowledge-base-id" \
 --data-source-configuration file://salesforce-bedrock-connector-configuration.json \
 --data-deletion-policy "DELETE" \
 --vector-ingestion-configuration '{"chunkingConfiguration":[{"chunkingStrategy":"FIXED_SIZE","fixedSizeChunkingConfiguration":[{"maxTokens":"100","overlapPercentage":"10"}]}]}'

salesforce-bedrock-connector-configuration.json
{
    "salesforceConfiguration": {
        "sourceConfiguration": {
            "hostUrl": "https://company.salesforce.com/",
            "authType": "OAUTH2_CLIENT_CREDENTIALS",
            "credentialsSecretArn": "arn:aws::secretsmanager:your-region:secret:AmazonBedrock-Salesforce"
        },
        "crawlerConfiguration": {
            "filterConfiguration": {
                "type": "PATTERN",
                "patternObjectFilter": {
                    "filters": [
                        {
                            "objectType": "Campaign",
                            "inclusionFilters": [
                                ".*public.*"
                            ],
                            "exclusionFilters": [
                                ".*private.*"
                            ]
                        }
                    ]
                }
            }
        }
    },
    "type": "SALESFORCE"
}
```

------

# Crawl web pages for your knowledge base
<a name="webcrawl-data-source-connector"></a>

The Amazon Bedrock provided Web Crawler connects to and crawls URLs you have selected for use in your Amazon Bedrock knowledge base. You can crawl website pages in accordance with your set scope or limits for your selected URLs. You can crawl website pages using either the [AWS Management Console for Amazon Bedrock](https://console.aws.amazon.com/bedrock/home) or the [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) API (see Amazon Bedrock [supported SDKs and AWS CLI](https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html)). Currently, only Amazon OpenSearch Serverless vector store is available to use with this data source.

**Note**  
The Web Crawler data source connector is in preview release and is subject to change.

When selecting websites to crawl, you must adhere to the [Amazon Acceptable Use Policy](https://aws.amazon.com/aup/) and all other Amazon terms. Remember that you must only use the Web Crawler to index your own web pages, or web pages that you have authorization to crawl and must respect robots.txt configurations..

The Web Crawler respects robots.txt in accordance with the [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)

There are limits to how many web page content items and MB per content item that can be crawled. See [Quotas for knowledge bases](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html).

**Topics**
+ [Supported features](#supported-features-webcrawl-connector)
+ [Prerequisites](#prerequisites-webcrawl-connector)
+ [Connection configuration](#configuration-webcrawl-connector)

## Supported features
<a name="supported-features-webcrawl-connector"></a>

The Web Crawler connects to and crawls HTML pages starting from the seed URL, traversing all child links under the same top primary domain and path. If any of the HTML pages reference supported documents, the Web Crawler will fetch these documents, regardless if they are within the same top primary domain. You can modify the crawling behavior by changing the crawling configuration - see [Connection configuration](#configuration-webcrawl-connector).

The following is supported for you to:
+ Select multiple source URLs to crawl and set the scope of URLs to crawl only the host or also include subdomains.
+ Crawl static web pages that are part of your source URLs.
+ Specify custom User Agent suffix to set rules for your own crawler.
+ Include or exclude certain URLs that match a filter pattern.
+ Respect standard robots.txt directives like 'Allow' and 'Disallow'.
+ Limit the scope of the URLs to crawl and optionally exclude URLs that match a filter pattern.
+ Limit the rate of crawling URLs and the maximum number of pages to crawl.
+ View the status of crawled URLs in Amazon CloudWatch

## Prerequisites
<a name="prerequisites-webcrawl-connector"></a>

**To use the Web Crawler, make sure you:**.
+ Check that you are authorized to crawl your source URLs.
+ Check the path to robots.txt corresponding to your source URLs doesn't block the URLs from being crawled. The Web Crawler adheres to the standards of robots.txt: `disallow` by default if robots.txt is not found for the website. The Web Crawler respects robots.txt in accordance with the [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html). You can also specify custom User Agent header suffix to set rules for your own crawler. For more information, see Web Crawler URL access in [Connection configuration](#configuration-webcrawl-connector) instructions on this page.
+ [Enable CloudWatch Logs delivery](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-bases-logging.html) and follow examples of Web Crawler logs to view the status of your data ingestion job for ingesting web content, and if certain URLs cannot be retrieved.

**Note**  
When selecting websites to crawl, you must adhere to the [Amazon Acceptable Use Policy](https://aws.amazon.com/aup/) and all other Amazon terms. Remember that you must only use the Web Crawler to index your own web pages, or web pages that you have authorization to crawl.

## Connection configuration
<a name="configuration-webcrawl-connector"></a>

For more information about sync scope for crawling URLs, inclusion/exclusion filters, URL access, incremental syncing, and how these work, select the following:

### Sync scope for crawling URLs
<a name="ds-sync-scope"></a>

You can limit the scope of the URLs to crawl based on each page URL's specific relationship to the seed URLs. For faster crawls, you can limit URLs to those with the same host and initial URL path of the seed URL. For more broader crawls, you can choose to crawl URLs with the same host or within any subdomain of the seed URL.

You can choose from the following options.
+ Default: Limit crawling to web pages that belong to the same host and with the same initial URL path. For example, with a seed URL of "https://aws.amazon.com/bedrock/" then only this path and web pages that extend from this path will be crawled, like "https://aws.amazon.com/bedrock/agents/". Sibling URLs like "https://aws.amazon.com/ec2/" are not crawled, for example.
+ Host only: Limit crawling to web pages that belong to the same host. For example, with a seed URL of "https://aws.amazon.com/bedrock/", then web pages with "https://aws.amazon.com" will also be crawled, like "https://aws.amazon.com/ec2".
+ Subdomains: Include crawling of any web page that has the same primary domain as the seed URL. For example, with a seed URL of "https://aws.amazon.com/bedrock/" then any web page that contains "amazon.com" (subdomain) will be crawled, like "https://www.amazon.com".

**Note**  
Make sure you are not crawling potentially excessive web pages. It's not recommended to crawl large websites, such as wikipedia.org, without filters or scope limits. Crawling large websites will take a very long time to crawl.  
[Supported file types](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-ds.html) are crawled regardless of scope and if there's no exclusion pattern for the file type.

The Web Crawler supports static websites.

You can also limit the rate of crawling URLs to control the throttling of crawling speed. You set the maximum number of URLs crawled per host per minute. In addition, you can also set the maximum number (up to 25,000) of total web pages to crawl. Note that if the total number of web pages from your source URLs exceeds your set maximum, then your data source sync/ingestion job will fail.

### Inclusion/exclusion filters
<a name="ds-inclusion-exclusion"></a>

You can include or exclude certain URLs in accordance with your scope. [Supported file types](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-ds.html) are crawled regardless of scope and if there's no exclusion pattern for the file type. If you specify an inclusion and exclusion filter and both match a URL, the exclusion filter takes precedence and the web content isn’t crawled.

**Important**  
Problematic regular expression pattern filters that lead to [catastrophic backtracking](https://docs.aws.amazon.com/codeguru/detector-library/python/catastrophic-backtracking-regex/) and look ahead are rejected.

An example of a regular expression filter pattern to exclude URLs that end with ".pdf" or PDF web page attachments: *".\$1\$1.pdf\$1"*

### Web Crawler URL access
<a name="ds-webcrawl-identity-crawling"></a>

You can use the Web Crawler to crawl the pages of websites that you are authorized to crawl.

When selecting websites to crawl, you must adhere to the [Amazon Acceptable Use Policy](https://aws.amazon.com/aup/) and all other Amazon terms. Remember that you must only use the Web Crawler to index your own web pages, or web pages that you have authorization to crawl.

The Web Crawler respects robots.txt in accordance with the [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)

You can specify certain user agent bots to either ‘Allow’ or ‘Disallow’ the user agent to crawl your source URLs. You can modify the robots.txt file of your website to control how the Web Crawler crawls your source URLs. The crawler will first look for `bedrockbot-UUID ` rules and then for generic `bedrockbot` rules in the robots.txt file.

You can also add a User-Agent suffix that can be used to allowlist your crawler in bot protection systems. Note that this suffix does not need to be added to the `robots.txt` file to make sure that no one can impersonate the User Agent string. For example, to allow the Web Crawler to crawl all website content and disallow crawling for any other robots, use the following directive:

```
User-agent: bedrockbot-UUID # Amazon Bedrock Web Crawler
Allow: / # allow access to all pages
User-agent: * # any (other) robot
Disallow: / # disallow access to any pages
```

### Incremental syncing
<a name="ds-incremental-sync"></a>

Each time the the Web Crawler runs, it retrieves content for all URLs that are reachable from the source URLs and which match the scope and filters. For incremental syncs after the first sync of all content, Amazon Bedrock will update your knowledge base with new and modified content, and will remove old content that is no longer present. Occasionally, the crawler may not be able to tell if content was removed from the website; and in this case it will err on the side of preserving old content in your knowledge base.

To sync your data source with your knowledge base, use the [StartIngestionJob](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_StartIngestionJob.html) API or select your knowledge base in the console and select **Sync** within the data source overview section.

**Important**  
All data that you sync from your data source becomes available to anyone with `bedrock:Retrieve` permissions to retrieve the data. This can also include any data with controlled data source permissions. For more information, see [Knowledge base permissions](https://docs.aws.amazon.com/bedrock/latest/userguide/kb-permissions.html).

------
#### [ Console ]

**Connect a Web Crawler data source to your knowledge base**

1. Follow the steps at [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and choose **Web Crawler** as the data source.

1. Provide a name and optional description for the data source.

1. Provide the **Source URLs** of the URLs you want to crawl. You can add up to 9 additional URLs by selecting **Add Source URLs**. By providing a source URL, you are confirming that you are authorized to crawl its domain.

1. In the **Advanced settings** section, you can optionally configure the following:
   + **KMS key for transient data storage.** – You can encrypt the transient data while converting your data into embeddings with the default AWS managed key or your own KMS key. For more information, see [Encryption of transient data storage during data ingestion](encryption-kb.md#encryption-kb-ingestion).
   + **Data deletion policy** – You can delete the vector embeddings for your data source that are stored in the vector store by default, or choose to retain the vector store data.

1. (Optional) Provide a user agent suffix for **bedrock-UUID-** that identifies the crawler or bot when it accesses a web server.

1. Configure the following in the **Sync scope** section:

   1. Select a **Website domain range** for crawling your source URLs:
      + Default: Limit crawling to web pages that belong to the same host and with the same initial URL path. For example, with a seed URL of "https://aws.amazon.com/bedrock/" then only this path and web pages that extend from this path will be crawled, like "https://aws.amazon.com/bedrock/agents/". Sibling URLs like "https://aws.amazon.com/ec2/" are not crawled, for example.
      + Host only: Limit crawling to web pages that belong to the same host. For example, with a seed URL of "https://aws.amazon.com/bedrock/", then web pages with "https://aws.amazon.com" will also be crawled, like "https://aws.amazon.com/ec2".
      + Subdomains: Include crawling of any web page that has the same primary domain as the seed URL. For example, with a seed URL of "https://aws.amazon.com/bedrock/" then any web page that contains "amazon.com" (subdomain) will be crawled, like "https://www.amazon.com".
**Note**  
Make sure you are not crawling potentially excessive web pages. It's not recommended to crawl large websites, such as wikipedia.org, without filters or scope limits. Crawling large websites will take a very long time to crawl.  
[Supported file types](https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-ds.html) are crawled regardless of scope and if there's no exclusion pattern for the file type.

   1. Enter **Maximum throttling of crawling speed**. Ingest URLs between 1 and 300 URLs per host per minute. A higher crawling speed increases the load but takes less time.

   1. Enter **Maximum pages for data source sync** between 1 and 25000. Limit the maximum number of web pages crawled from your source URLs. If web pages exceed this number the data source sync will fail and no web pages will be ingested. 

   1. For **URL Regex** patterns (optional) you can add **Include patterns** or **Exclude patterns** by entering the regular expression pattern in the box. You can add up to 25 include and 25 exclude filter patterns by selecting **Add new pattern**. The include and exclude patterns are crawled in accordance with your scope. If there's a conflict, the exclude pattern takes precedence.

1. (Optional) In the **Content parsing and chunking** section, you can customize how to parse and chunk your data. Refer to the following resources to learn more about these customizations:
   + For more information about parsing options, see [Parsing options for your data source](kb-advanced-parsing.md).
   + For more information about chunking strategies, see [How content chunking works for knowledge bases](kb-chunking.md).
**Warning**  
You can't change the chunking strategy after connecting to the data source.
   + For more information about how to customize chunking of your data and processing of your metadata with a Lambda function, see [Use a custom transformation Lambda function to define how your data is ingested](kb-custom-transformation.md).

1. Continue to choose an embeddings model and vector store. To see the remaining steps, return to [Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases](knowledge-base-create.md) and continue from the step after connecting your data source.

------
#### [ API ]

To connect a knowledge base to a data source using WebCrawler, send a [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) request with an [Agents for Amazon Bedrock build-time endpoint](https://docs.aws.amazon.com/general/latest/gr/bedrock.html#bra-bt), specify `WEB` in the `type` field of the [DataSourceConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_DataSourceConfiguration.html), and include the `webConfiguration` field. The following is an example of a configuration of Web Crawler for your Amazon Bedrock knowledge base.

```
{
    "webConfiguration": {
        "sourceConfiguration": {
            "urlConfiguration": {
                "seedUrls": [{
                    "url": "https://www.examplesite.com"
                }]
            }
        },
        "crawlerConfiguration": {
            "crawlerLimits": {
                "rateLimit": 50,
                "maxPages": 100
            },
            "scope": "HOST_ONLY",
            "inclusionFilters": [
                "https://www\.examplesite\.com/.*\.html"
            ],
            "exclusionFilters": [
                "https://www\.examplesite\.com/contact-us\.html"
            ],
            "userAgent": "CustomUserAgent"
        }
    },
    "type": "WEB"
}
```

To learn about customizations that you can apply to ingestion by including the optional `vectorIngestionConfiguration` field, see [Customize ingestion for a data source](kb-data-source-customize-ingestion.md).

------

# Connect your knowledge base to a custom data source
<a name="custom-data-source-connector"></a>

Instead of choosing a supported data source service, you can connect to a custom data source for the following advantages:
+ Flexibility and control over the data types that you want your knowledge base to have access to.
+ The ability to use the `KnowledgeBaseDocuments` API operations to directly ingest or delete documents without the need to sync changes.
+ The ability to view documents in your data source directly through the Amazon Bedrock console or API.
+ The ability to upload documents into the data source directly in the AWS Management Console or to add them inline.
+ The ability to add metadata directly to each document for when adding or updating a document in the data source. For more information on how to use metadata for filtering when retrieving information from a data source, see the **Metadata and filtering** tab in [Configure and customize queries and response generation](kb-test-config.md).

**Multimodal content support**  
Custom data sources support multimodal content including images, audio, and video files up to 10MB base64 encoded. For comprehensive guidance on working with multimodal content, see [Build a knowledge base for multimodal content](kb-multimodal.md).

To connect a knowledge base to a custom data source, send a [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html) request with an [Agents for Amazon Bedrock build-time endpoint](https://docs.aws.amazon.com/general/latest/gr/bedrock.html#bra-bt). Specify the `knowledgeBaseId` of the knowledge base to connect to, give a `name` to the data source, and specify the `type` field in the `dataSourceConfiguration` as `CUSTOM`. The following shows a minimal example to create this data source:

```
PUT /knowledgebases/KB12345678/datasources/ HTTP/1.1
Content-type: application/json

{
    "name": "MyCustomDataSource",
    "dataSourceConfiguration": {
        "type": "CUSTOM"
    }
}
```

You can include any of the following optional fields to configure the data source:


****  

| Field | Use case | 
| --- | --- | 
| description | To provide a description for the data source. | 
| clientToken | To ensure the API request completes only once. For more information, see [Ensuring idempotency](https://docs.aws.amazon.com/ec2/latest/devguide/ec2-api-idempotency.html). | 
| serverSideEncryptionConfiguration | To specify a custom KMS key for transient data storage while converting your data into embeddings. For more information, see [Encryption of transient data storage during data ingestion](encryption-kb.md#encryption-kb-ingestion) | 
| dataDeletionPolicy | To configure what to do with the vector embeddings for your data source in your vector store, if you delete the data source. Specify RETAIN to retain the data in the vector store or the default option of DELETE to delete them. | 
| vectorIngestionConfiguration | To configure options for ingestion of the data source. See below for more information. | 

The `vectorIngestionConfiguration` field maps to a [VectorIngestionConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_VectorIngestionConfiguration.html) object containing the following fields:
+ chunkingConfiguration – To configure the strategy to use for chunking the documents in the data source. For more information about chunking strategies, see [How content chunking works for knowledge bases](kb-chunking.md).
+ parsingConfiguration – To configure the strategy to use for parsing the data source. For more information about parsing options, see [Parsing options for your data source](kb-advanced-parsing.md).
+ customTransformationConfiguration – To customize how the data is transformed and to apply a Lambda function for greater customization. For more information about how to customize chunking of your data and processing of your metadata with a Lambda function, see [Use a custom transformation Lambda function to define how your data is ingested](kb-custom-transformation.md).

After setting up your custom data source, you can add documents into it and directly ingest them into the knowledge base. Unlike other data sources, you don't need to sync a custom data source. To learn how to ingest documents directly, see [Ingest changes directly into a knowledge base](kb-direct-ingestion.md).