Crawl web pages for your Amazon Bedrock knowledge base
The Amazon Bedrock provided Web Crawler connects to and crawls URLs you have selected for use in your Amazon Bedrock knowledge base.
You can crawl website pages in accordance with your set scope or limits for your selected URLs. You can crawl
website pages using either the AWS Management Console for Amazon Bedrock
Note
Crawling web URLs as your data source is in preview release and is subject to change.
Web crawler data sources don't support multimodal data, such as tables, charts, diagrams, or other images..
When selecting websites to crawl, you must adhere to the
Amazon Acceptable Use Policy
The Web Crawler respects robots.txt in accordance with the
RFC 9309
There are limits to how many web page content items and MB per content item that can be crawled. See Quotas for knowledge bases.
Supported features
The Web Crawler connects to and crawls HTML pages starting from the seed URL, traversing all child links under the same top primary domain and path. If any of the HTML pages reference supported documents, the Web Crawler will fetch these documents, regardless if they are within the same top primary domain. You can modify the crawling behavior by changing the crawling configuration - see Connection configuration.
The following is supported for you to:
-
Select multiple URLs to crawl
-
Respect standard robots.txt directives like 'Allow' and 'Disallow'
-
Limit the scope of the URLs to crawl and optionally exclude URLs that match a filter pattern
-
Limit the rate of crawling URLs
-
View the status of URLs visited while crawling in Amazon CloudWatch
Prerequisites
To use the Web Crawler, make sure you:.
-
Check that you are authorized to crawl your source URLs.
-
Check the path to robots.txt corresponding to your source URLs doesn't block the URLs from being crawled. The Web Crawler adheres to the standards of robots.txt:
disallow
by default if robots.txt is not found for the website. The Web Crawler respects robots.txt in accordance with the RFC 9309. -
Check if your source URL pages are JavaScript dynamically generated, as crawling dynamically generated content is currently not supported. You can check this by entering this in your browser:
view-source:https://examplesite.com/site/
. If thebody
element contains only adiv
element and few or noa href
elements, then the page is likely generated dynamically. You can disable JavaScript in your browser, reload the web page, and observe whether content is rendered properly and contains links to your web pages of interest. -
Enable CloudWatch Logs delivery to view the status of your data ingestion job for ingesting web content, and if certain URLs cannot be retrieved.
Note
When selecting websites to crawl, you must adhere to the
Amazon Acceptable Use Policy
Connection configuration
For more information about sync scope for crawling URLs, inclusion/exclusion filters, URL access, incremental syncing, and how these work, select the following:
You can limit the scope of the URLs to crawl based on each page URL's specific relationship to the seed URLs. For faster crawls, you can limit URLs to those with the same host and initial URL path of the seed URL. For more broader crawls, you can choose to crawl URLs with the same host or within any subdomain of the seed URL.
You can choose from the following options.
-
Default: Limit crawling to web pages that belong to the same host and with the same initial URL path. For example, with a seed URL of "https://aws.amazon.com/bedrock/" then only this path and web pages that extend from this path will be crawled, like "https://aws.amazon.com/bedrock/agents/". Sibling URLs like "https://aws.amazon.com/ec2/" are not crawled, for example.
-
Host only: Limit crawling to web pages that belong to the same host. For example, with a seed URL of "https://aws.amazon.com/bedrock/", then web pages with "https://aws.amazon.com" will also be crawled, like "https://aws.amazon.com/ec2".
-
Subdomains: Include crawling of any web page that has the same primary domain as the seed URL. For example, with a seed URL of "https://aws.amazon.com/bedrock/" then any web page that contains "amazon.com" (subdomain) will be crawled, like "https://www.amazon.com".
Note
Make sure you are not crawling potentially excessive web pages. It's not recommended to crawl large websites, such as wikipedia.org, without filters or scope limits. Crawling large websites will take a very long time to crawl.
Supported file types are crawled regardless of scope and if there's no exclusion pattern for the file type.
You can include or exclude certain URLs in accordance with your scope. Supported file types are crawled regardless of scope and if there's no exclusion pattern for the file type. If you specify an inclusion and exclusion filter and both match a URL, the exclusion filter takes precedence and the web content isn’t crawled.
Important
Problematic regular expression pattern filters that lead to catastrophic backtracking and look ahead are rejected.
An example of a regular expression filter pattern to exclude URLs that end with ".pdf" or PDF web page attachments: ".*\.pdf$"
You can use the Web Crawler to crawl the pages of websites that you are authorized to crawl.
When selecting websites to crawl, you must adhere to the
Amazon Acceptable Use Policy
The Web Crawler respects robots.txt in accordance with the
RFC 9309
Each time the the Web Crawler runs, it retrieves content for all URLs that are reachable from the source URLs and which match the scope and filters. For incremental syncs after the first sync of all content, Amazon Bedrock will update your knowledge base with new and modified content, and will remove old content that is no longer present. Occasionally, the crawler may not be able to tell if content was removed from the website; and in this case it will err on the side of preserving old content in your knowledge base.
To sync your data source with your knowledge base, use the StartIngestionJob API or select your knowledge base in the console and select Sync within the data source overview section.
Important
All data that you sync from your data source becomes available to anyone with
bedrock:Retrieve
permissions to retrieve the data. This can also include any data with controlled
data source permissions. For more
information, see Knowledge base permissions.