You use the AWS::QBusiness::DataSource
resource to connect a data source to
your Amazon Q application.
Use the configuration
property to provide a JSON or YAML schema with the necessary
configuration details specific to your data source connector.
To learn more about AWS CloudFormation, see What is AWS CloudFormation? in the AWS CloudFormation User Guide.
Topics
Web Crawler configuration
properties
The following provides information about important configuration properties required in the schema.
Configuration | Description | Type | Required |
---|---|---|---|
|
The type of data source. |
The only allowed values are
|
Yes |
|
Specify whether Amazon Q should update your index by syncing all documents or only new, modified, and deleted documents. |
You can choose between the following options:
|
Yes |
|
Configuration information for the endpoint for the data source. |
This property has the sub-property |
Yes |
|
The endpoint information for the data source. This is a sub-property for the
|
This property has the following sub-properties
|
Yes |
|
The authentication type if your websites require the same authentication.
This is a sub-property for the |
Depending on your connection type, the allowed values are
|
Yes |
|
The list of seed or starting point URLs for the websites that you want to
crawl. You can list up to 100 seed URLs. This is a sub-property for the
|
This is an array of |
No |
|
The seed or starting point URL for the websites that you want to crawl. This
is a sub-property for the |
Use the pattern: [ |
Yes |
|
The S3 path to the text file that stores the list of seed or starting point
URLs. This is a sub-property for the |
Use the pattern |
No |
|
The list of sitemap URLs for the websites that you want to crawl. This is a
sub-property for the |
This is an array of |
No |
|
The S3 path to the sitemap XML files. This is a sub-property for the
|
Use the pattern, s3://bucket-name/directory/. You can list up to three sitemap XML files. You can club together multiple sitemap files into a .zip file and store the .zip file in your Amazon S3 bucket. |
No |
|
Configuration information for the content of the data source. For example, configuring specific types of content and field mappings. |
This property has the following sub-properties
|
Yes |
|
A list of objects that map the attributes or field names of your webpages and attachments to Amazon Q index field names. |
These properties has the following sub-properties
|
No |
|
The name of the index field. This is a sub-property for |
|
Yes |
|
The type of the index field. This is a sub-property for |
The only allowed value are
|
Yes |
|
The field name of the data source. This is a sub-property for
|
|
Yes |
|
The field date of the data source. This is a sub-property for
|
Use the pattern |
No |
|
Additional configuration options for your content in your data source. |
This property has the following sub-properties that are not required
|
Yes |
|
The maximum number of URLs crawled per website host per minute. This is a
sub-property of |
The default value is |
Yes |
|
The maximum size (in MB) of a webpage or attachment to crawl. This is a
sub-property of |
The default value is |
Yes |
|
The number of levels from the seed URL to crawl. This is a sub-property of
|
The seed URL page is depth |
Yes |
|
The maximum number of URLs on a webpage to include when crawling a website.
This number is per webpage. As a website's webpages are crawled, any URLs that the
webpages link to also are crawled. URLs on a webpage are crawled in order of
appearance. This is a sub-property of
|
The default value is |
Yes |
|
NoteThe |
|
Yes |
|
|
If the seed URL is "abc.example.com", then
"a.abc.example.com" and "b.abc.example.com" are also
crawled. If you don't set |
Yes |
|
Crawl the website domains with subdomains and other domains the web pages
link to. This is a sub-property of |
If you don't set |
Yes |
|
|
|
Yes |
|
Specify the maximum single file size limit in MBs that Amazon Q will crawl.
Amazon Q will crawl only the files within the size limit you define. This is a
sub-property of |
The default value is |
No |
|
These are sub-properties of |
|
No |
|
These are sub-properties of |
|
No |
|
This is a sub-property of |
|
No |
|
This is a sub-property of |
|
No |
|
This is a sub-property of |
|
No |
|
This is a sub-property of For example,
the host name of https://a.example.com/page1.html
is |
|
No |
port |
This is a sub-property of For example,
the port 443 would be |
|
No |
|
This is a sub-property of |
The minimum length is 20and the maximum length is 2,048 characters The JSON structure for this is
|
No |
secretArn |
The Amazon Resource Name (ARN) of an AWS Secrets Manager secret that's used if your websites require authentication to access the websites. You store the authentication credentials for the website in the secret that contains JSON key-value pairs. If you use basic, or NTLM/Kerberos, enter the username and password. The JSON keys
in the secret must be If you use SAML or form authentication, enter the username and password, XPath for
the username field (and username button if using SAML), XPaths for the password field
and button, and the login page URL. The JSON keys in the secret must be
Amazon Q also checks if the endpoint information (seed URLs) included in the secret is the same the endpoint information specified in your data source endpoint configuration details. |
If you use If you use You must use the following JSON structure for your
|
No |
version |
The version of this template that's currently supported. |
The default value is |
No |
Web Crawler JSON schema for using the
configuration property with AWS CloudFormation
The following is the Web Crawler JSON schema and examples for the configuration property for AWS CloudFormation.
Topics
Web Crawler JSON schema for using the configuration property with AWS CloudFormation
The following is the Web Crawler JSON schema for the configuration property for AWS CloudFormation
{
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": ["WEBCRAWLERV2", "WEBCRAWLER"]
},
"syncMode": {
"type": "string",
"enum": ["FORCED_FULL_CRAWL", "FULL_CRAWL"]
},
"secretArn": {
"type": "string",
"minLength": 20,
"maxLength": 2048
},
"connectionConfiguration": {
"type": "object",
"properties": {
"repositoryEndpointMetadata": {
"type": "object",
"properties": {
"siteMapUrls": {
"type": "array",
"items": {
"type": "string",
"pattern": "https://.*"
}
},
"s3SeedUrl": {
"type": ["string", "null"],
"pattern": "s3:.*"
},
"s3SiteMapUrl": {
"type": ["string", "null"],
"pattern": "s3:.*"
},
"seedUrlConnections": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"seedUrl": {
"type": "string",
"pattern": "https://.*"
}
},
"required": ["seedUrl"]
}
]
},
"authentication": {
"type": "string",
"enum": [
"NoAuthentication",
"BasicAuth",
"NTLM_Kerberos",
"Form",
"SAML"
]
}
}
}
},
"required": ["repositoryEndpointMetadata"]
},
"repositoryConfigurations": {
"type": "object",
"properties": {
"webPage": {
"type": "object",
"properties": {
"fieldMappings": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"indexFieldName": {
"type": "string"
},
"indexFieldType": {
"type": "string",
"enum": ["STRING", "DATE", "LONG"]
},
"dataSourceFieldName": {
"type": "string"
},
"dateFieldFormat": {
"type": "string",
"pattern": "yyyy-MM-dd'T'HH:mm:ss'Z'"
}
},
"required": [
"indexFieldName",
"indexFieldType",
"dataSourceFieldName"
]
}
]
}
},
"required": ["fieldMappings"]
},
"attachment": {
"type": "object",
"properties": {
"fieldMappings": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"indexFieldName": {
"type": "string"
},
"indexFieldType": {
"type": "string",
"enum": ["STRING", "DATE", "LONG"]
},
"dataSourceFieldName": {
"type": "string"
},
"dateFieldFormat": {
"type": "string",
"pattern": "yyyy-MM-dd'T'HH:mm:ss'Z'"
}
},
"required": [
"indexFieldName",
"indexFieldType",
"dataSourceFieldName"
]
}
]
}
},
"required": ["fieldMappings"]
}
}
},
"additionalProperties": {
"type": "object",
"properties": {
"rateLimit": {
"type": "string",
"default": "300"
},
"maxFileSize": {
"type": "string",
"default": "50"
},
"maxFileSizeInMegaBytes": {
"type": "string"
},
"crawlDepth": {
"type": "string",
"default": "2"
},
"maxLinksPerUrl": {
"type": "string",
"default": "100"
},
"crawlSubDomain": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "string",
"enum": ["true", "false"]
}
],
"default": false
},
"crawlAllDomain": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "string",
"enum": ["true", "false"]
}
],
"default": false
},
"honorRobots": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "string",
"enum": ["true", "false"]
}
],
"default": false
},
"crawlAttachments": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "string",
"enum": ["true", "false"]
}
],
"default": false
},
"inclusionURLCrawlPatterns": {
"type": "array",
"items": {
"type": "string"
}
},
"exclusionURLCrawlPatterns": {
"type": "array",
"items": {
"type": "string"
}
},
"inclusionURLIndexPatterns": {
"type": "array",
"items": {
"type": "string"
}
},
"exclusionURLIndexPatterns": {
"type": "array",
"items": {
"type": "string"
}
},
"inclusionFileIndexPatterns": {
"type": "array",
"items": {
"type": "string"
}
},
"exclusionFileIndexPatterns": {
"type": "array",
"items": {
"type": "string"
}
},
"proxy": {
"type": "object",
"properties": {
"host": {
"type": "string"
},
"port": {
"type": "string"
},
"secretArn": {
"type": "string",
"minLength": 20,
"maxLength": 2048
}
}
},
"enableDeletionProtection": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "string",
"enum": ["true", "false"]
}
],
"default": false
},
"deletionProtectionThreshold": {
"type": "string",
"default": "15"
}
},
"required": [
"rateLimit",
"maxFileSize",
"crawlDepth",
"crawlSubDomain",
"crawlAllDomain",
"maxLinksPerUrl",
"honorRobots"
]
}
},
"version": {
"type": "string",
"anyOf": [
{
"pattern": "1.0.0"
}
]
},
"required": [
"type",
"syncMode",
"connectionConfiguration",
"repositoryConfigurations",
"additionalProperties"
]
}
Web Crawler JSON schema example for using the configuration property with AWS CloudFormation
The following is the Web Crawler JSON schema example for the configuration property for AWS CloudFormation
{
"AWSTemplateFormatVersion": "2010-09-09",
"Description": "CloudFormation Web Crawler Data Source Template",
"Resources": {
"DataSourceWebCrawler": {
"Type": "AWS::QBusiness::DataSource",
"Properties": {
"ApplicationId": "app12345-1234-1234-1234-123456789012",
"IndexId": "indx1234-1234-1234-1234-123456789012",
"DisplayName": "MyWebCrawlerDataSource",
"RoleArn": "arn:aws:iam::123456789012:role/qbusiness-data-source-role",
"Configuration": {
"type": "WEBCRAWLERV2",
"syncMode": "FULL_CRAWL",
"secretArn": "arn:aws:secretsmanager:us-west-2:0123456789012:secret",
"connectionConfiguration": {
"repositoryEndpointMetadata": {
"siteMapUrls": ["https://example.com/sitemap.xml"],
"s3SeedUrl": "s3://bucket/seed-url",
"s3SiteMapUrl": "s3://bucket/sitemap-url",
"seedUrlConnections": [
{
"seedUrl": "https://example.com"
}
],
"authentication": "BasicAuth"
}
},
"repositoryConfigurations": {
"webPage": {
"fieldMappings": [
{
"indexFieldName": "title",
"indexFieldType": "STRING",
"dataSourceFieldName": "page_title",
"dateFieldFormat": "yyyy-MM-dd'T'HH:mm:ss'Z'"
}
]
},
"attachment": {
"fieldMappings": [
{
"indexFieldName": "attachment_title",
"indexFieldType": "STRING",
"dataSourceFieldName": "attachment_name",
"dateFieldFormat": "yyyy-MM-dd'T'HH:mm:ss'Z'"
}
]
}
},
"additionalProperties": {
"rateLimit": "300",
"maxFileSize": "50",
"crawlDepth": "2",
"maxLinksPerUrl": "100",
"crawlSubDomain": "true",
"crawlAllDomain": "true",
"honorRobots": "true"
}
}
}
}
}
}
Web Crawler YAML schema for using the
configuration property with AWS CloudFormation
The following is the Web Crawler YAML schema and examples for the configuration property for AWS CloudFormation:
Topics
Web Crawler YAML schema for using the configuration property with AWS CloudFormation
The following is the Web Crawler YAML schema for the configuration property for AWS CloudFormation.
type: object
properties:
type:
type: string
enum:
- WEBCRAWLERV2
- WEBCRAWLER
syncMode:
type: string
enum:
- FORCED_FULL_CRAWL
- FULL_CRAWL
secretArn:
type: string
minLength: 20
maxLength: 2048
connectionConfiguration:
type: object
properties:
repositoryEndpointMetadata:
type: object
properties:
siteMapUrls:
type: array
items:
type: string
pattern: https://.*
s3SeedUrl:
type:
- string
- null
pattern: s3:.*
s3SiteMapUrl:
type:
- string
- null
pattern: s3:.*
seedUrlConnections:
type: array
items:
type: object
properties:
seedUrl:
type: string
pattern: https://.*
required:
- seedUrl
authentication:
type: string
enum:
- NoAuthentication
- BasicAuth
- NTLM_Kerberos
- Form
- SAML
required:
- repositoryEndpointMetadata
repositoryConfigurations:
type: object
properties:
webPage:
type: object
properties:
fieldMappings:
type: array
items:
type: object
properties:
indexFieldName:
type: string
indexFieldType:
type: string
enum:
- STRING
- DATE
- LONG
dataSourceFieldName:
type: string
dateFieldFormat:
type: string
pattern: yyyy-MM-dd'T'HH:mm:ss'Z'
required:
- indexFieldName
- indexFieldType
- dataSourceFieldName
required:
- fieldMappings
attachment:
type: object
properties:
fieldMappings:
type: array
items:
type: object
properties:
indexFieldName:
type: string
indexFieldType:
type: string
enum:
- STRING
- DATE
- LONG
dataSourceFieldName:
type: string
dateFieldFormat:
type: string
pattern: yyyy-MM-dd'T'HH:mm:ss'Z'
required:
- indexFieldName
- indexFieldType
- dataSourceFieldName
required:
- fieldMappings
additionalProperties:
type: object
properties:
rateLimit:
type: string
default: "300"
maxFileSize:
type: string
default: "50"
maxFileSizeInMegaBytes:
type: string
crawlDepth:
type: string
default: "2"
maxLinksPerUrl:
type: string
default: "100"
crawlSubDomain:
anyOf:
- type: boolean
- type: string
enum:
- "true"
- "false"
default: false
crawlAllDomain:
anyOf:
- type: boolean
- type: string
enum:
- "true"
- "false"
default: false
honorRobots:
anyOf:
- type: boolean
- type: string
enum:
- "true"
- "false"
default: false
crawlAttachments:
anyOf:
- type: boolean
- type: string
enum:
- "true"
- "false"
default: false
inclusionURLCrawlPatterns:
type: array
items:
type: string
exclusionURLCrawlPatterns:
type: array
items:
type: string
inclusionURLIndexPatterns:
type: array
items:
type: string
exclusionURLIndexPatterns:
type: array
items:
type: string
inclusionFileIndexPatterns:
type: array
items:
type: string
exclusionFileIndexPatterns:
type: array
items:
type: string
proxy:
type: object
properties:
host:
type: string
port:
type: string
secretArn:
type: string
minLength: 20
maxLength: 2048
enableDeletionProtection:
anyOf:
- type: boolean
- type: string
enum:
- "true"
- "false"
default: false
deletionProtectionThreshold:
type: string
default: "15"
required:
- rateLimit
- maxFileSize
- crawlDepth
- crawlSubDomain
- crawlAllDomain
- maxLinksPerUrl
- honorRobots
version:
type: string
anyOf:
- pattern: 1.0.0
required:
- type
- syncMode
- connectionConfiguration
- repositoryConfigurations
- additionalProperties
Web Crawler YAML schema example for using the configuration property with AWS CloudFormation
The following is the Web Crawler YAML example for the Configuration property for AWS CloudFormation:
AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation Web Crawler Data Source Template
Resources:
DataSourceWebCrawler:
Type: AWS::QBusiness::DataSource
Properties:
ApplicationId: app12345-1234-1234-1234-123456789012
IndexId: indx1234-1234-1234-1234-123456789012
DisplayName: MyWebCrawlerDataSource
RoleArn: arn:aws:iam::123456789012:role/qbusiness-data-source-role
Configuration:
type: WEBCRAWLERV2
syncMode: FULL_CRAWL
secretArn: arn:aws:secretsmanager:us-west-2:0123456789012:my-webcrawler-secret
connectionConfiguration:
repositoryEndpointMetadata:
siteMapUrls:
- https://example.com/sitemap.xml
s3SeedUrl: s3://bucket/seed-url
s3SiteMapUrl: s3://bucket/sitemap-url
seedUrlConnections:
- seedUrl: https://example.com
authentication: BasicAuth
repositoryConfigurations:
webPage:
fieldMappings:
- indexFieldName: title
indexFieldType: STRING
dataSourceFieldName: page_title
dateFieldFormat: yyyy-MM-dd'T'HH:mm:ss'Z'
attachment:
fieldMappings:
- indexFieldName: attachment_title
indexFieldType: STRING
dataSourceFieldName: attachment_name
dateFieldFormat: yyyy-MM-dd'T'HH:mm:ss'Z'
additionalProperties:
rateLimit: "300"
maxFileSize: "50"
crawlDepth: "2"
maxLinksPerUrl: "100"
crawlSubDomain: "true"
crawlAllDomain: "true"
honorRobots: "true"