create-data-source¶

Description¶

Connects a knowledge base to a data source. You specify the configuration for the specific data source service in the dataSourceConfiguration field.

Warning

You can’t change the chunkingConfiguration after you create the data source connector.

Synopsis¶

  create-data-source
--knowledge-base-id <value>
[--client-token <value>]
--name <value>
[--description <value>]
--data-source-configuration <value>
[--data-deletion-policy <value>]
[--server-side-encryption-configuration <value>]
[--vector-ingestion-configuration <value>]
[--cli-input-json | --cli-input-yaml]
[--generate-cli-skeleton <value>]
[--debug]
[--endpoint-url <value>]
[--no-verify-ssl]
[--no-paginate]
[--output <value>]
[--query <value>]
[--profile <value>]
[--region <value>]
[--version <value>]
[--color <value>]
[--no-sign-request]
[--ca-bundle <value>]
[--cli-read-timeout <value>]
[--cli-connect-timeout <value>]
[--cli-binary-format <value>]
[--no-cli-pager]
[--cli-auto-prompt]
[--no-cli-auto-prompt]
[--cli-error-format <value>]

Options¶

--knowledge-base-id (string) [required]

The unique identifier of the knowledge base to which to add the data source.

Constraints:

pattern: [0-9a-zA-Z]{10}

--client-token (string)

A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If this token matches a previous request, Amazon Bedrock ignores the request, but does not return an error. For more information, see Ensuring idempotency .

Constraints:

min: 33

max: 256

pattern: [a-zA-Z0-9](-*[a-zA-Z0-9]){0,256}

--name (string) [required]

The name of the data source.

Constraints:

pattern: ([0-9a-zA-Z][_-]?){1,100}

--description (string)

A description of the data source.

Constraints:

min: 1

max: 200

--data-source-configuration (structure) [required]

The connection configuration for the data source.

type -> (string) [required]

The type of data source.

Possible values:

S3

WEB

CONFLUENCE

SALESFORCE

SHAREPOINT

CUSTOM

REDSHIFT_METADATA

MANAGED_KNOWLEDGE_BASE_CONNECTOR

managedKnowledgeBaseConnectorConfiguration -> (structure)

Contains the configuration for a data source that connects a managed knowledge base to a supported data source connector. Specify this object when the data source type is MANAGED_KNOWLEDGE_BASE_CONNECTOR .

deletionProtectionConfiguration -> (structure)

A safeguard against accidental bulk deletion of indexed content.

deletionProtectionStatus -> (string) [required]

Enable or disable deletion protection for the connector.

Possible values:

ENABLED

DISABLED

deletionProtectionThreshold -> (integer)

The threshold is the maximum percentage of documents that a sync job can delete from your index. If a sync would delete more than this percentage, the sync skips its delete phase, leaving your indexed documents in place. Not supported for the Custom connector.

Constraints:

min: 0

max: 100

mediaExtractionConfiguration -> (structure)

Configuration for extracting media (images, audio, video) from data source files.

imageExtractionConfiguration -> (structure)

Configuration for image extraction.

imageExtractionStatus -> (string) [required]

Whether image extraction is enabled or disabled.

Possible values:

ENABLED

DISABLED

audioExtractionConfiguration -> (structure)

Configuration for audio extraction.

audioExtractionStatus -> (string) [required]

Whether audio extraction is enabled or disabled.

Possible values:

ENABLED

DISABLED

videoExtractionConfiguration -> (structure)

Configuration for video extraction.

videoExtractionStatus -> (string) [required]

Whether video extraction is enabled or disabled.

Possible values:

ENABLED

DISABLED

connectorParameters -> (document)

Connector-specific parameters. For more information, see Connect a data source .

s3Configuration -> (structure)

The configuration information to connect to Amazon S3 as your data source for self-managed knowledge bases. To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration .

bucketArn -> (string) [required]

The Amazon Resource Name (ARN) of the S3 bucket that contains your data.

Constraints:

min: 1

max: 2048

pattern: arn:aws(|-cn|-us-gov):s3:::[a-z0-9][a-z0-9.-]{1,61}[a-z0-9]

inclusionPrefixes -> (list)

A list of S3 prefixes to include certain files or content. For more information, see Organizing objects using prefixes .

Constraints:

min: 1

max: 1

(string)

Constraints:

min: 1

max: 300

bucketOwnerAccountId -> (string)

The account ID for the owner of the S3 bucket.

Constraints:

min: 12

max: 12

pattern: [0-9]{12}

webConfiguration -> (structure)

The configuration of web URLs to crawl for your data source. You should be authorized to crawl the URLs.

Note
To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration . Web crawler data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The source configuration details for the web data source.

urlConfiguration -> (structure) [required]

The configuration of the URL/URLs.

seedUrls -> (list)

One or more seed or starting point URLs.

Constraints:

min: 1

max: 100

(structure)

The seed or starting point URL. You should be authorized to crawl the URL.

url -> (string)

A seed or starting point URL.

Constraints:

pattern: https?://[A-Za-z0-9][^\s]*

crawlerConfiguration -> (structure)

The Web Crawler configuration details for the web data source.

crawlerLimits -> (structure)

The configuration of crawl limits for the web URLs.

rateLimit -> (integer)

The max rate at which pages are crawled, up to 300 per minute per host.

Constraints:

min: 1

max: 300

maxPages -> (integer)

The max number of web pages crawled from your source URLs, up to 25,000 pages. If the web pages exceed this limit, the data source sync will fail and no web pages will be ingested.

Constraints:

min: 1

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

scope -> (string)

The scope of what is crawled for your URLs.

You can choose to crawl only web pages that belong to the same host or primary domain. For example, only web pages that contain the seed URL “https://docs.aws.amazon.com/bedrock/latest/userguide/” and no other domains. You can choose to include sub domains in addition to the host or primary domain. For example, web pages that contain “aws.amazon.com” can also include sub domain “docs.aws.amazon.com”.

Possible values:

HOST_ONLY

SUBDOMAINS

userAgent -> (string)

Returns the user agent suffix for your web crawler.

Constraints:

min: 15

max: 40

userAgentHeader -> (string)

A string used for identifying the crawler or bot when it accesses a web server. The user agent header value consists of the bedrockbot , UUID, and a user agent suffix for your crawler (if one is provided). By default, it is set to bedrockbot_UUID . You can optionally append a custom suffix to bedrockbot_UUID to allowlist a specific user agent permitted to access your source URLs.

Constraints:

min: 61

max: 86

confluenceConfiguration -> (structure)

The configuration information to connect to Confluence as your data source for self-managed knowledge bases.

Note
To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration . Confluence data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The endpoint information to connect to your Confluence data source.

hostUrl -> (string) [required]

The Confluence host URL or instance URL.

Constraints:

pattern: https://[A-Za-z0-9][^\s]*

hostType -> (string) [required]

The supported host type, whether online/cloud or server/on-premises.

Possible values:

SAAS

authType -> (string) [required]

The supported authentication type to authenticate and connect to your Confluence instance.

Possible values:

BASIC

OAUTH2_CLIENT_CREDENTIALS

credentialsSecretArn -> (string) [required]

The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Confluence instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Confluence connection configuration .

Constraints:

pattern: arn:aws(|-cn|-us-gov):secretsmanager:[a-z0-9-]{1,20}:([0-9]{12}|):secret:[a-zA-Z0-9!/_+=.@-]{1,512}

crawlerConfiguration -> (structure)

The configuration of the Confluence content. For example, configuring specific types of Confluence content.

filterConfiguration -> (structure)

The configuration of filtering the Confluence content. For example, configuring regular expression patterns to include or exclude certain content.

type -> (string) [required]

The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

Possible values:

PATTERN

patternObjectFilter -> (structure)

The configuration of filtering certain objects or content types of the data source.

filters -> (list) [required]

The configuration of specific filters applied to your data source content. You can filter out or include certain content.

Constraints:

min: 1

max: 25

(structure)

The specific filters applied to your data source content. You can filter out or include certain content.

objectType -> (string) [required]

The supported object type or content type of the data source.

Constraints:

min: 1

max: 50

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

salesforceConfiguration -> (structure)

The configuration information to connect to Salesforce as your data source.

Note
Salesforce data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The endpoint information to connect to your Salesforce data source.

hostUrl -> (string) [required]

The Salesforce host URL or instance URL.

Constraints:

pattern: https://[A-Za-z0-9][^\s]*

authType -> (string) [required]

The supported authentication type to authenticate and connect to your Salesforce instance.

Possible values:

OAUTH2_CLIENT_CREDENTIALS

credentialsSecretArn -> (string) [required]

The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Salesforce instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Salesforce connection configuration .

Constraints:

pattern: arn:aws(|-cn|-us-gov):secretsmanager:[a-z0-9-]{1,20}:([0-9]{12}|):secret:[a-zA-Z0-9!/_+=.@-]{1,512}

crawlerConfiguration -> (structure)

The configuration of the Salesforce content. For example, configuring specific types of Salesforce content.

filterConfiguration -> (structure)

The configuration of filtering the Salesforce content. For example, configuring regular expression patterns to include or exclude certain content.

type -> (string) [required]

The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

Possible values:

PATTERN

patternObjectFilter -> (structure)

The configuration of filtering certain objects or content types of the data source.

filters -> (list) [required]

The configuration of specific filters applied to your data source content. You can filter out or include certain content.

Constraints:

min: 1

max: 25

(structure)

The specific filters applied to your data source content. You can filter out or include certain content.

objectType -> (string) [required]

The supported object type or content type of the data source.

Constraints:

min: 1

max: 50

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

sharePointConfiguration -> (structure)

The configuration information to connect to SharePoint as your data source for self-managed knowledge bases.

Note
To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration . SharePoint data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The endpoint information to connect to your SharePoint data source.

tenantId -> (string)

The identifier of your Microsoft 365 tenant.

Constraints:

min: 36

max: 36

pattern: [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

domain -> (string) [required]

The domain of your SharePoint instance or site URL/URLs.

Constraints:

min: 1

max: 50

siteUrls -> (list) [required]

A list of one or more SharePoint site URLs.

Constraints:

min: 1

max: 100

(string)

Constraints:

pattern: https://[A-Za-z0-9][^\s]*

hostType -> (string) [required]

The supported host type, whether online/cloud or server/on-premises.

Possible values:

ONLINE

authType -> (string) [required]

The supported authentication type to authenticate and connect to your SharePoint site/sites.

Possible values:

OAUTH2_CLIENT_CREDENTIALS

OAUTH2_SHAREPOINT_APP_ONLY_CLIENT_CREDENTIALS

credentialsSecretArn -> (string) [required]

The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your SharePoint site/sites. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see SharePoint connection configuration .

Constraints:

pattern: arn:aws(|-cn|-us-gov):secretsmanager:[a-z0-9-]{1,20}:([0-9]{12}|):secret:[a-zA-Z0-9!/_+=.@-]{1,512}

crawlerConfiguration -> (structure)

The configuration of the SharePoint content. For example, configuring specific types of SharePoint content.

filterConfiguration -> (structure)

The configuration of filtering the SharePoint content. For example, configuring regular expression patterns to include or exclude certain content.

type -> (string) [required]

The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

Possible values:

PATTERN

patternObjectFilter -> (structure)

The configuration of filtering certain objects or content types of the data source.

filters -> (list) [required]

The configuration of specific filters applied to your data source content. You can filter out or include certain content.

Constraints:

min: 1

max: 25

(structure)

The specific filters applied to your data source content. You can filter out or include certain content.

objectType -> (string) [required]

The supported object type or content type of the data source.

Constraints:

min: 1

max: 50

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

JSON Syntax:

{
  "type": "S3"|"WEB"|"CONFLUENCE"|"SALESFORCE"|"SHAREPOINT"|"CUSTOM"|"REDSHIFT_METADATA"|"MANAGED_KNOWLEDGE_BASE_CONNECTOR",
  "managedKnowledgeBaseConnectorConfiguration": {
    "deletionProtectionConfiguration": {
      "deletionProtectionStatus": "ENABLED"|"DISABLED",
      "deletionProtectionThreshold": integer
    },
    "mediaExtractionConfiguration": {
      "imageExtractionConfiguration": {
        "imageExtractionStatus": "ENABLED"|"DISABLED"
      },
      "audioExtractionConfiguration": {
        "audioExtractionStatus": "ENABLED"|"DISABLED"
      },
      "videoExtractionConfiguration": {
        "videoExtractionStatus": "ENABLED"|"DISABLED"
      }
    },
    "connectorParameters": {...}
  },
  "s3Configuration": {
    "bucketArn": "string",
    "inclusionPrefixes": ["string", ...],
    "bucketOwnerAccountId": "string"
  },
  "webConfiguration": {
    "sourceConfiguration": {
      "urlConfiguration": {
        "seedUrls": [
          {
            "url": "string"
          }
          ...
        ]
      }
    },
    "crawlerConfiguration": {
      "crawlerLimits": {
        "rateLimit": integer,
        "maxPages": integer
      },
      "inclusionFilters": ["string", ...],
      "exclusionFilters": ["string", ...],
      "scope": "HOST_ONLY"|"SUBDOMAINS",
      "userAgent": "string",
      "userAgentHeader": "string"
    }
  },
  "confluenceConfiguration": {
    "sourceConfiguration": {
      "hostUrl": "string",
      "hostType": "SAAS",
      "authType": "BASIC"|"OAUTH2_CLIENT_CREDENTIALS",
      "credentialsSecretArn": "string"
    },
    "crawlerConfiguration": {
      "filterConfiguration": {
        "type": "PATTERN",
        "patternObjectFilter": {
          "filters": [
            {
              "objectType": "string",
              "inclusionFilters": ["string", ...],
              "exclusionFilters": ["string", ...]
            }
            ...
          ]
        }
      }
    }
  },
  "salesforceConfiguration": {
    "sourceConfiguration": {
      "hostUrl": "string",
      "authType": "OAUTH2_CLIENT_CREDENTIALS",
      "credentialsSecretArn": "string"
    },
    "crawlerConfiguration": {
      "filterConfiguration": {
        "type": "PATTERN",
        "patternObjectFilter": {
          "filters": [
            {
              "objectType": "string",
              "inclusionFilters": ["string", ...],
              "exclusionFilters": ["string", ...]
            }
            ...
          ]
        }
      }
    }
  },
  "sharePointConfiguration": {
    "sourceConfiguration": {
      "tenantId": "string",
      "domain": "string",
      "siteUrls": ["string", ...],
      "hostType": "ONLINE",
      "authType": "OAUTH2_CLIENT_CREDENTIALS"|"OAUTH2_SHAREPOINT_APP_ONLY_CLIENT_CREDENTIALS",
      "credentialsSecretArn": "string"
    },
    "crawlerConfiguration": {
      "filterConfiguration": {
        "type": "PATTERN",
        "patternObjectFilter": {
          "filters": [
            {
              "objectType": "string",
              "inclusionFilters": ["string", ...],
              "exclusionFilters": ["string", ...]
            }
            ...
          ]
        }
      }
    }
  }
}

--data-deletion-policy (string)

The data deletion policy for the data source.

You can set the data deletion policy to:

DELETE: Deletes all data from your data source that’s converted into vector embeddings upon deletion of a knowledge base or data source resource. Note that the vector store itself is not deleted , only the data. This flag is ignored if an Amazon Web Services account is deleted.

RETAIN: Retains all data from your data source that’s converted into vector embeddings upon deletion of a knowledge base or data source resource. Note that the vector store itself is not deleted if you delete a knowledge base or data source resource.

Note
For managed knowledge bases, the only supported option is DELETE , which is also the default.

Possible values:

RETAIN

DELETE

--server-side-encryption-configuration (structure)

Contains details about the server-side encryption for the data source.

kmsKeyArn -> (string)

The Amazon Resource Name (ARN) of the KMS key used to encrypt the resource.

Constraints:

min: 1

max: 2048

pattern: arn:aws(|-cn|-us-gov):kms:[a-zA-Z0-9-]*:[0-9]{12}:key/[a-zA-Z0-9-]{36}

Shorthand Syntax:

kmsKeyArn=string

JSON Syntax:

{
  "kmsKeyArn": "string"
}

--vector-ingestion-configuration (structure)

Contains details about how to ingest the documents in the data source.

chunkingConfiguration -> (structure)

Details about how to chunk the documents in the data source. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

chunkingStrategy -> (string) [required]

Knowledge base can split your source data into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried. You have the following options for chunking your data. If you opt for NONE , then you may want to pre-process your files by splitting them up such that each file corresponds to a chunk.

FIXED_SIZE – Amazon Bedrock splits your source data into chunks of the approximate size that you set in the fixedSizeChunkingConfiguration .

HIERARCHICAL – Split documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

SEMANTIC – Split documents into chunks based on groups of similar content derived with natural language processing.

NONE – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files.

Possible values:

FIXED_SIZE

NONE

HIERARCHICAL

SEMANTIC

fixedSizeChunkingConfiguration -> (structure)

Configurations for when you choose fixed-size chunking. If you set the chunkingStrategy as NONE , exclude this field.

maxTokens -> (integer) [required]

The maximum number of tokens to include in a chunk.

Constraints:

min: 1

max: 8192

overlapPercentage -> (integer) [required]

The percentage of overlap between adjacent chunks of a data source.

Constraints:

min: 1

max: 99

hierarchicalChunkingConfiguration -> (structure)

Settings for hierarchical document chunking for a data source. Hierarchical chunking splits documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

levelConfigurations -> (list) [required]

Token settings for each layer.

Constraints:

min: 2

max: 2

(structure)

Token settings for a layer in a hierarchical chunking configuration.

maxTokens -> (integer) [required]

The maximum number of tokens that a chunk can contain in this layer.

Constraints:

min: 1

max: 8192

overlapTokens -> (integer) [required]

The number of tokens to repeat across chunks in the same layer.

Constraints:

min: 1

semanticChunkingConfiguration -> (structure)

Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.

maxTokens -> (integer) [required]

The maximum number of tokens that a chunk can contain.

Constraints:

min: 1

max: 8192

bufferSize -> (integer) [required]

The buffer size.

Constraints:

min: 0

max: 1

breakpointPercentileThreshold -> (integer) [required]

The dissimilarity threshold for splitting chunks.

Constraints:

min: 50

max: 99

customTransformationConfiguration -> (structure)

A custom document transformer for parsed data source documents.

intermediateStorage -> (structure) [required]

An S3 bucket path for input and output objects.

s3Location -> (structure) [required]

An S3 bucket path.

uri -> (string) [required]

The location’s URI. For example, s3://my-bucket/chunk-processor/ .

Constraints:

min: 1

max: 2048

pattern: s3://.{1,128}

transformations -> (list) [required]

A Lambda function that processes documents.

Constraints:

min: 1

max: 1

(structure)

A custom processing step for documents moving through a data source ingestion pipeline. To process documents after they have been converted into chunks, set the step to apply to POST_CHUNKING .

transformationFunction -> (structure) [required]

A Lambda function that processes documents.

transformationLambdaConfiguration -> (structure) [required]

The Lambda function.

lambdaArn -> (string) [required]

The function’s ARN identifier.

Constraints:

min: 0

max: 2048

pattern: arn:(aws[a-zA-Z-]*)?:lambda:[a-z]{2}(-gov)?-[a-z]+-\d{1}:\d{12}:function:[a-zA-Z0-9-_\.]+(:(\$LATEST|[a-zA-Z0-9-_]+))?

stepToApply -> (string) [required]

When the service applies the transformation.

Possible values:

POST_CHUNKING

parsingConfiguration -> (structure)

Configurations for a parser to use for parsing documents in your data source. If you exclude this field, the default parser will be used.

parsingStrategy -> (string) [required]

The parsing strategy for the data source. Only SMART_PARSING can be selected for managed knowledge bases. For more information, see Customize ingestion for managed knowledge bases .

Possible values:

BEDROCK_FOUNDATION_MODEL

BEDROCK_DATA_AUTOMATION

SMART_PARSING

bedrockFoundationModelConfiguration -> (structure)

If you specify BEDROCK_FOUNDATION_MODEL as the parsing strategy for ingesting your data source, use this object to modify configurations for using a foundation model to parse documents.

modelArn -> (string) [required]

The ARN of the foundation model to use for parsing.

Constraints:

min: 1

max: 2048

pattern: (arn:aws(-[^:]{1,12})?:(bedrock):[a-z0-9-]{1,20}:([0-9]{12})?:([a-z-]+/)?)?([a-zA-Z0-9.-]{1,63}){0,2}(([:][a-z0-9-]{1,63}){0,2})?(/[a-z0-9]{1,12})?

parsingPrompt -> (structure)

Instructions for interpreting the contents of a document.

parsingPromptText -> (string) [required]

Instructions for interpreting the contents of a document.

Constraints:

min: 1

max: 10000

parsingModality -> (string)

Specifies whether to enable parsing of multimodal data, including both text and/or images.

Possible values:

MULTIMODAL

bedrockDataAutomationConfiguration -> (structure)

If you specify BEDROCK_DATA_AUTOMATION as the parsing strategy for ingesting your data source, use this object to modify configurations for using the Amazon Bedrock Data Automation parser.

parsingModality -> (string)

Specifies whether to enable parsing of multimodal data, including both text and/or images.

Possible values:

MULTIMODAL

contextEnrichmentConfiguration -> (structure)

The context enrichment configuration used for ingestion of the data into the vector store.

type -> (string) [required]

The method used for context enrichment. It must be Amazon Bedrock foundation models.

Possible values:

BEDROCK_FOUNDATION_MODEL

bedrockFoundationModelConfiguration -> (structure)

The configuration of the Amazon Bedrock foundation model used for context enrichment.

enrichmentStrategyConfiguration -> (structure) [required]

The enrichment stategy used to provide additional context. For example, Neptune GraphRAG uses Amazon Bedrock foundation models to perform chunk entity extraction.

method -> (string) [required]

The method used for the context enrichment strategy.

Possible values:

CHUNK_ENTITY_EXTRACTION

modelArn -> (string) [required]

The Amazon Resource Name (ARN) of the model used to create vector embeddings for the knowledge base.

Constraints:

min: 1

max: 2048

pattern: (arn:aws(-[^:]{1,12})?:(bedrock):[a-z0-9-]{1,20}:([0-9]{12})?:([a-z-]+/)?)?([a-zA-Z0-9.-]{1,63}){0,2}(([:][a-z0-9-]{1,63}){0,2})?(/[a-z0-9]{1,12})?

JSON Syntax:

{
  "chunkingConfiguration": {
    "chunkingStrategy": "FIXED_SIZE"|"NONE"|"HIERARCHICAL"|"SEMANTIC",
    "fixedSizeChunkingConfiguration": {
      "maxTokens": integer,
      "overlapPercentage": integer
    },
    "hierarchicalChunkingConfiguration": {
      "levelConfigurations": [
        {
          "maxTokens": integer
        }
        ...
      ],
      "overlapTokens": integer
    },
    "semanticChunkingConfiguration": {
      "maxTokens": integer,
      "bufferSize": integer,
      "breakpointPercentileThreshold": integer
    }
  },
  "customTransformationConfiguration": {
    "intermediateStorage": {
      "s3Location": {
        "uri": "string"
      }
    },
    "transformations": [
      {
        "transformationFunction": {
          "transformationLambdaConfiguration": {
            "lambdaArn": "string"
          }
        },
        "stepToApply": "POST_CHUNKING"
      }
      ...
    ]
  },
  "parsingConfiguration": {
    "parsingStrategy": "BEDROCK_FOUNDATION_MODEL"|"BEDROCK_DATA_AUTOMATION"|"SMART_PARSING",
    "bedrockFoundationModelConfiguration": {
      "modelArn": "string",
      "parsingPrompt": {
        "parsingPromptText": "string"
      },
      "parsingModality": "MULTIMODAL"
    },
    "bedrockDataAutomationConfiguration": {
      "parsingModality": "MULTIMODAL"
    }
  },
  "contextEnrichmentConfiguration": {
    "type": "BEDROCK_FOUNDATION_MODEL",
    "bedrockFoundationModelConfiguration": {
      "enrichmentStrategyConfiguration": {
        "method": "CHUNK_ENTITY_EXTRACTION"
      },
      "modelArn": "string"
    }
  }
}

--cli-input-json | --cli-input-yaml (string) Reads arguments from the JSON string provided. The JSON string follows the format provided by --generate-cli-skeleton. If other arguments are provided on the command line, those values will override the JSON-provided values. It is not possible to pass arbitrary binary values using a JSON-provided value as the string will be taken literally. This may not be specified along with --cli-input-yaml.

--generate-cli-skeleton (string) Prints a JSON skeleton to standard output without sending an API request. If provided with no value or the value input, prints a sample input JSON that can be used as an argument for --cli-input-json. Similarly, if provided yaml-input it will print a sample input YAML that can be used with --cli-input-yaml. If provided with the value output, it validates the command inputs and returns a sample output JSON for that command. The generated JSON skeleton is not stable between versions of the AWS CLI and there are no backwards compatibility guarantees in the JSON skeleton generated.

Global Options¶

--debug (boolean)

Turn on debug logging.

--endpoint-url (string)

Override command’s default URL with the given URL.

--no-verify-ssl (boolean)

By default, the AWS CLI uses SSL when communicating with AWS services. For each SSL connection, the AWS CLI will verify SSL certificates. This option overrides the default behavior of verifying SSL certificates.

--no-paginate (boolean)

Disable automatic pagination. If automatic pagination is disabled, the AWS CLI will only make one call, for the first page of results.

--output (string)

The formatting style for command output.

json
text
table
yaml
yaml-stream
off

--query (string)

A JMESPath query to use in filtering the response data.

--profile (string)

Use a specific profile from your credential file.

--region (string)

The region to use. Overrides config/env settings.

--version (string)

Display the version of this tool.

--color (string)

Turn on/off color output.

on
off
auto

--no-sign-request (boolean)

Do not sign requests. Credentials will not be loaded if this argument is provided.

--ca-bundle (string)

The CA certificate bundle to use when verifying SSL certificates. Overrides config/env settings.

--cli-read-timeout (int)

The maximum socket read time in seconds. If the value is set to 0, the socket read will be blocking and not timeout. The default value is 60 seconds.

--cli-connect-timeout (int)

The maximum socket connect time in seconds. If the value is set to 0, the socket connect will be blocking and not timeout. The default value is 60 seconds.

--cli-binary-format (string)

The formatting style to be used for binary blobs. The default format is base64. The base64 format expects binary blobs to be provided as a base64 encoded string. The raw-in-base64-out format preserves compatibility with AWS CLI V1 behavior and binary values must be passed literally. When providing contents from a file that map to a binary blob fileb:// will always be treated as binary and use the file contents directly regardless of the cli-binary-format setting. When using file:// the file contents will need to properly formatted for the configured cli-binary-format.

base64
raw-in-base64-out

--no-cli-pager (boolean)

Disable cli pager for output.

--cli-auto-prompt (boolean)

Automatically prompt for CLI input parameters.

--no-cli-auto-prompt (boolean)

Disable automatically prompt for CLI input parameters.

--cli-error-format (string)

The formatting style for error output. By default, errors are displayed in enhanced format.

legacy
json
yaml
text
table
enhanced

Output¶

dataSource -> (structure)

Contains details about the data source.

knowledgeBaseId -> (string) [required]

The unique identifier of the knowledge base to which the data source belongs.

Constraints:

pattern: [0-9a-zA-Z]{10}

dataSourceId -> (string) [required]

The unique identifier of the data source.

Constraints:

pattern: [0-9a-zA-Z]{10}

name -> (string) [required]

The name of the data source.

Constraints:

pattern: ([0-9a-zA-Z][_-]?){1,100}

status -> (string) [required]

The status of the data source. The following statuses are possible:

Available – The data source has been created and is ready for ingestion into the knowledge base.

Deleting – The data source is being deleted.

Possible values:

AVAILABLE

DELETING

DELETE_UNSUCCESSFUL

CREATING

UPDATING

FAILED

description -> (string)

The description of the data source.

Constraints:

min: 1

max: 200

dataSourceConfiguration -> (structure) [required]

The connection configuration for the data source.

type -> (string) [required]

The type of data source.

Possible values:

S3

WEB

CONFLUENCE

SALESFORCE

SHAREPOINT

CUSTOM

REDSHIFT_METADATA

MANAGED_KNOWLEDGE_BASE_CONNECTOR

managedKnowledgeBaseConnectorConfiguration -> (structure)

Contains the configuration for a data source that connects a managed knowledge base to a supported data source connector. Specify this object when the data source type is MANAGED_KNOWLEDGE_BASE_CONNECTOR .

deletionProtectionConfiguration -> (structure)

A safeguard against accidental bulk deletion of indexed content.

deletionProtectionStatus -> (string) [required]

Enable or disable deletion protection for the connector.

Possible values:

ENABLED

DISABLED

deletionProtectionThreshold -> (integer)

The threshold is the maximum percentage of documents that a sync job can delete from your index. If a sync would delete more than this percentage, the sync skips its delete phase, leaving your indexed documents in place. Not supported for the Custom connector.

Constraints:

min: 0

max: 100

mediaExtractionConfiguration -> (structure)

Configuration for extracting media (images, audio, video) from data source files.

imageExtractionConfiguration -> (structure)

Configuration for image extraction.

imageExtractionStatus -> (string) [required]

Whether image extraction is enabled or disabled.

Possible values:

ENABLED

DISABLED

audioExtractionConfiguration -> (structure)

Configuration for audio extraction.

audioExtractionStatus -> (string) [required]

Whether audio extraction is enabled or disabled.

Possible values:

ENABLED

DISABLED

videoExtractionConfiguration -> (structure)

Configuration for video extraction.

videoExtractionStatus -> (string) [required]

Whether video extraction is enabled or disabled.

Possible values:

ENABLED

DISABLED

connectorParameters -> (document)

Connector-specific parameters. For more information, see Connect a data source .

s3Configuration -> (structure)

The configuration information to connect to Amazon S3 as your data source for self-managed knowledge bases. To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration .

bucketArn -> (string) [required]

The Amazon Resource Name (ARN) of the S3 bucket that contains your data.

Constraints:

min: 1

max: 2048

pattern: arn:aws(|-cn|-us-gov):s3:::[a-z0-9][a-z0-9.-]{1,61}[a-z0-9]

inclusionPrefixes -> (list)

A list of S3 prefixes to include certain files or content. For more information, see Organizing objects using prefixes .

Constraints:

min: 1

max: 1

(string)

Constraints:

min: 1

max: 300

bucketOwnerAccountId -> (string)

The account ID for the owner of the S3 bucket.

Constraints:

min: 12

max: 12

pattern: [0-9]{12}

webConfiguration -> (structure)

The configuration of web URLs to crawl for your data source. You should be authorized to crawl the URLs.

Note
To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration . Web crawler data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The source configuration details for the web data source.

urlConfiguration -> (structure) [required]

The configuration of the URL/URLs.

seedUrls -> (list)

One or more seed or starting point URLs.

Constraints:

min: 1

max: 100

(structure)

The seed or starting point URL. You should be authorized to crawl the URL.

url -> (string)

A seed or starting point URL.

Constraints:

pattern: https?://[A-Za-z0-9][^\s]*

crawlerConfiguration -> (structure)

The Web Crawler configuration details for the web data source.

crawlerLimits -> (structure)

The configuration of crawl limits for the web URLs.

rateLimit -> (integer)

The max rate at which pages are crawled, up to 300 per minute per host.

Constraints:

min: 1

max: 300

maxPages -> (integer)

The max number of web pages crawled from your source URLs, up to 25,000 pages. If the web pages exceed this limit, the data source sync will fail and no web pages will be ingested.

Constraints:

min: 1

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain URLs. If you specify an inclusion and exclusion filter/pattern and both match a URL, the exclusion filter takes precedence and the web content of the URL isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

scope -> (string)

The scope of what is crawled for your URLs.

You can choose to crawl only web pages that belong to the same host or primary domain. For example, only web pages that contain the seed URL “https://docs.aws.amazon.com/bedrock/latest/userguide/” and no other domains. You can choose to include sub domains in addition to the host or primary domain. For example, web pages that contain “aws.amazon.com” can also include sub domain “docs.aws.amazon.com”.

Possible values:

HOST_ONLY

SUBDOMAINS

userAgent -> (string)

Returns the user agent suffix for your web crawler.

Constraints:

min: 15

max: 40

userAgentHeader -> (string)

A string used for identifying the crawler or bot when it accesses a web server. The user agent header value consists of the bedrockbot , UUID, and a user agent suffix for your crawler (if one is provided). By default, it is set to bedrockbot_UUID . You can optionally append a custom suffix to bedrockbot_UUID to allowlist a specific user agent permitted to access your source URLs.

Constraints:

min: 61

max: 86

confluenceConfiguration -> (structure)

The configuration information to connect to Confluence as your data source for self-managed knowledge bases.

Note
To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration . Confluence data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The endpoint information to connect to your Confluence data source.

hostUrl -> (string) [required]

The Confluence host URL or instance URL.

Constraints:

pattern: https://[A-Za-z0-9][^\s]*

hostType -> (string) [required]

The supported host type, whether online/cloud or server/on-premises.

Possible values:

SAAS

authType -> (string) [required]

The supported authentication type to authenticate and connect to your Confluence instance.

Possible values:

BASIC

OAUTH2_CLIENT_CREDENTIALS

credentialsSecretArn -> (string) [required]

The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Confluence instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Confluence connection configuration .

Constraints:

pattern: arn:aws(|-cn|-us-gov):secretsmanager:[a-z0-9-]{1,20}:([0-9]{12}|):secret:[a-zA-Z0-9!/_+=.@-]{1,512}

crawlerConfiguration -> (structure)

The configuration of the Confluence content. For example, configuring specific types of Confluence content.

filterConfiguration -> (structure)

The configuration of filtering the Confluence content. For example, configuring regular expression patterns to include or exclude certain content.

type -> (string) [required]

The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

Possible values:

PATTERN

patternObjectFilter -> (structure)

The configuration of filtering certain objects or content types of the data source.

filters -> (list) [required]

The configuration of specific filters applied to your data source content. You can filter out or include certain content.

Constraints:

min: 1

max: 25

(structure)

The specific filters applied to your data source content. You can filter out or include certain content.

objectType -> (string) [required]

The supported object type or content type of the data source.

Constraints:

min: 1

max: 50

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

salesforceConfiguration -> (structure)

The configuration information to connect to Salesforce as your data source.

Note
Salesforce data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The endpoint information to connect to your Salesforce data source.

hostUrl -> (string) [required]

The Salesforce host URL or instance URL.

Constraints:

pattern: https://[A-Za-z0-9][^\s]*

authType -> (string) [required]

The supported authentication type to authenticate and connect to your Salesforce instance.

Possible values:

OAUTH2_CLIENT_CREDENTIALS

credentialsSecretArn -> (string) [required]

The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your Salesforce instance URL. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see Salesforce connection configuration .

Constraints:

pattern: arn:aws(|-cn|-us-gov):secretsmanager:[a-z0-9-]{1,20}:([0-9]{12}|):secret:[a-zA-Z0-9!/_+=.@-]{1,512}

crawlerConfiguration -> (structure)

The configuration of the Salesforce content. For example, configuring specific types of Salesforce content.

filterConfiguration -> (structure)

The configuration of filtering the Salesforce content. For example, configuring regular expression patterns to include or exclude certain content.

type -> (string) [required]

The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

Possible values:

PATTERN

patternObjectFilter -> (structure)

The configuration of filtering certain objects or content types of the data source.

filters -> (list) [required]

The configuration of specific filters applied to your data source content. You can filter out or include certain content.

Constraints:

min: 1

max: 25

(structure)

The specific filters applied to your data source content. You can filter out or include certain content.

objectType -> (string) [required]

The supported object type or content type of the data source.

Constraints:

min: 1

max: 50

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

sharePointConfiguration -> (structure)

The configuration information to connect to SharePoint as your data source for self-managed knowledge bases.

Note
To configure this data source for managed knowledge bases, use managedKnowledgeBaseConnectorConfiguration . SharePoint data source connector for self-managed knowledge bases is in preview release and is subject to change.

sourceConfiguration -> (structure) [required]

The endpoint information to connect to your SharePoint data source.

tenantId -> (string)

The identifier of your Microsoft 365 tenant.

Constraints:

min: 36

max: 36

pattern: [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

domain -> (string) [required]

The domain of your SharePoint instance or site URL/URLs.

Constraints:

min: 1

max: 50

siteUrls -> (list) [required]

A list of one or more SharePoint site URLs.

Constraints:

min: 1

max: 100

(string)

Constraints:

pattern: https://[A-Za-z0-9][^\s]*

hostType -> (string) [required]

The supported host type, whether online/cloud or server/on-premises.

Possible values:

ONLINE

authType -> (string) [required]

The supported authentication type to authenticate and connect to your SharePoint site/sites.

Possible values:

OAUTH2_CLIENT_CREDENTIALS

OAUTH2_SHAREPOINT_APP_ONLY_CLIENT_CREDENTIALS

credentialsSecretArn -> (string) [required]

The Amazon Resource Name of an Secrets Manager secret that stores your authentication credentials for your SharePoint site/sites. For more information on the key-value pairs that must be included in your secret, depending on your authentication type, see SharePoint connection configuration .

Constraints:

pattern: arn:aws(|-cn|-us-gov):secretsmanager:[a-z0-9-]{1,20}:([0-9]{12}|):secret:[a-zA-Z0-9!/_+=.@-]{1,512}

crawlerConfiguration -> (structure)

The configuration of the SharePoint content. For example, configuring specific types of SharePoint content.

filterConfiguration -> (structure)

The configuration of filtering the SharePoint content. For example, configuring regular expression patterns to include or exclude certain content.

type -> (string) [required]

The type of filtering that you want to apply to certain objects or content of the data source. For example, the PATTERN type is regular expression patterns you can apply to filter your content.

Possible values:

PATTERN

patternObjectFilter -> (structure)

The configuration of filtering certain objects or content types of the data source.

filters -> (list) [required]

The configuration of specific filters applied to your data source content. You can filter out or include certain content.

Constraints:

min: 1

max: 25

(structure)

The specific filters applied to your data source content. You can filter out or include certain content.

objectType -> (string) [required]

The supported object type or content type of the data source.

Constraints:

min: 1

max: 50

inclusionFilters -> (list)

A list of one or more inclusion regular expression patterns to include certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

exclusionFilters -> (list)

A list of one or more exclusion regular expression patterns to exclude certain object types that adhere to the pattern. If you specify an inclusion and exclusion filter/pattern and both match a document, the exclusion filter takes precedence and the document isn’t crawled.

Constraints:

min: 1

max: 25

(string)

Constraints:

min: 1

max: 1000

serverSideEncryptionConfiguration -> (structure)

Contains details about the configuration of the server-side encryption.

kmsKeyArn -> (string)

The Amazon Resource Name (ARN) of the KMS key used to encrypt the resource.

Constraints:

min: 1

max: 2048

pattern: arn:aws(|-cn|-us-gov):kms:[a-zA-Z0-9-]*:[0-9]{12}:key/[a-zA-Z0-9-]{36}

vectorIngestionConfiguration -> (structure)

Contains details about how to ingest the documents in the data source.

chunkingConfiguration -> (structure)

Details about how to chunk the documents in the data source. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

chunkingStrategy -> (string) [required]

Knowledge base can split your source data into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried. You have the following options for chunking your data. If you opt for NONE , then you may want to pre-process your files by splitting them up such that each file corresponds to a chunk.

FIXED_SIZE – Amazon Bedrock splits your source data into chunks of the approximate size that you set in the fixedSizeChunkingConfiguration .

HIERARCHICAL – Split documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

SEMANTIC – Split documents into chunks based on groups of similar content derived with natural language processing.

NONE – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files.

Possible values:

FIXED_SIZE

NONE

HIERARCHICAL

SEMANTIC

fixedSizeChunkingConfiguration -> (structure)

Configurations for when you choose fixed-size chunking. If you set the chunkingStrategy as NONE , exclude this field.

maxTokens -> (integer) [required]

The maximum number of tokens to include in a chunk.

Constraints:

min: 1

max: 8192

overlapPercentage -> (integer) [required]

The percentage of overlap between adjacent chunks of a data source.

Constraints:

min: 1

max: 99

hierarchicalChunkingConfiguration -> (structure)

Settings for hierarchical document chunking for a data source. Hierarchical chunking splits documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.

levelConfigurations -> (list) [required]

Token settings for each layer.

Constraints:

min: 2

max: 2

(structure)

Token settings for a layer in a hierarchical chunking configuration.

maxTokens -> (integer) [required]

The maximum number of tokens that a chunk can contain in this layer.

Constraints:

min: 1

max: 8192

overlapTokens -> (integer) [required]

The number of tokens to repeat across chunks in the same layer.

Constraints:

min: 1

semanticChunkingConfiguration -> (structure)

Settings for semantic document chunking for a data source. Semantic chunking splits a document into into smaller documents based on groups of similar content derived from the text with natural language processing.

maxTokens -> (integer) [required]

The maximum number of tokens that a chunk can contain.

Constraints:

min: 1

max: 8192

bufferSize -> (integer) [required]

The buffer size.

Constraints:

min: 0

max: 1

breakpointPercentileThreshold -> (integer) [required]

The dissimilarity threshold for splitting chunks.

Constraints:

min: 50

max: 99

customTransformationConfiguration -> (structure)

A custom document transformer for parsed data source documents.

intermediateStorage -> (structure) [required]

An S3 bucket path for input and output objects.

s3Location -> (structure) [required]

An S3 bucket path.

uri -> (string) [required]

The location’s URI. For example, s3://my-bucket/chunk-processor/ .

Constraints:

min: 1

max: 2048

pattern: s3://.{1,128}

transformations -> (list) [required]

A Lambda function that processes documents.

Constraints:

min: 1

max: 1

(structure)

A custom processing step for documents moving through a data source ingestion pipeline. To process documents after they have been converted into chunks, set the step to apply to POST_CHUNKING .

transformationFunction -> (structure) [required]

A Lambda function that processes documents.

transformationLambdaConfiguration -> (structure) [required]

The Lambda function.

lambdaArn -> (string) [required]

The function’s ARN identifier.

Constraints:

min: 0

max: 2048

pattern: arn:(aws[a-zA-Z-]*)?:lambda:[a-z]{2}(-gov)?-[a-z]+-\d{1}:\d{12}:function:[a-zA-Z0-9-_\.]+(:(\$LATEST|[a-zA-Z0-9-_]+))?

stepToApply -> (string) [required]

When the service applies the transformation.

Possible values:

POST_CHUNKING

parsingConfiguration -> (structure)

Configurations for a parser to use for parsing documents in your data source. If you exclude this field, the default parser will be used.

parsingStrategy -> (string) [required]

The parsing strategy for the data source. Only SMART_PARSING can be selected for managed knowledge bases. For more information, see Customize ingestion for managed knowledge bases .

Possible values:

BEDROCK_FOUNDATION_MODEL

BEDROCK_DATA_AUTOMATION

SMART_PARSING

bedrockFoundationModelConfiguration -> (structure)

If you specify BEDROCK_FOUNDATION_MODEL as the parsing strategy for ingesting your data source, use this object to modify configurations for using a foundation model to parse documents.

modelArn -> (string) [required]

The ARN of the foundation model to use for parsing.

Constraints:

min: 1

max: 2048

pattern: (arn:aws(-[^:]{1,12})?:(bedrock):[a-z0-9-]{1,20}:([0-9]{12})?:([a-z-]+/)?)?([a-zA-Z0-9.-]{1,63}){0,2}(([:][a-z0-9-]{1,63}){0,2})?(/[a-z0-9]{1,12})?

parsingPrompt -> (structure)

Instructions for interpreting the contents of a document.

parsingPromptText -> (string) [required]

Instructions for interpreting the contents of a document.

Constraints:

min: 1

max: 10000

parsingModality -> (string)

Specifies whether to enable parsing of multimodal data, including both text and/or images.

Possible values:

MULTIMODAL

bedrockDataAutomationConfiguration -> (structure)

If you specify BEDROCK_DATA_AUTOMATION as the parsing strategy for ingesting your data source, use this object to modify configurations for using the Amazon Bedrock Data Automation parser.

parsingModality -> (string)

Specifies whether to enable parsing of multimodal data, including both text and/or images.

Possible values:

MULTIMODAL

contextEnrichmentConfiguration -> (structure)

The context enrichment configuration used for ingestion of the data into the vector store.

type -> (string) [required]

The method used for context enrichment. It must be Amazon Bedrock foundation models.

Possible values:

BEDROCK_FOUNDATION_MODEL

bedrockFoundationModelConfiguration -> (structure)

The configuration of the Amazon Bedrock foundation model used for context enrichment.

enrichmentStrategyConfiguration -> (structure) [required]

The enrichment stategy used to provide additional context. For example, Neptune GraphRAG uses Amazon Bedrock foundation models to perform chunk entity extraction.

method -> (string) [required]

The method used for the context enrichment strategy.

Possible values:

CHUNK_ENTITY_EXTRACTION

modelArn -> (string) [required]

The Amazon Resource Name (ARN) of the model used to create vector embeddings for the knowledge base.

Constraints:

min: 1

max: 2048

pattern: (arn:aws(-[^:]{1,12})?:(bedrock):[a-z0-9-]{1,20}:([0-9]{12})?:([a-z-]+/)?)?([a-zA-Z0-9.-]{1,63}){0,2}(([:][a-z0-9-]{1,63}){0,2})?(/[a-z0-9]{1,12})?

dataDeletionPolicy -> (string)

The data deletion policy for the data source.

Possible values:

RETAIN

DELETE

createdAt -> (timestamp) [required]

The time at which the data source was created.

updatedAt -> (timestamp) [required]

The time at which the data source was last updated.

failureReasons -> (list)

The detailed reasons on the failure to delete a data source.

Constraints:

min: 0

max: 2048

(string)

Constraints:

min: 0

max: 2048

Table of Contents

Feedback

User Guide

create-data-source¶

Description¶

Warning

Synopsis¶

Options¶

Note

Note

Note

Note

Note

Global Options¶

Output¶

Note

Note

Note

Note