To add custom synonyms to an index, you specify them in a thesaurus file. You can include
business-specific or specialized terms in Amazon Kendra using synonyms. Generic English
synonyms, such as leader, head
, are built into Amazon Kendra and should not be
included in a thesaurus file, including generic synonyms that use hyphens. Amazon Kendra
supports synonyms for all response types, which include DOCUMENT
response types and
QUESTION_ANSWER
or ANSWER
response types. Amazon Kendra currently
does not support adding
synonyms flagged as stopwords. This is to be included in a future release.
Amazon Kendra makes correlations between synonyms. For example, using the synonym pair
Dynamo, Amazon DynamoDB
, Amazon Kendra correlates Dynamo with Amazon DynamoDB. The query "What is dynamo?" then returns a document such as "What is Amazon DynamoDB?". With synonyms, Amazon Kendra can more easily pick up the
correlation.
The thesaurus file is a text file stored in an Amazon S3 bucket. See Adding a thesaurus to an index.
The thesaurus file uses the Solr synonym format
Synonyms can be useful in the following scenarios:
-
Specialized terms that are not traditional English language synonyms such as
NLP, Natural Language Processing
. -
Proper nouns with complex semantic associations. These are nouns that the general public are unlikely to understand, for example, in machine learning,
cost, loss, model performance
. -
Different forms of product names, for example,
Elastic Compute Cloud, EC2
. -
Domain-specific or business-specific terms, such as product names. For example,
Route53, DNS
.
Do not use synonyms in the following scenarios:
-
Generic English language synonyms such as
leader, head
. These synonyms are not domain-specific,and using synonyms in these scenarios might have unintended effects. -
Typographical errors such as
teh => the
. -
Morphological variants like the plurals and possessives of nouns, the comparative and superlative form of adjectives, and the past tense, past participle and progressive form of verbs. One example of comparative and superlative adjectives is
good, better, best
. -
Unigram (single word) stop words such as
WHO
. Unigram stop words are not allowed in the thesaurus and are excluded from search. For example,WHO => World Health Organization
is rejected. You can useW.H.O.
however as a synonym term, and you can use stop words as part of a multi-word synonym. For example,of
is not allowed butUnited States of America
is accepted.
Custom synonyms make it easy to improve Amazon Kendra's understanding of your business-specific terminology by expanding your queries to cover your business-specific synonyms. Although synonyms can improve search accuracy, it is important to understand how synonyms affect latency so you can optimize for this.
A general rule for synonyms is: the more terms in your query that are matched and expanded with synonyms, the greater potential impact on latency. Other factors that affect latency include the average size of documents indexed, the size of your index, any filtering on search results, and the overall load on your Amazon Kendra index. Queries that don’t match any synonyms are not affected.
A general guideline for how synonyms affect latency:
Use case | Increase in latency* |
---|---|
Typical natural language or keyword queries of 3 to 5 words each | Less than 15 percent |
1 query term expands to 3 synonyms | |
Index of about 500,000 documents (averaging 10.48 KB of extracted text per document) or 30,000 FAQ / question pairs |
*Performance varies based on your specific use of synonyms and configurations on your index. It’s best to test search performance to obtain more accurate benchmarks for your specific use case.
If your thesaurus is large, has a high term expansion ratio, and your latency increase is not within acceptable boundaries, you can try one or both of the following:
-
Trim your thesaurus to reduce the expansion ratio (number of synonyms per term).
-
Trim the overall coverage of terms (number of lines in your thesaurus).
Alternatively, you can increase the provisioning capacity (virtual storage units) to offset the latency increase.