Creating a vocabulary filter - Amazon Transcribe

Creating a vocabulary filter

There are two options for creating a custom vocabulary filter:

  1. Save a list of line-separated words as a plain text file with UTF-8 encoding.

    • You can use this approach with the AWS Management Console, AWS CLI, or AWS SDKs.

    • If using the AWS Management Console, you can provide a local path or an Amazon S3 URI for your custom vocabulary file.

    • If using the AWS CLI or AWS SDKs, you must upload your custom vocabulary file to an Amazon S3 bucket and include the Amazon S3 URI in your request.

  2. Include a list of comma-separated words directly in your API request.

    • You can use this approach with the AWS CLI or AWS SDKs using the Words parameter.

For examples of each method, refer to Creating custom vocabulary filters

Things to note when creating your custom vocabulary filter:

  • Words aren't case sensitive. For example, "curse" and "CURSE" are treated the same.

  • Only exact word matches are filtered. For example, if your filter includes "swear" but your media contains the word "swears" or "swearing", these are not filtered. Only instances of "swear" are filtered. You must therefore include all variations of the words you want filtered.

  • Filters don't apply to words that are contained in other words. For example, if a custom vocabulary filter contains "marine" but not "submarine", "submarine" is not altered in the transcript.

  • Each entry can only contain one word (no spaces).

  • If you save your custom vocabulary filter as a text file, it must be in plain text format with UTF-8 encoding.

  • You can have up to 100 custom vocabulary filters per AWS account and each can be up to 50 Kb in size.

  • You can only use characters that are supported for your language. Refer to your language's character set for details.

Creating custom vocabulary filters

To process a custom vocabulary filter for use with Amazon Transcribe, see the following examples:

Before continuing, save your custom vocabulary filter as a text (*.txt) file. You can optionally upload your file to an Amazon S3 bucket.

  1. Sign in to the AWS Management Console.

  2. In the navigation pane, choose Vocabulary filtering. This opens the Vocabulary filters page where you can view existing custom vocabulary filters or create a new one.

  3. Select Create vocabulary filter.

    Amazon Transcribe console screenshot: the 'vocabulary filters' page.

    This takes you to the Create vocabulary filter page. Enter a name for your new custom vocabulary filter.

    Select the File upload or S3 location option under Vocabulary input source. Then specify the location of your custom vocabulary file.

    Amazon Transcribe console screenshot: the 'create vocabulary filter' page.
  4. Optionally, add tags to your custom vocabulary filter. Once you have all fields completed, select Create vocabulary filter at the bottom of the page. If there are no errors processing your file, this takes you back to the Vocabulary filters page.

    Your custom vocabulary filter is now ready to use.

This example uses the create-vocabulary-filter command to process a word list into a usable custom vocabulary filter. For more information, see CreateVocabularyFilter.

Option 1: You can include your list of words to your request using the words parameter.

aws transcribe create-vocabulary-filter \ --vocabulary-filter-name my-first-vocabulary-filter \ --language-code en-US \ --words profane,offensive,Amazon,Transcribe

Option 2: You can save your list of words as a text file and upload it to an Amazon S3 bucket, then include the file's URI in your request using the vocabulary-filter-file-uri parameter.

aws transcribe create-vocabulary-filter \ --vocabulary-filter-name my-first-vocabulary-filter \ --language-code en-US \ --vocabulary-filter-file-uri s3://DOC-EXAMPLE-BUCKET/my-vocabulary-filters/my-vocabulary-filter.txt

Here's another example using the create-vocabulary-filter command, and a request body that creates your custom vocabulary filter.

aws transcribe create-vocabulary-filter \ --cli-input-json file://filepath/my-first-vocab-filter.json

The file my-first-vocab-filter.json contains the following request body.

Option 1: You can include your list of words to your request using the Words parameter.

{ "VocabularyFilterName": "my-first-vocabulary-filter", "LanguageCode": "en-US", "Words": [ "profane","offensive","Amazon","Transcribe" ] }

Option 2: You can save your list of words as a text file and upload it to an Amazon S3 bucket, then include the file's URI in your request using the VocabularyFilterFileUri parameter.

{ "VocabularyFilterName": "my-first-vocabulary-filter", "LanguageCode": "en-US", "VocabularyFilterFileUri": "s3://DOC-EXAMPLE-BUCKET/my-vocabulary-filters/my-vocabulary-filter.txt" }
Note

If you include VocabularyFilterFileUri in your request, you cannot use Words; you must choose one or the other.

This example uses the AWS SDK for Python (Boto3) to create a custom vocabulary filter using the create_vocabulary_filter method. For more information, see CreateVocabularyFilter.

For additional examples using the AWS SDKs, including feature-specific, scenario, and cross-service examples, refer to the Code examples for Amazon Transcribe using AWS SDKs chapter.

Option 1: You can include your list of words to your request using the Words parameter.

from __future__ import print_function import time import boto3 transcribe = boto3.client('transcribe', 'us-west-2') vocab_name = "my-first-vocabulary-filter" response = transcribe.create_vocabulary_filter( LanguageCode = 'en-US', VocabularyFilterName = vocab_name, Words = [ 'profane','offensive','Amazon','Transcribe' ] )

Option 2: You can save your list of words as a text file and upload it to an Amazon S3 bucket, then include the file's URI in your request using the VocabularyFilterFileUri parameter.

from __future__ import print_function import time import boto3 transcribe = boto3.client('transcribe', 'us-west-2') vocab_name = "my-first-vocabulary-filter" response = transcribe.create_vocabulary_filter( LanguageCode = 'en-US', VocabularyFilterName = vocab_name, VocabularyFilterFileUri = 's3://DOC-EXAMPLE-BUCKET/my-vocabulary-filters/my-vocabulary-filter.txt' )
Note

If you include VocabularyFilterFileUri in your request, you cannot use Words; you must choose one or the other.

Note

If you create a new Amazon S3 bucket for your custom vocabulary filter files, make sure the IAM role making the CreateVocabularyFilter request has permissions to access this bucket. If the role doesn't have the correct permissions, your request fails. You can optionally specify an IAM role within your request by including the DataAccessRoleArn parameter. For more information on IAM roles and policies in Amazon Transcribe, see Amazon Transcribe identity-based policy examples.