Amazon SageMaker Unified Studio is in preview release and is subject to change.
Use built-in prompt datasets for automatic model evaluation in Amazon Bedrock
Amazon Bedrock provides multiple built-in prompt datasets that you can use in an automatic model evaluation job. Each built-in dataset is based off an open-source dataset. We have randomly down sampled each open-source dataset to include only 100 prompts.
When you create an automatic model evaluation job and choose a Task type Amazon Bedrock provides you with a list of recommended metrics. For each metric, Amazon Bedrock also provides recommended built-in datasets. To learn more about available task types, see Model evaluation task types.
- Bias in Open-ended Language Generation Dataset (BOLD)
-
The Bias in Open-ended Language Generation Dataset (BOLD) is a dataset that evaluates fairness in general text generation, focusing on five domains: profession, gender, race, religious ideologies, and political ideologies. It contains 23,679 different text generation prompts.
- RealToxicityPrompts
-
RealToxicityPrompts is a dataset that evaluates toxicity. It attempts to get the model to generate racist, sexist, or otherwise toxic language. This dataset contains 100,000 different text generation prompts.
- T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples (TREX)
-
TREX is dataset consisting of Knowledge Base Triples (KBTs) extracted from Wikipedia. KBTs are a type of data structure used in natural language processing (NLP) and knowledge representation. They consist of a subject, predicate, and object, where the subject and object are linked by a relation. An example of a Knowledge Base Triple (KBT) is "George Washington was the president of the United States". The subject is "George Washington", the predicate is "was the president of", and the object is "the United States".
- WikiText2
-
WikiText2 is a HuggingFace dataset that contains prompts used in general text generation.
- Gigaword
-
The Gigaword dataset consists of news article headlines. This dataset is used in text summarization tasks.
- BoolQ
-
BoolQ is a dataset consisting of yes/no question and answer pairs. The prompt contains a short passage, and then a question about the passage. This dataset is recommended for use with question and answer task type.
- Natural Questions
-
Natural question is a dataset consisting of real user questions submitted to Google search.
- TriviaQA
-
TriviaQA is a dataset that contains over 650K question-answer-evidence-triples. This dataset is used in question and answer tasks.
- Women's E-Commerce Clothing Reviews
-
Women's E-Commerce Clothing Reviews is a dataset that contains clothing reviews written by customers. This dataset is used in text classification tasks.
In the following table, you can see the list of available datasets grouped task type. To learn more about how automatic metrics are computed, see Review a model model evaluation job.
Task type | Metric | Built-in datasets | Computed metric |
---|---|---|---|
General text generation | Accuracy | TREX |
Real world knowledge (RWK) score |
Robustness | Word error rate | ||
TREX |
|||
WikiText2 |
|||
Toxicity | Toxicity | ||
BOLD |
|||
Text summarization | Accuracy | Gigaword |
BERTScore |
Toxicity | Gigaword |
Toxicity | |
Robustness | Gigaword |
BERTScore and deltaBERTScore | |
Question and answer | Accuracy | BoolQ |
NLP-F1 |
NaturalQuestions |
|||
TriviaQA |
|||
Robustness | BoolQ |
F1 and deltaF1 | |
NaturalQuestions |
|||
TriviaQA |
|||
Toxicity | BoolQ |
Toxicity | |
NaturalQuestions |
|||
TriviaQA |
|||
Text classification | Accuracy | Women's Ecommerce Clothing Reviews |
Accuracy (Binary accuracy from classification_accuracy_score) |
Robustness | Women's Ecommerce Clothing Reviews |
classification_accuracy_score and delta_classification_accuracy_score |
To learn more about the requirements for creating and examples of custom prompt datasets, see Use custom prompt dataset for model evaluation in Amazon Bedrock.