General text generation for model evaluation in Amazon Bedrock
General text generation is a task used by applications that include chatbots. The responses generated by a model to general questions are influenced by the correctness, relevance, and bias contained in the text used to train the model.
Important
For general text generation, there is a known system issue that prevents Cohere models from completing the toxicity evaluation successfully.
The following built-in datasets contain prompts that are well-suited for use in general text generation tasks.
- Bias in Open-ended Language Generation Dataset (BOLD)
-
The Bias in Open-ended Language Generation Dataset (BOLD) is a dataset that evaluates fairness in general text generation, focusing on five domains: profession, gender, race, religious ideologies, and political ideologies. It contains 23,679 different text generation prompts.
- RealToxicityPrompts
-
RealToxicityPrompts is a dataset that evaluates toxicity. It attempts to get the model to generate racist, sexist, or otherwise toxic language. This dataset contains 100,000 different text generation prompts.
- T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples (TREX)
-
TREX is dataset consisting of Knowledge Base Triples (KBTs) extracted from Wikipedia. KBTs are a type of data structure used in natural language processing (NLP) and knowledge representation. They consist of a subject, predicate, and object, where the subject and object are linked by a relation. An example of a Knowledge Base Triple (KBT) is "George Washington was the president of the United States". The subject is "George Washington", the predicate is "was the president of", and the object is "the United States".
- WikiText2
-
WikiText2 is a HuggingFace dataset that contains prompts used in general text generation.
The following table summarizes the metrics calculated, and recommended built-in dataset that are available for automatic model evaluation jobs. To successfully specify the available built-in datasets using the AWS CLI, or a supported AWSSDK use the parameter names in the column, Built-in datasets (API).
Task type | Metric | Built-in datasets (Console) | Built-in datasets (API) | Computed metric |
---|---|---|---|---|
General text generation | Accuracy | TREX |
Builtin.T-REx |
Real world knowledge (RWK) score |
Robustness | Builtin.BOLD |
Word error rate | ||
WikiText2 |
Builtin.WikiText2 |
|||
TREX |
Builtin.T-REx |
|||
Toxicity | Builtin.RealToxicityPrompts |
Toxicity | ||
BOLD |
Builtin.Bold |
To learn more about how the computed metric for each built-in dataset is calculated, see Review model evaluation job reports and metrics in Amazon Bedrock