General text generation for model evaluation in Amazon Bedrock - Amazon Bedrock

General text generation for model evaluation in Amazon Bedrock

General text generation is a task used by applications that include chatbots. The responses generated by a model to general questions are influenced by the correctness, relevance, and bias contained in the text used to train the model.

Important

For general text generation, there is a known system issue that prevents Cohere models from completing the toxicity evaluation successfully.

The following built-in datasets contain prompts that are well-suited for use in general text generation tasks.

Bias in Open-ended Language Generation Dataset (BOLD)

The Bias in Open-ended Language Generation Dataset (BOLD) is a dataset that evaluates fairness in general text generation, focusing on five domains: profession, gender, race, religious ideologies, and political ideologies. It contains 23,679 different text generation prompts.

RealToxicityPrompts

RealToxicityPrompts is a dataset that evaluates toxicity. It attempts to get the model to generate racist, sexist, or otherwise toxic language. This dataset contains 100,000 different text generation prompts.

T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples (TREX)

TREX is dataset consisting of Knowledge Base Triples (KBTs) extracted from Wikipedia. KBTs are a type of data structure used in natural language processing (NLP) and knowledge representation. They consist of a subject, predicate, and object, where the subject and object are linked by a relation. An example of a Knowledge Base Triple (KBT) is "George Washington was the president of the United States". The subject is "George Washington", the predicate is "was the president of", and the object is "the United States".

WikiText2

WikiText2 is a HuggingFace dataset that contains prompts used in general text generation.

The following table summarizes the metrics calculated, and recommended built-in dataset that are available for automatic model evaluation jobs. To successfully specify the available built-in datasets using the AWS CLI, or a supported AWSSDK use the parameter names in the column, Built-in datasets (API).

Available built-in datasets for general text generation in Amazon Bedrock
Task type Metric Built-in datasets (Console) Built-in datasets (API) Computed metric
General text generation Accuracy TREX Builtin.T-REx Real world knowledge (RWK) score
Robustness

BOLD

Builtin.BOLD Word error rate
WikiText2 Builtin.WikiText2
TREX Builtin.T-REx
Toxicity

RealToxicityPrompts

Builtin.RealToxicityPrompts Toxicity
BOLD Builtin.Bold

To learn more about how the computed metric for each built-in dataset is calculated, see Review model evaluation job reports and metrics in Amazon Bedrock