Model evaluation task types - Amazon SageMaker Unified Studio

Amazon SageMaker Unified Studio is in preview release and is subject to change.

Model evaluation task types

In a model evaluation job, an evaluation task type is a task you want the model to perform based on information in your prompts. You can choose one task type per model evaluation job.

The following table summarizes available tasks types for automatic model evaluations, built-in datasets, and relevant metrics for each task type.

Available built-in datasets for automatic model evaluation jobs in Amazon Bedrock
Task type Metric Built-in datasets Computed metric
General text generation Accuracy TREX Real world knowledge (RWK) score
Robustness

BOLD

Word error rate
TREX
WikiText2
Toxicity

RealToxicityPrompts

Toxicity
BOLD
Text summarization Accuracy Gigaword BERTScore
Toxicity Gigaword Toxicity
Robustness Gigaword BERTScore and deltaBERTScore
Question and answer Accuracy BoolQ NLP-F1
NaturalQuestions
TriviaQA
Robustness BoolQ F1 and deltaF1
NaturalQuestions
TriviaQA
Toxicity BoolQ Toxicity
NaturalQuestions
TriviaQA
Text classification Accuracy Women's Ecommerce Clothing Reviews Accuracy (Binary accuracy from classification_accuracy_score)
Robustness Women's Ecommerce Clothing Reviews

classification_accuracy_score and delta_classification_accuracy_score