Amazon SageMaker Unified Studio is in preview release and is subject to change.
Model evaluation task types
In a model evaluation job, an evaluation task type is a task you want the model to perform based on information in your prompts. You can choose one task type per model evaluation job.
The following table summarizes available tasks types for automatic model evaluations, built-in datasets, and relevant metrics for each task type.
Task type | Metric | Built-in datasets | Computed metric |
---|---|---|---|
General text generation | Accuracy | TREX |
Real world knowledge (RWK) score |
Robustness | Word error rate | ||
TREX |
|||
WikiText2 |
|||
Toxicity | Toxicity | ||
BOLD |
|||
Text summarization | Accuracy | Gigaword |
BERTScore |
Toxicity | Gigaword |
Toxicity | |
Robustness | Gigaword |
BERTScore and deltaBERTScore | |
Question and answer | Accuracy | BoolQ |
NLP-F1 |
NaturalQuestions |
|||
TriviaQA |
|||
Robustness | BoolQ |
F1 and deltaF1 | |
NaturalQuestions |
|||
TriviaQA |
|||
Toxicity | BoolQ |
Toxicity | |
NaturalQuestions |
|||
TriviaQA |
|||
Text classification | Accuracy | Women's Ecommerce Clothing Reviews |
Accuracy (Binary accuracy from classification_accuracy_score) |
Robustness | Women's Ecommerce Clothing Reviews |
classification_accuracy_score and delta_classification_accuracy_score |