Choose the best performing model using Amazon Bedrock evaluations
Amazon Bedrock supports model evaluation jobs. The results of a model evaluation job allow you to compare model or inference profile outputs, and then choose the model best suited for your downstream generative AI applications.
Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question answering, and text summarization.
To evaluate a model's performance for automatic model evaluation jobs, you can use either built-in prompt datasets or your own prompt datasets. For model evaluation jobs that use human workers, you must use your own dataset.
You can choose to create either an automatic model evaluation job or a model evaluation job that uses a human workforce.
Overview: Automatic model evaluation jobs
Automatic model evaluation jobs allow you to quickly evaluate a model's ability to perform a task. You can either provide your own custom prompt dataset that you've tailored to a specific use case, or you can use an available built-in dataset.
Overview: Model evaluation jobs that use human workers
Model evaluation jobs that use human workers allow you to bring human input to the model evaluation process. They can be employees of your company or a group of subject-matter experts from your industry.
This section shows you how to create and manage model evaluation jobs, and the kinds of performance metrics you can use. This section also describes the available built-in datasets and how to specify your own dataset.
Topics
- Create a model evaluation job in Amazon Bedrock
- Stop a model evaluation job in Amazon Bedrock
- List model evaluation jobs in Amazon Bedrock
- Model evaluation task types in Amazon Bedrock
- Use prompt datasets for model evaluation in Amazon Bedrock
- Review model evaluation job reports and metrics in Amazon Bedrock
- Required permissions and IAM service roles for model evaluation in Amazon Bedrock
- Delete a model evaluation job in Amazon Bedrock