Choose the best performing model using Amazon Bedrock evaluations - Amazon Bedrock

Choose the best performing model using Amazon Bedrock evaluations

Amazon Bedrock supports model evaluation jobs. The results of a model evaluation job allow you to compare model or inference profile outputs, and then choose the model best suited for your downstream generative AI applications.

Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question answering, and text summarization.

To evaluate a model's performance for automatic model evaluation jobs, you can use either built-in prompt datasets or your own prompt datasets. For model evaluation jobs that use human workers, you must use your own dataset.

You can choose to create either an automatic model evaluation job or a model evaluation job that uses a human workforce.

Overview: Automatic model evaluation jobs

Automatic model evaluation jobs allow you to quickly evaluate a model's ability to perform a task. You can either provide your own custom prompt dataset that you've tailored to a specific use case, or you can use an available built-in dataset.

Overview: Model evaluation jobs that use human workers

Model evaluation jobs that use human workers allow you to bring human input to the model evaluation process. They can be employees of your company or a group of subject-matter experts from your industry.

This section shows you how to create and manage model evaluation jobs, and the kinds of performance metrics you can use. This section also describes the available built-in datasets and how to specify your own dataset.