Evaluate model performance using another LLM as a judge

Focus mode

Evaluate model performance using another LLM as a judge - Amazon Bedrock

With a model evaluation job that uses a judge model, Amazon Bedrock uses an LLM to score another model's responses and provide an explanation of how it scored each prompt and response pair. Scores and explanations are available in the Amazon Bedrock console through the Evaluations page.

This kind of model evaluation requires two different models, a generator model and an evaluator model. You define prompts for the generator model in a dataset, and the evaluator model scores the responses to those prompts based on metrics you select.

The metrics summary card in the console displays a histogram that shows the number of times a responses received a certain score, and explanations of the score for the first five prompts found in your dataset. The full evaluation job report is available in the Amazon S3 bucket you specify when you create the model evaluation job.

When you create the model evaluation job, you can either select an Amazon Bedrock model as the generator model, or you can evaluate a non-Amazon Bedrock model by providing your own inference response data in the prompt dataset. If you provide your own response data, Amazon Bedrock skips the model invoke step and directly evaluates the data you supply.

To rate the generator models' responses, Amazon Bedrock provides a set of built-in metrics you can select from. Each metric uses a different prompt for the evaluator model. You can also define your own custom metrics for your particular business case. See Use metrics to understand model performance to learn more.

Supported models

Supported evaluator models (built-in metrics)

To create an evaluation job that uses an LLM as a judge with Amazon Bedrock's built-in metrics, you need access to at least one of the judge models in the following list. To learn more about gaining access to models and Region availability, see Access Amazon Bedrock foundation models.

Mistral Large – mistral.mistral-large-2402-v1:0
Anthropic Claude 3.5 Sonnet – anthropic.claude-3-5-sonnet-20240620-v1:0
Anthropic Claude 3 Haiku – anthropic.claude-3-haiku-20240307-v1:0
Meta Llama 3.1 70B Instruct – meta.llama3-1-70b-instruct-v1:0

Cross Region inference profiles are supported for the listed models. To learn more, see Supported cross-Region inference profiles.

Supported evaluator models (custom metrics)

To create an evaluation job that uses an LLM as a judge with custom metrics, you need access to at least one of the judge models in the following list.

Mistral Large 24.02 – mistral.mistral-large-2402-v1:0
Mistral Large 24.07 – mistral.mistral-large-2407-v1:0
Anthropic Claude 3.5 Sonnet v1 – anthropic.claude-3-5-sonnet-20240620-v1:0
Anthropic Claude 3.5 Sonnet v2 – anthropic.claude-3-5-sonnet-20241022-v2:0
Anthropic Claude 3 Haiku 3 – anthropic.claude-3-haiku-20240307-v1:0
Anthropic Claude 3 Haiku 3.5 – anthropic.claude-3-5-haiku-20241022-v1:0
Meta Llama 3.1 70B Instruct – meta.llama3-1-70b-instruct-v1:0
Meta Llama 3.3 70B Instruct – meta.llama3-3-70b-instruct-v1:0
Amazon Nova Pro – amazon.nova-pro-v1:0

Cross Region inference profiles are supported for the listed models. To learn more, see Supported cross-Region inference profiles.

Supported generator models

You can use the following model types in Amazon Bedrock as the generator model in an evaluation job. You can also bring your own inference response data from non-Amazon Bedrock models.

Foundation models – Amazon Bedrock foundation model information
Amazon Bedrock Marketplace models – Amazon Bedrock Marketplace
Customized foundation models – Customize your model to improve its performance for your use case
Imported foundation models – Import a customized model into Amazon Bedrock
Prompt routers – Understanding intelligent prompt routing in Amazon Bedrock
Models for which you have purchased Provisioned Throughput – Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Manage a work team for human evaluations

Prompt datasets

Next topic:

Prompt datasets

Previous topic:

Manage a work team for human evaluations

Need help?

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences