Review metrics for knowledge base evaluations that use LLMs (console) - Amazon Bedrock

Review metrics for knowledge base evaluations that use LLMs (console)

You can review the metrics presented in a report for a knowledge base evaluation job using the Amazon Bedrock console.

Knowledge base evaluations that use Large Language Models (LLMs) compute evaluation metrics to assess the performance of how well the knowledge base is retrieving information and generating responses.

In your knowledge base evaluation report card, you will see the metrics and the breakdown graphs of the metrics relevant to your evaluation type of either retrieval only or retrieval with response generation. Different metrics are relevant to different evaluation types. The computed scores for each metric are an average score for retrieved texts or generated responses across all the user queries in your prompts dataset. The computed score for each metric is a value between 0 and 1. The closer to 1, the more that metric’s characteristic appears in the retrieved texts or responses. The breakdown graphs for each metric plots a histogram and counts how many retrieved texts or responses for the queries fall within each score range.

For example, you created an evaluation job to evaluate retrieval with response generation. The console report card shows a computed score for Completeness in responses to be at 0.82. The Completeness score measures how generated responses address all aspects of users’ questions. It is computed as an average score for responses to questions across all prompts in your dataset. The histogram graph for Completeness shows that most of the responses (highest bar) fall between a completeness score range of 0.7 to 0.8. However, the knowledge base also scored high for Stereotyping, where generalized statements are made in the responses at 0.94 on average. The knowledge base can generate fairly complete responses most of the time, but those responses include a high amount of generalized statements about individuals or groups of people.

Report card for knowledge base evaluations that use LLMs

Follow the steps to open the report card in the Amazon Bedrock console for knowledge base evaluation jobs that use LLMs. Refer to the information below for each metric that is relevant to the evaluation types of retrieval only and retrieval with response generation.

  • Sign in to the AWS Management Console and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/.

  • Choose Evaluations from the navigation pane, then choose Knowledge base evaluation.

  • Select the name of your knowledge base evaluation job. You will be directed to the report card, which is the main page of the knowledge base evaluation.

    Note

    To open the report card, the status of your knowledge base evaluation must be either ready or available.

Metrics relevant to retrieval only type evaluations

There are certain metrics relevant to evaluating your knowledge base’s ability to retrieve highly relevant information.

Context relevance

This metric is relevant to the quality of the retrieved information. The score is an average score for retrieved text chunks across all prompts in your dataset. Context relevance means the retrieved text chunks are contextually relevant to the questions. The higher the score, the more contextually relevant the information is on average. The lower the score, the less contextually relevant the information is on average.

Context coverage (requires ground truth)

This metric is relevant to the quality of the retrieved information. The score is an average score for retrieved text chunks across all prompts in your dataset. Context coverage means the retrieved text chunks cover all the information provided in the ground truth texts. The higher the score, the more context coverage on average. The lower the score, the less context coverage on average.

Metrics relevant to retrieval with response generation type evaluations

There are certain metrics relevant to evaluating your knowledge base’s ability to generate useful, appropriate responses based on retrieved information.

Correctness

This metric is relevant to the quality of the generated responses. The score is an average score for responses across all prompts in your dataset. Correctness means accurately answering the questions. The higher the score, the more correct the generated responses are on average. The lower the score, the less correct the generated responses are on average.

Completeness

This metric is relevant to the quality of the generated responses. The score is an average score for responses across all prompts in your dataset. Completeness means answering and resolving all aspects of the questions. The higher the score, the more complete the generated responses are on average. The lower the score, the less complete the generated responses are on average.

Helpfulness

This metric is relevant to the quality of the generated responses. The score is an average score for responses across all prompts in your dataset. Helpfulness means holistically useful responses to the questions. The higher the score, the more helpful the generated responses are on average. The lower the score, the less helpful the generated responses are on average.

Logical coherence

This metric is relevant to the quality of the generated responses. The score is an average score for responses across all prompts in your dataset. Logical coherence means responses are free from logical gaps, inconsistencies or contradictions. The higher the score, the more coherent the generated responses are on average. The lower the score, the less coherent the generated responses are on average.

Faithfulness

This metric is relevant to the quality of the generated responses. The score is an average score for responses across all prompts in your dataset. Faithfulness means avoiding hallucination with resect to the retrieved text chunks. The higher the score, the more faithful the generated responses are on average. The lower the score, the less faithful the generated responses are on average.

Harmfulness

This metric is relevant to the appropriateness of the generated responses. The score is an average score for responses across all prompts in your dataset. Harmfulness means making hateful, insulting, or violent statements. The higher the score, the more harmful the generated responses on average. The lower the score, the less harmful the generated responses on average.

Stereotyping

This metric is relevant to the appropriateness of the generated responses. The score is an average score for responses across all prompts in your dataset. Stereotyping means making generalized statements about individuals or groups of people. The higher the score, the more stereotyping in the generated responses on average. The lower the score, the less stereotyping in the generated responses on average. Note that a strong presence of both flattering and derogatory stereotypes will result in a high score.

Refusal

This metric is relevant to the appropriateness of the generated responses. The score is an average score for responses across all prompts in your dataset. Refusal means evasive responses to the questions. The higher the score, the more evasive the generated responses are on average. The lower the score, the less evasive the generated responses are on average.