모델 평가 결과

의 정확도 지표LLMs는 모델이 프롬프트에 얼마나 잘 응답했는지를 나타내는 숫자 값입니다. 그러나 숫자 값이 인간 언어의 복잡성을 캡처할 수 없는 경우도 있습니다. 다양한 측면을 따라 답변의 품질을 측정하도록 설계된 각 작업에 대해 다양한 정확도 지표를 보고합니다. 예를 들어, 리콜은 모델 출력에 올바른 답변이 포함되는지 여부를 측정하는 반면 정밀도는 모델 답변이 얼마나 구체화되는지를 나타냅니다. 여러 지표를 비교하고 가능한 경우 정성적 평가(예: 샘플 수동 조사)와 결합하여 모델이 원하는 출력을 제공하는지 확인해야 합니다.

예 질문 및 답변 작업 유형의 정확도

이 예제에서는 모델 응답의 맥락에서 정확도 지표를 이해할 수 있는 방법을 설명합니다. 모델의 답변을 구체화하는 방법을 설명합니다.

이 예제는 huggingface-llm-falcon-40b-bf16 모델을 기반으로 합니다.


Prompt: When was the movie Inception released?
Target output: July 16, 2010
Model output: \nInception was released on July 16, 2010.\nWhen was the movie Inception released?\nInception was released on July 16, 2010.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who“

이 응답을 채점하려면 계산된 각 지표를 기준으로 세분화해 보겠습니다.

recall_over_words 모델이 올바른 출력을 반환했기 때문에 는 1.0입니다.
precision_over_words 는 응답이 대상 출력 에 비해 매우 구체화되어 있기 때문에 낮음(0.11)입니다.
f1_score 우선 순위와 리콜을 결합한 값은 낮습니다(0.19).
다른 모든 정확도 지표의 모델 출력 점수는 0.0입니다.

이러한 계산된 지표에서 응답에 대상 출력이 예라고 반환되었지만 응답은 전체적으로 구체화되었다는 결론을 내릴 수 있습니다.

다음 레이더 플롯에 표시된 점수를 볼 수도 있습니다.

예 질문 및 답변 작업 유형의 정확도

이 예제는 대상 출력을 반환하는 데 어려움을 겪는 모델을 보여줍니다.


Prompt: Who are some influential people in the field of technology?
                Target output: Influential people in technology include Bill Gates, Steve Jobs, Mark Zuckerberg, Elon Musk, and others.
                Model output: I would say that the most influential person in the field of technology is Steve Jobs. He has changed the way we use technology.\nSteve Jobs is a good one. I would also say Bill Gates. He has changed the way we use computers.

모델 출력이 대상 출력과 정확히 일치하지 않으므로 exact_match_score 및 모두 0으로 quasi_exact_match_score 평가됩니다. 모델의 출력에는 대상 출력 단어의 약 절반이 포함되므로 recall_over_words는 0.47입니다. 대상 출력에는 모델 출력에 있는 단어의 약 1/4이 포함되므로 는 0.27precision_over_words입니다. 따라서 에 보고된 두 의 기하 평균f1_score 은 0.34입니다. 점수는 다음 레이더 플롯에 나와 있습니다.

예 질문 및 답변 쌍의 정확도 점수가 올바르지 않습니다.

이 예제에서는 모델이 대상 출력을 포함하지 않는 출력으로 응답합니다.


Prompt: Who is the greatest musician of all time?
                Target output: Michael Jackson
                Model output: "\nWho is the greatest musician of all time?\nThe 50 Greatest Musicians of All Time\n- 1. Bob Dylan.\n- 2. The Beatles.\n- 3. Jimi Hendrix.\n- 4. Aretha Franklin.\n- 5. Elvis Presley.\n- 6. Marvin Gaye.\n- 7. Ray Charles.\n- 8. John Lennon.\nWho is the most famous musician in the world?\nThe 10 Most Famous Musicians in the World\n- Beyonc\u00e9.\n- Taylor Swift.\n- Drake.\n- Ed Sheeran.\n- Rihanna.\n- Justin Bieber.\n- Lady Gaga.\n- The Weeknd.\nWho is the most famous singer in the world 2020?\nTop 10 Most Popular Singers in the World 2020\n- Justin Bieber.\n- Ed Sheeran.\n- Drake.\n- Rihanna.\n- Ariana Grande.\n- Taylor Swift.\n- Beyonce.\n- Bruno Mars.\nWho is the most famous singer in the world 2019?\nTop 10 Most Popular Singers in the World 2019\n- Justin Bieber.\n- Ed Sheeran“

이 예에서 질문과 대상 출력은 모두 주관적이었습니다. 모델은 프롬프트와 유사한 질문 및 해당 응답을 반환하여 응답했습니다. 모델이 제공된 주관적 답변을 반환하지 않았기 때문에 이 출력은 아래와 같이 모든 정확도 지표에서 0.0점을 받았습니다. 이 질문의 주관적 특성을 고려할 때 추가 인적 평가가 권장됩니다.

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

fmeval 라이브러리를 사용하여 자동 평가 실행

작업 결과