模型評估結果 - Amazon SageMaker

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

模型評估結果

的準確度指標LLMs是數值,代表模型回應提示的程度。不過,有時數值無法擷取人類語言的複雜性。我們會報告每個任務的不同準確度指標,這些任務旨在測量不同層面的答案品質。例如,召回會測量模型輸出中是否包含正確答案,而精確度則顯示模型答案的措辭。應比較多個指標,並在可能的情況下結合定性評估 (即手動調查範例),以判斷模型是否提供所需的輸出。

範例 問題和答案任務類型的準確度

此範例說明如何了解模型回應內容中的準確度指標。模型的答案。

此範例以huggingface-llm-falcon-40b-bf16模型為基礎。

Prompt: When was the movie Inception released? Target output: July 16, 2010 Model output: \nInception was released on July 16, 2010.\nWhen was the movie Inception released?\nInception was released on July 16, 2010.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who“

若要為此回應評分,讓我們根據每個計算的指標進行細分。

  • recall_over_words 為 1.0,因為模型傳回正確的輸出。

  • precision_over_words 低 (0.11),因為與目標輸出 相比,回應非常詳細。

  • f1_score 其結合了先前版本和召回率低 (0.19)。

  • 所有其他準確度指標的模型輸出分數為 0.0。

從這些計算的指標中,我們可以得出結論,在回應中傳回的是目標輸出,但回應整體而言是詳細的。

您也可以查看下列雷達圖中顯示的分數。

顯示每個傳回指標雷達圖的影像。
範例 問題和答案任務類型的準確度

此範例顯示模型難以傳回目標輸出

Prompt: Who are some influential people in the field of technology? Target output: Influential people in technology include Bill Gates, Steve Jobs, Mark Zuckerberg, Elon Musk, and others. Model output: I would say that the most influential person in the field of technology is Steve Jobs. He has changed the way we use technology.\nSteve Jobs is a good one. I would also say Bill Gates. He has changed the way we use computers.

模型輸出與目標輸出完全相符,因此 exact_match_scorequasi_exact_match_score評估為 0。由於模型的輸出包含目標輸出約一半的單字,因此recall_over_words為 0.47。目標輸出包含模型輸出中大約四分之一的單字,因此 precision_over_words為 0.27。因此, 中報告的兩個幾何平均值f1_score 為 0.34。分數會顯示在下列雷達圖中。

顯示每個傳回指標雷達圖的影像。
範例 問題和答案對的準確性評分不正確

在此範例中,模型會以不包含目標輸出的輸出回應。

Prompt: Who is the greatest musician of all time? Target output: Michael Jackson Model output: "\nWho is the greatest musician of all time?\nThe 50 Greatest Musicians of All Time\n- 1. Bob Dylan.\n- 2. The Beatles.\n- 3. Jimi Hendrix.\n- 4. Aretha Franklin.\n- 5. Elvis Presley.\n- 6. Marvin Gaye.\n- 7. Ray Charles.\n- 8. John Lennon.\nWho is the most famous musician in the world?\nThe 10 Most Famous Musicians in the World\n- Beyonc\u00e9.\n- Taylor Swift.\n- Drake.\n- Ed Sheeran.\n- Rihanna.\n- Justin Bieber.\n- Lady Gaga.\n- The Weeknd.\nWho is the most famous singer in the world 2020?\nTop 10 Most Popular Singers in the World 2020\n- Justin Bieber.\n- Ed Sheeran.\n- Drake.\n- Rihanna.\n- Ariana Grande.\n- Taylor Swift.\n- Beyonce.\n- Bruno Mars.\nWho is the most famous singer in the world 2019?\nTop 10 Most Popular Singers in the World 2019\n- Justin Bieber.\n- Ed Sheeran“

在此範例中,問題和目標輸出都是主觀的。此模型回應的方式是傳回類似提示的問題及其答案。由於模型未傳回提供的主觀答案,因此此輸出在所有準確度指標上得分為 0.0,如下所示。鑒於此問題的主觀性質,建議進行額外的人工評估。