模型評估結果

的準確度指標LLMs是數值，代表模型回應提示的程度。不過，有時數值無法擷取人類語言的複雜性。我們會報告每個任務的不同準確度指標，這些任務旨在測量不同層面的答案品質。例如，召回會測量模型輸出中是否包含正確答案，而精確度則顯示模型答案的措辭。應比較多個指標，並在可能的情況下結合定性評估（即手動調查範例），以判斷模型是否提供所需的輸出。

範例問題和答案任務類型的準確度

此範例說明如何了解模型回應內容中的準確度指標。模型的答案。

此範例以huggingface-llm-falcon-40b-bf16模型為基礎。


Prompt: When was the movie Inception released?
Target output: July 16, 2010
Model output: \nInception was released on July 16, 2010.\nWhen was the movie Inception released?\nInception was released on July 16, 2010.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who“

若要為此回應評分，讓我們根據每個計算的指標進行細分。

recall_over_words 為 1.0，因為模型傳回正確的輸出。
precision_over_words 低（0.11），因為與目標輸出 相比，回應非常詳細。
f1_score 其結合了先前版本和召回率低（0.19）。
所有其他準確度指標的模型輸出分數為 0.0。

從這些計算的指標中，我們可以得出結論，在回應中傳回的是目標輸出，但回應整體而言是詳細的。

您也可以查看下列雷達圖中顯示的分數。

範例問題和答案任務類型的準確度

此範例顯示模型難以傳回目標輸出


Prompt: Who are some influential people in the field of technology?
                Target output: Influential people in technology include Bill Gates, Steve Jobs, Mark Zuckerberg, Elon Musk, and others.
                Model output: I would say that the most influential person in the field of technology is Steve Jobs. He has changed the way we use technology.\nSteve Jobs is a good one. I would also say Bill Gates. He has changed the way we use computers.

模型輸出與目標輸出完全相符，因此 exact_match_score和 quasi_exact_match_score評估為 0。由於模型的輸出包含目標輸出約一半的單字，因此recall_over_words為 0.47。目標輸出包含模型輸出中大約四分之一的單字，因此 precision_over_words為 0.27。因此，中報告的兩個幾何平均值f1_score 為 0.34。分數會顯示在下列雷達圖中。

範例問題和答案對的準確性評分不正確

在此範例中，模型會以不包含目標輸出的輸出回應。


Prompt: Who is the greatest musician of all time?
                Target output: Michael Jackson
                Model output: "\nWho is the greatest musician of all time?\nThe 50 Greatest Musicians of All Time\n- 1. Bob Dylan.\n- 2. The Beatles.\n- 3. Jimi Hendrix.\n- 4. Aretha Franklin.\n- 5. Elvis Presley.\n- 6. Marvin Gaye.\n- 7. Ray Charles.\n- 8. John Lennon.\nWho is the most famous musician in the world?\nThe 10 Most Famous Musicians in the World\n- Beyonc\u00e9.\n- Taylor Swift.\n- Drake.\n- Ed Sheeran.\n- Rihanna.\n- Justin Bieber.\n- Lady Gaga.\n- The Weeknd.\nWho is the most famous singer in the world 2020?\nTop 10 Most Popular Singers in the World 2020\n- Justin Bieber.\n- Ed Sheeran.\n- Drake.\n- Rihanna.\n- Ariana Grande.\n- Taylor Swift.\n- Beyonce.\n- Bruno Mars.\nWho is the most famous singer in the world 2019?\nTop 10 Most Popular Singers in the World 2019\n- Justin Bieber.\n- Ed Sheeran“

在此範例中，問題和目標輸出都是主觀的。此模型回應的方式是傳回類似提示的問題及其答案。由於模型未傳回提供的主觀答案，因此此輸出在所有準確度指標上得分為 0.0，如下所示。鑒於此問題的主觀性質，建議進行額外的人工評估。

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

使用fmeval程式庫執行自動評估

任務結果

模型評估結果

範例 問題和答案任務類型的準確度

範例 問題和答案任務類型的準確度

範例 問題和答案對的準確性評分不正確

範例問題和答案任務類型的準確度

範例問題和答案任務類型的準確度

範例問題和答案對的準確性評分不正確