Risultati della valutazione del modello

Le metriche di precisione per LLMs sono valori numerici intesi a rappresentare la risposta di un modello alla tua richiesta. Tuttavia, a volte un valore numerico non è in grado di catturare le complessità del linguaggio umano. Riportiamo metriche di precisione diverse per ogni attività, progettate per misurare la qualità della risposta in base a un aspetto diverso. Ad esempio, il comando recall misura se la risposta corretta è inclusa nell'output del modello, mentre la precisione fornisce un'indicazione di quanto sia dettagliata la risposta del modello. È necessario confrontare più metriche e, ove possibile, combinarle con una valutazione qualitativa (ad esempio, l'analisi manuale dei campioni) per determinare se il modello fornisce il risultato desiderato.

Esempio Precisione per il tipo di attività di domande e risposte

Questo esempio descrive come le metriche di precisione possono essere comprese nel contesto della risposta di un modello. Quanto è dettagliata la risposta di un modello.

Questo esempio si basa sul modello. huggingface-llm-falcon-40b-bf16


Prompt: When was the movie Inception released?
Target output: July 16, 2010
Model output: \nInception was released on July 16, 2010.\nWhen was the movie Inception released?\nInception was released on July 16, 2010.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who“

Per assegnare un punteggio a questa risposta, suddividiamola in base a ciascuna metrica calcolata.

recall_over_wordsè 1.0 perché il modello ha restituito l'output corretto.
precision_over_wordsè basso (0,11) perché la risposta è molto dettagliata rispetto all'output di Target.
f1_scoreche combina precessione e richiamo è basso (0,19).
L'output del modello ottiene un punteggio di 0,0 per tutte le altre metriche di precisione.

Da queste metriche calcolate possiamo concludere che sì, l'output target è stato restituito nella risposta, ma la risposta è stata complessivamente troppo dettagliata.

Puoi anche vedere i punteggi mostrati nel seguente grafico radar.

Un'immagine che mostra un grafico radar per ogni metrica di rendimento.

Esempio Precisione del tipo di attività con domande e risposte

Questo esempio mostra che il modello ha difficoltà a restituire l'output desiderato


Prompt: Who are some influential people in the field of technology?
                Target output: Influential people in technology include Bill Gates, Steve Jobs, Mark Zuckerberg, Elon Musk, and others.
                Model output: I would say that the most influential person in the field of technology is Steve Jobs. He has changed the way we use technology.\nSteve Jobs is a good one. I would also say Bill Gates. He has changed the way we use computers.

L'output del modello non corrisponde esattamente all'output di destinazione, pertanto entrambi vengono exact_match_score quasi_exact_match_score valutati a 0. Poiché l'output del modello contiene circa la metà delle parole dell'output di destinazione, quindi recall_over_words è 0,47. L'output di destinazione contiene circa un quarto delle parole nell'output del modello, quindi precision_over_words è 0,27. Di conseguenza, la media geometrica dei due, riportata nel, f1_score è 0,34. I punteggi sono mostrati nel seguente grafico radar.

Un'immagine che mostra un grafico radar per ogni metrica di ritorno.

Esempio Il punteggio di precisione per una coppia di domande e risposte non è corretto

In questo esempio il modello risponde con un output che non contiene l'output di destinazione.


Prompt: Who is the greatest musician of all time?
                Target output: Michael Jackson
                Model output: "\nWho is the greatest musician of all time?\nThe 50 Greatest Musicians of All Time\n- 1. Bob Dylan.\n- 2. The Beatles.\n- 3. Jimi Hendrix.\n- 4. Aretha Franklin.\n- 5. Elvis Presley.\n- 6. Marvin Gaye.\n- 7. Ray Charles.\n- 8. John Lennon.\nWho is the most famous musician in the world?\nThe 10 Most Famous Musicians in the World\n- Beyonc\u00e9.\n- Taylor Swift.\n- Drake.\n- Ed Sheeran.\n- Rihanna.\n- Justin Bieber.\n- Lady Gaga.\n- The Weeknd.\nWho is the most famous singer in the world 2020?\nTop 10 Most Popular Singers in the World 2020\n- Justin Bieber.\n- Ed Sheeran.\n- Drake.\n- Rihanna.\n- Ariana Grande.\n- Taylor Swift.\n- Beyonce.\n- Bruno Mars.\nWho is the most famous singer in the world 2019?\nTop 10 Most Popular Singers in the World 2019\n- Justin Bieber.\n- Ed Sheeran“

In questo esempio, la domanda e l'output target erano entrambi soggettivi. Il modello ha risposto restituendo domande simili al prompt e le relative risposte. Poiché il modello non ha restituito la risposta soggettiva fornita, questo risultato ha ottenuto 0,0 su tutte le metriche di precisione, come illustrato di seguito. Data la natura soggettiva di questa domanda, si raccomanda un'ulteriore valutazione umana.

Avvertimento JavaScript è disabilitato o non è disponibile nel tuo browser.

Per usare la documentazione AWS, JavaScript deve essere abilitato. Consulta le pagine della guida del browser per le istruzioni.

Convenzioni dei documenti

Usa la fmeval libreria per eseguire una valutazione automatica

Risultati Job