了解人工評估任務的結果 - Amazon SageMaker

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

了解人工評估任務的結果

當您建立使用人力工作者的模型評估任務時,您選擇了一或多個指標類型 。當工作團隊的成員評估工作者入口網站中的回應時,其回應會儲存在 humanAnswers json 物件中。這些回應的儲存方式會根據建立任務時選取的指標類型而變更。

下列各節說明這些差異,並提供範例。

JSON 輸出參考

模型評估任務完成後,結果會儲存在 Amazon S3 中作為JSON檔案。JSON 物件包含三個高階節點 humanEvaluationResultinputRecordmodelResponseshumanEvaluationResult金鑰是高階節點,其中包含指派給模型評估任務之工作團隊的回應。inputRecord 金鑰是高階節點,其中包含建立模型評估任務時提供給模型的提示。modelResponses 金鑰是高階節點,其中包含對模型提示的回應 (來自模型)。

下表摘要說明在模型評估任務的JSON輸出中找到的鍵值對。

程序區段提供每個索引鍵值對的更詳細詳細資訊。

參數 範例 描述

flowDefinitionArn

arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name

建立人工循環ARN的人工審核工作流程 (流程定義) 的 。

humanAnswers

所選評估指標特有的JSON物件清單。若要進一步了解,請參閱 金鑰值對位於 下 humanAnswers

包含工作者回應的JSON物件清單。

humanLoopName

system-generated-hash 系統產生了 40 個字元的十六進位字串。
inputRecord
"inputRecord": { "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": [{ "modelIdentifier": "meta-textgeneration-llama-codellama-7b", "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }
包含輸入資料集之項目提示的JSON物件。
modelResponses
"modelResponses": [{ "modelIdentifier": "arn:aws:bedrock:us-west-2::foundation-model/model-id", "text": "the-models-response-to-the-prompt" }]
來自模型的個別回應。
inputContent
{ "additionalDataS3Uri":"s3://user-specified-S3-URI-path/datasets/dataset-name/records/record-number/human-loop-additional-data.json", "evaluationMetrics":[ { "description":"brief-name", "metricName":"metric-name", "metricType":"IndividualLikertScale" } ], "instructions":"example instructions" }

在您的 Amazon S3 儲存貯體中啟動人工循環所需的人工循環輸入內容。

modelResponseIdMap
{ "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" }

描述 中每個模型的呈現方式answerContent

金鑰值對位於 下 humanEvaluationResult

下列索引鍵值對可在模型評估任務輸出humanEvaluationResult的 下找到。

如需與 相關聯的鍵值對humanAnswers,請參閱 金鑰值對位於 下 humanAnswers

flowDefinitionArn

  • 用於完成模型評估任務的流程定義 ARN 。

  • 範例:arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name

humanLoopName

  • 系統產生了 40 個字元的十六進位字串。

inputContent

  • 此索引鍵值說明指標類型 ,以及您為工作者在工作者入口網站中提供的指示。

    • additionalDataS3Uri:Amazon S3 中儲存工作者指示的位置。

    • instructions:您在工作者入口網站中提供給工作者的指示。

    • evaluationMetrics:指標的名稱及其描述。鍵值metricType是提供給工作者的工具,用於評估模型的回應。

modelResponseIdMap

  • 此鍵值對會識別所選模型的完整名稱,以及如何將工作者選擇對應至humanAnswers鍵值對中的模型。

金鑰值對位於 下 inputRecord

下列項目說明inputRecord索引鍵值對。

prompt

  • 傳送至模型的提示文字。

category

  • 可分類提示的選用類別。在模型評估期間,工作者可在工作者入口網站中看見。

  • 範例:"American cities"

referenceResponse

  • 輸入中的選用欄位,JSON用於指定您希望工作者在評估期間參考的地面事實

responses

  • 輸入中的選用欄位JSON,其中包含其他模型的回應。

JSON 輸入記錄範例。

{ "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": // All inference must come from a single model [{ "modelIdentifier": "meta-textgeneration-llama-codellama-7b" , "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }

金鑰值對位於 下 modelResponses

金鑰值對陣列,其中包含模型的回應,以及哪個模型提供回應。

text

  • 模型對提示的回應。

modelIdentifier

  • 模型的名稱。

金鑰值對位於 下 humanAnswers

包含模型回應的索引鍵值對陣列,以及工作者如何在 中評估模型

acceptanceTime

  • 當工作者在工作者入口網站中接受任務時。

submissionTime

  • 當工作者提交其回應時。

timeSpentInSeconds

  • 工作者完成任務的時間長度。

workerId

  • 完成任務的工作者 ID。

workerMetadata

  • 有關指派給此模型評估任務之工作團隊的中繼資料。

answerContent JSON 陣列的格式

答案的結構取決於建立模型評估任務時選取的評估指標。每個工作者回應或答案都會記錄在新的JSON物件中。

answerContent

  • evaluationResults 包含工作者的回應。

    • 選取選擇按鈕時,每個工作者的結果會是 "evaluationResults": "comparisonChoice"

      metricName:指標的名稱

      result: JSON 物件指出使用 0或 選取的工作者模型1。若要查看模型映射到哪個值,請modelResponseIdMap

    • 選取 Likert 擴展比較時,每個工作者的結果會是 "evaluationResults": "comparisonLikertScale"

      metricName:指標的名稱。

      leftModelResponseId:指示modelResponseIdMap顯示在工作者入口網站左側的 。

      rightModelResponseId:指示modelResponseIdMap顯示在工作者入口網站左側的 。

      result: JSON 物件指出使用 0或 選取的工作者模型1。若要查看模型映射到哪個值, modelResponseIdMap

    • 選取一般排名時,每個工作者的結果會是 "evaluationResults": "comparisonRank"

      metricName:指標的名稱

      result:JSON物件陣列。針對每個模型 (modelResponseIdMap) 工作者提供 rank

      "result": [{ "modelResponseId": "0", "rank": 1 }, { "modelResponseId": "1", "rank": 1 }]
    • Likert 擴展時,選取單一模型回應的評估,工作者的結果會儲存在 中"evaluationResults": "individualLikertScale"。這是陣列,JSON其中包含建立任務時metricName指定的分數。

      metricName:指標的名稱。

      modelResponseId:評分的模型。若要查看模型映射到哪個值,請modelResponseIdMap

      result:金鑰值對,指示工作者選取的 likert 比例值。

    • 選取縮圖向上/向下時,工作者的結果會儲存為JSON陣列 "evaluationResults": "thumbsUpDown"

      metricName:指標的名稱。

      resulttruefalse,因為它與 相關metricName。當工作者選擇拇指時,"result" : true

模型評估任務輸出的範例輸出

下列JSON物件是儲存在 Amazon S3 中的模型評估任務輸出範例。若要進一步了解每個索引鍵值對,請參閱 JSON 輸出參考

為了清楚起見,此任務僅包含兩個工作者的回應。某些索引鍵值對可能也已截斷以供讀取

{ "humanEvaluationResult": { "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name", "humanAnswers": [ { "acceptanceTime": "2024-06-07T22:31:57.066Z", "answerContent": { "evaluationResults": { "comparisonChoice": [ { "metricName": "Fluency", "result": { "modelResponseId": "0" } } ], "comparisonLikertScale": [ { "leftModelResponseId": "0", "metricName": "Coherence", "result": 1, "rightModelResponseId": "1" } ], "comparisonRank": [ { "metricName": "Toxicity", "result": [ { "modelResponseId": "0", "rank": 1 }, { "modelResponseId": "1", "rank": 1 } ] } ], "individualLikertScale": [ { "metricName": "Correctness", "modelResponseId": "0", "result": 2 }, { "metricName": "Correctness", "modelResponseId": "1", "result": 3 }, { "metricName": "Completeness", "modelResponseId": "0", "result": 1 }, { "metricName": "Completeness", "modelResponseId": "1", "result": 4 } ], "thumbsUpDown": [ { "metricName": "Accuracy", "modelResponseId": "0", "result": true }, { "metricName": "Accuracy", "modelResponseId": "1", "result": true } ] } }, "submissionTime": "2024-06-07T22:32:19.640Z", "timeSpentInSeconds": 22.574, "workerId": "ead1ba56c1278175", "workerMetadata": { "identityData": { "identityProviderType": "Cognito", "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4", "sub": "cd2848f5-6105-4f72-b44e-68f9cb79ba07" } } }, { "acceptanceTime": "2024-06-07T22:32:19.721Z", "answerContent": { "evaluationResults": { "comparisonChoice": [ { "metricName": "Fluency", "result": { "modelResponseId": "1" } } ], "comparisonLikertScale": [ { "leftModelResponseId": "0", "metricName": "Coherence", "result": 1, "rightModelResponseId": "1" } ], "comparisonRank": [ { "metricName": "Toxicity", "result": [ { "modelResponseId": "0", "rank": 2 }, { "modelResponseId": "1", "rank": 1 } ] } ], "individualLikertScale": [ { "metricName": "Correctness", "modelResponseId": "0", "result": 3 }, { "metricName": "Correctness", "modelResponseId": "1", "result": 4 }, { "metricName": "Completeness", "modelResponseId": "0", "result": 1 }, { "metricName": "Completeness", "modelResponseId": "1", "result": 5 } ], "thumbsUpDown": [ { "metricName": "Accuracy", "modelResponseId": "0", "result": true }, { "metricName": "Accuracy", "modelResponseId": "1", "result": false } ] } }, "submissionTime": "2024-06-07T22:32:57.918Z", "timeSpentInSeconds": 38.197, "workerId": "bad258db224c3db6", "workerMetadata": { "identityData": { "identityProviderType": "Cognito", "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4", "sub": "84d5194a-3eed-4ecc-926d-4b9e1b724094" } } } ], "humanLoopName": "a757 11d3e75a 8d41f35b9873d 253f5b7bce0256e", "inputContent": { "additionalDataS3Uri": "s3://mgrt-test-us-west-2/test-2-workers-2-model/datasets/custom_dataset/0/task-input-additional-data.json", "instructions": "worker instructions provided by the model evaluation job administrator", "evaluationMetrics": [ { "metricName": "Fluency", "metricType": "ComparisonChoice", "description": "Measures the linguistic quality of a generated text." }, { "metricName": "Coherence", "metricType": "ComparisonLikertScale", "description": "Measures the organization and structure of a generated text." }, { "metricName": "Toxicity", "metricType": "ComparisonRank", "description": "Measures the harmfulness of a generated text." }, { "metricName": "Accuracy", "metricType": "ThumbsUpDown", "description": "Indicates the accuracy of a generated text." }, { "metricName": "Correctness", "metricType": "IndividualLikertScale", "description": "Measures a generated answer's satisfaction in the context of the question." }, { "metricName": "Completeness", "metricType": "IndividualLikertScale", "description": "Measures a generated answer's inclusion of all relevant information." } ], "disableRandomization": "true" }, "modelResponseIdMap": { "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" } }, "inputRecord": { "prompt": { "text": "What is high intensity interval training?" }, "category": "Fitness", "referenceResponse": { "text": "High-Intensity Interval Training (HIIT)" } }, "modelResponses": [ { "text": "High Intensity Interval Training (HIIT) is a form of exercise that alternates between periods of high intensity work and low intensity recovery.HIIT is an excellent way to increase your fitness and improve your health, but it can be difficult to get started.In this article, we will", "modelIdentifier": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612" }, { "text": "High intensity interval training is a form of exercise consisting of short bursts of maximum effort followed by periods of rest. The theory behind HIIT is that it can often be more effective at improving cardiovascular and metabolic health than longer, lower intensity workouts.The work intervals can range in length depending on the specific type of exercise, but are typically between 20 and 90 seconds. The recovery periods are generally longer, lasting between 1 and 5 minutes. This pattern is then repeated for multiple sets.\n\nSince the work intervals are high intensity, they require more effort from your body and therefore result in a greater calorie burn. The body also continues to burn calories at an increased rate after the workout due to an effect called excess post exercise oxygen consumption (EPOC), also know as the afterburn effect.\n\nHIIT is a versatile form of training that can be adapted to different fitness levels and can be performed using a variety of exercises including cycling, running, bodyweight movements, and even swimming. It can be done in as little as 20 minutes once or twice a week, making it an efficient option for busy individuals.\n\nWhat are the benefits of high intensity interval training", "modelIdentifier": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" } ] }