了解人工评估工作的结果 - Amazon SageMaker

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

了解人工评估工作的结果

在创建使用人工工作人员的模型评估作业时,您选择了一个或多个指标类型。当工作组成员在工作人员门户中评估响应时,他们的响应将保存在 humanAnswers json 对象中。这些响应的存储方式会根据创建任务时选择的指标类型而变化。

以下各节解释了这些差异,并提供了示例。

JSON输出参考

模型评估任务完成后,结果将作为JSON文件保存在 Amazon S3 中。该JSON对象包含三个高级节点humanEvaluationResultinputRecord、和modelResponseshumanEvaluationResult关键是一个高级节点,其中包含分配给模型评估任务的工作团队的响应。inputRecord关键是一个高级节点,其中包含创建模型评估任务时向模型提供的提示。modelResponses关键是一个高级节点,其中包含模型对提示的响应。

下表汇总了在模型评估任务的JSON输出中找到的键值对。

后面的部分提供了有关每个密钥值对的更详细的细节。

参数 示例 描述

flowDefinitionArn

arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name

创建人工循环的人工审查工作流程(流程定义)。ARN

humanAnswers

特定于所选评估指标的JSON对象列表。要了解更多信息,请参阅在下方找到的密钥值对 humanAnswers

包含工作人员响应的JSON对象列表。

humanLoopName

system-generated-hash 系统生成的 40 个字符的十六进制字符串。
inputRecord
"inputRecord": { "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": [{ "modelIdentifier": "meta-textgeneration-llama-codellama-7b", "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }
包含来自输入数据集的输入提示的JSON对象。
modelResponses
"modelResponses": [{ "modelIdentifier": "arn:aws:bedrock:us-west-2::foundation-model/model-id", "text": "the-models-response-to-the-prompt" }]
来自模型的各个响应。
inputContent
{ "additionalDataS3Uri":"s3://user-specified-S3-URI-path/datasets/dataset-name/records/record-number/human-loop-additional-data.json", "evaluationMetrics":[ { "description":"brief-name", "metricName":"metric-name", "metricType":"IndividualLikertScale" } ], "instructions":"example instructions" }

在 Amazon S3 存储桶中启动人工循环所需的人工循环输入内容。

modelResponseIdMap
{ "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" }

描述每个模型在中的表示方式answerContent

在下方找到的密钥值对 humanEvaluationResult

以下键值对位于模型评估任务输出humanEvaluationResult中的下方。

有关与之关联的键值对humanAnswers,请参见在下方找到的密钥值对 humanAnswers

flowDefinitionArn

  • 用于完成模型评估任务的流程定义。ARN

  • 示例:arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name

humanLoopName

  • 系统生成的 40 个字符的十六进制字符串。

inputContent

  • 此键值描述了指标类型以及您在工作人员门户中为工作人员提供的说明。

    • additionalDataS3Uri:Amazon S3 中保存工作人员说明的位置。

    • instructions:您在工作人员门户中向工作人员提供的说明。

    • evaluationMetrics:指标的名称及其描述。关键价值metricType是提供给工作人员的工具,用于评估模型的响应。

modelResponseIdMap

  • 此键值对标识所选模型的全名,以及如何将工作人员的选择映射到humanAnswers键值对中的模型。

在下方找到的密钥值对 inputRecord

以下条目描述了inputRecord键值对。

prompt

  • 发送给模特的提示文本。

category

  • 对提示进行分类的可选类别。模型评估期间,工作人员可以在工作人员门户中看到。

  • 示例:"American cities"

referenceResponse

  • 输入中的一个可选字段,JSON用于指定您希望工作人员在评估期间参考的基本真相

responses

  • 输入中的一个可选字段JSON,包含来自其他模型的响应。

JSON输入记录示例。

{ "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": // All inference must come from a single model [{ "modelIdentifier": "meta-textgeneration-llama-codellama-7b" , "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }

在下方找到的密钥值对 modelResponses

一组键值对,其中包含来自模型的响应以及哪个模型提供了响应。

text

  • 模特对提示的回应。

modelIdentifier

  • 模型的名称。

在下方找到的密钥值对 humanAnswers

一组键值对,其中包含来自模型的响应,以及工作人员如何评估模型

acceptanceTime

  • 当工作人员在工作人员门户中接受任务时。

submissionTime

  • 工作人员何时提交回复。

timeSpentInSeconds

  • 工作人员花了多长时间完成任务。

workerId

  • 完成任务的工作人员的 ID。

workerMetadata

  • 有关为该模型评估任务分配了哪个工作团队的元数据。

answerContentJSON数组的格式

答案的结构取决于创建模型评估作业时选择的评估指标。每个工作人员的响应或答案都记录在一个新JSON对象中。

answerContent

  • evaluationResults包含工作人员的回应。

    • 选择 “选择” 按钮后,每个工作人员的结果均为"evaluationResults": "comparisonChoice"

      metricName: 指标的名称

      result:该JSON对象表示工作人员使用0或选择了哪个模型1。要查看模型映射到哪个值,modelResponseIdMap.

    • 选择李克特量表时,选择比较时,每位工作人员的结果为"evaluationResults": "comparisonLikertScale"

      metricName:指标的名称。

      leftModelResponseId:表示哪modelResponseIdMap个显示在工作人员门户网站的左侧。

      rightModelResponseId:表示哪modelResponseIdMap个显示在工作人员门户网站的左侧。

      result:该JSON对象表示工作人员使用0或选择了哪个模型1。要查看模型映射到哪个值,modelResponseIdMap

    • 选择 “序数等级” 时,每个工作人员的结果为"evaluationResults": "comparisonRank"

      metricName: 指标的名称

      result: 一个JSON对象数组。对于每个模型 (modelResponseIdMap),工作人员提供一个rank.

      "result": [{ "modelResponseId": "0", "rank": 1 }, { "modelResponseId": "1", "rank": 1 }]
    • 李克特量表选择对单个模型响应的评估时,将保存工作人员的结果。"evaluationResults": "individualLikertScale"这是一个包含作业创建时metricName指定的分数的数JSON组。

      metricName:指标的名称。

      modelResponseId: 被评分的模型。要查看模型映射到哪个值,modelResponseIdMap.

      result:表示工作人员选择的李克特量表值的键值对。

    • 选择 Thumbs up/down 时,工作人员的结果将保存为数组。JSON "evaluationResults": "thumbsUpDown"

      metricName:指标的名称。

      result: 要truefalse要么与metricName. 当工人选择竖起大拇指时,"result" : true.

模型评估任务输出的示例输出

以下JSON对象是保存在 Amazon S3 中的模型评估任务输出示例。要了解有关每个键值对的更多信息,请参阅JSON输出参考

为清楚起见,此作业仅包含两个工作人员的回复。为了便于阅读,某些键值对也可能被截断

{ "humanEvaluationResult": { "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name", "humanAnswers": [ { "acceptanceTime": "2024-06-07T22:31:57.066Z", "answerContent": { "evaluationResults": { "comparisonChoice": [ { "metricName": "Fluency", "result": { "modelResponseId": "0" } } ], "comparisonLikertScale": [ { "leftModelResponseId": "0", "metricName": "Coherence", "result": 1, "rightModelResponseId": "1" } ], "comparisonRank": [ { "metricName": "Toxicity", "result": [ { "modelResponseId": "0", "rank": 1 }, { "modelResponseId": "1", "rank": 1 } ] } ], "individualLikertScale": [ { "metricName": "Correctness", "modelResponseId": "0", "result": 2 }, { "metricName": "Correctness", "modelResponseId": "1", "result": 3 }, { "metricName": "Completeness", "modelResponseId": "0", "result": 1 }, { "metricName": "Completeness", "modelResponseId": "1", "result": 4 } ], "thumbsUpDown": [ { "metricName": "Accuracy", "modelResponseId": "0", "result": true }, { "metricName": "Accuracy", "modelResponseId": "1", "result": true } ] } }, "submissionTime": "2024-06-07T22:32:19.640Z", "timeSpentInSeconds": 22.574, "workerId": "ead1ba56c1278175", "workerMetadata": { "identityData": { "identityProviderType": "Cognito", "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4", "sub": "cd2848f5-6105-4f72-b44e-68f9cb79ba07" } } }, { "acceptanceTime": "2024-06-07T22:32:19.721Z", "answerContent": { "evaluationResults": { "comparisonChoice": [ { "metricName": "Fluency", "result": { "modelResponseId": "1" } } ], "comparisonLikertScale": [ { "leftModelResponseId": "0", "metricName": "Coherence", "result": 1, "rightModelResponseId": "1" } ], "comparisonRank": [ { "metricName": "Toxicity", "result": [ { "modelResponseId": "0", "rank": 2 }, { "modelResponseId": "1", "rank": 1 } ] } ], "individualLikertScale": [ { "metricName": "Correctness", "modelResponseId": "0", "result": 3 }, { "metricName": "Correctness", "modelResponseId": "1", "result": 4 }, { "metricName": "Completeness", "modelResponseId": "0", "result": 1 }, { "metricName": "Completeness", "modelResponseId": "1", "result": 5 } ], "thumbsUpDown": [ { "metricName": "Accuracy", "modelResponseId": "0", "result": true }, { "metricName": "Accuracy", "modelResponseId": "1", "result": false } ] } }, "submissionTime": "2024-06-07T22:32:57.918Z", "timeSpentInSeconds": 38.197, "workerId": "bad258db224c3db6", "workerMetadata": { "identityData": { "identityProviderType": "Cognito", "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4", "sub": "84d5194a-3eed-4ecc-926d-4b9e1b724094" } } } ], "humanLoopName": "a757 11d3e75a 8d41f35b9873d 253f5b7bce0256e", "inputContent": { "additionalDataS3Uri": "s3://mgrt-test-us-west-2/test-2-workers-2-model/datasets/custom_dataset/0/task-input-additional-data.json", "instructions": "worker instructions provided by the model evaluation job administrator", "evaluationMetrics": [ { "metricName": "Fluency", "metricType": "ComparisonChoice", "description": "Measures the linguistic quality of a generated text." }, { "metricName": "Coherence", "metricType": "ComparisonLikertScale", "description": "Measures the organization and structure of a generated text." }, { "metricName": "Toxicity", "metricType": "ComparisonRank", "description": "Measures the harmfulness of a generated text." }, { "metricName": "Accuracy", "metricType": "ThumbsUpDown", "description": "Indicates the accuracy of a generated text." }, { "metricName": "Correctness", "metricType": "IndividualLikertScale", "description": "Measures a generated answer's satisfaction in the context of the question." }, { "metricName": "Completeness", "metricType": "IndividualLikertScale", "description": "Measures a generated answer's inclusion of all relevant information." } ], "disableRandomization": "true" }, "modelResponseIdMap": { "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" } }, "inputRecord": { "prompt": { "text": "What is high intensity interval training?" }, "category": "Fitness", "referenceResponse": { "text": "High-Intensity Interval Training (HIIT)" } }, "modelResponses": [ { "text": "High Intensity Interval Training (HIIT) is a form of exercise that alternates between periods of high intensity work and low intensity recovery.HIIT is an excellent way to increase your fitness and improve your health, but it can be difficult to get started.In this article, we will", "modelIdentifier": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612" }, { "text": "High intensity interval training is a form of exercise consisting of short bursts of maximum effort followed by periods of rest. The theory behind HIIT is that it can often be more effective at improving cardiovascular and metabolic health than longer, lower intensity workouts.The work intervals can range in length depending on the specific type of exercise, but are typically between 20 and 90 seconds. The recovery periods are generally longer, lasting between 1 and 5 minutes. This pattern is then repeated for multiple sets.\n\nSince the work intervals are high intensity, they require more effort from your body and therefore result in a greater calorie burn. The body also continues to burn calories at an increased rate after the workout due to an effect called excess post exercise oxygen consumption (EPOC), also know as the afterburn effect.\n\nHIIT is a versatile form of training that can be adapted to different fitness levels and can be performed using a variety of exercises including cycling, running, bodyweight movements, and even swimming. It can be done in as little as 20 minutes once or twice a week, making it an efficient option for busy individuals.\n\nWhat are the benefits of high intensity interval training", "modelIdentifier": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" } ] }