人間の評価ジョブの結果を理解する

ヒューマンワーカーを使用するモデル評価ジョブを作成するときに、1 つ以上のメトリクスタイプ を選択しました。ワークチームのメンバーがワーカーポータルでレスポンスを評価すると、そのレスポンスは humanAnswers JSON オブジェクトに保存されます。これらのレスポンスの保存方法は、ジョブの作成時に選択されたメトリクスタイプに基づいて変わります。

以下のセクションでは、これらの違いを説明し、例を示します。

JSON 出力リファレンス

モデル評価ジョブが完了すると、結果は Amazon S3 にJSONファイルとして保存されます。JSON オブジェクトには、3 つの高レベルノード humanEvaluationResult、inputRecord、およびが含まれますmodelResponses。humanEvaluationResultキーは、モデル評価ジョブに割り当てられたワークチームからのレスポンスを含む高レベルノードです。inputRecord キーは、モデル評価ジョブの作成時にモデルに提供されるプロンプトを含む高レベルノードです (複数可）。modelResponses キーは、モデル (複数可) からのプロンプトに対するレスポンスを含む高レベルノードです。

次の表は、モデル評価ジョブからのJSON出力で見つかったキー値のペアをまとめたものです。

次のセクションでは、各キー値ペアについてより詳細な詳細を提供します。

パラメータ	例	説明
`flowDefinitionArn`	`arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name`	ヒューマンループを作成したヒューマンレビューワークフロー (フロー定義) ARNの。
`humanAnswers`	選択した評価メトリクスに固有のJSONオブジェクトのリスト。詳細については、「」を参照してくださいで見つかったキー値のペア humanAnswers。	ワーカーレスポンスを含むJSONオブジェクトのリスト。
`humanLoopName`	`system-generated-hash`	システムは 40 文字の 16 進数文字列を生成。
`inputRecord`	`"inputRecord": { "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": [{ "modelIdentifier": "meta-textgeneration-llama-codellama-7b", "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }`	入力データセットからのエントリプロンプトを含むJSONオブジェクト。
`modelResponses`	`"modelResponses": [{ "modelIdentifier": "arn:aws:bedrock:us-west-2::foundation-model/model-id", "text": "the-models-response-to-the-prompt" }]`	モデルからの個々のレスポンス。
`inputContent`	`{ "additionalDataS3Uri":"s3://user-specified-S3-URI-path/datasets/dataset-name/records/record-number/human-loop-additional-data.json", "evaluationMetrics":[ { "description":"brief-name", "metricName":"metric-name", "metricType":"IndividualLikertScale" } ], "instructions":"example instructions" }`	Amazon S3 バケットでヒューマンループを開始するために必要なヒューマンループ入力コンテンツ。
`modelResponseIdMap`	`{ "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" }`	で各モデルがどのように表されるかについて説明します`answerContent`。

で見つかったキー値のペア `humanEvaluationResult`

次のキー値は、モデル評価ジョブの出力humanEvaluationResultので囲まれています。

に関連付けられたキー値のペアについてはhumanAnswers、「」を参照してくださいで見つかったキー値のペア humanAnswers。

flowDefinitionArn

モデル評価ジョブの完了に使用されるフロー定義ARNの。
例:arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name

humanLoopName

システムは 40 文字の 16 進数文字列を生成。

inputContent

このキー値は、メトリクスタイプ と、ワーカーポータルでワーカーに提供した手順について説明します。
- additionalDataS3Uri: ワーカーの指示が保存される Amazon S3 の場所。
- instructions: ワーカーポータルでワーカーに提供した手順。
- evaluationMetrics: メトリクスの名前と説明。キー値はmetricType、モデルのレスポンスを評価するためにワーカーに提供されるツールです。

modelResponseIdMap

このキー値ペアは、選択したモデルの完全な名前と、ワーカーの選択がhumanAnswersキー値ペアのモデルにどのようにマッピングされるかを識別します。

で見つかったキー値のペア `inputRecord`

次のエントリでは、inputRecordキー値のペアについて説明します。

prompt

モデルに送信されたプロンプトのテキスト。

category

プロンプトを分類するオプションのカテゴリ。モデル評価中にワーカーポータルのワーカーに表示されます。
例:"American cities"

referenceResponse

評価中にワーカーが参照するグラウンドトゥルースを指定するJSONために使用される入力のオプションのフィールド

responses

他のモデルからのレスポンスJSONを含む入力のオプションフィールド。

JSON 入力レコードの例。


{
    "prompt": {
        "text": "Who invented the airplane?"
    },
    "category": "Airplanes",
    "referenceResponse": {
        "text": "Orville and Wilbur Wright"
    },
    "responses":
        // All inference must come from a single model
        [{
            "modelIdentifier": "meta-textgeneration-llama-codellama-7b" ,
            "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane."
        }]

}

で見つかったキー値のペア `modelResponses`

モデルからのレスポンスと、レスポンスを提供したモデルを含むキー値ペアの配列。

text

プロンプトに対するモデルの応答。

modelIdentifier

モデルの名前です。

で見つかったキー値のペア `humanAnswers`

モデルからのレスポンスと、ワーカーがでモデルをどのように評価したかを含むキー値ペアの配列

acceptanceTime

ワーカーポータルでワーカーがタスクを受け入れたとき。

submissionTime

ワーカーが応答を送信したとき。

timeSpentInSeconds

ワーカーがタスクの完了に費やした時間。

workerId

タスクを完了したワーカーの ID。

workerMetadata

このモデル評価ジョブに割り当てられたワークチームに関するメタデータ。

`answerContent` JSON 配列の形式

回答の構造は、モデル評価ジョブの作成時に選択された評価メトリクスによって異なります。各ワーカーのレスポンスまたはレスポンスは、新しいJSONオブジェクトに記録されます。

answerContent

evaluationResults にはワーカーのレスポンスが含まれます。
- 選択ボタンを選択すると、各ワーカーの結果はになります"evaluationResults": "comparisonChoice"。
  
  metricName: メトリクスの名前
  
  result: JSON オブジェクトは、ワーカーが 0またはを使用して選択したモデルを示します1。モデルがどの値を表示するようにマッピングされているかを確認するには、「」を参照してくださいmodelResponseIdMap。
- Likert スケールでは、比較が選択され、各ワーカーの結果はになります"evaluationResults": "comparisonLikertScale"。
  
  metricName: メトリクスの名前。
  
  leftModelResponseId: ワーカーポータルの左側にmodelResponseIdMap表示されているものを示します。
  
  rightModelResponseId: ワーカーポータルの左側にmodelResponseIdMap表示されているものを示します。
  
  result: JSON オブジェクトは、ワーカーが 0またはを使用して選択したモデルを示します1。モデルがどの値を表示するようにマッピングされているかを確認するには、 modelResponseIdMap
- 順序ランクを選択すると、各ワーカーの結果はになります"evaluationResults": "comparisonRank"。
  
  metricName: メトリクスの名前
  
  result: JSON オブジェクトの配列。モデル (modelResponseIdMap) ワーカーごとにを指定しますrank。
```
"result": [{
	"modelResponseId": "0",
	"rank": 1
}, {
	"modelResponseId": "1",
	"rank": 1
}]
```
- Likert スケールの場合、単一のモデルレスポンスの評価が選択され、ワーカーの結果がに保存されます"evaluationResults": "individualLikertScale"。これは、ジョブの作成時にmetricName指定されたのスコアを含むJSON配列です。
  
  metricName: メトリクスの名前。
  
  modelResponseId: スコアリングされるモデル。モデルがどの値を表示するようにマッピングされているかを確認するには、「」を参照してくださいmodelResponseIdMap。
  
  result: ワーカーによって選択された likert スケール値を示すキー値ペア。
- Thumbs の上下を選択すると、ワーカーの結果はJSON配列として保存されます"evaluationResults": "thumbsUpDown"。
  
  metricName: メトリクスの名前。
  
  result: に関連する false trueまたは metricNameのいずれか。ワーカーがサムスアップを選択すると、"result" : true。

モデル評価ジョブ出力からの出力例

次のJSONオブジェクトは、Amazon S3 に保存されるモデル評価ジョブ出力の例です。各キー値のペアの詳細については、「」を参照してくださいJSON 出力リファレンス。

明確にするために、このジョブには 2 人のワーカーからのレスポンスのみが含まれます。一部のキー値のペアは、読みやすくするために切り捨てられた可能性もあります。


{
	"humanEvaluationResult": {
		"flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name",
        "humanAnswers": [
            {
                "acceptanceTime": "2024-06-07T22:31:57.066Z",
                "answerContent": {
                    "evaluationResults": {
                        "comparisonChoice": [
                            {
                                "metricName": "Fluency",
                                "result": {
                                    "modelResponseId": "0"
                                }
                            }
                        ],
                        "comparisonLikertScale": [
                            {
                                "leftModelResponseId": "0",
                                "metricName": "Coherence",
                                "result": 1,
                                "rightModelResponseId": "1"
                            }
                        ],
                        "comparisonRank": [
                            {
                                "metricName": "Toxicity",
                                "result": [
                                    {
                                        "modelResponseId": "0",
                                        "rank": 1
                                    },
                                    {
                                        "modelResponseId": "1",
                                        "rank": 1
                                    }
                                ]
                            }
                        ],
                        "individualLikertScale": [
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "0",
                                "result": 2
                            },
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "1",
                                "result": 3
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "0",
                                "result": 1
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "1",
                                "result": 4
                            }
                        ],
                        "thumbsUpDown": [
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "0",
                                "result": true
                            },
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "1",
                                "result": true
                            }
                        ]
                    }
                },
                "submissionTime": "2024-06-07T22:32:19.640Z",
                "timeSpentInSeconds": 22.574,
                "workerId": "ead1ba56c1278175",
                "workerMetadata": {
                    "identityData": {
                        "identityProviderType": "Cognito",
                        "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4",
                        "sub": "cd2848f5-6105-4f72-b44e-68f9cb79ba07"
                    }
                }
            },
            {
                "acceptanceTime": "2024-06-07T22:32:19.721Z",
                "answerContent": {
                    "evaluationResults": {
                        "comparisonChoice": [
                            {
                                "metricName": "Fluency",
                                "result": {
                                    "modelResponseId": "1"
                                }
                            }
                        ],
                        "comparisonLikertScale": [
                            {
                                "leftModelResponseId": "0",
                                "metricName": "Coherence",
                                "result": 1,
                                "rightModelResponseId": "1"
                            }
                        ],
                        "comparisonRank": [
                            {
                                "metricName": "Toxicity",
                                "result": [
                                    {
                                        "modelResponseId": "0",
                                        "rank": 2
                                    },
                                    {
                                        "modelResponseId": "1",
                                        "rank": 1
                                    }
                                ]
                            }
                        ],
                        "individualLikertScale": [
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "0",
                                "result": 3
                            },
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "1",
                                "result": 4
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "0",
                                "result": 1
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "1",
                                "result": 5
                            }
                        ],
                        "thumbsUpDown": [
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "0",
                                "result": true
                            },
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "1",
                                "result": false
                            }
                        ]
                    }
                },
                "submissionTime": "2024-06-07T22:32:57.918Z",
                "timeSpentInSeconds": 38.197,
                "workerId": "bad258db224c3db6",
                "workerMetadata": {
                    "identityData": {
                        "identityProviderType": "Cognito",
                        "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4",
                        "sub": "84d5194a-3eed-4ecc-926d-4b9e1b724094"
                    }
                }
            }
        ],
        "humanLoopName": "a757 11d3e75a 8d41f35b9873d 253f5b7bce0256e",
        "inputContent": {
            "additionalDataS3Uri": "s3://mgrt-test-us-west-2/test-2-workers-2-model/datasets/custom_dataset/0/task-input-additional-data.json",
            "instructions": "worker instructions provided by the model evaluation job administrator",
            "evaluationMetrics": [
                {
                    "metricName": "Fluency",
                    "metricType": "ComparisonChoice",
                    "description": "Measures the linguistic quality of a generated text."
                },
                {
                    "metricName": "Coherence",
                    "metricType": "ComparisonLikertScale",
                    "description": "Measures the organization and structure of a generated text."
                },
                {
                    "metricName": "Toxicity",
                    "metricType": "ComparisonRank",
                    "description": "Measures the harmfulness of a generated text."
                },
                {
                    "metricName": "Accuracy",
                    "metricType": "ThumbsUpDown",
                    "description": "Indicates the accuracy of a generated text."
                },
                {
                    "metricName": "Correctness",
                    "metricType": "IndividualLikertScale",
                    "description": "Measures a generated answer's satisfaction in the context of the question."
                },
                {
                    "metricName": "Completeness",
                    "metricType": "IndividualLikertScale",
                    "description": "Measures a generated answer's inclusion of all relevant information."
                }
            ],
            "disableRandomization": "true"
        },
        "modelResponseIdMap": {
            "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612",
            "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352"
        }
    },
    "inputRecord": {
        "prompt": {
            "text": "What is high intensity interval training?"
        },
        "category": "Fitness",
        "referenceResponse": {
            "text": "High-Intensity Interval Training (HIIT)"
        }
    },
    "modelResponses": [
        {
            "text": "High Intensity Interval Training (HIIT) is a form of exercise that alternates between periods of high intensity work and low intensity recovery.HIIT is an excellent way to increase your fitness and improve your health, but it can be difficult to get started.In this article, we will",
            "modelIdentifier": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612"
        },
        {
            "text": "High intensity interval training is a form of exercise consisting of short bursts of maximum effort followed by periods of rest. The theory behind HIIT is that it can often be more effective at improving cardiovascular and metabolic health than longer, lower intensity workouts.The work intervals can range in length depending on the specific type of exercise, but are typically between 20 and 90 seconds. The recovery periods are generally longer, lasting between 1 and 5 minutes. This pattern is then repeated for multiple sets.\n\nSince the work intervals are high intensity, they require more effort from your body and therefore result in a greater calorie burn. The body also continues to burn calories at an increased rate after the workout due to an effect called excess post exercise oxygen consumption (EPOC), also know as the afterburn effect.\n\nHIIT is a versatile form of training that can be adapted to different fitness levels and can be performed using a variety of exercises including cycling, running, bodyweight movements, and even swimming. It can be done in as little as 20 minutes once or twice a week, making it an efficient option for busy individuals.\n\nWhat are the benefits of high intensity interval training",
            "modelIdentifier": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352"
        }
    ]
}

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

ジョブ結果

自動評価ジョブの結果を理解する

人間の評価ジョブの結果を理解する

JSON 出力リファレンス

で見つかったキー値のペア humanEvaluationResult

で見つかったキー値のペア inputRecord

で見つかったキー値のペア modelResponses

で見つかったキー値のペア humanAnswers

answerContent JSON 配列の形式