JSON output reference Example output from a model evaluation job output

Understand the results of a human evaluation job

When you created a model evaluation job that uses human workers you selected one or more metric types. When members of the workteam evaluate a response in the worker portal their responses are saved in the humanAnswers json object. How those responses are stored change based on the metric type selected when the job was created.

The following sections explain these differences, and provide examples.

JSON output reference

When a model evaluation job is completed the results are saved in Amazon S3 as a JSON file. The JSON object contains three high level nodes humanEvaluationResult, inputRecord, and modelResponses.The humanEvaluationResult key is a high level node that contains the responses from the workteam assigned to the model evaluation job. TheinputRecord key is a high level node that contains the prompts provided to the model(s) when the model evaluation job was created. The modelResponses key is a high level node that contains the responses to the prompts from the model(s).

The following table summarizes the key value pairs found in the JSON output from the model evaluation job.

The proceeding sections provide more granular details about each key value pair.

Parameter	Example	Description
`flowDefinitionArn`	`arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name`	The ARN of the human review workflow (flow definition) that created the human loop.
`humanAnswers`	A list of JSON objects specific to the evaluation metrics selected. To learn more see, Key values pairs found under humanAnswers.	A list of JSON objects that contain workers responses.
`humanLoopName`	`system-generated-hash`	A system generated 40-character hex string.
`inputRecord`	`"inputRecord": { "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": [{ "modelIdentifier": "meta-textgeneration-llama-codellama-7b", "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }`	A JSON object that contains an entry prompt from the input dataset.
`modelResponses`	`"modelResponses": [{ "modelIdentifier": "arn:aws:bedrock:us-west-2::foundation-model/model-id", "text": "the-models-response-to-the-prompt" }]`	The individual responses from the models.
`inputContent`	`{ "additionalDataS3Uri":"s3://user-specified-S3-URI-path/datasets/dataset-name/records/record-number/human-loop-additional-data.json", "evaluationMetrics":[ { "description":"brief-name", "metricName":"metric-name", "metricType":"IndividualLikertScale" } ], "instructions":"example instructions" }`	The human loop input content required to start human loop in your Amazon S3 bucket.
`modelResponseIdMap`	`{ "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" }`	Describes how each model is represented in the `answerContent`.

Key values pairs found under `humanEvaluationResult`

The following key value pairs around found under the humanEvaluationResult in the output of your model evaluation job.

For the key value pairs associated with humanAnswers, see Key values pairs found under humanAnswers.

flowDefinitionArn

The ARN of the flow definition used to complete the model evaluation job.
Example:arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name

humanLoopName

A system generated 40-character hex string.

inputContent

This key value describes the metric types, and the instructions your provided for workers in the worker portal.
- additionalDataS3Uri: The location in Amazon S3 where the instructions for workers is saved.
- instructions: The instructions you provided to workers in the worker portal.
- evaluationMetrics: The name of the metric and it's description. The key value metricType is the tool provided to workers to evaluate the models' responses.

modelResponseIdMap

This key value pair identifies the full names of the models selected, and how worker choices are mapped to the models in the humanAnswers key value pairs.

Key values pairs found under `inputRecord`

The following entries describe the inputRecord key value pairs.

prompt

The text of the prompt sent to the model.

category

An optional category that classifies the prompt. Visible to workers in the worker portal during the model evaluation.
Example:"American cities"

referenceResponse

An optional field from the input JSON used to specify the ground truth you want workers to reference during the evaluation

responses

An optional field from the input JSON that contains responses from other models.

An example JSON input record.


{
  "prompt": {
     "text": "Who invented the airplane?"
  },
  "category": "Airplanes",
  "referenceResponse": {
    "text": "Orville and Wilbur Wright"
  },
  "responses":
    // The same modelIdentifier must be specified for all responses
    [{
      "modelIdentifier": "meta-textgeneration-llama-codellama-7b" ,
      "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane."
    }]
}

Key values pairs found under `modelResponses`

An array of key value pairs that contains the responses from the models, and which model provided the responses.

text

The model's response to the prompt.

modelIdentifier

The name of the model.

Key values pairs found under `humanAnswers`

An array of key value pairs that contains the responses from the models, and how workers evaluated the models.

acceptanceTime

When the worker accepted the task in the worker portal.

submissionTime

When the worker submitted their response.

timeSpentInSeconds

How long the worker spent completing the task.

workerId

The ID of the worker who completed the task.

workerMetadata

Metadata about which workteam was assigned to this model evaluation job.

Format of the `answerContent` JSON array

The structure of answer depends on the evaluation metrics selected when model evaluation job was created. Each worker response or answer is recorded in a new JSON object.

answerContent

evaluationResults contains the worker's responses.
- When Choice buttons is selected, the results from each worker are as "evaluationResults": "comparisonChoice".
  
  metricName: The name of the metric
  
  result: The JSON object indicates which model the worker selected using either a 0 or 1. To see which value a model is mapped to see, modelResponseIdMap.
- When Likert scale, comparison is selected, the results from each worker are as "evaluationResults": "comparisonLikertScale".
  
  metricName: The name of the metric.
  
  leftModelResponseId: Indicates which modelResponseIdMap was shown on the left side of the worker portal.
  
  rightModelResponseId: Indicates which modelResponseIdMap was shown on the left side of the worker portal.
  
  result: The JSON object indicates which model the worker selected using either a 0 or 1. To see which value a model is mapped to see, modelResponseIdMap
- When Ordinal rank is selected, the results from each worker are as "evaluationResults": "comparisonRank".
  
  metricName: The name of the metric
  
  result: An array of JSON objects. For each model (modelResponseIdMap) workers provide a rank.
```
"result": [{
	"modelResponseId": "0",
	"rank": 1
}, {
	"modelResponseId": "1",
	"rank": 1
}]
```
- When Likert scale, evaluation of a single model response is selected, the results a worker are saved in "evaluationResults": "individualLikertScale". This is a JSON array containing the scores for metricName specified when the job was created.
  
  metricName: The name of the metric.
  
  modelResponseId: The model that is scored. To see which value a model is mapped to see, modelResponseIdMap.
  
  result: A key value pair indicating the likert scale value selected by the worker.
- When Thumbs up/down is selected, the results from a worker are saved as a JSON array "evaluationResults": "thumbsUpDown".
  
  metricName: The name of the metric.
  
  result: Either true or false as it relates to the metricName. When a worker chooses thumbs up, "result" : true.

Example output from a model evaluation job output

The following JSON object is an example model evaluation job output that is saved in Amazon S3. To learn more about each key values pair, see the JSON output reference.

For clarity this job only contains the responses from a two workers. Some key value pairs may have also been truncated for readability


{
	"humanEvaluationResult": {
		"flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name",
        "humanAnswers": [
            {
                "acceptanceTime": "2024-06-07T22:31:57.066Z",
                "answerContent": {
                    "evaluationResults": {
                        "comparisonChoice": [
                            {
                                "metricName": "Fluency",
                                "result": {
                                    "modelResponseId": "0"
                                }
                            }
                        ],
                        "comparisonLikertScale": [
                            {
                                "leftModelResponseId": "0",
                                "metricName": "Coherence",
                                "result": 1,
                                "rightModelResponseId": "1"
                            }
                        ],
                        "comparisonRank": [
                            {
                                "metricName": "Toxicity",
                                "result": [
                                    {
                                        "modelResponseId": "0",
                                        "rank": 1
                                    },
                                    {
                                        "modelResponseId": "1",
                                        "rank": 1
                                    }
                                ]
                            }
                        ],
                        "individualLikertScale": [
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "0",
                                "result": 2
                            },
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "1",
                                "result": 3
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "0",
                                "result": 1
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "1",
                                "result": 4
                            }
                        ],
                        "thumbsUpDown": [
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "0",
                                "result": true
                            },
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "1",
                                "result": true
                            }
                        ]
                    }
                },
                "submissionTime": "2024-06-07T22:32:19.640Z",
                "timeSpentInSeconds": 22.574,
                "workerId": "ead1ba56c1278175",
                "workerMetadata": {
                    "identityData": {
                        "identityProviderType": "Cognito",
                        "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4",
                        "sub": "cd2848f5-6105-4f72-b44e-68f9cb79ba07"
                    }
                }
            },
            {
                "acceptanceTime": "2024-06-07T22:32:19.721Z",
                "answerContent": {
                    "evaluationResults": {
                        "comparisonChoice": [
                            {
                                "metricName": "Fluency",
                                "result": {
                                    "modelResponseId": "1"
                                }
                            }
                        ],
                        "comparisonLikertScale": [
                            {
                                "leftModelResponseId": "0",
                                "metricName": "Coherence",
                                "result": 1,
                                "rightModelResponseId": "1"
                            }
                        ],
                        "comparisonRank": [
                            {
                                "metricName": "Toxicity",
                                "result": [
                                    {
                                        "modelResponseId": "0",
                                        "rank": 2
                                    },
                                    {
                                        "modelResponseId": "1",
                                        "rank": 1
                                    }
                                ]
                            }
                        ],
                        "individualLikertScale": [
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "0",
                                "result": 3
                            },
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "1",
                                "result": 4
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "0",
                                "result": 1
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "1",
                                "result": 5
                            }
                        ],
                        "thumbsUpDown": [
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "0",
                                "result": true
                            },
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "1",
                                "result": false
                            }
                        ]
                    }
                },
                "submissionTime": "2024-06-07T22:32:57.918Z",
                "timeSpentInSeconds": 38.197,
                "workerId": "bad258db224c3db6",
                "workerMetadata": {
                    "identityData": {
                        "identityProviderType": "Cognito",
                        "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4",
                        "sub": "84d5194a-3eed-4ecc-926d-4b9e1b724094"
                    }
                }
            }
        ],
        "humanLoopName": "a757 11d3e75a 8d41f35b9873d 253f5b7bce0256e",
        "inputContent": {
            "additionalDataS3Uri": "s3://mgrt-test-us-west-2/test-2-workers-2-model/datasets/custom_dataset/0/task-input-additional-data.json",
            "instructions": "worker instructions provided by the model evaluation job administrator",
            "evaluationMetrics": [
                {
                    "metricName": "Fluency",
                    "metricType": "ComparisonChoice",
                    "description": "Measures the linguistic quality of a generated text."
                },
                {
                    "metricName": "Coherence",
                    "metricType": "ComparisonLikertScale",
                    "description": "Measures the organization and structure of a generated text."
                },
                {
                    "metricName": "Toxicity",
                    "metricType": "ComparisonRank",
                    "description": "Measures the harmfulness of a generated text."
                },
                {
                    "metricName": "Accuracy",
                    "metricType": "ThumbsUpDown",
                    "description": "Indicates the accuracy of a generated text."
                },
                {
                    "metricName": "Correctness",
                    "metricType": "IndividualLikertScale",
                    "description": "Measures a generated answer's satisfaction in the context of the question."
                },
                {
                    "metricName": "Completeness",
                    "metricType": "IndividualLikertScale",
                    "description": "Measures a generated answer's inclusion of all relevant information."
                }
            ],
            "disableRandomization": "true"
        },
        "modelResponseIdMap": {
            "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612",
            "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352"
        }
    },
    "inputRecord": {
        "prompt": {
            "text": "What is high intensity interval training?"
        },
        "category": "Fitness",
        "referenceResponse": {
            "text": "High-Intensity Interval Training (HIIT)"
        }
    },
    "modelResponses": [
        {
            "text": "High Intensity Interval Training (HIIT) is a form of exercise that alternates between periods of high intensity work and low intensity recovery.HIIT is an excellent way to increase your fitness and improve your health, but it can be difficult to get started.In this article, we will",
            "modelIdentifier": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612"
        },
        {
            "text": "High intensity interval training is a form of exercise consisting of short bursts of maximum effort followed by periods of rest. The theory behind HIIT is that it can often be more effective at improving cardiovascular and metabolic health than longer, lower intensity workouts.The work intervals can range in length depending on the specific type of exercise, but are typically between 20 and 90 seconds. The recovery periods are generally longer, lasting between 1 and 5 minutes. This pattern is then repeated for multiple sets.\n\nSince the work intervals are high intensity, they require more effort from your body and therefore result in a greater calorie burn. The body also continues to burn calories at an increased rate after the workout due to an effect called excess post exercise oxygen consumption (EPOC), also know as the afterburn effect.\n\nHIIT is a versatile form of training that can be adapted to different fitness levels and can be performed using a variety of exercises including cycling, running, bodyweight movements, and even swimming. It can be done in as little as 20 minutes once or twice a week, making it an efficient option for busy individuals.\n\nWhat are the benefits of high intensity interval training",
            "modelIdentifier": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352"
        }
    ]
}

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Job results

Understand the results of an automatic evaluation job

Understand the results of a human evaluation job

JSON output reference

Key values pairs found under humanEvaluationResult

Key values pairs found under inputRecord

Key values pairs found under modelResponses

Key values pairs found under humanAnswers

Format of the answerContent JSON array