# Creating a model evaluation job that use human workers in Amazon Bedrock
<a name="evaluation-human"></a>

The topic provides detail directions for creating an automatic model evaluation job.

**Topics**
+ [Creating your first model evaluation that uses human workers](model-evaluation-type-human.md)
+ [Create a custom prompt dataset for a model evaluation job that uses human workers](model-evaluation-prompt-datasets-custom-human.md)
+ [Create a human-based model evaluation job](model-evaluation-jobs-management-create-human.md)
+ [List model evaluation jobs that use human workers in Amazon Bedrock](model-evaluation-jobs-management-list-human.md)
+ [Stop a model evaluation job in Amazon Bedrock](model-evaluation-jobs-human-management-stop.md)
+ [Delete a model evaluation job in Amazon Bedrock](model-evaluation-jobs-management-human-delete.md)
+ [Manage a work team for human evaluations of models in Amazon Bedrock](human-worker-evaluations.md)

# Creating your first model evaluation that uses human workers
<a name="model-evaluation-type-human"></a>

A model evaluation job that uses human workers requires access to the following service level resources. Use the linked topics to learn more about getting setting up.

**Required service level resources to start a model evaluation job that uses human workers**

1. Model evaluation jobs that use human workers allow you to rate/compare the responses for up two different foundation models. To start a job, at least one Amazon Bedrock foundation model is required. To learn more accessing Amazon Bedrock foundation models, see [Access Amazon Bedrock foundation models](model-access.md).

1. To create a model evaluation job using human workers, you need access to the [https://console.aws.amazon.com/bedrock/](https://console.aws.amazon.com/bedrock/), AWS Command Line Interface, or a supported AWS SDK. To learn more about the required IAM actions and resources, see [Required console permissions to create a human-based model evaluation job](#more-for-human).

1. When the model evaluation job starts, a service role is used to perform actions on your behalf. To learn more about required IAM actions and the trust policy requirements, see [Service role requirements for automatic model evaluation jobs](automatic-service-roles.md).

1. A prompt dataset is required to start the model evaluation job; it must be stored in a Amazon S3 bucket. To learn more about the prompt dataset requirements, see [Create a custom prompt dataset for a model evaluation job that uses human workers](model-evaluation-prompt-datasets-custom-human.md) 

1. The human evaluators are managed as a workteam. You can create a new Amazon Cognito managed workteam using the Amazon Bedrock console. To learn more about managing your workforce, see [Manage a work team for human evaluations of models in Amazon Bedrock](human-worker-evaluations.md). 

## Required console permissions to create a human-based model evaluation job
<a name="more-for-human"></a>

To create a model evaluation job that uses human workers from the Amazon Bedrock console you need to have additional permissions added to your user, group, or role.

The following policy contains the minimum set of IAM actions and resources in Amazon Bedrock, Amazon SageMaker AI, Amazon Cognito and Amazon S3 that are required to create a human-based model evaluation job using the Amazon Bedrock console.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AllowPassingConsoleCreatedServiceRoles",
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::111122223333:role/service-role/Amazon-Bedrock-IAM-Role-*"
      ],
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "bedrock.amazonaws.com"
        }
      }
    },
    {
      "Sid": "BedrockConsole",
      "Effect": "Allow",
      "Action": [
        "bedrock:CreateEvaluationJob",
        "bedrock:GetEvaluationJob",
        "bedrock:ListEvaluationJobs",
        "bedrock:StopEvaluationJob",
        "bedrock:GetCustomModel",
        "bedrock:ListCustomModels",
        "bedrock:CreateProvisionedModelThroughput",
        "bedrock:UpdateProvisionedModelThroughput",
        "bedrock:GetProvisionedModelThroughput",
        "bedrock:ListProvisionedModelThroughputs",
        "bedrock:GetImportedModel",
        "bedrock:ListImportedModels",
        "bedrock:ListTagsForResource",
        "bedrock:UntagResource",
        "bedrock:TagResource"
      ],
      "Resource": [
        "arn:aws:bedrock:us-west-2::foundation-model/model-id-of-foundational-model",
        "arn:aws:bedrock:us-west-2:111122223333:inference-profile/*",
        "arn:aws:bedrock:us-west-2:111122223333:provisioned-model/*",
        "arn:aws:bedrock:us-west-2:111122223333:imported-model/*"
      ]
    },
    {
      "Sid": "AllowCognitionActionsForWorkTeamCreations",
      "Effect": "Allow",
      "Action": [
        "cognito-idp:CreateUserPool",
        "cognito-idp:CreateUserPoolClient",
        "cognito-idp:CreateGroup",
        "cognito-idp:AdminCreateUser",
        "cognito-idp:AdminAddUserToGroup",
        "cognito-idp:CreateUserPoolDomain",
        "cognito-idp:UpdateUserPool",
        "cognito-idp:ListUsersInGroup",
        "cognito-idp:ListUsers",
        "cognito-idp:AdminRemoveUserFromGroup"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowModelEvaluationResourceCreation",
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateFlowDefinition",
        "sagemaker:CreateWorkforce",
        "sagemaker:CreateWorkteam",
        "sagemaker:DescribeFlowDefinition",
        "sagemaker:DescribeHumanLoop",
        "sagemaker:ListFlowDefinitions",
        "sagemaker:ListHumanLoops",
        "sagemaker:DescribeWorkforce",
        "sagemaker:DescribeWorkteam",
        "sagemaker:ListWorkteams",
        "sagemaker:ListWorkforces",
        "sagemaker:DeleteFlowDefinition",
        "sagemaker:DeleteHumanLoop",
        "sagemaker:RenderUiTemplate",
        "sagemaker:StartHumanLoop",
        "sagemaker:StopHumanLoop"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowConsoleS3AccessForModelEvaluation",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetBucketCORS",
        "s3:ListBucket",
        "s3:ListBucketVersions",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::my_output_bucket",
        "arn:aws:s3:::input_datasets/prompts.jsonl"
      ]
    }
  ]
}
```

------

# Create a custom prompt dataset for a model evaluation job that uses human workers
<a name="model-evaluation-prompt-datasets-custom-human"></a>

To create a model evaluation job that uses human workers you must specify a custom prompt dataset. These prompts are then used during inference with the models you select to evaluate. 

If you want to evaluate non-Amazon Bedrock models using responses that you've already generated, include them in the prompt dataset as described in [Perform an evaluation job using your own inference response data](#model-evaluation-prompt-datasets-custom-human-byoir). When you provide your own inference response data, Amazon Bedrock skips the model-invoke step and performs the evaluation job with the data you provide. 

Custom prompt datasets must be stored in Amazon S3, and use the JSON line format and use the `.jsonl` file extension. Each line must be a valid JSON object. There can be up to 1000 prompts in your dataset per automatic evaluation job.

For job created using the console you must update the Cross Origin Resource Sharing (CORS) configuration on the S3 bucket. To learn more about the required CORS permissions, see [Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets](model-evaluation-security-cors.md). 

## Perform an evaluation job where Amazon Bedrock invokes a model for you
<a name="model-evaluation-prompt-datasets-custom-human-invoke"></a>

To run an evaluation job where Amazon Bedrock invokes the models for you, provide a prompt dataset containing the following key-value pairs:
+ `prompt` – the prompt you want the models to respond to.
+ `referenceResponse` – (optional) a ground truth response that your workers can reference during the evaluation.
+ `category`– (optional) a key that you can use to filter results when reviewing them in the model evaluation report card.

In the worker UI, what you specify for `prompt` and `referenceResponse` are visible to your human workers.

The following is an example custom dataset that contains 6 inputs and uses the JSON line format.

```
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
{"prompt":"Provide the prompt you want the model to use during inference","category":"(Optional) Specify an optional category","referenceResponse":"(Optional) Specify a ground truth response."}
```

The following example is a single entry expanded for clarity. In your actual prompt dataset each line must be a valid JSON object.

```
{
  "prompt": "What is high intensity interval training?",
  "category": "Fitness",
  "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods."
}
```

## Perform an evaluation job using your own inference response data
<a name="model-evaluation-prompt-datasets-custom-human-byoir"></a>

To run an evaluation job using responses you've already generated, you provide a prompt dataset containing the following key-value pairs:
+ `prompt` – the prompt that your models used to generate the responses.
+ `referenceResponse` – (optional) a ground truth response that your workers can reference during the evaluation.
+ `category`– (optional) a key that you can use to filter results when reviewing them in the model evaluation report card.
+ `modelResponses` – the responses from your own inference that you want to evaluate. You can provide either one or two entries with the following properties in the `modelResponses` list.
  + `response` – a string containing the response from your model inference.
  + `modelIdentifier` – a string identifying the model that generated the responses.

Every line in your prompt dataset must contain the same number of responses (either one or two). Additionally, you must specify the same model identifier or identifiers in each line and may not use more than 2 unique values for `modelIdentifier` in a single dataset.

The following is a custom example dataset with 6 inputs in JSON line format.

```
{"prompt":"The prompt you used to generate the model responses","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your first model generated","modelIdentifier":"A string identifying your first model"},{"response":"The response your second model generated","modelIdentifier":"A string identifying your second model"}]}
{"prompt":"The prompt you used to generate the model responses","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your first model generated","modelIdentifier":"A string identifying your first model"},{"response":"The response your second model generated","modelIdentifier":"A string identifying your second model"}]}
{"prompt":"The prompt you used to generate the model responses","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your first model generated","modelIdentifier":"A string identifying your first model"},{"response":"The response your second model generated","modelIdentifier":"A string identifying your second model"}]}
{"prompt":"The prompt you used to generate the model responses","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your first model generated","modelIdentifier":"A string identifying your first model"},{"response":"The response your second model generated","modelIdentifier":"A string identifying your second model"}]}
{"prompt":"The prompt you used to generate the model responses","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your first model generated","modelIdentifier":"A string identifying your first model"},{"response":"The response your second model generated","modelIdentifier":"A string identifying your second model"}]}
{"prompt":"The prompt you used to generate the model responses","referenceResponse":"(Optional) a ground truth response","category":"(Optional) a category for the prompt","modelResponses":[{"response":"The response your first model generated","modelIdentifier":"A string identifying your first model"},{"response":"The response your second model generated","modelIdentifier":"A string identifying your second model"}]}
```

The following example shows a single entry in a prompt dataset expanded for clarity.

```
{
    "prompt": "What is high intensity interval training?",
    "referenceResponse": "High-Intensity Interval Training (HIIT) is a cardiovascular exercise approach that involves short, intense bursts of exercise followed by brief recovery or rest periods.",
    "category": "Fitness",
     "modelResponses": [
        {
            "response": "High intensity interval training (HIIT) is a workout strategy that alternates between short bursts of intense, maximum-effort exercise and brief recovery periods, designed to maximize calorie burn and improve cardiovascular fitness.",
            "modelIdentifier": "Model1"
        },
        {
            "response": "High-intensity interval training (HIIT) is a cardiovascular exercise strategy that alternates short bursts of intense, anaerobic exercise with less intense recovery periods, designed to maximize calorie burn, improve fitness, and boost metabolic rate.",
            "modelIdentifier": "Model2"
        }
    ]
}
```

# Create a human-based model evaluation job
<a name="model-evaluation-jobs-management-create-human"></a>

The following examples show how to create a model evaluation job that uses human workers.

**Console**

**To create a model evaluation job that uses human workers**

1. Open the [Amazon Bedrock console](https://console.aws.amazon.com/bedrock/).

1. In the navigation pane, under **Inference and Assessment**, select **Evaluations**.

1. In the **Model evaluation** pane, under **Human**, choose **Create** and select **Human: Bring your own work team**.

1. On the **Specify job details** page provide the following.

   1. **Evaluation name** — Give the model evaluation job a name that describes the job. This name is shown in your model evaluation job list. The name must be unique in your account in an AWS Region. 

   1. **Description** (Optional) — Provide an optional description.

   1. Choose **Next**.

1. On the **Set up evaluation** page, under **Inference source**, select the source for your model evaluation. You can evaluate the performance of Amazon Bedrock models, or of other models by providing your own inference response data in your prompt dataset. You can select up to two inference sources. For jobs with two sources, you don't have to choose the same type for both sources; you can select one Amazon Bedrock model, and provide your own inference response data for the second source. To evaluate Amazon Bedrock models, do the following:

   1. Under **Select source**, select **Bedrock models**.

   1. Choose **Select model** to choose the model you want to evaluate.

   1. To select a second model, choose **Add model** and repeat the preceding steps.

1. To bring your own inference response data, do the following:

   1. Under **Select source**, select **Bring your own inference responses**.

   1. For **Source Name**, enter a name for the model you used to create the response data. The name you enter must match the `modelIdentifier` parameter in your [prompt dataset](model-evaluation-prompt-datasets-custom-human.md#model-evaluation-prompt-datasets-custom-human-byoir).

   1. To add a second source, choose **Add model** and repeat the preceding steps.

1. For **Task type**, select the type of task you want the model to perform during the model evaluation job. All instructions for the model must be included in the prompts themselves. The task type does not control the model's responses.

1. In the **Datasets** pane, provide the following.

   1. Under **Choose a prompt dataset**, specify the S3 URI of your prompt dataset file or choose **Browse S3** to see available S3 buckets. You can have a maximum of 1000 prompts in a custom prompt dataset.

   1. Under **Evaluation results destination**, specify the S3 URI of the directory where you want the results of your model evaluation job saved, or choose **Browse S3** to see available S3 buckets.

1. (Optional) Under **KMS key - Optional**, provide the ARN of a customer managed key you want to use to encrypt your model evaluation job.

1. In the **Amazon Bedrock IAM role – Permissions** pane, do the following. To learn more about the required permissions for model evaluations, see [Service role requirements for model evaluation jobs](model-evaluation-security-service-roles.md).

   1. To use an existing Amazon Bedrock service role, choose **Use an existing role**. Otherwise, use **Create a new role** to specify the details of your new IAM service role.

   1. In **Service role name**, specify the name of your IAM service role.

   1. When ready, choose **Create role** to create the new IAM service role. 

1. Choose **Next**.

1. Under **Work team**, use the **Select team** dropdown to select an existing team, or create a new team by doing the following:

   1. Under **Team name**, enter a name for your team.

   1. Under **Email addresses**, enter the email addresses of the human workers in your team.

   1. Under **Number of workers per prompt**, select the number of workers who evaluate each prompt. After the responses for each prompt have been reviewed by the number of workers you selected, the prompt and its responses will be taken out of circulation from the work team. The final results report will include all ratings from each worker.
**Important**  
Large language models are known to occasionally hallucinate and produce toxic or offensive content. Your workers may be shown toxic or offensive material during this evaluation. Ensure you take proper steps to train and notify them before they work on the evaluation. They can decline and release tasks or take breaks during the evaluation while accessing the human evaluation tool.

1. Under **Human workflow IAM role - Permissions**, select an existing role, or select **Create a new role**.

1. Choose **Next**.

1. Under **Evaluation instructions**, provide instructions for completing the task. You can preview the evaluation UI that your work team uses to evaluate the responses, including the metrics, rating methods, and your instructions. This preview is based on the configuration you have created for this job.

1. Choose **Next**.

1. Review your configuration and choose **Create** to create the job.
**Note**  
Once the job has successfully started, the status changes to **In progress**. When the job has finished, the status changes to **Completed**. While a model evaluation job is still **In progress**, you can choose to the stop the job before all the models' responses have been evaluated by your work team. To do so, choose **Stop evaluation** on the model evaluation landing page. This will change the **Status** of the model evaluation job to **Stopping**. Once the model evaluation job has successfully stopped, you can delete the model evaluation job.

**API and AWS CLI**

When you create a human-based model evaluation job outside of the Amazon Bedrock console, you need to create an Amazon SageMaker AI flow definition ARN.

The flow definition ARN is where a model evaluation job's workflow is defined. The flow definition is used to define the worker interface and the work team you want assigned to the task, and connecting to Amazon Bedrock.

For model evaluation jobs started using Amazon Bedrock API operations you *must* create a flow definition ARN using the AWS CLI or a supported AWS SDK. To learn more about how flow definitions work, and creating them programmatically, see [Create a Human Review Workflow (API)](https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-create-flow-definition.html#a2i-create-human-review-api) in the *SageMaker AI Developer Guide*.

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateFlowDefinition.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateFlowDefinition.html) you must specify `AWS/Bedrock/Evaluation` as input to the `AwsManagedHumanLoopRequestSource`. The Amazon Bedrock service role must also have permissions to access the output bucket of the flow definition.

The following is an example request using the AWS CLI. In the request, the `HumanTaskUiArn` is a SageMaker AI owned ARN. In the ARN, you can only modify the AWS Region.

```
aws sagemaker create-flow-definition --cli-input-json '
{
	"FlowDefinitionName": "human-evaluation-task01",
	"HumanLoopRequestSource": {
        "AwsManagedHumanLoopRequestSource": "AWS/Bedrock/Evaluation"
    },
    "HumanLoopConfig": {
		"WorkteamArn": "arn:aws:sagemaker:AWS Region:111122223333:workteam/private-crowd/my-workteam",
		## The Task UI ARN is provided by the service team, you can only modify the AWS Region.
		"HumanTaskUiArn":"arn:aws:sagemaker:AWS Region:394669845002:human-task-ui/Evaluation"
		"TaskTitle": "Human review tasks",
		"TaskDescription": "Provide a real good answer",
		"TaskCount": 1,
		"TaskAvailabilityLifetimeInSeconds": 864000,
		"TaskTimeLimitInSeconds": 3600,
		"TaskKeywords": [
		    "foo"
        ]
    },
    "OutputConfig": {
        "S3OutputPath": "s3://amzn-s3-demo-destination-bucket"
    },
    "RoleArn": "arn:aws:iam::111122223333:role/SageMakerCustomerRoleArn"
}'
```

After creating your flow definition ARN, use the following examples to create human-based model evaluation job using the AWS CLI or a supported AWS SDK.

------
#### [ AWS CLI ]

The following example command and JSON file shows you how to create a model evaluation job using human workers where you provide your own inference response data. To learn how to specify a prompt dataset for a model evaluation job with human workers, see [Create a custom prompt dataset for a model evaluation job that uses human workers](model-evaluation-prompt-datasets-custom-human.md).

**Example AWS CLI command and JSON file to create an evaluation job using your own inference response data**  

```
aws bedrock create-evaluation-job --cli-input-json file://my_eval_job.json
```

```
{
    "jobName": "model-eval-llama-vs-my-other-model",
    "roleArn": "arn:aws:iam::111122223333:role/service-role/Amazon-Bedrock-IAM-Role-20250218T223671",
    "evaluationConfig": {
        "human": {
            "customMetrics": [
                {
                    "description": "Measures the organization and structure of a generated text.",
                    "name": "Coherence",
                    "ratingMethod": "ThumbsUpDown"
                },
                {
                    "description": "Indicates the accuracy of a generated text.",
                    "name": "Accuracy",
                    "ratingMethod": "ComparisonChoice"
                }
            ],
            "datasetMetricConfigs": [
                {
                    "dataset": {
                        "datasetLocation": {
                            "s3Uri": "s3://amzn-s3-demo-bucket/input/model-eval/fitness-dataset-model-eval-byoir-2-models.jsonl"
                        },
                        "name": "dataset1"
                    },
                    "metricNames": [
                        "Coherence",
                        "Accuracy"
                    ],
                    "taskType": "Generation"
                }
            ],
            "humanWorkflowConfig": {
                "flowDefinitionArn": "arn:aws:sagemaker:us-east-1:111122223333:flow-definition/bedrock-fitness-human-byoir",
                "instructions": "<h3>The following are the metrics and their descriptions for this evaluation</h3>\n<p><strong>Coherence</strong>: Measures the organization and structure of a generated text. - <em>Thumbs up/down</em>\n<strong>Accuracy</strong>: Indicates the accuracy of a generated text. - <em>Choice buttons</em></p>\n<h3>Instructions for how to use the evaluation tool</h3>\n<p>The evaluation creator should use this space to write detailed descriptions for every rating method so your evaluators know how to properly rate the responses with the buttons on their screen.</p>\n<h4>For example:</h4>\n<p>If using <strong>Likert scale - individual</strong>, define the 1 and 5 of the 5 point Likert scale for each metric so your evaluators know if 1 or 5 means favorable/acceptable/preferable.\nIf using <strong>Likert scale - comparison</strong>, describe what the evaluator is looking for to determine their preference between two responses.\nIf using <strong>Choice buttons</strong>, describe what is preferred according to your metric and its description.\nIf using <strong>Ordinal ranking</strong>, define what should receive a #1 ranking according to your metric and its description.\nIf using <strong>Thumbs up/down</strong>, define what makes an acceptable response according to your metric and its description.</p>\n<h3>Describing your ground truth responses if applicable to your dataset</h3>\n<p>Describe the purpose of your ground truth responses that will be shown on screen next to each model response. Note that the ground truth responses you provide are not rated/scored by the evaluators - they are meant to be a reference standard for comparison against the model responses.</p>"
            }
        }
    },
    "inferenceConfig": {
        "models": [
            {
                "precomputedInferenceSource": {
                    "inferenceSourceIdentifier": "llama-3-1-80b"
                }
            },
            {
                "precomputedInferenceSource": {
                    "inferenceSourceIdentifier": "my_other_model"
                }
            }
        ]
    },
    "outputDataConfig": {
        "s3Uri": "s3://amzn-s3-demo-bucket/output/"
    }
}
```

------
#### [ SDK for Python ]

The following code example shows you how to create a model evaluation job that uses human workers via the SDK for SDK for Python.

```
import boto3
client = boto3.client('bedrock')

job_request = client.create_evaluation_job(
    jobName="111122223333-job-01",
    jobDescription="two different task types",
    roleArn="arn:aws:iam::111122223333:role/example-human-eval-api-role",
    inferenceConfig={
        ## You must specify and array of models
        "models": [
            {
                "bedrockModel": {
                    "modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1",
                    "inferenceParams":"{\"inferenceConfig\":{\"maxTokens\": 512,\"temperature\":0.7,\"topP\":0.9}}"
                }

            },
            {
                "bedrockModel": {
                    "modelIdentifier": "anthropic.claude-v2",
                    "inferenceParams": "{\"inferenceConfig\":{\"maxTokens\":512,\"temperature\":1,\"topP\":0.999,\"stopSequences\":[\"stop\"]},\"additionalModelRequestFields\":{\"top_k\": 128}}"
                }
            }
        ]
    },
    outputDataConfig={
        "s3Uri":"s3://job-bucket/outputs/"
    },
    evaluationConfig={
        "human": {
        "humanWorkflowConfig": {
            "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/example-workflow-arn",
            "instructions": "some human eval instruction"
        },
        "customMetrics": [
            {
                "name": "IndividualLikertScale",
                "description": "testing",
                "ratingMethod": "IndividualLikertScale"
            }
        ],
        "datasetMetricConfigs": [
            {
                "taskType": "Summarization",
                "dataset": {
                    "name": "Custom_Dataset1",
                    "datasetLocation": {
                        "s3Uri": "s3://job-bucket/custom-datasets/custom-trex.jsonl"
                    }
                },
                "metricNames": [
                  "IndividualLikertScale"
                ]
            }
        ]
      }

    }
)

print(job_request)
```

------

# List model evaluation jobs that use human workers in Amazon Bedrock
<a name="model-evaluation-jobs-management-list-human"></a>

You can list your current model evaluation jobs that use human workers using the AWS CLI, or a supported AWS SDK. In the Amazon Bedrock console, you can also view a table containing your current model evaluation jobs.

The following examples show you how to find your model evaluation jobs using the AWS Management Console, AWS CLI and SDK for Python.

------
#### [ Amazon Bedrock console ]

1. Open the Amazon Bedrock console: [https://console.aws.amazon.com/bedrock/home](https://console.aws.amazon.com/bedrock/home)

1. In the navigation pane, choose **Model evaluation**.

1. In the **Model Evaluation Jobs** card, you can find a table that lists the model evaluation jobs you have already created.

------
#### [ AWS CLI ]

In the AWS CLI, you can use the `help` command to view parameters are required, and which parameters are optional when using `list-evaluation-jobs`.

```
aws bedrock list-evaluation-jobs help
```

The follow is an example of using `list-evaluation-jobs` and specifying that maximum of 5 jobs be returned. By default jobs are returned in descending order from the time when they where started.

```
aws bedrock list-evaluation-jobs --max-items 5
```

------
#### [ SDK for Python ]

The following examples show how to use the AWS SDK for Python to find a model evaluation job you have previously created. 

```
import boto3
client = boto3.client('bedrock')

job_request = client.list_evaluation_jobs(maxResults=20)

print (job_request)
```

------

# Stop a model evaluation job in Amazon Bedrock
<a name="model-evaluation-jobs-human-management-stop"></a>

You can stop a model evaluation job that is currently processing using the AWS Management Console, AWS CLI, or a supported AWS SDK.

The following examples show you how to stop a model evaluation job using the AWS Management Console, AWS CLI, and SDK for Python

------
#### [ Amazon Bedrock console ]

The following example shows you how to stop a model evaluation job using the AWS Management Console

1. Open the Amazon Bedrock console: [https://console.aws.amazon.com/bedrock/home](https://console.aws.amazon.com/bedrock/home)

1. In the navigation pane, choose **Model evaluation**.

1. In the **Model Evaluation Jobs** card, you can find a table that lists the model evaluation jobs you have already created.

1. Select the radio button next to your job's name.

1. Then, choose **Stop evaluation**.

------
#### [ SDK for Python ]

The following example shows you how to stop a model evaluation job using the SDK for Python

```
import boto3
client = boto3.client('bedrock')
response = client.stop_evaluation_job(
	## The ARN of the model evaluation job you want to stop.
	jobIdentifier='arn:aws:bedrock:us-west-2:444455556666:evaluation-job/fxaqujhttcza'
)

print(response)
```

------
#### [ AWS CLI ]

In the AWS CLI, you can use the `help` command to see which parameters are required, and which parameters are optional when specifying `add-something` in the AWS CLI.

```
aws bedrock create-evaluation-job help
```

The following example shows you how to stop a model evaluation job using the AWS CLI

```
aws bedrock stop-evaluation-job --job-identifier arn:aws:bedrock:us-west-2:444455556666:evaluation-job/fxaqujhttcza
```

------

# Delete a model evaluation job in Amazon Bedrock
<a name="model-evaluation-jobs-management-human-delete"></a>

You can delete a model evaluation job by using the Amazon Bedrock console, or by using the [BatchDeleteEvaluationJob](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_BatchDeleteEvaluationJob.html) operation with the AWS CLI, or a supported AWS SDK. 

Before you can delete a model evaluation job, the status of the job must be `FAILED`, `COMPLETED`, or `STOPPED`. You can get the current status for a job from the Amazon Bedrock console or by calling the [ ListEvaluationJobs](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_ListEvaluationJobs.html). For more information, see [List automatic model evaluation jobs in Amazon BedrockList model evaluation jobs that use human workers in Amazon Bedrock](model-evaluation-jobs-management-list.md). 

You can delete up to 25 model evaluation jobs at a time with the console and with the `BatchDeleteEvaluationJob` operation. If you need to delete more jobs, repeat the console procedure or `BatchDeleteEvaluationJob` call.

If you delete a model evaluation job with the `BatchDeleteEvaluationJob` operation, you need the Amazon Resource Names (ARNs) of the models that you want to delete. For information about getting the ARN for a model, see [List automatic model evaluation jobs in Amazon BedrockList model evaluation jobs that use human workers in Amazon Bedrock](model-evaluation-jobs-management-list.md). 

When you delete a model evaluation job all resources in Amazon Bedrock and Amazon SageMaker AI are removed. Any model evaluation job saved in Amazon S3 buckets are left unchanged. Also, for model evaluation jobs that use human workers, deleting a model evaluation job will not delete the workforce or workteam you have configured in Amazon Cognito or SageMaker AI.

Use the following sections to see examples of how to delete a model evaluation job.

------
#### [ Amazon Bedrock console ]

Use the following procedure to delete model evaluation job using the Amazon Bedrock console. To successfully complete this procedure make sure that your IAM user, group, or role has the sufficient permissions to access the console. To learn more, see [Required console permissions to create a human-based model evaluation job](model-evaluation-type-human.md#more-for-human).

**To delete multiple model evaluation jobs.**

1. Open the Amazon Bedrock console: [https://console.aws.amazon.com/bedrock/](https://console.aws.amazon.com/bedrock/)

1. In the navigation pane, choose **Model evaluation**.

1. In the **Model Evaluation Jobs** card, use the table to find the model evaluation jobs that you want to delete, select them using the checkbox next to the job's name. You can select up to 25 jobs.

1. Choose **Delete** to delete the model evaluation jobs.

1. If you need to delete more model evaluation jobs, repeat steps 3 and 4.

------
#### [ AWS CLI ]

In the AWS CLI, you can use the `help` command to view parameters are required, and which parameters are optional when using `batch-delete-evaluation-job`.

```
aws bedrock batch-delete-evaluation-job help
```

The follow is an example of using `batch-delete-evaluation-job` and specifying that 2 model evaluation jobs be deleted. You use the `job-identifiers` parameter to specify a list of ARNS for the model evaluation jobs that you want to delete. You can delete up to 25 model evaluation jobs in a single call to `batch-delete-evaluation-job`. If you need to delete more jobs, make further calls to `batch-delete-evaluation-job`.

```
aws bedrock batch-delete-evaluation-job \
--job-identifiers arn:aws:bedrock:us-east-1:111122223333:evaluation-job/rmqp8zg80rvg arn:aws:bedrock:us-east-1:111122223333:evaluation-job/xmfp9zg204fdk
```

After submitting you would get the following response.

```
{
	"evaluationJobs": [
		{
			"jobIdentifier": "rmqp8zg80rvg",
			"jobStatus": "Deleting"
		},
		{
			"jobIdentifier": "xmfp9zg204fdk",
			"jobStatus": "Deleting"
		}

	],
	"errors": []
}
```

------
#### [ SDK for Python ]

The following examples show how to use the AWS SDK for Python to delete a model evaluation job. Use the `jobIdentifiers` parameter to specify a list of ARNS for the model evaluation jobs that you want to delete. You can delete up to 25 model evaluation jobs in a single call to `BatchDeleteEvaluationJob`. If you need to delete more jobs, make further calls to `BatchDeleteEvaluationJob`.

```
import boto3
client = boto3.client('bedrock')

job_request = client.batch_delete_model_evaluation_job(jobIdentifiers=["arn:aws:bedrock:us-east-1:111122223333:evaluation-job/rmqp8zg80rvg", "arn:aws:bedrock:us-east-1:111122223333:evaluation-job/xmfp9zg204fdk"])

print (job_request)
```

------

# Manage a work team for human evaluations of models in Amazon Bedrock
<a name="human-worker-evaluations"></a>

For evaluation jobs that use human workers you need to have a work team. A work team is a group of workers that *you* choose. These can be employees of your company or a group of subject-matter experts from your industry.

**Worker notifications in Amazon Bedrock**  
When you create an evaluation job in Amazon Bedrock workers are notified of their assigned job *only* when you first add them to a work team
If you delete a worker from a work team during evaluation job creation, they will lose access to *all* evaluation jobs they have been assigned too.
For any new evaluation jobs that you assign to an existing human worker, you must notify them directly and provide them the URL to the worker portal. Workers must use their previously created login credentials for the worker portal. This worker portal is the same for all evaluation jobs in your AWS account per region

You can create a human work team or manage an existing one while setting up an evaluation job workers to a *Private workforce* that is managed by Amazon SageMaker Ground Truth. Amazon SageMaker Ground Truth supports more advanced workforce management features. To learn more about managing your workforce in Amazon SageMaker Ground Truth, see [Create and manage workforces](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management.html).

You can delete workers from a work team while setting up a new evaluation job. Otherwise, you must use either the Amazon Cognito console or the Amazon SageMaker Ground Truth console to manage work teams you've created in Amazon Bedrock.

If the IAM user, group, or role has the required permissions you will see existing private workforces and work teams you created in Amazon Cognito, Amazon SageMaker Ground Truth, or Amazon Augmented AI visible when you are creating an evaluation job that uses human workers.

Amazon Bedrock supports a maximum of 50 workers per work team.

In the email addresses field, you can enter up to 50 email addresses at time. To add more workers to your evaluation job use the Amazon Cognito console or the Ground Truth console. The addresses must be separated by a comma. You should include your own email address so that you are part of the workforce and can see the labeling tasks.

## Create instructions for human workers
<a name="human-worker-evaluation-instructions"></a>

Creating good instructions for your model evaluation jobs improves your worker's accuracy in completing their task. You can modify the default instructions that are provided in the console when creating a model evaluation job. The instructions are shown to the worker on the UI page where they complete their labeling task.

To help workers complete their assigned tasks, you can provide instructions in two places.

**Provide a good description for each evaluation and rating method**  
The descriptions should provide a succinct explanation of the metrics selected. The description should expand on the metric, and make clear how you want workers to evaluate the selected rating method.

**Provide your workers overall evaluation instructions**  
These instructions are shown on the same webpage where workers complete a task. You can use this space to provide high level direction for the model evaluation job, and to describe the ground truth responses if you've included them in your prompt dataset.