Human-based model evaluation jobs

The following examples show how to create a model evaluation job that uses human workers. In the API, you can also include an inference profile in the job by specifying its ARN in the modelIdentifier field.

Console

To create a model evaluation job that uses human workers

Open the Amazon Bedrock console: https://console.aws.amazon.com/bedrock/
In the navigation pane, choose Model evaluation.
In the Build an evaluation card, under Human: bring your own team choose Create human-based evaluation.
On the Specify job details page provide the following.
1. Evaluation name — Give the model evaluation job a name that describes the job. This name is shown in your model evaluation job list. The name must be unique in your account in an AWS Region.
2. Description (Optional) — Provide an optional description.
Then, choose Next.
On the Set up evaluation page provide the following.
1. Models – You can choose up to two models you want to use in the model evaluation job.
  
  To learn more about available models in Amazon Bedrock, see Access Amazon Bedrock foundation models.
2. (Optional) To change the inference configuration for the selected models choose update.
  
  Changing the inference configuration changes the responses generated by the selected models. To learn more about the available inferences parameters, see Inference request parameters and response fields for foundation models.
3. Task type – Choose the type of task you want the model to attempt to perform during the model evaluation job. All instructions for the model must be included in the prompts themselves. The task type does not control the model's responses.
4. Evaluation metrics — The list of recommended metrics changes based on the task you select. For each recommended metric, you must select a Rating method. You can have a maximum of 10 evaluation metrics per model evaluation job.
5. (Optional) Choose Add metric to add a metric. You must define the Metric, Description, and Rating method.
6. In the Datasets card you must provide the following.
  1. Choose a prompt dataset – Specify the S3 URI of your prompt dataset file or choose Browse S3 to see available S3 buckets. You can have a maximum of 1000 prompts in a custom prompt dataset.
  2. Evaluation results destination – You must specify the S3 URI of the directory where you want the results of your model evaluation job saved, or choose Browse S3 to see available S3 buckets.
7. (Optional) AWS KMS key – Provide the ARN of the customer managed key you want to use to encrypt your model evaluation job.
8. In the Amazon Bedrock IAM role – Permissions card, you must do the following. To learn more about the required permissions for model evaluations, see Service role requirements for model evaluation jobs.
  1. To use an existing Amazon Bedrock service role, choose Use an existing role. Otherwise, use Create a new role to specify the details of your new IAM service role.
  2. In Service role name, specify the name of your IAM service role.
  3. When ready, choose Create role to create the new IAM service role.
Then, choose Next.
In the Permissions card, specify the following. To learn more about the required permissions for model evaluations, see Service role requirements for model evaluation jobs.
Human workflow IAM role – Specify a SageMaker AI service role that has the required permissions.
In the Work team card, specify the following.

Human worker notification requirements
When you add a new human worker to a model evaluation job, they automatically receive an email inviting them to participate in the model evaluation job. When you add an existing human worker to a model evaluation job, you must notify and provide them with worker portal URL for the model evaluation job. The existing worker will not receive an automated email notification that they are added to the new model evaluation job.
1. Using the Select team dropdown, specify either Create a new work team or the name of an existing work team.
2. (Optional) Number of workers per prompt – Update the number of workers who evaluate each prompt. After the responses for each prompt have been reviewed by the number of workers you selected, the prompt and its responses will be taken out of circulation from the work team. The final results report will include all ratings from each worker.
3. (Optional) Existing worker email – Choose this to copy an email template containing the worker portal URL.
4. (Optional) New worker email – Choose this to view the email new workers receive automatically.
  
  Important
  Large language models are known to occasionally hallucinate and produce toxic or offensive content. Your workers may be shown toxic or offensive material during this evaluation. Ensure you take proper steps to train and notify them before they work on the evaluation. They can decline and release tasks or take breaks during the evaluation while accessing the human evaluation tool.
Then, choose Next.
On the Provide instruction page use the text editor to provide instructions for completing the task. You can preview the evaluation UI that your work team uses to evaluate the responses, including the metrics, rating methods, and your instructions. This preview is based on the configuration you have created for this job.
Then, choose Next.
On the Review and create page, you can view a summary of the options you've selected in the previous steps.
To start your model evaluation job, choose Create.

Note
Once the job has successfully started, the status changes to In progress. When the job has finished, the status changes to Completed. While a model evaluation job is still In progress, you can choose to the stop the job before all the models' responses have been evaluated by your work team. To do so, choose Stop evaluation on the model evaluation landing page. This will change the Status of the model evaluation job to Stopping. Once the model evaluation job has successfully stopped, you can delete the model evaluation job.

API and AWS CLI

When you create a human-based model evaluation job outside of the Amazon Bedrock console, you need to create an Amazon SageMaker AI flow definition ARN.

The flow definition ARN is where a model evaluation job's workflow is defined. The flow definition is used to define the worker interface and the work team you want assigned to the task, and connecting to Amazon Bedrock.

For model evaluation jobs started using Amazon Bedrock API operations you must create a flow definition ARN using the AWS CLI or a supported AWS SDK. To learn more about how flow definitions work, and creating them programmatically, see Create a Human Review Workflow (API) in the SageMaker AI Developer Guide.

In the CreateFlowDefinition you must specify AWS/Bedrock/Evaluation as input to the AwsManagedHumanLoopRequestSource. The Amazon Bedrock service role must also have permissions to access the output bucket of the flow definition.

The following is an example request using the AWS CLI. In the request, the HumanTaskUiArn is a SageMaker AI owned ARN. In the ARN, you can only modify the AWS Region.


aws sagemaker create-flow-definition --cli-input-json '
{
	"FlowDefinitionName": "human-evaluation-task01",
	"HumanLoopRequestSource": {
        "AwsManagedHumanLoopRequestSource": "AWS/Bedrock/Evaluation"
    },
    "HumanLoopConfig": {
		"WorkteamArn": "arn:aws:sagemaker:AWS Region:111122223333:workteam/private-crowd/my-workteam",
		## The Task UI ARN is provided by the service team, you can only modify the AWS Region.
		"HumanTaskUiArn":"arn:aws:sagemaker:AWS Region:394669845002:human-task-ui/Evaluation"
		"TaskTitle": "Human review tasks",
		"TaskDescription": "Provide a real good answer",
		"TaskCount": 1,
		"TaskAvailabilityLifetimeInSeconds": 864000,
		"TaskTimeLimitInSeconds": 3600,
		"TaskKeywords": [
		    "foo"
        ]
    },
    "OutputConfig": {
        "S3OutputPath": "s3://your-output-bucket"
    },
    "RoleArn": "arn:aws:iam::111122223333:role/SageMakerCustomerRoleArn"
}'

After creating your flow definition ARN, use the following examples to create human-based model evaluation job using the AWS CLI or a supported AWS SDK.

SDK for Python

The following code example shows you how to create a model evaluation job that uses human workers via the SDK for SDK for Python.


import boto3
client = boto3.client('bedrock')

job_request = client.create_evaluation_job(
    jobName="111122223333-job-01",
    jobDescription="two different task types",
    roleArn="arn:aws:iam::111122223333:role/example-human-eval-api-role",
    inferenceConfig={
        ## You must specify and array of models
        "models": [
            {
                "bedrockModel": {
                    "modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1",
                    "inferenceParams":"{\"inferenceConfig\":{\"maxTokens\": 512,\"temperature\":0.7,\"topP\":0.9}}"
                }

            },
            {
                "bedrockModel": {
                    "modelIdentifier": "anthropic.claude-v2",
                    "inferenceParams": "{\"inferenceConfig\":{\"maxTokens\":512,\"temperature\":1,\"topP\":0.999,\"stopSequences\":[\"stop\"]},\"additionalModelRequestFields\":{\"top_k\": 128}}"
                }
            }
        ]
    },
    outputDataConfig={
        "s3Uri":"s3://job-bucket/outputs/"
    },
    evaluationConfig={
        "human": {
        "humanWorkflowConfig": {
            "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/example-workflow-arn",
            "instructions": "some human eval instruction"
        },
        "customMetrics": [
            {
                "name": "IndividualLikertScale",
                "description": "testing",
                "ratingMethod": "IndividualLikertScale"
            }
        ],
        "datasetMetricConfigs": [
            {
                "taskType": "Summarization",
                "dataset": {
                    "name": "Custom_Dataset1",
                    "datasetLocation": {
                        "s3Uri": "s3://job-bucket/custom-datasets/custom-trex.jsonl"
                    }
                },
                "metricNames": [
                  "IndividualLikertScale"
                ]
            }
        ]
      }

    }
)

print(job_request)

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Custom prompt datasets (human)

List model evaluation jobs

Human-based model evaluation jobs

To create a model evaluation job that uses human workers

Human worker notification requirements

Important

Note