Create a model evaluation job that uses a judge model in Amazon Bedrock

You can create a model evaluation job using the AWS Management Console, AWS CLI, or a supported AWS SDK. Use this topic to learn how to create a model evaluation job that uses a model as judge.

This job requires two different models. Both models must be available in the same AWS Region, and you must have access to them in your AWS account. To learn more about accessing models, see Access Amazon Bedrock foundation models.

Supported generator models – model that will be performing inference based off your prompts

Foundation models – Amazon Bedrock foundation model information
Customized foundation models – Customize your model to improve its performance for your use case
Imported models – Import a customized model into Amazon Bedrock
Models with Provisioned Throughput – Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock

Supported evaluator models – the model that will be reviewing scoring the output of your generator model

You need access to at least one of the following Amazon Bedrock foundation models. These are the available judge models. To learn more about gaining access to models and region availability, see Access Amazon Bedrock foundation models.
- Mistral Large – mistral.mistral-large-2402-v1:0
- Anthropic Claude 3.5 Sonnet – anthropic.claude-3-5-sonnet-20240620-v1:0
- Anthropic Claude 3 Haiku – anthropic.claude-3-5-haiku-20241022-v1:0
- Meta Llama 3.1 70B Instruct – meta.llama3-1-70b-instruct-v1:0

Use the following tabs to learn how to make a CreateEvaluation. Use inferenceConfig to specify the generator model you want to use in the model evaluation job. Use evaluatorModelConfig to specify the supported evaluator model.

AWS CLI

The following example show how to make a create-evaluation-job request using the AWS CLI. Use the table found in model metrics for model as judge to see how to correctly format the metricNames. Ensure that evaluationContext is specified as Model.


aws bedrock create-evaluation-job


{
	"jobName": "model-eval-llmaj",
	"roleArn": "arn:aws:iam::111122223333:role/Amazon-Bedrock-ModelAsAJudgeTest",
	"evaluationContext": "Model",
	"evaluationConfig": {
		"automated": {
			"datasetMetricConfigs": [
	{
	"taskType": "General",
	"dataset": {
	"name": "text_dataset",
	"datasetLocation": {
	"s3Uri": "s3://bedrock-model-as-a-judge-test-1/input_datasets/text_dataset_input.jsonl"
	}
	},
	"metricNames": [
	"Builtin.Correctness",
	"Builtin.Completeness"
	]
	}
	],
	"evaluatorModelConfig": {
	"bedrockEvaluatorModels": [
	{
	"modelIdentifier": "anthropic.claude-3-haiku-20240307-v1:0"
	}
	]
	}
	}
	},
	"inferenceConfig": {
	"models": [
	{
	"bedrockModel": {
	"modelIdentifier": "anthropic.claude-v2",
	"inferenceParams": "{}"
	}
	}
	]
	},
	"outputDataConfig": {
	"s3Uri": "s3://bedrock-model-as-a-judge-test-1/output_data/"
	}
	}

SDK for Python

When you create a human-based model evaluation job outside of the Amazon Bedrock console, you need to create an Amazon SageMaker AI flow definition ARN.

The flow definition ARN is where a model evaluation job's workflow is defined. The flow definition is used to define the worker interface and the work team you want assigned to the task, and connecting to Amazon Bedrock.

For model evaluation jobs started using Amazon Bedrock API operations you must create a flow definition ARN using the AWS CLI or a supported AWS SDK. To learn more about how flow definitions work, and creating them programmatically, see Create a Human Review Workflow (API) in the SageMaker AI Developer Guide.

In the CreateFlowDefinition you must specify AWS/Bedrock/Evaluation as input to the AwsManagedHumanLoopRequestSource. The Amazon Bedrock service role must also have permissions to access the output bucket of the flow definition.

The following is an example request using the AWS CLI. In the request, the HumanTaskUiArn is a SageMaker AI owned ARN. In the ARN, you can only modify the AWS Region.


aws sagemaker create-flow-definition --cli-input-json '
{
	"FlowDefinitionName": "human-evaluation-task01",
	"HumanLoopRequestSource": {
        "AwsManagedHumanLoopRequestSource": "AWS/Bedrock/Evaluation"
    },
    "HumanLoopConfig": {
		"WorkteamArn": "arn:aws:sagemaker:AWS Region:111122223333:workteam/private-crowd/my-workteam",
		## The Task UI ARN is provided by the service team, you can only modify the AWS Region.
		"HumanTaskUiArn":"arn:aws:sagemaker:AWS Region:394669845002:human-task-ui/Evaluation"
		"TaskTitle": "Human review tasks",
		"TaskDescription": "Provide a real good answer",
		"TaskCount": 1,
		"TaskAvailabilityLifetimeInSeconds": 864000,
		"TaskTimeLimitInSeconds": 3600,
		"TaskKeywords": [
		    "foo"
        ]
    },
    "OutputConfig": {
        "S3OutputPath": "s3://your-output-bucket"
    },
    "RoleArn": "arn:aws:iam::111122223333:role/SageMakerCustomerRoleArn"
}'

The following code example shows you how to create a model evaluation job that uses human workers via the SDK for SDK for Python.


import boto3
client = boto3.client('bedrock')

job_request = client.create_evaluation_job(
    jobName="111122223333-job-01",
    jobDescription="two different task types",
    roleArn="arn:aws:iam::111122223333:role/example-human-eval-api-role",
    inferenceConfig={
        ## You must specify and array of models
        "models": [
            {
                "bedrockModel": {
                    "modelIdentifier":"arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-text-lite-v1",
                    "inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}"
                }

            },
            {
                "bedrockModel": {
                    "modelIdentifier": "anthropic.claude-v2",
                    "inferenceParams": "{\"temperature\":\"0.25\",\"top_p\":\"0.25\",\"max_tokens_to_sample\":\"256\",\"top_k\":\"1\"}"
                }
            }
        ]

    },
    outputDataConfig={
        "s3Uri":"s3://job-bucket/outputs/"
    },
    evaluationConfig={
        "human": {
        "humanWorkflowConfig": {
            "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/example-workflow-arn",
            "instructions": "some human eval instruction"
        },
        "customMetrics": [
            {
                "name": "IndividualLikertScale",
                "description": "testing",
                "ratingMethod": "IndividualLikertScale"
            }
        ],
        "datasetMetricConfigs": [
            {
                "taskType": "Summarization",
                "dataset": {
                    "name": "Custom_Dataset1",
                    "datasetLocation": {
                        "s3Uri": "s3://job-bucket/custom-datasets/custom-trex.jsonl"
                    }
                },
                "metricNames": [
                  "IndividualLikertScale"
                ]
            }
        ]
      }

    }
)

print(job_request)

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Mistral Large

List job