The following example show how to make a create-evaluation-job
request using the AWS CLI. Use the table found in model metrics for model as judge to see how to correctly format the metricNames
. Ensure that evaluationContext
is specified as Model
aws bedrock create-evaluation-job
"jobName": "model-eval-llmaj"
"roleArn": "arn:aws:iam::111122223333
"evaluationContext": "Model"
"evaluationConfig": {
"automated": {
"datasetMetricConfigs": [
"taskType": "General",
"dataset": {
"name": "text_dataset",
"datasetLocation": {
"s3Uri": "s3://bedrock-model-as-a-judge-test-1/input_datasets/text_dataset_input.jsonl"
"metricNames": [
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
"modelIdentifier": "anthropic.claude-3-haiku-20240307-v1:0"
"inferenceConfig": {
"models": [
"bedrockModel": {
"modelIdentifier": "anthropic.claude-v2",
"inferenceParams": "{}"
"outputDataConfig": {
"s3Uri": "s3://bedrock-model-as-a-judge-test-1/output_data/"
When you create a human-based model evaluation job outside of the Amazon Bedrock
console, you need to create an Amazon SageMaker AI flow definition ARN.
The flow definition ARN is where a model evaluation job's workflow is defined.
The flow definition is used to define the worker interface and the work team you
want assigned to the task, and connecting to Amazon Bedrock.
For model evaluation jobs started using Amazon Bedrock API operations you must
create a flow definition ARN using the AWS CLI or a supported AWS SDK. To
learn more about how flow definitions work, and creating them programmatically,
see Create a Human Review Workflow (API) in the SageMaker AI Developer Guide.
In the CreateFlowDefinition
you must specify AWS/Bedrock/Evaluation
as input to the AwsManagedHumanLoopRequestSource
. The Amazon Bedrock service role must also have permissions to access the output bucket of the flow definition.
The following is an example request using the AWS CLI. In the request, the HumanTaskUiArn
is a SageMaker AI owned ARN. In the ARN, you can only modify the AWS Region.
aws sagemaker create-flow-definition --cli-input-json '
"FlowDefinitionName": "human-evaluation-task01
"HumanLoopRequestSource": {
"AwsManagedHumanLoopRequestSource": "AWS/Bedrock/Evaluation
"HumanLoopConfig": {
"WorkteamArn": "arn:aws:sagemaker:AWS Region
## The Task UI ARN is provided by the service team, you can only modify the AWS Region.
"HumanTaskUiArn":"arn:aws:sagemaker:AWS Region:394669845002:human-task-ui/Evaluation"
"TaskTitle": "Human review tasks",
"TaskDescription": "Provide a real good answer",
"TaskCount": 1,
"TaskAvailabilityLifetimeInSeconds": 864000,
"TaskTimeLimitInSeconds": 3600,
"TaskKeywords": [
"OutputConfig": {
"S3OutputPath": "s3://your-output-bucket
"RoleArn": "arn:aws:iam::111122223333
The following code example shows you how to create a model evaluation job that uses human workers via the SDK for SDK for Python.
import boto3
client = boto3.client('bedrock')
job_request = client.create_evaluation_job(
jobDescription="two different task types",
## You must specify and array of models
"models": [
"bedrockModel": {
"inferenceParams":"{\"temperature\":\"0.0\", \"topP\":\"1\", \"maxTokenCount\":\"512\"}"
"bedrockModel": {
"modelIdentifier": "anthropic.claude-v2",
"inferenceParams": "{\"temperature\":\"0.25\",\"top_p\":\"0.25\",\"max_tokens_to_sample\":\"256\",\"top_k\":\"1\"}"
"human": {
"humanWorkflowConfig": {
"flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333
"instructions": "some human eval instruction"
"customMetrics": [
"name": "IndividualLikertScale",
"description": "testing",
"ratingMethod": "IndividualLikertScale"
"datasetMetricConfigs": [
"taskType": "Summarization",
"dataset": {
"name": "Custom_Dataset1",
"datasetLocation": {
"s3Uri": "s3://job-bucket
"metricNames": [