Code samples Request parameters Session source Response Errors

Start batch evaluation

Start a batch evaluation to run evaluators against multiple agent sessions. The service discovers sessions from CloudWatch Logs, runs each evaluator against each session, and produces aggregate results.

Code samples

Example

AgentCore CLI

The CLI resolves serviceNames and logGroupNames automatically from the project configuration when you use --runtime:


agentcore run batch-evaluation \
  --runtime MyAgent \
  --evaluator Builtin.GoalSuccessRate Builtin.Helpfulness Builtin.Faithfulness

With optional flags:


# Custom name and lookback window
agentcore run batch-evaluation \
  --runtime MyAgent \
  --evaluator Builtin.GoalSuccessRate \
  --name "baseline-eval" \
  --lookback-days 1

# Specific sessions
agentcore run batch-evaluation \
  --runtime MyAgent \
  --evaluator Builtin.GoalSuccessRate \
  --session-ids session-abc123 session-def456

# With ground truth
agentcore run batch-evaluation \
  --runtime MyAgent \
  --evaluator Builtin.GoalSuccessRate Builtin.Correctness \
  --ground-truth ground-truth.json

The CLI polls until the job reaches a terminal state (COMPLETED, FAILED, or STOPPED), displays per-evaluator average scores, and saves results to .cli/eval-job-results/.

AWS SDK (boto3)


import boto3
import uuid
import time
import json

client = boto3.client("bedrock-agentcore", region_name="us-west-2")

# All sessions in the log group
response = client.start_batch_evaluation(
    batchEvaluationName=f"baseline_eval_{uuid.uuid4().hex[:8]}",
    evaluators=[
        {"evaluatorId": "Builtin.GoalSuccessRate"},
        {"evaluatorId": "Builtin.Helpfulness"},
        {"evaluatorId": "Builtin.Faithfulness"},
    ],
    dataSourceConfig={
        "cloudWatchLogs": {
            "serviceNames": ["MyAgent.DEFAULT"],
            "logGroupNames": ["/aws/bedrock-agentcore/runtimes/MyAgent-abc123-DEFAULT"],
        }
    },
    clientToken=str(uuid.uuid4()),
)

batch_eval_id = response["batchEvaluationId"]
print(f"Started: {batch_eval_id}")

# Poll until complete
while True:
    result = client.get_batch_evaluation(batchEvaluationId=batch_eval_id)
    status = result["status"]
    print(f"Status: {status}")

    if status in ("COMPLETED", "COMPLETED_WITH_ERRORS", "FAILED", "STOPPED"):
        break
    time.sleep(30)

print(json.dumps(result, indent=4, default=str))

With session ID filtering:


response = client.start_batch_evaluation(
    batchEvaluationName=f"targeted-eval-{uuid.uuid4().hex[:8]}",
    evaluators=[
        {"evaluatorId": "Builtin.GoalSuccessRate"},
    ],
    dataSourceConfig={
        "cloudWatchLogs": {
            "serviceNames": ["MyAgent.DEFAULT"],
            "logGroupNames": ["/aws/bedrock-agentcore/runtimes/MyAgent-abc123-DEFAULT"],
            "filterConfig": {
                "sessionIds": ["session-001", "session-002", "session-003"]
            },
        }
    },
    clientToken=str(uuid.uuid4()),
)

With time range filtering:


from datetime import datetime, timedelta, timezone

now = datetime.now(timezone.utc)
response = client.start_batch_evaluation(
    batchEvaluationName=f"weekly-eval-{uuid.uuid4().hex[:8]}",
    evaluators=[
        {"evaluatorId": "Builtin.GoalSuccessRate"},
    ],
    dataSourceConfig={
        "cloudWatchLogs": {
            "serviceNames": ["MyAgent.DEFAULT"],
            "logGroupNames": ["/aws/bedrock-agentcore/runtimes/MyAgent-abc123-DEFAULT"],
            "filterConfig": {
                "timeRange": {
                    "startTime": (now - timedelta(days=7)).isoformat(),
                    "endTime": now.isoformat(),
                }
            },
        }
    },
    clientToken=str(uuid.uuid4()),
)

Request parameters

Parameter	Type	Required	Description
`batchEvaluationName`	String	Yes	A name for the batch evaluation job. Pattern: starts with a letter, alphanumeric and underscores, max 48 characters.
`dataSourceConfig`	Object	Yes	Where to find agent sessions. Specify a `cloudWatchLogs` source with the log groups and service name for your agent. See Session source below.
`evaluators`	List	Yes	List of evaluators. Each entry has an `evaluatorId` field (for example, `Builtin.GoalSuccessRate`). Maximum 10 evaluators.
`evaluationMetadata`	Object	No	Contains `sessionMetadata`, a list of per-session ground truth and metadata. Maximum 500 entries.
`clientToken`	String	No	Idempotency token. If you retry a request with the same client token, the service returns the existing job instead of creating a new one.

Session source

The dataSourceConfig parameter specifies the CloudWatch Logs location where the service discovers agent sessions.

Required fields

Field	Type	Description
`cloudWatchLogs.serviceNames`	List of strings (exactly 1)	The service name that identifies your agent’s traces in CloudWatch. Convention: `{RuntimeName}.DEFAULT`.
`cloudWatchLogs.logGroupNames`	List of strings (1–5)	CloudWatch log group names where agent telemetry is stored. Convention: `/aws/bedrock-agentcore/runtimes/{agentId}-DEFAULT`.

Optional fields

Field	Type	Description
`cloudWatchLogs.filterConfig.sessionIds`	List of strings	Evaluate only these specific session IDs. When omitted, the service discovers all sessions in the log group.
`cloudWatchLogs.filterConfig.timeRange.startTime`	ISO 8601 datetime	Filter sessions created after this time.
`cloudWatchLogs.filterConfig.timeRange.endTime`	ISO 8601 datetime	Filter sessions created before this time.

Response

Field	Type	Description
`batchEvaluationId`	String	Unique identifier for the batch evaluation.
`batchEvaluationArn`	String	ARN of the batch evaluation.
`batchEvaluationName`	String	The name you specified.
`status`	String	Initial status. One of: `PENDING`, `IN_PROGRESS`.
`evaluators`	List	The evaluators used.
`createdAt`	Timestamp	When the job was created.
`outputConfig`	Object	CloudWatch Logs destination for per-session results.

Errors

Error	HTTP status	Description
`ValidationException`	400	Invalid request parameters. Check field constraints and required fields.
`AccessDeniedException`	403	Insufficient permissions. Verify IAM policies.
`ConflictException`	409	A batch evaluation with the same client token already exists with different parameters.
`ThrottlingException`	429	Request rate exceeded. Retry with exponential backoff.
`InternalServerException`	500	Service-side error. Retry the request.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Getting started

Get batch evaluation results