Dataset evaluation
Dataset evaluations let you run your agent against a set of scenarios and automatically evaluate the results. Instead of manually invoking your agent, collecting spans, and calling the Evaluate API, a dataset runner orchestrates the entire lifecycle in a single call: invoke the agent, wait for telemetry ingestion, and evaluate.
This is useful for regression testing, benchmark datasets, CI/CD pipelines, baseline measurement, and pre/post comparison after configuration changes.
The AgentCore SDK provides two dataset runners that share the same dataset schema and ground truth format but differ in where evaluation happens:
-
On-demand dataset runner (
OnDemandEvaluationDatasetRunner) — Collects spans and calls the Evaluate API client-side. Best for dev-time iteration and small datasets. -
Batch dataset runner (
BatchEvaluationRunner) — Delegates span collection and evaluation to the service via the batch evaluation API. Best for large datasets and production baselines.
Choosing a runner
| Aspect | On-demand runner | Batch runner |
|---|---|---|
|
Span collection |
SDK-side via |
Server-side; service reads from CloudWatch directly |
|
Evaluate API calls |
SDK calls |
SDK calls |
|
Execution model |
Synchronous three-phase pipeline (invoke, wait, evaluate) |
Asynchronous four-phase pipeline (invoke, wait, submit, poll) |
|
Results |
Structured |
Aggregate |
|
Best for |
Dev-time iteration, CI/CD, small datasets, when you need per-scenario detail immediately |
Baseline measurement, large datasets, pre/post comparison, when aggregate scores are sufficient |
Prerequisites
Both runners require:
-
Python 3.10+
-
An agent deployed on AgentCore Runtime with observability enabled, or an agent built with a supported framework configured with AgentCore Observability. Supported frameworks:
-
Strands Agents
-
LangGraph with
opentelemetry-instrumentation-langchainoropeninference-instrumentation-langchain
-
-
Transaction Search enabled in CloudWatch; see Enable Transaction Search
-
The AgentCore SDK installed:
pip install bedrock-agentcore -
AWS credentials configured with permissions for
bedrock-agentcore,bedrock-agentcore-control, andlogs(CloudWatch)