Getting started with batch evaluation
This walkthrough takes you from a deployed agent to batch evaluation results using an Acme Store customer support agent. You will create the agent, deploy it, generate sample sessions, run a batch evaluation, and read the results.
Topics
Before you begin
Make sure you have:
-
The AgentCore CLI installed (
agentcore --version) -
AWS credentials with permissions for
bedrock-agentcoreandlogs -
Transaction Search enabled in CloudWatch
-
Python 3.10+ (for boto3 examples)
For full details, see Prerequisites.
The following constants are used in the boto3 examples. Replace them with your own values after deploying the agent:
REGION = "us-west-2" AGENT_ARN = "arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/AcmeSupport-abc123" SERVICE_NAME = "AcmeSupport-abc123.DEFAULT" LOG_GROUP = "/aws/bedrock-agentcore/runtimes/AcmeSupport-abc123-DEFAULT"
Step 1: Create and deploy the sample agent
Create an AgentCore project and replace the default agent code with the Acme Store customer support agent. This agent has five tools for handling orders, returns, shipping, discounts, and escalations.
Create the project
agentcore create --name AcmeSupport --framework Strands --model-provider Bedrock --memory none cd AcmeSupport
Replace the agent code
Open app/AcmeSupport/main.py and replace its contents with the following:
"""Acme Store customer support agent.""" from strands import Agent, tool from strands.models.bedrock import BedrockModel from bedrock_agentcore.runtime import BedrockAgentCoreApp app = BedrockAgentCoreApp() MODEL_ID = "global.anthropic.claude-sonnet-4-6" SYSTEM_PROMPT = ( "You are a helpful customer support assistant for Acme Store. " "Help customers with their orders, returns, and shipping questions." ) @tool def lookup_order(order_id: str) -> str: """Look up an order by ID and return its status, item, and delivery details.""" orders = { "ORD-1001": { "status": "delivered", "item": "Blue T-Shirt (L)", "delivered": "2026-03-28", "total": "$29.99", }, "ORD-1002": { "status": "in_transit", "item": "Running Shoes (10)", "shipped": "2026-03-30", "est_delivery": "2026-04-05", "total": "$89.99", }, "ORD-1003": { "status": "delayed", "item": "Wireless Headphones", "shipped": "2026-03-25", "est_delivery": "2026-03-29", "days_late": 5, "total": "$59.99", }, "ORD-1004": { "status": "processing", "item": "Yoga Mat", "ordered": "2026-04-02", "total": "$34.99", }, "ORD-1005": { "status": "delivered", "item": "Coffee Maker", "delivered": "2026-03-20", "total": "$149.99", }, } return str(orders.get(order_id, {"error": f"Order {order_id} not found"})) @tool def initiate_return(order_id: str, reason: str) -> str: """Initiate a return for an order. Sends a return label to the customer.""" return ( f"Return initiated for {order_id}. Reason: {reason}. " "Return label sent to customer email. Please ship within 14 days." ) @tool def check_shipping_status(order_id: str) -> str: """Check detailed shipping status including carrier location and delays.""" statuses = { "ORD-1002": ( "Package is with carrier, currently in Portland OR. " "On schedule for April 5." ), "ORD-1003": ( "Package delayed at distribution center in Memphis TN. " "Original delivery was March 29. Now 5 days late. " "Acme Store policy: orders delayed 3+ days qualify for 15% discount." ), } return statuses.get(order_id, f"No active shipment found for {order_id}.") @tool def apply_discount(order_id: str, discount_percent: int, reason: str) -> str: """Apply a percentage discount to an order and issue a refund.""" return ( f"Applied {discount_percent}% discount to {order_id}. " f"Reason: {reason}. Refund will appear in 3-5 business days." ) @tool def escalate_to_human(reason: str) -> str: """Escalate the conversation to a human support agent.""" return ( f"Escalated to human agent. Reason: {reason}. " "Estimated wait time: 3 minutes." ) agent = Agent( model=BedrockModel(model_id=MODEL_ID), tools=[lookup_order, initiate_return, check_shipping_status, apply_discount, escalate_to_human], system_prompt=SYSTEM_PROMPT, ) @app.entrypoint def invoke(payload, context): result = agent(payload.get("prompt", "Hello")) return {"response": str(result)} if __name__ == "__main__": app.run()
Deploy and verify
agentcore deploy
After deployment, verify the agent is running:
agentcore invoke --prompt "What's the status of order ORD-1001?"
You should see a response with the order details. Note the runtime ARN, service name, and log group from agentcore status --json — you will need these for the boto3 examples.
Note
If you already have an agent deployed on AgentCore Runtime with observability enabled, skip this step and use your own agent for the rest of the walkthrough.
Step 2: Generate sample sessions
Invoke the agent with varied prompts to create sessions for evaluation. These prompts cover different scenarios: order lookups, returns, shipping delays, discount requests, and multi-tool interactions.
Example
Wait 2–3 minutes after the last invocation for CloudWatch to ingest the telemetry before proceeding.
Step 3: Run batch evaluation
Start a batch evaluation to score all recent sessions. The service discovers sessions from CloudWatch Logs, runs each evaluator against each session, and returns aggregate results.
Example
Step 4: Read per-session detail
The aggregate scores tell you the overall picture. To see per-turn, per-evaluator scores for individual sessions, read the evaluation events from CloudWatch Logs.
# Get the output location from the batch evaluation result output = result["outputConfig"]["cloudWatchConfig"] log_group = output["logGroupName"] log_stream = output["logStreamName"] # Read the events logs_client = boto3.client("logs", region_name=REGION) response = logs_client.get_log_events( logGroupName=log_group, logStreamName=log_stream, ) for event in response["events"]: event_attrs = json.loads(event["message"]).get("attributes", {}) print(f"Score: {event_attrs.get('gen_ai.evaluation.score.value')}") print(f"Label: {event_attrs.get('gen_ai.evaluation.score.label')}") print(f"Explanation: {event_attrs.get('gen_ai.evaluation.explanation', '')[:200]}") print()
Next steps
-
Filter sessions — Evaluate specific sessions by ID or time range. See Start a batch evaluation.
-
Run against a dataset — Invoke your agent against predefined scenarios and evaluate the results automatically. See Dataset evaluation.
-
Compare runs — Run batch evaluation before and after a change and compare scores. See Understanding results and output.