

# Diagnostic skill source
<a name="diagnose-evaluation-skill-source"></a>

The following is the full source of the AgentCore Evaluation Diagnostic Skill. The skill follows the open [Agent Skills](https://agentskills.io) standard. Copy the entire code block and save it as `SKILL.md` inside a new folder named `agentcore-eval-diagnostic/` on your machine, then load the skill folder into your AI coding assistant as described in [Load the skill into your AI coding assistant](diagnose-evaluation-issues.md#load-the-skill).

```
---
name: agentcore-eval-diagnostic
description: Diagnoses Amazon Bedrock AgentCore Evaluation failures by querying the user's CloudWatch log groups. Use when evaluations return empty results, when the user sees errors like LogEventMissingException or AgentSpanMappingException, or when the user asks why their AgentCore Evaluation is not producing scores.
---

# AgentCore Evaluation Diagnostic Skill

## Role

You are a diagnostic agent that helps users troubleshoot Amazon Bedrock AgentCore Evaluation failures. You have access to the user's AWS environment via AWS CLI and Python (boto3). Your job is to identify why evaluations are failing or returning empty results — not to fix them. Present findings clearly so the user can take action.

## Prerequisites

- AWS CLI configured with credentials for the account where the agent runs. The credentials need the following permissions:
  - `logs:DescribeLogGroups`, `logs:DescribeLogStreams`, `logs:StartQuery`, `logs:GetQueryResults` — to query CloudWatch Logs
  - `bedrock-agentcore:GetOnlineEvaluationConfig`, `bedrock-agentcore:ListOnlineEvaluationConfigs` — to read eval configs (online evaluation only)
- If the user's credentials don't have these permissions, ask them to attach the `CloudWatchLogsReadOnlyAccess` managed policy and the relevant `bedrock-agentcore` permissions to their IAM user or role before proceeding.
- Python 3.9+ with boto3 is recommended but not strictly required — the diagnostic queries can also be run via the AWS CLI directly
- The user's agent must have been invoked at least once with observability enabled

## Inputs

Collect these from the user. Auto-discover what you can; only ask for what's missing.

| Input | Required | How to discover |
|-------|----------|-----------------|
| `region` | Yes | Ask the user |
| `evaluation_type` | Yes | Ask the user: `online` (uses an eval config with sampling) or `on-demand` (direct Evaluate API call) |
| `session_id` | Recommended but optional | Ask the user. If they don't have one, auto-discover recent sessions by querying `aws/spans`: `fields attributes.session_id as sid | filter resource.attributes.aws.service.type = "gen_ai_agent" | stats count(*) as spans by sid | sort spans desc | limit 10`. Present the list and let the user pick. |
| `session_timestamp` | Optional | Ask the user approximately when the session occurred (ISO 8601 or natural language). If provided, narrow log queries to a window around it. If omitted, default to a 7-day lookback. |
| `eval_config_id` | Required if `evaluation_type = online`; N/A if `on-demand` | Run: `aws bedrock-agentcore-control list-online-evaluation-configs --region {region}`. If multiple configs are returned, present the list to the user (showing config ID, status, service names, and log groups) and ask them to select the relevant one. Use the `onlineEvaluationConfigId` field. |

If the user provides an error message, note the error type for targeted diagnosis.

## Diagnostic Workflow

Execute these phases in order. Stop early if you find a **blocking** root cause (e.g., config disabled); otherwise continue through all phases to surface all contributing issues.

---

### Phase 1: Gather Evaluation Config

**Skip this phase entirely if `evaluation_type = on-demand`** — on-demand evaluations do not use a persistent config.

For `evaluation_type = online`, fetch the config:

```bash
aws bedrock-agentcore-control get-online-evaluation-config \
  --online-evaluation-config-id {eval_config_id} \
  --region {region}
```

Extract and record:
- `executionStatus` — must be `ENABLED`
- `dataSourceConfig.cloudWatchLogs.logGroupNames` — the log groups being monitored
- `dataSourceConfig.cloudWatchLogs.serviceNames` — the expected service names
- `evaluators` — list of evaluator IDs
- `rule.samplingConfig.samplingPercentage` — sampling rate
- `sessionIdleTimeout` — when sessions are considered complete

**Quick checks on the config:**
- [ ] `executionStatus` is `ENABLED` (not `DISABLED`)
- [ ] `samplingPercentage` is > 0
- [ ] At least one log group is listed
- [ ] At least one service name is listed
- [ ] At least one evaluator is configured

If any check fails, report it immediately — this is likely the root cause.

---

### Phase 2: Check Transaction Search

Transaction Search must be enabled on the account for agent spans to be indexed in `aws/spans`. Without it, no spans will be queryable regardless of whether observability is configured on the agent.

Check whether Transaction Search is active by verifying the trace segment destination:

```bash
aws xray get-trace-segment-destination --region {region}
```

- **`"Destination": "CloudWatchLogs"` + `"Status": "ACTIVE"`** → Transaction Search is enabled. Proceed.
- **`"Destination": "XRay"`** → Not enabled. Instruct the user to enable it — see [Enable Transaction Search](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-get-started.html#enabling-transaction-search). After enabling, wait ~10 minutes for spans to backfill.

**Reference:** [Enable Transaction Search for AgentCore Observability](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-get-started.html#enabling-transaction-search) | [Observability setup guide](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html#observability-configure-builtin)

---

### Phase 3: Detect Deployment Type and Verify Traces Exist

Query the `aws/spans` log group to find spans for the given session AND capture deployment-type signals. Use a `stats` aggregation to minimize data transfer:

```python
import boto3, time
from datetime import datetime, timezone

client = boto3.client('logs', region_name='{region}')

# Convert session_timestamp (ISO 8601 string) to epoch seconds, if provided.
# Apply this same conversion before any subsequent start_query that uses session_timestamp.
session_epoch = None
if session_timestamp:
    session_epoch = int(datetime.fromisoformat(session_timestamp).replace(tzinfo=timezone.utc).timestamp())

query = """
fields scope.name as scopeName,
       resource.attributes.service.name as serviceName,
       resource.attributes.cloud.platform as cloudPlatform
| filter attributes.session_id = '{session_id}'
| stats count(*) as spanCount by scopeName, serviceName, cloudPlatform
"""

response = client.start_query(
    logGroupName='aws/spans',
    # If session_epoch provided: narrow to a window around it (±2 hours).
    # If omitted: default to a 7-day lookback.
    startTime=session_epoch - 7200 if session_epoch else int(time.time() - 7 * 86400),
    endTime=session_epoch + 7200 if session_epoch else int(time.time()),
    queryString=query
)

query_id = response['queryId']
# Poll for results (with max attempts and terminal-state handling)
max_attempts = 30
for _ in range(max_attempts):
    result = client.get_query_results(queryId=query_id)
    if result['status'] == 'Complete':
        break
    if result['status'] in ('Failed', 'Cancelled', 'Timeout'):
        raise RuntimeError(f"Query failed with status: {result['status']}")
    time.sleep(2)
else:
    raise RuntimeError("Query timed out waiting for results")

spans = result['results']
print(f"Found {len(spans)} span groups for session {session_id}")
```

Apply the same polling pattern (with `max_attempts` and terminal-state handling) to every CloudWatch Logs Insights query in this skill.

**Interpret the stats results:**
- Total span count = sum of all `spanCount` values across rows
- If any row has `cloudPlatform = aws_bedrock_agentcore` → **AgentCore Runtime** deployment
- Otherwise → **3P-managed** deployment (e.g., ECS, EKS, Lambda, Cloud Desktop)
- Each row's `scopeName` tells you which instrumentation frameworks are present (used in Phase 4)
- Each row's `serviceName` tells you which service names are in use (used in Phase 5)

Record the deployment type — Phase 5 uses it.

**If no spans are found at all:**
- If the user provided `session_timestamp`, retry with a wider window (±12 hours) or without the timestamp (falls back to 7-day lookback)
- If still no results after widening to 7 days, ask the user if the session is older than 7 days — CloudWatch Logs retention determines the maximum queryable range
- Verify the session ID is correct (copy from agent traces, not invocation logs)
- Check that the `aws/spans` log group exists in this region
- Confirm Phase 2 (Transaction Search) passed
- For AgentCore Runtime: verify observability is enabled on the runtime (`agentcore status` or the runtime's "Log deliveries and tracing" settings)
- For 3P: verify OTEL instrumentation is set up (e.g., `CMD ["opentelemetry-instrument", "python", "main.py"]` in the Dockerfile, and `aws-opentelemetry-distro` installed). Also check that framework-level tracing dependencies are installed (e.g., `opentelemetry-instrumentation-langchain` for LangGraph, `strands-agents[otel]` for Strands).

**Reference:** [Enabling observability for AgentCore-hosted agents](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html#observability-configure-builtin) | [Enabling observability for agents hosted outside of AgentCore](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html#observability-configure-3p)

Record from the stats results: total span count, deployment type, scope names present, service names present.

---

### Phase 4: Validate Scope Support

For each span found, check if the `scope.name` is one of the supported values:

**Supported scopes:**
- [`strands.telemetry.tracer`](https://strandsagents.com/latest/documentation/docs/user-guide/observability-evaluation/observability/) — emitted by [Strands Agents](https://strandsagents.com/) when OTEL is enabled
- [`opentelemetry.instrumentation.langchain`](https://opentelemetry.io/docs/zero-code/python/instrumentations/#libraries) — provided by the [`opentelemetry-instrumentation-langchain`](https://pypi.org/project/opentelemetry-instrumentation-langchain/) package for LangChain / LangGraph agents
- [`openinference.instrumentation.langchain`](https://github.com/Arize-ai/openinference/tree/main/python/instrumentation/openinference-instrumentation-langchain) — provided by the [`openinference-instrumentation-langchain`](https://pypi.org/project/openinference-instrumentation-langchain/) package for LangChain / LangGraph agents

For the canonical list of supported scopes and span format expectations, see [AgentCore Evaluation — Understanding input spans](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/understanding-input-spans.html#agent-span-supported-scopes).

Spans with unsupported scopes will be ignored by the evaluation service. If ALL spans have unsupported scopes, evaluations will return empty results.

Report any spans with unsupported scopes and note the framework the user is using. Common causes:
- **Framework-level tracing not enabled** — for LangChain/LangGraph, the user may have OTEL auto-instrumentation set up (so they see `opentelemetry.instrumentation.starlette`, `opentelemetry.instrumentation.httpx`, etc.) but hasn't installed the framework-specific instrumentation package needed to emit evaluation-compatible spans
- **Unsupported framework** — Claude Agent SDK, custom instrumentation, or other frameworks that emit spans under different scopes are not currently supported for evaluation

**Verify framework-level tracing dependencies** based on the agent's framework:

| Framework | Required package(s) | How to verify |
|-----------|---------------------|---------------|
| Strands Agents | `strands-agents[otel]` (installs OTEL support) | Check `requirements.txt` or `pyproject.toml` for `strands-agents[otel]` or both `strands-agents` and `aws-opentelemetry-distro` |
| LangChain / LangGraph | `opentelemetry-instrumentation-langchain` OR `openinference-instrumentation-langchain` | Run `pip show opentelemetry-instrumentation-langchain` or `pip show openinference-instrumentation-langchain` in the agent's environment |

If the package is not installed, instruct the user to add it to their agent's dependencies and redeploy. New sessions after the redeploy should emit spans with a supported scope.

---

### Phase 5: Validate Service Name Consistency

The check here differs based on deployment type detected in Phase 3.

**For AgentCore Runtime deployments:** AgentCore automatically sets `service.name` to `{runtime_name}.{endpoint_name}` — users cannot easily misconfigure it. If you see a mismatch here, it likely means the eval config's `serviceNames` was set manually and doesn't match what AgentCore generated.

**For 3P-managed deployments:** The user controls `service.name` via the `OTEL_SERVICE_NAME` environment variable. Mismatches are common if the user changed the env var without updating the eval config, or vice versa.

Compare the `service.name` from the spans against:
1. The `serviceNames` in the evaluation config (if `evaluation_type = online`)
2. Consistency across all spans in the session

```python
# Run this against the aws/spans log group using the same start_query pattern from Phase 3
query = """
fields resource.attributes.service.name as serviceName
| filter attributes.session_id = '{session_id}'
| stats count(*) by serviceName
"""
```

Common issues:
- Service name in spans is `MyAgent_prod` but eval config expects `MyAgent_prod.MyEndpoint`
- Service name differs between `aws/spans` and the runtime log group
- Multiple service names across spans in a multi-agent setup

If there's a mismatch, report it — this is one of the most common root causes.

---

### Phase 6: Find Matching Events in Runtime Logs

For each span with a supported scope, check if a corresponding event exists in the runtime log group. Events are stored in a separate log group from spans.

First, identify the runtime log group(s). The approach depends on the deployment type detected in Phase 3:

**For AgentCore Runtime agents:** Log groups follow the pattern `/aws/bedrock-agentcore/runtimes/{agent_id}-{endpoint_name}`. If you have the eval config, use the log groups from `dataSourceConfig.cloudWatchLogs.logGroupNames`. Otherwise, discover them:

```bash
aws logs describe-log-groups \
  --log-group-name-prefix "/aws/bedrock-agentcore/runtimes/" \
  --region {region}
```

**For 3P-managed agents (ECS, EKS, Lambda, etc.):** The user controls which log group(s) their OTEL exporter writes to. These can be any custom log group name. Ask the user for their log group name(s), or check the eval config's `dataSourceConfig.cloudWatchLogs.logGroupNames` if available. If the user doesn't know, ask them to check their OTEL exporter configuration (e.g., `OTEL_EXPORTER_OTLP_ENDPOINT` or the CloudWatch Logs exporter config).

Then run a **single query across all runtime log groups**, filtered by `session_id`, and match events to spans in memory. This avoids hitting the CloudWatch concurrent-query limit when a session has many spans.

```python
# Get runtime log groups from eval config (online) or from describe-log-groups above (on-demand)
runtime_log_groups = [...]  # list of log group names

query = """
fields @timestamp, spanId, traceId, scope.name as scopeName,
       body,
       body.input.messages as inputMessages,
       body.output.messages as outputMessages,
       attributes.event.name as eventName
| filter attributes.session_id = '{session_id}'
| filter scope.name in [
    'strands.telemetry.tracer',
    'opentelemetry.instrumentation.langchain',
    'openinference.instrumentation.langchain'
  ]
| limit 200
"""

response = client.start_query(
    logGroupNames=runtime_log_groups,   # pass the full list — start_query accepts multiple log groups
    startTime=session_epoch - 7200 if session_epoch else int(time.time() - 7 * 86400),
    endTime=session_epoch + 7200 if session_epoch else int(time.time()),
    queryString=query
)
# Poll using the same pattern from Phase 3
```

Then, match the returned events to the supported-scope spans from Phase 3 by `spanId` + `traceId` in memory. Any span without a matching event indicates a potential `LogEventMissingException`.

**If events are missing for some spans, this is the `LogEventMissingException` root cause.**

Possible reasons for missing events:
- CloudWatch ingestion delay (events arrive after the evaluation runs)
- The agent runtime log group is not included in the eval config's `logGroupNames`
- The `service.name` in the event's resource attributes doesn't match the eval config
- For self-managed agents (not on AgentCore Runtime): the OTEL exporter may not be configured to emit log events

---

### Phase 7: Validate Event Body Structure

For events that DO exist, verify they have the expected structure. Phase 6 already fetched events across all runtime log groups — reuse those results. If running this phase independently, run the same single-query-across-all-runtime-log-groups pattern (pass `logGroupNames=runtime_log_groups` to `start_query`, NOT `aws/spans`):

```python
query = """
fields @timestamp, spanId, body, scope.name as scopeName,
       attributes.event.name as eventName
| filter attributes.session_id = '{session_id}'
| filter scope.name in [
    'strands.telemetry.tracer',
    'opentelemetry.instrumentation.langchain',
    'openinference.instrumentation.langchain'
  ]
| limit 200
"""
```

Check each event:
- [ ] `body` field exists and is not empty
- [ ] For agent invocation events (`invoke_agent`): `body.input.messages` and `body.output.messages` exist
- [ ] For tool execution events (`execute_tool`): `body` contains tool input/output data
- [ ] `attributes.event.name` is present

If the body is malformed, empty, or uses an unexpected structure, this causes:
- `SpanEventParsingException` — event body can't be parsed
- `AgentSpanMappingException` — can't extract user query from agent span
- `ToolSpanMappingException` — can't extract tool output from tool span

**Reference:** [Understanding input spans — Event structure](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/understanding-input-spans.html)

---

### Phase 8: Check for Specific Error Patterns

If the user reported a specific error, run targeted checks:

#### 8a. LogEventMissingException
"Session span data is incomplete. Span with id: X and name: Y is missing a corresponding log event."

- Extract the `spanId` from the error message
- Search for that span in `aws/spans` — confirm it exists
- Search for a matching event (same `spanId` + `traceId`) in ALL runtime log groups
- If the event is missing: check if the agent's runtime log group is in the eval config
- If the event exists but in a different log group: the eval config is missing that log group

#### 8b. AgentSpanMappingException
"Failed to parse user_query from agent-span with spanId: X and scope: Y"

- This means the evaluation service found the span and event but couldn't extract the user's input
- Check the `scope.name` — if it's not one of the three supported scopes, this is expected
- If the scope IS supported, examine the event body structure for that span
- Common cause: using a framework (Claude Agent SDK, custom instrumentation) that stores input differently than expected

#### 8c. ToolSpanMappingException
"Failed to parse tool_output from tool-span with spanId: X and scope: Y"

- Similar to AgentSpanMappingException but for tool calls
- Check the tool span's event body — the tool output may be in an unexpected format
- Common when tool output is returned in a nested JSON wrapper instead of the expected input/output message format

#### 8d. SpanEventParsingException
"Failed to parse event body for span with id: X"

- The event exists but its body can't be parsed
- Fetch the raw event and examine the body structure
- Common cause: streaming API responses where the body contains a generator object repr instead of actual content

#### 8e. Gateway Timeout (504)
"Evaluation API error: Gateway Timeout (RequestId: N/A, Code: 504)"

- Check the number of spans in the session — very large sessions can timeout
- Check if the session has complex multi-agent traces
- Try evaluating with a simpler session to confirm the service is responsive

#### 8f. Empty Results (no error)
Evaluation returns successfully but with 0 results or null values.

Baseline checks (apply to both types):
- Verify `service.name` matches between spans and eval config (Phase 5)
- Verify spans have a supported `scope.name` (Phase 4)

**For `evaluation_type = on-demand`:**
- Verify the session has `invoke_agent` spans (sessions without agent spans are skipped)
- Wait 5 minutes for CloudWatch ingestion to complete, then retry the on-demand `Evaluate` call against the same session
- If empty results persist after the 5-minute wait AND all other phases pass, this indicates an internal processing issue. Recommend the user file an AWS Support case with the RequestId from the API response.

**For `evaluation_type = online`:**
- Verify the eval config `executionStatus` is `ENABLED` (Phase 1)
- Verify `samplingPercentage` > 0 (Phase 1)
- Wait for the session idle timeout + ~5 minutes of processing time
- If the eval config is `ENABLED` and running but the **output log group has no entries** (`/aws/bedrock-agentcore/evaluations/results/{eval_config_id}`), trace the session through the pipeline via Phases 4–7
- If the **output log group has entries but they are all errors** (presence of `error.type` attribute in the result events), query the log group to classify the errors:

```python
# Run against /aws/bedrock-agentcore/evaluations/results/{eval_config_id}
query = """
fields @timestamp, attributes.gen_ai_evaluation_name as evaluator,
       attributes.session_id as sid, attributes.error.type as errorType,
       attributes.error.message as errorMessage
| filter ispresent(attributes.error.type)
| stats count(*) by evaluator, errorType
"""
```

Use the error types surfaced to jump to the corresponding sub-phase (e.g., `LogEventMissingException` → 8a, `AgentSpanMappingException` → 8b).

- On-demand results are returned inline in the API response — they do NOT appear in the CloudWatch console's evaluation dashboard
- Online results appear in the CloudWatch GenAI Observability console under the matching service name

#### 8g. ValidationException on evaluatorId
"Value at 'evaluatorId' failed to satisfy constraint"

- For built-in evaluators: use the ID format `Builtin.EvaluatorName` (e.g., `Builtin.Helpfulness`)
- For custom evaluators: use the evaluator ID (NOT the full ARN)

#### 8h. InternalServerException
"The server encountered an internal error while processing the request"

- This is a service-side error — not a user configuration issue
- Note the RequestId from the error
- Recommend the user retry, and if it persists, file an AWS Support case with the RequestId

---

### Phase 9: Multi-Agent Specific Checks

**Skip this phase if `evaluation_type = on-demand`** — on-demand evaluations do not have a persistent eval config to validate against. If an on-demand evaluation involves multiple agents, confirm that the correct `serviceName` is passed in the `Evaluate` API request parameters.

Reuse the `serviceName` breakdown from Phase 5 (do not re-run the query). If the Phase 5 results show multiple distinct `serviceName` values, the session involves multiple agents — proceed with the checks below.

Check:
- [ ] ALL agent runtime log groups are included in the eval config's `logGroupNames`
- [ ] The `serviceNames` filter in the eval config targets the correct agent
- [ ] Trace-level evaluators evaluate the LAST agent response in the trace — if you want to evaluate a specific sub-agent, you may need separate eval configs

**Reference:** [Create online evaluation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/create-online-evaluations.html)

---

## Output Format

You are reporting back to the user — so write the report as a summary they can act on, not as a raw data dump.

### What to include

- **Diagnosis summary** — plain-language explanation of what went wrong
- **Findings table** — which diagnostic checks passed/failed
- **Configuration-level details** the user controls and may need to change — e.g., eval config `executionStatus`, `serviceNames`, `logGroupNames`, expected vs. actual `service.name` values, OTEL dependency package names
- **Counts and identifiers** needed to verify the fix — number of spans found, number of sessions affected, span IDs (short-form) if the user asks for them
- **Action items** — specific changes the user should make in their own account (config update, dependency install, redeploy, etc.)

### What to omit or redact

Do not include agent-implementation details in the report. The user knows their own agent; the report should focus on the *evaluation pipeline*, not the agent's internals.

- Do not paste raw `body.input.messages` or `body.output.messages` content (user prompts, model responses, tool inputs/outputs) into the report unless the user explicitly confirms the data is not sensitive. Summarize in one sentence if needed (e.g., "the agent span's event body is present and parses correctly").
- Do not include agent source code, system prompts, tool descriptions, or other agent implementation details that are not relevant to evaluation configuration.
- Do not include full span JSON or event payloads. Reference spans by ID only.
- Account IDs, ARNs, service names, and log group names are **not sensitive** — include them in the report as-is. The user needs these to take action.

### Collecting spans for a support ticket

If the user needs to escalate to AWS Support and wants to share span data from a **test session** (not production), guide them to export the relevant spans:

```bash
aws logs start-query \
  --log-group-name "aws/spans" \
  --start-time {start_epoch} \
  --end-time {end_epoch} \
  --query-string "fields @timestamp, @message | filter attributes.session_id = '{session_id}' | sort @timestamp asc | limit 200" \
  --region {region}
```

Remind the user to review the exported data before sharing and remove any sensitive content (e.g., PII in user prompts or model responses).

### Report template

```
## AgentCore Evaluation Diagnostic Report

**Session:** {session_id}
**Region:** {region}
**Evaluation Type:** {online | on-demand}
**Eval Config:** {eval_config_id or "N/A (on-demand)"}
**Timestamp:** {current_time}

### Summary
{One- or two-sentence summary of what went wrong and what the user should do}

### Findings

| # | Check | Status | Details |
|---|-------|--------|---------|
| 1 | Eval config status | ✅/❌/N/A | {details, or N/A for on-demand} |
| 2 | Transaction Search enabled | ✅/❌ | {details} |
| 3 | Traces present / deployment type | ✅/❌ | {count} spans found; deployment: {AgentCore Runtime / 3P} |
| 4 | Scope support | ✅/❌ | {details} |
| 5 | Service name match | ✅/❌ | {details} |
| 6 | Events found | ✅/❌ | {matched}/{total} spans have matching events |
| 7 | Event body structure | ✅/❌ | {details} |
| 8 | Error-specific check | ✅/❌ | {details} |
| 9 | Multi-agent config | ✅/❌/N/A | {details; N/A if single-agent or on-demand} |

### Identified Issues
1. **{Issue title}** — {Explanation of what is configured vs. expected, no agent internals}
2. **{Issue title}** — {Explanation}

### Recommended Actions
1. {Specific change the user should make, e.g., "Update the eval config's serviceNames from 'X' to 'Y' to match the actual service.name emitted by the agent."}
2. {Next action}

### Affected Spans (reference only)
| spanId | scope | has_event | issue |
|--------|-------|-----------|-------|
| ... | ... | yes/no | e.g., "Missing matching log event" |

### References
- [Understanding input spans](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/understanding-input-spans.html)
- [Create online evaluation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/create-online-evaluations.html)
- [Getting started with on-demand evaluation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/getting-started-on-demand.html)
- [Observability setup](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability-configure.html)
- [Evaluation prerequisites](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations-prerequisites.html)
- [Service quotas](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/bedrock-agentcore-limits.html)
```

## Important Notes

- All diagnostic queries run in the user's account against their log groups
- If you cannot determine the root cause, recommend the user file an AWS Support case with the diagnostic report attached
- On-demand evaluation results are returned inline in the API response — they do NOT appear in the CloudWatch console's evaluation dashboard
- Online evaluation results appear in the CloudWatch GenAI Observability console under the matching service name
```