BedrockAgentCore / Client / get_batch_evaluation

get_batch_evaluation¶

BedrockAgentCore.Client.get_batch_evaluation(**kwargs)¶

Retrieves detailed information about a batch evaluation, including its status, configuration, results, and any error details.

Request Syntax

response = client.get_batch_evaluation(
    batchEvaluationId='string'
)

Parameters:

batchEvaluationId (string) –

[REQUIRED]

The unique identifier of the batch evaluation to retrieve.

Return type:

dict

Returns:

Response Syntax

{
    'batchEvaluationId': 'string',
    'batchEvaluationArn': 'string',
    'batchEvaluationName': 'string',
    'status': 'PENDING'|'IN_PROGRESS'|'COMPLETED'|'COMPLETED_WITH_ERRORS'|'FAILED'|'STOPPING'|'STOPPED'|'DELETING',
    'createdAt': datetime(2015, 1, 1),
    'evaluators': [
        {
            'evaluatorId': 'string'
        },
    ],
    'insights': [
        {
            'insightId': 'string'
        },
    ],
    'dataSourceConfig': {
        'cloudWatchLogs': {
            'serviceNames': [
                'string',
            ],
            'logGroupNames': [
                'string',
            ],
            'filterConfig': {
                'sessionIds': [
                    'string',
                ],
                'timeRange': {
                    'startTime': datetime(2015, 1, 1),
                    'endTime': datetime(2015, 1, 1)
                }
            }
        },
        'onlineEvaluationConfigSource': {
            'onlineEvaluationConfigArn': 'string',
            'timeRange': {
                'startTime': datetime(2015, 1, 1),
                'endTime': datetime(2015, 1, 1)
            }
        }
    },
    'outputConfig': {
        'cloudWatchConfig': {
            'logGroupName': 'string',
            'logStreamName': 'string'
        }
    },
    'evaluationResults': {
        'numberOfSessionsCompleted': 123,
        'numberOfSessionsInProgress': 123,
        'numberOfSessionsFailed': 123,
        'totalNumberOfSessions': 123,
        'numberOfSessionsIgnored': 123,
        'evaluatorSummaries': [
            {
                'evaluatorId': 'string',
                'statistics': {
                    'averageScore': 123.0
                },
                'totalEvaluated': 123,
                'totalFailed': 123
            },
        ]
    },
    'failureAnalysisResult': {
        'failures': [
            {
                'clusterId': 123,
                'name': 'string',
                'description': 'string',
                'affectedSessionCount': 123,
                'subCategories': [
                    {
                        'clusterId': 123,
                        'name': 'string',
                        'description': 'string',
                        'affectedSessionCount': 123,
                        'rootCauses': [
                            {
                                'clusterId': 123,
                                'name': 'string',
                                'rootCause': 'string',
                                'recommendation': 'string',
                                'affectedSessionCount': 123,
                                'affectedSessions': [
                                    {
                                        'sessionId': 'string',
                                        'explanation': 'string',
                                        'fixType': 'string',
                                        'recommendation': 'string',
                                        'failureSpans': [
                                            {
                                                'spanId': 'string',
                                                'traceId': 'string',
                                                'signals': [
                                                    {
                                                        'category': 'execution-error-category-authentication'|'execution-error-category-resource-not-found'|'execution-error-category-service-errors'|'execution-error-category-rate-limiting'|'execution-error-category-formatting'|'execution-error-category-timeout'|'execution-error-category-resource-exhaustion'|'execution-error-category-environment'|'execution-error-category-tool-schema'|'task-instruction-category-non-compliance'|'task-instruction-category-problem-id'|'incorrect-actions-category-tool-selection'|'incorrect-actions-category-poor-information-retrieval'|'incorrect-actions-category-clarification'|'incorrect-actions-category-inappropriate-info-request'|'context-handling-error-category-context-handling-failures'|'hallucination-category-hall-capabilities'|'hallucination-category-hall-misunderstand'|'hallucination-category-hall-usage'|'hallucination-category-hall-history'|'hallucination-category-hall-params'|'hallucination-category-fabricate-tool-outputs'|'repetitive-behavior-category-repetition-tool'|'repetitive-behavior-category-repetition-info'|'repetitive-behavior-category-step-repetition'|'orchestration-related-errors-category-reasoning-mismatch'|'orchestration-related-errors-category-goal-deviation'|'orchestration-related-errors-category-premature-termination'|'orchestration-related-errors-category-unaware-termination'|'llm-output-category-nonsensical'|'configuration-mismatch-category-tool-definition'|'coding-use-case-specific-failure-types-category-edge-case-oversights'|'coding-use-case-specific-failure-types-category-dependency-issues'|'other',
                                                        'evidence': 'string',
                                                        'confidence': 123.0
                                                    },
                                                ]
                                            },
                                        ]
                                    },
                                ]
                            },
                        ]
                    },
                ]
            },
        ]
    },
    'userIntentResult': {
        'userIntents': [
            {
                'clusterId': 123,
                'name': 'string',
                'description': 'string',
                'affectedSessionCount': 123,
                'affectedSessions': [
                    {
                        'sessionId': 'string',
                        'userMessages': [
                            'string',
                        ]
                    },
                ]
            },
        ]
    },
    'executionSummaryResult': {
        'executionSummaries': [
            {
                'clusterId': 123,
                'name': 'string',
                'description': 'string',
                'affectedSessionCount': 123,
                'affectedSessions': [
                    {
                        'sessionId': 'string',
                        'approachTaken': 'string',
                        'finalOutcome': 'string'
                    },
                ]
            },
        ]
    },
    'errorDetails': [
        'string',
    ],
    'description': 'string',
    'updatedAt': datetime(2015, 1, 1),
    'kmsKeyArn': 'string'
}

Response Structure

(dict) –
- batchEvaluationId (string) –
  
  The unique identifier of the batch evaluation.
- batchEvaluationArn (string) –
  
  The Amazon Resource Name (ARN) of the batch evaluation.
- batchEvaluationName (string) –
  
  The name of the batch evaluation.
- status (string) –
  
  The current status of the batch evaluation.
- createdAt (datetime) –
  
  The timestamp when the batch evaluation was created.
- evaluators (list) –
  
  The list of evaluators applied during the batch evaluation.
  - (dict) –
    
    An evaluator to run against sessions during batch evaluation.
    - evaluatorId (string) –
      
      The unique identifier of the evaluator. Can reference built-in evaluators (e.g., Builtin.Helpfulness) or custom evaluators.
- insights (list) –
  
  The list of insight analyses applied during the batch evaluation.
  - (dict) –
    
    A reference to an insight analysis to run against sessions during batch evaluation. Insights provide deeper analysis beyond individual evaluator scores, including failure detection, user intent clustering, and execution summarization.
    - insightId (string) –
      
      The unique identifier of the insight to run.
- dataSourceConfig (dict) –
  
  The data source configuration specifying where agent traces are pulled from.
  Note
  This is a Tagged Union structure. Only one of the following top level keys will be set: cloudWatchLogs, onlineEvaluationConfigSource. If a client receives an unknown member it will set SDK_UNKNOWN_MEMBER as the top level key, which maps to the name or tag of the unknown member. The structure of SDK_UNKNOWN_MEMBER is as follows:
  'SDK_UNKNOWN_MEMBER': {'name': 'UnknownMemberName'}
  - cloudWatchLogs (dict) –
    
    Configuration for pulling agent session traces from CloudWatch Logs.
    - serviceNames (list) –
      
      The list of agent service names to filter traces within the specified log groups.
      - (string) –
    - logGroupNames (list) –
      
      The list of CloudWatch log group names to read agent traces from. Maximum of 5 log groups.
      - (string) –
    - filterConfig (dict) –
      
      Optional filter configuration to narrow down which sessions to evaluate.
      - sessionIds (list) –
        
        A list of specific session IDs to evaluate. If specified, only these sessions are included in the evaluation.
        
        (string) –
      - timeRange (dict) –
        
        The time range filter for selecting sessions to evaluate.
        
        startTime (datetime) –
        
        The start time of the time range. Only sessions with activity at or after this timestamp are included.
        
        endTime (datetime) –
        
        The end time of the time range. Only sessions with activity before this timestamp are included.
  - onlineEvaluationConfigSource (dict) –
    
    Reference an existing OnlineEvaluationConfig as session source
    - onlineEvaluationConfigArn (string) –
      
      The Amazon Resource Name (ARN) of the online evaluation configuration to use as the session source.
    - timeRange (dict) –
      
      Optional session filter configuration to narrow down which sessions from the online evaluation configuration to include.
      - startTime (datetime) –
        
        The start time of the time range. Only sessions with activity at or after this timestamp are included.
      - endTime (datetime) –
        
        The end time of the time range. Only sessions with activity before this timestamp are included.
- outputConfig (dict) –
  
  The output configuration specifying where evaluation results are written.
  Note
  This is a Tagged Union structure. Only one of the following top level keys will be set: cloudWatchConfig. If a client receives an unknown member it will set SDK_UNKNOWN_MEMBER as the top level key, which maps to the name or tag of the unknown member. The structure of SDK_UNKNOWN_MEMBER is as follows:
  'SDK_UNKNOWN_MEMBER': {'name': 'UnknownMemberName'}
  - cloudWatchConfig (dict) –
    
    The CloudWatch Logs configuration for writing evaluation results.
    - logGroupName (string) –
      
      The name of the CloudWatch log group where evaluation results will be written.
    - logStreamName (string) –
      
      The name of the CloudWatch log stream where evaluation results will be written.
- evaluationResults (dict) –
  
  The aggregated evaluation results, including session completion counts and evaluator score summaries.
  - numberOfSessionsCompleted (integer) –
    
    The number of sessions that have been successfully evaluated.
  - numberOfSessionsInProgress (integer) –
    
    The number of sessions currently being evaluated.
  - numberOfSessionsFailed (integer) –
    
    The number of sessions that failed evaluation.
  - totalNumberOfSessions (integer) –
    
    The total number of sessions included in the batch evaluation.
  - numberOfSessionsIgnored (integer) –
    
    The number of sessions that were ignored during evaluation.
  - evaluatorSummaries (list) –
    
    A list of per-evaluator summary statistics.
    - (dict) –
      
      Summary statistics for a single evaluator within a batch evaluation.
      - evaluatorId (string) –
        
        The unique identifier of the evaluator.
      - statistics (dict) –
        
        The aggregated statistics for this evaluator.
        
        averageScore (float) –
        
        The average score across all evaluated sessions for this evaluator.
      - totalEvaluated (integer) –
        
        The total number of sessions evaluated by this evaluator.
      - totalFailed (integer) –
        
        The total number of sessions that failed evaluation by this evaluator.
- failureAnalysisResult (dict) –
  
  The failure analysis results from insights, containing categorized failure clusters with root causes and recommendations.
  - failures (list) –
    
    The list of failure category clusters identified across analyzed sessions.
    - (dict) –
      
      A top-level failure category identified by clustering similar failure patterns across sessions.
      - clusterId (integer) –
        
        The unique identifier of the failure category cluster.
      - name (string) –
        
        The name of the failure category.
      - description (string) –
        
        A description of the failure category pattern.
      - affectedSessionCount (integer) –
        
        The number of sessions affected by this failure category.
      - subCategories (list) –
        
        The list of failure subcategories within this category.
        
        (dict) –
        
        A subcategory of failures within a top-level failure category.
        
        clusterId (integer) –
        
        The unique identifier of the failure subcategory cluster.
        
        name (string) –
        
        The name of the failure subcategory.
        
        description (string) –
        
        A description of the failure subcategory pattern.
        
        affectedSessionCount (integer) –
        
        The number of sessions affected by this failure subcategory.
        
        rootCauses (list) –
        
        The list of root cause clusters identified within this subcategory.
        
        (dict) –
        
        A cluster of similar root causes identified within a failure subcategory.
        
        clusterId (integer) –
        
        The unique identifier of the root cause cluster.
        
        name (string) –
        
        The name of the root cause cluster.
        
        rootCause (string) –
        
        The root cause explanation for this cluster of failures.
        
        recommendation (string) –
        
        The recommended fix for this root cause.
        
        affectedSessionCount (integer) –
        
        The number of sessions affected by this root cause.
        
        affectedSessions (list) –
        
        The list of sessions affected by this root cause.
        
        (dict) –
        
        A session affected by a detected failure pattern, including root cause details.
        
        sessionId (string) –
        
        The unique identifier of the affected session.
        
        explanation (string) –
        
        An explanation of how the failure manifested in this session.
        
        fixType (string) –
        
        The type of fix recommended for this failure.
        
        recommendation (string) –
        
        The specific fix recommendation for this session.
        
        failureSpans (list) –
        
        The list of spans where failures were detected in this session.
        
        (dict) –
        
        Details about a specific span where a failure was detected.
        
        spanId (string) –
        
        The unique identifier of the span where the failure occurred.
        
        traceId (string) –
        
        The trace identifier associated with the failure span.
        
        signals (list) –
        
        The failure signals detected in this span.
        
        (dict) –
        
        A signal indicating a detected failure within a span.
        
        category (string) –
        
        The failure category classification for this signal.
        
        evidence (string) –
        
        The evidence supporting the failure detection.
        
        confidence (float) –
        
        The confidence score of the failure detection.
- userIntentResult (dict) –
  
  The user intent clustering results from insights, containing grouped user intents across evaluated sessions.
  - userIntents (list) –
    
    The list of user intent clusters identified across analyzed sessions.
    - (dict) –
      
      A cluster of similar user intents identified across sessions.
      - clusterId (integer) –
        
        The unique identifier of the user intent cluster.
      - name (string) –
        
        The name of the user intent cluster.
      - description (string) –
        
        A description of the user intent pattern.
      - affectedSessionCount (integer) –
        
        The number of sessions with this user intent.
      - affectedSessions (list) –
        
        The list of sessions with this user intent.
        
        (dict) –
        
        A session associated with a user intent cluster.
        
        sessionId (string) –
        
        The unique identifier of the session.
        
        userMessages (list) –
        
        The user messages from this session that contributed to the intent cluster.
        
        (string) –
- executionSummaryResult (dict) –
  
  The execution summary clustering results from insights, containing grouped execution patterns across evaluated sessions.
  - executionSummaries (list) –
    
    The list of execution summary clusters identified across analyzed sessions.
    - (dict) –
      
      A cluster of similar execution patterns identified across sessions.
      - clusterId (integer) –
        
        The unique identifier of the execution summary cluster.
      - name (string) –
        
        The name of the execution pattern cluster.
      - description (string) –
        
        A description of the execution pattern.
      - affectedSessionCount (integer) –
        
        The number of sessions with this execution pattern.
      - affectedSessions (list) –
        
        The list of sessions with this execution pattern.
        
        (dict) –
        
        A session associated with an execution summary cluster.
        
        sessionId (string) –
        
        The unique identifier of the session.
        
        approachTaken (string) –
        
        The approach taken by the agent during this session.
        
        finalOutcome (string) –
        
        The final outcome of the session.
- errorDetails (list) –
  
  The error details if the batch evaluation encountered failures.
  - (string) –
- description (string) –
  
  The description of the batch evaluation.
- updatedAt (datetime) –
  
  The timestamp when the batch evaluation was last updated.
- kmsKeyArn (string) –
  
  The ARN of the KMS key used to encrypt evaluation data.

Exceptions

BedrockAgentCore.Client.exceptions.UnauthorizedException
BedrockAgentCore.Client.exceptions.ValidationException
BedrockAgentCore.Client.exceptions.AccessDeniedException
BedrockAgentCore.Client.exceptions.ThrottlingException
BedrockAgentCore.Client.exceptions.ResourceNotFoundException
BedrockAgentCore.Client.exceptions.InternalServerException