BedrockAgentCore / Client / start_batch_evaluation

start_batch_evaluation¶

BedrockAgentCore.Client.start_batch_evaluation(**kwargs)¶

Starts a batch evaluation job that evaluates agent performance across multiple sessions. Batch evaluations pull agent traces from CloudWatch Logs or an existing online evaluation configuration and run specified evaluators and insights against them.

Request Syntax

response = client.start_batch_evaluation(
    batchEvaluationName='string',
    evaluators=[
        {
            'evaluatorId': 'string'
        },
    ],
    insights=[
        {
            'insightId': 'string'
        },
    ],
    dataSourceConfig={
        'cloudWatchLogs': {
            'serviceNames': [
                'string',
            ],
            'logGroupNames': [
                'string',
            ],
            'filterConfig': {
                'sessionIds': [
                    'string',
                ],
                'timeRange': {
                    'startTime': datetime(2015, 1, 1),
                    'endTime': datetime(2015, 1, 1)
                }
            }
        },
        'onlineEvaluationConfigSource': {
            'onlineEvaluationConfigArn': 'string',
            'timeRange': {
                'startTime': datetime(2015, 1, 1),
                'endTime': datetime(2015, 1, 1)
            }
        }
    },
    clientToken='string',
    evaluationMetadata={
        'sessionMetadata': [
            {
                'sessionId': 'string',
                'testScenarioId': 'string',
                'groundTruth': {
                    'inline': {
                        'assertions': [
                            {
                                'text': 'string'
                            },
                        ],
                        'expectedTrajectory': {
                            'toolNames': [
                                'string',
                            ]
                        },
                        'turns': [
                            {
                                'input': {
                                    'prompt': 'string'
                                },
                                'expectedResponse': {
                                    'text': 'string'
                                }
                            },
                        ]
                    }
                },
                'metadata': {
                    'string': 'string'
                }
            },
        ]
    },
    tags={
        'string': 'string'
    },
    kmsKeyArn='string',
    description='string'
)

Parameters:

batchEvaluationName (string) –
[REQUIRED]

The name of the batch evaluation. Must be unique within your account.
evaluators (list) –
The list of evaluators to apply during the batch evaluation. Can include both built-in evaluators and custom evaluators. Maximum of 10 evaluators.
- (dict) –
  
  An evaluator to run against sessions during batch evaluation.
  - evaluatorId (string) – [REQUIRED]
    
    The unique identifier of the evaluator. Can reference built-in evaluators (e.g., Builtin.Helpfulness) or custom evaluators.
insights (list) –
The list of insight analyses to run against sessions during the batch evaluation. Maximum of 10 insights.
- (dict) –
  
  A reference to an insight analysis to run against sessions during batch evaluation. Insights provide deeper analysis beyond individual evaluator scores, including failure detection, user intent clustering, and execution summarization.
  - insightId (string) – [REQUIRED]
    
    The unique identifier of the insight to run.
dataSourceConfig (dict) –
[REQUIRED]

The data source configuration that specifies where to pull agent session traces from for evaluation.

Note
This is a Tagged Union structure. Only one of the following top level keys can be set: cloudWatchLogs, onlineEvaluationConfigSource.
- cloudWatchLogs (dict) –
  
  Configuration for pulling agent session traces from CloudWatch Logs.
  - serviceNames (list) – [REQUIRED]
    
    The list of agent service names to filter traces within the specified log groups.
    - (string) –
  - logGroupNames (list) – [REQUIRED]
    
    The list of CloudWatch log group names to read agent traces from. Maximum of 5 log groups.
    - (string) –
  - filterConfig (dict) –
    
    Optional filter configuration to narrow down which sessions to evaluate.
    - sessionIds (list) –
      
      A list of specific session IDs to evaluate. If specified, only these sessions are included in the evaluation.
      - (string) –
    - timeRange (dict) –
      
      The time range filter for selecting sessions to evaluate.
      - startTime (datetime) –
        
        The start time of the time range. Only sessions with activity at or after this timestamp are included.
      - endTime (datetime) –
        
        The end time of the time range. Only sessions with activity before this timestamp are included.
- onlineEvaluationConfigSource (dict) –
  
  Reference an existing OnlineEvaluationConfig as session source
  - onlineEvaluationConfigArn (string) – [REQUIRED]
    
    The Amazon Resource Name (ARN) of the online evaluation configuration to use as the session source.
  - timeRange (dict) –
    
    Optional session filter configuration to narrow down which sessions from the online evaluation configuration to include.
    - startTime (datetime) –
      
      The start time of the time range. Only sessions with activity at or after this timestamp are included.
    - endTime (datetime) –
      
      The end time of the time range. Only sessions with activity before this timestamp are included.
clientToken (string) –
A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If this token matches a previous request, the service ignores the request, but does not return an error.

This field is autopopulated if not provided.
evaluationMetadata (dict) –
Optional metadata for the evaluation, including session-specific ground truth data and test scenario identifiers.

Note
This is a Tagged Union structure. Only one of the following top level keys can be set: sessionMetadata.
- sessionMetadata (list) –
  
  A list of session metadata entries containing ground truth data and test scenario identifiers for specific sessions.
  - (dict) –
    
    Metadata for a specific session in a batch evaluation, including ground truth data and test scenario identifiers.
    - sessionId (string) – [REQUIRED]
      
      The unique identifier of the session this metadata applies to.
    - testScenarioId (string) –
      
      An optional test scenario identifier for categorizing and tracking evaluation results.
    - groundTruth (dict) –
      
      The ground truth data for this session, including expected responses and assertions.
      
      Note
      This is a Tagged Union structure. Only one of the following top level keys can be set: inline.
      - inline (dict) –
        
        Inline ground truth data provided directly in the request.
        
        assertions (list) –
        
        Assertions for evaluation, reuses common model EvaluationContentList.
        
        (dict) –
        
        A content block for ground truth data in evaluation reference inputs. Supports text content for expected responses and assertions.
        
        Note
        This is a Tagged Union structure. Only one of the following top level keys can be set: text.
        
        text (string) –
        
        The text content of the ground truth data. Used for expected response text and assertion statements.
        
        expectedTrajectory (dict) –
        
        The expected tool call sequence for trajectory evaluation.
        
        toolNames (list) –
        
        The list of tool names representing the expected tool call sequence.
        
        (string) –
        
        turns (list) –
        
        A list of per-turn ground truth data, each containing an input prompt and expected response.
        
        (dict) –
        
        Ground truth data for a single conversation turn.
        
        input (dict) –
        
        The input for this conversation turn.
        
        Note
        This is a Tagged Union structure. Only one of the following top level keys can be set: prompt.
        
        prompt (string) –
        
        The text prompt for this conversation turn.
        
        expectedResponse (dict) –
        
        The expected response for this conversation turn.
        
        Note
        This is a Tagged Union structure. Only one of the following top level keys can be set: text.
        
        text (string) –
        
        The text content of the ground truth data. Used for expected response text and assertion statements.
    - metadata (dict) –
      
      Additional key-value metadata associated with this session.
      - (string) –
        
        (string) –
tags (dict) –
A map of tag keys and values to associate with the batch evaluation.
- (string) –
  - (string) –
kmsKeyArn (string) – The ARN of the KMS key used to encrypt evaluation data. If provided, customer data is encrypted at rest with the specified key.
description (string) – The description of the batch evaluation.

Return type:

dict

Returns:

Response Syntax

{
    'batchEvaluationId': 'string',
    'batchEvaluationArn': 'string',
    'batchEvaluationName': 'string',
    'evaluators': [
        {
            'evaluatorId': 'string'
        },
    ],
    'insights': [
        {
            'insightId': 'string'
        },
    ],
    'status': 'PENDING'|'IN_PROGRESS'|'COMPLETED'|'COMPLETED_WITH_ERRORS'|'FAILED'|'STOPPING'|'STOPPED'|'DELETING',
    'createdAt': datetime(2015, 1, 1),
    'outputConfig': {
        'cloudWatchConfig': {
            'logGroupName': 'string',
            'logStreamName': 'string'
        }
    },
    'tags': {
        'string': 'string'
    },
    'kmsKeyArn': 'string',
    'description': 'string'
}

Response Structure

(dict) –
- batchEvaluationId (string) –
  
  The unique identifier of the created batch evaluation.
- batchEvaluationArn (string) –
  
  The Amazon Resource Name (ARN) of the created batch evaluation.
- batchEvaluationName (string) –
  
  The name of the batch evaluation.
- evaluators (list) –
  
  The list of evaluators applied during the batch evaluation.
  - (dict) –
    
    An evaluator to run against sessions during batch evaluation.
    - evaluatorId (string) –
      
      The unique identifier of the evaluator. Can reference built-in evaluators (e.g., Builtin.Helpfulness) or custom evaluators.
- insights (list) –
  
  The list of insight analyses applied during the batch evaluation.
  - (dict) –
    
    A reference to an insight analysis to run against sessions during batch evaluation. Insights provide deeper analysis beyond individual evaluator scores, including failure detection, user intent clustering, and execution summarization.
    - insightId (string) –
      
      The unique identifier of the insight to run.
- status (string) –
  
  The status of the batch evaluation.
- createdAt (datetime) –
  
  The timestamp when the batch evaluation was created.
- outputConfig (dict) –
  
  The output configuration specifying where evaluation results are written.
  Note
  This is a Tagged Union structure. Only one of the following top level keys will be set: cloudWatchConfig. If a client receives an unknown member it will set SDK_UNKNOWN_MEMBER as the top level key, which maps to the name or tag of the unknown member. The structure of SDK_UNKNOWN_MEMBER is as follows:
  'SDK_UNKNOWN_MEMBER': {'name': 'UnknownMemberName'}
  - cloudWatchConfig (dict) –
    
    The CloudWatch Logs configuration for writing evaluation results.
    - logGroupName (string) –
      
      The name of the CloudWatch log group where evaluation results will be written.
    - logStreamName (string) –
      
      The name of the CloudWatch log stream where evaluation results will be written.
- tags (dict) –
  
  The tags associated with the batch evaluation.
  - (string) –
    - (string) –
- kmsKeyArn (string) –
  
  The ARN of the KMS key used to encrypt evaluation data.
- description (string) –
  
  The description of the batch evaluation.

Exceptions

BedrockAgentCore.Client.exceptions.ServiceQuotaExceededException
BedrockAgentCore.Client.exceptions.UnauthorizedException
BedrockAgentCore.Client.exceptions.ValidationException
BedrockAgentCore.Client.exceptions.AccessDeniedException
BedrockAgentCore.Client.exceptions.ConflictException
BedrockAgentCore.Client.exceptions.ThrottlingException
BedrockAgentCore.Client.exceptions.InternalServerException