BedrockAgentCore / Client / start_batch_evaluation
start_batch_evaluation¶
- BedrockAgentCore.Client.start_batch_evaluation(**kwargs)¶
Starts a batch evaluation job that evaluates agent performance across multiple sessions. Batch evaluations pull agent traces from CloudWatch Logs or an existing online evaluation configuration and run specified evaluators and insights against them.
See also: AWS API Documentation
Request Syntax
response = client.start_batch_evaluation( batchEvaluationName='string', evaluators=[ { 'evaluatorId': 'string' }, ], dataSourceConfig={ 'cloudWatchLogs': { 'serviceNames': [ 'string', ], 'logGroupNames': [ 'string', ], 'filterConfig': { 'sessionIds': [ 'string', ], 'timeRange': { 'startTime': datetime(2015, 1, 1), 'endTime': datetime(2015, 1, 1) } } } }, clientToken='string', evaluationMetadata={ 'sessionMetadata': [ { 'sessionId': 'string', 'testScenarioId': 'string', 'groundTruth': { 'inline': { 'assertions': [ { 'text': 'string' }, ], 'expectedTrajectory': { 'toolNames': [ 'string', ] }, 'turns': [ { 'input': { 'prompt': 'string' }, 'expectedResponse': { 'text': 'string' } }, ] } }, 'metadata': { 'string': 'string' } }, ] }, description='string' )
- Parameters:
batchEvaluationName (string) –
[REQUIRED]
The name of the batch evaluation. Must be unique within your account.
evaluators (list) –
The list of evaluators to apply during the batch evaluation. Can include both built-in evaluators and custom evaluators. Maximum of 10 evaluators.
(dict) –
An evaluator to run against sessions.
evaluatorId (string) – [REQUIRED]
The unique identifier of the evaluator. Can reference built-in evaluators (e.g.,
Builtin.Helpfulness) or custom evaluators.
dataSourceConfig (dict) –
[REQUIRED]
The data source configuration that specifies where to pull agent session traces from for evaluation.
Note
This is a Tagged Union structure. Only one of the following top level keys can be set:
cloudWatchLogs.cloudWatchLogs (dict) –
Pull session spans from CloudWatch
serviceNames (list) – [REQUIRED]
The list of agent service names to filter traces within the specified log groups.
(string) –
logGroupNames (list) – [REQUIRED]
The list of CloudWatch log group names to read agent traces from. Maximum of 5 log groups.
(string) –
filterConfig (dict) –
Optional filter configuration to narrow down which sessions to evaluate.
sessionIds (list) –
A list of specific session IDs to evaluate. If specified, only these sessions are included in the evaluation.
(string) –
timeRange (dict) –
The time range filter for selecting sessions to evaluate.
startTime (datetime) –
The start time of the time range. Only sessions with activity at or after this timestamp are included.
endTime (datetime) –
The end time of the time range. Only sessions with activity before this timestamp are included.
clientToken (string) –
A unique, case-sensitive identifier to ensure that the API request completes no more than one time. If this token matches a previous request, the service ignores the request, but does not return an error.
This field is autopopulated if not provided.
evaluationMetadata (dict) –
Optional metadata for the evaluation, including session-specific ground truth data and test scenario identifiers.
Note
This is a Tagged Union structure. Only one of the following top level keys can be set:
sessionMetadata.sessionMetadata (list) –
A list of session metadata entries containing ground truth data and test scenario identifiers for specific sessions.
(dict) –
Metadata for a specific session in a batch evaluation, including ground truth data and test scenario identifiers.
sessionId (string) – [REQUIRED]
The unique identifier of the session this metadata applies to.
testScenarioId (string) –
An optional test scenario identifier for categorizing and tracking evaluation results.
groundTruth (dict) –
The ground truth data for this session, including expected responses and assertions.
Note
This is a Tagged Union structure. Only one of the following top level keys can be set:
inline.inline (dict) –
Provide ground truth inline
assertions (list) –
Assertions for evaluation, reuses common model EvaluationContentList.
(dict) –
A content block for ground truth data in evaluation reference inputs. Supports text content for expected responses and assertions.
Note
This is a Tagged Union structure. Only one of the following top level keys can be set:
text.text (string) –
The text content of the ground truth data. Used for expected response text and assertion statements.
expectedTrajectory (dict) –
expectedTrajectory for evaluation, reuses common model EvaluationExpectedTrajectory
toolNames (list) –
The list of tool names representing the expected tool call sequence.
(string) –
turns (list) –
A list of per-turn ground truth data, each containing an input prompt and expected response.
(dict) –
Ground truth data for a single conversation turn.
input (dict) –
The input for this conversation turn.
Note
This is a Tagged Union structure. Only one of the following top level keys can be set:
prompt.prompt (string) –
The text prompt for this conversation turn.
expectedResponse (dict) –
The expected response for this conversation turn.
Note
This is a Tagged Union structure. Only one of the following top level keys can be set:
text.text (string) –
The text content of the ground truth data. Used for expected response text and assertion statements.
metadata (dict) –
Additional key-value metadata associated with this session.
(string) –
(string) –
description (string) – The description of the batch evaluation.
- Return type:
dict
- Returns:
Response Syntax
{ 'batchEvaluationId': 'string', 'batchEvaluationArn': 'string', 'batchEvaluationName': 'string', 'evaluators': [ { 'evaluatorId': 'string' }, ], 'status': 'PENDING'|'IN_PROGRESS'|'COMPLETED'|'COMPLETED_WITH_ERRORS'|'FAILED'|'STOPPING'|'STOPPED'|'DELETING', 'createdAt': datetime(2015, 1, 1), 'outputConfig': { 'cloudWatchConfig': { 'logGroupName': 'string', 'logStreamName': 'string' } }, 'description': 'string' }
Response Structure
(dict) –
batchEvaluationId (string) –
The unique identifier of the created batch evaluation.
batchEvaluationArn (string) –
The Amazon Resource Name (ARN) of the created batch evaluation.
batchEvaluationName (string) –
The name of the batch evaluation.
evaluators (list) –
The list of evaluators applied during the batch evaluation.
(dict) –
An evaluator to run against sessions.
evaluatorId (string) –
The unique identifier of the evaluator. Can reference built-in evaluators (e.g.,
Builtin.Helpfulness) or custom evaluators.
status (string) –
The status of the batch evaluation.
createdAt (datetime) –
The timestamp when the batch evaluation was created.
outputConfig (dict) –
The output configuration specifying where evaluation results are written.
Note
This is a Tagged Union structure. Only one of the following top level keys will be set:
cloudWatchConfig. If a client receives an unknown member it will setSDK_UNKNOWN_MEMBERas the top level key, which maps to the name or tag of the unknown member. The structure ofSDK_UNKNOWN_MEMBERis as follows:'SDK_UNKNOWN_MEMBER': {'name': 'UnknownMemberName'}
cloudWatchConfig (dict) –
The CloudWatch Logs configuration for writing evaluation results.
logGroupName (string) –
The name of the CloudWatch log group where evaluation results will be written.
logStreamName (string) –
The name of the CloudWatch log stream where evaluation results will be written.
description (string) –
The description of the batch evaluation.
Exceptions
BedrockAgentCore.Client.exceptions.ServiceQuotaExceededExceptionBedrockAgentCore.Client.exceptions.UnauthorizedExceptionBedrockAgentCore.Client.exceptions.ValidationExceptionBedrockAgentCore.Client.exceptions.AccessDeniedExceptionBedrockAgentCore.Client.exceptions.ConflictExceptionBedrockAgentCore.Client.exceptions.ThrottlingExceptionBedrockAgentCore.Client.exceptions.InternalServerException