Create a model evaluation job in Amazon Bedrock

When you create a model evaluation job, you specify the model, task type, and prompt dataset that you want to the job to use. You also specify the metrics that you want the job to create.

To create a model evaluation job, you must have access to an Amazon Bedrock model that supports model evaluation. For more information, see Model support by feature in the Amazon Bedrock user guide. If you don't have access to a suitable model, contact your administrator.

Model evaluation supports the following task types that assess different aspects of the model's performance:

General text generation – the model performs natural language processing and text generation tasks.
Text summarization – the model performs summarizes text based on the prompts you provide.
Question and answer – the model provides answers based on your prompts.
Text classification – the model categorizes text into predefined classes based on the input dataset.

To perform a model evaluation for a task type, Amazon Bedrock IDE needs an input dataset that contains prompts. The job uses the dataset for inference during evaluation. You can use a built in dataset that Amazon Bedrock IDE suppplies or supply your own custom prompt dataset. To create a custom prompt dataset, use the information at custom prompt. When you supply your own dataset, Amazon Bedrock IDE uploads the dataset to an Amazon S3 bucket that it manages. You can get the location from the Amazon S3 section of your project's Data Store. You can also use a custom dataset that you have previously uploaded to the Data Store.

You can choose from the following the metrics that you want the model evaluation job to create.

Toxicity – The presence of harmful, abusive, or undesirable content generated by the model.
Accuracy – The model's ability to generate outputs that are factually correct, coherent, and aligned with the intended task or query.
Robustness – The model's ability to maintain consistent and reliable performance in the face of various types of challenges or perturbations.

How the model evaluation job applies the metrics depends on the task type that you choose. For more information, see Review a model model evaluation job.

You can tag model evaluation jobs for purposes such as tracking costs. Amazon Bedrock IDE automatically prepends tags you add with ProjectUserTag. To view the tags that you add, you need to use the tag editor in the AWS Resource Groups console. For more information, see What is Tag Editor? in the AWS Resource Management Documentation.

You can set the inference parameters for the model evaluation job. You can change Max tokens, temperature, and Top P inference parameters. Models might support other parameters that you can change. For more information, see Inference request parameters and response fields for foundation models in the Amazon Bedrock user guide.

To create an automatic model evaluation job

Navigate to the Amazon SageMaker Unified Studio landing page by using the URL from your admininstrator.
Access Amazon SageMaker Unified Studio using your IAM or single sign-on (SSO) credentials. For more information, see Access Amazon SageMaker Unified Studio.
If you want to use a new project, do the following:
1. Choose the current project at the top of the page. If a project isn't already open, choose Select a project.
2. Select Create project.
3. Follow the instructions at Create a new project. For the Project profile in step 1, choose Generative AI application development.
If the project that you want to use isn't already open, do the following:
1. Choose the current project at the top of the page. If a project isn't already open, choose Select a project.
2. Select Browse all projects.
3. In Projects select the project that you want to use.
At the top of the page, select Build.
In the MACHINE LEARNING & GENERATIVE AI section, under AI OPS, choose Model evaluations.
Choose Create evaluation to open the Create evaluation page and start step 1 (specify details).
For Evaluation job name, enter a name for the evaluation job. This name is shown in your model evaluation job list.
(Optional) For Description enter a description.
(Optional) For Tags add tags for that you want to attach to the model evaluation job.
Choose Next to start step 2 (set up evaluation).
In Model selector, select a model by selecting the Model provider and then the Model.
(Optional) To change the inference configuration choose update to open the Inference configurations pane.
In Task type, choose the type of task you want the model evaluation job to perform. For information about the available task types, see Model evaluation task types.
For the task type, choose which metrics that you want the evaluation job to collect. For information about available metrics, see Review a model model evaluation job.
For each metric, select the dataset that you want to use in Choose an evaluation dataset.
- To use a built in dataset, choose Built in datasets and choose the metrics that you want use.
- To upload a custom dataset, choose Upload a dataset to S3 and upload the dataset file.
- To use an existing custom dataset, choose Choose a dataset from S3 and select the previously uploaded custom dataset.
Choose Next to start step 3 (review and submit).
Check that the evaluation job details are correct.
Choose Submit to start the model evaluation job.
Wait until the model evaluation job finishes. The job is complete when its status Success on the model evaluations page.
Next step: Review the results of the model evaluation job.

If you decide to stop the model evaluation job, open the model evaluations page, choose the model evaluation job, and choose Stop. To delete the evaluation, choose Stop.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Evaluate the performance of a model

Model evaluation task types