

# Evaluating your trained model
<a name="nova-hp-evaluate"></a>

An evaluation recipe is a YAML configuration file that defines how your Amazon Nova model evaluation job is executed. With this recipe, you can assess the performance of a base or trained model against common benchmarks or your own custom datasets. Metrics can be stored in Amazon S3 or TensorBoard. The evaluation provides quantitative metrics that help you assess model performance across various tasks to determine if further customization is needed.

Model evaluation is an offline process, where models are tested against fixed benchmarks with predefined answers. They are not assessed in real-time or against live user interactions. For real-time evaluations, you can evaluate the model after it is deployed to Amazon Bedrock by calling the Amazon Bedrock runtime APIs.

**Important**  
The evaluation container only supports checkpoints produced by the same training platform. Checkpoints created with SageMaker HyperPod can only be evaluated using the SageMaker HyperPod evaluation workflow, and checkpoints created with SageMaker training jobs can only be evaluated using the SageMaker training jobs evaluation workflow. Attempting to evaluate a checkpoint from a different platform will result in failure.

**Topics**
+ [Available benchmark tasks](customize-fine-tune-evaluate-available-tasks.md)
+ [Understanding the recipe parameters](customize-fine-tune-evaluate-understand-modify.md)
+ [Evaluation recipe examples](customize-fine-tune-evaluate-recipe-examples.md)
+ [Starting an evaluation job](customize-fine-tune-evaluate-start-job.md)
+ [Accessing and analyzing evaluation results](customize-fine-tune-evaluate-access-results.md)
+ [RFT evaluation](nova-hp-evaluate-rft.md)

# Available benchmark tasks
<a name="customize-fine-tune-evaluate-available-tasks"></a>

A sample code package is available that demonstrates how to calculate benchmark metrics using the SageMaker AI model evaluation feature for Amazon Nova. To access the code packages, see [sample-Nova-lighteval-custom-task](https://github.com/aws-samples/sample-Nova-lighteval-custom-task/).

Here is a list of the supported, available industry standard benchmarks. You can specify the following benchmarks in the `eval_task` parameter:


| Benchmark | Modality | Description | Metrics | Strategy | Subtask Available | 
| --- |--- |--- |--- |--- |--- |
| mmlu | Text | Multi-task Language Understanding – Tests knowledge across 57 subjects. | accuracy | zs\$1cot | Yes | 
| mmlu\$1pro | Text | MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering. | accuracy | zs\$1cot | No | 
| bbh | Text | Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills. | accuracy | zs\$1cot | Yes | 
| gpqa | Text | General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities. | accuracy | zs\$1cot | No | 
| math | Text | Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems. | exact\$1match | zs\$1cot | Yes | 
| strong\$1reject | Text | Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content. | deflection | zs | Yes | 
| IFEval | Text | Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification. | accuracy | zs | No | 
| gen\$1qa | Text | Custom Dataset Evaluation – Lets you bring your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU. | all | gen\$1qa | No | 
| llm\$1judge | Text | LLM-as-a-Judge Preference Comparison – Uses a Amazon Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A. | all | judge | No | 
| humaneval | Text | HumanEval - A benchmark dataset designed to evaluate the code generation capabilities of large language models | pass@1 | zs | No | 
|  mm\$1llm\$1judge  |  Multi-modal (image)  |  This new benchmark behaves the same as the text-based `llm_judge` above. The only difference is that it supports image inference.  |  all  |  judge  |  No  | 
|  rubric\$1llm\$1judge  | Text |  Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Amazon Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts, Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.  |  all  |  judge  |  No  | 
|  aime\$12024  | Text |  AIME 2024 - American Invitational Mathematics Examination problems testing advanced mathematical reasoning and problem-solving  |  exact\$1match  |  zs\$1cot  | No | 
|  calendar\$1scheduling  | Text |  Natural Plan - Calendar Scheduling task testing planning abilities for scheduling meetings across multiple days and people  |  exact\$1match  |  fs  | No | 

The following `mmlu` subtasks are available:

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

The following `bbh` subtasks are available:

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

The following `math` subtasks are available:

```
MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
]
```

# Understanding the recipe parameters
<a name="customize-fine-tune-evaluate-understand-modify"></a>

**Run configuration**  
The following is a general run configuration and an explanation of the parameters involved.

```
run:
  name: eval_job_name
  model_type: amazon.nova-micro-v1:0:128k
  model_name_or_path: nova-micro/prod
  replicas: 1
  data_s3_path: ""
  output_s3_path: s3://output_path
  mlflow_tracking_uri: ""
  mlflow_experiment_name : ""
  mlflow_run_name : ""
```
+ `name`: (Required) A descriptive name for your evaluation job. This helps identify your job in the AWS console.
+ `model_type`: (Required) Specifies the Amazon Nova model variant to use. Do not manually modify this field. Options include:
  + `amazon.nova-micro-v1:0:128k`
  + `amazon.nova-lite-v1:0:300k`
  + `amazon.nova-pro-v1:0:300k`
  + `amazon.nova-2-lite-v1:0:256k`
+ `model_name_or_path`: (Required) The path to the base model or S3 path for the post-trained checkpoint. Options include:
  + `nova-micro/prod`
  + `nova-lite/prod`
  + `nova-pro/prod`
  + `nova-lite-2/prod`
  + (S3 path for the post-trained checkpoint) `s3://<escrow bucket>/<job id>/outputs/checkpoints`
+ `replicas`: (Required) The number of compute instances to use for distributed training. You must set this value to 1 because multi-node is not supported.
+ `data_s3_path`: (Required) The S3 path to the input dataset. Leave this parameter empty unless you are using the *bring your own dataset* or *LLM as a judge* recipe.
+ `output_s3_path`: (Required) The S3 path to store output evaluation artifacts. Note that the output S3 bucket must be created by the same account that is creating the job.
+ `mlflow_tracking_uri`: (Optional) MLflow tracking server ARN for tracking MLFlow runs/experiments. Please ensure you have permission to access the tracking server from SageMaker AI execution role

**Evaluation configuration**  
The following is a model evaluation configuration and an explanation of the parameters involved.

```
evaluation:
  task: mmlu
  strategy: zs_cot
  subtask: mathematics
  metric: accuracy
```
+ `task`: (Required) Specifies the evaluation benchmark or task to use.

  Supported task list:
  + mmlu
  + mmlu\$1pro
  + bbh
  + gpqa
  + math
  + strong\$1reject
  + gen\$1qa
  + ifeval
  + llm\$1judge
  + humaneval
  + mm\$1llm\$1judge
  + rubric\$1llm\$1judge
  + aime\$12024
  + calendar\$1scheduling
  + humaneval
+ `strategy`: (Required) Defines the evaluation approach:
  + zs\$1cot: Zero-shot Chain-of-Thought - An approach to prompt large language models that encourages step-by-step reasoning without requiring explicit examples.
  + zs: Zero-shot - An approach to solve a problem without any prior training examples.
  + gen\$1qa: A strategy specific for bring your own dataset recipes.
  + judge: A strategy specific for Amazon Nova LLM as Judge and mm\$1llm\$1judge.
+ `subtask`: (Optional and Removable) Specifies a specific subtask for certain evaluation tasks. Remove this from your recipe if your task does not have any subtasks.
+ `metric`: (Required) The evaluation metric to use.
  + accuracy: Percentage of correct answers
  + exact\$1match: (For `math` benchmark), returns the rate at which the input predicted strings exactly match their references.
  + deflection: (For `strong reject` benchmark), returns the relative deflection to the base model and the difference in significance metrics.
  + pass@1: (For `humaneval` benchmark) is a metric used to measures the percentage of cases where the model's highest confidence prediction matches the correct answer.
  + `all`: Returns the following metrics:
    + For `gen_qa` and bring your own dataset benchmark, return following metrics:
      + `rouge1`: Measures the overlap of unigrams (single words) between generated and reference text.
      + `rouge2`: Measures the overlap of bigrams (two consecutive words) between generated and reference text.
      + `rougeL`: Measures the longest common subsequence between texts, allowing for gaps in the matching.
      + `exact_match`: Binary score (0 or 1) indicating if the generated text matches the reference text exactly, character by character.
      + `quasi_exact_match`: Similar to exact match but more lenient, typically ignoring case, punctuation, and white space differences.
      + `f1_score`: Harmonic mean of precision and recall, measuring word overlap between predicted and reference answers.
      + `f1_score_quasi`: Similar to f1\$1score but with more lenient matching, using normalized text comparison that ignores minor differences.
      + `bleu`: Measures precision of n-gram matches between generated and reference text, commonly used in translation evaluation.
    + For `llm_judge` and `mm_llm_judge`, bring your own dataset benchmark, return following metrics:
      + `a_scores`: Number of wins for `response_A` across forward and backward evaluation passes.
      + `a_scores_stderr`: Standard error of `response_A scores` across pairwise judgements.
      + `b_scores`: Number of wins for `response_B` across forward and backward evaluation passes.
      + `b_scores_stderr`: Standard error of `response_B scores` across pairwise judgements.
      + `ties`: Number of judgements where `response_A` and `response_B` are evaluated as equal.
      + `ties_stderr`: Standard error of ties across pairwise judgements.
      + `inference_error`: Count of judgements that could not be properly evaluated.
      + `inference_error_stderr`: Standard error of inference errors across judgements.
      + `score`: Aggregate score based on wins from both forward and backward passes for `response_B`.
      + `score_stderr`: Standard error of the aggregate score across pairwise judgements.
      + `winrate`: the probability that response\$1B will be preferred over response\$1A calculated using Bradley-Terry probability.
      + `lower_rate`: Lower bound (2.5th percentile) of the estimated win rate from bootstrap sampling.

**Inference configuration**  
The following is an inference configuration and an explanation of the parameters involved. All parameters are optional.

```
inference:
  max_new_tokens: 200
  top_k: -1
  top_p: 1.0
  temperature: 0
  top_logprobs: 10
  reasoning_effort: null  # options: low/high to enable reasoning or null to disable reasoning
```
+ `max_new_tokens`: The maximum number of tokens to generate. This must be an integer.
+ `top_k`: The number of highest probability tokens to consider. This must be an integer.
+ `top_p`: The cumulative probability threshold for token sampling. This must be a float between 0.0 and 1.0, inclusive.
+ `temperature`: Randomness in token selection. Larger values introduce more randomness. Use 0 to make the results deterministic. This value must be a float with a minimum value of 0.
+ `top_logprobs`: The number of top logprobs to be returned in the inference response. This value must be an integer from 0 to 20. Logprobs contain the considered output tokens and log probabilities of each output token returned in the message content.
+ `reasoning_effort`: controls the reasoning behavior for reasoning-capable models. Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`). Available options are `null` (default value if not set; disables reasoning), `low`, or `high`.

Note that for `humaneval`, we recommend the following inference configuration:

```
inference:
  top_k: 1
  max_new_tokens: 1600
  temperature: 0.0
```

**MLFlow configuration**  
The following is an MLFlow configuration and an explanation of the parameters involved. All parameters are optional.

```
run:
  mlflow_tracking_uri: ""
  mlflow_experiment_name: ""
  mlflow_run_name: ""
```
+ `mlflow_tracking_uri`: Optional) The location of the MLflow tracking server (only needed on SMHP)
+ `mlflow_experiment_name`: (Optional) Name of the experiment to group related ML runs together
+ `mlflow_run_name`: (Optional) Custom name for a specific training run within an experiment

# Evaluation recipe examples
<a name="customize-fine-tune-evaluate-recipe-examples"></a>

Amazon Nova provides four types of evaluation recipes, which are available in the SageMaker HyperPod recipes GitHub repository.

## General text benchmark recipes
<a name="nova-model-hp-evaluation-config-example-text"></a>

These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks. They are provided in the format `xxx_general_text_benchmark_eval.yaml`.

## Bring your own dataset benchmark recipes
<a name="nova-model-hp-evaluation-config-byo"></a>

These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics. They are provided in the format `xxx_bring_your_own_dataset_eval.yaml`. 

The following are the bring your own dataset requirements:
+ File format requirements
  + You must include a single `gen_qa.jsonl` file containing evaluation examples.
  + Your dataset must be uploaded to an S3 location where SageMaker training job can access it.
  + The file must follow the required schema format for a general Q&A dataset.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `query`: (Required) String containing the question or instruction that needs an answer
  + `response`: (Required) String containing the expected model output
  + `system`: (Optional) String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query
  + `metadata`: (Optional) String containing metadata associated with the entry for tagging purposes.

Here is a bring your own data set example entry

```
{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, do not change any of the content:

```
evaluation:
  task: gen_qa
  strategy: gen_qa
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 3.5k.

## Nova LLM as a Judge benchmark recipes
<a name="nova-model-evaluation-config-llm-judge"></a>

Amazon Nova LLM as a Judge is a model evaluation feature that enables customers to compare the quality of responses from one model to a baseline model response on a custom dataset. It takes in a dataset with prompts, baseline responses, and challenger responses, and uses a Amazon Nova Judge model to provide a winrate metric based on [Bradley-Terry probability](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) with pairwise comparisons.

The recipes are provided in the format `xxx_llm_judge_eval.yaml`. 

The following are the LLM as a Judge requirements:
+ File format requirements
  + Include a single `llm_judge.jsonl` file containing evaluation examples. The file name must be `llm_judge.jsonl`.
  + Your dataset must be uploaded to an S3 location that [SageMaker AI SageMaker HyperPod RIG](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html) can access.
  + The file must follow the required schema format for the `llm_judge.jsonl` dataset.
  + The input dataset should ensure all records are under 12k context length.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `prompt`: (Required) A string containing the prompt for the generated response.
  + `response_A`: A string containing the baseline response.
  + `response_B`: A string containing the alternative response be compared with baseline response.

Here is an LLM as a judge example entry

```
{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: llm_judge
  strategy: judge
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Amazon Nova Judge models are the same across all model family specifications (that is, Lite, Micro, and Pro).
+ Custom judge models are not supported at this time.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 7k.

## Nova LLM as a Judge for multi-modal (image) benchmark recipes
<a name="nova-model-hp-evaluation-mm-llm-judge"></a>

Nova LLM Judge for multi-modal (image), short for Amazon Nova MM\$1LLM Judge, is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, and images in the form of Base64-encoded string, then uses a Amazon Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_mm_llm_judge _eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `mm_llm_judge.jsonl` file containing evaluation examples. The file name must be exactly `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker Training Jobs can access it.
+ The file must follow the required schema format for the `mm_llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length, excluding the image's attribute.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `images`: Array containing a list of objects with data attributes (values are Base64-encoded image strings).

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

For readability, the following example includes new lines and indentation, but in the actual dataset, each record should be on a single line.

```
{
  "prompt": "what is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
}
{
  "prompt": "how many animals in echo of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal",
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova MM Judge models only support image reference.
+ Nova MM Judge models are the same across Amazon Nova Lite specifications.
+ Custom judge models are not currently supported.
+ Amazon S3 image URI is not supported.
+ The input dataset should ensure all records are under 12 k context length, excluding images attribute.

## Rubric Based Judge
<a name="nova-hp-evaluate-rubric-judge"></a>

Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Amazon Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts (A>B, B>A, or tie), Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.

Key capabilities:
+ **Dynamic criteria generation**: Automatically creates relevant evaluation dimensions based on the input prompt
+ **Weighted scoring**: Assigns importance weights to each criterion to reflect their relative significance
+ **Granular assessment**: Provides detailed scores on a binary (true/false) or scale (1-5) basis for each criterion
+ **Quality metrics**: Calculates continuous quality scores (0-1 scale) that quantify the magnitude of differences between responses

Example criterion generated by the model:

```
price_validation:
  description: "The response includes validation to ensure price is a positive value."
  type: "scale"
  weight: 0.3
```

The model evaluates both responses against all generated criteria, then uses these criterion-level scores to inform its final preference decision.

**Topics**
+ [Recipe configuration](#nova-hp-evaluate-rubric-judge-recipe)
+ [Input dataset format](#nova-hp-evaluate-rubric-judge-input)
+ [Evaluation output](#nova-hp-evaluate-rubric-judge-output)
+ [Reasoning model support](#nova-hp-evaluate-rubric-judge-reasoning)

### Recipe configuration
<a name="nova-hp-evaluate-rubric-judge-recipe"></a>

**Rubric Judge recipe**  
Enable Rubric Judge by setting `task: rubric_llm_judge` in your recipe:

```
run:
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Original LLM as a Judge recipe (for comparison)**  
The original judge model uses `task: llm_judge`:

```
run:
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: llm_judge                                       # [FIXED] Original judge task
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

### Input dataset format
<a name="nova-hp-evaluate-rubric-judge-input"></a>

The input dataset format is identical to the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/):

**Required fields:**
+ `prompt`: String containing the input prompt and instructions
+ `response_A`: String containing the baseline model output
+ `response_B`: String containing the customized model output

**Example dataset (JSONL format):**

```
{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}
```

**Format requirements:**
+ Each entry must be a single-line JSON object
+ Separate entries with newlines
+ Follow the exact field naming as shown in examples

### Evaluation output
<a name="nova-hp-evaluate-rubric-judge-output"></a>

**Output structure**  
Rubric Judge produces enhanced evaluation metrics compared to the original judge model:

```
{
  "config_general": {
    "lighteval_sha": "string",
    "num_fewshot_seeds": "int",
    "max_samples": "int | null",
    "job_id": "int",
    "start_time": "float",
    "end_time": "float",
    "total_evaluation_time_secondes": "string",
    "model_name": "string",
    "model_sha": "string",
    "model_dtype": "string | null",
    "model_size": "string"
  },
  "results": {
    "custom|rubric_llm_judge_judge|0": {
      "a_scores": "float",
      "a_scores_stderr": "float",
      "b_scores": "float",
      "b_scores_stderr": "float",
      "ties": "float",
      "ties_stderr": "float",
      "inference_error": "float",
      "inference_error_stderr": "float",
      "score": "float",
      "score_stderr": "float",
      "weighted_score_A": "float",
      "weighted_score_A_stderr": "float",
      "weighted_score_B": "float",
      "weighted_score_B_stderr": "float",
      "score_margin": "float",
      "score_margin_stderr": "float",
      "winrate": "float",
      "lower_rate": "float",
      "upper_rate": "float"
    }
  },
  "versions": {
    "custom|rubric_llm_judge_judge|0": "int"
  }
}
```

**New metrics in Rubric Judge**  
The following six metrics are unique to Rubric Judge and provide granular quality assessment:


| Metric | Description | 
| --- |--- |
| weighted\$1score\$1A | Average normalized quality score for response\$1A across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1A\$1stderr | Standard error of the mean for weighted\$1score\$1A, indicating statistical uncertainty | 
| weighted\$1score\$1B | Average normalized quality score for response\$1B across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1B\$1stderr | Standard error of the mean for weighted\$1score\$1B, indicating statistical uncertainty | 
| score\$1margin | Difference between weighted scores (calculated as weighted\$1score\$1A - weighted\$1score\$1B). Range: -1.0 to 1.0. Positive = response\$1A is better; negative = response\$1B is better; near zero = similar quality | 
| score\$1margin\$1stderr | Standard error of the mean for score\$1margin, indicating uncertainty in the quality difference measurement | 

**Understanding weighted score metrics**  
**Purpose**: Weighted scores provide continuous quality measurements that complement binary preference verdicts, enabling deeper insights into model performance.

**Key differences from original judge**:
+ **Original judge**: Only outputs discrete preferences (A>B, B>A, A=B)
+ **Rubric Judge**: Outputs both preferences AND continuous quality scores (0-1 scale) based on custom criteria

**Interpreting score\$1margin**:
+ `score_margin = -0.128`: Response\$1B scored 12.8 percentage points higher than response\$1A
+ `|score_margin| < 0.1`: Narrow quality difference (close decision)
+ `|score_margin| > 0.2`: Clear quality difference (confident decision)

**Use cases**:
+ **Model improvement**: Identify specific areas where your model underperforms
+ **Quality quantification**: Measure the magnitude of performance gaps, not just win/loss ratios
+ **Confidence assessment**: Distinguish between close decisions and clear quality differences

**Important**  
Final verdicts are still based on the judge model's explicit preference labels to preserve holistic reasoning and ensure proper position bias mitigation through forward/backward evaluation. Weighted scores serve as observability tools, not as replacements for the primary verdict.

**Calculation methodology**  
Weighted scores are computed through the following process:
+ **Extract criterion data**: Parse the judge's YAML output to extract criterion scores and weights
+ **Normalize scores**:
  + Scale-type criteria (1-5): Normalize to 0-1 by calculating `(score - 1) / 4`
  + Binary criteria (true/false): Convert to 1.0/0.0
+ **Apply weights**: Multiply each normalized score by its criterion weight
+ **Aggregate**: Sum all weighted scores for each response
+ **Calculate margin**: Compute `score_margin = weighted_score_A - weighted_score_B`

**Example**: If response\$1A has a weighted sum of 0.65 and response\$1B has 0.78, the `score_margin` would be -0.13, indicating response\$1B is 13 percentage points higher in quality across all weighted criteria.

### Reasoning model support
<a name="nova-hp-evaluate-rubric-judge-reasoning"></a>

Reasoning model support enables evaluation with reasoning-capable Amazon Nova models that perform explicit internal reasoning before generating final responses. This feature uses API-level control via the `reasoning_effort` parameter to dynamically enable or disable reasoning functionality, potentially improving response quality for complex analytical tasks.

**Supported models**:
+ amazon.nova-2-lite-v1:0:256k

**Recipe configuration**  
Enable reasoning by adding the `reasoning_effort` parameter to the `inference` section of your recipe:

```
run:
  name: eval-job-name                                    # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for SageMaker  SageMaker HyperPod  job
  output_s3_path: ""                                     # [MODIFIABLE] Output path for SageMaker  SageMaker HyperPod  job (not compatible with SageMaker Training jobs)

evaluation:
  task: mmlu                                             # [MODIFIABLE] Evaluation task
  strategy: generate                                     # [MODIFIABLE] Evaluation strategy
  metric: all                                            # [MODIFIABLE] Metric calculation method

inference:
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/medium/high or null to disable
  max_new_tokens: 200                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: 50                                              # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Using the reasoning\$1effort parameter**  
The `reasoning_effort` parameter controls the reasoning behavior for reasoning-capable models.

**Prerequisites**:
+ **Model compatibility**: Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`)
+ **Error handling**: Using `reasoning_effort` with unsupported models will fail with `ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."`

**Available options**:


| Option | Behavior | Token Limit | Use Case | 
| --- |--- |--- |--- |
| null (default) | Disables reasoning mode | N/A | Standard evaluation without reasoning overhead | 
| low | Enables reasoning with constraints | 4,000 tokens for internal reasoning | Scenarios requiring concise reasoning; optimizes for speed and cost | 
| high | Enables reasoning without constraints | No token limit on internal reasoning | Complex problems requiring extensive analysis and step-by-step reasoning | 

**When to enable reasoning**  
**Use reasoning mode (`low`, `medium`, or `high`) for**:
+ Complex problem-solving tasks (mathematics, logic puzzles, coding)
+ Multi-step analytical questions requiring intermediate reasoning
+ Tasks where detailed explanations or step-by-step thinking improve accuracy
+ Scenarios where response quality is prioritized over speed

**Use non-reasoning mode (omit parameter) for**:
+ Simple Q&A or factual queries
+ Creative writing tasks
+ When faster response times are critical
+ Performance benchmarking where reasoning overhead should be excluded
+ Cost optimization when reasoning doesn't improve task performance

**Troubleshooting**  
**Error: "Reasoning mode is enabled but model does not support reasoning"**

**Cause**: The `reasoning_effort` parameter is set to a non-null value, but the specified `model_type` doesn't support reasoning.

**Resolution**:
+ Verify your model type is `amazon.nova-2-lite-v1:0:256k`
+ If using a different model, either switch to a reasoning-capable model or remove the `reasoning_effort` parameter from your recipe

# Starting an evaluation job
<a name="customize-fine-tune-evaluate-start-job"></a>

The following provides a suggested evaluation instance type and model type configuration:

```
# Install Dependencies (Helm - https://helm.sh/docs/intro/install/)
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

# Install the SageMaker HyperPod CLI
git clone --recurse-submodules https://github.com/aws/sagemaker-hyperpod-cli.git
git checkout -b release_v2
cd sagemaker-hyperpod-cli
pip install .

# Verify the installation
hyperpod --help

# Connect to a SageMaker HyperPod Cluster
hyperpod connect-cluster --cluster-name cluster-name


# Submit the Job using the recipe for eval
# Namespace by default should be kubeflow
hyperpod start-job [--namespace namespace] --recipe evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval --override-parameters \
'{
    "instance_type":"p5d.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest",
    "recipes.run.name": custom-run-name,
    "recipes.run.model_type": model_type,
    "recipes.run.model_name_or_path" " model name or finetune checkpoint s3uri,
    "recipes.run.data_s3_path": s3 for input data only for genqa and llm_judge, must be full S3 path that include filename,
}'

# List jobs
hyperpod list-jobs [--namespace namespace] [--all-namespaces]

# Getting Job details
hyperpod get-job --job-name job-name [--namespace namespace] [--verbose]

# Listing Pods
hyperpod list-pods --job-name job-name --namespace namespace

# Cancel Job
hyperpod cancel-job --job-name job-name [--namespace namespace]
```

You should also be able to view the job status through Amazon EKS cluster console.

# Accessing and analyzing evaluation results
<a name="customize-fine-tune-evaluate-access-results"></a>

After your evaluation job completes successfully, you can access and analyze the results using the information in this section. Based on the `output_s3_path` (such as `s3://output_path/`) defined in the recipe, the output structure is the following:

```
job_name/
├── eval-result/
│    └── results_[timestamp].json
│    └── inference_output.jsonl (only present for gen_qa)
│    └── details/
│        └── model/
│            └── execution-date-time/
│                └──details_task_name_#_datetime.parquet
└── tensorboard-results/
    └── eval/
        └── events.out.tfevents.[timestamp]
```

Metrics results are stored in the specified S3 output location `s3://output_path/job_name/eval-result/result-timestamp.json`.

Tensorboard results are stored in the S3 path `s3://output_path/job_name/eval-tensorboard-result/eval/event.out.tfevents.epoch+ip`.

All inference outputs, except for `llm_judge` and `strong_reject`, are stored in the S3 path: `s3://output_path/job_name/eval-result/details/model/taskname.parquet`.

For `gen_qa`, the `inference_output.jsonl` file contains the following fields for each JSON object:
+ prompt - The final prompt submitted to the model
+ inference - The raw inference output from the model
+ gold - The target response from the input dataset
+ metadata - The metadata string from the input dataset if provided

To visualize your evaluation metrics in Tensorboard, complete the following steps:

1. Navigate to SageMaker AI Tensorboard.

1. Select **S3 folders**.

1. Add your S3 folder path, for example `s3://output_path/job-name/eval-tensorboard-result/eval`.

1. Wait for synchronization to complete.

The time series, scalars, and text visualizations are available.

We recommend the following best practices:
+ Keep your output paths organized by model and benchmark type.
+ Maintain consistent naming conventions for easy tracking.
+ Save extracted results in a secure location.
+ Monitor TensorBoard sync status for successful data loading.

You can find SageMaker HyperPod job error logs in the CloudWatch log group `/aws/sagemaker/Clusters/cluster-id`.

## Log Probability Output Format
<a name="nova-hp-access-results-logprobs"></a>

When `top_logprobs` is configured in your inference settings, the evaluation output includes token-level log probabilities in the parquet files. Each token position contains a dictionary of the top candidate tokens with their log probabilities in the following structure:

```
{
"Ġint": {"logprob_value": -17.8125, "decoded_value": " int"},
"Ġthe": {"logprob_value": -2.345, "decoded_value": " the"}
}
```

Each token entry contains:
+ `logprob_value`: The log probability value for the token
+ `decoded_value`: The human-readable decoded string representation of the token

The raw tokenizer token is used as the dictionary key to ensure uniqueness, while `decoded_value` provides a readable interpretation.

# RFT evaluation
<a name="nova-hp-evaluate-rft"></a>

**Note**  
Evaluation via remote reward functions in your own AWS environment is only available if you are Amazon Nova Forge customer.

**Important**  
The `rl_env` configuration field is used exclusively for evaluation, not for training. During training, you configure reward functions using `reward_lambda_arn` (single-turn) or BYOO infrastructure with `rollout.delegate: true` (multi-turn).

**What is RFT Evaluation?**  
RFT Evaluation allows you to assess your model's performance using custom reward functions before, during, or after reinforcement learning training. Unlike standard evaluations that use pre-defined metrics, RFT Evaluation lets you define your own success criteria through a Lambda function that scores model outputs based on your specific requirements.

**Why Evaluate with RFT?**  
Evaluation is crucial to determine whether the RL fine-tuning process has:
+ Improved model alignment with your specific use case and human values
+ Maintained or improved model capabilities on key tasks
+ Avoided unintended side effects such as reduced factuality, increased verbosity, or degraded performance on other tasks
+ Met your custom success criteria as defined by your reward function

**When to Use RFT Evaluation**  
Use RFT Evaluation in these scenarios:
+ Before RFT Training: Establish baseline metrics on your evaluation dataset
+ During RFT Training: Monitor training progress with intermediate checkpoints
+ After RFT Training: Validate that the final model meets your requirements
+ Comparing Models: Evaluate multiple model versions using consistent reward criteria

**Note**  
Use RFT Evaluation when you need custom, domain-specific metrics. For general-purpose evaluation (accuracy, perplexity, BLEU), use standard evaluation methods.

**Topics**
+ [Data format requirements](#nova-hp-evaluate-rft-data-format)
+ [Preparing your evaluation recipe](#nova-hp-evaluate-rft-recipe)
+ [Preset reward functions](#nova-hp-evaluate-rft-preset)
+ [Creating your reward function](#nova-hp-evaluate-rft-create-function)
+ [IAM permissions](#nova-hp-evaluate-rft-iam)
+ [Executing the evaluation job](#nova-hp-evaluate-rft-execution)
+ [Understanding evaluation results](#nova-hp-evaluate-rft-results)

## Data format requirements
<a name="nova-hp-evaluate-rft-data-format"></a>

**Input data structure**  
RFT evaluation input data must follow the OpenAI Reinforcement Fine-Tuning format. Each example is a JSON object containing:
+ `messages`: Array of conversational turns with `system` and `user` roles
+ Optional other metadata, e.g. reference\$1answer

**Data format example**  
The following example shows the required format:

```
{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Solve for x. Return only JSON like {\"x\": <number>}. Equation: 2x + 5 = 13"
        }
      ]
    }
  ],
  "reference_answer": {
    "x": 4
  }
}
```

**Current limitations**  
The following limitations apply to RFT evaluation:
+ Text only: No multimodal inputs (images, audio, video) are supported
+ Single-turn conversations: Only supports single user message (no multi-turn dialogues)
+ JSON format: Input data must be in JSONL format (one JSON object per line)
+ Model outputs: Evaluation is performed on generated completions from the specified model

## Preparing your evaluation recipe
<a name="nova-hp-evaluate-rft-recipe"></a>

**Sample recipe configuration**  
The following example shows a complete RFT evaluation recipe:

```
run:
  name: nova-lite-rft-eval-job
  model_type: amazon.nova-lite-v1:0:300k
  model_name_or_path: s3://escrow_bucket/model_location    # [MODIFIABLE] S3 path to your model or model identifier
  replicas: 1                                             # [MODIFIABLE] For SageMaker Training jobs only; fixed for  SageMaker HyperPod  jobs
  data_s3_path: ""                                        # [REQUIRED FOR HYPERPOD] Leave empty for SageMaker Training jobs
  output_s3_path: ""                                      # [REQUIRED] Output artifact S3 path for evaluation results

evaluation:
  task: rft_eval                                          # [FIXED] Do not modify
  strategy: rft_eval                                      # [FIXED] Do not modify
  metric: all                                             # [FIXED] Do not modify

# Inference Configuration
inference:
  max_new_tokens: 8196                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                               # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                              # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                          # [MODIFIABLE] Sampling temperature (0 = deterministic)
  top_logprobs: 0

# Evaluation Environment Configuration (NOT used in training)
rl_env:
  reward_lambda_arn: arn:aws:lambda:<region>:<account_id>:function:<reward-function-name>
```

## Preset reward functions
<a name="nova-hp-evaluate-rft-preset"></a>

Two preset reward functions (`prime_code` and `prime_math`) are available as a Lambda layer for easy integration with your RFT Lambda functions.

**Overview**  
These preset functions provide out-of-the-box evaluation capabilities for:
+ **prime\$1code**: Code generation and correctness evaluation
+ **prime\$1math**: Mathematical reasoning and problem-solving evaluation

**Quick setup**  
To use preset reward functions:

1. Download the Lambda layer from the [nova-custom-eval-sdk releases](https://github.com/aws/nova-custom-eval-sdk/releases)

1. Publish Lambda layer using AWS CLI:

   ```
   aws lambda publish-layer-version \
       --layer-name preset-function-layer \
       --description "Preset reward function layer with dependencies" \
       --zip-file fileb://universal_reward_layer.zip \
       --compatible-runtimes python3.9 python3.10 python3.11 python3.12 \
       --compatible-architectures x86_64 arm64
   ```

1. Add the layer to your Lambda function in AWS Console (Select the preset-function-layer from custom layer and also add AWSSDKPandas-Python312 for numpy dependencies)

1. Import and use in your Lambda code:

   ```
   from prime_code import compute_score  # For code evaluation
   from prime_math import compute_score  # For math evaluation
   ```

**prime\$1code function**  
**Purpose**: Evaluates Python code generation tasks by executing code against test cases and measuring correctness.

**Example input dataset format from evaluation**:

```
{"messages":[{"role":"user","content":"Write a function that returns the sum of two numbers."}],"reference_answer":{"inputs":["3\n5","10\n-2","0\n0"],"outputs":["8","8","0"]}}
{"messages":[{"role":"user","content":"Write a function to check if a number is even."}],"reference_answer":{"inputs":["4","7","0","-2"],"outputs":["True","False","True","True"]}}
```

**Key features**:
+ Automatic code extraction from markdown code blocks
+ Function detection and call-based testing
+ Test case execution with timeout protection
+ Syntax validation and compilation checks
+ Detailed error reporting with tracebacks

**prime\$1math function**  
**Purpose**: Evaluates mathematical reasoning and problem-solving capabilities with symbolic math support.

**Input format**:

```
{"messages":[{"role":"user","content":"What is the derivative of x^2 + 3x?."}],"reference_answer":"2*x + 3"}
```

**Key features**:
+ Symbolic math evaluation using SymPy
+ Multiple answer formats (LaTeX, plain text, symbolic)
+ Mathematical equivalence checking
+ Expression normalization and simplification

**Best practices**  
Follow these best practices when using preset reward functions:
+ Use proper data types in test cases (integers vs strings, booleans vs "True")
+ Provide clear function signatures in code problems
+ Include edge cases in test inputs (zero, negative numbers, empty inputs)
+ Format math expressions consistently in reference answers
+ Test your reward function with sample data before deployment

## Creating your reward function
<a name="nova-hp-evaluate-rft-create-function"></a>

**Lambda ARN**  
You must refer to the following format for the Lambda ARN:

```
"arn:aws:lambda:*:*:function:*SageMaker*"
```

If the Lambda does not have this naming scheme, the job will fail with this error:

```
[ERROR] Unexpected error: lambda_arn must contain one of: ['SageMaker', 'sagemaker', 'Sagemaker'] when running on SMHP platform (Key: lambda_arn)
```

**Lambda function structure**  
Your Lambda function receives batches of model outputs and returns reward scores. Below is a sample implementation:

```
from typing import List, Any
import json
import re
from dataclasses import asdict, dataclass


@dataclass
class MetricResult:
    """Individual metric result."""
    name: str
    value: float
    type: str


@dataclass
class RewardOutput:
    """Reward service output."""
    id: str
    aggregate_reward_score: float
    metrics_list: List[MetricResult]


def lambda_handler(event, context):
    """ Main lambda handler """
    return lambda_grader(event)


def lambda_grader(samples: list[dict]) -> list[dict]:
    """ Core grader function """
    scores: List[RewardOutput] = []

    for sample in samples:
        print("Sample: ", json.dumps(sample, indent=2))

        # Extract components
        idx = sample.get("id", "no id")
        if not idx or idx == "no id":
            print(f"ID is None/empty for sample: {sample}")

        ground_truth = sample.get("reference_answer")

        if "messages" not in sample:
            print(f"Messages is None/empty for id: {idx}")
            continue

        if ground_truth is None:
            print(f"No answer found in ground truth for id: {idx}")
            continue

        # Get model's response (last turn is assistant turn)
        last_message = sample["messages"][-1]

        if last_message["role"] != "nova_assistant":
            print(f"Last message is not from assistant for id: {idx}")
            continue

        if "content" not in last_message:
            print(f"Completion text is empty for id: {idx}")
            continue

        model_text = last_message["content"]

        # --- Actual scoring logic (lexical overlap) ---
        ground_truth_text = _extract_ground_truth_text(ground_truth)

        # Calculate main score and individual metrics
        overlap_score = _lexical_overlap_score(model_text, ground_truth_text)

        # Create two separate metrics as in the first implementation
        accuracy_score = overlap_score  # Use overlap as accuracy
        fluency_score = _calculate_fluency(model_text)  # New function for fluency

        # Create individual metrics
        metrics_list = [
            MetricResult(name="accuracy", value=accuracy_score, type="Metric"),
            MetricResult(name="fluency", value=fluency_score, type="Reward")
        ]

        ro = RewardOutput(
            id=idx,
            aggregate_reward_score=overlap_score,
            metrics_list=metrics_list
        )

        print(f"Response for id: {idx} is {ro}")
        scores.append(ro)

    # Convert to dict format
    result = []
    for score in scores:
        result.append({
            "id": score.id,
            "aggregate_reward_score": score.aggregate_reward_score,
            "metrics_list": [asdict(metric) for metric in score.metrics_list]
        })

    return result


def _extract_ground_truth_text(ground_truth: Any) -> str:
    """
    Turn the `ground_truth` field into a plain string.
    """
    if isinstance(ground_truth, str):
        return ground_truth

    if isinstance(ground_truth, dict):
        # Common patterns: { "explanation": "...", "answer": "..." }
        if "explanation" in ground_truth and isinstance(ground_truth["explanation"], str):
            return ground_truth["explanation"]
        if "answer" in ground_truth and isinstance(ground_truth["answer"], str):
            return ground_truth["answer"]
        # Fallback: stringify the whole dict
        return json.dumps(ground_truth, ensure_ascii=False)

    # Fallback: stringify anything else
    return str(ground_truth)


def _tokenize(text: str) -> List[str]:
    # Very simple tokenizer: lowercase + alphanumeric word chunks
    return re.findall(r"\w+", text.lower())


def _lexical_overlap_score(model_text: str, ground_truth_text: str) -> float:
    """
    Simple lexical overlap score in [0, 1]:
      score = |tokens(model) ∩ tokens(gt)| / |tokens(gt)|
    """
    gt_tokens = _tokenize(ground_truth_text)
    model_tokens = _tokenize(model_text)

    if not gt_tokens:
        return 0.0

    gt_set = set(gt_tokens)
    model_set = set(model_tokens)
    common = gt_set & model_set

    return len(common) / len(gt_set)


def _calculate_fluency(text: str) -> float:
    """
    Calculate a simple fluency score based on:
    - Average word length
    - Text length
    - Sentence structure

    Returns a score between 0 and 1.
    """
    # Simple implementation - could be enhanced with more sophisticated NLP
    words = _tokenize(text)

    if not words:
        return 0.0

    # Average word length normalized to [0,1] range
    # Assumption: average English word is ~5 chars, so normalize around that
    avg_word_len = sum(len(word) for word in words) / len(words)
    word_len_score = min(avg_word_len / 10, 1.0)

    # Text length score - favor reasonable length responses
    ideal_length = 100  # words
    length_score = min(len(words) / ideal_length, 1.0)

    # Simple sentence structure check (periods, question marks, etc.)
    sentence_count = len(re.findall(r'[.!?]+', text)) + 1
    sentence_ratio = min(sentence_count / (len(words) / 15), 1.0)

    # Combine scores
    fluency_score = (word_len_score + length_score + sentence_ratio) / 3

    return fluency_score
```

**Lambda request format**  
Your Lambda function receives data in this format:

```
[
  {
    "id": "sample-001",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Do you have a dedicated security team?"
          }
        ]
      },
      {
        "role": "nova_assistant",
        "content": [
          {
            "type": "text",
            "text": "As an AI developed by Company, I don't have a dedicated security team in the traditional sense. However, the development and deployment of AI systems like me involve extensive security measures, including data encryption, user privacy protection, and other safeguards to ensure safe and responsible use."
          }
        ]
      }
    ],
    "reference_answer": {
      "compliant": "No",
      "explanation": "As an AI developed by Company, I do not have a traditional security team. However, the deployment involves stringent safety measures, such as encryption and privacy safeguards."
    }
  }
]
```

**Note**  
The message structure includes the nested `content` array, matching the input data format. The last message with role `nova_assistant` contains the model's generated response.

**Lambda response format**  
Your Lambda function must return data in this format:

```
[
  {
    "id": "sample-001",
    "aggregate_reward_score": 0.75,
    "metrics_list": [
      {
        "name": "accuracy",
        "value": 0.85,
        "type": "Metric"
      },
      {
        "name": "fluency",
        "value": 0.90,
        "type": "Reward"
      }
    ]
  }
]
```

**Response fields**:
+ `id`: Must match the input sample ID
+ `aggregate_reward_score`: Overall score (typically 0.0 to 1.0)
+ `metrics_list`: Array of individual metrics with:
  + `name`: Metric identifier (e.g., "accuracy", "fluency")
  + `value`: Metric score (typically 0.0 to 1.0)
  + `type`: Either "Metric" (for reporting) or "Reward" (used in training)

## IAM permissions
<a name="nova-hp-evaluate-rft-iam"></a>

**Required permissions**  
Your SageMaker AI execution role must have permissions to invoke your Lambda function. Add this policy to your SageMaker AI execution role:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "lambda:InvokeFunction"
      ],
      "Resource": "arn:aws:lambda:region:account-id:function:function-name"
    }
  ]
}
```

**Lambda execution role**  
Your Lambda function's execution role needs basic Lambda execution permissions:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```

**Additional permissions**: If your Lambda function accesses other AWS services (e.g., Amazon S3 for reference data, DynamoDB for logging), add those permissions to the Lambda execution role.

## Executing the evaluation job
<a name="nova-hp-evaluate-rft-execution"></a>

1. **Prepare your data**
   + Format your evaluation data according to the data format requirements
   + Upload your JSONL file to Amazon S3: `s3://your-bucket/eval-data/eval_data.jsonl`

1. **Configure your recipe**

   Update the sample recipe with your configuration:
   + Set `model_name_or_path` to your model location
   + Set `lambda_arn` to your reward function ARN
   + Set `output_s3_path` to your desired output location
   + Adjust `inference` parameters as needed

   Save the recipe as `rft_eval_recipe.yaml`

1. **Run the evaluation**

   Execute the evaluation job using the provided notebook: [Nova model evaluation notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html#nova-model-evaluation-notebook)

1. **Monitor progress**

   Monitor your evaluation job through:
   + SageMaker AI Console: Check job status and logs
   + CloudWatch Logs: View detailed execution logs
   + Lambda Logs: Debug reward function issues

## Understanding evaluation results
<a name="nova-hp-evaluate-rft-results"></a>

**Output format**  
The evaluation job outputs results to your specified Amazon S3 location in JSONL format. Each line contains the evaluation results for one sample:

```
{
  "id": "sample-001",
  "aggregate_reward_score": 0.75,
  "metrics_list": [
    {
      "name": "accuracy",
      "value": 0.85,
      "type": "Metric"
    },
    {
      "name": "fluency",
      "value": 0.90,
      "type": "Reward"
    }
  ]
}
```

**Note**  
The RFT Evaluation Job Output is identical to the Lambda Response format. The evaluation service passes through your Lambda function's response without modification, ensuring consistency between your reward calculations and the final results.

**Interpreting results**  
**Aggregate Reward Score**:
+ Range: Typically 0.0 (worst) to 1.0 (best), but depends on your implementation
+ Purpose: Single number summarizing overall performance
+ Usage: Compare models, track improvement over training

**Individual Metrics**:
+ Metric Type: Informational metrics for analysis
+ Reward Type: Metrics used during RFT training
+ Interpretation: Higher values generally indicate better performance (unless you design inverse metrics)

**Performance benchmarks**  
What constitutes "good" performance depends on your use case:


| Score Range | Interpretation | Action | 
| --- |--- |--- |
| 0.8 - 1.0 | Excellent | Model ready for deployment | 
| 0.6 - 0.8 | Good | Minor improvements may be beneficial | 
| 0.4 - 0.6 | Fair | Significant improvement needed | 
| 0.0 - 0.4 | Poor | Review training data and reward function | 

**Important**  
These are general guidelines. Define your own thresholds based on business requirements, baseline model performance, domain-specific constraints, and cost-benefit analysis of further training.