# Amazon Nova customization on SageMaker HyperPod
<a name="nova-hp"></a>

You can customize Amazon Nova models, including the enhanced Nova 2.0 models, using [ Amazon Nova recipes](nova-model-recipes.md) and train them on Hyperpod. A recipe is a YAML configuration file that provides details to SageMaker AI on how to run your model customization job. SageMaker HyperPod supports two types of services: Forge and Non-forge.

Hyperpod offers high-performance computing with optimized GPU instances and Amazon FSx for Lustre storage, robust monitoring through integration with tools like TensorBoard, flexible checkpoint management for iterative improvement, seamless deployment to Amazon Bedrock for inference, and efficient scalable multi-node distributed training-all working together to provide organizations with a secure, performant, and flexible environment to tailor Nova models to their specific business requirements.

Amazon Nova customization on SageMaker HyperPod stores model artifacts including model checkpoints in a service-managed Amazon S3 bucket. Artifacts in the service-managed bucket are encrypted with SageMaker AI-managed AWS KMS keys. Service-managed Amazon S3 buckets don't currently support data encryption using customer-managed KMS keys. You can use this checkpoint location for evaluation jobs or Amazon Bedrock inference.

Standard pricing can apply for compute instances, Amazon S3 storage, and FSx for Lustre. For pricing details, see [Hyperpod pricing](https://aws.amazon.com/sagemaker-ai/pricing/), [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/), and [FSx for Lustre pricing](https://aws.amazon.com/fsx/lustre/pricing/).

## Compute requirements for Amazon Nova 1 models
<a name="nova-hp-compute-1"></a>

The following tables summarize the computational requirements for and SageMaker AI training jobs training for Nova 1.0 models.


**Pre-training**  

| Model | Sequence length | Nodes | Instance | Accelerator | 
| --- |--- |--- |--- |--- |
| Amazon Nova Micro | 8,192 | 8 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova Lite | 8,192 | 16 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova Pro | 8,192 | 12 | ml.p5.48xlarge | GPU H100 | 


**Direct preference optimization (DPO)**  

| Model | Sequence length | Number of nodes | Instance | Accelerator | 
| --- |--- |--- |--- |--- |
| Direct Preference Optimization (Full) | 32,768 | 2, 4, or 6 | ml.p5.48xlarge | GPU H100 | 
| Direct Preference Optimization (LoRA) | 32,768 | 2, 4, or 6 | ml.p5.48xlarge | GPU H100 | 


**Fine-tuning**  

| Model | Technique | Sequence length | Number of nodes | Instance | Accelerator | 
| --- |--- |--- |--- |--- |--- |
| Amazon Nova 1 Micro |  Supervised Fine-Tuning (LoRA)  | 65,536 | 2 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova 1 Micro |  Supervised Fine-Tuning (Full)  | 65,536 | 2 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova 1 Lite |  Supervised Fine-Tuning (LoRA)  | 32,768 | 4 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova 1 Lite |  Supervised Fine-Tuning (Full)  | 65,536 | 4 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova 1 Pro |  Supervised Fine-Tuning (LoRA)  | 65,536 | 6 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova 1 Pro |  Supervised Fine-Tuning (Full)  | 65,536 | 6 | ml.p5.48xlarge | GPU H100 | 


**Distillation**  

| Model | Nodes | Instance | 
| --- |--- |--- |
| Model Distillation for Post-Training | 1 | ml.r5.24xlarge | 


**Evaluation**  

| Model | Sequence length | Nodes | Instance | Accelerator | 
| --- |--- |--- |--- |--- |
| General Text Benchmark Recipe | 8,192 | 1 | ml.p5.48xlarge | GPU H100 | 
| Bring your own dataset (gen\$1qa) benchmark Recipe | 8,192 | 1 | ml.p5.48xlarge | GPU H100 | 
| Amazon Nova LLM as a Judge Recipe | 8,192 | 1 | ml.p5.48xlarge | GPU H100 | 
| Standard Text Benchmarks | 8,192 | 1 | ml.p5.48xlarge | GPU H100 | 
| Custom Dataset Evaluation | 8,192 | 1 | ml.p5.48xlarge | GPU H100 | 
| Multi-Modal Benchmarks | 8,192 | 1 | ml.p5.48xlarge | GPU H100 | 


**Proximal policy optimization**  

| Model | Critic Model Instance Count | Reward Model Instance Count | Anchor Model Instance Count | Actor Train | Actor Generation | Number of Instances | Total Hours Per Run | P5 Hours | Instance Type | 
| --- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
| Amazon Nova Micro | 1 | 1 | 1 | 2 | 2 | 7 | 8 | 56 | ml.p5.48xlarge | 
| Amazon Nova Lite | 1 | 1 | 1 | 2 | 2 | 7 | 16 | 112 | ml.p5.48xlarge | 
| Amazon Nova Pro | 1 | 1 | 1 | 6 | 2 | 11 | 26 | 260 | ml.p5.48xlarge | 

**Topics**
+ [Compute requirements for Amazon Nova 1 models](#nova-hp-compute-1)
+ [Nova Forge SDK](nova-hp-forge-sdk.md)
+ [Amazon SageMaker HyperPod Essential Commands Guide](nova-hp-essential-commands-guide.md)
+ [Creating a SageMaker HyperPod EKS cluster with restricted instance group (RIG)](nova-hp-cluster.md)
+ [Nova Forge access and setup for](nova-forge-hp-access.md)
+ [Training for Amazon Nova models](nova-hp-training.md)
+ [Fine-tuning Amazon Nova models on SageMaker HyperPod](nova-hp-fine-tune.md)
+ [Evaluating your trained model](nova-hp-evaluate.md)
+ [Monitoring HyperPod jobs with MLflow](nova-hp-mlflow.md)

# Nova Forge SDK
<a name="nova-hp-forge-sdk"></a>

The Nova Forge SDK is a comprehensive Python SDK that provides a unified, programmatic interface for the complete Amazon Nova model customization lifecycle. The SDK simplifies model customization by offering a single, consistent API for training, evaluation, monitoring, deployment, and inference across Amazon SageMaker and Amazon Bedrock platforms.

For more information, see [Nova Forge SDK](nova-forge-sdk.md).

# Amazon SageMaker HyperPod Essential Commands Guide
<a name="nova-hp-essential-commands-guide"></a>

Amazon SageMaker HyperPod provides extensive command-line functionality for managing training workflows. This guide covers essential commands for common operations, from connecting to your cluster to monitoring job progress.

**Prerequisites**  
Before using these commands, ensure you have completed the following setup:
+ SageMaker HyperPod cluster with RIG created (typically in us-east-1)
+ Output Amazon S3 bucket created for training artifacts
+ IAM roles configured with appropriate permissions
+ Training data uploaded in correct JSONL format
+ FSx for Lustre sync completed (verify in cluster logs on first job)

**Topics**
+ [Installing Recipe CLI](#nova-hp-essential-commands-guide-install)
+ [Connecting to your cluster](#nova-hp-essential-commands-guide-connect)
+ [Starting a training job](#nova-hp-essential-commands-guide-start-job)
+ [Checking job status](#nova-hp-essential-commands-guide-status)
+ [Monitoring job logs](#nova-hp-essential-commands-guide-logs)
+ [Listing active jobs](#nova-hp-essential-commands-guide-list-jobs)
+ [Canceling a job](#nova-hp-essential-commands-guide-cancel-job)
+ [Running an evaluation job](#nova-hp-essential-commands-guide-evaluation)
+ [Common issues](#nova-hp-essential-commands-guide-troubleshooting)

## Installing Recipe CLI
<a name="nova-hp-essential-commands-guide-install"></a>

Navigate to the root of your recipe repository before running the installation command.

**Use the Hyperpodrecipes repository if using Non Forge customization techniques, for Forge based customization refer to the forge specific recipe repository.**  
Run the following commands to install the SageMaker HyperPod CLI:

**Note**  
Make sure you aren’t in an active conda / anaconda / miniconda environment or another virtual environment  
If you are, please exit the environment using:  
`conda deactivate` for conda / anaconda / miniconda environments
`deactivate` for python virtual environments

 If you are using a Non Forge customization technique, download the sagemaker-hyperpod-recipes as shown below:

```
git clone -b release_v2 https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli
pip install -e .
cd ..
root_dir=$(pwd)
export PYTHONPATH=${root_dir}/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

If you are a** Forge Subscriber,** you should be downloading the recipes using below mentioned process.

```
mkdir NovaForgeHyperpodCLI
cd NovaForgeHyperpodCLI
aws s3 cp s3://nova-forge-c7363-206080352451-us-east-1/v1/ ./ --recursive
pip install -e .

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh
```

**Tip**  
To use a [new virtual environment](https://docs.python.org/3/library/venv.html) before running `pip install -e .`, run:  
`python -m venv nova_forge`
`source nova_forge/bin/activate`
Your command line will now display (nova\$1forge) at the beginning of your prompt
This ensures there are no competing dependencies when using the CLI

**Purpose**: Why do we do `pip install -e .` ?

This command installs the SageMaker HyperPod CLI in editable mode, allowing you to use updated recipes without reinstalling each time. It also enables you to add new recipes that the CLI can automatically pick up.

## Connecting to your cluster
<a name="nova-hp-essential-commands-guide-connect"></a>

Connect the SageMaker HyperPod CLI to your cluster before running any jobs:

```
export AWS_REGION=us-east-1 &&  hyperpod connect-cluster --cluster-name <your-cluster-name> --region us-east-1
```

**Important**  
This command creates a context file (`/tmp/hyperpod_context.json`) that subsequent commands require. If you see an error about this file not found, re-run the connect command.

**Pro tip**: You can further configure your cluster to always use the `kubeflow` namespace by adding the `--namespace kubeflow` argument to your command as follows:

```
export AWS_REGION=us-east-1 && \
hyperpod connect-cluster \
--cluster-name <your-cluster-name> \
--region us-east-1 \
--namespace kubeflow
```

This saves you the effort of adding the `-n kubeflow` in every command when interacting with your jobs.

## Starting a training job
<a name="nova-hp-essential-commands-guide-start-job"></a>

**Note**  
If running PPO/RFT jobs, ensure you add label selector settings to `src/hyperpod_cli/sagemaker_hyperpod_recipes/recipes_collection/cluster/k8s.yaml` so that all pods are schedule on the same node.  

```
label_selector:
  required:
    sagemaker.amazonaws.com/instance-group-name:
      - <rig_group>
```

Launch a training job using a recipe with optional parameter overrides:

```
hyperpod start-job -n kubeflow \
--recipe fine-tuning/nova/nova_1_0/nova_micro/SFT/nova_micro_1_0_p5_p4d_gpu_lora_sft \
--override-parameters '{
"instance_type": "ml.p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-HP-SFT-latest"
  }'
```

**Expected output**:

```
Final command: python3 <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/main.py recipes=fine-tuning/nova/nova_micro_p5_gpu_sft cluster_type=k8s cluster=k8s base_results_dir=/local/home/<username>/results cluster.pullPolicy="IfNotPresent" cluster.restartPolicy="OnFailure" cluster.namespace="kubeflow" container="708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:HP-SFT-DATAMIX-latest"

Prepared output directory at /local/home/<username>/results/<job-name>/k8s_templates
Found credentials in shared credentials file: ~/.aws/credentials
Helm script created at /local/home/<username>/results/<job-name>/<job-name>_launch.sh
Running Helm script: /local/home/<username>/results/<job-name>/<job-name>_launch.sh

NAME: <job-name>
LAST DEPLOYED: Mon Sep 15 20:56:50 2025
NAMESPACE: kubeflow
STATUS: deployed
REVISION: 1
TEST SUITE: None
Launcher successfully generated: <path_to_your_installation>/NovaForgeHyperpodCLI/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nova/k8s_templates/SFT

{
 "Console URL": "https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/cluster-management/<your-cluster-name>"
}
```

## Checking job status
<a name="nova-hp-essential-commands-guide-status"></a>

Monitor your running jobs using kubectl:

```
kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep <your-job-name>)
```

**Understanding pod statuses**  
The following table explains common pod statuses:


| Status | Description | 
| --- |--- |
| `Pending` | Pod accepted but not yet scheduled onto a node, or waiting for container images to be pulled | 
| `Running` | Pod bound to a node with at least one container running or starting | 
| `Succeeded` | All containers completed successfully and won't restart | 
| `Failed` | All containers terminated with at least one ending in failure | 
| `Unknown` | Pod state cannot be determined (usually due to node communication issues) | 
| `CrashLoopBackOff` | Container repeatedly failing; Kubernetes backing off from restart attempts | 
| `ImagePullBackOff` / `ErrImagePull` | Unable to pull container image from registry | 
| `OOMKilled` | Container terminated for exceeding memory limits | 
| `Completed` | Job or Pod finished successfully (batch job completion) | 

**Tip**  
Use the `-w` flag to watch pod status updates in real-time. Press `Ctrl+C` to stop watching.

## Monitoring job logs
<a name="nova-hp-essential-commands-guide-logs"></a>

You can view your logs one of three ways:

**Using CloudWatch**  
Your logs are available in your AWS account that contains the Hyperpodcluster under CloudWatch. To view them in your browser, navigate to the CloudWatch homepage in your account and search for your cluster name. For example, if your cluster were called `my-hyperpod-rig` the log group would have the prefix:
+ **Log group**: `/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}`
+ Once you're in the log group, you can find your specific log using the node instance ID such as - `hyperpod-i-00b3d8a1bf25714e4`.
  + `i-00b3d8a1bf25714e4` here represents the Hyperpodfriendly machine name where your training job is running. Recall how in the previous command `kubectl get pods -o wide -w -n kubeflow | (head -n1 ; grep my-cpt-run)` output we captured a column called **NODE**.
  + The "master" node run was in this case running on hyperpod-`i-00b3d8a1bf25714e4` and thus we'll use that string to select the log group to view. Select the one that says `SagemakerHyperPodTrainingJob/rig-group/[NODE]`

**Using CloudWatch Insights**  
If you have your job name handy and don't wish to go through all the steps above, you can simply query all logs under `/aws/sagemaker/Clusters/my-hyperpod-rig/{UUID}` to find the individual log.

CPT:

```
fields @timestamp, @message, @logStream, @log
| filter @message like /(?i)Starting CPT Job/
| sort @timestamp desc
| limit 100
```

For job completion replace `Starting CPT Job` with `CPT Job completed`

Then you can click through the results and pick the one that says "Epoch 0" since that will be your master node.

**Using the AWS CLI**  
You may choose to tail your logs using the AWS CLI. Before doing so, please check your AWS CLI version using `aws --version`. It is also recommended to use this utility script that helps in live log tracking in your terminal

**for V1**:

```
aws logs get-log-events \
--log-group-name /aws/sagemaker/YourLogGroupName \
--log-stream-name YourLogStream \
--start-from-head | jq -r '.events[].message'
```

**for V2**:

```
aws logs tail /aws/sagemaker/YourLogGroupName \
 --log-stream-name YourLogStream \
--since 10m \
--follow
```

## Listing active jobs
<a name="nova-hp-essential-commands-guide-list-jobs"></a>

View all jobs running in your cluster:

```
hyperpod list-jobs -n kubeflow
```

**Example output**:

```
{
  "jobs": [
    {
      "Name": "test-run-nhgza",
      "Namespace": "kubeflow",
      "CreationTime": "2025-10-29T16:50:57Z",
      "State": "Running"
    }
  ]
}
```

## Canceling a job
<a name="nova-hp-essential-commands-guide-cancel-job"></a>

Stop a running job at any time:

```
hyperpod cancel-job --job-name <job-name> -n kubeflow
```

**Finding your job name**  
**Option 1: From your recipe**

The job name is specified in your recipe's `run` block:

```
run:
  name: "my-test-run"                        # This is your job name
  model_type: "amazon.nova-micro-v1:0:128k"
  ...
```

**Option 2: From list-jobs command**

Use `hyperpod list-jobs -n kubeflow` and copy the `Name` field from the output.

## Running an evaluation job
<a name="nova-hp-essential-commands-guide-evaluation"></a>

Evaluate a trained model or base model using an evaluation recipe.

**Prerequisites**  
Before running evaluation jobs, ensure you have:
+ Checkpoint Amazon S3 URI from your training job's `manifest.json` file (for trained models)
+ Evaluation dataset uploaded to Amazon S3 in the correct format
+ Output Amazon S3 path for evaluation results

**Command**  
Run the following command to start an evaluation job:

```
hyperpod start-job -n kubeflow \
  --recipe evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_bring_your_own_dataset_eval \
  --override-parameters '{
    "instance_type": "p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-latest",
    "recipes.run.name": "<your-eval-job-name>",
    "recipes.run.model_name_or_path": "<checkpoint-s3-uri>",
    "recipes.run.output_s3_path": "s3://<your-bucket>/eval-results/",
    "recipes.run.data_s3_path": "s3://<your-bucket>/eval-data.jsonl"
  }'
```

**Parameter descriptions**:
+ `recipes.run.name`: Unique name for your evaluation job
+ `recipes.run.model_name_or_path`: Amazon S3 URI from `manifest.json` or base model path (e.g., `nova-micro/prod`)
+ `recipes.run.output_s3_path`: Amazon S3 location for evaluation results
+ `recipes.run.data_s3_path`: Amazon S3 location of your evaluation dataset

**Tips**:
+ **Model-specific recipes**: Each model size (micro, lite, pro) has its own evaluation recipe
+ **Base model evaluation**: Use base model paths (e.g., `nova-micro/prod`) instead of checkpoint URIs to evaluate base models

**Evaluation data format**  
**Input format (JSONL)**:

```
{
  "metadata": "{key:4, category:'apple'}",
  "system": "arithmetic-patterns, please answer the following with no other words: ",
  "query": "What is the next number in this series? 1, 2, 4, 8, 16, ?",
  "response": "32"
}
```

**Output format**:

```
{
  "prompt": "[{'role': 'system', 'content': 'arithmetic-patterns, please answer the following with no other words: '}, {'role': 'user', 'content': 'What is the next number in this series? 1, 2, 4, 8, 16, ?'}]",
  "inference": "['32']",
  "gold": "32",
  "metadata": "{key:4, category:'apple'}"
}
```

**Field descriptions**:
+ `prompt`: Formatted input sent to the model
+ `inference`: Model's generated response
+ `gold`: Expected correct answer from input dataset
+ `metadata`: Optional metadata passed through from input

## Common issues
<a name="nova-hp-essential-commands-guide-troubleshooting"></a>
+ `ModuleNotFoundError: No module named 'nemo_launcher'`, you might've to add `nemo_launcher` to your python path based on where `hyperpod_cli` is installed. Sample command:

  ```
  export PYTHONPATH=<path_to_hyperpod_cli>/sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:$PYTHONPATH
  ```
+ `FileNotFoundError: [Errno 2] No such file or directory: '/tmp/hyperpod_current_context.json'` indicates you missed running the hyperpod connect cluster command.
+ If you don't see your job scheduled, double check if the output of your SageMaker HyperPod CLI has this section with job names and other metadata. If not, re-install helm chart by running:

  ```
  curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
  chmod 700 get_helm.sh
  ./get_helm.sh
  rm -f ./get_helm.sh
  ```

# Creating a SageMaker HyperPod EKS cluster with restricted instance group (RIG)
<a name="nova-hp-cluster"></a>

To customize a model on Hyperpod, the necessary infrastructure must be set up. For details on setting up a SageMaker HyperPod EKS cluster with a restricted instance group (RIG), visit the [workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US), which provides a detailed walkthrough of the setup process.

# Nova Forge access and setup for
<a name="nova-forge-hp-access"></a>

To set up Amazon Nova Forge for use with Jobs, you need to:
+ Subscribe to Amazon Nova Forge
+ Set up a cluster

**Topics**
+ [Subscribe to Amazon Nova Forge](nova-forge-subscribing.md)
+ [Set up infrastructure](nova-forge-hyperpod-setup.md)
+ [Responsible AI](nova-forge-responsible-ai.md)

# Subscribe to Amazon Nova Forge
<a name="nova-forge-subscribing"></a>

To access Amazon Nova Forge features, complete the following steps:

1. Verify administrator access to the AWS account.

1. Navigate to the SageMaker AI console and request access to Amazon Nova Forge.

1. Wait for the Amazon Nova team to email a confirmation after the subscription request is approved.

1. Tag your execution role with the `forge-subscription` tag. This tag is required for accessing Amazon Nova Forge features and checkpoints. Add the following tag to your execution role:
   + Key: `forge-subscription`
   + Value: `true`

**Note**  
Standard Amazon Nova features remain available without a Forge subscription. Amazon Nova Forge is designed for building custom frontier models with control and flexibility across all model training phases.

# Set up infrastructure
<a name="nova-forge-hyperpod-setup"></a>

Once your Amazon Nova Forge subscription is approved, set up the necessary infrastructure to use Forge-enabled features. For detailed instructions on creating a EKS cluster with a restricted instance group (RIG), follow the [workshop instructions](https://catalog.us-east-1.prod.workshops.aws/workshops/dcac6f7a-3c61-4978-8344-7535526bf743/en-US).

# Responsible AI
<a name="nova-forge-responsible-ai"></a>

**Content moderation settings**: Amazon Nova Forge customers have access to Customizable Content Moderation Settings (CCMS) for Amazon Nova Lite 1.0 and Pro 1.0 models. CCMS allows you to adjust content moderation controls to align with your specific business requirements while maintaining essential responsible AI safeguards. To determine if your business use case qualifies for CCMS, contact your Amazon Web Services account manager.

Amazon Nova Forge provides a Responsible AI Toolkit that includes training data, evaluation benchmarks, and runtime controls to help you align your models with Amazon Nova's responsible AI guidelines.

**Training data**: The "RAI" category in data mixing contains cases and scenarios emphasizing responsible AI principles, safety considerations, and responsible technology deployment. Use these to align your models responsibly during continued pre-training.

**Evaluations**: Benchmark tasks are available to test your model's ability to detect and reject inappropriate, harmful, or incorrect content. Use these evaluations to measure the difference between base model performance and your custom model performance.

**Runtime controls**: By default, Amazon Nova's runtime controls moderate model responses during inference. To modify these controls for your specific business case, request Customizable Content Moderation Settings (CCMS) by contacting your Amazon Web Services account manager.

## Shared Responsibility for Safety
<a name="shared-responsibility"></a>

Safety is a shared responsibility between Amazon Web Services and our customers. Changing the base model or using continued pre-training to improve performance on a specific use case can impact safety, fairness, and other properties of the new model.

We use a robust adaptation method to minimize changes to the safety, fairness, and other protections built into our base models while minimizing impact on model performance for tasks the model was not customized for.

You are responsible for:
+ End-to-end testing of their applications on datasets representative of their use cases
+ Deciding if test results meet their specific expectations of safety, fairness, and other properties, as well as overall effectiveness

For more information, see the Amazon Web Services Responsible Use of AI Guide, Amazon Web Services Responsible AI Policy, AWS Acceptable Use Policy, and AWS Service Terms for the services you plan to use.

## Customizable Content Moderation Settings (CCMS)
<a name="ccms"></a>

CCMS allows you to adjust controls relevant to your business requirements while maintaining essential, non-configurable controls that ensure responsible use of AI.

These settings allow content generation through three available configurations:
+ Security only
+ Safety, sensitive content, and fairness combined
+ All categories combined

The four content moderation categories are:

1. **Safety** – Covers dangerous activities, weapons, and controlled substances

1. **Sensitive content** – Includes profanity, nudity, and bullying

1. **Fairness** – Addresses bias and cultural considerations

1. **Security** – Involves cybercrime, malware, and malicious content

Regardless of your CCMS configuration, Amazon Amazon Nova enforces essential, non-configurable controls to ensure responsible use of AI, such as controls to prevent harm to children and preserve privacy.

### Recommendations for Using CCMS
<a name="ccms-recommendations"></a>

When using CCMS, we recommend using Continuous Pre Training (CPT) and starting from a pre-RAI alignment checkpoint (PRE-TRAINING-Early, PRE-TRAINING-Mid, or PRE-TRAINING-Final) rather than the GA/FINAL checkpoint. These checkpoints have not undergone safety training or been steered toward specific RAI behaviors, allowing you to customize them more efficiently to your content moderation requirements.

**Tip**: When using CCMS with data mixing, consider adjusting the "rai" category percentage in your nova\$1data configuration to align with your specific content moderation requirements.

### Availability
<a name="ccms-availability"></a>

CCMS is currently available for approved customers using:
+ Nova Lite 1.0 and Pro 1.0 models
+ Amazon Bedrock On-Demand inference
+ The us-east-1 (N. Virginia) region

To enable CCMS for your Forge models, contact your Amazon Web Services account manager.

# Training for Amazon Nova models
<a name="nova-hp-training"></a>

Training Amazon Nova models on SageMaker HyperPod supports multiple techniques including Continued Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Reinforcement Fine-Tuning (RFT). Each technique serves different customization needs and can be applied to different Amazon Nova model versions.

**Topics**
+ [Continued pre-training (CPT)](nova-cpt.md)

# Continued pre-training (CPT)
<a name="nova-cpt"></a>

Continued pre-training (CPT) is a training technique that extends the pre-training phase of a foundation model by exposing it to additional unlabeled text from specific domains or corpora. Unlike supervised fine-tuning, which requires labeled input-output pairs, CPT trains on raw documents to help the model acquire deeper knowledge of new domains, learn domain-specific terminology and writing patterns, and adapt to particular content types or subject areas.

This approach is particularly valuable when you have large volumes (tens of billions of tokens) of domain-specific text data, such as legal documents, medical literature, technical documentation, or proprietary business content, and you want the model to develop native fluency in that domain. Generally, after the CPT stage, the model needs to undergo additional instruction tuning stages to enable the model to use the newly acquired knowledge and complete useful tasks.

**Supported models**  
CPT is available for the following Amazon Nova models:
+ Nova 1.0 (Micro, Lite, Pro)
+ Nova 2.0 (Lite)

Choose Nova 1.0 when the following applies:
+ Your use case requires standard language understanding without advanced reasoning.
+ You want to optimize for lower training and inference costs.
+ Your focus is on teaching the model domain-specific knowledge and behaviors rather than complex reasoning tasks.
+ You have already validated performance on Nova 1.0 and don't need additional capabilities.

**Note**  
The larger model is not always better. Consider the cost-performance tradeoff and your specific business requirements when selecting between Nova 1.0 and Nova 2.0 models.

# CPT on Nova 1.0
<a name="nova-cpt-1"></a>

You should use CPT in the following scenarios:
+ You have large-scale, unlabeled data that's specific to a domain (for example medicine or finance).
+ You want the model to retain general language capabilities while improving on domain-specific content.
+ You want to improve zero-shot and few-shot performance in specialized areas without performing extensive, task-specific fine-tuning.

**Data format requirements**  
We recommend adhering to the following dataset characteristics when performing CPT:
+ **Diversity**: Your data should cover a broad range of expressions within the target domain to avoid over-fitting.
+ **Representation**: Your data should reflect the distribution that the model will face during inference.
+ **Cleanliness**: Noise and redundancy in your data can degrade performance. Deduplication and text normalization improve model training.
+ **Scale**: Larger datasets help, but beyond a certain threshold (such as running multiple epochs on limited data), over-fitting risks increase.

Training and validation datasets must be JSONL files following the format shown below, where each line contains a JSON object representing a conversation with the required fields and structure. Here is an example:

```
{"text": "AWS stands for Amazon Web Services"}
{"text": "Amazon SageMaker is a fully managed machine learning service"}
{"text": "Amazon Bedrock is a fully managed service for foundation models"}
```

Text entries should contain naturally flowing, high-quality content that represents your target domain.

**Dataset validation**  
To validate your dataset before submitting your CPT job, check for the following conditions:
+ Each line must contain a valid JSON object.
+ Each object has a "text" field that contains string data.
+ No fields other than "text" are present.
+ The file is a `.jsonl` extension.

**Training times**  
The amount of time spent training depends heavily on the size of the dataset, the number of instances use, and the model being trained. Training times are expected to scale linearly. The following table provides some example training times for various models.


| Model Type | GBS | Number of Samples in Dataset | Number of P5 Instances | `max_length` value | Approximate training time in hours | 
| --- |--- |--- |--- |--- |--- |
| Amazon Nova Micro | 256 | 100,000 | 8 | 8,192 | 4 | 
| Amazon Nova Lite | 256 | 100,000 | 16 | 8,192 | 4 | 
| Amazon Nova Pro | 256 | 100,000 | 24 | 8,192 | 10 | 

Training and validation datasets must be JSONL files following the format shown below, where each line contains a JSON object representing a conversation with the required fields and structure.

The Amazon Nova parameters that are available for tuning with CPT include:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Amazon Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model to use for your training. The available options are `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://customer-escrow-bucket-unique_id/training_run_name`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model you choose. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
  + `data_s3_path`: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
  + `validation_data_s3_path`: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and region as the cluster. All of the S3 locations provided must be in the same account and Region.
  + `output_s3_path`: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations provided must be in the same AWS account and AWS Region.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 8192 tokens for CPT.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution.
+ **Trainer settings**
  + `global_batch_size`: The total number of training samples processed together in one forward or backward pass across all devices and workers.

    This value multiplies the per-device batch size and number of devices. It affects the stability of training and throughput. We recommend that you start with a batch size that fits comfortably within your memory and scale up from there. For domain-specific data, larger batches might over-smooth gradients.
  + `max_epochs`: The number of complete passes through your training dataset.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data to prevent over-fitting.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

**CPT recipe**  
The following is a recipe for CPT.

```
## Run config
run:
  name: "my-cpt-run"             # A descriptive name for your training job
  model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification, do not change
  model_name_or_path: "nova-lite/prod"      # Base model path, do not change
  replicas: 4                     # Number of compute instances for training, allowed values are 4, 8, 16
  data_s3_path: [S3_PATH_TO_TRAIN_DATASET]
  validation_data_s3_path: (OPTIONAL)[S3_PATH_TO_VALIDATION_DATASET]
  output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
  max_length: 8192               # Maximum context window size (tokens).
  global_batch_size: 256           # Global batch size, allowed values are 32, 64, 128, 256.

  trainer:
      max_epochs: 2                # Number of training epochs

  model:
      hidden_dropout: 0.0          # Dropout for hidden states, must be between 0.0 and 1.0
      attention_dropout: 0.0       # Dropout for attention weights, must be between 0.0 and 1.0
      ffn_dropout: 0.0             # Dropout for feed-forward networks, must be between 0.0 and 1.0

      optim:
        lr: 1e-5                 # Learning rate
        name: distributed_fused_adam  # Optimizer algorithm, do not change
        adam_w_mode: true        # Enable AdamW mode
        eps: 1e-06               # Epsilon for numerical stability
        weight_decay: 0.0        # L2 regularization strength, must be between 0.0 and 1.0
        betas:                   # Adam optimizer betas, must be between 0.0 and 1.0
          - 0.9
          - 0.999
        sched:
          warmup_steps: 10     # Learning rate warmup steps
          constant_steps: 0    # Steps at constant learning rate
          min_lr: 1e-6         # Minimum learning rate, must be lower than lr
```

**Limitations**  
CPT has the following limitations:
+ Multimodal datasets aren't supported.
+ Intermediate checkpoints aren't saved for evaluation and you can't resume from an intermediate checkpoint. Only the last checkpoint is saved.

# Fine-tuning Amazon Nova models on SageMaker HyperPod
<a name="nova-hp-fine-tune"></a>

The following techniques show you how to fine-tune Amazon Nova 2 models on SageMaker HyperPod.

**Topics**
+ [Supervised fine-tuning (SFT)](nova-fine-tune.md)
+ [Direct preference optimization (DPO)](nova-dpo.md)
+ [Proximal policy optimization (PPO)](nova-ppo.md)

# Supervised fine-tuning (SFT)
<a name="nova-fine-tune"></a>

The SFT training process consists of two main stages:
+ **Data Preparation**: Follow established guidelines to create, clean, or reformat datasets into the required structure. Ensure that inputs, outputs, and auxiliary information (such as reasoning traces or metadata) are properly aligned and formatted.
+ **Training Configuration**: Define how the model will be trained. When using , this configuration is written in a YAML recipe file that includes:
  + Data source paths (training and validation datasets)
  + Key hyperparameters (epochs, learning rate, batch size)
  + Optional components (distributed training parameters, etc)

## Nova Model Comparison and Selection
<a name="nova-model-comparison"></a>

Amazon Nova 2.0 is a model trained on a larger and more diverse dataset than Amazon Nova 1.0. Key improvements include:
+ **Enhanced reasoning abilities** with explicit reasoning mode support
+ **Broader multilingual performance** across additional languages
+ **Improved performance on complex tasks** including coding and tool use
+ **Extended context handling** with better accuracy and stability at longer context lengths

## When to Use Nova 1.0 vs. Nova 2.0
<a name="nova-model-selection"></a>

Choose Amazon Nova 1.0 when:
+ The use case requires standard language understanding without advanced reasoning
+ Performance has already been validated on Amazon Nova 1.0 and additional capabilities are not needed

# SFT on Nova 1.0
<a name="nova-sft-1"></a>

Supervised fine-tuning (SFT) is the process of providing a collection of prompt-response pairs to a foundation model to improve the performance of a pre-trained foundation model on a specific task. The labeled examples are formatted as prompt-response pairs and phrased as instructions. This fine-tuning process modifies the weights of the model.

You should use SFT when you have domain-specific data that requires providing specific prompt-response pairs for optimal results.

Note that your training and validation input datasets must reside in customer-owned buckets, not in escrow, or service-managed S3 buckets.

## Data requirements
<a name="nova-sft-1-data-requirements"></a>

For full-rank SFT and low-rank adapter (LoRA) SFT, the data should follow the [Amazon Bedrock Converse operation format](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_Converse.html). For examples and constraints of this format, see [Preparing data for fine-tuning Understanding models](https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-prepare-data-understanding.html).

To validate your dataset format before submission, we recommend using [the validation script from the Amazon Bedrock samples repository](https://github.com/aws-samples/amazon-nova-samples/blob/main/customization/bedrock-finetuning/understanding/dataset_validation/nova_ft_dataset_validator.py). This validation tool helps ensure that your JSONL files adhere to the required format specifications and identify any potential issues before you submit your fine-tuning job.

The Amazon Nova parameters that are available for tuning with SFT are as follows:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Amazon Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model to use for your training. Select the model to use from `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model chosen. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
  + `data_s3_path`: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations within the provided S3 path must be in the same account and Region.
  + `validation_data_s3_path`: (Optional) The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same account and Region as the cluster. All of the S3 locations within the provided S3 path must be in the same account and Region.
  + `output_s3_path`: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations within the provided S3 path must be in the same account and region.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 65,536 tokens for SFT.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution. 
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce over-fitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. Valid values are between 1e-6-1e-3, inclusive. We recommend values between 1e-6-1e-4 for good performance.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.

## Quick start with a full-rank SFT recipe
<a name="nova-sft-1-quick-start"></a>

The following is a recipe for full-rank SFT that's intended for you to quickly start an SFT job on a SageMaker HyperPod cluster. This recipe also assumes that you have connected to your SageMaker HyperPod cluster using the correct AWS credentials.

```
run:
  name: "my-sft-micro-job" # gets appended with a unique ID for HP jobs
  model_type: "amazon.nova-micro-v1:0:128k"
  model_name_or_path: "nova-micro/prod"
  replicas: 2
  data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
  validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
  output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## training specific configs
training_config:
  max_length: 32768
  save_steps: 100000
  replicas: ${recipes.run.replicas}
  micro_batch_size: 1
  task_type: sft
  global_batch_size: 64
  weights_only: True
  allow_percentage_invalid_samples: 10

  exp_manager:
    exp_dir: null
    create_wandb_logger: False
    create_tensorboard_logger: True
      project: null
      name: null
    checkpoint_callback_params:
      monitor: step
      save_top_k: 10
      mode: max
      every_n_train_steps: ${recipes.training_config.save_steps}
      save_last: True
    create_early_stopping_callback: True
    early_stopping_callback_params:
      min_delta: 0.001
      mode: min
      monitor: "val_loss"
      patience: 2

  trainer:
    log_every_n_steps: 1
    max_epochs: -1
    max_steps: 16
    limit_test_batches: 0
    gradient_clip_val: 1.0
    num_nodes: ${recipes.training_config.replicas}

  model:
    hidden_dropout: 0.0 # Dropout probability for hidden state transformer.
    attention_dropout: 0.0 # Dropout probability in the attention layer.
    ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
    sequence_parallel: True
    optim:
      lr: 1e-5
      name: distributed_fused_adam
      bucket_cap_mb: 10
      contiguous_grad_buffer: False
      overlap_param_sync: False
      contiguous_param_buffer: False
      overlap_grad_sync: False
      adam_w_mode: true
      eps: 1e-06
      weight_decay: 0.0
      betas:
        - 0.9
        - 0.999
      sched:
        name: CosineAnnealing
        warmup_steps: 10
        constant_steps: 0
        min_lr: 1e-6

    mm_cfg:
      llm:
        freeze: false
      image_projector:
        freeze: true
        require_newline: true
      video_projector:
        freeze: true
        require_newline: false

    peft:
      peft_scheme: null

    training_validation:
      loader:
        args:
          data_loader_workers: 1
          prefetch_factor: 2
      collator:
        args:
          force_image_at_turn_beginning: false
```

## Sample full-rank recipe
<a name="nova-sft-1-sample-recipe"></a>

The following is a sample full-rank recipe for SFT with all components properly configured.

```
## Run config
run:
    name: "my-sft-run"              # A descriptive name for your training job
    model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification
    model_name_or_path: "nova-lite/prod"      # Base model path
    replicas: 4                     # Number of compute instances for training
    data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
    validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
    output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
    max_length: 32768               # Maximum context window size (tokens)

    trainer:
        max_epochs: 2               # Number of training epochs

    model:
        hidden_dropout: 0.0          # Dropout for hidden states
        attention_dropout: 0.0       # Dropout for attention weights
        ffn_dropout: 0.0             # Dropout for feed-forward networks

        optim:
            lr: 1e-5                 # Learning rate
            name: distributed_fused_adam  # Optimizer algorithm
            adam_w_mode: true        # Enable AdamW mode
            eps: 1e-06               # Epsilon for numerical stability
            weight_decay: 0.0        # L2 regularization strength
            betas:                   # Adam optimizer betas
                - 0.9
                - 0.999
            sched:
                warmup_steps: 10     # Learning rate warmup steps
                constant_steps: 0    # Steps at constant learning rate
                min_lr: 1e-6         # Minimum learning rate

        peft:
            peft_scheme: null        # Set to null for full-parameter fine-tuning
```

## Limitations
<a name="nova-sft-1-limitations"></a>

Publishing metrics to Weights & Biases is not supported.

To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).

## Parameter-efficient fine-tuning (PEFT)
<a name="nova-fine-tune-peft"></a>

Parameter-efficient fine-tuning (PEFT) involves retraining a small number of additional weights to adapt a foundation model to new tasks or domains. Specifically, low-rank adapter (LoRA) PEFT efficiently fine-tunes foundation models by introducing low-rank trainable weight matrices into specific model layers, reducing the number of trainable parameters while maintaining model quality.

A LoRA PEFT adapter augments the base foundation model by incorporating lightweight adapter layers that modify the model’s weights during inference while keeping the original model parameters intact. This approach is also considered one of the most cost-effective fine-tuning techniques. For more information, see [Fine-tune models with adapter inference components](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-adapt.html).

You should use LoRA PEFT in the following scenarios:
+ You want to start with a fast training procedure.
+ The base model's performance is already satisfactory. In this case, the goal of LoRA PEFT is to enhance its capabilities across multiple related tasks, such as text summarization or language translation. LoRA PEFT's regularization properties help prevent overfitting and mitigate the risks of the model "forgetting" the source domain. This ensures the model remains versatile and adaptable to various applications.
+ You want to perform instruction fine-tuning scenarios with relatively small datasets. LoRA PEFT performs better with smaller, task-specific datasets than broader, larger datasets.
+ You have large, labeled datasets that exceed the Amazon Bedrock customization data limits. In this case, you can use LoRA PEFT on SageMaker AI to generate better results.
+ If you have already achieved promising results through Amazon Bedrock fine-tuning, LoRA PEFT in SageMaker AI can help further optimize the model hyperparameters.

The Amazon Nova parameters that are available for with LoRA PEFT include:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model to use for your training. Select the model to use. The available options are `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model you use. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
  + `output_s3_path`: The S3 location where the manifest and TensorBoard logs are stored. All of the S3 locations within the provided S3 path must be in the same account and region.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 65,536 tokens for LoRA PEFT.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution.
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset. You can set either `max_steps` or `max_epochs`, but we do not recommend setting both. The maximum value is 5.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.
+ **LoRA configuration parameters**
  + `peft_scheme`: Set to `lora` to enable low-rank adaptation. 
  + `alpha`: The scaling factor for LoRA weights. This is typically set to same value as `adapter_dim`.
  + `adaptor_dropout`: The regularization parameter for LoRA.

### PEFT recipe
<a name="nova-sft-1-peft-recipe"></a>

The following is a recipe for LoRA PEFT.

```
## Run config
run:
    name: "my-lora-run"             # A descriptive name for your training job
    model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification
    model_name_or_path: "nova-lite/prod"      # Base model path
    replicas: 4                     # Number of compute instances for training
    output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
    max_length: 32768               # Maximum context window size (tokens)

    trainer:
        max_epochs: 2               # Number of training epochs

    model:
        hidden_dropout: 0.0          # Dropout for hidden states
        attention_dropout: 0.0       # Dropout for attention weights
        ffn_dropout: 0.0             # Dropout for feed-forward networks

        optim:
            lr: 1e-5                 # Learning rate
            name: distributed_fused_adam  # Optimizer algorithm
            adam_w_mode: true        # Enable AdamW mode
            eps: 1e-06               # Epsilon for numerical stability
            weight_decay: 0.0        # L2 regularization strength
            betas:                   # Adam optimizer betas
                - 0.9
                - 0.999
            sched:
                warmup_steps: 10     # Learning rate warmup steps
                constant_steps: 0    # Steps at constant learning rate
                min_lr: 1e-6         # Minimum learning rate

        peft:
            peft_scheme: "lora"      # Enable LoRA for parameter-efficient fine-tuning
            lora_tuning:
                loraplus_lr_ratio: 8.0  # LoRA+ learning rate scaling factor
                alpha: 32            # Scaling factor for LoRA weights
                adapter_dropout: 0.01  # Regularization for LoRA parameters
```

### Troubleshooting
<a name="nova-sft-1-troubleshooting"></a>

Use the following information to help resolve issues that you might encounter:
+ The input dataset for both training and validation should reside in customer-owned buckets, not in escrow, or service-managed S3 buckets.
+ If you receive a Region not found error in the AWS CLI, resubmit the job with the region prepended to the start-job command. For example: `AWS_REGION=us-east-1 hyperpod start-job ...Job Parameters`.
+ To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).

# Direct preference optimization (DPO)
<a name="nova-dpo"></a>

Direct preference optimization (DPO) is an efficient fine-tuning method for foundation models that uses paired comparison data to align model outputs with human preferences. This approach enables direct optimization of model behavior based on human feedback about which responses are more desirable.

Both full-rank DPO and low-rank adapter (LoRA) DPO are available.

**Data format requirements**  
For both full-rank and LoRA DPO, the training data format requirements are similar to SFT. However, for DPO, the final turn needs to have preference pairs. Here is an example of the DPO data format:

```
// N-1 turns same as SFT format
{
    "role": "assistant",
    "candidates": [
        {
            "content": [
                {
                    "text": "..."
                } // content list can contain multiple 'text' objects
            ],
            "preferenceLabel": "preferred"
        },
        {
            "content": [
                {
                    "text": "..."
                } // content list can contain multiple 'text' objects
            ],
            "preferenceLabel": "non-preferred"
        }
    ]
}
```

Here is another complete DPO text sample:

```
{
    "system": [
        {
            "text": "..."
        }
    ],
    "messages":[
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "assistant",
            "candidates": [
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "preferred"
                },
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "non-preferred"
                }
            ]
        }
    ],
}
```

Here is a complete DPO image sample:

```
{
    "system": [
        {
            "text": "..."
        }
    ],
    "messages":[
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                },
                {
                    "text": "..."
                },
                {
                    "image": {
                        "format": "jpeg",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-image.jpg",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                } // "content" can have multiple "text" and "image" objects.
                 // max image count is 10
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "text": "..."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "text": "..."
                },
                {
                    "text": "..."
                },
                {
                    "image": {
                        "format": "jpeg",
                        "source": {
                            "s3Location": {
                                "uri": "s3://your-bucket/your-path/your-image.jpg",
                                "bucketOwner": "your-aws-account-id"
                            }
                        }
                    }
                } // "content" can have multiple "text" and "image" objects.
                 // max image count is 10
            ]
        },
        {
            "role": "assistant",
            "candidates": [
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "preferred"
                },
                {
                    "content": [
                        {
                            "text": "..."
                        }
                    ],
                    "preferenceLabel": "non-preferred"
                }
            ]
        }
    ],
}
```

Other constraints on the input datasets apply. For more information, see [Dataset constraints](https://docs.aws.amazon.com/nova/latest/userguide/fine-tune-prepare-data-understanding.html#custom-fine-tune-constraints). We recommend that you include a minimum of 1,000 preference pairs for effective training. High-quality preference data leads to more efficient results.

We recommend using DPO in the following scenarios:
+ Optimizing for subjective outputs that require alignment with specific human preferences.
+ Adjusting the model’s tone, style, or content characteristics to match desired response patterns.
+ Making targeted improvements to an existing model based on user feedback and error analysis.
+ Maintaining consistent output quality across different use cases.
+ Implementing safety guardrails through preferred response patterns.
+ Training with reward-free reinforcement learning.
+ Using only preference data instead of graded or labeled data.
+ Improving the model in nuanced alignment tasks, such as helpfulness, harmlessness, or honesty.

## Full-rank DPO
<a name="customize-fine-tune-hyperpod-dpo-fr"></a>

The Amazon Nova parameters that are available for full-rank DPO are as follows:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model. Select the model to use from `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model chosen. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
  + `data_s3_path`: The S3 location of the training dataset, which is a JSONL file. This file must reside in the same account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
  + `validation_data_s3_path`: The S3 location of the validation dataset, which is a JSONL file. This file must reside in the same AWS account and Region as the cluster. All of the S3 locations provided must be in the same account and Region.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 32,768 tokens for DPO.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution.
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.
+ **DPO configuration**
  + `beta`: Determines how closely the model should fit the training data or the original model. Valid values are between 0.001-0.5, inclusive.

    Specify larger values (for example, 0.5) to preserve more of the reference model behavior while more slowly learning new preferences. Specify smaller values (for example, 0.01-0.05) to more quickly learn new preferences at the risk of diverging from the reference model behavior.

**Full-rank DPO recipe**  
The following is a full-rank recipe for DPO

```
## Run config
run:
  name: "my-dpo-micro-job"             # A descriptive name for your training job
  model_type: "amazon.nova-micro-v1:0:128k"  # Model variant specification, do not change
  model_name_or_path: "nova-micro/prod"      # Base model path, do not change
  replicas: 2                     # Number of compute instances for training, allowed values are 2, 4, 8
  data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
  validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
  output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
  max_length: 32768               # Maximum context window size (tokens).
  global_batch_size: 64           # Global batch size, allowed values are 16, 32, 64.

  trainer:
    max_epochs: 2                # Number of training epochs

  model:
    hidden_dropout: 0.0          # Dropout for hidden states, must be between 0.0 and 1.0
    attention_dropout: 0.0       # Dropout for attention weights, must be between 0.0 and 1.0
    ffn_dropout: 0.0             # Dropout for feed-forward networks, must be between 0.0 and 1.0

    optim:
      lr: 1e-5                 # Learning rate
      name: distributed_fused_adam  # Optimizer algorithm, do not change
      adam_w_mode: true        # Enable AdamW mode
      eps: 1e-06               # Epsilon for numerical stability
      weight_decay: 0.0        # L2 regularization strength, must be between 0.0 and 1.0
      betas:                   # Adam optimizer betas, must be between 0.0 and 1.0
        - 0.9
        - 0.999
      sched:
        warmup_steps: 10     # Learning rate warmup steps
        constant_steps: 0    # Steps at constant learning rate
        min_lr: 1e-6         # Minimum learning rate, must be lower than lr

    dpo_cfg:
        beta: 0.1               # Strength of preference enforcement. Limits: [0.001, 0.5]

    peft:
        peft_scheme: null        # Disable LoRA, trigger full rank fine tuning
```

## Low-rank adapter DPO
<a name="customize-fine-tune-hyperpod-dpo-lora"></a>

The Amazon Nova parameters that are available for low-rank adapter DPO are as follows:
+ **Run configuration**
  + `name`: A descriptive name for your training job. This helps identify your job in the AWS Management Console.
  + `model_type`: The Nova model variant to use. The available options are `amazon.nova-micro-v1:0:128k`, `amazon.nova-lite-v1:0:300k`, or `amazon.nova-pro-v1:0:300k`.
  + `model_name_or_path`: The path to the base model. Select the model to use from `nova-micro/prod`, `nova-lite/prod`, `nova-pro/prod`, or the S3 path for the post-training checkpoint (`s3://<escrow bucket>/<job id>/outputs/checkpoints`).
  + `replicas`: The number of compute instances to use for distributed training. Available values vary based on the model chosen. Amazon Nova Micro supports 2, 4, or 8 replicas. Amazon Nova Lite supports 4, 8, 16, or 32 replicas. Amazon Nova Pro supports 6, 12, or 24 replicas.
+ **Training configuration**
  + `max_length`: The maximum sequence length in tokens. This determines the context window size for training. The maximum supported value are 32,768 tokens for DPO.

    Longer sequences will improve training efficiencies at the cost of increased memory requirements. We recommend that you match the `max_length` parameter to your data distribution.
+ **Trainer settings**
  + `max_epochs`: The number of complete passes through your training dataset.

    In general, larger datasets require fewer epochs to converge, while smaller datasets require more epochs to converge. We recommend that you adjust the number of epochs based on the size of your data.
+ **Model settings**
  + `hidden_dropout`: The probability of dropping hidden state outputs. Increase this value by approximately 0.0-0.2 to reduce overfitting on smaller datasets. Valid values are between 0-1, inclusive.
  + `attention_dropout`: The probability of dropping attention weights. This parameter can help with generalization. Valid values are between 0-1, inclusive.
  + `ffn_dropout`: The probability of dropping feed-forward network outputs. Valid values are between 0-1, inclusive.
+ **Optimizer configuration**
  + `lr`: The learning rate, which controls the step size during optimization. We recommend values between 1e-6-1e-4 for good performance. Valid values are between 0-1, inclusive.
  + `name`: The optimizer algorithm. Currently, only `distributed_fused_adam` is supported.
  + `weight_decay`: The L2 regularization strength. Higher values (between 0.01-0.1) increase regularization.
  + `warmup_steps`: The number of steps to gradually increase learning rate. This improves training stability. Valid values are between 1-20, inclusive.
  + `min_lr`: The minimum learning rate at the end of decay. Valid values are between 0-1, inclusive, but must be less than learning rate.
+ **DPO configuration**
  + `beta`: Determines how closely the model should fit the training data or the original model. Valid values are between 0.001-0.5, inclusive.

    Specify larger values (for example, 0.5) to preserve more of the reference model behavior while more slowly learning new preferences. Specify smaller values (for example, 0.01-0.05) to more quickly learn new preferences at the risk of diverging from the reference model behavior.
+ **LoRA configuration parameters**
  + `peft_scheme`: Set to `lora` to enable Low-Rank Adaptation, which generates a more efficient, smaller output model. These LoRA-specific properties are also available:
    + `alpha`: The scaling factor for LoRA weights. This is typically set to same value as `adapter_dim`.
    + `adapter_dropout`: The regularization parameter for the LoRA parameters.

**LoRA DPO recipe**  
The following is a recipe for LoRA DPO.

```
## Run config
run:
    name: "my-lora-run"             # A descriptive name for your training job
    model_type: "amazon.nova-lite-v1:0:300k"  # Model variant specification, do not change
    model_name_or_path: "nova-lite/prod"      # Base model path, do not change
    replicas: 4                     # Number of compute instances for training. All supported values: {4, 8, 16}
    data_s3_path: s3:Replace with your S3 bucket name/input.jsonl
    validation_data_s3_path: [OPTIONAL] s3:your S3 bucket name/input.jsonl
    output_s3_path: [S3_PATH_TO_STORE_MANIFEST]

## Training specific configs
training_config:
    max_length: 16384               # Maximum context window size (tokens). Should be between [1024, 32768] and multiple of 1024.
                                    # Note: Image dataset for DPO has a limit on 20k samples and 16384 max_length
    global_batch_size: 64           # Total samples per step. Limits: {16, 32, 64, 128, 256}

    trainer:
        max_epochs: 2               # Number of training epochs

    model:
        hidden_dropout: 0.0          # Dropout for hidden states. Limits: [0.0, 1.0]
        attention_dropout: 0.0       # Dropout for attention weights. Limits: [0.0, 1.0]
        ffn_dropout: 0.0             # Dropout for feed-forward networks. Limits: [0.0, 1.0]

        optim:
            lr: 1e-5                 # Learning rate
            name: distributed_fused_adam  # Optimizer algorithm, do not change
            adam_w_mode: true        # Enable AdamW mode
            eps: 1e-08               # Epsilon for numerical stability
            weight_decay: 0.01       # L2 regularization strength
            betas:                   # Adam optimizer betas. Limits: [0.0, 1.0]
                - 0.9
                - 0.999
            sched:
                warmup_steps: 10     # Learning rate warmup steps
                constant_steps: 0    # Steps at constant learning rate
                min_lr: 1e-6         # Minimum learning rate

        dpo_cfg:
            beta: 0.01               # Strength of preference enforcement. Limits: [0.001, 0.5]

        peft:
            peft_scheme: "lora"      # Enable LoRA for parameter-efficient fine-tuning
            lora_tuning:
                loraplus_lr_ratio: 20.0  # LoRA+ learning rate scaling factor. Limits: [0.0, 100.0]
                alpha: 64            # Scaling factor for LoRA weights. [32, 64, 96, 128, 160, 192]
                adapter_dropout: 0.01  # Regularization for LoRA parameters. Limits: [0.0, 1.0]
```

**Limitations**  
DPO has the following limitations:
+ Intermediate checkpoints are not saved for evaluation and you can't resume from an intermediate checkpoint. Only the last checkpoint is saved.
+ To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).

# Proximal policy optimization (PPO)
<a name="nova-ppo"></a>

Proximal policy optimization (PPO) is the process of using several machine learning models to train and score a model. The following models are part of the PPO process:
+ **Actor train or policy model**: A supervised fine-tuning (SFT) model that gets fine-tuned and updated every epoch. The updates are made by sampling prompts, generating completions, and updating weights using a clipped-surrogate objective. This limits the per-token log-profitability change so that each policy step is *proximal* to the previous one, preserving training stability.
+ **Actor generation model**: A model that generates prompt completions or responses to be judged by the reward model and critic model. The weights of this model are updated from the actor train or policy model each epoch.
+ **Reward model**: A model with frozen weights that's used to score the actor generation model.
+ **Critic model**: A model with unfrozen weights that's used to score the actor generation model. This score is often viewed as an estimate of the total reward the actor receives when generating the remaining tokens.
+ **Anchor model**: An SFT model with frozen weights that is used to calculate the KL divergence between the actor train model and the base model. The anchor model ensures that the updates to the actor model are not too drastic compared to the base model. Drastic changes can lead to instability or performance degradation.

The training data must be in JSONL format where each line contains a single JSON object that represents a training example. Here is an example:

```
{
    "turns": ["string", "string", ...], // Required
    "turns_to_mask": [integer, integer, ...], // Required
    "reward_category": "string", // Required
    "meta_data": {} // Optional
}
```
+ `turns` is an array of conversation string arrays that represent the dialog sequence. This line contains system prompts, user messages, and bot responses. User messages typically end with "Bot: " to indicate where the model output begins. For example, `[["System prompt"], ["User: Question Bot:"], ["Bot response"]]`.
+ `turns_to_mask` is an array of 0-based indices that identify which turns should not receive gradient updates. The masked turns are typically system prompts and user turns. For example, `[0, 1, 3]` masks the system prompt and user messages (the first and third messages).
+ `reward_category` is a string that identifies what aspects of model performance to evaluate. It's used to select the appropriate reward model category during training. The reward category is available for the following reward categories: `default`, `math`, `coding`, `if`, `rag`, and `rai`.
+ `meta_data` is an optional object that contains additional contextual or ground-truth information. This can include identifiers, source information, or conversation context. The structure is flexible based on your dataset needs.

Here is an example record:

```
{
    "turns": ["You are a helpful AI assistant.",
        "User: What is ML? Bot:",
        "Machine learning is...", "User: Examples? Bot:",
        "Email spam filtering is..."
    ],
    "turns_to_mask": [0, 1, 3],
    "reward_category": "default",
    "meta_data": {
        "messages": [{
                "role": "system",
                "content": "You are a helpful AI assistant."
            },
            {
                "role": "user",
                "content": "What is ML?"
            },
            {
                "role": "assistant",
                "content": "Machine learning is..."
            },
            {
                "role": "user",
                "content": "Examples?"
            },
            {
                "role": "assistant",
                "content": "Email spam filtering is..."
            }
        ]
    }
}
```

The reward modeling framework implements multi-dimensional optimization across distinct categorical objectives to facilitate robust model convergence. The reward category should be selected based on the task that the model must be optimized for. 

We recommend the following guidelines for selecting the right framework for your tasks:
+ `default`: A general purpose optimizer for standard conversational tasks and basic interactions. Used for general conversations and discussions, basic writing tasks, simple question answering, and non-specialized knowledge queries. 

  Here is an example:

  ```
  {
      "turns": ["Write a summary of climate change"],
      "turns_to_mask": [0],
      "reward_category": "default"
  }
  ```
+ `math`: A specialized optimizer for mathematical computations and numerical reasoning tasks. Used for mathematical problem-solving, arithmetic calculations, algebraic equations, geometric problems, and statistical analysis.

  Here is an example:

  ```
  {
      "turns": ["Calculate the derivative of x²"],
      "turns_to_mask": [0],
      "reward_category": "math"
  }
  ```
+ `coding`: A dedicated category for programming and software development-related queries. Used for code implementation, debugging assistance, algorithm design, technical documentation, and system architecture questions.

  Here is an example:

  ```
  {
      "turns": ["Write a function to check if a string is palindrome"],
      "turns_to_mask": [0],
      "reward_category": "coding"
  }
  ```
+ `if`: A category for tasks that require precise procedural execution and step-by-step guidance. Used for multi-step procedures, sequential instructions, complex task decomposition, and process documentation.

  Here is an example:

  ```
  {
      "turns": ["Provide steps to deploy a web application"],
      "turns_to_mask": [0],
      "reward_category": "if"
  }
  ```
+ `rag`: A reward category for tasks that require answering queries based specifically on retrieved contextual information. Used when responses should be derived directly from provided reference materials, synthesizing factual content without going beyond the scope of retrieved information, ensuring answers are grounded in the supplied context rather than general knowledge.

  Here is an example:

  ```
  {
              "turns": ["The Synthesis Report integrates findings from all six IPCC assessment cycles, revealing that global surface temperature has increased 1.1°C from 1850-1900 to 2011-2020, with human activities unequivocally identified as the cause of this warming. Alarmingly, current policies put the world on track for 3.2°C warming by 2100. The document identifies 5 key climate system "tipping points" approaching and emphasizes that greenhouse gas emissions must decline 43% by 2030 (compared to 2019 levels) to limit warming to 1.5°C. Climate-related risks will escalate with every increment of warming, with loss and damage disproportionately affecting vulnerable populations. Despite some progress, climate adaptation remains uneven with significant gaps, and financial flows continue to fall below levels needed for mitigation goals.",
              "What were the key findings of the latest IPCC climate report?"],
              "turns_to_mask": [0, 0],
              "reward_category": "rag"
              }
  ```
+ `rai`: A reward category for tasks that require applying responsible AI principles such as fairness, transparency, and ethics. Used for evaluating potential biases in AI systems, ensuring privacy considerations, addressing ethical dilemmas, and promoting inclusive design principles.

  Here is an example:

  ```
  {
              "turns": ["Identify potential bias concerns when developing a loan approval algorithm and suggest mitigation strategies"],
              "turns_to_mask": [0],
              "reward_category": "rai"
              }
  ```

**Masking turns**  
In training datasets, the `turns_to_mask` parameter is crucial for controlling which conversation turns receive gradient updates during training. This array of indices determines which parts of the dialogue the model should learn to generate versus which parts should be treated as context only. Proper masking ensures the model learns appropriate response patterns while avoiding training on system prompts or user inputs that could degrade performance.

We recommend the following guidance for masking:
+ **Always mask index 0** - System prompts should never receive gradient updates.
+ **Always mask user turns** - Prevent the model from learning to generate user inputs.
+ **Pattern consistency** - Use identical masking patterns for similar conversation structures, such as (0, 1, 3, 5) for multi-turn dialogues.
+ **Selective training** - Mask early bot responses to focus training on improved final responses.
+ **Chain-of-thought preservation** - Only mask system and user turns when training on reasoning sequences.
+ **Quality filtering** - Mask low-quality assistant responses to prevent performance degradation.
+ **Context optimization** - Ensure masked turns don't remove essential context needed for subsequent responses.

The key to effective masking is monitoring training metrics and validation performance to identify whether your masking strategy preserves necessary context while focusing gradient updates on the desired model outputs.

**Enable KL-divergence loss**  
For enabling KL-divergence loss, the anchor server needs to be enabled to compute the divergence of the current policy from the original distribution. The KL loss type needs to be specified, and coefficients need to be a value other than zero. Higher coefficient values help the model not deviate much from the original policy which results in lesser changes to general performance. Lower coefficient values allow larger deviations from previous policy, leading to better performance of target metrics but impacting the general performance.

```
ppo_anchor:
  max_length: 8192
  trainer:
    num_nodes: ${recipes.run.cm_replicas}
  model:
    global_batch_size: 32

ppo_actor_train:
  model:
    ######## Use KL in actor loss ########
    kl_loss_type: low_var_kl
    kl_loss_coeff: 0.1

    ######## Use KL in reward model ######
    kl_reward_penalty_coeff: 0.1
```

**Learning rate**  
The learning rate for the critic and policy models can be adjusted, with 3e-6 being the default balanced choice. Higher learning rates typically lead to training instabilities, which can be identified through KL divergence spikes and erratic policy behavior. Lower learning rates may cause convergence issues and slow learning, indicated by stagnant rewards and minimal policy updates. Regular monitoring of KL divergence, reward score, and value loss helps in determining whether to adjust the learning rate during training.

```
ppo_critic:
  model:
    optim:
      lr: 3e-6

ppo_actor_train:
  model:
    optim:
      lr: 3e-06
```

**Global batch size**  
Global batch size significantly impacts PPO performance in Amazon Nova, with larger batches generally improving training stability and gradient estimation while enabling more efficient parallel processing. However, very large batch sizes can lead to diminishing returns and may be constrained by available memory, requiring careful balance with learning rate and other hyperparameters.

```
ppo_actor_train:
  model:
    global_batch_size: 160
```

The Amazon Nova parameters that are available for tuning with PPO include:
+ **Run configuration**
  + `actor_train_replicas`: The number of compute instances to be used for the actor train model. Available values vary based on the model chosen. Amazon Nova Micro supports 1 or 2 replicas. Amazon Nova Lite supports 1, 2, or 4 replicas. Amazon Nova Pro supports 3, 6, or 12 replicas.
  + `rm_replicas`: The number of compute instances used for the reward model. We recommend that you use one replica for any model size.
  + `cm_replicas`: The number of compute instances used for the critic model. We recommend that you use one replica for any model size.
  + `actor_generation_replicas`: The number of compute instances used for the actor generation. Available values vary based on the model chosen. Amazon Nova Micro supports 1 replica. Amazon Nova Lite supports 1 or 2 replicas. Amazon Nova Pro supports 1 or 2 replicas.
  + `am_replicas`: The number of compute instances used for the anchor model. We recommend that you use one replica for any model size.
+ **Actor train configuration (policy config)**
  + `max_steps`: The maximum number of steps to fine-tune or train the actor train model. Here, one step is defined as rollout, followed by training the actor train model with `global_batch_size` number of samples. One epoch is defined as `global_batch_size * trajectory_buffer_scale`.

    The value chosen here will vary based on your use case and dataset complexity. We recommend starting with 65 epochs or 520 steps, which is the number of epochs multiplied by the value of the `trajectory_buffer_scale`. However, some tasks require a longer PPO training time to achieve the same performance.

    For PPO, the training metrics, such as saturating reward model score and average action length from the [ml-flow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server.html) console, can help in identifying the optimal points for evaluation.
  + `actor_model_max_length`: The maximum length of the input data that is sent to the actor generation component to generate completions.
  + `reward_model_max_length`: The maximum length of the input data that is sent to the reward server to score completions.
  + `trajectory_buffer_scale`: This buffer represents the number of rollouts generated using the old actor train (policy) model before updating the weights and generating the new rollouts. The supported values are 1, 2, 4, 8, and 16.

    If `trajectory_buffer_scale` is 1, then the training is on policy. That means the rollouts are generated with the most updated model weights, but throughput suffers. If it's 16, then the model is slightly off-policy but throughput is higher. We recommend starting with 8 for each model.
  + `kl_reward_penalty_coeff`: This is the KL divergence term that ensures updates are not too drastic and the policy does not draft from the base or SFT model.
  + `kl_loss_coeff`: This value controls how much the KL divergence penalty influences the overall training objective in PPO.
  + `kl_loss_type`: This value specifies how to compute the divergence between current and reference policy distributions. The `kl_loss_types` available are `kl` (Standard KL divergence), `mse` (Mean squared error), `abs` (Absolute difference between log probabilities), and `low_var_kl` (low-variance KL approximation).
  + `model.clip_ratio`: The actor clip ratio (ε) in PPO is a hyperparameter that limits how much the policy can change during each update.
  + `model.optim.lr`: The learning rate used for surrogate model loss training in the actor model. 
  + `model.lam`: Part of the advantage estimation process. Higher λ gives more weight to longer-term rewards but with higher variance, while a lower λ focuses more on immediate rewards with lower variance but more bias.
  + `model.ent_coeff`: Entropy loss in PPO encourages exploration by penalizing the policy when it becomes too deterministic (that is, always picking the same actions with high confidence).
+ **Reward model configuration**
  + `global_batch_size`: The batch size for scoring the completions using the reward model. If `ppo_actor_train.model.global_batch_size` is greater than `ppo_reward.model.global_batch_size`, they are processed in multiple batches. Note that `ppo_actor_train.model.global_batch_size % ppo_reward.model.global_batch_size` must equal 0.
  + `max_length`: The maximum context length of the reward model. This should be same as `ppo_actor_train.model.max_length`.
+ **Critic model configuration**
  + `global_batch_size`: The batch size of the critic model value. The critic model will provide value estimates for each token in the responses provided by the actor model. The batch size is used for both inference and training.

    Note that `ppo_actor_train.model.global_batch_size % ppo_critic.model.global_batch_size` must equal 0 and `ppo_actor_train.model.global_batch_size * ppo_actor_train.model.trajectory_buffer_size % ppo_critic.model.global_batch_size == 0`.
  + `max_length`: The maximum context length of the critic model. This should be same as `ppo_actor_train.model.max_length`.
  + `model.optim.lr`: The learning rate used for surrogate model loss training in the actor model.
+ **Anchor model configuration**
  + `global_batch_size`: The batch size for generating the logp of the frozen SFT or anchor model. Note that `ppo_actor_train.model.global_batch_size % ppo_anchor.model.global_batch_size` must equal 0.
  + `max_length`: The maximum context length of the reward model. This should be same as `ppo_actor_train.model.max_length`.
+ **Actor generation model configuration**
  + `actor_model_max_length`: The maximum context length of the actor model generation component. This should be the same as `ppo_actor_train.model.max_length`.

**PPO recipe**  
The following is a recipe for PPO.

```
## Run config
run:
  name: ndry-ppo-pro
  model_type: amazon.nova-pro-v1:0:300k
  model_name_or_path: nova-pro/prod
  data_s3_path: s3://testing/train.jsonl # Your training data S3 path

  actor_train_replicas: 6 # Actor train model replicas
  rm_replicas: 1 # Reward model replicas
  cm_replicas: 1 # Critic model replicas
  actor_generation_replicas: 2 # Actor generation model replicas
  am_replicas: 1 # Anchor model replicas

## Training config for each PPO component
ppo_reward:
  max_length: 8192 # model architecture max length
  trainer:
    num_nodes: ${recipes.run.rm_replicas}
  model:
    global_batch_size: 16

ppo_critic:
  max_length: 8192
  trainer:
    num_nodes: ${recipes.run.cm_replicas}
  model:
    global_batch_size: 16
    optim:
      lr: 3e-6
      name: distributed_fused_adam
      adam_w_mode: true
      eps: 1e-06
      weight_decay: 0.0
      betas:
        - 0.9
        - 0.999

ppo_anchor:
  max_length: 8192
  trainer:
    num_nodes: ${recipes.run.am_replicas}
  model:
    global_batch_size: 16

ppo_actor_generation:
  actor_model_max_length: 8192
  trainer:
    num_nodes: ${recipes.run.actor_generation_replicas}

ppo_actor_train:
  max_length: 8192
  max_steps: 520 # Stopping criteria Desired epoch num * trajectory_buffer_scale
  actor_model_max_length: 8192 # truncate input data to max length
  reward_model_max_length: 8192 # truncate input data to max length
  trajectory_buffer_scale: 8
  trainer:
    num_nodes: ${recipes.run.actor_train_replicas}
  model:
    global_batch_size: 160
    ent_coeff: 0
    clip_ratio: 0.2
    lam: 1
    kl_loss_coeff: 0.0
    kl_loss_type: low_var_kl
    kl_reward_penalty_coeff: 0.0
    hidden_dropout: 0.0 # Dropout probability for hidden state transformer.
    attention_dropout: 0.0 # Dropout probability in the attention layer.
    ffn_dropout: 0.0 # Dropout probability in the feed-forward layer.
    optim:
      lr: 3e-06
      name: distributed_fused_adam # only this one is available for p0.
      adam_w_mode: true
      eps: 1e-08
      weight_decay: 0.0
      betas:
        - 0.9
        - 0.999
```

**Limitations**  
PPO has the following limitations:
+ Intermediate checkpoints are not saved for evaluation and you can't resume from an intermediate checkpoint. Only the last checkpoint is saved.
+ Multimodal datasets aren't supported.
+ Training jobs aren't automatically stopped. You have to stop the job using the SageMaker HyperPod CLI.
+ Critic training metrics are not supported on TensorBoard.
+ To adjust the hyperparameters, follow the guidance in [Selecting hyperparameters](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-hyperparameters.html).

# Evaluating your trained model
<a name="nova-hp-evaluate"></a>

An evaluation recipe is a YAML configuration file that defines how your Amazon Nova model evaluation job is executed. With this recipe, you can assess the performance of a base or trained model against common benchmarks or your own custom datasets. Metrics can be stored in Amazon S3 or TensorBoard. The evaluation provides quantitative metrics that help you assess model performance across various tasks to determine if further customization is needed.

Model evaluation is an offline process, where models are tested against fixed benchmarks with predefined answers. They are not assessed in real-time or against live user interactions. For real-time evaluations, you can evaluate the model after it is deployed to Amazon Bedrock by calling the Amazon Bedrock runtime APIs.

**Important**  
The evaluation container only supports checkpoints produced by the same training platform. Checkpoints created with SageMaker HyperPod can only be evaluated using the SageMaker HyperPod evaluation workflow, and checkpoints created with SageMaker training jobs can only be evaluated using the SageMaker training jobs evaluation workflow. Attempting to evaluate a checkpoint from a different platform will result in failure.

**Topics**
+ [Available benchmark tasks](customize-fine-tune-evaluate-available-tasks.md)
+ [Understanding the recipe parameters](customize-fine-tune-evaluate-understand-modify.md)
+ [Evaluation recipe examples](customize-fine-tune-evaluate-recipe-examples.md)
+ [Starting an evaluation job](customize-fine-tune-evaluate-start-job.md)
+ [Accessing and analyzing evaluation results](customize-fine-tune-evaluate-access-results.md)

# Available benchmark tasks
<a name="customize-fine-tune-evaluate-available-tasks"></a>

A sample code package is available that demonstrates how to calculate benchmark metrics using the SageMaker AI model evaluation feature for Amazon Nova. To access the code packages, see [sample-Nova-lighteval-custom-task](https://github.com/aws-samples/sample-Nova-lighteval-custom-task/).

Here is a list of the supported, available industry standard benchmarks. You can specify the following benchmarks in the `eval_task` parameter:


| Benchmark | Modality | Description | Metrics | Strategy | Subtask Available | 
| --- |--- |--- |--- |--- |--- |
| mmlu | Text | Multi-task Language Understanding – Tests knowledge across 57 subjects. | accuracy | zs\$1cot | Yes | 
| mmlu\$1pro | Text | MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering. | accuracy | zs\$1cot | No | 
| bbh | Text | Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills. | accuracy | zs\$1cot | Yes | 
| gpqa | Text | General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities. | accuracy | zs\$1cot | No | 
| math | Text | Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems. | exact\$1match | zs\$1cot | Yes | 
| strong\$1reject | Text | Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content. | deflection | zs | Yes | 
| IFEval | Text | Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification. | accuracy | zs | No | 
| gen\$1qa | Text | Custom Dataset Evaluation – Lets you bring your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU. | all | gen\$1qa | No | 
| llm\$1judge | Text | LLM-as-a-Judge Preference Comparison – Uses a Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A. | all | judge | No | 
| humaneval | Text | HumanEval - A benchmark dataset designed to evaluate the code generation capabilities of large language models | pass@1 | zs | No | 
|  mm\$1llm\$1judge  |  Multi-modal (image)  |  This new benchmark behaves the same as the text-based `llm_judge` above. The only difference is that it supports image inference.  |  all  |  judge  |  No  | 
|  rubric\$1llm\$1judge  | Text |  Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts, Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.  |  all  |  judge  |  No  | 
|  aime\$12024  | Text |  AIME 2024 - American Invitational Mathematics Examination problems testing advanced mathematical reasoning and problem-solving  |  exact\$1match  |  zs\$1cot  | No | 
|  calendar\$1scheduling  | Text |  Natural Plan - Calendar Scheduling task testing planning abilities for scheduling meetings across multiple days and people  |  exact\$1match  |  fs  | No | 

The following `mmlu` subtasks are available:

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

The following `bbh` subtasks are available:

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

The following `math` subtasks are available:

```
MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
]
```

# Understanding the recipe parameters
<a name="customize-fine-tune-evaluate-understand-modify"></a>

**Run configuration**  
The following is a general run configuration and an explanation of the parameters involved.

```
run:
  name: eval_job_name
  model_type: amazon.nova-micro-v1:0:128k
  model_name_or_path: nova-micro/prod
  replicas: 1
  data_s3_path: ""
  output_s3_path: s3://output_path
  mlflow_tracking_uri: ""
  mlflow_experiment_name : ""
  mlflow_run_name : ""
```
+ `name`: (Required) A descriptive name for your evaluation job. This helps identify your job in the AWS console.
+ `model_type`: (Required) Specifies the Amazon Nova model variant to use. Do not manually modify this field. Options include:
  + `amazon.nova-micro-v1:0:128k`
  + `amazon.nova-lite-v1:0:300k`
  + `amazon.nova-pro-v1:0:300k`
  + `amazon.nova-2-lite-v1:0:256k`
+ `model_name_or_path`: (Required) The path to the base model or S3 path for the post-trained checkpoint. Options include:
  + `nova-micro/prod`
  + `nova-lite/prod`
  + `nova-pro/prod`
  + `nova-lite-2/prod`
  + (S3 path for the post-trained checkpoint) `s3://<escrow bucket>/<job id>/outputs/checkpoints`
+ `replicas`: (Required) The number of compute instances to use for distributed training. You must set this value to 1 because multi-node is not supported.
+ `data_s3_path`: (Required) The S3 path to the input dataset. Leave this parameter empty unless you are using the *bring your own dataset* or *LLM as a judge* recipe.
+ `output_s3_path`: (Required) The S3 path to store output evaluation artifacts. Note that the output S3 bucket must be created by the same account that is creating the job.
+ `mlflow_tracking_uri`: (Optional) MLflow tracking server ARN for tracking MLFlow runs/experiments. Please ensure you have permission to access the tracking server from SageMaker AI execution role

**Evaluation configuration**  
The following is a model evaluation configuration and an explanation of the parameters involved.

```
evaluation:
  task: mmlu
  strategy: zs_cot
  subtask: mathematics
  metric: accuracy
```
+ `task`: (Required) Specifies the evaluation benchmark or task to use.

  Supported task list:
  + mmlu
  + mmlu\$1pro
  + bbh
  + gpqa
  + math
  + strong\$1reject
  + gen\$1qa
  + ifeval
  + llm\$1judge
  + humaneval
  + mm\$1llm\$1judge
  + rubric\$1llm\$1judge
  + aime\$12024
  + calendar\$1scheduling
  + humaneval
+ `strategy`: (Required) Defines the evaluation approach:
  + zs\$1cot: Zero-shot Chain-of-Thought - An approach to prompt large language models that encourages step-by-step reasoning without requiring explicit examples.
  + zs: Zero-shot - An approach to solve a problem without any prior training examples.
  + gen\$1qa: A strategy specific for bring your own dataset recipes.
  + judge: A strategy specific for Amazon Nova LLM as Judge and mm\$1llm\$1judge.
+ `subtask`: (Optional and Removable) Specifies a specific subtask for certain evaluation tasks. Remove this from your recipe if your task does not have any subtasks.
+ `metric`: (Required) The evaluation metric to use.
  + accuracy: Percentage of correct answers
  + exact\$1match: (For `math` benchmark), returns the rate at which the input predicted strings exactly match their references.
  + deflection: (For `strong reject` benchmark), returns the relative deflection to the base model and the difference in significance metrics.
  + pass@1: (For `humaneval` benchmark) is a metric used to measures the percentage of cases where the model's highest confidence prediction matches the correct answer.
  + `all`: Returns the following metrics:
    + For `gen_qa` and bring your own dataset benchmark, return following metrics:
      + `rouge1`: Measures the overlap of unigrams (single words) between generated and reference text.
      + `rouge2`: Measures the overlap of bigrams (two consecutive words) between generated and reference text.
      + `rougeL`: Measures the longest common subsequence between texts, allowing for gaps in the matching.
      + `exact_match`: Binary score (0 or 1) indicating if the generated text matches the reference text exactly, character by character.
      + `quasi_exact_match`: Similar to exact match but more lenient, typically ignoring case, punctuation, and white space differences.
      + `f1_score`: Harmonic mean of precision and recall, measuring word overlap between predicted and reference answers.
      + `f1_score_quasi`: Similar to f1\$1score but with more lenient matching, using normalized text comparison that ignores minor differences.
      + `bleu`: Measures precision of n-gram matches between generated and reference text, commonly used in translation evaluation.
    + For `llm_judge` and `mm_llm_judge`, bring your own dataset benchmark, return following metrics:
      + `a_scores`: Number of wins for `response_A` across forward and backward evaluation passes.
      + `a_scores_stderr`: Standard error of `response_A scores` across pairwise judgements.
      + `b_scores`: Number of wins for `response_B` across forward and backward evaluation passes.
      + `b_scores_stderr`: Standard error of `response_B scores` across pairwise judgements.
      + `ties`: Number of judgements where `response_A` and `response_B` are evaluated as equal.
      + `ties_stderr`: Standard error of ties across pairwise judgements.
      + `inference_error`: Count of judgements that could not be properly evaluated.
      + `inference_error_stderr`: Standard error of inference errors across judgements.
      + `score`: Aggregate score based on wins from both forward and backward passes for `response_B`.
      + `score_stderr`: Standard error of the aggregate score across pairwise judgements.
      + `winrate`: the probability that response\$1B will be preferred over response\$1A calculated using Bradley-Terry probability.
      + `lower_rate`: Lower bound (2.5th percentile) of the estimated win rate from bootstrap sampling.

**Inference configuration**  
The following is an inference configuration and an explanation of the parameters involved. All parameters are optional.

```
inference:
  max_new_tokens: 200
  top_k: -1
  top_p: 1.0
  temperature: 0
  top_logprobs: 10
  reasoning_effort: null  # options: low/high to enable reasoning or null to disable reasoning
```
+ `max_new_tokens`: The maximum number of tokens to generate. This must be an integer.
+ `top_k`: The number of highest probability tokens to consider. This must be an integer.
+ `top_p`: The cumulative probability threshold for token sampling. This must be a float between 0.0 and 1.0, inclusive.
+ `temperature`: Randomness in token selection. Larger values introduce more randomness. Use 0 to make the results deterministic. This value must be a float with a minimum value of 0.
+ `top_logprobs`: The number of top logprobs to be returned in the inference response. This value must be an integer from 0 to 20. Logprobs contain the considered output tokens and log probabilities of each output token returned in the message content.
+ `reasoning_effort`: controls the reasoning behavior for reasoning-capable models. Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`). Available options are `null` (default value if not set; disables reasoning), `low`, or `high`.

Note that for `humaneval`, we recommend the following inference configuration:

```
inference:
  top_k: 1
  max_new_tokens: 1600
  temperature: 0.0
```

**MLFlow configuration**  
The following is an MLFlow configuration and an explanation of the parameters involved. All parameters are optional.

```
run:
  mlflow_tracking_uri: ""
  mlflow_experiment_name: ""
  mlflow_run_name: ""
```
+ `mlflow_tracking_uri`: Optional) The location of the MLflow tracking server (only needed on SMHP)
+ `mlflow_experiment_name`: (Optional) Name of the experiment to group related ML runs together
+ `mlflow_run_name`: (Optional) Custom name for a specific training run within an experiment

# Evaluation recipe examples
<a name="customize-fine-tune-evaluate-recipe-examples"></a>

Amazon Nova provides four types of evaluation recipes, which are available in the SageMaker HyperPod recipes GitHub repository.

## General text benchmark recipes
<a name="nova-model-hp-evaluation-config-example-text"></a>

These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks. They are provided in the format `xxx_general_text_benchmark_eval.yaml`.

## Bring your own dataset benchmark recipes
<a name="nova-model-hp-evaluation-config-byo"></a>

These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics. They are provided in the format `xxx_bring_your_own_dataset_eval.yaml`. 

The following are the bring your own dataset requirements:
+ File format requirements
  + You must include a single `gen_qa.jsonl` file containing evaluation examples.
  + Your dataset must be uploaded to an S3 location where SageMaker training job can access it.
  + The file must follow the required schema format for a general Q&A dataset.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `query`: (Required) String containing the question or instruction that needs an answer
  + `response`: (Required) String containing the expected model output
  + `system`: (Optional) String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query
  + `metadata`: (Optional) String containing metadata associated with the entry for tagging purposes.

Here is a bring your own data set example entry

```
{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, do not change any of the content:

```
evaluation:
  task: gen_qa
  strategy: gen_qa
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 3.5k.

## Nova LLM as a Judge benchmark recipes
<a name="nova-model-evaluation-config-llm-judge"></a>

Amazon Nova LLM as a Judge is a model evaluation feature that enables customers to compare the quality of responses from one model to a baseline model response on a custom dataset. It takes in a dataset with prompts, baseline responses, and challenger responses, and uses a Nova Judge model to provide a winrate metric based on [Bradley-Terry probability](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) with pairwise comparisons.

The recipes are provided in the format `xxx_llm_judge_eval.yaml`. 

The following are the LLM as a Judge requirements:
+ File format requirements
  + Include a single `llm_judge.jsonl` file containing evaluation examples. The file name must be `llm_judge.jsonl`.
  + Your dataset must be uploaded to an S3 location that [SageMaker AI SageMaker HyperPod RIG](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-cluster.html) can access.
  + The file must follow the required schema format for the `llm_judge.jsonl` dataset.
  + The input dataset should ensure all records are under 12k context length.
+ Schema format requirements - Each line in the JSONL file must be a JSON object with the following fields:
  + `prompt`: (Required) A string containing the prompt for the generated response.
  + `response_A`: A string containing the baseline response.
  + `response_B`: A string containing the alternative response be compared with baseline response.

Here is an LLM as a judge example entry

```
{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: llm_judge
  strategy: judge
  metric: all
```

The following limitations apply:
+ Only one JSONL file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Amazon Nova Judge models are the same across all model family specifications (that is, Lite, Micro, and Pro).
+ Custom judge models are not supported at this time.
+ Context length limit: For each sample in the dataset, the context length (including system \$1 query prompts) should be less than 7k.

## Nova LLM as a Judge for multi-modal (image) benchmark recipes
<a name="nova-model-hp-evaluation-mm-llm-judge"></a>

Nova LLM Judge for multi-modal (image), short for Nova MM\$1LLM Judge, is a model evaluation feature that enables you to compare the quality of responses from one model against a baseline model's responses using a custom dataset. It accepts a dataset containing prompts, baseline responses, and challenger responses, and images in the form of Base64-encoded string, then uses a Nova Judge model to provide a win rate metric based on [Bradley-Terry](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) probability through pairwise comparisons. Recipe format: `xxx_mm_llm_judge _eval.yaml`.

**Nova LLM dataset requirements**

File format: 
+ Single `mm_llm_judge.jsonl` file containing evaluation examples. The file name must be exactly `llm_judge.jsonl`.
+ Your must upload your dataset to an S3 location where SageMaker Training Jobs can access it.
+ The file must follow the required schema format for the `mm_llm_judge` dataset.
+ The input dataset should ensure all records are under 12 k context length, excluding the image's attribute.

Schema format - Each line in the `.jsonl` file must be a JSON object with the following fields.
+ Required fields. 

  `prompt`: String containing the prompt for the generated response.

  `images`: Array containing a list of objects with data attributes (values are Base64-encoded image strings).

  `response_A`: String containing the baseline response.

  `response_B`: String containing the alternative response be compared with baseline response.

Example entry

For readability, the following example includes new lines and indentation, but in the actual dataset, each record should be on a single line.

```
{
  "prompt": "what is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
}
{
  "prompt": "how many animals in echo of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal",
}
```

To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:

```
evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all
```

**Limitations**
+ Only one `.jsonl` file is allowed per evaluation.
+ The file must strictly follow the defined schema.
+ Nova MM Judge models only support image reference.
+ Nova MM Judge models are the same across Amazon Nova Lite specifications.
+ Custom judge models are not currently supported.
+ Amazon S3 image URI is not supported.
+ The input dataset should ensure all records are under 12 k context length, excluding images attribute.

## Rubric Based Judge
<a name="nova-hp-evaluate-rubric-judge"></a>

Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Nova 2.0 Lite. Unlike the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/) that only provides preference verdicts (A>B, B>A, or tie), Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.

Key capabilities:
+ **Dynamic criteria generation**: Automatically creates relevant evaluation dimensions based on the input prompt
+ **Weighted scoring**: Assigns importance weights to each criterion to reflect their relative significance
+ **Granular assessment**: Provides detailed scores on a binary (true/false) or scale (1-5) basis for each criterion
+ **Quality metrics**: Calculates continuous quality scores (0-1 scale) that quantify the magnitude of differences between responses

Example criterion generated by the model:

```
price_validation:
  description: "The response includes validation to ensure price is a positive value."
  type: "scale"
  weight: 0.3
```

The model evaluates both responses against all generated criteria, then uses these criterion-level scores to inform its final preference decision.

**Topics**
+ [Recipe configuration](#nova-hp-evaluate-rubric-judge-recipe)
+ [Input dataset format](#nova-hp-evaluate-rubric-judge-input)
+ [Evaluation output](#nova-hp-evaluate-rubric-judge-output)
+ [Reasoning model support](#nova-hp-evaluate-rubric-judge-reasoning)

### Recipe configuration
<a name="nova-hp-evaluate-rubric-judge-recipe"></a>

**Rubric Judge recipe**  
Enable Rubric Judge by setting `task: rubric_llm_judge` in your recipe:

```
run:
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Original LLM as a Judge recipe (for comparison)**  
The original judge model uses `task: llm_judge`:

```
run:
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: llm_judge                                       # [FIXED] Original judge task
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

### Input dataset format
<a name="nova-hp-evaluate-rubric-judge-input"></a>

The input dataset format is identical to the [original judge model](https://aws.amazon.com/blogs/machine-learning/evaluating-generative-ai-models-with-amazon-nova-llm-as-a-judge-on-amazon-sagemaker-ai/):

**Required fields:**
+ `prompt`: String containing the input prompt and instructions
+ `response_A`: String containing the baseline model output
+ `response_B`: String containing the customized model output

**Example dataset (JSONL format):**

```
{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}
```

**Format requirements:**
+ Each entry must be a single-line JSON object
+ Separate entries with newlines
+ Follow the exact field naming as shown in examples

### Evaluation output
<a name="nova-hp-evaluate-rubric-judge-output"></a>

**Output structure**  
Rubric Judge produces enhanced evaluation metrics compared to the original judge model:

```
{
  "config_general": {
    "lighteval_sha": "string",
    "num_fewshot_seeds": "int",
    "max_samples": "int | null",
    "job_id": "int",
    "start_time": "float",
    "end_time": "float",
    "total_evaluation_time_secondes": "string",
    "model_name": "string",
    "model_sha": "string",
    "model_dtype": "string | null",
    "model_size": "string"
  },
  "results": {
    "custom|rubric_llm_judge_judge|0": {
      "a_scores": "float",
      "a_scores_stderr": "float",
      "b_scores": "float",
      "b_scores_stderr": "float",
      "ties": "float",
      "ties_stderr": "float",
      "inference_error": "float",
      "inference_error_stderr": "float",
      "score": "float",
      "score_stderr": "float",
      "weighted_score_A": "float",
      "weighted_score_A_stderr": "float",
      "weighted_score_B": "float",
      "weighted_score_B_stderr": "float",
      "score_margin": "float",
      "score_margin_stderr": "float",
      "winrate": "float",
      "lower_rate": "float",
      "upper_rate": "float"
    }
  },
  "versions": {
    "custom|rubric_llm_judge_judge|0": "int"
  }
}
```

**New metrics in Rubric Judge**  
The following six metrics are unique to Rubric Judge and provide granular quality assessment:


| Metric | Description | 
| --- |--- |
| weighted\$1score\$1A | Average normalized quality score for response\$1A across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1A\$1stderr | Standard error of the mean for weighted\$1score\$1A, indicating statistical uncertainty | 
| weighted\$1score\$1B | Average normalized quality score for response\$1B across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality) | 
| weighted\$1score\$1B\$1stderr | Standard error of the mean for weighted\$1score\$1B, indicating statistical uncertainty | 
| score\$1margin | Difference between weighted scores (calculated as weighted\$1score\$1A - weighted\$1score\$1B). Range: -1.0 to 1.0. Positive = response\$1A is better; negative = response\$1B is better; near zero = similar quality | 
| score\$1margin\$1stderr | Standard error of the mean for score\$1margin, indicating uncertainty in the quality difference measurement | 

**Understanding weighted score metrics**  
**Purpose**: Weighted scores provide continuous quality measurements that complement binary preference verdicts, enabling deeper insights into model performance.

**Key differences from original judge**:
+ **Original judge**: Only outputs discrete preferences (A>B, B>A, A=B)
+ **Rubric Judge**: Outputs both preferences AND continuous quality scores (0-1 scale) based on custom criteria

**Interpreting score\$1margin**:
+ `score_margin = -0.128`: Response\$1B scored 12.8 percentage points higher than response\$1A
+ `|score_margin| < 0.1`: Narrow quality difference (close decision)
+ `|score_margin| > 0.2`: Clear quality difference (confident decision)

**Use cases**:
+ **Model improvement**: Identify specific areas where your model underperforms
+ **Quality quantification**: Measure the magnitude of performance gaps, not just win/loss ratios
+ **Confidence assessment**: Distinguish between close decisions and clear quality differences

**Important**  
Final verdicts are still based on the judge model's explicit preference labels to preserve holistic reasoning and ensure proper position bias mitigation through forward/backward evaluation. Weighted scores serve as observability tools, not as replacements for the primary verdict.

**Calculation methodology**  
Weighted scores are computed through the following process:
+ **Extract criterion data**: Parse the judge's YAML output to extract criterion scores and weights
+ **Normalize scores**:
  + Scale-type criteria (1-5): Normalize to 0-1 by calculating `(score - 1) / 4`
  + Binary criteria (true/false): Convert to 1.0/0.0
+ **Apply weights**: Multiply each normalized score by its criterion weight
+ **Aggregate**: Sum all weighted scores for each response
+ **Calculate margin**: Compute `score_margin = weighted_score_A - weighted_score_B`

**Example**: If response\$1A has a weighted sum of 0.65 and response\$1B has 0.78, the `score_margin` would be -0.13, indicating response\$1B is 13 percentage points higher in quality across all weighted criteria.

### Reasoning model support
<a name="nova-hp-evaluate-rubric-judge-reasoning"></a>

Reasoning model support enables evaluation with reasoning-capable Nova models that perform explicit internal reasoning before generating final responses. This feature uses API-level control via the `reasoning_effort` parameter to dynamically enable or disable reasoning functionality, potentially improving response quality for complex analytical tasks.

**Supported models**:
+ amazon.nova-2-lite-v1:0:256k

**Recipe configuration**  
Enable reasoning by adding the `reasoning_effort` parameter to the `inference` section of your recipe:

```
run:
  name: eval-job-name                                    # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for  job
  output_s3_path: ""                                     # [MODIFIABLE] Output path for  job (not compatible with SageMaker Training jobs)

evaluation:
  task: mmlu                                             # [MODIFIABLE] Evaluation task
  strategy: generate                                     # [MODIFIABLE] Evaluation strategy
  metric: all                                            # [MODIFIABLE] Metric calculation method

inference:
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/medium/high or null to disable
  max_new_tokens: 200                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: 50                                              # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)
```

**Using the reasoning\$1effort parameter**  
The `reasoning_effort` parameter controls the reasoning behavior for reasoning-capable models.

**Prerequisites**:
+ **Model compatibility**: Set `reasoning_effort` only when `model_type` specifies a reasoning-capable model (currently `amazon.nova-2-lite-v1:0:256k`)
+ **Error handling**: Using `reasoning_effort` with unsupported models will fail with `ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."`

**Available options**:


| Option | Behavior | Token Limit | Use Case | 
| --- |--- |--- |--- |
| null (default) | Disables reasoning mode | N/A | Standard evaluation without reasoning overhead | 
| low | Enables reasoning with constraints | 4,000 tokens for internal reasoning | Scenarios requiring concise reasoning; optimizes for speed and cost | 
| high | Enables reasoning without constraints | No token limit on internal reasoning | Complex problems requiring extensive analysis and step-by-step reasoning | 

**When to enable reasoning**  
**Use reasoning mode (`low`, `medium`, or `high`) for**:
+ Complex problem-solving tasks (mathematics, logic puzzles, coding)
+ Multi-step analytical questions requiring intermediate reasoning
+ Tasks where detailed explanations or step-by-step thinking improve accuracy
+ Scenarios where response quality is prioritized over speed

**Use non-reasoning mode (omit parameter) for**:
+ Simple Q&A or factual queries
+ Creative writing tasks
+ When faster response times are critical
+ Performance benchmarking where reasoning overhead should be excluded
+ Cost optimization when reasoning doesn't improve task performance

**Troubleshooting**  
**Error: "Reasoning mode is enabled but model does not support reasoning"**

**Cause**: The `reasoning_effort` parameter is set to a non-null value, but the specified `model_type` doesn't support reasoning.

**Resolution**:
+ Verify your model type is `amazon.nova-2-lite-v1:0:256k`
+ If using a different model, either switch to a reasoning-capable model or remove the `reasoning_effort` parameter from your recipe

# Starting an evaluation job
<a name="customize-fine-tune-evaluate-start-job"></a>

The following provides a suggested evaluation instance type and model type configuration:

```
# Install Dependencies (Helm - https://helm.sh/docs/intro/install/)
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
rm -f ./get_helm.sh

# Install the SageMaker HyperPod CLI
git clone --recurse-submodules https://github.com/aws/sagemaker-hyperpod-cli.git
git checkout -b release_v2
cd sagemaker-hyperpod-cli
pip install .

# Verify the installation
hyperpod --help

# Connect to a SageMaker HyperPod Cluster
hyperpod connect-cluster --cluster-name cluster-name


# Submit the Job using the recipe for eval
# Namespace by default should be kubeflow
hyperpod start-job [--namespace namespace] --recipe evaluation/nova/nova_micro_p5_48xl_general_text_benchmark_eval --override-parameters \
'{
    "instance_type":"p5d.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest",
    "recipes.run.name": custom-run-name,
    "recipes.run.model_type": model_type,
    "recipes.run.model_name_or_path" " model name or finetune checkpoint s3uri,
    "recipes.run.data_s3_path": s3 for input data only for genqa and llm_judge, must be full S3 path that include filename,
}'

# List jobs
hyperpod list-jobs [--namespace namespace] [--all-namespaces]

# Getting Job details
hyperpod get-job --job-name job-name [--namespace namespace] [--verbose]

# Listing Pods
hyperpod list-pods --job-name job-name --namespace namespace

# Cancel Job
hyperpod cancel-job --job-name job-name [--namespace namespace]
```

You should also be able to view the job status through Amazon EKS cluster console.

# Accessing and analyzing evaluation results
<a name="customize-fine-tune-evaluate-access-results"></a>

After your evaluation job completes successfully, you can access and analyze the results using the information in this section. Based on the `output_s3_path` (such as `s3://output_path/`) defined in the recipe, the output structure is the following:

```
job_name/
├── eval-result/
│    └── results_[timestamp].json
│    └── inference_output.jsonl (only present for gen_qa)
│    └── details/
│        └── model/
│            └── execution-date-time/
│                └──details_task_name_#_datetime.parquet
└── tensorboard-results/
    └── eval/
        └── events.out.tfevents.[timestamp]
```

Metrics results are stored in the specified S3 output location `s3://output_path/job_name/eval-result/result-timestamp.json`.

Tensorboard results are stored in the S3 path `s3://output_path/job_name/eval-tensorboard-result/eval/event.out.tfevents.epoch+ip`.

All inference outputs, except for `llm_judge` and `strong_reject`, are stored in the S3 path: `s3://output_path/job_name/eval-result/details/model/taskname.parquet`.

For `gen_qa`, the `inference_output.jsonl` file contains the following fields for each JSON object:
+ prompt - The final prompt submitted to the model
+ inference - The raw inference output from the model
+ gold - The target response from the input dataset
+ metadata - The metadata string from the input dataset if provided

To visualize your evaluation metrics in Tensorboard, complete the following steps:

1. Navigate to SageMaker AI Tensorboard.

1. Select **S3 folders**.

1. Add your S3 folder path, for example `s3://output_path/job-name/eval-tensorboard-result/eval`.

1. Wait for synchronization to complete.

The time series, scalars, and text visualizations are available.

We recommend the following best practices:
+ Keep your output paths organized by model and benchmark type.
+ Maintain consistent naming conventions for easy tracking.
+ Save extracted results in a secure location.
+ Monitor TensorBoard sync status for successful data loading.

You can find SageMaker HyperPod job error logs in the CloudWatch log group `/aws/sagemaker/Clusters/cluster-id`.

## Log Probability Output Format
<a name="nova-hp-access-results-logprobs"></a>

When `top_logprobs` is configured in your inference settings, the evaluation output includes token-level log probabilities in the parquet files. Each token position contains a dictionary of the top candidate tokens with their log probabilities in the following structure:

```
{
"Ġint": {"logprob_value": -17.8125, "decoded_value": " int"},
"Ġthe": {"logprob_value": -2.345, "decoded_value": " the"}
}
```

Each token entry contains:
+ `logprob_value`: The log probability value for the token
+ `decoded_value`: The human-readable decoded string representation of the token

The raw tokenizer token is used as the dictionary key to ensure uniqueness, while `decoded_value` provides a readable interpretation.

# Monitoring HyperPod jobs with MLflow
<a name="nova-hp-mlflow"></a>

You can use MLflow to track and monitor your training jobs on SageMaker HyperPod. Follow these steps to set up MLflow and connect it to your training recipes.

***Create the MLflow App***

Example AWS CLI command

```
aws sagemaker-mlflow create-mlflow-app \
    --name <app-name> \
    --artifact-store-uri <s3-bucket-name> \
    --role-arn <role-arn> \
    --region <region-name>
```

Example output

```
{
    "Arn": "arn:aws:sagemaker:us-east-1:111122223333:mlflow-app/app-LGZEOZ2UY4NZ"
}
```

***Generate pre-signed URL***

Example AWS CLI command

```
aws sagemaker-mlflow create-presigned-mlflow-app-url \
    --arn <app-arn> \
    --region <region-name> \
    --output text
```

Example output

```
https://app-LGZEOZ2UY4NZ.mlflow.sagemaker.us-east-1.app.aws/auth?authToken=eyJhbGciOiJIUzI1NiJ9.eyJhdXRoVG9rZW5JZCI6IkxETVBPUyIsImZhc0NyZWRlbnRpYWxzIjoiQWdWNGhDM1VvZ0VYSUVsT2lZOVlLNmxjRHVxWm1BMnNhZ3JDWEd3aFpOSmdXbzBBWHdBQkFCVmhkM010WTNKNWNIUnZMWEIxWW14cFl5MXJaWGtBUkVFd09IQmtVbU5IUzJJMU1VTnVaVEl3UVhkUE5uVm9Ra2xHTkZsNVRqTTNXRVJuTTNsalowNHhRVFZvZERneVdrMWlkRlZXVWpGTWMyWlRUV1JQWmpSS2R6MDlBQUVBQjJGM2N5MXJiWE1BUzJGeWJqcGhkM002YTIxek9uVnpMV1ZoYzNRdE1Ub3pNVFF4TkRZek1EWTBPREk2YTJWNUx6Y3dOMkpoTmpjeExUUXpZamd0TkRFeU5DMWhaVFUzTFRrMFlqTXdZbUptT1RJNU13QzRBUUlCQUhnQjRVMDBTK3ErVE51d1gydlFlaGtxQnVneWQ3YnNrb0pWdWQ2NmZjVENVd0ZzRTV4VHRGVllHUXdxUWZoeXE2RkJBQUFBZmpCOEJna3Foa2lHOXcwQkJ3YWdiekJ0QWdFQU1HZ0dDU3FHU0liM0RRRUhBVEFlQmdsZ2hrZ0JaUU1FQVM0d0VRUU1yOEh4MXhwczFBbmEzL1JKQWdFUWdEdTI0K1M5c2VOUUNFV0hJRXJwdmYxa25MZTJteitlT29pTEZYNTJaeHZsY3AyZHFQL09tY3RJajFqTWFuRjMxZkJyY004MmpTWFVmUHRhTWdJQUFCQUE3L1pGT05DRi8rWnVPOVlCVnhoaVppSEFSLy8zR1I0TmR3QWVxcDdneHNkd2lwTDJsVWdhU3ZGNVRCbW9uMUJnLy8vLy93QUFBQUVBQUFBQUFBQUFBQUFBQUFFQUFBUTdBMHN6dUhGbEs1NHdZbmZmWEFlYkhlNmN5OWpYOGV3T2x1NWhzUWhGWFllRXNVaENaQlBXdlQrVWp5WFY0ZHZRNE8xVDJmNGdTRUFOMmtGSUx0YitQa0tmM0ZUQkJxUFNUQWZ3S1oyeHN6a1lDZXdwRlNpalFVTGtxemhXbXBVcmVDakJCOHNGT3hQL2hjK0JQalY3bUhOL29qcnVOejFhUHhjNSt6bHFuak9CMHljYy8zL2JuSHA3NVFjRE8xd2NMbFJBdU5KZ2RMNUJMOWw1YVVPM0FFMlhBYVF3YWY1bkpwTmZidHowWUtGaWZHMm94SDJSNUxWSjNkbG40aGVRbVk4OTZhdXdsellQV253N2lTTDkvTWNidDAzdVZGN0JpUnRwYmZMN09JQm8wZlpYSS9wK1pUNWVUS2wzM2tQajBIU3F6NisvamliY0FXMWV4VTE4N1QwNHpicTNRcFhYMkhqcDEvQnFnMVdabkZoaEwrekZIaUV0Qjd4U1RaZkZsS2xRUUhNK0ZkTDNkOHIyRWhCMjFya2FBUElIQVBFUk5Pd1lnNmFzM2pVaFRwZWtuZVhxSDl3QzAyWU15R0djaTVzUEx6ejh3ZTExZVduanVTai9DZVJpZFQ1akNRcjdGMUdKWjBVREZFbnpNakFuL3Y3ajA5c2FMczZnemlCc2FLQXZZOWpib0JEYkdKdGZ0N2JjVjl4eUp4amptaW56TGtoVG5pV2dxV3g5MFZPUHlWNWpGZVk1QTFrMmw3bDArUjZRTFNleHg4d1FrK0FqVGJuLzFsczNHUTBndUtESmZKTWVGUVczVEVrdkp5VlpjOC9xUlpIODhybEpKOW1FSVdOd1BMU21yY1l6TmZwVTlVOGdoUDBPUWZvQ3FvcW1WaUhEYldaT294bGpmb295cS8yTDFKNGM3NTJUaVpFd1hnaG9haFBYdGFjRnA2NTVUYjY5eGxTN25FaXZjTTlzUjdTT3REMEMrVHIyd0cxNEJ3Zm9NZTdKOFhQeVRtcmQ0QmNKOEdOYnVZTHNRNU9DcFlsV3pVNCtEcStEWUI4WHk1UWFzaDF0dzJ6dGVjVVQyc0hsZmwzUVlrQ0d3Z1hWam5Ia2hKVitFRDIrR3Fpc3BkYjRSTC83RytCRzRHTWNaUE02Q3VtTFJkMnZLbnozN3dUWkxwNzdZNTdMQlJySm9Tak9idWdNUWdhOElLNnpWL2VtcFlSbXJsVjZ5VjZ6S1h5aXFKWFk3TTBXd3dSRzd5Q0xYUFRtTGt3WGE5cXF4NkcvZDY1RS83V3RWMVUrNFIxMlZIUmVUMVJmeWw2SnBmL2FXWFVCbFQ2ampUR0M5TU1uTk5OVTQwZHRCUTArZ001S1d2WGhvMmdmbnhVcU1OdnFHblRFTWdZMG5ZL1FaM0RWNFozWUNqdkFOVWVsS1NCdkxFbnY4SEx0WU9uajIrTkRValZOV1h5T1c4WFowMFFWeXU0ZU5LaUpLQ1hJbnI1N3RrWHE3WXl3b0lZV0hKeHQwWis2MFNQMjBZZktYYlhHK1luZ3F6NjFqMkhIM1RQUmt6dW5rMkxLbzFnK1ZDZnhVWFByeFFmNUVyTm9aT2RFUHhjaklKZ1FxRzJ2eWJjbFRNZ0M5ZXc1QURVcE9KL1RrNCt2dkhJMDNjM1g0UXcrT3lmZHFUUzJWb3N4Y0hJdG5iSkZmdXliZi9lRlZWRlM2L3lURkRRckhtQ1RZYlB3VXlRNWZpR20zWkRhNDBQUTY1RGJSKzZSbzl0S3c0eWFlaXdDVzYwZzFiNkNjNUhnQm5GclMyYytFbkNEUFcrVXRXTEF1azlISXZ6QnR3MytuMjdRb1cvSWZmamJucjVCSXk3MDZRTVR4SzhuMHQ3WUZuMTBGTjVEWHZiZzBvTnZuUFFVYld1TjhFbE11NUdpenZxamJmeVZRWXdBSERCcDkzTENsUUJuTUdVQ01GWkNHUGRPazJ2ZzJoUmtxcWQ3SmtDaEpiTmszSVlyanBPL0h2Z2NZQ2RjK2daM3lGRjMyTllBMVRYN1FXUkJYZ0l4QU5xU21ZTHMyeU9uekRFenBtMUtnL0tvYmNqRTJvSDJkZHcxNnFqT0hRSkhkVWRhVzlZL0NQYTRTbWxpN2pPbGdRPT0iLCJjaXBoZXJUZXh0IjoiQVFJQkFIZ0I0VTAwUytxK1ROdXdYMnZRZWhrcUJ1Z3lkN2Jza29KVnVkNjZmY1RDVXdHeDExRlBFUG5xU1ZFbE5YVUNrQnRBQUFBQW9qQ0Jud1lKS29aSWh2Y05BUWNHb0lHUk1JR09BZ0VBTUlHSUJna3Foa2lHOXcwQkJ3RXdIZ1lKWUlaSUFXVURCQUV1TUJFRURHemdQNnJFSWNEb2dWSTl1d0lCRUlCYitXekkvbVpuZkdkTnNYV0VCM3Y4NDF1SVJUNjBLcmt2OTY2Q1JCYmdsdXo1N1lMTnZUTkk4MEdkVXdpYVA5NlZwK0VhL3R6aGgxbTl5dzhjcWdCYU1pOVQrTVQxdzdmZW5xaXFpUnRRMmhvN0tlS2NkMmNmK3YvOHVnPT0iLCJzdWIiOiJhcm46YXdzOnNhZ2VtYWtlcjp1cy1lYXN0LTE6MDYwNzk1OTE1MzUzOm1sZmxvdy1hcHAvYXBwLUxHWkVPWjJVWTROWiIsImlhdCI6MTc2NDM2NDYxNSwiZXhwIjoxNzY0MzY0OTE1fQ.HNvZOfqft4m7pUS52MlDwoi1BA8Vsj3cOfa_CvlT4uw
```

***Open presigned URL and view the app***

Click 

```
https://app-LGZEOZ2UY4NZ.mlflow.sagemaker.us-east-1.app.aws/auth?authToken=eyJhbGciOiJIUzI1NiJ9.eyJhdXRoVG9rZW5JZCI6IkxETVBPUyIsImZhc0NyZWRlbnRpYWxzIjoiQWdWNGhDM1VvZ0VYSUVsT2lZOVlLNmxjRHVxWm1BMnNhZ3JDWEd3aFpOSmdXbzBBWHdBQkFCVmhkM010WTNKNWNIUnZMWEIxWW14cFl5MXJaWGtBUkVFd09IQmtVbU5IUzJJMU1VTnVaVEl3UVhkUE5uVm9Ra2xHTkZsNVRqTTNXRVJuTTNsalowNHhRVFZvZERneVdrMWlkRlZXVWpGTWMyWlRUV1JQWmpSS2R6MDlBQUVBQjJGM2N5MXJiWE1BUzJGeWJqcGhkM002YTIxek9uVnpMV1ZoYzNRdE1Ub3pNVFF4TkRZek1EWTBPREk2YTJWNUx6Y3dOMkpoTmpjeExUUXpZamd0TkRFeU5DMWhaVFUzTFRrMFlqTXdZbUptT1RJNU13QzRBUUlCQUhnQjRVMDBTK3ErVE51d1gydlFlaGtxQnVneWQ3YnNrb0pWdWQ2NmZjVENVd0ZzRTV4VHRGVllHUXdxUWZoeXE2RkJBQUFBZmpCOEJna3Foa2lHOXcwQkJ3YWdiekJ0QWdFQU1HZ0dDU3FHU0liM0RRRUhBVEFlQmdsZ2hrZ0JaUU1FQVM0d0VRUU1yOEh4MXhwczFBbmEzL1JKQWdFUWdEdTI0K1M5c2VOUUNFV0hJRXJwdmYxa25MZTJteitlT29pTEZYNTJaeHZsY3AyZHFQL09tY3RJajFqTWFuRjMxZkJyY004MmpTWFVmUHRhTWdJQUFCQUE3L1pGT05DRi8rWnVPOVlCVnhoaVppSEFSLy8zR1I0TmR3QWVxcDdneHNkd2lwTDJsVWdhU3ZGNVRCbW9uMUJnLy8vLy93QUFBQUVBQUFBQUFBQUFBQUFBQUFFQUFBUTdBMHN6dUhGbEs1NHdZbmZmWEFlYkhlNmN5OWpYOGV3T2x1NWhzUWhGWFllRXNVaENaQlBXdlQrVWp5WFY0ZHZRNE8xVDJmNGdTRUFOMmtGSUx0YitQa0tmM0ZUQkJxUFNUQWZ3S1oyeHN6a1lDZXdwRlNpalFVTGtxemhXbXBVcmVDakJCOHNGT3hQL2hjK0JQalY3bUhOL29qcnVOejFhUHhjNSt6bHFuak9CMHljYy8zL2JuSHA3NVFjRE8xd2NMbFJBdU5KZ2RMNUJMOWw1YVVPM0FFMlhBYVF3YWY1bkpwTmZidHowWUtGaWZHMm94SDJSNUxWSjNkbG40aGVRbVk4OTZhdXdsellQV253N2lTTDkvTWNidDAzdVZGN0JpUnRwYmZMN09JQm8wZlpYSS9wK1pUNWVUS2wzM2tQajBIU3F6NisvamliY0FXMWV4VTE4N1QwNHpicTNRcFhYMkhqcDEvQnFnMVdabkZoaEwrekZIaUV0Qjd4U1RaZkZsS2xRUUhNK0ZkTDNkOHIyRWhCMjFya2FBUElIQVBFUk5Pd1lnNmFzM2pVaFRwZWtuZVhxSDl3QzAyWU15R0djaTVzUEx6ejh3ZTExZVduanVTai9DZVJpZFQ1akNRcjdGMUdKWjBVREZFbnpNakFuL3Y3ajA5c2FMczZnemlCc2FLQXZZOWpib0JEYkdKdGZ0N2JjVjl4eUp4amptaW56TGtoVG5pV2dxV3g5MFZPUHlWNWpGZVk1QTFrMmw3bDArUjZRTFNleHg4d1FrK0FqVGJuLzFsczNHUTBndUtESmZKTWVGUVczVEVrdkp5VlpjOC9xUlpIODhybEpKOW1FSVdOd1BMU21yY1l6TmZwVTlVOGdoUDBPUWZvQ3FvcW1WaUhEYldaT294bGpmb295cS8yTDFKNGM3NTJUaVpFd1hnaG9haFBYdGFjRnA2NTVUYjY5eGxTN25FaXZjTTlzUjdTT3REMEMrVHIyd0cxNEJ3Zm9NZTdKOFhQeVRtcmQ0QmNKOEdOYnVZTHNRNU9DcFlsV3pVNCtEcStEWUI4WHk1UWFzaDF0dzJ6dGVjVVQyc0hsZmwzUVlrQ0d3Z1hWam5Ia2hKVitFRDIrR3Fpc3BkYjRSTC83RytCRzRHTWNaUE02Q3VtTFJkMnZLbnozN3dUWkxwNzdZNTdMQlJySm9Tak9idWdNUWdhOElLNnpWL2VtcFlSbXJsVjZ5VjZ6S1h5aXFKWFk3TTBXd3dSRzd5Q0xYUFRtTGt3WGE5cXF4NkcvZDY1RS83V3RWMVUrNFIxMlZIUmVUMVJmeWw2SnBmL2FXWFVCbFQ2ampUR0M5TU1uTk5OVTQwZHRCUTArZ001S1d2WGhvMmdmbnhVcU1OdnFHblRFTWdZMG5ZL1FaM0RWNFozWUNqdkFOVWVsS1NCdkxFbnY4SEx0WU9uajIrTkRValZOV1h5T1c4WFowMFFWeXU0ZU5LaUpLQ1hJbnI1N3RrWHE3WXl3b0lZV0hKeHQwWis2MFNQMjBZZktYYlhHK1luZ3F6NjFqMkhIM1RQUmt6dW5rMkxLbzFnK1ZDZnhVWFByeFFmNUVyTm9aT2RFUHhjaklKZ1FxRzJ2eWJjbFRNZ0M5ZXc1QURVcE9KL1RrNCt2dkhJMDNjM1g0UXcrT3lmZHFUUzJWb3N4Y0hJdG5iSkZmdXliZi9lRlZWRlM2L3lURkRRckhtQ1RZYlB3VXlRNWZpR20zWkRhNDBQUTY1RGJSKzZSbzl0S3c0eWFlaXdDVzYwZzFiNkNjNUhnQm5GclMyYytFbkNEUFcrVXRXTEF1azlISXZ6QnR3MytuMjdRb1cvSWZmamJucjVCSXk3MDZRTVR4SzhuMHQ3WUZuMTBGTjVEWHZiZzBvTnZuUFFVYld1TjhFbE11NUdpenZxamJmeVZRWXdBSERCcDkzTENsUUJuTUdVQ01GWkNHUGRPazJ2ZzJoUmtxcWQ3SmtDaEpiTmszSVlyanBPL0h2Z2NZQ2RjK2daM3lGRjMyTllBMVRYN1FXUkJYZ0l4QU5xU21ZTHMyeU9uekRFenBtMUtnL0tvYmNqRTJvSDJkZHcxNnFqT0hRSkhkVWRhVzlZL0NQYTRTbWxpN2pPbGdRPT0iLCJjaXBoZXJUZXh0IjoiQVFJQkFIZ0I0VTAwUytxK1ROdXdYMnZRZWhrcUJ1Z3lkN2Jza29KVnVkNjZmY1RDVXdHeDExRlBFUG5xU1ZFbE5YVUNrQnRBQUFBQW9qQ0Jud1lKS29aSWh2Y05BUWNHb0lHUk1JR09BZ0VBTUlHSUJna3Foa2lHOXcwQkJ3RXdIZ1lKWUlaSUFXVURCQUV1TUJFRURHemdQNnJFSWNEb2dWSTl1d0lCRUlCYitXekkvbVpuZkdkTnNYV0VCM3Y4NDF1SVJUNjBLcmt2OTY2Q1JCYmdsdXo1N1lMTnZUTkk4MEdkVXdpYVA5NlZwK0VhL3R6aGgxbTl5dzhjcWdCYU1pOVQrTVQxdzdmZW5xaXFpUnRRMmhvN0tlS2NkMmNmK3YvOHVnPT0iLCJzdWIiOiJhcm46YXdzOnNhZ2VtYWtlcjp1cy1lYXN0LTE6MDYwNzk1OTE1MzUzOm1sZmxvdy1hcHAvYXBwLUxHWkVPWjJVWTROWiIsImlhdCI6MTc2NDM2NDYxNSwiZXhwIjoxNzY0MzY0OTE1fQ.HNvZOfqft4m7pUS52MlDwoi1BA8Vsj3cOfa_CvlT4uw
```

View 

![\[Example nova image.\]](http://docs.aws.amazon.com/nova/latest/userguide/images/screenshot-nova-model-1.png)


***Pass to recipe under run block of your SageMaker HyperPod recipe***

Recipe

```
run
    mlflow_tracking_uri: arn:aws:sagemaker:us-east-1:111122223333:mlflow-app/app-LGZEOZ2UY4NZ
```

View

![\[Example nova image.\]](http://docs.aws.amazon.com/nova/latest/userguide/images/screenshot-nova-model-2.png)