# Iterative Training ## Overview Iterative training is the process of repeatedly fine-tuning a model through multiple training cycles across different training methods — train, evaluate, analyze errors, adjust data/objectives/hyperparameters — with each round starting from the previous checkpoint. This approach allows you to systematically target model failure modes, incorporate curated examples addressing specific weaknesses, and adapt to changing requirements over time. **Benefits over single-pass training:** + **Targeted improvement**: Address specific failure patterns discovered through evaluation + **Adaptive refinement**: Respond to distribution shifts or evolving product requirements + **Risk mitigation**: Validate improvements incrementally rather than committing to a single long training run + **Data efficiency**: Focus data collection efforts on areas where the model underperforms + **Curriculum Training**: Multiple round of training with increasingly higher quality data ## How it works ### Checkpoint location and access After each training job completes, a manifest file is generated in the output location specified by the `output_path` parameter in your training configuration. **To access your checkpoint** + Navigate to your specified `output_path` in S3 + Download and extract the `output.tar.gz` file + Open the `manifest.json` file inside + Locate the `checkpoint_s3_bucket` parameter, which contains the S3 URI of your trained model **Example manifest.json structure** ``` { "checkpoint_s3_bucket": "s3://customer-escrow--smtj-//stepID", ... } ``` ### Understanding escrow buckets Since Amazon Nova weights are proprietary, trained model checkpoints are stored in **escrow S3 buckets** within AWS-managed accounts rather than being copied to your account. These escrow buckets: + Contain your customized model weights securely + Can be referenced by other AWS services (Inference, Evaluation, and subsequent training jobs) + Are accessible only to your AWS account via IAM permissions + Incur standard S3 storage charges in your account (see Cost considerations) You can use the escrow bucket path as the `model_name_or_path` in your next training run to continue iterative training. ### Using checkpoints for iterative training Configure your next training job to use the previous checkpoint as the base model: ``` run: name: "my-iterative-training-job" model_type: amazon.nova-2-lite-v1:0:256k model_name_or_path: "s3://customer-escrow--smtj-/" data_s3_path: s3:///.jsonl replicas: 4 ``` ## When to use iterative training ### Ideal use cases Use iterative training when you have: + **Feedback loops** – Ability to collect real-world failure cases and systematically address them + **Dynamic environments** – Evolving documentation, APIs, or support topics requiring periodic model updates + **Robust evaluation** – Strong benchmarks and evaluation frameworks (see examples below) to measure improvements confidently + **ML operations capability** – Resources to manage multiple training cycles and version control **Examples of robust evaluation frameworks** + Automated benchmark suites with pass/fail thresholds + Human evaluation protocols with inter-rater reliability metrics + Red-team testing scenarios covering edge cases and adversarial inputs + A/B testing infrastructure to measure production impact ### Common patterns **SFT → RFT Pipeline**: A frequently used iterative pattern involves: + **SFT first** – Teach the model how to solve problems through demonstration examples + **RFT second** – Optimize performance across the broader problem space using reward signals This sequence is essential when models perform poorly initially—RFT on near-zero accuracy models will not improve performance without first establishing basic problem-solving capabilities through SFT. ### When not to use iterative training Avoid iterative training for: + **Stable, well-defined tasks** – Stationary data with consistent requirements already achieving near-ceiling performance + **Simple classification problems** – Narrow tasks where single-pass training suffices + **Resource constraints** – Lacking dedicated ML operations capabilities to manage multiple training cycles + **Marginal gains** – When overhead doesn't justify minimal performance improvements ## Example workflow: SFT → RFT This example demonstrates a common iterative training pattern for reasoning models. ### Step 1: Initial SFT training Configure and launch your SFT training job with your dataset: ``` run: name: "initial-sft-training" model_type: amazon.nova-2-lite-v1:0:256k model_name_or_path: "nova-lite-2/prod" data_s3_path: s3:///sft-training-data.jsonl validation_data_s3_path: s3:///sft-validation-data.jsonl ``` **Rationale**: SFT provides additional demonstrations that shape model outputs into your desired format and voice, establishing foundational capabilities. **After training completes** + Note the `output_path` configured in your training job + Download `output.tar.gz` from that location + Extract and locate `manifest.json` + Copy the `checkpoint_s3_bucket` value ### Step 2: RFT training on SFT checkpoint Create a new RFT training job using the SFT checkpoint: ``` run: name: "rft-on-sft-checkpoint" model_type: amazon.nova-2-lite-v1:0:256k model_name_or_path: "s3://customer-escrow--smtj-/" data_s3_path: s3:///rft-training-data.jsonl reward_lambda_arn: ``` **Rationale**: RFT training builds on the SFT foundation, allowing the model to develop more complex reasoning patterns optimized by your reward function. ### Step 3: Evaluate and iterate Run evaluation on the RFT checkpoint to assess performance: ``` run: name: "evaluate-rft-checkpoint" model_type: amazon.nova-2-lite-v1:0:256k model_name_or_path: "s3://customer-escrow--smtj-/" data_s3_path: s3:///evaluation-data.jsonl ``` If target metrics are not satisfied, continue iterating with adjusted data or hyperparameters. **Important** The training technique (LoRA vs. Full Rank) must remain consistent across all iterations: If you use SFT with **LoRA**, you must use RFT with **LoRA** If you use SFT with **Full Rank**, you must use RFT with **Full Rank** You cannot switch between LoRA and Full Rank mid-pipeline **Important** If a KMS key is used for encryption in the Amazon-owned output S3 bucket, that same KMS key must be used for all future iterations. ## Monitoring progress across iterations You can track metrics via MLflow by [setting up MLflow](nova-model-monitor.md#nova-customization-mlflow-setup) for your jobs. ### Create an MLflow app **Using Studio UI**: If you create a training job through the Studio UI, a default MLflow app is created automatically and selected by default under Advanced Options. **Using CLI**: If you use the CLI, you must create an MLflow app and pass it as an input to the training job API request. ``` mlflow_app_name="" role_arn="" bucket_name="" region="" mlflow_app_arn=$(aws sagemaker create-mlflow-app \ --name $mlflow_app_name \ --artifact-store-uri "s3://$bucket_name" \ --role-arn $role_arn \ --region $region) ``` ### Access the MLflow app **Using CLI**: Create a presigned URL to access the MLflow app UI: ``` aws sagemaker create-presigned-mlflow-app-url \ --arn $mlflow_app_arn \ --region $region \ --output text ``` **Using Studio UI**: The Studio UI displays key metrics stored in MLflow and provides a link to the MLflow app UI. ### Key metrics to track Monitor these metrics across iterations to assess improvement and track the job progress: **For SFT** + Training loss curves + Number of samples consumed and time to process samples + Performance accuracy on held-out test sets + Format compliance (for example, valid JSON output rate) + Perplexity on domain-specific evaluation data **For RFT** + Average reward scores over training + Reward distribution (percentage of high-reward responses) + Validation reward trends (watch for overfitting) + Task-specific success rates (for example, code execution pass rate, math problem accuracy) **General** + Benchmark performance deltas between iterations + Human evaluation scores on representative samples + Production metrics (if deploying iteratively) ### Determining when to stop Stop iterating when: + **Performance plateaus** – Additional training no longer meaningfully improves target metrics + **Technique switching helps** – If one technique plateaus, try switching (for example, SFT → RFT → SFT) to break through performance ceilings + **Target metrics achieved** – Your success criteria are met + **Regression detected** – New iterations degrade performance (see rollback procedures below) For detailed evaluation procedures, refer to the **Evaluation** section. ## Best practices ### Start small and scale gradually Begin with minimal datasets and single training epochs to validate your approach before scaling up. This builds confidence and helps identify issues early. ### Establish clear success metrics Define quantitative and qualitative indicators before starting: **Example success metrics by use case** + **Question answering** – Exact match accuracy, F1 score, human preference ratings + **Code generation** – Unit test pass rate, compilation success, execution time + **Reasoning tasks** – Step accuracy, final answer correctness, reward scores + **Content generation** – Coherence scores, factual accuracy, style adherence ### Implement automated evaluation Set up automated evaluation pipelines to track performance after each round, enabling rapid iteration and objective comparison. ### Maintain rigorous version control Document for each iteration: + Dataset versions and modifications + Model checkpoint locations + Hyperparameter changes + Performance metrics and deltas + Qualitative observations This builds institutional knowledge and enables debugging. ### Focus on data quality over quantity Analyze failure cases from previous rounds and add targeted, high-quality examples rather than simply increasing dataset size. ### Plan iteration budget Plan for **3-5 iterations** as a typical range: + **1-2 iterations** – Often sufficient for simple improvements or final polishing + **3-5 iterations** – Appropriate for complex tasks requiring multiple refinement cycles + **5\$1 iterations** – May indicate diminishing returns or need for different approaches Adjust based on computational budget and performance improvement rates. ### Implement rollback capabilities If an iteration introduces regressions: + **Identify the regression** – Compare evaluation metrics across checkpoints + **Return to previous checkpoint** – Use the earlier checkpoint's S3 path as your `model_name_or_path` + **Adjust training approach** – Modify data, hyperparameters, or technique before retrying + **Document the failure** – Record what caused regression to avoid repeating **Example rollback** ``` run: name: "rollback-to-iteration-2" model_type: amazon.nova-2-lite-v1:0:256k # Use iteration 2 checkpoint instead of failed iteration 3 model_name_or_path: "s3://customer-escrow--smtj-/" ``` ## Cost considerations ### Checkpoint storage + **Location** – Checkpoints stored in escrow buckets incur standard S3 storage charges billed to your AWS account + **Retention** – Checkpoints are retained indefinitely unless explicitly deleted + **Management** – Implement lifecycle policies to archive or delete old checkpoints you no longer need **Cost optimization tips** + Delete intermediate checkpoints after validating newer iterations + Archive checkpoints to S3 Glacier for long-term retention at lower cost + Set retention policies based on your compliance and experimentation needs ## Limitations ### Model family consistency When iteratively training, you must use the **same model type** throughout all iterations. **Initial training** ``` run: model_type: amazon.nova-2-lite-v1:0:256k model_name_or_path: "nova-lite-2/prod" ``` **Subsequent iterations must use the same model\$1type** ``` run: model_type: amazon.nova-2-lite-v1:0:256k # Must match original model_name_or_path: "s3://customer-escrow--smtj-/" ``` ### Training technique consistency The training technique must remain consistent across iterations: + **LoRA-trained models** can only be iteratively trained with LoRA + **Full-Rank-trained models** can only be iteratively trained with Full-Rank **How LoRA adapters work in iterative training** + Each LoRA training iteration produces new adapter weights + New adapters replace (not stack with) previous adapters + The base model remains frozen; only adapters are updated ### Technique compatibility matrix | Initial training | Can iterate with | | --- | --- | | SFT (Full-Rank) | SFT (Full-Rank), RFT (Full-Rank) | | SFT (LoRA) | SFT (LoRA), RFT (LoRA) | | RFT (Full-Rank) | RFT (Full-Rank) | | RFT (LoRA) | RFT (LoRA) | **Verifying compatibility before starting a job** + Check your previous training recipe to identify the model type and training technique (LoRA vs. Full-Rank) + Ensure your new recipe matches both the model type and technique + Review the manifest.json to confirm the checkpoint path is correct ## Troubleshooting ### Error: "Incompatible model training techniques detected" **Cause**: The training technique (LoRA vs. Full-Rank) doesn't match the checkpoint's technique. **Resolution**: Ensure your recipe uses the same training technique as the original model: + If the checkpoint was trained with LoRA, use LoRA in your new recipe + If the checkpoint was trained with Full-Rank, use Full-Rank in your new recipe ### Error: "Base model for the job extracted from model\$1name\$1or\$1path does not match model\$1type" **Cause**: The model type specified in `model_type` doesn't match the actual model in the checkpoint. **Resolution**: Verify that: + The `model_type` in your recipe matches the original model type + The checkpoint S3 path in `model_name_or_path` is correct + You're using the path from the correct manifest.json file **Example of correct configuration** ``` run: model_type: amazon.nova-2-lite-v1:0:256k # Must match checkpoint's model model_name_or_path: "s3://customer-escrow--smtj-/" ``` ### Error: "Model configuration not found" **Cause**: The S3 path in `model_name_or_path` is invalid or inaccessible. **Resolution**: + Verify the S3 path is correctly copied from the manifest.json file + Ensure your IAM role has permissions to access the escrow bucket + Confirm the previous training job completed successfully + Check for typos in the path ### Performance regression after iteration **Symptoms**: Evaluation metrics decline after a new training iteration. **Resolution**: + **Rollback** – Use the previous checkpoint as your base model + **Analyze** – Review training logs and data quality for the failed iteration + **Adjust** – Modify hyperparameters (reduce learning rate), improve data quality, or reduce training epochs + **Retry** – Launch a new iteration with adjustments