Data mixing on SageMaker Training Jobs
With Amazon Nova Forge data mixing, you can combine your custom training data with Amazon Nova's proprietary training data during supervised fine-tuning (SFT). This helps preserve the model's general capabilities while specializing it to your target domain.
Note
Data mixing on SageMaker Training Jobs (serverless) is currently supported for Nova 2 Lite text-only SFT, with both LoRA and full-rank fine-tuning.
Prerequisites
-
Amazon Nova Forge subscription. Contact your AWS point of contact for access.
-
An IAM execution role with
AmazonSageMakerFullAccessand permissions to access Amazon Nova Forge Amazon S3 buckets and your training data bucket. The execution role must have Amazon S3 read access to the Amazon Nova Forge subscription buckets so that the service can validate your subscription status. For the required IAM policy, see IAM policy requirements for Amazon Nova recipes. -
Training data uploaded to Amazon S3 in Converse API format. For data format details, see Preparing training data for Nova 2.0.
-
A model package group ARN in your account.
-
A base model ARN from the SageMaker AI Hub.
-
(Optional) The
amzn-nova-forgePython SDK installed. For installation instructions, see Installation.
Hyperparameters
When using the API or CLI directly, all hyperparameter values must be passed as
strings. The SDKs handle type conversion internally. Data mixing is enabled by including
the customer_data_percent hyperparameter.
Data mixing parameter
| Parameter | Type | Description |
|---|---|---|
customer_data_percent |
String (0–100) | Percentage of the overall training mix drawn from your data. The remainder comes from Amazon Nova's curated training data. |
Setting customer_data_percent to "50" means 50% of
training samples come from your JSONL file and 50% from Amazon Nova. Set to
"100" to disable mixing (customer data only). Set to
"0" for Amazon Nova data only.
You can also control the distribution of Amazon Nova data across categories by
passing individual nova_<category>_percent hyperparameters.
When provided, the nova_*_percent values must sum to 100.
If you omit them, the default distribution is used.
Important
When you customize category percentages, you must specify values for
all 23 categories and they must sum to 100.
Set unused categories to "0" explicitly.
Example: default category distribution
{ "customer_data_percent": "50", "nova_agents_percent": "1", "nova_baseline_percent": "10", "nova_chat_percent": "0.5", "nova_code_percent": "10", "nova_factuality_percent": "0.1", "nova_identity_percent": "1", "nova_long-context_percent": "1", "nova_math_percent": "2", "nova_rai_percent": "1", "nova_instruction-following_percent": "13", "nova_stem_percent": "0.5", "nova_planning_percent": "10", "nova_reasoning-chat_percent": "0.5", "nova_reasoning-code_percent": "0.5", "nova_reasoning-factuality_percent": "0.5", "nova_reasoning-instruction-following_percent": "45", "nova_reasoning-math_percent": "0.5", "nova_reasoning-planning_percent": "0.5", "nova_reasoning-rag_percent": "0.4", "nova_reasoning-rai_percent": "0.5", "nova_reasoning-stem_percent": "0.4", "nova_rag_percent": "1", "nova_translation_percent": "0.1", "max_steps": "100", "global_batch_size": "32" }
The following table lists all available category hyperparameters.
| Hyperparameter | Description |
|---|---|
nova_agents_percent |
Agentic reasoning and task completion |
nova_baseline_percent |
General language comprehension |
nova_chat_percent |
Conversational fluency |
nova_code_percent |
Code generation and understanding |
nova_factuality_percent |
Factual accuracy and verification |
nova_identity_percent |
Consistent identity and persona |
nova_instruction-following_percent |
Instruction following |
nova_long-context_percent |
Long-context comprehension |
nova_math_percent |
Mathematics |
nova_planning_percent |
Planning and task decomposition |
nova_rag_percent |
Retrieval-augmented generation |
nova_rai_percent |
Responsible AI alignment |
nova_stem_percent |
STEM |
nova_translation_percent |
Multilingual comprehension and fluency |
nova_reasoning-chat_percent |
Conversational reasoning |
nova_reasoning-code_percent |
Code reasoning |
nova_reasoning-factuality_percent |
Factual reasoning and verification |
nova_reasoning-instruction-following_percent |
Reasoning for complex instruction following |
nova_reasoning-math_percent |
Mathematical reasoning |
nova_reasoning-planning_percent |
Reasoning for planning and strategy |
nova_reasoning-rag_percent |
Reasoning with retrieved context |
nova_reasoning-rai_percent |
Responsible AI reasoning |
nova_reasoning-stem_percent |
STEM reasoning |
Note
If you provide any nova_*_percent hyperparameters, you must
specify all 23 categories with values that sum to 100. If you don't provide any
nova_*_percent hyperparameters, the default distribution is
used.
Training parameters
The following parameters apply to both LoRA and full-rank fine-tuning unless noted.
| Parameter | Type | Default | Description |
|---|---|---|---|
max_steps |
Integer | 10 (LoRA) / 100 (full-rank) |
Number of training steps. |
global_batch_size |
Integer | 32 |
Batch size. Options: 32, 64, 128, 256, 512, 1024. |
learning_rate |
Float | 1e-05 |
Learning rate. |
warmup_steps |
Integer | 15 |
Learning rate warmup steps. |
min_lr |
Float | 1e-06 |
Minimum learning rate for the scheduler. |
weight_decay |
Float | 0.0 |
L2 regularization strength. |
save_steps |
Integer | 10 (LoRA) / 100 (full-rank) |
How often to save checkpoints in training steps. Must be an even number. |
max_context_length |
Integer | 32768 |
Maximum sequence length in tokens. |
reasoning_enabled |
Boolean | true |
Enable reasoning mode. Set true if your data
contains reasoningContent fields,
false otherwise. |
validation_data_s3_path |
String | – | Optional Amazon S3 path to a validation JSONL file. When provided,
validation loss is computed at val_check_interval
steps. |
val_check_interval |
Integer | – | Run validation every N training steps.
Only applies when validation_data_s3_path is provided. |
fine_tuned_model |
Float | 1.0 |
Weight of the fine-tuned checkpoint in model merge
Set to 1.0 to disable model merging. |
LoRA parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
alpha |
Integer | 64 |
LoRA alpha scaling factor. Options: 32, 64, 96, 128, 160, 192. |
learning_rate_ratio |
Float | 64.0 |
LoRA+ learning rate scaling factor |
Using the Amazon Nova Forge SDK
You can also submit serverless SFT jobs with data mixing using the
amzn-nova-forge Python SDK. For installation instructions, see
Installing the Amazon Nova Forge SDK.
from amzn_nova_forge.trainer import ForgeTrainer from amzn_nova_forge.manager import SMTJServerlessRuntimeManager from amzn_nova_forge.core import ForgeConfig from amzn_nova_forge.core.enums import Model, TrainingMethod infra = SMTJServerlessRuntimeManager( model_package_group_name="your-model-package-group", execution_role="arn:aws:iam::123456789012:role/YourRole", ) trainer = ForgeTrainer( model=Model.NOVA_LITE_2, method=TrainingMethod.SFT_LORA, infra=infra, training_data_s3_path="s3://your-bucket/data/train.jsonl", data_mixing_enabled=True, config=ForgeConfig(output_s3_path="s3://your-bucket/output/"), ) # Set data mixing percentages — all 23 nova categories must be specified and sum to 100 trainer.data_mixing.set_config({ "customer_data_percent": 50, "nova_agents_percent": 60, "nova_chat_percent": 40, "nova_baseline_percent": 0, "nova_code_percent": 0, "nova_factuality_percent": 0, "nova_identity_percent": 0, "nova_long-context_percent": 0, "nova_math_percent": 0, "nova_rai_percent": 0, "nova_instruction-following_percent": 0, "nova_stem_percent": 0, "nova_planning_percent": 0, "nova_reasoning-chat_percent": 0, "nova_reasoning-code_percent": 0, "nova_reasoning-factuality_percent": 0, "nova_reasoning-instruction-following_percent": 0, "nova_reasoning-math_percent": 0, "nova_reasoning-planning_percent": 0, "nova_reasoning-rag_percent": 0, "nova_reasoning-rai_percent": 0, "nova_reasoning-stem_percent": 0, "nova_rag_percent": 0, "nova_translation_percent": 0, }) # Verify the config before training current_config = trainer.data_mixing.get_config() print(f"Data mixing config: {current_config}") result = trainer.train(job_name="sft-datamix-serverless") print(f"Job ID: {result.job_id}")
For full-rank fine-tuning, use TrainingMethod.SFT_FULL instead
of TrainingMethod.SFT_LORA.
Using the Python SDK (SageMaker)
You can submit serverless SFT jobs with data mixing using the SageMaker AI Python SDK
SFTTrainer class.
from sagemaker.train.sft_trainer import SFTTrainer from sagemaker.train.common import TrainingType from sagemaker.core.helper.session_helper import Session sagemaker_session = Session() trainer = SFTTrainer( model="nova-textgeneration-lite-v2", training_type=TrainingType.FULL, model_package_group="my-custom-models", training_dataset="s3://my-bucket/data/train.jsonl", s3_output_path="s3://my-bucket/output/", sagemaker_session=sagemaker_session, role="arn:aws:iam::123456789012:role/SageMakerExecutionRole", ) # Set standard training hyperparameters trainer.hyperparameters.max_steps = 100 trainer.hyperparameters.learning_rate = 5e-6 # Set data mixing percentages — nova percents must sum to 100 trainer.hyperparameters.customer_data_percent = 70 trainer.hyperparameters.nova_code_percent = 30 trainer.hyperparameters.nova_math_percent = 20 trainer.hyperparameters.nova_planning_percent = 10 setattr(trainer.hyperparameters, 'nova_instruction-following_percent', 10) setattr(trainer.hyperparameters, 'nova_reasoning-instruction-following_percent', 20) setattr(trainer.hyperparameters, 'nova_reasoning-math_percent', 10) # Zero out unused categories trainer.hyperparameters.nova_agents_percent = 0 trainer.hyperparameters.nova_baseline_percent = 0 trainer.hyperparameters.nova_chat_percent = 0 trainer.hyperparameters.nova_factuality_percent = 0 trainer.hyperparameters.nova_identity_percent = 0 trainer.hyperparameters.nova_stem_percent = 0 trainer.hyperparameters.nova_rai_percent = 0 trainer.hyperparameters.nova_rag_percent = 0 trainer.hyperparameters.nova_translation_percent = 0 for category in ['nova_long-context_percent', 'nova_reasoning-chat_percent', 'nova_reasoning-code_percent', 'nova_reasoning-factuality_percent', 'nova_reasoning-planning_percent', 'nova_reasoning-rag_percent', 'nova_reasoning-rai_percent', 'nova_reasoning-stem_percent']: setattr(trainer.hyperparameters, category, 0) # Start training training_job = trainer.train(wait=True) print(f"Training job: {training_job.training_job_name}")
Note
Some parameter names contain hyphens (for example,
nova_instruction-following_percent). Use
setattr() to set these. Parameters without hyphens can be set
directly (for example, trainer.hyperparameters.nova_code_percent = 30).
For LoRA fine-tuning, use TrainingType.LORA instead of
TrainingType.FULL.
Using the AWS CLI
The following command structure shows how to submit a serverless SFT job with data mixing using the AWS CLI.
aws sagemaker create-training-job \ --region <region> \ --training-job-name <job-name> \ --role-arn <execution-role-arn> \ --hyper-parameters '<hyperparameters-json>' \ --input-data-config '<input-config-json>' \ --output-data-config '<output-config-json>' \ --stopping-condition '{"MaxRuntimeInSeconds": 432000}' \ --serverless-job-config '<serverless-config-json>' \ --model-package-config '<model-package-config-json>'
Serverless job configuration
Pass the following JSON as the value of --serverless-job-config.
This tells Amazon Nova Forge which model to fine-tune and which training method to use.
{ "BaseModelArn": "<base-model-arn>", "AcceptEula": true, "JobType": "FineTuning", "CustomizationTechnique": "SFT", "Peft": "LORA" }
| Field | Description |
|---|---|
BaseModelArn |
ARN of the Nova 2 Lite model in SageMaker AI Hub. |
AcceptEula |
Must be true to accept the model EULA. |
JobType |
FineTuning |
CustomizationTechnique |
SFT |
Peft |
LORA for LoRA fine-tuning. Omit this field
entirely for full-rank fine-tuning. |
Model package configuration
Pass the following JSON as the value of --model-package-config.
The trained model checkpoint is registered as a model package version in the
specified group.
{ "ModelPackageGroupArn": "arn:aws:sagemaker:<region>:<account>:model-package-group/<group-name>" }
Input data configuration
Note
Data mixing on SageMaker Training Jobs (serverless) only supports text-only converse format. Multimodal converse manifests (containing images or videos) are not supported.
Pass the following JSON as the value of --input-data-config.
[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "Converse", "S3Uri": "s3://<bucket>/<path>/training-data.jsonl", "S3DataDistributionType": "FullyReplicated" } }, "CompressionType": "None", "RecordWrapperType": "None" }]
S3DataType must be Converse for Amazon Nova SFT data.
The training data uses the same Converse API JSONL format as standard Nova 2 Lite
SFT. Set CompressionType to None because the training
data is plain JSONL and not compressed. Set RecordWrapperType to
None because the data is not wrapped in RecordIO format.
The following complete example submits a LoRA SFT job with 50% data mixing.
aws sagemaker create-training-job \ --region us-east-1 \ --training-job-name "my-sft-datamix-lora" \ --role-arn "arn:aws:iam::<account>:role/<execution-role>" \ --hyper-parameters '{ "customer_data_percent": "50", "max_steps": "100", "global_batch_size": "32", "learning_rate": "1e-05", "warmup_steps": "15", "min_lr": "1e-06", "weight_decay": "0.0", "save_steps": "10", "reasoning_enabled": "true", "alpha": "64" }' \ --input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "Converse", "S3Uri": "s3://<bucket>/training-data.jsonl", "S3DataDistributionType": "FullyReplicated" } }, "CompressionType": "None", "RecordWrapperType": "None" }]' \ --output-data-config '{ "S3OutputPath": "s3://<bucket>/output", "CompressionType": "NONE" }' \ --stopping-condition '{"MaxRuntimeInSeconds": 432000}' \ --serverless-job-config '{ "BaseModelArn": "<base-model-arn>", "AcceptEula": true, "JobType": "FineTuning", "CustomizationTechnique": "SFT", "Peft": "LORA" }' \ --model-package-config '{ "ModelPackageGroupArn": "arn:aws:sagemaker:us-east-1:<account>:model-package-group/<group-name>" }' \ --output text --query 'TrainingJobArn'
For full-rank fine-tuning, omit "Peft": "LORA" from
--serverless-job-config and remove "alpha" from
hyperparameters.
Check job status
aws sagemaker describe-training-job \ --region us-east-1 \ --training-job-name "my-sft-datamix-lora" \ --query '{Status:TrainingJobStatus,Secondary:SecondaryStatus,Reason:FailureReason}' \ --output json
Best practices
-
Start with 50% customer data as a balanced starting point. Higher percentages increase domain specialization but reduce general capability preservation.
-
Include reasoning-instruction-following when using data mixing to maintain strong performance across general tasks.
-
Use default learning rates – 1e-5 for LoRA, 5e-6 for full-rank. Adjust only based on validation metrics.
Limitations
-
Text-only SFT – Multimodal data mixing is not supported on SageMaker Training Jobs.
-
Nova 2 Lite only – Other model sizes are not supported for data mixing on SageMaker Training Jobs.
-
Category-level control – When customizing category percentages, you must specify all 23 categories with values that sum to 100. Set unused categories to
"0"explicitly. -
No replicas parameter –
ResourceConfigis not supported on serverless. Replicas are not configurable.