Prepare the datasets
Before you can begin a model customization job, you need to minimally prepare a training dataset. Whether a validation dataset is supported and the format of your training and validation dataset depend on the following factors.
Model support for fine-tuning and continued
pre-training data format
The following table shows details of the fine-tuning and continued pre-training data
format supported for each respective model:
Model name |
Fine-tuning:Text-to-text |
Fine-tuning: Text-to-image &
Image-to-embeddings |
Fine-tuning: Text+Image-to-Text &
Text+Video-to-Text |
Continued Pre-training:Text-to-text |
Fine-tuning: Single-turn messaging |
Fine-tuning: Multi-turn messaging |
Amazon Nova Pro |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Amazon Nova Lite |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Amazon Nova Micro |
Yes |
No |
No |
No |
Yes |
Yes |
Amazon Titan Text G1 - Express |
Yes |
No |
No |
Yes |
No |
No |
Amazon Titan Text G1 - Lite |
Yes |
No |
No |
Yes |
No |
No |
Amazon Titan Text Premier |
Yes |
No |
No |
No |
No |
No |
Amazon Titan Image Generator G1 V1 |
Yes |
Yes |
No |
No |
No |
No |
Amazon Titan Multimodal Embeddings G1 G1 |
Yes |
Yes |
No |
No |
No |
No |
Anthropic Claude 3 Haiku |
No |
No |
No |
No |
Yes |
Yes |
Cohere Command |
Yes |
No |
No |
No |
No |
No |
Cohere Command Light |
Yes |
No |
No |
No |
No |
No |
Meta Llama 2 13B |
Yes |
No |
No |
No |
No |
No |
Meta Llama 2 70B |
Yes |
No |
No |
No |
No |
No |
To see the default quotas that apply for training and validation datasets used for customizing different models, see the Sum of training and validation records quotas in Amazon Bedrock endpoints and quotas in the AWS General Reference.
Prepare training and validation datasets for your custom model
To prepare training and validation datasets for your custom model, you create .jsonl
files, each line of which is a JSON object corresponding to a record. The files you create must conform to the format for the customization method and model that you choose and the records in it must conform to size requirements.
The format depends on the customization method and the input and output modality of the model. Choose the tab for your preferred method, and then follow the steps:
- Fine-tuning: Text-to-text
-
For text-to-text models, prepare a training and optional validation dataset. Each JSON object is a sample containing both a prompt
and completion
field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.
{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
{"prompt": "<prompt3>", "completion": "<expected generated text>"}
The following is an example item for a question-answer task:
{"prompt": "what is AWS", "completion": "it's Amazon Web Services"}
- Fine-tuning: Text-to-image & Image-to-embeddings
-
For text-to-image or image-to-embedding models, prepare a training dataset. Validation datasets are not supported. Each JSON object is a sample containing an image-ref
, the Amazon S3 URI for an image, and a caption
that could be a prompt for the image.
The images must be in JPEG or PNG format.
{"image-ref": "s3://bucket/path/to/image001.png", "caption": "<prompt text>"}
{"image-ref": "s3://bucket/path/to/image002.png", "caption": "<prompt text>"}
{"image-ref": "s3://bucket/path/to/image003.png", "caption": "<prompt text>"}
The following is an example item:
{"image-ref": "s3://amzn-s3-demo-bucket/my-pets/cat.png", "caption": "an orange cat with white spots"}
To allow Amazon Bedrock access to the image files, add an IAM policy similar to the one in Permissions to access training and validation files and to write output files in S3 to the Amazon Bedrock model customization service role that you set up or that was automatically set up for you in the console. The Amazon S3 paths you provide in the training dataset must be in folders that you specify in the policy.
- Continued Pre-training: Text-to-text
-
To carry out Continued Pre-training on a text-to-text model, prepare a training and optional validation dataset. Because Continued Pre-training involves unlabeled data, each JSON line is a sample containing only an input
field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.
{"input": "<input text>"}
{"input": "<input text>"}
{"input": "<input text>"}
The following is an example item that could be in the training data.
{"input": "AWS stands for Amazon Web Services"}
- Fine-tuning: Single-turn messaging
-
To fine-tune a text-to-text model using the single-turn messaging format, prepare a training and optional
validation dataset. Both data files must be in the JSONL
format. Each line specifies a complete data sample in json format; and each data sample must be formatted
to 1 line (remove all the ‘\n’ within each sample). One line with multiple data samples or splitting a data
sample over multiple lines won’t work.
Fields
-
system
(optional) : A string containing a system message that sets the context for the conversation.
-
messages
: An array of message objects, each containing:
Rules
-
The messages
array must contain 2 messages
-
The first message must have a role
of the user
-
The last message must have a role
of the assistant
{"system": "<system message>","messages":[{"role": "user", "content": "<user query>"},{"role": "assistant", "content": "<expected generated text>"}]}
Example
{"system": "You are an helpful assistant.","messages":[{"role": "user", "content": "what is AWS"},{"role": "assistant", "content": "it's Amazon Web Services."}]}
- Fine-tuning: Multi-turn messaging
-
To fine-tune a text-to-text model using the multi-turn messaging format, prepare a training and optional validation
dataset. Both data files must be in the JSONL format. Each line
specifies a complete data sample in json format; and each data sample must be formatted to 1 line (remove all the ‘\n’
within each sample). One line with multiple data samples or splitting a data sample over multiple lines won’t work.
Fields
-
system
(optional) : A string containing a system message that sets the context for the conversation.
-
messages
: An array of message objects, each containing:
Rules
-
The messages
array must contain at least 2
messages
-
The first message must have a role
of the user
-
The last message must have a role
of the assistant
-
Messages must alternate between user
and assistant
roles.
{"system": "<system message>","messages":[{"role": "user", "content": "<user query 1>"},{"role": "assistant", "content": "<expected generated text 1>"}, {"role": "user", "content": "<user query 2>"},{"role": "assistant", "content": "<expected generated text 2>"}]}
Example
{"system": "system message","messages":[{"role": "user", "content": "Hello there."},{"role": "assistant", "content": "Hi, how can I help you?"},{"role": "user", "content": "what are LLMs?"},{"role": "assistant", "content": "LLM means large language model."},]}
- Distillation
-
To prepare training and validation datasets for a model distillation job, see Prerequisites for Amazon Bedrock Model Distillation.
Select a tab to see the requirements for training and validation datasets for a model:
- Amazon Nova
-
Model |
Minimum Samples |
Maximum Samples |
Context Length |
Amazon Nova Micro |
100 |
20k |
32k |
Amazon Nova Lite |
8 |
20k (10k for document) |
32k |
Amazon Nova Pro |
100 |
10k |
32k |
Image and video constraints
Maximum image file size |
10 MB |
Maximum videos |
1 per sample |
Maximum video length or duration |
90 seconds |
Maximum video file size |
50 MB |
Supported image formats |
PNG, JPEG, GIF, WEBP |
Supported video formats |
MOV, MKV, MP4, WEBM |
- Amazon Titan Text Premier
-
Description |
Maximum (Fine-tuning) |
Sum of input and output tokens when batch size is 1 |
4,096 |
Sum of input and output tokens when batch size is 2, 3, or
4 |
N/A |
Character quota per sample in dataset |
Token quota x 6 |
Training dataset file size |
1 GB |
Validation dataset file size |
100 MB |
- Amazon Titan Text G1 - Express
-
Description |
Maximum (Continued Pre-training) |
Maximum (Fine-tuning) |
Sum of input and output tokens when batch size is 1 |
4,096 |
4,096 |
Sum of input and output tokens when batch size is 2, 3, or
4 |
2,048 |
2,048 |
Character quota per sample in dataset |
Token quota x 6 |
Token quota x 6 |
Training dataset file size |
10 GB |
1 GB |
Validation dataset file size |
100 MB |
100 MB |
- Amazon Titan Text G1 - Lite
-
Description |
Maximum (Continued Pre-training) |
Maximum (Fine-tuning) |
Sum of input and output tokens when batch size is 1 or
2 |
4,096 |
4,096 |
Sum of input and output tokens when batch size is 3, 4, 5, or
6 |
2,048 |
2,048 |
Character quota per sample in dataset |
Token quota x 6 |
Token quota x 6 |
Training dataset file size |
10 GB |
1 GB |
Validation dataset file size |
100 MB |
100 MB |
- Amazon Titan Image Generator G1 V1
-
Description |
Minimum (Fine-tuning) |
Maximum (Fine-tuning) |
Text prompt length in training sample, in characters |
3 |
1,024 |
Records in a training dataset |
5 |
10,000 |
Input image size |
0 |
50 MB |
Input image height in pixels |
512 |
4,096 |
Input image width in pixels |
512 |
4,096 |
Input image total pixels |
0 |
12,582,912 |
Input image aspect ratio |
1:4 |
4:1 |
- Amazon Titan Multimodal Embeddings G1
-
Description |
Minimum (Fine-tuning) |
Maximum (Fine-tuning) |
Text prompt length in training sample, in characters |
0 |
2,560 |
Records in a training dataset |
1,000 |
500,000 |
Input image size |
0 |
5 MB |
Input image height in pixels |
128 |
4096 |
Input image width in pixels |
128 |
4096 |
Input image total pixels |
0 |
12,528,912 |
Input image aspect ratio |
1:4 |
4:1 |
- Cohere Command
-
Description |
Maximum (Fine-tuning) |
Input tokens |
4,096 |
Output tokens |
2,048 |
Character quota per sample in dataset |
Token quota x 6 |
Records in a training dataset |
10,000 |
Records in a validation dataset |
1,000 |
- Meta Llama 2
-
Description |
Maximum (Fine-tuning) |
Input tokens |
4,096 |
Output tokens |
2,048 |
Character quota per sample in dataset |
Token quota x 6 |
- Meta Llama 3.1
-
Description |
Maximum (Fine-tuning) |
Input tokens |
16,000 |
Output tokens |
16,000 |
Character quota per sample in dataset |
Token quota x 6 |
For Amazon Nova data preparation guidelines, see Guidelines for preparing your data for Amazon Nova.