Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Prepare the datasets

Focus mode
Prepare the datasets - Amazon Bedrock

Before you can begin a model customization job, you need to minimally prepare a training dataset. Whether a validation dataset is supported and the format of your training and validation dataset depend on the following factors.

  • The type of customization job (fine-tuning or Continued Pre-training).

  • The input and output modalities of the data.

Model support for fine-tuning and continued pre-training data format

The following table shows details of the fine-tuning and continued pre-training data format supported for each respective model:

Model name Fine-tuning:Text-to-text Fine-tuning: Text-to-image & Image-to-embeddings Continued Pre-training:Text-to-text Fine-tuning: Single-turn messaging Fine-tuning: Multi-turn messaging
Amazon Titan Text G1 - Express Yes No Yes No No
Amazon Titan Text G1 - Lite Yes No Yes No No
Amazon Titan Text Premier Yes No No No No
Amazon Titan Image Generator G1 V1 Yes Yes No No No
Amazon Titan Multimodal Embeddings G1 G1 Yes Yes No No No
Anthropic Claude 3 Haiku No No No Yes Yes
Cohere Command Yes No No No No
Cohere Command Light Yes No No No No
Meta Llama 2 13B Yes No No No No
Meta Llama 2 70B Yes No No No No

To see the default quotas that apply for training and validation datasets used for customizing different models, see the Sum of training and validation records quotas in Amazon Bedrock endpoints and quotas in the AWS General Reference.

Prepare training and validation datasets for your custom model

To prepare training and validation datasets for your custom model, you create .jsonl files, each line of which is a JSON object corresponding to a record. The files you create must conform to the format for the customization method and model that you choose and the records in it must conform to size requirements.

The format depends on the customization method and the input and output modality of the model. Choose the tab for your preferred method, and then follow the steps:

Fine-tuning: Text-to-text

For text-to-text models, prepare a training and optional validation dataset. Each JSON object is a sample containing both a prompt and completion field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.

{"prompt": "<prompt1>", "completion": "<expected generated text>"} {"prompt": "<prompt2>", "completion": "<expected generated text>"} {"prompt": "<prompt3>", "completion": "<expected generated text>"}

The following is an example item for a question-answer task:

{"prompt": "what is AWS", "completion": "it's Amazon Web Services"}
Fine-tuning: Text-to-image & Image-to-embeddings

For text-to-image or image-to-embedding models, prepare a training dataset. Validation datasets are not supported. Each JSON object is a sample containing an image-ref, the Amazon S3 URI for an image, and a caption that could be a prompt for the image.

The images must be in JPEG or PNG format.

{"image-ref": "s3://bucket/path/to/image001.png", "caption": "<prompt text>"} {"image-ref": "s3://bucket/path/to/image002.png", "caption": "<prompt text>"} {"image-ref": "s3://bucket/path/to/image003.png", "caption": "<prompt text>"}

The following is an example item:

{"image-ref": "s3://amzn-s3-demo-bucket/my-pets/cat.png", "caption": "an orange cat with white spots"}

To allow Amazon Bedrock access to the image files, add an IAM policy similar to the one in Permissions to access training and validation files and to write output files in S3 to the Amazon Bedrock model customization service role that you set up or that was automatically set up for you in the console. The Amazon S3 paths you provide in the training dataset must be in folders that you specify in the policy.

Continued Pre-training: Text-to-text

To carry out Continued Pre-training on a text-to-text model, prepare a training and optional validation dataset. Because Continued Pre-training involves unlabeled data, each JSON line is a sample containing only an input field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.

{"input": "<input text>"} {"input": "<input text>"} {"input": "<input text>"}

The following is an example item that could be in the training data.

{"input": "AWS stands for Amazon Web Services"}
Fine-tuning: Single-turn messaging

To fine-tune a text-to-text model using the single-turn messaging format, prepare a training and optional validation dataset. Both data files must be in the JSONL format. Each line specifies a complete data sample in json format; and each data sample must be formatted to 1 line (remove all the ‘\n’ within each sample). One line with multiple data samples or splitting a data sample over multiple lines won’t work.

Fields

  • system (optional) : A string containing a system message that sets the context for the conversation.

  • messages : An array of message objects, each containing:

    • role : Either user or assistant

    • content : The text content of the message

Rules

  • The messages array must contain 2 messages

  • The first message must have a role of the user

  • The last message must have a role of the assistant

{"system": "<system message>","messages":[{"role": "user", "content": "<user query>"},{"role": "assistant", "content": "<expected generated text>"}]}

Example

{"system": "You are an helpful assistant.","messages":[{"role": "user", "content": "what is AWS"},{"role": "assistant", "content": "it's Amazon Web Services."}]}
Fine-tuning: Multi-turn messaging

To fine-tune a text-to-text model using the multi-turn messaging format, prepare a training and optional validation dataset. Both data files must be in the JSONL format. Each line specifies a complete data sample in json format; and each data sample must be formatted to 1 line (remove all the ‘\n’ within each sample). One line with multiple data samples or splitting a data sample over multiple lines won’t work.

Fields

  • system (optional) : A string containing a system message that sets the context for the conversation.

  • messages : An array of message objects, each containing:

    • role : Either user or assistant

    • content : The text content of the message

Rules

  • The messages array must contain at least 2 messages

  • The first message must have a role of the user

  • The last message must have a role of the assistant

  • Messages must alternate between user and assistant roles.

{"system": "<system message>","messages":[{"role": "user", "content": "<user query 1>"},{"role": "assistant", "content": "<expected generated text 1>"}, {"role": "user", "content": "<user query 2>"},{"role": "assistant", "content": "<expected generated text 2>"}]}

Example

{"system": "system message","messages":[{"role": "user", "content": "Hello there."},{"role": "assistant", "content": "Hi, how can I help you?"},{"role": "user", "content": "what are LLMs?"},{"role": "assistant", "content": "LLM means large language model."},]}
Distillation

To prepare training and validation datasets for a model distillation job, see Prerequisites for Amazon Bedrock Model Distillation.

For text-to-text models, prepare a training and optional validation dataset. Each JSON object is a sample containing both a prompt and completion field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.

{"prompt": "<prompt1>", "completion": "<expected generated text>"} {"prompt": "<prompt2>", "completion": "<expected generated text>"} {"prompt": "<prompt3>", "completion": "<expected generated text>"}

The following is an example item for a question-answer task:

{"prompt": "what is AWS", "completion": "it's Amazon Web Services"}

Select a tab to see the requirements for training and validation datasets for a model:

Amazon Titan Text Premier
Description Maximum (Fine-tuning)
Sum of input and output tokens when batch size is 1 4,096
Sum of input and output tokens when batch size is 2, 3, or 4 N/A
Character quota per sample in dataset Token quota x 6
Training dataset file size 1 GB
Validation dataset file size 100 MB
Amazon Titan Text G1 - Express
Description Maximum (Continued Pre-training) Maximum (Fine-tuning)
Sum of input and output tokens when batch size is 1 4,096 4,096
Sum of input and output tokens when batch size is 2, 3, or 4 2,048 2,048
Character quota per sample in dataset Token quota x 6 Token quota x 6
Training dataset file size 10 GB 1 GB
Validation dataset file size 100 MB 100 MB
Amazon Titan Text G1 - Lite
Description Maximum (Continued Pre-training) Maximum (Fine-tuning)
Sum of input and output tokens when batch size is 1 or 2 4,096 4,096
Sum of input and output tokens when batch size is 3, 4, 5, or 6 2,048 2,048
Character quota per sample in dataset Token quota x 6 Token quota x 6
Training dataset file size 10 GB 1 GB
Validation dataset file size 100 MB 100 MB
Amazon Titan Image Generator G1 V1
Description Minimum (Fine-tuning) Maximum (Fine-tuning)
Text prompt length in training sample, in characters 3 1,024
Records in a training dataset 5 10,000
Input image size 0 50 MB
Input image height in pixels 512 4,096
Input image width in pixels 512 4,096
Input image total pixels 0 12,582,912
Input image aspect ratio 1:4 4:1
Amazon Titan Multimodal Embeddings G1
Description Minimum (Fine-tuning) Maximum (Fine-tuning)
Text prompt length in training sample, in characters 0 2,560
Records in a training dataset 1,000 500,000
Input image size 0 5 MB
Input image height in pixels 128 4096
Input image width in pixels 128 4096
Input image total pixels 0 12,528,912
Input image aspect ratio 1:4 4:1
Cohere Command
Description Maximum (Fine-tuning)
Input tokens 4,096
Output tokens 2,048
Character quota per sample in dataset Token quota x 6
Records in a training dataset 10,000
Records in a validation dataset 1,000
Meta Llama 2
Description Maximum (Fine-tuning)
Input tokens 4,096
Output tokens 2,048
Character quota per sample in dataset Token quota x 6
Meta Llama 3.1
Description Maximum (Fine-tuning)
Input tokens 16,000
Output tokens 16,000
Character quota per sample in dataset Token quota x 6
Description Maximum (Fine-tuning)
Sum of input and output tokens when batch size is 1 4,096
Sum of input and output tokens when batch size is 2, 3, or 4 N/A
Character quota per sample in dataset Token quota x 6
Training dataset file size 1 GB
Validation dataset file size 100 MB
PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.