Before you can begin a model customization job, you need to minimally prepare a training dataset. Whether a validation dataset is supported and the format of your training and validation dataset depend on the following factors.
-
The type of customization job (fine-tuning or Continued Pre-training).
-
The input and output modalities of the data.
Model support for fine-tuning and continued
pre-training data format
The following table shows details of the fine-tuning and continued pre-training data format supported for each respective model:
Model name | Fine-tuning:Text-to-text | Fine-tuning: Text-to-image & Image-to-embeddings | Continued Pre-training:Text-to-text | Fine-tuning: Single-turn messaging | Fine-tuning: Multi-turn messaging |
---|---|---|---|---|---|
Amazon Titan Text G1 - Express | Yes | No | Yes | No | No |
Amazon Titan Text G1 - Lite | Yes | No | Yes | No | No |
Amazon Titan Text Premier | Yes | No | No | No | No |
Amazon Titan Image Generator G1 V1 | Yes | Yes | No | No | No |
Amazon Titan Multimodal Embeddings G1 G1 | Yes | Yes | No | No | No |
Anthropic Claude 3 Haiku | No | No | No | Yes | Yes |
Cohere Command | Yes | No | No | No | No |
Cohere Command Light | Yes | No | No | No | No |
Meta Llama 2 13B | Yes | No | No | No | No |
Meta Llama 2 70B | Yes | No | No | No | No |
To see the default quotas that apply for training and validation datasets used for customizing different models, see the Sum of training and validation records quotas in Amazon Bedrock endpoints and quotas in the AWS General Reference.
Prepare training and validation datasets for your custom model
To prepare training and validation datasets for your custom model, you create .jsonl
files, each line of which is a JSON object corresponding to a record. The files you create must conform to the format for the customization method and model that you choose and the records in it must conform to size requirements.
The format depends on the customization method and the input and output modality of the model. Choose the tab for your preferred method, and then follow the steps:
For text-to-text models, prepare a training and optional validation dataset. Each JSON object is a sample containing both a prompt
and completion
field. Use 6 characters per token as an approximation for the number of tokens. The format is as follows.
{"prompt": "<prompt1>", "completion": "<expected generated text>"}
{"prompt": "<prompt2>", "completion": "<expected generated text>"}
{"prompt": "<prompt3>", "completion": "<expected generated text>"}
The following is an example item for a question-answer task:
{"prompt": "what is AWS", "completion": "it's Amazon Web Services"}
Select a tab to see the requirements for training and validation datasets for a model:
Description | Maximum (Fine-tuning) |
---|---|
Sum of input and output tokens when batch size is 1 | 4,096 |
Sum of input and output tokens when batch size is 2, 3, or 4 | N/A |
Character quota per sample in dataset | Token quota x 6 |
Training dataset file size | 1 GB |
Validation dataset file size | 100 MB |