Dataset file types and input data format
Instruction-based fine-tuning uses labeled datasets to improve the performance of pre-trained LLMs on specific natural language processing (NLP) tasks. The labeled examples are formatted as prompt-response pairs and phrased as instructions.
To learn about the supported dataset file types, see Supported dataset file types.
To learn about input data format, see Input data format for instruction-based fine-tuning.
Supported dataset file types
Autopilot supports instruction-based fine-tuning datasets formatted as CSV files (default) or as Parquet files.
-
CSV (comma separated values) is a row-based file format that stores data in human readable plaintext, which is a popular choice for data exchange as it is supported by a wide range of applications.
-
Parquet is a binary, column-based file format where the data is stored and processed more efficiently than in human readable file formats such as CSV. This makes it a better option for big data problems.
Note
The dataset may consist of multiple files, each of which must adhere to a specific template. For information on how to format your input data, see Input data format for instruction-based fine-tuning.
Input data format for instruction-based fine-tuning
Each file in the dataset must adhere to the following format:
-
The dataset must contain exactly two comma-separated and named columns,
input
andoutput
. Autopilot does not allow any additional columns. -
The
input
columns contain the prompts, and their correspondingoutput
contains the expected answer. Both theinput
andoutput
are in string format.
The following example illustrates the input data format for instruction-based fine-tuning in Autopilot.
input,output "<prompt text>","<expected generated text>"
Note
We recommend using datasets with a minimum of 1000 rows to ensure optimal learning and performance of the model.
Additionally, Autopilot sets a maximum limit on the number of rows in the dataset and the context length based on the type of model being used.
-
The limits on the number of rows in a dataset apply to the cumulative count of rows across all files within the dataset, including multiple files. If there are two channel types defined (one for training and one for validation), the limit applies to the total number of rows across all datasets within both channels. When the number of rows exceeds the threshold, the job fails with a validation error.
-
When the length of the input or output of a row in the dataset exceeds the limit set on the context of the language model, it is automatically truncated. If more than 60% of the rows in the dataset are truncated, whether in their input or output, Autopilot fails the job with a validation error.
The following table presents those limits for each model.
JumpStart Model ID | BaseModelName in API request |
Row Limit | Context Length Limit |
---|---|---|---|
huggingface-textgeneration-dolly-v2-3b-bf16 | Dolly3B |
10,000 rows | 1024 tokens |
huggingface-textgeneration-dolly-v2-7b-bf16 | Dolly7B |
10,000 rows | 1024 tokens |
huggingface-textgeneration-dolly-v2-12b-bf16 | Dolly12B |
10,000 rows | 1024 tokens |
huggingface-llm-falcon-7b-bf16 | Falcon7B |
1,000 rows | 1024 tokens |
huggingface-llm-falcon-7b-instruct-bf16 | Falcon7BInstruct |
1,000 rows | 1024 tokens |
huggingface-llm-falcon-40b-bf16 | Falcon40B |
10,000 rows | 1024 tokens |
huggingface-llm-falcon-40b-instruct-bf16 | Falcon40BInstruct |
10,000 rows | 1024 tokens |
huggingface-text2text-flan-t5-large | FlanT5L |
10,000 rows | 1024 tokens |
huggingface-text2text-flan-t5-xl | FlanT5XL |
10,000 rows | 1024 tokens |
huggingface-text2text-flan-t5-xxll | FlanT5XXL |
10,000 rows | 1024 tokens |
meta-textgeneration-llama-2-7b | Llama2-7B |
10,000 rows | 2048 tokens |
meta-textgeneration-llama-2-7b-f | Llama2-7BChat |
10,000 rows | 2048 tokens |
meta-textgeneration-llama-2-13b | Llama2-13B |
7,000 rows | 2048 tokens |
meta-textgeneration-llama-2-13b-f | Llama2-13BChat |
7,000 rows | 2048 tokens |
huggingface-llm-mistral-7b | Mistral7B |
10,000 rows | 2048 tokens |
huggingface-llm-mistral-7b-instruct | Mistral7BInstruct |
10,000 rows | 2048 tokens |
huggingface-textgeneration1-mpt-7b-bf16 | MPT7B |
10,000 rows | 1024 tokens |
huggingface-textgeneration1-mpt-7b-instruct-bf16 | MPT7BInstruct |
10,000 rows | 1024 tokens |