Create an AutoML job to fine-tune text generation models using the API
Large language models (LLMs) excel in multiple generative tasks, including text generation, summarization, completion, question answering, and more. Their performance can be attributed to their significant size and extensive training on diverse datasets and various tasks. However, specific domains, such as healthcare and financial services, may require customized fine-tuning to adapt to unique data and use cases. By tailoring their training to their particular domain, LLMs can improve their performance and provide more accurate outputs for targeted applications.
Autopilot offers the capability to fine-tune a selection of pre-trained generative text models. In particular, Autopilot supports the instruction-based fine tuning of a selection of general-purpose large language models (LLMs) powered by JumpStart.
Note
The text generation models that support fine-tuning in Autopilot are currently accessible exclusively in Regions supported by SageMaker Canvas. See the documentation of SageMaker Canvas for the full list of its supported Regions.
Fine-tuning a pre-trained model requires a specific dataset of clear instructions that guide the model on how to generate output or behave for that task. The model learns from the dataset, adjusting its parameters to conform to the provided instructions. Instruction-based fine-tuning involves using labeled examples formatted as prompt-response pairs and phrased as instructions. For more information about fine-tuning, see Fine-tune a foundation model.
The following guidelines outline the process of creating an Amazon SageMaker Autopilot job as a pilot experiment to fine-tune text generation LLMs using the SageMaker AI API Reference.
Note
Tasks such as text and image classification,
time-series forecasting, and fine-tuning of large language models are exclusively available
through the version 2 of the AutoML REST API.
If your language of choice is Python, you can refer to AWS SDK for Python (Boto3)
Users who prefer the convenience of a user interface can use Amazon SageMaker Canvas to access pre-trained models and generative AI foundation models, or create custom models tailored for specific text, image classification, forecasting needs, or generative AI.
To create an Autopilot experiment programmatically for fine-tuning an LLM, you can call the
CreateAutoMLJobV2
API in any language supported by Amazon SageMaker Autopilot or the
AWS CLI.
For information about how this API action translates into a function in the language of your
choice, see the
See Also section of CreateAutoMLJobV2
and choose an SDK. As an example,
for Python users, see the full request syntax of create_auto_ml_job_v2
in AWS SDK for Python (Boto3).
Note
Autopilot fine-tunes large language models without requiring multiple candidates to be
trained and evaluated. Instead, using your dataset, Autopilot directly fine-tunes your target
model to enhance a default objective metric, the cross-entropy loss. Fine-tuning language
models in Autopilot does not require setting the AutoMLJobObjective
field.
Once your LLM is fine-tuned, you can evaluate its performance by accessing various ROUGE
scores through the BestCandidate
when making a DescribeAutoMLJobV2
API call. The model also provides information about its
training and validation loss as well as perplexity. For a comprehensive list of metrics for
evaluating the quality of the text generated by the fine-tuned models, see Metrics for fine-tuning large language
models in Autopilot.
Prerequisites
Before using Autopilot to create a fine-tuning experiment in SageMaker AI, make sure to take the following steps:
-
(Optional) Choose the pre-trained model you want to fine-tune.
For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see Supported large language models for fine-tuning. The selection of a model is not mandatory; if no model is specified, Autopilot automatically defaults to the model Falcon7BInstruct.
-
Create a dataset of instructions. See Dataset file types and input data format to learn about the format requirements for your instruction-based dataset.
-
Place your dataset in an Amazon S3 bucket.
-
Grant full access to the Amazon S3 bucket containing your input data for the SageMaker AI execution role used to run your experiment.
-
For information on retrieving your SageMaker AI execution role, see Get your execution role.
-
For information on granting your SageMaker AI execution role permissions to access one or more specific buckets in Amazon S3, see Add Additional Amazon S3 Permissions to a SageMaker AI Execution Role in Create execution role.
-
-
Additionally, you should provide your execution role with the necessary permissions to access the default storage Amazon S3 bucket used by JumpStart. This access is required for storing and retrieving pre-trained model artifacts in JumpStart. To grant access to this Amazon S3 bucket, you must create a new inline custom policy on your execution role.
Here's an example policy that you can use in your JSON editor when configuring AutoML fine-tuning jobs in
us-west-2
:JumpStart's bucket names follow a predetermined pattern that depends on the AWS Regions. You must adjust the name of the bucket accordingly.
{ "Sid": "Statement1", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::jumpstart-cache-prod-
us-west-2
", "arn:aws:s3:::jumpstart-cache-prod-us-west-2/*
" ] }
Once this is done, you can use the ARN of this execution role in Autopilot API requests.
Required parameters
When calling CreateAutoMLJobV2
to create an Autopilot experiment for LLM fine-tuning, you
must provide the following values:
-
An
AutoMLJobName
to specify the name of your job. The name should be of typestring
, and should have a minimum length of 1 character and a maximum length of 32. -
At least one
AutoMLJobChannel
of thetraining
type within theAutoMLJobInputDataConfig
. This channel specifies the name of the Amazon S3 bucket where your fine-tuning dataset is located. You have the option to define avalidation
channel. If no validation channel is provided, and aValidationFraction
is configured in theAutoMLDataSplitConfig
, this fraction is utilized to randomly divide the training dataset into training and validation sets. Additionally, you can specify the type of content (CSV or Parquet files) for the dataset. -
An
AutoMLProblemTypeConfig
of typeTextGenerationJobConfig
to configure the settings of your training job.In particular, you can specify the name of the base model to fine-tune in the
BaseModelName
field. For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see Supported large language models for fine-tuning. -
An
OutputDataConfig
to specify the Amazon S3 output path to store the artifacts of your AutoML job. -
A
RoleArn
to specify the ARN of the role used to access your data.
The following is an example of the full request format used when making an API call to
CreateAutoMLJobV2
for fine-tuning a (Falcon7BInstruct
)
model.
{ "AutoMLJobName": "<job_name>", "AutoMLJobInputDataConfig": [ { "ChannelType": "training", "CompressionType": "None", "ContentType": "text/csv", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<bucket_name>/<input_data>.csv" } } } ], "OutputDataConfig": { "S3OutputPath": "s3://<bucket_name>/output", "KmsKeyId": "arn:aws:kms:<region>:<account_id>:key/<key_value>" }, "RoleArn":"arn:aws:iam::<account_id>:role/<sagemaker_execution_role_name>", "AutoMLProblemTypeConfig": { "TextGenerationJobConfig": { "BaseModelName": "Falcon7BInstruct" } } }
All other parameters are optional.
Optional parameters
The following sections provide details of some optional parameters that you can pass to your fine-tuning AutoML job.
You can provide your own validation dataset and custom data split ratio, or let Autopilot split the dataset automatically.
Each AutoMLJobChannel
object (see the required parameter AutoMLJobInputDataConfig) has a ChannelType
, which can be set to
either training
or validation
values that specify how the data
is to be used when building a machine learning model.
At least one data source must be provided and a maximum of two data sources is allowed: one for training data and one for validation data. How you split the data into training and validation datasets depends on whether you have one or two data sources.
-
If you only have one data source, the
ChannelType
is set totraining
by default and must have this value.-
If the
ValidationFraction
value inAutoMLDataSplitConfig
is not set, 0.2 (20%) of the data from this source is used for validation by default. -
If the
ValidationFraction
is set to a value between 0 and 1, the dataset is split based on the value specified, where the value specifies the fraction of the dataset used for validation.
-
-
If you have two data sources, the
ChannelType
of one of theAutoMLJobChannel
objects must be set totraining
, the default value. TheChannelType
of the other data source must be set tovalidation
. The two data sources must have the same format, either CSV or Parquet, and the same schema. You must not set the value for theValidationFraction
in this case because all of the data from each source is used for either training or validation. Setting this value causes an error.
With
Autopilot,
you can automatically deploy your fine-tuned model to an endpoint. To
enable automatic deployment for your fine-tuned model, include a ModelDeployConfig
in the AutoML job request. This allows the
deployment of your fine-tuned model to a SageMaker AI endpoint. Below are the available
configurations for customization.
-
To let Autopilot generate the endpoint name, set
AutoGenerateEndpointName
toTrue
. -
To provide your own name for the endpoint, set
AutoGenerateEndpointName to
.False
and provide a name of your choice in EndpointName
For models requiring the acceptance of an end-user license agreement before
fine-tuning, you can accept the EULA by setting the AcceptEula
attribute of
the ModelAccessConfig
to True
in TextGenerationJobConfig
when configuring your AutoMLProblemTypeConfig
.
You can optimize the learning process of your text generation model by setting
hyperparameter values in the TextGenerationHyperParameters
attribute of
TextGenerationJobConfig
when configuring your AutoMLProblemTypeConfig
.
Autopilot allows for the setting of four common hyperparameters across all models.
-
epochCount
: Its value should be a string containing an integer value within the range of1
to10
. -
batchSize
: Its value should be a string containing an integer value within the range of1
to64
. -
learningRate
: Its value should be a string containing a floating-point value within the range of0
to1
. -
learningRateWarmupSteps
: Its value should be a string containing an integer value within the range of0
to250
.
For more details on each hyperparameter, see Hyperparameters for optimizing the learning process of your text generation models.
The following JSON example shows a TextGenerationHyperParameters
field
passed to the TextGenerationJobConfig where all four hyperparameters are
configured.
"AutoMLProblemTypeConfig": { "TextGenerationJobConfig": { "BaseModelName": "Falcon7B", "TextGenerationHyperParameters": {"epochCount":"5", "learningRate":"0.000001", "batchSize": "32", "learningRateWarmupSteps": "10"} } }