Create an AutoML job to fine-tune text generation models using the API - Amazon SageMaker AI

Create an AutoML job to fine-tune text generation models using the API

Large language models (LLMs) excel in multiple generative tasks, including text generation, summarization, completion, question answering, and more. Their performance can be attributed to their significant size and extensive training on diverse datasets and various tasks. However, specific domains, such as healthcare and financial services, may require customized fine-tuning to adapt to unique data and use cases. By tailoring their training to their particular domain, LLMs can improve their performance and provide more accurate outputs for targeted applications.

Autopilot offers the capability to fine-tune a selection of pre-trained generative text models. In particular, Autopilot supports the instruction-based fine tuning of a selection of general-purpose large language models (LLMs) powered by JumpStart.

Note

The text generation models that support fine-tuning in Autopilot are currently accessible exclusively in Regions supported by SageMaker Canvas. See the documentation of SageMaker Canvas for the full list of its supported Regions.

Fine-tuning a pre-trained model requires a specific dataset of clear instructions that guide the model on how to generate output or behave for that task. The model learns from the dataset, adjusting its parameters to conform to the provided instructions. Instruction-based fine-tuning involves using labeled examples formatted as prompt-response pairs and phrased as instructions. For more information about fine-tuning, see Fine-tune a foundation model.

The following guidelines outline the process of creating an Amazon SageMaker Autopilot job as a pilot experiment to fine-tune text generation LLMs using the SageMaker AI API Reference.

Note

Tasks such as text and image classification, time-series forecasting, and fine-tuning of large language models are exclusively available through the version 2 of the AutoML REST API. If your language of choice is Python, you can refer to AWS SDK for Python (Boto3) or the AutoMLV2 object of the Amazon SageMaker Python SDK directly.

Users who prefer the convenience of a user interface can use Amazon SageMaker Canvas to access pre-trained models and generative AI foundation models, or create custom models tailored for specific text, image classification, forecasting needs, or generative AI.

To create an Autopilot experiment programmatically for fine-tuning an LLM, you can call the CreateAutoMLJobV2 API in any language supported by Amazon SageMaker Autopilot or the AWS CLI.

For information about how this API action translates into a function in the language of your choice, see the See Also section of CreateAutoMLJobV2 and choose an SDK. As an example, for Python users, see the full request syntax of create_auto_ml_job_v2 in AWS SDK for Python (Boto3).

Note

Autopilot fine-tunes large language models without requiring multiple candidates to be trained and evaluated. Instead, using your dataset, Autopilot directly fine-tunes your target model to enhance a default objective metric, the cross-entropy loss. Fine-tuning language models in Autopilot does not require setting the AutoMLJobObjective field.

Once your LLM is fine-tuned, you can evaluate its performance by accessing various ROUGE scores through the BestCandidate when making a DescribeAutoMLJobV2 API call. The model also provides information about its training and validation loss as well as perplexity. For a comprehensive list of metrics for evaluating the quality of the text generated by the fine-tuned models, see Metrics for fine-tuning large language models in Autopilot.

Prerequisites

Before using Autopilot to create a fine-tuning experiment in SageMaker AI, make sure to take the following steps:

  • (Optional) Choose the pre-trained model you want to fine-tune.

    For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see Supported large language models for fine-tuning. The selection of a model is not mandatory; if no model is specified, Autopilot automatically defaults to the model Falcon7BInstruct.

  • Create a dataset of instructions. See Dataset file types and input data format to learn about the format requirements for your instruction-based dataset.

  • Place your dataset in an Amazon S3 bucket.

  • Grant full access to the Amazon S3 bucket containing your input data for the SageMaker AI execution role used to run your experiment.

    • For information on retrieving your SageMaker AI execution role, see Get your execution role.

    • For information on granting your SageMaker AI execution role permissions to access one or more specific buckets in Amazon S3, see Add Additional Amazon S3 Permissions to a SageMaker AI Execution Role in Create execution role.

  • Additionally, you should provide your execution role with the necessary permissions to access the default storage Amazon S3 bucket used by JumpStart. This access is required for storing and retrieving pre-trained model artifacts in JumpStart. To grant access to this Amazon S3 bucket, you must create a new inline custom policy on your execution role.

    Here's an example policy that you can use in your JSON editor when configuring AutoML fine-tuning jobs in us-west-2:

    JumpStart's bucket names follow a predetermined pattern that depends on the AWS Regions. You must adjust the name of the bucket accordingly.

    { "Sid": "Statement1", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::jumpstart-cache-prod-us-west-2", "arn:aws:s3:::jumpstart-cache-prod-us-west-2/*" ] }

Once this is done, you can use the ARN of this execution role in Autopilot API requests.

Required parameters

When calling CreateAutoMLJobV2 to create an Autopilot experiment for LLM fine-tuning, you must provide the following values:

  • An AutoMLJobName to specify the name of your job. The name should be of type string, and should have a minimum length of 1 character and a maximum length of 32.

  • At least one AutoMLJobChannel of the training type within the AutoMLJobInputDataConfig. This channel specifies the name of the Amazon S3 bucket where your fine-tuning dataset is located. You have the option to define a validation channel. If no validation channel is provided, and a ValidationFraction is configured in the AutoMLDataSplitConfig, this fraction is utilized to randomly divide the training dataset into training and validation sets. Additionally, you can specify the type of content (CSV or Parquet files) for the dataset.

  • An AutoMLProblemTypeConfig of type TextGenerationJobConfig to configure the settings of your training job.

    In particular, you can specify the name of the base model to fine-tune in the BaseModelName field. For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see Supported large language models for fine-tuning.

  • An OutputDataConfig to specify the Amazon S3 output path to store the artifacts of your AutoML job.

  • A RoleArn to specify the ARN of the role used to access your data.

The following is an example of the full request format used when making an API call to CreateAutoMLJobV2 for fine-tuning a (Falcon7BInstruct) model.

{ "AutoMLJobName": "<job_name>", "AutoMLJobInputDataConfig": [ { "ChannelType": "training", "CompressionType": "None", "ContentType": "text/csv", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<bucket_name>/<input_data>.csv" } } } ], "OutputDataConfig": { "S3OutputPath": "s3://<bucket_name>/output", "KmsKeyId": "arn:aws:kms:<region>:<account_id>:key/<key_value>" }, "RoleArn":"arn:aws:iam::<account_id>:role/<sagemaker_execution_role_name>", "AutoMLProblemTypeConfig": { "TextGenerationJobConfig": { "BaseModelName": "Falcon7BInstruct" } } }

All other parameters are optional.

Optional parameters

The following sections provide details of some optional parameters that you can pass to your fine-tuning AutoML job.

You can provide your own validation dataset and custom data split ratio, or let Autopilot split the dataset automatically.

Each AutoMLJobChannel object (see the required parameter AutoMLJobInputDataConfig) has a ChannelType, which can be set to either training or validation values that specify how the data is to be used when building a machine learning model.

At least one data source must be provided and a maximum of two data sources is allowed: one for training data and one for validation data. How you split the data into training and validation datasets depends on whether you have one or two data sources.

  • If you only have one data source, the ChannelType is set to training by default and must have this value.

    • If the ValidationFraction value in AutoMLDataSplitConfig is not set, 0.2 (20%) of the data from this source is used for validation by default.

    • If the ValidationFraction is set to a value between 0 and 1, the dataset is split based on the value specified, where the value specifies the fraction of the dataset used for validation.

  • If you have two data sources, the ChannelType of one of the AutoMLJobChannel objects must be set to training, the default value. The ChannelType of the other data source must be set to validation. The two data sources must have the same format, either CSV or Parquet, and the same schema. You must not set the value for the ValidationFraction in this case because all of the data from each source is used for either training or validation. Setting this value causes an error.

With Autopilot, you can automatically deploy your fine-tuned model to an endpoint. To enable automatic deployment for your fine-tuned model, include a ModelDeployConfig in the AutoML job request. This allows the deployment of your fine-tuned model to a SageMaker AI endpoint. Below are the available configurations for customization.

For models requiring the acceptance of an end-user license agreement before fine-tuning, you can accept the EULA by setting the AcceptEula attribute of the ModelAccessConfig to True in TextGenerationJobConfig when configuring your AutoMLProblemTypeConfig.

You can optimize the learning process of your text generation model by setting hyperparameter values in the TextGenerationHyperParameters attribute of TextGenerationJobConfig when configuring your AutoMLProblemTypeConfig.

Autopilot allows for the setting of four common hyperparameters across all models.

  • epochCount: Its value should be a string containing an integer value within the range of 1 to 10.

  • batchSize: Its value should be a string containing an integer value within the range of 1 to 64.

  • learningRate: Its value should be a string containing a floating-point value within the range of 0 to 1.

  • learningRateWarmupSteps: Its value should be a string containing an integer value within the range of 0 to 250.

For more details on each hyperparameter, see Hyperparameters for optimizing the learning process of your text generation models.

The following JSON example shows a TextGenerationHyperParameters field passed to the TextGenerationJobConfig where all four hyperparameters are configured.

"AutoMLProblemTypeConfig": { "TextGenerationJobConfig": { "BaseModelName": "Falcon7B", "TextGenerationHyperParameters": {"epochCount":"5", "learningRate":"0.000001", "batchSize": "32", "learningRateWarmupSteps": "10"} } }