Customize your workflow using the fmeval
library
You can customize your model evaluation to allow for a model that is not a
JumpStart or Amazon Bedrock model or use a custom workflow for evaluation. If you use your own
model, you have to create a custom ModelRunner
. If you use your own dataset
for evaluation, you must configure a DataConfig
object. The following
section shows how to format your input dataset, customize a DataConfig
object to use your custom dataset, and create a custom ModelRunner
.
If you want to use your own dataset to evaluate your model, you must use a
DataConfig
object to specify the dataset_name
and
the dataset_uri
of the dataset that you want to evaluate. If you
use a built-in dataset, the DataConfig
object is already configured
as the default for evaluation algorithms.
You can use one custom dataset every time you use the evaluate
function. You can invoke evaluate
any number of times to use any
number of datasets that you want.
Configure a custom dataset with your model request specified in the question column, and the target answer specified in the column answer, as follows:
from fmeval.data_loaders.data_config import DataConfig from fmeval.constants import MIME_TYPE_JSONLINES config = DataConfig( dataset_name="tiny_dataset", dataset_uri="tiny_dataset.jsonl", dataset_mime_type=MIME_TYPE_JSONLINES, model_input_location="question", target_output_location="answer", )
The DataConfig
class contains the following parameters:
-
dataset_name
– The name of the dataset that you want to use to evaluate your LLM.dataset_uri
– The local path or uniform resource identifier (URI) to the S3 location of your dataset. -
dataset_mime_type
– The format of the input data that you want to use to evaluate your LLM. The FMEval library can support bothMIME_TYPE_JSON
andMIME_TYPE_JSONLINES
. -
model_input_location
– (Optional) The name of the column in your dataset that contains the model inputs or prompts that you want to evaluate.Use a
model_input_location
that specifies the name of your column. The column must contain the following values corresponding to the following associated tasks:-
For open-ended generation, toxicity, and accuracy evaluations, specify the column that contains the prompt that your model should respond to.
-
For a question answering task, specify the column that contains the question that your model should generate a response to.
-
For a text summarization task, specify the name of the column that contains the text that you want your model to summarize.
-
For a classification task, specify the name of the column that contains the text that you want your model to classify.
-
For a factual knowledge evaluations, specify the name of the column that contains the question that you want the model to predict the answer to.
-
For semantic robustness evaluations, specify the name of the column that contains the input that you want your model to perturb.
-
For prompt stereotyping evaluations, use the
sent_more_input_location
andsent_less_input_location
instead ofmodel_input_location
, as shown in the following parameters.
-
-
model_output_location
– (Optional) The name of the column in your dataset that contains the predicted output that you want to compare against the reference output that is contained intarget_output_location
. If you providemodel_output_location
, then FMEval won't send a request to your model for inference. Instead, it uses the output contained in the specified column to evaluate your model. -
target_output_location
– The name of the column in the reference dataset that contains the true value to compare against the predicted value that is contained inmodel_output_location
. Required only for factual knowledge, accuracy, and semantic robustness. For factual knowledge, each row in this column should contain all possible answers separated by a delimiter. For example, if the answers for a question are [“UK”,“England”], then the column should contain “UK<OR>England”. The model prediction is correct if it contains any of the answers separated by the delimiter. -
category_location
– The name of the column that contains the name of a category. If you provide a value forcategory_location
, then scores are aggregated and reported for each category. -
sent_more_input_location
– The name of the column that contains a prompt with more bias. Required only for prompt stereotyping. Avoid unconscious bias. For bias examples, see the CrowS-Pairs dataset. -
sent_less_input_location
– The name of the column that contains a prompt with less bias. Required only for prompt stereotyping. Avoid unconscious bias. For bias examples, see the CrowS-Pairs dataset. -
sent_more_output_location
– (Optional) The name of the column that contains a predicted probability that your model’s generated response will contain more bias. This parameter is only used in prompt stereotyping tasks. -
sent_less_output_location
– (Optional) The name of the column that contains a predicted probability that your model’s generated response will contain less bias. This parameter is only used in prompt stereotyping tasks.
If you want to add a new attribute that corresponds to a dataset column to the
DataConfig
class, you must add the suffix
_location
to the end of the attribute name.
To evaluate a custom model, use a base data class to configure your model and
create a custom ModelRunner
. Then, you can use this
ModelRunner
to evaluate any language model. Use the following
steps to define a model configuration, create a custom ModelRunner
,
and test it.
The ModelRunner
interface has one abstract method as
follows:
def predict(self, prompt: str) → Tuple[Optional[str], Optional[float]]
This method takes in a prompt as a string input, and returns a Tuple
containing a model text response and an input log probability. Every
ModelRunner
must implement a predict
method.
Create a custom ModelRunner
-
Define a model configuration.
The following code example shows how to apply a
dataclass
decorator to a customHFModelConfig
class so that you can define a model configuration for a Hugging Face model:from dataclasses import dataclass @dataclass class HFModelConfig: model_name: str max_new_tokens: int seed: int = 0 remove_prompt_from_generated_text: bool = True
In the previous code example, the following applies:
-
The parameter
max_new_tokens
is used to limit the length of the response by limiting the number of tokens returned by an LLM. The type of model is set by passing a value formodel_name
when the class is instantiated. In this example, the model name is set togpt2
, as shown in the end of this section. The parametermax_new_tokens
is one option to configure text generation strategies using agpt2
model configuration for a pre-trained OpenAI GPT model. See AutoConfigfor other model types. -
If the parameter
remove_prompt_from_generated_text
is set toTrue
, then the generated response won't contain the originating prompt sent in the request.
For other text generation parameters, see the Hugging Face documentation for GenerationConfig
. -
-
Create a custom
ModelRunner
and implement a predict method. The following code example shows how to create a customModelRunner
for a Hugging Face model using theHFModelConfig
class created in the previous code example.from typing import Tuple, Optional import torch from transformers import AutoModelForCausalLM, AutoTokenizer from fmeval.model_runners.model_runner import ModelRunner class HuggingFaceCausalLLMModelRunner(ModelRunner): def __init__(self, model_config: HFModelConfig): self.config = model_config self.model = AutoModelForCausalLM.from_pretrained(self.config.model_name) self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name) def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]: input_ids = self.tokenizer(prompt, return_tensors="pt").to(self.model.device) generations = self.model.generate( **input_ids, max_new_tokens=self.config.max_new_tokens, pad_token_id=self.tokenizer.eos_token_id, ) generation_contains_input = ( input_ids["input_ids"][0] == generations[0][: input_ids["input_ids"].shape[1]] ).all() if self.config.remove_prompt_from_generated_text and not generation_contains_input: warnings.warn( "Your model does not return the prompt as part of its generations. " "`remove_prompt_from_generated_text` does nothing." ) if self.config.remove_prompt_from_generated_text and generation_contains_input: output = self.tokenizer.batch_decode(generations[:, input_ids["input_ids"].shape[1] :])[0] else: output = self.tokenizer.batch_decode(generations, skip_special_tokens=True)[0] with torch.inference_mode(): input_ids = self.tokenizer(self.tokenizer.bos_token + prompt, return_tensors="pt")["input_ids"] model_output = self.model(input_ids, labels=input_ids) probability = -model_output[0].item() return output, probability
The previous code uses a custom
HuggingFaceCausalLLMModelRunner
class that inherits properties from the FMEvalModelRunner
class. The custom class contains a constructor and a definition for a predict function, which returns aTuple
.For more
ModelRunner
examples, see the model_runnersection of the fmeval
library.The
HuggingFaceCausalLLMModelRunner
constructor contains the following definitions:-
The configuration is set to
HFModelConfig
, defined in the beginning of this section. -
The model is set to a pre-trained model from the Hugging Face Auto Class
that is specified using the model_name parameter upon instantiation. -
The tokenizer is set to a class from the Hugging Face tokenizer library
that matches the pre-trained model specified by model_name
.
The
predict
method in theHuggingFaceCausalLLMModelRunner
class uses the following definitions:-
input_ids
– A variable that contains input for your model. The model generates the input as follows.-
A
tokenizer
Converts the request contained inprompt
into token identifiers (IDs). These token IDs, which are numerical values that represent a specific token (word, sub-word or character), can be used directly by your model as input. The token IDs are returned as a PyTorch tensor objects, as specified byreturn_tensors="pt"
. For other types of return tensor types, see the Hugging Face documentation for apply_chat_template. -
Token IDs are sent to a device where the model is located so that they can be used by the model.
-
-
generations
– A variable that contains the response generated by your LLM. The model’s generate function uses the following inputs to generate the response:-
The
input_ids
from the previous step. -
The parameter
max_new_tokens
specified inHFModelConfig
. -
A
pad_token_id
adds an end of sentence (eos) token to the response. For other tokens that you can use, see the Hugging Face documentation for PreTrainedTokenizer.
-
-
generation_contains_input
– A boolean variable that returnsTrue
when the generated response includes the input prompt in its response, andFalse
otherwise. The return value is calculated using an element-wise comparison between the following.-
All of the token IDs in the input prompt that are contained in
input_ids["input_ids"][0]
. -
The beginning of the generated content that is contained in
generations[0][: input_ids["input_ids"].shape[1]]
.
The
predict
method returns a warning if you directed the LLM toremove_prompt_from_generated_text
in your configuration but the generated response doesn’t contain the input prompt.The output from the
predict
method contains a string returned by thebatch_decode
method, which converts token IDs returned in the response into human readable text. If you specifiedremove_prompt_from_generated_text
asTrue
, then the input prompt is removed from the generated text. If you specifiedremove_prompt_from_generated_text
asFalse
, the generated text will be returned without any special tokens that you included in the dictionaryspecial_token_dict
, as specified byskip_special_tokens=True
. -
-
-
Test your
ModelRunner
. Send a sample request to your model.The following example shows how to test a model using the
gpt2
pre-trained model from the Hugging FaceAutoConfig
class:hf_config = HFModelConfig(model_name="gpt2", max_new_tokens=32) model = HuggingFaceCausalLLMModelRunner(model_config=hf_config)
In the previous code example,
model_name
specifies the name of the pre-trained model. TheHFModelConfig
class is instantiated as hf_config with a value for the parametermax_new_tokens
, and used to initializeModelRunner
.If you want to use another pre-trained model from Hugging Face, choose a
pretrained_model_name_or_path
infrom_pretrained
under AutoClass. Lastly, test your
ModelRunner
. Send a sample request to your model as shown in the following code example:model_output = model.predict("London is the capital of?")[0] print(model_output) eval_algo.evaluate_sample()