

# Preparing classifier training data
<a name="prep-classifier-data"></a>

For custom classification, you train the model in either multi-class mode or multi-label mode. Multi-class mode associates a single class with each document. Multi-label mode associates one or more classes with each document. The input file formats are different for each mode, so choose the mode to use before you create the training data. 

**Note**  
The Amazon Comprehend console refers to multi-class mode as single-label mode.

Custom classification supports models that you train with plain-text documents and models that you train with native documents (such as PDF, Word, or images). For more information about classifier models and their supported document types, see [Training classification models](training-classifier-model.md).

To prepare data to train a custom classifier model: 

1. Identify the classes that you want this classifier to analyze. Decide which mode to use (multi-class or multi-label).

1. Decide on the classifier model type, based on whether the model is for analyzing plain-text documents or semi-structured documents. 

1. Gather examples of documents for each of the classes. For minimum training requirements, see [General quotas for document classification](guidelines-and-limits.md#limits-class-general).

1. For a plain-text model, choose the training file format to use (CSV file or augmented manifest file). To train a native document model, you always use a CSV file. 

**Topics**
+ [Classifier training file formats](prep-class-data-format.md)
+ [Multi-class mode](prep-classifier-data-multi-class.md)
+ [Multi-label mode](prep-classifier-data-multi-label.md)

# Classifier training file formats
<a name="prep-class-data-format"></a>

For a plain-text model, you can provide classifier training data as a CSV file or as an augmented manifest file that you create using SageMaker AI Ground Truth. The CSV file or augmented manifest file includes the text for each training document, and its associated labels.

For a native document model, you provide Classifier training data as a CSV file. The CSV file includes the file name for each training document, and its associated labels. You include the training documents in the Amazon S3 input folder for the training job.

## CSV files
<a name="prep-data-csv"></a>

You provide labeled training data as UTF-8 encoded text in a CSV file. Don't include a header row. Adding a header row in your file may cause runtime errors.

For each row in the CSV file, the first column contains one or more class labels, A class label can be any valid UTF-8 string. We recommend using clear class names that don't overlap in meaning. The name can include white space, and can consist of multiple words connected by underscores or hyphens.

Do not leave any space characters before or after the commas that separate the values in a row. 

The exact content of the CSV file depends on the classifier mode and the type of training data. For details, see the sections on [Multi-class mode](prep-classifier-data-multi-class.md) and [Multi-label mode](prep-classifier-data-multi-label.md).

## Augmented manifest file
<a name="prep-data-annotations"></a>

An augmented manifest file is a labeled dataset that you create using SageMaker AI Ground Truth. Ground Truth is a data labeling service that helps you—or a workforce that you employ—to build training datasets for machine learning models. 

For more information about Ground Truth and the output that it produces, see [Use SageMaker AI Ground Truth to Label Data](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) in the *Amazon SageMaker AI Developer Guide*.

Augmented manifest files are in JSON lines format. In these files, each line is a complete JSON object that contains a training document and its associated labels. The exact content of each line depends on the classifier mode. For details, see the sections on [Multi-class mode](prep-classifier-data-multi-class.md) and [Multi-label mode](prep-classifier-data-multi-label.md).

When you provide your training data to Amazon Comprehend, you specify one or more label attribute names. How many attribute names you specify depends on whether your augmented manifest file is the output of a single labeling job or a chained labeling job.

If your file is the output of a single labeling job, specify the single label attribute name from the Ground Truth job. 

If your file is the output of a chained labeling job, specify the label attribute name for one or more jobs in the chain. Each label attribute name provides the annotations from an individual job. You can specify up to 5 of these attributes for augmented manifest files from chained labeling jobs. 

For more information about chained labeling jobs, and for examples of the output that they produce, see [Chaining Labeling Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-reusing-data.html) in the Amazon SageMaker AI Developer Guide.

# Multi-class mode
<a name="prep-classifier-data-multi-class"></a>

In multi-class mode, classification assigns one class for each document. The individual classes are mutually exclusive. For example, you can classify a movie as comedy or science fiction, but not both. 

**Note**  
The Amazon Comprehend console refers to multi-class mode as single-label mode.

**Topics**
+ [Plain-text models](#prep-multi-class-plaintext)
+ [Native document models](#prep-multi-class-structured)

## Plain-text models
<a name="prep-multi-class-plaintext"></a>

To train a plain-text model, you can provide labeled training data as a CSV file or as an augmented manifest file from SageMaker AI Ground Truth.

### CSV file
<a name="prep-multi-class-plaintext-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a two-column CSV file. For each row, the first column contains the class label value. The second column contains an example text document for that class. Each row must end with \$1n or \$1r\$1n characters.

The following example shows a CSV file containing three documents.

```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS,Text of document 3
```

The following example shows one row of a CSV file that trains a custom classifier to detect whether an email message is spam:

```
SPAM,"Paulo, your $1000 award is waiting for you! Claim it while you still can at http://example.com."
```

### Augmented manifest file
<a name="prep-multi-class-plaintext-manifest"></a>

For general information about using augmented manifest files for training classifiers, see [Augmented manifest file](prep-class-data-format.md#prep-data-annotations).

For plain-text documents, each line of the augmented manifest file is a complete JSON object that contains a training document, a single class name, and other metadata from Ground Truth. The following example is an augmented manifest file for training a custom classifier to recognize spam email messages:

```
{"source":"Document 1 text", "MultiClassJob":0, "MultiClassJob-metadata":{"confidence":0.62, "job-name":"labeling-job/multiclassjob", "class-name":"not_spam", "human-annotated":"yes", "creation-date":"2020-05-21T17:36:45.814354", "type":"groundtruth/text-classification"}}
{"source":"Document 2 text", "MultiClassJob":1, "MultiClassJob-metadata":{"confidence":0.81, "job-name":"labeling-job/multiclassjob", "class-name":"spam", "human-annotated":"yes", "creation-date":"2020-05-21T17:37:51.970530", "type":"groundtruth/text-classification"}}
{"source":"Document 3 text", "MultiClassJob":1, "MultiClassJob-metadata":{"confidence":0.81, "job-name":"labeling-job/multiclassjob", "class-name":"spam", "human-annotated":"yes", "creation-date":"2020-05-21T17:37:51.970566", "type":"groundtruth/text-classification"}}
```

 The following example shows one JSON object from the augmented manifest file, formatted for readability: 

```
{
   "source": "Paulo, your $1000 award is waiting for you! Claim it while you still can at http://example.com.",
   "MultiClassJob": 0,
   "MultiClassJob-metadata": {
       "confidence": 0.98,
       "job-name": "labeling-job/multiclassjob",
       "class-name": "spam",
       "human-annotated": "yes",
       "creation-date": "2020-05-21T17:36:45.814354",
       "type": "groundtruth/text-classification"
   }
}
```

In this example, the `source` attribute provides the text of the training document, and the `MultiClassJob` attribute assigns the index of a class from a classification list. The `job-name` attribute is the name that you defined for the labeling job in Ground Truth. 

 When you start the classifier training job in Amazon Comprehend, you specify the same labeling job name. 

## Native document models
<a name="prep-multi-class-structured"></a>

A native document model is a model that you train with native documents (such as PDF, DOCX , and images). You provide the training data as a CSV file.

### CSV file
<a name="prep-multi-class-structured-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a three-column CSV file. For each row, the first column contains the class label value. The second column contains the file name of an example document for this class. The third column contains the page number. The page number is optional if the example document is an image.

The following example shows a CSV file that references three input documents. 

```
CLASS,input-doc-1.pdf,3
CLASS,input-doc-2.docx,1
CLASS,input-doc-3.png
```

The following example shows one row of a CSV file that trains a custom classifier to detect whether an email message is spam. Page 2 of the PDF file contains the spam example. 

```
SPAM,email-content-3.pdf,2
```

# Multi-label mode
<a name="prep-classifier-data-multi-label"></a>

In multi-label mode, individual classes represent different categories that aren't mutually exclusive. Multi-label classification assigns one or more classes to each document. For example, you can classify one movie as Documentary, and another movie as Science fiction, Action, and Comedy. 

For training, multi-label mode supports up to 1 million examples containing up to 100 unique classes.

**Topics**
+ [Plain-text models](#prep-multi-label-plaintext)
+ [Native document models](#prep-multi-label-structured)

## Plain-text models
<a name="prep-multi-label-plaintext"></a>

To train a plain-text model, you can provide labeled training data as a CSV file or as an augmented manifest file from SageMaker AI Ground Truth.

### CSV file
<a name="prep-multi-label-plaintext-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a two-column CSV file. For each row, the first column contains the class label values, and the second column contains an example text document for these classes. To enter more than one class in the first column, use a delimiter (such as a \$1 ) between each class.

```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS|CLASS|CLASS,Text of document 3
```

The following example shows one row of a CSV file that trains a custom classifier to detect genres in movie abstracts:

```
COMEDY|MYSTERY|SCIENCE_FICTION|TEEN,"A band of misfit teens become unlikely detectives when they discover troubling clues about their high school English teacher. Could the strange Mrs. Doe be an alien from outer space?"
```

The default delimiter between class names is a pipe (\$1). However, you can use a different character as a delimiter. The delimiter must be distinct from all characters in your class names. For example, if your classes are CLASS\$11, CLASS\$12, and CLASS\$13, the underscore (**\$1**) is part of the class name. So don't use an underscore as the delimiter for separating class names.

### Augmented manifest file
<a name="prep-multi-label-plaintext-manifest"></a>

For general information about using augmented manifest files for training classifiers, see [Augmented manifest file](prep-class-data-format.md#prep-data-annotations).

For plain-text documents, each line of the augmented manifest file is a complete JSON object. It contains a training document, class names, and other metadata from Ground Truth. The following example is an augmented manifest file for training a custom classifier to detect genres in movie abstracts:

```
{"source":"Document 1 text", "MultiLabelJob":[0,4], "MultiLabelJob-metadata":{"job-name":"labeling-job/multilabeljob", "class-map":{"0":"action", "4":"drama"}, "human-annotated":"yes", "creation-date":"2020-05-21T19:02:21.521882", "confidence-map":{"0":0.66}, "type":"groundtruth/text-classification-multilabel"}}
{"source":"Document 2 text", "MultiLabelJob":[3,6], "MultiLabelJob-metadata":{"job-name":"labeling-job/multilabeljob", "class-map":{"3":"comedy", "6":"horror"}, "human-annotated":"yes", "creation-date":"2020-05-21T19:00:01.291202", "confidence-map":{"1":0.61,"0":0.61}, "type":"groundtruth/text-classification-multilabel"}}
{"source":"Document 3 text", "MultiLabelJob":[1], "MultiLabelJob-metadata":{"job-name":"labeling-job/multilabeljob", "class-map":{"1":"action"}, "human-annotated":"yes", "creation-date":"2020-05-21T18:58:51.662050", "confidence-map":{"1":0.68}, "type":"groundtruth/text-classification-multilabel"}}
```

 The following example shows one JSON object from the augmented manifest file, formatted for readability: 

```
{
      "source": "A band of misfit teens become unlikely detectives when 
                   they discover troubling clues about their high school English teacher. 
                     Could the strange Mrs. Doe be an alien from outer space?",
      "MultiLabelJob": [
          3,
          8,
          10,
          11
      ],
      "MultiLabelJob-metadata": {
          "job-name": "labeling-job/multilabeljob",
          "class-map": {
              "3": "comedy",
              "8": "mystery",
              "10": "science_fiction",
              "11": "teen"
          },
          "human-annotated": "yes",
          "creation-date": "2020-05-21T19:00:01.291202",
          "confidence-map": {
              "3": 0.95,
              "8": 0.77,
              "10": 0.83,
              "11": 0.92
          },
          "type": "groundtruth/text-classification-multilabel"
      }
  }
```

In this example, the `source` attribute provides the text of the training document, and the `MultiLabelJob` attribute assigns the indexes of several classes from a classification list. The job-name in the `MultiLabelJob` metadata is the name that you defined for the labeling job in Ground Truth. 

## Native document models
<a name="prep-multi-label-structured"></a>

A native document model is a model that you train with native documents (such as PDF, DOCX , and image files). You provide labeled training data as a CSV file.

### CSV file
<a name="prep-multi-label-structured-csv"></a>

For general information about using CSV files for training classifiers, see [CSV files](prep-class-data-format.md#prep-data-csv).

Provide the training data as a three-column CSV file. For each row, the first column contains the class label values. The second column contains the file name of an example document for these classes. The third column contains the page number. The page number is optional if the example document is an image.

To enter more than one class in the first column, use a delimiter (such as a \$1 ) between each class.

```
CLASS,input-doc-1.pdf,3
CLASS,input-doc-2.docx,1
CLASS|CLASS|CLASS,input-doc-3.png,2
```

The following example shows one row of a CSV file that trains a custom classifier to detect genres in movie abstracts. Page 2 of the PDF file contains the example of a comedy/teen movie.

```
COMEDY|TEEN,movie-summary-1.pdf,2
```

The default delimiter between class names is a pipe (\$1). However, you can use a different character as a delimiter. The delimiter must be distinct from all characters in your class names. For example, if your classes are CLASS\$11, CLASS\$12, and CLASS\$13, the underscore (**\$1**) is part of the class name. So don't use an underscore as the delimiter for separating class names.