# Training classification models To train a model for custom classification, you define the categories and provide example documents to train the custom model. You train the model in either multi-class or multi-label mode. Multi-class mode associates a single class with each document. Multi-label mode associates one or more classes with each document. Custom classification supports two types of classifier models: plain-text models and native document models. A plain-text model classifies documents based on their text content. A native document model also classifies documents based on text content. A native document model can also use additional signals, such as from the layout of the document. You train a native document model with native documents for the model to learn the layout information. Plain-text models have the following characteristics: + You train the model using UTF-8 encoded text documents. + You can train the model using documents in one of following languages: English, Spanish, German, Italian, French, or Portuguese. + The training documents for a given classifier must all use the same language. + Training documents are plain text, so there are no additional charges for text extraction. Native document models have the following characteristics: + You train the model using semi-structured documents, which includes the following document types: + Digital and scanned PDF documents. + Word documents (DOCX). + Images: JPG files, PNG files, and single-page TIFF files. + Textract API output JSON files. + You train the model using English documents. + If your training documents include scanned document files, you incur additional charges for text extraction. See the [Amazon Comprehend Pricing](https://aws.amazon.com/comprehend/pricing) page for details. You can classify any of the supported document types using either type of model. However, for the most accurate results, we recommend using a plain-text model to classify plain-text documents and a native document model to classify semi-structured documents. **Topics** + [ # Train custom classifiers (console) ](create-custom-classifier-console.md) + [ # Train custom classifiers (API) ](train-custom-classifier-api.md) + [ # Test the training data ](testing-the-model.md) + [ # Classifier training output ](train-classifier-output.md) + [ # Custom classifier metrics ](cer-doc-class.md) # Train custom classifiers (console) You can create and train a custom classifier using the console, and then use the custom classifier to analyze your documents. To train a custom classifier, you need a set of training documents. You label these documents with the categories that you want the document classifier to recognize. For information about preparing your training documents, see [Preparing classifier training data](prep-classifier-data.md). **To create and train a document classifier model** 1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/) 1. From the left menu, choose **Customization** and then choose **Custom Classification**. 1. Choose **Create new model**. 1. Under **Model settings**, enter a model name for the classifier. The name must be unique within your account and current Region. (Optional) Enter a version name. The name must be unique within your account and current Region. 1. Select the language of the training documents. To see the languages that classifiers support, see [Training classification models](training-classifier-model.md). 1. (Optional) If you want to encrypt the data in the storage volume while Amazon Comprehend processes your training job, choose **Classifier encryption**. Then choose whether to use a KMS key associated with your current account, or one from another account. + If you are using a key associated with the current account, choose the key ID for **KMS key ID**. + If you are using a key associated with a different account, enter the ARN for the key ID under **KMS key ARN**. **Note** For more information on creating and using KMS keys and the associated encryption, see [AWS Key Management Service (AWS KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html). 1. Under **Data specifications**, choose the **Training model type** to use. + **Plain text documents:** Choose this option to create a plain text model. Train the model using plain text documents. + **Native documents:** Choose this option to create a native document model. Train the model using native documents (PDF, Word, images). 1. Choose the **Data format** of your training data. For information about the data formats, see [Classifier training file formats](prep-class-data-format.md). + **CSV file:** Choose this option if your training data uses the CSV file format. + **Augmented manifest:** Choose this option if you used Ground Truth to create augmented manifest files for your training data. This format is available if you chose **Plain text documents** as the training model type. 1. Choose the **Classifier mode** to use. + **Single-label mode:** Choose this mode if the categories you're assigning to documents are mutually exclusive and you're training your classifier to assign one label to each document. In the Amazon Comprehend API, single-label mode is known as multi-class mode. + **Multi-label mode:** Choose this mode if multiple categories can be applied to a document at the same time, and you are training your classifier to assign one or more labels to each document. 1. If you choose **Multi-label mode**, you can select the **Delimiter for labels**. Use this delimiter character to separate labels when there are multiple classes for a training document. The default delimiter is the pipe character. 1. (Optional) If you chose **Augmented manifest** as the data format, you can input up to five augmented manifest files. Each augmented manifest file contains either a training dataset or a test dataset. You must provide at least one training dataset. Test datasets are optional. Use the following steps to configure the augmented manifest files: 1. Under **Training and test dataset**, expand the **Input location** panel. 1. In **Dataset type**, choose **Training data** or **Test data**. 1. For the **SageMaker AI Ground Truth augmented manifest file S3 location**, enter the location of the Amazon S3 bucket that contains the manifest file or navigate to it by choosing **Browse S3**. The IAM role that you're using for access permissions for the training job must have read permissions for the S3 bucket. 1. For the **Attribute names**, enter the name of the attribute that contains your annotations. If the file contains annotations from multiple chained labeling jobs, add an attribute for each job. 1. To add another input location, choose **Add input location** and then configure the next location. 1. (Optional) If you chose **CSV file** as the data format, use the following steps to configure the training dataset and optional test dataset: 1. Under **Training dataset**, enter the location of the Amazon S3 bucket that contains your training data CSV file or navigate to it by choosing **Browse S3**. The IAM role that you're using for access permissions for the training job must have read permissions for the S3 bucket. (Optional) If you chose **Native documents** as the training model type, you also provide the URL of the Amazon S3 folder that contains the training example files. 1. Under **Test dataset**, select whether you are providing extra data for Amazon Comprehend to test the trained model. + **Autosplit**: Autosplit automatically selects 10% of your training data to reserve for use as testing data. + (Optional) **Customer provided**: Enter the URL of the test data CSV file in Amazon S3. You can also navigate to its location in Amazon S3 and choose **Select folder**. (Optional) If you chose **Native documents** as the training model type, you also provide the URL of the Amazon S3 folder that contains the test files. 1. (Optional) For **Document read mode**, you can override the default text extraction actions. This option isn't required for plain-text models, as it applies to text extraction for scanned documents. For more information, see [Setting text extraction options](idp-set-textract-options.md). 1. (Optional for plain-text models) For **Output data**, enter the location of an Amazon S3 bucket to save training output data, such as the confusion matrix. For more information, see [Confusion matrix](train-classifier-output.md#conf-matrix). (Optional) If you choose to encrypt the output result from your training job, choose **Encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account. + If you are using a key associated with the current account, choose the key alias for **KMS key ID**. + If you are using a key associated with a different account, enter the ARN for the key alias or ID under **KMS key ID**. 1. For **IAM role**, choose **Choose an existing IAM role**, and then choose an existing IAM role that has read permissions for the S3 bucket that contains your training documents. The role must have a trust policy that begins with `comprehend.amazonaws.com` to be valid. If you don't already have an IAM role with these permissions, choose **Create an IAM role** to make one. Choose the access permissions to grant this role, and then choose a name suffix to distinguish the role from IAM roles in your account. **Note** For encrypted input documents, the IAM role used must also have `kms:Decrypt` permission. For more information, see [Permissions required to use KMS encryption](security_iam_id-based-policy-examples.md#auth-kms-permissions). 1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the dropdown list. 1. Choose the subnet under **Subnets(s)**. After you select the first subnet, you can choose additional ones. 1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones. **Note** When you use a VPC with your classification job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC that accesses the input documents and the output bucket. 1. (Optional) To add a tag to the custom classifier, enter a key-value pair under **Tags**. Choose **Add tag**. To remove this pair before creating the classifier, choose **Remove tag**. For more information, see [Tagging your resources](tagging.md). 1. Choose **Create**. The console displays the **Classifiers** page. The new classifier appears in the table, showing `Submitted` as its status. When the classifier starts processing the training documents, the status changes to `Training`. When a classifier is ready to use, the status changes to `Trained` or `Trained with warnings`. If the status is `TRAINED_WITH_WARNINGS`, review the skipped files folder in the [Classifier training output](train-classifier-output.md). If Amazon Comprehend encountered errors during creation or training, the status changes to `In error`. You can choose a classifier job in the table to get more information about the classifier, including any error messages. ![\[The custom classifier list.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/class-list.png) # Train custom classifiers (API) To create and train a custom classifier, use the [CreateDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateDocumentClassifier.html) operation. You can monitor the progress of the request using the [DescribeDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassifier.html) operation. After the `Status` field transitions to `TRAINED`, you can use the classifier to classify documents. If the status is `TRAINED_WITH_WARNINGS`, review the skipped files folder in the [Classifier training output](train-classifier-output.md) from the `CreateDocumentClassifier` operation. **Topics** + [ ## Training custom classification using the AWS Command Line Interface ](#get-started-api-customclass-cli) + [ ## Using the AWS SDK for Java or SDK for Python ](#get-started-api-customclass-java) ## Training custom classification using the AWS Command Line Interface The following examples show how to use the `CreateDocumentClassifier` operation, the `DescribeDocumentClassificationJob` operation, and other custom classifier APIs with the AWS CLI. The examples are formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^). Create a plain-text custom classifier using the `create-document-classifier` operation. ``` aws comprehend create-document-classifier \ --region region \ --document-classifier-name testDelete \ --language-code en \ --input-data-config S3Uri=s3://S3Bucket/docclass/file name \ --data-access-role-arn arn:aws:iam::account number:role/testFlywheelDataAccess ``` To create a native custom classifier, provide the following additional parameters in the `create-document-classifier` request. 1. DocumentType: set the value to SEMI\$1STRUCTURED\$1DOCUMENT. 1. Documents: the S3 location for the training documents (and, optionally, the test documents). 1. OutputDataConfig: provide the S3 location for the output documents (and an optional KMS key). 1. DocumentReaderConfig: Optional field for text extraction settings. ``` aws comprehend create-document-classifier \ --region region \ --document-classifier-name testDelete \ --language-code en \ --input-data-config S3Uri=s3://S3Bucket/docclass/file name \ DocumentType \ Documents \ --output-data-config S3Uri=s3://S3Bucket/docclass/file name \ --data-access-role-arn arn:aws:iam::account number:role/testFlywheelDataAccess ``` Get information on a custom classifier with the document classifier ARN using the `DescribeDocumentClassifier` operation. ``` aws comprehend describe-document-classifier \ --region region \ --document-classifier-arn arn:aws:comprehend:region:account number:document-classifier/file name ``` Delete a custom classifier using the `DeleteDocumentClassifier` operation. ``` aws comprehend delete-document-classifier \ --region region \ --document-classifier-arn arn:aws:comprehend:region:account number:document-classifier/testDelete ``` List all custom classifiers in the account using the `ListDocumentClassifiers` operation. ``` aws comprehend list-document-classifiers --region region ``` ## Using the AWS SDK for Java or SDK for Python For SDK examples of how to create and train a custom classifier , see [Use `CreateDocumentClassifier` with an AWS SDK or CLI](example_comprehend_CreateDocumentClassifier_section.md). # Test the training data After training the model, Amazon Comprehend tests the custom classifier model. If you don't provide a test dataset, Amazon Comprehend trains the model with 90 percent of the training data. It reserves 10 percent of the training data to use for testing. If you do provide a test dataset, the test data must include at least one example for each unique label in the training dataset. Testing the model provides you with metrics that you can use to estimate the accuracy of the model. The console displays the metrics in the **Classifier performance** section of the **Classifier details** page in the console. They're also returned in the `Metrics` fields returned by the [DescribeDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassifier.html) operation. In the following example training data, there are five labels, DOCUMENTARY, DOCUMENTARY, SCIENCE\$1FICTION, DOCUMENTARY, ROMANTIC\$1COMEDY. There are three unique classes: DOCUMENTARY, SCIENCE\$1FICTION, ROMANTIC\$1COMEDY. | Column 1 | Column 2 | | --- | --- | | DOCUMENTARY | document text 1 | | DOCUMENTARY | document text 2 | | SCIENCE\$1FICTION | document text 3 | | DOCUMENTARY | document text 4 | | ROMANTIC\$1COMEDY | document text 5 | For auto split (where Amazon Comprehend reserves 10 percent of the training data to use for testing), if the training data contains limited examples of a specific label, the test dataset may contain zero examples of that label. For instance, if the training dataset contains 1000 instances of the DOCUMENTARY class, 900 instances of SCIENCE\$1FICTION, and a single instance of the ROMANTIC\$1COMEDY class, the test dataset might contain 100 DOCUMENTARY and 90 SCIENCE\$1FICTION instances, but no ROMANTIC\$1COMEDY instances, as there is a single example available. After you finish training your model, the training metrics provide information that you can use to decide if the model is sufficiently accurate for your needs. # Classifier training output After Amazon Comprehend completes the custom classifier model training, it creates output files in the Amazon S3 output location that you specified in the [CreateDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateDocumentClassifier.html) API request or the equivalent console request. Amazon Comprehend creates a confusion matrix when you train a plain-text model or a native document model. It can create additional output files when you train a native document model. **Topics** + [ ## Confusion matrix ](#conf-matrix) + [ ## Additional outputs for native document models ](#train-class-output-native) ## Confusion matrix When you train a custom classifier model, Amazon Comprehend creates a confusion matrix that provides metrics on how well the model performed in training. This matrix shows a matrix of labels that the model predicted, compared to the actual document labels. Amazon Comprehend uses a portion of the training data to create the confusion matrix. A confusion matrix provides an indication of which classes could use more data to improve model performance. A class with a high fraction of correct predictions has the highest number of results along the diagonal of the matrix. If the number on the diagonal is a lower number, the class has a lower fraction of correct predictions. You can add more training examples for this class and train the model again. For example, if 40 percent of label A samples get classified as label D, adding more samples for label A and label D enhances the classifier's performance. After Amazon Comprehend creates the classifier model, the confusion matrix is available in the `confusion_matrix.json` file in the S3 output location. The format of the confusion matrix varies, depending on whether you trained your classifier using multi-class mode or multi-label mode. **Topics** + [ ### Confusion matrix for multi-class mode ](#m-c-matrix) + [ ### Confusion matrix for multi-label mode ](#m-l-matrix) ### Confusion matrix for multi-class mode In multi-class mode, the individual classes are mutually exclusive, so classification assigns one label to each document. For example, an animal can be a dog or a cat, but not both at the same time. Consider the following example of a confusion matrix for a multi-class trained classifier: ``` A B X Y <-(predicted label) A 1 2 0 4 B 0 3 0 1 X 0 0 1 0 Y 1 1 1 1 ^ | (actual label) ``` In this case, the model predicted the following: + One "A" label was accurately predicted, two "A" labels were incorrectly predicted as "B" labels, and four "A" labels were incorrectly predicted as "Y" labels. + Three "B" labels were accurately predicted, and one "B" label was incorrectly predicted as a "Y" label. + One "X" was accurately predicted. + One "Y" label was accurately predicted, one was incorrectly predicted as an "A" label, one was incorrectly predicted as a "B" label, and one was incorrectly predicted as an "X" label. The diagonal line in the matrix (A:A, B:B, X:X, and Y:Y) shows the accurate predictions. The prediction errors are the values outside of the diagonal. In this case, the matrix shows the following prediction error rates: + A labels: 86% + B labels: 25% + X labels: 0% + Y labels: 75% The classifier returns the confusion matrix as a file in JSON format. The following JSON file represents the matrix for the previous example. ``` { "type": "multi_class", "confusion_matrix": [ [1, 2, 0,4], [0, 3, 0, 1], [0, 0, 1, 0], [1, 1, 1, 1]], "labels": ["A", "B", "X", "Y"], "all_labels": ["A", "B", "X", "Y"] } ``` ### Confusion matrix for multi-label mode In multi-label mode, classification can assign one or more classes to a document. Consider the following example of a confusion matrix for a multi-class trained classifier. In this example, there are three possible labels: `Comedy`, `Action`, and `Drama`. The multi-label confusion matrix creates one 2x2 matrix for each label. ``` Comedy Action Drama No Yes No Yes No Yes <-(predicted label) No 2 1 No 1 1 No 3 0 Yes 0 2 Yes 2 1 Yes 1 1 ^ ^ ^ | | | |-----------(was this label actually used)--------| ``` In this case, the model returned the following for the `Comedy` label: + Two instances where a `Comedy` label was accurately predicted to be present. True positive (TP). + Two instances where a `Comedy` label was accurately predicted to be absent. True negative (TN). + Zero instances where a `Comedy` label was incorrectly predicted to be present. False positive (FP). + One instance where a `Comedy` label was incorrectly predicted to be absent. False negative (FN). As with a multi-class confusion matrix, the diagonal line in each matrix shows the accurate predictions. In this case, the model accurately predicted `Comedy` labels 80% of the time (TP plus TN) and incorrectly predicted them 20% of the time (FP plus FN). The classifier returns the confusion matrix as a file in JSON format. The following JSON file represents the matrix for the previous example. ``` { "type": "multi_label", "confusion_matrix": [ [[2, 1], [0, 2]], [[1, 1], [2, 1]], [[3, 0], [1, 1]] ], "labels": ["Comedy", "Action", "Drama"] "all_labels": ["Comedy", "Action", "Drama"] } ``` ## Additional outputs for native document models Amazon Comprehend can create additional output files when you train a native document model. ### Amazon Textract output If Amazon Comprehend invoked the Amazon Textract APIs to extract text for any of the training documents, it saves the Amazon Textract output files in the S3 output location. It uses the following directory structure: + **Training documents:** `amazon-textract-output/train///textract_output.json` + **Test documents:** `amazon-textract-output/test///textract_output.json` Amazon Comprehend populates the test folder if you provided test documents in the API request. ### Document annotation failures Amazon Comprehend creates the following files in the Amazon S3 output location (in the **skipped\$1documents/** folder) if there are any failed annotations: + failed\$1annotations\$1train.jsonl File exists if any annotations failed in the training data. + failed\$1annotations\$1test.jsonl File exists if the request included test data and any annotations failed in the test data. The failed annotation files are JSONL files with the following format: ``` { "File": "String", "Page": Number, "ErrorCode": "...", "ErrorMessage": "..."} {"File": "String", "Page": Number, "ErrorCode": "...", "ErrorMessage": "..." } ``` # Custom classifier metrics Amazon Comprehend provides metrics to help you estimate how well a custom classifier performs. Amazon Comprehend calculates the metrics using the test data from the classifier training job. The metrics accurately represent the performance of the model during training, so they approximate the model performance for classification of similar data. Use API operations such as [DescribeDocumentClassifier](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DescribeDocumentClassifier.html) to retrieve the metrics for a custom classifier. **Note** Refer to [Metrics: Precision, recall, and FScore](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) for an understanding of the underlying Precision, Recall, and F1 score metrics. These metrics are defined at a class level. Amazon Comprehend uses **macro** averaging to combine these metrics into the test set P,R, and F1, as discussed in the following. **Topics** + [ ## Metrics ](#cer-doc-class-metrics) + [ ## Improving your custom classifier's performance ](#improving-metrics-doc) ## Metrics Amazon Comprehend supports the following metrics: **Topics** + [ ### Accuracy ](#class-accuracy-metric) + [ ### Precision (macro precision) ](#class-macroprecision-metric) + [ ### Recall (macro recall) ](#class-macrorecall-metric) + [ ### F1 score (macro F1 score) ](#class-macrof1score-metric) + [ ### Hamming loss ](#class-hammingloss-metric) + [ ### Micro precision ](#class-microprecision-metric) + [ ### Micro recall ](#class-microrecall-metric) + [ ### Micro F1 score ](#class-microf1score-metric) To view the metrics for a Classifier, open the **Classifier Details** page in the console. ![\[Custom Classifier Metrics\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/classifierperformance.png) ### Accuracy Accuracy indicates the percentage of labels from the test data that the model predicted accurately. To compute accuracy, divide the number of accurately predicted labels in the test documents by the total number of labels in the test documents. For example | Actual label | Predicted label | Accurate/Incorrect | | --- | --- | --- | | 1 | 1 | Accurate | | 0 | 1 | Incorrect | | 2 | 3 | Incorrect | | 3 | 3 | Accurate | | 2 | 2 | Accurate | | 1 | 1 | Accurate | | 3 | 3 | Accurate | The accuracy consists of the number of accurate predictions divided by the number of overall test samples = 5/7 = 0.714, or 71.4% ### Precision (macro precision) Precision is a measure of the usefulness of the classifier results in the test data. It's defined as the number of documents accurately classified, divided by the total number of classifications for the class. High precision means that the classifier returned significantly more relevant results than irrelevant ones. The `Precision` metric is also known as *Macro Precision*. The following example shows precision results for a test set. | Label | Sample size | Label precision | | --- | --- | --- | | Label\$11 | 400 | 0.75 | | Label\$12 | 300 | 0.80 | | Label\$13 | 30000 | 0.90 | | Label\$14 | 20 | 0.50 | | Label\$15 | 10 | 0.40 | The Precision (Macro Precision) metric for the model is therefore: ``` Macro Precision = (0.75 + 0.80 + 0.90 + 0.50 + 0.40)/5 = 0.67 ``` ### Recall (macro recall) This indicates the percentage of correct categories in your text that the model can predict. This metric comes from averaging the recall scores of all available labels. Recall is a measure of how complete the classifier results are for the test data. High recall means that the classifier returned most of the relevant results. The `Recall` metric is also known as *Macro Recall*. The following example shows recall results for a test set. | Label | Sample size | Label recall | | --- | --- | --- | | Label\$11 | 400 | 0.70 | | Label\$12 | 300 | 0.70 | | Label\$13 | 30000 | 0.98 | | Label\$14 | 20 | 0.80 | | Label\$15 | 10 | 0.10 | The Recall (Macro Recall) metric for the model is therefore: ``` Macro Recall = (0.70 + 0.70 + 0.98 + 0.80 + 0.10)/5 = 0.656 ``` ### F1 score (macro F1 score) The F1 score is derived from the `Precision` and `Recall` values. It measures the overall accuracy of the classifier. The highest score is 1, and the lowest score is 0. Amazon Comprehend calculates the *Macro F1 Score*. It's the unweighted average of the label F1 scores. Using the following test set as an example: | Label | Sample size | Label F1 score | | --- | --- | --- | | Label\$11 | 400 | 0.724 | | Label\$12 | 300 | 0.824 | | Label\$13 | 30000 | 0.94 | | Label\$14 | 20 | 0.62 | | Label\$15 | 10 | 0.16 | The F1 Score (Macro F1 Score) for the model is calculated as follows: ``` Macro F1 Score = (0.724 + 0.824 + 0.94 + 0.62 + 0.16)/5 = 0.6536 ``` ### Hamming loss The fraction of labels that are incorrectly predicted. Also seen as the fraction of incorrect labels compared to the total number of labels. Scores closer to zero are better. ### Micro precision Original: Similar to the precision metric, except that micro precision is based on the overall score of all precision scores added together. ### Micro recall Similar to the recall metric, except that micro recall is based on the overall score of all recall scores added together. ### Micro F1 score The Micro F1 score is a combination of the Micro Precision and Micro Recall metrics. ## Improving your custom classifier's performance The metrics provide an insight into how your custom classifier performs during a classification job. If the metrics are low, the classification model might not be effective for your use case. You have several options to improve your classifier performance: 1. In your training data, provide concrete examples that define clear separation of the categories. For example, provide documents that use unique words/sentences to represent the category. 1. Add more data for under-represented labels in your training data. 1. Try to reduce skew in the categories. If the largest label in your data has more than 10 times the documents in the smallest label, try increasing the number of documents for the smallest label. Make sure to reduce the skew ratio to at most 10:1 between highly represented and least represented classes. You can also try removing input documents from the highly represented classes.