从文件创建分类清单CSV文件

此示例 Python 脚本使用逗号分隔值 (CSV) 文件来标记图像，从而简化了分类清单文件的创建。您创建CSV文件。

清单文件描述了用于训练模型的图像。清单文件由一行或多JSON行组成。每JSON行描述一张图片。有关更多信息，请参阅为图像分类定义JSON线。

CSV文件表示文本文件中多行的表格数据。一行中的各个字段用逗号分隔。有关更多信息，请参阅逗号分隔的值。对于此脚本，CSV文件中的每一行都包括图像的 S3 位置和图像的异常分类（normal或anomaly）。每行都映射到清单文件中的一JSON行。

例如，以下CSV文件描述了示例图像中的一些图像。


s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train-normal_1.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_10.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_11.jpg,normal

该脚本为每JSON行生成行。例如，以下是第一行的JSON行 (s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly)。


{"source-ref": "s3://s3bucket/csv_test/train_anomaly_1.jpg","anomaly-label": 1,"anomaly-label-metadata": {"confidence": 1,"job-name": "labeling-job/anomaly-classification","class-name": "anomaly","human-annotated": "yes","creation-date": "2022-02-04T22:47:07","type": "groundtruth/image-classification"}}

如果您的CSV文件不包含图像的 Amazon S3 路径，请使用--s3-path命令行参数指定图像的 Amazon S3 路径。

在创建清单文件之前，该脚本会检查CSV文件中是否有重复的图像以及任何不是normal或的图像分类anomaly。如果发现重复图像或图像分类错误，则该脚本会执行以下操作：

在经过重复数据删除CSV的文件中记录所有图像的第一个有效图像条目。
在错误文件中，记录图像的重复版本。
在错误文件中，记录不是 normal 或 anomaly 的图像分类。
不创建清单文件。

错误文件包括在输入CSV文件中发现重复图像或分类错误的行号。使用错误CSV文件更新输入CSV文件，然后再次运行脚本。或者，使用错误CSV文件更新经过重复数据删除的CSV文件，该文件仅包含唯一的图像条目和没有图像分类错误的图像。使用已删除重复数据的CSV更新文件重新运行脚本。

如果在输入CSV文件中未发现重复项或错误，则脚本会删除经过重复数据删除的图像CSV文件和错误文件，因为它们为空。

在此过程中，您将创建CSV文件并运行 Python 脚本来创建清单文件。此脚本已使用 Python 版本 3.7 进行测试。

从文件创建清单CSV文件

创建每行包含以下字段的CSV文件（每张图像一行）。不要在CSV文件中添加标题行。

字段 1	字段 2
图像名称或图像的 Amazon S3 路径。例如，`s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg`。您不能混合使用带有 Amazon S3 路径的图像和不带 Amazon S3 路径的图像。	图像的异常分类（`normal` 或 `anomaly`）。

例如，s3://s3bucket/circuitboard/train/anomaly/image_10.jpg,anomaly 或 image_11.jpg,normal

保存 CSV 文件。

运行以下 Python 脚本。提供以下参数：

csv_file— 您在步骤 1 中创建的CSV文件。
（可选）--s3-path s3://path_to_folder/：要添加到图像文件名（字段 1）的 Amazon S3 路径。如果字段 1 中的图像未包含 S3 路径，则使用 --s3-path。


# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier:  Apache-2.0
"""
Purpose
Shows how to create an Amazon Lookout for Vision manifest file from a CSV file.
The CSV file format is image location,anomaly classification (normal or anomaly)
For example:
s3://s3bucket/circuitboard/train/anomaly/train_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train_1.jpg,normal

If necessary, use the bucket argument to specify the Amazon S3 bucket folder for the images.
"""

from datetime import datetime, timezone
import argparse
import logging
import csv
import os
import json

logger = logging.getLogger(__name__)


def check_errors(csv_file):
    """
    Checks for duplicate images and incorrect classifications in a CSV file.
    If duplicate images or invalid anomaly assignments are found, an errors CSV file
    and deduplicated CSV file are created. Only the first
    occurrence of a duplicate is recorded. Other duplicates are recorded in the errors file.
    :param csv_file: The source CSV file
    :return: True if errors or duplicates are found, otherwise false.
    """

    logger.info("Checking %s.", csv_file)

    errors_found = False
    errors_file = f"{os.path.splitext(csv_file)[0]}_errors.csv"
    deduplicated_file = f"{os.path.splitext(csv_file)[0]}_deduplicated.csv"

    with open(csv_file, 'r', encoding="UTF-8") as input_file,\
            open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
            open(errors_file, 'w', encoding="UTF-8") as errors:

        reader = csv.reader(input_file, delimiter=',')
        dedup_writer = csv.writer(dedup)
        error_writer = csv.writer(errors)
        line = 1
        entries = set()
        for row in reader:

            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            # Record any incorrect classifications.
            if not row[1].lower() == "normal" and not row[1].lower() == "anomaly":
                error_writer.writerow(
                    [line, row[0], row[1], "INVALID_CLASSIFICATION"])
                errors_found = True

            # Write first image entry to dedup file and record duplicates.
            key = row[0]
            if key not in entries:
                dedup_writer.writerow(row)
                entries.add(key)
            else:
                error_writer.writerow([line, row[0], row[1], "DUPLICATE"])
                errors_found = True
            line += 1

    if errors_found:
        logger.info("Errors found check %s.", errors_file)
    else:
        os.remove(errors_file)
        os.remove(deduplicated_file)

    return errors_found


def create_manifest_file(csv_file, manifest_file, s3_path):
    """
    Read a CSV file and create an Amazon Lookout for Vision classification manifest file.
    :param csv_file: The source CSV file.
    :param manifest_file: The name of the manifest file to create.
    :param s3_path: The Amazon S3 path to the folder that contains the images.
    """
    logger.info("Processing CSV file %s.", csv_file)

    image_count = 0
    anomalous_count = 0

    with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
        open(manifest_file, "w", encoding="UTF-8") as output_file:

        image_classifications = csv.reader(
            csvfile, delimiter=',', quotechar='|')

        # Process each row (image) in the CSV file.
        for row in image_classifications:
            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            source_ref = str(s3_path) + row[0]
            classification = 0

            if row[1].lower() == 'anomaly':
                classification = 1
                anomalous_count += 1

           # Create the JSON line.
            json_line = {}
            json_line['source-ref'] = source_ref
            json_line['anomaly-label'] = str(classification)

            metadata = {}
            metadata['confidence'] = 1
            metadata['job-name'] = "labeling-job/anomaly-classification"
            metadata['class-name'] = row[1]
            metadata['human-annotated'] = "yes"
            metadata['creation-date'] = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
            metadata['type'] = "groundtruth/image-classification"

            json_line['anomaly-label-metadata'] = metadata

            output_file.write(json.dumps(json_line))
            output_file.write('\n')
            image_count += 1

    logger.info("Finished creating manifest file %s.\n"
                "Images: %s\nAnomalous: %s",
                manifest_file,
                image_count,
                anomalous_count)
    return image_count, anomalous_count


def add_arguments(parser):
    """
    Add command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "csv_file", help="The CSV file that you want to process."
    )

    parser.add_argument(
        "--s3_path", help="The Amazon S3 bucket and folder path for the images."
        " If not supplied, column 1 is assumed to include the Amazon S3 path.", required=False
    )


def main():

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    try:

        # Get command line arguments.
        parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
        add_arguments(parser)
        args = parser.parse_args()
        s3_path = args.s3_path
        if s3_path is None:
            s3_path = ""

        csv_file = args.csv_file
        csv_file_no_extension = os.path.splitext(csv_file)[0]
        manifest_file = csv_file_no_extension + '.manifest'

        # Create manifest file if there are no duplicate images.
        if check_errors(csv_file):
            print(f"Issues found. Use {csv_file_no_extension}_errors.csv "\
                "to view duplicates and errors.")
            print(f"{csv_file}_deduplicated.csv contains the first"\
                "occurrence of a duplicate.\n"
                  "Update as necessary with the correct information.")
            print(f"Re-run the script with {csv_file_no_extension}_deduplicated.csv")
        else:
            print('No duplicates found. Creating manifest file.')

            image_count, anomalous_count = create_manifest_file(csv_file, manifest_file, s3_path)

            print(f"Finished creating manifest file: {manifest_file} \n")

            normal_count = image_count-anomalous_count
            print(f"Images processed: {image_count}")
            print(f"Normal: {normal_count}")
            print(f"Anomalous: {anomalous_count}")

    except FileNotFoundError as err:
        logger.exception("File not found.:%s", err)
        print(f"File not found: {err}. Check your input CSV file.")

if __name__ == "__main__":
    main()

如果出现重复的图像或出现分类错误：
1. 使用错误文件更新经过重复数据删除CSV的文件或输入文件。CSV
2. 使用已删除重复数据的更新CSV文件或更新的输入CSV文件再次运行脚本。
如果您计划使用测试数据集，请重复步骤 1-4，以便为测试数据集创建清单文件。
如有必要，请将图像从您的计算机复制到您在CSV文件第 1 列中指定的（或在--s3-path命令行中指定）的 Amazon S3 存储桶路径。要复制图像，请在命令提示符处输入以下命令。
```
aws s3 cp --recursive your-local-folder/ s3://your-target-S3-location/
```
按照使用清单文件创建数据集（控制台）部分的说明操作，创建一个数据集。如果您使用的是 AWS SDK，请参阅使用清单文件创建数据集 (SDK)。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

为图像分割定义JSON线条

使用清单文件创建数据集（控制台）