從 CSV 檔案建立分類資訊清單檔案

此範例 Python 指令碼使用逗號分隔值 (CSV) 檔案來標記影像，簡化分類資訊清單檔案的建立。建立 CSV 檔案。

清單檔案會描述用於訓練模型的影像。清單檔案由一或多個 JSON Lines 組成。每個 JSON Line 會描述單一影像。如需詳細資訊，請參閱定義影像分類的 JSON 行。

CSV 檔案代表文字檔案中多資料列的表格式資料。資料列中的欄位以逗號分隔。如需詳細資訊，請參閱逗號分隔值。對於此指令碼，CSV 檔案中的每一列都包含影像的 S3 位置，以及影像的異常分類 (normal 或 anomaly)。每一列都會映射到資訊清單檔案中的 JSON Line。

例如，下列 CSV 檔案說明範例影像中的部分影像。


s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg,anomaly
s3://s3bucket/circuitboard/train/anomaly/train-anomaly_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train-normal_1.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_10.jpg,normal
s3://s3bucket/circuitboard/train/normal/train-normal_11.jpg,normal

該指令碼會為每一資料列產生 JSON Lines。例如，以下是第一資料列 (s3://s3bucket/circuitboard/train/anomaly/train-anomaly_1.jpg,anomaly) 的 JSON Line。


{"source-ref": "s3://s3bucket/csv_test/train_anomaly_1.jpg","anomaly-label": 1,"anomaly-label-metadata": {"confidence": 1,"job-name": "labeling-job/anomaly-classification","class-name": "anomaly","human-annotated": "yes","creation-date": "2022-02-04T22:47:07","type": "groundtruth/image-classification"}}

如果您的 CSV 檔案不包含映像的 Amazon S3 路徑，請使用--s3-path命令列引數指定映像的 Amazon S3 路徑。

在建立資訊清單檔案之前，指令碼會檢查 CSV 檔案中是否有重複的影像，以及是否有任何不是 normal或的影像分類anomaly。如果找到重複的影像或影像分類錯誤，指令碼會執行下列動作：

在重複資料刪除的 CSV 檔案中記錄所有影像的第一個有效影像項目。
在錯誤檔案中記錄映像的重複發生。
記錄不在錯誤檔案中normal或anomaly位於錯誤檔案中的影像分類。
不會建立資訊清單檔案。

錯誤檔案包含在輸入 CSV 檔案中發現重複影像或分類錯誤的行號。使用錯誤 CSV 檔案來更新輸入 CSV 檔案，然後再次執行指令碼。或者，使用錯誤 CSV 檔案來更新重複資料刪除的 CSV 檔案，其中僅包含唯一的影像項目和沒有影像分類錯誤的影像。使用更新的重複資料刪除 CSV 檔案重新執行指令碼。

如果在輸入 CSV 檔案中找不到重複項目或錯誤，指令碼會刪除重複資料刪除的影像 CSV 檔案和錯誤檔案，因為它們是空的。

在此程序中，您可以建立 CSV 檔案並執行 Python 指令碼來建立清單檔案。指令碼已使用 Python 3.7 版進行測試。

從 CSV 檔案建立清單檔案

建立 CSV 檔案，每一資料列中包含以下欄位 (每個影像一個資料列)。請勿將標題資料列新增至 CSV 檔案。

欄位 1	欄位 2
影像名稱或 Amazon S3 路徑影像。例如：`s3://s3bucket/circuitboard/train/anomaly/train-anomaly_10.jpg`。您不能混合使用具有 Amazon S3 路徑的影像和不具有 Amazon S3 路徑的影像。	影像的異常分類 (`normal` 或 `anomaly`)。

例如：s3://s3bucket/circuitboard/train/anomaly/image_10.jpg,anomaly 或 image_11.jpg,normal

儲存 CSV 檔案。

執行下列 Python 指令碼。提供下列引數：

csv_file — 您在步驟 1 中建立的 CSV 檔案。
(選用) --s3-path s3://path_to_folder/ — 要新增至影像檔案名稱的 Amazon S3 路徑 (欄位 1)。如果欄位 1 中的影像尚未包含 S3 路徑，請使用 --s3-path。


# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier:  Apache-2.0
"""
Purpose
Shows how to create an Amazon Lookout for Vision manifest file from a CSV file.
The CSV file format is image location,anomaly classification (normal or anomaly)
For example:
s3://s3bucket/circuitboard/train/anomaly/train_11.jpg,anomaly
s3://s3bucket/circuitboard/train/normal/train_1.jpg,normal

If necessary, use the bucket argument to specify the Amazon S3 bucket folder for the images.
"""

from datetime import datetime, timezone
import argparse
import logging
import csv
import os
import json

logger = logging.getLogger(__name__)


def check_errors(csv_file):
    """
    Checks for duplicate images and incorrect classifications in a CSV file.
    If duplicate images or invalid anomaly assignments are found, an errors CSV file
    and deduplicated CSV file are created. Only the first
    occurrence of a duplicate is recorded. Other duplicates are recorded in the errors file.
    :param csv_file: The source CSV file
    :return: True if errors or duplicates are found, otherwise false.
    """

    logger.info("Checking %s.", csv_file)

    errors_found = False
    errors_file = f"{os.path.splitext(csv_file)[0]}_errors.csv"
    deduplicated_file = f"{os.path.splitext(csv_file)[0]}_deduplicated.csv"

    with open(csv_file, 'r', encoding="UTF-8") as input_file,\
            open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
            open(errors_file, 'w', encoding="UTF-8") as errors:

        reader = csv.reader(input_file, delimiter=',')
        dedup_writer = csv.writer(dedup)
        error_writer = csv.writer(errors)
        line = 1
        entries = set()
        for row in reader:

            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            # Record any incorrect classifications.
            if not row[1].lower() == "normal" and not row[1].lower() == "anomaly":
                error_writer.writerow(
                    [line, row[0], row[1], "INVALID_CLASSIFICATION"])
                errors_found = True

            # Write first image entry to dedup file and record duplicates.
            key = row[0]
            if key not in entries:
                dedup_writer.writerow(row)
                entries.add(key)
            else:
                error_writer.writerow([line, row[0], row[1], "DUPLICATE"])
                errors_found = True
            line += 1

    if errors_found:
        logger.info("Errors found check %s.", errors_file)
    else:
        os.remove(errors_file)
        os.remove(deduplicated_file)

    return errors_found


def create_manifest_file(csv_file, manifest_file, s3_path):
    """
    Read a CSV file and create an Amazon Lookout for Vision classification manifest file.
    :param csv_file: The source CSV file.
    :param manifest_file: The name of the manifest file to create.
    :param s3_path: The Amazon S3 path to the folder that contains the images.
    """
    logger.info("Processing CSV file %s.", csv_file)

    image_count = 0
    anomalous_count = 0

    with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
        open(manifest_file, "w", encoding="UTF-8") as output_file:

        image_classifications = csv.reader(
            csvfile, delimiter=',', quotechar='|')

        # Process each row (image) in the CSV file.
        for row in image_classifications:
            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            source_ref = str(s3_path) + row[0]
            classification = 0

            if row[1].lower() == 'anomaly':
                classification = 1
                anomalous_count += 1

           # Create the JSON line.
            json_line = {}
            json_line['source-ref'] = source_ref
            json_line['anomaly-label'] = str(classification)

            metadata = {}
            metadata['confidence'] = 1
            metadata['job-name'] = "labeling-job/anomaly-classification"
            metadata['class-name'] = row[1]
            metadata['human-annotated'] = "yes"
            metadata['creation-date'] = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
            metadata['type'] = "groundtruth/image-classification"

            json_line['anomaly-label-metadata'] = metadata

            output_file.write(json.dumps(json_line))
            output_file.write('\n')
            image_count += 1

    logger.info("Finished creating manifest file %s.\n"
                "Images: %s\nAnomalous: %s",
                manifest_file,
                image_count,
                anomalous_count)
    return image_count, anomalous_count


def add_arguments(parser):
    """
    Add command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "csv_file", help="The CSV file that you want to process."
    )

    parser.add_argument(
        "--s3_path", help="The Amazon S3 bucket and folder path for the images."
        " If not supplied, column 1 is assumed to include the Amazon S3 path.", required=False
    )


def main():

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    try:

        # Get command line arguments.
        parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
        add_arguments(parser)
        args = parser.parse_args()
        s3_path = args.s3_path
        if s3_path is None:
            s3_path = ""

        csv_file = args.csv_file
        csv_file_no_extension = os.path.splitext(csv_file)[0]
        manifest_file = csv_file_no_extension + '.manifest'

        # Create manifest file if there are no duplicate images.
        if check_errors(csv_file):
            print(f"Issues found. Use {csv_file_no_extension}_errors.csv "\
                "to view duplicates and errors.")
            print(f"{csv_file}_deduplicated.csv contains the first"\
                "occurrence of a duplicate.\n"
                  "Update as necessary with the correct information.")
            print(f"Re-run the script with {csv_file_no_extension}_deduplicated.csv")
        else:
            print('No duplicates found. Creating manifest file.')

            image_count, anomalous_count = create_manifest_file(csv_file, manifest_file, s3_path)

            print(f"Finished creating manifest file: {manifest_file} \n")

            normal_count = image_count-anomalous_count
            print(f"Images processed: {image_count}")
            print(f"Normal: {normal_count}")
            print(f"Anomalous: {anomalous_count}")

    except FileNotFoundError as err:
        logger.exception("File not found.:%s", err)
        print(f"File not found: {err}. Check your input CSV file.")

if __name__ == "__main__":
    main()

如果發生重複的影像或發生分類錯誤：
1. 使用錯誤檔案來更新重複資料刪除的 CSV 檔案或輸入 CSV 檔案。
2. 使用更新的重複資料刪除 CSV 檔案或更新的輸入 CSV 檔案再次執行指令碼。
如果您打算使用測試資料集，請重複步驟 1–4 來為您的測試資料集建立資訊清單檔案。
如有必要，請將映像從您的電腦複製到您在 CSV 檔案（或--s3-path命令列中指定的）第 1 欄指定的 Amazon S3 儲存貯體路徑。若要複製映像，請在命令提示中輸入下列命令。
```
aws s3 cp --recursive your-local-folder/ s3://your-target-S3-location/
```
遵循的指示來使用資訊清單檔案建立資料集（主控台）建立資料集。如果您使用 AWS SDK，請參閱使用資訊清單檔案 (SDK) 建立資料集。

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

定義影像分割的 JSON 行

使用資訊清單檔案建立資料集（主控台）