從 CSV 檔案建立清單檔案。

此 Python 指令碼範例使用逗號分隔值 (CSV) 檔案來標記影像，簡化了清單檔案的建立。建立 CSV 檔案。清單檔案適用於多標籤影像分類或多標籤影像分類。如需詳細資訊，請參閱尋找物件、場景和概念。

注意

此指令碼不會建立適合尋找物件位置或尋找品牌位置的清單檔案。

清單檔案會描述用於訓練模型的影像。例如，指派給影像的影像位置和標籤。清單檔案由一或多個 JSON Lines 組成。每個 JSON Line 會描述單一影像。如需詳細資訊，請參閱在資訊清單檔案中匯入影像層級標籤。

CSV 檔案代表文字檔案中多資料列的表格式資料。資料列中的欄位以逗號分隔。如需詳細資訊，請參閱逗號分隔值。對於此指令碼，CSV 檔案中的每一資料列會代表單一影像，並對應至清單檔案中的 JSON Line。若要為支援多標籤影像分類的清單檔案建立 CSV 檔案，請在每一資料列新增一或多個影像層級標籤。若要建立適合 Image classification 的清單檔案，請將單一影像層級標籤新增至每一資料列。

例如，下列 CSV 檔案描述多標籤影像分類 (Flowers) 入門專案中的影像。


camellia1.jpg,camellia,with_leaves
camellia2.jpg,camellia,with_leaves
camellia3.jpg,camellia,without_leaves
helleborus1.jpg,helleborus,without_leaves,not_fully_grown
helleborus2.jpg,helleborus,with_leaves,fully_grown
helleborus3.jpg,helleborus,with_leaves,fully_grown
jonquil1.jpg,jonquil,with_leaves
jonquil2.jpg,jonquil,with_leaves
jonquil3.jpg,jonquil,with_leaves
jonquil4.jpg,jonquil,without_leaves
mauve_honey_myrtle1.jpg,mauve_honey_myrtle,without_leaves
mauve_honey_myrtle2.jpg,mauve_honey_myrtle,with_leaves
mauve_honey_myrtle3.jpg,mauve_honey_myrtle,with_leaves
mediterranean_spurge1.jpg,mediterranean_spurge,with_leaves
mediterranean_spurge2.jpg,mediterranean_spurge,without_leaves

該指令碼會為每一資料列產生 JSON Lines。例如，以下是第一資料列 (camellia1.jpg,camellia,with_leaves) 的 JSON Line。


{"source-ref": "s3://bucket/flowers/train/camellia1.jpg","camellia": 1,"camellia-metadata":{"confidence": 1,"job-name": "labeling-job/camellia","class-name": "camellia","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"},"with_leaves": 1,"with_leaves-metadata":{"confidence": 1,"job-name": "labeling-job/with_leaves","class-name": "with_leaves","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"}}

在 CSV 範例中，影像的 Amazon S3 路徑不存在。如果您的 CSV 檔案不包含影像的 Amazon S3 路徑，請使用 --s3_path 命令列引數指定影像的 Amazon S3 路徑。

指令碼會在已刪除重複影像的 CSV 檔案中記錄每個影像的第一個項目。已刪除重複影像的 CSV 檔案包含輸入 CSV 檔案中找到的每個影像的單一執行個體。輸入 CSV 檔案中影像的進一步出現次數會記錄在重複影像 CSV 檔案中。如果指令碼尋找重複影像，請檢閱重複影像的 CSV 檔案，並視需要更新已刪除重複影像的 CSV 檔案。使用已刪除重複資料的檔案重新執行指令碼。如果在輸入的 CSV 檔案中找不到重複項目，則指令碼會刪除已刪除重複影像的 CSV 檔案和重複影像的 CSV 檔案，因為它們是空白的。

在此程序中，您可以建立 CSV 檔案並執行 Python 指令碼來建立清單檔案。

從 CSV 檔案建立清單檔案

建立 CSV 檔案，每一資料列中包含以下欄位 (每個影像一個資料列)。請勿將標題資料列新增至 CSV 檔案。

欄位 1	欄位 2	欄位 n
影像名稱或 Amazon S3 路徑影像。例如：`s3://my-bucket/flowers/train/camellia1.jpg`。您不能混合使用具有 Amazon S3 路徑的影像和不具有 Amazon S3 路徑的影像。	影像的第一個影像層級標籤。	一或多個以逗號分隔的其他影像層級標籤。只有在您想要建立支援多標籤影像分類的清單檔案時才新增。

例如：camellia1.jpg,camellia,with_leaves 或 s3://my-bucket/flowers/train/camellia1.jpg,camellia,with_leaves

儲存 CSV 檔案。

執行下列 Python 指令碼。提供下列引數：

csv_file — 您在步驟 1 中建立的 CSV 檔案。
manifest_file — 您要建立的清單檔案的名稱。
(選用) --s3_path s3://path_to_folder/ — 要新增至影像檔案名稱的 Amazon S3 路徑 (欄位 1)。如果欄位 1 中的影像尚未包含 S3 路徑，請使用 --s3_path。


# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier:  Apache-2.0

from datetime import datetime, timezone
import argparse
import logging
import csv
import os
import json

"""
Purpose
Amazon Rekognition Custom Labels model example used in the service documentation.
Shows how to create an image-level (classification) manifest file from a CSV file.
You can specify multiple image level labels per image.
CSV file format is
image,label,label,..
If necessary, use the bucket argument to specify the S3 bucket folder for the images.
https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/md-gt-cl-transform.html
"""

logger = logging.getLogger(__name__)


def check_duplicates(csv_file, deduplicated_file, duplicates_file):
    """
    Checks for duplicate images in a CSV file. If duplicate images
    are found, deduplicated_file is the deduplicated CSV file - only the first
    occurence of a duplicate is recorded. Other duplicates are recorded in duplicates_file.
    :param csv_file: The source CSV file.
    :param deduplicated_file: The deduplicated CSV file to create. If no duplicates are found
    this file is removed.
    :param duplicates_file: The duplicate images CSV file to create. If no duplicates are found
    this file is removed.
    :return: True if duplicates are found, otherwise false.
    """

    logger.info("Deduplicating %s", csv_file)

    duplicates_found = False

    # Find duplicates.
    with open(csv_file, 'r', newline='', encoding="UTF-8") as f,\
            open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
            open(duplicates_file, 'w', encoding="UTF-8") as duplicates:

        reader = csv.reader(f, delimiter=',')
        dedup_writer = csv.writer(dedup)
        duplicates_writer = csv.writer(duplicates)

        entries = set()
        for row in reader:
            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            key = row[0]
            if key not in entries:
                dedup_writer.writerow(row)
                entries.add(key)
            else:
                duplicates_writer.writerow(row)
                duplicates_found = True

    if duplicates_found:
        logger.info("Duplicates found check %s", duplicates_file)

    else:
        os.remove(duplicates_file)
        os.remove(deduplicated_file)

    return duplicates_found


def create_manifest_file(csv_file, manifest_file, s3_path):
    """
    Reads a CSV file and creates a Custom Labels classification manifest file.
    :param csv_file: The source CSV file.
    :param manifest_file: The name of the manifest file to create.
    :param s3_path: The S3 path to the folder that contains the images.
    """
    logger.info("Processing CSV file %s", csv_file)

    image_count = 0
    label_count = 0

    with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
            open(manifest_file, "w", encoding="UTF-8") as output_file:

        image_classifications = csv.reader(
            csvfile, delimiter=',', quotechar='|')

        # Process each row (image) in CSV file.
        for row in image_classifications:
            source_ref = str(s3_path)+row[0]

            image_count += 1

            # Create JSON for image source ref.
            json_line = {}
            json_line['source-ref'] = source_ref

            # Process each image level label.
            for index in range(1, len(row)):
                image_level_label = row[index]

                # Skip empty columns.
                if image_level_label == '':
                    continue
                label_count += 1

               # Create the JSON line metadata.
                json_line[image_level_label] = 1
                metadata = {}
                metadata['confidence'] = 1
                metadata['job-name'] = 'labeling-job/' + image_level_label
                metadata['class-name'] = image_level_label
                metadata['human-annotated'] = "yes"
                metadata['creation-date'] = \
                    datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
                metadata['type'] = "groundtruth/image-classification"

                json_line[f'{image_level_label}-metadata'] = metadata

                # Write the image JSON Line.
            output_file.write(json.dumps(json_line))
            output_file.write('\n')

    output_file.close()
    logger.info("Finished creating manifest file %s\nImages: %s\nLabels: %s",
                manifest_file, image_count, label_count)

    return image_count, label_count


def add_arguments(parser):
    """
    Adds command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "csv_file", help="The CSV file that you want to process."
    )

    parser.add_argument(
        "--s3_path", help="The S3 bucket and folder path for the images."
        " If not supplied, column 1 is assumed to include the S3 path.", required=False
    )


def main():

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    try:

        # Get command line arguments
        parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
        add_arguments(parser)
        args = parser.parse_args()

        s3_path = args.s3_path
        if s3_path is None:
            s3_path = ''

        # Create file names.
        csv_file = args.csv_file
        file_name = os.path.splitext(csv_file)[0]
        manifest_file = f'{file_name}.manifest'
        duplicates_file = f'{file_name}-duplicates.csv'
        deduplicated_file = f'{file_name}-deduplicated.csv'

        # Create manifest file, if there are no duplicate images.
        if check_duplicates(csv_file, deduplicated_file, duplicates_file):
            print(f"Duplicates found. Use {duplicates_file} to view duplicates "
                  f"and then update {deduplicated_file}. ")
            print(f"{deduplicated_file} contains the first occurence of a duplicate. "
                  "Update as necessary with the correct label information.")
            print(f"Re-run the script with {deduplicated_file}")
        else:
            print("No duplicates found. Creating manifest file.")

            image_count, label_count = create_manifest_file(csv_file,
                                                            manifest_file,
                                                            s3_path)

            print(f"Finished creating manifest file: {manifest_file} \n"
                  f"Images: {image_count}\nLabels: {label_count}")

    except FileNotFoundError as err:
        logger.exception("File not found: %s", err)
        print(f"File not found: {err}. Check your input CSV file.")


if __name__ == "__main__":
    main()

如果您打算使用測試資料集，請重複步驟 1 至 3，為測試資料集建立清單檔案。
如有必要，請將影像複製到您在 CSV 檔案的資料欄 1 中指定的 Amazon S3 儲存貯體路徑 (或在 --s3_path 命令列中指定)。您可以使用下列 AWS S3 命令。
```
aws s3 cp --recursive your-local-folder s3://your-target-S3-location
```
將清單檔案上傳到您要用於存放清單檔案的 Amazon S3 儲存貯體。

注意
請確保 Amazon Rekognition 自訂標籤可存取清單檔案 JSON Lines 的 source-ref 欄位中參考的 Amazon S3 儲存貯體。如需詳細資訊，請參閱存取外部 Amazon S3 儲存貯體。如果您的 Ground Truth 任務將影像存放在 Amazon Rekognition 自訂標籤主控台儲存貯體中，則不需要新增權限。
請遵循使用 SageMaker AI Ground Truth 資訊清單檔案（主控台）建立資料集中的指示，使用上傳的資訊清單檔案建立資料集。對於步驟 8，請在 .manifest 檔案位置，輸入清單檔案位置的 Amazon S3 URL。如果您使用 AWS SDK，請執行使用 SageMaker AI Ground Truth 資訊清單檔案 (SDK) 建立資料集。

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

轉換多標籤 Ground Truth 清單檔案

從現有資料集複製內容