从文件创建清单CSV文件

此示例 Python 脚本使用逗号分隔值 (CSV) 文件来标记图像，从而简化了清单文件的创建。您创建CSV文件。该清单文件适用于多标签图像分类或多标签图像分类。有关更多信息，请参阅查找物体、场景和概念。

注意

此脚本无法创建适用于查找物体位置或品牌位置的清单文件。

清单文件描述了用于训练模型的图像。例如，图像位置和分配给图像的标签。清单文件由一行或多JSON行组成。每JSON行描述一张图片。有关更多信息，请参阅在清单文件中导入图像级标签。

CSV文件表示文本文件中多行的表格数据。一行中的各个字段用逗号分隔。有关更多信息，请参阅逗号分隔的值。对于此脚本，CSV文件中的每一行代表一张图像，并映射到清单文件中的一JSON行。要为支持多标签图像分类的清单文件创建文件，您需要向每行添加一个或多个图像级标签。CSV要创建适用于图像分类的清单文件，请在每行中添加一个图像级标签。

例如，以下CSV文件描述了多标签图像分类（Flowers）入门项目中的图像。


camellia1.jpg,camellia,with_leaves
camellia2.jpg,camellia,with_leaves
camellia3.jpg,camellia,without_leaves
helleborus1.jpg,helleborus,without_leaves,not_fully_grown
helleborus2.jpg,helleborus,with_leaves,fully_grown
helleborus3.jpg,helleborus,with_leaves,fully_grown
jonquil1.jpg,jonquil,with_leaves
jonquil2.jpg,jonquil,with_leaves
jonquil3.jpg,jonquil,with_leaves
jonquil4.jpg,jonquil,without_leaves
mauve_honey_myrtle1.jpg,mauve_honey_myrtle,without_leaves
mauve_honey_myrtle2.jpg,mauve_honey_myrtle,with_leaves
mauve_honey_myrtle3.jpg,mauve_honey_myrtle,with_leaves
mediterranean_spurge1.jpg,mediterranean_spurge,with_leaves
mediterranean_spurge2.jpg,mediterranean_spurge,without_leaves

该脚本为每JSON行生成行。例如，以下是第一行的JSON行 (camellia1.jpg,camellia,with_leaves)。


{"source-ref": "s3://bucket/flowers/train/camellia1.jpg","camellia": 1,"camellia-metadata":{"confidence": 1,"job-name": "labeling-job/camellia","class-name": "camellia","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"},"with_leaves": 1,"with_leaves-metadata":{"confidence": 1,"job-name": "labeling-job/with_leaves","class-name": "with_leaves","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"}}

在示例中CSV，图像的 Amazon S3 路径不存在。如果您的CSV文件不包含图像的 Amazon S3 路径，请使用--s3_path命令行参数指定图像的 Amazon S3 路径。

该脚本将每个图像的第一个条目记录在经过重复数据删除的图像文件中CSV。经过重复数据删除的图像CSV文件包含输入CSV文件中每个图像的单个实例。输入CSV文件中出现的图像将记录在重复的图像CSV文件中。如果脚本发现重复的图像，请查看重复图像CSV文件并根据需要更新已删除重复数据的图像CSV文件。使用去重文件，重新运行该脚本。如果在输入CSV文件中未找到重复项，则脚本会删除经过重复数据删除的图像CSV文件和重复的图像CSVfile，因为它们是空的。

在此过程中，您将创建CSV文件并运行 Python 脚本来创建清单文件。

从文件创建清单CSV文件

创建每行包含以下字段的CSV文件（每张图像一行）。不要在CSV文件中添加标题行。

字段 1	字段 2	字段 n
图像名称或图像的 Amazon S3 路径。例如，`s3://my-bucket/flowers/train/camellia1.jpg`。您不能混合使用带有 Amazon S3 路径的图像和不带 Amazon S3 路径的图像。	图像的第一个图像级标签。	一个或多个其他图像级标签（以逗号分隔）。仅当您想要创建支持多标签图像分类的清单文件时才添加。

例如，camellia1.jpg,camellia,with_leaves 或 s3://my-bucket/flowers/train/camellia1.jpg,camellia,with_leaves

保存 CSV 文件。

运行以下 Python 脚本。提供以下参数：

csv_file— 您在步骤 1 中创建的CSV文件。
manifest_file：您要创建的清单文件的名称。
（可选）--s3_path s3://path_to_folder/：要添加到图像文件名（字段 1）的 Amazon S3 路径。如果字段 1 中的图像不包含 S3 路径，则使用 --s3_path。


# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier:  Apache-2.0

from datetime import datetime, timezone
import argparse
import logging
import csv
import os
import json

"""
Purpose
Amazon Rekognition Custom Labels model example used in the service documentation.
Shows how to create an image-level (classification) manifest file from a CSV file.
You can specify multiple image level labels per image.
CSV file format is
image,label,label,..
If necessary, use the bucket argument to specify the S3 bucket folder for the images.
https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/md-gt-cl-transform.html
"""

logger = logging.getLogger(__name__)


def check_duplicates(csv_file, deduplicated_file, duplicates_file):
    """
    Checks for duplicate images in a CSV file. If duplicate images
    are found, deduplicated_file is the deduplicated CSV file - only the first
    occurence of a duplicate is recorded. Other duplicates are recorded in duplicates_file.
    :param csv_file: The source CSV file.
    :param deduplicated_file: The deduplicated CSV file to create. If no duplicates are found
    this file is removed.
    :param duplicates_file: The duplicate images CSV file to create. If no duplicates are found
    this file is removed.
    :return: True if duplicates are found, otherwise false.
    """

    logger.info("Deduplicating %s", csv_file)

    duplicates_found = False

    # Find duplicates.
    with open(csv_file, 'r', newline='', encoding="UTF-8") as f,\
            open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
            open(duplicates_file, 'w', encoding="UTF-8") as duplicates:

        reader = csv.reader(f, delimiter=',')
        dedup_writer = csv.writer(dedup)
        duplicates_writer = csv.writer(duplicates)

        entries = set()
        for row in reader:
            # Skip empty lines.
            if not ''.join(row).strip():
                continue

            key = row[0]
            if key not in entries:
                dedup_writer.writerow(row)
                entries.add(key)
            else:
                duplicates_writer.writerow(row)
                duplicates_found = True

    if duplicates_found:
        logger.info("Duplicates found check %s", duplicates_file)

    else:
        os.remove(duplicates_file)
        os.remove(deduplicated_file)

    return duplicates_found


def create_manifest_file(csv_file, manifest_file, s3_path):
    """
    Reads a CSV file and creates a Custom Labels classification manifest file.
    :param csv_file: The source CSV file.
    :param manifest_file: The name of the manifest file to create.
    :param s3_path: The S3 path to the folder that contains the images.
    """
    logger.info("Processing CSV file %s", csv_file)

    image_count = 0
    label_count = 0

    with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
            open(manifest_file, "w", encoding="UTF-8") as output_file:

        image_classifications = csv.reader(
            csvfile, delimiter=',', quotechar='|')

        # Process each row (image) in CSV file.
        for row in image_classifications:
            source_ref = str(s3_path)+row[0]

            image_count += 1

            # Create JSON for image source ref.
            json_line = {}
            json_line['source-ref'] = source_ref

            # Process each image level label.
            for index in range(1, len(row)):
                image_level_label = row[index]

                # Skip empty columns.
                if image_level_label == '':
                    continue
                label_count += 1

               # Create the JSON line metadata.
                json_line[image_level_label] = 1
                metadata = {}
                metadata['confidence'] = 1
                metadata['job-name'] = 'labeling-job/' + image_level_label
                metadata['class-name'] = image_level_label
                metadata['human-annotated'] = "yes"
                metadata['creation-date'] = \
                    datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
                metadata['type'] = "groundtruth/image-classification"

                json_line[f'{image_level_label}-metadata'] = metadata

                # Write the image JSON Line.
            output_file.write(json.dumps(json_line))
            output_file.write('\n')

    output_file.close()
    logger.info("Finished creating manifest file %s\nImages: %s\nLabels: %s",
                manifest_file, image_count, label_count)

    return image_count, label_count


def add_arguments(parser):
    """
    Adds command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "csv_file", help="The CSV file that you want to process."
    )

    parser.add_argument(
        "--s3_path", help="The S3 bucket and folder path for the images."
        " If not supplied, column 1 is assumed to include the S3 path.", required=False
    )


def main():

    logging.basicConfig(level=logging.INFO,
                        format="%(levelname)s: %(message)s")

    try:

        # Get command line arguments
        parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
        add_arguments(parser)
        args = parser.parse_args()

        s3_path = args.s3_path
        if s3_path is None:
            s3_path = ''

        # Create file names.
        csv_file = args.csv_file
        file_name = os.path.splitext(csv_file)[0]
        manifest_file = f'{file_name}.manifest'
        duplicates_file = f'{file_name}-duplicates.csv'
        deduplicated_file = f'{file_name}-deduplicated.csv'

        # Create manifest file, if there are no duplicate images.
        if check_duplicates(csv_file, deduplicated_file, duplicates_file):
            print(f"Duplicates found. Use {duplicates_file} to view duplicates "
                  f"and then update {deduplicated_file}. ")
            print(f"{deduplicated_file} contains the first occurence of a duplicate. "
                  "Update as necessary with the correct label information.")
            print(f"Re-run the script with {deduplicated_file}")
        else:
            print("No duplicates found. Creating manifest file.")

            image_count, label_count = create_manifest_file(csv_file,
                                                            manifest_file,
                                                            s3_path)

            print(f"Finished creating manifest file: {manifest_file} \n"
                  f"Images: {image_count}\nLabels: {label_count}")

    except FileNotFoundError as err:
        logger.exception("File not found: %s", err)
        print(f"File not found: {err}. Check your input CSV file.")


if __name__ == "__main__":
    main()

如果您计划使用测试数据集，请重复步骤 1-3，以便为测试数据集创建清单文件。
如有必要，请将图像复制到您在CSV文件第 1 列中指定的（或在--s3_path命令行中指定）的 Amazon S3 存储桶路径。您可使用以下 AWS S3 命令。
```
aws s3 cp --recursive your-local-folder s3://your-target-S3-location
```
上传清单文件至要用于存储清单文件的 Amazon S3 存储桶。

注意
确保 Amazon Rekognition 自定义标签可以访问清单文件行字段source-ref中引用的 Amazon S3 存储桶。JSON有关更多信息，请参阅访问外部 Amazon S3 存储桶。如果 Ground Truth 作业将图像存储在 Amazon Rekognition Custom Labels 控制台存储桶中，则无需添加权限。
按照使用 G SageMaker round Truth 清单文件创建数据集（控制台）中的说明，使用上传的清单文件创建数据集。对于步骤 8，在 .manifest 文件位置中，输入 Amazon S3 URL 作为清单文件的位置。如果你使用的是 AWS SDK，那就去做使用 G SageMaker round Truth 清单文件创建数据集 (SDK)。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

转换多标签 Ground Truth 清单文件

从现有数据集中复制内容