AWS 文档 AWS SDK示例 GitHub 存储库中还有更多SDK示例
本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
用于 Python 的 Amaz SDK on Textract 示例 (Boto3)
以下代码示例向您展示了如何在 Amazon Textract 中 AWS SDK for Python (Boto3) 使用来执行操作和实现常见场景。
操作是大型程序的代码摘录,必须在上下文中运行。您可以通过操作了解如何调用单个服务函数,还可以通过函数相关场景的上下文查看操作。
场景是向您展示如何通过在一个服务中调用多个函数或与其他 AWS 服务结合来完成特定任务的代码示例。
每个示例都包含一个指向完整源代码的链接,您可以在其中找到有关如何在上下文中设置和运行代码的说明。
操作
以下代码示例演示如何使用 AnalyzeDocument
。
- SDK适用于 Python (Boto3)
-
注意
还有更多相关信息 GitHub。查找完整示例,学习如何在 AWS 代码示例存储库
中进行设置和运行。 class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def analyze_file( self, feature_types, *, document_file_name=None, document_bytes=None ): """ Detects text and additional elements, such as forms or tables, in a local image file or from in-memory byte data. The image must be in PNG or JPG format. :param feature_types: The types of additional document features to detect. :param document_file_name: The name of a document image file. :param document_bytes: In-memory byte data of a document image. :return: The response from Amazon Textract, including a list of blocks that describe elements detected in the image. """ if document_file_name is not None: with open(document_file_name, "rb") as document_file: document_bytes = document_file.read() try: response = self.textract_client.analyze_document( Document={"Bytes": document_bytes}, FeatureTypes=feature_types ) logger.info("Detected %s blocks.", len(response["Blocks"])) except ClientError: logger.exception("Couldn't detect text.") raise else: return response
-
有关API详细信息,请参阅AnalyzeDocument中的 AWS SDKPython (Boto3) API 参考。
-
以下代码示例演示如何使用 DetectDocumentText
。
- SDK适用于 Python (Boto3)
-
注意
还有更多相关信息 GitHub。查找完整示例,学习如何在 AWS 代码示例存储库
中进行设置和运行。 class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def detect_file_text(self, *, document_file_name=None, document_bytes=None): """ Detects text elements in a local image file or from in-memory byte data. The image must be in PNG or JPG format. :param document_file_name: The name of a document image file. :param document_bytes: In-memory byte data of a document image. :return: The response from Amazon Textract, including a list of blocks that describe elements detected in the image. """ if document_file_name is not None: with open(document_file_name, "rb") as document_file: document_bytes = document_file.read() try: response = self.textract_client.detect_document_text( Document={"Bytes": document_bytes} ) logger.info("Detected %s blocks.", len(response["Blocks"])) except ClientError: logger.exception("Couldn't detect text.") raise else: return response
-
有关API详细信息,请参阅DetectDocumentText中的 AWS SDKPython (Boto3) API 参考。
-
以下代码示例演示如何使用 GetDocumentAnalysis
。
- SDK适用于 Python (Boto3)
-
注意
还有更多相关信息 GitHub。查找完整示例,学习如何在 AWS 代码示例存储库
中进行设置和运行。 class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def get_analysis_job(self, job_id): """ Gets data for a previously started detection job that includes additional elements. :param job_id: The ID of the job to retrieve. :return: The job data, including a list of blocks that describe elements detected in the image. """ try: response = self.textract_client.get_document_analysis(JobId=job_id) job_status = response["JobStatus"] logger.info("Job %s status is %s.", job_id, job_status) except ClientError: logger.exception("Couldn't get data for job %s.", job_id) raise else: return response
-
有关API详细信息,请参阅GetDocumentAnalysis中的 AWS SDKPython (Boto3) API 参考。
-
以下代码示例演示如何使用 StartDocumentAnalysis
。
- SDK适用于 Python (Boto3)
-
注意
还有更多相关信息 GitHub。查找完整示例,学习如何在 AWS 代码示例存储库
中进行设置和运行。 启动异步任务以分析文档。
class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def start_analysis_job( self, bucket_name, document_file_name, feature_types, sns_topic_arn, sns_role_arn, ): """ Starts an asynchronous job to detect text and additional elements, such as forms or tables, in an image stored in an Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS topic when the job completes. The image must be in PNG, JPG, or PDF format. :param bucket_name: The name of the Amazon S3 bucket that contains the image. :param document_file_name: The name of the document image stored in Amazon S3. :param feature_types: The types of additional document features to detect. :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic where job completion notification is published. :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM) role that can be assumed by Textract and grants permission to publish to the Amazon SNS topic. :return: The ID of the job. """ try: response = self.textract_client.start_document_analysis( DocumentLocation={ "S3Object": {"Bucket": bucket_name, "Name": document_file_name} }, NotificationChannel={ "SNSTopicArn": sns_topic_arn, "RoleArn": sns_role_arn, }, FeatureTypes=feature_types, ) job_id = response["JobId"] logger.info( "Started text analysis job %s on %s.", job_id, document_file_name ) except ClientError: logger.exception("Couldn't analyze text in %s.", document_file_name) raise else: return job_id
-
有关API详细信息,请参阅StartDocumentAnalysis中的 AWS SDKPython (Boto3) API 参考。
-
以下代码示例演示如何使用 StartDocumentTextDetection
。
- SDK适用于 Python (Boto3)
-
注意
还有更多相关信息 GitHub。查找完整示例,学习如何在 AWS 代码示例存储库
中进行设置和运行。 启动异步任务以检测文档中的文本。
class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def start_detection_job( self, bucket_name, document_file_name, sns_topic_arn, sns_role_arn ): """ Starts an asynchronous job to detect text elements in an image stored in an Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS topic when the job completes. The image must be in PNG, JPG, or PDF format. :param bucket_name: The name of the Amazon S3 bucket that contains the image. :param document_file_name: The name of the document image stored in Amazon S3. :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic where the job completion notification is published. :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM) role that can be assumed by Textract and grants permission to publish to the Amazon SNS topic. :return: The ID of the job. """ try: response = self.textract_client.start_document_text_detection( DocumentLocation={ "S3Object": {"Bucket": bucket_name, "Name": document_file_name} }, NotificationChannel={ "SNSTopicArn": sns_topic_arn, "RoleArn": sns_role_arn, }, ) job_id = response["JobId"] logger.info( "Started text detection job %s on %s.", job_id, document_file_name ) except ClientError: logger.exception("Couldn't detect text in %s.", document_file_name) raise else: return job_id
-
有关API详细信息,请参阅StartDocumentTextDetection中的 AWS SDKPython (Boto3) API 参考。
-
场景
以下代码示例展示了如何通过交互式应用程序浏览 Amazon Textract 的输出。
- SDK适用于 Python (Boto3)
-
演示如何 AWS SDK for Python (Boto3) 与 Amazon Textract 配合使用来检测文档图像中的文本、表单和表格元素。输入图像和 Amazon Textract 输出在 Tkinter 应用程序中显示,该应用程序可让您探索检测到的元素。
将文档图像提交到 Amazon Textract 并探索检测到的元素的输出。
将图像直接提交到 Amazon Textract,或通过 Amazon Simple Storage Service(Amazon S3)桶提交图像。
使用异步APIs启动任务,该任务在任务完成时向亚马逊简单通知服务 (AmazonSNS) 主题发布通知。
轮询亚马逊简单队列服务 (AmazonSQS) 队列以获取任务完成消息并显示结果。
有关如何设置和运行的完整源代码和说明,请参阅上的完整示例GitHub
。 本示例中使用的服务
Amazon S3
Amazon SNS
Amazon SQS
Amazon Textract
以下代码示例显示了如何使用 Amazon Comprehend 检测 Amazon Textract 从存储在 Amazon S3 内的图像中提取的文本中的实体。
- SDK适用于 Python (Boto3)
-
演示如何使用 Jupyter 笔记本 AWS SDK for Python (Boto3) 中的来检测从图像中提取的文本中的实体。此示例使用 Amazon Textract 从存储在 Amazon Simple Storage Service (Amazon S3) 内的图像中提取文本,并使用 Amazon Comprehend 检测提取文本中的实体。
此示例是 Jupyter 笔记本,必须在可以托管笔记本电脑的环境中运行。有关如何使用 Amazon 运行示例的说明 SageMaker,请参阅 TextractAndComprehendNotebook.ipyn
b 中的说明。 有关如何设置和运行的完整源代码和说明,请参阅上的完整示例GitHub
。 本示例中使用的服务
Amazon Comprehend
Amazon S3
Amazon Textract