# Processing Documents Asynchronously You can use Amazon Textract to detect and analyze text in multipage documents in PDF or TIFF format, including invoices and receipts. Multipage document processing is an asynchronous operation, and it is useful for processing large, multipage documents. For example, a PDF file with over 1,000 pages takes a long time to process, but processing the PDF file asynchronously allows your application to complete other tasks while the operation completes. This section describes how you can use Amazon Textract to asynchronously detect and analyze text on a multipage or single-page document. Multipage documents must be in PDF or TIFF format. Single-page documents processed with asynchronous operations can be in JPEG, PNG, TIFF or PDF format. You can use Amazon Textract asynchronous operations for the following purposes: + Text detection – You can detect lines and words on a multipage document. The asynchronous operations are [StartDocumentTextDetection](API_StartDocumentTextDetection.md) and [GetDocumentTextDetection](API_GetDocumentTextDetection.md). For more information, see [Detecting Text](how-it-works-detecting.md). + Text analysis – You can identify relationships between detected text on a multipage document. The asynchronous operations are [StartDocumentAnalysis](API_StartDocumentAnalysis.md) and [GetDocumentAnalysis](API_GetDocumentAnalysis.md). For more information, see [Analyzing Documents](how-it-works-analyzing.md). + Expense analysis – You can identify data relationships on multipage invoices and receipts. Amazon Textract treats each invoice or a receipt page of a multi-page document as an individual receipt or an invoice. It does not retain the context from one page to another of a multi-page document. The asynchronous operations are [StartExpenseAnalysis](API_StartExpenseAnalysis.md) and [GetExpenseAnalysis](API_GetExpenseAnalysis.md). For more information, see [Analyzing Invoices and Receipts](invoices-receipts.md). + Lending document analysis – You can classify and analyze lending documents using the Analyze Lending workflow, which classifies documents and then automatically sends the documents to the proper Amazon Textract operation for information extraction. You can start the asynchronous analysis of lending documents with `StartLendingAnalysis`, and retrieve the extracted information with `GetLendingAnalysis` or get a summary of the information with `GetLendingAnalysisSummary`. Analyze Lending returns the relevant information extracted from the documents, including detected signatures. You can also get the different types of documents in the submitted package, split by the logical boundaries for a given document type, if you use the `OutputConfig` feature. **Topics** + [ # Calling Amazon Textract Asynchronous Operations ](api-async.md) + [ # Configuring Amazon Textract for Asynchronous Operations ](api-async-roles.md) + [ # Detecting or Analyzing Text in a Multipage Document ](async-analyzing-with-sqs.md) + [ # Using the Analyze Lending Workflow ](async-using-lending.md) + [ # Amazon Textract Results Notification ](async-notification-payload.md) # Calling Amazon Textract Asynchronous Operations Amazon Textract provides an asynchronous API that you can use to process multipage documents in PDF or TIFF format. You can also use asynchronous operations to process single-page documents that are in JPEG, PNG, TIFF, or PDF format. The information in this topic uses text detection operations to show how you to use Amazon Textract asynchronous operations. You can use the same approach with the text analysis operations of [StartDocumentAnalysis](API_StartDocumentAnalysis.md) and [GetDocumentAnalysis](API_GetDocumentAnalysis.md). It also works the same with [StartExpenseAnalysis](API_StartExpenseAnalysis.md) and [GetExpenseAnalysis](API_GetExpenseAnalysis.md). For an example, see [Detecting or Analyzing Text in a Multipage Document](async-analyzing-with-sqs.md). If you are analyzing lending documents, you can use the `StartLendingAnalysis` operation to classify document pages and send the classified pages to an Amazon Textract analysis operation. The pages are routed to analysis operations depending on their assigned class. You can retreive results for individual pages by using the `GetLendingAnalysis` operation, or retrieve a summary of the analysis with `GetLendingAnalysisSummary`. Amazon Textract asynchronously processes a document stored in an Amazon S3 bucket. You start processing by calling a `Start` operation, such as [StartDocumentTextDetection](API_StartDocumentTextDetection.md). The completion status of the request is published to an Amazon Simple Notification Service (Amazon SNS) topic. To get the completion status from the Amazon SNS topic, you can use an Amazon Simple Queue Service (Amazon SQS) queue or an AWS Lambda function. After you have the completion status, you call a `Get` operation, such as [GetDocumentTextDetection](API_GetDocumentTextDetection.md), to get the results of the request. Results of asynchronous calls are encrypted and stored for 7 days in a Amazon Textract owned bucket by default, unless you specify an Amazon S3 bucket using an operation's `OutputConfig` argument. For information on how to let Amazon Textract send encrypted documents to your Amazon S3 bucket, see [Permissions for Output Configuration](api-async-roles.md#async-output-config). The following table shows the corresponding Start and Get operations for the different types of asynchronous processing supported by Amazon Textract: **Start/Get API Operations for Amazon Textract Asynchronous Operations** | Processing Type | Start API | Get API | | --- | --- | --- | | Text Detection | StartDocumentTextDetection | GetDocumentTextDetection | | Text Analysis | StartDocumentAnalysis | GetDocumentAnalysis | | Expense Analysis | StartExpenseAnalysis | GetExpenseAnalysis | | Lending Analysis | StartLendingAnalysis | GetLendingAnalysis, GetLendingAnalysisSummary | For an example that uses AWS Lambda functions, see [Large scale document processing with Amazon Textract](https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing). The following diagram shows the process for detecting document text in a document image stored in an Amazon S3 bucket. In the diagram, an Amazon SQS queue gets the completion status from the Amazon SNS topic. ![\[Diagram showing an Amazon Textract workflow with key steps: start and return job ID, process document in S3 bucket, publish completion status to SNS topic, monitor SQS queue for completion status, call GetDocumentTextDetection to get analysis results.\]](http://docs.aws.amazon.com/textract/latest/dg/images/asynchronous.png) The process displayed by the preceeding diagram is the same for analyzing text and invoices/receipts. You start analyzing text by calling [StartDocumentAnalysis](API_StartDocumentAnalysis.md) and start analyzing invoices/receipts by calling [StartExpenseAnalysis](API_StartExpenseAnalysis.md) You get the results by calling [GetDocumentAnalysis](API_GetDocumentAnalysis.md) or [GetExpenseAnalysis](API_GetExpenseAnalysis.md) respectively. ## Starting Text Detection You start an Amazon Textract text detection request by calling [StartDocumentTextDetection](API_StartDocumentTextDetection.md). The following is an example of a JSON request that's passed by `StartDocumentTextDetection`. ``` { "DocumentLocation": { "S3Object": { "Bucket": "bucket", "Name": "image.pdf" } }, "ClientRequestToken": "DocumentDetectionToken", "NotificationChannel": { "SNSTopicArn": "arn:aws:sns:us-east-1:nnnnnnnnnn:topic", "RoleArn": "arn:aws:iam::nnnnnnnnnn:role/roleTopic" }, "JobTag": "Receipt" } ``` The input parameter `DocumentLocation` provides the document file name and the Amazon S3 bucket to retrieve it from. `NotificationChannel` contains the Amazon Resource Name (ARN) of the Amazon SNS topic that Amazon Textract notifies when the text detection request finishes. The Amazon SNS topic must be in the same AWS Region as the Amazon Textract endpoint that you're calling. `NotificationChannel` also contains the ARN for a role that allows Amazon Textract to publish to the Amazon SNS topic. You give Amazon Textract publishing permissions to your Amazon SNS topics by creating an IAM service role. For more information, see [Configuring Amazon Textract for Asynchronous Operations](api-async-roles.md). You can also specify an optional input parameter, `JobTag`, that enables you to identify the job, or groups of jobs, in the completion status that's published to the Amazon SNS topic. For example, you can use `JobTag` to identify the type of document being processed, such as a tax form or receipt. To prevent accidental duplication of analysis jobs, you can optionally provide an idempotent token, `ClientRequestToken`. If you supply a value for `ClientRequestToken`, the `Start` operation returns the same `JobId` for multiple identical calls to the `Start` operation, such as `StartDocumentTextDetection`. A `ClientRequestToken` token has a lifetime of 7 days. After 7 days, you can reuse it. If you reuse the token during the token lifetime, the following happens: + If you reuse the token with same `Start` operation and the same input parameters, the same `JobId` is returned. The job isn't performed again and Amazon Textract doesn't send a completion status to the registered Amazon SNS topic. + If you reuse the token with the same `Start` operation and a minor input parameter change, you get an `idempotentparametermismatchexception` (HTTP status code: 400) exception raised. + If you reuse the token with a different `Start` operation, the operation succeeds. Another optional parameter available is `OutputConfig`, which lets you adjust where your output will be placed. By default, Amazon Textract will store the results internally, and can only be accessed by the Get API operations. With `OutputConfig` enabled, you can set the name of the bucket the output will be sent to, and the file prefix of the results, where you can download your results. Additionally, you can set the `KMSKeyID` parameter to a customer managed key to encrypt your output. Without this parameter set Amazon Textract will encrypt server-side using the AWS managed key for Amazon S3 **Note** Before using this parameter, ensure you have the PutObject permission for the output bucket. Additionally, ensure you have the Decrypt, ReEncrypt, GenerateDataKey, and DescribeKey permissions for the AWS KMS key if you decide to use it. The response to the `StartDocumentTextDetection` operation is a job identifier (`JobId`). Use `JobId` to track requests and get the analysis results after Amazon Textract has published the completion status to the Amazon SNS topic. The following is an example: ``` {"JobId":"270c1cc5e1d0ea2fbc59d97cb69a72a5495da75851976b14a1784ca90fc180e3"} ``` If you start too many jobs concurrently, calls to `StartDocumentTextDetection` raise a `LimitExceededException` exception (HTTP status code: 400) until the number of concurrently running jobs is below the Amazon Textract service limit. If you find that LimitExceededException exceptions are raised with bursts of activity, consider using an Amazon SQS queue to manage incoming requests. Contact AWS Support if you find that your average number of concurrent requests can't be managed by an Amazon SQS queue and you're still receiving `LimitExceededException` exceptions. ## Getting the Completion Status of an Amazon Textract Analysis Request Amazon Textract sends an analysis completion notification to the registered Amazon SNS topic. The notification includes the job identifier and the completion status of the operation in a JSON string. A successful text detection request has a `SUCCEEDED` status. For example, the following result shows the successful processing of a text detection job. ``` { "JobId": "642492aea78a86a40665555dc375ee97bc963f342b29cd05030f19bd8fd1bc5f", "Status": "SUCCEEDED", "API": "StartDocumentTextDetection", "JobTag": "Receipt", "Timestamp": 1543599965969, "DocumentLocation": { "S3ObjectName": "document", "S3Bucket": "bucket" } } ``` For more information, see [Amazon Textract Results Notification](async-notification-payload.md). To get the status information published to the Amazon SNS topic by Amazon Textract, use one of the following options: + **AWS Lambda** – You can subscribe an AWS Lambda function that you write to an Amazon SNS topic. The function is called when Amazon Textract notifies the Amazon SNS topic that the request has completed. Use a Lambda function if you want server-side code to process the results of a text detection request. For example, you might want to use server-side code to annotate the image or create a report on the detected text before returning the information to a client application. + **Amazon SQS** – You can subscribe an Amazon SQS queue to an Amazon SNS topic. You then poll the Amazon SQS queue to retrieve the completion status published by Amazon Textract when a text detection request completes. For more information, see [Detecting or Analyzing Text in a Multipage Document](async-analyzing-with-sqs.md). Use an Amazon SQS queue if you want to call Amazon Textract operations only from a client application. **Important** We don't recommend getting the request completion status by repeatedly calling the Amazon Textract `Get` operation. This is because Amazon Textract throttles the `Get` operation if too many requests are made. If you're processing multiple documents at the same time, it's simpler and more efficient to monitor one SQS queue for the completion notification than to poll Amazon Textract for the status of each job individually. If you have configured your account to receive a results notification from an Amazon Simple Notification Service (Amazon SNS) topic or through an Amazon SQS queue, you should ensure that your account is secure by limiting the scope of Amazon Textract's access to just the resources you are using. This can be done by attaching a trust policy to your IAM service role. For information on how to do this, see [Cross-service confused deputy prevention](https://docs.aws.amazon.com/textract/latest/dg/cross-service-confused-deputy-prevention.html). ## Getting Amazon Textract Text Detection Results To get the results of a text detection request, first ensure that the completion status that's retrieved from the Amazon SNS topic is `SUCCEEDED`. Then call `GetDocumentTextDetection`, which passes the `JobId` value that's returned from `StartDocumentTextDetection`. The request JSON is similar to the following example: ``` { "JobId": "270c1cc5e1d0ea2fbc59d97cb69a72a5495da75851976b14a1784ca90fc180e3", "MaxResults": 10, "SortBy": "TIMESTAMP" } ``` `JobId` is the identifier for the text detection operation. Because text detection can generate large amounts of data, use `MaxResults` to specify the maximum number of results to return in a single `Get`operation. The default value for `MaxResults` is 1,000. If you specify a value greater than 1,000, only 1,000 results are returned. If the operation doesn't return all of the results, a pagination token for the next page is returned. To get the next page of results, specify the token in the `NextToken` parameter. **Note** Results can be retrieved only up to 7 days of job initialization time. The `GetDocumentTextDetection` operation response JSON is similar to the following. The total number of pages that are detected is returned in `DocumentMetadata`. The detected text is returned in the `Blocks` array. For information about `Block` objects, see [Text Detection and Document Analysis Response Objects](how-it-works-document-layout.md). ``` { "DocumentMetadata": { "Pages": 1 }, "JobStatus": "SUCCEEDED", "Blocks": [ { "BlockType": "PAGE", "Geometry": { "BoundingBox": { "Width": 1.0, "Height": 1.0, "Left": 0.0, "Top": 0.0 }, "Polygon": [ { "X": 0.0, "Y": 0.0 }, { "X": 1.0, "Y": 0.0 }, { "X": 1.0, "Y": 1.0 }, { "X": 0.0, "Y": 1.0 } ] }, "Id": "64533157-c47e-401a-930e-7ca1bb3ac3fa", "Relationships": [ { "Type": "CHILD", "Ids": [ "4297834d-dcb1-413b-8908-3b96866ebbb5", "1d85ba24-2877-4d09-b8b2-393833d769e9", "193e9c47-fd87-475a-ba09-3fda210d8784", "bd8aeb62-961b-4b47-b78a-e4ed9eeecd0f" ] } ], "Page": 1 }, { "BlockType": "LINE", "Confidence": 53.301639556884766, "Text": "ellooworio", "Geometry": { "BoundingBox": { "Width": 0.9999999403953552, "Height": 0.5365243554115295, "Left": 0.0, "Top": 0.46347561478614807 }, "Polygon": [ { "X": 0.0, "Y": 0.46347561478614807 }, { "X": 0.9999999403953552, "Y": 0.46347561478614807 }, { "X": 0.9999999403953552, "Y": 1.0 }, { "X": 0.0, "Y": 1.0 } ] }, "Id": "4297834d-dcb1-413b-8908-3b96866ebbb5", "Relationships": [ { "Type": "CHILD", "Ids": [ "170c3eb9-5155-4bec-8c44-173bba537e70" ] } ], "Page": 1 }, { "BlockType": "LINE", "Confidence": 89.15632629394531, "Text": "He llo,", "Geometry": { "BoundingBox": { "Width": 0.33642634749412537, "Height": 0.49159330129623413, "Left": 0.13885067403316498, "Top": 0.17169663310050964 }, "Polygon": [ { "X": 0.13885067403316498, "Y": 0.17169663310050964 }, { "X": 0.47527703642845154, "Y": 0.17169663310050964 }, { "X": 0.47527703642845154, "Y": 0.6632899641990662 }, { "X": 0.13885067403316498, "Y": 0.6632899641990662 } ] }, "Id": "1d85ba24-2877-4d09-b8b2-393833d769e9", "Relationships": [ { "Type": "CHILD", "Ids": [ "516ae823-3bab-4f9a-9d74-ad7150d128ab", "6bcf4ea8-bbe8-4686-91be-b98dd63bc6a6" ] } ], "Page": 1 }, { "BlockType": "LINE", "Confidence": 82.44834899902344, "Text": "worlo", "Geometry": { "BoundingBox": { "Width": 0.33182239532470703, "Height": 0.3766750991344452, "Left": 0.5091826915740967, "Top": 0.23131252825260162 }, "Polygon": [ { "X": 0.5091826915740967, "Y": 0.23131252825260162 }, { "X": 0.8410050868988037, "Y": 0.23131252825260162 }, { "X": 0.8410050868988037, "Y": 0.607987642288208 }, { "X": 0.5091826915740967, "Y": 0.607987642288208 } ] }, "Id": "193e9c47-fd87-475a-ba09-3fda210d8784", "Relationships": [ { "Type": "CHILD", "Ids": [ "ed135c3b-35dd-4085-8f00-26aedab0125f" ] } ], "Page": 1 }, { "BlockType": "LINE", "Confidence": 88.50325775146484, "Text": "world", "Geometry": { "BoundingBox": { "Width": 0.35004907846450806, "Height": 0.19635874032974243, "Left": 0.527581512928009, "Top": 0.30100569128990173 }, "Polygon": [ { "X": 0.527581512928009, "Y": 0.30100569128990173 }, { "X": 0.8776305913925171, "Y": 0.30100569128990173 }, { "X": 0.8776305913925171, "Y": 0.49736443161964417 }, { "X": 0.527581512928009, "Y": 0.49736443161964417 } ] }, "Id": "bd8aeb62-961b-4b47-b78a-e4ed9eeecd0f", "Relationships": [ { "Type": "CHILD", "Ids": [ "9e28834d-798e-4a62-8862-a837dfd895a6" ] } ], "Page": 1 }, { "BlockType": "WORD", "Confidence": 53.301639556884766, "Text": "ellooworio", "Geometry": { "BoundingBox": { "Width": 1.0, "Height": 0.5365243554115295, "Left": 0.0, "Top": 0.46347561478614807 }, "Polygon": [ { "X": 0.0, "Y": 0.46347561478614807 }, { "X": 1.0, "Y": 0.46347561478614807 }, { "X": 1.0, "Y": 1.0 }, { "X": 0.0, "Y": 1.0 } ] }, "Id": "170c3eb9-5155-4bec-8c44-173bba537e70", "Page": 1 }, { "BlockType": "WORD", "Confidence": 88.46246337890625, "Text": "He", "Geometry": { "BoundingBox": { "Width": 0.15350718796253204, "Height": 0.29955607652664185, "Left": 0.13885067403316498, "Top": 0.21856294572353363 }, "Polygon": [ { "X": 0.13885067403316498, "Y": 0.21856294572353363 }, { "X": 0.292357861995697, "Y": 0.21856294572353363 }, { "X": 0.292357861995697, "Y": 0.5181190371513367 }, { "X": 0.13885067403316498, "Y": 0.5181190371513367 } ] }, "Id": "516ae823-3bab-4f9a-9d74-ad7150d128ab", "Page": 1 }, { "BlockType": "WORD", "Confidence": 89.8501968383789, "Text": "llo,", "Geometry": { "BoundingBox": { "Width": 0.17724157869815826, "Height": 0.49159327149391174, "Left": 0.2980354428291321, "Top": 0.17169663310050964 }, "Polygon": [ { "X": 0.2980354428291321, "Y": 0.17169663310050964 }, { "X": 0.47527703642845154, "Y": 0.17169663310050964 }, { "X": 0.47527703642845154, "Y": 0.6632899045944214 }, { "X": 0.2980354428291321, "Y": 0.6632899045944214 } ] }, "Id": "6bcf4ea8-bbe8-4686-91be-b98dd63bc6a6", "Page": 1 }, { "BlockType": "WORD", "Confidence": 82.44834899902344, "Text": "worlo", "Geometry": { "BoundingBox": { "Width": 0.33182239532470703, "Height": 0.3766750991344452, "Left": 0.5091826915740967, "Top": 0.23131252825260162 }, "Polygon": [ { "X": 0.5091826915740967, "Y": 0.23131252825260162 }, { "X": 0.8410050868988037, "Y": 0.23131252825260162 }, { "X": 0.8410050868988037, "Y": 0.607987642288208 }, { "X": 0.5091826915740967, "Y": 0.607987642288208 } ] }, "Id": "ed135c3b-35dd-4085-8f00-26aedab0125f", "Page": 1 }, { "BlockType": "WORD", "Confidence": 88.50325775146484, "Text": "world", "Geometry": { "BoundingBox": { "Width": 0.35004907846450806, "Height": 0.19635874032974243, "Left": 0.527581512928009, "Top": 0.30100569128990173 }, "Polygon": [ { "X": 0.527581512928009, "Y": 0.30100569128990173 }, { "X": 0.8776305913925171, "Y": 0.30100569128990173 }, { "X": 0.8776305913925171, "Y": 0.49736443161964417 }, { "X": 0.527581512928009, "Y": 0.49736443161964417 } ] }, "Id": "9e28834d-798e-4a62-8862-a837dfd895a6", "Page": 1 } ] } ``` ## Using an adapter With Amazon Textract, you can use an adapter when calling the [StartDocumentAnalysis](API_StartDocumentAnalysis.md) operation. To use an adapter, you must first create and train an adapter by using the Amazon Textract console. To apply your adapter, provide its ID when calling the [StartDocumentAnalysis](API_StartDocumentAnalysis.md) API operation. When calling the [StartDocumentAnalysis](API_StartDocumentAnalysis.md) operation, you can use up to one adapter per page. ``` "AdaptersConfig": { "Adapters": [ { "AdapterId": "2e9bf1c4aa31", "Version": "1", "Pages": [ "1" ] } ] } ``` # Configuring Amazon Textract for Asynchronous Operations The following procedures show you how to configure Amazon Textract to use with an Amazon Simple Notification Service (Amazon SNS) topic and an Amazon Simple Queue Service (Amazon SQS) queue. **Note** If you're using these instructions to set up the [Detecting or Analyzing Text in a Multipage Document](async-analyzing-with-sqs.md) example, you don't need to do steps 3 – 6. The example includes code to create and configure the Amazon SNS topic and Amazon SQS queue. **To configure Amazon Textract** 1. Set up an AWS account to access Amazon Textract. For more information, see [Step 1: Set Up an AWS Account and Create a User](setting-up.md). Ensure that the user has at least the following permissions: + AmazonTextractFullAccess + AmazonS3ReadOnlyAccess + AmazonSNSFullAccess + AmazonSQSFullAccess Additionally, insure that the user has permission to pass IAM roles to Amazon Textract. This is done through an IAM PassRole policy. A simple example of such a policy is below: ------ #### [ JSON ] **** ``` { "Version":"2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "*", "Condition": { "StringEquals": {"iam:PassedToService": "textract.amazonaws.com"} } } ] } ``` ------ 1. Install and configure the required AWS SDK. For more information, see [Step 2: Set Up the AWS CLI and AWS SDKs](setup-awscli-sdk.md). 1. [Create an Amazon SNS standard topic](https://docs.aws.amazon.com/sns/latest/dg/sns-tutorial-create-topic.html). Prepend the topic name with *AmazonTextract*. Note the topic Amazon Resource Name (ARN). Ensure that the topic is in the same Region as the AWS endpoint that you're using with your AWS account. 1. [Create an Amazon SQS standard queue](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-create-queue.html) by using the [Amazon SQS console](https://console.aws.amazon.com/sqs/). Note the queue ARN. 1. [Subscribe the queue to the topic](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-subscribe-queue-sns-topic.html) you created in step 3. 1. [Give permission to the Amazon SNS topic to send messages to the Amazon SQS queue](https://docs.aws.amazon.com/sns/latest/dg/subscribe-sqs-queue-to-sns-topic.html). 1. Create an IAM service role to give Amazon Textract access to your Amazon SNS topics. Note the Amazon Resource Name (ARN) of the service role. For more information, see [Giving Amazon Textract Access to Your Amazon SNS Topic](#api-async-roles-all-topics). 1. [ Add the following inline policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#embed-inline-policy-console) to the IAM user that you created in step 1. Give the inline policy a name. 1. You can now run the examples in [Detecting or Analyzing Text in a Multipage Document](async-analyzing-with-sqs.md). ## Giving Amazon Textract Access to Your Amazon SNS Topic Amazon Textract needs permission to send a message to your Amazon SNS topic when an asynchronous operation is complete. You use an IAM service role to give Amazon Textract access to the Amazon SNS topic. When you create the Amazon SNS topic, you must prepend the topic name with **AmazonTextract**—for example, **AmazonTextractMyTopicName**. 1. Sign in to the IAM console ([https://console.aws.amazon.com/iam](https://console.aws.amazon.com/iam)). 1. In the navigation pane, choose **Roles**. 1. Choose **Create role**. 1. For **Select type of trusted entity**, choose **AWS service**. 1. For **Choose the service that will use this role**, choose **Textract**. 1. Choose **Next: Permissions**. 1. Verify that the **AmazonTextractServiceRole** policy has been included in the list of attached policies. To display the policy in the list, enter part of the policy name in the **Filter policies**. 1. Choose **Next: Tags**. 1. You don't need to add tags, so choose **Next: Review**. 1. In the **Review** section, for **Role name**, enter a name for the role (for example, `TextractRole`). In **Role description**, update the description for the role, and then choose **Create role**. 1. Choose the new role to open the role's details page. 1. In the **Summary**, copy the **Role ARN** value and save it. 1. Choose **Trust relationships**. 1. Choose **Edit trust relationship**, and edit the trust policy. Ensure that your trust policy includes conditions that limit the scope of permissions to just the required resources, as this will help prevent the confused deputy problem. For more details about this potential security issue, see [Cross-service confused deputy prevention](cross-service-confused-deputy-prevention.md). In the example below, replace the *red* text with your AWS account ID. ------ #### [ JSON ] **** ``` { "Version":"2012-10-17", "Statement": { "Sid": "ConfusedDeputyPreventionExamplePolicy", "Effect": "Allow", "Principal": { "Service": "textract.amazonaws.com" }, "Action": "sts:AssumeRole", "Condition": { "ArnLike": { "aws:SourceArn":"arn:aws:textract:*:123456789012:*" }, "StringEquals": { "aws:SourceAccount": "123456789012" } } } } ``` ------ 1. Choose **Update Trust Policy**. ## Permissions for Output Configuration You can have Amazon Textract send the results of asynchronous analysis operations to a designated Amazon S3 bucket by using the `OutputConfig` feature of asynchrnous API operations. If you are using the `OutputConfig` option for an asynchronous analysis operation to customize where the output of your operations is sent, additional configuration is required. You must let Amazon Textract decrypt your uploads and provide permissions for certain Amazon S3 operations. **To Allow Decryption of S3 Bucket Uploads** + You will need to provide the appropriate Users with the correct Amazon S3 permissions. Navigate to the Users section of the [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/) and select the User you created in **Step 1** of the **To configure Amazon Textract** section above. Choose to "Add inline policy" to your User and attach a JSON policy that includes the `s3:GetObject`, and `s3:PutObject`, `s3:ListMultipartUploadParts`, `s3:ListBucketMultipartUploads`, and `s3:AbortMultipartUpload` operations. Your JSON may look like the following: ------ #### [ JSON ] **** ``` { "Version":"2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*", "s3:PutObject", "s3:GetObject", "s3-object-lambda:Get*", "s3-object-lambda:List*", "s3:ListMultipartUploadParts", "s3:ListBucketMultipartUploads", "s3:AbortMultipartUpload" ], "Resource": "*" } ] } ``` ------ **To Provide AWS KMS Key Permissions** + You must[add](https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html#key-policy-modifying-how-to-console-policy-view) permissions to your AWS Key Management Service key that will allow your service role to decrypt your uploads. The service role will need permission for `kms:GenerateDataKey` and `kms:Decrypt` actions. Ensure that the service role you created in **Step 7** in the **To configure Amazon Textract** section has a permissions policy that looks like the following example. In the following example, replace `ARN from Step 7` with the ARN of your service role: ``` { "Sid": "Decrypt only", "Effect": "Allow", "Principal": { "AWS": "ARN from Step 7" }, "Action": [ "kms:Decrypt", "kms:ReEncrypt", "kms:GenerateDataKey", "kms:DescribeKey" ], "Resource": "*" } ``` # Detecting or Analyzing Text in a Multipage Document This procedure shows you how to detect or analyze text in a multipage document by using Amazon Textract detection operations, a document stored in an Amazon S3 bucket, an Amazon SNS topic, and an Amazon SQS queue. Multipage document processing is an asynchronous operation. For more information, see [Calling Amazon Textract Asynchronous Operations](api-async.md). You can choose the type of processing that you want the code to do: text detection, text analysis, or expense analysis. The processing results are returned in an array of [Block](API_Block.md) objects, which differ depending on the type of processing you use. To detect text in or analyze multipage documents, you do the following: 1. Create the Amazon SNS topic and the Amazon SQS queue. 1. Subscribe the queue the topic. 1. Give the topic permission to send messages to the queue. 1. Start processing the document. Use the appropriate operation for your chosen type of analysis: + [StartDocumentTextDetection](API_StartDocumentTextDetection.md) for text detection tasks. + [StartDocumentAnalysis](API_StartDocumentAnalysis.md) for text analysis tasks. + [StartExpenseAnalysis](API_StartExpenseAnalysis.md) for expense analysis tasks. 1. Get the completion status from the Amazon SQS queue. The example code tracks the job identifier (`JobId`) that's returned by the `Start` operation. It only gets the results for matching job identifiers that are read from the completion status. This is important if other applications are using the same queue and topic. For simplicity, the example deletes jobs that don't match. Consider adding the deleted jobs to an Amazon SQS dead-letter queue for further investigation. 1. Get and display the processing results by calling the appropriate operation for your chosen type of analysis: + [GetDocumentTextDetection](API_GetDocumentTextDetection.md) for text detection tasks. + [GetDocumentAnalysis](API_GetDocumentAnalysis.md) for text analysis tasks. + [GetExpenseAnalysis](API_GetExpenseAnalysis.md) for expense analysis tasks. 1. Delete the Amazon SNS topic and the Amazon SQS queue. ## Performing Asynchronous Operations The example code for this procedure is provided in Java, Python, and the AWS CLI. Before you begin, install the appropriate AWS SDK. For more information, see [Step 2: Set Up the AWS CLI and AWS SDKs](setup-awscli-sdk.md). **To detect or analyze text in a multipage document** 1. Configure user access to Amazon Textract, and configure Amazon Textract access to Amazon SNS. For more information, see [Configuring Amazon Textract for Asynchronous Operations](api-async-roles.md). To complete this procedure, you need a multipage document ﬁle in PDF format. Skip steps 3 – 6 because the example code creates and configures the Amazon SNS topic and Amazon SQS queue. If completing the CLI example, you don't need to set up an SQS queue. 1. Upload a multipage document file in PDF or TIFF format to your Amazon S3 bucket. (Single-page documents in JPEG, PNG, TIFF, or PDF format can also be processed). For instructions, see [Uploading Objects into Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UploadingObjectsintoAmazonS3.html) in the *Amazon Simple Storage Service User Guide*. 1. Use the following AWS SDK for Java, SDK for Python (Boto3), or AWS CLI code to either detect text or analyze text in a multipage document. In the `main` function: + Replace the value of `roleArn` with the IAM role ARN that you saved in [Giving Amazon Textract Access to Your Amazon SNS Topic](api-async-roles.md#api-async-roles-all-topics). + Replace the values of `bucket` and `document` with the bucket and document file name that you specified in step 2. + Replace the value of the `type` input parameter of the `ProcessDocument` function with the type of processing that you want to do. Use `ProcessType.DETECTION` to detect text. Use `ProcessType.ANALYSIS` to analyze text. + For the Python example, replace the value of `region_name` with the region your client is operating in. For the AWS CLI example, do the following: + When calling [StartDocumentTextDetection](API_StartDocumentTextDetection.md), replace the value of `bucket-name` with the name of your S3 bucket, and replace `file-name` with the name of the file you specified in step 2. Specify the region of your bucket by replacing `region-name` with the name of your region. Take note that the CLI example does not make use of SQS. + When calling [GetDocumentTextDetection](API_GetDocumentTextDetection.md) replace `job-id-number` with the `job-id` returned by [StartDocumentTextDetection](API_StartDocumentTextDetection.md). Specify the region of your bucket by replacing `region-name` with the name of your region. ------ #### [ Java ] Replace the value of `credentialsProvider` with the name of your developer profile. ``` import java.util.Arrays; import java.util.HashMap; import java.util.List; import java.util.Map; import com.amazonaws.auth.policy.Condition; import com.amazonaws.auth.policy.Policy; import com.amazonaws.auth.policy.Principal; import com.amazonaws.auth.policy.Resource; import com.amazonaws.auth.policy.Statement; import com.amazonaws.auth.policy.Statement.Effect; import com.amazonaws.auth.policy.actions.SQSActions; import com.amazonaws.auth.profile.ProfileCredentialsProvider; import com.amazonaws.services.sns.AmazonSNS; import com.amazonaws.services.sns.AmazonSNSClientBuilder; import com.amazonaws.services.sns.model.CreateTopicRequest; import com.amazonaws.services.sns.model.CreateTopicResult; import com.amazonaws.services.sqs.AmazonSQS; import com.amazonaws.services.sqs.AmazonSQSClientBuilder; import com.amazonaws.services.sqs.model.CreateQueueRequest; import com.amazonaws.services.sqs.model.Message; import com.amazonaws.services.sqs.model.QueueAttributeName; import com.amazonaws.services.sqs.model.SetQueueAttributesRequest; import com.amazonaws.services.textract.AmazonTextract; import com.amazonaws.services.textract.AmazonTextractClientBuilder; import com.amazonaws.services.textract.model.Block; import com.amazonaws.services.textract.model.DocumentLocation; import com.amazonaws.services.textract.model.DocumentMetadata; import com.amazonaws.services.textract.model.GetDocumentAnalysisRequest; import com.amazonaws.services.textract.model.GetDocumentAnalysisResult; import com.amazonaws.services.textract.model.GetDocumentTextDetectionRequest; import com.amazonaws.services.textract.model.GetDocumentTextDetectionResult; import com.amazonaws.services.textract.model.NotificationChannel; import com.amazonaws.services.textract.model.Relationship; import com.amazonaws.services.textract.model.S3Object; import com.amazonaws.services.textract.model.StartDocumentAnalysisRequest; import com.amazonaws.services.textract.model.StartDocumentAnalysisResult; import com.amazonaws.services.textract.model.StartDocumentTextDetectionRequest; import com.amazonaws.services.textract.model.StartDocumentTextDetectionResult; import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper;; public class DocumentProcessor { private static String sqsQueueName=null; private static String snsTopicName=null; private static String snsTopicArn = null; private static String roleArn= null; private static String sqsQueueUrl = null; private static String sqsQueueArn = null; private static String startJobId = null; private static String bucket = null; private static String document = null; private static AmazonSQS sqs=null; private static AmazonSNS sns=null; private static AmazonTextract textract = null; public enum ProcessType { DETECTION,ANALYSIS } public static void main(String[] args) throws Exception { String document = "document"; String bucket = "bucket"; String roleArn="role"; // set provider credentials AWSCredentialsProvider credentialsProvider = new ProfileCredentialsProvider("default"); sns = AmazonSNSClientBuilder.withCredentials(credentialsProvider) .withRegion(Regions.US_EAST_1) .build(); sqs= AmazonSQSClientBuilder.withCredentials(credentialsProvider) .withRegion(Regions.US_EAST_1) .build(); textract=AmazonTextractClientBuilder.withCredentials(credentialsProvider) .withRegion(Regions.US_EAST_1) .build(); CreateTopicandQueue(); ProcessDocument(bucket,document,roleArn,ProcessType.DETECTION); DeleteTopicandQueue(); System.out.println("Done!"); } // Creates an SNS topic and SQS queue. The queue is subscribed to the topic. static void CreateTopicandQueue() { //create a new SNS topic snsTopicName="AmazonTextractTopic" + Long.toString(System.currentTimeMillis()); CreateTopicRequest createTopicRequest = new CreateTopicRequest(snsTopicName); CreateTopicResult createTopicResult = sns.createTopic(createTopicRequest); snsTopicArn=createTopicResult.getTopicArn(); //Create a new SQS Queue sqsQueueName="AmazonTextractQueue" + Long.toString(System.currentTimeMillis()); final CreateQueueRequest createQueueRequest = new CreateQueueRequest(sqsQueueName); sqsQueueUrl = sqs.createQueue(createQueueRequest).getQueueUrl(); sqsQueueArn = sqs.getQueueAttributes(sqsQueueUrl, Arrays.asList("QueueArn")).getAttributes().get("QueueArn"); //Subscribe SQS queue to SNS topic String sqsSubscriptionArn = sns.subscribe(snsTopicArn, "sqs", sqsQueueArn).getSubscriptionArn(); // Authorize queue Policy policy = new Policy().withStatements( new Statement(Effect.Allow) .withPrincipals(Principal.AllUsers) .withActions(SQSActions.SendMessage) .withResources(new Resource(sqsQueueArn)) .withConditions(new Condition().withType("ArnEquals").withConditionKey("aws:SourceArn").withValues(snsTopicArn)) ); Map queueAttributes = new HashMap(); queueAttributes.put(QueueAttributeName.Policy.toString(), policy.toJson()); sqs.setQueueAttributes(new SetQueueAttributesRequest(sqsQueueUrl, queueAttributes)); System.out.println("Topic arn: " + snsTopicArn); System.out.println("Queue arn: " + sqsQueueArn); System.out.println("Queue url: " + sqsQueueUrl); System.out.println("Queue sub arn: " + sqsSubscriptionArn ); } static void DeleteTopicandQueue() { if (sqs !=null) { sqs.deleteQueue(sqsQueueUrl); System.out.println("SQS queue deleted"); } if (sns!=null) { sns.deleteTopic(snsTopicArn); System.out.println("SNS topic deleted"); } } //Starts the processing of the input document. static void ProcessDocument(String inBucket, String inDocument, String inRoleArn, ProcessType type) throws Exception { bucket=inBucket; document=inDocument; roleArn=inRoleArn; switch(type) { case DETECTION: StartDocumentTextDetection(bucket, document); System.out.println("Processing type: Detection"); break; case ANALYSIS: StartDocumentAnalysis(bucket,document); System.out.println("Processing type: Analysis"); break; default: System.out.println("Invalid processing type. Choose Detection or Analysis"); throw new Exception("Invalid processing type"); } System.out.println("Waiting for job: " + startJobId); //Poll queue for messages List messages=null; int dotLine=0; boolean jobFound=false; //loop until the job status is published. Ignore other messages in queue. do{ messages = sqs.receiveMessage(sqsQueueUrl).getMessages(); if (dotLine++<40){ System.out.print("."); }else{ System.out.println(); dotLine=0; } if (!messages.isEmpty()) { //Loop through messages received. for (Message message: messages) { String notification = message.getBody(); // Get status and job id from notification. ObjectMapper mapper = new ObjectMapper(); JsonNode jsonMessageTree = mapper.readTree(notification); JsonNode messageBodyText = jsonMessageTree.get("Message"); ObjectMapper operationResultMapper = new ObjectMapper(); JsonNode jsonResultTree = operationResultMapper.readTree(messageBodyText.textValue()); JsonNode operationJobId = jsonResultTree.get("JobId"); JsonNode operationStatus = jsonResultTree.get("Status"); System.out.println("Job found was " + operationJobId); // Found job. Get the results and display. if(operationJobId.asText().equals(startJobId)){ jobFound=true; System.out.println("Job id: " + operationJobId ); System.out.println("Status : " + operationStatus.toString()); if (operationStatus.asText().equals("SUCCEEDED")){ switch(type) { case DETECTION: GetDocumentTextDetectionResults(); break; case ANALYSIS: GetDocumentAnalysisResults(); break; default: System.out.println("Invalid processing type. Choose Detection or Analysis"); throw new Exception("Invalid processing type"); } } else{ System.out.println("Document analysis failed"); } sqs.deleteMessage(sqsQueueUrl,message.getReceiptHandle()); } else{ System.out.println("Job received was not job " + startJobId); //Delete unknown message. Consider moving message to dead letter queue sqs.deleteMessage(sqsQueueUrl,message.getReceiptHandle()); } } } else { Thread.sleep(5000); } } while (!jobFound); System.out.println("Finished processing document"); } private static void StartDocumentTextDetection(String bucket, String document) throws Exception{ //Create notification channel NotificationChannel channel= new NotificationChannel() .withSNSTopicArn(snsTopicArn) .withRoleArn(roleArn); StartDocumentTextDetectionRequest req = new StartDocumentTextDetectionRequest() .withDocumentLocation(new DocumentLocation() .withS3Object(new S3Object() .withBucket(bucket) .withName(document))) .withJobTag("DetectingText") .withNotificationChannel(channel); StartDocumentTextDetectionResult startDocumentTextDetectionResult = textract.startDocumentTextDetection(req); startJobId=startDocumentTextDetectionResult.getJobId(); } //Gets the results of processing started by StartDocumentTextDetection private static void GetDocumentTextDetectionResults() throws Exception{ int maxResults=1000; String paginationToken=null; GetDocumentTextDetectionResult response=null; Boolean finished=false; while (finished==false) { GetDocumentTextDetectionRequest documentTextDetectionRequest= new GetDocumentTextDetectionRequest() .withJobId(startJobId) .withMaxResults(maxResults) .withNextToken(paginationToken); response = textract.getDocumentTextDetection(documentTextDetectionRequest); DocumentMetadata documentMetaData=response.getDocumentMetadata(); System.out.println("Pages: " + documentMetaData.getPages().toString()); //Show blocks information List blocks= response.getBlocks(); for (Block block : blocks) { DisplayBlockInfo(block); } paginationToken=response.getNextToken(); if (paginationToken==null) finished=true; } } private static void StartDocumentAnalysis(String bucket, String document) throws Exception{ //Create notification channel NotificationChannel channel= new NotificationChannel() .withSNSTopicArn(snsTopicArn) .withRoleArn(roleArn); StartDocumentAnalysisRequest req = new StartDocumentAnalysisRequest() .withFeatureTypes("TABLES","FORMS") .withDocumentLocation(new DocumentLocation() .withS3Object(new S3Object() .withBucket(bucket) .withName(document))) .withJobTag("AnalyzingText") .withNotificationChannel(channel); StartDocumentAnalysisResult startDocumentAnalysisResult = textract.startDocumentAnalysis(req); startJobId=startDocumentAnalysisResult.getJobId(); } //Gets the results of processing started by StartDocumentAnalysis private static void GetDocumentAnalysisResults() throws Exception{ int maxResults=1000; String paginationToken=null; GetDocumentAnalysisResult response=null; Boolean finished=false; //loops until pagination token is null while (finished==false) { GetDocumentAnalysisRequest documentAnalysisRequest= new GetDocumentAnalysisRequest() .withJobId(startJobId) .withMaxResults(maxResults) .withNextToken(paginationToken); response = textract.getDocumentAnalysis(documentAnalysisRequest); DocumentMetadata documentMetaData=response.getDocumentMetadata(); System.out.println("Pages: " + documentMetaData.getPages().toString()); //Show blocks, confidence and detection times List blocks= response.getBlocks(); for (Block block : blocks) { DisplayBlockInfo(block); } paginationToken=response.getNextToken(); if (paginationToken==null) finished=true; } } //Displays Block information for text detection and text analysis private static void DisplayBlockInfo(Block block) { System.out.println("Block Id : " + block.getId()); if (block.getText()!=null) System.out.println("\tDetected text: " + block.getText()); System.out.println("\tType: " + block.getBlockType()); if (block.getBlockType().equals("PAGE") !=true) { System.out.println("\tConfidence: " + block.getConfidence().toString()); } if(block.getBlockType().equals("CELL")) { System.out.println("\tCell information:"); System.out.println("\t\tColumn: " + block.getColumnIndex()); System.out.println("\t\tRow: " + block.getRowIndex()); System.out.println("\t\tColumn span: " + block.getColumnSpan()); System.out.println("\t\tRow span: " + block.getRowSpan()); } System.out.println("\tRelationships"); List relationships=block.getRelationships(); if(relationships!=null) { for (Relationship relationship : relationships) { System.out.println("\t\tType: " + relationship.getType()); System.out.println("\t\tIDs: " + relationship.getIds().toString()); } } else { System.out.println("\t\tNo related Blocks"); } System.out.println("\tGeometry"); System.out.println("\t\tBounding Box: " + block.getGeometry().getBoundingBox().toString()); System.out.println("\t\tPolygon: " + block.getGeometry().getPolygon().toString()); List entityTypes = block.getEntityTypes(); System.out.println("\tEntity Types"); if(entityTypes!=null) { for (String entityType : entityTypes) { System.out.println("\t\tEntity Type: " + entityType); } } else { System.out.println("\t\tNo entity type"); } if(block.getBlockType().equals("SELECTION_ELEMENT")) { System.out.print(" Selection element detected: "); if (block.getSelectionStatus().equals("SELECTED")){ System.out.println("Selected"); }else { System.out.println(" Not selected"); } } if(block.getPage()!=null) System.out.println("\tPage: " + block.getPage()); System.out.println(); } } ``` ------ #### [ Java V2 ] Replace the value of `profile-name` in the line that creates the `TextractClient` with the name of your developer profile. ``` import software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider; import software.amazon.awssdk.regions.Region; import software.amazon.awssdk.services.textract.model.S3Object; import software.amazon.awssdk.services.textract.TextractClient; import software.amazon.awssdk.services.textract.model.StartDocumentAnalysisRequest; import software.amazon.awssdk.services.textract.model.DocumentLocation; import software.amazon.awssdk.services.textract.model.TextractException; import software.amazon.awssdk.services.textract.model.StartDocumentAnalysisResponse; import software.amazon.awssdk.services.textract.model.GetDocumentAnalysisRequest; import software.amazon.awssdk.services.textract.model.GetDocumentAnalysisResponse; import software.amazon.awssdk.services.textract.model.FeatureType; import java.util.ArrayList; import java.util.List; // snippet-end:[textract.java2._start_doc_analysis.import] /** * Before running this Java V2 code example, set up your development environment, including your credentials. * * For more information, see the following documentation topic: * * https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html */ public class StartDocumentAnalysis { public static void main(String[] args) { final String usage = "\n" + "Usage:\n" + " \n\n" + "Where:\n" + " bucketName - The name of the Amazon S3 bucket that contains the document. \n\n" + " docName - The document name (must be an image, for example, book.png). \n"; if (args.length != 2) { System.out.println(usage); System.exit(1); } String bucketName = args[0]; String docName = args[1]; Region region = Region.US_EAST_1; TextractClient textractClient = TextractClient.builder() .region(region) .credentialsProvider(ProfileCredentialsProvider.create("profile-name")) .build(); String jobId = startDocAnalysisS3 (textractClient, bucketName, docName); System.out.println("Getting results for job "+jobId); String status = getJobResults(textractClient, jobId); System.out.println("The job status is "+status); textractClient.close(); } // snippet-start:[textract.java2._start_doc_analysis.main] public static String startDocAnalysisS3 (TextractClient textractClient, String bucketName, String docName) { try { List myList = new ArrayList<>(); myList.add(FeatureType.TABLES); myList.add(FeatureType.FORMS); S3Object s3Object = S3Object.builder() .bucket(bucketName) .name(docName) .build(); DocumentLocation location = DocumentLocation.builder() .s3Object(s3Object) .build(); StartDocumentAnalysisRequest documentAnalysisRequest = StartDocumentAnalysisRequest.builder() .documentLocation(location) .featureTypes(myList) .build(); StartDocumentAnalysisResponse response = textractClient.startDocumentAnalysis(documentAnalysisRequest); // Get the job ID String jobId = response.jobId(); return jobId; } catch (TextractException e) { System.err.println(e.getMessage()); System.exit(1); } return "" ; } private static String getJobResults(TextractClient textractClient, String jobId) { boolean finished = false; int index = 0 ; String status = "" ; try { while (!finished) { GetDocumentAnalysisRequest analysisRequest = GetDocumentAnalysisRequest.builder() .jobId(jobId) .maxResults(1000) .build(); GetDocumentAnalysisResponse response = textractClient.getDocumentAnalysis(analysisRequest); status = response.jobStatus().toString(); if (status.compareTo("SUCCEEDED") == 0) finished = true; else { System.out.println(index + " status is: " + status); Thread.sleep(1000); } index++ ; } return status; } catch( InterruptedException e) { System.out.println(e.getMessage()); System.exit(1); } return ""; } // snippet-end:[textract.java2._start_doc_analysis.main] } ``` ------ #### [ AWS CLI ] This AWS CLI command starts the asynchronous detection of text in a specified document. It returns a `job-id` that can be used to retreive the results of the detection. ``` aws textract start-document-text-detection --document-location "{\"S3Object\":{\"Bucket\":\"bucket-name\",\"Name\":\"file-name\"}}" --region region-name ``` This AWS CLI command returns the results for an Amazon Textract asynchronous operation when provided with a `job-id`. ``` aws textract get-document-text-detection --region region-name --job-id job-id-number ``` If you are accessing the CLI on a Windows device, use double quotes instead of single quotes and escape the inner double quotes by backslash (i.e. \$1) to address any parser errors you may encounter. For an example, see below ``` aws textract start-document-text-detection --document-location "{\"S3Object\":{\"Bucket\":\"bucket\",\"Name\":\"document\"}}" --region region-name ``` If you are analyzing a document with the StartDocumentAnalysis operation, you can provide values to the `feature-type` parameter. The following example demonstrates how to include the `QUERIES` value in the `feature-types` parameter and then provide a `Queries` object to the `queries-config` parameter. ``` aws textract start-document-analysis \ --document '{"S3Object":{"Bucket":"bucket","Name":"document"}}'\ --feature-types '["QUERIES"]' \ --queries-config '{"Queries":[{"Text":"Question"}]}' ``` ------ #### [ Python ] Replace `profile-name` in the line that creates the TextractClient with the name of your developer profile. ``` import boto3 import json import sys import time class ProcessType: DETECTION = 1 ANALYSIS = 2 class DocumentProcessor: jobId = '' region_name = '' roleArn = '' bucket = '' document = '' sqsQueueUrl = '' snsTopicArn = '' processType = '' def __init__(self, role, bucket, document, region): self.roleArn = role self.bucket = bucket self.document = document self.region_name = region self.textract = boto3.client('textract', region_name=self.region_name) self.sqs = boto3.client('sqs', region_name=self.region_name) self.sns = boto3.client('sns', region_name=self.region_name) def ProcessDocument(self, type): jobFound = False self.processType = type validType = False # Determine which type of processing to perform if self.processType == ProcessType.DETECTION: response = self.textract.start_document_text_detection( DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}}, NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn}) print('Processing type: Detection') validType = True # For document analysis, select which features you want to obtain with the FeatureTypes argument if self.processType == ProcessType.ANALYSIS: response = self.textract.start_document_analysis( DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}}, FeatureTypes=["TABLES", "FORMS"], NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn}) print('Processing type: Analysis') validType = True if validType == False: print("Invalid processing type. Choose Detection or Analysis.") return print('Start Job Id: ' + response['JobId']) dotLine = 0 while jobFound == False: sqsResponse = self.sqs.receive_message(QueueUrl=self.sqsQueueUrl, MessageAttributeNames=['ALL'], MaxNumberOfMessages=10) if sqsResponse: if 'Messages' not in sqsResponse: if dotLine < 40: print('.', end='') dotLine = dotLine + 1 else: print() dotLine = 0 sys.stdout.flush() time.sleep(5) continue for message in sqsResponse['Messages']: notification = json.loads(message['Body']) textMessage = json.loads(notification['Message']) print(textMessage['JobId']) print(textMessage['Status']) if str(textMessage['JobId']) == response['JobId']: print('Matching Job Found:' + textMessage['JobId']) jobFound = True self.GetResults(textMessage['JobId']) self.sqs.delete_message(QueueUrl=self.sqsQueueUrl, ReceiptHandle=message['ReceiptHandle']) else: print("Job didn't match:" + str(textMessage['JobId']) + ' : ' + str(response['JobId'])) # Delete the unknown message. Consider sending to dead letter queue self.sqs.delete_message(QueueUrl=self.sqsQueueUrl, ReceiptHandle=message['ReceiptHandle']) print('Done!') def CreateTopicandQueue(self): millis = str(int(round(time.time() * 1000))) # Create SNS topic snsTopicName = "AmazonTextractTopic" + millis topicResponse = self.sns.create_topic(Name=snsTopicName) self.snsTopicArn = topicResponse['TopicArn'] # create SQS queue sqsQueueName = "AmazonTextractQueue" + millis self.sqs.create_queue(QueueName=sqsQueueName) self.sqsQueueUrl = self.sqs.get_queue_url(QueueName=sqsQueueName)['QueueUrl'] attribs = self.sqs.get_queue_attributes(QueueUrl=self.sqsQueueUrl, AttributeNames=['QueueArn'])['Attributes'] sqsQueueArn = attribs['QueueArn'] # Subscribe SQS queue to SNS topic self.sns.subscribe( TopicArn=self.snsTopicArn, Protocol='sqs', Endpoint=sqsQueueArn) # Authorize SNS to write SQS queue policy = """{{ "Version": "2012-10-17", "Statement":[ {{ "Sid":"MyPolicy", "Effect":"Allow", "Principal" : {{"AWS" : "*"}}, "Action":"SQS:SendMessage", "Resource": "{}", "Condition":{{ "ArnEquals":{{ "aws:SourceArn": "{}" }} }} }} ] }}""".format(sqsQueueArn, self.snsTopicArn) response = self.sqs.set_queue_attributes( QueueUrl=self.sqsQueueUrl, Attributes={ 'Policy': policy }) def DeleteTopicandQueue(self): self.sqs.delete_queue(QueueUrl=self.sqsQueueUrl) self.sns.delete_topic(TopicArn=self.snsTopicArn) # Display information about a block def DisplayBlockInfo(self, block): print("Block Id: " + block['Id']) print("Type: " + block['BlockType']) if 'EntityTypes' in block: print('EntityTypes: {}'.format(block['EntityTypes'])) if 'Text' in block: print("Text: " + block['Text']) if block['BlockType'] != 'PAGE' and "Confidence" in str(block['BlockType']): print("Confidence: " + "{:.2f}".format(block['Confidence']) + "%") print('Page: {}'.format(block['Page'])) if block['BlockType'] == 'CELL': print('Cell Information') print('\tColumn: {} '.format(block['ColumnIndex'])) print('\tRow: {}'.format(block['RowIndex'])) print('\tColumn span: {} '.format(block['ColumnSpan'])) print('\tRow span: {}'.format(block['RowSpan'])) if 'Relationships' in block: print('\tRelationships: {}'.format(block['Relationships'])) if ("Geometry") in str(block): print('Geometry') print('\tBounding Box: {}'.format(block['Geometry']['BoundingBox'])) print('\tPolygon: {}'.format(block['Geometry']['Polygon'])) if block['BlockType'] == 'SELECTION_ELEMENT': print(' Selection element detected: ', end='') if block['SelectionStatus'] == 'SELECTED': print('Selected') else: print('Not selected') if block["BlockType"] == "QUERY": print("Query info:") print(block["Query"]) if block["BlockType"] == "QUERY_RESULT": print("Query answer:") print(block["Text"]) def GetResults(self, jobId): maxResults = 1000 paginationToken = None finished = False while finished == False: response = None if self.processType == ProcessType.ANALYSIS: if paginationToken == None: response = self.textract.get_document_analysis(JobId=jobId, MaxResults=maxResults) else: response = self.textract.get_document_analysis(JobId=jobId, MaxResults=maxResults, NextToken=paginationToken) if self.processType == ProcessType.DETECTION: if paginationToken == None: response = self.textract.get_document_text_detection(JobId=jobId, MaxResults=maxResults) else: response = self.textract.get_document_text_detection(JobId=jobId, MaxResults=maxResults, NextToken=paginationToken) blocks = response['Blocks'] print('Detected Document Text') print('Pages: {}'.format(response['DocumentMetadata']['Pages'])) # Display block information for block in blocks: self.DisplayBlockInfo(block) print() print() if 'NextToken' in response: paginationToken = response['NextToken'] else: finished = True def GetResultsDocumentAnalysis(self, jobId): maxResults = 1000 paginationToken = None finished = False while finished == False: response = None if paginationToken == None: response = self.textract.get_document_analysis(JobId=jobId, MaxResults=maxResults) else: response = self.textract.get_document_analysis(JobId=jobId, MaxResults=maxResults, NextToken=paginationToken) # Get the text blocks blocks = response['Blocks'] print('Analyzed Document Text') print('Pages: {}'.format(response['DocumentMetadata']['Pages'])) # Display block information for block in blocks: self.DisplayBlockInfo(block) print() print() if 'NextToken' in response: paginationToken = response['NextToken'] else: finished = True def main(): roleArn = '' bucket = '' document = '' region_name = '' analyzer = DocumentProcessor(roleArn, bucket, document, region_name) analyzer.CreateTopicandQueue() analyzer.ProcessDocument(ProcessType.ANALYSIS) analyzer.DeleteTopicandQueue() if __name__ == "__main__": main() ``` In order to use different features of the `AnalyzeDocument` operation, you provide the proper feature type to the `features-type` parameter. For example, to use the Queries feature, include the `QUERIES` value in the `feature-types` parameter and then provide a `Queries` object to the `queries-config` parameter. To query your document, replace the code block that makes a request to the `StartDocumentAnalysis` operation with the code block below, and enter your query. ``` if self.processType == ProcessType.ANALYSIS: response = self.textract.start_document_analysis( DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}}, FeatureTypes=["TABLES", "FORMS", "QUERIES"], QueriesConfig={'Queries':[ {'Text':'{}'.format("Enter query here")} ]}, NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn}) ``` ------ #### [ Node.JS ] In this example, replace the value of `roleArn` with the IAM role ARN that you saved in [Giving Amazon Textract Access to Your Amazon SNS Topic](api-async-roles.md#api-async-roles-all-topics). Replace the values of `bucket` and `document` with the bucket and document file name you specified in step 2 above. Replace the value of `processType` with the type of processing you'd like to use on the input document. Finally, replace the value of `REGION` with the region your client is operating in. Replace the value of `profileName` with the name of your developer profile. ``` // snippet-start:[sqs.JavaScript.queues.createQueueV3] // Import required AWS SDK clients and commands for Node.js import { CreateQueueCommand, GetQueueAttributesCommand, GetQueueUrlCommand, SetQueueAttributesCommand, DeleteQueueCommand, ReceiveMessageCommand, DeleteMessageCommand } from "@aws-sdk/client-sqs"; import {CreateTopicCommand, SubscribeCommand, DeleteTopicCommand } from "@aws-sdk/client-sns"; import { SQSClient } from "@aws-sdk/client-sqs"; import { SNSClient } from "@aws-sdk/client-sns"; import { TextractClient, StartDocumentTextDetectionCommand, StartDocumentAnalysisCommand, GetDocumentAnalysisCommand, GetDocumentTextDetectionCommand, DocumentMetadata } from "@aws-sdk/client-textract"; import { stdout } from "process"; import {fromIni} from '@aws-sdk/credential-providers'; // Set the AWS Region. const REGION = "region-name"; //e.g. "us-east-1" const profileName = "profile-name"; // Create SNS service object. const textractClient = new TextractClient({region: REGION, credentials: fromIni({profile: profileName,}), }); const sqsClient = new SQSClient({region: REGION, credentials: fromIni({profile: profileName,}), }); const snsClient = new SNSClient({region: REGION, credentials: fromIni({profile: profileName,}), }); // Set bucket and video variables const bucket = "bucket-name"; const documentName = "document-name"; const roleArn = "role-arn" const processType = "DETECTION" var startJobId = "" var ts = Date.now(); const snsTopicName = "AmazonTextractExample" + ts; const snsTopicParams = {Name: snsTopicName} const sqsQueueName = "AmazonTextractQueue-" + ts; // Set the parameters const sqsParams = { QueueName: sqsQueueName, //SQS_QUEUE_URL Attributes: { DelaySeconds: "60", // Number of seconds delay. MessageRetentionPeriod: "86400", // Number of seconds delay. }, }; // Process a document based on operation type const processDocumment = async (type, bucket, videoName, roleArn, sqsQueueUrl, snsTopicArn) => { try { // Set job found and success status to false initially var jobFound = false var succeeded = false var dotLine = 0 var processType = type var validType = false if (processType == "DETECTION"){ var response = await textractClient.send(new StartDocumentTextDetectionCommand({DocumentLocation:{S3Object:{Bucket:bucket, Name:videoName}}, NotificationChannel:{RoleArn: roleArn, SNSTopicArn: snsTopicArn}})) console.log("Processing type: Detection") validType = true } if (processType == "ANALYSIS"){ var response = await textractClient.send(new StartDocumentAnalysisCommand({DocumentLocation:{S3Object:{Bucket:bucket, Name:videoName}}, NotificationChannel:{RoleArn: roleArn, SNSTopicArn: snsTopicArn}})) console.log("Processing type: Analysis") validType = true } if (validType == false){ console.log("Invalid processing type. Choose Detection or Analysis.") return } // while not found, continue to poll for response console.log(`Start Job ID: ${response.JobId}`) while (jobFound == false){ var sqsReceivedResponse = await sqsClient.send(new ReceiveMessageCommand({QueueUrl:sqsQueueUrl, MaxNumberOfMessages:'ALL', MaxNumberOfMessages:10})); if (sqsReceivedResponse){ var responseString = JSON.stringify(sqsReceivedResponse) if (!responseString.includes('Body')){ if (dotLine < 40) { console.log('.') dotLine = dotLine + 1 }else { console.log('') dotLine = 0 }; stdout.write('', () => { console.log(''); }); await new Promise(resolve => setTimeout(resolve, 5000)); continue } } // Once job found, log Job ID and return true if status is succeeded for (var message of sqsReceivedResponse.Messages){ console.log("Retrieved messages:") var notification = JSON.parse(message.Body) var rekMessage = JSON.parse(notification.Message) var messageJobId = rekMessage.JobId if (String(rekMessage.JobId).includes(String(startJobId))){ console.log('Matching job found:') console.log(rekMessage.JobId) jobFound = true // GET RESUlTS FUNCTION HERE var operationResults = await GetResults(processType, rekMessage.JobId) //GET RESULTS FUMCTION HERE console.log(rekMessage.Status) if (String(rekMessage.Status).includes(String("SUCCEEDED"))){ succeeded = true console.log("Job processing succeeded.") var sqsDeleteMessage = await sqsClient.send(new DeleteMessageCommand({QueueUrl:sqsQueueUrl, ReceiptHandle:message.ReceiptHandle})); } }else{ console.log("Provided Job ID did not match returned ID.") var sqsDeleteMessage = await sqsClient.send(new DeleteMessageCommand({QueueUrl:sqsQueueUrl, ReceiptHandle:message.ReceiptHandle})); } } console.log("Done!") } }catch (err) { console.log("Error", err); } } // Create the SNS topic and SQS Queue const createTopicandQueue = async () => { try { // Create SNS topic const topicResponse = await snsClient.send(new CreateTopicCommand(snsTopicParams)); const topicArn = topicResponse.TopicArn console.log("Success", topicResponse); // Create SQS Queue const sqsResponse = await sqsClient.send(new CreateQueueCommand(sqsParams)); console.log("Success", sqsResponse); const sqsQueueCommand = await sqsClient.send(new GetQueueUrlCommand({QueueName: sqsQueueName})) const sqsQueueUrl = sqsQueueCommand.QueueUrl const attribsResponse = await sqsClient.send(new GetQueueAttributesCommand({QueueUrl: sqsQueueUrl, AttributeNames: ['QueueArn']})) const attribs = attribsResponse.Attributes console.log(attribs) const queueArn = attribs.QueueArn // subscribe SQS queue to SNS topic const subscribed = await snsClient.send(new SubscribeCommand({TopicArn: topicArn, Protocol:'sqs', Endpoint: queueArn})) const policy = { Version: "2012-10-17", Statement: [ { Sid: "MyPolicy", Effect: "Allow", Principal: {AWS: "*"}, Action: "SQS:SendMessage", Resource: queueArn, Condition: { ArnEquals: { 'aws:SourceArn': topicArn } } } ] }; const response = sqsClient.send(new SetQueueAttributesCommand({QueueUrl: sqsQueueUrl, Attributes: {Policy: JSON.stringify(policy)}})) console.log(response) console.log(sqsQueueUrl, topicArn) return [sqsQueueUrl, topicArn] } catch (err) { console.log("Error", err); } } const deleteTopicAndQueue = async (sqsQueueUrlArg, snsTopicArnArg) => { const deleteQueue = await sqsClient.send(new DeleteQueueCommand({QueueUrl: sqsQueueUrlArg})); const deleteTopic = await snsClient.send(new DeleteTopicCommand({TopicArn: snsTopicArnArg})); console.log("Successfully deleted.") } const displayBlockInfo = async (block) => { console.log(`Block ID: ${block.Id}`) console.log(`Block Type: ${block.BlockType}`) if (String(block).includes(String("EntityTypes"))){ console.log(`EntityTypes: ${block.EntityTypes}`) } if (String(block).includes(String("Text"))){ console.log(`EntityTypes: ${block.Text}`) } if (!String(block.BlockType).includes('PAGE')){ console.log(`Confidence: ${block.Confidence}`) } console.log(`Page: ${block.Page}`) if (String(block.BlockType).includes("CELL")){ console.log("Cell Information") console.log(`Column: ${block.ColumnIndex}`) console.log(`Row: ${block.RowIndex}`) console.log(`Column Span: ${block.ColumnSpan}`) console.log(`Row Span: ${block.RowSpan}`) if (String(block).includes("Relationships")){ console.log(`Relationships: ${block.Relationships}`) } } console.log("Geometry") console.log(`Bounding Box: ${JSON.stringify(block.Geometry.BoundingBox)}`) console.log(`Polygon: ${JSON.stringify(block.Geometry.Polygon)}`) if (String(block.BlockType).includes('SELECTION_ELEMENT')){ console.log('Selection Element detected:') if (String(block.SelectionStatus).includes('SELECTED')){ console.log('Selected') } else { console.log('Not Selected') } } } const GetResults = async (processType, JobID) => { var maxResults = 1000 var paginationToken = null var finished = false while (finished == false){ var response = null if (processType == 'ANALYSIS'){ if (paginationToken == null){ response = textractClient.send(new GetDocumentAnalysisCommand({JobId:JobID, MaxResults:maxResults})) }else{ response = textractClient.send(new GetDocumentAnalysisCommand({JobId:JobID, MaxResults:maxResults, NextToken:paginationToken})) } } if(processType == 'DETECTION'){ if (paginationToken == null){ response = textractClient.send(new GetDocumentTextDetectionCommand({JobId:JobID, MaxResults:maxResults})) }else{ response = textractClient.send(new GetDocumentTextDetectionCommand({JobId:JobID, MaxResults:maxResults, NextToken:paginationToken})) } } await new Promise(resolve => setTimeout(resolve, 5000)); console.log("Detected Documented Text") console.log(response) //console.log(Object.keys(response)) console.log(typeof(response)) var blocks = (await response).Blocks console.log(blocks) console.log(typeof(blocks)) var docMetadata = (await response).DocumentMetadata var blockString = JSON.stringify(blocks) var parsed = JSON.parse(JSON.stringify(blocks)) console.log(Object.keys(blocks)) console.log(`Pages: ${docMetadata.Pages}`) blocks.forEach((block)=> { displayBlockInfo(block) console.log() console.log() }) //console.log(blocks[0].BlockType) //console.log(blocks[1].BlockType) if(String(response).includes("NextToken")){ paginationToken = response.NextToken }else{ finished = true } } } // DELETE TOPIC AND QUEUE const main = async () => { var sqsAndTopic = await createTopicandQueue(); var process = await processDocumment(processType, bucket, documentName, roleArn, sqsAndTopic[0], sqsAndTopic[1]) var deleteResults = await deleteTopicAndQueue(sqsAndTopic[0], sqsAndTopic[1]) } main() ``` ------ 1. Run the code. The operation might take a while to finish. After it's finished, a list of blocks for detected or analyzed text is displayed. # Using the Analyze Lending Workflow To detect text in, or analyze multipage lending documents, using the Analyze Lending workflow, you do the following: 1. Create the Amazon SNS topic and the Amazon SQS queue. 1. Subscribe the queue the topic. 1. Give the topic permission to send messages to the queue. 1. Start processing the document. Call `StartLendingAnalysis` operation. 1. Get the completion status from the Amazon SQS queue. The example code tracks the job identifier (`JobId`) that's returned by the `Start` operation. The example code only gets the results for matching job identifiers that are read from the completion status. This is important if other applications are using the same queue and topic. For simplicity, the example code deletes jobs that don't match. Consider adding the deleted jobs to an Amazon SQS dead-letter queue for further investigation. The results of the StartLendingAnalysis operation can be sent to an Amazon S3 bucket of your choice by using the OutputConfig feature. If you use this feature, you may have to do some additional configuration of your User and Service Role. For information on how to let Amazon Textract send encrypted documents to your Amazon S3 bucket, see [Permissions for Output Configuration](api-async-roles.md#async-output-config). 1. Get and display the processing results by calling the `GetLendingAnalysis` operation or the `GetLendingAnalysisSummary` operation. 1. Once you are finished processing documents, be sure to delete the Amazon SNS topic and the Amazon SQS queue. If you need to process additional documents, you can leave the Amazon SNS topic and Amazon SQS queue as they are and reuse them for the other documents. ## Performing Asynchronous Lending Analysis The example code for this procedure is provided for Python and the AWS CLI. Before you begin, install the appropriate AWS SDK. For more information, see [Step 2: Set Up the AWS CLI and AWS SDKs](setup-awscli-sdk.md). 1. Configure user access to Amazon Textract, and configure Amazon Textract access to Amazon SNS. For more information, see [Configuring Amazon Textract for Asynchronous Operations](https://docs.aws.amazon.com/en_us/textract/latest/dg/api-async-roles.html). To complete this procedure, you need a multipage document ﬁle in PDF format. You can skip steps 3 – 6 in the configuration instructions, because the example code creates and configures the Amazon SNS topic and Amazon SQS queue. If completing the CLI example, you don't need to set up an SQS queue. 1. Upload a multipage document file in PDF or TIFF format to your Amazon S3 bucket (you can also process single-page documents in JPEG, PNG, TIFF, or PDF formats). For instructions, see [Uploading Objects into Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/UploadingObjectsintoAmazonS3.html)in the *Amazon Simple Storage Service User Guide*. 1. Use the following AWS SDK for Python (Boto3) or AWS CLI code to analyze text in a multipage lending document. In the main function: + Replace the value of `roleArn` with the IAM role ARN that you saved in [Giving Amazon Textract Access to Your Amazon SNS Topic](https://docs.aws.amazon.com/en_us/textract/latest/dg/api-async-roles.html#api-async-roles-all-topics). + Replace the values of `bucket` and `document` with the bucket and document file name that you previously specified in the proceeding Step 2. + Replace the value of the `type` input parameter of the `ProcessDocument` function with the type of processing that you want to use. For example, use `ProcessType.DETECTION` to detect text, or use `ProcessType.ANALYSIS` to analyze text. + For the Python example, replace the value of `region_name` with the region your client is operating in. For the upcoming AWS CLI example code, do the following: + When calling the [StartLendingAnalysis](https://docs.aws.amazon.com/en_us/textract/latest/dg/API_StartLendingAnalysis.html) operation, replace the value of `bucket-name` with the name of your S3 bucket, and replace `FileName` with the name of the file you specified in step 2. Specify the region of your bucket by replacing `region-name` with the name of your region. Take note that the CLI example does not make use of SQS. + When calling the [GetLendingAnalysis](https://docs.aws.amazon.com/en_us/textract/latest/dg/API_GetLendingAnalysis.html) operation or the [GetLendingAnalysisSummary](https://docs.aws.amazon.com/en_us/textract/latest/dg/API_GetLendingAnalysisSummary.html) operation, replace `jobId` with the `jobId` returned by [StartLendingAnalysis](https://docs.aws.amazon.com/en_us/textract/latest/dg/API_StartLendingAnalysis.html). Specify the region of your bucket by replacing `region-name` with the name of your region. 1. Run the code for your chosen SDK or the AWS CLI. The operation might take a while to finish. After it's finished, a list of blocks for detected or analyzed text is displayed by the follwing examples: ------ #### [ AWS CLI ] To start the lending document analysis use the following CLI command. If you want to see splitted documents, use the `output-config` argument, otherwise you can remove it : ``` aws textract start-lending-analysis \ --document-location '{"S3Object":{"Bucket":"S3Bucket","Name":"FileName"}}' \ --output-config '{"S3Bucket": "S3Bucket", "S3Prefix": "S3Prefix"}' \ --kms-key-id '1234abcd-12ab-34cd-56ef-1234567890ab' \ --region 'region-name' ``` To get the results of the lending document analysis use the following CLI command. The `max-results` argument is optional, and if you don't want to limit the number of results returned you can remove it: ``` aws textract get-lending-analysis \ --job-id 'jobId' \ --region 'us-west-2' \ --max-results 30 ``` To retrieve a summary of the results: ``` aws textract get-lending-analysis-summary \ --job-id 'jobId' \ --region 'us-west-2' ``` ------ #### [ Python ] ``` import boto3 import json import sys import time class DocumentProcessor: def __init__(self, role, bucket, document, region): self.roleArn = role self.bucket = bucket self.document = document self.region_name = region self.textract = boto3.client('textract', region_name=self.region_name) self.sqs = boto3.client('sqs') self.sns = boto3.client('sns') def ProcessDocument(self): jobFound = False response = self.textract.start_lending_analysis( DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}}, NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn}) print('Processing type: Analysis') print('Start Job Id: ' + response['JobId']) dotLine = 0 while jobFound == False: sqsResponse = self.sqs.receive_message(QueueUrl=self.sqsQueueUrl, MessageAttributeNames=['ALL'], MaxNumberOfMessages=10) if sqsResponse: if 'Messages' not in sqsResponse: if dotLine < 40: print('.', end='') dotLine = dotLine + 1 else: print() dotLine = 0 sys.stdout.flush() time.sleep(5) continue for message in sqsResponse['Messages']: notification = json.loads(message['Body']) textMessage = json.loads(notification['Message']) print(textMessage['JobId']) print(textMessage['Status']) if str(textMessage['JobId']) == response['JobId']: print('Matching Job Found:' + textMessage['JobId']) jobFound = True self.GetResults(textMessage['JobId']) self.GetSummary(textMessage['JobId']) self.sqs.delete_message(QueueUrl=self.sqsQueueUrl, ReceiptHandle=message['ReceiptHandle']) else: print("Job didn't match:" + str(textMessage['JobId']) + ' : ' + str(response['JobId'])) # Delete the unknown message. Consider sending to dead letter queue self.sqs.delete_message(QueueUrl=self.sqsQueueUrl, ReceiptHandle=message['ReceiptHandle']) print('Done!') def CreateTopicandQueue(self): millis = str(int(round(time.time() * 1000))) # Create SNS topic snsTopicName = "AmazonTextractTopic" + millis topicResponse = self.sns.create_topic(Name=snsTopicName) self.snsTopicArn = topicResponse['TopicArn'] # create SQS queue sqsQueueName = "AmazonTextractQueue" + millis self.sqs.create_queue(QueueName=sqsQueueName) self.sqsQueueUrl = self.sqs.get_queue_url(QueueName=sqsQueueName)['QueueUrl'] attribs = self.sqs.get_queue_attributes(QueueUrl=self.sqsQueueUrl, AttributeNames=['QueueArn'])['Attributes'] sqsQueueArn = attribs['QueueArn'] # Subscribe SQS queue to SNS topic self.sns.subscribe( TopicArn=self.snsTopicArn, Protocol='sqs', Endpoint=sqsQueueArn) # Authorize SNS to write SQS queue policy = """{{ "Version":"2012-10-17", "Statement":[ {{ "Sid":"MyPolicy", "Effect":"Allow", "Principal" : {{"AWS" : "*"}}, "Action":"sqs:*", "Resource": "{}", "Condition":{{ "ArnEquals":{{ "aws:SourceArn": "{}" }} }} }} ] }}""".format(sqsQueueArn, self.snsTopicArn) response = self.sqs.set_queue_attributes( QueueUrl=self.sqsQueueUrl, Attributes={ 'Policy': policy }) def DeleteTopicandQueue(self): self.sqs.delete_queue(QueueUrl=self.sqsQueueUrl) self.sns.delete_topic(TopicArn=self.snsTopicArn) # Display information about a block def DisplayExtractInfo(self, response): results = response['Results'] for page in results: print("Page Classification: {}".format(page["PageClassification"]["PageType"])) print("Page Number: {}".format(page["Page"])) for extract in page["Extractions"]: for fields, vals in extract['LendingDocument'].items(): for val in vals: print("Document Type: {}".format(val['Type'])) detections = val['ValueDetections'] for i in detections: print(i['Text']) print('Geometry') print('\tBounding Box: {}'.format(i['Geometry']['BoundingBox'])) print('\tPolygon: {}'.format(i['Geometry']['Polygon'])) def GetSummary(self, jobId): maxResults = 1000 response = self.textract.get_lending_analysis_summary(JobId=jobId, MaxResults=maxResults) doc_groups = response['DocumentGroups'] print("Summary info:") for group in doc_groups: print("Document type: " + group['Type']) split_docs = group['SplitDocuments'] for doc in split_docs: print(doc) for idx, page in doc.items(): print(str(idx) + " - " + str(page)) def GetResults(self, jobId): maxResults = 1000 paginationToken = None finished = False while finished == False: response = None if paginationToken == None: response = self.textract.get_lending_analysis(JobId=jobId, MaxResults=maxResults) else: response = self.textract.get_lending_analysis(JobId=jobId, MaxResults=maxResults, NextToken=paginationToken) print('Detected Document Text') print('Pages: {}'.format(response['DocumentMetadata']['Pages'])) self.DisplayExtractInfo(response) if 'NextToken' in response: paginationToken = response['NextToken'] else: finished = True def main(): roleArn = '' bucket = '' document = '' region_name = '' analyzer = DocumentProcessor(roleArn, bucket, document, region_name) analyzer.CreateTopicandQueue() analyzer.ProcessDocument() analyzer.DeleteTopicandQueue() if __name__ == "__main__": main() ``` ------ # Amazon Textract Results Notification Amazon Textract sends the status of an analysis request to an Amazon Simple Notification Service (Amazon SNS) topic. To get the notification from an Amazon SNS topic, use an Amazon SQS queue or an AWS Lambda function. For more information, see [Calling Amazon Textract Asynchronous Operations](api-async.md). For an example, see [Detecting or Analyzing Text in a Multipage Document](async-analyzing-with-sqs.md). The status message sent by Amazon Simple Notification Service to Amazon SQS has the following JSON format: ``` { "JobId": "String", "Status": "String", "API": "String", "JobTag": "String", "Timestamp": Number, "DocumentLocation": { "S3ObjectName": "String", "S3Bucket": "String" } } ``` This table describes the different parameters within an Amazon SNS status. | Parameter | Description | | --- | --- | | JobId | The unique identifier that Amazon Textract assigns to the job. It matches a job identifier that's returned from a `Start` operation, such as [StartDocumentTextDetection](API_StartDocumentTextDetection.md). | | Status | The status of the job. Valid values are SUCCEEDED, FAILED, or ERROR. | | API | The Amazon Textract operation used to analyze the input document, such as [StartDocumentTextDetection](API_StartDocumentTextDetection.md) or [StartDocumentAnalysis](API_StartDocumentAnalysis.md). | | JobTag | The user-specified identifier for the job. You specify `JobTag` in a call to the `Start` operation, such as [StartDocumentTextDetection](API_StartDocumentTextDetection.md). | | Timestamp | The Unix timestamp that indicates when the job finished, returned in milliseconds. | | DocumentLocation | Details about the document that was processed. Includes the file name and the Amazon S3 bucket that the file is stored in. | If the value of "Status" in the Amazon SNS notification is "Failed", this indicates something has gone wrong with your analysis job. In this case, check for an error message returned by the Amazon Textract API operation and ensure your document matches the quotas specified by[Set Quotas in Amazon TextractModifying Default Quotas in Amazon Textract](limits-document.md)