# 文档理解
<a name="modalities-document"></a>

**注意**  
本文档适用于 Amazon Nova 版本 1。如需 Amazon Nova 2 文档理解指南，请访问[文档理解](https://docs.aws.amazon.com/nova/latest/nova2-userguide/using-multimodal-models.html#document-understanding)。

借助于 Amazon Nova 的文档理解功能，您可以在提示中包含整个文档（PDF、Word 文件、电子表格等），并就其内容提出问题或发出请求。Nova 的多模态理解模型（Lite、Pro、Premier）可解释这些文档中的文本和视觉元素（例如图表或表格）。这支持多种用例，比如对冗长报告或扫描文档进行问答、概括和分析。主要功能包括用于容纳长文档的超大上下文窗口（1-2M 个词元），以及在一次查询中处理多个文档的能力。

Amazon Nova 区分了以下两种类型的文档输入：
+ **基于文本的文档类型**（例如 TXT、CSV、Markdown、HTML、DOC）：对于这些文档，主要处理其文本内容。Nova 将专注于理解这些文档的文本并从中提取信息。
+ **基于媒体的文档类型**（例如 PDF、DOCX）：这些文件可能含有复杂的布局、图像、图表或嵌入式图形。对于基于媒体的文档，Nova 同时处理其视觉和文本元素。Nova 采用基于视觉的理解功能来解释文档文本旁边的视觉内容，例如图表、表格、示意图或屏幕截图。

  Amazon Nova 中的 PDF 文件不支持 JPEG2000 和 JBIG2。

支持的文件格式包括常见文档类型：纯文本和结构化文本文件（CSV、TXT）、电子表格（XLS/XLSX）、HTML/MarkDown、Word 文档（DOC/DOCX）和 PDF 文件。对于文档中的图像，可处理标准图像格式（PNG、JPG、GIF、WebP），但不支持含有某些图像编码（CYMK、SVG）的 PDF。


**文档大小限制和使用指南**  

| 约束 | 限制 | 
| --- | --- | 
|  最大文档数  |  每个请求最多 5 份文档（适用于直接上传和 Amazon S3）  | 
|  基于文本的文档大小  |  每个文本文档（例如 .txt、.csv、.md、.html、.doc）不能超过 4.5 MB  | 
|  基于媒体的文档大小  |  .pdf 和.docx 文件没有单独的文件大小限制，但是： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/nova/latest/userguide/modalities-document.html)  | 
|  不支持的 PDF 内容  |  不支持含有 CMYK 颜色配置文件或 SVG 图像的 PDF  | 

# 通过 API 使用 Nova 的文档理解功能
<a name="modalities-document-examples"></a>

下面是一个简化版的 Python 示例，说明了如何使用 Amazon Nova 进行文档 QA（问答）或分析。我们将使用 AWS Bedrock API（通过 Boto3 SDK）发送一份 PDF 文档和一个供模型回答的问题。

```
            
import base64
import base64
import json
import boto3

# Initialize Bedrock runtime client (adjust region as needed)
client = boto3.client("bedrock-runtime", region_name="us-east-1")

MODEL_ID = "us.amazon.nova-lite-v1:5"  # using Nova Lite model in this example

# Read the document file (PDF) in binary mode
with open("my_document.pdf", "rb") as file:
    doc_bytes = file.read()

# Construct the conversation messages with document + question
messages = [
    {
        "role": "user",
        "content": [
            {
                "document": {
                    "format": "pdf",
                    "name": "Document1",  # neutral name for the document
                    "source": {
                        "bytes": doc_bytes  # embedding the PDF content directly
                    }
                }
            },
            {
                "text": "Here is a question about the document: ... (your question) ... ?"
            }
        ]
    }
]

# Set inference parameters (optional)
inf_params = {"maxTokens": 4000, "topP": 0.1, "temperature": 0.3}

# Invoke the model
response = client.converse(modelId=MODEL_ID, messages=messages, inferenceConfig=inf_params)

# Extract and print the answer
answer_text = response["output"]["message"]["content"][0]["text"]
print(answer_text)
```

如果输入文件很大（超过 25 MB 的直接上传限制）或文件很多，可将其存储在 Amazon S3 中并进行引用。这就避免了在请求中发送原始字节。使用 S3 时，需确保 Bedrock 服务有权访问存储桶/对象。例如，为了在 S3 中引用 PDF，文档源将使用“S3 位置”而不是“字节”，如下所示：

```
messages = [
    {
        "role": "user",
        "content": [
            {
                "document": {
                    "format": "pdf",
                    "name": "Report2023",
                    "source": {
                        "s3Location": {
                            "uri": "s3://your-bucket/path/to/document1.pdf",
                            "bucketOwner": "123456789012"
                        }
                    }
                }
            },
            {
                "text": "Summarize the key findings from the Q3 2023 report."
            }
        ]
    }
]
```

**注意**  
文档名称只能包含字母数字字符、连字符、圆括号和方括号。  
`name` 字段容易受到提示注入的影响，因为模型可能会意外将其解释为指令。因此，我们建议您指定一个中性名称。