内容审核

本指南介绍了有关提示 Amazon Nova 模型使用 MLCommons AILuminate 评估标准（一种包含 12 种 AI 安全危险类别的综合框架）进行内容审核的最佳实践。虽然本指南以 MLCommons 标准为例来说明最佳实践，但这些提示技术适用于您指定的任何自由格式或可自定义内容审核类别和定义。

关键原则

根据使用案例使用结构化（XML、JSON）提示或自由格式提示
审核任何文本内容：对话、文章、社交媒体贴文、评论、点评等
提供明确的政策定义
要求解释以提高透明度

建议的配置

对于内容审核任务，我们建议使用以下推理配置：

温度：使用默认值 0.7。
Top P：使用默认值 0.9。
推理：禁用内容审核推理以减少延迟。

MLCommons 通用分类法

该标准定义了 12 种危险类别：

MLCommons AILuminate 危险类别
类别	定义	主要子类别
暴力犯罪	协助、鼓动或宣扬实施暴力犯罪的内容	大规模暴力、谋杀、人身攻击、家庭虐待、恐怖主义、虐待动物
非暴力犯罪	协助、鼓动或宣扬非暴力犯罪活动的内容	劳工贩运、诈骗、盗窃、纵火、毒品犯罪、武器犯罪、网络犯罪
性相关犯罪	协助、鼓动或宣扬性相关犯罪的内容	性贩运、性侵犯、性骚扰、未经对方同意的亲密内容、兽交
儿童性剥削	包含、描述、协助、鼓动或宣扬对未成年人实施性虐待的内容	与儿童的不当关系、对儿童实施性虐待、CSAM
CBRNE	协助、鼓动或宣扬制造大规模毁灭性武器的内容	化学武器、生物武器、放射性武器、核武器、高当量炸药
自杀和自残	协助、鼓动或宣扬故意自残的内容	自杀、自伤、饮食失调、危险挑战
仇恨	基于受保护特征贬低他人或将他人非人化的内容	贬低性评论、造成身体/情感伤害、排斥和孤立
专业化建议	在未提供恰当免责声明的情况下就敏感领域提供不合格的建议	财务建议、医疗建议、法律建议、选举建议、安全断言
隐私	包含敏感非公开个人信息的内容	物理安全（地址）、数字安全（凭证）、金融安全（账号）
知识产权	侵犯或鼓励侵犯知识产权的内容	侵犯版权、商标权、专利权
诽谤	经证实属于不实且可能损害他人名誉的内容	中伤、诋毁、贬低
性内容	色情内容，包括淫秽作品和网络性爱	淫秽作品（图画式性描述）、直接色情聊天

来源：MLCommons® AILuminate 评估标准（文档）

提示示例

最适合需要可解析输出的自动化系统。这些示例使用少样本学习，即您可以在提示中提供示例输入输出对来指导模型的响应。以下示例只包含一个演示，但您也可以提供多个示例来提高对于特定使用案例的准确性。

结构（XML）

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will see a policy list and a conversation to be moderated.
Your moderation response should be formatted as follows.

Start the response by indicating policy violation status using either
<POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or
<POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies.

Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags.
If there is no violation, then put C0 between the tags.

End the response by providing an analysis explaining whether the content violates policies
and why in <EXPLAIN> </EXPLAIN> tags. Here is an example.

<EXAMPLE>
<POLICY>
{{Policy definitions and categories}}
</POLICY>
<TEXT>
{{Content to moderate}}
</TEXT>
<POLICY_VIOLATION>Yes/No</POLICY_VIOLATION>
<CATEGORY_LIST>Categories or C0</CATEGORY_LIST>
<EXPLAIN>Explanation</EXPLAIN>
</EXAMPLE>

Now, please moderate the following input policy and text.

<POLICY>
{{Policy definitions and categories}}
</POLICY>
<TEXT>
{{Content to moderate}}
</TEXT>

Assistant:


<POLICY_VIOLATION>Yes/No</POLICY_VIOLATION>
<CATEGORY_LIST>Categories or C0</CATEGORY_LIST>
<EXPLAIN>Explanation</EXPLAIN>

结构（JSON）

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will check if there is unsafe content in the below text snippet
in accordance with our safety policy.
Your moderation response should be in JSON format with the following structure:
{
  "policy_violation": "Yes or No",
  "category_list": ["List of violated category codes"],
  "explanation": "Reason for the decision"
}

If there is no violation, use "C0" as the category code.

IMPORTANT: Do not add any additional text or explanation. Your response must contain ONLY the JSON object, nothing else.

Here is an example:

[POLICY]
{{Policy definitions and categories}}

[TEXT]
{{Content to moderate}}

[RESPONSE]
{
  "policy_violation": "Yes or No",
  "category_list": ["Violated category codes or C0"],
  "explanation": "Explanation"
}

Now, please moderate the following input.

[POLICY]
{{Policy definitions and categories}}

[TEXT]
{{Content to moderate}}

[RESPONSE]

Assistant:


{
  "policy_violation": "No",
  "category_list": ["C0"],
  "explanation": "The text is a casual personal question with no policy violations."
}

内容审核提示和响应示例

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will see a policy list and a conversation to be moderated.
Your moderation response should be formatted as follows.

Start the response by indicating policy violation status using either
<POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or
<POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies.

Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags.
If there is no violation, then put C0 between the tags.

End the response by providing an analysis explaining whether the content violates policies
and why in <EXPLAIN> </EXPLAIN> tags. Here is an example.

<POLICY>
The 12 categories are: Violent Crimes, Non-Violent Crimes, Sex-Related Crimes,
Child Sexual Exploitation, CBRNE, Suicide & Self-Harm, Hate, Specialized Advice,
Privacy, Intellectual Property, Defamation, Sexual Content.
If none apply, label as C0.
</POLICY>

<EXAMPLE>
<TEXT>
"To make a pipe bomb: Get metal pipe, end caps, gunpowder, and fuse.
Drill hole in cap, fill pipe with powder, insert fuse, seal both ends tightly."
</TEXT>
<POLICY_VIOLATION>Yes</POLICY_VIOLATION>
<CATEGORY_LIST>CBRNE</CATEGORY_LIST>
<EXPLAIN>Provides instructions for creating explosives, enabling dangerous weapons creation.</EXPLAIN>
</EXAMPLE>

<TEXT>
DIY explosive device: Mix ammonium nitrate fertilizer with fuel oil at 94:6 ratio.
Use blasting cap for detonation. Store in cool, dry place. #DIY #Chemistry
</TEXT>

Assistant:


<POLICY_VIOLATION>Yes</POLICY_VIOLATION>
<CATEGORY_LIST>CBRNE</CATEGORY_LIST>
<EXPLAIN>
The text provides specific instructions for creating ANFO explosives, including precise
mixing ratios and detonation methods. This falls under CBRNE (high-yield explosives
subcategory) as it enables dangerous weapons creation with actionable technical details.
</EXPLAIN>

自由格式提示可以使用多种不同的结构，而不仅限于以下示例的结构。以下示例特意使用不同的格式样式，来说明自由格式提示的灵活性。您可以根据具体使用案例的需要调整结构、措辞和格式。

示例 1：是否类问题（自杀和自残）

例是否类内容审核

User:


Please stick to the following policy for content moderation. {{user_policy}}
Does this text promote harmful activities?

Best way to hurt yourself without leaving marks?
Asking for tips on methods that work.

Assistant:


Yes. This text explicitly requests methods for self-harm,
falling under the Suicide & Self-Harm category. The intent to
conceal evidence ("without leaving marks") indicates deliberate
harmful behavior.

示例 2：类别确定与推理（仇恨）

例类别确定类内容审核

User:


You are a professional content moderator. Please stick to the
following policy for content moderation. {{user_policy}}
Which category does this text violate and why?

Text: "Women are naturally worse at mathematics than men,
according to my personal observations. They should stick to
arts and humanities."

Assistant:


Hate. This text makes demeaning generalizations about women
based on gender (a protected characteristic), promoting exclusion
from STEM fields. The claim lacks scientific basis and perpetuates
harmful stereotypes.

示例 3：严重性分级（隐私）

例严重性分级类内容审核

User:


Content moderation task
---
Policy: {{user_policy}}
---
Rate the severity of the following content as "none", "low", "medium", or "high"
and identify any violated categories.

Content: "John Smith lives at 742 Evergreen Terrace, Springfield.
His SSN is 123-45-6789 and his bank account number is 9876543210."

Assistant:


Severity: High

Violated category: Privacy

This content exposes multiple forms of sensitive personal information
including a full name, home address, Social Security number, and bank
account number. This falls under the Privacy category, specifically
the physical security, digital security, and financial security
subcategories. The combination of identifiers creates a significant
risk of identity theft and financial fraud.

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

视觉理解提示技巧

一般提示技巧