注意
精细操作功能仅在 AWS Glue 3.0 和 4.0 中可用。这包括 AWS Glue Studio 体验。此外,2.0 版本也不支持持久审计日志更改。
所有 AWS Glue Studio 3.0 和 4.0 可视化作业都将创建一个会自动使用精细操作 API 的脚本。
借助检测敏感数据转换功能,可以检测、遮蔽或移除您定义的或由 AWS Glue 预定义的实体。您还可以借助精细操作对每个实体应用特定的操作。其他优点包括:
-
可在检测到数据后立即应用操作,从而提高性能。
-
提供了包含或排除特定列的选项。
-
能够使用部分遮蔽功能。从而让您可以部分遮蔽检测到的敏感数据实体,而不是遮蔽整个字符串。支持带有偏移量的简单参数和正则表达式。
以下是敏感数据检测 API 代码片段和下一节中引用的示例作业中使用的精细操作。
检测 API – 精细操作将使用新的 detectionParameters
参数:
def detect(
frame: DynamicFrame,
detectionParameters: JsonOptions,
outputColumnName: String = "DetectedEntities",
detectionSensitivity: String = "LOW"
): DynamicFrame = {}
将敏感数据检测 API 与精细操作结合使用
敏感数据检测 API 使用 detect 来分析给定的数据,确定行或列是否属于敏感数据实体类型,并且将运行用户为每种实体类型指定的操作。
将 detect API 与精细操作结合使用
使用 detect API 并指定 outputColumnName
和
detectionParameters
。
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// @params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// Script generated for node S3 bucket. Creates DataFrame from data stored in S3.
val S3bucket_node1 = glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ",", "optimizePerformance": false}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://189657479688-ddevansh-pii-test-bucket/tiny_pii.csv"], "recurse": true}"""), transformationContext="S3bucket_node1").getDynamicFrame()
// Script generated for node Detect Sensitive Data. Will run detect API for the DataFrame
// detectionParameter contains information on which EntityType are being detected
// and what actions are being applied to them when detected.
val DetectSensitiveData_node2 = EntityDetector.detect(
frame = S3bucket_node1,
detectionParameters = JsonOptions(
"""
{
"PHONE_NUMBER": [
{
"action": "PARTIAL_REDACT",
"actionOptions": {
"numLeftCharsToExclude": "3",
"numRightCharsToExclude": "4",
"redactChar": "#"
},
"sourceColumnsToExclude": [ "Passport No", "DL NO#" ]
}
],
"USA_PASSPORT_NUMBER": [
{
"action": "SHA256_HASH",
"sourceColumns": [ "Passport No" ]
}
],
"USA_DRIVING_LICENSE": [
{
"action": "REDACT",
"actionOptions": {
"redactText": "USA_DL"
},
"sourceColumns": [ "DL NO#" ]
}
]
}
"""
),
outputColumnName = "DetectedEntities"
)
// Script generated for node S3 bucket. Store Results of detect to S3 location
val S3bucket_node3 = glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://189657479688-ddevansh-pii-test-bucket/test-output/", "partitionKeys": []}"""), transformationContext="S3bucket_node3", format="json").writeDynamicFrame(DetectSensitiveData_node2)
Job.commit()
}
上面的脚本将从 Amazon S3 中的某个位置创建一个 DataFrame,然后运行 detect
API。由于 detect
API 要求字段 detectionParameters
(实体名称与将用于该实体的所有操作设置列表的映射)由 AWS Glue 的 JsonOptions
对象来表示,因此还有利于我们扩展该 API 的功能。
对于为每个实体指定的每个操作,输入要应用该实体/操作组合的所有列名的列表。这让您能够为数据集中的每一列自定义要检测的实体,并跳过您知道特定列中并未包含的实体。此外,这还让您能够不对这些实体执行不必要的检测调用,从而提高作业性能,并且能够为每个列和实体组合所特有的操作。
如果更深入看 detectionParameters
,示例作业中有三种实体类型,分别是 Phone Number
、USA_PASSPORT_NUMBER
和 USA_DRIVING_LICENSE
。AWS Glue 将针对每种实体类型运行不同的操作,分别是 PARTIAL_REDACT
、SHA256_HASH
、REDACT
和 DETECT
。每种实体类型也必须拥有要应用到的 sourceColumns
和/或 sourceColumnsToExclude
(如果检测到)。
注意
每列只能使用一个就地编辑操作(PARTIAL_REDACT
、SHA256_HASH
或 REDACT
),但 DETECT
操作可以与这些操作中的任何一个结合使用。
detectionParameters
字段的布局如下:
ENTITY_NAME -> List[Actions]
{
"ENTITY_NAME": [{
Action, // required
ColumnSpecs,
ActionOptionsMap
}],
"ENTITY_NAME2": [{
...
}]
}
actions
和 actionOptions
的类型列举如下:
DETECT
{
# Required
"action": "DETECT",
# Optional, depending on action chosen
"actionOptions": {
// There are no actionOptions for DETECT
},
# 1 of below required, both can also used
"sourceColumns": [
"COL_1", "COL_2", ..., "COL_N"
],
"sourceColumnsToExclude": [
"COL_5"
]
}
SHA256_HASH
{
# Required
"action": "SHA256_HASH",
# Required or optional, depending on action chosen
"actionOptions": {
// There are no actionOptions for SHA256_HASH
},
# 1 of below required, both can also used
"sourceColumns": [
"COL_1", "COL_2", ..., "COL_N"
],
"sourceColumnsToExclude": [
"COL_5"
]
}
REDACT
{
# Required
"action": "REDACT",
# Required or optional, depending on action chosen
"actionOptions": {
// The text that is being replaced
"redactText": "USA_DL"
},
# 1 of below required, both can also used
"sourceColumns": [
"COL_1", "COL_2", ..., "COL_N"
],
"sourceColumnsToExclude": [
"COL_5"
]
}
PARTIAL_REDACT
{
# Required
"action": "PARTIAL_REDACT",
# Required or optional, depending on action chosen
"actionOptions": {
// number of characters to not redact from the left side
"numLeftCharsToExclude": "3",
// number of characters to not redact from the right side
"numRightCharsToExclude": "4",
// the partial redact will be made with this redacted character
"redactChar": "#",
// regex pattern for partial redaction
"matchPattern": "[0-9]"
},
# 1 of below required, both can also used
"sourceColumns": [
"COL_1", "COL_2", ..., "COL_N"
],
"sourceColumnsToExclude": [
"COL_5"
]
}
脚本运行后,结果将输出到给定的 Amazon S3 位置。您可以在 Amazon S3 中查看数据,但对于选定的实体类型,将根据所选操作进行敏感化处理。在此例中,结果行将如下所示:
{
"Name": "Colby Schuster",
"Address": "39041 Antonietta Vista, South Rodgerside, Nebraska 24151",
"Car Owned": "Fiat",
"Email": "Kitty46@gmail.com",
"Company": "O'Reilly Group",
"Job Title": "Dynamic Functionality Facilitator",
"ITIN": "991-22-2906",
"Username": "Cassandre.Kub43",
"SSN": "914-22-2906",
"DOB": "2020-08-27",
"Phone Number": "1-2#######1718",
"Bank Account No": "69741187",
"Credit Card Number": "6441-6289-6867-2162-2711",
"Passport No": "94f311e93a623c72ccb6fc46cf5f5b0265ccb42c517498a0f27fd4c43b47111e",
"DL NO#": "USA_DL"
}
上面的脚本对 Phone Number
使用 #
进行了部分编辑。Passport No
已更改为 SHA256 哈希值。检测到 DL NO#
属于美国驾驶执照号码,并如 detectionParameters
中所述编辑为“USA_DL”。
注意
由于 classifyColumns API 的性质,无法与精细操作结合使用。此 API 会执行列采样(可由用户调整,不过有默认值)来提高检测速度。由于这一原因,精细操作将需要迭代每个值。
持久审计日志
随精细操作引入的一项新功能(但在使用普通 API 时也可用)是持久审计日志。目前,运行 detect API 会添加一个带有 PII 检测元数据的附加列(默认为 DetectedEntities
,但可通过 outputColumnName
进行自定义)参数。现在推出了“actionUsed”元数据键,可以是 DETECT
、PARTIAL_REDACT
、SHA256_HASH
或 REDACT
。
"DetectedEntities": {
"Credit Card Number": [
{
"entityType": "CREDIT_CARD",
"actionUsed": "DETECT",
"start": 0,
"end": 19
}
],
"Phone Number": [
{
"entityType": "PHONE_NUMBER",
"actionUsed": "REDACT",
"start": 0,
"end": 14
}
]
}
即使客户使用不支持精细操作(例如 detect(entityTypesToDetect, outputColumnName)
)的 API,也会在生成的数据帧中看到此持久审计日志。
如果客户使用支持精细操作的 API,则将看到所有操作,无论是否经过编辑。例如:
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Credit Card Number | Phone Number | DetectedEntities |
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 622126741306XXXX | +12#####7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":16}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":12}]}} |
| 6221 2674 1306 XXXX | +12#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} |
| 6221-2674-1306-XXXX | 22#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} |
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果您不想看到 DetectedEntities 列,则只需在自定义脚本中删除该附加列即可。