本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
下列範例會執行對等的轉換。不過,第二個範例 (SparkSQL) 最簡潔且最有效率,其次是 pandas UDF,最後是第一個範例中的低階映射。下列範例是將兩欄加總的簡單轉換範例的完整範例:
from awsglue import DynamicFrame
# You can have other auxiliary variables, functions or classes on this file, it won't affect the runtime
def record_sum(rec, col1, col2, resultCol):
rec[resultCol] = rec[col1] + rec[col2]
return rec
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
# (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
# The mapping will alter the columns order, which could be important
fields = [field.name for field in self.schema()]
if resultCol not in fields:
# If it's a new column put it at the end
fields.append(resultCol)
return self.map(lambda record: record_sum(record, col1, col2, resultCol)).select_fields(paths=fields)
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns
下列範例是利用 SparkSQL API 的對等轉換。
from awsglue import DynamicFrame
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
# (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
df = self.toDF()
return DynamicFrame.fromDF(
df.withColumn(resultCol, df[col1] + df[col2]) # This is the conversion logic
, self.glue_ctx, self.name)
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns
下列範例使用相同的轉換,但使用 pandas UDF,這比使用普通 UDF 更有效率。如需有關撰寫 pandas UDF 的詳細資訊,請參閱 Apache Spark SQL 文件
from awsglue import DynamicFrame
import pandas as pd
from pyspark.sql.functions import pandas_udf
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
# (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
@pandas_udf("integer") # We need to declare the type of the result column
def add_columns(value1: pd.Series, value2: pd.Series) → pd.Series:
return value1 + value2
df = self.toDF()
return DynamicFrame.fromDF(
df.withColumn(resultCol, add_columns(col1, col2)) # This is the conversion logic
, self.glue_ctx, self.name)
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns