Terjemahan disediakan oleh mesin penerjemah. Jika konten terjemahan yang diberikan bertentangan dengan versi bahasa Inggris aslinya, utamakan versi bahasa Inggris.
Contoh-contoh berikut melakukan transformasi yang setara. Namun, contoh kedua (SparkSQL) adalah yang terbersih dan paling efisien, diikuti oleh Pandas UDF dan akhirnya pemetaan tingkat rendah pada contoh pertama. Contoh berikut adalah contoh lengkap dari transformasi sederhana untuk menambahkan dua kolom:
from awsglue import DynamicFrame
# You can have other auxiliary variables, functions or classes on this file, it won't affect the runtime
def record_sum(rec, col1, col2, resultCol):
rec[resultCol] = rec[col1] + rec[col2]
return rec
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
# (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
# The mapping will alter the columns order, which could be important
fields = [field.name for field in self.schema()]
if resultCol not in fields:
# If it's a new column put it at the end
fields.append(resultCol)
return self.map(lambda record: record_sum(record, col1, col2, resultCol)).select_fields(paths=fields)
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns
Contoh berikut adalah transformasi setara yang memanfaatkan SparkSQL API.
from awsglue import DynamicFrame
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
# (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
df = self.toDF()
return DynamicFrame.fromDF(
df.withColumn(resultCol, df[col1] + df[col2]) # This is the conversion logic
, self.glue_ctx, self.name)
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns
Contoh berikut menggunakan transformasi yang sama tetapi menggunakan panda UDF, yang lebih efisien daripada menggunakan UDF biasa. Untuk informasi selengkapnya tentang menulis panda UDF lihat: Dokumentasi Apache Spark SQL
from awsglue import DynamicFrame
import pandas as pd
from pyspark.sql.functions import pandas_udf
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
# (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
@pandas_udf("integer") # We need to declare the type of the result column
def add_columns(value1: pd.Series, value2: pd.Series) → pd.Series:
return value1 + value2
df = self.toDF()
return DynamicFrame.fromDF(
df.withColumn(resultCol, add_columns(col1, col2)) # This is the conversion logic
, self.glue_ctx, self.name)
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns