DynamicFrameReader class
— methods —
__init__
__init__(glue_context)
glue_context
– The GlueContext class to use.
from_rdd
from_rdd(data, name, schema=None, sampleRatio=None)
Reads a DynamicFrame
from a Resilient Distributed Dataset (RDD).
data
– The dataset to read from.name
– The name to read from.schema
– The schema to read (optional).sampleRatio
– The sample ratio (optional).
from_options
from_options(connection_type, connection_options={}, format=None,
format_options={}, transformation_ctx="")
Reads a DynamicFrame
using the specified connection and format.
connection_type
– The connection type. Valid values includes3
,mysql
,postgresql
,redshift
,sqlserver
,oracle
,dynamodb
, andsnowflake
.connection_options
– Connection options, such as path and database table (optional). For more information, see Connection types and options for ETL in AWS Glue for Spark . For aconnection_type
ofs3
, Amazon S3 paths are defined in an array.connection_options = {"paths": [ "
s3://mybucket/object_a
", "s3://mybucket/object_b
"]}For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
Warning
Storing passwords in your script is not recommended. Consider using
boto3
to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.connection_options = {"url": "
jdbc-url/database
", "user": "username
", "password":passwordVariable
,"dbtable": "table-name
", "redshiftTmpDir": "s3-tempdir-path
"}For a JDBC connection that performs parallel reads, you can set the hashfield option. For example:
connection_options = {"url": "
jdbc-url/database
", "user": "username
", "password":passwordVariable
,"dbtable": "table-name
", "redshiftTmpDir": "s3-tempdir-path
" , "hashfield": "month
"}For more information, see Reading from JDBC tables in parallel.
format
– A format specification (optional). This is used for an Amazon Simple Storage Service (Amazon S3) or an AWS Glue connection that supports multiple formats. See Data format options for inputs and outputs in AWS Glue for Spark for the formats that are supported.format_options
– Format options for the specified format. See Data format options for inputs and outputs in AWS Glue for Spark for the formats that are supported.transformation_ctx
– The transformation context to use (optional).push_down_predicate
– Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-Filtering Using Pushdown Predicates.
from_catalog
from_catalog(database, table_name, redshift_tmp_dir="", transformation_ctx="", push_down_predicate="", additional_options={})
Reads a DynamicFrame
using the specified catalog namespace and table
name.
database
– The database to read from.table_name
– The name of the table to read from.redshift_tmp_dir
– An Amazon Redshift temporary directory to use (optional if not reading data from Redshift).transformation_ctx
– The transformation context to use (optional).push_down_predicate
– Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-filtering using pushdown predicates.additional_options
– Additional options provided to AWS Glue.-
To use a JDBC connection that performs parallel reads, you can set the
hashfield
,hashexpression
, orhashpartitions
options. For example:additional_options = {"hashfield": "
month
"}For more information, see Reading from JDBC tables in parallel.
-
To pass a catalog expression to filter based on the index columns, you can see the
catalogPartitionPredicate
option.catalogPartitionPredicate
— You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see AWS Glue Partition Indexes. Note thatpush_down_predicate
andcatalogPartitionPredicate
use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.For more information, see Managing partitions for ETL output in AWS Glue.
-