DynamicFrameReader class - AWS Glue

DynamicFrameReader class

 — methods —

__init__

__init__(glue_context)

from_rdd

from_rdd(data, name, schema=None, sampleRatio=None)

Reads a DynamicFrame from a Resilient Distributed Dataset (RDD).

  • data – The dataset to read from.

  • name – The name to read from.

  • schema – The schema to read (optional).

  • sampleRatio – The sample ratio (optional).

from_options

from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")

Reads a DynamicFrame using the specified connection and format.

  • connection_type – The connection type. Valid values include s3, mysql, postgresql, redshift, sqlserver, oracle, dynamodb, and snowflake.

  • connection_options – Connection options, such as path and database table (optional). For more information, see Connection types and options for ETL in AWS Glue for Spark . For a connection_type of s3, Amazon S3 paths are defined in an array.

    connection_options = {"paths": [ "s3://mybucket/object_a", "s3://mybucket/object_b"]}

    For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.

    Warning

    Storing passwords in your script is not recommended. Consider using boto3 to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

    connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

    For a JDBC connection that performs parallel reads, you can set the hashfield option. For example:

    connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path" , "hashfield": "month"}

    For more information, see Reading from JDBC tables in parallel.

  • format – A format specification (optional). This is used for an Amazon Simple Storage Service (Amazon S3) or an AWS Glue connection that supports multiple formats. See Data format options for inputs and outputs in AWS Glue for Spark for the formats that are supported.

  • format_options – Format options for the specified format. See Data format options for inputs and outputs in AWS Glue for Spark for the formats that are supported.

  • transformation_ctx – The transformation context to use (optional).

  • push_down_predicate – Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-Filtering Using Pushdown Predicates.

from_catalog

from_catalog(database, table_name, redshift_tmp_dir="", transformation_ctx="", push_down_predicate="", additional_options={})

Reads a DynamicFrame using the specified catalog namespace and table name.

  • database – The database to read from.

  • table_name – The name of the table to read from.

  • redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional if not reading data from Redshift).

  • transformation_ctx – The transformation context to use (optional).

  • push_down_predicate – Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-filtering using pushdown predicates.

  • additional_options – Additional options provided to AWS Glue.

    • To use a JDBC connection that performs parallel reads, you can set the hashfield, hashexpression, or hashpartitions options. For example:

      additional_options = {"hashfield": "month"}

      For more information, see Reading from JDBC tables in parallel.

    • To pass a catalog expression to filter based on the index columns, you can see the catalogPartitionPredicate option.

      catalogPartitionPredicate — You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see AWS Glue Partition Indexes. Note that push_down_predicate and catalogPartitionPredicate use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.

      For more information, see Managing partitions for ETL output in AWS Glue.