Reading from JDBC tables in parallel
You can set properties of your JDBC table to enable AWS Glue to read data in parallel. When you set certain properties, you instruct AWS Glue to run parallel SQL queries against logical partitions of your data. You can control partitioning by setting a hash field or a hash expression. You can also control the number of parallel reads that are used to access your data.
Reading from JDBC tables in parallel is an optimization technique that may improve performance. For more information about the process of identifying when this technique is appropriate, consult Reduce the amount of data scan in the Best practices for performance tuning AWS Glue for Apache Spark jobs guide on AWS Prescriptive Guidance.
To enable parallel reads, you can set key-value pairs in the parameters field of your table
structure. Use JSON notation to set a value for the parameter field of your table. For more
information about editing the properties of a table, see Viewing and managing table details. You can also
enable parallel reads when you call the ETL (extract, transform, and load) methods
create_dynamic_frame_from_options
and
create_dynamic_frame_from_catalog
. For more information about specifying
options in these methods, see from_options and from_catalog.
You can use this method for JDBC tables, that is, most tables whose base data is a JDBC data store. These properties are ignored when reading Amazon Redshift and Amazon S3 tables.
- hashfield
-
Set
hashfield
to the name of a column in the JDBC table to be used to divide the data into partitions. For best results, this column should have an even distribution of values to spread the data between partitions. This column can be of any data type. AWS Glue generates non-overlapping queries that run in parallel to read the data partitioned by this column. For example, if your data is evenly distributed by month, you can use themonth
column to read each month of data in parallel.'hashfield': 'month'
AWS Glue creates a query to hash the field value to a partition number and runs the query for all partitions in parallel. To use your own query to partition a table read, provide a
hashexpression
instead of ahashfield
. - hashexpression
-
Set
hashexpression
to an SQL expression (conforming to the JDBC database engine grammar) that returns a whole number. A simple expression is the name of any numeric column in the table. AWS Glue generates SQL queries to read the JDBC data in parallel using thehashexpression
in theWHERE
clause to partition data.For example, use the numeric column
customerID
to read data partitioned by a customer number.'hashexpression': 'customerID'
To have AWS Glue control the partitioning, provide a
hashfield
instead of ahashexpression
. - hashpartitions
-
Set
hashpartitions
to the number of parallel reads of the JDBC table. If this property is not set, the default value is 7.For example, set the number of parallel reads to
5
so that AWS Glue reads your data with five queries (or fewer).'hashpartitions': '5'