Amazon Athena Google BigQuery connector
The Amazon Athena connector for Google BigQuery
Prerequisites
Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see Deploy a data source connector or Use the AWS Serverless Application Repository to deploy a data source connector.
Limitations
-
Lambda functions have a maximum timeout value of 15 minutes. Each split executes a query on BigQuery and must finish with enough time to store the results for Athena to read. If the Lambda function times out, the query fails.
-
Google BigQuery is case sensitive. The connector attempts to correct the case of dataset names, table names, and project IDs. This is necessary because Athena lower cases all metadata. These corrections make many extra calls to Google BigQuery.
-
Binary data types are not supported.
-
Because of Google BigQuery concurrency and quota limits, the connector may encounter Google quota limit issues. To avoid these issues, push as many constraints to Google BigQuery as feasible. For information about BigQuery quotas, see Quotas and limits
in the Google BigQuery documentation.
Parameters
Use the Lambda environment variables in this section to configure the Google BigQuery connector.
-
spill_bucket – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
-
spill_prefix – (Optional) Defaults to a subfolder in the specified
spill_bucket
calledathena-federation-spill
. We recommend that you configure an Amazon S3 storage lifecycle on this location to delete spills older than a predetermined number of days or hours. -
spill_put_request_headers – (Optional) A JSON encoded map of request headers and values for the Amazon S3
putObject
request that is used for spilling (for example,{"x-amz-server-side-encryption" : "AES256"}
). For other possible headers, see PutObject in the Amazon Simple Storage Service API Reference. -
kms_key_id – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like
a7e63k4b-8loc-40db-a2a1-4d0en2cd8331
, you can specify a KMS key ID. -
disable_spill_encryption – (Optional) When set to
True
, disables spill encryption. Defaults toFalse
so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses server-side encryption. -
gcp_project_id – The project ID (not project name) that contains the datasets that the connector should read from (for example,
semiotic-primer-1234567
). -
secret_manager_gcp_creds_name – The name of the secret within AWS Secrets Manager that contains your BigQuery credentials in JSON format (for example,
GoogleCloudPlatformCredentials
). -
big_query_endpoint – (Optional) The URL of a BigQuery private endpoint. Use this parameter when you want to access BigQuery over a private endpoint.
Splits and views
Because the BigQuery connector uses the BigQuery Storage Read API to query tables, and the BigQuery Storage API does not support views, the connector uses the BigQuery client with a single split for views.
Performance
To query tables, the BigQuery connector uses the BigQuery Storage Read
API, which uses an RPC-based protocol that provides fast access to BigQuery
managed storage. For more information about the BigQuery Storage Read API, see
Use the
BigQuery Storage Read API to read table data
Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The connector is subject to query failures as concurrency increases, and generally is a slow connector.
The Athena Google BigQuery connector performs predicate pushdown to decrease the data scanned by the query. LIMIT
clauses, ORDER BY
clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.
LIMIT clauses
A LIMIT N
statement reduces the data scanned by the query. With
LIMIT N
pushdown, the connector returns only N
rows to
Athena.
Top N queries
A top N
query specifies an ordering of the result set and a limit on
the number of rows returned. You can use this type of query to determine the top
N
max values or top N
min values for your datasets.
With top N
pushdown, the connector returns only N
ordered
rows to Athena.
Predicates
A predicate is an expression in the WHERE
clause of a SQL query that
evaluates to a Boolean value and filters rows based on multiple conditions. The
Athena Google BigQuery connector can combine these expressions and push them directly to
Google BigQuery for enhanced functionality and to reduce the amount of data scanned.
The following Athena Google BigQuery connector operators support predicate pushdown:
-
Boolean: AND, OR, NOT
-
Equality: EQUAL, NOT_EQUAL, LESS_THAN, LESS_THAN_OR_EQUAL, GREATER_THAN, GREATER_THAN_OR_EQUAL, IS_DISTINCT_FROM, NULL_IF, IS_NULL
-
Arithmetic: ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
-
Other: LIKE_PATTERN, IN
Combined pushdown example
For enhanced querying capabilities, combine the pushdown types, as in the following example:
SELECT * FROM my_table WHERE col_a > 10 AND ((col_a + col_b) > (col_c % col_d)) AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') ORDER BY col_a DESC LIMIT 10;
Passthrough queries
The Google BigQuery connector supports passthrough queries. Passthrough queries use a table function to push your full query down to the data source for execution.
To use passthrough queries with Google BigQuery, you can use the following syntax:
SELECT * FROM TABLE( system.query( query => '
query string
' ))
The following example query pushes down a query to a data source in Google BigQuery. The query
selects all columns in the customer
table, limiting the results to 10.
SELECT * FROM TABLE( system.query( query => 'SELECT * FROM customer LIMIT 10' ))
License information
The Amazon Athena Google BigQuery connector project is licensed under the Apache-2.0 License
By using this connector, you acknowledge the inclusion of third party components, a list
of which can be found in the pom.xml
Additional resources
For additional information about this connector, visit the corresponding site