Amazon Athena TPC benchmark DS (TPC-DS) connector
The Amazon Athena TPC-DS connector enables Amazon Athena to communicate with a source of randomly generated TPC Benchmark DS data for use in benchmarking and functional testing of Athena Federation. The Athena TPC-DS connector generates a TPC-DS compliant database at one of four scale factors. We do not recommend the use of this connector as an alternative to Amazon S3-based data lake performance tests.
This connector can not be registered with Glue Data Catalog as a federated catalog. This connector does not support data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.
Prerequisites
Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see Create a data source connection or Use the AWS Serverless Application Repository to deploy a data source connector.
Parameters
Use the parameters in this section to configure the TPC-DS connector.
Note
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding AWS Glue connection properties. Starting December 3, 2024, use the parameters below only when you manually deploy an earlier version of an Athena data source connector.
-
spill_bucket – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
-
spill_prefix – (Optional) Defaults to a subfolder in the specified
spill_bucket
calledathena-federation-spill
. We recommend that you configure an Amazon S3 storage lifecycle on this location to delete spills older than a predetermined number of days or hours. -
spill_put_request_headers – (Optional) A JSON encoded map of request headers and values for the Amazon S3
putObject
request that is used for spilling (for example,{"x-amz-server-side-encryption" : "AES256"}
). For other possible headers, see PutObject in the Amazon Simple Storage Service API Reference. -
kms_key_id – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like
a7e63k4b-8loc-40db-a2a1-4d0en2cd8331
, you can specify a KMS key ID. -
disable_spill_encryption – (Optional) When set to
True
, disables spill encryption. Defaults toFalse
so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses server-side encryption.
Test databases and tables
The Athena TPC-DS connector generates a TPC-DS compliant database at one of the four
scale factors tpcds1
, tpcds10
, tpcds100
,
tpcds250
, or tpcds1000
.
Summary of tables
For a complete list of the test data tables and columns, run the SHOW
TABLES
or DESCRIBE TABLE
queries. The following summary of
tables is provided for convenience.
-
call_center
-
catalog_page
-
catalog_returns
-
catalog_sales
-
customer
-
customer_address
-
customer_demographics
-
date_dim
-
dbgen_version
-
household_demographics
-
income_band
-
inventory
-
item
-
promotion
-
reason
-
ship_mode
-
store
-
store_returns
-
store_sales
-
time_dim
-
warehouse
-
web_page
-
web_returns
-
web_sales
-
web_site
For TPC-DS queries that are compatible with this generated schema and data, see
the athena-tpcds/src/main/resources/queries/
Example query
The following SELECT
query example queries the tpcds
catalog for customer demographics in specific counties.
SELECT cd_gender, cd_marital_status, cd_education_status, count(*) cnt1, cd_purchase_estimate, count(*) cnt2, cd_credit_rating, count(*) cnt3, cd_dep_count, count(*) cnt4, cd_dep_employed_count, count(*) cnt5, cd_dep_college_count, count(*) cnt6 FROM "lambda:tpcds".tpcds1.customer c, "lambda:tpcds".tpcds1.customer_address ca, "lambda:tpcds".tpcds1.customer_demographics WHERE c.c_current_addr_sk = ca.ca_address_sk AND ca_county IN ('Rush County', 'Toole County', 'Jefferson County', 'Dona Ana County', 'La Porte County') AND cd_demo_sk = c.c_current_cdemo_sk AND exists(SELECT * FROM "lambda:tpcds".tpcds1.store_sales, "lambda:tpcds".tpcds1.date_dim WHERE c.c_customer_sk = ss_customer_sk AND ss_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3) AND (exists(SELECT * FROM "lambda:tpcds".tpcds1.web_sales, "lambda:tpcds".tpcds1.date_dim WHERE c.c_customer_sk = ws_bill_customer_sk AND ws_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3) OR exists(SELECT * FROM "lambda:tpcds".tpcds1.catalog_sales, "lambda:tpcds".tpcds1.date_dim WHERE c.c_customer_sk = cs_ship_customer_sk AND cs_sold_date_sk = d_date_sk AND d_year = 2002 AND d_moy BETWEEN 1 AND 1 + 3)) GROUP BY cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count ORDER BY cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count LIMIT 100
Required Permissions
For full details on the IAM policies that this
connector requires, review the Policies
section of the athena-tpcds.yaml
-
Amazon S3 write access – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
-
Athena GetQueryExecution – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
Performance
The Athena TPC-DS connector attempts to parallelize queries based on the scale factor that you choose. Predicate pushdown is performed within the Lambda function.
License information
The Amazon Athena TPC-DS connector project is licensed under the Apache-2.0 License
Additional resources
For additional information about this connector, visit the corresponding site