Work with data source connectors for Apache Spark

Focus mode

Work with data source connectors for Apache Spark - Amazon Athena

Specify the jar to Spark Query the connector on Spark Specify parameters

Some Athena data source connectors are available as Spark DSV2 connectors. The Spark DSV2 connector names have a -dsv2 suffix (for example, athena-dynamodb-dsv2).

Following are the currently available DSV2 connectors, their Spark .format() class name, and links to their corresponding Amazon Athena Federated Query documentation:

DSV2 connector	Spark .format() class name	Documentation
athena-cloudwatch-dsv2	`com.amazonaws.athena.connectors.dsv2.cloudwatch.CloudwatchTableProvider`	CloudWatch
athena-cloudwatch-metrics-dsv2	`com.amazonaws.athena.connectors.dsv2.cloudwatch.metrics.CloudwatchMetricsTableProvider`	CloudWatch metrics
athena-aws-cmdb-dsv2	`com.amazonaws.athena.connectors.dsv2.aws.cmdb.AwsCmdbTableProvider`	CMDB
athena-dynamodb-dsv2	`com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider`	DynamoDB

To download .jar files for the DSV2 connectors, visit the Amazon Athena Query Federation DSV2 GitHub page and see the Releases, Release <version>, Assets section.

Specify the jar to Spark

To use the Athena DSV2 connectors with Spark, you submit the .jar file for the connector to the Spark environment that you are using. The following sections describe specific cases.

Athena for Spark

For information on adding custom .jar files and custom configuration to Amazon Athena for Apache Spark, see Use Spark properties to specify custom configuration.

General Spark

To pass in the connector .jar file to Spark, use the spark-submit command and specify the .jar file in the --jars option, as in the following example:


spark-submit \ 
  --deploy-mode cluster \ 
  --jars https://github.com/awslabs/aws-athena-query-federation-dsv2/releases/download/some_version/athena-dynamodb-dsv2-some_version.jar

Amazon EMR Spark

In order to run a spark-submit command with the --jars parameter on Amazon EMR, you must add a step to your Amazon EMR Spark cluster. For details on how to use spark-submit on Amazon EMR, see Add a Spark step in the Amazon EMR Release Guide.

AWS Glue ETL Spark

For AWS Glue ETL, you can pass in the .jar file's GitHub.com URL to the --extra-jars argument of the aws glue start-job-run command. The AWS Glue documentation describes the --extra-jars parameter as taking an Amazon S3 path, but the parameter can also take an HTTPS URL. For more information, see Job parameter reference in the AWS Glue Developer Guide.

Query the connector on Spark

To submit the equivalent of your existing Athena federated query on Apache Spark, use the spark.sql() function. For example, suppose you have the following Athena query that you want to use on Apache Spark.


SELECT somecola, somecolb, somecolc 
FROM ddb_datasource.some_schema_or_glue_database.some_ddb_or_glue_table 
WHERE somecola > 1

To perform the same query on Spark using the Amazon Athena DynamoDB DSV2 connector, use the following code:


dynamoDf = (spark.read 
    .option("athena.connectors.schema", "some_schema_or_glue_database") 
    .option("athena.connectors.table", "some_ddb_or_glue_table") 
    .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") 
    .load()) 
 
dynamoDf.createOrReplaceTempView("ddb_spark_table") 
 
spark.sql(''' 
SELECT somecola, somecolb, somecolc 
FROM ddb_spark_table 
WHERE somecola > 1 
''')

Specify parameters

The DSV2 versions of the Athena data source connectors use the same parameters as the corresponding Athena data source connectors. For parameter information, refer to the documentation for the corresponding Athena data source connector.

In your PySpark code, use the following syntax to configure your parameters.


spark.read.option("athena.connectors.conf.parameter", "value")

For example, the following code sets the Amazon Athena DynamoDB connector disable_projection_and_casing parameter to always.


dynamoDf = (spark.read 
    .option("athena.connectors.schema", "some_schema_or_glue_database") 
    .option("athena.connectors.table", "some_ddb_or_glue_table") 
    .option("athena.connectors.conf.disable_projection_and_casing", "always") 
    .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") 
    .load())

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Develop a data source connector

Use DataZone

Next topic:

Use DataZone

Previous topic:

Develop a data source connector

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences