Amazon Redshift integration for Apache Spark
Apache Spark
This integration provides you with a Spark connector you can use to build Apache Spark applications that read from and write to data in Amazon Redshift and Amazon Redshift Serverless. These applications don't compromise on application performance or transactional consistency of the data. This integration is automatically included in Amazon EMR and AWS Glue, so you can immediately run Apache Spark jobs that access and load data into Amazon Redshift as part of your data ingestion and transformation pipelines.
Currently, you can use the versions 3.3.0, 3.3.1, 3.3.2, and 3.4.0 of Spark with this integration.
This integration provides the following:
-
AWS Identity and Access Management (IAM) authentication. For more information, see Identity and access management in Amazon Redshift.
-
Predicate and query pushdown to improve performance.
-
Amazon Redshift data types.
-
Connectivity to Amazon Redshift and Amazon Redshift Serverless.
Considerations and limitations when using the Spark connector
-
The tempdir URI points to an Amazon S3 location. This temp directory is not cleaned up automatically and could add additional cost. We recommend using Amazon S3 lifecycle policies in the Amazon Simple Storage Service User Guide to define the retention rules for the Amazon S3 bucket.
-
By default, copies between Amazon S3 and Redshift don't work if the S3 bucket and Redshift cluster are in different AWS Regions. To use separate AWS Regions, set the
tempdir_region
parameter to the Region of the S3 bucket used for thetempdir
. -
Cross-Region writes between S3 and Redshift if writing Parquet data using the
tempformat
parameter. -
We recommend using Amazon S3 server-side encryption to encrypt the Amazon S3 buckets used.
-
We recommend blocking public access to Amazon S3 buckets.
-
We recommend that the Amazon Redshift cluster should not be publicly accessible.
-
We recommend turning on Amazon Redshift audit logging.
-
We recommend turning on Amazon Redshift at-rest encryption.
-
We recommend turning on SSL for the JDBC connection from Spark on Amazon EMR to Amazon Redshift.
-
We recommend passing an IAM role using the parameter
aws_iam_role
for the Amazon Redshift authentication parameter.