Considerations and limitations when using the Spark connector - Amazon EMR

Considerations and limitations when using the Spark connector

  • We recommend that you turn on SSL for the JDBC connection from Spark on Amazon EMR to Amazon Redshift.

  • We recommend that you manage the credentials for the Amazon Redshift cluster in AWS Secrets Manager as a best practice. See Using AWS Secrets Manager to retrieve credentials for connecting to Amazon Redshift for an example.

  • We recommend that you pass an IAM role with the parameter aws_iam_role for the Amazon Redshift authentication parameter.

  • The parameter tempformat currently doesn't support the Parquet format.

  • The tempdir URI points to an Amazon S3 location. This temp directory isn't cleaned up automatically and therefore could add additional cost.

  • Consider the following recommendations for Amazon Redshift:

  • Consider the following recommendations for Amazon S3:

    • We recommend that you block public access to Amazon S3 buckets.

    • We recommend that you use Amazon S3 server-side encryption to encrypt the Amazon S3 buckets used.

    • We recommend that you use Amazon S3 lifecycle policies to define the retention rules for the Amazon S3 bucket.

    • Amazon EMR always verifies code imported from open-source into the image. For security, we don't support the following authentication methods from Spark to Amazon S3:

      • Setting AWS access keys in the hadoop-env configuration classification

      • Encoding AWS access keys in the tempdir URI

For more information on using the connector and its supported parameters, see the following resources: