Query Amazon Redshift with PySpark (via Glue InteractiveSession)

To query Amazon Redshift through AWS Glue using PySpark, write and run the following code:



%%pyspark project.spark
import sys 
import boto3 
from awsglue.utils import getResolvedOptions 
from pyspark.context import SparkContext 
from pyspark.sql import SparkSession

args = getResolvedOptions(
  sys.argv, ["redshift_url", "redshift_iam_role", "redshift_tempdir","redshift_jdbc_iam_url"]
  ) 

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

table_name = "database.table" 
rs_read_df = (
  spark.read.format("io.github.spark_redshift_community.spark.redshift")
   .option("url", args["redshift_jdbc_iam_url"])
   .option("aws_iam_role", args["redshift_iam_role"])
   .option("tempdir", args["redshift_tempdir"]) 
   .option("unload_s3_format", "PARQUET") 
   .option("dbtable", table_name) 
   .load() 
   ) 
rs_read_df.show(5)

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Query Amazon Redshift with SQL statement

Using the Amazon Q data integration in AWS Glue