Supported streaming connectors - Amazon EMR

Supported streaming connectors

Streaming connectors facilitate reading data from a streaming source and can also write data to a streaming sink.

The following are the supported streaming connectors:

Amazon Kinesis Data Streams connector

The Amazon Kinesis Data Streams connector for Apache Spark enables building streaming applications and pipelines that consume data from and write data to Amazon Kinesis Data Streams. The connector supports enhanced fan-out consumption with a dedicated read throughput rate of up to 2MB/second per shard. By default, Amazon EMR Serverless 7.1.0 and higher includes the connector, so you don't need to build or download any additional packages. For more information about the connector, see the spark-sql-kinesis-connector page on GitHub.

The following is an example of how to start a job run with the Kinesis Data Streams connector dependency.

aws emr-serverless start-job-run \ --application-id <APPLICATION_ID> \ --execution-role-arn <JOB_EXECUTION_ROLE> \ --mode 'STREAMING' \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://<Kinesis-streaming-script>", "entryPointArguments": ["s3://<DOC-EXAMPLE-BUCKET-OUTPUT>/output"], "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=16g --conf spark.driver.cores=4 --conf spark.driver.memory=16g --conf spark.executor.instances=3 --jars /usr/share/aws/kinesis/spark-sql-kinesis/lib/spark-streaming-sql-kinesis-connector.jar" } }'

To connect to Kinesis Data Streams, you must configure the EMR Serverless application with VPC access and use a VPC endpoint to allow private access. or use a NAT Gateway to get public access. For more information, see Configuring VPC access. You must also make sure that your job runtime role has the necessary read and write permissions to access the required data streams. To learn more about how to configure a job runtime role, see Job runtime roles for Amazon EMR Serverless. For a full list of all of the required permissions, see the spark-sql-kinesis-connector page on GitHub.

Apache Kafka connector

The Apache Kafka connector for Spark structured streaming is an open-source connector from the Spark community and is available in a Maven repository. This connector facilitates Spark structured streaming applications to read data from and write data to self-managed Apache Kafka and Amazon Managed Streaming for Apache Kafka. For more information about the connector, see the Structured Streaming + Kafka Integration Guide in the Apache Spark documentation.

The following example demonstrates how to include the Kafka connector in your job run request.

aws emr-serverless start-job-run \ --application-id <APPLICATION_ID> \ --execution-role-arn <JOB_EXECUTION_ROLE> \ --mode 'STREAMING' \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://<Kafka-streaming-script>", "entryPointArguments": ["s3://<DOC-EXAMPLE-BUCKET-OUTPUT>/output"], "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=16g --conf spark.driver.cores=4 --conf spark.driver.memory=16g --conf spark.executor.instances=3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:<KAFKA_CONNECTOR_VERSION>" } }'

The Apache Kafka connector version depends on your EMR Serverless release version and corresponding Spark version. To find the correct Kafka version, see the see the Structured Streaming + Kafka Integration Guide.

To use Amazon Managed Streaming for Apache Kafka with IAM authentication, you must include another dependency to enable the Kafka connector to connect to Amazon MSK with IAM. For more information, see the aws-msk-iam-auth repository on GitHub. You must also make sure that the job runtime role has the necessary IAM permissions. The following example demonstrates how to use the connector with IAM authentication.

aws emr-serverless start-job-run \ --application-id <APPLICATION_ID> \ --execution-role-arn <JOB_EXECUTION_ROLE> \ --mode 'STREAMING' \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://<Kafka-streaming-script>", "entryPointArguments": ["s3://<DOC-EXAMPLE-BUCKET-OUTPUT>/output"], "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=16g --conf spark.driver.cores=4 --conf spark.driver.memory=16g --conf spark.executor.instances=3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:<KAFKA_CONNECTOR_VERSION>,software.amazon.msk:aws-msk-iam-auth:<MSK_IAM_LIB_VERSION>" } }'

To use the Kafka connector and the IAM authentication library from Amazon MSK you must configure the EMR Serverless application with VPC access. Your subnets must have Internet access and use a NAT Gateway to access the the Maven dependencies. For more information, see Configuring VPC access. The subnets must have network connectivity to access the Kafka cluster. This is true regardless of whether your Kafka cluster is self-managed or if you use Amazon Managed Streaming for Apache Kafka.