Prerequisites Configure the Kubernetes service account Run the application

Setting up cluster access permissions with IAM roles for service accounts (IRSA)

This section uses an example to demonstrate how to configure a Kubernetes service account to assume an AWS Identity and Access Management role. Pods that use the service account can then access any AWS service that the role has permissions to access.

The following example runs a Spark application to count the words from a file in Amazon S3. To do this, you can set up IAM roles for service accounts (IRSA) to authenticate and authorize Kubernetes service accounts.

Note

This example uses the "spark-operator" namespace for the Spark operator and for the namespace where you submit the Spark application.

Prerequisites

Before you try the example on this page, complete the following prerequisites:

Get set up for the Spark operator.
Install the Spark operator.
Create an Amazon S3 bucket.
Save your favorite poem in a text file named poem.txt, and upload the file to your S3 bucket. The Spark application that you create on this page will read the contents of the text file. For more information on uploading files to S3, see Upload an object to your bucket in the Amazon Simple Storage Service User Guide.

Configure a Kubernetes service account to assume an IAM role

Use the following steps to configure a Kubernetes service account to assume an IAM role that pods can use to access AWS services that the role has permissions to access.

After completing the Prerequisites, use the AWS Command Line Interface to create an example-policy.json file that allows read-only access to the file that you uploaded to Amazon S3:


cat >example-policy.json <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-pod-bucket",
                "arn:aws:s3:::my-pod-bucket/*"
            ]
        }
    ]
}
EOF

Then, create an IAM policy example-policy:


aws iam create-policy --policy-name example-policy --policy-document file://example-policy.json

Next, create an IAM role example-role and associate it with a Kubernetes service account for the Spark driver:


eksctl create iamserviceaccount --name driver-account-sa --namespace spark-operator \
--cluster my-cluster --role-name "example-role" \
--attach-policy-arn arn:aws:iam::111122223333:policy/example-policy --approve

Create a yaml file with the cluster role bindings that are required for the Spark driver service account:


cat >spark-rbac.yaml <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: driver-account-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: spark-role
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
  - kind: ServiceAccount
    name: driver-account-sa
    namespace: spark-operator
EOF

Apply the cluster role binding configurations:
```
kubectl apply -f spark-rbac.yaml
```

The kubectl command should confirm successful creation of the account:


serviceaccount/driver-account-sa created
clusterrolebinding.rbac.authorization.k8s.io/spark-role configured

Running an application from the Spark operator

After you configure the Kubernetes service account, you can run a Spark application that counts the number of words in the text file that you uploaded as part of the Prerequisites.

Create a new file word-count.yaml, with a SparkApplication definition for your word-count application.


cat >word-count.yaml <<EOF
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: word-count
  namespace: spark-operator
spec:
  type: Java
  mode: cluster
  image: "895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.10.0:latest"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.JavaWordCount
  mainApplicationFile: local:///usr/lib/spark/examples/jars/spark-examples.jar
  arguments:
    - s3://my-pod-bucket/poem.txt
  hadoopConf:
   # EMRFS filesystem
    fs.s3.customAWSCredentialsProvider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
    fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem
    fs.AbstractFileSystem.s3.impl: org.apache.hadoop.fs.s3.EMRFSDelegate
    fs.s3.buffer.dir: /mnt/s3
    fs.s3.getObject.initialSocketTimeoutMilliseconds: "2000"
    mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem: "2"
    mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem: "true"
  sparkConf:
    # Required for EMR Runtime
    spark.driver.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*
    spark.driver.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
    spark.executor.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*
    spark.executor.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
  sparkVersion: "3.3.1"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.3.1
    serviceAccount: my-spark-driver-sa
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.3.1
EOF

Submit the Spark application.
```
kubectl apply -f word-count.yaml
```
The kubectl command should return confirmation that you successfully created a SparkApplication object called word-count.
```
sparkapplication.sparkoperator.k8s.io/word-count configured
```

To check events for the SparkApplication object, run the following command:


kubectl describe sparkapplication word-count -n spark-operator

The kubectl command should return the description of the SparkApplication with the events:


Events:
  Type     Reason                               Age                    From            Message
  ----     ------                               ----                   ----            -------
  Normal   SparkApplicationSpecUpdateProcessed  3m2s (x2 over 17h)     spark-operator  Successfully processed spec update for SparkApplication word-count
  Warning  SparkApplicationPendingRerun         3m2s (x2 over 17h)     spark-operator  SparkApplication word-count is pending rerun
  Normal   SparkApplicationSubmitted            2m58s (x2 over 17h)    spark-operator  SparkApplication word-count was submitted successfully
  Normal   SparkDriverRunning                   2m56s (x2 over 17h)    spark-operator  Driver word-count-driver is running
  Normal   SparkExecutorPending                 2m50s                  spark-operator  Executor [javawordcount-fdd1698807392c66-exec-1] is pending
  Normal   SparkExecutorRunning                 2m48s                  spark-operator  Executor [javawordcount-fdd1698807392c66-exec-1] is running
  Normal   SparkDriverCompleted                 2m31s (x2 over 17h)    spark-operator  Driver word-count-driver completed
  Normal   SparkApplicationCompleted            2m31s (x2 over 17h)    spark-operator  SparkApplication word-count completed
  Normal   SparkExecutorCompleted               2m31s (x2 over 2m31s)  spark-operator  Executor [javawordcount-fdd1698807392c66-exec-1] completed

The application is now counting the words in your S3 file. To find the count of words, refer to the log files for your driver:


kubectl logs pod/word-count-driver -n spark-operator

The kubectl command should return the contents of the log file with the results of your word-count application.


INFO DAGScheduler: Job 0 finished: collect at JavaWordCount.java:53, took 5.146519 s
                Software: 1

For more information on how to submit applications to Spark through the Spark operator, see Using a SparkApplication in the Kubernetes Operator for Apache Spark (spark-on-k8s-operator) documentation on GitHub.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Role-based access control (RBAC)

spark-submit