Prerequisites Run a job Verify functionality

Use vertical autoscaling with the Spark operator for Amazon EMR on EKS

Starting with Amazon EMR 7.0, you can use Amazon EMR on EKS vertical autoscaling to simplify resource management. It automatically tunes memory and CPU resources to adapt to the needs of the workload that you provide for Amazon EMR Spark applications. For more information, see Using vertical autoscaling with Amazon EMR Spark jobs.

This section describes how to configure the Spark operator to use vertical autoscaling.

Prerequisites

Before you configure monitoring, be sure to complete the following setup tasks:

Complete the steps in Setting up the Spark operator for Amazon EMR on EKS.
(optional) If you previously installed an older version of the Spark operator, delete the SparkApplication/ScheduledSparkApplication CRD.
```
kubectl delete crd sparkApplication
kubectl delete crd scheduledSparkApplication
```
Complete the steps in Install the Spark operator. In step 3, add the following line to the installation command to allow webhooks for the operator:
```
--set webhook.enable=true
```
Complete the steps in Setting up vertical autoscaling for Amazon EMR on EKS.

Give access to the files in your Amazon S3 location:

Annotate your driver and operator service account with the JobExecutionRole that has S3 permissions.


kubectl annotate serviceaccount -n spark-operator emr-containers-sa-spark eks.amazonaws.com/role-arn=JobExecutionRole
kubectl annotate serviceaccount -n spark-operator emr-containers-sa-spark-operator eks.amazonaws.com/role-arn=JobExecutionRole

Update the trust policy of your job execution role in that namespace.


aws emr-containers update-role-trust-policy \
--cluster-name cluster \
--namespace ${Namespace}\
--role-name iam_role_name_for_job_execution

Edit the IAM role trust policy of your job execution role and update the serviceaccount from emr-containers-sa-spark-*-*-xxxx to emr-containers-sa-*.


{
    "Effect": "Allow",
    "Principal": {
        "Federated": "OIDC-provider"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
        "StringLike": {
            "OIDC": "system:serviceaccount:${Namespace}:emr-containers-sa-*"
        }
    }
}

If you're using Amazon S3 as your file storage, add the following defaults to your yaml file.


hadoopConf:
# EMRFS filesystem
  fs.s3.customAWSCredentialsProvider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
  fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem
  fs.AbstractFileSystem.s3.impl: org.apache.hadoop.fs.s3.EMRFSDelegate
  fs.s3.buffer.dir: /mnt/s3
  fs.s3.getObject.initialSocketTimeoutMilliseconds: "2000"
  mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem: "2"
  mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem: "true"
sparkConf:
 # Required for EMR Runtime
 spark.driver.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*
 spark.driver.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
 spark.executor.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*
 spark.executor.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native

Run a job with vertical autoscaling on the Spark operator

Before you can run a Spark application with the Spark operator, you must complete the steps in Prerequisites.

To use vertical autoscaling with the Spark operator, add the following configuration to the driver for your Spark Application spec to turn on vertical autoscaling:


dynamicSizing:
  mode: Off
  signature: "my-signature"

This configuration enables vertical autoscaling and is a required signature configuration that lets you choose a signature for your job.

For more information on the configurations and parameter values, see Configuring vertical autoscaling for Amazon EMR on EKS. By default, your job submits in the monitoring-only Off mode of vertical autoscaling. This monitoring state lets you compute and view resource recommendations without performing autoscaling. For more information, see Vertical autoscaling modes.

The following is a sample SparkApplication definition file named spark-pi.yaml with the required configurations to use vertical autoscaling.


apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: spark-operator
spec:
  type: Scala
  mode: cluster
  image: "895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-7.7.0:latest"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///usr/lib/spark/examples/jars/spark-examples.jar"
  sparkVersion: "3.4.1"
  dynamicSizing:
    mode: Off
    signature: "my-signature"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.4.1
    serviceAccount: emr-containers-sa-spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.4.1
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Now, submit the Spark application with the following command. This will also create a SparkApplication object named spark-pi:


kubectl apply -f spark-pi.yaml

For more information on submitting applications to Spark through the Spark operator, see Using a SparkApplication in the spark-on-k8s-operator documentation on GitHub.

Verifying the vertical autoscaling functionality

To verify that vertical autoscaling works correctly for the submitted job, use kubectl to get the verticalpodautoscaler custom resource and view your scaling recommendations.


kubectl get verticalpodautoscalers --all-namespaces \ 
-l=emr-containers.amazonaws.com/dynamic.sizing.signature=my-signature

The output from this query should resemble the following:


NAMESPACE        NAME                                                          MODE   CPU   MEM         PROVIDED   AGE
spark-operator   ds-p73j6mkosvc4xeb3gr7x4xol2bfcw5evqimzqojrlysvj3giozuq-vpa   Off          580026651   True       15m

If your output doesn't look similar or contains an error code, see Troubleshooting Amazon EMR on EKS vertical autoscaling for steps to help resolve the issue.

To remove the pods and applications, run the following command:


kubectl delete sparkapplication spark-pi

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Getting started

Uninstall