

# Configuring an application when working with EMR Serverless
<a name="application-capacity"></a>

With EMR Serverless, configure the applications that you use. For example, set the maximum capacity that an application can scale up to, configure pre-initialized capacity to keep driver and workers ready to respond, and specify a common set of runtime and monitoring configurations at the application level. The following pages describe how to configure applications when you use EMR Serverless.

**Topics**
+ [Understanding application behavior in EMR Serverless](app-behavior.md)
+ [Pre-initialized capacity for working with an application in EMR Serverless](pre-init-capacity.md)
+ [Default application configuration for EMR Serverless](default-configs.md)

# Understanding application behavior in EMR Serverless
<a name="app-behavior"></a>

This section describes job submission behavior, capacity configuration for scaling, and worker configuration settings for EMR Serverless.

## Default application behavior
<a name="auto-start-stop"></a>

**Auto-start** — An application by default is configured to auto-start on job submission. You can turn this feature off.

**Auto-stop** — An application by default is configured to auto-stop when idle for 15 minutes. When an application changes to the `STOPPED` state, it releases any configured pre-initialized capacity. You can modify the amount of idle time before an application auto-stops, or you can turn this feature off.

## Maximum capacity
<a name="max-capacity"></a>

You can configure the maximum capacity that an application can scale up to. You can specify your maximum capacity in terms of CPU, memory (GB), and disk (GB). 

**Note**  
It is best practice to configure your maximum capacity to be proportional to your supported worker sizes by multiplying the number of workers by their sizes. For example, if you want to limit your application to 50 workers with 2 vCPUs, 16 GB for memory, and 20 GB for disk, set your maximum capacity to 100 vCPUs, 800 GB for memory, and 1000 GB for disk. 

## Supported worker configurations
<a name="worker-configs"></a>

The following table lists supported worker configurations and sizes that can be specified for EMR Serverless. Configure different sizes for drivers and executors based on the need of your workload.


**Worker configurations and sizes**  

| CPU | Memory | Default ephemeral storage | 
| --- | --- | --- | 
|  1 vCPU  |  Minimum 2 GB, maximum 8 GB, in 1 GB increments  |  20 GB - 200 GB  | 
|  2 vCPU  |  Minimum 4 GB, maximum 16 GB, in 1 GB increments  |  20 GB - 200 GB  | 
|  4 vCPU  |  Minimum 8 GB, maximum 30 GB, in 1 GB increments  |  20 GB - 200 GB  | 
|  8 vCPU  |  Minimum 16 GB, maximum 60 GB, in 4 GB increments  |  20 GB - 200 GB  | 
|  16 vCPU  |  Minimum 32 GB, maximum 120 GB, in 8 GB increments  |  20 GB - 200 GB  | 

**CPU** — Each worker can have 1, 2, 4, 8, or 16 vCPUs.

**Memory** — Each worker has memory, specified in GB, within the limits listed in the earlier table. Spark jobs have a memory overhead, meaning that the memory they use is more than the specified container sizes. This overhead is specified with the properties `spark.driver.memoryOverhead` and `spark.executor.memoryOverhead`. The overhead has a default value of 10% of container memory, with a minimum of 384 MB. You should consider this overhead when you choose worker sizes. 

For example, If you choose 4 vCPUs for your worker instance, and a pre-initialized storage capacity of 30 GB, then set a value of approximately 27 GB as executor memory for your Spark job. This maximizes the utilization of your pre-initialized capacity. Usable memory is 27 GB, plus 10% of 27 GB (2.7 GB), for a total of 29.7 GB.

**Disk** — You can configure each worker with temporary storage disks with a minimum size of 20 GB and a maximum of 200 GB. You only pay for additional storage beyond 20 GB that you configure per worker.

# Pre-initialized capacity for working with an application in EMR Serverless
<a name="pre-init-capacity"></a>

EMR Serverless provides an optional feature that keeps driver and workers pre-initialized and ready to respond in seconds. This effectively creates a warm pool of workers for an application. This feature is called *pre-initialized capacity*. To configure this feature, set the `initialCapacity` parameter of an application to the number of workers you want to pre-initialize. With pre-initialized worker capacity, jobs start immediately. This is ideal when you want to implement iterative applications and time-sensitive jobs.

Pre-initialized capacity keeps a warm pool of workers ready for jobs and sessions to startup in seconds. You will be paying for provisioned pre-initialized workers even when the application is idle, hence we suggest enabling it for use cases that benefit from the fast start-up time and sizing it for optimal utilization of resources. EMR Serverless applications automatically shut down when idle. We suggest keeping this feature on when using pre-initialized workers to avoid unexpected charges.

When you submit a job, if workers from `initialCapacity` are available, the job uses those resources to start its run. If those workers are already in use by other jobs, or if the job needs more resources than available from `initialCapacity`, then the application requests and gets additional workers, up to the maximum limits on resources set for the application. When a job finishes its run, it releases the workers that it used, and the number of resources available for the application returns to `initialCapacity`. An application maintains the `initialCapacity` of resources even after jobs finish their runs. The application releases excess resources beyond `initialCapacity` when the jobs no longer need them to run.

Pre-initialized capacity is available and ready to use when the application has started. The pre-initialized capacity becomes inactive when the application is stopped. An application moves to the `STARTED` state only if the requested pre-initialized capacity has been created and is ready to use. The whole time that the application is in the `STARTED` state, EMR Serverless keeps the pre-initialized capacity available for use or in use by jobs or interactive workloads. The feature restores capacity for released or failed containers. This maintains the number of workers that the `InitialCapacity` parameter specifies. The state of an application with no pre-initialized capacity can immediately change from `CREATED` to `STARTED`.

 You can configure the application to release pre-initialized capacity if it isn't used for a certain period of time, with a default of 15 minutes. A stopped application starts automatically when you submit a new job. You can set these automatic start and stop configurations when you create the application, or change them when the application is in a `CREATED` or `STOPPED` state.

You can change the `InitialCapacity` counts, and specify compute configurations such as CPU, memory, and disk, for each worker. Because you can't make partial modifications, specify all compute configurations when you change values. You can only change configurations when the application is in the `CREATED` or `STOPPED` state.

**Note**  
To optimize your application’s use of resources, we suggest aligning your container sizes with your pre-initialized capacity worker sizes. For example, if you configure your Spark executor size to 2 CPUs and your memory to 8 GB, but your pre-initialized capacity worker size is 4 CPUs with 16 GB of memory, then the Spark executors only use half of the workers’ resources when they are assigned to this job.

## Customizing pre-initialized capacity for Spark and Hive
<a name="customizing-capacity"></a>

You can further customize pre-initialized capacity for workloads that run on specific big data frameworks. For example, when a workload runs on Apache Spark, specify how many workers start as drivers and how many start as executors. Similarly, when you use Apache Hive, specify how many workers start as Hive drivers, and how many should run Tez tasks.

**Configuring an application running Apache Hive with pre-initialized capacity**

The following API request creates an application running Apache Hive based on Amazon EMR release emr-6.6.0. The application starts with 5 pre-initialized Hive drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized Tez task workers, each with 4 vCPU and 8 GB of memory. When Hive queries run on this application, they first use the pre-initialized workers and start executing immediately. If all of the pre-initialized workers are busy and more Hive jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory. You can optionally omit capacity for either the `DRIVER` or the `TEZ_TASK` worker.

```
aws emr-serverless create-application \
  --type "HIVE" \
  --name my-application-name \
  --release-label emr-6.6.0 \
  --initial-capacity '{
    "DRIVER": {
        "workerCount": 5,
        "workerConfiguration": {
            "cpu": "2vCPU",
            "memory": "4GB"
        }
    },
    "TEZ_TASK": {
        "workerCount": 50,
        "workerConfiguration": {
            "cpu": "4vCPU",
            "memory": "8GB"
        }
    }
  }' \
  --maximum-capacity '{
    "cpu": "400vCPU",
    "memory": "1024GB"
  }'
```

**Configuring an application running Apache Spark with pre-initialized capacity **

The following API request creates an application that runs Apache Spark 3.2.0 based on Amazon EMR release 6.6.0. The application starts with 5 pre-initialized Spark drivers, each with 2 vCPU and 4 GB of memory, and 50 pre-initialized executors, each with 4 vCPU and 8 GB of memory. When Spark jobs run on this application, they first use the pre-initialized workers and start to execute immediately. If all of the pre-initialized workers are busy and more Spark jobs are submitted, the application can scale to a total of 400 vCPU and 1024 GB of memory. You can optionally omit capacity for either the `DRIVER` or the `EXECUTOR`.

**Note**  
Spark adds a configurable memory overhead, with a 10% default value, to the memory requested for driver and executors. For jobs to use pre-initialized workers, the initial capacity memory configuration should be greater than the memory that the job and the overhead request.

```
aws emr-serverless create-application \
  --type "SPARK" \
  --name my-application-name \
  --release-label emr-6.6.0 \
  --initial-capacity '{
    "DRIVER": {
        "workerCount": 5,
        "workerConfiguration": {
            "cpu": "2vCPU",
            "memory": "4GB"
        }
    },
    "EXECUTOR": {
        "workerCount": 50,
        "workerConfiguration": {
            "cpu": "4vCPU",
            "memory": "8GB"
        }
    }
  }' \
  --maximum-capacity '{
    "cpu": "400vCPU",
    "memory": "1024GB"
  }'
```

# Default application configuration for EMR Serverless
<a name="default-configs"></a>

You can specify a common set of runtime and monitoring configurations at the application level for all the jobs that you submit under the same application. This reduces the additional overhead that is associated with the need to submit the same configurations for each job.

You can modify the configurations at the following points in time:
+ [Declare application-level configurations at job submission.](#default-configs-declare)
+ [Override default configurations during job run.](#default-configs-override)

The following sections provide more details and an example for further context.

## Declaring configurations at the application level
<a name="default-configs-declare"></a>

You can specify application-level logging and runtime configuration properties for the jobs that you submit under the application.

**`monitoringConfiguration`**  
To specify the log configurations for jobs that you submit with the application, use the [https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_MonitoringConfiguration.html](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_MonitoringConfiguration.html) field. For more information on logging for EMR Serverless, refer to [Storing logs](logging.md).

**`runtimeConfiguration`**  
To specify runtime configuration properties such as `spark-defaults`, provide a configuration object in the `runtimeConfiguration` field. This affects the default configurations for all the jobs that you submit with the application. For more information, refer to [Hive configuration override parameter](jobs-hive.md#hive-defaults-configurationOverrides) and [Spark configuration override parameter](jobs-spark.md#spark-defaults-configurationOverrides).  
Available configuration classifications vary by specific EMR Serverless release. For example, classifications for custom Log4j `spark-driver-log4j2` and `spark-executor-log4j2` are only available with releases 6.8.0 and higher. For a list of application-specific properties, refer to [Spark job properties](jobs-spark.md#spark-defaults) and [Hive job properties](jobs-hive.md#hive-defaults).  
You can also configure [Apache Log4j2 properties](log4j2.md), [AWS Secrets Manager for data protection](secrets-manager.md), and [Java 17 runtime](using-java-runtime.md) at the application level.  
To pass Secrets Manager secrets at the application level, attach the following policy to users and roles that need to create or update EMR Serverless applications with secrets.    
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "SecretsManagerPolicy",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": [
        "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-secret-name-123abc"
      ]
    },
    {
      "Sid": "KMSDecryptPolicy",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": [
        "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"
      ]
    }
  ]
}
```
For more information on creating custom policies for secrets, refer to [Permissions policy examples for AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access_examples.html) in the *AWS Secrets Manager User Guide*.

**Note**  
The `runtimeConfiguration` that you specify at application level maps to `applicationConfiguration` in the [https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html) API.

### Example declaration
<a name="default-configs-declare-example"></a>

The following example shows how to declare default configurations with `create-application`.

```
aws emr-serverless create-application \
    --release-label release-version  \
    --type SPARK \
    --name my-application-name \
    --runtime-configuration '[
        {
            "classification": "spark-defaults",
            "properties": {
                "spark.driver.cores": "4",
                "spark.executor.cores": "2",
                "spark.driver.memory": "8G",
                "spark.executor.memory": "8G",
                "spark.executor.instances": "2",
                "spark.hadoop.javax.jdo.option.ConnectionDriverName":"org.mariadb.jdbc.Driver",
                "spark.hadoop.javax.jdo.option.ConnectionURL":"jdbc:mysql://db-host:db-port/db-name",
                "spark.hadoop.javax.jdo.option.ConnectionUserName":"connection-user-name",
                "spark.hadoop.javax.jdo.option.ConnectionPassword": "EMR.secret@SecretID"
            }
        },
        {
            "classification": "spark-driver-log4j2",
            "properties": {
                "rootLogger.level":"error", 
                "logger.IdentifierForClass.name": "classpathForSettingLogger",
                "logger.IdentifierForClass.level": "info"
            }
        }
    ]' \
    --monitoring-configuration '{
        "s3MonitoringConfiguration": {
            "logUri": "s3://amzn-s3-demo-logging-bucket/logs/app-level"
        },
        "managedPersistenceMonitoringConfiguration": {
            "enabled": false
        }
    }'
```

## Overriding configurations during a job run
<a name="default-configs-override"></a>

You can specify configuration overrides for the application configuration and monitoring configuration with the [https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html) API. EMR Serverless then merges the configurations that you specify at the application level and the job level to determine the configurations for the job execution. 

The granularity level when the merge occurs is as follows:
+ **[https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_ConfigurationOverrides.html#emrserverless-Type-ConfigurationOverrides-applicationConfiguration](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_ConfigurationOverrides.html#emrserverless-Type-ConfigurationOverrides-applicationConfiguration)** - Classification type, for example `spark-defaults`.
+ **[https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_ConfigurationOverrides.html#emrserverless-Type-ConfigurationOverrides-monitoringConfiguration](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_ConfigurationOverrides.html#emrserverless-Type-ConfigurationOverrides-monitoringConfiguration)** - Configuration type, for example `s3MonitoringConfiguration`.

**Note**  
The priority of configurations that you provide at [https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html) supersede the configurations that you provide at the application level.

For more information priority rankings, refer to [Hive configuration override parameter](jobs-hive.md#hive-defaults-configurationOverrides) and [Spark configuration override parameter](jobs-spark.md#spark-defaults-configurationOverrides).

When you start a job, if you don’t specify a particular configuration, it will be inherited from the application. If you declare the configurations at job level, you can perform the following operations:
+ **Override an existing configuration** - Provide the same configuration parameter in the `StartJobRun` request with your override values. 
+ **Add an additional configuration** - Add the new configuration parameter in the `StartJobRun` request with the values that you want to specify.
+ **Remove an existing configuration** - To remove an application *runtime configuration*, provide the key for the configuration that you want to remove, and pass an empty declaration `{}` for the configuration. We don't recommend removing any classifications that contain parameters that are required for a job run. For example, if you try to remove the [required properties for a Hive job](https://docs.aws.amazon.com/), the job will fail.

  To remove an application *monitoring configuration*, use the appropriate method for the relevant configuration type:
  + **`cloudWatchLoggingConfiguration`** - To remove `cloudWatchLogging`, pass the enabled flag as `false`. 
  + **`managedPersistenceMonitoringConfiguration`** - To remove managed persistence settings and fall back to the default enabled state, pass an empty declaration `{}` for the configuration. 
  + **`s3MonitoringConfiguration`** - To remove `s3MonitoringConfiguration`, pass an empty declaration `{}` for the configuration.

### Example override
<a name="default-configs-override-example"></a>

The following example shows different operations you can perform during job submission at `start-job-run`.

```
aws emr-serverless start-job-run \
    --application-id your-application-id \
    --execution-role-arn your-job-role-arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py", 
            "entryPointArguments": ["s3://amzn-s3-demo-destination-bucket1/wordcount_output"]
        }
    }' \
    --configuration-overrides '{
        "applicationConfiguration": [ 
            {
                // Override existing configuration for spark-defaults in the application
                "classification": "spark-defaults", 
                "properties": {
                    "spark.driver.cores": "2",
                    "spark.executor.cores": "1",
                    "spark.driver.memory": "4G",
                    "spark.executor.memory": "4G"
                }
            },
            {
                // Add configuration for spark-executor-log4j2
                "classification": "spark-executor-log4j2",
                "properties": {
                    "rootLogger.level": "error", 
                    "logger.IdentifierForClass.name": "classpathForSettingLogger",
                    "logger.IdentifierForClass.level": "info"
                }
            },
            {
                // Remove existing configuration for spark-driver-log4j2 from the application
                "classification": "spark-driver-log4j2",
                "properties": {}
            }
        ],
        "monitoringConfiguration": {
            "managedPersistenceMonitoringConfiguration": {
                // Override existing configuration for managed persistence
                "enabled": true
            },
            "s3MonitoringConfiguration": {
                // Remove configuration of S3 monitoring
            },
            "cloudWatchLoggingConfiguration": {
                // Add configuration for CloudWatch logging
                "enabled": true
            }
        }
    }'
```

At the time of job execution, the following classifications and configurations will apply based on the priority override ranking described in [Hive configuration override parameter](jobs-hive.md#hive-defaults-configurationOverrides) and [Spark configuration override parameter](jobs-spark.md#spark-defaults-configurationOverrides).
+ The classification `spark-defaults` will be updated with the properties specified at the job level. Only the properties included in `StartJobRun` is considered for this classification.
+ The classification `spark-executor-log4j2` will be added in the existing list of classifications.
+ The classification `spark-driver-log4j2` will be removed.
+ The configurations for `managedPersistenceMonitoringConfiguration` will be updated with configurations at job level.
+ The configurations for `s3MonitoringConfiguration` will be removed.
+ The configurations for `cloudWatchLoggingConfiguration` will be added to existing monitoring configurations.