

# Troubleshooting AWS Batch
<a name="troubleshooting"></a>

You might need to troubleshoot issues that are related to your compute environments, job queues, job definitions, or jobs. This chapter describes how to troubleshoot and resolve such issues in your AWS Batch environment.

AWS Batch uses IAM policies, roles, and permissions, and runs on Amazon EC2, Amazon ECS, AWS Fargate, and Amazon Elastic Kubernetes Service infrastructure. To troubleshoot issues that are related to these services, see the following:
+ [Troubleshooting IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot.html) in the *IAM User Guide*
+ [Amazon ECS troubleshooting](https://docs.aws.amazon.com/AmazonECS/latest/userguide/troubleshooting.html) in the *Amazon Elastic Container Service Developer Guide*
+ [Amazon EKS troubleshooting](https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html) in the *Amazon EKS User Guide*
+ [Troubleshoot EC2 instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-troubleshoot.html) in the *Amazon EC2 User Guide*

**Contents**
+ [

# AWS Batch
](batch-troubleshooting.md)
  + [

# Optimal instance type configuration to receive automatic instance family updates
](optimal-default-instance-troubleshooting.md)
  + [

# `INVALID` compute environment
](invalid_compute_environment.md)
    + [

## Incorrect role name or ARN
](invalid_compute_environment.md#invalid_service_role_arn)
    + [

## Repair an `INVALID` compute environment
](invalid_compute_environment.md#repairing_invalid_compute_environment)
  + [

# Jobs stuck in a `RUNNABLE` status
](job_stuck_in_runnable.md)
  + [

# Spot Instances not tagged on creation
](spot-instance-no-tag.md)
  + [

# Spot Instances not scaling down
](spot-fleet-not-authorized.md)
    + [

## Attach **AmazonEC2SpotFleetTaggingRole** managed policy to your Spot Fleet role in the AWS Management Console
](spot-fleet-not-authorized.md#spot-fleet-not-authorized-console)
    + [

## Attach **AmazonEC2SpotFleetTaggingRole** managed policy to your Spot Fleet role with the AWS CLI
](spot-fleet-not-authorized.md#spot-fleet-not-authorized-cli)
  + [

# Can't retrieve Secrets Manager secrets
](troubleshooting-cant-specify-secrets.md)
  + [

# Can't override job definition resource requirements
](override-resource-requirements.md)
  + [

# Error message when you update the `desiredvCpus` setting
](error-desired-vcpus-update.md)
+ [

# AWS Batch on Amazon EKS
](batch-eks-troubleshooting.md)
  + [

# `INVALID` compute environment
](batch_eks_invalid_compute_environment.md)
    + [

## Unsupported Kubernetes version
](batch_eks_invalid_compute_environment.md#invalid_kubernetes_version)
    + [

## Instance profile doesn't exist
](batch_eks_invalid_compute_environment.md#instance_profile_not_exist)
    + [

## Invalid Kubernetes namespace
](batch_eks_invalid_compute_environment.md#invalid_kubernetes_namespace)
    + [

## Deleted compute environment
](batch_eks_invalid_compute_environment.md#deleted_compute_environment)
    + [

## Nodes don't join the Amazon EKS cluster
](batch_eks_invalid_compute_environment.md#batch_eks_node_not_join_cluster)
  + [

# AWS Batch on Amazon EKS job is stuck in `RUNNABLE` status
](batch_eks_job_stuck_in_runnable.md)
  + [

# AWS Batch on Amazon EKS job is stuck in `STARTING` status
](batch-eks-job-stuck-in-starting.md)
    + [

## Scenario: Persisted Volume Claim Attach or Mount Failure
](batch-eks-job-stuck-in-starting.md#batch-eks-job-stuck-in-starting-scenario)
  + [

# Verify that the `aws-auth ConfigMap` is configured correctly
](verify-configmap-config.md)
  + [

# RBAC permissions or bindings aren't configured properly
](batch_eks_rbac.md)

# AWS Batch
<a name="batch-troubleshooting"></a>

Review the following topics to find review processes and potential solutions to common issues that you may encounter when using AWS Batch.

**Topics**
+ [

# Optimal instance type configuration to receive automatic instance family updates
](optimal-default-instance-troubleshooting.md)
+ [

# `INVALID` compute environment
](invalid_compute_environment.md)
+ [

# Jobs stuck in a `RUNNABLE` status
](job_stuck_in_runnable.md)
+ [

# Spot Instances not tagged on creation
](spot-instance-no-tag.md)
+ [

# Spot Instances not scaling down
](spot-fleet-not-authorized.md)
+ [

# Can't retrieve Secrets Manager secrets
](troubleshooting-cant-specify-secrets.md)
+ [

# Can't override job definition resource requirements
](override-resource-requirements.md)
+ [

# Error message when you update the `desiredvCpus` setting
](error-desired-vcpus-update.md)

# Optimal instance type configuration to receive automatic instance family updates
<a name="optimal-default-instance-troubleshooting"></a>

**Note**  
Starting on 11/01/2025 the behavior of `optimal` is going to be changed to match `default_x86_64`. During the change your instance families could be updated to a newer generation. You do not need to perform any actions for the upgrade to happen.

AWS Batch supported a single option in **instanceTypes** for `optimal` to match the demand of your job queues. We've introduced two new instance type options: `default_x86_64` and `default_arm64`. We will use `default_x86_64` if you make no instance type selection. These new options will automatically select cost-effective instance types across different families and generations based on your job queue requirements, allowing you to get your workloads running quickly.

As sufficient capacity of new instance types become available in an AWS Region, the corresponding default pool will be automatically updated with the new instance type. The existing `optimal` option will continue to be supported and is not being deprecated, as it will be supported by the underlying default pools to provide updated instances going forward. If you are using '`optimal`, no action is needed on your part.

However, please be aware that only `ENABLED` and `VALID` Compute Environments (CEs) will be automatically updated with new instance types. If you have any `DISABLED` or `INVALID` CEs, they will receive updates once they are re-enabled and set to a `VALID` state. 

# `INVALID` compute environment
<a name="invalid_compute_environment"></a>

It's possible that you might have incorrectly configured a managed compute environment. If you did, the compute environment enters an `INVALID` state and can't accept jobs for placement. The following sections describe the possible causes and how to troubleshoot based on the cause.

**Important**  
AWS Batch creates and manages multiple AWS resources on your behalf and within your account, including Amazon EC2 Launch Templates, Amazon EC2 Auto Scaling Groups, Amazon EC2 Spot Fleets, and Amazon ECS Clusters. These managed resources are configured specifically to ensure optimal AWS Batch operation. Manually modifying these Batch-managed resources, unless explicitly stated in AWS Batch documentation, may result in unexpected behavior resulting in `INVALID` Compute Environment, sub-optimal instance scaling behavior, delayed workload processing, or unexpected costs. These manual modifications can not be deterministically supported by the AWS Batch service. Always use the supported Batch APIs or the Batch console to manage your Compute Environments.

## Incorrect role name or ARN
<a name="invalid_service_role_arn"></a>

The most common cause for a compute environment to enter an `INVALID` state is that the AWS Batch service role or the Amazon EC2 Spot Fleet role has an incorrect name or Amazon Resource Name (ARN). This is more common with compute environments that are created using the AWS CLI or the AWS SDKs. When you create a compute environment in the AWS Management Console, AWS Batch helps you choose the correct service or Spot Fleet roles. However, suppose that you manually enter the name or the ARN and enter them incorrectly. Then, the resulting compute environment is also `INVALID`.

However, suppose that you manually enter the name or ARN for an IAM resource in an AWS CLI command or your SDK code. In this case, AWS Batch can't validate the string. Instead, AWS Batch must accept the bad value and attempt to create the environment. If AWS Batch fails to create the environment, the environment moves to an `INVALID` state, and you see the following errors.

For an invalid service role:

`CLIENT_ERROR - Not authorized to perform sts:AssumeRole (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: dc0e2d28-2e99-11e7-b372-7fcc6fb65fe7)`

For an invalid Spot Fleet role:

`CLIENT_ERROR - Parameter: SpotFleetRequestConfig.IamFleetRole is invalid. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidSpotFleetRequestConfig; Request ID: 331205f0-5ae3-4cea-bac4-897769639f8d) Parameter: SpotFleetRequestConfig.IamFleetRole is invalid`

One common cause for this issue is the following scenario. You only specify the name of an IAM role when using the AWS CLI or the AWS SDKs, instead of the full Amazon Resource Name (ARN). Depending on how you created the role, the ARN might contain a `aws-service-role` path prefix. For example, if you manually create the AWS Batch service role using the procedures in [Using service-linked roles for AWS Batch](using-service-linked-roles.md), your service role ARN might look like the following.

```
arn:aws:iam::123456789012:role/AWSBatchServiceRole
```

However, if you created the service role as part of the console first run wizard today, your service role ARN might look like the following.

```
arn:aws:iam::123456789012:role/aws-service-role/AWSBatchServiceRole
```

This issue can also occur if you attach the AWS Batch service-level policy (`AWSBatchServiceRole`) to a non-service role. For example, you may receive an error message that resembles the following in this scenario: 

```
CLIENT_ERROR - User: arn:aws:sts::account_number:assumed-role/batch-replacement-role/aws-batch is not 
   authorized to perform: action on resource ...
```

To resolve this issue, do one of the following.
+ Use an empty string for the service role when you create the AWS Batch compute environment.
+ Specify the service role in the following format: `arn:aws:iam::account_number:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch`.

When you only specify the name of an IAM role when using the AWS CLI or the AWS SDKs, AWS Batch assumes that your ARN doesn't use the `aws-service-role` path prefix. Because of this, we recommend that you specify the full ARN for your IAM roles when you create compute environments.

To repair a compute environment that's misconfigured this way, see [Repair an `INVALID` compute environment](#repairing_invalid_compute_environment).

## Repair an `INVALID` compute environment
<a name="repairing_invalid_compute_environment"></a>

When you have a compute environment in an `INVALID` state, update it to repair the invalid parameter. For an [Incorrect role name or ARN](#invalid_service_role_arn), update the compute environment using the correct service role.

**To repair a misconfigured compute environment**

1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/).

1. From the navigation bar, select the AWS Region to use.

1. In the navigation pane, choose **Compute environments**.

1. On the **Compute environments** page, select the radio button next to the compute environment to edit, and then choose **Edit**.

1. On the **Update compute environment** page, for **Service role**, choose the IAM role to use with your compute environment. The AWS Batch console only displays roles that have the correct trust relationship for compute environments.
**Tip**  
For directions on how to create a service linked role, see [Using roles for AWS Batch](using-service-linked-roles-batch-general.md).

1. Choose **Save** to update your compute environment.

# Jobs stuck in a `RUNNABLE` status
<a name="job_stuck_in_runnable"></a>

Suppose that your compute environment contains compute resources, but your jobs don't progress beyond the `RUNNABLE` status. Then, it's likely that something is preventing the jobs from being placed on a compute resource and causing your job queues to be blocked. Here's how to know if your job is waiting for its turn or stuck and blocking the queue.

If AWS Batch detects that you have a `RUNNABLE` job at the head and blocking the queue, you'll receive a [Job queue blocked events](batch-job-queue-blocked-events.md) event from Amazon CloudWatch Events with the reason. The same reason is also updated into the `statusReason` field as a part of `[ListJobs](https://docs.aws.amazon.com/batch/latest/APIReference/API_ListJobs.html)` and `[DescribeJobs](https://docs.aws.amazon.com/batch/latest/APIReference/API_DescribeJobs.html)` API calls. 

Optionally, you can configure the `jobStateTimeLimitActions` parameter through `[CreateJobQueue](https://docs.aws.amazon.com/batch/latest/APIReference/API_CreateJobQueue.html)` and [https://docs.aws.amazon.com/batch/latest/APIReference/API_UpdateJobQueue.html](https://docs.aws.amazon.com/batch/latest/APIReference/API_UpdateJobQueue.html) API actions.

**Note**  
Currently, for job queues connected to Amazon ECS, Amazon EKS, or Fargate compute environments, the only action you can use with `jobStateLimitActions.action` is to cancel a job.

The `jobStateTimeLimitActions` parameter is used to specify a set of actions that AWS Batch performs on jobs in a specific state. You can set a time threshold in seconds through the `maxTimeSeconds` field.

When a job has been in a `RUNNABLE` state with the defined `statusReason`, AWS Batch performs the action specified after `maxTimeSeconds` have elapsed.

For example, you can set the `jobStateTimeLimitActions` parameter to wait up to 4 hours for any job in the `RUNNABLE` state that is waiting for sufficient capacity to become available. You can do this by setting `statusReason` to `CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY` and `maxTimeSeconds` to 14400 before cancelling the job and allowing the next job to advance to the head of the job queue.

The following are the reasons that AWS Batch provides when it detects that a job queue is blocked. This list provides the messages returned from the `ListJobs` and `DescribeJobs` API actions. These are also the same values you can define for the `jobStateLimitActions.statusReason` parameter. 

1. **Reason:** All connected compute environments have insufficient capacity errors. When requested, AWS Batch detects Amazon EC2 instances that experience insufficient capacity errors. Manually canceling the job will allow the subsequent job to move to the head of the queue.
   + **`statusReason` message while the job is stuck:** `CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY - Service cannot fulfill the capacity requested for instance type [instanceTypeName]`
   + **`reason` used for `jobStateTimeLimitActions`:** `CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY`
   + **`statusReason` message after the job is canceled:** `Canceled by JobStateTimeLimit action due to reason: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY`

   **Note:**

   1. The AWS Batch service role requires `autoscaling:DescribeScalingActivities` permission for this detection to work. If you use the [Using service-linked roles for AWS Batch](using-service-linked-roles.md) service-linked role (SLR) or the [AWS managed policy: **AWSBatchServiceRole** policy](security-iam-awsmanpol.md#security-iam-awsmanpol-AWSBatchServiceRolePolicy) managed policy, then you don't need to take any action because their permission policies are updated.

   1. If you use the SLR or the managed policy, you must add the `autoscaling:DescribeScalingActivities` and `ec2:DescribeSpotFleetRequestHistory` permissions so that you can receive blocked job queue events and updated job status when in `RUNNABLE`. In addition, AWS Batch needs these permissions to perform `cancellation` actions through the `jobStateTimeLimitActions` parameter even if they are configured on the job queue.

   1. In the case of a multi-node parallel (MNP) job, if the attached high-priority, Amazon EC2 compute environment experiences `insufficient capacity` errors, it blocks the queue even if a lower priority compute environment does experience this error.

1. **Reason:** All compute environments have a [https://docs.aws.amazon.com/batch/latest/APIReference/API_ComputeResource.html#Batch-Type-ComputeResource-maxvCpus](https://docs.aws.amazon.com/batch/latest/APIReference/API_ComputeResource.html#Batch-Type-ComputeResource-maxvCpus) parameter that is smaller than the job requirements. Canceling the job, either manually or by setting the `jobStateTimeLimitActions` parameter on `statusReason`, allows the subsequent job to move to the head of the queue. Optionally, you can increase the `maxvCpus` parameter of the primary compute environment to meet the needs of the blocked job.
   + **`statusReason` message while the job is stuck:** `MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE - CE(s) associated with the job queue cannot meet the CPU requirement of the job.`
   + **`reason` used for `jobStateTimeLimitActions`:** `MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE`
   + **`statusReason` message after the job is canceled:** `Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE`

1. **Reason:** None of the compute environments have instances that meet the job requirements. When a job requests resources, AWS Batch detects that no attached compute environment is able to accommodate the incoming job. Canceling the job, either manually or by setting the `jobStateTimeLimitActions` parameter on `statusReason`, allows the subsequent job to move to the head of the queue. Optionally, you can redefine the compute environment's allowed instance types to add the necessary job resources.
   + **`statusReason` message while the job is stuck:** ` MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT - The job resource requirement (vCPU/memory/GPU) is higher than that can be met by the CE(s) attached to the job queue.`
   + **`reason` used for `jobStateTimeLimitActions`:** `MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT`
   + **`statusReason` message after the job is canceled:** `Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT`

1. **Reason:** All compute environments have service role issues. To resolve this, compare your service role permissions to the [AWS managed policies for AWS Batch](security-iam-awsmanpol.md) and address any gaps. Note:You can't configure a programmable action through the `jobStateTimeLimitActions` parameter to resolve this error.

   It's a best practice to use the [Using service-linked roles for AWS Batch](using-service-linked-roles.md) to avoid similar errors.

   Canceling the job, either manually or by setting the `jobStateTimeLimitActions` parameter on `statusReason`, allows the subsequent job to move to the head of the queue. Without resolving the service role issue(s), it is likely that the next job will also be blocked as well. It's best to manually investigate and resolve this issue. 
   + **`statusReason` message while the job is stuck:** `MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS – Batch service role has a permission issue.`

1.  **Reason:** Your compute environment has an unsupported instance type configuration. This can occur when instance types are not available in your selected Availability Zones, or when your launch template or launch configuration contains settings incompatible with the specified instance types. To resolve this, verify that your instance types are supported in your specified AWS Region and Availability Zones, check that your launch template settings are compatible with your instance types, and consider updating to newer generation instance types. For more information about finding supported instance types, see [Finding an Amazon EC2 instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-discovery.html) in the *Amazon EC2 User Guide*.
   + **`statusReason` message while the job is stuck:** `MISCONFIGURATION:EC2_INSTANCE_CONFIGURATION_UNSUPPORTED - Your compute environment associated with this job queue has an unsupported instance type configuration.`

1. **Reason:** All compute environments are invalid. For more information, see [`INVALID` compute environment](invalid_compute_environment.md). Note: You can't configure a programmable action through the `jobStateTimeLimitActions` parameter to resolve this error. 
   + **`statusReason` message while the job is stuck:** `ACTION_REQUIRED - CE(s) associated with the job queue are invalid.`

1. **Reason:** AWS Batch has detected a blocked queue, but is unable to determine the reason. Note: You can't configure a programmable action through the `jobStateTimeLimitActions` parameter to resolve this error. For more information about troubleshooting, see [Why is my AWS Batch job stuck in RUNNABLE on AWS](https://repost.aws/knowledge-center/batch-job-stuck-runnable-status) in *re:Post*.
   + **`statusReason` message while the job is stuck:** `UNDETERMINED - Batch job is blocked, root cause is undetermined.`

In case you did not receive an event from CloudWatch Events or you received the unknown reason event, here are some common causes for this issue.

**The `awslogs` log driver isn't configured on your compute resources**  
AWS Batch jobs send their log information to CloudWatch Logs. To enable this, you must configure your compute resources to use the `awslogs` log driver. Suppose that you base your compute resource AMI off of the Amazon ECS optimized AMI (or Amazon Linux). Then, this driver is registered by default with the `ecs-init` package. Now suppose that you use a different base AMI. Then, you must verify that the `awslogs` log driver is specified as an available log driver with the `ECS_AVAILABLE_LOGGING_DRIVERS` environment variable when the Amazon ECS container agent is started. For more information, see [Compute resource AMI specification](batch-ami-spec.md) and [Tutorial: Create a compute resource AMI](create-batch-ami.md).

**Insufficient resources**  
If your job definitions specify more CPU or memory resources than your compute resources can allocate, then your jobs aren't ever placed. For example, suppose that your job specifies 4 GiB of memory, and your compute resources have less than that available. Then it's the case that the job can't be placed on those compute resources. In this case, you must reduce the specified memory in your job definition or add larger compute resources to your environment. Some memory is reserved for the Amazon ECS container agent and other critical system processes. For more information, see [Compute resource memory management](memory-management.md).

**No internet access for compute resources**  
Compute resources need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your compute resources having public IP addresses.  
For more information about interface VPC endpoints, see [Amazon ECS Interface VPC Endpoints (AWS PrivateLink)](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/vpc-endpoints.html) in the *Amazon Elastic Container Service Developer Guide*.  
If you do not have an interface VPC endpoint configured and your compute resources do not have public IP addresses, then they must use network address translation (NAT) to provide this access. For more information, see [NAT gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the *Amazon VPC User Guide*. For more information, see [Create a VPC](create-a-vpc.md).

**Amazon EC2 instance limit reached**  
The number of Amazon EC2 instances that your account can launch in an AWS Region is determined by your EC2 instance quota. Certain instance types also have a per-instance-type quota. For more information about your account's Amazon EC2 instance quota including how to request a limit increase, see [Amazon EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) in the *Amazon EC2 User Guide*.

**Amazon ECS container agent isn't installed**  
The Amazon ECS container agent must be installed on the Amazon Machine Image (AMI) to let AWS Batch run jobs. The Amazon ECS container agent is installed by default on Amazon ECS optimized AMIs. For more information about the Amazon ECS container agent, see [Amazon ECS container agent](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_agent.html) in the *Amazon Elastic Container Service Developer Guide*.

For more information, see [Why is my AWS Batch job stuck in `RUNNABLE` status?](https://aws.amazon.com/premiumsupport/knowledge-center/batch-job-stuck-runnable-status/) in *re:Post*.

# Spot Instances not tagged on creation
<a name="spot-instance-no-tag"></a>

Spot Instance tagging for AWS Batch compute resources is supported as of October 25, 2017. Before, the recommended IAM managed policy (`AmazonEC2SpotFleetRole`) for the Amazon EC2 Spot Fleet role didn't contain permissions to tag Spot Instances at launch. The new recommended IAM managed policy is called `AmazonEC2SpotFleetTaggingRole`. It supports tagging Spot Instances at launch.

To fix Spot Instance tagging on creation, follow the following procedure to apply the current recommended IAM managed policy to your Amazon EC2 Spot Fleet role. That way, any future Spot Instances that are created with that role have permissions to apply instance tags when they're created.

**To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role**

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**, and choose your Amazon EC2 Spot Fleet role.

1. Choose **Attach policy**.

1. Select the **AmazonEC2SpotFleetTaggingRole** and choose **Attach policy**.

1. Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.

1. Select the **x** to the right of the **AmazonEC2SpotFleetRole** policy, and choose **Detach**.

# Spot Instances not scaling down
<a name="spot-fleet-not-authorized"></a>

AWS Batch introduced the **AWSServiceRoleForBatch** service-linked role on March 10, 2021. If no role is specified in the `serviceRole` parameter of the compute environment, this service-linked role is used as the service role. However, suppose that the service-linked role is used in an EC2 Spot compute environment, but the Spot role used doesn't include the **AmazonEC2SpotFleetTaggingRole** managed policy. Then, the Spot Instance doesn't scale down. As a result, you will receive an error with the following message: "You are not authorized to perform this operation." Use the following steps to update the spot fleet role that you use in the `spotIamFleetRole` parameter. For more information, see [Using service-linked roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/using-service-linked-roles.html) and [Creating a role to delegate permissions to an AWS Service](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html) in the *IAM User Guide*.

**Topics**
+ [

## Attach **AmazonEC2SpotFleetTaggingRole** managed policy to your Spot Fleet role in the AWS Management Console
](#spot-fleet-not-authorized-console)
+ [

## Attach **AmazonEC2SpotFleetTaggingRole** managed policy to your Spot Fleet role with the AWS CLI
](#spot-fleet-not-authorized-cli)

## Attach **AmazonEC2SpotFleetTaggingRole** managed policy to your Spot Fleet role in the AWS Management Console
<a name="spot-fleet-not-authorized-console"></a>

**To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role**

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**, and choose your Amazon EC2 Spot Fleet role.

1. Choose **Attach policy**.

1. Select the **AmazonEC2SpotFleetTaggingRole** and choose **Attach policy**.

1. Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.

1. Select the **x** to the right of the **AmazonEC2SpotFleetRole** policy, and choose **Detach**.

## Attach **AmazonEC2SpotFleetTaggingRole** managed policy to your Spot Fleet role with the AWS CLI
<a name="spot-fleet-not-authorized-cli"></a>

The example commands assume that your Amazon EC2 Spot Fleet role is named *AmazonEC2SpotFleetRole*. If your role uses a different name, adjust the commands to match.

**To attach the **AmazonEC2SpotFleetTaggingRole** managed policy to your Spot Fleet role**

1. To attach the **AmazonEC2SpotFleetTaggingRole** managed IAM policy to your *AmazonEC2SpotFleetRole* role, run the following command using the AWS CLI.

   ```
   $ aws iam attach-role-policy \
       --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole \
       --role-name AmazonEC2SpotFleetRole
   ```

1. To detach the **AmazonEC2SpotFleetRole** managed IAM policy from your *AmazonEC2SpotFleetRole* role, run the following command using the AWS CLI.

   ```
   $ aws iam detach-role-policy \
       --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetRole \
       --role-name AmazonEC2SpotFleetRole
   ```

# Can't retrieve Secrets Manager secrets
<a name="troubleshooting-cant-specify-secrets"></a>

If you use an AMI with an Amazon ECS agent that's earlier than version 1.16.0-1, then you must use the Amazon ECS agent configuration variable `ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true` to use this feature. You can add it to the `./etc/ecs/ecs.config` file to a new container instance when you create that instance. Or, you can add it to an existing instance. If you add it to an existing instance, you must restart the ECS agent after you add it. For more information, see [Amazon ECS Container Agent Configuration](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html) in the *Amazon Elastic Container Service Developer Guide*.

# Can't override job definition resource requirements
<a name="override-resource-requirements"></a>

The memory and vCPU overrides that are specified in the `memory` and `vcpus` members of the [containerOverrides](https://docs.aws.amazon.com/batch/latest/APIReference/API_ContainerOverrides.html) structure, which passed to [SubmitJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html), can't override the memory and vCPU requirements that are specified in the [resourceRequirements](https://docs.aws.amazon.com/batch/latest/APIReference/API_ContainerProperties.html#Batch-Type-ContainerProperties-resourceRequirements) structure in the job definition.

If you try to override these resource requirements, you might see the following error message:

"This value was submitted in a deprecated key and may conflict with the value provided by the job definition's resource requirements."

To correct this, specify the memory and vCPU requirements in the [resourceRequirements](https://docs.aws.amazon.com/batch/latest/APIReference/API_ContainerOverrides.html#Batch-Type-ContainerOverrides-resourceRequirements) member of the [containerOverrides](https://docs.aws.amazon.com/batch/latest/APIReference/API_ContainerOverrides.html). For example, if your memory and vCPU overrides are specified in the following lines.

```
"containerOverrides": {
   "memory": 8192,
   "vcpus": 4
}
```

Change them to the following:

```
"containerOverrides": {
   "resourceRequirements": [
      {
         "type": "MEMORY",
         "value": "8192"
      },
      {
         "type": "VCPU",
         "value": "4"
      }
   ],
}
```

Do the same change to the memory and vCPU requirements that are specified in the [containerProperties](https://docs.aws.amazon.com/batch/latest/APIReference/API_ContainerProperties.html) object in the job definition. For example, if your memory and vCPU requirements are specified in the following lines.

```
{
   "containerProperties": {
      "memory": 4096,
      "vcpus": 2,
}
```

Change them to the following:

```
"containerProperties": {
   "resourceRequirements": [
      {
         "type": "MEMORY",
         "value": "4096"
      },
      {
         "type": "VCPU",
         "value": "2"
      }
   ],
}
```

# Error message when you update the `desiredvCpus` setting
<a name="error-desired-vcpus-update"></a>

You see the following error message when you use the AWS Batch API to update the desired vCPUs (`desiredvCpus`) setting.

`Manually scaling down compute environment is not supported. Disconnecting job queues from compute environment will cause it to scale-down to minvCpus`.

This issue occurs if the updated `desiredvCpus` value is less than the current `desiredvCpus` value. When you update the `desiredvCpus` value, both of the following must be true:
+ The `desiredvCpus` value must be between the `minvCpus` and `maxvCpus` values.
+ The updated `desiredvCpus` value must be greater than or equal to the current `desiredvCpus` value.

# AWS Batch on Amazon EKS
<a name="batch-eks-troubleshooting"></a>

**Topics**
+ [

# `INVALID` compute environment
](batch_eks_invalid_compute_environment.md)
+ [

# AWS Batch on Amazon EKS job is stuck in `RUNNABLE` status
](batch_eks_job_stuck_in_runnable.md)
+ [

# AWS Batch on Amazon EKS job is stuck in `STARTING` status
](batch-eks-job-stuck-in-starting.md)
+ [

# Verify that the `aws-auth ConfigMap` is configured correctly
](verify-configmap-config.md)
+ [

# RBAC permissions or bindings aren't configured properly
](batch_eks_rbac.md)

Review the following topics to find review processes and potential solutions to common issues that you may encounter when using AWS Batch on Amazon Elastic Kubernetes Service.

# `INVALID` compute environment
<a name="batch_eks_invalid_compute_environment"></a>

It's possible that you might have incorrectly configured a managed compute environment. If you did, the compute environment enters an `INVALID` state and can't accept jobs for placement. The following sections describe the possible causes and how to troubleshoot based on the cause.

## Unsupported Kubernetes version
<a name="invalid_kubernetes_version"></a>

You might see an error message that resembles the following when you use the `CreateComputeEnvironment` API operation or `UpdateComputeEnvironment`API operation to create or update a compute environment. This issue occurs if you specify an unsupported Kubernetes version in `EC2Configuration`.

```
At least one imageKubernetesVersion in EC2Configuration is not supported.
```

To resolve this issue, delete the compute environment and then re-create it with a supported Kubernetes version. 

You can perform a minor version upgrade on your Amazon EKS cluster. For example, you can upgrade the cluster from `1.xx` to `1.yy` even if the minor version isn't supported. 

However, the compute environment status might change to `INVALID` after a major version update. For example, if you perform a major version upgrade from `1.xx` to `2.yy`. If the major version isn't supported by AWS Batch, you see an error message that resembles the following.

```
reason=CLIENT_ERROR - ... EKS Cluster version [2.yy] is unsupported
```

To resolve this issue, specify a supported Kubernetes version when you use an API operation to create or update a compute environment.

AWS Batch on Amazon EKS currently supports the following Kubernetes versions:
+ `1.34`
+ `1.33`
+ `1.32`
+ `1.31`
+ `1.30`
+ `1.29`

## Instance profile doesn't exist
<a name="instance_profile_not_exist"></a>

If the specified instance profile does not exist, the AWS Batch on Amazon EKS compute environment status is changed to `INVALID`. You see an error set in the `statusReason` parameter that resembles the following.

```
CLIENT_ERROR - Instance profile arn:aws:iam::...:instance-profile/<name> does not exist
```

To resolve this issue, specify or create a working instance profile. For more information, see [Amazon EKS node IAM role](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html) in the *Amazon EKS User Guide*.

## Invalid Kubernetes namespace
<a name="invalid_kubernetes_namespace"></a>

If AWS Batch on Amazon EKS can't validate the namespace for the compute environment, the compute environment status is changed to `INVALID`. For example, this issue can occur if the namespace doesn't exist. 

You see an error message set in the `statusReason` parameter that resembles the following.

```
CLIENT_ERROR - Unable to validate Kubernetes Namespace
```

This issue can occur if any of the following are true:
+ The Kubernetes namespace string in the `CreateComputeEnvironment` call doesn't exist. For more information, see [CreateComputeEnvironment](https://docs.aws.amazon.com/batch/latest/APIReference/API_CreateComputeEnvironment.html).
+ The required Role-Based Access Control (RBAC) permissions to manage the namespace are not configured correctly.
+ AWS Batch doesn't have access to the Amazon EKS Kubernetes API server endpoint. 

To resolve this issue, see [Verify that the `aws-auth ConfigMap` is configured correctly](verify-configmap-config.md). For more information, see [Getting started with AWS Batch on Amazon EKS](getting-started-eks.md).

## Deleted compute environment
<a name="deleted_compute_environment"></a>

Suppose that you delete an Amazon EKS cluster before you delete the attached AWS Batch on Amazon EKS compute environment. Then, the compute environment status is changed to `INVALID`. In this scenario, the compute environment doesn't work properly if you re-create the Amazon EKS cluster with the same name.

To resolve this issue, delete and then re-create the AWS Batch on Amazon EKS compute environment.

## Nodes don't join the Amazon EKS cluster
<a name="batch_eks_node_not_join_cluster"></a>

AWS Batch on Amazon EKS scales down a compute environment if it determines that not all nodes joined the Amazon EKS cluster. When AWS Batch on Amazon EKS scales down the compute environment, the compute environment status is changed to `INVALID`.

**Note**  
AWS Batch doesn't change the compute environment status immediately so that you can debug the issue.

You see an error message set in the `statusReason` parameter that resembles ones of the following:

`Your compute environment has been INVALIDATED and scaled down because none of the instances joined the underlying ECS Cluster. Common issues preventing instances joining are the following: VPC/Subnet configuration preventing communication to ECS, incorrect Instance Profile policy preventing authorization to ECS, or customized AMI or LaunchTemplate configurations affecting ECS agent.`

`Your compute environment has been INVALIDATED and scaled down because none of the nodes joined the underlying Amazon EKS Cluster. Common issues preventing nodes joining are the following: networking configuration preventing communication to Amazon EKS Cluster, incorrect Amazon EKS Instance Profile or Kubernetes RBAC policy preventing authorization to Amazon EKS Cluster, customized AMI or LaunchTemplate configurations affecting Amazon EKS/Kubernetes node bootstrap.`

When using a default Amazon EKS AMI, the most common causes of this issue are the following:
+ The instance role isn't configured correctly. For more information, see [Amazon EKS node IAM role](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html) in the *Amazon EKS User Guide*.
+ The subnets aren't configured correctly. For more information, see [Amazon EKS VPC and subnet requirements and considerations](https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html) in the *Amazon EKS User Guide*.
+ The security group isn't configured correctly. For more information, see [Amazon EKS security group requirements and considerations](https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html) in the *Amazon EKS User Guide*.
**Note**  
You may also see an error notification in the Personal Health Dashboard (PHD).

# AWS Batch on Amazon EKS job is stuck in `RUNNABLE` status
<a name="batch_eks_job_stuck_in_runnable"></a>

An `aws-auth` `ConfigMap` is automatically created and applied to your cluster when you create a managed node group or a node group using `eksctl`. An `aws-auth` `ConfigMap` is initially created to allow nodes to join your cluster. However, you also use the `aws-auth``ConfigMap` to add role-based access control (RBAC) access to users and roles.

To verify that the `aws-auth` `ConfigMap` is configured correctly:

1. Retrieve the mapped roles in the `aws-auth` `ConfigMap`:

   ```
   $ kubectl get configmap -n kube-system aws-auth -o yaml
   ```

1. Verify that the `roleARN` is configured as follows.

   `rolearn: arn:aws:iam::aws_account_number:role/AWSServiceRoleForBatch`
**Note**  
You can also review the Amazon EKS control plane logs. For more information, see [Amazon EKS control plane logging](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) in the *Amazon EKS User Guide*.

To resolve an issue where a job is stuck in a `RUNNABLE` status, we recommend that you use `kubectl` to re-apply the manifest. For more information, see [Step 2: Prepare your Amazon EKS cluster for AWS Batch](getting-started-eks.md#getting-started-eks-step-1). Or, you can use `kubectl` to manually edit the `aws-auth` `ConfigMap`. For more information, see [Enabling IAM user and role access to your cluster](https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) in the *Amazon EKS User Guide*.

# AWS Batch on Amazon EKS job is stuck in `STARTING` status
<a name="batch-eks-job-stuck-in-starting"></a>

A Job may remain in `STARTING` status when the Pod is stuck in `PENDING` on `ContainerCreating` for any long running requests from the kubelet (`pull`, `log`, `exec`, and `attach`) until the Pod startup issue is resolved or the Job is terminated. In the qualifying scenarios below AWS Batch will terminate the job on your behalf, otherwise the job must be terminated manually using the [TerminateJob API](https://docs.aws.amazon.com/batch/latest/APIReference/API_TerminateJob.html). 

To verify the reason a Job may be stuck in `STARTING`, use [Tutorial: Map a running job to a pod and a node](eks-jobs-map-running-job.md) to find the `podName`, and describe the Pod:

```
% kubectl describe pod aws-batch.000c8190-87df-31e7-8819-176fe017a24a -n my-aws-batch-namespace
Name:             aws-batch.000c8190-87df-31e7-8819-176fe017a24a
Namespace:        my-aws-batch-namespace
...
Containers:
  default:
...
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
...
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
...
Events:
  Type     Reason       Age    From     Message
  ----     ------       ----   ----     -------
  Warning  FailedMount  2m32s  kubelet  Unable to attach or mount volumes: ...
```

Consider configuring your EKS cluster to [Send control plane logs to CloudWatch Logs](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) for full visibility. 

## Scenario: Persisted Volume Claim Attach or Mount Failure
<a name="batch-eks-job-stuck-in-starting-scenario"></a>

Jobs using Persistent Volume Claims where the volume fails to attach or mount are candidates for termination. This can be a result of an incorrectly configured Job Definition. See [Create a single-node job definition on Amazon EKS resources](create-job-definition-eks.md) for more details.

# Verify that the `aws-auth ConfigMap` is configured correctly
<a name="verify-configmap-config"></a>

To verify that the `aws-auth` `ConfigMap` is configured correctly:

1. Retrieve the mapped roles in the `aws-auth` `ConfigMap`.

   ```
   $ kubectl get configmap -n kube-system aws-auth -o yaml
   ```

1. Verify that the `roleARN` is configured as follows.

   `rolearn: arn:aws:iam::aws_account_number:role/AWSServiceRoleForBatch`
**Note**  
The path `aws-service-role/batch.amazonaws.com/` has been removed from the ARN of the service-linked role. This is because of an issue with the `aws-auth` configuration map. For more information, see [Roles with paths do not work when the path is included in their ARN in the aws-authconfigmap](https://github.com/kubernetes-sigs/aws-iam-authenticator/issues/268).
**Note**  
You can also review the Amazon EKS control plane logs. For more information, see [Amazon EKS control plane logging](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) in the *Amazon EKS User Guide*.

To resolve an issue where a job is stuck in a `RUNNABLE` status, we recommend that you use `kubectl` to re-apply the manifest. For more information, see [Step 2: Prepare your Amazon EKS cluster for AWS Batch](getting-started-eks.md#getting-started-eks-step-1). Or, you can use `kubectl` to manually edit the `aws-auth` `ConfigMap`. For more information, see [Enabling IAM user and role access to your cluster](https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) in the *Amazon EKS User Guide*.

# RBAC permissions or bindings aren't configured properly
<a name="batch_eks_rbac"></a>

If you experience any RBAC permissions or binding issues, verify that the `aws-batch` Kubernetes role can access the Kubernetes namespace:

```
$ kubectl get namespace namespace --as=aws-batch
```

```
$ kubectl auth can-i get ns --as=aws-batch
```

You can also use the **kubectl describe** command to view the authorizations for a cluster role or Kubernetes namespace.

```
$ kubectl describe clusterrole aws-batch-cluster-role
```

The following is example output.

```
Name:         aws-batch-cluster-role
Labels:       <none>
Annotations:  <none>
PolicyRule:
  Resources                                      Non-Resource URLs  Resource Names  Verbs
  ---------                                      -----------------  --------------  -----
  configmaps                                     []                 []              [get list watch]
  nodes                                          []                 []              [get list watch]
  pods                                           []                 []              [get list watch]
  daemonsets.apps                                []                 []              [get list watch]
  deployments.apps                               []                 []              [get list watch]
  replicasets.apps                               []                 []              [get list watch]
  statefulsets.apps                              []                 []              [get list watch]
  clusterrolebindings.rbac.authorization.k8s.io  []                 []              [get list]
  clusterroles.rbac.authorization.k8s.io         []                 []              [get list]
  namespaces                                     []                 []              [get]
  events                                         []                 []              [list]
```

```
$ kubectl describe role aws-batch-compute-environment-role -n my-aws-batch-namespace
```

The following is example output.

```
Name:         aws-batch-compute-environment-role
Labels:       <none>
Annotations:  <none>
PolicyRule:
  Resources                               Non-Resource URLs  Resource Names  Verbs
  ---------                               -----------------  --------------  -----
  pods                                    []                 []              [create get list watch delete patch]
  serviceaccounts                         []                 []              [get list]
  rolebindings.rbac.authorization.k8s.io  []                 []              [get list]
  roles.rbac.authorization.k8s.io         []                 []              [get list]
```

To resolve this issue, re-apply the RBAC permissions and `rolebinding` commands. For more information, see [Step 2: Prepare your Amazon EKS cluster for AWS Batch](getting-started-eks.md#getting-started-eks-step-1).