

# Troubleshooting Amazon Managed Workflows for Apache Airflow
<a name="troubleshooting"></a>

This chapter describes common issues and errors you can encounter when using Apache Airflow on Amazon Managed Workflows for Apache Airflow and recommended steps to resolve these errors.

**Contents**
+ [Troubleshooting: DAGs, Operators, Connections, and other issues](t-apache-airflow-202.md)
  + [Connections](t-apache-airflow-202.md#troubleshooting-conn-202)
    + [I can't connect to Secrets Manager](t-apache-airflow-202.md#access-secrets-manager)
    + [How do I configure `secretsmanager:ResourceTag/<tag-key>` secrets manager conditions or a resource restriction in my execution role policy?](t-apache-airflow-202.md#access-secrets-manager-condition-keys)
    + [I can't connect to Snowflake](t-apache-airflow-202.md#missing-snowflake)
    + [I can't find my connection in the Airflow UI](t-apache-airflow-202.md#connection-type-missing)
  + [Webserver](t-apache-airflow-202.md#troubleshooting-webserver-202)
    + [I get a 5xx error accessing the webserver](t-apache-airflow-202.md#5xx-webserver-202)
    + [I get a `The scheduler does not seem to be running` error](t-apache-airflow-202.md#error-scheduler-202)
  + [Tasks](t-apache-airflow-202.md#troubleshooting-tasks-202)
    + [I get my tasks stuck or not completing](t-apache-airflow-202.md#stranded-tasks-202)
    + [I get task failures without logs in Airflow v3](t-apache-airflow-202.md#failed-task-no-log)
  + [CLI](t-apache-airflow-202.md#troubleshooting-cli-202)
    + [I get a '503' error when triggering a DAG in the CLI](t-apache-airflow-202.md#cli-toomany-202)
    + [Why does the `dags backfill` Apache Airflow CLI command fail? Is there a workaround?](t-apache-airflow-202.md#troubleshooting-cli-backfill)
  + [Operators](t-apache-airflow-202.md#troubleshooting-operators-202)
    + [I received a `PermissionError: [Errno 13] Permission denied` error using the S3Transform operator](t-apache-airflow-202.md#op-s3-transform)
+ [Troubleshooting: Creating and updating an Amazon MWAA environment](t-create-update-environment.md)
  + [Updating `requirements.txt`](t-create-update-environment.md#troubleshooting-reqs)
    + [I specified a new version of my `requirements.txt` and it's taking more than 20 minutes to update my environment](t-create-update-environment.md#t-requirements)
  + [Plugins](t-create-update-environment.md#troubleshooting-plugins)
    + [Does Amazon MWAA support implementing custom UI?](t-create-update-environment.md#custom-ui)
  + [Create bucket](t-create-update-environment.md#troubleshooting-create-bucket)
    + [I can't select the option for S3 Block Public Access settings](t-create-update-environment.md#t-create-bucket)
  + [Create environment](t-create-update-environment.md#troubleshooting-create-environment)
    + [I tried to create an environment and it's stuck in the `Creating` state](t-create-update-environment.md#t-stuck-failure)
    + [I tried to create an environment but it displays the status as `Create failed`](t-create-update-environment.md#t-create-environ-failed)
    + [I tried to select a VPC and received a `Network Failure` error](t-create-update-environment.md#t-network-failure)
    + [I tried to create an environment and received a service, partition, or resource "must be passed" error](t-create-update-environment.md#t-service-partition)
    + [I tried to create an environment and it displays the status as `Available` but when I try to access the Airflow UI an `Empty Reply from Server` or `502 Bad Gateway` error is shown](t-create-update-environment.md#t-create-environ-empty-reply)
    + [I tried to create an environment and my user name is a bunch of random character names](t-create-update-environment.md#t-create-environ-random-un)
  + [Update environment](t-create-update-environment.md#troubleshooting-update-environment)
    + [I tried changing the environment class but the update failed](t-create-update-environment.md#t-rollback-billing-failure)
  + [Access environment](t-create-update-environment.md#troubleshooting-access-environment)
    + [I can't access the Apache Airflow UI](t-create-update-environment.md#t-no-access-airflow-ui)
+ [Troubleshooting: CloudWatch Logs and CloudTrail errors](t-cloudwatch-cloudtrail-logs.md)
  + [Logs](t-cloudwatch-cloudtrail-logs.md#troubleshooting-view-logs)
    + [I can't find my task logs, or I received a `Reading remote log from Cloudwatch log_group` error](t-cloudwatch-cloudtrail-logs.md#t-task-logs)
    + [Tasks are failing without any logs](t-cloudwatch-cloudtrail-logs.md#t-task-failing-no-logs)
    + [I get a `ResourceAlreadyExistsException` error in CloudTrail](t-cloudwatch-cloudtrail-logs.md#t-cloudtrail)
    + [I get an `Invalid request` error in CloudTrail](t-cloudwatch-cloudtrail-logs.md#t-cloudtrail-bucket)
    + [I get `Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory` in Apache Airflow logs](t-cloudwatch-cloudtrail-logs.md#t-plugins-logs)
    + [I get psycopg2 'server closed the connection unexpectedly' in my scheduler logs](t-cloudwatch-cloudtrail-logs.md#scheduler-postgres-library)
    + [I get `Executor reports task instance %s finished (%s) although the task says its %s` in my DAG processing logs](t-cloudwatch-cloudtrail-logs.md#long-running-tasks)
    + [I get `Could not read remote logs from log_group: airflow-*{*environmentName}-Task log_stream:* {*DAG_ID}/*{*TASK_ID}/*{*time}/*{*n}.log.` in my task logs](t-cloudwatch-cloudtrail-logs.md#t-task-fail-permission)

# Troubleshooting: DAGs, Operators, Connections, and other issues
<a name="t-apache-airflow-202"></a>

The topics on this page describe resolutions to Apache Airflow v2 and v3 Python dependencies, custom plugins, DAGs, Operators, Connections, tasks, and webserver issues you can encounter on an Amazon Managed Workflows for Apache Airflow environment.

**Contents**
+ [Connections](#troubleshooting-conn-202)
  + [I can't connect to Secrets Manager](#access-secrets-manager)
  + [How do I configure `secretsmanager:ResourceTag/<tag-key>` secrets manager conditions or a resource restriction in my execution role policy?](#access-secrets-manager-condition-keys)
  + [I can't connect to Snowflake](#missing-snowflake)
  + [I can't find my connection in the Airflow UI](#connection-type-missing)
+ [Webserver](#troubleshooting-webserver-202)
  + [I get a 5xx error accessing the webserver](#5xx-webserver-202)
  + [I get a `The scheduler does not seem to be running` error](#error-scheduler-202)
+ [Tasks](#troubleshooting-tasks-202)
  + [I get my tasks stuck or not completing](#stranded-tasks-202)
  + [I get task failures without logs in Airflow v3](#failed-task-no-log)
+ [CLI](#troubleshooting-cli-202)
  + [I get a '503' error when triggering a DAG in the CLI](#cli-toomany-202)
  + [Why does the `dags backfill` Apache Airflow CLI command fail? Is there a workaround?](#troubleshooting-cli-backfill)
+ [Operators](#troubleshooting-operators-202)
  + [I received a `PermissionError: [Errno 13] Permission denied` error using the S3Transform operator](#op-s3-transform)

## Connections
<a name="troubleshooting-conn-202"></a>

The following topic describes the errors you might receive when using an Apache Airflow connection, or using another AWS database.

### I can't connect to Secrets Manager
<a name="access-secrets-manager"></a>

We recommend the following steps:

1. Learn how to create secret keys for your Apache Airflow connection and variables in [Configuring an Apache Airflow connection using a AWS Secrets Manager secret](connections-secrets-manager.md).

1. Learn how to use the secret key for an Apache Airflow variable (`test-variable`) in [Using a secret key in AWS Secrets Manager for an Apache Airflow variable](samples-secrets-manager-var.md).

1. Learn how to use the secret key for an Apache Airflow connection (`myconn`) in [Using a secret key in AWS Secrets Manager for an Apache Airflow connection](samples-secrets-manager.md).

### How do I configure `secretsmanager:ResourceTag/<tag-key>` secrets manager conditions or a resource restriction in my execution role policy?
<a name="access-secrets-manager-condition-keys"></a>

**Note**  
Applies to Apache Airflow version 2.0 and earlier.

Currently, you cannot limit access to Secrets Manager secrets by using condition keys or other resource restrictions in your environment's execution role, due to a known issue in Apache Airflow.

### I can't connect to Snowflake
<a name="missing-snowflake"></a>

We recommend the following steps:

1. Test your DAGs, custom plugins, and Python dependencies locally using [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) on GitHub.

1. Add the following entries to the requirements.txt for your environment.

   ```
   apache-airflow-providers-snowflake==1.3.0
   ```

1. Add the following imports to your DAG:

   ```
   from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
   ```

Ensure the Apache Airflow connection object includes the following key-value pairs:

1. **Conn Id: **snowflake\$1conn

1. **Conn Type: **Snowflake

1. **Host: **<my account>.<my region if not us-west-2>.snowflakecomputing.com

1. **Schema: **<my schema>

1. **Login: **<my user name>

1. **Password: **\$1\$1\$1\$1\$1\$1\$1\$1

1. **Port: ** <port, if any>

1. **Extra: **

   ```
   {
   						"account": "<my account>",
   						"warehouse": "<my warehouse>",
   						"database": "<my database>",
   						"region": "<my region if not using us-west-2 otherwise omit this line>"
   						}
   ```

For example:

```
>>> import json
				>>> from airflow.models.connection import Connection
				>>> myconn = Connection(
				...    conn_id='snowflake_conn',
				...    conn_type='Snowflake',
				...    host='123456789012.us-east-1.snowflakecomputing.com',
				...    schema='YOUR_SCHEMA'
				...    login='YOUR_USERNAME',
				...    password='YOUR_PASSWORD',
				...    port='YOUR_PORT'
				...    extra=json.dumps(dict(account='123456789012', warehouse='YOUR_WAREHOUSE', database='YOUR_DB_OPTION', region='us-east-1')),
				... )
```

### I can't find my connection in the Airflow UI
<a name="connection-type-missing"></a>

Apache Airflow provides connection templates in the Apache Airflow UI. It uses this to generate the connection URI string, regardless of the connection type. If a connection template is not available in the Apache Airflow UI, an alternate connection template can be used to generate a connection URI string, such as using the HTTP connection template.

We recommend the following steps:

1. Access the connection types Amazon MWAA's providing in the Apache Airflow UI at [Apache Airflow provider packages installed on Amazon MWAA environments](connections-packages.md).

1. Access the commands to create an Apache Airflow connection in the CLI at [Apache Airflow CLI command reference](airflow-cli-command-reference.md).

1. Learn how to use connection templates in the Apache Airflow UI interchangeably for connection types that aren't available in the Apache Airflow UI on Amazon MWAA at [Overview of connection types](manage-connection-types.md).

## Webserver
<a name="troubleshooting-webserver-202"></a>

The following topic describes the errors you might receive for your Apache Airflow webserver on Amazon MWAA.

### I get a 5xx error accessing the webserver
<a name="5xx-webserver-202"></a>

We recommend the following steps:

1. Check Apache Airflow configuration options. Verify that the key-value pairs you specified as an Apache Airflow configuration option, such as AWS Secrets Manager, were configured correctly. To learn more, refer to [I can't connect to Secrets Manager](#access-secrets-manager).

1. Check the `requirements.txt`. Verify the Airflow "extras" package and other libraries listed in your `requirements.txt` are compatible with your Apache Airflow version.

1. Explore ways to specify Python dependencies in a `requirements.txt` file, refer to [Managing Python dependencies in requirements.txt](best-practices-dependencies.md).

### I get a `The scheduler does not seem to be running` error
<a name="error-scheduler-202"></a>

If the scheduler doesn't seem to be running, or the last "heart beat" was received several hours ago, your DAGs might not be listed in Apache Airflow, and new tasks will not be scheduled.

We recommend the following steps:

1. Confirm that your VPC security group allows inbound access to port `5432`. This port is needed to connect to the Amazon Aurora PostgreSQL metadata database for your environment. After this rule is added, give Amazon MWAA a few minutes, and the error can disappear. To learn more, refer to [Security in your VPC on Amazon MWAA](vpc-security.md).
**Note**  
The Aurora PostgreSQL metadatabase is part of the [Amazon MWAA service architecture](what-is-mwaa.md#architecture-mwaa) and is not available in your AWS account.
Database-related errors are usually a symptom of scheduler failure and not the root cause.

1. If the scheduler is not running, it might be due to a number of factors such as [dependency installation failures](best-practices-dependencies.md), or an [overloaded scheduler](best-practices-tuning.md). Confirm that your DAGs, plugins, and requirements are working correctly by accessing the corresponding log groups in CloudWatch Logs. To learn more, refer to [Monitoring and metrics for Amazon Managed Workflows for Apache Airflow](cw-metrics.md).

## Tasks
<a name="troubleshooting-tasks-202"></a>

The following topic describes the errors you might receive for Apache Airflow tasks in an environment.

### I get my tasks stuck or not completing
<a name="stranded-tasks-202"></a>

If your Apache Airflow tasks are "stuck" or not completing, we recommend the following steps:

1. There might be a large number of DAGs defined. Reduce the number of DAGs and perform an update of the environment (such as changing a log level) to force a reset.

   1. Airflow parses DAGs whether they are enabled or not. If you're using greater than 50% of your environment's capacity you might start overwhelming the Apache Airflow scheduler. This leads to large *Total Parse Time* in CloudWatch Metrics or long DAG processing times in CloudWatch Logs. There are other ways to optimize Apache Airflow configurations which are outside the scope of this guide.

   1. To learn more about the best practices we recommend to tune the performance of your environment, refer to [Performance tuning for Apache Airflow on Amazon MWAA](best-practices-tuning.md).

1. There might be a large number of tasks in the queue. This is often shown as a large—and growing—number of tasks in the `None` state, or as a large number in `Queued Tasks` and/or `Tasks Pending` in CloudWatch. This can occur for the following reasons:

   1. If there are more tasks to run than the environment has the capacity to run, and/or a large number of tasks that were queued before autoscaling has time to detect the tasks and deploy additional workers.

   1. If there are more tasks to run than an environment has the capacity to run, we recommend **reducing** the number of tasks that your DAGs run concurrently, and/or increasing the minimum Apache Airflow workers.

   1. If there are a large number of tasks that were queued before autoscaling has had time to detect and deploy additional workers, we recommend **staggering** task deployment and/or increasing the minimum Apache Airflow workers.

   1. You can use the [update-environment](https://docs.aws.amazon.com/cli/latest/reference/mwaa/update-environment.html) command in the AWS Command Line Interface (AWS CLI) to change the minimum or maximum number of workers that run on your environment.

      ```
      aws mwaa update-environment --name MyEnvironmentName --min-workers 2 --max-workers 10
      ```

   1. To learn more about the best practices we recommend to tune the performance of your environment, refer to [Performance tuning for Apache Airflow on Amazon MWAA](best-practices-tuning.md).

1. If your tasks are stuck in the "running" state, you can also clear the tasks or mark them as succeeded or failed. This allows the autoscaling component for your environment to scale down the number of workers running on your environment. The following image depicts an example of a stranded task.  
![\[This is an image with a stranded task.\]](http://docs.aws.amazon.com/mwaa/latest/userguide/images/mwaa-airflow-scaling.png)

   1. Choose the circle for the stranded task, and then select **Clear** (as shown). This allows Amazon MWAA to scale down workers; otherwise, Amazon MWAA can't determine which DAGs are enabled or disabled, and can't scale down, if there are still queued tasks.  
![\[Apache Airflow Actions\]](http://docs.aws.amazon.com/mwaa/latest/userguide/images/mwaa-airflow-scaling-menu.png)

1. Learn more about the Apache Airflow task lifecycle at [Concepts](https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#task-lifecycle) in the *Apache Airflow reference guide*.

### I get task failures without logs in Airflow v3
<a name="failed-task-no-log"></a>

If your Apache Airflow 3 tasks are failing without logs, follow these steps:
+ If the worker logs present an error such as `Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 15 Job: 12.')` around the time the task failed, this indicates that the forked worker process assigned to the task was likely terminated unexpectedly.

  To address this, consider configuring celery.worker\$1autoscale with the same minimum and maximum values. For example:

  ```
  celery.worker_autoscale=5,5  # for mw1.small
  celery.worker_autoscale=10,10 # for mw1.medium
  celery.worker_autoscale=20,20 # for mw1.large
  ```

  This ensures that the worker pool size remains fixed, preventing unexpected worker terminations.

## CLI
<a name="troubleshooting-cli-202"></a>

The following topic describes the errors you might receive when running Airflow CLI commands in the AWS Command Line Interface.

### I get a '503' error when triggering a DAG in the CLI
<a name="cli-toomany-202"></a>

The Airflow CLI runs on the Apache Airflow webserver, which has limited concurrency. Typically a maximum of 4 CLI commands can run simultaneously.

### Why does the `dags backfill` Apache Airflow CLI command fail? Is there a workaround?
<a name="troubleshooting-cli-backfill"></a>

**Note**  
The following applies only to Apache Airflow v2.0.2 environments.

The `backfill` command, like other Apache Airflow CLI commands, parses all DAGs locally before any DAGs are processed, regardless of which DAG the CLI operation applies to. In Amazon MWAA environments using Apache Airflow v2.0.2, because plugins and requirements are not yet installed on the webserver by the time the CLI command runs, the parsing operation fails, and the `backfill` operation is not invoked. If you did not have any requirements nor plugins in your environment, the `backfill` operation would succeed.

To be able to run the `backfill` CLI command, we recommend invoking it in a bash operator. In a bash operator, `backfill` is initiated from the worker, allowing the DAGs to parse successfully as all necessary requirements and plguins are available and installed. Use the following example to create a DAG with a `BashOperator` to run `backfill`.

```
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago

with DAG(dag_id="backfill_dag", schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag:
    cli_command = BashOperator(
        task_id="bash_command",
        bash_command="airflow dags backfill my_dag_id"
    )
```

## Operators
<a name="troubleshooting-operators-202"></a>

The following topic describes the errors you might receive when using Operators.

### I received a `PermissionError: [Errno 13] Permission denied` error using the S3Transform operator
<a name="op-s3-transform"></a>

We recommend the following steps if you're trying to run a shell script with the *S3Transform* operator and you're receiving a `PermissionError: [Errno 13] Permission denied` error. The following steps assume you have an existing plugins.zip file. If you're creating a *new* plugins.zip, refer to [Installing custom plugins](configuring-dag-import-plugins.md).

1. Test your DAGs, custom plugins, and Python dependencies locally using [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) on GitHub.

1. Create your "transform" script.

   ```
   #!/bin/bash
   cp $1 $2
   ```

1. (optional) macOS and Linux users might need to run the following command to ensure the script is executable.

   ```
   chmod 777 transform_test.sh
   ```

1. Add the script to your plugins.zip.

   ```
   zip plugins.zip transform_test.sh
   ```

1. Follow the steps in [Upload the plugins.zip to Amazon S3](configuring-dag-import-plugins.md#configuring-dag-plugins-upload).

1. Follow the steps in [Specifying the plugins.zip version on the Amazon MWAA console](configuring-dag-import-plugins.md#configuring-dag-plugins-s3-mwaaconsole).

1. Create the following DAG.

   ```
   from airflow import DAG
   						from airflow.providers.amazon.aws.operators.s3_file_transform import S3FileTransformOperator
   						from airflow.utils.dates import days_ago
   						import os
   						
   						DAG_ID = os.path.basename(__file__).replace(".py", "")
   						
   						with DAG (dag_id=DAG_ID, schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag:
   						file_transform = S3FileTransformOperator(
   						task_id='file_transform',
   						transform_script='/usr/local/airflow/plugins/transform_test.sh',
   						source_s3_key='s3://amzn-s3-demo-bucket/files/input.txt',
   						dest_s3_key='s3://amzn-s3-demo-bucket/files/output.txt'
   						)
   ```

1. Follow the steps in [Uploading DAG code to Amazon S3](configuring-dag-folder.md#configuring-dag-folder-uploading).

# Troubleshooting: Creating and updating an Amazon MWAA environment
<a name="t-create-update-environment"></a>

The topics on this page contain errors you can encounter when creating and updating an Amazon Managed Workflows for Apache Airflow environment and how to resolve these errors.

**Contents**
+ [Updating `requirements.txt`](#troubleshooting-reqs)
  + [I specified a new version of my `requirements.txt` and it's taking more than 20 minutes to update my environment](#t-requirements)
+ [Plugins](#troubleshooting-plugins)
  + [Does Amazon MWAA support implementing custom UI?](#custom-ui)
+ [Create bucket](#troubleshooting-create-bucket)
  + [I can't select the option for S3 Block Public Access settings](#t-create-bucket)
+ [Create environment](#troubleshooting-create-environment)
  + [I tried to create an environment and it's stuck in the `Creating` state](#t-stuck-failure)
  + [I tried to create an environment but it displays the status as `Create failed`](#t-create-environ-failed)
  + [I tried to select a VPC and received a `Network Failure` error](#t-network-failure)
  + [I tried to create an environment and received a service, partition, or resource "must be passed" error](#t-service-partition)
  + [I tried to create an environment and it displays the status as `Available` but when I try to access the Airflow UI an `Empty Reply from Server` or `502 Bad Gateway` error is shown](#t-create-environ-empty-reply)
  + [I tried to create an environment and my user name is a bunch of random character names](#t-create-environ-random-un)
+ [Update environment](#troubleshooting-update-environment)
  + [I tried changing the environment class but the update failed](#t-rollback-billing-failure)
+ [Access environment](#troubleshooting-access-environment)
  + [I can't access the Apache Airflow UI](#t-no-access-airflow-ui)

## Updating `requirements.txt`
<a name="troubleshooting-reqs"></a>

The following topic describes the errors you might receive when updating your `requirements.txt`.

### I specified a new version of my `requirements.txt` and it's taking more than 20 minutes to update my environment
<a name="t-requirements"></a>

If it takes more than twenty minutes for your environment to install a new version of a `requirements.txt` file, the environment update failed and Amazon MWAA is rolling back to the last stable version of the container image.

1. Check package versions. We recommend always specifying either a specific version (`==`) or a maximum version (`<=`) for the Python dependencies in your `requirements.txt`.

1. Check Apache Airflow logs. If you enabled Apache Airflow logs, verify your log groups were created successfully on the [Logs groups page](https://console.aws.amazon.com/cloudwatch/home#logsV2:log-groups) on the CloudWatch console. If you get blank logs, the most common reason is due to missing permissions in your execution role for CloudWatch or Amazon S3 where logs are written. To learn more, refer to [Execution role](mwaa-create-role.md).

1. Check Apache Airflow configuration options. If you're using Secrets Manager, verify that the key-value pairs you specified as an Apache Airflow configuration option were configured correctly. To learn more, refer to [Configuring an Apache Airflow connection using a AWS Secrets Manager secret](connections-secrets-manager.md).

1. Check VPC network configuration. To learn more, refer to [I tried to create an environment and it's stuck in the `Creating` state](#t-stuck-failure).

1. Check execution role permissions. An execution role is an AWS Identity and Access Management (IAM) role with a permissions policy that grants Amazon MWAA permission to invoke the resources of other AWS services (such as Amazon S3, CloudWatch, Amazon SQS, Amazon ECR) on your behalf. Your [Customer-managed key](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#customer-cmk) or [AWS-owned key](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#aws-owned-cmk) also needs to be permitted access. To learn more, refer to [Execution role](mwaa-create-role.md).

1. To run a troubleshooting script that checks the Amazon VPC network setup and configuration for your Amazon MWAA environment, refer to the [Verify Environment](https://github.com/awslabs/aws-support-tools/tree/master/MWAA) script in AWS Support Tools on GitHub.

## Plugins
<a name="troubleshooting-plugins"></a>

The following topic describes issues you can encounter when configuring or updating Apache Airflow plugins.

### Does Amazon MWAA support implementing custom UI?
<a name="custom-ui"></a>

Starting with Apache Airflow v2.2.2, Amazon MWAA supports installing plugins on the Apache Airflow webserver, and implementing custom UI. If your Amazon MWAA environment is running Apache Airflow v2.0.2 or older, you will not be able to implement custom UI.

For more information about version management, and upgrading your existing environments, refer to [Apache Airflow versions on Amazon Managed Workflows for Apache Airflow](airflow-versions.md).

## Create bucket
<a name="troubleshooting-create-bucket"></a>

The following topic describes the errors you might receive when creating an Amazon S3 bucket.

### I can't select the option for S3 Block Public Access settings
<a name="t-create-bucket"></a>

The [execution role](mwaa-create-role.md) for your Amazon MWAA environment needs permission to the `GetBucketPublicAccessBlock` action on the Amazon S3 bucket to verify the bucket blocked public access. We recommend the following steps:

1. Follow the steps to [Attach a JSON policy to your execution role](mwaa-create-role.md).

1. Attach the following JSON policy:

   ```
   {
     "Effect":"Allow",
     "Action":[
       "s3:GetObject*",
       "s3:GetBucket*",
       "s3:List*"
     ],
     "Resource":[
       "arn:aws:s3:::amzn-s3-demo-bucket",
       "arn:aws:s3:::amzn-s3-demo-bucket/*"
     ]
   }
   ```

   Substitute the sample placeholders in *amzn-s3-demo-bucket* with your Amazon S3 bucket name.

1. To run a troubleshooting script that checks the Amazon VPC network setup and configuration for your Amazon MWAA environment, refer to the [Verify Environment](https://github.com/awslabs/aws-support-tools/tree/master/MWAA) script in AWS Support Tools on GitHub.

## Create environment
<a name="troubleshooting-create-environment"></a>

The following topic describes the errors you might receive when creating an environment.

### I tried to create an environment and it's stuck in the `Creating` state
<a name="t-stuck-failure"></a>

We recommend the following steps:

1. Check VPC network with *public routing*. If you're using an Amazon VPC with internet access, verify the following:

   1. That your Amazon VPC is configured to allow network traffic between the different AWS resources used by your Amazon MWAA environment, as defined in [About networking on Amazon MWAA](networking-about.md). For example, your VPC security group must either allow all traffic in a self-referencing rule, or optionally specify the port range for HTTPS port range 443 and a TCP port range 5432.

1. Check VPC network with *private routing*. If you're using an Amazon VPC without internet access, verify the following:

   1. That your Amazon VPC is configured to allow network traffic between the different AWS resources for your Amazon MWAA environment, as defined in [About networking on Amazon MWAA](networking-about.md). For example, your two private subnets must **not** have a route table to a NAT gateway (or NAT instance), **nor** an internet gateway.

1. To run a troubleshooting script that checks the Amazon VPC network setup and configuration for your Amazon MWAA environment, refer to the [Verify Environment](https://github.com/awslabs/aws-support-tools/tree/master/MWAA) script in AWS Support Tools on GitHub.

### I tried to create an environment but it displays the status as `Create failed`
<a name="t-create-environ-failed"></a>

We recommend the following steps:

1. Check VPC network configuration. To learn more, refer to [I tried to create an environment and it's stuck in the `Creating` state](#t-stuck-failure).

1. Check user permissions. Amazon MWAA performs a dry run against a user's credentials before creating an environment. Your AWS account might not have permission in AWS Identity and Access Management (IAM) to create some of the resources for an environment. For example, if you chose the **Private network** Apache Airflow access mode, your AWS account must have been granted access by your administrator to the [AmazonMWAAFullConsoleAccess](access-policies.md#console-full-access) access control policy for your environment, which allows your account to create VPC endpoints.

1. Check execution role permissions. An execution role is an AWS Identity and Access Management (IAM) role with a permissions policy that grants Amazon MWAA permission to invoke the resources of other AWS services (such as Amazon S3, CloudWatch, Amazon SQS, Amazon ECR) on your behalf. Your [Customer-managed key](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#customer-cmk) or [AWS-owned key](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#aws-owned-cmk) also needs to be permitted access. To learn more, refer to [Execution role](mwaa-create-role.md).

1. Check Apache Airflow logs. If you enabled Apache Airflow logs, verify your log groups were created successfully on the [Logs groups page](https://console.aws.amazon.com/cloudwatch/home#logsV2:log-groups) on the CloudWatch console. If you get blank logs, the most common reason is due to missing permissions in your execution role for CloudWatch or Amazon S3 where logs are written. To learn more, refer to [Execution role](mwaa-create-role.md).

1. To run a troubleshooting script that checks the Amazon VPC network setup and configuration for your Amazon MWAA environment, refer to the [Verify Environment](https://github.com/awslabs/aws-support-tools/tree/master/MWAA) script in AWS Support Tools on GitHub.

1. If you are using an Amazon VPC *without* internet access, ensure that you've created an Amazon S3 gateway endpoint, and granted the minimum required permisions to Amazon ECR to access Amazon S3. To learn more about creating an Amazon S3 gateway endpoint, refer to the following:
   + [Creating an Amazon VPC network without internet access](vpc-create.md#vpc-create-template-private-only)
   + [Create the Amazon S3 gateway endpoint](https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html#ecr-setting-up-s3-gateway) in the *Amazon Elastic Container Registry User Guide*

### I tried to select a VPC and received a `Network Failure` error
<a name="t-network-failure"></a>

We recommend the following steps:
+ If you get a `Network Failure` error when you try to select an Amazon VPC when creating your environment, turn off any in-browser proxies that are running, and then try again.

### I tried to create an environment and received a service, partition, or resource "must be passed" error
<a name="t-service-partition"></a>

We recommend the following steps:
+ You might be receiving this error because the URI you specified for your Amazon S3 bucket includes a '/' at the end of the URI. We recommend removing the '/' in the path. The value must be in the following format:

  ```
  s3://amzn-s3-demo-bucket
  ```

### I tried to create an environment and it displays the status as `Available` but when I try to access the Airflow UI an `Empty Reply from Server` or `502 Bad Gateway` error is shown
<a name="t-create-environ-empty-reply"></a>

We recommend the following steps:

1. Check VPC security group configuration. To learn more, refer to [I tried to create an environment and it's stuck in the `Creating` state](#t-stuck-failure).

1. Confirm that any Apache Airflow packages you listed in the `requirements.txt` correspond to the Apache Airflow version you're running on Amazon MWAA. To learn more, refer to [Installing Python dependencies](working-dags-dependencies.md).

1. To run a troubleshooting script that checks the Amazon VPC network setup and configuration for your Amazon MWAA environment, refer to the [Verify Environment](https://github.com/awslabs/aws-support-tools/tree/master/MWAA) script in AWS Support Tools on GitHub.

### I tried to create an environment and my user name is a bunch of random character names
<a name="t-create-environ-random-un"></a>
+ Apache Airflow has a maximum of 64 characters for user names. If your AWS Identity and Access Management (IAM) role exceeds this length, a hash algorithm is used to reduce it, while remaining unique.

## Update environment
<a name="troubleshooting-update-environment"></a>

The following topic describes the errors you might receive when updating an environment.

### I tried changing the environment class but the update failed
<a name="t-rollback-billing-failure"></a>

If you update your environment to a different environment class (such as changing an `mw1.medium` to an `mw1.small`), and the request to update your environment failed, the environment status goes into an `UPDATE_FAILED` state and the environment is rolled back to, and is billed according to, the previous stable version of an environment.

We recommend the following steps:

1. Test your DAGs, custom plugins, and Python dependencies locally using [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) on GitHub.

1. To run a troubleshooting script that checks the Amazon VPC network setup and configuration for your Amazon MWAA environment, refer to the [Verify Environment](https://github.com/awslabs/aws-support-tools/tree/master/MWAA) script in AWS Support Tools on GitHub.

## Access environment
<a name="troubleshooting-access-environment"></a>

The following topic describes the errors you might receive when accessing an environment.

### I can't access the Apache Airflow UI
<a name="t-no-access-airflow-ui"></a>

We recommend the following steps:

1. Check user permissions. You might not have been granted access to a permissions policy that you can use to access the Apache Airflow UI. To learn more, refer to [Accessing an Amazon MWAA environment](access-policies.md).

1. Check network access. This might be because you selected the **Private network** access mode. If the URL of your Apache Airflow UI is in the following format `387fbcn-8dh4-9hfj-0dnd-834jhdfb-vpce.c10.us-west-2.airflow.amazonaws.com`, it means that you're using *private routing* for your Apache Airflow webserver. You can either update the Apache Airflow access mode to the **Public network** access mode, or create a mechanism to access the VPC endpoint for your Apache Airflow *Web server*. To learn more, refer to [Managing access to service-specific Amazon VPC endpoints on Amazon MWAA](vpc-vpe-access.md).

# Troubleshooting: CloudWatch Logs and CloudTrail errors
<a name="t-cloudwatch-cloudtrail-logs"></a>

The topics on this page contain resolutions to Amazon CloudWatch Logs and AWS CloudTrail errors you can encounter on an Amazon Managed Workflows for Apache Airflow environment.

**Contents**
+ [Logs](#troubleshooting-view-logs)
  + [I can't find my task logs, or I received a `Reading remote log from Cloudwatch log_group` error](#t-task-logs)
  + [Tasks are failing without any logs](#t-task-failing-no-logs)
  + [I get a `ResourceAlreadyExistsException` error in CloudTrail](#t-cloudtrail)
  + [I get an `Invalid request` error in CloudTrail](#t-cloudtrail-bucket)
  + [I get `Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory` in Apache Airflow logs](#t-plugins-logs)
  + [I get psycopg2 'server closed the connection unexpectedly' in my scheduler logs](#scheduler-postgres-library)
  + [I get `Executor reports task instance %s finished (%s) although the task says its %s` in my DAG processing logs](#long-running-tasks)
  + [I get `Could not read remote logs from log_group: airflow-*{*environmentName}-Task log_stream:* {*DAG_ID}/*{*TASK_ID}/*{*time}/*{*n}.log.` in my task logs](#t-task-fail-permission)

## Logs
<a name="troubleshooting-view-logs"></a>

The following topic describes the errors you might receive when accessing Apache Airflow logs.

### I can't find my task logs, or I received a `Reading remote log from Cloudwatch log_group` error
<a name="t-task-logs"></a>

Amazon MWAA has configured Apache Airflow to read and write logs directly from and to Amazon CloudWatch Logs. If a worker fails to start a task, or fails to write any logs, you will refer to the error:

```
*** Reading remote log from Cloudwatch log_group: airflow-environmentName-Task log_stream: DAG_ID/TASK_ID/timestamp/n.log.Could not read remote logs from log_group: airflow-environmentName-Task log_stream: DAG_ID/TASK_ID/time/n.log.
```
+ We recommend the following steps:

  1. Verify that you have enabled task logs at the `INFO` level for your environment. For more information, refer to [Accessing Airflow logs in Amazon CloudWatch](monitoring-airflow.md).

  1. Verify that the environment [execution role](mwaa-create-role.md) has the correct permission policies.

  1. Verify that your operator or task is working correctly, has sufficient resources to parse the DAG, and has the appropriate Python libraries to load. To verify your whether you have the correct dependencies, try eliminating imports until you find the one that is causing the issue. We recommend testing your Python dependencies using [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images).

### Tasks are failing without any logs
<a name="t-task-failing-no-logs"></a>

If tasks are failing in a workflow and you can't locate any logs for the failed tasks, check if you are setting the `queue` parameter in your default arguments, as listed in the following.

```
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago

# Setting queue argument to default.
default_args = {
	"start_date": days_ago(1),
	"queue": "default"
}

with DAG(dag_id="any_command_dag", schedule_interval=None, catchup=False, default_args=default_args) as dag:
    cli_command = BashOperator(
        task_id="bash_command",
        bash_command="{{ dag_run.conf['command'] }}"
    )
```

To resovle the issue, remove `queue` from your code, and invoke the DAG again.

### I get a `ResourceAlreadyExistsException` error in CloudTrail
<a name="t-cloudtrail"></a>

```
"errorCode": "ResourceAlreadyExistsException",
    "errorMessage": "The specified log stream already exists",
    "requestParameters": {
        "logGroupName": "airflow-MyAirflowEnvironment-DAGProcessing",
        "logStreamName": "scheduler_cross-account-eks.py.log"
    }
```

Certain Python requirements such as `apache-airflow-backport-providers-amazon` roll back the `watchtower` library that Amazon MWAA uses to communicate with CloudWatch to an older version. We recommend the following steps:
+ Add the following library to your `requirements.txt`

  ```
  watchtower==1.0.6
  ```

### I get an `Invalid request` error in CloudTrail
<a name="t-cloudtrail-bucket"></a>

```
Invalid request provided: Provided role does not have sufficient permissions for s3 location airflow-xxx-xxx/dags
```

If you're creating an Amazon MWAA environment and an Amazon S3 bucket using the same CloudFormation template, you need to add a `DependsOn` section within your CloudFormation template. The two resources (*MWAA Environment* and *MWAA Execution Policy*) have a dependency in CloudFormation. We recommend the following steps:
+ Add the following **DependsOn** statement to your CloudFormation template.

  ```
  ...
    MaxWorkers: 5
    NetworkConfiguration:
      SecurityGroupIds:
        - !GetAtt SecurityGroup.GroupId
      SubnetIds: !Ref subnetIds
    WebserverAccessMode: PUBLIC_ONLY
  DependsOn: MwaaExecutionPolicy
  
   MwaaExecutionPolicy:
   Type: AWS::IAM::ManagedPolicy
   Properties:
     Roles:
      - !Ref MwaaExecutionRole
    PolicyDocument:
      Version: 2012-10-17		 	 	 
      Statement:
        - Effect: Allow
          Action: airflow:PublishMetrics
          Resource:
  ...
  ```

  For an example, refer to [Quick start tutorial for Amazon Managed Workflows for Apache Airflow](quick-start.md).

### I get `Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory` in Apache Airflow logs
<a name="t-plugins-logs"></a>
+ We recommend the following steps:

  1. If you're using Apache Airflow v2, add `core.lazy_load_plugins : False` as an Apache Airflow configuration option. To learn more, refer to [Using configuration options to load plugins in 2](configuring-env-variables.md#configuring-2.0-airflow-override).

### I get psycopg2 'server closed the connection unexpectedly' in my scheduler logs
<a name="scheduler-postgres-library"></a>

If you get an error similar to the following, your Apache Airflow scheduler might have run out of resources.

```
2021-06-14T10:20:24.581-05:00    sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
2021-06-14T10:20:24.633-05:00    This probably means the server terminated abnormally
2021-06-14T10:20:24.686-05:00    before or while processing the request.
```

We recommend the following steps:
+ Consider upgrading to Apache Airflow v2.0.2, which you can use to specify up to 5 schedulers.

### I get `Executor reports task instance %s finished (%s) although the task says its %s` in my DAG processing logs
<a name="long-running-tasks"></a>

If you get an error similar to the following, your long-running tasks might have reached the task time limit on Amazon MWAA. Amazon MWAA has a limit of 12 hours for any one Airflow task, to prevent tasks from getting stuck in the queue and blocking activities like autoscaling.

```
Executor reports task instance %s finished (%s) although the task says its %s. (Info: %s) Was the task killed externally
```

We recommend the following steps:
+ Consider breaking up the task into multiple, shorter running tasks. Airflow typically has a model whereby operators are asynchronous. It invokes activities on external systems, and Apache Airflow Sensors poll to check when it's complete. If a Sensor fails, it can be safely retried without impacting the Operator's functionality.

### I get `Could not read remote logs from log_group: airflow-*{*environmentName}-Task log_stream:* {*DAG_ID}/*{*TASK_ID}/*{*time}/*{*n}.log.` in my task logs
<a name="t-task-fail-permission"></a>

If you get an error similar to the following, the execution role for your environment might not contain a permissions policy to create log streams for task logs.

```
Could not read remote logs from log_group: airflow-*{*environmentName}-Task log_stream:* {*DAG_ID}/*{*TASK_ID}/*{*time}/*{*n}.log.
```

We recommend the following steps:
+ Modify the execution role for your environment using one of the sample policies at [Amazon MWAA execution role](mwaa-create-role.md).

You might have also specified a provider package in your `requirements.txt` file that is incompatible with your Apache Airflow version. For example, if you're using Apache Airflow v2.0.2, you might have specified a package, such as the [apache-airflow-providers-databricks](https://airflow.apache.org/docs/apache-airflow-providers-databricks/stable/index.html) package, which is only compatible with Airflow 2.1\$1.

We recommend the following steps:

1. If you're using Apache Airflow v2.0.2, modify the `requirements.txt` file and add `apache-airflow[databricks]`. This installs the correct version of the Databricks package that is compatible with Apache Airflow v2.0.2.

1. Test your DAGs, custom plugins, and Python dependencies locally using [aws-mwaa-docker-images](https://github.com/aws/amazon-mwaa-docker-images) on GitHub.