Creating a custom plugin with Apache Hive and Hadoop - Amazon Managed Workflows for Apache Airflow

Creating a custom plugin with Apache Hive and Hadoop

Amazon MWAA extracts the contents of a plugins.zip to /usr/local/airflow/plugins. This can be used to add binaries to your containers. In addition, Apache Airflow executes the contents of Python files in the plugins folder at startup—enabling you to set and modify environment variables. The following sample walks you through the steps to create a custom plugin using Apache Hive and Hadoop on an Amazon Managed Workflows for Apache Airflow environment and can be combined with other custom plugins and binaries.

Version

  • The sample code on this page can be used with Apache Airflow v1 in Python 3.7.

  • You can use the code example on this page with Apache Airflow v2 in Python 3.10.

Prerequisites

To use the sample code on this page, you'll need the following:

Permissions

  • No additional permissions are required to use the code example on this page.

Requirements

To use the sample code on this page, add the following dependencies to your requirements.txt. To learn more, see Installing Python dependencies.

Apache Airflow v2
-c https://raw.githubusercontent.com/apache/airflow/constraints-2.0.2/constraints-3.7.txt apache-airflow-providers-amazon[apache.hive]
Apache Airflow v1
apache-airflow[hive]==1.10.12

Download dependencies

Amazon MWAA will extract the contents of plugins.zip into /usr/local/airflow/plugins on each Amazon MWAA scheduler and worker container. This is used to add binaries to your environment. The following steps describe how to assemble the files needed for the custom plugin.

  1. In your command prompt, navigate to the directory where you would like to create your plugin. For example:

    cd plugins
  2. Download Hadoop from a mirror, for example:

    wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
  3. Download Hive from a mirror, for example:

    wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
  4. Create a directory. For example:

    mkdir hive_plugin
  5. Extract Hadoop.

    tar -xvzf hadoop-3.3.0.tar.gz -C hive_plugin
  6. Extract Hive.

    tar -xvzf apache-hive-3.1.2-bin.tar.gz -C hive_plugin

Custom plugin

Apache Airflow will execute the contents of Python files in the plugins folder at startup. This is used to set and modify environment variables. The following steps describe the sample code for the custom plugin.

  1. In your command prompt, navigate to the hive_plugin directory. For example:

    cd hive_plugin
  2. Copy the contents of the following code sample and save locally as hive_plugin.py in the hive_plugin directory.

    from airflow.plugins_manager import AirflowPlugin import os os.environ["JAVA_HOME"]="/usr/lib/jvm/jre" os.environ["HADOOP_HOME"]='/usr/local/airflow/plugins/hadoop-3.3.0' os.environ["HADOOP_CONF_DIR"]='/usr/local/airflow/plugins/hadoop-3.3.0/etc/hadoop' os.environ["HIVE_HOME"]='/usr/local/airflow/plugins/apache-hive-3.1.2-bin' os.environ["PATH"] = os.getenv("PATH") + ":/usr/local/airflow/plugins/hadoop-3.3.0:/usr/local/airflow/plugins/apache-hive-3.1.2-bin/bin:/usr/local/airflow/plugins/apache-hive-3.1.2-bin/lib" os.environ["CLASSPATH"] = os.getenv("CLASSPATH") + ":/usr/local/airflow/plugins/apache-hive-3.1.2-bin/lib" class EnvVarPlugin(AirflowPlugin): name = 'hive_plugin'
  3. Cope the content of the following text and save locally as .airflowignore in the hive_plugin directory.

    hadoop-3.3.0 apache-hive-3.1.2-bin

Plugins.zip

The following steps show how to create plugins.zip. The contents of this example can be combined with other plugins and binaries into a single plugins.zip file.

  1. In your command prompt, navigate to the hive_plugin directory from the previous step. For example:

    cd hive_plugin
  2. Zip the contents within your plugins folder.

    zip -r ../hive_plugin.zip ./

Code sample

The following steps describe how to create the DAG code that will test the custom plugin.

  1. In your command prompt, navigate to the directory where your DAG code is stored. For example:

    cd dags
  2. Copy the contents of the following code sample and save locally as hive.py.

    from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.utils.dates import days_ago with DAG(dag_id="hive_test_dag", schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag: hive_test = BashOperator( task_id="hive_test", bash_command='hive --help' )

Airflow configuration options

If you're using Apache Airflow v2, add core.lazy_load_plugins : False as an Apache Airflow configuration option. To learn more, see Using configuration options to load plugins in 2.

What's next?