Creating a custom plugin with Apache Hive and Hadoop
Amazon MWAA extracts the contents of a plugins.zip
to /usr/local/airflow/plugins
. This can be used to add binaries to your containers. In addition, Apache Airflow executes the contents of Python files in the plugins
folder at startup—enabling you to set and modify environment variables. The following sample walks you through the steps to create a custom plugin using Apache Hive and Hadoop on an Amazon Managed Workflows for Apache Airflow environment and can be combined with other custom plugins and binaries.
Topics
Version
-
The sample code on this page can be used with Apache Airflow v1 in Python 3.7
.
-
You can use the code example on this page with Apache Airflow v2 in Python 3.10
.
Prerequisites
To use the sample code on this page, you'll need the following:
Permissions
-
No additional permissions are required to use the code example on this page.
Requirements
To use the sample code on this page, add the following dependencies to your requirements.txt
. To learn more, see Installing Python dependencies.
Download dependencies
Amazon MWAA will extract the contents of plugins.zip into /usr/local/airflow/plugins
on each Amazon MWAA scheduler and worker container. This is used to add binaries to your environment. The following steps describe how to assemble the files needed for the custom plugin.
-
In your command prompt, navigate to the directory where you would like to create your plugin. For example:
cd plugins
-
Download Hadoop
from a mirror , for example: wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
-
Download Hive
from a mirror , for example: wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
-
Create a directory. For example:
mkdir hive_plugin
-
Extract Hadoop.
tar -xvzf hadoop-3.3.0.tar.gz -C hive_plugin
-
Extract Hive.
tar -xvzf apache-hive-3.1.2-bin.tar.gz -C hive_plugin
Custom plugin
Apache Airflow will execute the contents of Python files in the plugins folder at startup. This is used to set and modify environment variables. The following steps describe the sample code for the custom plugin.
-
In your command prompt, navigate to the
hive_plugin
directory. For example:cd hive_plugin
-
Copy the contents of the following code sample and save locally as
hive_plugin.py
in thehive_plugin
directory.from airflow.plugins_manager import AirflowPlugin import os os.environ["JAVA_HOME"]="/usr/lib/jvm/jre" os.environ["HADOOP_HOME"]='/usr/local/airflow/plugins/hadoop-3.3.0' os.environ["HADOOP_CONF_DIR"]='/usr/local/airflow/plugins/hadoop-3.3.0/etc/hadoop' os.environ["HIVE_HOME"]='/usr/local/airflow/plugins/apache-hive-3.1.2-bin' os.environ["PATH"] = os.getenv("PATH") + ":/usr/local/airflow/plugins/hadoop-3.3.0:/usr/local/airflow/plugins/apache-hive-3.1.2-bin/bin:/usr/local/airflow/plugins/apache-hive-3.1.2-bin/lib" os.environ["CLASSPATH"] = os.getenv("CLASSPATH") + ":/usr/local/airflow/plugins/apache-hive-3.1.2-bin/lib" class EnvVarPlugin(AirflowPlugin): name = 'hive_plugin'
-
Cope the content of the following text and save locally as
.airflowignore
in thehive_plugin
directory.hadoop-3.3.0 apache-hive-3.1.2-bin
Plugins.zip
The following steps show how to create plugins.zip
. The contents of this example can be combined with other plugins and binaries into a single plugins.zip
file.
-
In your command prompt, navigate to the
hive_plugin
directory from the previous step. For example:cd hive_plugin
-
Zip the contents within your
plugins
folder.zip -r ../hive_plugin.zip ./
Code sample
The following steps describe how to create the DAG code that will test the custom plugin.
-
In your command prompt, navigate to the directory where your DAG code is stored. For example:
cd dags
-
Copy the contents of the following code sample and save locally as
hive.py
.from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.utils.dates import days_ago with DAG(dag_id="hive_test_dag", schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag: hive_test = BashOperator( task_id="hive_test", bash_command='hive --help' )
Airflow configuration options
If you're using Apache Airflow v2, add core.lazy_load_plugins : False
as an Apache Airflow configuration option.
To learn more, see Using configuration options to load plugins in 2.
What's next?
-
Learn how to upload the
requirements.txt
file in this example to your Amazon S3 bucket in Installing Python dependencies. -
Learn how to upload the DAG code in this example to the
dags
folder in your Amazon S3 bucket in Adding or updating DAGs. -
Learn more about how to upload the
plugins.zip
file in this example to your Amazon S3 bucket in Installing custom plugins.