Using Python libraries with EMR Serverless - Amazon EMR

Using Python libraries with EMR Serverless

When you run PySpark jobs on Amazon EMR Serverless applications, you can package various Python libraries as dependencies. To do this, you can use native Python features, build a virtual environment, or directly configure your PySpark jobs to use Python libraries. This page covers each approach.

Using native Python features

When you set the following configuration, you can use PySpark to upload Python files (.py), zipped Python packages (.zip), and Egg files (.egg) to Spark executors.

--conf spark.submit.pyFiles=s3://amzn-s3-demo-bucket/EXAMPLE-PREFIX/<.py|.egg|.zip file>

For more details about how to use Python virtual environments for PySpark jobs, see Using PySpark Native Features.

Building a Python virtual environment

To package multiple Python libraries for a PySpark job, you can create isolated Python virtual environments.

  1. To build the Python virtual environment, use the following commands. The example shown installs the packages scipy and matplotlib into a virtual environment package and copies the archive to an Amazon S3 location.

    Important

    You must run the following commands in a similar Amazon Linux 2 environment with the same version of Python as you use in EMR Serverless, that is, Python 3.7.10 for Amazon EMR release 6.6.0. You can find an example Dockerfile in the EMR Serverless Samples GitHub repository.

    # initialize a python virtual environment python3 -m venv pyspark_venvsource source pyspark_venvsource/bin/activate # optionally, ensure pip is up-to-date pip3 install --upgrade pip # install the python packages pip3 install scipy pip3 install matplotlib # package the virtual environment into an archive pip3 install venv-pack venv-pack -f -o pyspark_venv.tar.gz # copy the archive to an S3 location aws s3 cp pyspark_venv.tar.gz s3://amzn-s3-demo-bucket/EXAMPLE-PREFIX/ # optionally, remove the virtual environment directory rm -fr pyspark_venvsource
  2. Submit the Spark job with your properties set to use the Python virtual environment.

    --conf spark.archives=s3://amzn-s3-demo-bucket/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python

    Note that if you don't override the original Python binary, the second configuration in the preceding sequence of settings will be --conf spark.executorEnv.PYSPARK_PYTHON=python.

    For more on how to use Python virtual environments for PySpark jobs, see Using Virtualenv. For more examples of how to submit Spark jobs, see Using Spark configurations when you run EMR Serverless jobs.

Configuring PySpark jobs to use Python libraries

With Amazon EMR releases 6.12.0 and higher, you can directly configure EMR Serverless PySpark jobs to use popular data science Python libraries like pandas, NumPy, and PyArrow without any additional setup.

The following examples show how to package each Python library for a PySpark job.

NumPy

NumPy is a Python library for scientific computing that offers multidimensional arrays and operations for math, sorting, random simulation, and basic statistics. To use NumPy, run the following command:

import numpy
pandas

pandas is a Python library that is built on top of NumPy. The pandas library provides datas scientists with DataFrame data structures and data analysis tools. To use pandas, run the following command:

import pandas
PyArrow

PyArrow is a Python library that manages in-memory columnar data for improved job performance. PyArrow is based on the Apache Arrow cross-language development specification, which is a standard way to represent and exchange data in a columnar format. To use PyArrow, run the following command:

import pyarrow