Resources for using SageMaker Spark for Python (PySpark) examples
Amazon SageMaker provides an Apache Spark Python library (SageMaker
PySpark
Download PySpark
You can download the source code for both Python Spark (PySpark) and Scala libraries
from the SageMaker Spark
For instructions on installing the SageMaker Spark library, use any the following options
or visit SageMaker
PySpark
-
Install using pip:
pip install sagemaker_pyspark
-
Install from the source:
git clone git@github.com:aws/sagemaker-spark.git cd sagemaker-pyspark-sdk python setup.py install
-
You can also create a new notebook in a notebook instance that uses either the
Sparkmagic (PySpark)
or theSparkmagic (PySpark3)
kernel and connect to a remote Amazon EMR cluster.Note
The Amazon EMR cluster must be configured with an IAM role that has the
AmazonSageMakerFullAccess
policy attached. For information about configuring roles for an EMR cluster, see Configure IAM Roles for Amazon EMR Permissions to AWS Services in the Amazon EMR Management Guide.
PySpark examples
For examples on using SageMaker PySpark, see:
-
Using Amazon SageMaker with Apache Spark
in Read the Docs. -
SageMaker Spark
GitHub repository.
To run the notebooks on a notebook instance, see Access example notebooks. To run the notebooks on Studio, see Create or Open an Amazon SageMaker Studio Classic Notebook.