Apache Spark with Amazon SageMaker
Amazon SageMaker Spark is an open source Spark library that helps you build Spark machine learning
(ML) pipelines with SageMaker. This simplifies the integration of Spark ML stages with SageMaker
stages, like model training and hosting. For information about SageMaker Spark, see the SageMaker Spark
The SageMaker Spark library is available in Python and Scala. You can use SageMaker Spark to train
models in SageMaker using org.apache.spark.sql.DataFrame
data frames in your Spark
clusters. After model training, you can also host the model using SageMaker hosting services.
The SageMaker Spark library, com.amazonaws.services.sagemaker.sparksdk
, provides
the following classes, among others:
-
SageMakerEstimator
—Extends theorg.apache.spark.ml.Estimator
interface. You can use this estimator for model training in SageMaker. -
KMeansSageMakerEstimator
,PCASageMakerEstimator
, andXGBoostSageMakerEstimator
—Extend theSageMakerEstimator
class. -
SageMakerModel
—Extends theorg.apache.spark.ml.Model
class. You can use thisSageMakerModel
for model hosting and getting inferences in SageMaker.
You can download the source code for both Python Spark (PySpark) and Scala libraries from
the SageMaker Spark
For installation and examples of the SageMaker Spark library, see SageMaker Spark for Scala examples or Resources for using SageMaker Spark for Python (PySpark) examples.
If you use Amazon EMR on AWS to manage Spark clusters, see Apache Spark
Topics
Integrate your Apache Spark application with SageMaker
The following is high-level summary of the steps for integrating your Apache Spark application with SageMaker.
-
Continue data preprocessing using the Apache Spark library that you are familiar with. Your dataset remains a
DataFrame
in your Spark cluster. Load your data into aDataFrame
. Preprocess it so that you have afeatures
column withorg.apache.spark.ml.linalg.Vector
ofDoubles
, and an optionallabel
column with values ofDouble
type. -
Use the estimator in the SageMaker Spark library to train your model. For example, if you choose the k-means algorithm provided by SageMaker for model training, call the
KMeansSageMakerEstimator.fit
method.Provide your
DataFrame
as input. The estimator returns aSageMakerModel
object.Note
SageMakerModel
extends theorg.apache.spark.ml.Model
.The
fit
method does the following:-
Converts the input
DataFrame
to the protobuf format. It does so by selecting thefeatures
andlabel
columns from the inputDataFrame
. It then uploads the protobuf data to an Amazon S3 bucket. The protobuf format is efficient for model training in SageMaker. -
Starts model training in SageMaker by sending a SageMaker
CreateTrainingJob
request. After model training has completed, SageMaker saves the model artifacts to an S3 bucket.SageMaker assumes the IAM role that you specified for model training to perform tasks on your behalf. For example, it uses the role to read training data from an S3 bucket and to write model artifacts to a bucket.
-
Creates and returns a
SageMakerModel
object. The constructor does the following tasks, which are related to deploying your model to SageMaker.-
Sends a
CreateModel
request to SageMaker. -
Sends a
CreateEndpointConfig
request to SageMaker. -
Sends a
CreateEndpoint
request to SageMaker, which then launches the specified resources, and hosts the model on them.
-
-
-
You can get inferences from your model hosted in SageMaker with the
SageMakerModel.transform
.Provide an input
DataFrame
with features as input. Thetransform
method transforms it to aDataFrame
containing inferences. Internally, thetransform
method sends a request to theInvokeEndpoint
SageMaker API to get inferences. Thetransform
method appends the inferences to the inputDataFrame
.