

# Apache Spark with Amazon SageMaker AI
<a name="apache-spark"></a>

Amazon SageMaker AI Spark is an open source Spark library that helps you build Spark machine learning (ML) pipelines with SageMaker AI. This simplifies the integration of Spark ML stages with SageMaker AI stages, like model training and hosting. For information about SageMaker AI Spark, see the [SageMaker AI Spark](https://github.com/aws/sagemaker-spark) GitHub repository. The following topics provide information to learn how to use Apache Spark with SageMaker AI.

The SageMaker AI Spark library is available in Python and Scala. You can use SageMaker AI Spark to train models in SageMaker AI using `org.apache.spark.sql.DataFrame` data frames in your Spark clusters. After model training, you can also host the model using SageMaker AI hosting services. 

The SageMaker AI Spark library, `com.amazonaws.services.sagemaker.sparksdk`, provides the following classes, among others:
+ `SageMakerEstimator`—Extends the `org.apache.spark.ml.Estimator` interface. You can use this estimator for model training in SageMaker AI.
+ `KMeansSageMakerEstimator`, `PCASageMakerEstimator`, and `XGBoostSageMakerEstimator`—Extend the `SageMakerEstimator` class. 
+ `SageMakerModel`—Extends the `org.apache.spark.ml.Model` class. You can use this `SageMakerModel` for model hosting and getting inferences in SageMaker AI.

You can download the source code for both Python Spark (PySpark) and Scala libraries from the [SageMaker AI Spark](https://github.com/aws/sagemaker-spark) GitHub repository.

For installation and examples of the SageMaker AI Spark library, see [SageMaker AI Spark for Scala examples](apache-spark-example1.md) or [Resources for using SageMaker AI Spark for Python (PySpark) examples](apache-spark-additional-examples.md).

If you use Amazon EMR on AWS to manage Spark clusters, see [Apache Spark](https://aws.amazon.com/emr/features/spark/). For more information on using Amazon EMR in SageMaker AI, see [Data preparation using Amazon EMR](studio-notebooks-emr-cluster.md).

**Topics**
+ [

## Integrate your Apache Spark application with SageMaker AI
](#spark-sdk-common-process)
+ [

# SageMaker AI Spark for Scala examples
](apache-spark-example1.md)
+ [

# Resources for using SageMaker AI Spark for Python (PySpark) examples
](apache-spark-additional-examples.md)

## Integrate your Apache Spark application with SageMaker AI
<a name="spark-sdk-common-process"></a>

The following is high-level summary of the steps for integrating your Apache Spark application with SageMaker AI.

1. Continue data preprocessing using the Apache Spark library that you are familiar with. Your dataset remains a `DataFrame` in your Spark cluster. Load your data into a `DataFrame`. Preprocess it so that you have a `features` column with `org.apache.spark.ml.linalg.Vector` of `Doubles`, and an optional `label` column with values of `Double`​ type.

1. Use the estimator in the SageMaker AI Spark library to train your model. For example, if you choose the k-means algorithm provided by SageMaker AI for model training, call the `KMeansSageMakerEstimator.fit` method. 

   Provide your `DataFrame` as input. The estimator returns a `SageMakerModel` object. 
**Note**  
`SageMakerModel` extends the `org.apache.spark.ml.Model`.

   The `fit` method does the following: 

   1. Converts the input `DataFrame` to the protobuf format. It does so by selecting the `features` and `label` columns from the input `DataFrame`. It then uploads the protobuf data to an Amazon S3 bucket. The protobuf format is efficient for model training in SageMaker AI.

   1. Starts model training in SageMaker AI by sending a SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. After model training has completed, SageMaker AI saves the model artifacts to an S3 bucket. 

      SageMaker AI assumes the IAM role that you specified for model training to perform tasks on your behalf. For example, it uses the role to read training data from an S3 bucket and to write model artifacts to a bucket. 

   1. Creates and returns a `SageMakerModel` object. The constructor does the following tasks, which are related to deploying your model to SageMaker AI. 

      1. Sends a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) request to SageMaker AI. 

      1. Sends a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) request to SageMaker AI.

      1. Sends a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) request to SageMaker AI, which then launches the specified resources, and hosts the model on them. 

1. You can get inferences from your model hosted in SageMaker AI with the `SageMakerModel.transform`. 

   Provide an input `DataFrame` with features as input. The `transform` method transforms it to a `DataFrame` containing inferences. Internally, the `transform` method sends a request to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) SageMaker API to get inferences. The `transform` method appends the inferences to the input `DataFrame`.

# SageMaker AI Spark for Scala examples
<a name="apache-spark-example1"></a>

Amazon SageMaker AI provides an Apache Spark library ([SageMaker AI Spark](https://github.com/aws/sagemaker-spark/tree/master/sagemaker-spark-sdk)) that you can use to integrate your Apache Spark applications with SageMaker AI. This topic contains examples to help get you started with SageMaker AI Spark with Scala. For information about the SageMaker AI Apache Spark library, see [Apache Spark with Amazon SageMaker AI](apache-spark.md).

**Download Spark for Scala**

You can download the source code and examples for both Python Spark (PySpark) and Scala libraries from the [SageMaker AI Spark](https://github.com/aws/sagemaker-spark) GitHub repository.

For detailed instructions on installing the SageMaker AI Spark library, see [SageMaker AI Spark](https://github.com/aws/sagemaker-spark/tree/master/sagemaker-spark-sdk).

SageMaker AI Spark SDK for Scala is available in the Maven central repository. Add the Spark library to your project by adding the following dependency to your `pom.xml` file:
+  If your project is built with Maven, add the following to your pom.xml file:

  ```
  <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>sagemaker-spark_2.11</artifactId>
      <version>spark_2.2.0-1.0</version>
  </dependency>
  ```
+ If your project depends on Spark 2.1, add the following to your pom.xml file:

  ```
  <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>sagemaker-spark_2.11</artifactId>
      <version>spark_2.1.1-1.0</version>
  </dependency>
  ```

**Spark for Scala example**

This section provides example code that uses the Apache Spark Scala library provided by SageMaker AI to train a model in SageMaker AI using `DataFrame`s in your Spark cluster. This is then followed by examples on how to [Use Custom Algorithms for Model Training and Hosting on Amazon SageMaker AI with Apache Spark](apache-spark-example1-cust-algo.md) and [Use the SageMakerEstimator in a Spark Pipeline](apache-spark-example1-extend-pipeline.md).

The following example hosts the resulting model artifacts using SageMaker AI hosting services. For more details on this example, see [Getting Started: K-Means Clustering on SageMaker AI with SageMaker AI Spark SDK](https://github.com/aws/sagemaker-spark?tab=readme-ov-file#getting-started-k-means-clustering-on-sagemaker-with-sagemaker-spark-sdk) Specifically, this example does the following:
+ Uses the `KMeansSageMakerEstimator` to fit (or train) a model on data

  Because the example uses the k-means algorithm provided by SageMaker AI to train a model, you use the `KMeansSageMakerEstimator`. You train the model using images of handwritten single-digit numbers (from the MNIST dataset). You provide the images as an input `DataFrame`. For your convenience, SageMaker AI provides this dataset in an Amazon S3 bucket.

  In response, the estimator returns a `SageMakerModel` object.
+ Obtains inferences using the trained `SageMakerModel`

  To get inferences from a model hosted in SageMaker AI, you call the `SageMakerModel.transform` method. You pass a `DataFrame` as input. The method transforms the input `DataFrame` to another `DataFrame` containing inferences obtained from the model. 

  For a given input image of a handwritten single-digit number, the inference identifies a cluster that the image belongs to. For more information, see [K-Means Algorithm](k-means.md).

```
import org.apache.spark.sql.SparkSession
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.algorithms
import com.amazonaws.services.sagemaker.sparksdk.algorithms.KMeansSageMakerEstimator

val spark = SparkSession.builder.getOrCreate

// load mnist data as a dataframe from libsvm
val region = "us-east-1"
val trainingData = spark.read.format("libsvm")
  .option("numFeatures", "784")
  .load(s"s3://sagemaker-sample-data-$region/spark/mnist/train/")
val testData = spark.read.format("libsvm")
  .option("numFeatures", "784")
  .load(s"s3://sagemaker-sample-data-$region/spark/mnist/test/")

val roleArn = "arn:aws:iam::account-id:role/rolename"

val estimator = new KMeansSageMakerEstimator(
  sagemakerRole = IAMRole(roleArn),
  trainingInstanceType = "ml.p2.xlarge",
  trainingInstanceCount = 1,
  endpointInstanceType = "ml.c4.xlarge",
  endpointInitialInstanceCount = 1)
  .setK(10).setFeatureDim(784)

// train
val model = estimator.fit(trainingData)

val transformedData = model.transform(testData)
transformedData.show
```

The example code does the following:
+ Loads the MNIST dataset from an S3 bucket provided by SageMaker AI (`awsai-sparksdk-dataset`) into a Spark `DataFrame` (`mnistTrainingDataFrame`):

  ```
  // Get a Spark session.
  
  val spark = SparkSession.builder.getOrCreate
  
  // load mnist data as a dataframe from libsvm
  val region = "us-east-1"
  val trainingData = spark.read.format("libsvm")
    .option("numFeatures", "784")
    .load(s"s3://sagemaker-sample-data-$region/spark/mnist/train/")
  val testData = spark.read.format("libsvm")
    .option("numFeatures", "784")
    .load(s"s3://sagemaker-sample-data-$region/spark/mnist/test/")
  
  val roleArn = "arn:aws:iam::account-id:role/rolename"
  trainingData.show()
  ```

  The `show` method displays the first 20 rows in the data frame:

  ```
  +-----+--------------------+
  |label|            features|
  +-----+--------------------+
  |  5.0|(784,[152,153,154...|
  |  0.0|(784,[127,128,129...|
  |  4.0|(784,[160,161,162...|
  |  1.0|(784,[158,159,160...|
  |  9.0|(784,[208,209,210...|
  |  2.0|(784,[155,156,157...|
  |  1.0|(784,[124,125,126...|
  |  3.0|(784,[151,152,153...|
  |  1.0|(784,[152,153,154...|
  |  4.0|(784,[134,135,161...|
  |  3.0|(784,[123,124,125...|
  |  5.0|(784,[216,217,218...|
  |  3.0|(784,[143,144,145...|
  |  6.0|(784,[72,73,74,99...|
  |  1.0|(784,[151,152,153...|
  |  7.0|(784,[211,212,213...|
  |  2.0|(784,[151,152,153...|
  |  8.0|(784,[159,160,161...|
  |  6.0|(784,[100,101,102...|
  |  9.0|(784,[209,210,211...|
  +-----+--------------------+
  only showing top 20 rows
  ```

  In each row:
  + The `label` column identifies the image's label. For example, if the image of the handwritten number is the digit 5, the label value is 5. 
  + The `features` column stores a vector (`org.apache.spark.ml.linalg.Vector`) of `Double` values. These are the 784 features of the handwritten number. (Each handwritten number is a 28 x 28-pixel image, making 784 features.)
+ Creates a SageMaker AI estimator (`KMeansSageMakerEstimator`) 

  The `fit` method of this estimator uses the k-means algorithm provided by SageMaker AI to train models using an input `DataFrame`. In response, it returns a `SageMakerModel` object that you can use to get inferences.
**Note**  
The `KMeansSageMakerEstimator` extends the SageMaker AI `SageMakerEstimator`, which extends the Apache Spark `Estimator`. 

  ```
  val estimator = new KMeansSageMakerEstimator(
    sagemakerRole = IAMRole(roleArn),
    trainingInstanceType = "ml.p2.xlarge",
    trainingInstanceCount = 1,
    endpointInstanceType = "ml.c4.xlarge",
    endpointInitialInstanceCount = 1)
    .setK(10).setFeatureDim(784)
  ```

   

  The constructor parameters provide information that is used for training a model and deploying it on SageMaker AI:
  + `trainingInstanceType` and `trainingInstanceCount`—Identify the type and number of ML compute instances to use for model training.
  + `endpointInstanceType`—Identifies the ML compute instance type to use when hosting the model in SageMaker AI. By default, one ML compute instance is assumed.
  + `endpointInitialInstanceCount`—Identifies the number of ML compute instances initially backing the endpoint hosting the model in SageMaker AI.
  + `sagemakerRole`—SageMaker AI assumes this IAM role to perform tasks on your behalf. For example, for model training, it reads data from S3 and writes training results (model artifacts) to S3. 
**Note**  
This example implicitly creates a SageMaker AI client. To create this client, you must provide your credentials. The API uses these credentials to authenticate requests to SageMaker AI. For example, it uses the credentials to authenticate requests to create a training job and API calls for deploying the model using SageMaker AI hosting services.
  + After the `KMeansSageMakerEstimator` object has been created, you set the following parameters, are used in model training: 
    + The number of clusters that the k-means algorithm should create during model training. You specify 10 clusters, one for each digit, 0 through 9. 
    + Identifies that each input image has 784 features (each handwritten number is a 28 x 28-pixel image, making 784 features).
+ Calls the estimator `fit` method

  ```
  // train
  val model = estimator.fit(trainingData)
  ```

  You pass the input `DataFrame` as a parameter. The model does all the work of training the model and deploying it to SageMaker AI. For more information see, [Integrate your Apache Spark application with SageMaker AI](apache-spark.md#spark-sdk-common-process). In response, you get a `SageMakerModel` object, which you can use to get inferences from your model deployed in SageMaker AI. 

  You provide only the input `DataFrame`. You don't need to specify the registry path to the k-means algorithm used for model training because the `KMeansSageMakerEstimator` knows it.
+ Calls the `SageMakerModel.transform` method to get inferences from the model deployed in SageMaker AI.

  The `transform` method takes a `DataFrame` as input, transforms it, and returns another `DataFrame` containing inferences obtained from the model. 

  ```
  val transformedData = model.transform(testData)
  transformedData.show
  ```

  For simplicity, we use the same `DataFrame` as input to the `transform` method that we used for model training in this example. The `transform` method does the following:
  + Serializes the `features` column in the input `DataFrame` to protobuf and sends it to the SageMaker AI endpoint for inference.
  + Deserializes the protobuf response into the two additional columns (`distance_to_cluster` and `closest_cluster`) in the transformed `DataFrame`.

  The `show` method gets inferences to the first 20 rows in the input `DataFrame`: 

  ```
  +-----+--------------------+-------------------+---------------+
  |label|            features|distance_to_cluster|closest_cluster|
  +-----+--------------------+-------------------+---------------+
  |  5.0|(784,[152,153,154...|  1767.897705078125|            4.0|
  |  0.0|(784,[127,128,129...|  1392.157470703125|            5.0|
  |  4.0|(784,[160,161,162...| 1671.5711669921875|            9.0|
  |  1.0|(784,[158,159,160...| 1182.6082763671875|            6.0|
  |  9.0|(784,[208,209,210...| 1390.4002685546875|            0.0|
  |  2.0|(784,[155,156,157...|  1713.988037109375|            1.0|
  |  1.0|(784,[124,125,126...| 1246.3016357421875|            2.0|
  |  3.0|(784,[151,152,153...|  1753.229248046875|            4.0|
  |  1.0|(784,[152,153,154...|  978.8394165039062|            2.0|
  |  4.0|(784,[134,135,161...|  1623.176513671875|            3.0|
  |  3.0|(784,[123,124,125...|  1533.863525390625|            4.0|
  |  5.0|(784,[216,217,218...|  1469.357177734375|            6.0|
  |  3.0|(784,[143,144,145...|  1736.765869140625|            4.0|
  |  6.0|(784,[72,73,74,99...|   1473.69384765625|            8.0|
  |  1.0|(784,[151,152,153...|    944.88720703125|            2.0|
  |  7.0|(784,[211,212,213...| 1285.9071044921875|            3.0|
  |  2.0|(784,[151,152,153...| 1635.0125732421875|            1.0|
  |  8.0|(784,[159,160,161...| 1436.3162841796875|            6.0|
  |  6.0|(784,[100,101,102...| 1499.7366943359375|            7.0|
  |  9.0|(784,[209,210,211...| 1364.6319580078125|            6.0|
  +-----+--------------------+-------------------+---------------+
  ```

  You can interpret the data, as follows:
  + A handwritten number with the `label` 5 belongs to cluster 4 (`closest_cluster`).
  + A handwritten number with the `label` 0 belongs to cluster 5.
  + A handwritten number with the `label` 4 belongs to cluster 9.
  + A handwritten number with the `label` 1 belongs to cluster 6.

**Topics**
+ [

# Use Custom Algorithms for Model Training and Hosting on Amazon SageMaker AI with Apache Spark
](apache-spark-example1-cust-algo.md)
+ [

# Use the SageMakerEstimator in a Spark Pipeline
](apache-spark-example1-extend-pipeline.md)

# Use Custom Algorithms for Model Training and Hosting on Amazon SageMaker AI with Apache Spark
<a name="apache-spark-example1-cust-algo"></a>

In [SageMaker AI Spark for Scala examples](apache-spark-example1.md), you use the `kMeansSageMakerEstimator` because the example uses the k-means algorithm provided by Amazon SageMaker AI for model training. You might choose to use your own custom algorithm for model training instead. Assuming that you have already created a Docker image, you can create your own `SageMakerEstimator` and specify the Amazon Elastic Container Registry path for your custom image. 

The following example shows how to create a `KMeansSageMakerEstimator` from the `SageMakerEstimator`. In the new estimator, you explicitly specify the Docker registry path to your training and inference code images.

```
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.SageMakerEstimator
import com.amazonaws.services.sagemaker.sparksdk.transformation.serializers.ProtobufRequestRowSerializer
import com.amazonaws.services.sagemaker.sparksdk.transformation.deserializers.KMeansProtobufResponseRowDeserializer

val estimator = new SageMakerEstimator(
  trainingImage =
    "811284229777.dkr.ecr.us-east-1.amazonaws.com/kmeans:1",
  modelImage =
    "811284229777.dkr.ecr.us-east-1.amazonaws.com/kmeans:1",
  requestRowSerializer = new ProtobufRequestRowSerializer(),
  responseRowDeserializer = new KMeansProtobufResponseRowDeserializer(),
  hyperParameters = Map("k" -> "10", "feature_dim" -> "784"),
  sagemakerRole = IAMRole(roleArn),
  trainingInstanceType = "ml.p2.xlarge",
  trainingInstanceCount = 1,
  endpointInstanceType = "ml.c4.xlarge",
  endpointInitialInstanceCount = 1,
  trainingSparkDataFormat = "sagemaker")
```

In the code, the parameters in the `SageMakerEstimator` constructor include:
+ `trainingImage` —Identifies the Docker registry path to the training image containing your custom code.
+ `modelImage` —Identifies the Docker registry path to the image containing inference code.
+ `requestRowSerializer` —Implements `com.amazonaws.services.sagemaker.sparksdk.transformation.RequestRowSerializer`.

  This parameter serializes rows in the input `DataFrame` to send them to the model hosted in SageMaker AI for inference.
+ `responseRowDeserializer` —Implements 

  `com.amazonaws.services.sagemaker.sparksdk.transformation.ResponseRowDeserializer`.

  This parameter deserializes responses from the model, hosted in SageMaker AI, back into a `DataFrame`.
+ `trainingSparkDataFormat` —Specifies the data format that Spark uses when uploading training data from a `DataFrame` to S3. For example, `"sagemaker"` for protobuf format, `"csv"` for comma-separated values, and `"libsvm"` for LibSVM format. 

You can implement your own `RequestRowSerializer` and `ResponseRowDeserializer` to serialize and deserialize rows from a data format that your inference code supports, such as .libsvm or ..csv.

# Use the SageMakerEstimator in a Spark Pipeline
<a name="apache-spark-example1-extend-pipeline"></a>

You can use `org.apache.spark.ml.Estimator` estimators and `org.apache.spark.ml.Model` models, and `SageMakerEstimator` estimators and `SageMakerModel` models in `org.apache.spark.ml.Pipeline` pipelines, as shown in the following example:

```
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.PCA
import org.apache.spark.sql.SparkSession
import com.amazonaws.services.sagemaker.sparksdk.IAMRole
import com.amazonaws.services.sagemaker.sparksdk.algorithms
import com.amazonaws.services.sagemaker.sparksdk.algorithms.KMeansSageMakerEstimator

val spark = SparkSession.builder.getOrCreate

// load mnist data as a dataframe from libsvm
val region = "us-east-1"
val trainingData = spark.read.format("libsvm")
  .option("numFeatures", "784")
  .load(s"s3://sagemaker-sample-data-$region/spark/mnist/train/")
val testData = spark.read.format("libsvm")
  .option("numFeatures", "784")
  .load(s"s3://sagemaker-sample-data-$region/spark/mnist/test/")

// substitute your SageMaker IAM role here
val roleArn = "arn:aws:iam::account-id:role/rolename"

val pcaEstimator = new PCA()
  .setInputCol("features")
  .setOutputCol("projectedFeatures")
  .setK(50)

val kMeansSageMakerEstimator = new KMeansSageMakerEstimator(
  sagemakerRole = IAMRole(integTestingRole),
  requestRowSerializer =
    new ProtobufRequestRowSerializer(featuresColumnName = "projectedFeatures"),
  trainingSparkDataFormatOptions = Map("featuresColumnName" -> "projectedFeatures"),
  trainingInstanceType = "ml.p2.xlarge",
  trainingInstanceCount = 1,
  endpointInstanceType = "ml.c4.xlarge",
  endpointInitialInstanceCount = 1)
  .setK(10).setFeatureDim(50)

val pipeline = new Pipeline().setStages(Array(pcaEstimator, kMeansSageMakerEstimator))

// train
val pipelineModel = pipeline.fit(trainingData)

val transformedData = pipelineModel.transform(testData)
transformedData.show()
```

The parameter `trainingSparkDataFormatOptions` configures Spark to serialize to protobuf the "projectedFeatures" column for model training. Additionally, Spark serializes to protobuf the "label" column by default.

Because we want to make inferences using the "projectedFeatures" column, we pass the column name into the `ProtobufRequestRowSerializer`.

The following example shows a transformed `DataFrame`:

```
+-----+--------------------+--------------------+-------------------+---------------+
|label|            features|   projectedFeatures|distance_to_cluster|closest_cluster|
+-----+--------------------+--------------------+-------------------+---------------+
|  5.0|(784,[152,153,154...|[880.731433034386...|     1500.470703125|            0.0|
|  0.0|(784,[127,128,129...|[1768.51722024166...|      1142.18359375|            4.0|
|  4.0|(784,[160,161,162...|[704.949236329314...|  1386.246826171875|            9.0|
|  1.0|(784,[158,159,160...|[-42.328192193771...| 1277.0736083984375|            5.0|
|  9.0|(784,[208,209,210...|[374.043902028333...|   1211.00927734375|            3.0|
|  2.0|(784,[155,156,157...|[941.267714528850...|  1496.157958984375|            8.0|
|  1.0|(784,[124,125,126...|[30.2848596410594...| 1327.6766357421875|            5.0|
|  3.0|(784,[151,152,153...|[1270.14374062052...| 1570.7674560546875|            0.0|
|  1.0|(784,[152,153,154...|[-112.10792566485...|     1037.568359375|            5.0|
|  4.0|(784,[134,135,161...|[452.068280676606...| 1165.1236572265625|            3.0|
|  3.0|(784,[123,124,125...|[610.596447285397...|  1325.953369140625|            7.0|
|  5.0|(784,[216,217,218...|[142.959601818422...| 1353.4930419921875|            5.0|
|  3.0|(784,[143,144,145...|[1036.71862533658...| 1460.4315185546875|            7.0|
|  6.0|(784,[72,73,74,99...|[996.740157435754...| 1159.8631591796875|            2.0|
|  1.0|(784,[151,152,153...|[-107.26076167417...|   960.963623046875|            5.0|
|  7.0|(784,[211,212,213...|[619.771820430940...|   1245.13623046875|            6.0|
|  2.0|(784,[151,152,153...|[850.152101817161...|  1304.437744140625|            8.0|
|  8.0|(784,[159,160,161...|[370.041887230547...| 1192.4781494140625|            0.0|
|  6.0|(784,[100,101,102...|[546.674328209335...|    1277.0908203125|            2.0|
|  9.0|(784,[209,210,211...|[-29.259112927426...| 1245.8182373046875|            6.0|
+-----+--------------------+--------------------+-------------------+---------------+
```

# Resources for using SageMaker AI Spark for Python (PySpark) examples
<a name="apache-spark-additional-examples"></a>

Amazon SageMaker AI provides an Apache Spark Python library ([SageMaker AI PySpark](https://github.com/aws/sagemaker-spark/tree/master/sagemaker-pyspark-sdk)) that you can use to integrate your Apache Spark applications with SageMaker AI. This topic contains examples to help you get started with PySpark. For information about the SageMaker AI Apache Spark library, see [Apache Spark with Amazon SageMaker AI](apache-spark.md).

**Download PySpark**

You can download the source code for both Python Spark (PySpark) and Scala libraries from the [SageMaker AI Spark](https://github.com/aws/sagemaker-spark) GitHub repository.

For instructions on installing the SageMaker AI Spark library, use any the following options or visit [SageMaker AI PySpark](https://github.com/aws/sagemaker-spark/tree/master/sagemaker-pyspark-sdk).
+ Install using pip:

  ```
  pip install sagemaker_pyspark
  ```
+ Install from the source:

  ```
  git clone git@github.com:aws/sagemaker-spark.git
  cd sagemaker-pyspark-sdk
  python setup.py install
  ```
+ You can also create a new notebook in a notebook instance that uses either the `Sparkmagic (PySpark)` or the `Sparkmagic (PySpark3)` kernel and connect to a remote Amazon EMR cluster.
**Note**  
The Amazon EMR cluster must be configured with an IAM role that has the `AmazonSageMakerFullAccess` policy attached. For information about configuring roles for an EMR cluster, see [Configure IAM Roles for Amazon EMR Permissions to AWS Services](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles.html) in the *Amazon EMR Management Guide*.

**PySpark examples**

For examples on using SageMaker AI PySpark, see:
+ [Using Amazon SageMaker AI with Apache Spark](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-spark/index.html) in Read the Docs.
+ [SageMaker AI Spark](https://github.com/aws/sagemaker-spark) GitHub repository.