Feature Processing with Spark ML and Scikit-learn
Before training a model with either Amazon SageMaker built-in algorithms or custom algorithms, you can use Spark and scikit-learn preprocessors to transform your data and engineer features.
Feature Processing with Spark ML
You can run Spark ML jobs with AWS Glue, a serverless ETL (extract, transform, load) service, from your SageMaker notebook. You can also connect to existing EMR clusters to run Spark ML jobs with Amazon EMR. To do this, you need an AWS Identity and Access Management (IAM) role that grants permission for making calls from your SageMaker notebook to AWS Glue.
Note
To see which Python and Spark versions AWS Glue supports, refer to AWS Glue Release Notes.
After engineering features, you package and serialize Spark ML jobs with MLeap into MLeap containers that you can add to an inference pipeline. You don't need to use externally managed Spark clusters. With this approach, you can seamlessly scale from a sample of rows to terabytes of data. The same transformers work for both training and inference, so you don't need to duplicate preprocessing and feature engineering logic or develop a one-time solution to make the models persist. With inference pipelines, you don't need to maintain outside infrastructure, and you can make predictions directly from data inputs.
When you run a Spark ML job on AWS Glue, a
Spark
ML pipeline is serialized into MLeap
For an example that shows how to feature process with Spark ML, see the Train an ML Model using Apache Spark in Amazon EMR and deploy in SageMaker
Feature Processing with Scikit-Learn
You can run and package scikit-learn jobs into containers directly in Amazon SageMaker.
For an example of Python code for building a scikit-learn
featurizer model that trains on Fisher's Iris flower data
set