

# Data transformation workloads with SageMaker Processing
<a name="processing-job"></a>

SageMaker Processing refers to SageMaker AI’s capabilities to run data pre and post processing, feature engineering, and model evaluation tasks on SageMaker AI's fully-managed infrastructure. These tasks are executed as [processing jobs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingJob.html). The following provides information and resources to learn about SageMaker Processing.

Using SageMaker Processing API, data scientists can run scripts and notebooks to process, transform, and analyze datasets to prepare them for machine learning. When combined with the other critical machine learning tasks provided by SageMaker AI, such as training and hosting, Processing provides you with the benefits of a fully managed machine learning environment, including all the security and compliance support built into SageMaker AI. You have the flexibility to use the built-in data processing containers or to bring your own containers for custom processing logic and then submit jobs to run on SageMaker AI managed infrastructure. 

**Note**  
 You can create a processing job programmatically by calling the [CreateProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html) API action in any language supported by SageMaker AI or by using the AWS CLI. For information on how this API action translates into a function in the language of your choice, see the [See Also](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html#API_CreateProcessingJob_SeeAlso) section of CreateProcessingJob and choose an SDK. As an example, for Python users, refer to the [Amazon SageMaker Processing](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html) section of SageMaker Python SDK. Alternatively, see the full request syntax of [create\$1processing\$1job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_processing_job.html) in the AWS SDK for Python (Boto3).

The following diagram shows how Amazon SageMaker AI spins up a Processing job. Amazon SageMaker AI takes your script, copies your data from Amazon Simple Storage Service (Amazon S3), and then pulls a processing container. The underlying infrastructure for a Processing job is fully managed by Amazon SageMaker AI. After you submit a processing job, SageMaker AI launches the compute instances, processes and analyzes the input data, and releases the resources upon completion. The output of the Processing job is stored in the Amazon S3 bucket you specified. 

**Note**  
Your input data must be stored in an Amazon S3 bucket. Alternatively, you can use Amazon Athena or Amazon Redshift as input sources.

![\[Running a processing job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Processing-1.png)


**Tip**  
To learn best practices for distributed computing of machine learning (ML) training and processing jobs in general, see [Distributed computing with SageMaker AI best practices](distributed-training-options.md).

## Use Amazon SageMaker Processing Sample Notebooks
<a name="processing-job-sample-notebooks"></a>

We provide two sample Jupyter notebooks that show how to perform data preprocessing, model evaluation, or both.

For a sample notebook that shows how to run scikit-learn scripts to perform data preprocessing and model training and evaluation with the SageMaker Python SDK for Processing, see [scikit-learn Processing](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation). This notebook also shows how to use your own custom container to run processing workloads with your Python libraries and other specific dependencies.

For a sample notebook that shows how to use Amazon SageMaker Processing to perform distributed data preprocessing with Spark, see [Distributed Processing (Spark)](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.ipynb). This notebook also shows how to train a regression model using XGBoost on the preprocessed dataset.

For instructions on how to create and access Jupyter notebook instances that you can use to run these samples in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

## Monitor Amazon SageMaker Processing Jobs with CloudWatch Logs and Metrics
<a name="processing-job-cloudwatch"></a>

Amazon SageMaker Processing provides Amazon CloudWatch logs and metrics to monitor processing jobs. CloudWatch provides CPU, GPU, memory, GPU memory, and disk metrics, and event logging. For more information, see [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md) and [CloudWatch Logs for Amazon SageMaker AI](logging-cloudwatch.md).

# Run a Processing Job with Apache Spark
<a name="use-spark-processing-container"></a>

Apache Spark is a unified analytics engine for large-scale data processing. Amazon SageMaker AI provides prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs. The following provides an example on how to run a Amazon SageMaker Processing job using Apache Spark.

With the [Amazon SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk), you can easily apply data transformations and extract features (feature engineering) using the Spark framework. For information about using the SageMaker Python SDK to run Spark processing jobs, see [Data Processing with Spark](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html#data-processing-with-spark) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/).

A code repository that contains the source code and Dockerfiles for the Spark images is available on [GitHub](https://github.com/aws/sagemaker-spark-container). 

 You can use the [https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.spark.processing.PySparkProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.spark.processing.PySparkProcessor) or [https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.spark.processing.SparkJarProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.spark.processing.SparkJarProcessor) class to run your Spark application inside of a processing job. Note you can set MaxRuntimeInSeconds to a maximum runtime limit of 5 days. With respect to execution time, and number of instances used, simple spark workloads see a near linear relationship between the number of instances vs. time to completion. 

 The following code example shows how to run a processing job that invokes your PySpark script `preprocess.py`. 

```
from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    base_job_name="spark-preprocessor",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=1200,
)

spark_processor.run(
    submit_app="preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix,
               's3_output_bucket', bucket,
               's3_output_key_prefix', output_prefix]
)
```

 For an in-depth look, see the Distributed Data Processing with Apache Spark and SageMaker Processing [example notebook](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html). 

 If you are not using the [Amazon SageMaker AI Python SDK](https://sagemaker.readthedocs.io/) and one of its Processor classes to retrieve the pre-built images, you can retrieve these images yourself. The SageMaker prebuilt Docker images are stored in Amazon Elastic Container Registry (Amazon ECR). For a complete list of the available pre-built Docker images, see the [available images](https://github.com/aws/sagemaker-spark-container/blob/master/available_images.md) document. 

 To learn more about using the SageMaker Python SDK with Processing containers, see [Amazon SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/). 

# Run a Processing Job with scikit-learn
<a name="use-scikit-learn-processing-container"></a>

You can use Amazon SageMaker Processing to process data and evaluate models with scikit-learn scripts in a Docker image provided by Amazon SageMaker AI. The following provides an example on how to run a Amazon SageMaker Processing job using scikit-learn.

For a sample notebook that shows how to run scikit-learn scripts using a Docker image provided and maintained by SageMaker AI to preprocess data and evaluate models, see [scikit-learn Processing](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation). To use this notebook, you need to install the SageMaker AI Python SDK for Processing. 

This notebook runs a processing job using `SKLearnProcessor` class from the the SageMaker Python SDK to run a scikit-learn script that you provide. The script preprocesses data, trains a model using a SageMaker training job, and then runs a processing job to evaluate the trained model. The processing job estimates how the model is expected to perform in production.

To learn more about using the SageMaker Python SDK with Processing containers, see the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). For a complete list of pre-built Docker images available for processing jobs, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths).

The following code example shows how the notebook uses `SKLearnProcessor` to run your own scikit-learn script using a Docker image provided and maintained by SageMaker AI, instead of your own Docker image.

```
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source='s3://path/to/my/input-data.csv',
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
                               ProcessingOutput(source='/opt/ml/processing/output/validation'),
                               ProcessingOutput(source='/opt/ml/processing/output/test')]
                     )
```

To process data in parallel using Scikit-Learn on Amazon SageMaker Processing, you can shard input objects by S3 key by setting `s3_data_distribution_type='ShardedByS3Key'` inside a `ProcessingInput` so that each instance receives about the same number of input objects.

# Data Processing with Framework Processors
<a name="processing-job-frameworks"></a>

A `FrameworkProcessor` can run Processing jobs with a specified machine learning framework, providing you with an Amazon SageMaker AI–managed container for whichever machine learning framework you choose. `FrameworkProcessor` provides premade containers for the following machine learning frameworks: Hugging Face, MXNet, PyTorch, TensorFlow, and XGBoost.

The `FrameworkProcessor` class also provides you with customization over the container configuration. The `FrameworkProcessor` class supports specifying a source directory `source_dir` for your processing scripts and dependencies. With this capability, you can give the processor access to multiple scripts in a directory instead of only specifying one script. `FrameworkProcessor` also supports including a `requirements.txt` file in the `source_dir` for customizing the Python libraries to install in the container.

For more information on the `FrameworkProcessor` class and its methods and parameters, see [FrameworkProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor) in the *Amazon SageMaker AI Python SDK*.

To see examples of using a `FrameworkProcessor` for each of the supported machine learning frameworks, see the following topics.

**Topics**
+ [Code example using HuggingFaceProcessor in the Amazon SageMaker Python SDK](processing-job-frameworks-hugging-face.md)
+ [MXNet Framework Processor](processing-job-frameworks-mxnet.md)
+ [PyTorch Framework Processor](processing-job-frameworks-pytorch.md)
+ [TensorFlow Framework Processor](processing-job-frameworks-tensorflow.md)
+ [XGBoost Framework Processor](processing-job-frameworks-xgboost.md)

# Code example using HuggingFaceProcessor in the Amazon SageMaker Python SDK
<a name="processing-job-frameworks-hugging-face"></a>

Hugging Face is an open-source provider of natural language processing (NLP) models. The `HuggingFaceProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with Hugging Face scripts. When you use the `HuggingFaceProcessor`, you can leverage an Amazon-built Docker container with a managed Hugging Face environment so that you don't need to bring your own container.

The following code example shows how you can use the `HuggingFaceProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the HuggingFaceProcessor
hfp = HuggingFaceProcessor(
    role=get_execution_role(), 
    instance_count=1,
    instance_type='ml.g4dn.xlarge',
    transformers_version='4.4.2',
    pytorch_version='1.6.0', 
    base_job_name='frameworkprocessor-hf'
)

#Run the processing job
hfp.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data/'
        )
    ],
    outputs=[
        ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train/', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test/', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='val', source='/opt/ml/processing/output/val/', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `HuggingFaceProcessor` class, see [Hugging Face Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) in the *Amazon SageMaker AI Python SDK*.

# MXNet Framework Processor
<a name="processing-job-frameworks-mxnet"></a>

Apache MXNet is an open-source deep learning framework commonly used for training and deploying neural networks. The `MXNetProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with MXNet scripts. When you use the `MXNetProcessor`, you can leverage an Amazon-built Docker container with a managed MXNet environment so that you don’t need to bring your own container.

The following code example shows how you can use the `MXNetProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.mxnet import MXNetProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the MXNetProcessor
mxp = MXNetProcessor(
    framework_version='1.8.0',
    py_version='py37',
    role=get_execution_role(), 
    instance_count=1,
    instance_type='ml.c5.xlarge',
    base_job_name='frameworkprocessor-mxnet'
)

#Run the processing job
mxp.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data/'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='processed_data',
            source='/opt/ml/processing/output/',
            destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
        )
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `MXNetProcessor` class, see [MXNet Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html#mxnet-estimator) in the *Amazon SageMaker Python SDK*.

# PyTorch Framework Processor
<a name="processing-job-frameworks-pytorch"></a>

PyTorch is an open-source machine learning framework. The `PyTorchProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with PyTorch scripts. When you use the `PyTorchProcessor`, you can leverage an Amazon-built Docker container with a managed PyTorch environment so that you don’t need to bring your own container.

The following code example shows how you can use the `PyTorchProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

For the PyTorch versions supported by SageMaker AI, see the available [ Deep Learning Container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

```
from sagemaker.pytorch.processing import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the PyTorchProcessor
pytorch_processor = PyTorchProcessor(
    framework_version='1.8',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name='frameworkprocessor-PT'
)

#Run the processing job
pytorch_processor.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(output_name='data_structured', source='/opt/ml/processing/tmp/data_structured', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='validation', source='/opt/ml/processing/output/val', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='logs', source='/opt/ml/processing/logs', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `PyTorchProcessor` class, see [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) in the *Amazon SageMaker Python SDK*.

# TensorFlow Framework Processor
<a name="processing-job-frameworks-tensorflow"></a>

TensorFlow is an open-source machine learning and artificial intelligence library. The `TensorFlowProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with TensorFlow scripts. When you use the `TensorFlowProcessor`, you can leverage an Amazon-built Docker container with a managed TensorFlow environment so that you don’t need to bring your own container.

The following code example shows how you can use the `TensorFlowProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.tensorflow import TensorFlowProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the TensorFlowProcessor
tp = TensorFlowProcessor(
    framework_version='2.3',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name='frameworkprocessor-TF',
    py_version='py37'
)

#Run the processing job
tp.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data'
        ),
        ProcessingInput(
            input_name='model',
            source=f's3://{BUCKET}/{S3_PATH_TO_MODEL}',
            destination='/opt/ml/processing/input/model'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='predictions',
            source='/opt/ml/processing/output',
            destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
        )
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `TensorFlowProcessor` class, see [TensorFlow Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) in the *Amazon SageMaker Python SDK*.

# XGBoost Framework Processor
<a name="processing-job-frameworks-xgboost"></a>

XGBoost is an open-source machine learning framework. The `XGBoostProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with XGBoost scripts. When you use the XGBoostProcessor, you can leverage an Amazon-built Docker container with a managed XGBoost environment so that you don’t need to bring your own container.

The following code example shows how you can use the `XGBoostProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.xgboost import XGBoostProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the XGBoostProcessor
xgb = XGBoostProcessor(
    framework_version='1.2-2',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name='frameworkprocessor-XGB',
)

#Run the processing job
xgb.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='processed_data',
            source='/opt/ml/processing/output/',
            destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
        )
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `XGBoostProcessor` class, see [XGBoost Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/xgboost.html) in the *Amazon SageMaker Python SDK*.

# Use Your Own Processing Code
<a name="use-your-own-processing-code"></a>

You can install libraries to run your scripts in your own processing container or, in a more advanced scenario, you can build your own processing container that satisfies the contract to run in Amazon SageMaker AI. For more information about containers in SageMaker AI, see [Docker containers for training and deploying models](docker-containers.md). For a formal specification that defines the contract for an Amazon SageMaker Processing container, see [How to Build Your Own Processing Container (Advanced Scenario)](build-your-own-processing-container.md). 

**Topics**
+ [Run Scripts with Your Own Processing Container](processing-container-run-scripts.md)
+ [How to Build Your Own Processing Container (Advanced Scenario)](build-your-own-processing-container.md)

# Run Scripts with Your Own Processing Container
<a name="processing-container-run-scripts"></a>

You can use scikit-learn scripts to preprocess data and evaluate your models. To see how to run scikit-learn scripts to perform these tasks, see the [scikit-learn Processing](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation) sample notebook. This notebook uses the `ScriptProcessor` class from the Amazon SageMaker Python SDK for Processing.

The following example shows a general workflow for using a `ScriptProcessor` class with your own processing container. The workflow shows how to create your own image, build your container, and use a `ScriptProcessor` class to run a Python preprocessing script with the container. The processing job processes your input data and saves the processed data in Amazon Simple Storage Service (Amazon S3).

Before using the following examples, you need to have your own input data and a Python script prepared to process your data. For an end-to-end, guided example of this process, refer back to the [scikit-learn Processing](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation) sample notebook.

1. Create a Docker directory and add the Dockerfile used to create the processing container. Install pandas and scikit-learn into it. (You could also install your own dependencies with a similar `RUN` command.)

   ```
   mkdir docker
   
   %%writefile docker/Dockerfile
   
   FROM python:3.7-slim-buster
   
   RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
   ENV PYTHONUNBUFFERED=TRUE
   
   ENTRYPOINT ["python3"]
   ```

1. Build the container using the docker command, create an Amazon Elastic Container Registry (Amazon ECR) repository, and push the image to Amazon ECR.

   ```
   import boto3
   
   account_id = boto3.client('sts').get_caller_identity().get('Account')
   region = boto3.Session().region_name
   ecr_repository = 'sagemaker-processing-container'
   tag = ':latest'
   processing_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)
   
   # Create ECR repository and push docker image
   !docker build -t $ecr_repository docker
   !aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com
   !aws ecr create-repository --repository-name $ecr_repository
   !docker tag {ecr_repository + tag} $processing_repository_uri
   !docker push $processing_repository_uri
   ```

1. Set up the `ScriptProcessor` from the SageMaker Python SDK to run the script. Replace *image\$1uri* with the URI for the image you created, and replace *role\$1arn* with the ARN for an AWS Identity and Access Management role that has access to your target Amazon S3 bucket.

   ```
   from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
   
   script_processor = ScriptProcessor(command=['python3'],
                   image_uri='image_uri',
                   role='role_arn',
                   instance_count=1,
                   instance_type='ml.m5.xlarge')
   ```

1. Run the script. Replace *preprocessing.py* with the name of your own Python processing script, and replace *s3://path/to/my/input-data.csv* with the Amazon S3 path to your input data.

   ```
   script_processor.run(code='preprocessing.py',
                        inputs=[ProcessingInput(
                           source='s3://path/to/my/input-data.csv',
                           destination='/opt/ml/processing/input')],
                        outputs=[ProcessingOutput(source='/opt/ml/processing/output/train'),
                                  ProcessingOutput(source='/opt/ml/processing/output/validation'),
                                  ProcessingOutput(source='/opt/ml/processing/output/test')])
   ```

You can use the same procedure with any other library or system dependencies. You can also use existing Docker images. This includes images that you run on other platforms such as [Kubernetes](https://kubernetes.io/).

# How to Build Your Own Processing Container (Advanced Scenario)
<a name="build-your-own-processing-container"></a>

You can provide Amazon SageMaker Processing with a Docker image that has your own code and dependencies to run your data processing, feature engineering, and model evaluation workloads. The following provides information on how to build your own processing container.

The following example of a Dockerfile builds a container with the Python libraries scikit-learn and pandas, which you can run as a processing job. 

```
FROM python:3.7-slim-buster

# Install scikit-learn and pandas
RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3

# Add a Python script and configure Docker to run it
ADD processing_script.py /
ENTRYPOINT ["python3", "/processing_script.py"]
```

For an example of a processing script, see [Get started with SageMaker Processing](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker_processing/basic_sagemaker_data_processing/basic_sagemaker_processing.ipynb).

Build and push this Docker image to an Amazon Elastic Container Registry (Amazon ECR) repository and ensure that your SageMaker AI IAM role can pull the image from Amazon ECR. Then you can run this image on Amazon SageMaker Processing.

# How Amazon SageMaker Processing Runs Your Processing Container Image
<a name="byoc-run-image"></a>

Amazon SageMaker Processing runs your processing container image in a similar way as the following command, where `AppSpecification.ImageUri` is the Amazon ECR image URI that you specify in a `CreateProcessingJob` operation. 

```
docker run [AppSpecification.ImageUri]
```

This command runs the `ENTRYPOINT` command configured in your Docker image. 

You can also override the entrypoint command in the image or give command-line arguments to your entrypoint command using the `AppSpecification.ContainerEntrypoint` and `AppSpecification.ContainerArgument` parameters in your `CreateProcessingJob` request. Specifying these parameters configures Amazon SageMaker Processing to run the container similar to the way that the following command does. 

```
 docker run --entry-point [AppSpecification.ContainerEntrypoint] [AppSpecification.ImageUri] [AppSpecification.ContainerArguments]
```

For example, if you specify the `ContainerEntrypoint` to be `[python3, -v, /processing_script.py]` in your `CreateProcessingJob `request, and `ContainerArguments` to be `[data-format, csv]`, Amazon SageMaker Processing runs your container with the following command. 

```
 python3 -v /processing_script.py data-format csv 
```

 When building your processing container, consider the following details: 
+ Amazon SageMaker Processing decides whether the job completes or fails depending on the exit code of the command run. A processing job completes if all of the processing containers exit successfully with an exit code of 0, and fails if any of the containers exits with a non-zero exit code.
+  Amazon SageMaker Processing lets you override the processing container's entrypoint and set command-line arguments just like you can with the Docker API. Docker images can also configure the entrypoint and command-line arguments using the `ENTRYPOINT` and CMD instructions. The way `CreateProcessingJob`'s `ContainerEntrypoint` and `ContainerArgument` parameters configure a Docker image's entrypoint and arguments mirrors how Docker overrides the entrypoint and arguments through the Docker API:
  + If neither `ContainerEntrypoint` nor `ContainerArguments` are provided, Processing uses the default `ENTRYPOINT` or CMD in the image.
  + If `ContainerEntrypoint` is provided, but not `ContainerArguments`, Processing runs the image with the given entrypoint, and ignores the `ENTRYPOINT` and CMD in the image.
  + If `ContainerArguments` is provided, but not `ContainerEntrypoint`, Processing runs the image with the default `ENTRYPOINT` in the image and with the provided arguments.
  + If both `ContainerEntrypoint` and `ContainerArguments` are provided, Processing runs the image with the given entrypoint and arguments, and ignores the `ENTRYPOINT` and CMD in the image.
+ You must use the exec form of the `ENTRYPOINT` instruction in your Dockerfile (`ENTRYPOINT` `["executable", "param1", "param2"])` instead of the shell form (`ENTRYPOINT`` command param1 param2`). This lets your processing container receive `SIGINT` and `SIGKILL` signals, which Processing uses to stop processing jobs with the `StopProcessingJob` API.
+ `/opt/ml` and all its subdirectories are reserved by SageMaker AI. When building your Processing Docker image, don't place any data required by your processing container in these directories.
+ If you plan to use GPU devices, make sure that your containers are nvidia-docker compatible. Include only the CUDA toolkit in containers. Don't bundle NVIDIA drivers with the image. For more information about nvidia-docker, see [NVIDIA/nvidia-docker](https://github.com/NVIDIA/nvidia-docker).

# How Amazon SageMaker Processing Configures Input and Output For Your Processing Container
<a name="byoc-input-and-output"></a>

When you create a processing job using the `CreateProcessingJob` operation, you can specify multiple `ProcessingInput` and `ProcessingOutput`. values. 

You use the `ProcessingInput` parameter to specify an Amazon Simple Storage Service (Amazon S3) URI to download data from, and a path in your processing container to download the data to. The `ProcessingOutput` parameter configures a path in your processing container from which to upload data, and where in Amazon S3 to upload that data to. For both `ProcessingInput` and `ProcessingOutput`, the path in the processing container must begin with `/opt/ml/processing/ `.

For example, you might create a processing job with one `ProcessingInput` parameter that downloads data from `s3://your-data-bucket/path/to/input/csv/data` into `/opt/ml/processing/csv` in your processing container, and a `ProcessingOutput` parameter that uploads data from `/opt/ml/processing/processed_csv` to `s3://your-data-bucket/path/to/output/csv/data`. Your processing job would read the input data, and write output data to `/opt/ml/processing/processed_csv`. Then it uploads the data written to this path to the specified Amazon S3 output location. 

**Important**  
Symbolic links (symlinks) can not be used to upload output data to Amazon S3. Symlinks are not followed when uploading output data. 

# How Amazon SageMaker Processing Provides Logs and Metrics for Your Processing Container
<a name="byoc-logs-and-metrics"></a>

When your processing container writes to `stdout` or `stderr`, Amazon SageMaker Processing saves the output from each processing container and puts it in Amazon CloudWatch logs. For information about logging, see [CloudWatch Logs for Amazon SageMaker AI](logging-cloudwatch.md).

Amazon SageMaker Processing also provides CloudWatch metrics for each instance running your processing container. For information about metrics, see [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). 

## How Amazon SageMaker Processing Configures Your Processing Container
<a name="byoc-config"></a>

Amazon SageMaker Processing provides configuration information to your processing container through environment variables and two JSON files—`/opt/ml/config/processingjobconfig.json` and `/opt/ml/config/resourceconfig.json`— at predefined locations in the container. 

When a processing job starts, it uses the environment variables that you specified with the `Environment` map in the `CreateProcessingJob` request. The `/opt/ml/config/processingjobconfig.json` file contains information about the hostnames of your processing containers, and is also specified in the `CreateProcessingJob` request. 

The following example shows the format of the `/opt/ml/config/processingjobconfig.json` file.

```
{
    "ProcessingJobArn": "<processing_job_arn>",
    "ProcessingJobName": "<processing_job_name>",
    "AppSpecification": {
        "ImageUri": "<image_uri>",
        "ContainerEntrypoint": null,
        "ContainerArguments": null
    },
    "Environment": {
        "KEY": "VALUE"
    },
    "ProcessingInputs": [
        {
            "InputName": "input-1",
            "S3Input": {
                "LocalPath": "/opt/ml/processing/input/dataset",
                "S3Uri": "<s3_uri>",
                "S3DataDistributionType": "FullyReplicated",
                "S3DataType": "S3Prefix",
                "S3InputMode": "File",
                "S3CompressionType": "None",
                "S3DownloadMode": "StartOfJob"
            }
        }
    ],
    "ProcessingOutputConfig": {
        "Outputs": [
            {
                "OutputName": "output-1",
                "S3Output": {
                    "LocalPath": "/opt/ml/processing/output/dataset",
                    "S3Uri": "<s3_uri>",
                    "S3UploadMode": "EndOfJob"
                }
            }
        ],
        "KmsKeyId": null
    },
    "ProcessingResources": {
        "ClusterConfig": {
            "InstanceCount": 1,
            "InstanceType": "ml.m5.xlarge",
            "VolumeSizeInGB": 30,
            "VolumeKmsKeyId": null
        }
    },
    "RoleArn": "<IAM role>",
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 86400
    }
}
```

The `/opt/ml/config/resourceconfig.json` file contains information about the hostnames of your processing containers. Use the following hostnames when creating or running distributed processing code.

```
{
  "current_host": "algo-1",
  "hosts": ["algo-1","algo-2","algo-3"]
}
```

Don't use the information about hostnames contained in `/etc/hostname` or `/etc/hosts` because it might be inaccurate.

Hostname information might not be immediately available to the processing container. We recommend adding a retry policy on hostname resolution operations as nodes become available in the cluster.

# Save and Access Metadata Information About Your Processing Job
<a name="byoc-metadata"></a>

To save metadata from the processing container after exiting it, containers can write UTF-8 encoded text to the `/opt/ml/output/message` file. After the processing job enters any terminal status ("`Completed`", "`Stopped`", or "`Failed`"), the "`ExitMessage`" field in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) contains the first 1 KB of this file. Access that initial part of file with a call to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html), which returns it through the `ExitMessage` parameter. For failed processing jobs, you can use this field to communicate information about why the processing container failed.

**Important**  
Don't write sensitive data to the `/opt/ml/output/message` file. 

If the data in this file isn't UTF-8 encoded, the job fails and returns a `ClientError`. If multiple containers exit with an `ExitMessage,` the content of the `ExitMessage` from each processing container is concatenated, then truncated to 1 KB.

# Run Your Processing Container Using the SageMaker AI Python SDK
<a name="byoc-run"></a>

You can use the SageMaker Python SDK to run your own processing image by using the `Processor` class. The following example shows how to run your own processing container with one input from Amazon Simple Storage Service (Amazon S3) and one output to Amazon S3.

```
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

processor = Processor(image_uri='<your_ecr_image_uri>',
                     role=role,
                     instance_count=1,
                     instance_type="ml.m5.xlarge")

processor.run(inputs=[ProcessingInput(
                        source='<s3_uri or local path>',
                        destination='/opt/ml/processing/input_data')],
                    outputs=[ProcessingOutput(
                        source='/opt/ml/processing/processed_data',
                        destination='<s3_uri>')],
                    )
```

Instead of building your processing code into your processing image, you can provide a `ScriptProcessor` with your image and the command that you want to run, along with the code that you want to run inside that container. For an example, see [Run Scripts with Your Own Processing Container](processing-container-run-scripts.md).

You can also use the scikit-learn image that Amazon SageMaker Processing provides through `SKLearnProcessor` to run scikit-learn scripts. For an example, see [Run a Processing Job with scikit-learn](use-scikit-learn-processing-container.md). 