

# Batch transform for inference with Amazon SageMaker AI
<a name="batch-transform"></a>

Use batch transform when you need to do the following: 
+ Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
+ Get inferences from large datasets.
+ Run inference when you don't need a persistent endpoint.
+ Associate input records with inferences to help with the interpretation of results.

To filter input data before performing inferences or to associate input records with inferences about those records, see [Associate Prediction Results with Input Records](batch-transform-data-processing.md). For example, you can filter input data to provide context for creating and interpreting reports about the output data.

**Topics**
+ [

## Use batch transform to get inferences from large datasets
](#batch-transform-large-datasets)
+ [

## Speed up a batch transform job
](#batch-transform-reduce-time)
+ [

## Use batch transform to test production variants
](#batch-transform-test-variants)
+ [

## Batch transform sample notebooks
](#batch-transform-notebooks)
+ [

# Associate Prediction Results with Input Records
](batch-transform-data-processing.md)
+ [

# Storage in Batch Transform
](batch-transform-storage.md)
+ [

# Troubleshooting
](batch-transform-errors.md)

## Use batch transform to get inferences from large datasets
<a name="batch-transform-large-datasets"></a>

Batch transform automatically manages the processing of large datasets within the limits of specified parameters. For example, having a dataset file, `input1.csv`, stored in an S3 bucket. The content of the input file might look like the following example.

```
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM
Record2-Attribute1, Record2-Attribute2, Record2-Attribute3, ..., Record2-AttributeM
Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM
...
RecordN-Attribute1, RecordN-Attribute2, RecordN-Attribute3, ..., RecordN-AttributeM
```

When a batch transform job starts, SageMaker AI starts compute instances and distributes the inference or preprocessing workload between them. Batch Transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances. When you have multiple files, one instance might process `input1.csv`, and another instance might process the file named `input2.csv`. If you have one input file but initialize multiple compute instances, only one instance processes the input file. The rest of the instances are idle.

You can also split input files into mini-batches. For example, you might create a mini-batch from `input1.csv` by including only two of the records.

```
Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM
Record4-Attribute1, Record4-Attribute2, Record4-Attribute3, ..., Record4-AttributeM
```

**Note**  
SageMaker AI processes each input file separately. It doesn't combine mini-batches from different input files to comply with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB               ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB               ) limit.

To split input files into mini-batches when you create a batch transform job, set the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ) parameter value to `Line`. SageMaker AI uses the entire input file in a single request when:
+ `SplitType` is set to `None`.
+ An input file can't be split into mini-batches.

. Note that Batch Transform doesn't support CSV-formatted input that contains embedded newline characters. You can control the size of the mini-batches by using the `[BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-BatchStrategy)` and `[MaxPayloadInMB](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxPayloadInMB)` parameters. `MaxPayloadInMB` must not be greater than 100 MB. If you specify the optional `[MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxConcurrentTransforms)` parameter, then the value of `(MaxConcurrentTransforms * MaxPayloadInMB)` must also not exceed 100 MB.

If the batch transform job successfully processes all of the records in an input file, it creates an output file. The output file has the same name and the `.out` file extension. For multiple input files, such as `input1.csv` and `input2.csv`, the output files are named `input1.csv.out` and `input2.csv.out`. The batch transform job stores the output files in the specified location in Amazon S3, such as `s3://amzn-s3-demo-bucket/output/`. 

The predictions in an output file are listed in the same order as the corresponding records in the input file. The output file `input1.csv.out`, based on the input file shown earlier, would look like the following.

```
Inference1-Attribute1, Inference1-Attribute2, Inference1-Attribute3, ..., Inference1-AttributeM
Inference2-Attribute1, Inference2-Attribute2, Inference2-Attribute3, ..., Inference2-AttributeM
Inference3-Attribute1, Inference3-Attribute2, Inference3-Attribute3, ..., Inference3-AttributeM
...
InferenceN-Attribute1, InferenceN-Attribute2, InferenceN-Attribute3, ..., InferenceN-AttributeM
```

If you set [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType             ) to `Line`, you can set the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith             ) parameter to `Line` to concatenate the output records with a line delimiter. This does not change the number of output files. The number of output files is equal to the number of input files, and using `AssembleWith` does not merge files. If you don't specify the `AssembleWith` parameter, the output records are concatenated in a binary format by default.

When the input data is very large and is transmitted using HTTP chunked encoding, to stream the data to the algorithm, set [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB) to `0`. Amazon SageMaker AI built-in algorithms don't support this feature.

For information about using the API to create a batch transform job, see the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) API. For more information about the relationship between batch transform input and output objects, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputDataConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputDataConfig.html). For an example of how to use batch transform, see [(Optional) Make Prediction with Batch Transform](ex1-model-deployment.md#ex1-batch-transform).

## Speed up a batch transform job
<a name="batch-transform-reduce-time"></a>

If you are using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters. This includes parameters such as [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxPayloadInMB), [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-MaxConcurrentTransforms), or [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-BatchStrategy). The ideal value for `MaxConcurrentTransforms` is equal to the number of compute workers in the batch transform job. 

If you are using the SageMaker AI console, specify these optimal parameter values in the **Additional configuration** section of the **Batch transform job configuration** page. SageMaker AI automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an [execution-parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-how-containe-serves-requests) endpoint.

## Use batch transform to test production variants
<a name="batch-transform-test-variants"></a>

To test different models or hyperparameter settings, create a separate transform job for each new model variant and use a validation dataset. For each transform job, specify a unique model name and location in Amazon S3 for the output file. To analyze the results, use [Inference Pipeline Logs and Metrics](inference-pipeline-logs-metrics.md).

## Batch transform sample notebooks
<a name="batch-transform-notebooks"></a>

For sample notebook that uses batch transform, see [Batch Transform with PCA and DBSCAN Movie Clusters](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_batch_transform/introduction_to_batch_transform/batch_transform_pca_dbscan_movie_clusters.html). This notebook uses batch transform with a principal component analysis (PCA) model to reduce data in a user-item review matrix. It then shows the application of a density-based spatial clustering of applications with noise (DBSCAN) algorithm to cluster movies.

 For instructions on creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker Examples** tab to see a list of all the SageMaker AI examples. The topic modeling example notebooks that use the NTM algorithms are located in the **Advanced functionality** section. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# Associate Prediction Results with Input Records
<a name="batch-transform-data-processing"></a>

When making predictions on a large dataset, you can exclude attributes that aren't needed for prediction. After the predictions have been made, you can associate some of the excluded attributes with those predictions or with other input data in your report. By using batch transform to perform these data processing steps, you can often eliminate additional preprocessing or postprocessing. You can use input files in JSON and CSV format only. 

**Topics**
+ [

## Workflow for Associating Inferences with Input Records
](#batch-transform-data-processing-workflow)
+ [

## Use Data Processing in Batch Transform Jobs
](#batch-transform-data-processing-steps)
+ [

## Supported JSONPath Operators
](#data-processing-operators)
+ [

## Batch Transform Examples
](#batch-transform-data-processing-examples)

## Workflow for Associating Inferences with Input Records
<a name="batch-transform-data-processing-workflow"></a>

The following diagram shows the workflow for associating inferences with input records.

![\[The workflow for associating inferences with input records.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/batch-transform-data-processing.png)


To associate inferences with input data, there are three main steps:

1. Filter the input data that is not needed for inference before passing the input data to the batch transform job. Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-InputFilter                             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-InputFilter                             ) parameter to determine which attributes to use as input for the model.

1. Associate the input data with the inference results. Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-JoinSource                         ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-JoinSource                         ) parameter to combine the input data with the inference.

1. Filter the joined data to retain the inputs that are needed to provide context for interpreting the predictions in the reports. Use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-OutputFilter                             ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-OutputFilter                             ) to store the specified portion of the joined dataset in the output file.

## Use Data Processing in Batch Transform Jobs
<a name="batch-transform-data-processing-steps"></a>

When creating a batch transform job with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) to process data:

1. Specify the portion of the input to pass to the model with the `InputFilter` parameter in the `DataProcessing` data structure. 

1. Join the raw input data with the transformed data with the `JoinSource` parameter.

1. Specify which portion of the joined input and transformed data from the batch transform job to include in the output file with the `OutputFilter` parameter.

1.  Choose either JSON- or CSV-formatted files for input: 
   + For JSON- or JSON Lines-formatted input files, SageMaker AI either adds the `SageMakerOutput` attribute to the input file or creates a new JSON output file with the `SageMakerInput` and `SageMakerOutput` attributes. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataProcessing.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataProcessing.html). 
   + For CSV-formatted input files, the joined input data is followed by the transformed data and the output is a CSV file.

If you use an algorithm with the `DataProcessing` structure, it must support your chosen format for *both* input and output files. For example, with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html) field of the `CreateTransformJob` API, you must set both the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-Accept](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-Accept) parameters to one of the following values: `text/csv`, `application/json`, or `application/jsonlines`. The syntax for specifying columns in a CSV file and specifying attributes in a JSON file are different. Using the wrong syntax causes an error. For more information, see [Batch Transform Examples](#batch-transform-data-processing-examples). For more information about input and output file formats for built-in algorithms, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md).

The record delimiters for the input and output must also be consistent with your chosen file input. The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType) parameter indicates how to split the records in the input dataset. The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith                     ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformOutput.html#SageMaker-Type-TransformOutput-AssembleWith                     ) parameter indicates how to reassemble the records for the output. If you set input and output formats to `text/csv`, you must also set the `SplitType` and `AssembleWith` parameters to `line`. If you set the input and output formats to `application/jsonlines`, you can set both `SplitType` and `AssembleWith` to `line`.

For CSV files, you cannot use embedded newline characters. For JSON files, the attribute name `SageMakerOutput` is reserved for output. The JSON input file can't have an attribute with this name. If it does, the data in the input file might be overwritten. 

## Supported JSONPath Operators
<a name="data-processing-operators"></a>

To filter and join the input data and inference, use a JSONPath subexpression. SageMaker AI supports only a subset of the defined JSONPath operators. The following table lists the supported JSONPath operators. For CSV data, each row is taken as a JSON array, so only index based JSONPaths can be applied, e.g. `$[0]`, `$[1:]`. CSV data should also follow [RFC format](https://tools.ietf.org/html/rfc4180).


| JSONPath Operator | Description | Example | 
| --- | --- | --- | 
| \$1 |  The root element to a query. This operator is required at the beginning of all path expressions.  | \$1 | 
| .<name> |  A dot-notated child element.  |  `$.id`  | 
| \$1 |  A wildcard. Use in place of an attribute name or numeric value.  |  `$.id.*`  | 
| ['<name>' (,'<name>')] |  A bracket-notated element or multiple child elements.  |  `$['id','SageMakerOutput']`  | 
| [<number> (,<number>)] |  An index or array of indexes. Negative index values are also supported. A `-1` index refers to the last element in an array.  |  `$[1]` , `$[1,3,5]`  | 
| [<start>:<end>] |  An array slice operator.  The array slice() method extracts a section of an array and returns a new array. If you omit *<start>*, SageMaker AI uses the first element of the array. If you omit *<end>*, SageMaker AI uses the last element of the array.  |  `$[2:5]`, `$[:5]`, `$[2:]`  | 

When using the bracket-notation to specify multiple child elements of a given field, additional nesting of children within brackets is not supported. For example, `$.field1.['child1','child2']` is supported while `$.field1.['child1','child2.grandchild']` is not. 

For more information about JSONPath operators, see [JsonPath](https://github.com/json-path/JsonPath) on GitHub.

## Batch Transform Examples
<a name="batch-transform-data-processing-examples"></a>

The following examples show some common ways to join input data with prediction results.

**Topics**
+ [

### Example: Output Only Inferences
](#batch-transform-data-processing-example-default)
+ [

### Example: Output Inferences Joined with Input Data
](#batch-transform-data-processing-example-all)
+ [

### Example: Output Inferences Joined with Input Data and Exclude the ID Column from the Input (CSV)
](#batch-transform-data-processing-example-select-csv)
+ [

### Example: Output Inferences Joined with an ID Column and Exclude the ID Column from the Input (CSV)
](#batch-transform-data-processing-example-select-json)

### Example: Output Only Inferences
<a name="batch-transform-data-processing-example-default"></a>

By default, the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-DataProcessing](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-CreateTransformJob-request-DataProcessing) parameter doesn't join inference results with input. It outputs only the inference results.

If you want to explicitly specify to not join results with input, use the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and specify the following settings in a transformer call.

```
sm_transformer = sagemaker.transformer.Transformer(…)
sm_transformer.transform(…, input_filter="$", join_source= "None", output_filter="$")
```

To output inferences using the AWS SDK for Python, add the following code to your CreateTransformJob request. The following code mimics the default behavior.

```
{
    "DataProcessing": {
        "InputFilter": "$",
        "JoinSource": "None",
        "OutputFilter": "$"
    }
}
```

### Example: Output Inferences Joined with Input Data
<a name="batch-transform-data-processing-example-all"></a>

If you're using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to combine the input data with the inferences in the output file, specify the `assemble_with` and `accept` parameters when initializing the transformer object. When you use the transform call, specify `Input` for the `join_source` parameter, and specify the `split_type` and `content_type` parameters as well. The `split_type` parameter must have the same value as `assemble_with`, and the `content_type` parameter must have the same value as `accept`. For more information about the parameters and their accepted values, see the [Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer) page in the *Amazon SageMaker AI Python SDK*.

```
sm_transformer = sagemaker.transformer.Transformer(…, assemble_with="Line", accept="text/csv")
sm_transformer.transform(…, join_source="Input", split_type="Line", content_type="text/csv")
```

If you're using the AWS SDK for Python (Boto 3), join all input data with the inference by adding the following code to your [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request. The values for `Accept` and `ContentType` must match, and the values for `AssembleWith` and `SplitType` must also match.

```
{
    "DataProcessing": {
        "JoinSource": "Input"
    },
    "TransformOutput": {
        "Accept": "text/csv",
        "AssembleWith": "Line"
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line"
    }
}
```

For JSON or JSON Lines input files, the results are in the `SageMakerOutput` key in the input JSON file. For example, if the input is a JSON file that contains the key-value pair `{"key":1}`, the data transform result might be `{"label":1}`.

SageMaker AI stores both in the input file in the `SageMakerInput` key.

```
{
    "key":1,
    "SageMakerOutput":{"label":1}
}
```

**Note**  
The joined result for JSON must be a key-value pair object. If the input isn't a key-value pair object, SageMaker AI creates a new JSON file. In the new JSON file, the input data is stored in the `SageMakerInput` key and the results are stored as the `SageMakerOutput` value.

For a CSV file, for example, if the record is `[1,2,3]`, and the label result is `[1]`, then the output file would contain `[1,2,3,1]`.

### Example: Output Inferences Joined with Input Data and Exclude the ID Column from the Input (CSV)
<a name="batch-transform-data-processing-example-select-csv"></a>

If you are using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to join your input data with the inference output while excluding an ID column from the transformer input, specify the same parameters from the preceding example as well as a JSONPath subexpression for the `input_filter` in your transformer call. For example, if your input data includes five columns and the first one is the ID column, use the following transform request to select all columns except the ID column as features. The transformer still outputs all of the input columns joined with the inferences. For more information about the parameters and their accepted values, see the [Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer) page in the *Amazon SageMaker AI Python SDK*.

```
sm_transformer = sagemaker.transformer.Transformer(…, assemble_with="Line", accept="text/csv")
sm_transformer.transform(…, split_type="Line", content_type="text/csv", input_filter="$[1:]", join_source="Input")
```

If you are using the AWS SDK for Python (Boto 3), add the following code to your `[ CreateTransformJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html)` request.

```
{
    "DataProcessing": {
        "InputFilter": "$[1:]",
        "JoinSource": "Input"
    },
    "TransformOutput": {
        "Accept": "text/csv",
        "AssembleWith": "Line"
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line"
    }
}
```

To specify columns in SageMaker AI, use the index of the array elements. The first column is index 0, the second column is index 1, and the sixth column is index 5.

To exclude the first column from the input, set `[InputFilter](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#SageMaker-Type-DataProcessing-InputFilter )` to `"$[1:]"`. The colon (`:`) tells SageMaker AI to include all of the elements between two values, inclusive. For example, `$[1:4]` specifies the second through fifth columns.

If you omit the number after the colon, for example, `[5:]`, the subset includes all columns from the 6th column through the last column. If you omit the number before the colon, for example, `[:5]`, the subset includes all columns from the first column (index 0) through the sixth column.

### Example: Output Inferences Joined with an ID Column and Exclude the ID Column from the Input (CSV)
<a name="batch-transform-data-processing-example-select-json"></a>

If you are using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), you can specify the output to join only specific input columns (such as the ID column) with the inferences by specifying the `output_filter` in the transformer call. The `output_filter` uses a JSONPath subexpression to specify which columns to return as output after joining the input data with the inference results. The following request shows how you can make predictions while excluding an ID column and then join the ID column with the inferences. Note that in the following example, the last column (`-1`) of the output contains the inferences. If you are using JSON files, SageMaker AI stores the inference results in the attribute `SageMakerOutput`. For more information about the parameters and their accepted values, see the [Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer) page in the *Amazon SageMaker AI Python SDK*.

```
sm_transformer = sagemaker.transformer.Transformer(…, assemble_with="Line", accept="text/csv")
sm_transformer.transform(…, split_type="Line", content_type="text/csv", input_filter="$[1:]", join_source="Input", output_filter="$[0,-1]")
```

If you are using the AWS SDK for Python (Boto 3), join only the ID column with the inferences by adding the following code to your [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request.

```
{
    "DataProcessing": {
        "InputFilter": "$[1:]",
        "JoinSource": "Input",
        "OutputFilter": "$[0,-1]"
    },
    "TransformOutput": {
        "Accept": "text/csv",
        "AssembleWith": "Line"
    },
    "TransformInput": {
        "ContentType": "text/csv",
        "SplitType": "Line"
    }
}
```

**Warning**  
If you are using a JSON-formatted input file, the file can't contain the attribute name `SageMakerOutput`. This attribute name is reserved for the inferences in the output file. If your JSON-formatted input file contains an attribute with this name, values in the input file might be overwritten with the inference.

# Storage in Batch Transform
<a name="batch-transform-storage"></a>

When you run a batch transform job, Amazon SageMaker AI attaches an Amazon Elastic Block Store storage volume to Amazon EC2 instances that process your job. The volume stores your model, and the size of the storage volume is fixed at 30 GB. You have the option to encrypt your model at rest in the storage volume.

**Note**  
If you have a large model, you may encounter an `InternalServerError`.

For more information about Amazon EBS storage and features, see the following pages:
+ [Amazon EBS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html) in the Amazon EC2 User Guide
+ [Amazon EBS volumes](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volumes.html) in the Amazon EC2 User Guide

**Note**  
G4dn instances come with their own local SSD storage. To learn more about G4dn instances, see the [Amazon EC2 G4 Instances](https://aws.amazon.com/ec2/instance-types/g4/) page.

# Troubleshooting
<a name="batch-transform-errors"></a>

If you are having errors in Amazon SageMaker AI Batch Transform, refer to the following troubleshooting tips.

## Max timeout errors
<a name="batch-transform-errors-max-timeout"></a>

If you are getting max timeout errors when running batch transform jobs, try the following:
+ Begin with the single-record `[BatchStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-BatchStrategy)`, a batch size of the default (6 MB) or smaller which you specify in the `[MaxPayloadInMB](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxPayloadInMB)` parameter, and a small sample dataset. Tune the maximum timeout parameter `[InvocationsTimeoutInSeconds](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelClientConfig.html#sagemaker-Type-ModelClientConfig-InvocationsTimeoutInSeconds)` (which has a maximum of 1 hour) until you receive a successful invocation response.
+ After you receive a successful invocation response, increase the `MaxPayloadInMB` (which has a maximum of 100 MB) and the `InvocationsTimeoutInSeconds` parameters together to find the maximum batch size that can support your desired model timeout. You can use either the single-record or multi-record `BatchStrategy` in this step.
**Note**  
Exceeding the `MaxPayloadInMB` limit causes an error. This might happen with a large dataset if it can't be split, the `SplitType` parameter is set to none, or individual records within the dataset exceed the limit.
+ (Optional) Tune the `[MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxConcurrentTransforms)` parameter, which specifies the maximum number of parallel requests that can be sent to each instance in a batch transform job. However, the value of `MaxConcurrentTransforms * MaxPayloadInMB` must not exceed 100 MB.

## Incomplete output
<a name="batch-transform-errors-incomplete"></a>

SageMaker AI uses the Amazon S3 [Multipart Upload API](https://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html) to upload results from a batch transform job to Amazon S3. If an error occurs, the uploaded results are removed from Amazon S3. In some cases, such as when a network outage occurs, an incomplete multipart upload might remain in Amazon S3. An incomplete upload might also occur if you have multiple input files but some of the files can’t be processed by SageMaker AI Batch Transform. The input files that couldn’t be processed won’t have corresponding output files in Amazon S3.

To avoid incurring storage charges, we recommend that you add the [S3 bucket policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config) to the S3 bucket lifecycle rules. This policy deletes incomplete multipart uploads that might be stored in the S3 bucket. For more information, see [Object Lifecycle Management](https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html).

## Job shows as `failed`
<a name="batch-transform-errors-failed"></a>

If a batch transform job fails to process an input file because of a problem with the dataset, SageMaker AI marks the job as `failed`. If an input file contains a bad record, the transform job doesn't create an output file for that input file because doing so prevents it from maintaining the same order in the transformed data as in the input file. When your dataset has multiple input files, a transform job continues to process input files even if it fails to process one. The processed files still generate useable results.

If you are using your own algorithms, you can use placeholder text, such as `ERROR`, when the algorithm finds a bad record in an input file. For example, if the last record in a dataset is bad, the algorithm places the placeholder text for that record in the output file.