Batch prediction dataset requirements - Amazon SageMaker AI

Batch prediction dataset requirements

For batch predictions, make sure that your datasets meet the requirements outlined in Create a dataset. If your dataset is larger than 5 GB, then Canvas uses Amazon EMR Serverless to process your data and split it into smaller batches. After your data has been split, Canvas uses SageMaker AI Batch Transform to make predictions. You may see charges from both of these services after running batch predictions. For more information, see Canvas pricing.

You might not be able to make predictions on some datasets if they have incompatible schemas. A schema is an organizational structure. For a tabular dataset, the schema is the names of the columns and the data type of the data in the columns. An incompatible schema might happen for one of the following reasons:

  • The dataset that you're using to make predictions has fewer columns than the dataset that you're using to build the model.

  • The data types in the columns you used to build the dataset might be different from the data types in dataset that you're using to make predictions.

  • The dataset that you're using to make predictions and the dataset that you've used to build the model have column names that don't match. The column names are case sensitive. Column1 is not the same as column1.

To ensure that you can successfully generate batch predictions, match the schema of your batch predictions dataset to the dataset you used to train the model.

Note

For batch predictions, if you dropped any columns when building your model, Canvas adds the dropped columns back to the prediction results. However, Canvas does not add the dropped columns to your batch predictions for time series models.