Sample datasets in Canvas
SageMaker Canvas provides sample datasets addressing unique use cases so you can start building, training, and validating models quickly without writing any code. The use cases associated with these datasets highlight the capabilities of SageMaker Canvas, and you can leverage these datasets to get started with building models. You can find the sample datasets in the Datasets page of your SageMaker Canvas application.
The following datasets are the samples that SageMaker Canvas provides by default. These
datasets cover use cases such as predicting house prices, loan defaults, and
readmission for diabetic patients; forecasting sales; predicting machine failures
to streamline predictive maintenance in manufacturing units; and generating supply
chain predictions for transportation and logistics. The datasets are stored in the
sample_dataset
folder in the default Amazon S3 bucket that SageMaker AI creates
for your account in a Region.
-
canvas-sample-diabetic-readmission.csv: This dataset contains historical data including over fifteen features with patient and hospital outcomes. You can use this dataset to predict whether high-risk diabetic patients are likely to get readmitted to the hospital within 30 days of discharge, after 30 days, or not at all. Use the redadmitted column as the target column, and use the 3+ category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the SageMaker Canvas workshop page
. This dataset was obtained from the UCI Machine Learning Repository . -
canvas-sample-housing.csv: This dataset contains data on the characteristics tied to a given housing price. You can use this dataset to predict housing prices. Use the median_house_value column as the target column, and use the numeric prediction model type with this dataset. To learn more about building a model with this dataset, see the SageMaker Canvas workshop page
. This is the California housing dataset obtained from the StatLib repository . -
canvas-sample-loans.csv: This dataset contains complete loan data for all loans issued from 2007–2011, including the current loan status and latest payment information. You can use this dataset to predict whether a customer will repay a loan. Use the loan_status column as the target column, and use the 3+ category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the SageMaker Canvas workshop page
. This data uses the LendingClub data obtained from Kaggle . -
canvas-sample-maintenance.csv: This dataset contains data on the characteristics tied to a given maintenance failure type. You can use this dataset to predict which failure will occur in the future. Use the Failure Type column as the target column, and use the 3+ category prediction model type with this dataset. To learn more about how to build a model with this dataset, see the SageMaker Canvas workshop page
. This dataset was obtained from the UCI Machine Learning Repository . -
canvas-sample-shipping-logs.csv: This dataset contains complete shipping data for all products delivered, including estimated time shipping priority, carrier, and origin. You can use this dataset to predict the estimated time of arrival of the shipment in number of days. Use the ActualShippingDays column as the target column, and use the numeric prediction model type with this dataset. To learn more about how to build a model with this data, see the SageMaker Canvas workshop page
. This is a synthetic dataset created by Amazon. -
canvas-sample-sales-forecasting.csv: This dataset contains historical time series sales data for retail stores. You can use this dataset to forecast sales for a particular retail store. Use the sales column as the target column, and use the time series forecasting model type with this dataset. To learn more about how to build a model with this dataset, see the SageMaker Canvas workshop page
. This is a synthetic dataset created by Amazon.