

# Data preparation
<a name="canvas-data-prep"></a>

**Note**  
Previously, Amazon SageMaker Data Wrangler was part of the SageMaker Studio Classic experience. Now, if you update to using the new Studio experience, you must use SageMaker Canvas to access Data Wrangler and receive the latest feature updates. If you have been using Data Wrangler in Studio Classic until now and want to migrate to Data Wrangler in Canvas, you might have to grant additional permissions so that you can create and use a Canvas application. For more information, see [(Optional) Migrate from Data Wrangler in Studio Classic to SageMaker Canvas](studio-updated-migrate-ui.md#studio-updated-migrate-dw).  
To learn how to migrate your data flows from Data Wrangler in Studio Classic, see [(Optional) Migrate data from Studio Classic to Studio](studio-updated-migrate-data.md).

Use Amazon SageMaker Data Wrangler in Amazon SageMaker Canvas to prepare, featurize and analyze your data. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows.
+ **Data Flow** – Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline. 
+ **Transform** – Clean and transform your dataset using standard *transforms* like string, vector, and numeric data formatting tools. Featurize your data using transforms like text and date/time embedding and categorical encoding.
+ **Generate Data Insights** – Automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Quality and Insights Report. 
+ **Analyze** – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation. 
+ **Export** – Export your data preparation workflow to a different location. The following are example locations: 
  + Amazon Simple Storage Service (Amazon S3) bucket
  + Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
+ **Automate data preparation** – Create machine learning workflows from your data flow.
  + Amazon SageMaker Pipelines – Build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs.
  + Serial inference pipeline – Create a serial inference pipeline from your data flow. Use it to make predictions on new data.
  + Python script – Store the data and their transformations in a Python script for your custom workflows.

# Create a data flow
<a name="canvas-data-flow"></a>

Use a Data Wrangler flow in SageMaker Canvas, or *data flow*, to create and modify a data preparation pipeline. We recommend that you use Data Wrangler for datasets larger than 5 GB.

To get started, use the following procedure to import your data into a data flow.

1. Open SageMaker Canvas.

1. In the left-hand navigation, choose **Data Wrangler**.

1. Choose **Import and prepare**.

1. From the dropdown menu, choose either **Tabular** or **Image**.

1. For **Select a data source**, choose your data source and select the data that you want to import. You have the option to select up to 30 files or one folder. If you have a dataset already imported into Canvas, choose **Canvas dataset** as your source. Otherwise, connect to a data source such as Amazon S3 or Snowflake and browse through your data. For information about connecting to a data source or importing data, see the following pages:
   + [Data import](canvas-importing-data.md)
   + [Connect to data sources](canvas-connecting-external.md)

1. After selecting the data that you want to import, choose **Next**.

1. (Optional) For the **Import settings** section when importing a tabular dataset, expand the **Advanced** dropdown menu. You can specify the following advanced settings for data flow imports:
   + **Sampling method** – Select the sampling method and sample size you'd like to use. For more information about how to change your sample, see the section [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).
   + **File encoding (CSV)** – Select your dataset file’s encoding. `UTF-8` is the default.
   + **Skip first rows** – Enter the number of rows you’d like to skip importing if you have redundant rows at the beginning of your dataset.
   + **Delimiter** – Select the delimiter that separates each item in your data. You can also specify a custom delimiter.
   + **Multi-line detection** – Select this option if you’d like Canvas to manually parse your entire dataset for multi-line cells. Canvas determines whether or not to use multi-line support by taking a sample of your data, but Canvas might not detect any multi-line cells in the sample. In this case, we recommend that you select the **Multi-line detection** option to force Canvas to check your entire dataset for multi-line cells.

1. Choose **Import**.

You should now have a new data flow, and you can begin adding transform steps and analyses.

# How the data flow UI works
<a name="canvas-data-flow-ui"></a>

To help you navigate your data flow, Data Wrangler has the following tabs in the top navigation pane:
+ **Data flow** – This tab provides you with a visual view of your data flow step where you can add or remove transforms, and export data.
+ **Data** – This tab gives you a preview of your data so that you can check the results of your transforms. You can also see an ordered list of your data flow steps and edit or reorder the steps.
**Note**  
In this tab, you can only preview data visualizations (such as the distribution of values per column) for Amazon S3 data sources. Visualizations for other data sources, such as Amazon Athena, aren't supported.
+ **Analyses** – In this tab, you can see separate sub-tabs for each analysis you create. For example, if you create a histogram and a Data Quality and Insights (DQI) report, Canvas creates a tab for each.

When you import a dataset, the original dataset appears on the data flow and is named **Source**. SageMaker Canvas automatically infers the types of each column in your dataset and creates a new dataframe named **Data types**. You can select this frame to update the inferred data types.

The datasets, transformations, and analyses that you use in the data flow are represented as *steps*. Each time you add a transform step, you create a new dataframe. When multiple transform steps (other than **Join** or **Concatenate**) are added to the same dataset, they are stacked.

Under the **Combine data** option, **Join** and **Concatenate** create standalone steps that contain the new joined or concatenated dataset.

# Edit the data flow sampling configuration
<a name="canvas-data-flow-edit-sampling"></a>

When importing tabular data into a Data Wrangler data flow, you can opt to take a sample of your dataset to speed up the data exploration and cleaning process. Running exploratory transforms on a sample of your dataset is often faster than running transforms on your entire dataset, and when you're ready to export your dataset and build a model, you can apply the transforms to the full dataset.

Canvas supports the following sampling methods:
+ **FirstK** – Canvas selects the first *K* items from your dataset, where *K* is a number you specify. This sampling method is simple but can introduce bias if your dataset isn't randomly ordered.
+ **Random** – Canvas selects items from the dataset at random, with each item having an equal probability of being chosen. This sampling method helps ensure that the sample is representative of the entire dataset.
+ **Stratified** – Canvas divides the dataset into groups (or *strata*) based on one or more attributes (for example, age and income level). Then, a proportional number of items are randomly selected from each group. This method ensures that all relevant subgroups are adequately represented in the sample.

You can edit your sampling configuration at any time to change the size of the sample used for data exploration.

To make changes to your sampling configuration, do the following:

1. In your data flow graph, select your data source node.

1. Choose **Sampling** on the bottom navigation bar.

1. The **Sampling** dialog box opens. For the **Sampling method** dropdown, select your desired sampling method.

1. For **Maximum sample size**, enter the number of rows you want to sample.

1. Choose **Update** to save your changes.

The changes to your sampling configuration should now be applied.

# Add a step to your data flow
<a name="canvas-data-flow-add-step"></a>

In your Data Wrangler data flows, you can add steps that represent data transformations and analyses.

To add a step to your data flow, select **\$1** next to any dataset node or previously added step. Then, select one of the following options:
+ **Edit data types** (For a **Data types** step only): If you have not added any transforms to a **Data types** step, you can double-click on the **Data types** step in your flow to open the **Data** tab and edit the data types that Data Wrangler inferred when importing your dataset. 
+ **Add transform**: Adds a new transform step. See [Transform data](canvas-transform.md) to learn more about the data transformations you can add. 
+ **Get data insights**: Add analyses, such as histograms or custom visualizations. You can use this option to analyze your data at any point in the data flow. See [Perform exploratory data analysis (EDA)](canvas-analyses.md) to learn more about the analyses you can add. 
+ **Join**: Find this option under **Combine data** to join two datasets and add the resulting dataset to the data flow. To learn more, see [Join Datasets](canvas-transform.md#canvas-transform-join).
+ **Concatenate**: Find this option under **Combine data** to concatenate two datasets and add the resulting dataset to the data flow. To learn more, see [Concatenate Datasets](canvas-transform.md#canvas-transform-concatenate).

# Edit data flow steps
<a name="canvas-data-flow-edit-steps"></a>

In Amazon SageMaker Canvas, you can edit individual steps in your data flows to transform your dataset without having to create a new data flow. The following page covers how to edit join and concatenate steps, as well as data source steps.

## Edit join and concatenate steps
<a name="canvas-data-flow-edit-join-concat"></a>

Within your data flows, you have the flexibility to edit your join and concatenate steps. You can make necessary adjustments to your data processing workflow, ensuring that your data is properly combined and transformed without having to redo your entire data flow.

To edit a join or concatenate step in your data flow, do the following:

1. Open your data flow.

1. Choose the plus icon (**\$1**) next to the join or concatenate node that you want to edit.

1. From the context menu, choose **Edit**.

1. A side panel opens where you can edit the details of your join or concatenation. Modify your step fields, such as the type of join. To swap out a data node and select a different one to join or concatenate, choose the delete icon next to the node and then, in the data flow view, select the new node that you want to include in your transformation.
**Note**  
When swapping out a node during the editing process, you can only select steps that occur before the join or concatenate operation. You can swap either the left or right node, but you can only swap one node at a time. Additionally, you cannot select a source node as a replacement.

1. Choose **Preview** to view the result of the combining operation.

1. Choose **Update** to save your changes.

Your data flow should now be updated.

## Edit or replace a data source step
<a name="canvas-data-flow-edit-source"></a>

You might need to make changes to your data source or dataset without deleting the transforms and data flow steps applied to your original data. Within Data Wrangler, you can edit or replace your data source configuration while keeping the steps of your data flow. When editing a data source, you can change the import settings, such as the sampling size or method and any advanced settings. You can also add more files with the same schema, or for query-based data sources such as Amazon Athena, you can edit the query. When replacing a data source, you have the option to select a different dataset, or even import the data from a different data source altogether, as long as the schema of the new data matches the original data.

To edit a data source configuration, do the following:

1. In the Canvas application, go to the **Data Wrangler** page.

1. Choose your data flow to view it.

1. In the **Data flow** tab that shows your data flow steps, find the **Source** node that you want to edit.

1. Choose the ellipsis icon next to the **Source** node.

1. From the context menu, choose **Edit**.

1. For Amazon S3 data sources and local upload, you have the option to select or upload more files with the same schema as your original data. For query-based data sources such as Amazon Athena, you can remove and select different tables in the visual query builder, or you can edit the SQL query directly. When you're done, choose **Next**.

1. For the **Import settings**, make any desired changes.

1. When you're done, choose **Save changes**.

Your data source should now be updated.

To replace a data source, do the following:

1. In the Canvas application, go to the **Data Wrangler** page.

1. Choose your data flow to view it.

1. In the **Data flow** tab that shows your data flow steps, find the **Source** node that you want to edit.

1. Choose the ellipsis icon next to the **Source** node.

1. From the context menu, choose **Replace**.

1. Go through the [create a data flow experience](canvas-data-flow.md) to select another data source and data.

1. When you’ve selected your data and are ready to update the source node, choose **Save**.

You should now see the **Source** node updated in your data flow.

# Reorder steps in your data flow
<a name="canvas-data-flow-reorder-steps"></a>

After adding steps to your data flow, you have the option to reorder steps instead of deleting and re-adding them in the correct order. For example, you might decide to move a transform to impute missing values before a step to format strings.

**Note**  
You can’t change the order of certain step types, such as defining your data source, changing data types, joining, concatenating, or splitting. Steps that can’t be reordered are grayed out in the Canvas application UI.

To reorder your data flow steps, do the following:

1. While editing a data flow in Data Wrangler, choose the **Data** tab. A side panel called **Steps** lists your data flow steps in order.

1. Hover over a transform step and choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to that step.

1. From the context menu, choose **Reorder**.

1. Drag and drop your data flow steps into your desired order.

1. When you’ve finished, choose **Save**.

Your data flow steps and graph should now reflect the changes you’ve made.

# Delete a step from your data flow
<a name="canvas-data-flow-delete-step"></a>

Within your data flows, you have the flexibility to delete your join and concatenate steps and choose whether or not to still apply any subsequent transforms to your data.

To delete a join or concatenate step from your data flow, do the following:

1. Open your data flow.

1. Choose the plus icon (**\$1**) next to the join or concatenate node that you want to delete.

1. In the context menu, choose **Delete**.

1. (Optional) If you have transformation steps following the join or concatenate step, then you can choose whether or not to keep the subsequent transformation steps and add them separately to each data node. In the **Delete join** side panel, choose a node to deselect it and remove any subsequent transformation steps. You can leave both nodes selected to keep all transformation steps, or you can deselect both nodes to discard all transformation steps.

   The following screenshot shows this step with only the second of two data nodes selected. When the join is successfully deleted, then the subsequent **Rename column** transform is only kept by the second data node.  
![\[Screenshot of a data flow in Data Wrangler showing the delete join view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/canvas-data-flow-delete-step.png)

1. Choose **Delete**.

The join or concatenate step should now be removed from your data flow.

# Perform exploratory data analysis (EDA)
<a name="canvas-analyses"></a>

Data Wrangler includes built-in analyses that help you generate visualizations and data analyses in a few clicks. You can also create custom analyses using your own code. 

You add an analysis to a dataframe by selecting a step in your data flow, and then choosing **Add analysis**. To access an analysis you've created, select the step that contains the analysis, and select the analysis. 

Analyses are generated using a sample of up to 200,000 rows of your dataset, and you can configure the sample size. For more information about changing the sample size of your data flow, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).

**Note**  
Analyses are optimized for data with 1000 or fewer columns. You may experience some latency when generating analyses for data with additional columns.

You can add the following analysis to a dataframe:
+ Data visualizations, including histograms and scatter plots. 
+ A quick summary of your dataset, including number of entries, minimum and maximum values (for numeric data), and most and least frequent categories (for categorical data).
+ A quick model of the dataset, which can be used to generate an importance score for each feature. 
+ A target leakage report, which you can use to determine if one or more features are strongly correlated with your target feature.
+ A custom visualization using your own code. 

Use the following sections to learn more about these options.

## Get insights on data and data quality
<a name="canvas-data-insights"></a>

Use the **Data Quality and Insights Report** to perform an analysis of the data that you've imported into Data Wrangler. We recommend that you create the report after you import your dataset. You can use the report to help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

Use the following procedure to create a Data Quality and Insights report. It assumes that you've already imported a dataset into your Data Wrangler flow.

**To create a Data Quality and Insights report**

1. Choose the ellipsis icon next to a node in your Data Wrangler flow.

1. Select **Get data insights**.

1. For **Analysis type**, select **Data Quality and Insights Report**.

1. For **Analysis name**, specify a name for the insights report.

1. For **Problem type**, specify **Regression** or **Classification**.

1. For **Target column**, specify the target column.

1. For **Data size**, specify one of the following:
   + **Sampled dataset** – Uses the interactive sample from your data flow, which can contain up to 200,000 rows of your dataset. For information about how to edit the size of your sample, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).
   + **Full dataset** – Uses the full dataset from your data source to create the report.
**Note**  
Creating a Data Quality and Insights report on the full dataset uses an Amazon SageMaker processing job. A SageMaker Processing job provisions the additional compute resources required to get insights for all of your data. For more information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

1. Choose **Create**.

The following topics show the sections of the report:

**Topics**
+ [Summary](#canvas-data-insights-summary)
+ [Target column](#canvas-data-insights-target-column)
+ [Quick model](#canvas-data-insights-quick-model)
+ [Feature summary](#canvas-data-insights-feature-summary)
+ [Samples](#canvas-data-insights-samples)
+ [Definitions](#canvas-data-insights-definitions)

You can either download the report or view it online. To download the report, choose the download button at the top right corner of the screen. 

### Summary
<a name="canvas-data-insights-summary"></a>

The insights report has a brief summary of the data that includes general information such as missing values, invalid values, feature types, outlier counts, and more. It can also include high severity warnings that point to probable issues with the data. We recommend that you investigate the warnings.

### Target column
<a name="canvas-data-insights-target-column"></a>

When you create the Data Quality and Insights Report, Data Wrangler gives you the option to select a target column. A target column is a column that you're trying to predict. When you choose a target column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the order of their predictive power. When you select a target column, you must specify whether you’re trying to solve a regression or a classification problem.

For classification, Data Wrangler shows a table and a histogram of the most common classes. A class is a category. It also presents observations, or rows, with a missing or invalid target value.

For regression, Data Wrangler shows a histogram of all the values in the target column. It also presents observations, or rows, with a missing, invalid, or outlier target value.

### Quick model
<a name="canvas-data-insights-quick-model"></a>

The **Quick model** provides an estimate of the expected predicted quality of a model that you train on your data.

Data Wrangler splits your data into training and validation folds. It uses 80% of the samples for training and 20% of the values for validation. For classification, the sample is stratified split. For a stratified split, each data partition has the same ratio of labels. For classification problems, it's important to have the same ratio of labels between the training and classification folds. Data Wrangler trains the XGBoost model with the default hyperparameters. It applies early stopping on the validation data and performs minimal feature preprocessing.

For classification models, Data Wrangler returns both a model summary and a confusion matrix.

 To learn more about the information that the classification model summary returns, see [Definitions](#canvas-data-insights-definitions).

A confusion matrix gives you the following information:
+ The number of times the predicted label matches the true label.
+ The number of times the predicted label doesn't match the true label.

The true label represents an actual observation in your data. For example, if you're using a model to detect fraudulent transactions, the true label represents a transaction that is actually fraudulent or non-fraudulent. The predicted label represents the label that your model assigns to the data.

You can use the confusion matrix to see how well the model predicts the presence or the absence of a condition. If you're predicting fraudulent transactions, you can use the confusion matrix to get a sense of both the sensitivity and the specificity of the model. The sensitivity refers to the model's ability to detect fraudulent transactions. The specificity refers to the model's ability to avoid detecting non-fraudulent transactions as fraudulent.

### Feature summary
<a name="canvas-data-insights-feature-summary"></a>

When you specify a target column, Data Wrangler orders the features by their prediction power. Prediction power is measured on the data after it is split into 80% training and 20% validation folds. Data Wrangler fits a model for each feature separately on the training fold. It applies minimal feature preprocessing and measures prediction performance on the validation data.

It normalizes the scores to the range [0,1]. Higher prediction scores indicate columns that are more useful for predicting the target on their own. Lower scores point to columns that aren’t predictive of the target column.

It’s uncommon for a column that isn’t predictive on its own to be predictive when it’s used in tandem with other columns. You can confidently use the prediction scores to determine whether a feature in your dataset is predictive.

A low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities, which often indicates target leakage. Target leakage usually happens when the dataset contains a column that isn’t available at the prediction time. For example, it could be a duplicate of the target column.

### Samples
<a name="canvas-data-insights-samples"></a>

Data Wrangler provides information about whether your samples are anomalous or if there are duplicates in your dataset.

Data Wrangler detects anomalous samples using the *isolation forest algorithm*. The isolation forest associates an anomaly score with each sample (row) of the dataset. Low anomaly scores indicate anomalous samples. High scores are associated with non-anomalous samples. Samples with a negative anomaly score are usually considered anomalous and samples with positive anomaly score are considered non-anomalous.

When you look at a sample that might be anomalous, we recommend that you pay attention to unusual values. For example, you might have anomalous values that result from errors in gathering and processing the data. The following is an example of the most anomalous samples according to the Data Wrangler’s implementation of the isolation forest algorithm. We recommend using domain knowledge and business logic when you examine the anomalous samples.

Data Wrangler detects duplicate rows and calculates the ratio of duplicate rows in your data. Some data sources could include valid duplicates. Other data sources could have duplicates that point to problems in data collection. Duplicate samples that result from faulty data collection could interfere with machine learning processes that rely on splitting the data into independent training and validation folds.

The following are elements of the insights report that can be impacted by duplicated samples:
+ Quick model
+ Prediction power estimation
+ Automatic hyperparameter tuning

You can remove duplicate samples from the dataset using the **Drop duplicates** transform under **Manage rows**. Data Wrangler shows you the most frequently duplicated rows.

### Definitions
<a name="canvas-data-insights-definitions"></a>

The following are definitions for the technical terms that are used in the data insights report.

------
#### [ Feature types ]

The following are the definitions for each of the feature types:
+ **Numeric** – Numeric values can be either floats or integers, such as age or income. The machine learning models assume that numeric values are ordered and a distance is defined over them. For example, 3 is closer to 4 than to 10 and 3 < 4 < 10.
+ **Categorical** – The column entries belong to a set of unique values, which is usually much smaller than the number of entries in the column. For example, a column of length 100 could contain the unique values `Dog`, `Cat`, and `Mouse`. The values could be numeric, text, or a combination of both. `Horse`, `House`, `8`, `Love`, and `3.1` would all be valid values and could be found in the same categorical column. The machine learning model does not assume order or distance on the values of categorical features, as opposed to numeric features, even when all the values are numbers.
+ **Binary** – Binary features are a special categorical feature type in which the cardinality of the set of unique values is 2.
+ **Text** – A text column contains many non-numeric unique values. In extreme cases, all the elements of the column are unique. In an extreme case, no two entries are the same.
+ **Datetime** – A datetime column contains information about the date or time. It can have information about both the date and time.

------
#### [ Feature statistics ]

The following are definitions for each of the feature statistics:
+ **Prediction power** – Prediction power measures how useful the column is in predicting the target.
+ **Outliers** (in numeric columns) – Data Wrangler detects outliers using two statistics that are robust to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the clipped vector. All values larger than median \$1 5 \$1 RSTD or smaller than median - 5 \$1 RSTD are considered to be outliers.
+ **Skew** (in numeric columns) – Skew measures the symmetry of the distribution and is defined as the third moment of the distribution divided by the third power of the standard deviation. The skewness of the normal distribution or any other symmetric distribution is zero. Positive values imply that the right tail of the distribution is longer than the left tail. Negative values imply that the left tail of the distribution is longer than the right tail. As a rule of thumb, a distribution is considered skewed when the absolute value of the skew is larger than 3.
+ **Kurtosis** (in numeric columns) – Pearson's kurtosis measures the heaviness of the tail of the distribution. It's defined as the fourth moment of the distribution divided by the square of the second moment. The kurtosis of the normal distribution is 3. Kurtosis values lower than 3 imply that the distribution is concentrated around the mean and the tails are lighter than the tails of the normal distribution. Kurtosis values higher than 3 imply heavier tails or outliers.
+ **Missing values** – Null-like objects, empty strings and strings composed of only white spaces are considered missing.
+ **Valid values for numeric features or regression target** – All values that you can cast to finite floats are valid. Missing values are not valid.
+ **Valid values for categorical, binary, or text features, or for classification target** – All values that are not missing are valid.
+ **Datetime features** – All values that you can cast to a datetime object are valid. Missing values are not valid.
+ **Invalid values** – Values that are either missing or you can't properly cast. For example, in a numeric column, you can't cast the string `"six"` or a null value.

------
#### [ Quick model metrics for regression ]

The following are the definitions for the quick model metrics:
+ R2 or coefficient of determination) – R2 is the proportion of the variation in the target that is predicted by the model. R2 is in the range of [-infty, 1]. 1 is the score of the model that predicts the target perfectly and 0 is the score of the trivial model that always predicts the target mean.
+ MSE or mean squared error – MSE is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ MAE or mean absolute error – MAE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ RMSE or root mean square error – RMSE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ Max error – The maximum absolute value of the error over the dataset. Max error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ Median absolute error – Median absolute error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.

------
#### [ Quick model metrics for classification ]

The following are the definitions for the quick model metrics:
+ **Accuracy** – Accuracy is the ratio of samples that are predicted accurately. Accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples incorrectly and 1 is the score of the perfect model.
+ **Balanced accuracy** – Balanced accuracy is the ratio of samples that are predicted accurately when the class weights are adjusted to balance the data. All classes are given the same importance, regardless of their frequency. Balanced accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples wrong. 1 is the score of the perfect model.
+ **AUC (binary classification)** – This is the area under the receiver operating characteristic curve. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **AUC (OVR)** – For multiclass classification, this is the area under the receiver operating characteristic curve calculated separately for each label using one versus rest. Data Wrangler reports the average of the areas. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **Precision** – Precision is defined for a specific class. Precision is the fraction of true positives out of all the instances that the model classified as that class. Precision is in the range [0, 1]. 1 is the score of the model that has no false positives for the class. For binary classification, Data Wrangler reports the precision of the positive class.
+ **Recall** – Recall is defined for a specific class. Recall is the fraction of the relevant class instances that are successfully retrieved. Recall is in the range [0, 1]. 1 is the score of the model that classifies all the instances of the class correctly. For binary classification, Data Wrangler reports the recall of the positive class.
+ **F1** – F1 is defined for a specific class. It's the harmonic mean of the precision and recall. F1 is in the range [0, 1]. 1 is the score of the perfect model. For binary classification, Data Wrangler reports the F1 for classes with positive values.

------
#### [ Textual patterns ]

**Patterns** describe the textual format of a string using an easy to read format. The following are examples of textual patterns:
+ "\$1digits:4-7\$1" describes a sequence of digits that have a length between 4 and 7.
+ "\$1alnum:5\$1" describes an alpha-numeric string with a length of exactly 5.

Data Wrangler infers the patterns by looking at samples of non-empty strings from your data. It can describe many of the commonly used patterns. The **confidence** expressed as a percentage indicates how much of the data is estimated to match the pattern. Using the textual pattern, you can see which rows in your data you need to correct or drop.

The following describes the patterns that Data Wrangler can recognize:


| Pattern | Textual Format | 
| --- | --- | 
|  \$1alnum\$1  |  Alphanumeric strings  | 
|  \$1any\$1  |  Any string of word characters  | 
|  \$1digits\$1  |  A sequence of digits  | 
|  \$1lower\$1  |  A lowercase word  | 
|  \$1mixed\$1  |  A mixed-case word  | 
|  \$1name\$1  |  A word beginning with a capital letter  | 
|  \$1upper\$1  |  An uppercase word  | 
|  \$1whitespace\$1  |  Whitespace characters  | 

A word character is either an underscore or a character that might appear in a word in any language. For example, the strings `'Hello_word'` and `'écoute'` both consist of word characters. 'H' and 'é' are both examples of word characters.

------

## Bias report
<a name="canvas-bias-report"></a>

SageMaker Canvas provides the bias report in Data Wrangler to help uncover potential biases in your data. The bias report analyzes the relationship between the target column (label) and a column that you believe might contain bias (facet variable). For example, if you are trying to predict customer conversion, the facet variable may be the age of the customer. The bias report can help you determine whether or not your data is biased toward a certain age group.

To generate a bias report in Canvas, do the following:

1. In your data flow in Data Wrangler, choose the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to a node in the flow.

1. From the context menu, choose **Get data insights**.

1. The **Create analysis** side panel opens. For the **Analysis type** dropdown menu, select **Bias Report**.

1.  In the **Analysis name** field, enter a name for the bias report.

1. For the **Select the column your model predicts (target)** dropdown menu, select your target column.

1. For **Is your predicted column a value or threshold?**, select **Value** if your target column has categorical values or **Threshold** if it has numerical values.

1. For **Predicted value** (or **Predicted threshold**, depending on your selection in the previous step), enter the target column value or values that correspond to a positive outcome. For example, if predicting customer conversion, your value might be `yes` to indicate that a customer was converted.

1. For the **Select the column to analyze for bias** dropdown menu, select the column that you believe might contain bias, also known as the facet variable.

1. For **Is your column a value or threshold?**, select **Value** if the facet variable has categorical values or **Threshold** if it has numerical values.

1. For **Column value(s) to analyze for bias** (or **Column threshold to analyze for bias**, depending on your selection in the previous step), enter the value or values that you want to analyze for potential bias. For example, if you're checking for bias against customers over a certain age, use the beginning of that age range as your threshold.

1. For **Choose bias metrics**, select the bias metrics you'd like to include in your bias report. Hover over the info icons for more information about each metric.

1. (Optional) When prompted with the option **Would you like to analyze additional metrics?**, select **Yes** to view and include more bias metrics.

1. When you're ready to create the bias report, choose **Add**.

Once generated, the report gives you an overview of the bias metrics you selected. You can view the bias report at any time from the **Analyses** tab of your data flow.

## Histogram
<a name="canvas-visualize-histogram"></a>

Use histograms to see the counts of feature values for a specific feature. You can inspect the relationships between features using the **Color by** option.

You can use the **Facet by** feature to create histograms of one column, for each value in another column. 

## Scatter plot
<a name="canvas-visualize-scatter-plot"></a>

Use the **Scatter Plot** feature to inspect the relationship between features. To create a scatter plot, select a feature to plot on the **X axis** and the **Y axis**. Both of these columns must be numeric typed columns. 

You can color scatter plots by an additional column. 

Additionally, you can facet scatter plots by features.

## Table summary
<a name="canvas-table-summary"></a>

Use the **Table Summary** analysis to quickly summarize your data.

For columns with numerical data, including log and float data, a table summary reports the number of entries (count), minimum (min), maximum (max), mean, and standard deviation (stddev) for each column.

For columns with non-numerical data, including columns with string, Boolean, or date/time data, a table summary reports the number of entries (count), least frequent value (min), and most frequent value (max). 

## Quick model
<a name="canvas-quick-model"></a>

Use the **Quick Model** visualization to quickly evaluate your data and produce importance scores for each feature. A [feature importance score](http://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances) score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. On the top of the quick model chart, there is a model score. A classification problem shows an F1 score. A regression problem has a mean squared error (MSE) score.

When you create a quick model chart, you select a dataset you want evaluated, and a target label against which you want feature importance to be compared. Data Wrangler does the following:
+ Infers the data types for the target label and each feature in the dataset selected. 
+ Determines the problem type. Based on the number of distinct values in the label column, Data Wrangler determines if this is a regression or classification problem type. Data Wrangler sets a categorical threshold to 100. If there are more than 100 distinct values in the label column, Data Wrangler classifies it as a regression problem; otherwise, it is classified as a classification problem. 
+ Pre-processes features and label data for training. The algorithm used requires encoding features to vector type and encoding labels to double type. 
+ Trains a random forest algorithm with 70% of data. Spark’s [RandomForestRegressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression) is used to train a model for regression problems. The [RandomForestClassifier](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier) is used to train a model for classification problems.
+ Evaluates a random forest model with the remaining 30% of data. Data Wrangler evaluates classification models using an F1 score and evaluates regression models using an MSE score.
+ Calculates feature importance for each feature using the Gini importance method. 

## Target leakage
<a name="canvas-analysis-target-leakage"></a>

Target leakage occurs when there is data in a machine learning training dataset that is strongly correlated with the target label, but is not available in real-world data. For example, you may have a column in your dataset that serves as a proxy for the column you want to predict with your model. 

When you use the **Target Leakage** analysis, you specify the following:
+ **Target**: This is the feature about which you want your ML model to be able to make predictions.
+ **Problem type**: This is the ML problem type on which you are working. Problem type can either be **classification** or **regression**. 
+  (Optional) **Max features**: This is the maximum number of features to present in the visualization, which shows features ranked by their risk of being target leakage.

For classification, the target leakage analysis uses the area under the receiver operating characteristic, or AUC - ROC curve for each column, up to **Max features**. For regression, it uses a coefficient of determination, or R2 metric.

The AUC - ROC curve provides a predictive metric, computed individually for each column using cross-validation, on a sample of up to around 1000 rows. A score of 1 indicates perfect predictive abilities, which often indicates target leakage. A score of 0.5 or lower indicates that the information on the column could not provide, on its own, any useful information towards predicting the target. Although it can happen that a column is uninformative on its own but is useful in predicting the target when used in tandem with other features, a low score could indicate the feature is redundant.

## Multicollinearity
<a name="canvas-multicollinearity"></a>

Multicollinearity is a circumstance where two or more predictor variables are related to each other. The predictor variables are the features in your dataset that you're using to predict a target variable. When you have multicollinearity, the predictor variables are not only predictive of the target variable, but also predictive of each other.

You can use the **Variance Inflation Factor (VIF)**, **Principal Component Analysis (PCA)**, or **Lasso feature selection** as measures for the multicollinearity in your data. For more information, see the following.

------
#### [ Variance Inflation Factor (VIF) ]

The Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is a positive number that is greater than or equal to 1.

A score of 1 means that the variable is uncorrelated with the other variables. Scores greater than 1 indicate higher correlation.

Theoretically, you can have a VIF score with a value of infinity. Data Wrangler clips high scores to 50. If you have a VIF score greater than 50, Data Wrangler sets the score to 50.

You can use the following guidelines to interpret your VIF scores:
+ A VIF score less than or equal to 5 indicates that the variables are moderately correlated with the other variables.
+ A VIF score greater than or equal to 5 indicates that the variables are highly correlated with the other variables.

------
#### [ Principle Component Analysis (PCA) ]

Principal Component Analysis (PCA) measures the variance of the data along different directions in the feature space. The feature space consists of all the predictor variables that you use to predict the target variable in your dataset.

For example, if you're trying to predict who survived on the *RMS Titanic* after it hit an iceberg, your feature space can include the passengers' age, gender, and the fare that they paid.

From the feature space, PCA generates an ordered list of variances. These variances are also known as singular values. The values in the list of variances are greater than or equal to 0. We can use them to determine how much multicollinearity there is in our data.

When the numbers are roughly uniform, the data has very few instances of multicollinearity. When there is a lot of variability among the values, we have many instances of multicollinearity. Before it performs PCA, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation of 1.

**Note**  
PCA in this circumstance can also be referred to as Singular Value Decomposition (SVD).

------
#### [ Lasso feature selection ]

Lasso feature selection uses the L1 regularization technique to only include the most predictive features in your dataset.

For both classification and regression, the regularization technique generates a coefficient for each feature. The absolute value of the coefficient provides an importance score for the feature. A higher importance score indicates that it is more predictive of the target variable. A common feature selection method is to use all the features that have a non-zero lasso coefficient.

------

## Detect anomalies in time series data
<a name="canvas-time-series-anomaly-detection"></a>

You can use the anomaly detection visualization to see outliers in your time series data. To understand what determines an anomaly, you need to understand that we decompose the time series into a predicted term and an error term. We treat the seasonality and trend of the time series as the predicted term. We treat the residuals as the error term.

For the error term, you specify a threshold as the number of standard of deviations the residual can be away from the mean for it to be considered an anomaly. For example, you can specify a threshold as being 3 standard deviations. Any residual greater than 3 standard deviations away from the mean is an anomaly.

You can use the following procedure to perform an **Anomaly detection** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Anomaly detection**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Seasonal trend decomposition in time series data
<a name="canvas-seasonal-trend-decomposition"></a>

You can determine whether there's seasonality in your time series data by using the Seasonal Trend Decomposition visualization. We use the STL (Seasonal Trend decomposition using LOESS) method to perform the decomposition. We decompose the time series into its seasonal, trend, and residual components. The trend reflects the long term progression of the series. The seasonal component is a signal that recurs in a time period. After removing the trend and the seasonal components from the time series, you have the residual.

You can use the following procedure to perform a **Seasonal-Trend decomposition** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Seasonal-Trend decomposition**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Create custom visualizations
<a name="canvas-visualize-custom"></a>

You can add an analysis to your Data Wrangler flow to create a custom visualization. Your dataset, with all the transformations you've applied, is available as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Data Wrangler uses the `df` variable to store the dataframe. You access the dataframe by calling the variable.

You must provide the output variable, `chart`, to store an [Altair](https://altair-viz.github.io/) output chart. For example, you can use the following code block to create a custom histogram using the Titanic dataset.

```
import altair as alt
df = df.iloc[:30]
df = df.rename(columns={"Age": "value"})
df = df.assign(count=df.groupby('value').value.transform('count'))
df = df[["value", "count"]]
base = alt.Chart(df)
bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))
rule = base.mark_rule(color='red').encode(
    x='mean(value):Q',
    size=alt.value(5))
chart = bar + rule
```

**To create a custom visualization:**

1. Next to the node containing the transformation that you'd like to visualize, choose the **\$1**.

1. Choose **Add analysis**.

1. For **Analysis type**, choose **Custom Visualization**.

1. For **Analysis name**, specify a name.

1. Enter your code in the code box. 

1. Choose **Preview** to preview your visualization.

1. Choose **Save** to add your visualization.

If you don’t know how to use the Altair visualization package in Python, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of visualization snippets. To use a visualization snippet, choose **Search example snippets** and specify a query in the search bar.

The following example uses the **Binned scatterplot** code snippet. It plots a histogram for 2 dimensions.

The snippets have comments to help you understand the changes that you need to make to the code. You usually need to specify the column names of your dataset in the code.

```
import altair as alt

# Specify the number of top rows for plotting
rows_number = 1000
df = df.head(rows_number)  
# You can also choose bottom rows or randomly sampled rows
# df = df.tail(rows_number)
# df = df.sample(rows_number)


chart = (
    alt.Chart(df)
    .mark_circle()
    .encode(
        # Specify the column names for binning and number of bins for X and Y axis
        x=alt.X("col1:Q", bin=alt.Bin(maxbins=20)),
        y=alt.Y("col2:Q", bin=alt.Bin(maxbins=20)),
        size="count()",
    )
)

# :Q specifies that label column has quantitative type.
# For more details on Altair typing refer to
# https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types
```

# Transform data
<a name="canvas-transform"></a>

Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning and featurizing your data. Using the interactive data preparation tools in Data Wrangler, you can sample datasets of any size with a variety of sampling techniques and start exploring your data in a matter of minutes. After finalizing your data transforms on the sampled data, you can then scale the data flow to apply those transformations to the entire dataset.

When you add a transform, it adds a step to the data flow. Each transform you add modifies your dataset and produces a new dataframe. All subsequent transforms apply to the resulting dataframe.

Data Wrangler includes built-in transforms, which you can use to transform columns without any code. If you know how you want to prepare your data but don't know how to get started or which transforms to use, you can use the chat for data prep feature to interact conversationally with Data Wrangler and apply transforms using natural language. For more information, see [Chat for data prep](canvas-chat-for-data-prep.md). 

You can also add custom transformations using PySpark, Python (User-Defined Function), pandas, and PySpark SQL. Some transforms operate in place, while others create a new output column in your dataset.

You can apply transforms to multiple columns at once. For example, you can delete multiple columns in a single step.

You can apply the **Process numeric** and **Handle missing** transforms only to a single column.

Use this page to learn more about the built-in and custom transforms offered by Data Wrangler.

## Join Datasets
<a name="canvas-transform-join"></a>

You can join datasets directly in your data flow. When you join two datasets, the resulting joined dataset appears in your flow. The following join types are supported by Data Wrangler.
+ **Left outer** – Include all rows from the left table. If the value for the column joined on a left table row does not match any right table row values, that row contains null values for all right table columns in the joined table.
+ **Left anti** – Include rows from the left table that do not contain values in the right table for the joined column.
+ **Left semi** – Include a single row from the left table for all identical rows that satisfy the criteria in the join statement. This excludes duplicate rows from the left table that match the criteria of the join.
+ **Right outer** – Include all rows from the right table. If the value for the joined column in a right table row does not match any left table row values, that row contains null values for all left table columns in the joined table.
+ **Inner** – Include rows from left and right tables that contain matching values in the joined column. 
+ **Full outer** – Include all rows from the left and right tables. If the row value for the joined column in either table does not match, separate rows are created in the joined table. If a row doesn’t contain a value for a column in the joined table, null is inserted for that column.
+ **Cartesian cross** – Include rows which combine each row from the first table with each row from the second table. This is a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of rows from tables in the join. The result of this product is the size of the left table times the size of the right table. Therefore, we recommend caution in using this join between very large datasets. 

Use the following procedure to join two datasets. You should have already imported two data sources into your data flow.

1. Select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the left node that you want to join. The first node you select is always the left table in your join. 

1. Hover over **Combine data**, and then choose **Join**.

1. Select the right node. The second node you select is always the right table in your join.

1. The **Join type** field is set to **Inner join** by default. Select the dropdown menu to change the join type.

1. For **Join keys**, verify the columns from the left and right tables that you want to use to join the data. You can add or remove additional join keys.

1. For **Name of join**, enter a name for the joined data, or use the default name.

1. (Optional) Choose **Preview** to preview the joined data.

1. Choose **Add** to complete the join.

**Note**  
If you receive a notice that Canvas didn't identify any matching rows when joining your data, we recommend that you either verify that you've selected the correct columns, or update your sample to try to find matching rows. You can choose a different sampling strategy or change the size of the sample. For information about how to edit the sample, see [Edit the data flow sampling configuration](canvas-data-flow-edit-sampling.md).

You should now see a join node added to your data flow.

## Concatenate Datasets
<a name="canvas-transform-concatenate"></a>

Concatenating combines two datasets by appending the rows from one dataset to another.

Use the following procedure to concatenate two datasets. You should have already imported two data sources into your data flow.

**To concatenate two datasets:**

1. Select the **More options** icon (![\[Vertical ellipsis icon representing a menu or more options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/canvas/more-options-icon.png)) next to the left node that you want to concatenate. The first node you select is always the left table in your concatenate operation. 

1. Hover over **Combine data**, and then choose **Concatenate**.

1. Select the right node. The second node you select is always the right table in your concatenate.

1. (Optional) Select the checkbox next to **Remove duplicates after concatenation** to remove duplicate columns. 

1. (Optional) Select the checkbox next to **Add column to indicate source dataframe** to add a column to the resulting dataframe that lists the source dataset for each record.

   1. For **Indicator column name**, enter a name for the added column.

   1. For **First dataset indicating string**, enter the value you want to use to mark records from the first dataset (or the left node).

   1. For **Second dataset indicating string**, enter the value you want to use to mark records from the second dataset (or the right node).

1. For **Name of concatenate**, enter a name for the concatenation.

1. (Optional) Choose **Preview** to preview the concatenated data.

1. Choose **Add** to add the new dataset to your data flow. 

You should now see a concatenate node added to your data flow.

## Balance Data
<a name="canvas-transform-balance-data"></a>

You can balance the data for datasets with an underrepresented category. Balancing a dataset can help you create better models for binary classification.

**Note**  
You can't balance datasets containing column vectors.

You can use the **Balance data** operation to balance your data using one of the following operators:
+ *Random oversampling* – Randomly duplicates samples in the minority category. For example, if you're trying to detect fraud, you might only have cases of fraud in 10% of your data. For an equal proportion of fraudulent and non-fraudulent cases, this operator randomly duplicates fraud cases within the dataset 8 times.
+ *Random undersampling* – Roughly equivalent to random oversampling. Randomly removes samples from the overrepresented category to get the proportion of samples that you desire.
+ *Synthetic Minority Oversampling Technique (SMOTE)* – Uses samples from the underrepresented category to interpolate new synthetic minority samples. For more information about SMOTE, see the following description.

You can use all transforms for datasets containing both numeric and non-numeric features. SMOTE interpolates values by using neighboring samples. Data Wrangler uses the R-squared distance to determine the neighborhood to interpolate the additional samples. Data Wrangler only uses numeric features to calculate the distances between samples in the underrepresented group.

For two real samples in the underrepresented group, Data Wrangler interpolates the numeric features by using a weighted average. It randomly assigns weights to those samples in the range of [0, 1]. For numeric features, Data Wrangler interpolates samples using a weighted average of the samples. For samples A and B, Data Wrangler could randomly assign a weight of 0.7 to A and 0.3 to B. The interpolated sample has a value of 0.7A \$1 0.3B.

Data Wrangler interpolates non-numeric features by copying from either of the interpolated real samples. It copies the samples with a probability that it randomly assigns to each sample. For samples A and B, it can assign probabilities 0.8 to A and 0.2 to B. For the probabilities it assigned, it copies A 80% of the time.

## Custom Transforms
<a name="canvas-transform-custom"></a>

The **Custom Transforms** group allows you to use Python (User-Defined Function), PySpark, pandas, or PySpark (SQL) to define custom transformations. For all three options, you use the variable `df` to access the dataframe to which you want to apply the transform. To apply your custom code to your dataframe, assign the dataframe with the transformations that you've made to the `df` variable. If you're not using Python (User-Defined Function), you don't need to include a return statement. Choose **Preview** to preview the result of the custom transform. Choose **Add** to add the custom transform to your list of **Previous steps**.

You can import the popular libraries with an `import` statement in the custom transform code block, such as the following:
+ NumPy version 1.19.0
+ scikit-learn version 0.23.2
+ SciPy version 1.5.4
+ pandas version 1.0.3
+ PySpark version 3.0.0

**Important**  
**Custom transform** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

If you include print statements in the code block, the result appears when you select **Preview**. You can resize the custom code transformer panel. Resizing the panel provides more space to write code. 

The following sections provide additional context and examples for writing custom transform code.

**Python (User-Defined Function)**

The Python function gives you the ability to write custom transformations without needing to know Apache Spark or pandas. Data Wrangler is optimized to run your custom code quickly. You get similar performance using custom Python code and an Apache Spark plugin.

To use the Python (User-Defined Function) code block, you specify the following:
+ **Input column** – The input column where you're applying the transform.
+ **Mode** – The scripting mode, either pandas or Python.
+ **Return type** – The data type of the value that you're returning.

Using the pandas mode gives better performance. The Python mode makes it easier for you to write transformations by using pure Python functions.

**PySpark**

The following example extracts date and time from a timestamp.

```
from pyspark.sql.functions import from_unixtime, to_date, date_format
df = df.withColumn('DATE_TIME', from_unixtime('TIMESTAMP'))
df = df.withColumn( 'EVENT_DATE', to_date('DATE_TIME')).withColumn(
'EVENT_TIME', date_format('DATE_TIME', 'HH:mm:ss'))
```

**pandas**

The following example provides an overview of the dataframe to which you are adding transforms. 

```
df.info()
```

**PySpark (SQL)**

The following example creates a new dataframe with four columns: *name*, *fare*, *pclass*, *survived*.

```
SELECT name, fare, pclass, survived FROM df
```

If you don’t know how to use PySpark, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of code snippets. You can use to code snippets to perform tasks such as dropping columns, grouping by columns, or modelling.

To use a code snippet, choose **Search example snippets** and specify a query in the search bar. The text you specify in the query doesn’t have to match the name of the code snippet exactly.

The following example shows a **Drop duplicate rows** code snippet that can delete rows with similar data in your dataset. You can find the code snippet by searching for one of the following:
+ Duplicates
+ Identical
+ Remove

The following snippet has comments to help you understand the changes that you need to make. For most snippets, you must specify the column names of your dataset in the code.

```
# Specify the subset of columns
# all rows having identical values in these columns will be dropped

subset = ["col1", "col2", "col3"]
df = df.dropDuplicates(subset)  

# to drop the full-duplicate rows run
# df = df.dropDuplicates()
```

To use a snippet, copy and paste its content into the **Custom transform** field. You can copy and paste multiple code snippets into the custom transform field.

## Custom Formula
<a name="canvas-transform-custom-formula"></a>

Use **Custom formula** to define a new column using a Spark SQL expression to query data in the current dataframe. The query must use the conventions of Spark SQL expressions.

**Important**  
**Custom formula** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

You can use this transform to perform operations on columns, referencing the columns by name. For example, assuming the current dataframe contains columns named *col\$1a* and *col\$1b*, you can use the following operation to produce an **Output column** that is the product of these two columns with the following code:

```
col_a * col_b
```

Other common operations include the following, assuming a dataframe contains `col_a` and `col_b` columns:
+ Concatenate two columns: `concat(col_a, col_b)`
+ Add two columns: `col_a + col_b`
+ Subtract two columns: `col_a - col_b`
+ Divide two columns: `col_a / col_b`
+ Take the absolute value of a column: `abs(col_a)`

For more information, see the [Spark documentation](http://spark.apache.org/docs/latest/api/python) on selecting data. 

## Reduce Dimensionality within a Dataset
<a name="canvas-transform-dimensionality-reduction"></a>

Reduce the dimensionality in your data by using Principal Component Analysis (PCA). The dimensionality of your dataset corresponds to the number of features. When you use dimensionality reduction in Data Wrangler, you get a new set of features called components. Each component accounts for some variability in the data.

The first component accounts for the largest amount of variation in the data. The second component accounts for the second largest amount of variation in the data, and so on.

You can use dimensionality reduction to reduce the size of the data sets that you use to train models. Instead of using the features in your dataset, you can use the principal components instead.

To perform PCA, Data Wrangler creates axes for your data. An axis is an affine combination of columns in your dataset. The first principal component is the value on the axis that has the largest amount of variance. The second principal component is the value on the axis that has the second largest amount of variance. The nth principal component is the value on the axis that has the nth largest amount of variance.

You can configure the number of principal components that Data Wrangler returns. You can either specify the number of principal components directly or you can specify the variance threshold percentage. Each principal component explains an amount of variance in the data. For example, you might have a principal component with a value of 0.5. The component would explain 50% of the variation in the data. When you specify a variance threshold percentage, Data Wrangler returns the smallest number of components that meet the percentage that you specify.

The following are example principal components with the amount of variance that they explain in the data.
+ Component 1 – 0.5
+ Component 2 – 0.45
+ Component 3 – 0.05

If you specify a variance threshold percentage of `94` or `95`, Data Wrangler returns Component 1 and Component 2. If you specify a variance threshold percentage of `96`, Data Wrangler returns all three principal components.

You can use the following procedure to run PCA on your dataset.

To run PCA on your dataset, do the following.

1. Open your Data Wrangler data flow.

1. Choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Dimensionality Reduction**.

1. For **Input Columns**, choose the features that you're reducing into the principal components.

1. (Optional) For **Number of principal components**, choose the number of principal components that Data Wrangler returns in your dataset. If specify a value for the field, you can't specify a value for **Variance threshold percentage**.

1. (Optional) For **Variance threshold percentage**, specify the percentage of variation in the data that you want explained by the principal components. Data Wrangler uses the default value of `95` if you don't specify a value for the variance threshold. You can't specify a variance threshold percentage if you've specified a value for **Number of principal components**.

1. (Optional) Deselect **Center** to not use the mean of the columns as the center of the data. By default, Data Wrangler centers the data with the mean before scaling.

1. (Optional) Deselect **Scale** to not scale the data with the unit standard deviation.

1. (Optional) Choose **Columns** to output the components to separate columns. Choose **Vector** to output the components as a single vector.

1. (Optional) For **Output column**, specify a name for an output column. If you're outputting the components to separate columns, the name that you specify is a prefix. If you're outputting the components to a vector, the name that you specify is the name of the vector column.

1. (Optional) Select **Keep input columns**. We don't recommend selecting this option if you plan on only using the principal components to train your model.

1. Choose **Preview**.

1. Choose **Add**.

## Encode Categorical
<a name="canvas-transform-cat-encode"></a>

Categorical data is usually composed of a finite number of categories, where each category is represented with a string. For example, if you have a table of customer data, a column that indicates the country a person lives in is categorical. The categories would be *Afghanistan*, *Albania*, *Algeria*, and so on. Categorical data can be *nominal* or *ordinal*. Ordinal categories have an inherent order, and nominal categories do not. The highest degree obtained (*High school*, *Bachelors*, *Masters*, and so on) is an example of ordinal categories. 

Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are *Dog* and *Cat*, you may encode this information into two vectors, `[1,0]` to represent *Dog*, and `[0,1]` to represent *Cat*.

When you encode ordinal categories, you may need to translate the natural order of categories into your encoding. For example, you can represent the highest degree obtained with the following map: `{"High school": 1, "Bachelors": 2, "Masters":3}`.

Use categorical encoding to encode categorical data that is in string format into arrays of integers. 

The Data Wrangler categorical encoders create encodings for all categories that exist in a column at the time the step is defined. If new categories have been added to a column when you start a Data Wrangler job to process your dataset at time *t*, and this column was the input for a Data Wrangler categorical encoding transform at time *t*-1, these new categories are considered *missing* in the Data Wrangler job. The option you select for **Invalid handling strategy** is applied to these missing values. Examples of when this can occur are: 
+ When you use a .flow file to create a Data Wrangler job to process a dataset that was updated after the creation of the data flow. For example, you may use a data flow to regularly process sales data each month. If that sales data is updated weekly, new categories may be introduced into columns for which an encode categorical step is defined. 
+ When you select **Sampling** when you import your dataset, some categories may be left out of the sample. 

In these situations, these new categories are considered missing values in the Data Wrangler job.

You can choose from and configure an *ordinal* and a *one-hot encode*. Use the following sections to learn more about these options. 

Both transforms create a new column named **Output column name**. You specify the output format of this column with **Output style**:
+ Select **Vector** to produce a single column with a sparse vector. 
+ Select **Columns** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category.

### Ordinal Encode
<a name="canvas-transform-cat-encode-ordinal"></a>

Select **Ordinal encode** to encode categories into an integer between 0 and the total number of categories in the **Input column** you select.

**Invalid handing strategy**: Select a method to handle invalid or missing values. 
+ Choose **Skip** if you want to omit the rows with missing values.
+ Choose **Keep** to retain missing values as the last category.
+ Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ Choose **Replace with NaN** to replace missing with NaN. This option is recommended if your ML algorithm can handle missing values. Otherwise, the first three options in this list may produce better results.

### One-Hot Encode
<a name="canvas-transform-cat-encode-onehot"></a>

Select **One-hot encode** for **Transform** to use one-hot encoding. Configure this transform using the following: 
+ **Drop last category**: If `True`, the last category does not have a corresponding index in the one-hot encoding. When missing values are possible, a missing category is always the last one and setting this to `True` means that a missing value results in an all zero vector.
+ **Invalid handing strategy**: Select a method to handle invalid or missing values. 
  + Choose **Skip** if you want to omit the rows with missing values.
  + Choose **Keep** to retain missing values as the last category.
  + Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ **Is input ordinal encoded**: Select this option if the input vector contains ordinal encoded data. This option requires that input data contain non-negative integers. If **True**, input *i* is encoded as a vector with a non-zero in the *i*th location. 

### Similarity encode
<a name="canvas-transform-cat-encode-similarity"></a>

Use similarity encoding when you have the following:
+ A large number of categorical variables
+ Noisy data

The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia".

Data Wrangler converts each category in your dataset into a set of tokens using a 3-gram tokenizer. It converts the tokens into an embedding using min-hash encoding.

The similarity encodings that Data Wrangler creates:
+ Have low dimensionality
+ Are scalable to a large number of categories
+ Are robust and resistant to noise

For the preceding reasons, similarity encoding is more versatile than one-hot encoding.

To add the similarity encoding transform to your dataset, use the following procedure.

To use similarity encoding, do the following.

1. Sign in to the [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker/).

1. Choose **Open Studio Classic**.

1. Choose **Launch app**.

1. Choose **Studio**.

1. Specify your data flow.

1. Choose a step with a transformation.

1. Choose **Add step**.

1. Choose **Encode categorical**.

1. Specify the following:
   + **Transform** – **Similarity encode**
   + **Input column** – The column containing the categorical data that you're encoding.
   + **Target dimension** – (Optional) The dimension of the categorical embedding vector. The default value is 30. We recommend using a larger target dimension if you have a large dataset with many categories.
   + **Output style** – Choose **Vector** for a single vector with all of the encoded values. Choose **Column** to have the encoded values in separate columns.
   + **Output column** – (Optional) The name of the output column for a vector encoded output. For a column-encoded output, this is the prefix of the column names followed by listed number.

## Featurize Text
<a name="canvas-transform-featurize-text"></a>

Use the **Featurize Text** transform group to inspect string-typed columns and use text embedding to featurize these columns. 

This feature group contains two features, *Character statistics* and *Vectorize*. Use the following sections to learn more about these transforms. For both options, the **Input column** must contain text data (string type).

### Character Statistics
<a name="canvas-transform-featurize-text-character-stats"></a>

Use **Character statistics** to generate statistics for each row in a column containing text data. 

This transform computes the following ratios and counts for each row, and creates a new column to report the result. The new column is named using the input column name as a prefix and a suffix that is specific to the ratio or count. 
+ **Number of words**: The total number of words in that row. The suffix for this output column is `-stats_word_count`.
+ **Number of characters**: The total number of characters in that row. The suffix for this output column is `-stats_char_count`.
+ **Ratio of upper**: The number of uppercase characters, from A to Z, divided by all characters in the column. The suffix for this output column is `-stats_capital_ratio`.
+ **Ratio of lower**: The number of lowercase characters, from a to z, divided by all characters in the column. The suffix for this output column is `-stats_lower_ratio`.
+ **Ratio of digits**: The ratio of digits in a single row over the sum of digits in the input column. The suffix for this output column is `-stats_digit_ratio`.
+ **Special characters ratio**: The ratio of non-alphanumeric (characters like \$1\$1&%:@) characters to over the sum of all characters in the input column. The suffix for this output column is `-stats_special_ratio`.

### Vectorize
<a name="canvas-transform-featurize-text-vectorize"></a>

Text embedding involves mapping words or phrases from a vocabulary to vectors of real numbers. Use the Data Wrangler text embedding transform to tokenize and vectorize text data into term frequency–inverse document frequency (TF-IDF) vectors. 

When TF-IDF is calculated for a column of text data, each word in each sentence is converted to a real number that represents its semantic importance. Higher numbers are associated with less frequent words, which tend to be more meaningful. 

When you define a **Vectorize** transform step, Data Wrangler uses the data in your dataset to define the count vectorizer and TF-IDF methods . Running a Data Wrangler job uses these same methods.

You configure this transform using the following: 
+ **Output column name**: This transform creates a new column with the text embedding. Use this field to specify a name for this output column. 
+ **Tokenizer**: A tokenizer converts the sentence into a list of words, or *tokens*. 

  Choose **Standard** to use a tokenizer that splits by white space and converts each word to lowercase. For example, `"Good dog"` is tokenized to `["good","dog"]`.

  Choose **Custom** to use a customized tokenizer. If you choose **Custom**, you can use the following fields to configure the tokenizer:
  + **Minimum token length**: The minimum length, in characters, for a token to be valid. Defaults to `1`. For example, if you specify `3` for minimum token length, words like `a, at, in` are dropped from the tokenized sentence. 
  + **Should regex split on gaps**: If selected, **regex** splits on gaps. Otherwise, it matches tokens. Defaults to `True`. 
  + **Regex pattern**: Regex pattern that defines the tokenization process. Defaults to `' \\ s+'`.
  + **To lowercase**: If chosen, Data Wrangler converts all characters to lowercase before tokenization. Defaults to `True`.

  To learn more, see the Spark documentation on [Tokenizer](https://spark.apache.org/docs/latest/ml-features#tokenizer).
+ **Vectorizer**: The vectorizer converts the list of tokens into a sparse numeric vector. Each token corresponds to an index in the vector and a non-zero indicates the existence of the token in the input sentence. You can choose from two vectorizer options, *Count* and *Hashing*.
  + **Count vectorize** allows customizations that filter infrequent or too common tokens. **Count vectorize parameters** include the following: 
    + **Minimum term frequency**: In each row, terms (tokens) with smaller frequency are filtered. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Minimum document frequency**: Minimum number of rows in which a term (token) must appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Maximum document frequency**: Maximum number of documents (rows) in which a term (token) can appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `0.999`.
    + **Maximum vocabulary size**: Maximum size of the vocabulary. The vocabulary is made up of all terms (tokens) in all rows of the column. Defaults to `262144`.
    + **Binary outputs**: If selected, the vector outputs do not include the number of appearances of a term in a document, but rather are a binary indicator of its appearance. Defaults to `False`.

    To learn more about this option, see the Spark documentation on [CountVectorizer](https://spark.apache.org/docs/latest/ml-features#countvectorizer).
  + **Hashing** is computationally faster. **Hash vectorize parameters** includes the following:
    + **Number of features during hashing**: A hash vectorizer maps tokens to a vector index according to their hash value. This feature determines the number of possible hash values. Large values result in fewer collisions between hash values but a higher dimension output vector.

    To learn more about this option, see the Spark documentation on [FeatureHasher](https://spark.apache.org/docs/latest/ml-features#featurehasher)
+ **Apply IDF** applies an IDF transformation, which multiplies the term frequency with the standard inverse document frequency used for TF-IDF embedding. **IDF parameters** include the following: 
  + **Minimum document frequency **: Minimum number of documents (rows) in which a term (token) must appear to be included. If **count\$1vectorize** is the chosen vectorizer, we recommend that you keep the default value and only modify the **min\$1doc\$1freq** field in **Count vectorize parameters**. Defaults to `5`.
+ ** Output format**:The output format of each row. 
  + Select **Vector** to produce a single column with a sparse vector. 
  + Select **Flattened** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category. You can only choose flattened when **Vectorizer** is set as **Count vectorizer**.

## Transform Time Series
<a name="canvas-transform-time-series"></a>

In Data Wrangler, you can transform time series data. The values in a time series dataset are indexed to specific time. For example, a dataset that shows the number of customers in a store for each hour in a day is a time series dataset. The following table shows an example of a time series dataset.

Hourly number of customers in a store


| Number of customers | Time (hour) | 
| --- | --- | 
| 4 | 09:00 | 
| 10 | 10:00 | 
| 14 | 11:00 | 
| 25 | 12:00 | 
| 20 | 13:00 | 
| 18 | 14:00 | 

For the preceding table, the **Number of Customers** column contains the time series data. The time series data is indexed on the hourly data in the **Time (hour)** column.

You might need to perform a series of transformations on your data to get it in a format that you can use for your analysis. Use the **Time series** transform group to transform your time series data. For more information about the transformations that you can perform, see the following sections.

**Topics**
+ [Group by a Time Series](#canvas-group-by-time-series)
+ [Resample Time Series Data](#canvas-resample-time-series)
+ [Handle Missing Time Series Data](#canvas-transform-handle-missing-time-series)
+ [Validate the Timestamp of Your Time Series Data](#canvas-transform-validate-timestamp)
+ [Standardizing the Length of the Time Series](#canvas-transform-standardize-length)
+ [Extract Features from Your Time Series Data](#canvas-transform-extract-time-series-features)
+ [Use Lagged Features from Your Time Series Data](#canvas-transform-lag-time-series)
+ [Create a Datetime Range In Your Time Series](#canvas-transform-datetime-range)
+ [Use a Rolling Window In Your Time Series](#canvas-transform-rolling-window)

### Group by a Time Series
<a name="canvas-group-by-time-series"></a>

You can use the group by operation to group time series data for specific values in a column.

For example, you have the following table that tracks the average daily electricity usage in a household.

Average daily household electricity usage


| Household ID | Daily timestamp | Electricity usage (kWh) | Number of household occupants | 
| --- | --- | --- | --- | 
| household\$10 | 1/1/2020 | 30 | 2 | 
| household\$10 | 1/2/2020 | 40 | 2 | 
| household\$10 | 1/4/2020 | 35 | 3 | 
| household\$11 | 1/2/2020 | 45 | 3 | 
| household\$11 | 1/3/2020 | 55 | 4 | 

If you choose to group by ID, you get the following table.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | 
| --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 
| household\$11 | [45, 55] | [3, 4] | 

Each entry in the time series sequence is ordered by the corresponding timestamp. The first element of the sequence corresponds to the first timestamp of the series. For `household_0`, `30` is the first value of the **Electricity Usage Series**. The value of `30` corresponds to the first timestamp of `1/1/2020`.

You can include the starting timestamp and ending timestamp. The following table shows how that information appears.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | Start\$1time | End\$1time | 
| --- | --- | --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 1/1/2020 | 1/4/2020 | 
| household\$11 | [45, 55] | [3, 4] | 1/2/2020 | 1/3/2020 | 

You can use the following procedure to group by a time series column. 

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Time Series**.

1. Under **Transform**, choose **Group by**.

1. Specify a column in **Group by this column**.

1. For **Apply to columns**, specify a value.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Resample Time Series Data
<a name="canvas-resample-time-series"></a>

Time series data usually has observations that aren't taken at regular intervals. For example, a dataset could have some observations that are recorded hourly and other observations that are recorded every two hours.

Many analyses, such as forecasting algorithms, require the observations to be taken at regular intervals. Resampling gives you the ability to establish regular intervals for the observations in your dataset.

You can either upsample or downsample a time series. Downsampling increases the interval between observations in the dataset. For example, if you downsample observations that are taken either every hour or every two hours, each observation in your dataset is taken every two hours. The hourly observations are aggregated into a single value using an aggregation method such as the mean or median.

Upsampling reduces the interval between observations in the dataset. For example, if you upsample observations that are taken every two hours into hourly observations, you can use an interpolation method to infer hourly observations from the ones that have been taken every two hours. For information on interpolation methods, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

You can resample both numeric and non-numeric data.

Use the **Resample** operation to resample your time series data. If you have multiple time series in your dataset, Data Wrangler standardizes the time interval for each time series.

The following table shows an example of downsampling time series data by using the mean as the aggregation method. The data is downsampled from every two hours to every hour.

Hourly temperature readings over a day before downsampling


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 1:00 | 32 | 
| 2:00 | 35 | 
| 3:00 | 32 | 
| 4:00 | 30 | 

Temperature readings downsampled to every two hours


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 2:00 | 33.5 | 
| 4:00 | 35 | 

You can use the following procedure to resample time series data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Resample**.

1. For **Timestamp**, choose the timestamp column.

1. For **Frequency unit**, specify the frequency that you're resampling.

1. (Optional) Specify a value for **Frequency quantity**.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Handle Missing Time Series Data
<a name="canvas-transform-handle-missing-time-series"></a>

If you have missing values in your dataset, you can do one of the following:
+ For datasets that have multiple time series, drop the time series that have missing values that are greater than a threshold that you specify.
+ Impute the missing values in a time series by using other values in the time series.

Imputing a missing value involves replacing the data by either specifying a value or by using an inferential method. The following are the methods that you can use for imputation:
+ Constant value – Replace all the missing data in your dataset with a value that you specify.
+ Most common value – Replace all the missing data with the value that has the highest frequency in the dataset.
+ Forward fill – Use a forward fill to replace the missing values with the non-missing value that precedes the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 7. The sequence that results from using a forward fill is [2, 4, 7, 7, 7, 7, 8].
+ Backward fill – Use a backward fill to replace the missing values with the non-missing value that follows the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 8. The sequence that results from using a backward fill is [2, 4, 7, 8, 8, 8, 8]. 
+ Interpolate – Uses an interpolation function to impute the missing values. For more information on the functions that you can use for interpolation, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

Some of the imputation methods might not be able to impute of all the missing value in your dataset. For example, a **Forward fill** can't impute a missing value that appears at the beginning of the time series. You can impute the values by using either a forward fill or a backward fill.

You can either impute missing values within a cell or within a column.

The following example shows how values are imputed within a cell.

Electricity usage with missing values


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, NaN, NaN] | 
| household\$11 | [45, NaN, 55] | 

Electricity usage with values imputed using a forward fill


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, 35, 35] | 
| household\$11 | [45, 45, 55] | 

The following example shows how values are imputed within a column.

Average daily household electricity usage with missing values


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | NaN | 
| household\$11 | NaN | 
| household\$11 | NaN | 

Average daily household electricity usage with values imputed using a forward fill


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | 40 | 
| household\$11 | 40 | 
| household\$11 | 40 | 

You can use the following procedure to handle missing values.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Handle missing**.

1. For **Time series input type**, choose whether you want to handle missing values inside of a cell or along a column.

1. For **Impute missing values for this column**, specify the column that has the missing values.

1. For **Method for imputing values**, select a method.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. If you have missing values, you can specify a method for imputing them under **Method for imputing values**.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Validate the Timestamp of Your Time Series Data
<a name="canvas-transform-validate-timestamp"></a>

You might have time stamp data that is invalid. You can use the **Validate time stamp** function to determine whether the timestamps in your dataset are valid. Your timestamp can be invalid for one or more of the following reasons:
+ Your timestamp column has missing values.
+ The values in your timestamp column are not formatted correctly.

If you have invalid timestamps in your dataset, you can't perform your analysis successfully. You can use Data Wrangler to identify invalid timestamps and understand where you need to clean your data.

The time series validation works in one of the two ways:

You can configure Data Wrangler to do one of the following if it encounters missing values in your dataset:
+ Drop the rows that have the missing or invalid values.
+ Identify the rows that have the missing or invalid values.
+ Throw an error if it finds any missing or invalid values in your dataset.

You can validate the timestamps on columns that either have the `timestamp` type or the `string` type. If the column has the `string` type, Data Wrangler converts the type of the column to `timestamp` and performs the validation.

You can use the following procedure to validate the timestamps in your dataset.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Validate timestamps**.

1. For **Timestamp Column**, choose the timestamp column.

1. For **Policy**, choose whether you want to handle missing timestamps.

1. (Optional) For **Output column**, specify a name for the output column.

1. If the date time column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Standardizing the Length of the Time Series
<a name="canvas-transform-standardize-length"></a>

If you have time series data stored as arrays, you can standardize each time series to the same length. Standardizing the length of the time series array might make it easier for you to perform your analysis on the data.

You can standardize your time series for data transformations that require the length of your data to be fixed.

Many ML algorithms require you to flatten your time series data before you use them. Flattening time series data is separating each value of the time series into its own column in a dataset. The number of columns in a dataset can't change, so the lengths of the time series need to be standardized between you flatten each array into a set of features.

Each time series is set to the length that you specify as a quantile or percentile of the time series set. For example, you can have three sequences that have the following lengths:
+ 3
+ 4
+ 5

You can set the length of all of the sequences as the length of the sequence that has the 50th percentile length.

Time series arrays that are shorter than the length you've specified have missing values added. The following is an example format of standardizing the time series to a longer length: [2, 4, 5, NaN, NaN, NaN].

You can use different approaches to handle the missing values. For information on those approaches, see [Handle Missing Time Series Data](#canvas-transform-handle-missing-time-series).

The time series arrays that are longer than the length that you specify are truncated.

You can use the following procedure to standardize the length of the time series.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Standardize length**.

1. For **Standardize the time series length for the column**, choose a column.

1. (Optional) For **Output column**, specify a name for the output column. If you don't specify a name, the transform is done in place.

1. If the datetime column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Cutoff quantile** and specify a quantile to set the length of the sequence.

1. Choose **Flatten the output** to output the values of the time series into separate columns.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Extract Features from Your Time Series Data
<a name="canvas-transform-extract-time-series-features"></a>

If you're running a classification or a regression algorithm on your time series data, we recommend extracting features from the time series before running the algorithm. Extracting features might improve the performance of your algorithm.

Use the following options to choose how you want to extract features from your data:
+ Use **Minimal subset** to specify extracting 8 features that you know are useful in downstream analyses. You can use a minimal subset when you need to perform computations quickly. You can also use it when your ML algorithm has a high risk of overfitting and you want to provide it with fewer features.
+ Use **Efficient subset** to specify extracting the most features possible without extracting features that are computationally intensive in your analyses.
+ Use **All features** to specify extracting all features from the tune series.
+ Use **Manual subset** to choose a list of features that you think explain the variation in your data well.

Use the following the procedure to extract features from your time series data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Extract features**.

1. For **Extract features for this column**, choose a column.

1. (Optional) Select **Flatten** to output the features into separate columns.

1. For **Strategy**, choose a strategy to extract the features.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use Lagged Features from Your Time Series Data
<a name="canvas-transform-lag-time-series"></a>

For many use cases, the best way to predict the future behavior of your time series is to use its most recent behavior.

The most common uses of lagged features are the following:
+ Collecting a handful of past values. For example, for time, t \$1 1, you collect t, t - 1, t - 2, and t - 3.
+ Collecting values that correspond to seasonal behavior in the data. For example, to predict the occupancy in a restaurant at 1:00 PM, you might want to use the features from 1:00 PM on the previous day. Using the features from 12:00 PM or 11:00 AM on the same day might not be as predictive as using the features from previous days.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Lag features**.

1. For **Generate lag features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. For **Lag**, specify the duration of the lag.

1. (Optional) Configure the output using one of the following options:
   + **Include the entire lag window**
   + **Flatten the output**
   + **Drop rows without history**

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Create a Datetime Range In Your Time Series
<a name="canvas-transform-datetime-range"></a>

You might have time series data that don't have timestamps. If you know that the observations were taken at regular intervals, you can generate timestamps for the time series in a separate column. To generate timestamps, you specify the value for the start timestamp and the frequency of the timestamps.

For example, you might have the following time series data for the number of customers at a restaurant.

Time series data on the number of customers at a restaurant


| Number of customers | 
| --- | 
| 10 | 
| 14 | 
| 24 | 
| 40 | 
| 30 | 
| 20 | 

If you know that the restaurant opened at 5:00 PM and that the observations are taken hourly, you can add a timestamp column that corresponds to the time series data. You can see the timestamp column in the following table.

Time series data on the number of customers at a restaurant


| Number of customers | Timestamp | 
| --- | --- | 
| 10 | 1:00 PM | 
| 14 | 2:00 PM | 
| 24 | 3:00 PM | 
| 40 | 4:00 PM | 
| 30 | 5:00 PM | 
| 20 | 6:00 PM | 

Use the following procedure to add a datetime range to your data.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Datetime range**.

1. For **Frequency type**, choose the unit used to measure the frequency of the timestamps.

1. For **Starting timestamp**, specify the start timestamp.

1. For **Output column**, specify a name for the output column.

1. (Optional) Configure the output using the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use a Rolling Window In Your Time Series
<a name="canvas-transform-rolling-window"></a>

You can extract features over a time period. For example, for time, *t*, and a time window length of 3, and for the row that indicates the *t*th timestamp, we append the features that are extracted from the time series at times *t* - 3, *t* -2, and *t* - 1. For information on extracting features, see [Extract Features from Your Time Series Data](#canvas-transform-extract-time-series-features). 

You can use the following procedure to extract features over a time period.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Rolling window features**.

1. For **Generate rolling window features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. (Optional) For **Output Column**, specify the name of the output column.

1. For **Window size**, specify the window size.

1. For **Strategy**, choose the extraction strategy.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Featurize Datetime
<a name="canvas-transform-datetime-embed"></a>

Use **Featurize date/time** to create a vector embedding representing a datetime field. To use this transform, your datetime data must be in one of the following formats: 
+ Strings describing datetime: For example, `"January 1st, 2020, 12:44pm"`. 
+ A Unix timestamp: A Unix timestamp describes the number of seconds, milliseconds, microseconds, or nanoseconds from 1/1/1970. 

You can choose to **Infer datetime format** and provide a **Datetime format**. If you provide a datetime format, you must use the codes described in the [Python documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The options you select for these two configurations have implications for the speed of the operation and the final results.
+ The most manual and computationally fastest option is to specify a **Datetime format** and select **No** for **Infer datetime format**.
+ To reduce manual labor, you can choose **Infer datetime format** and not specify a datetime format. It is also a computationally fast operation; however, the first datetime format encountered in the input column is assumed to be the format for the entire column. If there are other formats in the column, these values are NaN in the final output. Inferring the datetime format can give you unparsed strings. 
+ If you don't specify a format and select **No** for **Infer datetime format**, you get the most robust results. All the valid datetime strings are parsed. However, this operation can be an order of magnitude slower than the first two options in this list. 

When you use this transform, you specify an **Input column** which contains datetime data in one of the formats listed above. The transform creates an output column named **Output column name**. The format of the output column depends on your configuration using the following:
+ **Vector**: Outputs a single column as a vector. 
+ **Columns**: Creates a new column for every feature. For example, if the output contains a year, month, and day, three separate columns are created for year, month, and day. 

Additionally, you must choose an **Embedding mode**. For linear models and deep networks, we recommend choosing **cyclic**. For tree-based algorithms, we recommend choosing **ordinal**.

## Format String
<a name="canvas-transform-format-string"></a>

The **Format string** transforms contain standard string formatting operations. For example, you can use these operations to remove special characters, normalize string lengths, and update string casing.

This feature group contains the following transforms. All transforms return copies of the strings in the **Input column** and add the result to a new, output column.


| Name | Function | 
| --- | --- | 
| Left pad |  Left-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Right pad |  Right-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Center (pad on either side) |  Center-pad the string (add padding on both sides of the string) with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Prepend zeros |  Left-fill a numeric string with zeros, up to a given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Strip left and right |  Returns a copy of the string with the leading and trailing characters removed.  | 
| Strip characters from left |  Returns a copy of the string with leading characters removed.  | 
| Strip characters from right |  Returns a copy of the string with trailing characters removed.  | 
| Lower case |  Convert all letters in text to lowercase.  | 
| Upper case |  Convert all letters in text to uppercase.  | 
| Capitalize |  Capitalize the first letter in each sentence.   | 
| Swap case | Converts all uppercase characters to lowercase and all lowercase characters to uppercase characters of the given string, and returns it. | 
| Add prefix or suffix |  Adds a prefix and a suffix the string column. You must specify at least one of **Prefix** and **Suffix**.   | 
| Remove symbols |  Removes given symbols from a string. All listed characters are removed. Defaults to white space.   | 

## Handle Outliers
<a name="canvas-transform-handle-outlier"></a>

Machine learning models are sensitive to the distribution and range of your feature values. Outliers, or rare values, can negatively impact model accuracy and lead to longer training times. Use this feature group to detect and update outliers in your dataset. 

When you define a **Handle outliers** transform step, the statistics used to detect outliers are generated on the data available in Data Wrangler when defining this step. These same statistics are used when running a Data Wrangler job. 

Use the following sections to learn more about the transforms this group contains. You specify an **Output name** and each of these transforms produces an output column with the resulting data. 

### Robust standard deviation numeric outliers
<a name="canvas-transform-handle-outlier-rstdev"></a>

This transform detects and fixes outliers in numeric features using statistics that are robust to outliers.

You must define an **Upper quantile** and a **Lower quantile** for the statistics used to calculate outliers. You must also specify the number of **Standard deviations** from which a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Standard Deviation Numeric Outliers
<a name="canvas-transform-handle-outlier-sstdev"></a>

This transform detects and fixes outliers in numeric features using the mean and standard deviation.

You specify the number of **Standard deviations** a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Quantile Numeric Outliers
<a name="canvas-transform-handle-outlier-quantile-numeric"></a>

Use this transform to detect and fix outliers in numeric features using quantiles. You can define an **Upper quantile** and a **Lower quantile**. All values that fall above the upper quantile or below the lower quantile are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Min-Max Numeric Outliers
<a name="canvas-transform-handle-outlier-minmax-numeric"></a>

This transform detects and fixes outliers in numeric features using upper and lower thresholds. Use this method if you know threshold values that demark outliers.

You specify a **Upper threshold** and a **Lower threshold**, and if values fall above or below those thresholds respectively, they are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Replace Rare
<a name="canvas-transform-handle-outlier-replace-rare"></a>

When you use the **Replace rare** transform, you specify a threshold and Data Wrangler finds all values that meet that threshold and replaces them with a string that you specify. For example, you may want to use this transform to categorize all outliers in a column into an "Others" category. 
+ **Replacement string**: The string with which to replace outliers.
+ **Absolute threshold**: A category is rare if the number of instances is less than or equal to this absolute threshold.
+ **Fraction threshold**: A category is rare if the number of instances is less than or equal to this fraction threshold multiplied by the number of rows.
+ **Max common categories**: Maximum not-rare categories that remain after the operation. If the threshold does not filter enough categories, those with the top number of appearances are classified as not rare. If set to 0 (default), there is no hard limit to the number of categories.

## Handle Missing Values
<a name="canvas-transform-handle-missing"></a>

Missing values are a common occurrence in machine learning datasets. In some situations, it is appropriate to impute missing data with a calculated value, such as an average or categorically common value. You can process missing values using the **Handle missing values** transform group. This group contains the following transforms. 

### Fill Missing
<a name="canvas-transform-fill-missing"></a>

Use the **Fill missing** transform to replace missing values with a **Fill value** you define. 

### Impute Missing
<a name="canvas-transform-impute"></a>

Use the **Impute missing** transform to create a new column that contains imputed values where missing values were found in input categorical and numerical data. The configuration depends on your data type.

For numeric data, choose an imputing strategy, the strategy used to determine the new value to impute. You can choose to impute the mean or the median over the values that are present in your dataset. Data Wrangler uses the value that it computes to impute the missing values.

For categorical data, Data Wrangler imputes missing values using the most frequent value in the column. To impute a custom string, use the **Fill missing** transform instead.

### Add Indicator for Missing
<a name="canvas-transform-missing-add-indicator"></a>

Use the **Add indicator for missing** transform to create a new indicator column, which contains a Boolean `"false"` if a row contains a value, and `"true"` if a row contains a missing value. 

### Drop Missing
<a name="canvas-transform-drop-missing"></a>

Use the **Drop missing** option to drop rows that contain missing values from the **Input column**.

## Manage Columns
<a name="canvas-manage-columns"></a>

You can use the following transforms to quickly update and manage columns in your dataset: 


****  

| Name | Function | 
| --- | --- | 
| Drop Column | Delete a column.  | 
| Duplicate Column | Duplicate a column. | 
| Rename Column | Rename a column. | 
| Move Column |  Move a column's location in the dataset. Choose to move your column to the start or end of the dataset, before or after a reference column, or to a specific index.   | 

## Manage Rows
<a name="canvas-transform-manage-rows"></a>

Use this transform group to quickly perform sort and shuffle operations on rows. This group contains the following:
+ **Sort**: Sort the entire dataframe by a given column. Select the check box next to **Ascending order** for this option; otherwise, deselect the check box and descending order is used for the sort. 
+ **Shuffle**: Randomly shuffle all rows in the dataset. 

## Manage Vectors
<a name="canvas-transform-manage-vectors"></a>

Use this transform group to combine or flatten vector columns. This group contains the following transforms. 
+ **Assemble**: Use this transform to combine Spark vectors and numeric data into a single column. For example, you can combine three columns: two containing numeric data and one containing vectors. Add all the columns you want to combine in **Input columns** and specify a **Output column name** for the combined data. 
+ **Flatten**: Use this transform to flatten a single column containing vector data. The input column must contain PySpark vectors or array-like objects. You can control the number of columns created by specifying a **Method to detect number of outputs**. For example, if you select **Length of first vector**, the number of elements in the first valid vector or array found in the column determines the number of output columns that are created. All other input vectors with too many items are truncated. Inputs with too few items are filled with NaNs.

  You also specify an **Output prefix**, which is used as the prefix for each output column. 

## Process Numeric
<a name="canvas-transform-process-numeric"></a>

Use the **Process Numeric** feature group to process numeric data. Each scalar in this group is defined using the Spark library. The following scalars are supported:
+ **Standard Scaler**: Standardize the input column by subtracting the mean from each value and scaling to unit variance. To learn more, see the Spark documentation for [StandardScaler](https://spark.apache.org/docs/latest/ml-features#standardscaler).
+ **Robust Scaler**: Scale the input column using statistics that are robust to outliers. To learn more, see the Spark documentation for [RobustScaler](https://spark.apache.org/docs/latest/ml-features#robustscaler).
+ **Min Max Scaler**: Transform the input column by scaling each feature to a given range. To learn more, see the Spark documentation for [MinMaxScaler](https://spark.apache.org/docs/latest/ml-features#minmaxscaler).
+ **Max Absolute Scaler**: Scale the input column by dividing each value by the maximum absolute value. To learn more, see the Spark documentation for [MaxAbsScaler](https://spark.apache.org/docs/latest/ml-features#maxabsscaler).

## Sampling
<a name="canvas-transform-sampling"></a>

After you've imported your data, you can use the **Sampling** transformer to take one or more samples of it. When you use the sampling transformer, Data Wrangler samples your original dataset.

You can choose one of the following sample methods:
+ **Limit**: Samples the dataset starting from the first row up to the limit that you specify.
+ **Randomized**: Takes a random sample of a size that you specify.
+ **Stratified**: Takes a stratified random sample.

You can stratify a randomized sample to make sure that it represents the original distribution of the dataset.

You might be performing data preparation for multiple use cases. For each use case, you can take a different sample and apply a different set of transformations.

The following procedure describes the process of creating a random sample. 

To take a random sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

The following procedure describes the process of creating a stratified sample.

To take a stratified sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. For **Stratify column**, specify the name of the column that you want to stratify on.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

## Search and Edit
<a name="canvas-transform-search-edit"></a>

Use this section to search for and edit specific patterns within strings. For example, you can find and update strings within sentences or documents, split strings by delimiters, and find occurrences of specific strings. 

The following transforms are supported under **Search and edit**. All transforms return copies of the strings in the **Input column** and add the result to a new output column.


****  

| Name | Function | 
| --- | --- | 
|  Find substring  |  Returns the index of the first occurrence of the **Substring** for which you searched , You can start and end the search at **Start** and **End** respectively.   | 
|  Find substring (from right)  |  Returns the index of the last occurrence of the **Substring** for which you searched. You can start and end the search at **Start** and **End** respectively.   | 
|  Matches prefix  |  Returns a Boolean value if the string contains a given **Pattern**. A pattern can be a character sequence or regular expression. Optionally, you can make the pattern case sensitive.   | 
|  Find all occurrences  |  Returns an array with all occurrences of a given pattern. A pattern can be a character sequence or regular expression.   | 
|  Extract using regex  |  Returns a string that matches a given Regex pattern.  | 
|  Extract between delimiters  |  Returns a string with all characters found between **Left delimiter** and **Right delimiter**.   | 
|  Extract from position  |  Returns a string, starting from **Start position** in the input string, that contains all characters up to the start position plus **Length**.   | 
|  Find and replace substring  |  Returns a string with all matches of a given **Pattern** (regular expression) replaced by **Replacement string**.  | 
|  Replace between delimiters  |  Returns a string with the substring found between the first appearance of a **Left delimiter** and the last appearance of a **Right delimiter** replaced by **Replacement string**. If no match is found, nothing is replaced.   | 
|  Replace from position  |  Returns a string with the substring between **Start position** and **Start position** plus **Length** replaced by **Replacement string**. If **Start position** plus **Length** is greater than the length of the replacement string, the output contains **…**.  | 
|  Convert regex to missing  |  Converts a string to `None` if invalid and returns the result. Validity is defined with a regular expression in **Pattern**.  | 
|  Split string by delimiter  |  Returns an array of strings from the input string, split by **Delimiter**, with up to **Max number of splits** (optional). The delimiter defaults to white space.   | 

## Split data
<a name="canvas-transform-split-data"></a>

Use the **Split data** transform to split your dataset into two or three datasets. For example, you can split your dataset into a dataset used to train your model and a dataset used to test it. You can determine the proportion of the dataset that goes into each split. For example, if you’re splitting one dataset into two datasets, the training dataset can have 80% of the data while the testing dataset has 20%.

Splitting your data into three datasets gives you the ability to create training, validation, and test datasets. You can see how well the model performs on the test dataset by dropping the target column.

Your use case determines how much of the original dataset each of your datasets get and the method you use to split the data. For example, you might want to use a stratified split to make sure that the distribution of the observations in the target column are the same across datasets. You can use the following split transforms:
+ Randomized split — Each split is a random, non-overlapping sample of the original dataset. For larger datasets, using a randomized split might be computationally expensive and take longer than an ordered split.
+ Ordered split – Splits the dataset based on the sequential order of the observations. For example, for an 80/20 train-test split, the first observations that make up 80% of the dataset go to the training dataset. The last 20% of the observations go to the testing dataset. Ordered splits are effective in keeping the existing order of the data between splits.
+ Stratified split – Splits the dataset to make sure that the number of observations in the input column have proportional representation. For an input column that has the observations 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, an 80/20 split on the column would mean that approximately 80% of the 1s, 80% of the 2s, and 80% of the 3s go to the training set. About 20% of each type of observation go to the testing set.
+ Split by key – Avoids data with the same key occurring in more than one split. For example, if you have a dataset with the column 'customer\$1id' and you're using it as a key, no customer id is in more than one split.

After you split the data, you can apply additional transformations to each dataset. For most use cases, they aren't necessary.

Data Wrangler calculates the proportions of the splits for performance. You can choose an error threshold to set the accuracy of the splits. Lower error thresholds more accurately reflect the proportions that you specify for the splits. If you set a higher error threshold, you get better performance, but lower accuracy.

For perfectly split data, set the error threshold to 0. You can specify a threshold between 0 and 1 for better performance. If you specify a value greater than 1, Data Wrangler interprets that value as 1.

If you have 10000 rows in your dataset and you specify an 80/20 split with an error of 0.001, you would get observations approximating one of the following results:
+ 8010 observations in the training set and 1990 in the testing set
+ 7990 observations in the training set and 2010 in the testing set

The number of observations for the testing set in the preceding example is in the interval between 8010 and 7990.

By default, Data Wrangler uses a random seed to make the splits reproducible. You can specify a different value for the seed to create a different reproducible split.

------
#### [ Randomized split ]

Use the following procedure to perform a randomized split on your dataset.

To split your dataset randomly, do the following

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Ordered split ]

Use the following procedure to perform an ordered split on your dataset.

To make an ordered split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. For **Transform**, choose **Ordered split**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) For **Input column**, specify a column with numeric values. Uses the values of the columns to infer which records are in each split. The smaller values are in one split with the larger values in the other splits.

1. (Optional) Select **Handle duplicates** to add noise to duplicate values and create a dataset of entirely unique values.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Stratified split ]

Use the following procedure to perform a stratified split on your dataset.

To make a stratified split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Stratified split**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Input column**, specify a column with up to 100 unique values. Data Wrangler can't stratify a column with more than 100 unique values.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed** to specify a different seed.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Split by column keys ]

Use the following procedure to split by the column keys in your dataset.

To split by the column keys in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Split by key**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Key columns**, specify the columns with values that you don't want to appear in both datasets.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. Choose **Preview**.

1. Choose **Add**.

------

## Parse Value as Type
<a name="canvas-transform-cast-type"></a>

Use this transform to cast a column to a new type. The supported Data Wrangler data types are:
+ Long
+ Float
+ Boolean
+ Date, in the format dd-MM-yyyy, representing day, month, and year respectively. 
+ String

## Validate String
<a name="canvas-transform-validate-string"></a>

Use the **Validate string** transforms to create a new column that indicates that a row of text data meets a specified condition. For example, you can use a **Validate string** transform to verify that a string only contains lowercase characters. The following transforms are supported under **Validate string**. 

The following transforms are included in this transform group. If a transform outputs a Boolean value, `True` is represented with a `1` and `False` is represented with a `0`.


****  

| Name | Function | 
| --- | --- | 
|  String length  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.   | 
|  Starts with  |  Returns `True` if a string starts will a specified prefix. Otherwise, returns `False`.  | 
|  Ends with  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.  | 
|  Is alphanumeric  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is alpha (letters)  |  Returns `True` if a string only contains letters. Otherwise, returns `False`.  | 
|  Is digit  |  Returns `True` if a string only contains digits. Otherwise, returns `False`.  | 
|  Is space  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is title  |  Returns `True` if a string contains any white spaces. Otherwise, returns `False`.  | 
|  Is lowercase  |  Returns `True` if a string only contains lower case letters. Otherwise, returns `False`.  | 
|  Is uppercase  |  Returns `True` if a string only contains upper case letters. Otherwise, returns `False`.  | 
|  Is numeric  |  Returns `True` if a string only contains numbers. Otherwise, returns `False`.  | 
|  Is decimal  |  Returns `True` if a string only contains decimal numbers. Otherwise, returns `False`.  | 

## Unnest JSON Data
<a name="canvas-transform-flatten-column"></a>

If you have a .csv file, you might have values in your dataset that are JSON strings. Similarly, you might have nested data in columns of either a Parquet file or a JSON document.

Use the **Flatten structured** operator to separate the first level keys into separate columns. A first level key is a key that isn't nested within a value.

For example, you might have a dataset that has a *person* column with demographic information on each person stored as JSON strings. A JSON string might look like the following.

```
 "{"seq": 1,"name": {"first": "Nathaniel","last": "Ferguson"},"age": 59,"city": "Posbotno","state": "WV"}"
```

The **Flatten structured** operator converts the following first level keys into additional columns in your dataset:
+ seq
+ name
+ age
+ city
+ state

Data Wrangler puts the values of the keys as values under the columns. The following shows the column names and values of the JSON.

```
seq, name,                                    age, city, state
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV
```

For each value in your dataset containing JSON, the **Flatten structured** operator creates columns for the first-level keys. To create columns for nested keys, call the operator again. For the preceding example, calling the operator creates the columns:
+ name\$1first
+ name\$1last

The following example shows the dataset that results from calling the operation again.

```
seq, name,                                    age, city, state, name_first, name_last
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV, Nathaniel, Ferguson
```

Choose **Keys to flatten on** to specify the first-level keys that want to extract as separate columns. If you don't specify any keys, Data Wrangler extracts all the keys by default.

## Explode Array
<a name="canvas-transform-explode-array"></a>

Use **Explode array** to expand the values of the array into separate output rows. For example, the operation can take each value in the array, [[1, 2, 3,], [4, 5, 6], [7, 8, 9]] and create a new column with the following rows:

```
                [1, 2, 3]
                [4, 5, 6]
                [7, 8, 9]
```

Data Wrangler names the new column, input\$1column\$1name\$1flatten.

You can call the **Explode array** operation multiple times to get the nested values of the array into separate output columns. The following example shows the result of calling the operation multiple times on a dataset with a nested array.

Putting the values of a nested array into separate columns


| id | array | id | array\$1items | id | array\$1items\$1items | 
| --- | --- | --- | --- | --- | --- | 
| 1 | [ [cat, dog], [bat, frog] ] | 1 | [cat, dog] | 1 | cat | 
| 2 |  [[rose, petunia], [lily, daisy]]  | 1 | [bat, frog] | 1 | dog | 
|  |  | 2 | [rose, petunia] | 1 | bat | 
|  |  | 2 | [lily, daisy] | 1 | frog | 
|  |  |  | 2 | 2 | rose | 
|  |  |  | 2 | 2 | petunia | 
|  |  |  | 2 | 2 | lily | 
|  |  |  | 2 | 2 | daisy | 

## Transform Image Data
<a name="canvas-transform-image"></a>

Use Data Wrangler to import and transform the images that you're using for your machine learning (ML) pipelines. After you've prepared your image data, you can export it from your Data Wrangler flow to your ML pipeline.

You can use the information provided here to familiarize yourself with importing and transforming image data in Data Wrangler. Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

After you've familiarized yourself with the concepts of transforming your image data, go through the following tutorial, [Prepare image data with Amazon SageMaker Data Wrangler](https://aws.amazon.com/blogs/machine-learning/prepare-image-data-with-amazon-sagemaker-data-wrangler/).

The following industries and use cases are examples where applying machine learning to transformed image data can be useful:
+ Manufacturing – Identifying defects in items from the assembly line
+ Food – Identifying spoiled or rotten food
+ Medicine – Identifying lesions in tissues

When you work with image data in Data Wrangler, you go through the following process:

1. Import – Select the images by choosing the directory containing them in your Amazon S3 bucket.

1. Transform – Use the built-in transformations to prepare the images for your machine learning pipeline.

1. Export – Export the images that you’ve transformed to a location that can be accessed from the pipeline.

Use the following procedure to import your image data.

**To import your image data**

1. Navigate to the **Create connection** page.

1. Choose **Amazon S3**.

1. Specify the Amazon S3 file path that contains the image data.

1. For **File type**, choose **Image**.

1. (Optional) Choose **Import nested directories** to import images from multiple Amazon S3 paths.

1. Choose **Import**.

Data Wrangler uses the open-source [imgaug](https://imgaug.readthedocs.io/en/latest/) library for its built-in image transformations. You can use the following built-in transformations:
+ **ResizeImage**
+ **EnhanceImage**
+ **CorruptImage**
+ **SplitImage**
+ **DropCorruptedImages**
+ **DropImageDuplicates**
+ **Brightness**
+ **ColorChannels**
+ **Grayscale**
+ **Rotate**

Use the following procedure to transform your images without writing code.

**To transform the image data without writing code**

1. From your Data Wrangler flow, choose the **\$1** next to the node representing the images that you've imported.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose the transform and configure it.

1. Choose **Preview**.

1. Choose **Add**.

In addition to using the transformations that Data Wrangler provides, you can also use your own custom code snippets. For more information about using custom code snippets, see [Custom Transforms](#canvas-transform-custom). You can import the OpenCV and imgaug libraries within your code snippets and use the transforms associated with them. The following is an example of a code snippet that detects edges within the images.

```
# A table with your image data is stored in the `df` variable
import cv2
import numpy as np
from pyspark.sql.functions import column

from sagemaker_dataprep.compute.operators.transforms.image.constants import DEFAULT_IMAGE_COLUMN, IMAGE_COLUMN_TYPE
from sagemaker_dataprep.compute.operators.transforms.image.decorators import BasicImageOperationDecorator, PandasUDFOperationDecorator


@BasicImageOperationDecorator
def my_transform(image: np.ndarray) -> np.ndarray:
  # To use the code snippet on your image data, modify the following lines within the function
    HYST_THRLD_1, HYST_THRLD_2 = 100, 200
    edges = cv2.Canny(image,HYST_THRLD_1,HYST_THRLD_2)
    return edges
    

@PandasUDFOperationDecorator(IMAGE_COLUMN_TYPE)
def custom_image_udf(image_row):
    return my_transform(image_row)
    

df = df.withColumn(DEFAULT_IMAGE_COLUMN, custom_image_udf(column(DEFAULT_IMAGE_COLUMN)))
```

When apply transformations in your Data Wrangler flow, Data Wrangler only applies them to a sample of the images in your dataset. To optimize your experience with the application, Data Wrangler doesn't apply the transforms to all of your images.

## Filter data
<a name="canvas-transform-filter-data"></a>

Use Data Wrangler to filter the data in your columns. When you filter the data in a column, you specify the following fields:
+ **Column name** – The name of the column that you're using to filter the data.
+ **Condition** – The type of filter that you're applying to values in the column.
+ **Value** – The value or category in the column to which you're applying the filter.

You can filter on the following conditions:
+ **=** – Returns values that match the value or category that you specify.
+ **\$1=** – Returns values that don't match the value or category that you specify.
+ **>=** – For **Long** or **Float** data, filters for values that are greater than or equal to the value that you specify.
+ **<=** – For **Long** or **Float** data, filters for values that are less than or equal to the value that you specify.
+ **>** – For **Long** or **Float** data, filters for values that are greater than the value that you specify.
+ **<** – For **Long** or **Float** data, filters for values that are less than the value that you specify.

For a column that has the categories, `male` and `female`, you can filter out all the `male` values. You could also filter for all the `female` values. Because there are only `male` and `female` values in the column, the filter returns a column that only has `female` values.

You can also add multiple filters. The filters can be applied across multiple columns or the same column. For example, if you're creating a column that only has values within a certain range, you add two different filters. One filter specifies that the column must have values greater than the value that you provide. The other filter specifies that the column must have values less than the value that you provide.

Use the following procedure to add the filter transform to your data.

**To filter your data**

1. From your Data Wrangler flow, choose the **\$1** next to the node with the data that you're filtering.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose **Filter data**.

1. Specify the following fields:
   + **Column name** – The column that you're filtering.
   + **Condition** – The condition of the filter.
   + **Value** – The value or category in the column to which you're applying the filter.

1. (Optional) Choose **\$1** following the filter that you've created.

1. Configure the filter.

1. Choose **Preview**.

1. Choose **Add**.

# Chat for data prep
<a name="canvas-chat-for-data-prep"></a>

**Important**  
For administrators:  
Chat for data prep requires the `AmazonSageMakerCanvasAIServicesAccess` policy. For more information, see [AWS managed policy: AmazonSageMakerCanvasAIServicesAccess](security-iam-awsmanpol-canvas.md#security-iam-awsmanpol-AmazonSageMakerCanvasAIServicesAccess)
Chat for data prep requires access to Amazon Bedrock and the **Anthropic Claude** model within it. For more information, see [Add model access](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html#add-model-access).
You must run SageMaker Canvas data prep in the same AWS Region as the Region where you're running your model. Chat for data prep is available in the US East (N. Virginia), US West (Oregon), and Europe (Frankfurt) AWS Regions.

In addition to using the built-in transforms and analyses, you can use natural language to explore, visualize, and transform your data in a conversational interface. Within the conversational interface, you can use natural language queries to understand and prepare your data to build ML models.

The following are examples of some prompts that you can use:
+ Summarize my data
+ Drop column `example-column-name`
+ Replace missing values with median
+ Plot histogram of prices
+ What is the most expensive item sold?
+ How many distinct items were sold?
+ Sort data by region

When you’re transforming your data using your prompts, you can view a preview that shows how data is being transformed. You can choose to add it as step in your Data Wrangler flow based on what you see in the preview.

The responses to your prompts generate code for your transformations and analyses. You can modify the code to update the output from the prompt. For example, you can modify the code for an analysis to change the values of the axes of a graph.

Use the following procedure to start chatting with your data:

**To chat with your data**

1. Open the SageMaker Canvas data flow.

1. Choose the speech bubble.  
![\[Chat for data prep is at the top of the screen\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/chat-for-data-prep-welcome-step.png)

1. Specify a prompt.

1. (Optional) If an analysis has been generated by your query, choose **Add to analyses** to reference it for later.  
![\[The view of an editable and copyable code block.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/encanto-query-for-visualization.png)

1. (Optional) If you've transformed your data using a prompt, do the following.

   1. Choose **Preview** to view the results.

   1. (Optional) Modify the code in the transform and choose **Update**.

   1. (Optional) If you're happy with the results of the transform, choose **Add to steps** to add it to the steps panel on the right-hand navigation.  
![\[Added to steps shows confirmation that the transform has been added to the flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/transform-added-to-steps-panel.png)

After you’ve prepared your data using natural language, you can create a model using your transformed data. For more information about creating a model, see [How custom models work](canvas-build-model.md).

# How data processing works in Data Wrangler
<a name="canvas-data-processing"></a>

While working with data interactively in an Amazon SageMaker Data Wrangler data flow, Amazon SageMaker Canvas only applies the transformations to a sample dataset for you to preview. After finishing your data flow in SageMaker Canvas, you can process all of your data and save it in a location that is suitable for your machine learning workflows.

There are several options for how to proceed after you've finished transforming your data in Data Wrangler:
+ [Create a model](canvas-processing-export-model.md). You can create a Canvas model, where you directly start creating a model with your prepared data. You can create a model either after processing your entire dataset, or by exporting just the sample data you worked with in Data Wrangler. Canvas saves your processed data (either the entire dataset or the sample data) as a Canvas dataset.

  We recommend that you use your sample data for quick iterations, but that you use your entire data when you want to train your final model. When building tabular models, datasets larger than 5 GB are automatically downsampled to 5 GB, and for time series forecasting models, datasets larger than 30 GB are downsampled to 30 GB.

  To learn more about creating a model, see [How custom models work](canvas-build-model.md).
+ [Export the data](canvas-export-data.md). You can export your data for use in machine learning workflows. When you choose to export your data, you have several options:
  + You can save your data in the Canvas application as a dataset. For more information about the supported file types for Canvas datasets and additional requirements when importing data into Canvas, see [Create a dataset](canvas-import-dataset.md).
  + You can save your data to Amazon S3. Depending on the Canvas memory availability, your data is processed in the application and then exported to Amazon S3. If the size of your dataset exceeds what Canvas can process, then by default, Canvas uses an EMR Serverless job to scale to multiple compute instances, process your full dataset, and export it to Amazon S3. You can also manually configure a SageMaker Processing job to have more granular control over the compute resources used to process your data.
+ [Export a data flow](canvas-export-data-flow.md). You might want to save the code for your data flow so that you can modify or run your transformations outside of Canvas. Canvas provides you with the option to save your data flow transformations as Python code in a Jupyter notebook, which you can then export to Amazon S3 for use elsewhere in your machine learning workflows.

When you export your data from a data flow and save it either as a Canvas dataset or to Amazon S3, Canvas creates a new destination node in your data flow, which is a final node that shows you where your processed data is stored. You can add additional destination nodes to your flow if you'd like to perform multiple export operations. For example, you can export the data from different points in your data flow to only apply some of the transformations, or you can export transformed data to different Amazon S3 locations. For more information about how to add or edit a destination node, see [Add destination nodes](canvas-destination-nodes-add.md) and [Edit a destination node](canvas-destination-nodes-edit.md).

For more information about setting up a schedule with Amazon EventBridge to automatically process and export your data on a schedule, see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md).

# Export to create a model
<a name="canvas-processing-export-model"></a>

In just a few clicks from your data flow, you can export your transformed data and start creating an ML model in Canvas. Canvas saves your data as a Canvas dataset, and you're taken to the model build configuration page for a new model.

To create a Canvas model with your transformed data:

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. From the context menu, choose **Create model**.

1. In the **Export to create a model** side panel, enter a **Dataset name** for the new dataset.

1. Leave the **Process entire dataset** option selected to process and export your entire dataset before proceeding with building a model. Turn this option off to train your model using the interactive sample data you are working with in your data flow.

1. Enter a **Model name** to name the new model.

1. Select a **Problem type**, or the type of model that you want to build. For more information about the supported model types in SageMaker Canvas, see [How custom models work](canvas-build-model.md).

1. Select the **Target column**, or the value that you want the model to predict.

1. Choose **Export and create model**.

The **Build** tab for a new Canvas model should open, and you can finish configuring and training your model. For more information about how to build a model, see [Build a model](canvas-build-model-how-to.md).

# Export data
<a name="canvas-export-data"></a>

Export data to apply the transforms from your data flow to the full imported dataset. You can export any node in your data flow to the following locations:
+ SageMaker Canvas dataset
+ Amazon S3

If you want to train models in Canvas, you can export your full, transformed dataset as a Canvas dataset. If you want to use your transformed data in machine learning workflows external to SageMaker Canvas, you can export your dataset to Amazon S3.

## Export to a Canvas dataset
<a name="canvas-export-data-canvas"></a>

Use the following procedure to export a SageMaker Canvas dataset from a node in your data flow.

**To export a node in your flow as a SageMaker Canvas dataset**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. In the context menu, hover over **Export**, and then select **Export data to Canvas dataset**.

1. In the **Export to Canvas dataset** side panel, enter a **Dataset name** for the new dataset.

1. Leave the **Process entire dataset** option selected if you want SageMaker Canvas to process and save your full dataset. Turn this option off to only apply the transforms to the sample data you are working with in your data flow.

1. Choose **Export**.

You should now be able to go to the **Datasets** page of the Canvas application and see your new dataset.

## Export to Amazon S3
<a name="canvas-export-data-s3"></a>

When exporting your data to Amazon S3, you can scale to transform and process data of any size. Canvas automatically processes your data locally if the application's memory can handle the size of your dataset. If your dataset size exceeds the local memory capacity of 5 GB, then Canvas initiates a remote job on your behalf to provision additional compute resources and process the data more quickly. By default, Canvas uses Amazon EMR Serverless to run these remote jobs. However, you can manually configure Canvas to use either EMR Serverless or a SageMaker Processing job with your own settings.

**Note**  
When running an EMR Serverless job, by default the job inherits the IAM role, KMS key settings, and tags of your Canvas application.

The following summarizes the options for remote jobs in Canvas:
+ **EMR Serverless**: This is the default option that Canvas uses for remote jobs. EMR Serverless automatically provisions and scales compute resources to process your data so that you don't have to worry about choosing the right compute resources for your workload. For more information about EMR Serverless, see the [EMR Serverless User Guide](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html).
+ **SageMaker Processing**: SageMaker Processing jobs offer more advanced options and granular control over the compute resources used to process your data. For example, you can specify the type and count of the compute instances, configure the job in your own VPC and control network access, automate processing jobs, and more. For more information about automating processing jobs see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md). For more general information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

The following file types are supported when exporting to Amazon S3:
+ CSV
+ Parquet

To get started, review the following prerequisites.

### Prerequisites for EMR Serverless jobs
<a name="canvas-export-data-emr-prereqs"></a>

To create a remote job that uses EMR Serverless resources, you must have the necessary permissions. You can grant permissions either through the Amazon SageMaker AI domain or user profile settings, or you can manually configure your user's AWS IAM role. For instructions on how to grant users permissions to perform large data processing, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

If you don't want to configure these policies but still need to process large datasets through Data Wrangler, you can alternatively use a SageMaker Processing job.

Use the following procedures to export your data to Amazon S3. To configure a remote job, follow the optional advanced steps.

**To export a node in your flow to Amazon S3**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you're exporting.

1. In the context menu, hover over **Export**, and then select **Export data to Amazon S3**.

1. In the **Export to Amazon S3** side panel, you can change the **Dataset name** for the new dataset.

1. For the **S3 location**, enter the Amazon S3 location to which you want to export the dataset. You can enter the S3 URI, alias, or ARN of the S3 location or S3 access point. For more information access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) in the *Amazon S3 User Guide*.

1. (Optional) For the **Advanced settings**, specify values for the following fields:

   1. **File type** – The file format of your exported data.

   1. **Delimiter** – The delimiter used to separate values in the file.

   1. **Compression** – The compression method used to reduce the file size.

   1. **Number of partitions** – The number of dataset files that Canvas writes as the output of the job.

   1. **Choose columns** – You can choose a subset of columns from the data to include in the partitions.

1. Leave the **Process entire dataset** option selected if you want Canvas to apply your data flow transforms to your entire dataset and export the result. If you deselect this option, Canvas only applies the transforms to the sample of your dataset used in the interactive Data Wrangler data flow.
**Note**  
If you only export a sample of your data, Canvas processes your data in the application and doesn't create a remote job for you.

1. Leave the **Auto job configuration** option selected if you want Canvas to automatically determine whether to run the job using Canvas application memory or an EMR Serverless job. If you deselect this option and manually configure your job, then you can choose to use either an EMR Serverless or a SageMaker Processing job. For instructions on how to configure an EMR Serverless or a SageMaker Processing job, see the section after this procedure before you export your data.

1. Choose **Export**.

The following procedures show how to manually configure the remote job settings for either EMR Serverless or SageMaker Processing when exporting your full dataset to Amazon S3.

------
#### [ EMR Serverless ]

To configure an EMR Serverless job while exporting to Amazon S3, do the following:

1. In the Export to Amazon S3 side panel, turn off the **Auto job configuration** option.

1. Select **EMR Serverless**.

1. For **Job name**, enter a name for your EMR Serverless job. The name can contain letters, numbers, hyphens, and underscores.

1. For **IAM role**, enter the user's IAM execution role. This role should have the required permissions to run EMR Serverless applications. For more information, see [Grant Users Permissions to Use Large Data across the ML Lifecycle](canvas-large-data-permissions.md).

1. (Optional) For **KMS key**, specify the key ID or ARN of an AWS KMS key to encrypt the job logs. If you don't enter a key, Canvas uses a default key for EMR Serverless.

1. (Optional) For **Monitoring configuration**, enter the name of an Amazon CloudWatch Logs log group to which you want to publish your logs.

1. (Optional) For **Tags**, add metadata tags to the EMR Serverless job consisting of key-value pairs. These tags can be used to categorize and search for jobs.

1. Choose **Export** to start the job.

------
#### [ SageMaker Processing ]

To configure a SageMaker Processing job while exporting to Amazon S3, do the following:

1. In the **Export to Amazon S3** side panel, turn off the **Auto job configuration** option.

1. Select **SageMaker Processing**.

1. For **Job name**, enter a name for your SageMaker AI Processing job.

1. For **Instance type**, select the type of compute instance to run the processing job.

1. For **Instance count**, specify the number of compute instances to launch.

1. For **IAM role**, enter the user's IAM execution role. This role should have the required permissions for SageMaker AI to create and run processing jobs on your behalf. These permissions are granted if you have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy attached to your IAM role.

1. For **Volume size**, enter the storage size in GB for the ML storage volume that is attached to each processing instance. Choose the size based on your expected input and output data size.

1. (Optional) For **Volume KMS key**, specify a KMS key to encrypt the storage volume. If you don't specify a key, the default Amazon EBS encryption key is used.

1. (Optional) For **KMS key**, specify a KMS key to encrypt input and output Amazon S3 data sources used by the processing job.

1. (Optional) For **Spark memory configuration**, do the following:

   1. Enter **Driver memory in MB** for the Spark driver node that handles job coordination and scheduling.

   1. Enter **Executor memory in MB** for the Spark executor nodes that run individual tasks in the job.

1. (Optional) For **Network configuration**, do the following:

   1. For **Subnet configuration**, enter the IDs of the VPC subnets for the processing instances to be launched in. By default, the job uses the settings of your default VPC.

   1. For **Security group configuration**, enter the IDs of the security groups to control inbound and outbound connectivity rules.

   1. Turn on the **Enable inter-container traffic encryption** option to encrypt network communication between processing containers during the job.

1. (Optional) For **Associate schedules**, you can choose create an Amazon EventBridge schedule to have the processing job run on recurring intervals. Choose **Create new schedule** and fill out the dialog box. For more information about filling out this section and running processing jobs on a schedule, see [Create a schedule to automatically process new data](canvas-data-export-schedule-job.md).

1. (Optional) Add **Tags** as key-value pairs so that you can categorize and search for processing jobs.

1. Choose **Export** to start the processing job.

------

After exporting your data, you should find the fully processed dataset in the specified Amazon S3 location.

# Export a data flow
<a name="canvas-export-data-flow"></a>

Exporting your data flow translates the operations that you've made in Data Wrangler and exports it into a Jupyter notebook of Python code that you can modify and run. This can be helpful for integrating the code for your data transformations into your machine learning pipelines.

You can choose any data node in your data flow and export it. Exporting the data node exports the transformation that the node represents and the transformations that precede it.

**To export a data flow as a Jupyter notebook**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node that you want to export.

1. In the context menu, hover over **Export**, and then hover over **Export via Jupyter notebook**.

1. Choose one of the following:
   + **SageMaker Pipelines**
   + **Amazon S3**
   + **SageMaker AI Inference Pipeline**
   + **SageMaker AI Feature Store**
   + **Python Code**

1. The **Export data flow as notebook** dialog box opens. Select one of the following:
   + **Download a local copy**
   + **Export to S3 location**

1. If you selected **Export to S3 location**, enter the Amazon S3 location to which you want to export the notebook.

1. Choose **Export**.

Your Jupyter notebook should either download to your local machine, or you can find it saved in the Amazon S3 location you specified.

# Add destination nodes
<a name="canvas-destination-nodes-add"></a>

A destination node in SageMaker Canvas specifies where to store your processed and transformed data. When you choose to export your transformed data to Amazon S3, Canvas uses the specified destination node location, applying all the transformations you've configured in your data flow. For more information about export jobs to Amazon S3, see the preceding section [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

By default, choosing to export your data to Amazon S3 adds a destination node to your data flow. However, you can add multiple destination nodes to your flow, allowing you to simultaneously export different sets of transformations or variations of your data to different Amazon S3 locations. For example, you can create one destination node that exports the data after applying all transformations, and another destination node that exports the data after only certain initial transformations, such as a join operation. This flexibility enables you to export and store different versions or subsets of your transformed data in separate S3 locations for various use cases.

Use the following procedure to add a destination node to your data flow.

**To add a destination node**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the node where you want to place the destination node.

1. In the context menu, hover over **Export**, and then select **Add destination**.

1. In the **Export destination** side panel, enter a **Dataset name** to name the output.

1. For **Amazon S3 location**, enter the Amazon S3 location to which you want to export the output. You can enter the S3 URI, alias, or ARN of the S3 location or S3 access point. For more information access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) in the *Amazon S3 User Guide*.

1. For **Export settings**, specify the following fields:

   1. **File type** – The file format of the exported data.

   1. **Delimiter** – The delimiter used to separate values in the file.

   1. **Compression** – The compression method used to reduce the file size.

1. For **Partitioning**, specify the following fields:

   1. **Number of partitions** – The number of dataset files that SageMaker Canvas writes as the output of the job.

   1. **Choose columns** – You can choose a subset of columns from the data to include in the partitions.

1. Choose **Add** if you want to simply add a destination node to your data flow, or choose **Add** and then choose **Export** if you want to add the node and initiate an export job.

You should now see a new destination node in your flow.

# Edit a destination node
<a name="canvas-destination-nodes-edit"></a>

A *destination node* in a Amazon SageMaker Canvas data flow specifies the Amazon S3 location where your processed and transformed data is stored, applying all the configured transformations in your data flow. You can edit the configuration of an existing destination node and then choose to re-run the job to overwrite the data in the specified Amazon S3 location. For more information about adding a new destination node, see [Add destination nodes](canvas-destination-nodes-add.md).

Use the following procedure to edit a destination node in your data flow and initiate an export job.

**To edit a destination node**

1. Navigate to your data flow.

1. Choose the ellipsis icon next to the destination node that you want to edit.

1. In the context menu, choose **Edit**.

1. The **Edit destination** side panel opens. From this panel, you can edit details such as the dataset name, the Amazon S3 location, and the export and partitioning settings.

1. (Optional) In **Additional nodes to export**, you can select more destination nodes to process when you run the export job.

1. Leave the **Process entire dataset** option selected if you want Canvas to apply your data flow transforms to your entire dataset and export the result. If you deselect this option, Canvas only applies the transforms to the sample of your dataset used in the interactive Data Wrangler data flow.

1. Leave the **Auto job configuration** option selected if you want Canvas to automatically determine whether to run the job using Canvas application memory or an EMR Serverless job. If you deselect this option and manually configure your job, then you can choose to use either an EMR Serverless or a SageMaker Processing job. For instructions on how to configure an EMR Serverless or a SageMaker Processing job, see the preceding section [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

1. When you're done making changes, choose **Update**.

Saving changes to your destination node configuration doesn't automatically re-run a job or overwrite data that has already been processed and exported. Export your data again to run a job with the new configuration. If you decide to export your data again with a job, Canvas uses the updated destination node configuration to transform and output the data to the specified location, overwriting any existing data.

# Create a schedule to automatically process new data
<a name="canvas-data-export-schedule-job"></a>

**Note**  
The following section only applies to SageMaker Processing jobs. If you used the default Canvas settings or EMR Serverless to create a remote job to apply transforms to your full dataset, this section doesn’t apply.

If you're processing data periodically, you can create a schedule to run the processing job automatically. For example, you can create a schedule that runs a processing job automatically when you get new data. For more information about processing jobs, see [Export to Amazon S3](canvas-export-data.md#canvas-export-data-s3).

When you create a job, you must specify an IAM role that has permissions to create the job. You can use the [AmazonSageMakerCanvasDataPrepFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerCanvasDataPrepFullAccess.html) policy to add permissions.

Add the following trust policy to the role to allow EventBridge to assume it.

```
{
    "Effect": "Allow",
    "Principal": {
        "Service": "events.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
}
```

**Important**  
When you create a schedule, Data Wrangler creates an `eventRule` in EventBridge. You incur charges for both the event rules that you create and the instances used to run the processing job.  
For information about EventBridge pricing, see [Amazon EventBridge pricing](https://aws.amazon.com/eventbridge/pricing/). For information about processing job pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

You can set a schedule using one of the following methods:
+ [CRON expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html)
**Note**  
Data Wrangler doesn't support the following expressions:  
LW\$1
Abbreviations for days
Abbreviations for months
+ [RATE expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html#eb-rate-expressions)
+ Recurring – Set an hourly or daily interval to run the job.
+ Specific time – Set specific days and times to run the job.

The following sections provide procedures on scheduling jobs when filling out the SageMaker AI Processing job settings while [exporting your data to Amazon S3](canvas-export-data.md#canvas-export-data-s3). All of the following instructions begin in the **Associate schedules** section of the SageMaker Processing job settings.

------
#### [ CRON ]

Use the following procedure to create a schedule with a CRON expression.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **CRON**.

1. For each of the **Minutes**, **Hours**, **Days of month**, **Month**, and **Day of week** fields, enter valid CRON expression values.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ RATE ]

Use the following procedure to create a schedule with a RATE expression.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Rate**.

1. For **Value**, specify an integer.

1. For **Unit**, select one of the following:
   + **Minutes**
   + **Hours**
   + **Days**

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ Recurring ]

Use the following procedure to create a schedule that runs a job on a recurring basis.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Recurring**.

1. For **Every x hours**, specify the hourly frequency that the job runs during the day. Valid values are integers in the inclusive range of **1** and **23**.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.
**Note**  
The schedule resets every day. If you schedule a job to run every five hours, it runs at the following times during the day:  
00:00
05:00
10:00
15:00
20:00

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------
#### [ Specific time ]

Use the following procedure to create a schedule that runs a job at specific times.

1. In the **Export to Amazon S3** side panel, make sure you've turned off the **Auto job configuration** toggle and have the **SageMaker Processing** option selected.

1. In the **SageMaker Processing** job settings, open the **Associate schedules** section and choose **Create new schedule**.

1. The **Create new schedule** dialog box opens. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Start time**.

1. For **Start time**, enter a time in UTC format (for example, **09:00**). The start time defaults to the time zone where you are located.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – The job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – The job only runs on the schedules that you specify.

1. Choose **Export** after you've filled out the rest of the export job settings.

------

You can use the SageMaker AI AWS Management Console to view the jobs that are scheduled to run. Your processing jobs run within Pipelines. Each processing job has its own pipeline. It runs as a processing step within the pipeline. You can view the schedules that you've created within a pipeline. For information about viewing a pipeline, see [View the details of a pipeline](pipelines-studio-list.md).

Use the following procedure to view the jobs that you've scheduled.

To view the jobs you've scheduled, do the following.

1. Open Amazon SageMaker Studio Classic.

1. Open Pipelines

1. View the pipelines for the jobs that you've created.

   The pipeline running the job uses the job name as a prefix. For example, if you've created a job named `housing-data-feature-enginnering`, the name of the pipeline is `canvas-data-prep-housing-data-feature-engineering`.

1. Choose the pipeline containing your job.

1. View the status of the pipelines. Pipelines with a **Status** of **Succeeded** have run the processing job successfully.

To stop the processing job from running, do the following:

To stop a processing job from running, delete the event rule that specifies the schedule. Deleting an event rule stops all the jobs associated with the schedule from running. For information about deleting a rule, see [Disabling or deleting an Amazon EventBridge rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-delete-rule.html).

You can stop and delete the pipelines associated with the schedules as well. For information about stopping a pipeline, see [StopPipelineExecution](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopPipelineExecution.html). For information about deleting a pipeline, see [DeletePipeline](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeletePipeline.html#API_DeletePipeline_RequestSyntax).

# Automate data preparation in SageMaker Canvas
<a name="canvas-data-export"></a>

After you transform your data in data flow, you can export the transforms to your machine learning workflows. When you export your transforms, SageMaker Canvas creates a Jupyter notebook. You must run the notebook within Amazon SageMaker Studio Classic. For information about getting started with Studio Classic, contact your administrator.

## Automate data preparation using Pipelines
<a name="canvas-data-export-pipelines"></a>

When you want to build and deploy large-scale machine learning (ML) workflows, you can use Pipelines to create workflows that manage and deploy SageMaker AI jobs. With Pipelines, you can build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs. You can use the first-party algorithms that SageMaker AI offers by using Pipelines. For more information on Pipelines, see [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html).

When you export one or more steps from your data flow to Pipelines, Data Wrangler creates a Jupyter notebook that you can use to define, instantiate, run, and manage a pipeline.

### Use a Jupyter Notebook to Create a Pipeline
<a name="canvas-pipelines-notebook"></a>

Use the following procedure to create a Jupyter notebook to export your Data Wrangler flow to Pipelines.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Pipelines.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **Pipelines (via Jupyter Notebook)**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

You can use the Jupyter notebook that Data Wrangler produces to define a pipeline. The pipeline includes the data processing steps that are defined by your Data Wrangler flow. 

You can add additional steps to your pipeline by adding steps to the `steps` list in the following code in the notebook:

```
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[instance_type, instance_count],
    steps=[step_process], #Add more steps to this list to run in your Pipeline
)
```

For more information on defining pipelines, see [Define SageMaker AI Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html).

## Automate data preparation using an inference endpoint
<a name="canvas-data-export-inference"></a>

Use your Data Wrangler flow to process data at the time of inference by creating a SageMaker AI serial inference pipeline from your Data Wrangler flow. An inference pipeline is a series of steps that results in a trained model making predictions on new data. A serial inference pipeline within Data Wrangler transforms the raw data and provides it to the machine learning model for a prediction. You create, run, and manage the inference pipeline from a Jupyter notebook within Studio Classic. For more information about accessing the notebook, see [Use a Jupyter notebook to create an inference endpoint](#canvas-inference-notebook).

Within the notebook, you can either train a machine learning model or specify one that you've already trained. You can either use Amazon SageMaker Autopilot or XGBoost to train the model using the data that you've transformed in your Data Wrangler flow.

The pipeline provides the ability to perform either batch or real-time inference. You can also add the Data Wrangler flow to SageMaker Model Registry. For more information about hosting models, see [Multi-model endpoints](multi-model-endpoints.md).

**Important**  
You can't export your Data Wrangler flow to an inference endpoint if it has the following transformations:  
Join
Concatenate
Group by
If you must use the preceding transforms to prepare your data, use the following procedure.  
Create a Data Wrangler flow.
Apply the preceding transforms that aren't supported.
Export the data to an Amazon S3 bucket.
Create a separate Data Wrangler flow.
Import the data that you've exported from the preceding flow.
Apply the remaining transforms.
Create a serial inference pipeline using the Jupyter notebook that we provide.
For information about exporting your data to an Amazon S3 bucket see [Export data](canvas-export-data.md). For information about opening the Jupyter notebook used to create the serial inference pipeline, see [Use a Jupyter notebook to create an inference endpoint](#canvas-inference-notebook).

Data Wrangler ignores transforms that remove data at the time of inference. For example, Data Wrangler ignores the [Handle Missing Values](canvas-transform.md#canvas-transform-handle-missing) transform if you use the **Drop missing** configuration.

If you've refit transforms to your entire dataset, the transforms carry over to your inference pipeline. For example, if you used the median value to impute missing values, the median value from refitting the transform is applied to your inference requests. You can either refit the transforms from your Data Wrangler flow when you're using the Jupyter notebook or when you're exporting your data to an inference pipeline. .

The serial inference pipeline supports the following data types for the input and output strings. Each data type has a set of requirements.

**Supported datatypes**
+ `text/csv` – the datatype for CSV strings
  + The string can't have a header.
  + Features used for the inference pipeline must be in the same order as features in the training dataset.
  + There must be a comma delimiter between features.
  + Records must be delimited by a newline character.

  The following is an example of a validly formatted CSV string that you can provide in an inference request.

  ```
  abc,0.0,"Doe, John",12345\ndef,1.1,"Doe, Jane",67890                    
  ```
+ `application/json` – the datatype for JSON strings
  + The features used in the dataset for the inference pipeline must be in the same order as the features in the training dataset.
  + The data must have a specific schema. You define schema as a single `instances` object that has a set of `features`. Each `features` object represents an observation.

  The following is an example of a validly formatted JSON string that you can provide in an inference request.

  ```
  {
      "instances": [
          {
              "features": ["abc", 0.0, "Doe, John", 12345]
          },
          {
              "features": ["def", 1.1, "Doe, Jane", 67890]
          }
      ]
  }
  ```

### Use a Jupyter notebook to create an inference endpoint
<a name="canvas-inference-notebook"></a>

Use the following procedure to export your Data Wrangler flow to create an inference pipeline.

To create an inference pipeline using a Jupyter notebook, do the following.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **SageMaker AI Inference Pipeline (via Jupyter Notebook)**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

When you run the Jupyter notebook, it creates an inference flow artifact. An inference flow artifact is a Data Wrangler flow file with additional metadata used to create the serial inference pipeline. The node that you're exporting encompasses all of the transforms from the preceding nodes.

**Important**  
Data Wrangler needs the inference flow artifact to run the inference pipeline. You can't use your own flow file as the artifact. You must create it by using the preceding procedure.

## Automate data preparation using Python Code
<a name="canvas-data-export-python-code"></a>

To export all steps in your data flow to a Python file that you can manually integrate into any data processing workflow, use the following procedure.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Python code.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export data flow**.

1. Choose **Python Code**.

1. Download the Jupyter notebook or copy it to an Amazon S3 location. We recommend copying it to an Amazon S3 location that you can access within Studio Classic. Contact your administrator if you need guidance on a suitable location.

1. Run the Jupyter notebook.

You might need to configure the Python script to make it run in your pipeline. For example, if you're running a Spark environment, make sure that you are running the script from an environment that has permission to access AWS resources.