# Recommendations for choosing the right data preparation tool in SageMaker AI
<a name="data-prep"></a>

Data preparation in machine learning refers to the process of collecting, preprocessing, and organizing raw data to make it suitable for analysis and modeling. This step ensures that the data is in a format from which machine learning algorithms can effectively learn. Data preparation tasks may include handling missing values, removing outliers, scaling features, encoding categorical variables, assessing potential biases and taking steps to mitigate them, splitting data into training and testing sets, labeling, and other necessary transformations to optimize the quality and usability of the data for subsequent machine learning tasks.

## Choose a feature
<a name="data-prep-choose"></a>

There are 3 primary use cases for *data preparation* with Amazon SageMaker AI. Choose the [use case](#data-prep-choose-use-cases) that aligns with your requirements, and then refer to the corresponding [recommended feature](#data-prep-choose-recommended).

### Use cases
<a name="data-prep-choose-use-cases"></a>

The following are the primary uses cases when performing data preparation for Machine Learning.
+ **Use case 1**: For those who prefer a visual interface, SageMaker AI provides ways to explore, prepare, and engineer features for model training through a point-and-click environment. 
+ **Use case 2**: For users comfortable with coding who want more flexibility and control over data preparation, SageMaker AI integrates tools into its coding environments for exploration, transformations, and feature engineering. 
+ **Use case 3**: For users focused on scalable data preparation, SageMaker AI offers serverless capabilities that leverage the Hadoop/Spark ecosystem for distributed processing of big data.

### Recommended features
<a name="data-prep-choose-recommended"></a>

The following table outlines the key considerations and tradeoffs for the SageMaker AI features related to each data preparation use case for machine learning. To get started, identify the use case that aligns to your requirements and navigate to its recommended SageMaker AI feature.


| Descriptor | Use case 1 | Use case 2 | Use case 3 | 
| --- | --- | --- | --- | 
| SageMaker AI feature | [Data Wrangler](canvas-data-prep.md) within Amazon SageMaker Canvas | [Data preparation with SQL in Studio](sagemaker-sql-extension.md) | [Prepare data using EMR Serverless](studio-notebooks-emr-serverless.md) applications in Studio | 
| Description | SageMaker Canvas is a visual low-code environment for building, training, and deploying machine learning models in SageMaker AI. Its integrated Data Wrangler tool allows users to combine, transform, and clean datasets through point-and-click interactions. | The SQL extension in Studio allows users to connect to Amazon Redshift, Snowflake, Athena, and Amazon S3 to author ad-hoc SQL queries, and preview results in JupyterLab notebooks. The output of these queries can be manipulated using Python and Pandas for additional processing, visualization, and transformation into formats usable for machine learning model development. | The integration between EMR Serverless and Amazon SageMaker Studio provides a scalable serverless environment for large-scale data preparation for machine learning using open-source frameworks such as Apache Spark and Apache Hive. Users can directly access EMR Serverless applications and data from their Studio notebooks to perform their data preparation tasks at scale. | 
| Optimized for | Using a visual interface in which you can: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-prep.html) Optimized for tabular data tasks such as handling missing values, encoding categorical variables, and applying data transformations.  | For users whose data resides in Amazon Redshift, Snowflake, Athena, or [Amazon S3](studio-sqlexplorer-athena-s3-quickstart.md) and want to combine exploratory SQL and Python for data analysis and preparation without the need to learn Spark. | For users who prefer a serverless experience with automatic resource provisioning and termination for scaling short-running or intermittent interactive workloads revolving around Apache Spark, while taking advantage of SageMaker AI's machine learning capabilities. | 
| Considerations |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-prep.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-prep.html)  | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-prep.html)  | 
| Recommended environment | [Getting started with using SageMaker Canvas](canvas-getting-started.md) | [Launch Studio](studio-updated-launch.md#studio-updated-launch-prereq) | [Launch Studio](studio-updated-launch.md#studio-updated-launch-prereq) | 

### Additional options
<a name="data-prep-choose-additional"></a>

 SageMaker AI offers the following additional options for preparing your data for use in machine learning models. 
+ [Data preparation using Amazon EMR](studio-notebooks-emr-cluster.md): For long-running, computationally intensive, large-scale data processing tasks, consider using Amazon EMR clusters from SageMaker Studio. Amazon EMR clusters are designed to handle massive parallelization and can scale to hundreds or thousands of nodes, making them well-suited for big data workloads that require frameworks like Apache Spark, Hadoop, Hive, and Presto. The integration of Amazon EMR with SageMaker Studio allows you to leverage the scalability and performance of Amazon EMR, while keeping your full ML experimentation, model training and deployment, centralized and managed within the SageMaker Studio environment. 
+ [Prepare data using glue interactive sessions](studio-notebooks-glue.md): You can use the Apache Spark-based serverless engine from AWS Glue interactive sessions to aggregate, transform, and prepare data from multiple sources in SageMaker Studio.
+ [Identify bias in training data]() using Amazon SageMaker Clarify processing jobs: SageMaker Clarify analyzes your data and detect potential biases across multiple facets. For example, you can use Clarify API in Studio to detect if your training data contains imbalanced representations or labeling biases between groups such as gender, race, or age. Clarify can help you identify these biases before training a model to avoid propagating biases into the model's predictions.
+ [Create, store, and share features](): Amazon SageMaker Feature Store optimizes the discovery and reuse of curated features for machine learning. It provides a centralized repository to store feature data that can be searched and retrieved for model training. Storing features in a standardized format enables reuse across ML projects. The Feature Store manages the full lifecycle of features including lineage tracking, statistics, and audit trails for scalable and governed machine learning feature engineering.
+ [Label data with a human-in-the-loop](data-label.md): You can use SageMaker Ground Truth to manage the data labeling workflows of your training datasets. 
+ [Use SageMaker Processing API](processing-job.md): After performing exploratory data analysis and creating your data transformations steps, you can productionize your transformation code using [SageMaker AI Processing jobs](processing-job.md) and automate your preparation workflow using [SageMaker Model Building Pipelines](pipelines.md).

# Data preparation with SQL in Studio
<a name="sagemaker-sql-extension"></a>

Amazon SageMaker Studio provides a built-in SQL extension. This extension allows data scientists to perform tasks such as sampling, exploratory analysis, and feature engineering directly within their JupyterLab notebooks. It leverages AWS Glue connections to maintain a centralized data source catalog. The catalog stores metadata about various data sources. Through this SQL environment, data scientists can browse data catalogs, explore their data, author complex SQL queries, and further process the results in Python. 

This section walks through configuring the SQL extension in Studio. It describes the capabilities enabled by this SQL integration and provides instructions for running SQL queries in JupyterLab notebooks. 

To enable SQL data analysis, administrators must first configure AWS Glue connections to the relevant data sources. These connections allow data scientists to seamlessly access authorized datasets from within JupyterLab. 

In addition to the administrator-configured AWS Glue connections, the SQL extension allows individual data scientists to create their own data source connections. These user-created connections can be managed independently and scoped to the user's profile through tag-based access control policies. This dual-level connection model - with both administrator-configured and user-created connections - provides data scientists with broader access to the data they need for their analysis and modeling tasks. Users can set up the necessary connections to their own data sources within the JupyterLab environment user interface (UI), without relying solely on the centralized connections established by the administrator.

**Important**  
The user-defined connections creation capability is available as a set of standalone libraries in PyPI. To use this functionality, you need to install the following libraries in your JupyterLab environment:  
[amazon-sagemaker-sql-editor](https://pypi.org/project/amazon-sagemaker-sql-editor/)
[amazon-sagemaker-sql-execution](https://pypi.org/project/amazon-sagemaker-sql-execution/)
[amazon-sagemaker-sql-magic](https://pypi.org/project/amazon-sagemaker-sql-magic/)
You can install these libraries by running the following commands in your JupyterLab terminal:  

```
pip install amazon-sagemaker-sql-editor>=0.1.13
pip install amazon-sagemaker-sql-execution>=0.1.6
pip install amazon-sagemaker-sql-magic>=0.1.3
```
After installing the libraries, you will need to restart the JupyterLab server for the changes to take effect.  

```
restart-jupyter-server
```

With access set up, JupyterLab users can:
+ View and browse pre-configured data sources.
+ Search, filter, and inspect database information elements such as tables, schemas, and columns.
+ Auto-generate the connection parameters to a data source.
+ Create complex SQL queries using the syntax-highlighting, auto-completion, and SQL formatting features of the extension's SQL editor.
+ Run SQL statements from JupyterLab notebook cells.
+ Retrieve the results of SQL queries as pandas DataFrames for further processing, visualization, and other machine learning tasks.

You can access the extension by choosing the SQL extension icon (![\[Icon of the SQL extension feature in JupyterLab.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/sqlexplorer/sqlexplorer-icon.png)) in the left navigation pane of your JupyterLab application in Studio. Hovering over the icon displays its *Data Discovery* tool tip.

**Important**  
The JupyterLab image in SageMaker Studio contains the SQL extension by default, starting with [SageMaker AI Distribution](https://github.com/aws/sagemaker-distribution) 1.6. The extension works with Python and SparkMagic kernels only.
The extension's user interface for exploring connections and data is only available in JupyterLab within Studio. It is compatible with [Amazon Redshift](https://aws.amazon.com/redshift/), [Amazon Athena](https://aws.amazon.com/athena/), and [Snowflake](https://www.snowflake.com/en/).
+ If you are an administrator looking to create generic connections to data sources for the SQL extension, follow these steps:

  1. Enable the network communication between your Studio domain and the data sources to which you want to connect. To learn about the networking requirements, see [Configure network access between Studio and data sources (for administrators)](sagemaker-sql-extension-networking.md).

  1. Check connection properties and instructions to create a secret for your data source in [Create secrets for database access credentials in Secrets Manager](sagemaker-sql-extension-glue-connection-secrets.md).

  1. Create the AWS Glue connections to your data sources in [Create AWS Glue connections (for administrators)](sagemaker-sql-extension-datasources-glue-connection.md).

  1. Grant the execution role of your SageMaker domain or user profiles the required permissions in [Set up the IAM permissions to access the data sources (for administrators)](sagemaker-sql-extension-datasources-connection-permissions.md).
+ If you are a data scientist looking to create your own connections to data sources for the SQL extension, follow these steps:

  1. Have your administrator:
     + Enable the network communication between your Studio domain and the data sources to which you want to connect. To learn about the networking requirements, see [Configure network access between Studio and data sources (for administrators)](sagemaker-sql-extension-networking.md).
     + Grant the execution role of your SageMaker domain or user profiles the required permissions in [Set up the IAM permissions to access the data sources (for administrators)](sagemaker-sql-extension-datasources-connection-permissions.md).
**Note**  
Administrators can restrict user access to connections created within the JupyterLab application by configuring [tag-based access control](sagemaker-sql-extension-datasources-connection-permissions.md#user-defined-connections-permissions) in the execution role.

  1. Check connection properties and instructions to create a secret for your data source in [Create secrets for database access credentials in Secrets Manager](sagemaker-sql-extension-glue-connection-secrets.md).

  1. Create your connection in JupyterLab UI using the instructions in [Create user-defined AWS Glue connections](sagemaker-sql-extension-datasources-glue-connection-user-defined.md).
+ If you are a data scientist looking to browse and query your data sources using the SQL extension, ensure that you or your administrator have set up the connections to your data sources first. Then, follow these steps:

  1. Create a private space to launch your JupyterLab application in Studio using the SageMaker distribution image version 1.6 or higher.

  1. If you are a user of the SageMaker distribution image version 1.6, load the SQL extension in a JupyterLab notebook by running `%load_ext amazon_sagemaker_sql_magic` in a notebook cell.

     For users of SageMaker distribution image versions 1.7 and later, no action is needed, the SQL extension loads automatically.

  1. Familiarize with the capabilities of the SQL extension in [SQL extension features and usage](sagemaker-sql-extension-features.md).

**Topics**
+ [

# Quickstart: Query data in Amazon S3
](studio-sqlexplorer-athena-s3-quickstart.md)
+ [

# SQL extension features and usage
](sagemaker-sql-extension-features.md)
+ [

# Configure network access between Studio and data sources (for administrators)
](sagemaker-sql-extension-networking.md)
+ [

# SQL extension data source connections
](sagemaker-sql-extension-datasources-connection.md)
+ [

# Frequently asked questions
](sagemaker-sql-extension-faqs.md)
+ [

# Connection parameters
](sagemaker-sql-extension-connection-properties.md)

# Quickstart: Query data in Amazon S3
<a name="studio-sqlexplorer-athena-s3-quickstart"></a>

Users can analyze data stored in Amazon S3 by running SQL queries from JupyterLab notebooks using the SQL extension. The extension integrates with Athena enabling the functionality for data in Amazon S3 with a few extra steps.

This section walks you through the steps to load data from Amazon S3 into Athena and then query that data from JupyterLab using the SQL extension. You will create an Athena data source and AWS Glue crawler to index your Amazon S3 data, configure the proper IAM permissions to enable JupyterLab access to Athena, and connect JupyterLab to Athena to query the data. Following those few steps, you will be able to analyze Amazon S3 data using the SQL extension in JupyterLab notebooks.

**Prerequisites**  
Sign in to the AWS Management Console using an AWS Identity and Access Management (IAM) user account with admin permissions. For information on how to sign up for an AWS account and create a user with administrative access, see [Complete Amazon SageMaker AI prerequisites](gs-set-up.md).
Have a SageMaker AI domain and user profile to access SageMaker Studio. For information on how to set a SageMaker AI environment, see [Use quick setup for Amazon SageMaker AI](onboard-quick-start.md).
Have an Amazon S3 bucket and folder to store Athena query results, using the same AWS Region and account as your SageMaker AI environment. For information on how to create a bucket in Amazon S3, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in the Amazon S3 documentation. You will configure this bucket and folder to be your query output location.

**Topics**
+ [

## Step 1: Set up an Athena data source and AWS Glue crawler for your Amazon S3 data
](#studio-sqlexplorer-athena-s3-quickstart-setup)
+ [

## Step 2: Grant Studio the permissions to access Athena
](#studio-sqlexplorer-athena-s3-quickstart-permissions)
+ [

## Step 3: Enable Athena default connection in JupyterLab
](#studio-sqlexplorer-athena-s3-quickstart-connect)
+ [

## Step 4: Query data in Amazon S3 from JupyterLab notebooks using the SQL extension
](#studio-sqlexplorer-athena-s3-quickstart-query)

## Step 1: Set up an Athena data source and AWS Glue crawler for your Amazon S3 data
<a name="studio-sqlexplorer-athena-s3-quickstart-setup"></a>

Follow these steps to index your data in Amazon S3 and create tables in Athena.

**Note**  
To avoid collisions between table names from different Amazon S3 locations, create a separate data source and crawler for each location. Each data source creates a table named after the folder that contain them unless prefixed.

1. Configure a query result location

   1. Go to the Athena console: [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

   1. From the left menu, choose **Workgroups**.

   1. Follow the link for the `primary` workgroup and choose **Edit**.

   1. In the **Query result configuration** section, enter the Amazon S3 path for your output directory and then choose **Save changes**.

1. Create an Athena data source for your Amazon S3 data

   1. From the left menu in the Athena console, choose **Data sources** and then **Create Data Source**. 

   1. Choose **S3 - AWS Glue Data Catalog** and then **Next**. 

   1. Leave the default **AWS Glue Data Catalog in this account**, choose **Create a crawler in AWS Glue** and then **Create in AWS Glue**. This opens the AWS Glue console. 

1. Use AWS Glue to crawl your data source

   1. Enter a name and a description for your new crawler and then choose **Next**. 

   1. Under **Data Sources**, choose **Add a data source**.

      1. If the Amazon Amazon S3 bucket containing your data is in a different AWS account than your SageMaker AI environment, choose **In a different account** for the **Location of the S3 data**.

      1. Enter the path to your dataset in Amazon S3. For example:

         ```
         s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year-multiple-files/ride-info/year=2019/
         ```

      1. Keep all other default values and then choose **Add an Amazon S3 data source**. You should see a new Amazon S3 data source in the data sources table.

      1. Choose **Next**.

       
   1. Configure the IAM role for the crawler to access your data.
**Note**  
Each role is scoped down to the data source you specify. When reusing a role, edit the JSON policy to add any new resource you want to grant access to or create a new role for this data source.

      1. Choose **Create new IAM role**.

      1. Enter a name for the role and then choose **Next**.

1. Create or select a database for your tables

   1. If you do not have an existing database in Athena, choose **Add database** and then **Create a new database**.

   1. Back to your previous crawler creation tab, in **Output configuration**, choose the **Refresh** button. You should now see your newly created database in the list.

   1. Select your database, add an optional prefix in **Table name prefix** and then choose **Next**.
**Note**  
For the previous example where your data is located at `s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year-multiple-files/ride-info/year=2019/`, adding the prefix `taxi-ride-` will create a table named `taxi-ride-year_2019`. Adding a prefix helps prevent table name collisions when multiple data locations have identically named folders.

1. Choose **Create crawler**.

1. Run your crawler to index your data. Wait for the crawler run to reach a `Completed` status, which may take a few minutes.

To ensure that a new table was created, go to the left menu in AWS Glue and choose **Databases** then **Tables**. You should now see a new table containing your data. 

## Step 2: Grant Studio the permissions to access Athena
<a name="studio-sqlexplorer-athena-s3-quickstart-permissions"></a>

In the following steps you grant the execution role of your user profile permissions to access Athena.

1. Retrieve the ARN of the execution role associated with your user profile

   1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and choose **Domains** in the left menu.

   1. Follow the name for your domain name.

   1. In the **User profiles** list, follow the name for your user profile.

   1. On the **User details** page, copy the ARN of the execution role.

1. Update the policy of your execution role

   1. Find your AWS region and account ID at the top right of the SageMaker AI console. Use these values and your database name to update the placeholders in the following JSON policy in a text editor.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "GetS3AndDataSourcesMetadata",
                  "Effect": "Allow",
                  "Action": [
                      "glue:GetDatabases",
                      "glue:GetSchema",
                      "glue:GetTables",
                      "s3:ListBucket",
                      "s3:GetObject",
                      "s3:GetBucketLocation",
                      "glue:GetDatabase",
                      "glue:GetTable",
                      "glue:ListSchemas",
                      "glue:GetPartitions"
                  ],
                  "Resource": [
                      "arn:aws:s3:::*",
                      "arn:aws:glue:us-east-1:111122223333:catalog",
                      "arn:aws:glue:us-east-1:111122223333:database/db-name"
                  ]
              },
              {
                  "Sid": "ExecuteAthenaQueries",
                  "Effect": "Allow",
                  "Action": [
                      "athena:ListDataCatalogs",
                      "athena:ListDatabases",
                      "athena:ListTableMetadata",
                      "athena:StartQueryExecution",
                      "athena:GetQueryExecution",
                      "athena:RunQuery",
                      "athena:StartSession",
                      "athena:GetQueryResults",
                      "athena:ListWorkGroups",
                      "s3:ListMultipartUploadParts",
                      "s3:ListBucket",
                      "s3:GetBucketLocation",
                      "athena:GetDataCatalog",
                      "s3:AbortMultipartUpload",
                      "s3:GetObject",
                      "s3:PutObject",
                      "athena:GetWorkGroup"
                  ],
                  "Resource": [
                      "arn:aws:s3:::*"
                  ]
              },
              {
                  "Sid": "GetGlueConnectionsAndSecrets",
                  "Effect": "Allow",
                  "Action": [
                      "glue:GetConnections",
                      "glue:GetConnection"
                  ],
                  "Resource": [
                      "*"
                  ]
              }
          ]
      }
      ```

------

   1. Go to the IAM console: [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/) and choose **Roles** in the left menu.

   1. Search for your role by role name.
**Note**  
You can retrieve an execution role name from its Amazon Resource Name (ARN) by splitting the ARN on `'/'` and taking the last element. For example, in the following example of an ARN `arn:aws:iam::112233445566:role/SageMakerStudio-SQLExtension-ExecutionRole`, the name of the execution role is `SageMakerStudio-SQLExtension-ExecutionRole`.

   1. Follow the link for your role.

   1. In the **Permissions** tab, choose **Add permissions** then **Create inline policy**.

   1. Choose the `JSON` format in the **Policy editor** section.

   1. Copy the policy above and then choose **Next**. Ensure that you have replaced all the `account-id`, `region-name`, and `db-name` with their values.

   1. Enter a name for your policy and then choose **Create policy**.

## Step 3: Enable Athena default connection in JupyterLab
<a name="studio-sqlexplorer-athena-s3-quickstart-connect"></a>

In the following steps, you enable a `default-athena-connection` in your JupyterLab application. The default Athena connection allows running SQL queries in Athena directly from JupyterLab, without needing to manually create a connection.

To enable the default Athena connection

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and choose **Studio** in the left menu. Using your domain and user profile, launch Studio.

1. Choose the JupyterLab application. 

1. If you have not created a space for your JupyterLab application, choose **Create a JupyterLab space**. Enter a name for the space, keep the space as **Private**, and then choose **Create space**. Run your space using the latest version of the SageMaker AI Distribution image.

   Otherwise, choose **Run space** on your space to launch a JupyterLab application.

1. Enable Athena default connection:

   1. In your JupyterLab application, navigate to the **Settings** menu in the top navigation bar and open the **Settings Editor** menu.

   1. Choose **Data Discovery**.

   1. Check the box for **Enable default Athena connection**.

   1. In your JupyterLab application, choose the SQL extension icon (![\[Purple circular icon with a clock symbol representing time or scheduling.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/sqlexplorer/sqlexplorer-icon.png)) in the left navigation pane to open the SQL extension.

   1.  Choose the **Refresh** button at the bottom of the data discovery panel. You should see a `default-athena-connection` in the list of connections.

## Step 4: Query data in Amazon S3 from JupyterLab notebooks using the SQL extension
<a name="studio-sqlexplorer-athena-s3-quickstart-query"></a>

You are ready to query your data using SQL in your JupyterLab notebooks.

1. Open the connection `default-athena-connection` and then **AWSDataCatalog**.

1. Navigate to your database and choose the three dots icon (![\[SQL extension three dots icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/sqlexplorer/sqlexplorer-3dots-icon.png)) on its right. Select **Query in notebook**.

   This automatically populates a notebook cell in JupyterLab with the relevant `%%sm_sql` magic command to connect to the data source. It also adds a sample SQL statement to help you start querying right away. 
**Note**  
Ensure to load the extension in the top cell before you run an SQL query.

   You can further refine the SQL query using the auto-complete and highlighting features of the extension. See [SQL editor features of the JupyterLab SQL extension](sagemaker-sql-extension-features-editor.md) for more information on using the SQL extension SQL editor.

# SQL extension features and usage
<a name="sagemaker-sql-extension-features"></a>

This section details the various features of the JupyterLab SQL extension in Studio, and provides instructions on how to use them. Before you can use the SQL extension to access and query data from your JupyterLab notebooks, an administrator must first configure the connection to your data sources. For information on how administrators can create connections to data sources, see [SQL extension data source connections](sagemaker-sql-extension-datasources-connection.md).

**Note**  
To use the SQL extension, your JupyterLab application must run on a [SageMaker AI distribution](https://github.com/aws/sagemaker-distribution/blob/main/README.md) image version 1.6 or higher. These SageMaker images have the extension pre-installed.

The extension provides two components to help you access, discover, query, and analyze data from pre-configured data sources.
+ Use the *user interface* of the SQL extension to discover and explore your data sources. The UI capabilities can be further divided into the following subcategories.
  + With the **data exploration** UI element, you can browse your data sources and explore their tables, columns, and metadata. For details on the data exploration features of the SQL extension, see [Browse data using SQL extension](sagemaker-sql-extension-features-data-discovery.md).
  + The **connection caching** element caches connections for quick access. For details on connection caching in the SQL extension, see [SQL extension connection caching](sagemaker-sql-extension-features-connection-caching.md).
+ Use the *SQL Editor and Executor* to write, edit, and run SQL queries against connected data sources.
  + With the **SQL editor** element, you can write, format, and validate SQL statements within the notebooks of your JupyterLab application in Studio. For details on the SQL editor features, see [SQL editor features of the JupyterLab SQL extension](sagemaker-sql-extension-features-editor.md).
  + With the **SQL execution** element, you can run your SQL queries and visualize their results from the notebooks of your JupyterLab application in Studio. For details on the SQL execution capabilities, see [SQL execution features of the JupyterLab SQL extension](sagemaker-sql-extension-features-sql-execution.md).

# Browse data using SQL extension
<a name="sagemaker-sql-extension-features-data-discovery"></a>

To open the SQL extension user interface (UI), choose the SQL extension icon (![\[Purple circular icon with a clock symbol representing time or scheduling.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/sqlexplorer/sqlexplorer-icon.png)) in the navigation pane of your JupyterLab application in Studio. The left panel data discovery view expands and displays all pre-configured data store connections to Amazon Athena, Amazon Redshift, and Snowflake.

From there, you can:
+ Expand a specific connection to explore its databases, schemas, tables or views, and columns.
+ Search for a specific connection using the search box in the SQL extension UI. The search returns any databases, schemas, tables, or views that partially match the string you enter.

**Note**  
If Athena is already set up in your AWS account, you can enable a `default-athena-connection` in your JupyterLab application. This allows you to run Athena queries without needing to manually create the connection. To enable the default Athena connection:  
Check with your administrator that your execution role has the required permissions to access Athena and the AWS Glue catalog. For details on the permissions required, see [Configure an AWS Glue connection for Athena](sagemaker-sql-extension-datasources-glue-connection.md#sagemaker-sql-extension-athena-glue-connection-config)
In your JupyterLab application, navigate to the **Settings** menu in the top navigation bar and open the **Settings Editor** menu.
Choose **Data Discovery**.
Check the box for **Enable default Athena connection**.
You can update the default `primary` WorkGroup if needed.

To query a database, schema, or table in a JupyterLab notebook, from a given connection in the SQL extension pane:
+ Choose the three dots icon (![\[SQL extension three dots icon.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/sqlexplorer/sqlexplorer-3dots-icon.png)) on the right side of any database, schema, or table.
+ Select **Query in notebook** from the menu.

  This automatically populates a notebook cell in JupyterLab with the relevant `%%sm_sql` magic command to connect to the data source. It also adds a sample SQL statement to help you start querying right away. You can further refine the SQL query using the auto-complete and highlighting features of the extension. See [SQL editor features of the JupyterLab SQL extension](sagemaker-sql-extension-features-editor.md) for more information on using the SQL extension SQL editor.

At the table level, the three dots icon provides the additional option to choose to **Preview** a table's metadata.

The JupyterLab notebook cell content below shows an example of what is automatically generated when selecting the **Query in notebook** menu on a `redshift-connection` data source in the SQL extension pane.

```
%%sm_sql --metastore-id redshift-connection --metastore-type GLUE_CONNECTION

-- Query to list tables from schema 'dev.public'
SHOW TABLES
FROM
  SCHEMA "dev"."public"
```

Use the *less than* symbol (![\[Icon to clear the SQL extension search box.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/sqlexplorer/sqlexplorer-search-clear.png)) at the top of the SQL extension pane to clear the search box or return to the list of your connections.

**Note**  
The extension caches your exploration results for fast access. If the cached results are outdated or a connection is missing from your list, you can manually refresh the cache by choosing the **Refresh** button at the bottom of the SQL extension panel. For more information on connection caching, see [SQL extension connection caching](sagemaker-sql-extension-features-connection-caching.md).

# SQL editor features of the JupyterLab SQL extension
<a name="sagemaker-sql-extension-features-editor"></a>

The SQL extension provides magic commands that enable the SQL editor functionalities within your JupyterLab notebook cells.

If you are a user of the SageMaker distribution image version 1.6, you must load the SQL extension magic library by running `%load_ext amazon_sagemaker_sql_magic` in a JupyterLab notebook. This turns on SQL editing features.

For users of SageMaker distribution image versions 1.7 and later, no action is needed, the SQL extension loads automatically.

Once the extension is loaded, add the `%%sm_sql` magic command at the beginning of a cell to activate the following capabilities of the SQL editor.
+ **Connection-selection dropdown**: Upon adding an `%%sm_sql` magic command to a cell, a dropdown menu appears at the top of the cell with your available data source connections. Select a connection to automatically fill in the parameters needed to query that data source. The following is an example of an `%%sm_sql` magic command string generated by selecting the connection named `connection-name`. 

  ```
  %%sm_sql --metastore-type GLUE_CONNECTION --metastore-id connection-name
  ```

  Use the SQL editor's features below to build your SQL queries, then run the query by running the cell. For more information on the SQL execution capabilities, see [SQL execution features of the JupyterLab SQL extension](sagemaker-sql-extension-features-sql-execution.md).
+ **Query result dropdown**: You can specify how to render query results by selecting a result type from the dropdown menu next to your connection-selection dropdown menu. Choose between the following two alternatives:
  + **Cell Output**: (default) This option displays the result of your query in the notebook cell output area.
  + **Pandas Dataframe**: This option populates a pandas DataFrame with the query results. An extra input box lets you name the DataFrame when you choose this option.
+ **SQL syntax highlight**: The cell automatically visually distinguishes SQL keywords, clauses, operators, and more by color and styling. This makes SQL code easier to read and understand. Keywords such as `SELECT`, `FROM`, `WHERE`, and built-in functions such as `SUM` and `COUNT`, or clauses such as `GROUP BY` and more are highlighted in a different color and bold style.
+ **SQL formatting**: You can apply consistent indents, capitalization, spacing, and line breaks to group or separate SQL statements and clauses in one of the following ways. This makes SQL code easier to read and understand.
  + Right-click on the SQL cell and choose **Format SQL**.
  + When the SQL cell is in focus, use the *ALT \$1 F* shortcut on Windows or *Option \$1 F* on MacOS.
+ **SQL auto-completion**: The extension provides automatic suggestions and completion of SQL keywords, functions, table names, column names, and more as you type. As you start typing an SQL keyword such as `SELECT` or `WHERE`, the extension displays a pop-up with suggestions to auto-complete the rest of the word. For example, when typing table or column names, it suggests matching table and column names defined in the database schema.
**Important**  
To enable SQL auto-completion in JupyterLab notebooks, users of the SageMaker AI distribution image version 1.6 must run the following `npm install -g vscode-jsonrpc sql-language-server` command in a terminal. After the installation completes, restart the JupyterLab server by running `restart-jupyter-server`.  
For users of SageMaker distribution image versions 1.7 and later, no action is needed.

  The cell offers two methods for auto-completing recognized SQL keywords:
  + Explicit invocation (recommended): Choose the **Tab** key to initiate the context-aware suggestion menu, then choose **Enter** to accept the suggested item.
  + Continuous hinting: The cell automatically suggests completions as you type.
**Note**  
Auto-completion is only triggered if the SQL keywords are in uppercase. For instance, entering `SEL` prompts for `SELECT`, but typing `sel` does not.
The first time you connect to a data source, SQL auto-completion indexes the data source's metadata. This indexing process may take some time to complete depending on the size of your databases.

# SQL execution features of the JupyterLab SQL extension
<a name="sagemaker-sql-extension-features-sql-execution"></a>

You can execute SQL queries against your connected data sources in the SQL extension of JupyterLab. The following sections explain the most common parameters for running SQL queries inside JupyterLab notebooks:
+ Create a simple connection in [Create a simple magic command connection string](sagemaker-sql-extension-features-sql-execution-create-connection.md).
+ Save your query results in a pandas DataFrame in [Save SQL query results in a pandas DataFrame](sagemaker-sql-extension-features-sql-execution-save-dataframe.md).
+ Override or add to connection properties defined by your administrator in [Override connection properties](sagemaker-sql-extension-features-sql-execution-override-connection.md).
+ [Use query parameters to provide dynamic values in SQL queries](sagemaker-sql-extension-features-sql-execution-query-parameters.md).

When you run a cell with the `%%sm_sql` magic command, the SQL extension engine executes the SQL query in the cell against the data source specified in the magic command parameters.

To see the details of the magic command parameters and supported formats, run `%%sm_sql?`.

**Important**  
To use Snowflake, users of the SageMaker distribution image version 1.6 must install the Snowflake Python dependency by running the following `micromamba install snowflake-connector-python -c conda-forge` command in a terminal of their JupyterLab application. Restart the JupyterLab server by running `restart-jupyter-server` in the terminal after the installation is complete.  
For SageMaker distribution image versions 1.7 and later, the Snowflake dependency is pre-installed. No action is needed.

# Create a simple magic command connection string
<a name="sagemaker-sql-extension-features-sql-execution-create-connection"></a>

If your administrator has configured the connections to your data sources, follow these steps to easily create a connection string in a notebook cell:

1. Open a notebook cell that uses `%%sm_sql`.

1. Select a pre-configured connection to your desired data source from the connection dropdown menu above the cell.

1. This will automatically populate the parameters needed to query that data source.

Alternatively, you can specify connection properties inline in the cell.

Choosing a connection from the dropdown menu inserts the following two parameters into the default magic command string. The parameters contain the connection information your administrator configured.
+ `--metastore-id`: The name of the connection object that holds your connection parameters.
+ `--metastore-type`: The type of meta-store corresponding to `--metastore-id`. The SQL extension uses AWS Glue connections as a connection meta-store. This value is automatically set to `GLUE_CONNECTION`.

For example, the connection string to a pre-configured Amazon Athena data store looks like the following:

```
%%sm_sql --metastore-id athena-connection-name --metastore-type GLUE_CONNECTION 
```

# Save SQL query results in a pandas DataFrame
<a name="sagemaker-sql-extension-features-sql-execution-save-dataframe"></a>

You can store the results of your SQL query in a pandas DataFrame. The easiest way to output query results to a DataFrame is to use the [SQL editor features of the JupyterLab SQL extension](sagemaker-sql-extension-features-editor.md) query-result dropdown and choose the **Pandas dataframe** option.

Alternatively, you can add the parameter `--output '{"format": "DATAFRAME", "dataframe_name": "dataframe_name"}'` to your connection string.

For example, the following query extracts details of customers with the highest balance from the `Customer` table in Snowflake's `TPCH_SF1` database, using both pandas and SQL:
+ In this example, we extract all the data from the customer table and save then in a DataFrame named `all_customer_data`.

  ```
  %%sm_sql --output '{"format": "DATAFRAME", "dataframe_name": "all_customer_data"}' --metastore-id snowflake-connection-name --metastore-type GLUE_CONNECTION
  SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
  ```

  ```
  Saved results to all_customer_data
  ```
+ Next, we extract the details of the highest account balance from the DataFrame.

  ```
  all_customer_data.loc[all_customer_data['C_ACCTBAL'].idxmax()].values
  ```

  ```
  array([61453, 'Customer#000061453', 'RxNgWcyl5RZD4qOYnyT3', 15,
  '25-819-925-1077', Decimal('9999.99'), 'BUILDING','es. carefully regular requests among the blithely pending requests boost slyly alo'],
  dtype=object)
  ```

# Override connection properties
<a name="sagemaker-sql-extension-features-sql-execution-override-connection"></a>

Your administrator's predefined connection definitions may not have the exact parameters you need to connect to a specific data store. You can add or override parameters in the connection string by using the `--connection-properties` argument.

The arguments are applied in the following order of precedence:

1. Overridden connection properties provided as inline arguments.

1. Connection properties present in the AWS Secrets Manager.

1. Connection properties in the AWS Glue connection.

If the same connection property is present in all three (command line argument, Secrets Manager, and connection), the value provided in the command line argument takes precedence.

For more information on the available connection properties per data source, see the [Connection parameters](sagemaker-sql-extension-connection-properties.md).

The following example illustrates a connection property argument that sets the schema name for Amazon Athena.

```
%%sm_sql --connection-properties '{"schema_name": "athena-db-name"}' --metastore-id athena-connection-name --metastore-type GLUE_CONNECTION
```

# Use query parameters to provide dynamic values in SQL queries
<a name="sagemaker-sql-extension-features-sql-execution-query-parameters"></a>

Query parameters can be used to provide dynamic values in SQL queries.

In the following example, we pass a query parameter to the `WHERE` clause of the query.

```
# How to use '--query-parameters' with ATHENA as a data store
%%sm_sql --metastore-id athena-connection-name --metastore-type GLUE_CONNECTION --query-parameters '{"parameters":{"name_var": "John Smith"}}'
SELECT * FROM my_db.my_schema.my_table WHERE name = (%(name_var)s);
```

# SQL extension connection caching
<a name="sagemaker-sql-extension-features-connection-caching"></a>

The SQL extension extension defaults to caching connections to prevent the creation of multiple connections for the same set of connection properties. The cached connections can be managed using the `%sm_sql_manage` magic command.

The following topics describe how to manage your cached connections.

**Topics**
+ [

# Create cached connections
](sagemaker-sql-extension-features-create-cached-connection.md)
+ [

# List cached connections
](sagemaker-sql-extension-features-list-cached-connection.md)
+ [

# Clear cached connections
](sagemaker-sql-extension-features-clear-cached-connection.md)
+ [

# Disable cached connections
](sagemaker-sql-extension-features-disable-cached-connection.md)

# Create cached connections
<a name="sagemaker-sql-extension-features-create-cached-connection"></a>

You can create cached connections by specifying a connection name in the `--connection-name` parameter of your connection string. This is particularly useful when multiple connection properties are overridden for a specific use case, and there's a need to reuse the same properties without retyping them.

For example, the code below saves an Athena connection with an overridden schema connection property using the name `--connection-name my_athena_conn_with_schema`, and then reuses it in another cell:

```
%%sm_sql --connection-name my_athena_conn_with_schema --connection-properties '{"schema_name": "sm-sql-private-beta-db"}' --metastore-id sm-sql-private-beta-athena-connection --metastore-type GLUE_CONNECTION 
SELECT * FROM "covid_table" LIMIT 2
```

```
%%sm_sql --connection-name my_athena_conn_with_schema
SELECT * FROM "covid_table" LIMIT 2
```

# List cached connections
<a name="sagemaker-sql-extension-features-list-cached-connection"></a>

You can list your cached connections by running the following command:

```
%sm_sql_manage --list-cached-connections
```

# Clear cached connections
<a name="sagemaker-sql-extension-features-clear-cached-connection"></a>

To clear all cached connections, run the following command:

```
%sm_sql_manage --clear-cached-connections
```

# Disable cached connections
<a name="sagemaker-sql-extension-features-disable-cached-connection"></a>

To disable connection caching, run the following command:

```
%sm_sql_manage --set-connection-reuse False
```

# Configure network access between Studio and data sources (for administrators)
<a name="sagemaker-sql-extension-networking"></a>

This section provides information about how administrators can configure a network to enable communication between Amazon SageMaker Studio and [Amazon Redshift](https://aws.amazon.com/redshift/) or [Amazon Athena](https://aws.amazon.com/athena/), either within a private Amazon VPC or over the internet. The networking instructions vary based on whether the Studio domain and the data store are deployed within a private [Amazon Virtual Private Cloud](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) (VPC) or communicate over the internet.

By default, Studio runs in an AWS managed VPC with [internet access](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html#studio-notebooks-and-internet-access-default). When using an internet connection, Studio accesses AWS resources, such as Amazon S3 buckets, over the internet. However, if you have security requirements to control access to your data and job containers, we recommend that you configure Studio and your data store (Amazon Redshift or Athena) so that your data and containers aren’t accessible over the internet. To control access to your resources or run Studio without public internet access, you can specify the `VPC only` network access type when you onboard to [Amazon SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html). In this scenario, Studio establishes connections with other AWS services via private [VPC endpoints](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html). For information about configuring Studio in `VPC only` mode, see [Connect Studio to external resources in a VPC](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html#studio-notebooks-and-internet-access-vpc-only).

**Note**  
To connect to Snowflake, the VPC of the Studio domain must have internet access.

The first two sections describe how to ensure communication between your Studio domain and your data store in VPCs without public internet access. The last section covers how to ensure communication between Studio and your data store using an internet connection. Prior to connecting Studio and your data store without internet access, make sure to establish endpoints for Amazon Simple Storage Service, Amazon Redshift or Athena, SageMaker AI, and for Amazon CloudWatch and AWS CloudTrail (logging and monitoring).
+ If Studio and the data store are in different VPCs, either in the same AWS account or in separate accounts, see [Studio and the data store are deployed in separate VPCs](#sagemaker-sql-extension-networking-cross-vpc).
+ If Studio and the data store are in the same VPC, see [Studio and the data store are deployed in the same VPC](#sagemaker-sql-extension-networking-same-vpc).
+ If you chose to connect Studio and the data store over the public internet, see [Studio and the data store communicate over public internet](#sagemaker-sql-extension-networking-internet).

## Studio and the data store are deployed in separate VPCs
<a name="sagemaker-sql-extension-networking-cross-vpc"></a>

To allow communication between Studio and a data store deployed in different VPCs:

1. Start by connecting your VPCs through a VPC peering connection.

1. Update the routing tables in each VPC to allow bidirectional network traffic between Studio subnets and the data store subnets. 

1. Configure your security groups to allow inbound and outbound traffic.

The configuration steps are the same whether Studio and the data store are deployed in a single AWS account or across different AWS accounts.

1. 

**VPC peering**

   Create a [VPC peering connection](https://docs.aws.amazon.com/vpc/latest/peering/working-with-vpc-peering.html) to facilitate the networking between the two VPCs (Studio and the data store).

   1. From the Studio account, on the VPC dashboard, choose **Peering connections**, then **Create peering connection**.

   1. Create your request to peer the Studio VPC with the data store VPC. When requesting peering in another AWS account, choose **Another account** in **Select another VPC to peer with**.

      For cross-account peering, the administrator must accept the request from the SQL engine account.

      When peering private subnets, you should enable private IP DNS resolution at the VPC peering connection level.

1. 

**Routing tables**

   Configure the routing to allow network traffic between Studio and data store VPC subnets in both directions.

   After you establish the peering connection, the administrator (on each account for cross account access) can add routes to the private subnet route tables to route the traffic between Studio and the data store VPCs' subnets. You can define those routes by going to the **Route Tables** section of each VPC in the VPC dashboard.

1. 

**Security groups**

   Lastly, the security group of Studio's domain VPC must allow outbound traffic, and the security group of the data store's VPC must allow inbound traffic on your data store port from Studio's VPC security group.

## Studio and the data store are deployed in the same VPC
<a name="sagemaker-sql-extension-networking-same-vpc"></a>

 If Studio and the data store are in different private subnets in the same VPC, add routes in each private subnet's route table. The routes should allow traffic to flow between the Studio subnets and the data store subnets. You can define those routes by going to the **Route Tables** section of each VPC in the VPC dashboard. If you deployed Studio and the data store in the same VPC and the same subnet, you do not need to route the traffic.

Regardless of any routing table updates, the security group of Studio's domain VPC must allow outbound traffic, and the security group of the data store's VPC must allow inbound traffic on its port from Studio's VPC security group.

## Studio and the data store communicate over public internet
<a name="sagemaker-sql-extension-networking-internet"></a>

By default, Studio provides a network interface that allows communication with the internet through an internet gateway in the VPC associated with the Studio domain. If you choose to connect to your data store through the public internet, your data store needs to accept inbound traffic on its port.

A [NAT gateway](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html#nat-gateway-working-with) must be used to allow instances in private subnets of multiple VPCs to share a single public IP address provided by the [internet gateway](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html) when accessing the internet.

**Note**  
Each port opened for inbound traffic represents a potential security risk. Carefully review custom security groups to ensure that you minimize vulnerabilities.

# SQL extension data source connections
<a name="sagemaker-sql-extension-datasources-connection"></a>

Before using the SQL extension in JupyterLab notebooks, administrators or users must create AWS Glue connections to their data sources. The SQL extension allows connecting to data sources such as Amazon Redshift Amazon Athena, or Snowflake.

To set up the connections, administrators must first ensure their network configuration allows communication between Studio and the data sources and then grant the necessary IAM permissions to allow Studio to access the data sources. For information on how administrators can set up the networking, see [Configure network access between Studio and data sources (for administrators)](sagemaker-sql-extension-networking.md). For information on what policies must be setup, see [Set up the IAM permissions to access the data sources (for administrators)](sagemaker-sql-extension-datasources-connection-permissions.md). Once the connections are set up, data scientists can use the SQL extension in their JupyterLab notebooks to browse and query the connected data sources.

**Note**  
We recommend storing your database access credentials as a secret in Secrets Manager. To learn about how to create secrets for storing Amazon Redshift or Snowflake access credentials, see [Create secrets for database access credentials in Secrets Manager](sagemaker-sql-extension-glue-connection-secrets.md).

This section explains how to set up an AWS Glue connection and lists the IAM permissions required for the Studio JupyterLab application to access the data through the connection. 

**Note**  
[Amazon SageMaker Assets](sm-assets.md) integrates [Amazon DataZone](https://docs.aws.amazon.com/datazone/latest/userguide/what-is-datazone.html) with Studio. It includes a SageMaker AI blueprint for administrators to create Studio environments from Amazon DataZone projects within an Amazon DataZone domain.  
Users of a JupyterLab application launched from a Studio domain created with the blueprint can automatically access AWS Glue connections to data assets in their Amazon DataZone catalog when using the SQL extension. This allows querying those data sources without manually setting up connections.

**Topics**
+ [

# Create secrets for database access credentials in Secrets Manager
](sagemaker-sql-extension-glue-connection-secrets.md)
+ [

# Create AWS Glue connections (for administrators)
](sagemaker-sql-extension-datasources-glue-connection.md)
+ [

# Create user-defined AWS Glue connections
](sagemaker-sql-extension-datasources-glue-connection-user-defined.md)
+ [

# Set up the IAM permissions to access the data sources (for administrators)
](sagemaker-sql-extension-datasources-connection-permissions.md)

# Create secrets for database access credentials in Secrets Manager
<a name="sagemaker-sql-extension-glue-connection-secrets"></a>

Before creating your connection, we recommend storing your database access credentials as a secret in AWS Secrets Manager. Alternatively, you can generate temporary database credentials based on permissions granted through an AWS Identity and Access Management (IAM) permissions policy to manage the access that your users have to your database. For more information, see [Using IAM authentication to generate database user credentials](https://docs.aws.amazon.com/redshift/latest/mgmt/generating-user-credentials.html)

## Create a secret for Amazon Redshift access credentials
<a name="sagemaker-sql-extension-redshift-secret"></a>

**To store Amazon Redshift information in AWS Secrets Manager**

1. From the AWS Management Console, navigate to Secrets Manager.

1. Choose **Store a new secret**.

1. Under **Secret type**, choose **Credentials for Amazon Redshift**.

1. Enter the administrator username and password configured when launching the Amazon Redshift cluster. 

1. Select the Amazon Redshift cluster associated with the secrets.

1. Name your secret.

1. The remaining settings can be left at their default values for initial secret creation, or customized if required. 

1. Create the secret and retrieve its ARN.

## Create a secret for Amazon Redshift Serverless access credentials
<a name="sagemaker-sql-extension-redshift-serverless-secret"></a>

**If you need to connect to Amazon Redshift Serverless, follow these steps**

1. From the AWS Management Console, navigate to Secrets Manager.

1. Choose **Store a new secret**.

1. Under **Secret type**, choose **Other type of secret**.

1. In the **Key-value pairs**, choose **Plaintext**, and then copy the following JSON content. Replace the user, and password with their actual values: 

   ```
   {
     "user": "redshift_user",
     "password": "redshift_password"
   }
   ```

1. Create the secret and retrieve its ARN..

1. When creating a new connection in SQL extension in JupyterLab, supply all other Amazon Redshift connection parameters as needed.

## Create a secret for Snowflake access credentials
<a name="sagemaker-sql-extension-snowflake-secret"></a>

This section provides details on the secret and connection properties in JSON definition files that are specific to Snowflake. Before creating your connection, we recommend storing your Snowflake access credentials as a secret in Secrets Manager.

**To store Amazon Redshift information in Secrets Manager**

1. From the AWS Management Console, navigate to Secrets Manager.

1. Choose **Store a new secret**.

1. Under **Secret type**, choose **Other type of secret**.

1. In the key-value pair, choose **Plaintext**, and then copy the following JSON content. Replace the `user`, `password`, and `account` by their values.

   ```
   {
       "user":"snowflake_user",
       "password":"snowflake_password",
       "account":"account_id"
   }
   ```

1. Name the secret.

1. The remaining settings can be left at their default values for initial secret creation, or customized if required.

1. Create the secret and retrieve its ARN.

# Create AWS Glue connections (for administrators)
<a name="sagemaker-sql-extension-datasources-glue-connection"></a>

To use data sources with the SQL extension, administrators can set up AWS Glue connections for each data source. These connections store the necessary configuration details to access and interact with the data sources. Once the connections are created, and the [appropriate permissions](sagemaker-sql-extension-datasources-connection-permissions.md) are granted, the connections become visible to all users of the [Amazon SageMaker Studio spaces](studio-updated-spaces.md) that share the same execution role.

To create these connections:
+ First, create a JSON file that defines the connection properties for each data source. The JSON file includes details such as the data source identifier, access credentials, and other relevant configuration parameters to access the data sources through the AWS Glue connections.
+ Then use the AWS Command Line Interface (AWS CLI) to create the AWS Glue connection, passing the JSON file as a parameter. The AWS CLI command reads the connection details from the JSON file and establishes the appropriate connection.
**Note**  
The SQL extension supports creating connections using the AWS CLI only.

Before creating AWS Glue connections, ensure that you complete the following steps:
+ Install and configure the AWS Command Line Interface (AWS CLI). For more information about how to install and configure the AWS CLI, see [About AWS CLI version 2](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html). Ensure that the access keys and tokens of the IAM user or role used to configure the AWS CLI have the required permissions to create AWS Glue connections. Add a policy that allows the `glue:CreateConnection` action otherwise.
+ Understand how to use AWS Secrets Manager. We recommend that you use Secrets Manager to provide connection credentials and any other sensitive information for your data store. For more information on using Secrets Manager to store credentials, see [Storing connection credentials in AWS Secrets Manager](https://docs.aws.amazon.com/glue/latest/dg/connection-properties-secrets-manager.html).

## Create a connection definition JSON file
<a name="sagemaker-sql-extension-glue-connection-config"></a>

To create an AWS Glue connection definition file, create a JSON file to define the connection details on the machine where you have installed and configured the AWS CLI. For this example, name the file `sagemaker-sql-connection.json`.

The connection definition file should follow the following general format:
+ **Name** is the name for the connection.
+ **Description** is a textual description of the connection.
+ **ConnectionType** is the type of connection. Choose `REDSHIFT`, `ATHENA`, or `SNOWFLAKE`.
+ **ConnectionProperties** is a map of key-value pairs for the connection properties, such as the ARN of your AWS secret, or the name of your database.

```
{
    "ConnectionInput": {
        "Name": <GLUE_CONNECTION_NAME>,
        "Description": <GLUE_CONNECTION_DESCRIPTION>,
        "ConnectionType": "REDSHIFT | ATHENA | SNOWFLAKE",
        "ConnectionProperties": {
            "PythonProperties": "{\"aws_secret_arn\": <SECRET_ARN>, \"database\": <...>}"
        }
    }
}
```

**Note**  
The properties within the `ConnectionProperties` key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (`\`) character.
All properties available in Secrets Manager can also be directly provided through `PythonProperties`. However, it is not recommended to include sensitive fields such as passwords in `PythonProperties`. Instead, the preferred approach is to use Secrets Manager.

Connection definition files specific to different data stores can be found in the following sections.

The connection definition files for each data source contain the specific properties and configuration required to connect to those data stores from the SQL extension. Refer to the appropriate section for details on defining connections to that source.
+ To create an AWS Glue connection for Amazon Redshift, see the sample definition file in [Configure an AWS Glue connection for Amazon Redshift](#sagemaker-sql-extension-redshift-glue-connection-config).
+ To create an AWS Glue connection for Amazon Athena, see the sample definition file in [Configure an AWS Glue connection for Athena](#sagemaker-sql-extension-athena-glue-connection-config).
+ To create an AWS Glue connection for Snowflake, see the sample definition file in [Configure an AWS Glue connection for Snowflake](#sagemaker-sql-extension-snowflake-glue-connection-config).

### Configure an AWS Glue connection for Amazon Redshift
<a name="sagemaker-sql-extension-redshift-glue-connection-config"></a>

This section provides details on the secret and connection properties in JSON definition files that are specific to Amazon Redshift. Before creating your connection configuration file, we recommend storing your Amazon Redshift access credentials as a secret in Secrets Manager. Alternatively, you can generate temporary database credentials based on permissions granted through an AWS Identity and Access Management (IAM) permissions policy to manage the access that your users have to your Amazon Redshift database. For more information, see [Using IAM authentication to generate database user credentials](https://docs.aws.amazon.com/redshift/latest/mgmt/generating-user-credentials.html).

#### Create a secret for Amazon Redshift access credentials
<a name="sagemaker-sql-extension-redshift-secret"></a>

**To store Amazon Redshift information in AWS Secrets Manager**

1. From the AWS console, navigate to Secrets Manager.

1. Choose **Store a new secret**.

1. Under **Secret type**, choose **Credentials for Amazon Redshift**.

1. Enter the administrator username and password configured when launching the Amazon Redshift cluster. 

1. Select the Amazon Redshift cluster associated with the secrets.

1. Name your secret.

1. The remaining settings can be left at their default values for initial secret creation, or customized if required. 

1. Create the secret and retrieve its ARN.

#### Configure an AWS Glue connection for Amazon Redshift
<a name="sagemaker-sql-extension-redshift-glue-connection-creation"></a>

The SQL extension connects to data sources using custom AWS Glue connections. For general information on creating AWS Glue connections to connect a data source, see [Create AWS Glue connections (for administrators)](#sagemaker-sql-extension-datasources-glue-connection). The following example is a sample AWS Glue connection definition for connecting to Amazon Redshift.

Before creating a new connection, keep these recommendations in mind:
+ The properties within the `PythonProperties` key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (`\`) character.
+ In the connection definition file, enter the name and description of the connection, replace the ARN of the secret in `aws_secret_arn` with the ARN of the secret previously created.
+ Ensure that the database declared by its name in the connection definition above matches the cluster database. You can verify this by going to the cluster details page on [Amazon Redshift console](https://console.aws.amazon.com/redshiftv2/), and verifying the database name under **Database configurations** in **Properties** section.
+ For additional parameters, see the list of connection properties supported by Amazon Redshift in [Amazon Redshift connection parameters](sagemaker-sql-extension-connection-properties.md#sagemaker-sql-extension-connection-properties-redshift). 
**Note**  
By default, the SQL extension connector for Python runs all queries in a transaction, unless the `auto_commit` in connection properties is set to `true`. 
You can add all connection parameters, including the `database` name, to a secret.

```
{
  "ConnectionInput": {
      "Name": "Redshift connection name",
      "Description": "Redshift connection description",
      "ConnectionType": "REDSHIFT",
      "ConnectionProperties": {
          "PythonProperties":"{\"aws_secret_arn\": \"arn:aws:secretsmanager:region:account_id:secret:secret_name\", \"database\":\"database_name\", \"database_metadata_current_db_only\": false}"
      }
  }
}
```

Once your definition file is updated, follow the steps in [Create AWS Glue connections](#sagemaker-sql-extension-datasources-glue-connection-creation) to create your AWS Glue connection.

### Configure an AWS Glue connection for Athena
<a name="sagemaker-sql-extension-athena-glue-connection-config"></a>

This section provides details on the connection properties in JSON definition files that are specific to Athena.

#### Configure an AWS Glue connection for Athena
<a name="sagemaker-sql-extension-athena-glue-connection-creation"></a>

The SQL extension connects to data sources using custom AWS Glue connections. For general information on creating AWS Glue connections to connect a data source, see [Create AWS Glue connections (for administrators)](#sagemaker-sql-extension-datasources-glue-connection). The following example is a sample AWS Glue connection definition for connecting to Athena.

Before creating a new connection, keep these recommendations in mind:
+ The properties within the `ConnectionProperties` key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (`\`) character. 
+ In the connection definition file, enter the name and description of the connection, replace the `catalog_name` with the name of your catalog, `s3_staging_dir` with the Amazon S3 URI (Uniform Resource Identifier) of your output directory in your Amazon S3 bucket, and the `region_name` with the region of your Amazon S3 bucket.
+ For additional parameters, refer to the list of connection properties supported by Athena in [Athena connection parameters](sagemaker-sql-extension-connection-properties.md#sagemaker-sql-extension-connection-properties-athena). 
**Note**  
You can add all connection parameters, including the `catalog_name` or `s3_staging_dir`, to a secret.
If you specify a `workgroup`, you don't need to specify `s3_staging_dir`.

```
{
    "ConnectionInput": {
        "Name": "Athena connection name",
        "Description": "Athena connection description",
        "ConnectionType": "ATHENA",
        "ConnectionProperties": {
            "PythonProperties": "{\"catalog_name\": \"catalog_name\",\"s3_staging_dir\": \"s3://amzn-s3-demo-bucket_in_same_region/output_query_results_dir/\", \"region_name\": \"region\"}"
        }
    }
}
```

Once your definition file is updated, follow the steps in [Create AWS Glue connections](#sagemaker-sql-extension-datasources-glue-connection-creation) to create your AWS Glue connection.

### Configure an AWS Glue connection for Snowflake
<a name="sagemaker-sql-extension-snowflake-glue-connection-config"></a>

This section provides details on the secret and connection properties in JSON definition files that are specific to Snowflake. Before creating your connection configuration file, we recommend storing your Snowflake access credentials as a secret in Secrets Manager.

#### Create a secret for Snowflake access credentials
<a name="sagemaker-sql-extension-snowflake-secret"></a>

**To store Amazon Redshift information in Secrets Manager**

1. From the AWS console, navigate to AWS Secrets Manager.

1. Choose **Store a new secret**.

1. Under **Secret type**, choose **Other type of secret**.

1. In the key-value pair, choose **Plaintext**, and then copy the following JSON content. Replace the `user`, `password`, and `account` by their values.

   ```
   {
       "user":"snowflake_user",
       "password":"snowflake_password",
       "account":"account_id"
   }
   ```

1. Name the secret.

1. The remaining settings can be left at their default values for initial secret creation, or customized if required.

1. Create the secret and retrieve its ARN.

#### Configure an AWS Glue connection for Snowflake
<a name="sagemaker-sql-extension-snowflake-glue-connection-creation"></a>

The SQL extension connects to data sources using custom AWS Glue connections. For general information on creating AWS Glue connections to connect a data source, see [Create AWS Glue connections (for administrators)](#sagemaker-sql-extension-datasources-glue-connection). The following example is a sample AWS Glue connection definition for connecting to Snowflake.

Before creating a new connection, keep these recommendations in mind:
+ The properties within the `ConnectionProperties` key consist of stringified key-value pairs. Escape any double quotes used in the keys or values with a backslash (`\`) character. 
+ In the connection definition file, enter the name and description of the connection, then replace the ARN of the secret in `aws_secret_arn` with the ARN of the secret previously created, and your account ID in `account`.
+ For additional parameters, refer to the list of connection properties supported by Snowflake in [Snowflake connection parameters](sagemaker-sql-extension-connection-properties.md#sagemaker-sql-extension-connection-properties-snowflake).
**Note**  
You can add all connection parameters, including the `account`, to a secret.

```
{
    "ConnectionInput": {
        "Name": "Snowflake connection name",
        "Description": "Snowflake connection description",
        "ConnectionType": "SNOWFLAKE",
        "ConnectionProperties": {
            "PythonProperties":  "{\"aws_secret_arn\": \"arn:aws:secretsmanager:region:account_id:secret:secret_name\", \"account\":\"account_id\"}"}"
        }
    }
}
```

Once your definition file is updated, follow the steps in [Create AWS Glue connections](#sagemaker-sql-extension-datasources-glue-connection-creation) to create your AWS Glue connection.

## Create AWS Glue connections
<a name="sagemaker-sql-extension-datasources-glue-connection-creation"></a>

To create an AWS Glue connection through the AWS CLI, use your connection definition file and run this AWS CLI command. Replace the `region` placeholder with your AWS Region name and provide the local path to your definition file.

**Note**  
The path to your configuration definition file must be preceded by `file://`.

```
aws --region region glue create-connection --cli-input-json file://path_to_file/sagemaker-sql-connection.json
```

Verify that the AWS Glue connection was created by running the following command and check for your connection name.

```
aws --region region glue get-connections
```

Alternatively, you can update an existing AWS Glue connection as follows:
+ Modify the AWS Glue connection definition file as required.
+ Run the following command to update the connection.

  ```
  aws --region region glue update-connection --name glue_connection_name --cli-input-json file://path_to_file/sagemaker-sql-connection.json
  ```

# Create user-defined AWS Glue connections
<a name="sagemaker-sql-extension-datasources-glue-connection-user-defined"></a>

**Note**  
All AWS Glue connections created by users via the SQL extension UI are automatically tagged with the following:  
`UserProfile: user-profile-name`
`AppType: "JL"`
Those tags applied to the AWS Glue connections created via the SQL extension UI serve two purposes. The `"UserProfile": user-profile-name` tag allows the identification of the specific user profile that created the AWS Glue connection, providing visibility into the user responsible for the connection. The `"AppType": "JL"` tag categorizes the provenance of the connection, associating it with the JupyterLab application. This allows these connections to be differentiated from those that may have been created through other means, such as the AWS CLI. 

## Prerequisites
<a name="sagemaker-sql-extension-datasources-glue-connection-user-defined-prerequisites"></a>

Before creating a AWS Glue connection using the SQL extension UI, ensure that you have completed the following tasks: 
+ Have your administrator:
  + Enable the network communication between your Studio domain and the data sources to which you want to connect. To learn about the networking requirements, see [Configure network access between Studio and data sources (for administrators)](sagemaker-sql-extension-networking.md).
  + Ensure that the necessary IAM permissions are set up for managing AWS Glue connections and access to Secrets Manager. To learn about the required permissions, see [Set up the IAM permissions to access the data sources (for administrators)](sagemaker-sql-extension-datasources-connection-permissions.md).
**Note**  
Administrators can restrict user access to only the connections that were created by a user within the JupyterLab application. This can be done by configuring [tag-based access control](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-sql-extension-datasources-connection-permissions.html#user-defined-connections-permissions) scoped down to the user profile.
+ Check the connection properties and instructions to create a secret for your data source in [Create secrets for database access credentials in Secrets Manager](sagemaker-sql-extension-glue-connection-secrets.md).

## User workflow
<a name="sagemaker-sql-extension-datasources-glue-connection-user-defined-steps"></a>

The following steps provide the user workflow when creating user connections:

1. **Select the data source type**: Upon choosing the *Add new connection* icon, a form opens, prompting the user to select the type of data source they want to connect to, such as Amazon Redshift, Athena, or Snowflake.

1. **Provide connection properties**: Based on the selected data source, the relevant connection properties are dynamically loaded. The form indicates which fields are mandatory or optional for the chosen data source. To learn about the available properties for your data source, see [Connection parameters](sagemaker-sql-extension-connection-properties.md).

1. **Select your AWS Secrets Manager ARN**: For Amazon Redshift and Snowflake data sources, the user is prompted to select the AWS Secrets Manager ARN that stores sensitive information such as the username and password. To learn about the creation of a secret for your data source, see [Create secrets for database access credentials in Secrets Manager](sagemaker-sql-extension-glue-connection-secrets.md).

1. **Save your connection details**: Upon clicking **Create**, the provided connection properties are saved as a AWS Glue connection. 

1. **Test your connection**: If the connection is successful, the associated databases and tables become visible in the explorer. If the connection fails, an error message is displayed, prompting the user to review and correct the connection details.

1. **Familiarize with SQL extension features**: To learn about the capabilities of the extension, see the [SQL extension features and usage](sagemaker-sql-extension-features.md).

1. **(Optional) Update or delete user-created connections**: Provided that the user has been granted the necessary permissions, they can update or delete the connections they have created. To learn more about the required permissions, see [User-defined connections required IAM permissions](sagemaker-sql-extension-datasources-connection-permissions.md#user-defined-connections-permissions).

# Set up the IAM permissions to access the data sources (for administrators)
<a name="sagemaker-sql-extension-datasources-connection-permissions"></a>

Administrators should ensure that the execution role used by the JupyterLab applications has the necessary AWS IAM permissions to access the data through the configured AWS Glue connections. 
+ **Connections created by administrators using the AWS CLI**: To view the AWS Glue connections [created by administrators](sagemaker-sql-extension-datasources-glue-connection.md) and access their data, users need to have their administrator attach specific permissions to the SageMaker AI execution role used by their JupyterLab application in Studio. This includes access to AWS Glue, Secrets Manager, and database-specific permissions. Connections created by administrators are visible to all applications sharing the execution role granted the permissions to view specific AWS Glue catalogs or databases. To learn about the list of required permissions per type of data source, see the admin-defined connections permissions in [Admin-defined connections required IAM permissions](#admin-defined-connections-permissions). 
+ **Connections created by users using the SQL extension UI in JupyterLab**: Connections [created by user profiles](sagemaker-sql-extension-datasources-glue-connection-user-defined.md) sharing the same execution role will also be listed unless the visibility of their connections is scoped down to only those created by the user. Connections created by users are tagged with the user profile that created them. To restrict the ability to view, update, or delete those user-created connections to only the user who created them, administrators can add additional tag-based access control restrictions to the execution role IAM permissions. To learn about the additional tag-based access control required, see [User-defined connections required IAM permissions](#user-defined-connections-permissions).

## Admin-defined connections required IAM permissions
<a name="admin-defined-connections-permissions"></a>

To grant the SageMaker AI execution role used by your JupyterLab application in Studio access to a data source through an AWS Glue connection, attach the following inline policy to the role.

To view the specific permissions and policy details for each data source or authentication method, choose the relevant connection type below.

**Note**  
We recommend limiting your policy's permissions to only the resources and actions required.  
To scope down policies and grant least privilege access, replace wildcard `"Resource": ["*"]` in your policy with specific ARNs for the exact resources needing access. For more information about how to control access to your resources, see [Fine-tune AWS resource access with granular ARN permissions](#resource-access-control).

### All connection types
<a name="datasources-connection-permissions-all"></a>

**Note**  
We strongly recommend scoping down this policy to only the actions and resources required.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "GetS3AndDataSourcesMetadata",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabases",
                "glue:GetSchema",
                "glue:GetTables",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation",
                "glue:GetDatabase",
                "glue:GetTable",
                "glue:ListSchemas",
                "glue:GetPartitions"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*",
                "arn:aws:glue:us-east-1:111122223333:catalog",
    "arn:aws:glue:us-east-1:111122223333:connection/*"
            ]
        },
        {
            "Sid": "ExecuteQueries",
            "Effect": "Allow",
            "Action": [
                "athena:ListDataCatalogs",
                "athena:ListDatabases",
                "athena:ListTableMetadata",
                "athena:StartQueryExecution",
                "athena:GetQueryExecution",
                "athena:RunQuery",
                "athena:StartSession",
                "athena:GetQueryResults",
                "athena:ListWorkGroups",
                "s3:ListMultipartUploadParts",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "athena:GetDataCatalog",
                "s3:AbortMultipartUpload",
                "s3:GetObject",
                "s3:PutObject",
                "athena:GetWorkGroup"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*",
                "arn:aws:athena:us-east-1:111122223333:workgroup/workgroup-name"
            ]
        },
        {
            "Sid": "GetGlueConnections",
            "Effect": "Allow",
            "Action": [
                "glue:GetConnections",
                "glue:GetConnection"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:111122223333:catalog",
                "arn:aws:glue:us-east-1:111122223333:connection/*"
            ]
        },
        {
            "Sid": "GetSecrets",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                "arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name"
            ]
        },
        {
            "Sid": "GetClusterCredentials",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-east-1:111122223333:cluster:cluster-name"
            ]
        }
    ]
}
```

------

### Athena
<a name="datasources-connection-permissions-athena"></a>

**Note**  
We strongly recommend scoping down this policy to only the resources required.

For more information, see *Example IAM permissions policies* in [Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/federated-query-iam-access.html).

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "GetS3AndDataSourcesMetadata",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabases",
                "glue:GetSchema",
                "glue:GetTables",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation",
                "glue:GetDatabase",
                "glue:GetTable",
                "glue:ListSchemas",
                "glue:GetPartitions"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*",                
                "arn:aws:glue:us-east-2:111122223333:catalog",
                "arn:aws:glue:us-east-2:111122223333:connection/*"
            ]
        },
        {
            "Sid": "ExecuteAthenaQueries",
            "Effect": "Allow",
            "Action": [
                "athena:ListDataCatalogs",
                "athena:ListDatabases",
                "athena:ListTableMetadata",
                "athena:StartQueryExecution",
                "athena:GetQueryExecution",
                "athena:RunQuery",
                "athena:StartSession",
                "athena:GetQueryResults",
                "athena:ListWorkGroups",
                "s3:ListMultipartUploadParts",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "athena:GetDataCatalog",
                "s3:AbortMultipartUpload",
                "s3:GetObject",
                "s3:PutObject",
                "athena:GetWorkGroup"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*",
                "arn:aws:athena:us-east-2:111122223333:workgroup/workgroup-name"
            ]
        },
        {
            "Sid": "GetGlueConnections",
            "Effect": "Allow",
            "Action": [
                "glue:GetConnections",
                "glue:GetConnection"
            ],
            "Resource": [
                "arn:aws:glue:us-east-2:111122223333:catalog",
                "arn:aws:glue:us-east-2:111122223333:connection/*"
            ]
        },
        {
            "Sid": "GetSecrets",
            "Effect": "Allow",
            "Action": [                
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                "arn:aws:secretsmanager:us-east-2:111122223333:secret:secret-name"       
            ]
        }
    ]
}
```

------

### Amazon Redshift and Amazon Redshift Serverless (username & password auth) / Snowflake
<a name="datasources-connection-permissions-snowflake-redshift-user-password"></a>

**Note**  
We strongly recommend scoping down this policy to only the resources required.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "GetS3Metadata",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ]
        },
        {
            "Sid": "GetGlueConnections",
            "Effect": "Allow",
            "Action": [
                "glue:GetConnections",
                "glue:GetConnection"
            ],
            "Resource": [
                "arn:aws:glue:us-east-2:111122223333:catalog",
                "arn:aws:glue:us-east-2:111122223333:connection/*"
            ]
        },
        {
            "Sid": "GetSecrets",
            "Effect": "Allow",
            "Action": [                
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                "arn:aws:secretsmanager:us-east-2:111122223333:secret:secret-name"            
            ]
        }
    ]
}
```

------

### Amazon Redshift (IAM auth)
<a name="datasources-connection-permissions-redshift-iam"></a>

**Note**  
We strongly recommend scoping down this policy to only the resources required.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "GetS3Metadata",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*",
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ]
        },
        {
            "Sid": "GetGlueConnections",
            "Effect": "Allow",
            "Action": [
                "glue:GetConnections",
                "glue:GetConnection"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:111122223333:catalog",
                "arn:aws:glue:us-east-1:111122223333:connection/*",
                "arn:aws:glue:us-east-1:111122223333:connection/connection-name"
            ]
        },
        {
            "Sid": "GetSecrets",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                "arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name",
                "arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name-with-suffix"
            ]
        },
        {
            "Sid": "GetClusterCredentials",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-east-1:111122223333:cluster:cluster-name",
                "arn:aws:redshift:us-east-1:111122223333:dbuser:cluster-name/db-user-name"
            ]
        }
    ]
}
```

------

### Amazon Redshift serverless (IAM auth)
<a name="datasources-connection-permissions-redshift-serverless-iam"></a>

**Note**  
We strongly recommend scoping down this policy to only the resources required.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "GetS3Metadata",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ]
        },
        {
            "Sid": "GetGlueConnections",
            "Effect": "Allow",
            "Action": [
                "glue:GetConnections",
                "glue:GetConnection"
            ],
            "Resource": [
                "arn:aws:glue:us-east-2:111122223333:catalog",
                "arn:aws:glue:us-east-2:111122223333:connection/*"
            ]
        },
        {
            "Sid": "GetSecrets",
            "Effect": "Allow",
            "Action": [                
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                "arn:aws:secretsmanager:us-east-2:111122223333:secret:secret-name"         
            ]
        },
        {
            "Sid": "GetRedshiftServerlessCredentials",
            "Effect": "Allow",
            "Action": [
                "redshift-serverless:GetCredentials"
            ],
            "Resource": [
                "arn:aws:redshift-serverless:us-east-2:111122223333:namespace/namespace-id"           
            ]
        }
    ]
}
```

------

## User-defined connections required IAM permissions
<a name="user-defined-connections-permissions"></a>

The IAM policy permissions for a user can account for the presence of the `UserProfile` tag on AWS Glue connection resources.
+ **For viewing AWS Glue connections**:
  + Users can view all connections that do not have the `UserProfile` tag (created by an administrator). 
  + Users can view connections that have the `UserProfile` tag with the same value as their user profile name. 
  + Users cannot view connections that have the `UserProfile` tag with a different value than their user profile name. 
+ **For updating or deleting AWS Glue connections**:
  + Users can update or delete a connection that has the `UserProfile` tag with the same value as their user profile name. 
  + Users cannot update or delete a connection that has the `UserProfile` tag with a different value than their user profile name. 
  + Users cannot update or delete connections that do not have the `UserProfile` tag. 

To achieve this, administrators must grant the execution role used by the user profile's JupyterLab application additional permissions beyond their existing [admin-defined connections permissions](#admin-defined-connections-permissions). Specifically, in addition to the permissions required for accessing admin-defined AWS Glue connections, the following two additional IAM permissions must be granted to the user's execution role:
+ Permission to create AWS Glue connections and associate the `UserProfile` tag with the value of the user's profile name.
+ Permission to view, update, and delete AWS Glue connections that have the `UserProfile` tag matching the user's profile name.

This permission restricts access to AWS Glue connections based on a specific user profile tag value. Update the `UserProfile` tag value with the profile name of the user you want to target.

```
"Action": [
    "glue:GetConnection",
    "glue:GetConnections"    
],
"Resource": [
    "arn:aws:glue:region:account_id:connection/*"
],
"Condition": {
    "StringEqualsIfExists": {
        "aws:ResourceTag/UserProfile": "user_profile_name"
    }
}
```

This permission restricts the ability to create, update, and delete user-created connections to only the connections created by the user profile with the specified `UserProfile` tag value.

```
"Action": [
    "glue:DeleteConnection",
    "glue:UpdateConnection",
    "glue:CreateConnection",
    "glue:TagResource"
],
"Resource": [
    "arn:aws:glue:region:account_id:connection/*"
],
"Condition": {
    "StringEquals": {
        "aws:ResourceTag/UserProfile": "user_profile"
    }
}
```

## Fine-tune AWS resource access with granular ARN permissions
<a name="resource-access-control"></a>

For finer-grained control over access to your AWS resources, replace the wildcard resource `"Resource": ["*"]` in your policies with the specific Amazon Resource Names (ARNs) of only those resources that require access. Using the exact ARNs rather than a wildcard restricts access to the intended resources. 
+ **Use specific Amazon S3 bucket ARNs**

  For example `"arn:aws:s3:::bucket-name"` or ` "arn:aws:s3:::bucket-name/*"` for bucket-level or object-level operations.

  For information about all resource types in Amazon S3, see [Resource types defined by Amazon S3](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazons3.html#amazons3-resources-for-iam-policies).
+ **Use specific AWS Glue database ARNs**

  For example ` "arn:aws:glue:region:account-id:catalog"` or ` "arn:aws:glue:region:account-id:database/db-name"`. For information about all resource types in AWS Glue, see [Resource types defined by AWS Glue](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awsglue.html#awsglue-resources-for-iam-policies).
+ **Use specific Athena workgroup ARNs**

  For example `"arn:aws:athena:region:account-id:workgroup/workgroup-name"`. For information about all resource types in Athena, see [Resource types defined by Athena](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonathena.html#amazonathena-resources-for-iam-policies).
+ **Use specific AWS Secrets Manager secret ARNs**

  For example `"arn:aws:secretsmanager:region:account-id:secret:secret-name"`. For information about all resource types in AWS Secrets Manager, see [Resource types defined by AWS Secrets Manager](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awssecretsmanager.html#awssecretsmanager-resources-for-iam-policies)
+ **Use specific Amazon Redshift cluster ARNs**

  For example `"arn:aws:redshift:region:account-id:cluster:cluster-name"`. For information about resource types in Amazon Redshift, see [Resource types defined by Amazon Redshift](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonredshift.html#amazonredshift-resources-for-iam-policies). For information about all resource types in Redshift Serverless, see [Resource types defined by Redshift Serverless](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonredshiftserverless.html#amazonredshiftserverless-resources-for-iam-policies).

# Frequently asked questions
<a name="sagemaker-sql-extension-faqs"></a>

The following FAQs answer common general questions for the SQL extension in JupyterLab.

## Q: Where do I find the logs for the SQL extension?
<a name="sagemaker-sql-extension-faqs-0"></a>

A: The SQL extension writes its log in the general log file of your JupyterLab application in Studio. You can find those logs at `/var/log/apps/app_container.log`.

## Q: I am getting an error: “UsageError: Cell magic `%%sm\$1sql` not found.”
<a name="sagemaker-sql-extension-faqs-1"></a>

A: Create a new cell and load the extension again using `%load_ext amazon_sagemaker_sql_magic`.

## Q: How do I list the various parameters of my `%%sm_sql` command?
<a name="sagemaker-sql-extension-faqs-2"></a>

A: Use `%%sm_sql?` to get the help content of the command.

## Q: I cannot see the data discovery view on the right side panel.
<a name="sagemaker-sql-extension-faqs-3"></a>

A: Ensure that your space uses a SageMaker distribution image version 1.6 or higher. These SageMaker images come pre-installed with the extension. 

If you updated the image of your JupyterLab application space in Studio, refresh your browser.

## Q: The right panel does not accurately reflect the AWS Glue connections that are configured.
<a name="sagemaker-sql-extension-faqs-4"></a>

A: Try refreshing the right panel using the **Refresh** button in the bottom right corner of the SQL extension UI in your notebook.

## Q: SQL statements do not run as expected or run incorrectly.
<a name="sagemaker-sql-extension-faqs-5"></a>

A: Try clearing the cached connections by running the following magic command `%sm_sql_manage --clear-cached-connections`.

## Q: I am getting an error: "Actual statement count 2 did not match the desired statement count 1."
<a name="sagemaker-sql-extension-faqs-6"></a>

A: The SQL extension only supports running one SQL query at a time.

## Snowflake FAQs
<a name="sagemaker-sql-extension-faqs-snowflake"></a>

The following FAQs answer common general questions for users of the SQL extension using Snowflake as their data source.

### Q: I am getting an error: "No active warehouse selected in the current session." Select an active warehouse with the 'use warehouse' command.
<a name="sagemaker-sql-extension-faqs-snowflake-1"></a>

A: This can happen if the default warehouse for a user is not selected. Run the command `USE WAREHOUSE warehouse_name` for each session.

### Q: I am getting an error: "object '*foo*' does not exist or not authorized."
<a name="sagemaker-sql-extension-faqs-snowflake-2"></a>

A: Ensure that your Snowflake user has access to the given object.

# Connection parameters
<a name="sagemaker-sql-extension-connection-properties"></a>

The following tables detail the supported Python properties for AWS Glue connections per data store.

## Amazon Redshift connection parameters
<a name="sagemaker-sql-extension-connection-properties-redshift"></a>

The following Python connection parameters are supported by AWS Glue connections to Amazon Redshift.


| Key | Type | Description | Constraints | Required | 
| --- | --- | --- | --- | --- | 
| auto\$1create | Type: boolean | Indicates whether the user should be created if they do not exist. Defaults to false. | true, false | No | 
| aws\$1secret\$1arn | Type: string | The ARN of the secret used to retrieve the additional parameters for the connection. | Valid ARN | No | 
| cluster\$1identifier | Type: string - maxLength: 63 | The cluster identifier of the Amazon Redshift cluster. | ^(?\$1.\$1—)[a-z][a-z0-9-]\$10,61\$1[a-z0-9]\$1 | No | 
| database | Type: string - maxLength: 127 | The name of the database to connect to. |  | No | 
| database\$1metadata\$1current\$1db\$1only | Type: boolean | Indicates if the application supports multi-database datashare catalogs. Defaults to true to indicate that the application does not support multi-database datashare catalogs for backwards compatibility. | true, false | No | 
| db\$1groups | Type: string | A comma-separated list of existing database group names that the db\$1user joins for the current session. |  | No | 
| db\$1user | Type: string | The user ID to use with Amazon Redshift. |  | No | 
| host | Type: string - maxLength: 256 | The hostname of the Amazon Redshift cluster. |  | No | 
| iam | Type: boolean | Flag to enable or disable IAM based authentication for a connection. Defaults to false. | true, false | No | 
| iam\$1disable\$1cache | Type: boolean | This option specifies whether the IAM credentials are cached. Defaults to true. This improves performance when requests to the API gateway are throttled. | true, false | No | 
| max\$1prepared\$1statements | Type: integer | The maximum number of prepared statements that can be open at once. |  | No | 
| numeric\$1to\$1float | Decimal to float | Specifies if NUMERIC datatype values will be converted from decimal. By default NUMERIC values are received as decimal.Decimal Python objects. Enabling this option is not recommended for use cases which prefer the most precision as results may be rounded. Please reference the Python documentation on [https://docs.python.org/3/library/decimal.html#decimal-objects](https://docs.python.org/3/library/decimal.html#decimal-objects) to understand the tradeoffs between decimal.Decimal and float before enabling this option. Defaults to false. | true, false | No | 
| port | Type: integer | The port number of the Amazon Redshift cluster. | Range 1150-65535 | No | 
| profile | Type: string - maxLength: 256 | The name of the profile containing the credentials and setting used by the AWS CLI. |  | No | 
| region | Type: string | The AWS Region where the cluster is located. | Valid AWS Region | No | 
| serverless\$1acct\$1id | Type: string - maxLength: 256 | The AWS account ID that is associated with the Amazon Redshift serverless resource. |  | No | 
| serverless\$1work\$1group | Type: string - maxLength: 256 | The name of the work group for the Amazon Redshift serverless endpoint. |  | No | 
| ssl | Type: boolean | true if SSL is enabled. | true, false | No | 
| ssl\$1mode | Type: enum[verify-ca, verify-full, null]) | The security of the connection to Amazon Redshift. verify-ca (SSL must be used and the server certificate must be verified.) and verify-full (SSL must be used. The server certificate must be verified and the server hostname must match the hostname attribute on the certificate.) are supported. For more information, see [Configuring security options for connections](https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html) in Amazon Redshift documentation. Defaults to verify-ca. | verify-ca, verify-full | No | 
| timeout | Type: integer | The number of seconds before the connection to the server times out. | 0 | No | 

## Athena connection parameters
<a name="sagemaker-sql-extension-connection-properties-athena"></a>

The following Python connection parameters are supported by AWS Glue connections to Athena.


| Key | Type | Description | Constraints | Required | 
| --- | --- | --- | --- | --- | 
| aws\$1access\$1key\$1id | Type: string - maxLength: 256 | Specifies an AWS access key associated with an IAM account. We recommend storing this information in the aws\$1secret. | Length 16-128 | No | 
| aws\$1secret\$1access\$1key | Type: string - maxLength: 256 | Secret part of an AWS access key. We recommend storing this information in the aws\$1secret. |  | No | 
| aws\$1secret\$1arn | Type: string | The ARN of the secret used to retrieve the additional parameters for the connection. | Valid ARN | No | 
| catalog\$1name | Type: string - maxLength: 256 | The catalog that contains the databases and the tables that are accessed with the driver. For information about catalogs, see [DataCatalog](https://docs.aws.amazon.com/athena/latest/APIReference/API_DataCatalog.html). |  | No | 
| duration\$1seconds | Type: number | The duration, in seconds, of the role session. This setting can have a value from 1 hour to 12 hours. By default the duration is set to 3600 seconds (1 hour).  | Range from 900 seconds (15 minutes) up to the maximum session duration setting for the role | No | 
| encryption\$1option | Type: enum[SSE\$1S3, SSE\$1KMS, CSE\$1KMS, null]) | Encryption at rest for Amazon S3. See the Encryption at rest section in [Athena guide](https://docs.aws.amazon.com/athena/latest/ug/encryption.html). | SSE\$1S3, SSE\$1KMS, CSE\$1KMS | No | 
| kms\$1key | Type: string - maxLength: 256 | AWS KMS key if using CSE\$1KMS in encrytion\$1option. |  | No | 
| poll\$1interval | Type: number | Interval in seconds to poll the status of query results in Athena. |  | No | 
| profile\$1name | Type: string - maxLength: 256 | The name of the AWS configuration profile whose credentials should be used to authenticate the request to Athena. |  | No | 
| region\$1name | Type: string | The AWS Region where queries are run. | Valid AWS Region | No | 
| result\$1reuse\$1enable | Type: boolean | Enable the reuse of previous query result. | true, false | No | 
| result\$1reuse\$1minutes | Type: integer | Specifies, in minutes, the maximum age of a previous query result that Athena should consider for reuse. The default is 60. | >= 1 | No | 
| role\$1arn | Type: string | Role to be used for running queries. | Valid ARN | No | 
| schema\$1name | Type: string - maxLength: 256 | Name of the default schema to use for the database. |  | No | 
| s3\$1staging\$1dir | Type: string - maxLength: 1024 | The location in Amazon S3 where the query results are stored. |  | Either s3\$1staging\$1dir or work\$1group is required | 
| work\$1group | Type: string | The workgroup in which queries will run. For information about workgroups, see [WorkGroup](https://docs.aws.amazon.com/athena/latest/APIReference/API_WorkGroup.html). | ^[a-zA-Z0-9.\$1-]\$11,128\$1\$1 | Either s3\$1staging\$1dir or work\$1group is required | 

## Snowflake connection parameters
<a name="sagemaker-sql-extension-connection-properties-snowflake"></a>

The following Python connection parameters are supported by AWS Glue connections to Snowflake.

Snowflake connection parameters


| Key | Type | Description | Constraints | Required | 
| --- | --- | --- | --- | --- | 
| account | Type: string - maxLength: 256 | The Snowflake account identifier. The account identifier does not include the snowflakecomputing.com suffix. |  | Yes | 
| arrow\$1number\$1to\$1decimal | Type: boolean | False by default, which means that NUMBER column values are returned as double-precision floating point numbers (float64). Set this to True to return DECIMAL column values as decimal numbers (decimal.Decimal) when calling the fetch\$1pandas\$1all() and fetch\$1pandas\$1batches() methods. | true, false | No | 
| autocommit | Type: boolean | Defaults to false, which honors the Snowflake parameter AUTOCOMMIT. Set to true or false to enable or disable the autocommit mode in the session, respectively. | true, false | No | 
| aws\$1secret\$1arn | Type: string | The ARN of the secret used to retrieve the additional parameters for the connection. | Valid ARN | No | 
| client\$1prefetch\$1threads | Type: integer | The number of threads used to download the result sets (4 by default). Increasing the value improves the fetch performance but requires more memory. |  | No | 
| database | Type: string - maxLength: 256 | The name of the default database to use. |  | No | 
| login\$1timeout | Type: integer | The timeout in seconds for the login request. Defaults to 60 seconds. The login request gives up after the timeout length if the HTTP response is not success. |  | No | 
| network\$1timeout | Type: integer | The timeout in seconds for all other operations. Defaults to none (infinite). A general request gives up after the timeout length if the HTTP response is not success. |  | No | 
| paramstyle | Type: string - maxLength: 256 | Placeholder syntaxes used for parameter substitution when executing SQL queries from Python code. Defaults to pyformat for client side binding. Specify qmark or numeric to change the bind variable formats for server side binding. |  | No | 
| role | Type: string - maxLength: 256 | The name of the default role to use. |  | No | 
| schema | Type: string - maxLength: 256 | The name of the default schema to use for the database. |  | No | 
| timezone | Type: string - maxLength: 128 | None by default, which honors the Snowflake parameter TIMEZONE. Set to a valid time zone (such as America/Los\$1Angeles) to set the session time zone. | Timezone in a format similar to America/Los\$1Angeles | No | 
| validate\$1default\$1parameters | Type: boolean | Set to true to raise an exception if the specified database, schema, or warehouse doesn’t exist. Defaults to false. |  | No | 
| warehouse | Type: string - maxLength: 256 | The name of the default warehouse to use. |  | No | 

# Data preparation at scale using Amazon EMR Serverless applications or Amazon EMR clusters in Studio
<a name="studio-emr-data-preparation"></a>

Amazon SageMaker Studio and its legacy version, Studio Classic, provide data scientists, and machine learning (ML) engineers with tools to perform data analytics and data preparation at scale. Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Both Studio and Studio Classic come with built-in integration with Amazon EMR, allowing users to manage large-scale, interactive data preparation and machine learning workflows within their JupyterLab notebooks.

[Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html) is a managed big data platform with resources to help you run petabyte-scale distributed data processing jobs using open-source analytics frameworks on AWS such as [Apache Spark](https://aws.amazon.com/emr/features/spark), [Apache Hive](https://aws.amazon.com/emr/features/hive), [Presto](https://aws.amazon.com/emr/features/presto), HBase, and Flink among others. With Studio and Studio Classic integration with Amazon EMR, you can create, browse, discover, and connect to Amazon EMR clusters without leaving your JupyterLab or Studio Classic notebooks. You can additionally monitor and debug your Spark workloads by accessing the Spark UI directly from your notebook with one-click.

You should consider Amazon EMR clusters for your data preparation workloads if you have large-scale, long-running, or complex data processing requirements that involve massive amounts of data, require extensive customization and integration with other services, need to run custom applications, or plan to run a diverse range of distributed data processing frameworks beyond just Apache Spark. 

Using [SageMaker distribution image](sagemaker-distribution.md) `1.10` or higher, you can alternatively connect to interactive [EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) applications directly from your JupyterLab notebooks in SageMaker AI Studio. The integration of Studio with EMR Serverless allows you to run open-source big data analytics frameworks such as [Apache Spark](https://aws.amazon.com/emr/features/spark) and [Apache Hive](https://aws.amazon.com/emr/features/hive) without configuring, managing, or scaling Amazon EMR clusters. EMR Serverless automatically provisions and manages the underlying compute and memory resources based on your EMR Serverless application's needs. It scales resources up and down dynamically, charging you or the amount of vCPU, memory, and storage resources consumed by your applications. This serverless approach allows you to [run interactive data preparation workloads](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/interactive-workloads.html) from your JupyterLab notebooks without worrying about cluster management, while achieving high instance utilization and cost efficiency.

You should consider EMR Serverless for your interactive data preparation workloads if your workloads are short-lived or intermittent and don't require a persistent cluster; you prefer a serverless experience with automatic resource provisioning and termination, avoiding the overhead of managing infrastructure; or your interactive data preparation tasks primarily revolve around Apache Spark. 

**Topics**
+ [

# Configure network access for your Amazon EMR cluster
](studio-notebooks-emr-networking.md)
+ [

# Prepare data using EMR Serverless
](studio-notebooks-emr-serverless.md)
+ [

# Data preparation using Amazon EMR
](studio-notebooks-emr-cluster.md)

# Configure network access for your Amazon EMR cluster
<a name="studio-notebooks-emr-networking"></a>

Before you get started with using Amazon EMR or EMR Serverless for your data preparation tasks in Studio, ensure that you or your administrator have configured your network to allow communication between Studio and Amazon EMR. Once this communication is enabled, you can choose to:
+ [Prepare data using EMR Serverless](studio-notebooks-emr-serverless.md)
+ [Data preparation using Amazon EMR](studio-notebooks-emr-cluster.md)

**Note**  
For EMR Serverless users, the simplest setup involves creating your application in the Studio UI without modifying the default settings for the **Virtual private cloud (VPC)** option. This approach allows the application to be created within your SageMaker domain's VPC, eliminating the need for additional networking configuration. If you choose this option, you can skip the following networking setup section. 

The networking instructions vary based on whether Studio and Amazon EMR are deployed within a private [Amazon Virtual Private Cloud](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) (VPC) or communicate over the internet.

By default, Studio or Studio Classic run in an AWS managed VPC with [internet access](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html#studio-notebooks-and-internet-access-default). When using an internet connection, Studio and Studio Classic access AWS resources, such as Amazon S3 buckets, over the internet. However, if you have security requirements to control access to your data and job containers, we recommend that you configure Studio or Studio Classic and Amazon EMR so that your data and containers aren’t accessible over the internet. To control access to your resources or run Studio or Studio Classic without public internet access, you can specify the `VPC only` network access type when you onboard to [Amazon SageMaker AI domain](gs-studio-onboard.md). In this scenario, both Studio and Studio Classic establish connections with other AWS services via private [VPC endpoints](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html). For information about configuring Studio or Studio Classic in `VPC only` mode, see [Connect SageMaker Studio or Studio Classic notebooks in a VPC to external resources.](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html#studio-notebooks-and-internet-access-vpc-only).

The first two sections describe how to ensure communication between Studio or Studio Classic and Amazon EMR in VPCs without public internet access. The last section covers how to ensure communication between Studio or Studio Classic and Amazon EMR using an internet connection. Prior to connecting Studio or Studio Classic and Amazon EMR without internet access, make sure to establish endpoints for Amazon Simple Storage Service (data storage), Amazon CloudWatch (logging and monitoring), and Amazon SageMaker Runtime (fine-grained role-based access control (RBAC)).

To connect Studio or Studio Classic and Amazon EMR:
+ If Studio or Studio Classic and Amazon EMR are in separate VPCs, either in the same AWS account or in different accounts, see [Studio and Amazon EMR are in separate VPCs](#studio-notebooks-emr-networking-requirements-cross-vpc).
+ If Studio or Studio Classic and Amazon EMR are in the same VPC, see [Studio and Amazon EMR are in the same VPC](#studio-notebooks-emr-networking-requirements-same-vpc).
+ If you chose to connect Studio or Studio Classic and Amazon EMR over public internet, see [Studio and Amazon EMR communicate over public internet](#studio-notebooks-emr-networking-requirements-internet).

## Studio and Amazon EMR are in separate VPCs
<a name="studio-notebooks-emr-networking-requirements-cross-vpc"></a>

To allow communication between Studio or Studio Classic and Amazon EMR when they are deployed in separate VPCs:

1. Start by connecting your VPCs through a VPC peering connection.

1. Update your routing tables in each VPC to route the network traffic between Studio or Studio Classic subnets and Amazon EMR subnets both ways.

1. Configure your security groups to allow inbound and outbound traffic.

The steps to connect Studio or Studio Classic and Amazon EMR are the same whether the resources are deployed in a single AWS account (Single account use case) or across multiple AWS accounts (Cross-account use case).

1. 

**VPC peering**

   Create a [VPC peering connection](https://docs.aws.amazon.com/vpc/latest/peering/working-with-vpc-peering.html) to facilitate the networking between the two VPCs (Studio or Studio Classic and Amazon EMR).

   1. From your Studio or Studio Classic account, on the VPC dashboard, choose **Peering connections**, then **Create peering connection**.

   1. Create your request to peer the Studio or Studio Classic VPC with the Amazon EMR VPC. When requesting peering in another AWS account, choose **Another account** in **Select another VPC to peer with**.

      For cross-account peering, the administrator must accept the request from the Amazon EMR account.

      When peering private subnets, you should enable private IP DNS resolution at the VPC peering connection level.

1. 

**Routing tables**

   Send the network traffic between Studio or Studio Classic subnets and Amazon EMR subnets both ways.

   After you establish the peering connection, the administrator (on each account for cross-account access) can add routes to the private subnet route tables to route the traffic between Studio or Studio Classic and the Amazon EMR subnets. You can define those routes by going to the **Route Tables** section of each VPC in the VPC dashboard.

   The following illustration of the route table of a Studio VPC subnet shows an example of an outbound route from the Studio account to the Amazon EMR VPC IP range (here `2.0.1.0/24`) through the peering connection.  
![\[Route table of a Studio VPC subnet showing the outbound routes from the Studio account to the Amazon EMR VPC IP range (here 2.0.1.0/24) through the peering connection\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-return-route.png)

   The following illustration of a route table of an Amazon EMR VPC subnet shows an example of return routes from the Amazon EMR VPC to Studio VPC IP range (here `10.0.20.0/24`) through the peering connection.  
![\[Route table of an Amazon EMR VPC subnet showing the return routes from the Amazon EMR account to the Studio VPC IP range (here 10.0.20.0/24) through the peering connection\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-outbound-route.png)

1. 

**Security groups**

   Lastly, the security group of your Studio or Studio Classic domain must allow outbound traffic, and the security group of the Amazon EMR primary node must allow inbound traffic on *Apache Livy*, *Hive*, or *Presto* TCP ports (respectively `8998`, `10000`, and `8889`) from the Studio or Studio Classic instance security group. [Apache Livy](https://livy.apache.org/) is a service that enables interaction with Amazon EMR over a REST interface.

The following diagram shows an example of an Amazon VPC setup that enables JupyterLab or Studio Classic notebooks to provision Amazon EMR clusters from CloudFormation templates in the Service Catalog and then connect to an Amazon EMR cluster within the same AWS account. The diagram provides an additional illustration of the required endpoints for a direct connection to various AWS services, such as Amazon S3 or Amazon CloudWatch, when the VPCs have no internet access. Alternatively, a [NAT gateway](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html#nat-gateway-working-with) must be used to allow instances in private subnets of multiple VPCs to share a single public IP address provided by the [internet gateway](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html) when accessing the internet.

![\[Architectural diagram illustrating an example of a simple Amazon VPC setup that enables Studio or Studio Classic notebooks to provision Amazon EMR clusters from CloudFormation templates in the Service Catalog and then connect to an Amazon EMR cluster within the same AWS account. The diagram provides an additional illustration of the required endpoints for a direct connection to various AWS services, such as Amazon S3 or Amazon CloudWatch, when the VPCs have no internet access. Alternatively, a NAT gateway must be used to allow instances in private subnets of multiple VPCs to share a single public IP address provided by the internet gateway when accessing the internet.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-architecture-singleaccount-vpcendpoints.png)


## Studio and Amazon EMR are in the same VPC
<a name="studio-notebooks-emr-networking-requirements-same-vpc"></a>

If Studio or Studio Classic and Amazon EMR are in different subnets, add routes to each private subnet route table to route the traffic between Studio or Studio Classic and the Amazon EMR subnets. You can define those routes by going to the **Route Tables** section of each VPC in the VPC dashboard. If you deployed Studio or Studio Classic and Amazon EMR in the same VPC and the same subnet, you do not need to route the traffic between the Studio and the Amazon EMR.

Whether or not you needed to update your routing tables, the security group of your Studio or Studio Classic domain must allow outbound traffic, and the security group of the Amazon EMR primary node must allow inbound traffic on *Apache Livy*, *Hive*,or *Presto* TCP ports (respectively `8998`, `10000`, and `8889`) from the Studio or Studio Classic instance security group. [Apache Livy](https://livy.apache.org/) is a service that enables interaction with a Amazon EMR over a REST interface.

## Studio and Amazon EMR communicate over public internet
<a name="studio-notebooks-emr-networking-requirements-internet"></a>

By default, Studio and Studio Classic provide a network interface that allows communication with the internet through an internet gateway in the VPC associated with the SageMaker domain. If you choose to connect to Amazon EMR through the public internet, Amazon EMR needs to accept inbound traffic on *Apache Livy*, *Hive*,or *Presto* TCP ports (respectively `8998`, `10000`, and `8889`) from its internet gateway. [Apache Livy](https://livy.apache.org/) is a service that enables interaction with Amazon EMR over a REST interface.

Keep in mind that any port on which you allow inbound traffic represents a potential security vulnerability. Carefully review custom security groups to ensure that you minimize vulnerabilities. For more information, see [Control network traffic with security groups](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-security-groups.html).

Alternatively, see [Blogs and whitepapers](studio-notebooks-emr-resources.md) for a detailed walkthrough of how to enable [Kerberos on Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos.html), set the cluster in a private subnet, and access the cluster using a [ Network Load Balancer (NLB)](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html) to expose only specific ports, which are access-controlled via security groups.

**Note**  
When connecting to your Apache Livy endpoint through the public internet, we recommend that you secure communications between Studio or Studio Classic and your Amazon EMR cluster using TLS.  
For information on setting up HTTPS with Apache Livy, see [Enabling HTTPS with Apache Livy](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/enabling-https.html). For information on setting an Amazon EMR cluster with transit encryption enabled, see [Providing certificates for encrypting data in transit with Amazon EMR encryption](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-encryption-certificates). Additionally, you need to configure Studio or Studio Classic to access your certificate key as specified in [Connect to an Amazon EMR cluster over HTTPS](connect-emr-clusters.md#connect-emr-clusters-ssl).

# Prepare data using EMR Serverless
<a name="studio-notebooks-emr-serverless"></a>

Beginning with [SageMaker distribution image](sagemaker-distribution.md) version `1.10`, Amazon SageMaker Studio integrates with EMR Serverless. Within JupyterLab notebooks in SageMaker Studio, data scientists and data engineers can discover and connect to EMR Serverless applications, then interactively explore, visualize, and prepare large-scale Apache Spark or Apache Hive workloads. This integration allows to perform interactive data preprocessing at scale in preparation for ML model training and deployment.

Specifically, the updated version of the [https://pypi.org/project/sagemaker-studio-analytics-extension/](https://pypi.org/project/sagemaker-studio-analytics-extension/) in [SageMaker AI distribution](https://github.com/aws/sagemaker-distribution/tree/main/build_artifacts/v1) image version `1.10` leverages the integration between Apache Livy and EMR Serverless, allowing the connection to an Apache Livy endpoint through JupyterLab notebooks. This section assumes prior knowledge of [EMR Serverless interactive applications](https://docs.aws.amazon.com/EMR-Serverless-UserGuide/interactive-workloads.html).

**Important**  
When using Studio, you can only discover and connect to EMR Serverless applications for JupyterLab applications that are launched from private spaces. Ensure that the EMR Serverless applications are located in the same AWS region as your Studio environment.

## Prerequisites
<a name="studio-set-up-emr-serverless-prerequisites"></a>

Before you get started running interactive workloads with EMR Serverless from your JupyterLab notebooks, make sure you meet the following prerequisites:

1. Your JupyterLab space must use a SageMaker Distribution image version `1.10` or higher.

1. Create an EMR Serverless interactive application with Amazon EMR version `6.14.0` or higher. You can create an EMR Serverless application from the Studio user interface by following the steps in [Create EMR Serverless applications from Studio](create-emr-serverless-application.md).
**Note**  
For the simplest setup, you can create your EMR Serverless application in the Studio UI without changing any default settings for the **Virtual private cloud (VPC)** option. This allows the application to be created within your domain VPC without requiring any networking configuration. In this case, you can skip the following networking setup step.

1. Review the networking and security requirements in [Configure network access for your Amazon EMR cluster](studio-notebooks-emr-networking.md). Specifically, ensure that you:
   + Establish a VPC peering connection between your Studio account and your EMR Serverless account.
   + Add routes to the private subnet route tables in both accounts. 
   + Set up the security group attached to your Studio domain to allow outbound traffic, and configure the security group of the VPC where you plan to run the EMR Serverless applications to allow inbound TCP traffic from the Studio instance's security group.

1. To access your interactive applications on EMR Serverless and run workloads submitted from your JupyterLab notebooks in SageMaker Studio, you must assign specific permissions and roles. Refer to the [Set up the permissions to enable listing and launching Amazon EMR applications from SageMaker Studio](studio-emr-serverless-permissions.md) section for details on the necessary roles and permissions.

**Topics**
+ [

## Prerequisites
](#studio-set-up-emr-serverless-prerequisites)
+ [

# Set up the permissions to enable listing and launching Amazon EMR applications from SageMaker Studio
](studio-emr-serverless-permissions.md)
+ [

# Create EMR Serverless applications from Studio
](create-emr-serverless-application.md)
+ [

# Connect to an EMR Serverless application from Studio
](connect-emr-serverless-application.md)
+ [

# Stop or delete an EMR Serverless application from the Studio UI
](terminate-emr-serverless-application.md)

# Set up the permissions to enable listing and launching Amazon EMR applications from SageMaker Studio
<a name="studio-emr-serverless-permissions"></a>

In this section, we detail the roles and permissions required to list and connect to EMR Serverless applications from SageMaker Studio, considering scenarios where Studio and the EMR Serverless applications are deployed in the same AWS account or across different accounts.

The roles to which you must add the necessary permissions depend on whether Studio and your EMR Serverless applications reside in the same AWS account (*Single Account*) or in separate accounts (*Cross Account*). There are two types of roles involved:
+ Execution roles:
  + [Runtime execution roles](https://docs.aws.amazon.com/http://emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults-executionRoleArn) (Role-Based Access Control roles) used by EMR Serverless: These are the IAM roles used by the EMR Serverless job execution environments to access other AWS services and resources needed during runtime, such as Amazon S3 for data access, CloudWatch for logging, access to the AWS Glue Data Catalog or other services based on your workload requirements. We recommend creating these roles in the account where the EMR Serverless applications are running.

    To learn more about runtime roles, see [Job runtime roles](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/security-iam-runtime-role.html) in the *EMR Serverless User Guide*.
**Note**  
You can define several RBAC roles for your EMR Serverless application. These roles can be based on the responsibilities and access levels needed by different users or groups within your organization. For more information about RBAC permissions, see [Security best practices for Amazon Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/security-best-practices.html#security-practice-rbac).
  + SageMaker AI execution role: The execution role allowing SageMaker AI to perform certain tasks like reading data from Amazon S3 buckets, writing logs to CloudWatch, and accessing other AWS services that your workflow might need. The SageMaker AI execution role also has the special permission called `iam:PassRole` which allows SageMaker AI to pass temporary runtime execution roles to the EMR Serverless applications. These roles give the EMR Serverless applications the permissions they need to interact with other AWS resources while they are running.
+ Assumable roles (Also referred to as *Service Access Roles*):
  + These are the IAM roles that SageMaker AI's execution role can assume to perform operations related to managing EMR Serverless applications. These roles define the permissions and access policies required when listing, connecting to, or managing EMR Serverless applications. They are typically used in cross-account scenarios, where the EMR Serverless applications are located in a different AWS account than the SageMaker AI domain. Having a dedicated IAM role for your EMR Serverless applications helps to follow the principle of least privilege and ensures that Amazon EMR has only the required permissions to run your jobs while protecting other resources in your AWS account.

By understanding and configuring these roles correctly, you can ensure that SageMaker Studio has the necessary permissions to interact with EMR Serverless applications, regardless of whether they are deployed in the same account or across different accounts.

## Single account
<a name="studio-set-up-emr-serverless-permissions-singleaccount"></a>

The following diagrams illustrate the roles and permissions required to list and connect to EMR Serverless applications from Studio when Studio and the applications are deployed in the same AWS account.

![\[The diagram shows roles and permissions needed to list and connect EMR Serverless applications from Studio when Studio and the applications are in the same AWS account.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-serverless-permissions-setup-singleaccount.png)


If your Amazon EMR applications and Studio are deployed in the same AWS account, follow these steps:

1. **Step 1**: Retrieve the ARN of the Amazon S3 bucket you use for data sources and output data storage in the [Amazon S3 console](https://console.aws.amazon.com/S3).

   To learn about how to find a bucket by name, see [Accessing and listing an Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html). For information on how to create an Amazon S3 bucket, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html). 

1. **Step 2**: Create at least one job runtime execution role for your EMR Serverless application in your account (The `EMRServerlessRuntimeExecutionRoleA` in the *Single account* use case diagram above). Choose **Custom trust policy** as the trusted entity. Add the permissions required by your job. At a minimum, you need full access to an Amazon S3 bucket, and create and read access to AWS Glue Data Catalog.

   For detailed instructions about how to create a new runtime execution role for your EMR Serverless applications, follow these steps:

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. In the left navigation pane, choose **Policy**, and then **Create policy**.

   1. Add the permissions required by your runtime role, name the policy, and then choose **Create policy**.

      You can refer to [Job runtime roles for EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/security-iam-runtime-role.html) to find sample runtime policies for an EMR Serverless runtime role.

   1. In the left navigation pane, choose **Roles** and then **Create role**.

   1. On the **Create role** page, choose **Custom trust policy** as the trusted entity.

   1. Paste in the following JSON document in the **Custom trust policy** section and then choose **Next**.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": "emr-serverless.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
      ```

------

   1. In the **Add permissions** page, add the policy you created and then choose **Next**.

   1. On the **Review** page, enter a name for the role such as `EMRServerlessAppRuntimeRoleA` and an optional description.

   1. Review the role details and choose **Create role**.

   With these roles, you and your teammates can connect to the same application, each using a runtime role scoped with permissions matching your individual level of access to data.
**Note**  
The Spark sessions operate differently. Spark sessions are isolated based on the execution role used from Studio, so users with different execution roles will have separate, isolated Spark sessions. Additionally, if you have enabled source identity for your domain, there is further isolation of Spark sessions across different source identities.

1. **Step 3**: Retrieve the ARN of the SageMaker AI execution role used by your private space.

   For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

   For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).
**Note**  
 Alternatively, users new to SageMaker AI can simplify their setup process by automatically creating a new SageMaker AI execution role with the appropriate permissions. In this case, skip steps 3 and 4. Instead, users can either:  
Choose the **Set up for organizations** option when creating a new domain from the **Domain** menu in the left navigation of the [ SageMaker AI console](https://console.aws.amazon.com/sagemaker).
Create a new execution role from the **Role manager** menu of the console, and then attach the role to an existing domain or user profile.
When creating the role, choose the **Run Studio EMR Serverless Applications** option in **What ML activities will users perform?** Then, provide the name of your Amazon S3 bucket and the job runtime execution role you want your EMR Serverless application to use (step 2).  
The SageMaker Role Manager automatically adds the necessary permissions for running and connecting to EMR Serverless applications to the new execution role.Using the [SageMaker Role Manager](), you can only assign one runtime role to your EMR Serverless application, and the application must run in the same account where Studio is deployed, using a runtime role created within that same account.

1. **Step 4**: Attach the following permissions to the SageMaker AI execution role accessing your EMR Serverless application.

   1. Open the IAM console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. Choose **Roles** and then search for your execution role by name in the **Search** field. The role name is the last part of the ARN, after the last forward slash (/).

   1. Follow the link to your role.

   1. Choose **Add permissions** and then **Create inline policy**.

   1. In the **JSON** tab, add the Amazon EMR Serverless permissions allowing EMR Serverless access and operations. For details on the policy document, see *EMR Serverless policies* in [Reference policies](#studio-set-up-emr-serverless-permissions-reference). Replace the *region*, *accountID*, and passed *EMRServerlessAppRuntimeRole*(s) with their actual values before copying the list of statements to the inline policy of your role. 
**Note**  
You can include as many ARN strings of runtime roles as needed within the permission, separating them with commas.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. Repeat the **Create inline policy** step to add another inline policy granting the role permissions to update the domains, user profiles, and spaces. For details on the `SageMakerUpdateResourcesPolicy` policy document, see *Domain, user profile, and space update actions policy* in [Reference policies](#studio-set-up-emr-serverless-permissions-reference). Replace the *region* and *accountID* with their actual values before copying the list of statements to the inline policy of your role.

1. **Step 5**:

   Associate the list of runtime roles with your user profile or domain so you can visually browse the list of roles and select the one to use when [connecting to an EMR Serverless application](connect-emr-serverless-application.md) from JupyterLab. You can use the SageMaker AI console or the following script. Subsequently, all your Apache Spark or Apache Hive jobs created from your notebook will access only the data and resources permitted by the policies attached to the selected runtime role.
**Important**  
Failure to complete this step will prevent you from connecting a JupyterLab notebook to an EMR Serverless application.

------
#### [ SageMaker AI console ]

   To associate your runtime roles with your user profile or domain using the SageMaker AI console:

   1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. In the left navigation pane, choose **domain**, and then select the domain using the SageMaker AI execution role whose permissions you updated.

   1. 
      + To add your runtime roles to your domain: In the **App Configurations** tab of the **Domain details** page, navigate to the **JupyterLab** section.
      + To add your runtime roles to your user profile: On the **Domain details** page, chose the **User profiles** tab, select the user profile using the SageMaker AI execution role whose permissions you updated. In the **App Configurations** tab, navigate to the **JupyterLab** section.

   1. Choose **Edit** and add the ARNs of your EMR Serverless runtime execution roles.

   1. Choose **Submit**.

   When you next connect to an EMR Serverless application via JupyterLab, the runtime roles should appear in a drop-down menu for selection.

------
#### [ Python script ]

   In a JupyterLab application started from a private space using the SageMaker AI execution role whose permissions you updated, run the following command in a terminal. Replace the `domainID`, `user-profile-name`, `studio-accountID`, and `EMRServerlessRuntimeExecutionRole`(s) with their proper values. This code snippet updates the user profile settings for a specific user profile (`client.update_user_profile`) or domain settings (`client.update_domain`), specifically associating the EMR Serverless runtime execution roles you previously created.

   ```
   import botocore.session
   import json
   sess = botocore.session.get_session()
   client = sess.create_client('sagemaker')
   
   client.update_user_profile(
   DomainId="domainID", 
   UserProfileName="user-profile-name",
   DefaultUserSettings={
       'JupyterLabAppSettings': {
           'EmrSettings': {
               'ExecutionRoleArns': ["arn:aws:iam::studio-accountID:role/EMRServerlessRuntimeExecutionRoleA", 
                                "arn:aws:iam::studio-accountID:role/EMRServerlessRuntimeExecutionRoleAA"]
           }
           
       }
   })
   resp = client.describe_domain(DomainId="domainID")
   
   resp['CreationTime'] = str(resp['CreationTime'])
   resp['LastModifiedTime'] = str(resp['LastModifiedTime'])
   print(json.dumps(resp, indent=2))
   ```

------

## Cross account
<a name="studio-set-up-emr-serverless-permissions-crossaccount"></a>

The following diagrams illustrate the roles and permissions required to list and connect to EMR Serverless applications from Studio when Studio and the applications are deployed in different AWS accounts.

![\[The diagram shows roles and permissions needed to list and connect EMR Serverless applications from Studio when Studio and the applications are in different AWS accounts.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-serverless-permissions-setup-crossaccount.png)


For more information about creating a role on an AWS account, see [https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html) Creating an IAM role (console).

Before you get started: 
+ Retrieve the ARN of the SageMaker AI execution role used by your private space. For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md). For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).
+ Retrieve the ARN of the Amazon S3 bucket you will use for data sources and output data storage in the [Amazon S3 console](https://console.aws.amazon.com/S3).

  For information on how to create an Amazon S3 bucket, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html). To learn about how to find a bucket by name, see [Accessing and listing an Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html).

If your EMR Serverless applications and Studio are deployed in separate AWS accounts, you configure the permissions on both accounts. 

### On the EMR Serverless account
<a name="studio-set-up-emr-serverless-permissions-crossaccount-emraccount"></a>

Follow these steps to create the necessary roles and policies on the account where your EMR Serverless application is running, also referred to as the *trusting account*:

1. **Step 1**: Create at least one job runtime execution role for your EMR Serverless application in your account (The `EMRServerlessRuntimeExecutionRoleB` in the *Cross account* diagram above). Choose **Custom trust policy** as the trusted entity. Add the permissions required by your job. At a minimum, you need full access to an Amazon S3 bucket, and create and read access to AWS Glue Data Catalog.

   For detailed instructions on how to create a new runtime execution role for your EMR Serverless applications, follow these steps:

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. In the left navigation pane, choose **Policy**, and then **Create policy**.

   1. Add the permissions required by your runtime role, name the policy, and then choose **Create policy**.

      For sample runtime policies of an EMR Serverless runtime role, see [Job runtime roles for Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/security-iam-runtime-role.html).

   1. In the left navigation pane, choose **Roles** and then **Create role**.

   1. On the **Create role** page, choose **Custom trust policy** as the trusted entity.

   1. Paste in the following JSON document in the **Custom trust policy** section and then choose **Next**.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": "emr-serverless.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
      ```

------

   1. In the **Add permissions** page, add the policy you created and then choose **Next**.

   1. On the **Review** page, enter a name for the role such as `EMRServerlessAppRuntimeRoleB` and an optional description.

   1. Review the role details and choose **Create role**.

   With these roles, you and your teammates can connect to the same application, each using a runtime role scoped with permissions matching your individual level of access to data.
**Note**  
The Spark sessions operate differently.Spark sessions are isolated based on the execution role used from Studio, so users with different execution roles will have separate, isolated Spark sessions. Additionally, if you have enabled source identity for your domain, there is further isolation of Spark sessions across different source identities.

1. **Step 2**: Create a custom IAM role named `AssumableRole` with the following configuration:
   + Permissions: Grant the necessary permissions (Amazon EMR Serverless policies) to the `AssumableRole` to allow accessing EMR Serverless resources. This role is also known as an *Access role*.
   + Trust relationship: Configure the trust policy for the `AssumableRole` to allow assuming the execution role (The `SageMakerExecutionRole` in the cross-account diagram) from the Studio account that requires access.

   By assuming the role, Studio can gain temporary access to the permissions it needs in the EMR Serverless account.

   For detailed instructions on how to create a new `AssumableRole` in your EMR Serverless AWS account, follow these steps:

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. In the left navigation pane, choose **Policy**, and then **Create policy**.

   1. In the **JSON** tab, add the Amazon EMR Serverless permissions allowing EMR Serverless access and operations. For details on the policy document, see *EMR Serverless policies* in [Reference policies](#studio-set-up-emr-serverless-permissions-reference). Replace the `region`, `accountID`, and passed `EMRServerlessAppRuntimeRole`(s) with their actual values before copying the list of statements to the inline policy of your role.
**Note**  
The `EMRServerlessAppRuntimeRole` here is the job runtime execution role created in Step 1 (The `EMRServerlessAppRuntimeRoleB` in the *Cross account* diagram above). You can include as many ARN strings of runtime roles as needed within the permission, separating them with commas. 

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. In the left navigation pane, choose **Roles** and then **Create role**.

   1. On the **Create role** page, choose **Custom trust policy** as the trusted entity.

   1. Paste in the following JSON document in the **Custom trust policy** section and then choose **Next**.

      Replace `studio-account` with the Studio account ID, and `AmazonSageMaker-ExecutionRole` with the execution role used by your JupyterLab space. 

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
      ```

------

   1. In the **Add permissions** page, add the permission `EMRServerlessAppRuntimeRoleB` you created in Step 2 and then choose **Next**.

   1. On the **Review** page, enter a name for the role such as `AssumableRole` and an optional description.

   1. Review the role details and choose **Create role**.

   For more information about creating a role on an AWS account, see [Creating an IAM role (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html).

### On the Studio account
<a name="studio-set-up-emr-serverless-permissions-crossaccount-studioaccount"></a>

On the account where Studio is deployed, also referred to as the *trusted account*, update the SageMaker AI execution role accessing your EMR Serverless applications with the required permissions to access resources in the trusting account.

1. **Step 1**: Retrieve the ARN of the SageMaker AI execution role used by your space.

   For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

   For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).

1. **Step 2**: Attach the following permissions to the SageMaker AI execution role accessing your EMR Serverless application.

   1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

   1. Choose **Roles** and then search for your execution role by name in the **Search** field. The role name is the last part of the ARN, after the last forward slash (/). 

   1. Follow the link to your role.

   1. Choose **Add permissions** and then **Create inline policy**.

   1. In the **JSON** tab, add the inline policy granting the role permissions to update the domains, user profiles, and spaces. For details on the `SageMakerUpdateResourcesPolicy` policy document, see *Domain, user profile, and space update actions policy* in [Reference policies](#studio-set-up-emr-serverless-permissions-reference). Replace the `region` and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. Repeat the **Create inline policy** step to add another policy granting the execution role the permissions to assume the `AssumableRole` and then perform actions permitted by the role's access policy.

      Replace `emr-account` with the Amazon EMR Serverless account ID, and `AssumableRole` with the name of the assumable role created in the Amazon EMR Serverless account.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": {
              "Sid": "AllowSTSToAssumeAssumableRole",
              "Effect": "Allow",
              "Action": "sts:AssumeRole",
              "Resource": "arn:aws:iam::111122223333:role/AssumableRole"
          }
      }
      ```

------

1. **Step 3**:

   Associate the list of runtime roles with your domain or user profile so you can visually browse the list of roles and select the one to use when [connecting to an EMR Serverless application](connect-emr-serverless-application.md) from JupyterLab. You can use the SageMaker AI console or the following script. Subsequently, all your Apache Spark or Apache Hive jobs created from your notebook will access only the data and resources permitted by the policies attached to the selected runtime role.
**Important**  
Failure to complete this step will prevent you from connecting a JupyterLab notebook to an EMR Serverless application.

------
#### [ SageMaker AI console ]

   To associate your runtime roles with your user profile or domain using the SageMaker AI console:

   1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. In the left navigation pane, choose **domain**, and then select the domain using the SageMaker AI execution role whose permissions you updated.

   1. 
      + To add your runtime roles to your domain: In the **App Configurations** tab of the **Domain details** page, navigate to the **JupyterLab** section.
      + To add your runtime roles to your user profile: On the **Domain details** page, chose the **User profiles** tab, select the user profile using the SageMaker AI execution role whose permissions you updated. In the **App Configurations** tab, navigate to the **JupyterLab** section.

   1. Choose **Edit** and add the ARNs of your assumable role and EMR Serverless runtime execution roles.

   1. Choose **Submit**.

   When you next connect to an EMR Serverless application via JupyterLab, the runtime roles should appear in a drop-down menu for selection.

------
#### [ Python script ]

    In a JupyterLab application started from a private space using the SageMaker AI execution role whose permissions you updated, run the following command in a terminal. Replace the `domainID`, `user-profile-name`, `studio-accountID`, and `EMRServerlessRuntimeExecutionRole` with their proper values. This code snippet updates the user profile settings for a specific user profile (`client.update_user_profile`) or domain settings (`client.update_domain`) within a SageMaker AI domain. Specifically, it sets the runtime execution roles for Amazon EMR Serverless, which you have previously created. It also allows the JupyterLab application to assume a particular IAM role (`AssumableRole`) for running EMR Serverless applications within the Amazon EMR account.

   ```
   import botocore.session
   import json
   sess = botocore.session.get_session()
   client = sess.create_client('sagemaker')
   
   client.update_user_profile(
   DomainId="domainID", 
   UserProfileName="user-profile-name",
   DefaultUserSettings={
       'JupyterLabAppSettings': {
           'EmrSettings': {
               'AssumableRoleArns': ["arn:aws:iam::emr-accountID:role/AssumableRole"],
               'ExecutionRoleArns': ["arn:aws:iam::emr-accountID:role/EMRServerlessRuntimeExecutionRoleA", 
                                "arn:aws:iam::emr-accountID:role/AnotherRuntimeExecutionRole"]
           }
           
       }
   })
   resp = client.describe_user_profile(DomainId="domainID", UserProfileName=user-profile-name")
   
   resp['CreationTime'] = str(resp['CreationTime'])
   resp['LastModifiedTime'] = str(resp['LastModifiedTime'])
   print(json.dumps(resp, indent=2))
   ```

------

## Reference policies
<a name="studio-set-up-emr-serverless-permissions-reference"></a>
+ **EMR Serverless policies**: This policy allows managing EMR Serverless applications, including listing, creating (with required SageMaker AI tags), starting, stopping, getting details, deleting, accessing Livy endpoints, and getting job run dashboards. It also allows passing the required EMR Serverless application runtime role to the service.
  + `EMRServerlessListApplications`: Allows the ListApplications action on all EMR Serverless resources in the specified region and AWS account.
  + `EMRServerlessPassRole`: Allows passing the specified runtime role(s) in the provided AWS account, but only when the role is being passed to the `emr-serverless.amazonaws.com service`. 
  + `EMRServerlessCreateApplicationAction`: Allows the CreateApplication and TagResource actions on EMR Serverless resources in he specified region and AWS account. However, it requires that the resources being created or tagged have specific tag keys (`sagemaker:domain-arn`, `sagemaker:user-profile-arn`, and `sagemaker:space-arn`) present with non-null values.
  + `EMRServerlessDenyTaggingAction`: The TagResource and UntagResource actions on EMR Serverless resources in the specified region and AWS account if the resources do not have any of the specified tag keys (`sagemaker:domain-arn`, `sagemaker:user-profile-arn`, and `sagemaker:space-arn`) set.
  + `EMRServerlessActions`: Allows various actions (`StartApplication`, `StopApplication`, `GetApplication`, `DeleteApplication`, `AccessLivyEndpoints`, and `GetDashboardForJobRun`) on EMR Serverless resources, but only if the resources have the specified tag keys (`sagemaker:domain-arn`, `sagemaker:user-profile-arn`, and `sagemaker:space-arn`) set with non-null values.

  The IAM policy defined in the provided JSON document grants those permissions, but limits that access to the presence of specific SageMaker AI tags on the EMR Serverless applications to ensure that only Amazon EMR Serverless resources associated with a particular SageMaker AI domain, user profile, and space can be managed. 

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "EMRServerlessListApplications",
              "Effect": "Allow",
              "Action": [
                  "emr-serverless:ListApplications"
              ],
              "Resource": "arn:aws:emr-serverless:us-east-1:111122223333:/*"
          },
          {
              "Sid": "EMRServerlessPassRole",
              "Effect": "Allow",
              "Action": "iam:PassRole",
              "Resource": "arn:aws:iam::111122223333:role/EMRServerlessAppRuntimeRole",
              "Condition": {
                  "StringLike": {
                      "iam:PassedToService": "emr-serverless.amazonaws.com"
                  }
              }
          },
          {
              "Sid": "EMRServerlessCreateApplicationAction",
              "Effect": "Allow",
              "Action": [
                  "emr-serverless:CreateApplication",
                  "emr-serverless:TagResource"
              ],
              "Resource": "arn:aws:emr-serverless:us-east-1:111122223333:/*",
              "Condition": {
                  "ForAllValues:StringEquals": {
                      "aws:TagKeys": [
                          "sagemaker:domain-arn",
                          "sagemaker:user-profile-arn",
                          "sagemaker:space-arn"
                      ]
                  },
                  "Null": {
                      "aws:RequestTag/sagemaker:domain-arn": "false",
                      "aws:RequestTag/sagemaker:user-profile-arn": "false",
                      "aws:RequestTag/sagemaker:space-arn": "false"
                  }
              }
          },
          {
              "Sid": "EMRServerlessDenyTaggingAction",
              "Effect": "Deny",
              "Action": [
                  "emr-serverless:TagResource",
                  "emr-serverless:UntagResource"
              ],
              "Resource": "arn:aws:emr-serverless:us-east-1:111122223333:/*",
              "Condition": {
                  "Null": {
                      "aws:ResourceTag/sagemaker:domain-arn": "true",
                      "aws:ResourceTag/sagemaker:user-profile-arn": "true",
                      "aws:ResourceTag/sagemaker:space-arn": "true"
                  }
              }
          },
          {
              "Sid": "EMRServerlessActions",
              "Effect": "Allow",
              "Action": [
                  "emr-serverless:StartApplication",
                  "emr-serverless:StopApplication",
                  "emr-serverless:GetApplication",
                  "emr-serverless:DeleteApplication",
                  "emr-serverless:AccessLivyEndpoints",
                  "emr-serverless:GetDashboardForJobRun"
              ],
              "Resource": "arn:aws:emr-serverless:us-east-1:111122223333:/applications/*",
              "Condition": {
                  "Null": {
                      "aws:ResourceTag/sagemaker:domain-arn": "false",
                      "aws:ResourceTag/sagemaker:user-profile-arn": "false",
                      "aws:ResourceTag/sagemaker:space-arn": "false"
                  }
              }
          }
      ]
  }
  ```

------
+ **Domain, user profile, and space update actions policy **: The following policy grants permissions to update SageMaker AI domains, user profiles, and spaces within the specified region and AWS account.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "SageMakerUpdateResourcesPolicy",
              "Effect": "Allow",
              "Action": [
                  "sagemaker:UpdateDomain",
                  "sagemaker:UpdateUserprofile",
                  "sagemaker:UpdateSpace"
              ],
              "Resource": [
                  "arn:aws:sagemaker:us-east-1:111122223333:domain/*",
                  "arn:aws:sagemaker:us-east-1:111122223333:user-profile/*"
              ]
          }
      ]
  }
  ```

------

# Create EMR Serverless applications from Studio
<a name="create-emr-serverless-application"></a>

Data scientists and data engineers can create EMR Serverless applications directly from the Studio user interface. Before you begin, ensure that you have configured the necessary permissions as described in the [Set up the permissions to enable listing and launching Amazon EMR applications from SageMaker Studio](studio-emr-serverless-permissions.md) section. These permissions grant Studio the ability to create, start, view, access, and terminate the applications.

To create an EMR Serverless application from Studio:

1. In the Studio UI, navigate to the left-side panel and select the **Data** node in the left navigation menu. Then, scroll and choose the **Amazon EMR applications and clusters** option. This opens up a page that displays the Amazon EMR applications that you can access from within the Studio environment, under the **Serverless applications** tab.

1. Choose the **Create serverless application** button at the top right corner. This opens a **Create application** page resembling the view you would see in the [EMR Serverless console](https://console.aws.amazon.com/emrserverless) when choosing to **Use custom settings** in the **application setup options**.

1. Provide the necessary details for your application, including a name and any specific configurable parameters you wish to set, then choose **Create application**.  
![\[Creation form of an EMR Serverless application from Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-serverless-create-app.png)

   All configuration settings have default values and are optional to modify. For detailed information on each available parameter, see [Configuring an application](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-capacity.html) in the EMR Serverless user guide.
**Note**  
During the application creation process in the Studio UI, you have the option to either **Create application** or **Create and start application**. Based on your choice, the application will enter either the `Creating` or `Starting` state respectively.  
If you opt to create the application without immediately starting it, make sure the **Automatically start application on job submission** option remains selected. This will ensure that the application automatically transitions to the `Starting` state when you later submit a job to run on it. 
For the simplest setup, we recommend leaving the **Virtual private cloud (VPC)** option set to its default value of **No network connectivity to resources in your VPC** under the **Network connections** section. This allows the application to be created within your domain VPC without requiring any additional networking configuration.  
 In any other case, ensure that you perform the following steps:   
Peer your VPCs.
Add routes to your private subnet route tables.
Configure your security groups as detailed in [Configure network access for your Amazon EMR cluster](studio-notebooks-emr-networking.md).
This ensures the proper networking setup for your application, beyond the default **No network connectivity** option.
For applications created from the Studio Classic UI, the following configuration is automatically applied:  
An enabled Apache Livy endpoint.
The application is tagged with the following:  
sagemaker:user-profile-arn
sagemaker:domain-arn
sagemaker:space-arn
If you create an application outside of Studio, ensure that you manually enable the Apache Livy endpoint and apply the same set of tags to the application.

Once the application is created, the Studio Classic UI displays a *The application has been successfully created* message and the new application appears in the list of **Serverless applications**.

To connect to your EMR Serverless application, see [Connect to an EMR Serverless application from Studio](connect-emr-serverless-application.md)

# Connect to an EMR Serverless application from Studio
<a name="connect-emr-serverless-application"></a>

Data scientists and data engineers can discover and then connect to an EMR Serverless application directly from the Studio user interface. Before you begin, ensure that you have created an EMR Serverless application by following the instructions in [Create EMR Serverless applications from Studio](create-emr-serverless-application.md).

You can connect an EMR Serverless application to a new JupyterLab notebook directly from the Studio UI, or choose to initiate the connection in a notebook of a running JupyterLab application.

**Important**  
When using Studio, you can only discover and connect to EMR Serverless applications for JupyterLab applications that are launched from private spaces. Ensure that the EMR Serverless applications are located in the same AWS region as your Studio environment. Your JupyterLab space must use a SageMaker Distribution image version `1.10` or higher.

**To connect an EMR Serverless application to a new JupyterLab notebook from the Studio UI:**

1. In the Studio UI, navigate to the left-side panel and select the **Data** node in the left navigation menu. Then, scroll and choose the **Amazon EMR applications and clusters** option. This opens up a page that displays the Amazon EMR applications that you can access from within the Studio environment, under the **Serverless applications** tab.
**Note**  
If you or your administrator have configured the permissions to allow cross-account access to EMR Serverless applications, you can view a consolidated list of applications across all accounts that you have granted access to Studio.

1. Select an EMR Serverless application you want to connect to a new notebook, and then choose **Attach to notebook**. This opens up a modal window displaying the list of your JupyterLab spaces.

1. 
   + Select the private space from which you want to launch a JupyterLab application, and then choose **Open notebook**. This launches a JupyterLab application from your chosen space and opens a new notebook.
   + Alternatively, you can create a new private space by choosing the **Create new space** button at the top of the modal window. Enter a name for your space and then choose **Create space and open notebook**. This creates a private space with the default instance type and latest SageMaker distribution image available, launches a JupyterLab application, and opens a new notebook.

1. Choose the name of the IAM runtime execution role that your EMR Serverless application can assume for the job run. Upon selection, a connection command populates the first cell of your notebook and initiates the connection with the EMR Serverless application.
**Important**  
To successfully connect a JupyterLab notebook to an EMR Serverless application, you must first associate the list of runtime roles with your domain or user profile, as outlined in [Set up the permissions to enable listing and launching Amazon EMR applications from SageMaker Studio](studio-emr-serverless-permissions.md). Failing to complete this step will prevent you from establishing the connection. 

   Once the connection succeeds, a message confirms the connection, starts your EMR Serverless application, and initiates your Spark session.
**Note**  
When you connect to an EMR Serverless application, its status transitions from either `Stopped` or `Created` to `Started`.

**Alternatively, you can connect to a cluster from a JupyterLab notebook.**

1. Choose the **Cluster** button at the top right of your notebook. This opens a modal window listing the EMR Serverless applications that you can access. You can see the applications in the **Serverless applications** tab.

1. Select the application to which you want to connect, then choose **Connect**.

1. EMR Serverless supports runtime IAM roles that were preloaded when setting the required permissions as outlined in [Set up the permissions to enable listing and launching Amazon EMR applications from SageMaker Studio](studio-emr-serverless-permissions.md). Failing to complete this step will prevent you from establishing the connection. 

   You can select your role from the **Amazon EMR execution role** drop down menu. When you connect to an EMR Serverless, Studio adds a code block to an active cell of your notebook to establish the connection.

1. An active cell populates and runs. This cell contains the connection magic command to connect your notebook to your application.

   Once the connection succeeds, a message confirms the connection and the start of the Spark application. You can begin submitting your data processing jobs to your EMR Serverless application.

# Stop or delete an EMR Serverless application from the Studio UI
<a name="terminate-emr-serverless-application"></a>

You can stop (transition to the `Stopped` state) or delete (transition to the `Deleted` state) an EMR Serverless application from the list of applications in the Studio UI. 

**To stop or delete an application, navigate to the list of available EMR Serverless applications.**

1. In the Studio UI, navigate to the left-side panel and select the **Data** node in the left navigation menu. Then, scroll and choose the **Amazon EMR applications and clusters** option. This opens up a page that displays the Amazon EMR applications that you can access from within the Studio environment, under the **Serverless applications** tab.

1. Select the name of the application that you want to stop or delete, and then choose the corresponding **Stop** or **Delete** button.

1. A confirmation message informs you that any pending job will be lost permanently. 

# Data preparation using Amazon EMR
<a name="studio-notebooks-emr-cluster"></a>

**Important**  
Amazon SageMaker Studio and Amazon SageMaker Studio Classic are two of the machine learning environments that you can use to interact with SageMaker AI.  
If your domain was created after November 30, 2023, Studio is your default experience.  
If your domain was created before November 30, 2023, Amazon SageMaker Studio Classic is your default experience. To use Studio if Amazon SageMaker Studio Classic is your default experience, see [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md).  
When you migrate from Amazon SageMaker Studio Classic to Amazon SageMaker Studio, there is no loss in feature availability. Studio Classic also exists as an application within Amazon SageMaker Studio to help you run your legacy machine learning workflows.

Amazon SageMaker Studio and Studio Classic come with built-in integration with [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). Within JupyterLab and Studio Classic notebooks, data scientists and data engineers can discover and connect to existing Amazon EMR clusters, then interactively explore, visualize, and prepare large-scale data for machine learning using [Apache Spark](https://aws.amazon.com/emr/features/spark), [Apache Hive](https://aws.amazon.com/emr/features/hive), or [Presto](https://aws.amazon.com/emr/features/presto). With a single click, they can access the Spark UI to monitor the status and metrics of their Spark jobs without leaving their notebook.

Administrators can create [CloudFormation templates](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) that define Amazon EMR clusters. They can then make those cluster templates available in the [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/userguide/end-user-console.html) for Studio and Studio Classic users to launch. Data scientists can then choose a predefined template to self-provision an Amazon EMR cluster directly from their Studio environment. Administrators can further parameterize the templates to let users choose aspects of the cluster within predefined values. For example, users may want to specify the number of core nodes or select the instance type of a node from a dropdown menu.

Using CloudFormation, administrators can control the organizational, security, and networking setup of Amazon EMR clusters. Data scientists and data engineers can then customize those templates for their workloads to create on-demand Amazon EMR clusters directly from Studio and Studio Classic without setting up complex configurations. Users can terminate Amazon EMR clusters after use.
+ **If you are an administrator**:

  Ensure that you have enabled communication between Studio or Studio Classic and Amazon EMR clusters. For instructions, see the [Configure network access for your Amazon EMR cluster](studio-notebooks-emr-networking.md) section. Once this communication is enabled, you can:
  + [Configure Amazon EMR CloudFormation templates in the Service Catalog](studio-notebooks-set-up-emr-templates.md)
  + [Configure listing Amazon EMR clusters](studio-notebooks-configure-discoverability-emr-cluster.md)
+ **If you are a data scientist or data engineer**, you can:
  + [Launch an Amazon EMR cluster from Studio or Studio Classic](studio-notebooks-launch-emr-cluster-from-template.md)
  + [List Amazon EMR clusters from Studio or Studio Classic](discover-emr-clusters.md)
  + [Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic](connect-emr-clusters.md)
  + [Terminate an Amazon EMR cluster from Studio or Studio Classic](terminate-emr-clusters.md)
  + [Access Spark UI from Studio or Studio Classic](studio-notebooks-access-spark-ui.md)

**Topics**
+ [

# Quickstart: Create a SageMaker AI sandbox domain to launch Amazon EMR clusters in Studio
](studio-notebooks-emr-cluster-quickstart.md)
+ [

# Admin guide
](studio-emr-admin-guide.md)
+ [

# User guide
](studio-emr-user-guide.md)
+ [

# Blogs and whitepapers
](studio-notebooks-emr-resources.md)
+ [

# Troubleshooting
](studio-notebooks-emr-troubleshooting.md)

# Quickstart: Create a SageMaker AI sandbox domain to launch Amazon EMR clusters in Studio
<a name="studio-notebooks-emr-cluster-quickstart"></a>

This section walks you through the quick set up of a complete test environment in Amazon SageMaker Studio. You will be creating a new Studio domain that lets users launch new Amazon EMR clusters directly from Studio. The steps provide an example notebook that you can connect to an Amazon EMR cluster to start running Spark workloads. Using this notebook, you will build a Retrieval Augmented Generation System (RAG) using Amazon EMR Spark distributed processing and OpenSearch vector database.

**Note**  
To get started, sign in to the AWS Management Console using an AWS Identity and Access Management (IAM) user account with admin permissions. For information on how to sign up for an AWS account and create a user with administrative access, see [Complete Amazon SageMaker AI prerequisites](gs-set-up.md).

**To set up your Studio test environment and start running Spark jobs:**
+ [

## Step 1: Create a SageMaker AI domain for launching Amazon EMR clusters in Studio
](#studio-notebooks-emr-cluster-quickstart-setup)
+ [

## Step 2: Launch a new Amazon EMR cluster from Studio UI
](#studio-notebooks-emr-cluster-quickstart-launch)
+ [

## Step 3: Connect a JupyterLab notebook to the Amazon EMR cluster
](#studio-notebooks-emr-cluster-quickstart-connect)
+ [

## Step 4: Clean up your CloudFormation stack
](#studio-notebooks-emr-cluster-quickstart-clean-stack)

## Step 1: Create a SageMaker AI domain for launching Amazon EMR clusters in Studio
<a name="studio-notebooks-emr-cluster-quickstart-setup"></a>

In the following steps, you apply a CloudFormation stack to automatically create a new SageMaker AI domain. The stack also creates a user profile and configures the needed environment and permissions. The SageMaker AI domain is configured to let you directly launch Amazon EMR clusters from Studio. For this example, the Amazon EMR clusters are created in the same AWS account as SageMaker AI without authentication. You can find additional CloudFormation stacks supporting various authentication methods like Kerberos in the [getting\$1started](https://github.com/aws-samples/sagemaker-studio-emr/tree/main/cloudformation/getting_started) GitHub repository.

**Note**  
SageMaker AI allows 5 Studio domains per AWS account and AWS Region by default. Ensure your account has no more than 4 domains in your region before you create your stack.

**Follow these steps to set up a SageMaker AI domain for launching Amazon EMR clusters from Studio.**

1. Download the raw file of this [CloudFormation template](https://github.com/aws-samples/sagemaker-studio-foundation-models/blob/main/workshop-artifacts/cfn/workshop-cfn.yaml) from the `sagemaker-studio-emr` GitHub repository.

1. Go to the CloudFormation console: [https://console.aws.amazon.com/cloudformation](https://console.aws.amazon.com/cloudformation/)

1. Choose **Create stack** and select **With new resources (standard)** from the drop down menu.

1. In **Step 1**:

   1. In the **Prepare template** section, select **Choose an existing template**.

   1. In the **Specify template** section, choose **Upload a template file**.

   1. Upload the downloaded CloudFormation template and choose **Next**.

1. In **Step 2**, enter a **Stack name** and a **SageMakerDomainName** then choose **Next**.

1. In **Step 3**, keep all default values and choose **Next**.

1. In **Step 4**, check the box to acknowledge resource creation and choose **Create stack**. This creates a Studio domain in your account and region.

## Step 2: Launch a new Amazon EMR cluster from Studio UI
<a name="studio-notebooks-emr-cluster-quickstart-launch"></a>

In the following steps, you create a new Amazon EMR cluster from the Studio UI.

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and choose **Domains** in the left menu.

1. Click on your domain name **GenerativeAIDomain** to open the **Domain details** page.

1. Launch Studio from the user profile `genai-user`.

1. In the left navigation pane, go to **Data** then **Amazon EMR Clusters**.

1. On the Amazon EMR clusters page, choose **Create**. Select the template **SageMaker Studio Domain No Auth EMR** created by the CloudFormation stack and then choose **Next**.

1. Enter a name for the new Amazon EMR cluster. Optionally update other parameters such as the instance type of core and master nodes, idle timeout, or number of core nodes.

1. Choose **Create resource** to launch the new Amazon EMR cluster. 

   After creating the Amazon EMR cluster, follow the status on the **EMR Clusters** page. When the status changes to `Running/Waiting`, your Amazon EMR cluster is ready to use in Studio.

## Step 3: Connect a JupyterLab notebook to the Amazon EMR cluster
<a name="studio-notebooks-emr-cluster-quickstart-connect"></a>

In the following steps, you connect a notebook in JupyterLab to your running Amazon EMR cluster. For this example, you import a notebook allowing you to build a Retrieval Augmented Generation (RAG) system using Amazon EMR Spark distributed processing and OpenSearch vector database.

1. 

**Launch JupyterLab**

   From Studio, launch the JupyterLab application.

1. 

**Create a private space**

   If you have not created a space for your JupyterLab application, choose **Create a JupyterLab space**. Enter a name for the space, and keep the space as **Private**. Leave all other settings at their default values, and then choose **Create space**. 

   Otherwise, run your JupyterLab space to launch a JupyterLab application.

1. 

**Deploy your LLM and embedding models for inference**
   + From the top menu, choose **File**, **New**, and then **Terminal**.
   + In the terminal, run the following command.

     ```
     wget --no-check-certificate https://raw.githubusercontent.com/aws-samples/sagemaker-studio-foundation-models/main/lab-00-setup/Lab_0_Warm_Up_Deploy_EmbeddingModel_Llama2_on_Nvidia.ipynb
     mkdir AWSGuides
     cd AWSGuides
     wget --no-check-certificate https://raw.githubusercontent.com/aws-samples/sagemaker-studio-foundation-models/main/lab-03-rag/AWSGuides/AmazonSageMakerDeveloperGuide.pdf
     wget --no-check-certificate https://raw.githubusercontent.com/aws-samples/sagemaker-studio-foundation-models/main/lab-03-rag/AWSGuides/EC2DeveloperGuide.pdf
     wget --no-check-certificate https://raw.githubusercontent.com/aws-samples/sagemaker-studio-foundation-models/main/lab-03-rag/AWSGuides/S3DeveloperGuide.pdf
     ```

     This retrieves the the `Lab_0_Warm_Up_Deploy_EmbeddingModel_Llama2_on_Nvidia.ipynb` notebook to your local directory and downloads three PDF files into a local `AWSGuides` folder.
   + Open `lab-00-setup/Lab_0_Warm_Up_Deploy_EmbeddingModel_Llama2_on_Nvidia.ipynb`, keep the `Python 3 (ipykernel)` kernel, and run each cell.
**Warning**  
In the **Llama 2 License Agreement** section, ensure to accept the Llama2 EULA before you continue.  
The notebook deploys two models, `Llama 2` and `all-MiniLM-L6-v2 Models`, on `ml.g5.2xlarge` for inference.

     The deployment of the models and the creation of the endpoints may take some time.

1. 

**Open your main notebook**

   In JupyterLab, open your terminal and run the following command.

   ```
   cd ..
   wget --no-check-certificate https://raw.githubusercontent.com/aws-samples/sagemaker-studio-foundation-models/main/lab-03-rag/Lab_3_RAG_on_SageMaker_Studio_using_EMR.ipynb
   ```

   You should see the additional `Lab_3_RAG_on_SageMaker_Studio_using_EMR.ipynb` notebook in the left panel of JupyterLab.

1. 

**Choose a `PySpark` kernel**

   Open your `Lab_3_RAG_on_SageMaker_Studio_using_EMR.ipynb` notebook and ensure that you are using the `SparkMagic PySpark` kernel. You can switch kernel at the top right of your notebook. Choose the current kernel name to open up a kernel selection modal, and then choose `SparkMagic PySpark`.

1. 

**Connect your notebook to the cluster**

   1. At the top right of your notebook, choose **Cluster**. This action opens a modal window that lists all of the running clusters that you have permission to access. 

   1. Select your cluster then choose **Connect**. A new credential type selection modal window opens up.

   1. Choose **No credential** and then **Connect**.  
![\[Modal showing selection of Amazon EMR credentials for JupyterLab notebooks.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-credential-selection.png)

   1. A notebook cell automatically populates and runs. The notebook cell loads the `sagemaker_studio_analytics_extension.magics` extension, which provides functionality to connect to the Amazon EMR cluster. It then uses the `%sm_analytics` magic command to initiate the connection to your Amazon EMR cluster and the Spark application.
**Note**  
Ensure that the connection string to your Amazon EMR cluster has an authentication type set to `None`. This is illustrated by the value `--auth-type None` in the following example. You can modify the field if necessary.  

      ```
      %load_ext sagemaker_studio_analytics_extension.magics
      %sm_analytics emr connect --verify-certificate False --cluster-id your-cluster-id --auth-type None --language python
      ```

   1. Once you successfully establish the connection, your connection cell output message should display your `SparkSession` details including your cluster ID, `YARN` application ID, and a link to the Spark UI to monitor your Spark jobs.

You are ready to use the `Lab_3_RAG_on_SageMaker_Studio_using_EMR.ipynb` notebook. This example notebook runs distributed PySpark workloads for building a RAG system using LangChain and OpenSearch.

## Step 4: Clean up your CloudFormation stack
<a name="studio-notebooks-emr-cluster-quickstart-clean-stack"></a>

After you are finished, make sure to terminate your two endpoints and delete your CloudFormation stack to prevent continued charges. Deleting the stack cleans up all the resources that were provisioned by the stack.

**To delete your CloudFormation stack when you are done with it**

1. Go to the CloudFormation console: [https://console.aws.amazon.com/cloudformation](https://console.aws.amazon.com/cloudformation/)

1. Select the stack you want to delete. You can search for it by name or find it in the list of stacks.

1. Click the **Delete** button to finalize deleting the stack and then **Delete** again to acknowledge that this will delete all resources created by the stack.

   Wait for the stack deletion to complete. This can take a few minutes. CloudFormation automatically cleans up all resources defined in the stack template.

1. Verify that all resources created by the stack have been deleted. For example, check for any leftover Amazon EMR cluster.

**To remove the API endpoints for a model**

1. Go to the SageMaker AI console: [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference** and then **Endpoints**.

1. Select the endpoint `hf-allminil6v2-embedding-ep` and then choose **Delete** in the **Actions** drop down list. Repeat the step for the endpoint `meta-llama2-7b-chat-tg-ep`.

# Admin guide
<a name="studio-emr-admin-guide"></a>

This section provides prerequisites, networking instructions for allowing the communication between Studio or Studio Classic and Amazon EMR clusters. It covers different deployment scenarios - when Studio and Amazon EMR are provisioned within private Amazon VPCs without public internet access, as well as when they need to communicate over the internet.

It walks through how administrators can use the AWS Service Catalog to make CloudFormation templates available to Studio, allowing data scientists to discover and self-provision Amazon EMR clusters directly from within Studio. This involves creating a Service Catalog portfolio, granting requisite permissions, referencing the Amazon EMR templates, and parameterizing them to enable customizations during cluster creation.

Last, it provides guidance on configuring discoverability of existing running Amazon EMR clusters from Studio, and Studio Classic, covering single account and cross-account access scenarios along with the necessary IAM permissions.

**Topics**
+ [

# Configure Amazon EMR CloudFormation templates in the Service Catalog
](studio-notebooks-set-up-emr-templates.md)
+ [

# Configure listing Amazon EMR clusters
](studio-notebooks-configure-discoverability-emr-cluster.md)
+ [

# Configure IAM runtime roles for Amazon EMR cluster access in Studio
](studio-notebooks-emr-cluster-rbac.md)
+ [

# Reference policies
](studio-set-up-emr-permissions-reference.md)

# Configure Amazon EMR CloudFormation templates in the Service Catalog
<a name="studio-notebooks-set-up-emr-templates"></a>

This topic assumes administrators are familiar with [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html), [portfolios and products in AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/getstarted-portfolio.html), as well as [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html).

To simplify the creation of Amazon EMR clusters from Studio, administrators can register an [Amazon EMR CloudFormation template](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticmapreduce-cluster.html) as a product in an [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/introduction.html) portfolio. To make the template available to data scientists, they must associate the portfolio with the SageMaker AI execution role used in Studio or Studio Classic. Finally, to allow users to discover templates, provision clusters, and connect to Amazon EMR clusters from Studio or Studio Classic, administrators need to set appropriate access permissions.

The Amazon EMR CloudFormation templates can allow end-users to customize various cluster aspects. For example, administrators can define an approved list of instance types that users can choose from when creating a cluster.

The following instructions use end-to-end [CloudFormation stacks](https://github.com/aws-samples/sagemaker-studio-emr/tree/main/cloudformation/getting_started) to setup a Studio or Studio Classic domain, a user profile, a Service Catalog portfolio, and populate an Amazon EMR launch template. The following steps highlight the specific settings that administrators must apply in their end-to-end stack to enable Studio or Studio Classic to access Service Catalog products and provision Amazon EMR clusters.

**Note**  
The GitHub repository [aws-samples/sagemaker-studio-emr](https://github.com/aws-samples/sagemaker-studio-emr/tree/main/cloudformation/getting_started) contains example end-to-end CloudFormation stacks that deploy the necessary IAM roles, networking, SageMaker domain, user profile, Service Catalog portfolio, and add an Amazon EMR launch CloudFormation template. The templates provide different authentication options between Studio or Studio Classic and the Amazon EMR cluster. In these example templates, the parent CloudFormation stack passes SageMaker AI VPC, security group, and subnet parameters to the Amazon EMR cluster template.  
The [sagemaker-studio-emr/cloudformation/emr\$1servicecatalog\$1templates](https://github.com/aws-samples/sagemaker-studio-emr/tree/main/cloudformation/emr_servicecatalog_templates) repository contains various sample Amazon EMR CloudFormation launch templates, including options for single account and cross-account deployments.  
Refer to [Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic](connect-emr-clusters.md) for details on the authentication methods you can use to connect to an Amazon EMR cluster.

To let data scientists discover Amazon EMR CloudFormation templates and provision clusters from Studio or Studio Classic, follow these steps.

## Step 0: Check your networking and prepare your CloudFormation stack
<a name="studio-set-up-emr-prereq"></a>

Before you start:
+ Ensure that you have reviewed the networking and security requirements in [Configure network access for your Amazon EMR cluster](studio-notebooks-emr-networking.md).
+ You must have an existing end-to-end CloudFormation stack that supports the authentication method of your choice. You can find examples of such CloudFormation templates in the [aws-samples/sagemaker-studio-emr](https://github.com/aws-samples/sagemaker-studio-emr/tree/main/cloudformation/getting_started) GitHub repository. The following steps highlight the specific configurations in your end-to-end stack to enable the use of Amazon EMR templates within Studio or Studio Classic. 

## Step 1: Associate your Service Catalog portfolio with SageMaker AI
<a name="studio-set-up-emr-service-catalog-portfolio"></a>

**In your Service Catalog portfolio**, associate your portfolio ID with the SageMaker AI execution role accessing your cluster.

To do so, add the following section (here in YAML format) to your stack. This grants the SageMaker AI execution role access to the specified Service Catalog portfolio containing products like Amazon EMR templates. It allows roles assumed by SageMaker AI to launch those products.

 Replace *SageMakerExecutionRole.Arn* and *SageMakerStudioEMRProductPortfolio.ID* with their actual values.

```
SageMakerStudioEMRProductPortfolioPrincipalAssociation:
    Type: AWS::ServiceCatalog::PortfolioPrincipalAssociation
    Properties:
      PrincipalARN: SageMakerExecutionRole.Arn
      PortfolioId: SageMakerStudioEMRProductPortfolio.ID
      PrincipalType: IAM
```

For details on the required set of IAM permissions, see the [permissions](#studio-emr-permissions) section.

## Step 2: Reference an Amazon EMR template in a Service Catalog product
<a name="studio-set-up-emr-service-catalog-product"></a>

**In a Service Catalog product of your portfolio**, reference an Amazon EMR template resource and ensure its visibility in Studio or Studio Classic. 

To do so, reference the Amazon EMR template resource in the Service Catalog product definition, and then add the following tag key `"sagemaker:studio-visibility:emr"` set to the value `"true"` (see the example in YAML format).

In the Service Catalog product definition, the CloudFormation template of the cluster is referenced via URL. The additional tag set to true ensures the visibility of the Amazon EMR templates in Studio or Studio Classic. 

**Note**  
The Amazon EMR template referenced by the provided URL in the example does not enforce any authentication requirements when launched. This option is meant for demonstration and learning purposes. It is not recommended in a production environment.

```
SMStudioEMRNoAuthProduct:
    Type: AWS::ServiceCatalog::CloudFormationProduct
    Properties:
      Owner: AWS
      Name: SageMaker Studio Domain No Auth EMR
      ProvisioningArtifactParameters:
        - Name: SageMaker Studio Domain No Auth EMR
          Description: Provisions a SageMaker domain and No Auth EMR Cluster
          Info:
            LoadTemplateFromURL: Link to your CloudFormation template. For example, https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/astra-m4-sagemaker/end-to-end/CFN-EMR-NoStudioNoAuthTemplate-v3.yaml
      Tags:
        - Key: "sagemaker:studio-visibility:emr"
          Value: "true"
```

## Step 3: Parameterize the Amazon EMR CloudFormation template
<a name="studio-set-up-emr-cfn-template"></a>

**The CloudFormation template used to define the Amazon EMR cluster within the Service Catalog product** allows administrators to specify configurable parameters. Administrators can define `Default` values and `AllowedValues` ranges for these parameters within the template's `Parameters` section. During the cluster launch process, data scientists can provide custom inputs or make selections from those predefined options to customize certain aspects of their Amazon EMR cluster.

The following example illustrates additional input parameters that administrators can set when creating an Amazon EMR template.

```
"Parameters": {
    "EmrClusterName": {
      "Type": "String",
      "Description": "EMR cluster Name."
    },
    "MasterInstanceType": {
      "Type": "String",
      "Description": "Instance type of the EMR master node.",
      "Default": "m5.xlarge",
      "AllowedValues": [
        "m5.xlarge",
        "m5.2xlarge",
        "m5.4xlarge"
      ]
    },
    "CoreInstanceType": {
      "Type": "String",
      "Description": "Instance type of the EMR core nodes.",
      "Default": "m5.xlarge",
      "AllowedValues": [
        "m5.xlarge",
        "m5.2xlarge",
        "m5.4xlarge",
        "m3.medium",
        "m3.large",
        "m3.xlarge",
        "m3.2xlarge"
      ]
    },
    "CoreInstanceCount": {
      "Type": "String",
      "Description": "Number of core instances in the EMR cluster.",
      "Default": "2",
      "AllowedValues": [
        "2",
        "5",
        "10"
      ]
    },
    "EmrReleaseVersion": {
      "Type": "String",
      "Description": "The release version of EMR to launch.",
      "Default": "emr-5.33.1",
      "AllowedValues": [
        "emr-5.33.1",
        "emr-6.4.0"
      ]
    }
  }
```

After administrators have made the Amazon EMR CloudFormation templates available within Studio, data scientists can use them to self-provision Amazon EMR clusters. The `Parameters` section defined in the template translates into input fields on the cluster creation form within Studio or Studio Classic. For each parameter, data scientists can either enter a custom value into the input box or select from the predefined options listed in a dropdown menu, which corresponds to the `AllowedValues` specified in the template.

The following illustration shows the dynamic form assembled from a CloudFormation Amazon EMR template to create an Amazon EMR cluster in Studio or Studio Classic.

![\[Illustration of a dynamic form assembled from a CloudFormation Amazon EMR template to create an Amazon EMR cluster from Studio or Studio Classic.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-cluster-creation.png)


Visit [Launch an Amazon EMR cluster from Studio or Studio Classic](studio-notebooks-launch-emr-cluster-from-template.md) to learn about how to launch a cluster from Studio or Studio Classic using those Amazon EMR templates.

## Step 4: Set up the permissions to enable listing and launching Amazon EMR clusters from Studio
<a name="studio-emr-permissions"></a>

Last, attach the required IAM permissions to enable listing existing running Amazon EMR clusters and self-provisioning new clusters from Studio or Studio Classic.

The role(s) to which you must add those permissions depends on whether Studio or Studio Classic and Amazon EMR are deployed in the same account (choose *Single Account*) or in different accounts (choose *Cross account*).

**Important**  
You can only discover and connect to Amazon EMR clusters for JupyterLab and Studio Classic applications that are launched from private spaces. Ensure that the Amazon EMR clusters are located in the same AWS region as your Studio environment.

### Single account
<a name="studio-set-up-emr-permissions-singleaccount"></a>

If your Amazon EMR clusters and Studio or Studio Classic are deployed in the same AWS account, attach the following permissions to the SageMaker AI execution role accessing your cluster.

1. **Step 1**: Retrieve the ARN of the SageMaker AI execution role used by your private space.

   For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

   For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).

1. **Step 2**: Attach the following permissions to the SageMaker AI execution role accessing your Amazon EMR clusters.

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. Choose **Roles** and then search for your execution role by name in the **Search** field. The role name is the last part of the ARN, after the last forward slash (/). 

   1. Follow the link to your role.

   1. Choose **Add permissions** and then **Create inline policy**.

   1. In the **JSON** tab, add the Amazon EMR permissions allowing Amazon EMR access and operations. For details on the policy document, see *List Amazon EMR policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region`, and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. Repeat the **Create inline policy** step to add another policy granting the execution role the permissions to provision new Amazon EMR clusters using CloudFormation templates. For details on the policy document, see *Create Amazon EMRclusters policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region` and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

**Note**  
Users of role-based access control (RBAC) connectivity to Amazon EMR clusters should also refer to [Configure runtime role authentication when your Amazon EMR cluster and Studio are in the same account](studio-notebooks-emr-cluster-rbac.md#studio-notebooks-emr-cluster-iam-same). 

### Cross account
<a name="studio-set-up-emr-permissions-crossaccount"></a>

Before you get started, retrieve the ARN of the SageMaker AI execution role used by your private space.

For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).

If your Amazon EMR clusters and Studio or Studio Classic are deployed in separate AWS accounts, you configure the permissions on both accounts.

**Note**  
Users of role-based access control (RBAC) connectivity to Amazon EMR clusters should also refer to [Configure runtime role authentication when your cluster and Studio are in different accounts](studio-notebooks-emr-cluster-rbac.md#studio-notebooks-emr-cluster-iam-diff). 

#### On the Amazon EMR cluster account
<a name="studio-set-up-emr-permissions-crossaccount-emraccount"></a>

Follow these steps to create the necessary roles and policies on the account where Amazon EMR is deployed, also referred to as the *trusting account*:

1. **Step 1**: Retrieve the ARN of the [service role of your Amazon EMR cluster](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role.html). 

   To learn about how to find the ARN of the service role of a cluster, see [Configure IAM service roles for Amazon EMR permissions to AWS services and resources](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles.html#emr-iam-role-landing).

1. **Step 2**: Create a custom IAM role named `AssumableRole` with the following configuration:
   + Permissions: Grant the necessary permissions to `AssumableRole` to allow accessing Amazon EMR resources. This role is also known as an *Access role* in scenarios involving cross-account access.
   + Trust relationship: Configure the trust policy for `AssumableRole` to allow assuming the execution role (The `SageMakerExecutionRole` in the cross-account diagram) from the Studio account that requires access.

   By assuming the role, Studio or Studio Classic can gain temporary access to the permissions it needs in Amazon EMR.

   For detailed instructions on how to create a new `AssumableRole` in your Amazon EMR AWS account, follow these steps:

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. In the left navigation pane, choose **Policy**, and then **Create policy**.

   1. In the **JSON** tab, add the Amazon EMR permissions allowing Amazon EMR access and operations. For details on the policy document, see *List Amazon EMR policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region`, and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. In the left navigation pane, choose **Roles** and then **Create role**.

   1. On the **Create role** page, choose **Custom trust policy** as the trusted entity.

   1. Paste in the following JSON document in the **Custom trust policy** section and then choose **Next**.

------
#### [ For users of Studio and JupyterLab ]

      Replace `studio-account` with the Studio account ID, and `AmazonSageMaker-ExecutionRole` with the execution role used by your JupyterLab space.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
      ```

------

------
#### [ For users of Studio Classic ]

      Replace `studio-account` with the Studio Classic account ID.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "arn:aws:iam::111122223333:root"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
      ```

------

------

   1. In the **Add permissions** page, add the permission you just created and then choose **Next**.

   1. On the **Review** page, enter a name for the role such as `AssumableRole` and an optional description.

   1. Review the role details and choose **Create role**.

   For more information about creating a role on an AWS account, see [Creating an IAM role (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html).

#### On the Studio account
<a name="studio-set-up-emr-permissions-crossaccount-studioaccount"></a>

On the account where Studio is deployed, also referred to as the *trusted account*, update the SageMaker AI execution role accessing your clusters with the required permissions to access resources in the trusting account.

1. **Step 1**: Retrieve the ARN of the SageMaker AI execution role used by your private space.

   For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

   For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).

1. **Step 2**: Attach the following permissions to the SageMaker AI execution role accessing your Amazon EMR clusters.

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. Choose **Roles** and then search for your execution role by name in the **Search** field. The role name is the last part of the ARN, after the last forward slash (/). 

   1. Follow the link to your role.

   1. Choose **Add permissions** and then **Create inline policy**.

   1. In the **JSON** tab, add the inline policy granting the role permissions to update the domains, user profiles, and spaces. For details on the policy document, see *Domain, user profile, and space update actions policy* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region` and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. Repeat the **Create inline policy** step to add another policy granting the execution role the permissions to assume the `AssumableRole` and then perform actions permitted by the role's access policy. Replace `emr-account` with the Amazon EMR account ID, and `AssumableRole` with the name of the assumable role created in the Amazon EMR account.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "AllowRoleAssumptionForCrossAccountDiscovery",
                  "Effect": "Allow",
                  "Action": "sts:AssumeRole",
                  "Resource": [
                      "arn:aws:iam::111122223333:role/AssumableRole"
                  ]
              }
          ]
      }
      ```

------

   1. Repeat the **Create inline policy** step to add another policy granting the execution role the permissions to provision new Amazon EMR clusters using CloudFormation templates. For details on the policy document, see *Create Amazon EMRclusters policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region` and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. (Optional) To allow listing Amazon EMR clusters deployed in the same account as Studio, add an additional inline policy to your Studio execution role as defined in *List Amazon EMR policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). 

1. **Step 3**: Associate your assumable role(s) (access role) with your domain or user profile. JupyterLab users in Studio can use the SageMaker AI console or the provided script.

    Choose the tab that corresponds to your use case.

------
#### [ Associate your assumable roles in JupyterLab using the SageMaker AI console ]

   To associate your assumable roles with your user profile or domain using the SageMaker AI console:

   1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. In the left navigation pane, choose **domain**, and then select the domain using the SageMaker AI execution role whose permissions you updated.

   1. 
      + To add your assumable role(s) (access role) to your domain: In the **App Configurations** tab of the **Domain details** page, navigate to the **JupyterLab** section.
      + To add your assumable role(s) (access role) to your user profile: On the **Domain details** page, chose the **User profiles** tab, select the user profile using the SageMaker AI execution role whose permissions you updated. In the **App Configurations** tab, navigate to the **JupyterLab** section.

   1. Choose **Edit** and add the ARNs of your assumable role (access role).

   1. Choose **Submit**.

------
#### [ Associate your assumable roles in JupyterLab using a Python script ]

    In a JupyterLab application started from a space using the SageMaker AI execution role whose permissions you updated, run the following command in a terminal. Replace the `domainID`, `user-profile-name`, `emr-accountID`, and `AssumableRole` ( `EMRServiceRole` for [RBAC runtime roles]()) with their proper values. This code snippet updates the user profile settings for a specific user profile (use `client.update_userprofile`) or domain settings (use `client.update_domain`) within a SageMaker AI domain. Specifically, it allows the JupyterLab application to assume a particular IAM role (`AssumableRole`) for running Amazon EMR clusters within the Amazon EMR account.

   ```
   import botocore.session
   import json
   sess = botocore.session.get_session()
   client = sess.create_client('sagemaker')
   
   client.update_userprofile(
   DomainId="domainID", 
   UserProfileName="user-profile-name",
   DefaultUserSettings={
       'JupyterLabAppSettings': {
           'EmrSettings': {
               'AssumableRoleArns': ["arn:aws:iam::emr-accountID:role/AssumableRole"],
               'ExecutionRoleArns': ["arn:aws:iam::emr-accountID:role/EMRServiceRole", 
                                "arn:aws:iam::emr-accountID:role/AnotherServiceRole"]
           }
           
       }
   })
   resp = client.describe_user_profile(DomainId="domainID", UserProfileName=user-profile-name")
   
   resp['CreationTime'] = str(resp['CreationTime'])
   resp['LastModifiedTime'] = str(resp['LastModifiedTime'])
   print(json.dumps(resp, indent=2))
   ```

------
#### [ For users of Studio Classic ]

   Provide the ARN of the `AssumableRole` to your Studio Classic execution role. The ARN is loaded by the Jupyter server at launch. The execution role used by Studio assumes that cross-account role to discover and connect to Amazon EMR clusters in the *trusting account*.

   You can specify this information by using Lifecycle Configuration (LCC) scripts. You can attach the LCC to your domain or a specific user profile. The LCC script that you use must be a JupyterServer configuration. For more information on how to create an LCC script, see [Use Lifecycle Configurations with Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lcc.html). 

   The following is an example LCC script. To modify the script, replace `AssumableRole` and `emr-account` with their respective values. The number of cross-accounts is limited to five.

   ```
   # This script creates the file that informs Studio Classic that the role "arn:aws:iam::emr-account:role/AssumableRole" in remote account "emr-account" must be assumed to list and describe Amazon EMR clusters in the remote account.
   
   #!/bin/bash
   
   set -eux
   
   FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
   FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
   FILE="$FILE_DIRECTORY/$FILE_NAME"
   
   mkdir -p $FILE_DIRECTORY
   
   cat > "$FILE" <<- "EOF"
   {
     emr-cross-account1: "arn:aws:iam::emr-cross-account1:role/AssumableRole",
     emr-cross-account2: "arn:aws:iam::emr-cross-account2:role/AssumableRole"
   }
   EOF
   ```

    After the LCC runs and the files are written, the server reads the file `/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE/emr-discovery-iam-role-arns-DO_NOT_DELETE.json` and stores the cross-account ARN.

------

# Configure listing Amazon EMR clusters
<a name="studio-notebooks-configure-discoverability-emr-cluster"></a>

Administrators can configure permissions for the SageMaker Studio execution role to grant users the ability to view the list of Amazon EMR clusters they have access to, allowing them to connect to these clusters. The clusters to which you want access can be deployed in the same AWS account as Studio (choose *Single account*) or in separate accounts (choose *Cross account*). The following page describes how to grant the permissions for viewing Amazon EMR clusters from Studio or Studio Classic.

**Important**  
You can only discover and connect to Amazon EMR clusters for JupyterLab and Studio Classic applications that are launched from private spaces. Ensure that the Amazon EMR clusters are located in the same AWS region as your Studio environment.

To let data scientists discover and then connect to Amazon EMRclusters from Studio or Studio Classic, follow these steps.

## Single account
<a name="studio-set-up-emr-permissions-singleaccount-list-clusters"></a>

If your Amazon EMR clusters and Studio or Studio Classic are deployed in the same AWS account, attach the following permissions to the SageMaker AI execution role accessing your cluster.

1. **Step 1**: Retrieve the ARN of the SageMaker AI execution role used by your private space.

   For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

   For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).

1. **Step 2**: Attach the following permissions to the SageMaker AI execution role accessing your Amazon EMR clusters.

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. Choose **Roles** and then search for your execution role by name in the **Search** field. The role name is the last part of the ARN, after the last forward slash (/). 

   1. Follow the link to your role.

   1. Choose **Add permissions** and then **Create inline policy**.

   1. In the **JSON** tab, add the Amazon EMR permissions allowing Amazon EMR access and operations. For details on the policy document, see *List Amazon EMR policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region`, and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

**Note**  
Users of role-based access control (RBAC) connectivity to Amazon EMR clusters should also refer to [Configure runtime role authentication when your Amazon EMR cluster and Studio are in the same account](studio-notebooks-emr-cluster-rbac.md#studio-notebooks-emr-cluster-iam-same). 

## Cross account
<a name="studio-set-up-emr-permissions-crossaccount-list-clusters"></a>

Before you get started, retrieve the ARN of the SageMaker AI execution role used by your private space.

For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).

If your Amazon EMR clusters and Studio or Studio Classic are deployed in separate AWS accounts, you configure the permissions on both accounts.

**Note**  
Users of role-based access control (RBAC) connectivity to Amazon EMR clusters should also refer to [Configure runtime role authentication when your cluster and Studio are in different accounts](studio-notebooks-emr-cluster-rbac.md#studio-notebooks-emr-cluster-iam-diff). 

**On the Amazon EMR cluster account**

Follow these steps to create the necessary roles and policies on the account where Amazon EMR is deployed, also referred to as the *trusting account*:

1. **Step 1**: Retrieve the ARN of the [service role of your Amazon EMR cluster](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role.html). 

   To learn about how to find the ARN of the service role of a cluster, see [Configure IAM service roles for Amazon EMR permissions to AWS services and resources](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles.html#emr-iam-role-landing).

1. **Step 2**: Create a custom IAM role named `AssumableRole` with the following configuration:
   + Permissions: Grant the necessary permissions to `AssumableRole` to allow accessing Amazon EMR resources. This role is also known as an *Access role* in scenarios involving cross-account access.
   + Trust relationship: Configure the trust policy for `AssumableRole` to allow assuming the execution role (The `SageMakerExecutionRole` in the cross-account diagram) from the Studio account that requires access.

   By assuming the role, Studio or Studio Classic can gain temporary access to the permissions it needs in Amazon EMR.

   For detailed instructions on how to create a new `AssumableRole` in your Amazon EMR AWS account, follow these steps:

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. In the left navigation pane, choose **Policy**, and then **Create policy**.

   1. In the **JSON** tab, add the Amazon EMR permissions allowing Amazon EMR access and operations. For details on the policy document, see *List Amazon EMR policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region`, and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. In the left navigation pane, choose **Roles** and then **Create role**.

   1. On the **Create role** page, choose **Custom trust policy** as the trusted entity.

   1. Paste in the following JSON document in the **Custom trust policy** section and then choose **Next**.

------
#### [ For users of Studio and JupyterLab ]

      Replace `studio-account` with the Studio account ID, and `AmazonSageMaker-ExecutionRole` with the execution role used by your JupyterLab space.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
      ```

------

------
#### [ For users of Studio Classic ]

      Replace `studio-account` with the Studio Classic account ID.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "arn:aws:iam::111122223333:root"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
      ```

------

------

   1. In the **Add permissions** page, add the permission you just created and then choose **Next**.

   1. On the **Review** page, enter a name for the role such as `AssumableRole` and an optional description.

   1. Review the role details and choose **Create role**.

   For more information about creating a role on an AWS account, see [Creating an IAM role (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html).

**On the Studio account**

On the account where Studio is deployed, also referred to as the *trusted account*, update the SageMaker AI execution role accessing your clusters with the required permissions to access resources in the trusting account.

1. **Step 1**: Retrieve the ARN of the SageMaker AI execution role used by your private space.

   For information on spaces and execution roles in SageMaker AI, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md).

   For more information about how to retrieve the ARN of SageMaker AI's execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).

1. **Step 2**: Attach the following permissions to the SageMaker AI execution role accessing your Amazon EMR clusters.

   1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

   1. Choose **Roles** and then search for your execution role by name in the **Search** field. The role name is the last part of the ARN, after the last forward slash (/). 

   1. Follow the link to your role.

   1. Choose **Add permissions** and then **Create inline policy**.

   1. In the **JSON** tab, add the inline policy granting the role permissions to update the domains, user profiles, and spaces. For details on the policy document, see *Domain, user profile, and space update actions policy* in [Reference policies](studio-set-up-emr-permissions-reference.md). Replace the `region` and `accountID` with their actual values before copying the list of statements to the inline policy of your role.

   1. Choose **Next** and then provide a **Policy name**.

   1. Choose **Create policy**.

   1. Repeat the **Create inline policy** step to add another policy granting the execution role the permissions to assume the `AssumableRole` and then perform actions permitted by the role's access policy. Replace `emr-account` with the Amazon EMR account ID, and `AssumableRole` with the name of the assumable role created in the Amazon EMR account.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "AllowRoleAssumptionForCrossAccountDiscovery",
                  "Effect": "Allow",
                  "Action": "sts:AssumeRole",
                  "Resource": [
                      "arn:aws:iam::111122223333:role/AssumableRole"
                  ]
              }
          ]
      }
      ```

------

   1. (Optional) To allow listing Amazon EMR clusters deployed in the same account as Studio, add an additional inline policy to your Studio execution role as defined in *List Amazon EMR policies* in [Reference policies](studio-set-up-emr-permissions-reference.md). 

1. **Step 3**: Associate your assumable role(s) (access role) with your domain or user profile. JupyterLab users in Studio can use the SageMaker AI console or the provided script.

    Choose the tab that corresponds to your use case.

------
#### [ Associate your assumable roles in JupyterLab using the SageMaker AI console ]

   To associate your assumable roles with your user profile or domain using the SageMaker AI console:

   1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. In the left navigation pane, choose **domain**, and then select the domain using the SageMaker AI execution role whose permissions you updated.

   1. 
      + To add your assumable role(s) (access role) to your domain: In the **App Configurations** tab of the **Domain details** page, navigate to the **JupyterLab** section.
      + To add your assumable role(s) (access role) to your user profile: On the **Domain details** page, chose the **User profiles** tab, select the user profile using the SageMaker AI execution role whose permissions you updated. In the **App Configurations** tab, navigate to the **JupyterLab** section.

   1. Choose **Edit** and add the ARNs of your assumable role (access role).

   1. Choose **Submit**.

------
#### [ Associate your assumable roles in JupyterLab using a Python script ]

    In a JupyterLab application started from a space using the SageMaker AI execution role whose permissions you updated, run the following command in a terminal. Replace the `domainID`, `user-profile-name`, `emr-accountID`, and `AssumableRole` ( `EMRServiceRole` for [RBAC runtime roles]()) with their proper values. This code snippet updates the user profile settings for a specific user profile (use `client.update_userprofile`) or domain settings (use `client.update_domain`) within a SageMaker AI domain. Specifically, it allows the JupyterLab application to assume a particular IAM role (`AssumableRole`) for running Amazon EMR clusters within the Amazon EMR account.

   ```
   import botocore.session
   import json
   sess = botocore.session.get_session()
   client = sess.create_client('sagemaker')
   
   client.update_userprofile(
   DomainId="domainID", 
   UserProfileName="user-profile-name",
   DefaultUserSettings={
       'JupyterLabAppSettings': {
           'EmrSettings': {
               'AssumableRoleArns': ["arn:aws:iam::emr-accountID:role/AssumableRole"],
               'ExecutionRoleArns': ["arn:aws:iam::emr-accountID:role/EMRServiceRole", 
                                "arn:aws:iam::emr-accountID:role/AnotherServiceRole"]
           }
           
       }
   })
   resp = client.describe_user_profile(DomainId="domainID", UserProfileName=user-profile-name")
   
   resp['CreationTime'] = str(resp['CreationTime'])
   resp['LastModifiedTime'] = str(resp['LastModifiedTime'])
   print(json.dumps(resp, indent=2))
   ```

------
#### [ For users of Studio Classic ]

   Provide the ARN of the `AssumableRole` to your Studio Classic execution role. The ARN is loaded by the Jupyter server at launch. The execution role used by Studio assumes that cross-account role to discover and connect to Amazon EMR clusters in the *trusting account*.

   You can specify this information by using Lifecycle Configuration (LCC) scripts. You can attach the LCC to your domain or a specific user profile. The LCC script that you use must be a JupyterServer configuration. For more information on how to create an LCC script, see [Use Lifecycle Configurations with Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lcc.html). 

   The following is an example LCC script. To modify the script, replace `AssumableRole` and `emr-account` with their respective values. The number of cross-accounts is limited to five.

   ```
   # This script creates the file that informs Studio Classic that the role "arn:aws:iam::emr-account:role/AssumableRole" in remote account "emr-account" must be assumed to list and describe Amazon EMR clusters in the remote account.
   
   #!/bin/bash
   
   set -eux
   
   FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
   FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
   FILE="$FILE_DIRECTORY/$FILE_NAME"
   
   mkdir -p $FILE_DIRECTORY
   
   cat > "$FILE" <<- "EOF"
   {
     emr-cross-account1: "arn:aws:iam::emr-cross-account1:role/AssumableRole",
     emr-cross-account2: "arn:aws:iam::emr-cross-account2:role/AssumableRole"
   }
   EOF
   ```

    After the LCC runs and the files are written, the server reads the file `/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE/emr-discovery-iam-role-arns-DO_NOT_DELETE.json` and stores the cross-account ARN.

------

Refer to [List Amazon EMR clusters from Studio or Studio Classic](discover-emr-clusters.md) to learn about how to discover and connect to Amazon EMR clusters from Studio or Studio Classic notebooks.

# Configure IAM runtime roles for Amazon EMR cluster access in Studio
<a name="studio-notebooks-emr-cluster-rbac"></a>

When you connect to an Amazon EMR cluster from your Studio or Studio Classic notebooks, you can visually browse a list of IAM roles, known as runtime roles, and select one on the fly. Subsequently, all your Apache Spark, Apache Hive, or Presto jobs created from your notebook access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with AWS Lake Formation, you can enforce table-level and column-level access using policies attached to the runtime role.

With this capability, you and your teammates can connect to the same cluster, each using a runtime role scoped with permissions matching your individual level of access to data. Your sessions are also isolated from one another on the shared cluster. 

To try out this feature using Studio Classic, see [ Apply fine-grained data access controls with AWS Lake Formation and Amazon EMR from Amazon SageMaker Studio Classic ](https://aws.amazon.com/blogs/machine-learning/apply-fine-grained-data-access-controls-with-aws-lake-formation-and-amazon-emr-from-amazon-sagemaker-studio/). This blog post helps you set up a demo environment where you can try using preconfigured runtime roles to connect to Amazon EMR clusters.

## Prerequisites
<a name="studio-notebooks-emr-cluster-rbac-prereq"></a>

Before you get started, make sure you meet the following prerequisites:
+ Use Amazon EMR version 6.9 or above.
+ **For Studio Classic users**: Use JupyterLab version 3 in the Studio Classic Jupyter server application configuration. This version supports Studio Classic connection to Amazon EMR clusters using runtime roles.

  **For Studio users**: Use a [SageMaker distribution image](sagemaker-distribution.md) version `1.10` or above.
+ Allow the use of runtime roles in your cluster's security configuration. For more information, see [ Runtime roles for Amazon EMR steps](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html).
+ Create a notebook with any of the kernels listed in [Supported images and kernels to connect to an Amazon EMR cluster from Studio or Studio Classic](studio-emr-user-guide.md#studio-notebooks-emr-cluster-connect-kernels).
+ Make sure you review the instructions in [Set up Studio to use runtime IAM roles](#studio-notebooks-emr-cluster-iam) to configure your runtime roles.

## Cross-account connection scenarios
<a name="studio-notebooks-emr-cluster-rbac-scen"></a>

Runtime role authentication supports a variety of cross-account connection scenarios when your data resides outside of your Studio account. The following image shows three different ways you can assign your Amazon EMR cluster, data, and even Amazon EMR runtime execution role between your Studio and data accounts: 

![\[Cross-account scenarios supported by runtime IAM role authentication.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio-emr-rbac-scenarios.png)


In option 1, your Amazon EMR cluster and Amazon EMR runtime execution role are in a separate data account from the Studio account. You define a separate Amazon EMR access role (also referred to as `Assumable role`) permission policy which grants permission to Studio or Studio Classic execution role to assume the Amazon EMR access role. The Amazon EMR access role then calls the Amazon EMR API `GetClusterSessionCredentials` on behalf of your Studio or Studio Classic execution role, giving you access to the cluster.

In option 2, your Amazon EMR cluster and Amazon EMR runtime execution role are in your Studio account. Your Studio execution role has permission to use the Amazon EMR API `GetClusterSessionCredentials` to gain access to your cluster. To access the Amazon S3 bucket, give the Amazon EMR runtime execution role cross-account Amazon S3 bucket access permissions — you grant these permissions within your Amazon S3 bucket policy.

In option 3, your Amazon EMR clusters are in your Studio account, and the Amazon EMR runtime execution role is in the data account. Your Studio or Studio Classic execution role has permission to use the Amazon EMR API `GetClusterSessionCredentials` to gain access to your cluster. Add the Amazon EMR runtime execution role into the execution role configuration JSON. Then you can select the role in the UI when you choose your cluster. For details about how to set up your execution role configuration JSON file, see [Preload your execution roles into Studio or Studio Classic](#studio-notebooks-emr-cluster-iam-preload).

## Set up Studio to use runtime IAM roles
<a name="studio-notebooks-emr-cluster-iam"></a>

To establish runtime role authentication for your Amazon EMR clusters, configure the required IAM policies, network, and usability enhancements. Your setup depends on whether you handle any cross-account arrangements if your Amazon EMR clusters, Amazon EMR runtime execution role, or both, reside outside of your Studio account. The following section guides you through the policies to install, how to configure the network to allow traffic between cross-accounts, and the local configuration file to set up to automate your Amazon EMR connection.

### Configure runtime role authentication when your Amazon EMR cluster and Studio are in the same account
<a name="studio-notebooks-emr-cluster-iam-same"></a>

If your Amazon EMR cluster resides in your Studio account, complete the following steps to add necessary permissions to your Studio execution policy:

1. Add the required IAM policy to connect to Amazon EMR clusters. For details, see [Configure listing Amazon EMR clusters](studio-notebooks-configure-discoverability-emr-cluster.md).

1. Grant permission to call the Amazon EMR API `GetClusterSessionCredentials` when you pass one or more permitted Amazon EMR runtime execution roles specified in the policy.

1. (Optional) Grant permission to pass IAM roles that follow any user-defined naming conventions.

1. (Optional) Grant permission to access Amazon EMR clusters that are tagged with specific user-defined strings.

1. Preload your IAM roles so you can select the role to use when you connect to your Amazon EMR cluster. For details about how to preload your IAM roles, see [Preload your execution roles into Studio or Studio Classic](#studio-notebooks-emr-cluster-iam-preload).

The following example policy permits Amazon EMR runtime execution roles belonging to the modeling and training groups to call `GetClusterSessionCredentials`. In addition, the policyholder can access Amazon EMR clusters tagged with the strings `modeling` or `training`.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "elasticmapreduce:GetClusterSessionCredentials",
            "Resource": "*",
            "Condition": {
                "ArnLike": {
                    "elasticmapreduce:ExecutionRoleArn": [
                        "arn:aws:iam::111122223333:role/emr-execution-role-ml-modeling*",
                        "arn:aws:iam::111122223333:role/emr-execution-role-ml-training*"
			]},
		"StringLike":{
                    "elasticmapreduce:ResourceTag/group": [
                        "*modeling*",
                        "*training*"
                    ]
                }
            }
        }
    ]
}
```

------

### Configure runtime role authentication when your cluster and Studio are in different accounts
<a name="studio-notebooks-emr-cluster-iam-diff"></a>

If your Amazon EMR cluster is not in your Studio account, allow your SageMaker AI execution role to assume the cross-account Amazon EMR access role so you can connect to the cluster. Complete the following steps to set up your cross-account configuration:

1. Create your SageMaker AI execution role permission policy so that the execution role can assume the Amazon EMR access role. The following policy is an example:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowAssumeCrossAccountEMRAccessRole",
               "Effect": "Allow",
               "Action": "sts:AssumeRole",
               "Resource": "arn:aws:iam::111122223333:role/emr-access-role-name"
           }
       ]
   }
   ```

------

1. Create the trust policy to specify which Studio account IDs are trusted to assume the Amazon EMR access role. The following policy is an example:

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
         {
           "Sid": "AllowCrossAccountSageMakerExecutionRoleToAssumeThisRole",
           "Effect": "Allow",
           "Principal": {
             "AWS": "arn:aws:iam::111122223333:role/studio_execution_role"
           },
           "Action": "sts:AssumeRole"
         }
       ]
   }
   ```

------

1. Create the Amazon EMR access role permission policy, which grants the Amazon EMR runtime execution role the needed permissions to carry out the intended tasks on the cluster. Configure the Amazon EMR access role to call the API `GetClusterSessionCredentials` with the Amazon EMR runtime execution roles specified in the access role permission policy. The following policy is an example:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowCallingEmrGetClusterSessionCredentialsAPI",
               "Effect": "Allow",
               "Action": "elasticmapreduce:GetClusterSessionCredentials",
               "Resource": "arn:aws:elasticmapreduce:us-east-1:111122223333:cluster/cluster-id",
               "Condition": {
                   "StringLike": {
                       "elasticmapreduce:ExecutionRoleArn": [
                           "arn:aws:iam::111122223333:role/emr-execution-role-name"
                       ]
                   }
               }
           }
       ]
   }
   ```

------

1. Set up the cross-account network so that traffic can move back and forth between your accounts. For guided instruction, see *[Configure network access for your Amazon EMR cluster](studio-notebooks-emr-networking.md)Set up the *. The steps in this section help you complete the following tasks:

   1. VPC-peer your Studio account and your Amazon EMR account to establish a connection.

   1. Manually add routes to the private subnet route tables in both accounts. This permits creation and connection of Amazon EMR clusters from the Studio account to the remote account's private subnet.

   1. Set up the security group attached to your Studio domain to allow outbound traffic and the security group of the Amazon EMR primary node to allow inbound TCP traffic from the Studio instance security group.

1. Preload your IAM runtime roles so you can select the role to use when you connect to your Amazon EMR cluster. For details about how to preload your IAM roles, see [Preload your execution roles into Studio or Studio Classic](#studio-notebooks-emr-cluster-iam-preload).

### Configure Lake Formation access
<a name="studio-notebooks-emr-cluster-iam-lf"></a>

When you access data from data lakes managed by AWS Lake Formation, you can enforce table-level and column-level access using policies attached to your runtime role. To configure permission for Lake Formation access, see [Integrate Amazon EMR with AWS Lake Formation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-lake-formation.html).

### Preload your execution roles into Studio or Studio Classic
<a name="studio-notebooks-emr-cluster-iam-preload"></a>

You can preload your IAM runtime roles so you can select the role to use when you connect to your Amazon EMR cluster. Users of JupyterLab in Studio can use the SageMaker AI console or the provided script.

------
#### [ Preload runtime roles in JupyterLab using the SageMaker AI console ]

To associate your runtime roles with your user profile or domain using the SageMaker AI console:

1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **domain**, and then select the domain using the SageMaker AI execution role whose permissions you updated.

1. 
   + To add your runtime (and access roles for cross-account use case) to your domain: In the **App Configurations** tab of the **Domain details** page, navigate to the **JupyterLab** section.
   + To add your runtime (and access roles for cross-account use case) to your user profile: On the **Domain details** page, chose the **User profiles** tab, select the user profile using the SageMaker AI execution role whose permissions you updated. In the **App Configurations** tab, navigate to the **JupyterLab** section.

1. Choose **Edit** and add the ARNs of your access role (assumable role) and EMR Serverless runtime execution roles.

1. Choose **Submit**.

When you next connect to an Amazon EMR server, the runtime roles should appear in a drop-down menu for selection.

------
#### [ Preload runtime roles in JupyterLab using a Python script ]

In a JupyterLab application started from a space using the SageMaker AI execution role whose permissions you updated, run the following command in a terminal. Replace the `domainID`, `user-profile-name`, `emr-accountID`, and `EMRServiceRole` with their proper values. This code snippet updates a user profile settings (`client.update_user_profile`) within a SageMaker AI domain in a cross account use case. Specifically, it sets the service roles for Amazon EMR. It also allows the JupyterLab application to assume a particular IAM role (`AssumableRole` or `AccessRole`) for running Amazon EMR within the Amazon EMR account.

Alternatively, use `client.update_domain` to update the domain settings if your space uses an execution role set at the domain level.

```
import botocore.session
import json
sess = botocore.session.get_session()
client = sess.create_client('sagemaker')

client.update_user_profile(
DomainId="domainID", 
UserProfileName="user-profile-name",
UserSettings={
    'JupyterLabAppSettings': {
        'EmrSettings': {
            'AssumableRoleArns': ["arn:aws:iam::emr-accountID:role/AssumableRole"],
            'ExecutionRoleArns': ["arn:aws:iam::emr-accountID:role/EMRServiceRole", 
                             "arn:aws:iam::emr-accountID:role/AnotherServiceRole"]
        }
        
    }
})
resp = client.describe_user_profile(DomainId="domainID", UserProfileName=user-profile-name")

resp['CreationTime'] = str(resp['CreationTime'])
resp['LastModifiedTime'] = str(resp['LastModifiedTime'])
print(json.dumps(resp, indent=2))
```

------
#### [ Preload runtime roles in Studio Classic ]

Provide the ARN of the `AccessRole` (`AssumableRole`) to your SageMaker AI execution role. The ARN is loaded by the Jupyter server at launch. The execution role used by Studio assumes that cross-account role to discover and connect to Amazon EMR clusters in the *trusting account*.

You can specify this information by using Lifecycle Configuration (LCC) scripts. You can attach the LCC to your domain or a specific user profile. The LCC script that you use must be a JupyterServer configuration. For more information on how to create an LCC script, see [Use Lifecycle Configurations with Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lcc.html). 

The following is an example LCC script. To modify the script, replace `AssumableRole` and `emr-account` with their respective values. The number of cross-accounts is limited to five.

The following snippet is an example LCC bash script you can apply if your Studio Classic application and cluster are in the same account:

```
#!/bin/bash

set -eux

FILE_DIRECTORY="/home/sagemaker-user/.sagemaker-analytics-configuration-DO_NOT_DELETE"
FILE_NAME="emr-configurations-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat << 'EOF' > "$FILE"
{
    "emr-execution-role-arns":
    {
      "123456789012": [
          "arn:aws:iam::123456789012:role/emr-execution-role-1",
          "arn:aws:iam::123456789012:role/emr-execution-role-2"
      ]
    }
}
EOF
```

If your Studio Classic application and clusters are in different accounts, specify the Amazon EMR access roles that can use the cluster. In the following example policy, *123456789012* is the Amazon EMR cluster account ID, and *212121212121* and *434343434343* are the ARNs for the permitted Amazon EMR access roles.

```
#!/bin/bash

set -eux

FILE_DIRECTORY="/home/sagemaker-user/.sagemaker-analytics-configuration-DO_NOT_DELETE"
FILE_NAME="emr-configurations-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat << 'EOF' > "$FILE"
{
    "emr-execution-role-arns":
    {
      "123456789012": [
          "arn:aws:iam::212121212121:role/emr-execution-role-1",
          "arn:aws:iam::434343434343:role/emr-execution-role-2"
      ]
    }
}
EOF

# add your cross-account EMR access role
FILE_DIRECTORY="/home/sagemaker-user/.cross-account-configuration-DO_NOT_DELETE"
FILE_NAME="emr-discovery-iam-role-arns-DO_NOT_DELETE.json"
FILE="$FILE_DIRECTORY/$FILE_NAME"

mkdir -p $FILE_DIRECTORY

cat << 'EOF' > "$FILE"
{
    "123456789012": "arn:aws:iam::123456789012:role/cross-account-emr-access-role"
}
EOF
```

------

# Reference policies
<a name="studio-set-up-emr-permissions-reference"></a>
+ **List Amazon EMR policies**: This policy allows performing the following actions:
  + `AllowPresignedUrl` allows generating pre-signed URLs for accessing the Spark UI from within Studio.
  + `AllowClusterDiscovery` and `AllowClusterDetailsDiscovery` allows listing and describing Amazon EMR clusters in the provided region and account.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "AllowPresignedUrl",
              "Effect": "Allow",
              "Action": [
                  "elasticmapreduce:CreatePersistentAppUI",
                  "elasticmapreduce:DescribePersistentAppUI",
                  "elasticmapreduce:GetPersistentAppUIPresignedURL",
                  "elasticmapreduce:GetOnClusterAppUIPresignedURL"
              ],
              "Resource": [
                  "arn:aws:elasticmapreduce:us-east-1:111122223333:cluster/*"
              ]
          },
          {
              "Sid": "AllowClusterDetailsDiscovery",
              "Effect": "Allow",
              "Action": [
                  "elasticmapreduce:DescribeCluster",
                  "elasticmapreduce:ListInstances",
                  "elasticmapreduce:ListInstanceGroups",
                  "elasticmapreduce:DescribeSecurityConfiguration"
              ],
              "Resource": [
                  "arn:aws:elasticmapreduce:us-east-1:111122223333:cluster/*"
              ]
          },
          {
              "Sid": "AllowClusterDiscovery",
              "Effect": "Allow",
              "Action": [
                  "elasticmapreduce:ListClusters"
              ],
              "Resource": "*"
          }
      ]
  }
  ```

------
+ **Create Amazon EMR clusters policies**: This policy allows performing the following actions:
  + `AllowEMRTemplateDiscovery` allows searching for Amazon EMR templates in the Service Catalog. Studio and Studio Classic use this to show available templates.
  + `AllowSagemakerProjectManagement` enables the creation of [What is a SageMaker AI Project?](sagemaker-projects-whatis.md). In Studio or Studio Classic, access to the AWS Service Catalog is managed through [What is a SageMaker AI Project?](sagemaker-projects-whatis.md).

  The IAM policy defined in the provided JSON grants those permissions. Replace *region* and *accountID* with your actual region and AWS account ID values before copying the list of statements to the inline policy of your role.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "AllowEMRTemplateDiscovery",
              "Effect": "Allow",
              "Action": [
                  "servicecatalog:SearchProducts"
              ],
              "Resource": "*"
          },
          {
              "Sid": "AllowSagemakerProjectManagement",
              "Effect": "Allow",
              "Action": [
                  "sagemaker:CreateProject",
                  "sagemaker:DeleteProject"
              ],
              "Resource": "arn:aws:sagemaker:us-east-1:111122223333:project/*"
          }
      ]
  }
  ```

------
+ **Domain, user profile, and space update actions policy **: The following policy grants permissions to update SageMaker AI domains, user profiles, and spaces within the specified region and AWS account.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "SageMakerUpdateResourcesPolicy",
              "Effect": "Allow",
              "Action": [
                  "sagemaker:UpdateDomain",
                  "sagemaker:UpdateUserprofile",
                  "sagemaker:UpdateSpace"
              ],
              "Resource": [
                  "arn:aws:sagemaker:us-east-1:111122223333:domain/*",
                  "arn:aws:sagemaker:us-east-1:111122223333:user-profile/*"
              ]
          }
      ]
  }
  ```

------

# User guide
<a name="studio-emr-user-guide"></a>

This section covers how data scientist and data engineers can launch, discover, connect to, or terminate an Amazon EMR cluster from Studio or Studio Classic.

Before users can list or launch clusters, administrators must have configured the necessary settings in the Studio environment. For information on how administrators can configure a Studio environment to allow self-provisioning and listing of Amazon EMR clusters, see [Admin guide](studio-emr-admin-guide.md).

**Topics**
+ [

## Supported images and kernels to connect to an Amazon EMR cluster from Studio or Studio Classic
](#studio-notebooks-emr-cluster-connect-kernels)
+ [

## Bring your own image
](#studio-notebooks-emr-byoi)
+ [

# Launch an Amazon EMR cluster from Studio or Studio Classic
](studio-notebooks-launch-emr-cluster-from-template.md)
+ [

# List Amazon EMR clusters from Studio or Studio Classic
](discover-emr-clusters.md)
+ [

# Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic
](connect-emr-clusters.md)
+ [

# Terminate an Amazon EMR cluster from Studio or Studio Classic
](terminate-emr-clusters.md)
+ [

# Access Spark UI from Studio or Studio Classic
](studio-notebooks-access-spark-ui.md)

## Supported images and kernels to connect to an Amazon EMR cluster from Studio or Studio Classic
<a name="studio-notebooks-emr-cluster-connect-kernels"></a>

The following images and kernels come with [sagemaker-studio-analytics-extension](https://pypi.org/project/sagemaker-studio-analytics-extension/), the JupyterLab extension that connects to a remote Spark (Amazon EMR) cluster via the [SparkMagic](https://github.com/jupyter-incubator/sparkmagic) library using [Apache Livy](https://livy.apache.org/).
+ **For Studio users:** SageMaker Distribution is a Docker environment for data science used as the default image of JupyterLab notebook instances. All versions of [SageMaker AI Distribution](https://github.com/aws/sagemaker-distribution) come with `sagemaker-studio-analytics-extension` pre-installed.
+ **For Studio Classic users:** The following images come pre-installed with `sagemaker-studio-analytics-extension`:
  + DataScience – Python 3 kernel
  + DataScience 2.0 – Python 3 kernel
  + DataScience 3.0 – Python 3 kernel
  + SparkAnalytics 1.0 – SparkMagic and PySpark kernels
  + SparkAnalytics 2.0 – SparkMagic and PySpark kernels
  + SparkMagic – SparkMagic and PySpark kernels
  + PyTorch 1.8 – Python 3 kernels
  + TensorFlow 2.6 – Python 3 kernel
  + TensorFlow 2.11 – Python 3 kernel

To connect to Amazon EMR clusters using another built-in image or your own image, follow the instructions in [Bring your own image](#studio-notebooks-emr-byoi).

## Bring your own image
<a name="studio-notebooks-emr-byoi"></a>

To bring your own image in Studio or Studio Classic and allow your notebooks to connect to Amazon EMR clusters, install the following [sagemaker-studio-analytics-extension](https://pypi.org/project/sagemaker-studio-analytics-extension/) extension to your kernel. It supports connecting SageMaker Studio or Studio Classic notebooks to Spark(Amazon EMR) clusters through the [SparkMagic](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-magics.html) library.

```
pip install sparkmagic
pip install sagemaker-studio-sparkmagic-lib
pip install sagemaker-studio-analytics-extension
```

Additionally, to connect to Amazon EMR with [Kerberos ](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos.html) authentication, you must install the kinit client. Depending on your OS, the command to install the kinit client can vary. To bring an Ubuntu (Debian based) image, use the `apt-get install -y -qq krb5-user` command.

For more information on bringing your own image in SageMaker Studio or Studio Classic, see [Bring your own SageMaker image](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi.html).

# Launch an Amazon EMR cluster from Studio or Studio Classic
<a name="studio-notebooks-launch-emr-cluster-from-template"></a>

Data scientists and data engineers can self-provision Amazon EMR clusters from Studio or Studio Classic using CloudFormation templates set up by their administrators. Before users can launch a cluster, administrators must have configured the necessary settings in the Studio environment. For information on how administrators can configure a Studio environment to allow self-provisioning Amazon EMR clusters, see [Configure Amazon EMR CloudFormation templates in the Service Catalog](studio-notebooks-set-up-emr-templates.md).

To provision a new Amazon EMR cluster from Studio or Studio Classic:

1. In the Studio or Studio Classic UI's left-side panel, select the **Data** node in the left navigation menu. Navigate down to **Amazon EMR Clusters**. This opens up a page listing the Amazon EMR clusters that you can access from Studio or Studio Classic.

1. Choose the **Create** button at the top right corner. This opens up a new modal listing the cluster templates available to you.

1. Select a cluster template by choosing a template name and then choose **Next**.

1. Enter the cluster's details, such as a cluster name and any specific configurable parameter set by your administrator, and then choose **Create cluster**. The creation of the cluster might take a couple of minutes.  
![\[Creation form of an Amazon EMR cluster from Studio or Studio Classic.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/emr/studio-notebooks-emr-cluster-creation.png)

Once the cluster is provisioned, the Studio or Studio Classic UI displays a *The cluster has been successfully created* message.

To connect to your cluster, see [Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic](connect-emr-clusters.md)

# List Amazon EMR clusters from Studio or Studio Classic
<a name="discover-emr-clusters"></a>

Data scientists and data engineers can discover, and then connect to Amazon EMR clusters from Studio. The Amazon EMR clusters may be in the same AWS account as Studio or in a different AWS account.

Before users can list or connect to clusters, administrators must have configured the necessary settings in the Studio environment. For information on how administrators can configure a Studio environment to allow discovering running Amazon EMR clusters, see [Admin guide](studio-emr-admin-guide.md). If your administrator [configured the cross-account discovery of Amazon EMR clusters](studio-notebooks-configure-discoverability-emr-cluster.md), you can view a consolidated list of clusters. The list includes clusters from the AWS account used by Studio as well as clusters from remote accounts that you have been granted access to.

To view the list of available Amazon EMR clusters from within Studio:

1. In the Studio UI's left navigation menu, scroll down to **EMR Clusters**. This opens up a page listing the Amazon EMR clusters that you have access to.

   The list displays clusters in the following stages: **Bootstrapping**, **Starting** **Running**, **Waiting**. You can narrow down the displayed clusters by their current status using the filter icon. 

1. Choose a particular **Running** cluster you want to connect to, and then refer to [Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic](connect-emr-clusters.md).

# Connect to an Amazon EMR cluster from SageMaker Studio or Studio Classic
<a name="connect-emr-clusters"></a>

Data scientists and data engineers can discover and then connect to an Amazon EMR cluster directly from the Studio user interface. Before you begin, ensure that you have configured the necessary permissions as described in the [Step 4: Set up the permissions to enable listing and launching Amazon EMR clusters from Studio](studio-notebooks-set-up-emr-templates.md#studio-emr-permissions) section. These permissions grant Studio the ability to create, start, view, access, and terminate clusters.

You can connect an Amazon EMR cluster to a new JupyterLab notebook directly from the Studio UI, or choose to initiate the connection in a notebook of a running JupyterLab application.

**Important**  
You can only discover and connect to Amazon EMR clusters for JupyterLab and Studio Classic applications that are launched from private spaces. Ensure that the Amazon EMR clusters are located in the same AWS region as your Studio environment. Your JupyterLab space must use a SageMaker Distribution image version `1.10` or higher.

## Connect to an Amazon EMR cluster using the Studio UI
<a name="connect-emr-clusters-ui-options"></a>

To connect to your cluster using the Studio or Studio Classic UI, you can either initiate a connection from the list of clusters accessed in [List Amazon EMR clusters from Studio or Studio Classic](discover-emr-clusters.md), or from a notebook in SageMaker Studio or Studio Classic.

**To connect an Amazon EMR cluster to a new JupyterLab notebook from the Studio UI:**

1. In the Studio UI's left-side panel, select the **Data** node in the left navigation menu. Navigate down to **Amazon EMR applications and clusters**. This opens up a page listing the Amazon EMR clusters that you can access from Studio in the **Amazon EMR clusters** tab.
**Note**  
If you or your administrator have configured the permissions to allow cross-account access to Amazon EMR clusters, you can view a consolidated list of clusters across all accounts that you have granted access to Studio.

1. Select an Amazon EMR cluster you want to connect to a new notebook, and then choose **Attach to notebook**. This opens up a modal window displaying the list of your JupyterLab spaces.

1. 
   + Select the space from which you want to launch a JupyterLab application, and then choose **Open notebook**. This launches a JupyterLab application from your chosen space and opens a new notebook.
**Note**  
Users of Studio Classic need to select an image and kernel. For a list of supported images, see [Supported images and kernels to connect to an Amazon EMR cluster from Studio or Studio Classic](studio-emr-user-guide.md#studio-notebooks-emr-cluster-connect-kernels) or refer to [Bring your own image](studio-emr-user-guide.md#studio-notebooks-emr-byoi).
   + Alternatively, you can create a new private space by choosing the **Create new space** button at the top of the modal window. Enter a name for your space and then choose **Create space and open notebook**. This creates a private space with the default instance type and latest SageMaker distribution image available, launches a JupyterLab application, and opens a new notebook.

1. If the cluster you select does not use Kerberos, LDAP, or [runtime role]() authentication, Studio prompts you to select the credential type. Choose from **Http basic authentication** or **No credentials**, then enter your credentials, if applicable.

   If the cluster you select supports runtime roles, choose the name of the IAM role that your Amazon EMR cluster can assume for the job run. 
**Important**  
To successfully connect a JupyterLab notebook to an Amazon EMR cluster supporting runtime roles, you must first associate the list of runtime roles with your domain or user profile, as outlined in [Configure IAM runtime roles for Amazon EMR cluster access in Studio](studio-notebooks-emr-cluster-rbac.md). Failing to complete this step will prevent you from establishing the connection. 

   Upon selection, a connection command populates the first cell of your notebook and initiates the connection with the Amazon EMR cluster.

   Once the connection succeeds, a message confirms the connection and the start of the Spark application.

**Alternatively, you can connect to a cluster from a JupyterLab or Studio Classic notebook.**

1. Choose the **Cluster** button at the top of your notebook. This opens a modal window listing the Amazon EMR clusters in a `Running` state that you can access. You can see the `Running` Amazon EMR clusters in the **Amazon EMR clusters** tab.
**Note**  
For the users of Studio Classic, **Cluster** is only visible when you use a kernel from [Supported images and kernels to connect to an Amazon EMR cluster from Studio or Studio Classic](studio-emr-user-guide.md#studio-notebooks-emr-cluster-connect-kernels) or from [Bring your own image](studio-emr-user-guide.md#studio-notebooks-emr-byoi). If you cannot see **Cluster** at the top of your notebook, ensure that your administrator has [configured the discoverability of your clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-configure-discoverability-emr-cluster.html) and switch to a supported kernel.

1. Select the cluster to which you want to connect, then choose **Connect**.

1. If you configured your Amazon EMR clusters to support [runtime IAM roles](studio-notebooks-emr-cluster-rbac.md), you can select your role from the **Amazon EMR execution role** drop down menu. 
**Important**  
To successfully connect a JupyterLab notebook to an Amazon EMR cluster supporting runtime roles, you must first associate the list of runtime roles with your domain or user profile, as outlined in [Configure IAM runtime roles for Amazon EMR cluster access in Studio](studio-notebooks-emr-cluster-rbac.md). Failing to complete this step will prevent you from establishing the connection. 

   Otherwise, if the cluster you choose does not use Kerberos, LDAP, or runtime role authentication, Studio or Studio Classic prompts you to select the credential type. You can choose **HTTP basic authentication** or **No credential**.

1. Studio adds and then run a code block to an active cell to establish the connection. This cell contains the connection magic command to connect your notebook to your application according to your authentication type.

   Once the connection succeeds, a message confirms the connection and the start of the Spark application.

## Connect to an Amazon EMR cluster using a connection command
<a name="connect-emr-clusters-manually"></a>

To establish a connection to an Amazon EMR cluster, you can execute connection commands within a notebook cell.

When establishing the connection, you can authenticate using [Kerberos](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos.html), [Lightweight Directory Access Protocol (LDAP)](https://docs.aws.amazon.com/), or [runtime IAM role](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster-rbac.html) authentication. The authentication method you choose depends on your cluster configuration. 

You can refer to this example [Access Apache Livy using a Network Load Balancer on a Kerberos-enabled Amazon EMR cluster](https://aws.amazon.com/blogs/big-data/access-apache-livy-using-a-network-load-balancer-on-a-kerberos-enabled-amazon-emr-cluster/) to set up an Amazon EMR cluster that uses Kerberos authentication. Alternatively, you can explore the CloudFormation example templates using Kerberos or LDAP authentication in the [aws-samples/sagemaker-studio-emr](https://github.com/aws-samples/sagemaker-studio-emr/tree/main/cloudformation/getting_started) GitHub repository.

If your administrator has enabled cross-account access, you can connect to your Amazon EMR cluster from a Studio Classic notebook, regardless of whether your Studio Classic application and cluster reside in the same AWS account or different accounts.

For each of the following authentication types, use the specified command to connect to your cluster from your Studio or Studio Classic notebook.
+ **Kerberos**

  Append the `--assumable-role-arn` argument if you need cross-account Amazon EMR access. Append the `--verify-certificate` argument if you connect to your cluster with HTTPS.

  ```
  %load_ext sagemaker_studio_analytics_extension.magics
  %sm_analytics emr connect --cluster-id cluster_id \
  --auth-type Kerberos --language python 
  [--assumable-role-arn EMR_access_role_ARN ] 
  [--verify-certificate /home/user/certificateKey.pem]
  ```
+ **LDAP**

  Append the `--assumable-role-arn` argument if you need cross-account Amazon EMR access. Append the `--verify-certificate` argument if you connect to your cluster with HTTPS.

  ```
  %load_ext sagemaker_studio_analytics_extension.magics
  %sm_analytics emr connect --cluster-id cluster_id \
  --auth-type Basic_Access --language python 
  [--assumable-role-arn EMR_access_role_ARN ]
  [--verify-certificate /home/user/certificateKey.pem]
  ```
+ **NoAuth**

  Append the `--assumable-role-arn` argument if you need cross-account Amazon EMR access. Append the `--verify-certificate` argument if you connect to your cluster with HTTPS.

  ```
  %load_ext sagemaker_studio_analytics_extension.magics
  %sm_analytics emr connect --cluster-id cluster_id \
  --auth-type None --language python
  [--assumable-role-arn EMR_access_role_ARN ]
  [--verify-certificate /home/user/certificateKey.pem]
  ```
+ **Runtime IAM roles**

  Append the `--assumable-role-arn` argument if you need cross-account Amazon EMR access. Append the `--verify-certificate` argument if you connect to your cluster with HTTPS. 

  For more information on connecting to an Amazon EMR cluster using runtime IAM roles, see [Configure IAM runtime roles for Amazon EMR cluster access in Studio](studio-notebooks-emr-cluster-rbac.md).

  ```
  %load_ext sagemaker_studio_analytics_extension.magics
  %sm_analytics emr connect --cluster-id cluster_id \
  --auth-type Basic_Access \
  --emr-execution-role-arn arn:aws:iam::studio_account_id:role/emr-execution-role-name
  [--assumable-role-arn EMR_access_role_ARN]
  [--verify-certificate /home/user/certificateKey.pem]
  ```

## Connect to an Amazon EMR cluster over HTTPS
<a name="connect-emr-clusters-ssl"></a>

If you have configured your Amazon EMR cluster with transit encryption enabled and Apache Livy server for HTTPS and would like Studio or Studio Classic to communicate with Amazon EMR using HTTPS, you need to configure Studio or Studio Classic to access your certificate key.

For self-signed or local Certificate Authority (CA) signed certificates, you can do this in two steps:

1. Download the PEM file of your certificate to your local file system using one of the following options:
   + Jupyter's built-in file upload function.
   + A notebook cell.
   + (For Studio Classic users only) A lifecycle configuration (LCC) script.

     For information on how to use an LCC script, see [Customize a Notebook Instance Using a Lifecycle Configuration Script](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html)

1. Enable the validation of the certificate by providing the path to your certificate in the `--verify-certificate` argument of your connection command.

   ```
   %sm_analytics emr connect --cluster-id cluster_id \
   --verify-certificate /home/user/certificateKey.pem ...
   ```

For public CA issued certificates, set the certificate validation by setting the `--verify-certificate` parameter as `true`.

Alternatively, you can disable the certificate validation by setting the `--verify-certificate` parameter as `false`.

You can find the list of available connection commands to an Amazon EMR cluster in [Connect to an Amazon EMR cluster using a connection command](#connect-emr-clusters-manually).

# Terminate an Amazon EMR cluster from Studio or Studio Classic
<a name="terminate-emr-clusters"></a>

The following procedure shows how to terminate an Amazon EMR cluster from a Studio or Studio Classic notebook.

**To terminate a cluster in a `Running` state, navigate to the list of available Amazon EMR clusters.**

1. In the Studio UI, scroll down to the **Data** node in the left navigation menu.

1. Navigate down to the **EMR Clusters** node. This opens up a page listing the Amazon EMR clusters that you have access to.

1. Select the name of the cluster that you want to terminate, and then choose **Terminate**.

1. This opens up a confirmation window informing you that any pending work or data on your cluster will be lost permanently after termination. Confirm by choosing **Terminate** again.

# Access Spark UI from Studio or Studio Classic
<a name="studio-notebooks-access-spark-ui"></a>

The following sections give instructions for accessing the Spark UI from SageMaker AI Studio or Studio Classic notebooks. The Spark UI allows you to monitor and debug your Spark Jobs submitted to run on Amazon EMR from Studio or Studio Classic notebooks. SSH tunneling and presigned URLs are two ways for accessing the Spark UI.

## Set up SSH tunneling for Spark UI access
<a name="studio-notebooks-emr-ssh-tunneling"></a>

To set up SSH tunneling to access the Spark UI, follow one of the two options in this section.

Options for setting up SSH tunneling:
+ [Option 1: Set up an SSH tunnel to the master node using local port forwarding](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel-local.html)
+ [Option 2, part 1: Set up an SSH tunnel to the master node using dynamic port forwarding](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html)

  [Option 2, part 2: Configure proxy settings to view websites hosted on the master node](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-proxy.html)

For information about viewing web interfaces hosted on Amazon EMR clusters, see [View web interfaces hosted on Amazon EMR Clusters](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html). You can also visit your Amazon EMR console to get access to the Spark UI.

**Note**  
You can set up an SSH tunnel even if presigned URLs are not available to you. 

## Presigned URLs
<a name="studio-notebooks-emr-spark-ui-presigned-urls"></a>

To create one-click URLs that can access Spark UI on Amazon EMR from SageMaker Studio or Studio Classic notebooks, you must enable the following IAM permissions. Choose the option that applies to you: 
+ **For Amazon EMR clusters that are in the same account as the SageMaker Studio or Studio Classic notebook: Add the following permissions to the SageMaker Studio or Studio Classic IAM execution role. **
+ **For Amazon EMR clusters that are in a different account (not SageMaker Studio or Studio Classic notebook): Add the following permissions to the cross-account role that you created for [List Amazon EMR clusters from Studio or Studio Classic](discover-emr-clusters.md).**

**Note**  
You can access presigned URLs from the console in the following regions:  
US East (N. Virginia) Region
US West (N. California) Region
Canada (Central) Region
Europe (Frankfurt) Region
Europe (Stockholm) Region
Europe (Ireland) Region
Europe (London) Region
Europe (Paris) Region
Asia Pacific (Tokyo) Region
Asia Pacific (Seoul) Region
Asia Pacific (Sydney) Region
Asia Pacific (Mumbai) Region
Asia Pacific (Singapore) Region
South America (São Paulo)

 The following policy gives access to presigned URLs for your execution role. 

```
{
        "Sid": "AllowPresignedUrl",
        "Effect": "Allow",
        "Action": [
            "elasticmapreduce:DescribeCluster",
            "elasticmapreduce:ListInstanceGroups",
            "elasticmapreduce:CreatePersistentAppUI",
            "elasticmapreduce:DescribePersistentAppUI",
            "elasticmapreduce:GetPersistentAppUIPresignedURL",
            "elasticmapreduce:GetOnClusterAppUIPresignedURL"
        ],
        "Resource": [
            "arn:aws:elasticmapreduce:region:account-id:cluster/*"
        ]
}
```

# Blogs and whitepapers
<a name="studio-notebooks-emr-resources"></a>

The following blogs use a case study of sentiment prediction for a movie review to illustrate the process of executing a complete machine learning workflow. This includes data preparation, monitoring Spark jobs, and training and deploying a ML model to get predictions directly from your Studio or Studio Classic notebook.
+ [Create and manage Amazon EMR clusters from SageMaker Studio or Studio Classic to run interactive Spark and ML workloads](https://aws.amazon.com/blogs/machine-learning/part-1-create-and-manage-amazon-emr-clusters-from-sagemaker-studio-to-run-interactive-spark-and-ml-workloads/).
+ To extend the use case to a cross-account configuration where SageMaker Studio or Studio Classic and your Amazon EMR cluster are deployed in separate AWS accounts, see [Create and manage Amazon EMR clusters from SageMaker Studio or Studio Classic to run interactive Spark and ML workloads - Part 2](https://aws.amazon.com/blogs/machine-learning/part-2-create-and-manage-amazon-emr-clusters-from-sagemaker-studio-to-run-interactive-spark-and-ml-workloads/).

See also: 
+ A walkthrough of the configuration of [Access Apache Livy using a Network Load Balancer on a Kerberos-enabled Amazon EMR cluster](https://aws.amazon.com/blogs/big-data/access-apache-livy-using-a-network-load-balancer-on-a-kerberos-enabled-amazon-emr-cluster/).
+ AWS whitepapers for [SageMaker Studio or Studio Classic best practices](https://docs.aws.amazon.com/whitepapers/latest/sagemaker-studio-admin-best-practices/sagemaker-studio-admin-best-practices.html).

# Troubleshooting
<a name="studio-notebooks-emr-troubleshooting"></a>

When working with Amazon EMR clusters from Studio or Studio Classic notebooks, you may encounter various potential issues or challenges during the connection or usage process. To help you troubleshoot and resolve these errors, this section provides guidance on common problems that can arise. 

The following are common errors that might occur while connecting or using Amazon EMR clusters from Studio or Studio Classic notebooks.

## Troubleshoot Livy connections hanging or failing
<a name="studio-notebooks-emr-troubleshooting.memoryerror"></a>

The following are Livy connectivity issues that might occur while using Amazon EMR clusters from Studio or Studio Classic notebooks.
+ **Your Amazon EMR cluster encountered an out-of-memory error.**

  A possible reason for a Livy connection via `sparkmagic` hanging or failing is if your Amazon EMR cluster encountered an out-of-memory error. 

  By default, the Java configuration parameter of the Apache Spark driver, `spark.driver.defaultJavaOptions`, is set to `-XX:OnOutOfMemoryError='kill -9 %p'`. This means that the default action taken when the driver program encounters an `OutOfMemoryError` is to terminate the driver program by sending a SIGKILL signal. When the Apache Spark driver is terminated, any Livy connection via `sparkmagic` that depends on that driver hangs or fails. This is because the Spark driver is responsible for managing the Spark application's resources, including task scheduling and execution. Without the driver, the Spark application cannot function, and any attempts to interact with it fails.

  If you suspect that your Spark cluster is experiencing memory issues, you can check [Amazon EMR logs](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html). Containers killed due to out-of-memory errors typically exit with a code of `137`. In such cases, you need to restart the Spark application and establish a new Livy connection to resume interaction with the Spark cluster.

  You can refer to the knowledge base article [How do I resolve the error "Container killed by YARN for exceeding memory limits" in Spark on Amazon EMR?](https://repost.aws/knowledge-center/emr-spark-yarn-memory-limit) on AWS re:Post to learn about various strategies and parameters that can be used to address an out-of-memory issue.

  We recommend reviewing the [Amazon EMR Best Practices Guides](https://aws.github.io/aws-emr-best-practices/) for best practices and tuning guidance on running Apache Spark workloads on your Amazon EMR clusters.
+ **Your Livy session times out when connecting to an Amazon EMR cluster for the first time.**

  When you initially connect to an Amazon EMR cluster using [sagemaker-studio-analytics-extension](https://pypi.org/project/sagemaker-studio-analytics-extension/), which enables connection to a remote Spark (Amazon EMR) cluster via the [SparkMagic](https://github.com/jupyter-incubator/sparkmagic) library using [Apache Livy](https://livy.apache.org/), you may encounter a connection timeout error:

  `An error was encountered: Session 0 did not start up in 60 seconds.`

  If your Amazon EMR cluster requires the initialization of a Spark application upon establishing a connection, there is an increased chance of seeing connection timeout errors.

  To reduce the chances of getting timeouts when connecting to an Amazon EMR cluster using Livy through the analytics extension, `sagemaker-studio-analytics-extension` version `0.0.19` and later override the default server session timeout to `120` seconds instead of `sparkmagic`'s default of `60` seconds.

  We recommend upgrading your extension `0.0.18` and sooner by running the following upgrade command.

  ```
  pip install --upgrade sagemaker-studio-analytics-extension
  ```

  Note that when providing a custom timeout configuration in `sparkmagic`, `sagemaker-studio-analytics-extension` honors this override. However, setting the session timeout to `60` seconds automatically triggers the default server session timeout of `120` seconds in `sagemaker-studio-analytics-extension`.

# Data preparation using AWS Glue interactive sessions
<a name="studio-notebooks-glue"></a>

[AWS Glue interactive sessions](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-overview.html) is a serverless service that you can enlist to collect, transform, clean, and prepare data for storage in your data lakes and data pipelines. AWS Glue interactive sessions provides an on-demand, serverless Apache Spark runtime environment that you can initialize in seconds on a dedicated Data Processing Unit (DPU) without having to provision and manage complex compute cluster infrastructure. After initialization, you can browse the AWS Glue data catalog, run large queries, access data governed by AWS Lake Formation, and interactively analyze and prepare data using Spark, right in your Studio or Studio Classic notebooks. You can then use the prepared data to train, tune, and deploy models using the purpose-built ML tools within SageMaker Studio or Studio Classic. You should consider AWS Glue Interactive Sessions for your data preparation workloads when you want a serverless Spark service with moderate control of configurability and flexibility.

You can initiate an AWS Glue interactive session by starting a JupyterLab notebook in Studio or Studio Classic. When starting your notebook, choose the built-in `Glue PySpark and Ray` or `Glue Spark` kernel. This automatically starts an interactive, serverless Spark session. You do not need to provision or manage any compute cluster or infrastructure. After initialization, you can explore and interact with your data from within your Studio or Studio Classic notebooks.

Before starting your AWS Glue interactive session in Studio or Studio Classic, you need to set the appropriate roles and policies. Additionally, you may need to provide access to additional resources, such as a storage Amazon S3 bucket. For more information about required IAM policies, see [Permissions for AWS Glue interactive sessions in Studio or Studio Classic](getting-started-glue-sm.md#glue-sm-iam).

Studio and Studio Classic provide a default configuration for your AWS Glue interactive session, however, you can use AWS Glue’s full catalog of Jupyter magic commands to further customize your environment. For information about the default and additional Jupyter magics that you can use in your AWS Glue interactive session, see [Configure your AWS Glue interactive session in Studio or Studio Classic](getting-started-glue-sm.md#glue-sm-magics).
+ For Studio Classic users initiating an AWS Glue interactive session, they can select from the following images and kernels:
  + Images: `SparkAnalytics 1.0`, `SparkAnalytics 2.0`
  + Kernel: `Glue Python [PySpark and Ray]` and `Glue Spark`
+ For Studio users, use the default [SageMaker Distribution image](https://github.com/aws/sagemaker-distribution) and select a `Glue Python [PySpark and Ray]` or a `Glue Spark` kernel.

# Get Started with AWS Glue Interactive Sessions
<a name="getting-started-glue-sm"></a>

In this guide, you learn how to initiate an AWS Glue interactive session in SageMaker AI Studio Classic, and manage your environment with Jupyter magics.

## Permissions for AWS Glue interactive sessions in Studio or Studio Classic
<a name="glue-sm-iam"></a>

This section lists the required policies to run AWS Glue interactive sessions in Studio or Studio Classic and explains how to set them up. In particular, it details how to:
+ Attach the `AwsGlueSessionUserRestrictedServiceRole` managed policy to your SageMaker AI execution role.
+ Create an inline custom policy on your SageMaker AI execution role.
+ Modify the trust relationship of your SageMaker AI execution role.

**To attach the `AwsGlueSessionUserRestrictedServiceRole` managed policy to your execution role**

1. Open the [IAM console](https://console.aws.amazon.com/iam/).

1. Select **Roles** in the left-side panel.

1. Find the Studio Classic execution role used by your user profile. For information about how to view a user profile, see [View user profiles in a domain](domain-user-profile-view.md).

1. Choose your role name to access the role summary page.

1. Under the **Permissions** tab, select **Attach policies** from the **Add Permissions** dropdown menu.

1. Select the checkbox next to the managed policy `AwsGlueSessionUserRestrictedServiceRole`.

1. Choose **Attach policies**. 

   The summary page shows your newly-added managed policies.

   
**To create the inline custom policy on your execution role**

1. Select **Create inline policy** in the **Add Permissions** dropdown menu.

1. Select the **JSON** tab.

1. Copy and paste in the following policy.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "uniqueStatementId",
   
               "Effect": "Allow",
               "Action": [
   	     "iam:GetRole",
                   "iam:PassRole",
                   "sts:GetCallerIdentity"
               ],
               "Resource": "arn:aws:iam::*:role/GlueServiceRole*"
           }
       ]
   }
   ```

------

1. Choose **Review policy**.

1. Enter a **Name** and choose **Create policy**. 

   The summary page shows your newly-added custom policy.

   
**To modify the trust relationship of your execution role**

1. Select the **Trust relationships** tab.

1. Chose **Edit trust policy**.

1. Copy and paste in the following policy.

------
#### [ JSON ]

****  

   ```
   {
   "Version":"2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Service": [
                   "glue.amazonaws.com",
                   "sagemaker.amazonaws.com"
               ]
           },
           "Action": "sts:AssumeRole"
       }
   ]
   }
   ```

------

1. Choose **Update policy**.

You can add additional roles and policies if you need to access other AWS resources. For a description of the additional roles and policies you can include, see [interactive sessions with IAM](https://docs.aws.amazon.com/glue/latest/dg/glue-is-security.html) in the AWS Glue documentation.

## Tag propagation
<a name="glue-sm-tag-propagation"></a>

Tags are commonly used to track and allocate costs, control access to your session, isolate your resources, and more. To learn about adding metadata to your AWS resources using tagging, or for details on common use cases, see [Additional information](#more-information).

You can enable the automatic propagation of AWS tags to new AWS Glue interactive sessions created from within the Studio or Studio Classic UI. When an AWS Glue interactive session is created from Studio or Studio Classic, any [user-defined tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/custom-tags.html) attached to the user profile or shared space are carried over to the new AWS Glue interactive session. Additionally,Studio and Studio Classic automatically add two AWS-generated internal tags ((`sagemaker:user-profile-arn` and `sagemaker:domain-arn`) or (`sagemaker:shared-space-arn` and `sagemaker:domain-arn`)) to new AWS Glue interactive sessions created from their UI. You can use these tags to aggregate costs across individual domains, user profiles, or spaces.

### Enable tag propagation
<a name="enable-propagation"></a>

To enable the automatic propagation of tags to new AWS Glue interactive sessions, set the following permissions for your SageMaker AI execution role and the IAM role associated with your AWS Glue session:

**Note**  
By default, the role associated with the AWS Glue interactive session is the same as the SageMaker AI execution role. You can specify a different execution role for the AWS Glue interactive session by using the `%iam_role` magic command. For information on the available Jupyter magic commands to configure AWS Glue interactive sessions, see [Configure your AWS Glue interactive session in Studio or Studio Classic](#glue-sm-magics).
+ *On your SageMaker AI execution role*: Create a new inline policy, and paste the following JSON file. The policy grants the execution role permission to describe (`DescribeUserProfile`, `DescribeSpace`, `DescribeDomain`) and list the tags (`ListTag`) set on the user profiles, shared spaces, and SageMaker AI domain.

  ```
  {
      "Effect": "Allow",
      "Action": [
          "sagemaker:ListTags"
      ],
      "Resource": [
          "arn:aws:sagemaker:*:*:user-profile/*",
          "arn:aws:sagemaker:*:*:space/*"
      ]
  },
  {
      "Effect": "Allow",
      "Action": [
          "sagemaker:DescribeUserProfile"
      ],
      "Resource": [
          "arn:aws:sagemaker:*:*:user-profile/*"
      ]
  },
  {
      "Effect": "Allow",
      "Action": [
          "sagemaker:DescribeSpace"
      ],
      "Resource": [
          "arn:aws:sagemaker:*:*:space/*"
      ]
  }
  {
      "Effect": "Allow",
      "Action": [
          "sagemaker:DescribeDomain"
      ],
      "Resource": [
          "arn:aws:sagemaker:*:*:domain/*"
      ]
  }
  ```
+ *On the IAM role of your AWS Glue session*: Create a new inline policy, and paste the following JSON file. The policy grants your role permission to attach tags (`TagResource`) to your session, or retrieve its list of tags (`GetTags`).

  ```
  {
      "Effect": "Allow",
      "Action": [
          "glue:TagResource",
          "glue:GetTags"
      ],
      "Resource": [
          "arn:aws:glue:*:*:session/*"
      ]
  }
  ```

**Note**  
Failures occurring while applying those permissions do not prevent the creation of AWS Glue interactive sessions. You can find details about the reason of the failure in Studio or Studio Classic [CloudWatch](https://docs.aws.amazon.com//sagemaker/latest/dg/monitoring-cloudwatch.html) logs.
You must restart the kernel of your AWS Glue interactive session to propagate the update of a tag’s value.

It is important to note the following points:
+ Once a tag is attached to a session, it cannot be removed by propagation.

  You can remove tags from an AWS Glue interactive session directly through the AWS CLI, the AWS Glue API, or the [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). For example, using the AWS CLI, you can remove a tag by providing the session's ARN and the tag keys you want to remove as follows:

  ```
  aws glue untag-resource \
  --resource-arn arn:aws:glue:region:account-id:session:session-name \
  --tags-to-remove tag-key1,tag-key2
  ```
+ Studio and Studio Classic add two AWS-generated internal tags ((`sagemaker:user-profile-arn` and `sagemaker:domain-arn`) or (`sagemaker:shared-space-arn` and `sagemaker:domain-arn`)) to new AWS Glue interactive sessions created from their UI. Those tags count against the limit of 50 tags set on all AWS resources. Both `sagemaker:user-profile-arn` and `sagemaker:shared-space-arn` contain the domain ID to which they belong.
+ Tags keys starting with `aws:`, `AWS:`, or any combination of upper and lowercase letters as a prefix for keys are not propagated and are reserved for AWS use.

### Additional information
<a name="more-information"></a>

For more information on tagging, refer to the following resources.
+ To learn about adding metadata to your AWS resources with tagging, see [Tagging AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).
+ For information on tracking costs using tags, see [Cost analysis](https://docs.aws.amazon.com/whitepapers/latest/sagemaker-studio-admin-best-practices/cost-attribution.html) in Studio administration best practices.
+ For information on controlling access to AWS Glue based on tag keys, see [ABAC with AWS Glue](glue/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-tags).

## Launch your AWS Glue interactive session on Studio or Studio Classic
<a name="glue-sm-launch"></a>

After you create the roles, policies, and SageMaker AI domain, you can launch your AWS Glue interactive session in Studio or Studio Classic.

1. Sign in to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. From the left navigation pane, choose **Studio**.

1. From the Studio landing page, select the domain and user profile for launching Studio.

1. Choose **Open Studio** and start a JupyterLab or Studio Classic application.

1. In the Jupyter view, choose **File**, then **New**, then **Notebook**.

1. For Studio Classic users: In the **Image** dropdown menu, select **SparkAnalytics 1.0** or **SparkAnalytics 2.0**. In the **kernel** dropdown menu, select **Glue Spark** or **Glue Python [PySpark and Ray]**. Choose **Select**.

   For Studio users, select a **Glue Spark** or **Glue Python [PySpark and Ray]** kernel

1. (optional) Use Jupyter magics to customize your environment. For more information about Jupyter magics, see [Configure your AWS Glue interactive session in Studio or Studio Classic](#glue-sm-magics).

1. Start writing your Spark data processing scripts. The following [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/use-cases/pyspark_etl_and_training/pyspark-etl-training.ipynb) showcases an end-to-end workflow for ETL on a large dataset using an AWS Glue interactive session, exploratory data analysis, data preprocessing, and finally training a model on the processed data with SageMaker AI.

## Configure your AWS Glue interactive session in Studio or Studio Classic
<a name="glue-sm-magics"></a>

**Note**  
All magic configurations are carried over to subsequent sessions for the lifetime of the AWS Glue kernel.

You can use Jupyter magics in your AWS Glue interactive session to modify your session and configuration parameters. Magics are short commands prefixed with `%` at the start of Jupyter cells that provide a quick and easy way to help you control your environment. In your AWS Glue interactive session, the following magics are set for you by default:


| Magic | Default value | 
| --- | --- | 
| %glue\$1version |  3.0  | 
| %iam\$1role |  *execution role attached to your SageMaker AI domain*  | 
| %region |  your region  | 

You can use magics to further customize your environment. For example, if you want to change the number of workers allocated to your job from the default five to 10, you can specify `%number_of_workers 10`. If you want to configure your session to stop after 10 minutes of idle time instead of the default 2880, you can specify `%idle_timeout 10`.

All of the Jupyter magics currently available in AWS Glue are also available in Studio or Studio Classic. For the complete list of AWS Glue magics available, see [Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html).

# AWS Glue interactive session pricing
<a name="glue-sm-pricing"></a>

When you use AWS Glue interactive sessions on Studio or Studio Classic notebooks, you are charged separately for resource usage on AWS Glue and Studio notebooks.

AWS charges for AWS Glue interactive sessions based on how long the session is active and the number of Data Processing Units (DPU) used. You are charged an hourly rate for the number of DPUs used to run your workloads, billed in increments of one second. AWS Glue interactive sessions assigns a default of five DPUs and requires a minimum of two DPUs. There is also a one-minute minimum billing duration for each interactive session. To see the AWS Glue rates and pricing examples, or to estimate your costs using the AWS Pricing Calculator, see [AWS Glue pricing ](https://aws.amazon.com/glue/pricing).

Your Studio or Studio Classic notebook runs on an Amazon EC2 instance and you are charged for the instance type you choose, based on the duration of use. Studio Classic assign you a default EC2 instance type of `ml-t3-medium` when you select the `SparkAnalytics` image and associated kernel. You can change the instance type for of your Studio Classic notebook to suit your workload. For information about Studio and Studio Classic pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing).

# Prepare ML Data with Amazon SageMaker Data Wrangler
<a name="data-wrangler"></a>

**Important**  
Amazon SageMaker Data Wrangler has been integrated into Amazon SageMaker Canvas. Within the new Data Wrangler experience in SageMaker Canvas, you can use a natural language interface to explore and transform your data in addition to the visual interface. For more information about Data Wrangler in SageMaker Canvas, see [Data preparation](canvas-data-prep.md).

Amazon SageMaker Data Wrangler (Data Wrangler) is a feature of Amazon SageMaker Studio Classic that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows.

Data Wrangler provides the following core functionalities to help you analyze and prepare data for machine learning applications. 
+ **Import** – Connect to and import data from Amazon Simple Storage Service (Amazon S3), Amazon Athena (Athena), Amazon Redshift, Snowflake, and Databricks.
+ **Data Flow** – Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline. 
+ **Transform** – Clean and transform your dataset using standard *transforms* like string, vector, and numeric data formatting tools. Featurize your data using transforms like text and date/time embedding and categorical encoding.
+ **Generate Data Insights** – Automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Insights and Quality Report. 
+ **Analyze** – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation. 
+ **Export** – Export your data preparation workflow to a different location. The following are example locations: 
  + Amazon Simple Storage Service (Amazon S3) bucket
  + Amazon SageMaker Pipelines – Use Pipelines to automate model deployment. You can export the data that you've transformed directly to the pipelines.
  + Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
  + Python script – Store the data and their transformations in a Python script for your custom workflows.

To start using Data Wrangler, see [Get Started with Data Wrangler](data-wrangler-getting-started.md).

**Important**  
Data Wrangler no longer supports Jupyter Lab Version 1 (JL1). To access the latest features and updates, update to Jupyter Lab Version 3. For more information about upgrading, see [View and update the JupyterLab version of an application from the console](studio-jl.md#studio-jl-view).

**Important**  
The information and procedures in this guide use the latest version of Amazon SageMaker Studio Classic. For information about updating Studio Classic to the latest version, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).

You must use Studio Classic version 1.3.0 or later. Use the following procedure to open Amazon SageMaker Studio Classic and see which version you're running.

To open Studio Classic and check its version, see the following procedure.

1. Use the steps in [Prerequisites](data-wrangler-getting-started.md#data-wrangler-getting-started-prerequisite) to access Data Wrangler through Amazon SageMaker Studio Classic.

1. Next to the user you want to use to launch Studio Classic, select **Launch app**.

1. Choose **Studio**.

1. After Studio Classic loads, select **File**, then **New**, and then **Terminal**.  
![\[The Studio Classic context menu options described in step 4.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/terminal.png)

1. Once you have launched Studio Classic, select **File**, then **New**, and then **Terminal**.

1. Enter `cat /opt/conda/share/jupyter/lab/staging/yarn.lock | grep -A 1 "@amzn/sagemaker-ui-data-prep-plugin@"` to print the version of your Studio Classic instance. You must have Studio Classic version 1.3.0 to use Snowflake.   
![\[A terminal window opened in Studio Classic with the command from step 6 copied and pasted in.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/cat-command.png)

You can update Amazon SageMaker Studio Classic from within the AWS Management Console. For more information about updating Studio Classic, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).

**Topics**
+ [

# Get Started with Data Wrangler
](data-wrangler-getting-started.md)
+ [

# Import
](data-wrangler-import.md)
+ [

# Create and Use a Data Wrangler Flow
](data-wrangler-data-flow.md)
+ [

# Get Insights On Data and Data Quality
](data-wrangler-data-insights.md)
+ [

# Automatically Train Models on Your Data Flow
](data-wrangler-autopilot.md)
+ [

# Transform Data
](data-wrangler-transform.md)
+ [

# Analyze and Visualize
](data-wrangler-analyses.md)
+ [

# Reusing Data Flows for Different Datasets
](data-wrangler-parameterize.md)
+ [

# Export
](data-wrangler-data-export.md)
+ [

# Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Classic Notebook to Get Data Insights
](data-wrangler-interactively-prepare-data-notebook.md)
+ [

# Security and Permissions
](data-wrangler-security.md)
+ [

# Release Notes
](data-wrangler-release-notes.md)
+ [

# Troubleshoot
](data-wrangler-trouble-shooting.md)
+ [

# Increase Amazon EC2 Instance Limit
](data-wrangler-increase-instance-limit.md)
+ [

# Update Data Wrangler
](data-wrangler-update.md)
+ [

# Shut Down Data Wrangler
](data-wrangler-shut-down.md)

# Get Started with Data Wrangler
<a name="data-wrangler-getting-started"></a>

Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio Classic. Use this section to learn how to access and get started using Data Wrangler. Do the following:

1. Complete each step in [Prerequisites](#data-wrangler-getting-started-prerequisite).

1. Follow the procedure in [Access Data Wrangler](#data-wrangler-getting-started-access) to start using Data Wrangler.

## Prerequisites
<a name="data-wrangler-getting-started-prerequisite"></a>

To use Data Wrangler, you must complete the following prerequisites. 

1. To use Data Wrangler, you need access to an Amazon Elastic Compute Cloud (Amazon EC2) instance. For more information about the Amazon EC2 instances that you can use, see [Instances](data-wrangler-data-flow.md#data-wrangler-data-flow-instances). To learn how to view your quotas and, if necessary, request a quota increase, see [AWS service quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html).

1. Configure the required permissions described in [Security and Permissions](data-wrangler-security.md). 

1. If your organization is using a firewall that blocks internet traffic, you must have access to the following URLs:
   + `https://ui.prod-1.data-wrangler.sagemaker.aws/`
   + `https://ui.prod-2.data-wrangler.sagemaker.aws/`
   + `https://ui.prod-3.data-wrangler.sagemaker.aws/`
   + `https://ui.prod-4.data-wrangler.sagemaker.aws/`

To use Data Wrangler, you need an active Studio Classic instance. To learn how to launch a new instance, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). When your Studio Classic instance is **Ready**, use the instructions in [Access Data Wrangler](#data-wrangler-getting-started-access).

## Access Data Wrangler
<a name="data-wrangler-getting-started-access"></a>

The following procedure assumes you have completed the [Prerequisites](#data-wrangler-getting-started-prerequisite).

To access Data Wrangler in Studio Classic, do the following.

1. Sign in to Studio Classic. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. You can also create a Data Wrangler flow by doing the following.

   1. In the top navigation bar, select **File**.

   1. Select **New**.

   1. Select **Data Wrangler Flow**.  
![\[Home tab of the Studio Classic console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/new-flow-file-menu.png)

1. (Optional) Rename the new directory and the .flow file.

1. When you create a new .flow file in Studio Classic, you might see a carousel that introduces you to Data Wrangler.

   **This may take a few minutes.**

   This messaging persists as long as the **KernelGateway** app on your **User Details** page is **Pending**. To see the status of this app, in the SageMaker AI console on the **Amazon SageMaker Studio Classic** page, select the name of the user you are using to access Studio Classic. On the **User Details** page, you see a **KernelGateway** app under **Apps**. Wait until this app status is **Ready** to start using Data Wrangler. This can take around 5 minutes the first time you launch Data Wrangler.  
![\[Example showing the KernelGateway app status is Ready on the User Details page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/gatewayKernel-ready.png)

1. To get started, choose a data source and use it to import a dataset. See [Import](data-wrangler-import.md) to learn more. 

   When you import a dataset, it appears in your data flow. To learn more, see [Create and Use a Data Wrangler Flow](data-wrangler-data-flow.md).

1. After you import a dataset, Data Wrangler automatically infers the type of data in each column. Choose **\$1** next to the **Data types** step and select **Edit data types**. 
**Important**  
After you add transforms to the **Data types** step, you cannot bulk-update column types using **Update types**. 

1. Use the data flow to add transforms and analyses. To learn more see [Transform Data](data-wrangler-transform.md) and [Analyze and Visualize](data-wrangler-analyses.md).

1. To export a complete data flow, choose **Export** and choose an export option. To learn more, see [Export](data-wrangler-data-export.md). 

1. Finally, choose the **Components and registries** icon, and select **Data Wrangler** from the dropdown list to see all the .flow files that you've created. You can use this menu to find and move between data flows.

After you have launched Data Wrangler, you can use the following section to walk through how you might use Data Wrangler to create an ML data prep flow. 

## Update Data Wrangler
<a name="data-wrangler-update-studio-app"></a>

We recommend that you periodically update the Data Wrangler Studio Classic app to access the latest features and updates. The Data Wrangler app name starts with **sagemaker-data-wrang**. To learn how to update a Studio Classic app, see [Shut Down and Update Amazon SageMaker Studio Classic Apps](studio-tasks-update-apps.md).

## Demo: Data Wrangler Titanic Dataset Walkthrough
<a name="data-wrangler-getting-started-demo"></a>

The following sections provide a walkthrough to help you get started using Data Wrangler. This walkthrough assumes that you have already followed the steps in [Access Data Wrangler](#data-wrangler-getting-started-access) and have a new data flow file open that you intend to use for the demo. You may want to rename this .flow file to something similar to `titanic-demo.flow`.

This walkthrough uses the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv). It's a modified version of the [Titanic dataset](https://www.openml.org/d/40945) that you can import into your Data Wrangler flow more easily. This data set contains the survival status, age, gender, and class (which serves as a proxy for economic status) of passengers aboard the maiden voyage of the *RMS Titanic* in 1912.

In this tutorial, you perform the following steps.

1. Do one of the following:
   + Open your Data Wrangler flow and choose **Use Sample Dataset**.
   + Upload the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv) to Amazon Simple Storage Service (Amazon S3), and then import this dataset into Data Wrangler.

1. Analyze this dataset using Data Wrangler analyses. 

1. Define a data flow using Data Wrangler data transforms.

1. Export your flow to a Jupyter Notebook that you can use to create a Data Wrangler job. 

1. Process your data, and kick off a SageMaker training job to train a XGBoost Binary Classifier. 

### Upload Dataset to S3 and Import
<a name="data-wrangler-getting-started-demo-import"></a>

To get started, you can use one of the following methods to import the Titanic dataset into Data Wrangler:
+ Importing the dataset directly from the Data Wrangler flow
+ Uploading the dataset to Amazon S3 and then importing it into Data Wrangler

To import the dataset directly into Data Wrangler, open the flow and choose **Use Sample Dataset**.

Uploading the dataset to Amazon S3 and importing it into Data Wrangler is closer to the experience you have importing your own data. The following information tells you how to upload your dataset and import it.

Before you start importing the data into Data Wrangler, download the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv) and upload it to an Amazon S3 (Amazon S3) bucket in the AWS Region in which you want to complete this demo.

If you are a new user of Amazon S3, you can do this using drag and drop in the Amazon S3 console. To learn how, see [Uploading Files and Folders by Using Drag and Drop](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/upload-objects.html#upload-objects-by-drag-and-drop) in the Amazon Simple Storage Service User Guide.

**Important**  
Upload your dataset to an S3 bucket in the same AWS Region you want to use to complete this demo. 

When your dataset has been successfully uploaded to Amazon S3, you can import it into Data Wrangler.

**Import the Titanic dataset to Data Wrangler**

1. Choose the **Import data** button in your **Data flow** tab or choose the **Import** tab.

1. Select **Amazon S3**.

1. Use the **Import a dataset from S3** table to find the bucket to which you added the Titanic dataset. Choose the Titanic dataset CSV file to open the **Details** pane.

1. Under **Details**, the **File type** should be CSV. Check **First row is header** to specify that the first row of the dataset is a header. You can also name the dataset something more friendly, such as **Titanic-train**.

1. Choose the **Import ** button.

When your dataset is imported into Data Wrangler, it appears in your **Data Flow** tab. You can double click on a node to enter the node detail view, which allows you to add transformations or analysis. You can use the plus icon for a quick access to the navigation. In the next section, you use this data flow to add analysis and transform steps.

### Data Flow
<a name="data-wrangler-getting-started-demo-data-flow"></a>

In the data flow section, the only steps in the data flow are your recently imported dataset and a **Data type** step. After applying transformations, you can come back to this tab and see what the data flow looks like. Now, add some basic transformations under the **Prepare** and **Analyze** tabs. 

#### Prepare and Visualize
<a name="data-wrangler-getting-started-demo-prep-visualize"></a>

Data Wrangler has built-in transformations and visualizations that you can use to analyze, clean, and transform your data. 

The **Data** tab of the node detail view lists all built-in transformations in the right panel, which also contains an area in which you can add custom transformations. The following use case showcases how to use these transformations.

To get information that might help you with data exploration and feature engineering, create a data quality and insights report. The information from the report can help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention. For more information about creating a report, see [Get Insights On Data and Data Quality](data-wrangler-data-insights.md).

##### Data Exploration
<a name="data-wrangler-getting-started-demo-explore"></a>

First, create a table summary of the data using an analysis. Do the following:

1. Choose the **\$1** next to the **Data type** step in your data flow and select **Add analysis**.

1. In the **Analysis** area, select **Table summary** from the dropdown list.

1. Give the table summary a **Name**.

1. Select **Preview** to preview the table that will be created.

1. Choose **Save** to save it to your data flow. It appears under **All Analyses**.

Using the statistics you see, you can make observations similar to the following about this dataset: 
+ Fare average (mean) is around \$133, while the max is over \$1500. This column likely has outliers. 
+ This dataset uses *?* to indicate missing values. A number of columns have missing values: *cabin*, *embarked*, and *home.dest*
+ The age category is missing over 250 values.

Next, clean your data using the insights gained from these stats. 

##### Drop Unused Columns
<a name="data-wrangler-getting-started-demo-drop-unused"></a>

Using the analysis from the previous section, clean up the dataset to prepare it for training. To add a new transform to your data flow, choose **\$1** next to the **Data type** step in your data flow and choose **Add transform**.

First, drop columns that you don't want to use for training. You can use [pandas](https://pandas.pydata.org/) data analysis library to do this, or you can use one of the built-in transforms.

Use the following procedure to drop the unused columns.

To drop the unused columns.

1. Open the Data Wrangler flow.

1. There are two nodes in your Data Wrangler flow. Choose the **\$1** to the right of the **Data types** node.

1. Choose **Add transform**.

1. In the **All steps** column, choose **Add step**.

1. In the **Standard** transform list, choose **Manage Columns**. The standard transformations are ready-made, built-in transformations. Make sure that **Drop column** is selected.

1. Under **Columns to drop**, check the following column names:
   + cabin
   + ticket
   + name
   + sibsp
   + parch
   + home.dest
   + boat
   + body

1. Choose **Preview**.

1. Verify that the columns have been dropped, then choose **Add**.

To do this using pandas, follow these steps.

1. In the **All steps** column, choose **Add step**.

1. In the **Custom** transform list, choose **Custom transform**.

1. Provide a name for your transformation, and choose **Python (Pandas)** from the dropdown list.

1. Enter the following Python script in the code box.

   ```
   cols = ['name', 'ticket', 'cabin', 'sibsp', 'parch', 'home.dest','boat', 'body']
   df = df.drop(cols, axis=1)
   ```

1. Choose **Preview** to preview the change, and then choose **Add** to add the transformation. 

##### Clean up Missing Values
<a name="data-wrangler-getting-started-demo-missing-vals"></a>

Now, clean up missing values. You can do this with the **Handling missing values** transform group.

A number of columns have missing values. Of the remaining columns, *age* and *fare* contain missing values. Inspect this using a **Custom Transform**.

Using the **Python (Pandas)** option, use the following to quickly review the number of entries in each column:

```
df.info()
```

![\[Example review the number of entries in each column.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/inspect-missing-pandas.png)


To drop rows with missing values in the *age* category, do the following: 

1. Choose **Handle missing**. 

1. Choose **Drop missing** for the **Transformer**.

1. Choose *age* for the **Input column**.

1. Choose **Preview** to see the new data frame, and then choose **Add** to add the transform to your flow.

1. Repeat the same process for *fare*. 

You can use `df.info()` in the **Custom transform** section to confirm that all rows now have 1,045 values.

##### Custom Pandas: Encode
<a name="data-wrangler-getting-started-demo-encode"></a>

Try flat encoding using Pandas. Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are `Dog` and `Cat`, you may encode this information into two vectors: `[1,0]` to represent `Dog`, and `[0,1]` to represent `Cat`.

1. In the **Custom Transform** section, choose **Python (Pandas)** from the dropdown list.

1. Enter the following in the code box.

   ```
   import pandas as pd
   
   dummies = []
   cols = ['pclass','sex','embarked']
   for col in cols:
       dummies.append(pd.get_dummies(df[col]))
       
   encoded = pd.concat(dummies, axis=1)
   
   df = pd.concat((df, encoded),axis=1)
   ```

1. Choose **Preview** to preview the change. The encoded version of each column is added to the dataset. 

1. Choose **Add** to add the transformation. 

#### Custom SQL: SELECT Columns
<a name="data-wrangler-getting-started-demo-sql"></a>

Now, select the columns you want to keep using SQL. For this demo, select the columns listed in the following `SELECT` statement. Because *survived* is your target column for training, put that column first.

1. In the **Custom Transform** section, select **SQL (PySpark SQL)** from the dropdown list.

1. Enter the following in the code box.

   ```
   SELECT survived, age, fare, 1, 2, 3, female, male, C, Q, S FROM df;
   ```

1. Choose **Preview** to preview the change. The columns listed in your `SELECT` statement are the only remaining columns.

1. Choose **Add** to add the transformation. 

### Export to a Data Wrangler Notebook
<a name="data-wrangler-getting-started-export"></a>

When you've finished creating a data flow, you have a number of export options. The following section explains how to export to a Data Wrangler job notebook. A Data Wrangler job is used to process your data using the steps defined in your data flow. To learn more about all export options, see [Export](data-wrangler-data-export.md).

#### Export to Data Wrangler Job Notebook
<a name="data-wrangler-getting-started-export-notebook"></a>

When you export your data flow using a **Data Wrangler job**, the process automatically creates a Jupyter Notebook. This notebook automatically opens in your Studio Classic instance and is configured to run a SageMaker Processing job to run your Data Wrangler data flow, which is referred to as a Data Wrangler job. 

1. Save your data flow. Select **File** and then select **Save Data Wrangler Flow**.

1. Back to the **Data Flow** tab, select the last step in your data flow (SQL), then choose the **\$1** to open the navigation.

1. Choose **Export**, and **Amazon S3 (via Jupyter Notebook)**. This opens a Jupyter Notebook.  
![\[Example showing how to open the navigation in the data flow tab in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/export-select-step.png)

1. Choose any **Python 3 (Data Science)** kernel for the **Kernel**. 

1. When the kernel starts, run the cells in the notebook book until **Kick off SageMaker Training Job (Optional)**. 

1. Optionally, you can run the cells in **Kick off SageMaker Training Job (Optional)** if you want to create a SageMaker AI training job to train an XGBoost classifier. You can find the cost to run a SageMaker training job in [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/). 

   Alternatively, you can add the code blocks found in [Training XGBoost Classifier](#data-wrangler-getting-started-train-xgboost) to the notebook and run them to use the [XGBoost](https://xgboost.readthedocs.io/en/latest/) open source library to train an XGBoost classifier. 

1. Uncomment and run the cell under **Cleanup** and run it to revert the SageMaker Python SDK to its original version.

You can monitor your Data Wrangler job status in the SageMaker AI console in the **Processing** tab. Additionally, you can monitor your Data Wrangler job using Amazon CloudWatch. For additional information, see [Monitor Amazon SageMaker Processing Jobs with CloudWatch Logs and Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html#processing-job-cloudwatch). 

If you kicked off a training job, you can monitor its status using the SageMaker AI console under **Training jobs** in the **Training section**.

#### Training XGBoost Classifier
<a name="data-wrangler-getting-started-train-xgboost"></a>

You can train an XGBoost Binary Classifier using either a Jupyter notebook or a Amazon SageMaker Autopilot. You can use Autopilot to automatically train and tune models on the data that you've transformed directly from your Data Wrangler flow. For information about Autopilot, see [Automatically Train Models on Your Data Flow](data-wrangler-autopilot.md).

In the same notebook that kicked off the Data Wrangler job, you can pull the data and train an XGBoost Binary Classifier using the prepared data with minimal data preparation. 

1. First, upgrade necessary modules using `pip` and remove the \$1SUCCESS file (this last file is problematic when using `awswrangler`).

   ```
   ! pip install --upgrade awscli awswrangler boto sklearn
   ! aws s3 rm {output_path} --recursive  --exclude "*" --include "*_SUCCESS*"
   ```

1. Read the data from Amazon S3. You can use `awswrangler` to recursively read all the CSV files in the S3 prefix. The data is then split into features and labels. The label is the first column of the dataframe.

   ```
   import awswrangler as wr
   
   df = wr.s3.read_csv(path=output_path, dataset=True)
   X, y = df.iloc[:,:-1],df.iloc[:,-1]
   ```
   + Finally, create DMatrices (the XGBoost primitive structure for data) and do cross-validation using the XGBoost binary classification.

     ```
     import xgboost as xgb
     
     dmatrix = xgb.DMatrix(data=X, label=y)
     
     params = {"objective":"binary:logistic",'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10}
     
     xgb.cv(
         dtrain=dmatrix, 
         params=params, 
         nfold=3,
         num_boost_round=50,
         early_stopping_rounds=10,
         metrics="rmse", 
         as_pandas=True, 
         seed=123)
     ```

#### Shut down Data Wrangler
<a name="data-wrangler-getting-started-shut-down"></a>

When you are finished using Data Wrangler, we recommend that you shut down the instance it runs on to avoid incurring additional charges. To learn how to shut down the Data Wrangler app and associated instance, see [Shut Down Data Wrangler](data-wrangler-shut-down.md). 

# Import
<a name="data-wrangler-import"></a>

You can use Amazon SageMaker Data Wrangler to import data from the following *data sources*: Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. The dataset that you import can include up to 1000 columns.

**Topics**
+ [

## Import data from Amazon S3
](#data-wrangler-import-s3)
+ [

## Import data from Athena
](#data-wrangler-import-athena)
+ [

## Import data from Amazon Redshift
](#data-wrangler-import-redshift)
+ [

## Import data from Amazon EMR
](#data-wrangler-emr)
+ [

## Import data from Databricks (JDBC)
](#data-wrangler-databricks)
+ [

## Import data from Salesforce Data Cloud
](#data-wrangler-import-salesforce-data-cloud)
+ [

## Import data from Snowflake
](#data-wrangler-snowflake)
+ [

## Import Data From Software as a Service (SaaS) Platforms
](#data-wrangler-import-saas)
+ [

## Imported Data Storage
](#data-wrangler-import-storage)

Some data sources allow you to add multiple *data connections*:
+ You can connect to multiple Amazon Redshift clusters. Each cluster becomes a data source. 
+ You can query any Athena database in your account to import data from that database.


When you import a dataset from a data source, it appears in your data flow. Data Wrangler automatically infers the data type of each column in your dataset. To modify these types, select the **Data types** step and select **Edit data types**.

When you import data from Athena or Amazon Redshift, the imported data is automatically stored in the default SageMaker AI S3 bucket for the AWS Region in which you are using Studio Classic. Additionally, Athena stores data you preview in Data Wrangler in this bucket. To learn more, see [Imported Data Storage](#data-wrangler-import-storage).

**Important**  
The default Amazon S3 bucket may not have the least permissive security settings, such as bucket policy and server-side encryption (SSE). We strongly recommend that you [ Add a Bucket Policy To Restrict Access to Datasets Imported to Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-security.html#data-wrangler-security-bucket-policy). 

**Important**  
In addition, if you use the managed policy for SageMaker AI, we strongly recommend that you scope it down to the most restrictive policy that allows you to perform your use case. For more information, see [Grant an IAM Role Permission to Use Data Wrangler](data-wrangler-security.md#data-wrangler-security-iam-policy).

All data sources except for Amazon Simple Storage Service (Amazon S3) require you to specify a SQL query to import your data. For each query, you must specify the following:
+ **Data catalog**
+ **Database**
+ **Table**

You can specify the name of the database or the data catalog in either the drop down menus or within the query. The following are example queries:
+ `select * from example-data-catalog-name.example-database-name.example-table-name` – The query doesn't use anything specified in the dropdown menus of the user-interface (UI) to run. It queries `example-table-name` within `example-database-name` within `example-data-catalog-name`.
+ `select * from example-database-name.example-table-name` – The query uses the data catalog that you've specified in the **Data catalog** dropdown menu to run. It queries `example-table-name` within `example-database-name` within the data catalog that you've specified.
+ `select * from example-table-name` – The query requires you to select fields for both the **Data catalog** and **Database name** dropdown menus. It queries `example-table-name` within the data catalog within the database and data catalog that you've specified.

The link between Data Wrangler and the data source is a *connection*. You use the connection to import data from your data source.

There are the following types of connections:
+ Direct
+ Cataloged

Data Wrangler always has access to the most recent data in a direct connection. If the data in the data source has been updated, you can use the connection to import the data. For example, if someone adds a file to one of your Amazon S3 buckets, you can import the file.

A cataloged connection is the result of a data transfer. The data in the cataloged connection doesn't necessarily have the most recent data. For example, you might set up a data transfer between Salesforce and Amazon S3. If there's an update to the Salesforce data, you must transfer the data again. You can automate the process of transferring data. For more information about data transfers, see [Import Data From Software as a Service (SaaS) Platforms](#data-wrangler-import-saas).

## Import data from Amazon S3
<a name="data-wrangler-import-s3"></a>

You can use Amazon Simple Storage Service (Amazon S3) to store and retrieve any amount of data, at any time, from anywhere on the web. You can accomplish these tasks using the AWS Management Console, which is a simple and intuitive web interface, and the Amazon S3 API. If you've stored your dataset locally, we recommend that you add it to an S3 bucket for import into Data Wrangler. To learn how, see [Uploading an object to a bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/PuttingAnObjectInABucket.html) in the Amazon Simple Storage Service User Guide. 

Data Wrangler uses [S3 Select](https://aws.amazon.com/s3/features/#s3-select) to allow you to preview your Amazon S3 files in Data Wrangler. You incur standard charges for each file preview. To learn more about pricing, see the **Requests & data retrievals** tab on [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/). 

**Important**  
If you plan to export a data flow and launch a Data Wrangler job, ingest data into a SageMaker AI feature store, or create a SageMaker AI pipeline, be aware that these integrations require Amazon S3 input data to be located in the same AWS region.

**Important**  
If you're importing a CSV file, make sure it meets the following requirements:  
A record in your dataset can't be longer than one line.
A backslash, `\`, is the only valid escape character.
Your dataset must use one of the following delimiters:  
Comma – `,`
Colon – `:`
Semicolon – `;`
Pipe – `|`
Tab – `[TAB]`
To save space, you can import compressed CSV files.

Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For Amazon S3, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

After you've imported your data, you can also use the sampling transformer to take one or more samples from your entire dataset. For more information about the sampling transformer, see [Sampling](data-wrangler-transform.md#data-wrangler-transform-sampling).

You can use one of the following resource identifiers to import your data:
+ An Amazon S3 URI that uses an Amazon S3 bucket or Amazon S3 access point
+ An Amazon S3 access point alias
+ An Amazon Resource Name (ARN) that uses an Amazon S3 access point or Amazon S3 bucket

Amazon S3 access points are named network endpoints that are attached to the buckets. Each access point has distinct permissions and network controls that you can configure. For more information about access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html).

**Important**  
If you're using an Amazon Resource Name (ARN) to import your data, it must be for a resource located in the same AWS Region that you're using to access Amazon SageMaker Studio Classic.

You can import either a single file or multiple files as a dataset. You can use the multifile import operation when you have a dataset that is partitioned into separate files. It takes all of the files from an Amazon S3 directory and imports them as a single dataset. For information on the types of files that you can import and how to import them, see the following sections.

------
#### [ Single File Import ]

You can import single files in the following formats:
+ Comma Separated Values (CSV)
+ Parquet
+ Javascript Object Notation (JSON)
+ Optimized Row Columnar (ORC)
+ Image – Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

For files formatted in JSON, Data Wrangler supports both JSON lines (.jsonl) and JSON documents (.json). When you preview your data, it automatically shows the JSON in tabular format. For nested JSON documents that are larger than 5 MB, Data Wrangler shows the schema for the structure and the arrays as values in the dataset. Use the **Flatten structured** and **Explode array** operators to display the nested values in tabular format. For more information, see [Unnest JSON Data](data-wrangler-transform.md#data-wrangler-transform-flatten-column) and [Explode Array](data-wrangler-transform.md#data-wrangler-transform-explode-array).

When you choose a dataset, you can rename it, specify the file type, and identify the first row as a header.

You can import a dataset that you've partitioned into multiple files in an Amazon S3 bucket in a single import step.

**To import a dataset into Data Wrangler from a single file that you've stored in Amazon S3:**

1. If you are not currently on the **Import** tab, choose **Import**.

1. Under **Available**, choose **Amazon S3**.

1. From the **Import tabular, image, or time-series data from S3**, do one of the following:
   + Choose an Amazon S3 bucket from the tabular view and navigate to the file that you're importing.
   + For **S3 source**, specify an Amazon S3 bucket or an Amazon S3 URI and select **Go**. The Amazon S3 URIs can be in one of the following formats:
     + `s3://amzn-s3-demo-bucket/example-prefix/example-file`
     + *example-access-point*-*aqfqprnstn7aefdfbarligizwgyfouse1a*-s3alias/datasets/*example-file*
     + `s3://arn:aws:s3:AWS-Region:111122223333:accesspoint/example-prefix/example-file`

1. Choose the dataset to open the **Import settings** pane.

1. If your CSV file has a header, select the checkbox next to **Add header to table**.

1. Use the **Preview** table to preview your dataset. This table shows up to 100 rows. 

1. In the **Details** pane, verify or change the **Name** and **File Type** for your dataset. If you add a **Name** that contains spaces, these spaces are replaced with underscores when your dataset is imported. 

1. Specify the sampling configuration that you'd like to use. 

1. Choose **Import**.

------
#### [ Multifile Import ]

The following are the requirements for importing multiple files:
+ The files must be in the same folder of your Amazon S3 bucket.
+ The files must either share the same header or have no header.

Each file must be in one of the following formats:
+ CSV
+ Parquet
+ Optimized Row Columnar (ORC)
+ Image – Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

Use the following procedure to import multiple files.

**To import a dataset into Data Wrangler from multiple files that you've stored in an Amazon S3 directory**

1. If you are not currently on the **Import** tab, choose **Import**.

1. Under **Available**, choose **Amazon S3**.

1. From the **Import tabular, image, or time-series data from S3**, do one of the following:
   + Choose an Amazon S3 bucket from the tabular view and navigate to the folder containing the files that you're importing.
   + For **S3 source**, specify the Amazon S3 bucket or an Amazon S3 URI with your files and select **Go**. The following are valid URIs:
     + `s3://amzn-s3-demo-bucket/example-prefix/example-prefix`
     + `example-access-point-aqfqprnstn7aefdfbarligizwgyfouse1a-s3alias/example-prefix/`
     + `s3://arn:aws:s3:AWS-Region:111122223333:accesspoint/example-prefix`

1. Select the folder containing the files that you want to import. Each file must be in one of the supported formats. Your files must be the same data type.

1. If your folder contains CSV files with headers, select the checkbox next to **First row is header**.

1. If your files are nested within other folders, select the checkbox next to **Include nested directories**.

1. (Optional) Choose **Add filename column** add a column to the dataset that shows the filename for each observation.

1. (Optional) By default, Data Wrangler doesn't show you a preview of a folder. You can activate previewing by choosing the blue **Preview off** button. A preview shows the first 10 rows of the first 10 files in the folder.

1. In the **Details** pane, verify or change the **Name** and **File Type** for your dataset. If you add a **Name** that contains spaces, these spaces are replaced with underscores when your dataset is imported. 

1. Specify the sampling configuration that you'd like to use. 

1. Choose **Import dataset**.

------

You can also use parameters to import a subset of files that match a pattern. Parameters help you more selectively pick the files that you're importing. To start using parameters, edit the data source and apply them to the path that you're using to import the data. For more information, see [Reusing Data Flows for Different Datasets](data-wrangler-parameterize.md).

## Import data from Athena
<a name="data-wrangler-import-athena"></a>

Use Amazon Athena to import your data from Amazon Simple Storage Service (Amazon S3) into Data Wrangler. In Athena, you write standard SQL queries to select the data that you're importing from Amazon S3. For more information, see [What is Amazon Athena?](https://docs.aws.amazon.com/athena/latest/ug/what-is.html)

You can use the AWS Management Console to set up Amazon Athena. You must create at least one database in Athena before you start running queries. For more information about getting started with Athena, see [Getting started](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html).

Athena is directly integrated with Data Wrangler. You can write Athena queries without having to leave the Data Wrangler UI.

In addition to writing simple Athena queries in Data Wrangler, you can also use:
+ Athena workgroups for query result management. For more information about workgroups, see [Managing query results](#data-wrangler-import-manage-results).
+ Lifecycle configurations for setting data retention periods. For more information about data retention, see [Setting data retention periods](#data-wrangler-import-athena-retention).

### Query Athena within Data Wrangler
<a name="data-wrangler-import-athena-query"></a>

**Note**  
Data Wrangler does not support federated queries.

If you use AWS Lake Formation with Athena, make sure your Lake Formation IAM permissions do not override IAM permissions for the database `sagemaker_data_wrangler`.

Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For Athena, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

The following procedure shows how to import a dataset from Athena into Data Wrangler.

**To import a dataset into Data Wrangler from Athena**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Amazon Athena**.

1. For **Data Catalog**, choose a data catalog.

1. Use the **Database** dropdown list to select the database that you want to query. When you select a database, you can preview all tables in your database using the **Tables** listed under **Details**.

1. (Optional) Choose **Advanced configuration**.

   1. Choose a **Workgroup**.

   1. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a workgroup, specify a value for **Amazon S3 location of query results**.

   1. (Optional) For **Data retention period**, select the checkbox to set a data retention period and specify the number of days to store the data before it's deleted.

   1. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the checkbox and not save the connection.

1. For **Sampling**, choose a sampling method. Choose **None** to turn off sampling.

1. Enter your query in the query editor and use the **Run** button to run the query. After a successful query, you can preview your result under the editor.
**Note**  
Salesforce data uses the `timestamptz` type. If you're querying the timestamp column that you've imported to Athena from Salesforce, cast the data in the column to the `timestamp` type. The following query casts the timestamp column to the correct type.  

   ```
   # cast column timestamptz_col as timestamp type, and name it as timestamp_col
   select cast(timestamptz_col as timestamp) as timestamp_col from table
   ```

1. To import the results of your query, select **Import**.

After you complete the preceding procedure, the dataset that you've queried and imported appears in the Data Wrangler flow.

By default, Data Wrangler saves the connection settings as a new connection. When you import your data, the query that you've already specified appears as a new connection. The saved connections store information about the Athena workgroups and Amazon S3 buckets that you're using. When you're connecting to the data source again, you can choose the saved connection.

### Managing query results
<a name="data-wrangler-import-manage-results"></a>

Data Wrangler supports using Athena workgroups to manage the query results within an AWS account. You can specify an Amazon S3 output location for each workgroup. You can also specify whether the output of the query can go to different Amazon S3 locations. For more information, see [Using Workgroups to Control Query Access and Costs](https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html).

Your workgroup might be configured to enforce the Amazon S3 query output location. You can't change the output location of the query results for those workgroups.

If you don't use a workgroup or specify an output location for your queries, Data Wrangler uses the default Amazon S3 bucket in the same AWS Region in which your Studio Classic instance is located to store Athena query results. It creates temporary tables in this database to move the query output to this Amazon S3 bucket. It deletes these tables after data has been imported; however the database, `sagemaker_data_wrangler`, persists. To learn more, see [Imported Data Storage](#data-wrangler-import-storage).

To use Athena workgroups, set up the IAM policy that gives access to workgroups. If you're using a `SageMaker AI-Execution-Role`, we recommend adding the policy to the role. For more information about IAM policies for workgroups, see [IAM policies for accessing workgroups](https://docs.aws.amazon.com/athena/latest/ug/workgroups-iam-policy.html). For example workgroup policies, see [Workgroup example policies](https://docs.aws.amazon.com/athena/latest/ug/example-policies-workgroup.html).

### Setting data retention periods
<a name="data-wrangler-import-athena-retention"></a>

Data Wrangler automatically sets a data retention period for the query results. The results are deleted after the length of the retention period. For example, the default retention period is five days. The results of the query are deleted after five days. This configuration is designed to help you clean up data that you're no longer using. Cleaning up your data prevents unauthorized users from gaining access. It also helps control the costs of storing your data on Amazon S3.

If you don't set a retention period, the Amazon S3 lifecycle configuration determines the duration that the objects are stored. The data retention policy that you've specified for the lifecycle configuration removes any query results that are older than the lifecycle configuration that you've specified. For more information, see [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).

Data Wrangler uses Amazon S3 lifecycle configurations to manage data retention and expiration. You must give your Amazon SageMaker Studio Classic IAM execution role permissions to manage bucket lifecycle configurations. Use the following procedure to give permissions.

To give permissions to manage the lifecycle configuration do the following.

1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search bar, specify the Amazon SageMaker AI execution role that Amazon SageMaker Studio Classic is using.

1. Choose the role.

1. Choose **Add permissions**.

1. Choose **Create inline policy**.

1. For **Service**, specify **S3** and choose it.

1. Under the **Read** section, choose **GetLifecycleConfiguration**.

1. Under the **Write** section, choose **PutLifecycleConfiguration**.

1. For **Resources**, choose **Specific**.

1. For **Actions**, select the arrow icon next to **Permissions management**.

1. Choose **PutResourcePolicy**.

1. For **Resources**, choose **Specific**.

1. Choose the checkbox next to **Any in this account**.

1. Choose **Review policy**.

1. For **Name**, specify a name.

1. Choose **Create policy**.

## Import data from Amazon Redshift
<a name="data-wrangler-import-redshift"></a>

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your dataset and then perform data analysis queries. 

You can connect to and query one or more Amazon Redshift clusters in Data Wrangler. To use this import option, you must create at least one cluster in Amazon Redshift. To learn how, see [Getting started with Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html).

You can output the results of your Amazon Redshift query in one of the following locations:
+ The default Amazon S3 bucket
+ An Amazon S3 output location that you specify

You can either import the entire dataset or sample a portion of it. For Amazon Redshift, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

The default Amazon S3 bucket is in the same AWS Region in which your Studio Classic instance is located to store Amazon Redshift query results. For more information, see [Imported Data Storage](#data-wrangler-import-storage).

For either the default Amazon S3 bucket or the bucket that you specify, you have the following encryption options:
+ The default AWS service-side encryption with an Amazon S3 managed key (SSE-S3)
+  An AWS Key Management Service (AWS KMS) key that you specify

An AWS KMS key is an encryption key that you create and manage. For more information on KMS keys, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

You can specify an AWS KMS key using either the key ARN or the ARN of your AWS account.

If you use the IAM managed policy, `AmazonSageMakerFullAccess`, to grant a role permission to use Data Wrangler in Studio Classic, your **Database User** name must have the prefix `sagemaker_access`.

Use the following procedures to learn how to add a new cluster. 

**Note**  
Data Wrangler uses the Amazon Redshift Data API with temporary credentials. To learn more about this API, refer to [Using the Amazon Redshift Data API](https://docs.aws.amazon.com//redshift/latest/mgmt/data-api.html) in the Amazon Redshift Management Guide. 

**To connect to a Amazon Redshift cluster**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Amazon Athena**.

1. Choose **Amazon Redshift**.

1. Choose **Temporary credentials (IAM)** for **Type**.

1. Enter a **Connection Name**. This is a name used by Data Wrangler to identify this connection. 

1. Enter the **Cluster Identifier** to specify to which cluster you want to connect. Note: Enter only the cluster identifier and not the full endpoint of the Amazon Redshift cluster.

1. Enter the **Database Name** of the database to which you want to connect.

1. Enter a **Database User** to identify the user you want to use to connect to the database. 

1. For **UNLOAD IAM Role**, enter the IAM role ARN of the role that the Amazon Redshift cluster should assume to move and write data to Amazon S3. For more information about this role, see [Authorizing Amazon Redshift to access other AWS services on your behalf](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) in the Amazon Redshift Management Guide. 

1. Choose **Connect**.

1. (Optional) For **Amazon S3 output location**, specify the S3 URI to store the query results.

1. (Optional) For **KMS key ID**, specify the ARN of the AWS KMS key or alias. The following image shows you where you can find either key in the AWS Management Console.  
![\[The location of the AWS KMS alias ARN, alias name, and key ARN in the AWS KMS console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/kms-alias-redacted.png)

The following image shows all the fields from the preceding procedure.

![\[The Add Amazon Redshift connection panel.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/redshift-connection.png)


After your connection is successfully established, it appears as a data source under **Data Import**. Select this data source to query your database and import data.

**To query and import data from Amazon Redshift**

1. Select the connection that you want to query from **Data Sources**.

1. Select a **Schema**. To learn more about Amazon Redshift Schemas, see [Schemas](https://docs.aws.amazon.com/redshift/latest/dg/r_Schemas_and_tables.html) in the Amazon Redshift Database Developer Guide.

1. (Optional) Under **Advanced configuration**, specify the **Sampling** method that you'd like to use.

1. Enter your query in the query editor and choose **Run** to run the query. After a successful query, you can preview your result under the editor.

1. Select **Import dataset** to import the dataset that has been queried. 

1. Enter a **Dataset name**. If you add a **Dataset name** that contains spaces, these spaces are replaced with underscores when your dataset is imported. 

1. Choose **Add**.

To edit a dataset, do the following.

1. Navigate to your Data Wrangler flow.

1. Choose the \$1 next to **Source - Sampled**.

1. Change the data that you're importing.

1. Choose **Apply**

## Import data from Amazon EMR
<a name="data-wrangler-emr"></a>

You can use Amazon EMR as a data source for your Amazon SageMaker Data Wrangler flow. Amazon EMR is a managed cluster platform that you can use process and analyze large amounts of data. For more information about Amazon EMR, see [What is Amazon EMR?](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). To import a dataset from EMR, you connect to it and query it. 

**Important**  
You must meet the following prerequisites to connect to an Amazon EMR cluster:  
You have an Amazon VPC in the Region that you're using to launch Amazon SageMaker Studio Classic and Amazon EMR.
Both Amazon EMR and Amazon SageMaker Studio Classic must be launched in private subnets. They can be in the same subnet or in different ones.
Amazon SageMaker Studio Classic must be in VPC-only mode.  
For more information about creating a VPC, see [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#Create-VPC).  
For more information about creating a VPC, see [Connect SageMaker Studio Classic Notebooks in a VPC to External Resources](https://docs.aws.amazon.com/vpc/latest/userguide/studio-notebooks-and-internet-access.html).
The Amazon EMR clusters that you're running must be in the same Amazon VPC.
The Amazon EMR clusters and the Amazon VPC must be in the same AWS account.
Your Amazon EMR clusters are running Hive or Presto.  
Hive clusters must allow inbound traffic from Studio Classic security groups on port 10000.
Presto clusters must allow inbound traffic from Studio Classic security groups on port 8889.  
The port number is different for Amazon EMR clusters using IAM roles. Navigate to the end of the prerequisites section for more information.
Amazon SageMaker Studio Classic must run Jupyter Lab Version 3. For information about updating the Jupyter Lab Version, see [View and update the JupyterLab version of an application from the console](studio-jl.md#studio-jl-view).
Amazon SageMaker Studio Classic has an IAM role that controls user access. The default IAM role that you're using to run Amazon SageMaker Studio Classic doesn't have policies that can give you access to Amazon EMR clusters. You must attach the policy granting permissions to the IAM role. For more information, see [Configure listing Amazon EMR clusters](studio-notebooks-configure-discoverability-emr-cluster.md).
The IAM role must also have the following policy attached `secretsmanager:PutResourcePolicy`.
If you're using a Studio Classic domain that you've already created, make sure that its `AppNetworkAccessType` is in VPC-only mode. For information about updating a domain to use VPC-only mode, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).
You must have Hive or Presto installed on your cluster.
The Amazon EMR release must be version 5.5.0 or later.  
Amazon EMR supports auto termination. Auto termination stops idle clusters from running and prevents you from incurring costs. The following are the releases that support auto termination:  
For 6.x releases, version 6.1.0 or later.
For 5.x releases, version 5.30.0 or later.
Use the following pages to set up IAM runtime roles for the Amazon EMR cluster. You must enable in-transit encryption when you're using runtime roles:  
[Prerequisites for launching an Amazon EMR cluster with a runtime role](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html#emr-steps-runtime-roles-configure)
[Launch an Amazon EMR cluster with role-based access control](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html#emr-steps-runtime-roles-launch)
You must Lake Formation as a governance tool for the data within your databases. You must also use external data filtering for access control.  
For more information about Lake Formation, see [What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)
For more information about integrating Lake Formation into Amazon EMR, see [Integrating third-party services with Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/Integrating-with-LakeFormation.html).
The version of your cluster must be 6.9.0 or later.
Access to AWS Secrets Manager. For more information about Secrets Manager see [What is AWS Secrets Manager?](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html)
Hive clusters must allow inbound traffic from Studio Classic security groups on port 10000.

An Amazon VPC is a virtual network that is logically isolated from other networks on the AWS cloud. Amazon SageMaker Studio Classic and your Amazon EMR cluster only exist within the Amazon VPC.

Use the following procedure to launch Amazon SageMaker Studio Classic in an Amazon VPC.

To launch Studio Classic within a VPC, do the following.

1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Launch SageMaker Studio Classic**.

1. Choose **Standard setup**.

1. For **Default execution role**, choose the IAM role to set up Studio Classic.

1. Choose the VPC where you've launched the Amazon EMR clusters.

1. For **Subnet**, choose a private subnet.

1. For **Security group(s)**, specify the security groups that you're using to control between your VPC.

1. Choose **VPC Only**.

1. (Optional) AWS uses a default encryption key. You can specify an AWS Key Management Service key to encrypt your data.

1. Choose **Next**.

1. Under **Studio settings**, choose the configurations that are best suited to you.

1. Choose **Next** to skip the SageMaker Canvas settings.

1. Choose **Next** to skip the RStudio settings.

If you don't have an Amazon EMR cluster ready, you can use the following procedure to create one. For more information about Amazon EMR, see [What is Amazon EMR?](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html)

To create a cluster, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify **Amazon EMR**.

1. Choose **Create cluster**.

1. For **Cluster name**, specify the name of your cluster.

1. For **Release**, select the release version of the cluster.
**Note**  
Amazon EMR supports auto termination for the following releases:  
For 6.x releases, releases 6.1.0 or later
For 5.x releases, releases 5.30.0 or later
Auto termination stops idle clusters from running and prevents you from incurring costs.

1. (Optional) For **Applications**, choose **Presto**.

1. Choose the application that you're running on the cluster.

1. Under **Networking**, for **Hardware configuration**, specify the hardware configuration settings.
**Important**  
For **Networking**, choose the VPC that is running Amazon SageMaker Studio Classic and choose a private subnet.

1. Under **Security and access**, specify the security settings.

1. Choose **Create**.

For a tutorial about creating an Amazon EMR cluster, see [Getting started with Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html). For information about best practices for configuring a cluster, see [Considerations and best practices](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-considerations.html).

**Note**  
For security best practices, Data Wrangler can only connect to VPCs on private subnets. You can't connect to the master node unless you use AWS Systems Manager for your Amazon EMR instances. For more information, see [Securing access to EMR clusters using AWS Systems Manager](https://aws.amazon.com/blogs/big-data/securing-access-to-emr-clusters-using-aws-systems-manager/).

You can currently use the following methods to access an Amazon EMR cluster:
+ No authentication
+ Lightweight Directory Access Protocol (LDAP)
+ IAM (Runtime role)

Not using authentication or using LDAP can require you to create multiple clusters and Amazon EC2 instance profiles. If you’re an administrator, you might need to provide groups of users with different levels of access to the data. These methods can result in administrative overhead that makes it more difficult to manage your users.

We recommend using an IAM runtime role that gives multiple users the ability to connect to the same Amazon EMR cluster. A runtime role is an IAM role that you can assign to a user who is connecting to an Amazon EMR cluster. You can configure the runtime IAM role to have permissions that are specific to each group of users.

Use the following sections to create a Presto or Hive Amazon EMR cluster with LDAP activated.

------
#### [ Presto ]

**Important**  
To use AWS Glue as a metastore for Presto tables, select **Use** for **Presto table metadata** to store the results of your Amazon EMR queries in a AWS Glue data catalog when you're launching an EMR cluster. Storing the query results in a AWS Glue data catalog can save you from incurring charges.  
To query large datasets on Amazon EMR clusters, you must add the following properties to the Presto configuration file on your Amazon EMR clusters:  

```
[{"classification":"presto-config","properties":{
"http-server.max-request-header-size":"5MB",
"http-server.max-response-header-size":"5MB"}}]
```
You can also modify the configuration settings when you launch the Amazon EMR cluster.  
The configuration file for your Amazon EMR cluster is located under the following path: `/etc/presto/conf/config.properties`.

Use the following procedure to create a Presto cluster with LDAP activated.

To create a cluster, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify **Amazon EMR**.

1. Choose **Create cluster**.

1. For **Cluster name**, specify the name of your cluster.

1. For **Release**, select the release version of the cluster.
**Note**  
Amazon EMR supports auto termination for the following releases:  
For 6.x releases, releases 6.1.0 or later
For 5.x releases, releases 5.30.0 or later
Auto termination stops idle clusters from running and prevent you from incurring costs.

1. Choose the application that you're running on the cluster.

1. Under **Networking**, for **Hardware configuration**, specify the hardware configuration settings.
**Important**  
For **Networking**, choose the VPC that is running Amazon SageMaker Studio Classic and choose a private subnet.

1. Under **Security and access**, specify the security settings.

1. Choose **Create**.

------
#### [ Hive ]

**Important**  
To use AWS Glue as a metastore for Hive tables, select **Use** for **Hive table metadata** to store the results of your Amazon EMR queries in a AWS Glue data catalog when you're launching an EMR cluster. Storing the query results in a AWS Glue data catalog can save you from incurring charges.  
To be able to query large datasets on Amazon EMR clusters, add the following properties to Hive configuration file on your Amazon EMR clusters:  

```
[{"classification":"hive-site", "properties"
:{"hive.resultset.use.unique.column.names":"false"}}]
```
You can also modify the configuration settings when you launch the Amazon EMR cluster.  
The configuration file for your Amazon EMR cluster is located under the following path: `/etc/hive/conf/hive-site.xml`. You can specify the following property and restart the cluster:  

```
<property>
    <name>hive.resultset.use.unique.column.names</name>
    <value>false</value>
</property>
```

Use the following procedure to create a Hive cluster with LDAP activated.

To create a Hive cluster with LDAP activated, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify **Amazon EMR**.

1. Choose **Create cluster**.

1. Choose **Go to advanced options**.

1. For **Release**, select an Amazon EMR release version.

1. The **Hive** configuration option is selected by default. Make sure the **Hive** option has a checkbox next to it.

1. (Optional) You can also select **Presto** as a configuration option to activate both Hive and Presto on your cluster.

1. (Optional) Select **Use for Hive table metadata** to store the results of your Amazon EMR queries in a AWS Glue data catalog. Storing the query results in a AWS Glue catalog can save you from incurring charges. For more information, see [Using the AWS Glue Data Catalog as the metastore for Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html).
**Note**  
Storing the query results in a data catalog requires Amazon EMR version 5.8.0 or later.

1. Under **Enter configuration**, specify the following JSON:

   ```
   [
     {
       "classification": "hive-site",
       "properties": {
         "hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
         "hive.server2.authentication": "LDAP",
         "hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
       }
     }
   ]
   ```
**Note**  
As a security best practice, we recommend enabling SSL for HiveServer by adding a few properties in the preceding hive-site JSON. For more information, see [Enable SSL on HiveServer2](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/configuring-wire-encryption/content/enable_ssl_on_hiveserver2.html).

1. Specify the remaining cluster settings and create a cluster.

------

Use the following sections to use LDAP authentication for Amazon EMR clusters that you've already created.

------
#### [ LDAP for Presto ]

Using LDAP on a cluster running Presto requires access to the Presto coordinator through HTTPS. Do the following to provide access:
+ Activate access on port 636
+ Enable SSL for the Presto coordinator

Use the following template to configure Presto:

```
- Classification: presto-config
     ConfigurationProperties:
        http-server.authentication.type: 'PASSWORD'
        http-server.https.enabled: 'true'
        http-server.https.port: '8889'
        http-server.http.port: '8899'
        node-scheduler.include-coordinator: 'true'
        http-server.https.keystore.path: '/path/to/keystore/path/for/presto'
        http-server.https.keystore.key: 'keystore-key-password'
        discovery.uri: 'http://master-node-dns-name:8899'
- Classification: presto-password-authenticator
     ConfigurationProperties:
        password-authenticator.name: 'ldap'
        ldap.url: !Sub 'ldaps://ldap-server-dns-name:636'
        ldap.user-bind-pattern: "uid=${USER},dc=example,dc=org"
        internal-communication.authentication.ldap.user: "ldap-user-name"
        internal-communication.authentication.ldap.password: "ldap-password"
```

For information about setting up LDAP in Presto, see the following resources:
+ [LDAP Authentication](https://prestodb.io/docs/current/security/ldap.html)
+ [Using LDAP Authentication for Presto on Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-ldap.html)

**Note**  
As a security best practice, we recommend enabling SSL for Presto. For more information, see [Secure Internal Communication](https://prestodb.io/docs/current/security/internal-communication.html).

------
#### [ LDAP for Hive ]

To use LDAP for Hive for a cluster that you've created, use the following procedure [Reconfigure an instance group in the console.](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html#emr-configure-apps-running-cluster-considerations)

You're specifying the name of the cluster to which you're connecting.

```
[
  {
    "classification": "hive-site",
    "properties": {
      "hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
      "hive.server2.authentication": "LDAP",
      "hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
    }
  }
]
```

------

Use the following procedure to import data from a cluster.

To import data from a cluster, do the following.

1. Open a Data Wrangler flow.

1. Choose **Create Connection**.

1. Choose **Amazon EMR**.

1. Do one of the following.
   + (Optional) For **Secrets ARN**, specify the Amazon Resource Number (ARN) of the database within the cluster. Secrets provide additional security. For more information about secrets, see [What is AWS Secrets Manager?](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) For information about creating a secret for your cluster, see [Creating a AWS Secrets Manager secret for your cluster](#data-wrangler-emr-secrets-manager).
**Important**  
You must specify a secret if you're using an IAM runtime role for authentication.
   + From the dropdown table, choose a cluster.

1. Choose **Next**.

1. For **Select an endpoint for *example-cluster-name* cluster**, choose a query engine.

1. (Optional) Select **Save connection**.

1. Choose **Next, select login** and choose one of the following:
   + No authentication
   + LDAP
   + IAM

1. For **Login into *example-cluster-name* cluster**, specify the **Username** and **Password** for the cluster.

1. Choose **Connect**.

1. In the query editor specify a SQL query.

1. Choose **Run**.

1. Choose **Import**.

### Creating a AWS Secrets Manager secret for your cluster
<a name="data-wrangler-emr-secrets-manager"></a>

If you're using an IAM runtime role to access your Amazon EMR cluster, you must store the credentials that you're using to access the Amazon EMR as a Secrets Manager secret. You store all the credentials that you use to access the cluster within the secret.

You must store the following information in the secret:
+ JDBC endpoint – `jdbc:hive2://`
+ DNS name – The DNS name of your Amazon EMR cluster. It's either the endpoint for the primary node or the hostname.
+ Port – `8446`

You can also store the following additional information within the secret:
+ IAM role – The IAM role that you're using to access the cluster. Data Wrangler uses your SageMaker AI execution role by default.
+ Truststore path – By default, Data Wrangler creates a truststore path for you. You can also use your own truststore path. For more information about truststore paths, see [In-transit encryption in HiveServer2](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/hs2-encryption-intransit.html).
+ Truststore password – By default, Data Wrangler creates a truststore password for you. You can also use your own truststore path. For more information about truststore paths, see [In-transit encryption in HiveServer2](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/hs2-encryption-intransit.html).

Use the following procedure to store the credentials within a Secrets Manager secret.

To store your credentials as a secret, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify Secrets Manager.

1. Choose **AWS Secrets Manager**.

1. Choose **Store a new secret**.

1. For **Secret type**, choose **Other type of secret**.

1. Under **Key/value** pairs, select **Plaintext**.

1. For clusters running Hive, you can use the following template for IAM authentication.

   ```
   {"jdbcURL": ""
    "iam_auth": {"endpoint": "jdbc:hive2://", #required
                   "dns": "ip-xx-x-xxx-xxx.ec2.internal", #required 
                   "port": "10000", #required
                 "cluster_id": "j-xxxxxxxxx", #required
                 "iam_role": "arn:aws:iam::xxxxxxxx:role/xxxxxxxxxxxx", #optional
                 "truststore_path": "/etc/alternatives/jre/lib/security/cacerts", #optional
                 "truststore_password": "changeit" #optional
                 
                 }}
   ```
**Note**  
After you import your data, you apply transformations to them. You then export the data that you've transformed to a specific location. If you're using a Jupyter notebook to export your transformed data to Amazon S3, you must use the truststore path specified in the preceding example.

A Secrets Manager secret stores the JDBC URL of the Amazon EMR cluster as a secret. Using a secret is more secure than directly entering in your credentials.

Use the following procedure to store the JDBC URL as a secret.

To store the JDBC URL as a secret, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify Secrets Manager.

1. Choose **AWS Secrets Manager**.

1. Choose **Store a new secret**.

1. For **Secret type**, choose **Other type of secret**.

1. For **Key/value pairs**, specify `jdbcURL` as the key and a valid JDBC URL as the value.

   The format of a valid JDBC URL depends on whether you use authentication and whether you use Hive or Presto as the query engine. The following list shows the valid JBDC URL formats for the different possible configurations.
   + Hive, no authentication – `jdbc:hive2://emr-cluster-master-public-dns:10000/;`
   + Hive, LDAP authentication – `jdbc:hive2://emr-cluster-master-public-dns-name:10000/;AuthMech=3;UID=david;PWD=welcome123;`
   + For Hive with SSL enabled, the JDBC URL format depends on whether you use a Java Keystore File for the TLS configuration. The Java Keystore File helps verify the identity of the master node of the Amazon EMR cluster. To use a Java Keystore File, generate it on an EMR cluster and upload it to Data Wrangler. To generate a file, use the following command on the Amazon EMR cluster, `keytool -genkey -alias hive -keyalg RSA -keysize 1024 -keystore hive.jks`. For information about running commands on an Amazon EMR cluster, see [Securing access to EMR clusters using AWS Systems Manager](https://aws.amazon.com/blogs/big-data/securing-access-to-emr-clusters-using-aws-systems-manager/). To upload a file, choose the upward arrow on the left-hand navigation of the Data Wrangler UI.

     The following are the valid JDBC URL formats for Hive with SSL enabled:
     + Without a Java Keystore File – `jdbc:hive2://emr-cluster-master-public-dns:10000/;AuthMech=3;UID=user-name;PWD=password;SSL=1;AllowSelfSignedCerts=1;`
     + With a Java Keystore File – `jdbc:hive2://emr-cluster-master-public-dns:10000/;AuthMech=3;UID=user-name;PWD=password;SSL=1;SSLKeyStore=/home/sagemaker-user/data/Java-keystore-file-name;SSLKeyStorePwd=Java-keystore-file-passsword;`
   + Presto, no authentication – jdbc:presto://*emr-cluster-master-public-dns*:8889/;
   + For Presto with LDAP authentication and SSL enabled, the JDBC URL format depends on whether you use a Java Keystore File for the TLS configuration. The Java Keystore File helps verify the identity of the master node of the Amazon EMR cluster. To use a Java Keystore File, generate it on an EMR cluster and upload it to Data Wrangler. To upload a file, choose the upward arrow on the left-hand navigation of the Data Wrangler UI. For information about creating a Java Keystore File for Presto, see [Java Keystore File for TLS](https://prestodb.io/docs/current/security/tls.html#server-java-keystore). For information about running commands on an Amazon EMR cluster, see [Securing access to EMR clusters using AWS Systems Manager](https://aws.amazon.com/blogs/big-data/securing-access-to-emr-clusters-using-aws-systems-manager/).
     + Without a Java Keystore File – `jdbc:presto://emr-cluster-master-public-dns:8889/;SSL=1;AuthenticationType=LDAP Authentication;UID=user-name;PWD=password;AllowSelfSignedServerCert=1;AllowHostNameCNMismatch=1;`
     + With a Java Keystore File – `jdbc:presto://emr-cluster-master-public-dns:8889/;SSL=1;AuthenticationType=LDAP Authentication;SSLTrustStorePath=/home/sagemaker-user/data/Java-keystore-file-name;SSLTrustStorePwd=Java-keystore-file-passsword;UID=user-name;PWD=password;`

Throughout the process of importing data from an Amazon EMR cluster, you might run into issues. For information about troubleshooting them, see [Troubleshooting issues with Amazon EMR](data-wrangler-trouble-shooting.md#data-wrangler-trouble-shooting-emr).

## Import data from Databricks (JDBC)
<a name="data-wrangler-databricks"></a>

You can use Databricks as a data source for your Amazon SageMaker Data Wrangler flow. To import a dataset from Databricks, use the JDBC (Java Database Connectivity) import functionality to access to your Databricks database. After you access the database, specify a SQL query to get the data and import it.

We assume that you have a running Databricks cluster and that you've configured your JDBC driver to it. For more information, see the following Databricks documentation pages:
+ [JDBC driver](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-driver)
+ [JDBC configuration and connection parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-configuration-and-connection-parameters)
+ [Authentication parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#authentication-parameters)

Data Wrangler stores your JDBC URL in AWS Secrets Manager. You must give your Amazon SageMaker Studio Classic IAM execution role permissions to use Secrets Manager. Use the following procedure to give permissions.

To give permissions to Secrets Manager, do the following.

1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search bar, specify the Amazon SageMaker AI execution role that Amazon SageMaker Studio Classic is using.

1. Choose the role.

1. Choose **Add permissions**.

1. Choose **Create inline policy**.

1. For **Service**, specify **Secrets Manager** and choose it.

1. For **Actions**, select the arrow icon next to **Permissions management**.

1. Choose **PutResourcePolicy**.

1. For **Resources**, choose **Specific**.

1. Choose the checkbox next to **Any in this account**.

1. Choose **Review policy**.

1. For **Name**, specify a name.

1. Choose **Create policy**.

You can use partitions to import your data more quickly. Partitions give Data Wrangler the ability to process the data in parallel. By default, Data Wrangler uses 2 partitions. For most use cases, 2 partitions give you near-optimal data processing speeds.

If you choose to specify more than 2 partitions, you can also specify a column to partition the data. The type of the values in the column must be numeric or date.

We recommend using partitions only if you understand the structure of the data and how it's processed.

You can either import the entire dataset or sample a portion of it. For a Databricks database, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

Use the following procedure to import your data from a Databricks database.

To import data from Databricks, do the following.

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. From the **Import data** tab of your Data Wrangler flow, choose **Databricks**.

1. Specify the following fields:
   + **Dataset name** – A name that you want to use for the dataset in your Data Wrangler flow.
   + **Driver** – **com.simba.spark.jdbc.Driver**.
   + **JDBC URL** – The URL of the Databricks database. The URL formatting can vary between Databricks instances. For information about finding the URL and the specifying the parameters within it, see [JDBC configuration and connection parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-configuration-and-connection-parameters). The following is an example of how a URL can be formatted: jdbc:spark://aws-sagemaker-datawrangler.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3122619508517275/0909-200301-cut318;AuthMech=3;UID=*token*;PWD=*personal-access-token*.
**Note**  
You can specify a secret ARN that contains the JDBC URL instead of specifying the JDBC URL itself. The secret must contain a key-value pair with the following format: `jdbcURL:JDBC-URL`. For more information, see [What is Secrets Manager?](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html).

1. Specify a SQL SELECT statement.
**Note**  
Data Wrangler doesn't support Common Table Expressions (CTE) or temporary tables within a query.

1. For **Sampling**, choose a sampling method.

1. Choose **Run**. 

1. (Optional) For the **PREVIEW**, choose the gear to open the **Partition settings**. 

   1. Specify the number of partitions. You can partition by column if you specify the number of partitions:
     + **Enter number of partitions** – Specify a value greater than 2.
     + (Optional) **Partition by column** – Specify the following fields. You can only partition by a column if you've specified a value for **Enter number of partitions**.
       + **Select column** – Select the column that you're using for the data partition. The data type of the column must be numeric or date.
       + **Upper bound** – From the values in the column that you've specified, the upper bound is the value that you're using in the partition. The value that you specify doesn't change the data that you're importing. It only affects the speed of the import. For the best performance, specify an upper bound that's close to the column's maximum.
       + **Lower bound** – From the values in the column that you've specified, the lower bound is the value that you're using in the partition. The value that you specify doesn't change the data that you're importing. It only affects the speed of the import. For the best performance, specify a lower bound that's close to the column's minimum.

1. Choose **Import**.

## Import data from Salesforce Data Cloud
<a name="data-wrangler-import-salesforce-data-cloud"></a>

You can use Salesforce Data Cloud as a data source in Amazon SageMaker Data Wrangler to prepare the data in your Salesforce Data Cloud for machine learning.

With Salesforce Data Cloud as a data source in Data Wrangler, you can quickly connect to your Salesforce data without writing a single line of code. You can join your Salesforce data with data from any other data source in Data Wrangler.

After you connect to the data cloud, you can do the following:
+ Visualize your data with built-in visualizations
+ Understand data and identify potential errors and extreme values
+ Transform data with more than 300 built-in transformations
+ Export the data that you've transformed

**Topics**
+ [

### Administrator setup
](#data-wrangler-import-salesforce-data-cloud-administrator)
+ [

### Data Scientist Guide
](#data-wrangler-salesforce-data-cloud-ds)

### Administrator setup
<a name="data-wrangler-import-salesforce-data-cloud-administrator"></a>

**Important**  
Before you get started, make sure that your users are running Amazon SageMaker Studio Classic version 1.3.0 or later. For information about checking the version of Studio Classic and updating it, see [Prepare ML Data with Amazon SageMaker Data Wrangler](data-wrangler.md).

When you're setting up access to Salesforce Data Cloud, you must complete the following tasks:
+ Getting your Salesforce Domain URL. Salesforce also refers to the Domain URL as your org's URL.
+ Getting OAuth credentials from Salesforce. 
+ Getting the authorization URL and token URL for your Salesforce Domain.
+ Creating a AWS Secrets Manager secret with the OAuth configuration.
+ Creating a lifecycle configuration that Data Wrangler uses to read the credentials from the secret.
+ Giving Data Wrangler permissions to read the secret.

After you perform the preceding tasks, your users can log into the Salesforce Data Cloud using OAuth.

**Note**  
Your users might run into issues after you've set everything up. For information about troubleshooting, see [Troubleshooting with Salesforce](data-wrangler-trouble-shooting.md#data-wrangler-troubleshooting-salesforce-data-cloud).

Use the following procedure to get the Domain URL.

1. Navigate to the [Salesforce](login.salesforce.com) login page.

1. For **Quick find**, specify **My Domain**.

1. Copy the value of **Current My Domain URL** to a text file.

1. Add `https://` to the beginning of the URL. 

After you get the Salesforce Domain URL, you can use the following procedure to get the login credentials from Salesforce and allow Data Wrangler to access your Salesforce data.

To get the log in credentials from Salesforce and provide access to Data Wrangler, do the following.

1. Navigate to your Salesforce Domain URL and log into your account.

1. Choose the gear icon.

1. In the search bar that appears, specify **App Manager**.

1. Select **New Connected App**.

1. Specify the following fields:
   + Connected App Name – You can specify any name, but we recommend choosing a name that includes Data Wrangler. For example, you can specify **Salesforce Data Cloud Data Wrangler Integration**.
   + API name – Use the default value.
   + Contact Email – Specify your email address.
   + Under **API heading (Enable OAuth Settings)**, select the checkbox to activate OAuth settings.
   + For **Callback URL** specify the Amazon SageMaker Studio Classic URL. To get the URL for Studio Classic, access it from the AWS Management Console and copy the URL.

1. Under **Selected OAuth Scopes**, move the following from the **Available OAuth Scopes** to **Selected OAuth Scopes**:
   + Manage user data via APIs (`api`)
   + Perform requests at any time (`refresh_token`, `offline_access`)
   + Perform ANSI SQL queries on Salesforce Data Cloud data (`cdp_query_api`)
   + Manage Salesforce Customer Data Platform profile data (`cdp_profile_api`)

1. Choose **Save**. After you save your changes, Salesforce opens a new page.

1. Choose **Continue**

1. Navigate to **Consumer Key and Secret**.

1. Choose **Manage Consumer Details**. Salesforce redirects you to a new page where you might have to pass two-factor authentication.

1. 
**Important**  
Copy the Consumer Key and Consumer Secret to a text editor. You need this information to connect the data cloud to Data Wrangler.

1. Navigate back to **Manage Connected Apps**.

1. Navigate to **Connected App Name **and the name of your application.

1. Choose **Manage**.

   1. Select **Edit Policies**.

   1. Change **IP Relaxation** to **Relax IP restrictions**.

   1. Choose **Save**.

After you provide access to your Salesforce Data Cloud, you need to provide permissions for your users. Use the following procedure to provide them with permissions.

To provide your users with permissions, do the following.

1. Navigate to the setup home page.

1. On the left-hand navigation, search for **Users** and choose the **Users** menu item.

1. Choose the hyperlink with your user name.

1. Navigate to **Permission Set Assignments**.

1. Choose **Edit Assignments**.

1. Add the following permissions:
   + **Customer Data Platform Admin**
   + **Customer Data Platform Data Aware Specialist**

1. Choose **Save**.

After you get the information for your Salesforce Domain, you must get the authorization URL and the token URL for the AWS Secrets Manager secret that you're creating.

Use the following procedure to get the authorization URL and the token URL.

**To get the authorization URL and token URL**

1. Navigate to your Salesforce Domain URL.

1. Use one of the following methods to get the URLs. If you are on a Linux distribution with `curl` and `jq` installed, we recommend using the method that only works on Linux.
   + (Linux only) Specify the following command in your terminal.

     ```
     curl salesforce-domain-URL/.well-known/openid-configuration | \
     jq '. | { authorization_url: .authorization_endpoint, token_url: .token_endpoint }' | \
     jq '.  += { identity_provider: "SALESFORCE", client_id: "example-client-id", client_secret: "example-client-secret" }'
     ```
   + 

     1. Navigate to **example-org-URL*/.well-known/openid-configuration* in your browser.

     1. Copy the `authorization_endpoint` and `token_endpoint` to a text editor.

     1. Create the following JSON object:

        ```
        {
          "identity_provider": "SALESFORCE",
          "authorization_url": "example-authorization-endpoint", 
          "token_url": "example-token-endpoint",
          "client_id": "example-consumer-key",
          "client_secret": "example-consumer-secret"
        }
        ```

After you create the OAuth configuration object, you can create a AWS Secrets Manager secret that stores it. Use the following procedure to create the secret.

To create a secret, do the following.

1. Navigate to the [AWS Secrets Manager console](https://console.aws.amazon.com/secretsmanager/).

1. Choose **Store a secret**.

1. Select **Other type of secret**.

1. Under **Key/value** pairs select **Plaintext**.

1. Replace the empty JSON with the following configuration settings.

   ```
   {
     "identity_provider": "SALESFORCE",
     "authorization_url": "example-authorization-endpoint", 
     "token_url": "example-token-endpoint",
     "client_id": "example-consumer-key",
     "client_secret": "example-consumer-secret"
   }
   ```

1. Choose **Next**.

1. For **Secret Name**, specify the name of the secret.

1. Under **Tags**, choose **Add**.

   1. For the **Key**, specify **sagemaker:partner**. For **Value**, we recommend specifying a value that might be useful for your use case. However, you can specify anything.
**Important**  
You must create the key. You can't import your data from Salesforce if you don't create it.

1. Choose **Next**.

1. Choose **Store**.

1. Choose the secret you've created.

1. Make a note of the following fields:
   + The Amazon Resource Number (ARN) of the secret
   + The name of the secret

After you've created the secret, you must add permissions for Data Wrangler to read the secret. Use the following procedure to add permissions.

To add read permissions for Data Wrangler, do the following.

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. Choose **domains**.

1. Choose the domain that you're using to access Data Wrangler.

1. Choose your **User Profile**.

1. Under **Details**, find the **Execution role**. Its ARN is in the following format: `arn:aws:iam::111122223333:role/example-role`. Make a note of the SageMaker AI execution role. Within the ARN, it's everything after `role/`.

1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

1. In the **Search IAM** search bar, specify the name of the SageMaker AI execution role.

1. Choose the role.

1. Choose **Add permissions**.

1. Choose **Create inline policy**.

1. Choose the JSON tab.

1. Specify the following policy within the editor.

------
#### [ JSON ]

****  

   ```
   {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:PutSecretValue"
            ],
            "Resource": "arn:aws:secretsmanager:*:*:secret:*",
            "Condition": {
                "ForAnyValue:StringLike": {
                    "aws:ResourceTag/sagemaker:partner": "*"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:UpdateSecret"
            ],
            "Resource": "arn:aws:secretsmanager:*:*:secret:AmazonSageMaker-*"
        }
    ]
   }
   ```

------

1. Choose **Review Policy**.

1. For **Name**, specify a name.

1. Choose **Create policy**.

After you've given Data Wrangler permissions to read the secret, you must add a Lifecycle Configuration that uses your Secrets Manager secret to your Amazon SageMaker Studio Classic user profile.

Use the following procedure to create a lifecycle configuration and add it to the Studio Classic profile.

To create a lifecycle configuration and add it to the Studio Classic profile, do the following.

1. Navigate to the [Amazon SageMaker AI console](console.aws.amazon.com/sagemaker).

1. Choose **domains**.

1. Choose the domain that you're using to access Data Wrangler.

1. Choose your **User Profile**.

1. If you see the following applications, delete them:
   + KernelGateway
   + JupyterKernel
**Note**  
Deleting the applications updates Studio Classic. It can take a while for the updates to happen.

1. While you're waiting for updates to happen, choose **Lifecycle configurations**.

1. Make sure the page you're on says **Studio Classic Lifecycle configurations**.

1. Choose **Create configuration**.

1. Make sure **Jupyter server app** has been selected.

1. Choose **Next**.

1. For **Name**, specify a name for the configuration.

1. For **Scripts**, specify the following script:

   ```
   #!/bin/bash
   set -eux
   
   cat > ~/.sfgenie_identity_provider_oauth_config <<EOL
   {
       "secret_arn": "secrets-arn-containing-salesforce-credentials"
   }
   EOL
   ```

1. Choose **Submit**.

1. On the left hand navigation, choose **domains**.

1. Choose your domain.

1. Choose **Environment**.

1. Under **Lifecycle configurations for personal Studio Classic apps**, choose **Attach**. 

1. Select **Existing configuration**.

1. Under **Studio Classic Lifecycle configurations** select the lifecycle configuration that you've created.

1. Choose **Attach to domain**.

1. Select the checkbox next to the lifecycle configuration that you've attached.

1. Select **Set as default**.

You might run into issues when you set up your lifecycle configuration. For information about debugging them, see [Debug Lifecycle Configurations in Amazon SageMaker Studio Classic](studio-lcc-debug.md).

### Data Scientist Guide
<a name="data-wrangler-salesforce-data-cloud-ds"></a>

Use the following to connect Salesforce Data Cloud and access your data in Data Wrangler.

**Important**  
Your administrator needs to use the information in the preceding sections to set up Salesforce Data Cloud. If you're running into issues, contact them for troubleshooting help.

To open Studio Classic and check its version, see the following procedure.

1. Use the steps in [Prerequisites](data-wrangler-getting-started.md#data-wrangler-getting-started-prerequisite) to access Data Wrangler through Amazon SageMaker Studio Classic.

1. Next to the user you want to use to launch Studio Classic, select **Launch app**.

1. Choose **Studio**.

**To create a dataset in Data Wrangler with data from the Salesforce Data Cloud**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Salesforce Data Cloud**.

1. For **Connection name**, specify a name for your connection to the Salesforce Data Cloud.

1. For **Org URL**, specify the organization URL in your Salesforce account. You can get the URL from your administrator.s

1. Choose **Connect**.

1. Specify your credentials to log into Salesforce.

You can begin creating a dataset using data from Salesforce Data Cloud after you've connected to it.

After you select a table, you can write queries and run them. The output of your query shows under **Query results**.

After you have settled on the output of your query, you can then import the output of your query into a Data Wrangler flow to perform data transformations. 

After you've created a dataset, navigate to the **Data flow** screen to start transforming your data.

## Import data from Snowflake
<a name="data-wrangler-snowflake"></a>

You can use Snowflake as a data source in SageMaker Data Wrangler to prepare data in Snowflake for machine learning.

With Snowflake as a data source in Data Wrangler, you can quickly connect to Snowflake without writing a single line of code. You can join your data in Snowflake with data from any other data source in Data Wrangler.

Once connected, you can interactively query data stored in Snowflake, transform data with more than 300 preconfigured data transformations, understand data and identify potential errors and extreme values with a set of robust preconfigured visualization templates, quickly identify inconsistencies in your data preparation workflow, and diagnose issues before models are deployed into production. Finally, you can export your data preparation workflow to Amazon S3 for use with other SageMaker AI features such as Amazon SageMaker Autopilot, Amazon SageMaker Feature Store and Amazon SageMaker Pipelines.

You can encrypt the output of your queries using an AWS Key Management Service key that you've created. For more information about AWS KMS, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

**Topics**
+ [

### Administrator Guide
](#data-wrangler-snowflake-admin)
+ [

### Data Scientist Guide
](#data-wrangler-snowflake-ds)

### Administrator Guide
<a name="data-wrangler-snowflake-admin"></a>

**Important**  
To learn more about granular access control and best practices, see [Security Access Control](https://docs.snowflake.com/en/user-guide/security-access-control.html). 

This section is for Snowflake administrators who are setting up access to Snowflake from within SageMaker Data Wrangler.

**Important**  
You are responsible for managing and monitoring the access control within Snowflake. Data Wrangler does not add a layer of access control with respect to Snowflake.   
Access control includes the following:  
The data that a user accesses
(Optional) The storage integration that provides Snowflake the ability to write query results to an Amazon S3 bucket
The queries that a user can run

#### (Optional) Configure Snowflake Data Import Permissions
<a name="data-wrangler-snowflake-admin-config"></a>

By default, Data Wrangler queries the data in Snowflake without creating a copy of it in an Amazon S3 location. Use the following information if you're configuring a storage integration with Snowflake. Your users can use a storage integration to store their query results in an Amazon S3 location.

Your users might have different levels of access of sensitive data. For optimal data security, provide each user with their own storage integration. Each storage integration should have its own data governance policy.

This feature is currently not available in the opt-in Regions.

Snowflake requires the following permissions on an S3 bucket and directory to be able to access files in the directory:
+ `s3:GetObject`
+ `s3:GetObjectVersion`
+ `s3:ListBucket`
+ `s3:ListObjects`
+ `s3:GetBucketLocation`

**Create an IAM policy**

You must create an IAM policy to configure access permissions for Snowflake to load and unload data from an Amazon S3 bucket.

The following is the JSON policy document that you use to create the policy:

```
# Example policy for S3 write access
# This needs to be updated
{
"Version": "2012-10-17",		 	 	 
"Statement": [
  {
    "Effect": "Allow",
    "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:DeleteObject",
        "s3:DeleteObjectVersion"
    ],
    "Resource": "arn:aws:s3:::bucket/prefix/*"
  },
  {
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket"
    ],
    "Resource": "arn:aws:s3:::bucket/",
    "Condition": {
        "StringLike": {
            "s3:prefix": ["prefix/*"]
        }
    }
  }
 ]
}
```

For information and procedures about creating policies with policy documents, see [Creating IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html).

For documentation that provides an overview of using IAM permissions with Snowflake, see the following resources:
+ [What is IAM?](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html)
+ [Create the IAM Role in AWS](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-2-create-the-iam-role-in-aws)
+ [Create a Cloud Storage Integration in Snowflake](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-3-create-a-cloud-storage-integration-in-snowflake)
+ [Retrieve the AWS IAM User for your Snowflake Account](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-4-retrieve-the-aws-iam-user-for-your-snowflake-account)
+ [Grant the IAM User Permissions to Access Bucket](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-5-grant-the-iam-user-permissions-to-access-bucket-objects).

To grant the data scientist's Snowflake role usage permission to the storage integration, you must run `GRANT USAGE ON INTEGRATION integration_name TO snowflake_role;`.
+ `integration_name` is the name of your storage integration.
+ `snowflake_role` is the name of the default [Snowflake role](https://docs.snowflake.com/en/user-guide/security-access-control-overview.html#roles) given to the data scientist user.

#### Setting up Snowflake OAuth Access
<a name="data-wrangler-snowflake-oauth-setup"></a>

Instead of having your users directly enter their credentials into Data Wrangler, you can have them use an identity provider to access Snowflake. The following are links to the Snowflake documentation for the identity providers that Data Wrangler supports.
+ [Azure AD](https://docs.snowflake.com/en/user-guide/oauth-azure.html)
+ [Okta](https://docs.snowflake.com/en/user-guide/oauth-okta.html)
+ [Ping Federate](https://docs.snowflake.com/en/user-guide/oauth-pingfed.html)

Use the documentation from the preceding links to set up access to your identity provider. The information and procedures in this section help you understand how to properly use the documentation to access Snowflake within Data Wrangler.

Your identity provider needs to recognize Data Wrangler as an application. Use the following procedure to register Data Wrangler as an application within the identity provider:

1. Select the configuration that starts the process of registering Data Wrangler as an application.

1. Provide the users within the identity provider access to Data Wrangler.

1. Turn on OAuth client authentication by storing the client credentials as an AWS Secrets Manager secret.

1. Specify a redirect URL using the following format: https://*domain-ID*.studio.*AWS Region*.sagemaker.aws/jupyter/default/lab
**Important**  
You're specifying the Amazon SageMaker AI domain ID and AWS Region that you're using to run Data Wrangler.
**Important**  
You must register a URL for each Amazon SageMaker AI domain and AWS Region where you're running Data Wrangler. Users from a domain and AWS Region that don't have redirect URLs set up for them won't be able to authenticate with the identity provider to access the Snowflake connection.

1. Make sure that the authorization code and refresh token grant types are allowed for the Data Wrangler application.

Within your identity provider, you must set up a server that sends OAuth tokens to Data Wrangler at the user level. The server sends the tokens with Snowflake as the audience.

Snowflake uses the concept of roles that are distinct role the IAM roles used in AWS. You must configure the identity provider to use any role to use the default role associated with the Snowflake account. For example, if a user has `systems administrator` as the default role in their Snowflake profile, the connection from Data Wrangler to Snowflake uses `systems administrator` as the role.

Use the following procedure to set up the server.

To set up the server, do the following. You're working within Snowflake for all steps except the last one.

1. Start setting up the server or API.

1. Configure the authorization server to use the authorization code and refresh token grant types.

1. Specify the lifetime of the access token.

1. Set the refresh token idle timeout. The idle timeout is the time that the refresh token expires if it's not used.
**Note**  
If you're scheduling jobs in Data Wrangler, we recommend making the idle timeout time greater than the frequency of the processing job. Otherwise, some processing jobs might fail because the refresh token expired before they could run. When the refresh token expires, the user must re-authenticate by accessing the connection that they've made to Snowflake through Data Wrangler.

1. Specify `session:role-any` as the new scope.
**Note**  
For Azure AD, copy the unique identifier for the scope. Data Wrangler requires you to provide it with the identifier.

1. 
**Important**  
Within the External OAuth Security Integration for Snowflake, enable `external_oauth_any_role_mode`.

**Important**  
Data Wrangler doesn't support rotating refresh tokens. Using rotating refresh tokens might result in access failures or users needing to log in frequently.

**Important**  
If the refresh token expires, your users must reauthenticate by accessing the connection that they've made to Snowflake through Data Wrangler.

After you've set up the OAuth provider, you provide Data Wrangler with the information it needs to connect to the provider. You can use the documentation from your identity provider to get values for the following fields:
+ Token URL – The URL of the token that the identity provider sends to Data Wrangler.
+ Authorization URL – The URL of the authorization server of the identity provider.
+ Client ID – The ID of the identity provider.
+ Client secret – The secret that only the authorization server or API recognizes.
+ (Azure AD only) The OAuth scope credentials that you've copied.

You store the fields and values in a AWS Secrets Manager secret and add it to the Amazon SageMaker Studio Classic lifecycle configuration that you're using for Data Wrangler. A Lifecycle Configuration is a shell script. Use it to make the Amazon Resource Name (ARN) of the secret accessible to Data Wrangler. For information about creating secrets see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html). For information about using lifecycle configurations in Studio Classic, see [Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic](studio-lcc.md).

**Important**  
Before you create a Secrets Manager secret, make sure that the SageMaker AI execution role that you're using for Amazon SageMaker Studio Classic has permissions to create and update secrets in Secrets Manager. For more information about adding permissions, see [Example: Permission to create secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access_examples.html#auth-and-access_examples_create).

For Okta and Ping Federate, the following is the format of the secret:

```
{
    "token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
    "client_id":"example-client-id",
    "client_secret":"example-client-secret",
    "identity_provider":"OKTA"|"PING_FEDERATE",
    "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize"
}
```

For Azure AD, the following is the format of the secret:

```
{
    "token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
    "client_id":"example-client-id",
    "client_secret":"example-client-secret",
    "identity_provider":"AZURE_AD",
    "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize",
    "datasource_oauth_scope":"api://appuri/session:role-any)"
}
```

You must have a lifecycle configuration that uses the Secrets Manager secret that you've created. You can either create the lifecycle configuration or modify one that has already been created. The configuration must use the following script.

```
#!/bin/bash

set -eux

## Script Body

cat > ~/.snowflake_identity_provider_oauth_config <<EOL
{
    "secret_arn": "example-secret-arn"
}
EOL
```

For information about setting up lifecycle configurations, see [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md). When you're going through the process of setting up, do the following:
+ Set the application type of the configuration to `Jupyter Server`.
+ Attach the configuration to the Amazon SageMaker AI domain that has your users.
+ Have the configuration run by default. It must run every time a user logs into Studio Classic. Otherwise, the credentials saved in the configuration won't be available to your users when they're using Data Wrangler.
+ The lifecycle configuration creates a file with the name, `snowflake_identity_provider_oauth_config` in the user's home folder. The file contains the Secrets Manager secret. Make sure that it's in the user's home folder every time the Jupyter Server's instance is initialized.

#### Private Connectivity between Data Wrangler and Snowflake via AWS PrivateLink
<a name="data-wrangler-security-snowflake-vpc"></a>

This section explains how to use AWS PrivateLink to establish a private connection between Data Wrangler and Snowflake. The steps are explained in the following sections. 

##### Create a VPC
<a name="data-wrangler-snowflake-snowflake-vpc-setup"></a>

If you do not have a VPC set up, then follow the [Create a new VPC](https://docs.aws.amazon.com/directoryservice/latest/admin-guide/gsg_create_vpc.html#create_vpc) instructions to create one.

Once you have a chosen VPC you would like to use for establishing a private connection, provide the following credentials to your Snowflake Administrator to enable AWS PrivateLink:
+ VPC ID
+ AWS Account ID
+ Your corresponding account URL you use to access Snowflake

**Important**  
As described in Snowflake's documentation, enabling your Snowflake account can take up to two business days. 

##### Set up Snowflake AWS PrivateLink Integration
<a name="data-wrangler-snowflake-snowflake-vpc-privatelink-setup"></a>

After AWS PrivateLink is activated, retrieve the AWS PrivateLink configuration for your Region by running the following command in a Snowflake worksheet. Log into your Snowflake console and enter the following under **Worksheets**: `select SYSTEM$GET_PRIVATELINK_CONFIG();` 

1. Retrieve the values for the following: `privatelink-account-name`, `privatelink_ocsp-url`, `privatelink-account-url`, and `privatelink_ocsp-url` from the resulting JSON object. Examples of each value are shown in the following snippet. Store these values for later use.

   ```
   privatelink-account-name: xxxxxxxx.region.privatelink
   privatelink-vpce-id: com.amazonaws.vpce.region.vpce-svc-xxxxxxxxxxxxxxxxx
   privatelink-account-url: xxxxxxxx.region.privatelink.snowflakecomputing.com
   privatelink_ocsp-url: ocsp.xxxxxxxx.region.privatelink.snowflakecomputing.com
   ```

1. Switch to your AWS Console and navigate to the VPC menu.

1. From the left side panel, choose the **Endpoints** link to navigate to the **VPC Endpoints** setup.

   Once there, choose **Create Endpoint**. 

1. Select the radio button for **Find service by name**, as shown in the following screenshot.   
![\[The Create Endpoint section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-radio.png)

1. In the **Service Name** field, paste in the value for `privatelink-vpce-id` that you retrieved in the preceding step and choose **Verify**. 

   If the connection is successful, a green alert saying **Service name found** appears on your screen and the **VPC** and **Subnet** options automatically expand, as shown in the following screenshot. Depending on your targeted Region, your resulting screen may show another AWS Region name.   
![\[The Create Endpoint section in the console showing the connection is successful.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-service-name-found.png)

1. Select the same VPC ID that you sent to Snowflake from the **VPC** dropdown list.

1. If you have not yet created a subnet, then perform the following set of instructions on creating a subnet. 

1. Select **Subnets** from the **VPC** dropdown list. Then select **Create subnet** and follow the prompts to create a subset in your VPC. Ensure you select the VPC ID you sent Snowflake. 

1. Under **Security Group Configuration**, select **Create New Security Group** to open the default **Security Group** screen in a new tab. In this new tab, select t**Create Security Group**. 

1. Provide a name for the new security group (such as `datawrangler-doc-snowflake-privatelink-connection`) and a description. Be sure to select the VPC ID you have used in previous steps. 

1. Add two rules to allow traffic from within your VPC to this VPC endpoint. 

   Navigate to your VPC under **Your VPCs** in a separate tab, and retrieve your CIDR block for your VPC. Then choose **Add Rule** in the **Inbound Rules** section. Select `HTTPS` for the type, leave the **Source** as **Custom** in the form, and paste in the value retrieved from the preceding `describe-vpcs` call (such as `10.0.0.0/16`). 

1. Choose **Create Security Group**. Retrieve the **Security Group ID** from the newly created security group (such as `sg-xxxxxxxxxxxxxxxxx`).

1. In the **VPC Endpoint** configuration screen, remove the default security group. Paste in the security group ID in the search field and select the checkbox.  
![\[The Security group section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-security-group.png)

1. Select **Create Endpoint**. 

1. If the endpoint creation is successful, you see a page that has a link to your VPC endpoint configuration, specified by the VPC ID. Select the link to view the configuration in full.   
![\[The endpoint Details section.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-success-endpoint.png)

   Retrieve the topmost record in the DNS names list. This can be differentiated from other DNS names because it only includes the Region name (such as `us-west-2`), and no Availability Zone letter notation (such as `us-west-2a`). Store this information for later use.

##### Configure DNS for Snowflake Endpoints in your VPC
<a name="data-wrangler-snowflake-vpc-privatelink-dns"></a>

This section explains how to configure DNS for Snowflake endpoints in your VPC. This allows your VPC to resolve requests to the Snowflake AWS PrivateLink endpoint. 

1. Navigate to the [Route 53 menu](https://console.aws.amazon.com/route53) within your AWS console.

1. Select the **Hosted Zones** option (if necessary, expand the left-hand menu to find this option).

1. Choose **Create Hosted Zone**.

   1. In the **Domain name** field, reference the value that was stored for `privatelink-account-url` in the preceding steps. In this field, your Snowflake account ID is removed from the DNS name and only uses the value starting with the Region identifier. A **Resource Record Set** is also created later for the subdomain, such as, `region.privatelink.snowflakecomputing.com`.

   1. Select the radio button for **Private Hosted Zone** in the **Type** section. Your Region code may not be `us-west-2`. Reference the DNS name returned to you by Snowflake.  
![\[The Create hosted zone page in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-create-hosted-zone.png)

   1. In the **VPCs to associate with the hosted zone** section, select the Region in which your VPC is located and the VPC ID used in previous steps.  
![\[The VPCs to associate with the hosted zone section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-vpc-hosted-zone.png)

   1. Choose **Create hosted zone**.

1. Next, create two records, one for `privatelink-account-url` and one for `privatelink_ocsp-url`.
   + In the **Hosted Zone** menu, choose **Create Record Set**.

     1. Under **Record name**, enter your Snowflake Account ID only (the first 8 characters in `privatelink-account-url`).

     1. Under **Record type**, select **CNAME**.

     1. Under **Value**, enter the DNS name for the regional VPC endpoint you retrieved in the last step of the *Set up the Snowflake AWS PrivateLink Integration* section.   
![\[The Quick create record section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-quick-create-record.png)

     1. Choose **Create records**.

     1. Repeat the preceding steps for the OCSP record we notated as `privatelink-ocsp-url`, starting with `ocsp` through the 8-character Snowflake ID for the record name (such as `ocsp.xxxxxxxx`).  
![\[The Quick create record section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-quick-create-ocsp.png)

##### Configure Route 53 Resolver Inbound Endpoint for your VPC
<a name="data-wrangler-snowflake-vpc-privatelink-route53"></a>

This section explains how to configure Route 53 resolvers inbound endpoints for your VPC.

1. Navigate to the [Route 53 menu](https://console.aws.amazon.com/route53) within your AWS console.
   + In the left hand panel in the **Security** section, select the **Security Groups** option.

1. Choose **Create Security Group**. 
   + Provide a name for your security group (such as `datawranger-doc-route53-resolver-sg`) and a description.
   + Select the VPC ID used in previous steps.
   + Create rules that allow for DNS over UDP and TCP from within the VPC CIDR block.   
![\[The Inbound rules section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-inbound-rules.png)
   + Choose **Create Security Group**. Note the **Security Group ID** because adds a rule to allow traffic to the VPC endpoint security group.

1. Navigate to the [Route 53 menu](https://console.aws.amazon.com/route53) within your AWS console.
   + In the **Resolver** section, select the **Inbound Endpoint** option.

1. Choose **Create Inbound Endpoint**. 
   + Provide an endpoint name.
   + From the **VPC in the Region** dropdown list, select the VPC ID you have used in all previous steps. 
   + In the **Security group for this endpoint** dropdown list, select the security group ID from Step 2 in this section.   
![\[The General settings for inbound endpoint section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-inbound-endpoint.png)
   + In the **IP Address** section, select an Availability Zones, select a subnet, and leave the radio selector for **Use an IP address that is selected automatically** selected for each IP address.   
![\[The IP Address section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-ip-address-1.png)
   + Choose **Submit**.

1. Select the **Inbound endpoint** after it has been created.

1. Once the inbound endpoint is created, note the two IP addresses for the resolvers.  
![\[The IP Addresses section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-ip-addresses-2.png)

##### SageMaker AI VPC Endpoints
<a name="data-wrangler-snowflake-sagemaker-vpc-endpoints"></a>

 This section explains how to create VPC endpoints for the following: Amazon SageMaker Studio Classic, SageMaker Notebooks, the SageMaker API, SageMaker Runtime Runtime, and Amazon SageMaker Feature Store Runtime.

**Create a security group that is applied to all endpoints.**

1. Navigate to the [EC2 menu](https://console.aws.amazon.com/ec2) in the AWS Console.

1. In the **Network & Security** section, select the **Security groups** option.

1. Choose **Create security group**.

1. Provide a security group name and description (such as `datawrangler-doc-sagemaker-vpce-sg`). A rule is added later to allow traffic over HTTPS from SageMaker AI to this group. 

**Creating the endpoints**

1. Navigate to the [VPC menu](https://console.aws.amazon.com/vpc) in the AWS console.

1. Select the **Endpoints** option.

1. Choose **Create Endpoint**.

1. Search for the service by entering its name in the **Search** field.

1. From the **VPC** dropdown list, select the VPC in which your Snowflake AWS PrivateLink connection exists.

1. In the **Subnets** section, select the subnets which have access to the Snowflake PrivateLink connection.

1. Leave the **Enable DNS Name** checkbox selected.

1. In the **Security Groups** section, select the security group you created in the preceding section.

1. Choose **Create Endpoint**.

**Configure Studio Classic and Data Wrangler**

This section explains how to configure Studio Classic and Data Wrangler.

1. Configure the security group.

   1. Navigate to the Amazon EC2 menu in the AWS Console.

   1. Select the **Security Groups** option in the **Network & Security** section.

   1. Choose **Create Security Group**. 

   1. Provide a name and description for your security group (such as `datawrangler-doc-sagemaker-studio`). 

   1. Create the following inbound rules.
      + The HTTPS connection to the security group you provisioned for the Snowflake PrivateLink connection you created in the *Set up the Snowflake PrivateLink Integration* step.
      + The HTTP connection to the security group you provisioned for the Snowflake PrivateLink connection you created in the *Set up the Snowflake PrivateLink Integration step*.
      + The UDP and TCP for DNS (port 53) to Route 53 Resolver Inbound Endpoint security group you create in step 2 of *Configure Route 53 Resolver Inbound Endpoint for your VPC*.

   1. Choose **Create Security Group** button in the lower right hand corner.

1. Configure Studio Classic.
   + Navigate to the SageMaker AI menu in the AWS console.
   + From the left hand console, Select the **SageMaker AI Studio Classic** option.
   + If you do not have any domains configured, the **Get Started** menu is present.
   + Select the **Standard Setup** option from the **Get Started** menu.
   + Under **Authentication method**, select **AWS Identity and Access Management (IAM)**.
   + From the **Permissions** menu, you can create a new role or use a pre-existing role, depending on your use case.
     + If you choose **Create a new role**, you are presented the option to provide an S3 bucket name, and a policy is generated for you.
     + If you already have a role created with permissions for the S3 buckets to which you require access, select the role from the dropdown list. This role should have the `AmazonSageMakerFullAccess` policy attached to it.
   + Select the **Network and Storage** dropdown list to configure the VPC, security, and subnets SageMaker AI uses.
     + Under **VPC**, select the VPC in which your Snowflake PrivateLink connection exists.
     + Under **Subnet(s)**, select the subnets which have access to the Snowflake PrivateLink connection.
     + Under **Network Access for Studio Classic**, select **VPC Only**.
     + Under **Security Group(s)**, select the security group you created in step 1.
   + Choose **Submit**.

1. Edit the SageMaker AI security group.
   + Create the following inbound rules:
     + Port 2049 to the inbound and outbound NFS Security Groups created automatically by SageMaker AI in step 2 (the security group names contain the Studio Classic domain ID).
     + Access to all TCP ports to itself (required for SageMaker AI for VPC Only).

1. Edit the VPC Endpoint Security Groups:
   + Navigate to the Amazon EC2 menu in the AWS console.
   + Locate the security group you created in a preceding step.
   + Add an inbound rule allowing for HTTPS traffic from the security group created in step 1.

1. Create a user profile.
   + From the **SageMaker Studio Classic Control Panel **, choose **Add User**.
   + Provide a user name. 
   + For the **Execution Role**, choose to create a new role or to use a pre-existing role.
     + If you choose **Create a new role**, you are presented the option to provide an Amazon S3 bucket name, and a policy is generated for you.
     + If you already have a role created with permissions to the Amazon S3 buckets to which you require access, select the role from the dropdown list. This role should have the `AmazonSageMakerFullAccess` policy attached to it.
   + Choose **Submit**. 

1. Create a data flow (follow the data scientist guide outlined in a preceding section). 
   + When adding a Snowflake connection, enter the value of `privatelink-account-name` (from the *Set up Snowflake PrivateLink Integration* step) into the **Snowflake account name (alphanumeric)** field, instead of the plain Snowflake account name. Everything else is left unchanged.

#### Provide information to the data scientist
<a name="data-wrangler-snowflake-admin-ds-info"></a>

Provide the data scientist with the information that they need to access Snowflake from Amazon SageMaker AI Data Wrangler.

**Important**  
Your users need to run Amazon SageMaker Studio Classic version 1.3.0 or later. For information about checking the version of Studio Classic and updating it, see [Prepare ML Data with Amazon SageMaker Data Wrangler](data-wrangler.md).

1. To allow your data scientist to access Snowflake from SageMaker Data Wrangler, provide them with one of the following:
   + For Basic Authentication, a Snowflake account name, user name, and password.
   + For OAuth, a user name and password in the identity provider.
   + For ARN, the Secrets Manager secret Amazon Resource Name (ARN).
   + A secret created with [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) and the ARN of the secret. Use the following procedure below to create the secret for Snowflake if you choose this option.
**Important**  
If your data scientists use the **Snowflake Credentials (User name and Password)** option to connect to Snowflake, you can use [Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) to store the credentials in a secret. Secrets Manager rotates secrets as part of a best practice security plan. The secret created in Secrets Manager is only accessible with the Studio Classic role configured when you set up a Studio Classic user profile. This requires you to add this permission, `secretsmanager:PutResourcePolicy`, to the policy that is attached to your Studio Classic role.  
We strongly recommend that you scope the role policy to use different roles for different groups of Studio Classic users. You can add additional resource-based permissions for the Secrets Manager secrets. See [Manage Secret Policy](https://docs.aws.amazon.com/secretsmanager/latest/userguide/manage_secret-policy.html) for condition keys you can use.  
For information about creating a secret, see [Create a secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html). You're charged for the secrets that you create.

1. (Optional) Provide the data scientist with the name of the storage integration that you created using the following procedure [Create a Cloud Storage Integration in Snowflake](                                      https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-3-create-a-cloud-storage-integration-in-snowflake). This is the name of the new integration and is called `integration_name` in the `CREATE INTEGRATION` SQL command you ran, which is shown in the following snippet: 

   ```
     CREATE STORAGE INTEGRATION integration_name
     TYPE = EXTERNAL_STAGE
     STORAGE_PROVIDER = S3
     ENABLED = TRUE
     STORAGE_AWS_ROLE_ARN = 'iam_role'
     [ STORAGE_AWS_OBJECT_ACL = 'bucket-owner-full-control' ]
     STORAGE_ALLOWED_LOCATIONS = ('s3://bucket/path/', 's3://bucket/path/')
     [ STORAGE_BLOCKED_LOCATIONS = ('s3://bucket/path/', 's3://bucket/path/') ]
   ```

### Data Scientist Guide
<a name="data-wrangler-snowflake-ds"></a>

Use the following to connect Snowflake and access your data in Data Wrangler.

**Important**  
Your administrator needs to use the information in the preceding sections to set up Snowflake. If you're running into issues, contact them for troubleshooting help.

You can connect to Snowflake in one of the following ways:
+ Specifying your Snowflake credentials (account name, user name, and password) in Data Wrangler. 
+ Providing an Amazon Resource Name (ARN) of a secret containing the credentials.
+ Using an open standard for access delegation (OAuth) provider that connects to Snowflake. Your administrator can give you access to one of the following OAuth providers:
  + [Azure AD](https://docs.snowflake.com/en/user-guide/oauth-azure.html)
  + [Okta](https://docs.snowflake.com/en/user-guide/oauth-okta.html)
  + [Ping Federate](https://docs.snowflake.com/en/user-guide/oauth-pingfed.html)

Talk to your administrator about the method that you need to use to connect to Snowflake.

The following sections have information about how you can connect to Snowflake using the preceding methods.

------
#### [ Specifying your Snowflake Credentials ]

**To import a dataset into Data Wrangler from Snowflake using your credentials**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Snowflake**.

1. For **Connection name**, specify a name that uniquely identifies the connection.

1. For **Authentication method**, choose **Basic Username-Password**.

1. For **Snowflake account name (alphanumeric)**, specify the full name of the Snowflake account.

1. For **Username**, specify the username that you use to access the Snowflake account.

1. For **Password**, specify the password associated with the username.

1. (Optional) For **Advanced settings**. specify the following:
   + **Role** – A role within Snowflake. Some roles have access to different datasets. If you don't specify a role, Data Wrangler uses the default role in your Snowflake account.
   + **Storage integration** – When you specify and run a query, Data Wrangler creates a temporary copy of the query results in memory. To store a permanent copy of the query results, specify the Amazon S3 location for the storage integration. Your administrator provided you with the S3 URI.
   + **KMS key ID** – A KMS key that you've created. You can specify its ARN to encrypt the output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.

1. Choose **Connect**.

------
#### [ Providing an Amazon Resource Name (ARN) ]

**To import a dataset into Data Wrangler from Snowflake using an ARN**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Snowflake**.

1. For **Connection name**, specify a name that uniquely identifies the connection.

1. For **Authentication method**, choose **ARN**.

1. **Secrets Manager ARN** – The ARN of the AWS Secrets Manager secret used to store the credentials used to connect to Snowflake.

1. (Optional) For **Advanced settings**. specify the following:
   + **Role** – A role within Snowflake. Some roles have access to different datasets. If you don't specify a role, Data Wrangler uses the default role in your Snowflake account.
   + **Storage integration** – When you specify and run a query, Data Wrangler creates a temporary copy of the query results in memory. To store a permanent copy of the query results, specify the Amazon S3 location for the storage integration. Your administrator provided you with the S3 URI.
   + **KMS key ID** – A KMS key that you've created. You can specify its ARN to encrypt the output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.

1. Choose **Connect**.

------
#### [ Using an OAuth Connection ]

**Important**  
Your administrator customized your Studio Classic environment to provide the functionality you're using to use an OAuth connection. You might need to restart the Jupyter server application to use the functionality.  
Use the following procedure to update the Jupyter server application.  
Within Studio Classic, choose **File**
Choose **Shut down**.
Choose **Shut down server**.
Close the tab or window that you're using to access Studio Classic.
From the Amazon SageMaker AI console, open Studio Classic.

**To import a dataset into Data Wrangler from Snowflake using your credentials**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Snowflake**.

1. For **Connection name**, specify a name that uniquely identifies the connection.

1. For **Authentication method**, choose **OAuth**.

1. (Optional) For **Advanced settings**. specify the following:
   + **Role** – A role within Snowflake. Some roles have access to different datasets. If you don't specify a role, Data Wrangler uses the default role in your Snowflake account.
   + **Storage integration** – When you specify and run a query, Data Wrangler creates a temporary copy of the query results in memory. To store a permanent copy of the query results, specify the Amazon S3 location for the storage integration. Your administrator provided you with the S3 URI.
   + **KMS key ID** – A KMS key that you've created. You can specify its ARN to encrypt the output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.

1. Choose **Connect**.

------

You can begin the process of importing your data from Snowflake after you've connected to it.

Within Data Wrangler, you can view your data warehouses, databases, and schemas, along with the eye icon with which you can preview your table. After you select the **Preview Table** icon, the schema preview of that table is generated. You must select a warehouse before you can preview a table.

**Important**  
If you're importing a dataset with columns of type `TIMESTAMP_TZ` or `TIMESTAMP_LTZ`, add `::string` to the column names of your query. For more information, see [How To: Unload TIMESTAMP\$1TZ and TIMESTAMP\$1LTZ data to a Parquet file](https://community.snowflake.com/s/article/How-To-Unload-Timestamp-data-in-a-Parquet-file).

After you select a data warehouse, database and schema, you can now write queries and run them. The output of your query shows under **Query results**.

After you have settled on the output of your query, you can then import the output of your query into a Data Wrangler flow to perform data transformations. 

After you've imported your data, navigate to your Data Wrangler flow and start adding transformations to it. For a list of available transforms, see [Transform Data](data-wrangler-transform.md).

## Import Data From Software as a Service (SaaS) Platforms
<a name="data-wrangler-import-saas"></a>

You can use Data Wrangler to import data from more than forty software as a service (SaaS) platforms. To import your data from your SaaS platform, you or your administrator must use Amazon AppFlow to transfer the data from the platform to Amazon S3 or Amazon Redshift. For more information about Amazon AppFlow, see [What is Amazon AppFlow?](https://docs.aws.amazon.com/appflow/latest/userguide/what-is-appflow.html) If you don't need to use Amazon Redshift, we recommend transferring the data to Amazon S3 for a simpler process.

Data Wrangler supports transferring data from the following SaaS platforms:
+ [Amplitude](https://docs.aws.amazon.com/appflow/latest/userguide/amplitude.html)
+ [Asana](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-asana.html)
+ [Braintree](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-braintree.html)
+ [CircleCI](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-circleci.html)
+ [DocuSign Monitor](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-docusign-monitor.html)
+ [Delighted](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-delighted.html)
+ [Domo](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-domo.html)
+ [Datadog](https://docs.aws.amazon.com/appflow/latest/userguide/datadog.html)
+ [Dynatrace](https://docs.aws.amazon.com/appflow/latest/userguide/dynatrace.html)
+ [Facebook Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-ads.html)
+ [Facebook Page Insights](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-page-insights.html)
+ [Google Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-ads.html)
+ [Google Analytics 4](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-analytics-4.html)
+ [Google Calendar](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-calendar.html)
+ [Google Search Console](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-search-console.html)
+ [GitHub](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-github.html)
+ [GitLab](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-gitlab.html)
+ [Infor Nexus](https://docs.aws.amazon.com/appflow/latest/userguide/infor-nexus.html)
+ [Instagram Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-instagram-ads.html)
+ [Intercom](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-intercom.html)
+ [JDBC (Sync)](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-jdbc.html)
+ [Jira Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-jira-cloud.html)
+ [LinkedIn Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-linkedin-ads.html)
+ [Mailchimp](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mailchimp.html)
+ [Marketo](https://docs.aws.amazon.com/appflow/latest/userguide/marketo.html)
+ [Microsoft Dynamics 365](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-microsoft-dynamics-365.html)
+ [Microsoft Teams](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-microsoft-teams.html)
+ [Mixpanel](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mixpanel.html)
+ [Okta](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-okta.html)
+ [Oracle HCM](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-oracle-hcm.html)
+ [Paypal Checkout](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-paypal.html)
+ [Pendo](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-pendo.html)
+ [Salesforce](https://docs.aws.amazon.com/appflow/latest/userguide/salesforce.html)
+ [Salesforce Marketing Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-salesforce-marketing-cloud.html)
+ [Salesforce Pardot](https://docs.aws.amazon.com/appflow/latest/userguide/pardot.html)
+ [SAP OData](https://docs.aws.amazon.com/appflow/latest/userguide/sapodata.html)
+ [SendGrid](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-sendgrid.html)
+ [ServiceNow](https://docs.aws.amazon.com/appflow/latest/userguide/servicenow.html)
+ [Singular](https://docs.aws.amazon.com/appflow/latest/userguide/singular.html)
+ [Slack](https://docs.aws.amazon.com/appflow/latest/userguide/slack.html)
+ [Smartsheet](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-smartsheet.html)
+ [Snapchat Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-snapchat-ads.html)
+ [Stripe](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-stripe.html)
+ [Trend Micro](https://docs.aws.amazon.com/appflow/latest/userguide/trend-micro.html)
+ [Typeform](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-typeform.html)
+ [Veeva](https://docs.aws.amazon.com/appflow/latest/userguide/veeva.html)
+ [WooCommerce](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-woocommerce.html)
+ [Zendesk](https://docs.aws.amazon.com/appflow/latest/userguide/slack.html)
+ [Zendesk Chat](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-chat.html)
+ [Zendesk Sell](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sell.html)
+ [Zendesk Sunshine](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sunshine.html)
+ [Zoho CRM](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zoho-crm.html)
+ [Zoom Meetings](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zoom-meetings.html)

The preceding list has links to more information about setting up your data source. You or your administrator can refer to the preceding links after you've read the following information.

When you navigate to the **Import** tab of your Data Wrangler flow, you see data sources under the following sections:
+ **Available**
+ **Set up data sources**

You can connect to data sources under **Available** without needing additional configuration. You can choose the data source and import your data.

Data sources under **Set up data sources**, require you or your administrator to use Amazon AppFlow to transfer the data from the SaaS platform to Amazon S3 or Amazon Redshift. For information about performing a transfer, see [Using Amazon AppFlow to transfer your data](#data-wrangler-import-saas-transfer).

After you perform the data transfer, the SaaS platform appears as a data source under **Available**. You can choose it and import the data that you've transferred into Data Wrangler. The data that you've transferred appears as tables that you can query.

### Using Amazon AppFlow to transfer your data
<a name="data-wrangler-import-saas-transfer"></a>

Amazon AppFlow is a platform that you can use to transfer data from your SaaS platform to Amazon S3 or Amazon Redshift without having to write any code. To perform a data transfer, you use the AWS Management Console.

**Important**  
You must make sure you've set up the permissions to perform a data transfer. For more information, see [Amazon AppFlow Permissions](data-wrangler-security.md#data-wrangler-appflow-permissions).

After you've added permissions, you can transfer the data. Within Amazon AppFlow, you create a *flow* to transfer the data. A flow is a series of configurations. You can use it to specify whether you're running the data transfer on a schedule or whether you're partitioning the data into separate files. After you've configured the flow, you run it to transfer the data.

For information about creating a flow, see [Creating flows in Amazon AppFlow](https://docs.aws.amazon.com/appflow/latest/userguide/create-flow.html). For information about running a flow, see [Activate an Amazon AppFlow flow](https://docs.aws.amazon.com/appflow/latest/userguide/run-flow.html).

After the data has been transferred, use the following procedure to access the data in Data Wrangler.
**Important**  
Before you try to access your data, make sure your IAM role has the following policy:  

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:SearchTables",
            "Resource": [
                "arn:aws:glue:*:*:table/*/*",
                "arn:aws:glue:*:*:database/*",
                "arn:aws:glue:*:*:catalog"
            ]
        }
    ]
}
```
By default, the IAM role that you use to access Data Wrangler is the `SageMakerExecutionRole`. For more information about adding policies, see [Adding IAM identity permissions (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console).

To connect to a data source, do the following.

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose the data source.

1. For the **Name** field, specify the name of the connection.

1. (Optional) Choose **Advanced configuration**.

   1. Choose a **Workgroup**.

   1. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a workgroup, specify a value for **Amazon S3 location of query results**.

   1. (Optional) For **Data retention period**, select the checkbox to set a data retention period and specify the number of days to store the data before it's deleted.

   1. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the checkbox and not save the connection.

1. Choose **Connect**.

1. Specify a query.
**Note**  
To help you specify a query, you can choose a table on the left-hand navigation panel. Data Wrangler shows the table name and a preview of the table. Choose the icon next to the table name to copy the name. You can use the table name in the query.

1. Choose **Run**.

1. Choose **Import query**.

1. For **Dataset name**, specify the name of the dataset.

1. Choose **Add**.

When you navigate to the **Import data** screen, you can see the connection that you've created. You can use the connection to import more data.

## Imported Data Storage
<a name="data-wrangler-import-storage"></a>

**Important**  
 We strongly recommend that you follow the best practices around protecting your Amazon S3 bucket by following [ Security best practices](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html). 

When you query data from Amazon Athena or Amazon Redshift, the queried dataset is automatically stored in Amazon S3. Data is stored in the default SageMaker AI S3 bucket for the AWS Region in which you are using Studio Classic.

Default S3 buckets have the following naming convention: `sagemaker-region-account number`. For example, if your account number is 111122223333 and you are using Studio Classic in `us-east-1`, your imported datasets are stored in `sagemaker-us-east-1-`111122223333. 

 Data Wrangler flows depend on this Amazon S3 dataset location, so you should not modify this dataset in Amazon S3 while you are using a dependent flow. If you do modify this S3 location, and you want to continue using your data flow, you must remove all objects in `trained_parameters` in your .flow file. To do this, download the .flow file from Studio Classic and for each instance of `trained_parameters`, delete all entries. When you are done, `trained_parameters` should be an empty JSON object:

```
"trained_parameters": {}
```

When you export and use your data flow to process your data, the .flow file you export refers to this dataset in Amazon S3. Use the following sections to learn more. 

### Amazon Redshift Import Storage
<a name="data-wrangler-import-storage-redshift"></a>

Data Wrangler stores the datasets that result from your query in a Parquet file in your default SageMaker AI S3 bucket. 

This file is stored under the following prefix (directory): redshift/*uuid*/data/, where *uuid* is a unique identifier that gets created for each query. 

For example, if your default bucket is `sagemaker-us-east-1-111122223333`, a single dataset queried from Amazon Redshift is located in s3://sagemaker-us-east-1-111122223333/redshift/*uuid*/data/.

### Amazon Athena Import Storage
<a name="data-wrangler-import-storage-athena"></a>

When you query an Athena database and import a dataset, Data Wrangler stores the dataset, as well as a subset of that dataset, or *preview files*, in Amazon S3. 

The dataset you import by selecting **Import dataset** is stored in Parquet format in Amazon S3. 

Preview files are written in CSV format when you select **Run** on the Athena import screen, and contain up to 100 rows from your queried dataset. 

The dataset you query is located under the prefix (directory): athena/*uuid*/data/, where *uuid* is a unique identifier that gets created for each query.

For example, if your default bucket is `sagemaker-us-east-1-111122223333`, a single dataset queried from Athena is located in `s3://sagemaker-us-east-1-111122223333`/athena/*uuid*/data/*example\$1dataset.parquet*.

The subset of the dataset that is stored to preview dataframes in Data Wrangler is stored under the prefix: athena/.

# Create and Use a Data Wrangler Flow
<a name="data-wrangler-data-flow"></a>

Use an Amazon SageMaker Data Wrangler flow, or a *data flow*, to create and modify a data preparation pipeline. The data flow connects the datasets, transformations, and analyses, or *steps*, you create and can be used to define your pipeline. 

## Instances
<a name="data-wrangler-data-flow-instances"></a>

When you create a Data Wrangler flow in Amazon SageMaker Studio Classic, Data Wrangler uses an Amazon EC2 instance to run the analyses and transformations in your flow. By default, Data Wrangler uses the m5.4xlarge instance. m5 instances are general purpose instances that provide a balance between compute and memory. You can use m5 instances for a variety of compute workloads.

Data Wrangler also gives you the option of using r5 instances. r5 instances are designed to deliver fast performance that processes large datasets in memory.

We recommend that you choose an instance that is best optimized around your workloads. For example, the r5.8xlarge might have a higher price than the m5.4xlarge, but the r5.8xlarge might be better optimized for your workloads. With better optimized instances, you can run your data flows in less time at lower cost.

The following table shows the instances that you can use to run your Data Wrangler flow.


| Standard Instances | vCPU | Memory | 
| --- | --- | --- | 
| ml.m5.4xlarge | 16 | 64 GiB | 
| ml.m5.8xlarge | 32 | 128 GiB | 
| ml.m5.16xlarge | 64 |  256 GiB  | 
| ml.m5.24xlarge | 96 | 384 GiB | 
| r5.4xlarge | 16 | 128 GiB | 
| r5.8xlarge | 32 | 256 GiB | 
| r5.24xlarge | 96 | 768 GiB | 

For more information about r5 instances, see [Amazon EC2 R5 Instances](https://aws.amazon.com/ec2/instance-types/r5/). For more information about m5 instances, see [Amazon EC2 M5 Instances](https://aws.amazon.com/ec2/instance-types/m5/).

Each Data Wrangler flow has an Amazon EC2 instance associated with it. You might have multiple flows that are associated with a single instance.

For each flow file, you can seamlessly switch the instance type. If you switch the instance type, the instance that you used to run the flow continues to run.

To switch the instance type of your flow, do the following.

1. Choose the **Running Terminals and Kernels** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/running-terminals-kernels.png)).

1. Navigate to the instance that you're using and choose it.

1. Choose the instance type that you want to use.  
![\[Example showing how to choose an instance in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-instance-switching-list-instances.png)

1. Choose **Save**.

You are charged for all running instances. To avoid incurring additional charges, shut down the instances that you aren't using manually. To shut down an instance that is running, use the following procedure. 

To shut down a running instance.

1. Choose the instance icon. The following image shows you where to select the **RUNNING INSTANCES** icon.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/instance-switching-running-instances.png)

1. Choose **Shut down** next to the instance that you want to shut down.

If you shut down an instance used to run a flow, you temporarily can't access the flow. If you get an error while attempting to open the flow running an instance you previously shut down, wait for 5 minutes and try opening it again.

When you export your data flow to a location such as Amazon Simple Storage Service or Amazon SageMaker Feature Store, Data Wrangler runs an Amazon SageMaker processing job. You can use one of the following instances for the processing job. For more information on exporting your data, see [Export](data-wrangler-data-export.md).


| Standard Instances | vCPU | Memory | 
| --- | --- | --- | 
| ml.m5.4xlarge | 16 | 64 GiB | 
| ml.m5.12xlarge | 48 |  192 GiB  | 
| ml.m5.24xlarge | 96 | 384 GiB | 

For more information about the cost per hour for using the available instance types, see [SageMaker Pricing](https://aws.amazon.com//sagemaker/pricing/). 

## The Data Flow UI
<a name="data-wrangler-data-flow-ui"></a>

When you import a dataset, the original dataset appears on the data flow and is named **Source**. If you turned on sampling when you imported your data, this dataset is named **Source - sampled**. Data Wrangler automatically infers the types of each column in your dataset and creates a new dataframe named **Data types**. You can select this frame to update the inferred data types. You see results similar to those shown in the following image after you upload a single dataset: 

![\[Example showing Source - sampled and Data types in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/dataflow-after-import.png)


Each time you add a transform step, you create a new dataframe. When multiple transform steps (other than **Join** or **Concatenate**) are added to the same dataset, they are stacked. 

**Join** and **Concatenate** create standalone steps that contain the new joined or concatenated dataset. 

The following diagram shows a data flow with a join between two datasets, as well as two stacks of steps. The first stack (**Steps (2)**) adds two transforms to the type inferred in the **Data types** dataset. The *downstream* stack, or the stack to the right, adds transforms to the dataset resulting from a join named **demo-join**. 

![\[Example showing steps in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-flow-steps.png)


The small, gray box in the bottom right corner of the data flow provides an overview of number of stacks and steps in the flow and the layout of the flow. The lighter box inside the gray box indicates the steps that are within the UI view. You can use this box to see sections of your data flow that fall outside of the UI view. Use the fit screen icon (![\[Dotted square outline icon representing a placeholder or empty state.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/fit-screen.png)) to fit all steps and datasets into your UI view. 

The bottom left navigation bar includes icons that you can use to zoom in (![\[Plus symbol icon representing an addition or new item action.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/zoom-in.png)) and zoom out (![\[Horizontal line or divider, typically used to separate content sections.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/zoom-out.png)) of your data flow and resize the data flow to fit the screen (![\[Dotted square outline icon representing a placeholder or empty state.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/fit-screen.png)). Use the lock icon (![\[Trash can icon representing deletion or removal functionality.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/lock-nodes.png)) to lock and unlock the location of each step on the screen. 


## Add a Step to Your Data Flow
<a name="data-wrangler-data-flow-add-step"></a>

Select **\$1** next to any dataset or previously added step and then select one of the following options:
+ **Edit data types** (For a **Data types** step only): If you have not added any transforms to a **Data types** step, you can select **Edit data types** to update the data types Data Wrangler inferred when importing your dataset. 
+ **Add transform**: Adds a new transform step. See [Transform Data](data-wrangler-transform.md) to learn more about the data transformations you can add. 
+ **Add analysis**: Adds an analysis. You can use this option to analyze your data at any point in the data flow. When you add one or more analyses to a step, an analysis icon (![\[Bar chart icon representing data visualization or analytics functionality.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/analysis-icon.png)) appears on that step. See [Analyze and Visualize](data-wrangler-analyses.md) to learn more about the analyses you can add. 
+ **Join**: Joins two datasets and adds the resulting dataset to the data flow. To learn more, see [Join Datasets](data-wrangler-transform.md#data-wrangler-transform-join).
+ **Concatenate**: Concatenates two datasets and adds the resulting dataset to the data flow. To learn more, see [Concatenate Datasets](data-wrangler-transform.md#data-wrangler-transform-concatenate).

## Delete a Step from Your Data Flow
<a name="data-wrangler-data-flow-delete-step"></a>

To delete a step, select the step and select **Delete**. If the node is a node that has a single input, you delete only the step that you select. Deleting a step that has a single input doesn't delete the steps that follow it. If you're deleting a step for a source, join, or concatenate node, all the steps that follow it are also deleted.

To delete a step from a stack of steps, select the stack and then select the step you want to delete. 

You can use one of the following procedures to delete a step without deleting the downstream steps.

------
#### [ Delete a step in the Data Wrangler flow ]

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

Use the following procedure to delete a step in the Data Wrangler flow.

1. Choose the group of steps that has the step that you're deleting.

1. Choose the icon next to the step.

1. Choose **Delete step**.  
![\[Example showing how to delete a step in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/delete-step-flow-1.png)

------
#### [ Delete a step in the table view ]

Use the following procedure to delete a step in the table view.

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

1. Choose the step and open the table view for the step.

1. Move your cursor over the step so the ellipsis icon appears.

1. Choose the icon next to the step.

1. Choose **Delete**.  
![\[Example showing how to delete a step in the table view of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/delete-step-table-0.png)

------

## Edit a Step in Your Data Wrangler Flow
<a name="data-wrangler-data-flow-edit-step"></a>

You can edit each step that you've added in your Data Wrangler flow. By editing steps, you can change the transformations or the data types of the columns. You can edit the steps to make changes with which you can perform better analyses.

There are many ways that you can edit a step. Some examples include changing the imputation method or changing the threshold for considering a value to be an outlier.

Use the following procedure to edit a step.

To edit a step, do the following.

1. Choose a step in the Data Wrangler flow to open the table view.  
![\[Example step in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-flow-edit-choose-step.png)

1. Choose a step in the data flow.

1. Edit the step.

The following image shows an example of editing a step.

![\[Example showing how to edit steps in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-flow-table-edit-step.png)


**Note**  
You can use the shared spaces within your Amazon SageMaker AI domain to work collaboratively on your Data Wrangler flows. Within a shared space, you and your collaborators can edit a flow file in real-time. However, neither you nor your collaborators can see the changes in real-time. When anyone makes a change to the Data Wrangler flow, they must save it immediately. When someone saves a file, a collaborator won’t be able to see it unless the close the file and reopen it. Any changes that aren’t saved by one person are overwritten by the person who saved their changes.

# Get Insights On Data and Data Quality
<a name="data-wrangler-data-insights"></a>

Use the **Data Quality and Insights Report** to perform an analysis of the data that you've imported into Data Wrangler. We recommend that you create the report after you import your dataset. You can use the report to help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

Use the following procedure to create a Data Quality and Insights report. It assumes that you've already imported a dataset into your Data Wrangler flow.

**To create a Data Quality and Insights report**

1. Choose a **\$1** next to a node in your Data Wrangler flow.

1. Select **Get data insights**.

1. For **Analysis name**, specify a name for the insights report.

1. (Optional) For **Target column**, specify the target column.

1. For **Problem type**, specify **Regression** or **Classification**.

1. For **Data size**, specify one of the following:
   + **50 K** – Uses the first 50000 rows of the dataset that you've imported to create the report.
   + **Entire dataset** – Uses the entire dataset that you've imported to create the report.
**Note**  
Creating a Data Quality and Insights report on the entire dataset uses an Amazon SageMaker processing job. A SageMaker Processing job provisions the additional compute resources required to get insights for all of your data. For more information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

1. Choose **Create**.

The following topics show the sections of the report:

**Topics**
+ [

## Summary
](#data-wrangler-data-insights-summary)
+ [

## Target column
](#data-wrangler-data-insights-target-column)
+ [

## Quick model
](#data-wrangler-data-insights-quick-model)
+ [

## Feature summary
](#data-wrangler-data-insights-feature-summary)
+ [

## Samples
](#data-wrangler-data-insights-samples)
+ [

## Definitions
](#data-wrangler-data-insights-definitions)

You can either download the report or view it online. To download the report, choose the download button at the top right corner of the screen. The following image shows the button.

![\[Example showing the download button.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-download.png)


## Summary
<a name="data-wrangler-data-insights-summary"></a>

The insights report has a brief summary of the data that includes general information such as missing values, invalid values, feature types, outlier counts, and more. It can also include high severity warnings that point to probable issues with the data. We recommend that you investigate the warnings.

The following is an example of a report summary.

![\[Example report summary.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-report-summary.png)


## Target column
<a name="data-wrangler-data-insights-target-column"></a>

When you create the data quality and insights report, Data Wrangler gives you the option to select a target column. A target column is a column that you're trying to predict. When you choose a target column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the order of their predictive power. When you select a target column, you must specify whether you’re trying to solve a regression or a classification problem.

For classification, Data Wrangler shows a table and a histogram of the most common classes. A class is a category. It also presents observations, or rows, with a missing or invalid target value.

The following image shows an example target column analysis for a classification problem.

![\[Example target column analysis.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-target-column-classification.png)


For regression, Data Wrangler shows a histogram of all the values in the target column. It also presents observations, or rows, with a missing, invalid, or outlier target value.

The following image shows an example target column analysis for a regression problem.

![\[Example target column analysis.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-target-column-regression.png)


## Quick model
<a name="data-wrangler-data-insights-quick-model"></a>

The **Quick model** provides an estimate of the expected predicted quality of a model that you train on your data.

Data Wrangler splits your data into training and validation folds. It uses 80% of the samples for training and 20% of the values for validation. For classification, the sample is stratified split. For a stratified split, each data partition has the same ratio of labels. For classification problems, it's important to have the same ratio of labels between the training and classification folds. Data Wrangler trains the XGBoost model with the default hyperparameters. It applies early stopping on the validation data and performs minimal feature preprocessing.

For classification models, Data Wrangler returns both a model summary and a confusion matrix.

The following is an example of a classification model summary. To learn more about the information that it returns, see [Definitions](#data-wrangler-data-insights-definitions).

![\[Example classification model summary.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-quick-model-classification-summary.png)


The following is an example of a confusion matrix that the quick model returns.

![\[Example confusion matrix.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-quick-model-classification-confusion-matrix.png)


A confusion matrix gives you the following information:
+ The number of times the predicted label matches the true label.
+ The number of times the predicted label doesn't match the true label.

The true label represents an actual observation in your data. For example, if you're using a model to detect fraudulent transactions, the true label represents a transaction that is actually fraudulent or non-fraudulent. The predicted label represents the label that your model assigns to the data.

You can use the confusion matrix to see how well the model predicts the presence or the absence of a condition. If you're predicting fraudulent transactions, you can use the confusion matrix to get a sense of both the sensitivity and the specificity of the model. The sensitivity refers to the model's ability to detect fraudulent transactions. The specificity refers to the model's ability to avoid detecting non-fraudulent transactions as fraudulent.

The following is an example of the quick model outputs for a regression problem.

![\[Example of the quick model outputs for a regression problem.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-quick-model-regression-summary.png)


## Feature summary
<a name="data-wrangler-data-insights-feature-summary"></a>

When you specify a target column, Data Wrangler orders the features by their prediction power. Prediction power is measured on the data after it was split into 80% training and 20% validation folds. Data Wrangler fits a model for each feature separately on the training fold. It applies minimal feature preprocessing and measures prediction performance on the validation data.

It normalizes the scores to the range [0,1]. Higher prediction scores indicate columns that are more useful for predicting the target on their own. Lower scores point to columns that aren’t predictive of the target column.

It’s uncommon for a column that isn’t predictive on its own to be predictive when it’s used in tandem with other columns. You can confidently use the prediction scores to determine whether a feature in your dataset is predictive.

A low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities, which often indicates target leakage. Target leakage usually happens when the dataset contains a column that isn’t available at the prediction time. For example, it could be a duplicate of the target column.

The following are examples of the table and the histogram that show the prediction value of each feature.

![\[Example summary table showing the prediction value of each feature.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-feature-summary-table.png)


![\[Example histogram showing the prediction value of each feature.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-feature-summary-histogram.png)


## Samples
<a name="data-wrangler-data-insights-samples"></a>

Data Wrangler provides information about whether your samples are anomalous or if there are duplicates in your dataset.

Data Wrangler detects anomalous samples using the *isolation forest algorithm*. The isolation forest associates an anomaly score with each sample (row) of the dataset. Low anomaly scores indicate anomalous samples. High scores are associated with non-anomalous samples. Samples with a negative anomaly score are usually considered anomalous and samples with positive anomaly score are considered non-anomalous.

When you look at a sample that might be anomalous, we recommend that you pay attention to unusual values. For example, you might have anomalous values that result from errors in gathering and processing the data. We recommend using domain knowledge and business logic when you examine anomalous samples.

Data Wrangler detects duplicate rows and calculates the ratio of duplicate rows in your data. Some data sources could include valid duplicates. Other data sources could have duplicates that point to problems in data collection. Duplicate samples that result from faulty data collection could interfere with machine learning processes that rely on splitting the data into independent training and validation folds.

The following are elements of the insights report that can be impacted by duplicated samples:
+ Quick model
+ Prediction power estimation
+ Automatic hyperparameter tuning

You can remove duplicate samples from the dataset using the **Drop duplicates** transform under **Manage rows**. Data Wrangler shows you the most frequently duplicated rows.

## Definitions
<a name="data-wrangler-data-insights-definitions"></a>

The following are definitions for the technical terms that are used in the data insights report.

------
#### [ Feature types ]

The following are the definitions for each of the feature types:
+ **Numeric** – Numeric values can be either floats or integers, such as age or income. The machine learning models assume that numeric values are ordered and a distance is defined over them. For example, 3 is closer to 4 than to 10 and 3 < 4 < 10.
+ Categorical – The column entries belong to a set of unique values, which is usually much smaller than the number of entries in the column. For example, a column of length 100 could contain the unique values `Dog`, `Cat`, and `Mouse`. The values could be numeric, text, or a combination of both. `Horse`, `House`, `8`, `Love`, and `3.1` would all be valid values and could be found in the same categorical column. The machine learning model does not assume order or distance on the values of categorical features, as opposed to numeric features, even when all the values are numbers.
+ **Binary** – Binary features are a special categorical feature type in which the cardinality of the set of unique values is 2.
+ **Text** – A text column contains many non-numeric unique values. In extreme cases, all the elements of the column are unique. In an extreme case, no two entries are the same.
+ **Datetime** – A datetime column contains information about the date or time. It can have information about both the date and time.

------
#### [ Feature statistics ]

The following are definitions for each of the feature statistics:
+ **Prediction power** – Prediction power measures how useful the column is in predicting the target.
+ **Outliers** (in numeric columns) – Data Wrangler detects outliers using two statistics that are robust to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the clipped vector. All values larger than median \$1 5 \$1 RSTD or smaller than median - 5 \$1 RSTD are considered to be outliers.
+ **Skew** (in numeric columns) – Skew measures the symmetry of the distribution and is defined as the third moment of the distribution divided by the third power of the standard deviation. The skewness of the normal distribution or any other symmetric distribution is zero. Positive values imply that the right tail of the distribution is longer than the left tail. Negative values imply that the left tail of the distribution is longer than the right tail. As a rule of thumb, a distribution is considered skewed when the absolute value of the skew is larger than 3.
+ **Kurtosis** (in numeric columns) – Pearson's kurtosis measures the heaviness of the tail of the distribution. It's defined as the fourth moment of the distribution divided by the square of the second moment. The kurtosis of the normal distribution is 3. Kurtosis values lower than 3 imply that the distribution is concentrated around the mean and the tails are lighter than the tails of the normal distribution. Kurtosis values higher than 3 imply heavier tails or outliers.
+ **Missing values** – Null-like objects, empty strings and strings composed of only white spaces are considered missing.
+ **Valid values for numeric features or regression target** – All values that you can cast to finite floats are valid. Missing values are not valid.
+ **Valid values for categorical, binary, or text features, or for classification target** – All values that are not missing are valid.
+ **Datetime features** – All values that you can cast to a datetime object are valid. Missing values are not valid.
+ **Invalid values** – Values that are either missing or you can't properly cast. For example, in a numeric column, you can't cast the string `"six"` or a null value.

------
#### [ Quick model metrics for regression ]

The following are the definitions for the quick model metrics:
+ R2 or coefficient of determination) – R2 is the proportion of the variation in the target that is predicted by the model. R2 is in the range of [-infty, 1]. 1 is the score of the model that predicts the target perfectly and 0 is the score of the trivial model that always predicts the target mean.
+ MSE or mean squared error – MSE is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ MAE or mean absolute error – MAE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ RMSE or root mean square error – RMSE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ Max error – The maximum absolute value of the error over the dataset. Max error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ Median absolute error – Median absolute error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.

------
#### [ Quick model metrics for classification ]

The following are the definitions for the quick model metrics:
+ **Accuracy** – Accuracy is the ratio of samples that are predicted accurately. Accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples incorrectly and 1 is the score of the perfect model.
+ **Balanced accuracy** – Balanced accuracy is the ratio of samples that are predicted accurately when the class weights are adjusted to balance the data. All classes are given the same importance, regardless of their frequency. Balanced accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples wrong. 1 is the score of the perfect model.
+ **AUC (binary classification)** – This is the area under the receiver operating characteristic curve. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **AUC (OVR)** – For multiclass classification, this is the area under the receiver operating characteristic curve calculated separately for each label using one versus rest. Data Wrangler reports the average of the areas. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **Precision** – Precision is defined for a specific class. Precision is the fraction of true positives out of all the instances that the model classified as that class. Precision is in the range [0, 1]. 1 is the score of the model that has no false positives for the class. For binary classification, Data Wrangler reports the precision of the positive class.
+ **Recall** – Recall is defined for a specific class. Recall is the fraction of the relevant class instances that are successfully retrieved. Recall is in the range [0, 1]. 1 is the score of the model that classifies all the instances of the class correctly. For binary classification, Data Wrangler reports the recall of the positive class.
+ **F1** – F1 is defined for a specific class. It's the harmonic mean of the precision and recall. F1 is in the range [0, 1]. 1 is the score of the perfect model. For binary classification, Data Wrangler reports the F1 for classes with positive values.

------
#### [ Textual patterns ]

**Patterns** describe the textual format of a string using an easy to read format. The following are examples of textual patterns:
+ "\$1digits:4-7\$1" describes a sequence of digits that have a length between 4 and 7.
+ "\$1alnum:5\$1" describes an alpha-numeric string with a length of exactly 5.

Data Wrangler infers the patterns by looking at samples of non-empty strings from your data. It can describe many of the commonly used patterns. The **confidence** expressed as a percentage indicates how much of the data is estimated to match the pattern. Using the textual pattern, you can see which rows in your data you need to correct or drop.

The following describes the patterns that Data Wrangler can recognize:


| Pattern | Textual Format | 
| --- | --- | 
|  \$1alnum\$1  |  Alphanumeric strings  | 
|  \$1any\$1  |  Any string of word characters  | 
|  \$1digits\$1  |  A sequence of digits  | 
|  \$1lower\$1  |  A lowercase word  | 
|  \$1mixed\$1  |  A mixed-case word  | 
|  \$1name\$1  |  A word beginning with a capital letter  | 
|  \$1upper\$1  |  An uppercase word  | 
|  \$1whitespace\$1  |  whitespace characters  | 

A word character is either an underscore or a character that might appear in a word in any language. For example, the strings 'Hello\$1word' and 'écoute' both consist of word characters. 'H' and 'é' are both examples of word characters.

------

# Automatically Train Models on Your Data Flow
<a name="data-wrangler-autopilot"></a>

You can use Amazon SageMaker Autopilot to automatically train, tune, and deploy models on the data that you've transformed in your data flow. Amazon SageMaker Autopilot can go through several algorithms and use the one that works best with your data. For more information about Amazon SageMaker Autopilot, see [SageMaker Autopilot](autopilot-automate-model-development.md).

When you train and tune a model, Data Wrangler exports your data to an Amazon S3 location where Amazon SageMaker Autopilot can access it.

You can prepare and deploy a model by choosing a node in your Data Wrangler flow and choosing **Export and Train** in the data preview. You can use this method to view your dataset before you choose to train a model on it.

You can also train and deploy a model directly from your data flow.

The following procedure prepares and deploys a model from the data flow. For Data Wrangler flows with multi-row transforms, you can't use the transforms from the Data Wrangler flow when you're deploying the model. You can use the following procedure to process the data before you use it to perform inference.

To train and deploy a model directly from your data flow, do the following.

1. Choose the **\$1** next to the node containing the training data.

1. Choose **Train model**.

1. (Optional) Specify a AWS KMS key or ID. For more information about creating and controlling cryptographic keys to protect your data, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Choose **Export and train**.

1. After Amazon SageMaker Autopilot trains the model on the data that Data Wrangler exported, specify a name for **Experiment name**.

1. Under **Input data**, choose **Preview** to verify that Data Wrangler properly exported your data to Amazon SageMaker Autopilot.

1. For **Target**, choose the target column.

1. (Optional) For **S3 location** under **Output data**, specify an Amazon S3 location other than the default location.

1. Choose **Next: Training method**.

1. Choose a training method. For more information, see [Training modes](autopilot-model-support-validation.md#autopilot-training-mode).

1. (Optional) For **Auto deploy endpoint**, specify a name for the endpoint.

1. For **Deployment option**, choose a deployment method. You can choose to deploy with or without the transformations that you've made to your data.
**Important**  
You can't deploy an Amazon SageMaker Autopilot model with the transformations that you've made in your Data Wrangler flow. For more information about those transformations, see [Export to an Inference Endpoint](data-wrangler-data-export.md#data-wrangler-data-export-inference).

1. Choose **Next: Review and create**.

1. Choose **Create experiment**.

For more information about model training and deployment, see [Create Regression or Classification Jobs for Tabular Data Using the AutoML API](autopilot-automate-model-development-create-experiment.md). Autopilot shows you analyses about the best model's performance. For more information about model performance, see [View an Autopilot model performance report](autopilot-model-insights.md).

# Transform Data
<a name="data-wrangler-transform"></a>

Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning, transforming, and featurizing your data. When you add a transform, it adds a step to the data flow. Each transform you add modifies your dataset and produces a new dataframe. All subsequent transforms apply to the resulting dataframe.

Data Wrangler includes built-in transforms, which you can use to transform columns without any code. You can also add custom transformations using PySpark, Python (User-Defined Function), pandas, and PySpark SQL. Some transforms operate in place, while others create a new output column in your dataset.

You can apply transforms to multiple columns at once. For example, you can delete multiple columns in a single step.

You can apply the **Process numeric** and **Handle missing** transforms only to a single column.

Use this page to learn more about these built-in and custom transforms.

## Transform UI
<a name="data-wrangler-transform-ui"></a>

Most of the built-in transforms are located in the **Prepare** tab of the Data Wrangler UI. You can access the join and concatenate transforms through the data flow view. Use the following table to preview these two views. 

------
#### [ Transform ]

You can add a transform to any step in your data flow. Use the following procedure to add a transform to your data flow.

To add a step to your data flow, do the following.

1. Choose the **\$1** next to the step in the data flow.

1. Choose **Add transform**.

1. Choose **Add step**.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-add-step.png)

1. Choose a transform. 

1. (Optional) You can search for the transform that you want to use. Data Wrangler highlights the query in the results.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-search.png)

------
#### [ Join View ]

To join two datasets, select the first dataset in your data flow and choose **Join**. When you choose **Join**, you see results similar to those shown in the following image. Your left and right datasets are displayed in the left panel. The main panel displays your data flow, with the newly joined dataset added. 

![\[The joined dataset flow in the data flow section of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/join-1.png)


When you choose **Configure** to configure your join, you see results similar to those shown in the following image. Your join configuration is displayed in the left panel. You can use this panel to choose the joined dataset name, join type, and columns to join. The main panel displays three tables. The top two tables display the left and right datasets on the left and right respectively. Under this table, you can preview the joined dataset. 

![\[The joined dataset tables in the data flow section of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/join-2.png)


See [Join Datasets](#data-wrangler-transform-join) to learn more. 

------
#### [ Concatenate View ]

To concatenate two datasets, you select the first dataset in your data flow and choose **Concatenate**. When you select **Concatenate**, you see results similar to those shown in the following image. Your left and right datasets are displayed in the left panel. The main panel displays your data flow, with the newly concatenated dataset added. 

![\[Example concatenated dataset flow in the data flow section in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/concat-1.png)


When you choose **Configure** to configure your concatenation, you see results similar to those shown in the following image. Your concatenate configuration displays in the left panel. You can use this panel to choose the concatenated dataset's name, and choose to remove duplicates after concatenation and add columns to indicate the source dataframe. The main panel displays three tables. The top two tables display the left and right datasets on the left and right respectively. Under this table, you can preview the concatenated dataset. 

![\[Example concatenated dataset tables in the data flow section in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/concat-2.png)


See [Concatenate Datasets](#data-wrangler-transform-concatenate) to learn more.

------

## Join Datasets
<a name="data-wrangler-transform-join"></a>

You join dataframes directly in your data flow. When you join two datasets, the resulting joined dataset appears in your flow. The following join types are supported by Data Wrangler.
+ **Left Outer** – Include all rows from the left table. If the value for the column joined on a left table row does not match any right table row values, that row contains null values for all right table columns in the joined table.
+ **Left Anti** – Include rows from the left table that do not contain values in the right table for the joined column.
+ **Left semi** – Include a single row from the left table for all identical rows that satisfy the criteria in the join statement. This excludes duplicate rows from the left table that match the criteria of the join.
+ **Right Outer** – Include all rows from the right table. If the value for the joined column in a right table row does not match any left table row values, that row contains null values for all left table columns in the joined table.
+ **Inner** – Include rows from left and right tables that contain matching values in the joined column. 
+ **Full Outer** – Include all rows from the left and right tables. If the row value for the joined column in either table does not match, separate rows are created in the joined table. If a row doesn’t contain a value for a column in the joined table, null is inserted for that column.
+ **Cartesian Cross** – Include rows which combine each row from the first table with each row from the second table. This is a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of rows from tables in the join. The result of this product is the size of the left table times the size of the right table. Therefore, we recommend caution in using this join between very large datasets. 

Use the following procedure to join two dataframes.

1. Select **\$1** next to the left dataframe that you want to join. The first dataframe you select is always the left table in your join. 

1. Choose **Join**.

1. Select the right dataframe. The second dataframe you select is always the right table in your join.

1. Choose **Configure** to configure your join. 

1. Give your joined dataset a name using the **Name** field.

1. Select a **Join type**.

1. Select a column from the left and right tables to join. 

1. Choose **Apply** to preview the joined dataset on the right. 

1. To add the joined table to your data flow, choose **Add**. 

## Concatenate Datasets
<a name="data-wrangler-transform-concatenate"></a>

**Concatenate two datasets:**

1. Choose **\$1** next to the left dataframe that you want to concatenate. The first dataframe you select is always the left table in your concatenate. 

1. Choose **Concatenate**.

1. Select the right dataframe. The second dataframe you select is always the right table in your concatenate.

1. Choose **Configure** to configure your concatenate. 

1. Give your concatenated dataset a name using the **Name** field.

1. (Optional) Select the checkbox next to **Remove duplicates after concatenation** to remove duplicate columns. 

1. (Optional) Select the checkbox next to **Add column to indicate source dataframe** if, for each column in the new dataset, you want to add an indicator of the column's source. 

1. Choose **Apply** to preview the new dataset. 

1. Choose **Add** to add the new dataset to your data flow. 

## Balance Data
<a name="data-wrangler-transform-balance-data"></a>

You can balance the data for datasets with an underrepresented category. Balancing a dataset can help you create better models for binary classification.

**Note**  
You can't balance datasets containing column vectors.

You can use the **Balance data** operation to balance your data using one of the following operators:
+ *Random oversampling* – Randomly duplicates samples in the minority category. For example, if you're trying to detect fraud, you might only have cases of fraud in 10% of your data. For an equal proportion of fraudulent and non-fraudulent cases, this operator randomly duplicates fraud cases within the dataset 8 times.
+ *Random undersampling* – Roughly equivalent to random oversampling. Randomly removes samples from the overrepresented category to get the proportion of samples that you desire.
+ *Synthetic Minority Oversampling Technique (SMOTE)* – Uses samples from the underrepresented category to interpolate new synthetic minority samples. For more information about SMOTE, see the following description.

You can use all transforms for datasets containing both numeric and non-numeric features. SMOTE interpolates values by using neighboring samples. Data Wrangler uses the R-squared distance to determine the neighborhood to interpolate the additional samples. Data Wrangler only uses numeric features to calculate the distances between samples in the underrepresented group.

For two real samples in the underrepresented group, Data Wrangler interpolates the numeric features by using a weighted average. It randomly assigns weights to those samples in the range of [0, 1]. For numeric features, Data Wrangler interpolates samples using a weighted average of the samples. For samples A and B, Data Wrangler could randomly assign a weight of 0.7 to A and 0.3 to B. The interpolated sample has a value of 0.7A \$1 0.3B.

Data Wrangler interpolates non-numeric features by copying from either of the interpolated real samples. It copies the samples with a probability that it randomly assigns to each sample. For samples A and B, it can assign probabilities 0.8 to A and 0.2 to B. For the probabilities it assigned, it copies A 80% of the time.

## Custom Transforms
<a name="data-wrangler-transform-custom"></a>

The **Custom Transforms** group allows you to use Python (User-Defined Function), PySpark, pandas, or PySpark (SQL) to define custom transformations. For all three options, you use the variable `df` to access the dataframe to which you want to apply the transform. To apply your custom code to your dataframe, assign the dataframe with the transformations that you've made to the `df` variable. If you're not using Python (User-Defined Function), you don't need to include a return statement. Choose **Preview** to preview the result of the custom transform. Choose **Add** to add the custom transform to your list of **Previous steps**.

You can import the popular libraries with an `import` statement in the custom transform code block, such as the following:
+ NumPy version 1.19.0
+ scikit-learn version 0.23.2
+ SciPy version 1.5.4
+ pandas version 1.0.3
+ PySpark version 3.0.0

**Important**  
**Custom transform** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

If you include print statements in the code block, the result appears when you select **Preview**. You can resize the custom code transformer panel. Resizing the panel provides more space to write code. The following image shows the resizing of the panel.

![\[For the Python function, replace the comments under pd.Series with your code.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/resizing-panel.gif)


The following sections provide additional context and examples for writing custom transform code.

**Python (User-Defined Function)**

The Python function gives you the ability to write custom transformations without needing to know Apache Spark or pandas. Data Wrangler is optimized to run your custom code quickly. You get similar performance using custom Python code and an Apache Spark plugin.

To use the Python (User-Defined Function) code block, you specify the following:
+ **Input column** – The input column where you're applying the transform.
+ **Mode** – The scripting mode, either pandas or Python.
+ **Return type** – The data type of the value that you're returning.

Using the pandas mode gives better performance. The Python mode makes it easier for you to write transformations by using pure Python functions.

The following video shows an example of how to use custom code to create a transformation. It uses the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv) to create a column with the person's salutation.

![\[For the Python function, replace the comments under pd.Series with your code.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/python-function-transform-titanic-720.gif)


**PySpark**

The following example extracts date and time from a timestamp.

```
from pyspark.sql.functions import from_unixtime, to_date, date_format
df = df.withColumn('DATE_TIME', from_unixtime('TIMESTAMP'))
df = df.withColumn( 'EVENT_DATE', to_date('DATE_TIME')).withColumn(
'EVENT_TIME', date_format('DATE_TIME', 'HH:mm:ss'))
```

**pandas**

The following example provides an overview of the dataframe to which you are adding transforms. 

```
df.info()
```

**PySpark (SQL)**

The following example creates a new dataframe with four columns: *name*, *fare*, *pclass*, *survived*.

```
SELECT name, fare, pclass, survived FROM df
```

If you don’t know how to use PySpark, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of code snippets. You can use to code snippets to perform tasks such as dropping columns, grouping by columns, or modelling.

To use a code snippet, choose **Search example snippets** and specify a query in the search bar. The text you specify in the query doesn’t have to match the name of the code snippet exactly.

The following example shows a **Drop duplicate rows** code snippet that can delete rows with similar data in your dataset. You can find the code snippet by searching for one of the following:
+ Duplicates
+ Identical
+ Remove

The following snippet has comments to help you understand the changes that you need to make. For most snippets, you must specify the column names of your dataset in the code.

```
# Specify the subset of columns
# all rows having identical values in these columns will be dropped

subset = ["col1", "col2", "col3"]
df = df.dropDuplicates(subset)  

# to drop the full-duplicate rows run
# df = df.dropDuplicates()
```

To use a snippet, copy and paste its content into the **Custom transform** field. You can copy and paste multiple code snippets into the custom transform field.

## Custom Formula
<a name="data-wrangler-transform-custom-formula"></a>

Use **Custom formula** to define a new column using a Spark SQL expression to query data in the current dataframe. The query must use the conventions of Spark SQL expressions.

**Important**  
**Custom formula** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

You can use this transform to perform operations on columns, referencing the columns by name. For example, assuming the current dataframe contains columns named *col\$1a* and *col\$1b*, you can use the following operation to produce an **Output column** that is the product of these two columns with the following code:

```
col_a * col_b
```

Other common operations include the following, assuming a dataframe contains `col_a` and `col_b` columns:
+ Concatenate two columns: `concat(col_a, col_b)`
+ Add two columns: `col_a + col_b`
+ Subtract two columns: `col_a - col_b`
+ Divide two columns: `col_a / col_b`
+ Take the absolute value of a column: `abs(col_a)`

For more information, see the [Spark documentation](http://spark.apache.org/docs/latest/api/python) on selecting data. 

## Reduce Dimensionality within a Dataset
<a name="data-wrangler-transform-dimensionality-reduction"></a>

Reduce the dimensionality in your data by using Principal Component Analysis (PCA). The dimensionality of your dataset corresponds to the number of features. When you use dimensionality reduction in Data Wrangler, you get a new set of features called components. Each component accounts for some variability in the data.

The first component accounts for the largest amount of variation in the data. The second component accounts for the second largest amount of variation in the data, and so on.

You can use dimensionality reduction to reduce the size of the data sets that you use to train models. Instead of using the features in your dataset, you can use the principal components instead.

To perform PCA, Data Wrangler creates axes for your data. An axis is an affine combination of columns in your dataset. The first principal component is the value on the axis that has the largest amount of variance. The second principal component is the value on the axis that has the second largest amount of variance. The nth principal component is the value on the axis that has the nth largest amount of variance.

You can configure the number of principal components that Data Wrangler returns. You can either specify the number of principal components directly or you can specify the variance threshold percentage. Each principal component explains an amount of variance in the data. For example, you might have a principal component with a value of 0.5. The component would explain 50% of the variation in the data. When you specify a variance threshold percentage, Data Wrangler returns the smallest number of components that meet the percentage that you specify.

The following are example principal components with the amount of variance that they explain in the data.
+ Component 1 – 0.5
+ Component 2 – 0.45
+ Component 3 – 0.05

If you specify a variance threshold percentage of `94` or `95`, Data Wrangler returns Component 1 and Component 2. If you specify a variance threshold percentage of `96`, Data Wrangler returns all three principal components.

You can use the following procedure to run PCA on your dataset.

To run PCA on your dataset, do the following.

1. Open your Data Wrangler data flow.

1. Choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Dimensionality Reduction**.

1. For **Input Columns**, choose the features that you're reducing into the principal components.

1. (Optional) For **Number of principal components**, choose the number of principal components that Data Wrangler returns in your dataset. If specify a value for the field, you can't specify a value for **Variance threshold percentage**.

1. (Optional) For **Variance threshold percentage**, specify the percentage of variation in the data that you want explained by the principal components. Data Wrangler uses the default value of `95` if you don't specify a value for the variance threshold. You can't specify a variance threshold percentage if you've specified a value for **Number of principal components**.

1. (Optional) Deselect **Center** to not use the mean of the columns as the center of the data. By default, Data Wrangler centers the data with the mean before scaling.

1. (Optional) Deselect **Scale** to not scale the data with the unit standard deviation.

1. (Optional) Choose **Columns** to output the components to separate columns. Choose **Vector** to output the components as a single vector.

1. (Optional) For **Output column**, specify a name for an output column. If you're outputting the components to separate columns, the name that you specify is a prefix. If you're outputting the components to a vector, the name that you specify is the name of the vector column.

1. (Optional) Select **Keep input columns**. We don't recommend selecting this option if you plan on only using the principal components to train your model.

1. Choose **Preview**.

1. Choose **Add**.

## Encode Categorical
<a name="data-wrangler-transform-cat-encode"></a>

Categorical data is usually composed of a finite number of categories, where each category is represented with a string. For example, if you have a table of customer data, a column that indicates the country a person lives in is categorical. The categories would be *Afghanistan*, *Albania*, *Algeria*, and so on. Categorical data can be *nominal* or *ordinal*. Ordinal categories have an inherent order, and nominal categories do not. The highest degree obtained (*High school*, *Bachelors*, *Masters*, and so on) is an example of ordinal categories. 

Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are *Dog* and *Cat*, you may encode this information into two vectors, `[1,0]` to represent *Dog*, and `[0,1]` to represent *Cat*.

When you encode ordinal categories, you may need to translate the natural order of categories into your encoding. For example, you can represent the highest degree obtained with the following map: `{"High school": 1, "Bachelors": 2, "Masters":3}`.

Use categorical encoding to encode categorical data that is in string format into arrays of integers. 

The Data Wrangler categorical encoders create encodings for all categories that exist in a column at the time the step is defined. If new categories have been added to a column when you start a Data Wrangler job to process your dataset at time *t*, and this column was the input for a Data Wrangler categorical encoding transform at time *t*-1, these new categories are considered *missing* in the Data Wrangler job. The option you select for **Invalid handling strategy** is applied to these missing values. Examples of when this can occur are: 
+ When you use a .flow file to create a Data Wrangler job to process a dataset that was updated after the creation of the data flow. For example, you may use a data flow to regularly process sales data each month. If that sales data is updated weekly, new categories may be introduced into columns for which an encode categorical step is defined. 
+ When you select **Sampling** when you import your dataset, some categories may be left out of the sample. 

In these situations, these new categories are considered missing values in the Data Wrangler job.

You can choose from and configure an *ordinal* and a *one-hot encode*. Use the following sections to learn more about these options. 

Both transforms create a new column named **Output column name**. You specify the output format of this column with **Output style**:
+ Select **Vector** to produce a single column with a sparse vector. 
+ Select **Columns** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category.

### Ordinal Encode
<a name="data-wrangler-transform-cat-encode-ordinal"></a>

Select **Ordinal encode** to encode categories into an integer between 0 and the total number of categories in the **Input column** you select.

**Invalid handing strategy**: Select a method to handle invalid or missing values. 
+ Choose **Skip** if you want to omit the rows with missing values.
+ Choose **Keep** to retain missing values as the last category.
+ Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ Choose **Replace with NaN** to replace missing with NaN. This option is recommended if your ML algorithm can handle missing values. Otherwise, the first three options in this list may produce better results.

### One-Hot Encode
<a name="data-wrangler-transform-cat-encode-onehot"></a>

Select **One-hot encode** for **Transform** to use one-hot encoding. Configure this transform using the following: 
+ **Drop last category**: If `True`, the last category does not have a corresponding index in the one-hot encoding. When missing values are possible, a missing category is always the last one and setting this to `True` means that a missing value results in an all zero vector.
+ **Invalid handing strategy**: Select a method to handle invalid or missing values. 
  + Choose **Skip** if you want to omit the rows with missing values.
  + Choose **Keep** to retain missing values as the last category.
  + Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ **Is input ordinal encoded**: Select this option if the input vector contains ordinal encoded data. This option requires that input data contain non-negative integers. If **True**, input *i* is encoded as a vector with a non-zero in the *i*th location. 

### Similarity encode
<a name="data-wrangler-transform-cat-encode-similarity"></a>

Use similarity encoding when you have the following:
+ A large number of categorical variables
+ Noisy data

The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia".

Data Wrangler converts each category in your dataset into a set of tokens using a 3-gram tokenizer. It converts the tokens into an embedding using min-hash encoding.

The following example shows how the similarity encoder creates vectors from strings.

![\[Example on using ENCODE CATEGORICAL for a table in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/similarity-encode-example-screenshot-0.png)


![\[Example vector representation of a variable found in a table in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/similarity-encode-example-screenshot-1.png)


The similarity encodings that Data Wrangler creates:
+ Have low dimensionality
+ Are scalable to a large number of categories
+ Are robust and resistant to noise

For the preceding reasons, similarity encoding is more versatile than one-hot encoding.

To add the similarity encoding transform to your dataset, use the following procedure.

To use similarity encoding, do the following.

1. Sign in to the [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker/).

1. Choose **Open Studio Classic**.

1. Choose **Launch app**.

1. Choose **Studio**.

1. Specify your data flow.

1. Choose a step with a transformation.

1. Choose **Add step**.

1. Choose **Encode categorical**.

1. Specify the following:
   + **Transform** – **Similarity encode**
   + **Input column** – The column containing the categorical data that you're encoding.
   + **Target dimension** – (Optional) The dimension of the categorical embedding vector. The default value is 30. We recommend using a larger target dimension if you have a large dataset with many categories.
   + **Output style** – Choose **Vector** for a single vector with all of the encoded values. Choose **Column** to have the encoded values in separate columns.
   + **Output column** – (Optional) The name of the output column for a vector encoded output. For a column-encoded output, this is the prefix of the column names followed by listed number.

## Featurize Text
<a name="data-wrangler-transform-featurize-text"></a>

Use the **Featurize Text** transform group to inspect string-typed columns and use text embedding to featurize these columns. 

This feature group contains two features, *Character statistics* and *Vectorize*. Use the following sections to learn more about these transforms. For both options, the **Input column** must contain text data (string type).

### Character Statistics
<a name="data-wrangler-transform-featurize-text-character-stats"></a>

Use **Character statistics** to generate statistics for each row in a column containing text data. 

This transform computes the following ratios and counts for each row, and creates a new column to report the result. The new column is named using the input column name as a prefix and a suffix that is specific to the ratio or count. 
+ **Number of words**: The total number of words in that row. The suffix for this output column is `-stats_word_count`.
+ **Number of characters**: The total number of characters in that row. The suffix for this output column is `-stats_char_count`.
+ **Ratio of upper**: The number of uppercase characters, from A to Z, divided by all characters in the column. The suffix for this output column is `-stats_capital_ratio`.
+ **Ratio of lower**: The number of lowercase characters, from a to z, divided by all characters in the column. The suffix for this output column is `-stats_lower_ratio`.
+ **Ratio of digits**: The ratio of digits in a single row over the sum of digits in the input column. The suffix for this output column is `-stats_digit_ratio`.
+ **Special characters ratio**: The ratio of non-alphanumeric (characters like \$1\$1&%:@) characters to over the sum of all characters in the input column. The suffix for this output column is `-stats_special_ratio`.

### Vectorize
<a name="data-wrangler-transform-featurize-text-vectorize"></a>

Text embedding involves mapping words or phrases from a vocabulary to vectors of real numbers. Use the Data Wrangler text embedding transform to tokenize and vectorize text data into term frequency–inverse document frequency (TF-IDF) vectors. 

When TF-IDF is calculated for a column of text data, each word in each sentence is converted to a real number that represents its semantic importance. Higher numbers are associated with less frequent words, which tend to be more meaningful. 

When you define a **Vectorize** transform step, Data Wrangler uses the data in your dataset to define the count vectorizer and TF-IDF methods . Running a Data Wrangler job uses these same methods.

You configure this transform using the following: 
+ **Output column name**: This transform creates a new column with the text embedding. Use this field to specify a name for this output column. 
+ **Tokenizer**: A tokenizer converts the sentence into a list of words, or *tokens*. 

  Choose **Standard** to use a tokenizer that splits by white space and converts each word to lowercase. For example, `"Good dog"` is tokenized to `["good","dog"]`.

  Choose **Custom** to use a customized tokenizer. If you choose **Custom**, you can use the following fields to configure the tokenizer:
  + **Minimum token length**: The minimum length, in characters, for a token to be valid. Defaults to `1`. For example, if you specify `3` for minimum token length, words like `a, at, in` are dropped from the tokenized sentence. 
  + **Should regex split on gaps**: If selected, **regex** splits on gaps. Otherwise, it matches tokens. Defaults to `True`. 
  + **Regex pattern**: Regex pattern that defines the tokenization process. Defaults to `' \\ s+'`.
  + **To lowercase**: If chosen, Data Wrangler converts all characters to lowercase before tokenization. Defaults to `True`.

  To learn more, see the Spark documentation on [Tokenizer](https://spark.apache.org/docs/latest/ml-features#tokenizer).
+ **Vectorizer**: The vectorizer converts the list of tokens into a sparse numeric vector. Each token corresponds to an index in the vector and a non-zero indicates the existence of the token in the input sentence. You can choose from two vectorizer options, *Count* and *Hashing*.
  + **Count vectorize** allows customizations that filter infrequent or too common tokens. **Count vectorize parameters** include the following: 
    + **Minimum term frequency**: In each row, terms (tokens) with smaller frequency are filtered. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Minimum document frequency**: Minimum number of rows in which a term (token) must appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Maximum document frequency**: Maximum number of documents (rows) in which a term (token) can appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `0.999`.
    + **Maximum vocabulary size**: Maximum size of the vocabulary. The vocabulary is made up of all terms (tokens) in all rows of the column. Defaults to `262144`.
    + **Binary outputs**: If selected, the vector outputs do not include the number of appearances of a term in a document, but rather are a binary indicator of its appearance. Defaults to `False`.

    To learn more about this option, see the Spark documentation on [CountVectorizer](https://spark.apache.org/docs/latest/ml-features#countvectorizer).
  + **Hashing** is computationally faster. **Hash vectorize parameters** includes the following:
    + **Number of features during hashing**: A hash vectorizer maps tokens to a vector index according to their hash value. This feature determines the number of possible hash values. Large values result in fewer collisions between hash values but a higher dimension output vector.

    To learn more about this option, see the Spark documentation on [FeatureHasher](https://spark.apache.org/docs/latest/ml-features#featurehasher)
+ **Apply IDF** applies an IDF transformation, which multiplies the term frequency with the standard inverse document frequency used for TF-IDF embedding. **IDF parameters** include the following: 
  + **Minimum document frequency **: Minimum number of documents (rows) in which a term (token) must appear to be included. If **count\$1vectorize** is the chosen vectorizer, we recommend that you keep the default value and only modify the **min\$1doc\$1freq** field in **Count vectorize parameters**. Defaults to `5`.
+ ** Output format**:The output format of each row. 
  + Select **Vector** to produce a single column with a sparse vector. 
  + Select **Flattened** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category. You can only choose flattened when **Vectorizer** is set as **Count vectorizer**.

## Transform Time Series
<a name="data-wrangler-transform-time-series"></a>

In Data Wrangler, you can transform time series data. The values in a time series dataset are indexed to specific time. For example, a dataset that shows the number of customers in a store for each hour in a day is a time series dataset. The following table shows an example of a time series dataset.

Hourly number of customers in a store


| Number of customers | Time (hour) | 
| --- | --- | 
| 4 | 09:00 | 
| 10 | 10:00 | 
| 14 | 11:00 | 
| 25 | 12:00 | 
| 20 | 13:00 | 
| 18 | 14:00 | 

For the preceding table, the **Number of Customers** column contains the time series data. The time series data is indexed on the hourly data in the **Time (hour)** column.

You might need to perform a series of transformations on your data to get it in a format that you can use for your analysis. Use the **Time series** transform group to transform your time series data. For more information about the transformations that you can perform, see the following sections.

**Topics**
+ [

### Group by a Time Series
](#data-wrangler-group-by-time-series)
+ [

### Resample Time Series Data
](#data-wrangler-resample-time-series)
+ [

### Handle Missing Time Series Data
](#data-wrangler-transform-handle-missing-time-series)
+ [

### Validate the Timestamp of Your Time Series Data
](#data-wrangler-transform-validate-timestamp)
+ [

### Standardizing the Length of the Time Series
](#data-wrangler-transform-standardize-length)
+ [

### Extract Features from Your Time Series Data
](#data-wrangler-transform-extract-time-series-features)
+ [

### Use Lagged Features from Your Time Series Data
](#data-wrangler-transform-lag-time-series)
+ [

### Create a Datetime Range In Your Time Series
](#data-wrangler-transform-datetime-range)
+ [

### Use a Rolling Window In Your Time Series
](#data-wrangler-transform-rolling-window)

### Group by a Time Series
<a name="data-wrangler-group-by-time-series"></a>

You can use the group by operation to group time series data for specific values in a column.

For example, you have the following table that tracks the average daily electricity usage in a household.

Average daily household electricity usage


| Household ID | Daily timestamp | Electricity usage (kWh) | Number of household occupants | 
| --- | --- | --- | --- | 
| household\$10 | 1/1/2020 | 30 | 2 | 
| household\$10 | 1/2/2020 | 40 | 2 | 
| household\$10 | 1/4/2020 | 35 | 3 | 
| household\$11 | 1/2/2020 | 45 | 3 | 
| household\$11 | 1/3/2020 | 55 | 4 | 

If you choose to group by ID, you get the following table.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | 
| --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 
| household\$11 | [45, 55] | [3, 4] | 

Each entry in the time series sequence is ordered by the corresponding timestamp. The first element of the sequence corresponds to the first timestamp of the series. For `household_0`, `30` is the first value of the **Electricity Usage Series**. The value of `30` corresponds to the first timestamp of `1/1/2020`.

You can include the starting timestamp and ending timestamp. The following table shows how that information appears.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | Start\$1time | End\$1time | 
| --- | --- | --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 1/1/2020 | 1/4/2020 | 
| household\$11 | [45, 55] | [3, 4] | 1/2/2020 | 1/3/2020 | 

You can use the following procedure to group by a time series column. 

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Time Series**.

1. Under **Transform**, choose **Group by**.

1. Specify a column in **Group by this column**.

1. For **Apply to columns**, specify a value.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Resample Time Series Data
<a name="data-wrangler-resample-time-series"></a>

Time series data usually has observations that aren't taken at regular intervals. For example, a dataset could have some observations that are recorded hourly and other observations that are recorded every two hours.

Many analyses, such as forecasting algorithms, require the observations to be taken at regular intervals. Resampling gives you the ability to establish regular intervals for the observations in your dataset.

You can either upsample or downsample a time series. Downsampling increases the interval between observations in the dataset. For example, if you downsample observations that are taken either every hour or every two hours, each observation in your dataset is taken every two hours. The hourly observations are aggregated into a single value using an aggregation method such as the mean or median.

Upsampling reduces the interval between observations in the dataset. For example, if you upsample observations that are taken every two hours into hourly observations, you can use an interpolation method to infer hourly observations from the ones that have been taken every two hours. For information on interpolation methods, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

You can resample both numeric and non-numeric data.

Use the **Resample** operation to resample your time series data. If you have multiple time series in your dataset, Data Wrangler standardizes the time interval for each time series.

The following table shows an example of downsampling time series data by using the mean as the aggregation method. The data is downsampled from every two hours to every hour.

Hourly temperature readings over a day before downsampling


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 1:00 | 32 | 
| 2:00 | 35 | 
| 3:00 | 32 | 
| 4:00 | 30 | 

Temperature readings downsampled to every two hours


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 2:00 | 33.5 | 
| 4:00 | 35 | 

You can use the following procedure to resample time series data.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Resample**.

1. For **Timestamp**, choose the timestamp column.

1. For **Frequency unit**, specify the frequency that you're resampling.

1. (Optional) Specify a value for **Frequency quantity**.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Handle Missing Time Series Data
<a name="data-wrangler-transform-handle-missing-time-series"></a>

If you have missing values in your dataset, you can do one of the following:
+ For datasets that have multiple time series, drop the time series that have missing values that are greater than a threshold that you specify.
+ Impute the missing values in a time series by using other values in the time series.

Imputing a missing value involves replacing the data by either specifying a value or by using an inferential method. The following are the methods that you can use for imputation:
+ Constant value – Replace all the missing data in your dataset with a value that you specify.
+ Most common value – Replace all the missing data with the value that has the highest frequency in the dataset.
+ Forward fill – Use a forward fill to replace the missing values with the non-missing value that precedes the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 7. The sequence that results from using a forward fill is [2, 4, 7, 7, 7, 7, 8].
+ Backward fill – Use a backward fill to replace the missing values with the non-missing value that follows the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 8. The sequence that results from using a backward fill is [2, 4, 7, 8, 8, 8, 8]. 
+ Interpolate – Uses an interpolation function to impute the missing values. For more information on the functions that you can use for interpolation, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

Some of the imputation methods might not be able to impute of all the missing value in your dataset. For example, a **Forward fill** can't impute a missing value that appears at the beginning of the time series. You can impute the values by using either a forward fill or a backward fill.

You can either impute missing values within a cell or within a column.

The following example shows how values are imputed within a cell.

Electricity usage with missing values


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, NaN, NaN] | 
| household\$11 | [45, NaN, 55] | 

Electricity usage with values imputed using a forward fill


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, 35, 35] | 
| household\$11 | [45, 45, 55] | 

The following example shows how values are imputed within a column.

Average daily household electricity usage with missing values


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | NaN | 
| household\$11 | NaN | 
| household\$11 | NaN | 

Average daily household electricity usage with values imputed using a forward fill


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | 40 | 
| household\$11 | 40 | 
| household\$11 | 40 | 

You can use the following procedure to handle missing values.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Handle missing**.

1. For **Time series input type**, choose whether you want to handle missing values inside of a cell or along a column.

1. For **Impute missing values for this column**, specify the column that has the missing values.

1. For **Method for imputing values**, select a method.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. If you have missing values, you can specify a method for imputing them under **Method for imputing values**.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Validate the Timestamp of Your Time Series Data
<a name="data-wrangler-transform-validate-timestamp"></a>

You might have time stamp data that is invalid. You can use the **Validate time stamp** function to determine whether the timestamps in your dataset are valid. Your timestamp can be invalid for one or more of the following reasons:
+ Your timestamp column has missing values.
+ The values in your timestamp column are not formatted correctly.

If you have invalid timestamps in your dataset, you can't perform your analysis successfully. You can use Data Wrangler to identify invalid timestamps and understand where you need to clean your data.

The time series validation works in one of the two ways:

You can configure Data Wrangler to do one of the following if it encounters missing values in your dataset:
+ Drop the rows that have the missing or invalid values.
+ Identify the rows that have the missing or invalid values.
+ Throw an error if it finds any missing or invalid values in your dataset.

You can validate the timestamps on columns that either have the `timestamp` type or the `string` type. If the column has the `string` type, Data Wrangler converts the type of the column to `timestamp` and performs the validation.

You can use the following procedure to validate the timestamps in your dataset.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Validate timestamps**.

1. For **Timestamp Column**, choose the timestamp column.

1. For **Policy**, choose whether you want to handle missing timestamps.

1. (Optional) For **Output column**, specify a name for the output column.

1. If the date time column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Standardizing the Length of the Time Series
<a name="data-wrangler-transform-standardize-length"></a>

If you have time series data stored as arrays, you can standardize each time series to the same length. Standardizing the length of the time series array might make it easier for you to perform your analysis on the data.

You can standardize your time series for data transformations that require the length of your data to be fixed.

Many ML algorithms require you to flatten your time series data before you use them. Flattening time series data is separating each value of the time series into its own column in a dataset. The number of columns in a dataset can't change, so the lengths of the time series need to be standardized between you flatten each array into a set of features.

Each time series is set to the length that you specify as a quantile or percentile of the time series set. For example, you can have three sequences that have the following lengths:
+ 3
+ 4
+ 5

You can set the length of all of the sequences as the length of the sequence that has the 50th percentile length.

Time series arrays that are shorter than the length you've specified have missing values added. The following is an example format of standardizing the time series to a longer length: [2, 4, 5, NaN, NaN, NaN].

You can use different approaches to handle the missing values. For information on those approaches, see [Handle Missing Time Series Data](#data-wrangler-transform-handle-missing-time-series).

The time series arrays that are longer than the length that you specify are truncated.

You can use the following procedure to standardize the length of the time series.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Standardize length**.

1. For **Standardize the time series length for the column**, choose a column.

1. (Optional) For **Output column**, specify a name for the output column. If you don't specify a name, the transform is done in place.

1. If the datetime column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Cutoff quantile** and specify a quantile to set the length of the sequence.

1. Choose **Flatten the output** to output the values of the time series into separate columns.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Extract Features from Your Time Series Data
<a name="data-wrangler-transform-extract-time-series-features"></a>

If you're running a classification or a regression algorithm on your time series data, we recommend extracting features from the time series before running the algorithm. Extracting features might improve the performance of your algorithm.

Use the following options to choose how you want to extract features from your data:
+ Use **Minimal subset** to specify extracting 8 features that you know are useful in downstream analyses. You can use a minimal subset when you need to perform computations quickly. You can also use it when your ML algorithm has a high risk of overfitting and you want to provide it with fewer features.
+ Use **Efficient subset** to specify extracting the most features possible without extracting features that are computationally intensive in your analyses.
+ Use **All features** to specify extracting all features from the tune series.
+ Use **Manual subset** to choose a list of features that you think explain the variation in your data well.

Use the following the procedure to extract features from your time series data.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Extract features**.

1. For **Extract features for this column**, choose a column.

1. (Optional) Select **Flatten** to output the features into separate columns.

1. For **Strategy**, choose a strategy to extract the features.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use Lagged Features from Your Time Series Data
<a name="data-wrangler-transform-lag-time-series"></a>

For many use cases, the best way to predict the future behavior of your time series is to use its most recent behavior.

The most common uses of lagged features are the following:
+ Collecting a handful of past values. For example, for time, t \$1 1, you collect t, t - 1, t - 2, and t - 3.
+ Collecting values that correspond to seasonal behavior in the data. For example, to predict the occupancy in a restaurant at 1:00 PM, you might want to use the features from 1:00 PM on the previous day. Using the features from 12:00 PM or 11:00 AM on the same day might not be as predictive as using the features from previous days.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Lag features**.

1. For **Generate lag features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. For **Lag**, specify the duration of the lag.

1. (Optional) Configure the output using one of the following options:
   + **Include the entire lag window**
   + **Flatten the output**
   + **Drop rows without history**

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Create a Datetime Range In Your Time Series
<a name="data-wrangler-transform-datetime-range"></a>

You might have time series data that don't have timestamps. If you know that the observations were taken at regular intervals, you can generate timestamps for the time series in a separate column. To generate timestamps, you specify the value for the start timestamp and the frequency of the timestamps.

For example, you might have the following time series data for the number of customers at a restaurant.

Time series data on the number of customers at a restaurant


| Number of customers | 
| --- | 
| 10 | 
| 14 | 
| 24 | 
| 40 | 
| 30 | 
| 20 | 

If you know that the restaurant opened at 5:00 PM and that the observations are taken hourly, you can add a timestamp column that corresponds to the time series data. You can see the timestamp column in the following table.

Time series data on the number of customers at a restaurant


| Number of customers | Timestamp | 
| --- | --- | 
| 10 | 1:00 PM | 
| 14 | 2:00 PM | 
| 24 | 3:00 PM | 
| 40 | 4:00 PM | 
| 30 | 5:00 PM | 
| 20 | 6:00 PM | 

Use the following procedure to add a datetime range to your data.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Datetime range**.

1. For **Frequency type**, choose the unit used to measure the frequency of the timestamps.

1. For **Starting timestamp**, specify the start timestamp.

1. For **Output column**, specify a name for the output column.

1. (Optional) Configure the output using the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use a Rolling Window In Your Time Series
<a name="data-wrangler-transform-rolling-window"></a>

You can extract features over a time period. For example, for time, *t*, and a time window length of 3, and for the row that indicates the *t*th timestamp, we append the features that are extracted from the time series at times *t* - 3, *t* -2, and *t* - 1. For information on extracting features, see [Extract Features from Your Time Series Data](#data-wrangler-transform-extract-time-series-features). 

You can use the following procedure to extract features over a time period.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Rolling window features**.

1. For **Generate rolling window features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. (Optional) For **Output Column**, specify the name of the output column.

1. For **Window size**, specify the window size.

1. For **Strategy**, choose the extraction strategy.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Featurize Datetime
<a name="data-wrangler-transform-datetime-embed"></a>

Use **Featurize date/time** to create a vector embedding representing a datetime field. To use this transform, your datetime data must be in one of the following formats: 
+ Strings describing datetime: For example, `"January 1st, 2020, 12:44pm"`. 
+ A Unix timestamp: A Unix timestamp describes the number of seconds, milliseconds, microseconds, or nanoseconds from 1/1/1970. 

You can choose to **Infer datetime format** and provide a **Datetime format**. If you provide a datetime format, you must use the codes described in the [Python documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The options you select for these two configurations have implications for the speed of the operation and the final results.
+ The most manual and computationally fastest option is to specify a **Datetime format** and select **No** for **Infer datetime format**.
+ To reduce manual labor, you can choose **Infer datetime format** and not specify a datetime format. It is also a computationally fast operation; however, the first datetime format encountered in the input column is assumed to be the format for the entire column. If there are other formats in the column, these values are NaN in the final output. Inferring the datetime format can give you unparsed strings. 
+ If you don't specify a format and select **No** for **Infer datetime format**, you get the most robust results. All the valid datetime strings are parsed. However, this operation can be an order of magnitude slower than the first two options in this list. 

When you use this transform, you specify an **Input column** which contains datetime data in one of the formats listed above. The transform creates an output column named **Output column name**. The format of the output column depends on your configuration using the following:
+ **Vector**: Outputs a single column as a vector. 
+ **Columns**: Creates a new column for every feature. For example, if the output contains a year, month, and day, three separate columns are created for year, month, and day. 

Additionally, you must choose an **Embedding mode**. For linear models and deep networks, we recommend choosing **cyclic**. For tree-based algorithms, we recommend choosing **ordinal**.

## Format String
<a name="data-wrangler-transform-format-string"></a>

The **Format string** transforms contain standard string formatting operations. For example, you can use these operations to remove special characters, normalize string lengths, and update string casing.

This feature group contains the following transforms. All transforms return copies of the strings in the **Input column** and add the result to a new, output column.


| Name | Function | 
| --- | --- | 
| Left pad |  Left-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Right pad |  Right-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Center (pad on either side) |  Center-pad the string (add padding on both sides of the string) with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Prepend zeros |  Left-fill a numeric string with zeros, up to a given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Strip left and right |  Returns a copy of the string with the leading and trailing characters removed.  | 
| Strip characters from left |  Returns a copy of the string with leading characters removed.  | 
| Strip characters from right |  Returns a copy of the string with trailing characters removed.  | 
| Lower case |  Convert all letters in text to lowercase.  | 
| Upper case |  Convert all letters in text to uppercase.  | 
| Capitalize |  Capitalize the first letter in each sentence.   | 
| Swap case | Converts all uppercase characters to lowercase and all lowercase characters to uppercase characters of the given string, and returns it. | 
| Add prefix or suffix |  Adds a prefix and a suffix the string column. You must specify at least one of **Prefix** and **Suffix**.   | 
| Remove symbols |  Removes given symbols from a string. All listed characters are removed. Defaults to white space.   | 

## Handle Outliers
<a name="data-wrangler-transform-handle-outlier"></a>

Machine learning models are sensitive to the distribution and range of your feature values. Outliers, or rare values, can negatively impact model accuracy and lead to longer training times. Use this feature group to detect and update outliers in your dataset. 

When you define a **Handle outliers** transform step, the statistics used to detect outliers are generated on the data available in Data Wrangler when defining this step. These same statistics are used when running a Data Wrangler job. 

Use the following sections to learn more about the transforms this group contains. You specify an **Output name** and each of these transforms produces an output column with the resulting data. 

### Robust standard deviation numeric outliers
<a name="data-wrangler-transform-handle-outlier-rstdev"></a>

This transform detects and fixes outliers in numeric features using statistics that are robust to outliers.

You must define an **Upper quantile** and a **Lower quantile** for the statistics used to calculate outliers. You must also specify the number of **Standard deviations** from which a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Standard Deviation Numeric Outliers
<a name="data-wrangler-transform-handle-outlier-sstdev"></a>

This transform detects and fixes outliers in numeric features using the mean and standard deviation.

You specify the number of **Standard deviations** a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Quantile Numeric Outliers
<a name="data-wrangler-transform-handle-outlier-quantile-numeric"></a>

Use this transform to detect and fix outliers in numeric features using quantiles. You can define an **Upper quantile** and a **Lower quantile**. All values that fall above the upper quantile or below the lower quantile are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Min-Max Numeric Outliers
<a name="data-wrangler-transform-handle-outlier-minmax-numeric"></a>

This transform detects and fixes outliers in numeric features using upper and lower thresholds. Use this method if you know threshold values that demark outliers.

You specify a **Upper threshold** and a **Lower threshold**, and if values fall above or below those thresholds respectively, they are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Replace Rare
<a name="data-wrangler-transform-handle-outlier-replace-rare"></a>

When you use the **Replace rare** transform, you specify a threshold and Data Wrangler finds all values that meet that threshold and replaces them with a string that you specify. For example, you may want to use this transform to categorize all outliers in a column into an "Others" category. 
+ **Replacement string**: The string with which to replace outliers.
+ **Absolute threshold**: A category is rare if the number of instances is less than or equal to this absolute threshold.
+ **Fraction threshold**: A category is rare if the number of instances is less than or equal to this fraction threshold multiplied by the number of rows.
+ **Max common categories**: Maximum not-rare categories that remain after the operation. If the threshold does not filter enough categories, those with the top number of appearances are classified as not rare. If set to 0 (default), there is no hard limit to the number of categories.

## Handle Missing Values
<a name="data-wrangler-transform-handle-missing"></a>

Missing values are a common occurrence in machine learning datasets. In some situations, it is appropriate to impute missing data with a calculated value, such as an average or categorically common value. You can process missing values using the **Handle missing values** transform group. This group contains the following transforms. 

### Fill Missing
<a name="data-wrangler-transform-fill-missing"></a>

Use the **Fill missing** transform to replace missing values with a **Fill value** you define. 

### Impute Missing
<a name="data-wrangler-transform-impute"></a>

Use the **Impute missing** transform to create a new column that contains imputed values where missing values were found in input categorical and numerical data. The configuration depends on your data type.

For numeric data, choose an imputing strategy, the strategy used to determine the new value to impute. You can choose to impute the mean or the median over the values that are present in your dataset. Data Wrangler uses the value that it computes to impute the missing values.

For categorical data, Data Wrangler imputes missing values using the most frequent value in the column. To impute a custom string, use the **Fill missing** transform instead.

### Add Indicator for Missing
<a name="data-wrangler-transform-missing-add-indicator"></a>

Use the **Add indicator for missing** transform to create a new indicator column, which contains a Boolean `"false"` if a row contains a value, and `"true"` if a row contains a missing value. 

### Drop Missing
<a name="data-wrangler-transform-drop-missing"></a>

Use the **Drop missing** option to drop rows that contain missing values from the **Input column**.

## Manage Columns
<a name="data-wrangler-manage-columns"></a>

You can use the following transforms to quickly update and manage columns in your dataset: 


| Name | Function | 
| --- | --- | 
| Drop Column | Delete a column.  | 
| Duplicate Column | Duplicate a column. | 
| Rename Column | Rename a column. | 
| Move Column |  Move a column's location in the dataset. Choose to move your column to the start or end of the dataset, before or after a reference column, or to a specific index.   | 

## Manage Rows
<a name="data-wrangler-transform-manage-rows"></a>

Use this transform group to quickly perform sort and shuffle operations on rows. This group contains the following:
+ **Sort**: Sort the entire dataframe by a given column. Select the check box next to **Ascending order** for this option; otherwise, deselect the check box and descending order is used for the sort. 
+ **Shuffle**: Randomly shuffle all rows in the dataset. 

## Manage Vectors
<a name="data-wrangler-transform-manage-vectors"></a>

Use this transform group to combine or flatten vector columns. This group contains the following transforms. 
+ **Assemble**: Use this transform to combine Spark vectors and numeric data into a single column. For example, you can combine three columns: two containing numeric data and one containing vectors. Add all the columns you want to combine in **Input columns** and specify a **Output column name** for the combined data. 
+ **Flatten**: Use this transform to flatten a single column containing vector data. The input column must contain PySpark vectors or array-like objects. You can control the number of columns created by specifying a **Method to detect number of outputs**. For example, if you select **Length of first vector**, the number of elements in the first valid vector or array found in the column determines the number of output columns that are created. All other input vectors with too many items are truncated. Inputs with too few items are filled with NaNs.

  You also specify an **Output prefix**, which is used as the prefix for each output column. 

## Process Numeric
<a name="data-wrangler-transform-process-numeric"></a>

Use the **Process Numeric** feature group to process numeric data. Each scalar in this group is defined using the Spark library. The following scalars are supported:
+ **Standard Scaler**: Standardize the input column by subtracting the mean from each value and scaling to unit variance. To learn more, see the Spark documentation for [StandardScaler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html).
+ **Robust Scaler**: Scale the input column using statistics that are robust to outliers. To learn more, see the Spark documentation for [RobustScaler](https://spark.apache.org/docs/latest/ml-features#robustscaler).
+ **Min Max Scaler**: Transform the input column by scaling each feature to a given range. To learn more, see the Spark documentation for [MinMaxScaler](https://spark.apache.org/docs/latest/ml-features#minmaxscaler).
+ **Max Absolute Scaler**: Scale the input column by dividing each value by the maximum absolute value. To learn more, see the Spark documentation for [MaxAbsScaler](https://spark.apache.org/docs/latest/ml-features#maxabsscaler).

## Sampling
<a name="data-wrangler-transform-sampling"></a>

After you've imported your data, you can use the **Sampling** transformer to take one or more samples of it. When you use the sampling transformer, Data Wrangler samples your original dataset.

You can choose one of the following sample methods:
+ **Limit**: Samples the dataset starting from the first row up to the limit that you specify.
+ **Randomized**: Takes a random sample of a size that you specify.
+ **Stratified**: Takes a stratified random sample.

You can stratify a randomized sample to make sure that it represents the original distribution of the dataset.

You might be performing data preparation for multiple use cases. For each use case, you can take a different sample and apply a different set of transformations.

The following procedure describes the process of creating a random sample. 

To take a random sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

The following procedure describes the process of creating a stratified sample.

To take a stratified sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. For **Stratify column**, specify the name of the column that you want to stratify on.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

## Search and Edit
<a name="data-wrangler-transform-search-edit"></a>

Use this section to search for and edit specific patterns within strings. For example, you can find and update strings within sentences or documents, split strings by delimiters, and find occurrences of specific strings. 

The following transforms are supported under **Search and edit**. All transforms return copies of the strings in the **Input column** and add the result to a new output column.


| Name | Function | 
| --- | --- | 
|  Find substring  |  Returns the index of the first occurrence of the **Substring** for which you searched , You can start and end the search at **Start** and **End** respectively.   | 
|  Find substring (from right)  |  Returns the index of the last occurrence of the **Substring** for which you searched. You can start and end the search at **Start** and **End** respectively.   | 
|  Matches prefix  |  Returns a Boolean value if the string contains a given **Pattern**. A pattern can be a character sequence or regular expression. Optionally, you can make the pattern case sensitive.   | 
|  Find all occurrences  |  Returns an array with all occurrences of a given pattern. A pattern can be a character sequence or regular expression.   | 
|  Extract using regex  |  Returns a string that matches a given Regex pattern.  | 
|  Extract between delimiters  |  Returns a string with all characters found between **Left delimiter** and **Right delimiter**.   | 
|  Extract from position  |  Returns a string, starting from **Start position** in the input string, that contains all characters up to the start position plus **Length**.   | 
|  Find and replace substring  |  Returns a string with all matches of a given **Pattern** (regular expression) replaced by **Replacement string**.  | 
|  Replace between delimiters  |  Returns a string with the substring found between the first appearance of a **Left delimiter** and the last appearance of a **Right delimiter** replaced by **Replacement string**. If no match is found, nothing is replaced.   | 
|  Replace from position  |  Returns a string with the substring between **Start position** and **Start position** plus **Length** replaced by **Replacement string**. If **Start position** plus **Length** is greater than the length of the replacement string, the output contains **…**.  | 
|  Convert regex to missing  |  Converts a string to `None` if invalid and returns the result. Validity is defined with a regular expression in **Pattern**.  | 
|  Split string by delimiter  |  Returns an array of strings from the input string, split by **Delimiter**, with up to **Max number of splits** (optional). The delimiter defaults to white space.   | 

## Split data
<a name="data-wrangler-transform-split-data"></a>

Use the **Split data** transform to split your dataset into two or three datasets. For example, you can split your dataset into a dataset used to train your model and a dataset used to test it. You can determine the proportion of the dataset that goes into each split. For example, if you’re splitting one dataset into two datasets, the training dataset can have 80% of the data while the testing dataset has 20%.

Splitting your data into three datasets gives you the ability to create training, validation, and test datasets. You can see how well the model performs on the test dataset by dropping the target column.

Your use case determines how much of the original dataset each of your datasets get and the method you use to split the data. For example, you might want to use a stratified split to make sure that the distribution of the observations in the target column are the same across datasets. You can use the following split transforms:
+ Randomized split — Each split is a random, non-overlapping sample of the original dataset. For larger datasets, using a randomized split might be computationally expensive and take longer than an ordered split.
+ Ordered split – Splits the dataset based on the sequential order of the observations. For example, for an 80/20 train-test split, the first observations that make up 80% of the dataset go to the training dataset. The last 20% of the observations go to the testing dataset. Ordered splits are effective in keeping the existing order of the data between splits.
+ Stratified split – Splits the dataset to make sure that the number of observations in the input column have proportional representation. For an input column that has the observations 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, an 80/20 split on the column would mean that approximately 80% of the 1s, 80% of the 2s, and 80% of the 3s go to the training set. About 20% of each type of observation go to the testing set.
+ Split by key – Avoids data with the same key occurring in more than one split. For example, if you have a dataset with the column 'customer\$1id' and you're using it as a key, no customer id is in more than one split.

After you split the data, you can apply additional transformations to each dataset. For most use cases, they aren't necessary.

Data Wrangler calculates the proportions of the splits for performance. You can choose an error threshold to set the accuracy of the splits. Lower error thresholds more accurately reflect the proportions that you specify for the splits. If you set a higher error threshold, you get better performance, but lower accuracy.

For perfectly split data, set the error threshold to 0. You can specify a threshold between 0 and 1 for better performance. If you specify a value greater than 1, Data Wrangler interprets that value as 1.

If you have 10000 rows in your dataset and you specify an 80/20 split with an error of 0.001, you would get observations approximating one of the following results:
+ 8010 observations in the training set and 1990 in the testing set
+ 7990 observations in the training set and 2010 in the testing set

The number of observations for the testing set in the preceding example is in the interval between 8010 and 7990.

By default, Data Wrangler uses a random seed to make the splits reproducible. You can specify a different value for the seed to create a different reproducible split.

------
#### [ Randomized split ]

Use the following procedure to perform a randomized split on your dataset.

To split your dataset randomly, do the following

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Ordered split ]

Use the following procedure to perform an ordered split on your dataset.

To make an ordered split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. For **Transform**, choose **Ordered split**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) For **Input column**, specify a column with numeric values. Uses the values of the columns to infer which records are in each split. The smaller values are in one split with the larger values in the other splits.

1. (Optional) Select **Handle duplicates** to add noise to duplicate values and create a dataset of entirely unique values.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Stratified split ]

Use the following procedure to perform a stratified split on your dataset.

To make a stratified split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Stratified split**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Input column**, specify a column with up to 100 unique values. Data Wrangler can't stratify a column with more than 100 unique values.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed** to specify a different seed.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Split by column keys ]

Use the following procedure to split by the column keys in your dataset.

To split by the column keys in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Split by key**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Key columns**, specify the columns with values that you don't want to appear in both datasets.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. Choose **Preview**.

1. Choose **Add**.

------

## Parse Value as Type
<a name="data-wrangler-transform-cast-type"></a>

Use this transform to cast a column to a new type. The supported Data Wrangler data types are:
+ Long
+ Float
+ Boolean
+ Date, in the format dd-MM-yyyy, representing day, month, and year respectively. 
+ String

## Validate String
<a name="data-wrangler-transform-validate-string"></a>

Use the **Validate string** transforms to create a new column that indicates that a row of text data meets a specified condition. For example, you can use a **Validate string** transform to verify that a string only contains lowercase characters. The following transforms are supported under **Validate string**. 

The following transforms are included in this transform group. If a transform outputs a Boolean value, `True` is represented with a `1` and `False` is represented with a `0`.


| Name | Function | 
| --- | --- | 
|  String length  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.   | 
|  Starts with  |  Returns `True` if a string starts will a specified prefix. Otherwise, returns `False`.  | 
|  Ends with  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.  | 
|  Is alphanumeric  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is alpha (letters)  |  Returns `True` if a string only contains letters. Otherwise, returns `False`.  | 
|  Is digit  |  Returns `True` if a string only contains digits. Otherwise, returns `False`.  | 
|  Is space  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is title  |  Returns `True` if a string contains any white spaces. Otherwise, returns `False`.  | 
|  Is lowercase  |  Returns `True` if a string only contains lower case letters. Otherwise, returns `False`.  | 
|  Is uppercase  |  Returns `True` if a string only contains upper case letters. Otherwise, returns `False`.  | 
|  Is numeric  |  Returns `True` if a string only contains numbers. Otherwise, returns `False`.  | 
|  Is decimal  |  Returns `True` if a string only contains decimal numbers. Otherwise, returns `False`.  | 

## Unnest JSON Data
<a name="data-wrangler-transform-flatten-column"></a>

If you have a .csv file, you might have values in your dataset that are JSON strings. Similarly, you might have nested data in columns of either a Parquet file or a JSON document.

Use the **Flatten structured** operator to separate the first level keys into separate columns. A first level key is a key that isn't nested within a value.

For example, you might have a dataset that has a *person* column with demographic information on each person stored as JSON strings. A JSON string might look like the following.

```
 "{"seq": 1,"name": {"first": "Nathaniel","last": "Ferguson"},"age": 59,"city": "Posbotno","state": "WV"}"
```

The **Flatten structured** operator converts the following first level keys into additional columns in your dataset:
+ seq
+ name
+ age
+ city
+ state

Data Wrangler puts the values of the keys as values under the columns. The following shows the column names and values of the JSON.

```
seq, name,                                    age, city, state
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV
```

For each value in your dataset containing JSON, the **Flatten structured** operator creates columns for the first-level keys. To create columns for nested keys, call the operator again. For the preceding example, calling the operator creates the columns:
+ name\$1first
+ name\$1last

The following example shows the dataset that results from calling the operation again.

```
seq, name,                                    age, city, state, name_first, name_last
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV, Nathaniel, Ferguson
```

Choose **Keys to flatten on** to specify the first-level keys that want to extract as separate columns. If you don't specify any keys, Data Wrangler extracts all the keys by default.

## Explode Array
<a name="data-wrangler-transform-explode-array"></a>

Use **Explode array** to expand the values of the array into separate output rows. For example, the operation can take each value in the array, [[1, 2, 3,], [4, 5, 6], [7, 8, 9]] and create a new column with the following rows:

```
                [1, 2, 3]
                [4, 5, 6]
                [7, 8, 9]
```

Data Wrangler names the new column, input\$1column\$1name\$1flatten.

You can call the **Explode array** operation multiple times to get the nested values of the array into separate output columns. The following example shows the result of calling the operation multiple times on a dataset with a nested array.

Putting the values of a nested array into separate columns


| id | array | id | array\$1items | id | array\$1items\$1items | 
| --- | --- | --- | --- | --- | --- | 
| 1 | [ [cat, dog], [bat, frog] ] | 1 | [cat, dog] | 1 | cat | 
| 2 |  [[rose, petunia], [lily, daisy]]  | 1 | [bat, frog] | 1 | dog | 
|  |  | 2 | [rose, petunia] | 1 | bat | 
|  |  | 2 | [lily, daisy] | 1 | frog | 
|  |  |  | 2 | 2 | rose | 
|  |  |  | 2 | 2 | petunia | 
|  |  |  | 2 | 2 | lily | 
|  |  |  | 2 | 2 | daisy | 

## Transform Image Data
<a name="data-wrangler-transform-image"></a>

Use Data Wrangler to import and transform the images that you're using for your machine learning (ML) pipelines. After you've prepared your image data, you can export it from your Data Wrangler flow to your ML pipeline.

You can use the information provided here to familiarize yourself with importing and transforming image data in Data Wrangler. Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

After you've familiarized yourself with the concepts of transforming your image data, go through the following tutorial, [Prepare image data with Amazon SageMaker Data Wrangler](https://aws.amazon.com/blogs/machine-learning/prepare-image-data-with-amazon-sagemaker-data-wrangler/).

The following industries and use cases are examples where applying machine learning to transformed image data can be useful:
+ Manufacturing – Identifying defects in items from the assembly line
+ Food – Identifying spoiled or rotten food
+ Medicine – Identifying lesions in tissues

When you work with image data in Data Wrangler, you go through the following process:

1. Import – Select the images by choosing the directory containing them in your Amazon S3 bucket.

1. Transform – Use the built-in transformations to prepare the images for your machine learning pipeline.

1. Export – Export the images that you’ve transformed to a location that can be accessed from the pipeline.

Use the following procedure to import your image data.

**To import your image data**

1. Navigate to the **Create connection** page.

1. Choose **Amazon S3**.

1. Specify the Amazon S3 file path that contains the image data.

1. For **File type**, choose **Image**.

1. (Optional) Choose **Import nested directories** to import images from multiple Amazon S3 paths.

1. Choose **Import**.

Data Wrangler uses the open-source [imgaug](https://imgaug.readthedocs.io/en/latest/) library for its built-in image transformations. You can use the following built-in transformations:
+ **ResizeImage**
+ **EnhanceImage**
+ **CorruptImage**
+ **SplitImage**
+ **DropCorruptedImages**
+ **DropImageDuplicates**
+ **Brightness**
+ **ColorChannels**
+ **Grayscale**
+ **Rotate**

Use the following procedure to transform your images without writing code.

**To transform the image data without writing code**

1. From your Data Wrangler flow, choose the **\$1** next to the node representing the images that you've imported.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose the transform and configure it.

1. Choose **Preview**.

1. Choose **Add**.

In addition to using the transformations that Data Wrangler provides, you can also use your own custom code snippets. For more information about using custom code snippets, see [Custom Transforms](#data-wrangler-transform-custom). You can import the OpenCV and imgaug libraries within your code snippets and use the transforms associated with them. The following is an example of a code snippet that detects edges within the images.

```
# A table with your image data is stored in the `df` variable
import cv2
import numpy as np
from pyspark.sql.functions import column

from sagemaker_dataprep.compute.operators.transforms.image.constants import DEFAULT_IMAGE_COLUMN, IMAGE_COLUMN_TYPE
from sagemaker_dataprep.compute.operators.transforms.image.decorators import BasicImageOperationDecorator, PandasUDFOperationDecorator


@BasicImageOperationDecorator
def my_transform(image: np.ndarray) -> np.ndarray:
  # To use the code snippet on your image data, modify the following lines within the function
    HYST_THRLD_1, HYST_THRLD_2 = 100, 200
    edges = cv2.Canny(image,HYST_THRLD_1,HYST_THRLD_2)
    return edges
    

@PandasUDFOperationDecorator(IMAGE_COLUMN_TYPE)
def custom_image_udf(image_row):
    return my_transform(image_row)
    

df = df.withColumn(DEFAULT_IMAGE_COLUMN, custom_image_udf(column(DEFAULT_IMAGE_COLUMN)))
```

When apply transformations in your Data Wrangler flow, Data Wrangler only applies them to a sample of the images in your dataset. To optimize your experience with the application, Data Wrangler doesn't apply the transforms to all of your images.

To apply the transformations to all of your images, export your Data Wrangler flow to an Amazon S3 location. You can use the images that you've exported in your training or inference pipelines. Use a destination node or a Jupyter Notebook to export your data. You can access either method for exporting your data from the Data Wrangler flow. For information about using these methods, see [Export to Amazon S3](data-wrangler-data-export.md#data-wrangler-data-export-s3).

## Filter data
<a name="data-wrangler-transform-filter-data"></a>

Use Data Wrangler to filter the data in your columns. When you filter the data in a column, you specify the following fields:
+ **Column name** – The name of the column that you're using to filter the data.
+ **Condition** – The type of filter that you're applying to values in the column.
+ **Value** – The value or category in the column to which you're applying the filter.

You can filter on the following conditions:
+ **=** – Returns values that match the value or category that you specify.
+ **\$1=** – Returns values that don't match the value or category that you specify.
+ **>=** – For **Long** or **Float** data, filters for values that are greater than or equal to the value that you specify.
+ **<=** – For **Long** or **Float** data, filters for values that are less than or equal to the value that you specify.
+ **>** – For **Long** or **Float** data, filters for values that are greater than the value that you specify.
+ **<** – For **Long** or **Float** data, filters for values that are less than the value that you specify.

For a column that has the categories, `male` and `female`, you can filter out all the `male` values. You could also filter for all the `female` values. Because there are only `male` and `female` values in the column, the filter returns a column that only has `female` values.

You can also add multiple filters. The filters can be applied across multiple columns or the same column. For example, if you're creating a column that only has values within a certain range, you add two different filters. One filter specifies that the column must have values greater than the value that you provide. The other filter specifies that the column must have values less than the value that you provide.

Use the following procedure to add the filter transform to your data.

**To filter your data**

1. From your Data Wrangler flow, choose the **\$1** next to the node with the data that you're filtering.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose **Filter data**.

1. Specify the following fields:
   + **Column name** – The column that you're filtering.
   + **Condition** – The condition of the filter.
   + **Value** – The value or category in the column to which you're applying the filter.

1. (Optional) Choose **\$1** following the filter that you've created.

1. Configure the filter.

1. Choose **Preview**.

1. Choose **Add**.

## Map Columns for Amazon Personalize
<a name="data-wrangler-transform-personalize"></a>

Data Wrangler integrates with Amazon Personalize, a fully managed machine learning service that generates item recommendations and user segments. You can use the **Map columns for Amazon Personalize** transform to get your data into a format that Amazon Personalize can interpret. For more information about the transforms specific to Amazon Personalize, see [Importing data using Amazon SageMaker Data Wrangler](https://docs.aws.amazon.com/personalize/latest/dg/preparing-importing-with-data-wrangler.html#dw-transform-data). For more information about Amazon Personalize see [What is Amazon Personalize?](https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html)

# Analyze and Visualize
<a name="data-wrangler-analyses"></a>

Amazon SageMaker Data Wrangler includes built-in analyses that help you generate visualizations and data analyses in a few clicks. You can also create custom analyses using your own code. 

You add an analysis to a dataframe by selecting a step in your data flow, and then choosing **Add analysis**. To access an analysis you've created, select the step that contains the analysis, and select the analysis. 

All analyses are generated using 100,000 rows of your dataset. 

You can add the following analysis to a dataframe:
+ Data visualizations, including histograms and scatter plots. 
+ A quick summary of your dataset, including number of entries, minimum and maximum values (for numeric data), and most and least frequent categories (for categorical data).
+ A quick model of the dataset, which can be used to generate an importance score for each feature. 
+ A target leakage report, which you can use to determine if one or more features are strongly correlated with your target feature.
+ A custom visualization using your own code. 

Use the following sections to learn more about these options. 

## Histogram
<a name="data-wrangler-visualize-histogram"></a>

Use histograms to see the counts of feature values for a specific feature. You can inspect the relationships between features using the **Color by** option. For example, the following histogram charts the distribution of user ratings of the best-selling books on Amazon from 2009–2019, colored by genre. 

![\[Example histogram chart in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/histogram.png)


You can use the **Facet by** feature to create histograms of one column, for each value in another column. For example, the following diagram shows histograms of user reviews of best-selling books on Amazon if faceted by year. 

![\[Example histograms in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/review_by_year.png)


## Scatter Plot
<a name="data-wrangler-visualize-scatter-plot"></a>

Use the **Scatter Plot** feature to inspect the relationship between features. To create a scatter plot, select a feature to plot on the **X axis** and the **Y axis**. Both of these columns must be numeric typed columns. 

You can color scatter plots by an additional column. For example, the following example shows a scatter plot comparing the number of reviews against user ratings of top-selling books on Amazon between 2009 and 2019. The scatter plot is colored by book genre. 

![\[Example scatter plot in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/scatter-plot.png)


Additionally, you can facet scatter plots by features. For example, the following image shows an example of the same review versus user rating scatter plot, faceted by year. 

![\[Example faceted scatter plot in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/scatter-plot-facet.png)


## Table Summary
<a name="data-wrangler-table-summary"></a>

Use the **Table Summary** analysis to quickly summarize your data.

For columns with numerical data, including log and float data, a table summary reports the number of entries (count), minimum (min), maximum (max), mean, and standard deviation (stddev) for each column.

For columns with non-numerical data, including columns with string, Boolean, or date/time data, a table summary reports the number of entries (count), least frequent value (min), and most frequent value (max). 

## Quick Model
<a name="data-wrangler-quick-model"></a>

Use the **Quick Model** visualization to quickly evaluate your data and produce importance scores for each feature. A [feature importance score](http://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances) score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. On the top of the quick model chart, there is a model score. A classification problem shows an F1 score. A regression problem has a mean squared error (MSE) score.

When you create a quick model chart, you select a dataset you want evaluated, and a target label against which you want feature importance to be compared. Data Wrangler does the following:
+ Infers the data types for the target label and each feature in the dataset selected. 
+ Determines the problem type. Based on the number of distinct values in the label column, Data Wrangler determines if this is a regression or classification problem type. Data Wrangler sets a categorical threshold to 100. If there are more than 100 distinct values in the label column, Data Wrangler classifies it as a regression problem; otherwise, it is classified as a classification problem. 
+ Pre-processes features and label data for training. The algorithm used requires encoding features to vector type and encoding labels to double type. 
+ Trains a random forest algorithm with 70% of data. Spark’s [RandomForestRegressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression) is used to train a model for regression problems. The [RandomForestClassifier](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier) is used to train a model for classification problems.
+ Evaluates a random forest model with the remaining 30% of data. Data Wrangler evaluates classification models using an F1 score and evaluates regression models using an MSE score.
+ Calculates feature importance for each feature using the Gini importance method. 

The following image shows the user interface for the quick model feature. 

![\[Example UI of the quick model feature in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/quick-model.png)


## Target Leakage
<a name="data-wrangler-analysis-target-leakage"></a>

Target leakage occurs when there is data in a machine learning training dataset that is strongly correlated with the target label, but is not available in real-world data. For example, you may have a column in your dataset that serves as a proxy for the column you want to predict with your model. 

When you use the **Target Leakage** analysis, you specify the following:
+ **Target**: This is the feature about which you want your ML model to be able to make predictions.
+ **Problem type**: This is the ML problem type on which you are working. Problem type can either be **classification** or **regression**. 
+  (Optional) **Max features**: This is the maximum number of features to present in the visualization, which shows features ranked by their risk of being target leakage.

For classification, the target leakage analysis uses the area under the receiver operating characteristic, or AUC - ROC curve for each column, up to **Max features**. For regression, it uses a coefficient of determination, or R2 metric.

The AUC - ROC curve provides a predictive metric, computed individually for each column using cross-validation, on a sample of up to around 1000 rows. A score of 1 indicates perfect predictive abilities, which often indicates target leakage. A score of 0.5 or lower indicates that the information on the column could not provide, on its own, any useful information towards predicting the target. Although it can happen that a column is uninformative on its own but is useful in predicting the target when used in tandem with other features, a low score could indicate the feature is redundant.

For example, the following image shows a target leakage report for a diabetes classification problem, that is, predicting if a person has diabetes or not. An AUC - ROC curve is used to calculate the predictive ability of five features, and all are determined to be safe from target leakage.

![\[Example target leakage report in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/target-leakage.png)


## Multicollinearity
<a name="data-wrangler-multicollinearity"></a>

Multicollinearity is a circumstance where two or more predictor variables are related to each other. The predictor variables are the features in your dataset that you're using to predict a target variable. When you have multicollinearity, the predictor variables are not only predictive of the target variable, but also predictive of each other.

You can use the **Variance Inflation Factor (VIF)**, **Principal Component Analysis (PCA)**, or **Lasso feature selection** as measures for the multicollinearity in your data. For more information, see the following.

------
#### [ Variance Inflation Factor (VIF) ]

The Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is a positive number that is greater than or equal to 1.

A score of 1 means that the variable is uncorrelated with the other variables. Scores greater than 1 indicate higher correlation.

Theoretically, you can have a VIF score with a value of infinity. Data Wrangler clips high scores to 50. If you have a VIF score greater than 50, Data Wrangler sets the score to 50.

You can use the following guidelines to interpret your VIF scores:
+ A VIF score less than or equal to 5 indicates that the variables are moderately correlated with the other variables.
+ A VIF score greater than or equal to 5 indicates that the variables are highly correlated with the other variables.

------
#### [ Principle Component Analysis (PCA) ]

Principal Component Analysis (PCA) measures the variance of the data along different directions in the feature space. The feature space consists of all the predictor variables that you use to predict the target variable in your dataset.

For example, if you're trying to predict who survived on the *RMS Titanic* after it hit an iceberg, your feature space can include the passengers' age, gender, and the fare that they paid.

From the feature space, PCA generates an ordered list of variances. These variances are also known as singular values. The values in the list of variances are greater than or equal to 0. We can use them to determine how much multicollinearity there is in our data.

When the numbers are roughly uniform, the data has very few instances of multicollinearity. When there is a lot of variability among the values, we have many instances of multicollinearity. Before it performs PCA, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation of 1.

**Note**  
PCA in this circumstance can also be referred to as Singular Value Decomposition (SVD).

------
#### [ Lasso feature selection ]

Lasso feature selection uses the L1 regularization technique to only include the most predictive features in your dataset.

For both classification and regression, the regularization technique generates a coefficient for each feature. The absolute value of the coefficient provides an importance score for the feature. A higher importance score indicates that it is more predictive of the target variable. A common feature selection method is to use all the features that have a non-zero lasso coefficient.

------

## Detect Anomalies In Time Series Data
<a name="data-wrangler-time-series-anomaly-detection"></a>

You can use the anomaly detection visualization to see outliers in your time series data. To understand what determines an anomaly, you need to understand that we decompose the time series into a predicted term and an error term. We treat the seasonality and trend of the time series as the predicted term. We treat the residuals as the error term.

For the error term, you specify a threshold as the number of standard of deviations the residual can be away from the mean for it to be considered an anomaly. For example, you can specify a threshold as being 3 standard deviations. Any residual greater than 3 standard deviations away from the mean is an anomaly.

You can use the following procedure to perform an **Anomaly detection** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Anomaly detection**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Seasonal Trend Decomposition In Time Series Data
<a name="data-wrangler-seasonal-trend-decomposition"></a>

You can determine whether there's seasonality in your time series data by using the Seasonal Trend Decomposition visualization. We use the STL (Seasonal Trend decomposition using LOESS) method to perform the decomposition. We decompose the time series into its seasonal, trend, and residual components. The trend reflects the long term progression of the series. The seasonal component is a signal that recurs in a time period. After removing the trend and the seasonal components from the time series, you have the residual.

You can use the following procedure to perform a **Seasonal-Trend decomposition** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Seasonal-Trend decomposition**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Bias Report
<a name="data-wrangler-bias-report"></a>

You can use the bias report in Data Wrangler to uncover potential biases in your data. To generate a bias report, you must specify the target column, or **Label**, that you want to predict and a **Facet**, or the column that you want to inspect for biases.

**Label**: The feature about which you want a model to make predictions. For example, if you are predicting customer conversion, you may select a column containing data on whether or not a customer has placed an order. You must also specify whether this feature is a label or a threshold. If you specify a label, you must specify what a *positive outcome* looks like in your data. In the customer conversion example, a positive outcome may be a 1 in the orders column, representing the positive outcome of a customer placing an order within the last three months. If you specify a threshold, you must specify a lower bound defining a positive outcome. For example, if your customer orders columns contains the number of orders placed in the last year, you may want to specify 1.

**Facet**: The column that you want to inspect for biases. For example, if you are trying to predict customer conversion, your facet may be the age of the customer. You may choose this facet because you believe that your data is biased toward a certain age group. You must identify whether the facet is measured as a value or threshold. For example, if you wanted to inspect one or more specific ages, you select **Value** and specify those ages. If you want to look at an age group, you select **Threshold** and specify the threshold of ages you want to inspect.

After you select your feature and label, you select the types of bias metrics you want to calculate.

To learn more, see [Generate reports for bias in pre-training data](https://docs.aws.amazon.com/sagemaker/latest/dg/data-bias-reports.html). 

## Create Custom Visualizations
<a name="data-wrangler-visualize-custom"></a>

You can add an analysis to your Data Wrangler flow to create a custom visualization. Your dataset, with all the transformations you've applied, is available as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Data Wrangler uses the `df` variable to store the dataframe. You access the dataframe by calling the variable.

You must provide the output variable, `chart`, to store an [Altair](https://altair-viz.github.io/) output chart. For example, you can use the following code block to create a custom histogram using the Titanic dataset.

```
import altair as alt
df = df.iloc[:30]
df = df.rename(columns={"Age": "value"})
df = df.assign(count=df.groupby('value').value.transform('count'))
df = df[["value", "count"]]
base = alt.Chart(df)
bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))
rule = base.mark_rule(color='red').encode(
    x='mean(value):Q',
    size=alt.value(5))
chart = bar + rule
```

**To create a custom visualization:**

1. Next to the node containing the transformation that you'd like to visualize, choose the **\$1**.

1. Choose **Add analysis**.

1. For **Analysis type**, choose **Custom Visualization**.

1. For **Analysis name**, specify a name.

1. Enter your code in the code box. 

1. Choose **Preview** to preview your visualization.

1. Choose **Save** to add your visualization.

![\[Example on how to add your visualization in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/custom-visualization.png)


If you don’t know how to use the Altair visualization package in Python, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of visualization snippets. To use a visualization snippet, choose **Search example snippets** and specify a query in the search bar.

The following example uses the **Binned scatterplot** code snippet. It plots a histogram for 2 dimensions.

The snippets have comments to help you understand the changes that you need to make to the code. You usually need to specify the column names of your dataset in the code.

```
import altair as alt

# Specify the number of top rows for plotting
rows_number = 1000
df = df.head(rows_number)  
# You can also choose bottom rows or randomly sampled rows
# df = df.tail(rows_number)
# df = df.sample(rows_number)


chart = (
    alt.Chart(df)
    .mark_circle()
    .encode(
        # Specify the column names for binning and number of bins for X and Y axis
        x=alt.X("col1:Q", bin=alt.Bin(maxbins=20)),
        y=alt.Y("col2:Q", bin=alt.Bin(maxbins=20)),
        size="count()",
    )
)

# :Q specifies that label column has quantitative type.
# For more details on Altair typing refer to
# https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types
```

# Reusing Data Flows for Different Datasets
<a name="data-wrangler-parameterize"></a>

For Amazon Simple Storage Service (Amazon S3) data sources, you can create and use parameters. A parameter is a variable that you've saved in your Data Wrangler flow. Its value can be any portion of the data source's Amazon S3 path. Use parameters to quickly change the data that you're importing into a Data Wrangler flow or exporting to a processing job. You can also use parameters to select and import a specific subset of your data.

After you created a Data Wrangler flow, you might have trained a model on the data that you've transformed. For datasets that have the same schema, you can use parameters to apply the same transformations on a different dataset and train a different model. You can use the new datasets to perform inference with your model or you could be using them to retrain your model.

In general, parameters have the following attributes:
+ Name – The name you specify for the parameter
+ Type – The type of value that the parameter represents
+ Default value – The value of the parameter when you don't specify a new value

**Note**  
Datetime parameters have a time range attribute that they use as the default value.

Data Wrangler uses curly braces, `{{}}`, to indicate that a parameter is being used in the Amazon S3 path. For example, you can have a URL such as `s3://amzn-s3-demo-bucket1/{{example_parameter_name}}/example-dataset.csv`.

You create a parameter when you're editing the Amazon S3 data source that you've imported. You can set any portion of the file path to a parameter value. You can set the parameter value to either a value or a pattern. The following are the available parameter value types in the Data Wrangler flow:
+ Number
+ String
+ Pattern
+ Datetime

**Note**  
You can't create a pattern parameter or a datetime parameter for the name of the bucket in the Amazon S3 path.

You must set a number as the default value of a number parameter. You can change the value of the parameter to a different number when you're editing a parameter or when you're launching a processing job. For example, in the S3 path, `s3://amzn-s3-demo-bucket/example-prefix/example-file-1.csv`, you can create a number parameter named `number_parameter` in the place of `1`. Your S3 path now appears as `s3://amzn-s3-demo-bucket/example-prefix/example-file-{{number_parameter}}.csv`. The path continues to point to the `example-file-1.csv` dataset until you change the value of the parameter. If you change the value of `number_parameter` to `2` the path is now `s3://amzn-s3-demo-bucket/example-prefix/example-file-2.csv`. You can import `example-file-2.csv` into Data Wrangler if you've uploaded the file to that Amazon S3 location.

A string parameter stores a string as its default value. For example, in the S3 path, `s3://amzn-s3-demo-bucket/example-prefix/example-file-1.csv`, you can create a string parameter named `string_parameter` in the place of the filename, `example-file-1.csv`. The path now appears as `s3://amzn-s3-demo-bucket/example-prefix/{{string_parameter}}`. It continues to match `s3://amzn-s3-demo-bucket/example-prefix/example-file-1.csv`, until you change the value of the parameter.

Instead of specifying the filename as a string parameter, you can create a string parameter using the entire Amazon S3 path. You can specify a dataset from any Amazon S3 location in the string parameter.

A pattern parameter stores a regular expression (Python REGEX) string as its default value. You can use a pattern parameter to import multiple data files at the same time. To import more than one object at a time, specify a parameter value that matches the Amazon S3 objects that you're importing.

You can also create a pattern parameter for the following datasets:
+ s3://amzn-s3-demo-bucket1/example-prefix/example-file-1.csv
+ s3://amzn-s3-demo-bucket1/example-prefix/example-file-2.csv
+ s3://amzn-s3-demo-bucket1/example-prefix/example-file-10.csv
+ s3://amzn-s3-demo-bucket/example-prefix/example-file-0123.csv

For `s3://amzn-s3-demo-bucket1/example-prefix/example-file-1.csv`, you can create a pattern parameter in the place of `1`, and set the default value of the parameter to `\d+`. The `\d+` REGEX string matches any one or more decimal digits. If you create a pattern parameter named `pattern_parameter`, your S3 path appears as `s3://amzn-s3-demo-bucket1/example-prefix/example-file-{{pattern_parameter}}.csv`.

You can also use pattern parameters to match all CSV objects within your bucket. To match all objects in a bucket, create a pattern parameter with the default value of `.*` and set the path to `s3://amzn-s3-demo-bucket/{{pattern_parameter}}.csv`. The `.*` character matches any string character in the path. 

The `s3://amzn-s3-demo-bucket/{{pattern_parameter}}.csv` path can match the following datasets.
+ `example-file-1.csv`
+ `other-example-file.csv`
+ `example-file-a.csv`

A datetime parameter stores the format with the following information:
+ A format for parsing strings inside an Amazon S3 path.
+ A relative time range to limit the datetime values that match

For example, in the Amazon S3 file path, `s3://amzn-s3-demo-bucket/2020/01/01/example-dataset.csv`, 2020/01/01 represents a datetime in the format of `year/month/day`. You can set the parameter’s time range to an interval such as `1 years` or `24 hours`. An interval of `1 years` matches all S3 paths with datetimes that fall between the current time and the time exactly a year before the current time. The current time is the time when you start exporting the transformations that you've made to the data. For more information about exporting data, see [Export](data-wrangler-data-export.md). If the current date is 2022/01/01 and the time range is `1 years`, the S3 path matches datasets such as the following:
+ s3://amzn-s3-demo-bucket/2021/01/01/example-dataset.csv
+ s3://amzn-s3-demo-bucket/2021/06/30/example-dataset.csv
+ s3://amzn-s3-demo-bucket/2021/12/31/example-dataset.csv

The datetime values within a relative time range change as time passes. The S3 paths that fall within the relative time range might also differ.

For the Amazon S3 file path, `s3://amzn-s3-demo-bucket1/20200101/example-dataset.csv`, `20220101` is an example of a path that can become a datetime parameter.

To view a table of all parameters that you've created in Data Wrangler flow, choose the `\$1\$1\$1\$1` to the right of the text box containing the Amazon S3 path. If you no longer need a parameter that you've created, you can edit or delete. To edit or delete a parameter, choose icons to the right of the parameter.

**Important**  
Before you delete a parameter, make sure that you haven't used it anywhere in your Data Wrangler flow. Deleted parameters that are still within the flow cause errors.

You can create parameters for any step of your Data Wrangler flow. You can edit or delete any parameter that you create. If you're applying transformations to data that is no longer relevant to your use case, you can modify the values of parameters. Modifying the values of the parameters changes the data that you're importing.

The following sections provide additional examples and general guidance on using parameters. You can use the sections to understand the parameters that work best for you.

**Note**  
The following sections contain procedures that use the Data Wrangler interface to override the parameters and create a processing job.  
You can also override the parameters by using the following procedures.  
To export your Data Wrangler flow and override the value of a parameter, do the following.  
Choose the **\$1** next to the node that you want to export.
Choose **Export to**.
Choose the location where you're exporting the data.
Under `parameter_overrides`, specify different values for the parameters that you've created.
Run the Jupyter Notebook.

## Applying a Data Wrangler flow to files using patterns
<a name="data-wrangler-pattern-parameters"></a>

You can use parameters to apply transformations in your Data Wrangler flow to different files that match a pattern in the Amazon S3 URI path. This helps you specify the files in your S3 bucket that you want to transform with high specificity. For example, you might have a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`. Different datasets named `example-dataset.csv` are stored under many different example prefixes. The prefixes might also be numbered sequentially. You can create patterns for the numbers in the Amazon S3 URI. Pattern parameters use REGEX to select any number of files that match the pattern of the expression. The following are REGEX patterns that might be useful:
+ `.*` – Matches zero or more of any character, except newline characters
+ `.+` – Matches one or more of any character, excluding newline characters
+ `\d+` – Matches one or more of any decimal digit
+ `\w+` – Matches one or more of any alphanumeric character
+ `[abc-_]{2,4}` – Matches a string two, three, or four characters composed of the set of characters provided within a set of brackets
+ `abc|def` – Matches one string or another. For example, the operation matches either `abc` or `def`

You can replace each number in the following paths with a single parameter that has a value of `\d+`.
+ `s3://amzn-s3-demo-bucket1/example-prefix-3/example-prefix-4/example-prefix-5/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix-8/example-prefix-12/example-prefix-13/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix-4/example-prefix-9/example-prefix-137/example-dataset.csv`

The following procedure creates a pattern parameter for a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`.

To create a pattern parameter, do the following.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the `0` in `example-prefix-0`.

1. Specify values for the following fields:
   + **Name** – A name for parameter
   + **Type** – **Pattern**
   + **Value** – **\$1d\$1** a regular expression that corresponds to one or more digits

1. Choose **Create**.

1. Replace the `1` and the `2` in S3 URI path with the parameter. The path should have the following format: `s3://amzn-s3-demo-bucket1/example-prefix-{{example_parameter_name}}/example-prefix-{{example_parameter_name}}/example-prefix-{{example_parameter_name}}/example-dataset.csv`

The following is a general procedure for creating a pattern parameter.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the portion of the URI that you're using as the value of the pattern parameter.

1. Choose **Create custom parameter**.

1. Specify values for the following fields:
   + **Name** – A name for parameter
   + **Type** – **Pattern**
   + **Value** – A regular expression containing the pattern that you'd like to store.

1. Choose **Create**.

## Applying a Data Wrangler flow to files using numeric values
<a name="data-wrangler-numeric-parameters"></a>

You can use parameters to apply transformations in your Data Wrangler flow to different files that have similar paths. For example, you might have a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`.

You might have the transformations from your Data Wrangler flow that you've applied to datasets under `example-prefix-1`. You might want to apply the same transformations to `example-dataset.csv` that falls under `example-prefix-10` or `example-prefix-20`.

You can create a parameter that stores the value `1`. If you want to apply the transformations to different datasets, you can create processing jobs that replace the value of the parameter with a different value. The parameter acts as a placeholder for you to change when you want to apply the transformations from your Data Wrangler flow to new data. You can override the value of the parameter when you create a Data Wrangler processing job to apply the transformations in your Data Wrangler flow to different datasets.

Use the following procedure to create numeric parameters for `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`.

To create parameters for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the number in an example prefix of `example-prefix-number`.

1. Choose **Create custom parameter**.

1. For **Name**, specify a name for the parameter.

1. For **Type**, choose **Integer**.

1. For **Value**, specify the number.

1. Create parameters for the remaining numbers by repeating the procedure.

After you've created the parameters, apply the transforms to your dataset and create a destination node for them. For more information about destination nodes, see [Export](data-wrangler-data-export.md).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a numeric parameter in a Data Wrangler processing job, do the following.

1. From your Data Wrangler flow, choose **Create job**

1. Select only the destination node that contains the transformations to the dataset containing the datetime parameters.

1. Choose **Configure job**.

1. Choose **Parameters**.

1. Choose the name of a parameter that you've created.

1. Change the value of the parameter.

1. Repeat the procedure for the other parameters.

1. Choose **Run**.

## Applying a Data Wrangler flow to files using strings
<a name="data-wrangler-string-parameters"></a>

You can use parameters to apply transformations in your Data Wrangler flow to different files that have similar paths. For example, you might have a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix/example-dataset.csv`.

You might have transformations from your Data Wrangler flow that you've applied to datasets under `example-prefix`. You might want to apply the same transformations to `example-dataset.csv` under `another-example-prefix` or `example-prefix-20`.

You can create a parameter that stores the value `example-prefix`. If you want to apply the transformations to different datasets, you can create processing jobs that replace the value of the parameter with a different value. The parameter acts as a placeholder for you to change when you want to apply the transformations from your Data Wrangler flow to new data. You can override the value of the parameter when you create a Data Wrangler processing job to apply the transformations in your Data Wrangler flow to different datasets.

Use the following procedure to create a string parameter for `s3://amzn-s3-demo-bucket1/example-prefix/example-dataset.csv`.

To create a parameter for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the example prefix, `example-prefix`.

1. Choose **Create custom parameter**.

1. For **Name**, specify a name for the parameter.

1. For **Type**, choose **String**.

1. For **Value**, specify the prefix.

After you've created the parameter, apply the transforms to your dataset and create a destination node for them. For more information about destination nodes, see [Export](data-wrangler-data-export.md).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a numeric parameter in a Data Wrangler processing job, do the following:

1. From your Data Wrangler flow, choose **Create job**

1. Select only the destination node that contains the transformations to the dataset containing the datetime parameters.

1. Choose **Configure job**.

1. Choose **Parameters**.

1. Choose the name of a parameter that you've created.

1. Change the value of the parameter.

1. Repeat the procedure for the other parameters.

1. Choose **Run**.

## Applying a Data Wrangler flow to different datetime ranges
<a name="data-wrangler-datetime-parameters"></a>

Use datetime parameters to apply transformations in your Data Wrangler flow to different time ranges. Highlight the portion of the Amazon S3 URI that has a timestamp and create a parameter for it. When you create a parameter, you specify a time range from the current time to a time in the past. For example, you might have an Amazon S3 URI that looks like the following: `s3://amzn-s3-demo-bucket1/example-prefix/2022/05/15/example-dataset.csv`. You can save `2022/05/15` as a datetime parameter. If you specify a year as the time range, the time range includes the moment that you run the processing job containing the datetime parameter and the time exactly one year ago. If the moment you're running the processing job is September 6th, 2022 or `2022/09/06`, the time ranges can include the following:
+ `s3://amzn-s3-demo-bucket1/example-prefix/2022/03/15/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix/2022/01/08/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix/2022/07/31/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix/2021/09/07/example-dataset.csv`

The transformations in the Data Wrangler flow apply to all of the preceding prefixes. Changing the value of the parameter in the processing job doesn't change the value of the parameter in the Data Wrangler flow. To apply the transformations to datasets within a different time range, do the following:

1. Create a destination node containing all the transformations that you'd like to use.

1. Create a Data Wrangler job.

1. Configure the job to use a different time range for the parameter. Changing the value of the parameter in the processing job doesn't change the value of the parameter in the Data Wrangler flow.

For more information about destination nodes and Data Wrangler jobs, see [Export](data-wrangler-data-export.md).

The following procedure creates a datetime parameter for the Amazon S3 path: `s3://amzn-s3-demo-bucket1/example-prefix/2022/05/15/example-dataset.csv`.

To create a datetime parameter for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the portion of the URI that you're using as the value of the datetime parameter.

1. Choose **Create custom parameter**.

1. For **Name**, specify a name for the parameter.

1. For **Type**, choose **Datetime**.
**Note**  
By default, Data Wrangler selects **Predefined**, which provides a dropdown menu for you to select a date format. However, the timestamp format that you're using might not be available. Instead of using **Predefined** as the default option, you can choose **Custom** and specify the timestamp format manually.

1. For **Date format**, open the dropdown menu following **Predefined** and choose **yyyy/MM/dd**. The format, **yyyy/MM/dd,** corresponds to the year/month/day of the timestamp.

1. For **Timezone**, choose a time zone.
**Note**  
The data that you're analyzing might have time stamps taken in a different time zone from your time zone. Make sure that the time zone that you select matches the time zone of the data. 

1. For **Time range**, specify the time range for the parameter.

1. (Optional) Enter a description to describe how you're using the parameter.

1. Choose **Create**.

After you've created the datetime parameters, apply the transforms to your dataset and create a destination node for them. For more information about destination nodes, see [Export](data-wrangler-data-export.md).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a datetime parameter in a Data Wrangler processing job, do the following:

1. From your Data Wrangler flow, choose **Create job**

1. Select only the destination node that contains the transformations to the dataset containing the datetime parameters.

1. Choose **Configure job**.

1. Choose **Parameters**.

1. Choose the name of the datetime parameter that you've created.

1. For **Time range**, change the time range for the datasets.

1. Choose **Run**.

# Export
<a name="data-wrangler-data-export"></a>

In your Data Wrangler flow, you can export some or all of the transformations that you've made to your data processing pipelines.

A *Data Wrangler flow* is the series of data preparation steps that you've performed on your data. In your data preparation, you perform one or more transformations to your data. Each transformation is done using a transform step. The flow has a series of nodes that represent the import of your data and the transformations that you've performed. For an example of nodes, see the following image.

![\[Example data flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-0.png)


The preceding image shows a Data Wrangler flow with two nodes. The **Source - sampled** node shows the data source from which you've imported your data. The **Data types** node indicates that Data Wrangler has performed a transformation to convert the dataset into a usable format. 

Each transformation that you add to the Data Wrangler flow appears as an additional node. For information on the transforms that you can add, see [Transform Data](data-wrangler-transform.md). The following image shows a Data Wrangler flow that has a **Rename-column** node to change the name of a column in a dataset.

You can export your data transformations to the following:
+ Amazon S3
+ Pipelines
+ Amazon SageMaker Feature Store
+ Python Code

**Important**  
We recommend that you use the IAM `AmazonSageMakerFullAccess` managed policy to grant AWS permission to use Data Wrangler. If you don't use the managed policy, you can use an IAM policy that gives Data Wrangler access to an Amazon S3 bucket. For more information on the policy, see [Security and Permissions](data-wrangler-security.md).

When you export your data flow, you're charged for the AWS resources that you use. You can use cost allocation tags to organize and manage the costs of those resources. You create these tags for your user-profile and Data Wrangler automatically applies them to the resources used to export the data flow. For more information, see [Using Cost Allocation Tags](https://docs.aws.amazon.com//awsaccountbilling/latest/aboutv2/cost-alloc-tags.html).

## Export to Amazon S3
<a name="data-wrangler-data-export-s3"></a>

Data Wrangler gives you the ability to export your data to a location within an Amazon S3 bucket. You can specify the location using one of the following methods:
+ Destination node – Where Data Wrangler stores the data after it has processed it.
+ Export to – Exports the data resulting from a transformation to Amazon S3.
+ Export data – For small datasets, can quickly export the data that you've transformed.

Use the following sections to learn more about each of these methods.

------
#### [ Destination Node ]

If you want to output a series of data processing steps that you've performed to Amazon S3, you create a destination node. A *destination node* tells Data Wrangler where to store the data after you've processed it. After you create a destination node, you create a processing job to output the data. A *processing job* is an Amazon SageMaker processing job. When you're using a destination node, it runs the computational resources needed to output the data that you've transformed to Amazon S3. 

You can use a destination node to export some of the transformations or all of the transformations that you've made in your Data Wrangler flow.

You can use multiple destination nodes to export different transformations or sets of transformations. The following example shows two destination nodes in a single Data Wrangler flow.

![\[Example data flow showing two destination nodes in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-4.png)


You can use the following procedure to create destination nodes and export them to an Amazon S3 bucket.

To export your data flow, you create destination nodes and a Data Wrangler job to export the data. Creating a Data Wrangler job starts a SageMaker Processing job to export your flow. You can choose the destination nodes that you want to export after you've created them.
**Note**  
You can choose **Create job** in the Data Wrangler flow to view the instructions to use a processing job.

Use the following procedure to create destination nodes.

1. Choose the **\$1** next to the nodes that represent the transformations that you want to export.

1. Choose **Add destination**.  
![\[Example data flow showing how to add a destination in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-add-destination-0.png)

1. Choose **Amazon S3**.  
![\[Example dataflow showing how to add destination in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-add-destination-S3-selected.png)

1. Specify the following fields.
   + **Dataset name** – The name that you specify for the dataset that you're exporting.
   + **File type** – The format of the file that you're exporting.
   + **Delimiter** (CSV and Parquet files only) – The value used to separate other values.
   + **Compression** (CSV and Parquet files only) – The compression method used to reduce the file size. You can use the following compression methods:
     + bzip2
     + deflate
     + gzip
   + (Optional) **Amazon S3 location** – The S3 location that you're using to output the files.
   + (Optional) **Number of partitions** – The number of datasets that you're writing as the output of the processing job.
   + (Optional) **Partition by column** – Writes all data with the same unique value from the column.
   + (Optional) **Inference Parameters** – Selecting **Generate inference artifact** applies all of the transformations you've used in the Data Wrangler flow to data coming into your inference pipeline. The model in your pipeline makes predictions on the transformed data.

1. Choose **Add destination**.

Use the following procedure to create a processing job.

Create a job from the **Data flow** page and choose the destination nodes that you want to export.
**Note**  
You can choose **Create job** in the Data Wrangler flow to view the instructions for creating a processing job.

1. Choose **Create job**. The following image shows the pane that appears after you select **Create job**.  
![\[Example data flow create job pane in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-create-job.png)

1. For **Job name**, specify the name of the export job.

1. Choose the destination nodes that you want to export.

1. (Optional) Specify a AWS KMS key ARN. A AWS KMS key is a cryptographic key that you can use to protect your data. For more information about AWS KMS keys, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

1. (Optional) Under **Trained parameters**. choose **Refit** if you've done the following:
   + Sampled your dataset
   + Applied a transform that uses your data to create a new column in the dataset

   For more information about refitting the transformations you've made to an entire dataset, see [Refit Transforms to The Entire Dataset and Export Them](#data-wrangler-data-export-fit-transform).
**Note**  
For image data, Data Wrangler exports the transformations that you've made to all of the images. Refitting the transformations isn't applicable to your use case.

1. Choose **Configure job**. The following image shows the **Configure job** page.  
![\[Example data flow configure job page in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-configure-job.png)

1. (Optional) Configure the Data Wrangler job. You can make the following configurations:
   + **Job configuration**
   + **Spark memory configuration**
   + **Network configuration**
   + **Tags**
   + **Parameters**
   + **Associate Schedules**

1. Choose **Run**.

------
#### [ Export to ]

As an alternative to using a destination node, you can use the **Export to** option to export your Data Wrangler flow to Amazon S3 using a Jupyter notebook. You can choose any data node in your Data Wrangler flow and export it. Exporting the data node exports the transformation that the node represents and the transformations that precede it.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Amazon S3.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Amazon S3 (via Jupyter Notebook)**.

1. Run the Jupyter notebook.  
![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)

When you run the notebook, it exports your data flow (.flow file) in the same AWS Region as the Data Wrangler flow.

The notebook provides options that you can use to configure the processing job and the data that it outputs.

**Important**  
We provide you with job configurations to configure the output of your data. For the partitioning and driver memory options, we strongly recommend that you don't specify a configuration unless you already have knowledge about them.

Under **Job Configurations**, you can configure the following:
+ `output_content_type` – The content type of the output file. Uses `CSV` as the default format, but you can specify `Parquet`.
+ `delimiter` – The character used to separate values in the dataset when writing to a CSV file.
+ `compression` – If set, compresses the output file. Uses gzip as the default compression format.
+ `num_partitions` – The number of partitions or files that Data Wrangler writes as the output.
+ `partition_by` – The names of the columns that you use to partition the output.

To change the output file format from CSV to Parquet, change the value from `"CSV"` to `"Parquet"`. For the rest of the preceding fields, uncomment the lines containing the fields that you want to specify.

Under **(Optional) Configure Spark Cluster Driver Memory** you can configure Spark properties for the job, such as the Spark driver memory, in the `config` dictionary.

The following shows the `config` dictionary.

```
config = json.dumps({
    "Classification": "spark-defaults",
    "Properties": {
        "spark.driver.memory": f"{driver_memory_in_mb}m",
    }
})
```

To apply the configuration to the processing job, uncomment the following lines:

```
# data_sources.append(ProcessingInput(
#     source=config_s3_uri,
#     destination="/opt/ml/processing/input/conf",
#     input_name="spark-config",
#     s3_data_type="S3Prefix",
#     s3_input_mode="File",
#     s3_data_distribution_type="FullyReplicated"
# ))
```

------
#### [ Export data ]

If you have a transformation on a small dataset that you want to export quickly, you can use the **Export data** method. When you start choose **Export data**, Data Wrangler works synchronously to export the data that you've transformed to Amazon S3. You can't use Data Wrangler until either it finishes exporting your data or you cancel the operation.

For information on using the **Export data** method in your Data Wrangler flow, see the following procedure.

To use the **Export data** method:

1. Choose a node in your Data Wrangler flow by opening (double-clicking on) it.  
![\[Example data flow showing how to export data in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/export-s3.png)

1. Configure how you want to export the data.

1. Choose **Export data**.

------

When you export your data flow to an Amazon S3 bucket, Data Wrangler stores a copy of the flow file in the S3 bucket. It stores the flow file under the *data\$1wrangler\$1flows* prefix. If you use the default Amazon S3 bucket to store your flow files, it uses the following naming convention: `sagemaker-region-account number`. For example, if your account number is 111122223333 and you are using Studio Classic in us-east-1, your imported datasets are stored in `sagemaker-us-east-1-111122223333`. In this example, your .flow files created in us-east-1 are stored in `s3://sagemaker-region-account number/data_wrangler_flows/`. 

## Export to Pipelines
<a name="data-wrangler-data-export-pipelines"></a>

When you want to build and deploy large-scale machine learning (ML) workflows, you can use Pipelines to create workflows that manage and deploy SageMaker AI jobs. With Pipelines, you can build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs. You can use the first-party algorithms that SageMaker AI offers by using Pipelines. For more information on Pipelines, see [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html).

When you export one or more steps from your data flow to Pipelines, Data Wrangler creates a Jupyter notebook that you can use to define, instantiate, run, and manage a pipeline.

### Use a Jupyter Notebook to Create a Pipeline
<a name="data-wrangler-pipelines-notebook"></a>

Use the following procedure to create a Jupyter notebook to export your Data Wrangler flow to Pipelines.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Pipelines.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Pipelines (via Jupyter Notebook)**.

1. Run the Jupyter notebook.

![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)


You can use the Jupyter notebook that Data Wrangler produces to define a pipeline. The pipeline includes the data processing steps that are defined by your Data Wrangler flow. 

You can add additional steps to your pipeline by adding steps to the `steps` list in the following code in the notebook:

```
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[instance_type, instance_count],
    steps=[step_process], #Add more steps to this list to run in your Pipeline
)
```

For more information on defining pipelines, see [Define SageMaker AI Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html).

## Export to an Inference Endpoint
<a name="data-wrangler-data-export-inference"></a>

Use your Data Wrangler flow to process data at the time of inference by creating a SageMaker AI serial inference pipeline from your Data Wrangler flow. An inference pipeline is a series of steps that results in a trained model making predictions on new data. A serial inference pipeline within Data Wrangler transforms the raw data and provides it to the machine learning model for a prediction. You create, run, and manage the inference pipeline from a Jupyter notebook within Studio Classic. For more information about accessing the notebook, see [Use a Jupyter Notebook to create an inference endpoint](#data-wrangler-inference-notebook).

Within the notebook, you can either train a machine learning model or specify one that you've already trained. You can either use Amazon SageMaker Autopilot or XGBoost to train the model using the data that you've transformed in your Data Wrangler flow.

The pipeline provides the ability to perform either batch or real-time inference. You can also add the Data Wrangler flow to SageMaker Model Registry. For more information about hosting models, see [Multi-model endpoints](multi-model-endpoints.md).

**Important**  
You can't export your Data Wrangler flow to an inference endpoint if it has the following transformations:  
Join
Concatenate
Group by
If you must use the preceding transforms to prepare your data, use the following procedure.  
Create a Data Wrangler flow.
Apply the preceding transforms that aren't supported.
Export the data to an Amazon S3 bucket.
Create a separate Data Wrangler flow.
Import the data that you've exported from the preceding flow.
Apply the remaining transforms.
Create a serial inference pipeline using the Jupyter notebook that we provide.
For information about exporting your data to an Amazon S3 bucket see [Export to Amazon S3](#data-wrangler-data-export-s3). For information about opening the Jupyter notebook used to create the serial inference pipeline, see [Use a Jupyter Notebook to create an inference endpoint](#data-wrangler-inference-notebook).

Data Wrangler ignores transforms that remove data at the time of inference. For example, Data Wrangler ignores the [Handle Missing Values](data-wrangler-transform.md#data-wrangler-transform-handle-missing) transform if you use the **Drop missing** configuration.

If you've refit transforms to your entire dataset, the transforms carry over to your inference pipeline. For example, if you used the median value to impute missing values, the median value from refitting the transform is applied to your inference requests. You can either refit the transforms from your Data Wrangler flow when you're using the Jupyter notebook or when you're exporting your data to an inference pipeline. For information about refitting transforms, see [Refit Transforms to The Entire Dataset and Export Them](#data-wrangler-data-export-fit-transform).

The serial inference pipeline supports the following data types for the input and output strings. Each data type has a set of requirements.

**Supported datatypes**
+ `text/csv` – the datatype for CSV strings
  + The string can't have a header.
  + Features used for the inference pipeline must be in the same order as features in the training dataset.
  + There must be a comma delimiter between features.
  + Records must be delimited by a newline character.

  The following is an example of a validly formatted CSV string that you can provide in an inference request.

  ```
  abc,0.0,"Doe, John",12345\ndef,1.1,"Doe, Jane",67890                    
  ```
+ `application/json` – the datatype for JSON strings
  + The features used in the dataset for the inference pipeline must be in the same order as the features in the training dataset.
  + The data must have a specific schema. You define schema as a single `instances` object that has a set of `features`. Each `features` object represents an observation.

  The following is an example of a validly formatted JSON string that you can provide in an inference request.

  ```
  {
      "instances": [
          {
              "features": ["abc", 0.0, "Doe, John", 12345]
          },
          {
              "features": ["def", 1.1, "Doe, Jane", 67890]
          }
      ]
  }
  ```

### Use a Jupyter Notebook to create an inference endpoint
<a name="data-wrangler-inference-notebook"></a>

Use the following procedure to export your Data Wrangler flow to create an inference pipeline.

To create an inference pipeline using a Jupyter notebook, do the following.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **SageMaker AI Inference Pipeline (via Jupyter Notebook)**.

1. Run the Jupyter notebook.

When you run the Jupyter notebook, it creates an inference flow artifact. An inference flow artifact is a Data Wrangler flow file with additional metadata used to create the serial inference pipeline. The node that you're exporting encompasses all of the transforms from the preceding nodes.

**Important**  
Data Wrangler needs the inference flow artifact to run the inference pipeline. You can't use your own flow file as the artifact. You must create it by using the preceding procedure.

## Export to Python Code
<a name="data-wrangler-data-export-python-code"></a>

To export all steps in your data flow to a Python file that you can manually integrate into any data processing workflow, use the following procedure.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Python Code.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Python Code**.

1. Run the Jupyter notebook.

![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)


You might need to configure the Python script to make it run in your pipeline. For example, if you're running a Spark environment, make sure that you are running the script from an environment that has permission to access AWS resources.

## Export to Amazon SageMaker Feature Store
<a name="data-wrangler-data-export-feature-store"></a>

You can use Data Wrangler to export features you've created to Amazon SageMaker Feature Store. A feature is a column in your dataset. Feature Store is a centralized store for features and their associated metadata. You can use Feature Store to create, share, and manage curated data for machine learning (ML) development. Centralized stores make your data more discoverable and reusable. For more information about Feature Store, see [Amazon SageMaker Feature Store](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html).

A core concept in Feature Store is a feature group. A feature group is a collection of features, their records (observations), and associated metadata. It's similar to a table in a database.

You can use Data Wrangler to do one of the following:
+ Update an existing feature group with new records. A record is an observation in the dataset.
+ Create a new feature group from a node in your Data Wrangler flow. Data Wrangler adds the observations from your datasets as records in your feature group.

If you're updating an existing feature group, your dataset's schema must match the schema of the feature group. All the records in the feature group are replaced with the observations in your dataset.

You can use either a Jupyter notebook or a destination node to update your feature group with the observations in the dataset.

If your feature groups with the Iceberg table format have a custom offline store encryption key, make sure you grant the IAM that you're using for the Amazon SageMaker Processing job permissions to use it. At a minimum, you must grant it permissions to encrypt the data that you're writing to Amazon S3. To grant the permissions, give the IAM role the ability to use the [GenerateDataKey](https://docs.aws.amazon.com/kms/latest/APIReference/API_GenerateDataKey.html). For more information about granting IAM roles permissions to use AWS KMS keys see [https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html)

------
#### [ Destination Node ]

If you want to output a series of data processing steps that you've performed to a feature group, you can create a destination node. When you create and run a destination node, Data Wrangler updates a feature group with your data. You can also create a new feature group from the destination node UI. After you create a destination node, you create a processing job to output the data. A processing job is an Amazon SageMaker processing job. When you're using a destination node, it runs the computational resources needed to output the data that you've transformed to the feature group. 

You can use a destination node to export some of the transformations or all of the transformations that you've made in your Data Wrangler flow.

Use the following procedure to create a destination node to update a feature group with the observations from your dataset.

To update a feature group using a destination node, do the following.
**Note**  
You can choose **Create job** in the Data Wrangler flow to view the instructions for using a processing job to update the feature group.

1. Choose the **\$1** symbol next to the node containing the dataset that you'd like to export.

1. Under **Add destination**, choose **SageMaker AI Feature Store**.  
![\[Example data flow showing how to add destination in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/feature-store-destination-node-selection.png)

1. Choose (double-click) the feature group. Data Wrangler checks whether the schema of the feature group matches the schema of the data that you're using to update the feature group.

1. (Optional) Select **Export to offline store only** for feature groups that have both an online store and an offline store. This option only updates the offline store with observations from your dataset.

1. After Data Wrangler validates the schema of your dataset, choose **Add**.

Use the following procedure to create a new feature group with data from your dataset.

You can store your feature group in one of the following ways:
+ Online – Low-latency, high-availability cache for a feature group that provides real-time lookup of records. The online store allows quick access to the latest value for a record in a feature group.
+ Offline – Stores data for your feature group in an Amazon S3 bucket. You can store your data offline when you don't need low-latency (sub-second) reads. You can use an offline store for features used in data exploration, model training, and batch inference.
+ Both online and offline – Stores your data in both an online store and an offline store.

To create a feature group using a destination node, do the following.

1. Choose the **\$1** symbol next to the node containing the dataset that you'd like to export.

1. Under **Add destination**, choose **SageMaker AI Feature Store**.

1. Choose **Create Feature Group**.

1. In the following dialog box, if your dataset doesn't have an event time column, select **Create "EventTime" column**.

1. Choose **Next**.

1. Choose **Copy JSON Schema**. When you create a feature group, you paste the schema into the feature definitions.

1. Choose **Create**.

1. For **Feature group name**, specify a name for your feature group.

1. For **Description (optional)**, specify a description to make your feature group more discoverable.

1. To create a feature group for an online store, do the following.

   1. Select **Enable storage online**.

   1. For **Online store encryption key**, specify an AWS managed encryption key or an encryption key of your own.

1. To create a feature group for an offline store, do the following.

   1. Select **Enable storage offline**. Specify values for the following fields:
      + **S3 bucket name** – The name of the Amazon S3 bucket that stores the feature group.
      + (Optional) **Dataset directory name** – The Amazon S3 prefix that you're using to store the feature group.
      + **IAM Role ARN** – The IAM role that has access to Feature Store.
      + **Table Format** – Table format of your offline store. You can specify **Glue** or **Iceberg**. **Glue** is the default format.
      + **Offline store encryption key** – By default, Feature Store uses an AWS Key Management Service managed key, but you can use the field to specify a key of your own.

   1. Specify values for the following fields:
      + **S3 bucket name** – The name of the bucket storing the feature group.
      + **(Optional) Dataset directory name** – The Amazon S3 prefix that you're using to store the feature group.
      + **IAM Role ARN** – The IAM role that has access to feature store.
      + **Offline store encryption key** – By default, Feature Store uses an AWS managed key, but you can use the field to specify a key of your own.

1. Choose **Continue**.

1. Choose **JSON**.

1. Remove the placeholder brackets in the window.

1. Paste the JSON text from Step 6.

1. Choose **Continue**.

1. For **RECORD IDENTIFIER FEATURE NAME**, choose the column in your dataset that has unique identifiers for each record in your dataset.

1. For **EVENT TIME FEATURE NAME**, choose the column with the timestamp values.

1. Choose **Continue**.

1. (Optional) Add tags to make your feature group more discoverable.

1. Choose **Continue**.

1. Choose **Create feature group**.

1. Navigate back to your Data Wrangler flow and choose the refresh icon next to the **Feature Group** search bar.

**Note**  
If you've already created a destination node for a feature group within a flow, you can't create another destination node for the same feature group. If you want to create another destination node for the same feature group, you must create another flow file.

Use the following procedure to create a Data Wrangler job.

Create a job from the **Data flow** page and choose the destination nodes that you want to export.

1. Choose **Create job**. The following image shows the pane that appears after you select **Create job**.

1. For **Job name**, specify the name of the export job.

1. Choose the destination nodes that you want to export.

1. (Optional) For **Output KMS Key**, specify an ARN, ID, or alias of an AWS KMS key. A KMS key is a cryptographic key. You can use the key to encrypt the output data from the job. For more information about AWS KMS keys, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

1. The following image shows the **Configure job** page with the **Job configuration** tab open.  
![\[Example data flow create job page in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-configure-job.png)

   (Optional) Under **Trained parameters**. choose **Refit** if you've done the following:
   + Sampled your dataset
   + Applied a transform that uses your data to create a new column in the dataset

   For more information about refitting the transformations you've made to an entire dataset, see [Refit Transforms to The Entire Dataset and Export Them](#data-wrangler-data-export-fit-transform).

1. Choose **Configure job**.

1. (Optional) Configure the Data Wrangler job. You can make the following configurations:
   + **Job configuration**
   + **Spark memory configuration**
   + **Network configuration**
   + **Tags**
   + **Parameters**
   + **Associate Schedules**

1. Choose **Run**.

------
#### [ Jupyter notebook ]

Use the following procedure to a Jupyter notebook to export to Amazon SageMaker Feature Store.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Feature Store.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Amazon SageMaker Feature Store (via Jupyter Notebook)**.

1. Run the Jupyter notebook.

![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)


Running a Jupyter notebook runs a Data Wrangler job. Running a Data Wrangler job starts a SageMaker AI processing job. The processing job ingests the flow into an online and offline feature store.

**Important**  
The IAM role you use to run this notebook must have the following AWS managed policies attached: `AmazonSageMakerFullAccess` and `AmazonSageMakerFeatureStoreAccess`.

You only need to enable one online or offline feature store when you create a feature group. You can also enable both. To disable online store creation, set `EnableOnlineStore` to `False`:

```
# Online Store Configuration
online_store_config = {
    "EnableOnlineStore": False
}
```

The notebook uses the column names and types of the dataframe you export to create a feature group schema, which is used to create a feature group. A feature group is a group of features defined in the feature store to describe a record. The feature group defines the schema and features contained in the feature group. A feature group definition is composed of a list of features, a record identifier feature name, an event time feature name, and configurations for its online store and offline store. 

Each feature in a feature group can have one of the following types: *String*, *Fractional*, or *Integral*. If a column in your exported dataframe is not one of these types, it defaults to `String`. 

The following is an example of a feature group schema.

```
column_schema = [
    {
        "name": "Height",
        "type": "long"
    },
    {
        "name": "Input",
        "type": "string"
    },
    {
        "name": "Output",
        "type": "string"
    },
    {
        "name": "Sum",
        "type": "string"
    },
    {
        "name": "Time",
        "type": "string"
    }
]
```

Additionally, you must specify a record identifier name and event time feature name:
+ The *record identifier name* is the name of the feature whose value uniquely identifies a record defined in the feature store. Only the latest record per identifier value is stored in the online store. The record identifier feature name must be one of feature definitions' names.
+ The *event time feature name* is the name of the feature that stores the `EventTime` of a record in a feature group. An `EventTime` is a point in time when a new event occurs that corresponds to the creation or update of a record in a feature. All records in the feature group must have a corresponding `EventTime`.

The notebook uses these configurations to create a feature group, process your data at scale, and then ingest the processed data into your online and offline feature stores. To learn more, see [Data Sources and Ingestion](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-ingest-data.html).

------

The notebook uses these configurations to create a feature group, process your data at scale, and then ingest the processed data into your online and offline feature stores. To learn more, see [Data Sources and Ingestion](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-ingest-data.html).

## Refit Transforms to The Entire Dataset and Export Them
<a name="data-wrangler-data-export-fit-transform"></a>

When you import data, Data Wrangler uses a sample of the data to apply the encodings. By default, Data Wrangler uses the first 50,000 rows as a sample, but you can import the entire dataset or use a different sampling method. For more information, see [Import](data-wrangler-import.md).

The following transformations use your data to create a column in the dataset:
+ [Encode Categorical](data-wrangler-transform.md#data-wrangler-transform-cat-encode)
+ [Featurize Text](data-wrangler-transform.md#data-wrangler-transform-featurize-text)
+ [Handle Outliers](data-wrangler-transform.md#data-wrangler-transform-handle-outlier)
+ [Handle Missing Values](data-wrangler-transform.md#data-wrangler-transform-handle-missing)

If you used sampling to import your data, the preceding transforms only use the data from the sample to create the column. The transform might not have used all of the relevant data. For example, if you use the **Encode Categorical** transform, there might have been a category in the entire dataset that wasn't present in the sample.

You can either use a destination node or a Jupyter notebook to refit the transformations to the entire dataset. When Data Wrangler exports the transformations in the flow, it creates a SageMaker Processing job. When the processing job finishes, Data Wrangler saves the following files in either the default Amazon S3 location or an S3 location that you specify:
+ The Data Wrangler flow file that specifies the transformations that are refit to the dataset
+ The dataset with the refit transformations applied to it

You can open a Data Wrangler flow file within Data Wrangler and apply the transformations to a different dataset. For example, if you've applied the transformations to a training dataset, you can open and use the Data Wrangler flow file to apply the transformations to a dataset used for inference.

For a information about using destination nodes to refit transforms and export see the following pages:
+ [Export to Amazon S3](#data-wrangler-data-export-s3)
+ [Export to Amazon SageMaker Feature Store](#data-wrangler-data-export-feature-store)

Use the following procedure to run a Jupyter notebook to refit the transformations and export the data.

To run a Jupyter notebook and to refit the transformations and export your Data Wrangler flow, do the following.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose the location to which you're exporting the data.

1. For the `refit_trained_params` object, set `refit` to `True`.

1. For the `output_flow` field, specify the name of the output flow file with the refit transformations.

1. Run the Jupyter notebook.

## Create a Schedule to Automatically Process New Data
<a name="data-wrangler-data-export-schedule-job"></a>

If you're processing data periodically, you can create a schedule to run the processing job automatically. For example, you can create a schedule that runs a processing job automatically when you get new data. For more information about processing jobs, see [Export to Amazon S3](#data-wrangler-data-export-s3) and [Export to Amazon SageMaker Feature Store](#data-wrangler-data-export-feature-store).

When you create a job you must specify an IAM role that has permissions to create the job. By default, the IAM role that you use to access Data Wrangler is the `SageMakerExecutionRole`.

The following permissions allow Data Wrangler to access EventBridge and allow EventBridge to run processing jobs:
+ Add the following AWS Managed policy to the Amazon SageMaker Studio Classic execution role that provides Data Wrangler with permissions to use EventBridge:

  ```
  arn:aws:iam::aws:policy/AmazonEventBridgeFullAccess
  ```

  For more information about the policy, see [AWS managed policies for EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-use-identity-based.html#eb-full-access-policy).
+ Add the following policy to the IAM role that you specify when you create a job in Data Wrangler:

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Action": "sagemaker:StartPipelineExecution",
              "Resource": "arn:aws:sagemaker:us-east-1:111122223333:pipeline/data-wrangler-*"
          }
      ]
  }
  ```

------

  If you're using the default IAM role, you add the preceding policy to the Amazon SageMaker Studio Classic execution role.

  Add the following trust policy to the role to allow EventBridge to assume it.

  ```
  {
      "Effect": "Allow",
      "Principal": {
          "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
  }
  ```

**Important**  
When you create a schedule, Data Wrangler creates an `eventRule` in EventBridge. You incur charges for both the event rules that you create and the instances used to run the processing job.  
For information about EventBridge pricing, see [Amazon EventBridge pricing](https://aws.amazon.com/eventbridge/pricing/). For information about processing job pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

You can set a schedule using one of the following methods:
+ [CRON expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html)
**Note**  
Data Wrangler doesn't support the following expressions:  
LW\$1
Abbreviations for days
Abbreviations for months
+ [RATE expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html#eb-rate-expressions)
+ Recurring – Set an hourly or daily interval to run the job.
+ Specific time – Set specific days and times to run the job.

The following sections provide procedures on creating jobs.

------
#### [ CRON ]

Use the following procedure to create a schedule with a CRON expression.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **CRON**.

1. Specify a valid CRON expression.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------
#### [ RATE ]

Use the following procedure to create a schedule with a RATE expression.

To specify a schedule with a RATE expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Rate**.

1. For **Value**, specify an integer.

1. For **Unit**, select one of the following:
   + **Minutes**
   + **Hours**
   + **Days**

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------
#### [ Recurring ]

Use the following procedure to create a schedule that runs a job on a recurring basis.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, make sure **Recurring** is selected by default.

1. For **Every x hours**, specify the hourly frequency that the job runs during the day. Valid values are integers in the inclusive range of **1** and **23**.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.
**Note**  
The schedule resets every day. If you schedule a job to run every five hours, it runs at the following times during the day:  
00:00
05:00
10:00
15:00
20:00

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------
#### [ Specific time ]

Use the following procedure to create a schedule that runs a job at specific times.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------

You can use Amazon SageMaker Studio Classic view the jobs that are scheduled to run. Your processing jobs run within Pipelines. Each processing job has its own pipeline. It runs as a processing step within the pipeline. You can view the schedules that you've created within a pipeline. For information about viewing a pipeline, see [View the details of a pipeline](pipelines-studio-list.md).

Use the following procedure to view the jobs that you've scheduled.

To view the jobs you've scheduled, do the following.

1. Open Amazon SageMaker Studio Classic.

1. Open Pipelines

1. View the pipelines for the jobs that you've created.

   The pipeline running the job uses the job name as a prefix. For example, if you've created a job named `housing-data-feature-enginnering`, the name of the pipeline is `data-wrangler-housing-data-feature-engineering`.

1. Choose the pipeline containing your job.

1. View the status of the pipelines. Pipelines with a **Status** of **Succeeded** have run the processing job successfully.

To stop the processing job from running, do the following:

To stop a processing job from running, delete the event rule that specifies the schedule. Deleting an event rule stops all the jobs associated with the schedule from running. For information about deleting a rule, see [Disabling or deleting an Amazon EventBridge rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-delete-rule.html).

You can stop and delete the pipelines associated with the schedules as well. For information about stopping a pipeline, see [StopPipelineExecution](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopPipelineExecution.html). For information about deleting a pipeline, see [DeletePipeline](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeletePipeline.html#API_DeletePipeline_RequestSyntax).

# Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Classic Notebook to Get Data Insights
<a name="data-wrangler-interactively-prepare-data-notebook"></a>

Use the Data Wrangler data preparation widget to interact with your data, get visualizations, explore actionable insights, and fix data quality issues. 

You can access the data preparation widget from an Amazon SageMaker Studio Classic notebook. For each column, the widget creates a visualization that helps you better understand its distribution. If a column has data quality issues, a warning appears in its header.

To see the data quality issues, select the column header showing the warning. You can use the information that you get from the insights and the visualizations to apply the widget's built-in transformations to help you fix the issues. 

For example, the widget might detect that you have a column that only has one unique value and show you a warning. The warning provides the option to drop the column from the dataset.

## Getting started with running the widget
<a name="data-wrangler-interactively-prepare-data-notebook-getting-started"></a>

Use the following information to help you get started with running a notebook.

Open a notebook in Amazon SageMaker Studio Classic. For information about opening a notebook, see [Create or Open an Amazon SageMaker Studio Classic Notebook](notebooks-create-open.md).

**Important**  
To run the widget, the notebook must use one of the following images:  
Python 3 (Data Science) with Python 3.7
Python 3 (Data Science 2.0) with Python 3.8
Python 3 (Data Science 3.0) with Python 3.10
SparkAnalytics 1.0
SparkAnalytics 2.0
For more information about images, see [Amazon SageMaker Images Available for Use With Studio Classic Notebooks](notebooks-available-images.md).

Use the following code to import the data preparation widget and pandas. The widget uses pandas dataframes to analyze your data.

```
import pandas as pd
import sagemaker_datawrangler
```

The following example code loads a file into the dataframe called `df`.

```
df = pd.read_csv("example-dataset.csv")
```

You can use a dataset in any format that you can load as a pandas dataframe object. For more information about pandas formats, see [IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

The following cell runs the `df` variable to start the widget.

```
df
```

The top of the dataframe has the following options:
+ **View the Pandas table** – Switches between the interactive visualization and a pandas table.
+ **Use all of the rows in your dataset to compute the insights. Using the entire dataset might increase the time it takes to generate the insights.** – If you don't select the option, Data Wrangler computes the insights for the first 10,000 rows of the dataset.

The dataframe shows the first 1000 rows of the dataset. Each column header has a stacked bar chart that shows the column's characteristics. It shows the proportion of valid values, invalid values, and missing values. You can hover over the different portions of the stacked bar chart to get the calculated percentages.

Each column has a visualization in the header. The following shows the types of visualizations the columns can have:
+ Categorical – Bar chart
+ Numeric – Histogram
+ Datetime – Bar chart
+ Text – Bar chart

For each visualization, the data preparation widget highlights outliers in orange.

When you choose a column, it opens a side panel. The side panel shows you the **Insights** tab. The pane provides a count for the following types of values:
+ Invalid values – Values whose type doesn’t match the column type.
+ Missing values – Values that are missing, such as `NaN` or `None`.
+ Valid values – Values that are neither missing nor invalid.

For numeric columns, the **Insights** tab shows the following summary statistics:
+ Minimum – The smallest value.
+ Maximum – The largest value.
+ Mean – The mean of the values.
+ Mode – The value that appears most frequently.
+ Standard deviation – The standard deviation of the values.

For categorical columns, the **Insights** tab shows the following summary statistics:
+ Unique values – The number of unique values in the column.
+ Top – The value that appears most frequently.

The columns that have warning icons in their headers have data quality issues. Choosing a column opens a **Data quality** tab that you can use to find transforms to help you fix the issue. A warning has one of the following severity levels:
+ Low – Issues that might not affect your analysis, but can be useful to fix.
+ Medium – Issues that are likely to affect your analysis, but are likely not critical to fix.
+ High – Severe issues that we strongly recommend fixing.

**Note**  
The widget sorts the column to show the values that have data quality issues at the top of the dataframe. It also highlights the values that are causing the issues. The color of the highlighting corresponds to the severity level.

Under **SUGGESTED TRANSFORMS**, you can choose a transform to fix the data quality issue. The widget can offer multiple transforms that can fix the issue. It can offer recommendations for the transforms that are best suited to the problem. You can move your cursor over the transform to get more information about it.

To apply a transform to the dataset, choose **Apply and export code**. The transform modifies the dataset and updates the visualization with modified values. The code for the transform appears in the following cell of the notebook. If you apply additional transforms to the dataset, the widget appends the transforms to the cell. You can use the code that the widget generates to do the following:
+ Customize it to better fit your needs.
+ Use it in your own workflows.

You can reproduce all the transforms you've made by rerunning all of the cells in the notebook.

The widget can provide insights and warnings for the target column. The target column is the column that you're trying to predict. Use the following procedure to get target column insights.

To get target column insights, do the following.

1. Choose the column that you're using as the target column.

1. Choose **Select as target column**.

1. Choose the problem type. The widget's insights and warnings are tailored to the problem types. The following are the problem types:
   + **Classification** – The target column has categorical data.
   + **Regression** – The target column has numeric data.

1. Choose **Run**.

1. (Optional) Under **Target Column Insights**, choose one of the suggested transforms.

## Reference for the insights and transforms in the widget
<a name="data-wrangler-notebook-dataprep-assistant-reference"></a>

For feature columns (columns that aren't the target column), you can get the following insights to warn you about issues with your dataset.
+ **Missing values** – The column has missing values such as `None`, `NaN` (not a number), or `NaT` (not a timestamp). Many machine learning algorithms don’t support missing values in the input data. Filling them in or dropping the rows with missing data is therefore a crucial data preparation step. If you see the missing values warning, you can use one of the following transforms to correct the issue.
  + **Drop missing** – Drops rows with missing values. We recommend dropping rows when the percentage of rows with missing data is small and imputing the missing values isn't appropriate. 
  + **Replace with new value** – Replaces textual missing values with `Other`. You can change `Other` to a different value in the output code. Replaces numeric missing values with 0.
  + **Replace with mean** – Replaces missing values with the mean of the column.
  + **Replace with median** – Replaces missing values with the median of the column.
  + **Drop column** – Drops the column with missing values from the dataset. We recommend dropping the entire column when there's a high percentage of rows with missing data.
+ **Disguised missing values** – The column has disguised missing values. A disguised missing value is a value that isn't explicitly encoded as a missing value. For example, instead of using a `NaN` to indicate a missing value, the value could be `Placeholder`. You can use one of the following transforms to handle the missing values:
  + **Drop missing** – Drops rows with missing values
  + **Replace with new value** – Replaces textual missing values with `Other`. You can change `Other` to a different value in the output code. Replaces numeric missing values with 0.
+ **Constant column** – The column only has one value. It therefore has no predictive power. We strongly recommend using the **Drop column** transform to drop the column from the dataset.
+ **ID column** – The column has no repeating values. All of the values in the column are unique. They might be either IDs or database keys. Without additional information, the column has no predictive power. We strongly recommend using the **Drop column** transform to drop the column from the dataset.
+ **High cardinality** – The column has a high percentage of unique values. High cardinality limits the predictive power of categorical columns. Examine the importance of the column in your analysis and consider using the **Drop column** transform to drop it.

For the target column, you can get the following insights to warn you about issues with your dataset. You can use the suggested transformation provided with the warning to correct the issue.
+ **Mixed data types in target (Regression)** – There are some non-numeric values in the target column. There might be data entry errors. We recommend removing the rows that have the values that can't be converted.
+ **Frequent label** – Certain values in the target column appear more frequently than what would be normal in the context of regression. There might be an error in data collection or processing. A frequently appearing category might indicate that either the value is used as a default value or that it’s a placeholder for missing values. We recommend using the **Replace with new value** transform to replace the missing values with `Other`.
+ **Too few instances per class** – The target column has categories that appear rarely. Some of the categories don't have enough rows for the target column to be useful. You can use one of the following transforms:
  + **Drop rare target** – Drops unique values with fewer than ten observations. For example, drops the value `cat` if it appears nine times in the column.
  + **Replace rare target** – Replaces categories that appear rarely in the dataset with the value `Other`.
+ **Classes too imbalanced (multi-class classification)** – There are categories in the dataset that appear much more frequently than the other categories. The class imbalance might affect prediction accuracy. For the most accurate predictions possible, we recommend updating the dataset with rows that have the categories that currently appear less frequently.
+ **Large amount of classes/too many classes** – There's a large number of classes in the target column. Having many classes might result in longer training times or poor predictive quality. We recommend doing one of the following:
  + Grouping some of the categories into their own category. For example, if six categories are closely related, we recommend using a single category for them.
  + Using an ML algorithm that's resilient to multiple categories.

# Security and Permissions
<a name="data-wrangler-security"></a>

When you query data from Athena or Amazon Redshift, the queried dataset is automatically stored in the default SageMaker AI S3 bucket for the AWS Region in which you are using Studio Classic. Additionally, when you export a Jupyter Notebook from Amazon SageMaker Data Wrangler and run it, your data flows, or .flow files, are saved to the same default bucket, under the prefix *data\$1wrangler\$1flows*.

For high-level security needs, you can configure a bucket policy that restricts the AWS roles that have access to this default SageMaker AI S3 bucket. Use the following section to add this type of policy to an S3 bucket. To follow the instructions on this page, use the AWS Command Line Interface (AWS CLI). To learn how, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in the IAM User Guide.

Additionally, you need to grant each IAM role that uses Data Wrangler permissions to access required resources. If you do not require granular permissions for the IAM role you use to access Data Wrangler, you can add the IAM managed policy, [https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess), to an IAM role that you use to create your Studio Classic user. This policy grants you full permission to use Data Wrangler. If you require more granular permissions, refer to the section, [Grant an IAM Role Permission to Use Data Wrangler](#data-wrangler-security-iam-policy).

## Add a Bucket Policy To Restrict Access to Datasets Imported to Data Wrangler
<a name="data-wrangler-security-bucket-policy"></a>

You can add a policy to the S3 bucket that contains your Data Wrangler resources using an Amazon S3 bucket policy. Resources that Data Wrangler uploads to your default SageMaker AI S3 bucket in the AWS Region you are using Studio Classic in include the following:
+ Queried Amazon Redshift results. These are stored under the *redshift/* prefix.
+ Queried Athena results. These are stored under the *athena/* prefix. 
+ The .flow files uploaded to Amazon S3 when you run an exported Jupyter Notebook Data Wrangler produces. These are stored under the *data\$1wrangler\$1flows/* prefix.

Use the following procedure to create an S3 bucket policy that you can add to restrict IAM role access to that bucket. To learn how to add a policy to an S3 bucket, see [How do I add an S3 Bucket policy?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/add-bucket-policy.html).

**To set up a bucket policy on the S3 bucket that stores your Data Wrangler resources:**

1. Configure one or more IAM roles that you want to be able to access Data Wrangler.

1. Open a command prompt or shell. For each role that you create, replace *role-name* with the name of the role and run the following:

   ```
   $ aws iam get-role --role-name role-name
   ```

   In the response, you see a `RoleId` string which begins with `AROA`. Copy this string. 

1. Add the following policy to the SageMaker AI default bucket in the AWS Region in which you are using Data Wrangler. Replace *region* with the AWS Region in which the bucket is located, and *account-id* with your AWS account ID. Replace `userId`s starting with *AROAEXAMPLEID* with the IDs of an AWS roles to which you want to grant permission to use Data Wrangler. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Deny",
         "Principal": "*",
         "Action": "s3:*",
         "Resource": [
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/*",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena/*",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift/*"
           
         ],
         "Condition": {
           "StringNotLike": {
             "aws:userId": [
               "AROAEXAMPLEID_1:*",
               "AROAEXAMPLEID_2:*"
             ]
           }
         }
       }
     ]
   }
   ```

------

## Create an Allowlist for Data Wrangler
<a name="data-wrangler-security-allowlist"></a>

Whenever a user starts running Data Wrangler from the Amazon SageMaker Studio Classic user interface, they make call to the SageMaker AI application programming interface (API) to create a Data Wrangler application.

Your organization might not provide your users with permissions to make those API calls by default. To provide permissions, you must create and attach a policy to the user's IAM roles using the following policy template: [Data Wrangler Allow List Example](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/DataWranglerAllowListExample.txt).

**Note**  
The preceding policy example only gives your users access to the Data Wrangler application.

For information about creating a policy, see [Creating policies on the JSON tab](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html#access_policies_create-json-editor). When you're creating a policy, copy and paste the JSON policy from [Data Wrangler Allow List Example](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/DataWranglerAllowListExample.txt) in the **JSON** tab.

**Important**  
Delete any IAM policies that prevent users from running the following operations:  
[CreateApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateApp.html)
[DescribeApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeApp.html)
If you don't delete the policies, your users could still be affected by them.

After you've creating the policy using the template, attach it to the IAM roles of your users. For information about attaching a policy, see [Adding IAM identity permissions (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console).

## Grant an IAM Role Permission to Use Data Wrangler
<a name="data-wrangler-security-iam-policy"></a>

You can grant an IAM role permission to use Data Wrangler with the general IAM managed policy, [https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess). This is a general policy that includes [permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-AmazonSageMakerFullAccess.html) required to use all SageMaker AI services. This policy grants an IAM role full access to Data Wrangler. You should be aware of the following when using `AmazonSageMakerFullAccess` to grant access to Data Wrangler:
+ If you import data from Amazon Redshift, the **Database User** name must have the prefix `sagemaker_access`.
+ This managed policy only grants permission to access buckets with one of the following in the name: `SageMaker AI`, `SageMaker AI`, `sagemaker`, or `aws-glue`. If want to use Data Wrangler to import from an S3 bucket without these phrases in the name, refer to the last section on this page to learn how to grant permission to an IAM entity to access your S3 buckets.

If you have high-security needs, you can attach the policies in this section to an IAM entity to grant permissions required to use Data Wrangler.

If you have datasets in Amazon Redshift or Athena that an IAM role needs to import from Data Wrangler, you must add a policy to that entity to access these resources. The following policies are the most restrictive policies you can use to give an IAM role permission to import data from Amazon Redshift and Athena. 

To learn how to attach a custom policy to an IAM role, refer to [Managing IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage.html#create-managed-policy-console) in the IAM User Guide.

**Policy example to grant access to an Athena dataset import**

The following policy assumes that the IAM role has permission to access the underlying S3 bucket where data is stored through a separate IAM policy.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:ListDataCatalogs",
                "athena:ListDatabases",
                "athena:ListTableMetadata",
                "athena:GetQueryExecution",
                "athena:GetQueryResults",
                "athena:StartQueryExecution",
                "athena:StopQueryExecution"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable"
            ],
            "Resource": [
                "arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
                "arn:aws:glue:*:*:table/sagemaker_featurestore/*",
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables"
            ],
            "Resource": [
                "arn:aws:glue:*:*:table/*",
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:GetDatabase"
            ],
            "Resource": [
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/sagemaker_featurestore",
                "arn:aws:glue:*:*:database/sagemaker_processing",
                "arn:aws:glue:*:*:database/default",
                "arn:aws:glue:*:*:database/sagemaker_data_wrangler"
            ]
        }
    ]
}
```

------

****Policy example to grant access to an Amazon Redshift dataset import****

The following policy grants permission to set up an Amazon Redshift connection to Data Wrangler using database users that have the prefix `sagemaker_access` in the name. To grant permission to connect using additional database users, add additional entries under `"Resources"` in the following policy. The following policy assumes that the IAM role has permission to access the underlying S3 bucket where data is stored through a separate IAM policy, if applicable. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift-data:ExecuteStatement",
                "redshift-data:DescribeStatement",
                "redshift-data:CancelStatement",
                "redshift-data:GetStatementResult",
                "redshift-data:ListSchemas",
                "redshift-data:ListTables"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:dbuser:*/sagemaker_access*",
                "arn:aws:redshift:*:*:dbname:*"
            ]
        }
    ]
}
```

------

**Policy to grant access to an S3 bucket**

If your dataset is stored in Amazon S3, you can grant an IAM role permission to access this bucket with a policy similar to the following. This example grants programmatic read-write access to the bucket named *test*.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::test"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": ["arn:aws:s3:::test/*"]
    }
  ]
}
```

------

To import data from Athena and Amazon Redshift, you must grant an IAM role permission to access the following prefixes under the default Amazon S3 bucket in the AWS Region Data Wrangler in which is being used: `athena/`, `redshift/`. If a default Amazon S3 bucket does not already exist in the AWS Region, you must also give the IAM role permission to create a bucket in this region.

Additionally, if you want the IAM role to be able to use the Amazon SageMaker Feature Store, Pipelines, and Data Wrangler job export options, you must grant access to the prefix `data_wrangler_flows/` in this bucket.

 Data Wrangler uses the `athena/` and `redshift/` prefixes to store preview files and imported datasets. To learn more, see [Imported Data Storage](data-wrangler-import.md#data-wrangler-import-storage).

Data Wrangler uses the `data_wrangler_flows/` prefix to store .flow files when you run a Jupyter Notebook exported from Data Wrangler. To learn more, see [Export](data-wrangler-data-export.md).

Use a policy similar to the following to grant the permissions described in the preceding paragraphs.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/*",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena/*",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::sagemaker-us-east-1-111122223333"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation"
            ],
            "Resource": "*"
        }
    ]
}
```

------

You can also access data in your Amazon S3 bucket from another AWS account by specifying the Amazon S3 bucket URI. To do this, the IAM policy that grants access to the Amazon S3 bucket in the other account should use a policy similar to the following example, where `BucketFolder` is the specific directory in the user's bucket `UserBucket`. This policy should be added to the user granting access to their bucket for another user. 

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement": [
       {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::UserBucket/BucketFolder/*"
            },
                {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": "arn:aws:s3:::UserBucket",
                "Condition": {
                "StringLike": {
                    "s3:prefix": [
                    "BucketFolder/*"
                    ]
                }
            }
        } 
    ]
}
```

------

The user that is accessing the bucket (not the bucket owner) must add a policy similar to the following example to their user. Note that `AccountX` and `TestUser` below refers to the bucket owner and their user respectively.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:user/TestUser"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::UserBucket/BucketFolder/*"
            ]
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:user/TestUser"
            },
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::UserBucket"
            ]
        }
    ]
}
```

------

**Policy example to grant access to use SageMaker AI Studio**

Use a policy like to the following to create an IAM execution role that can be used to set up a Studio Classic instance. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeDomain",
                "sagemaker:ListDomains",
                "sagemaker:DescribeUserProfile",
                "sagemaker:ListUserProfiles",
                "sagemaker:*App",
                "sagemaker:ListApps"
            ],
            "Resource": "*"
        }
    ]
}
```

------

## Snowflake and Data Wrangler
<a name="data-wrangler-security-snowflake"></a>

All permissions for AWS resources are managed via your IAM role attached to your Studio Classic instance. The Snowflake administrator manages Snowflake-specific permissions, as they can grant granular permissions and privileges to each Snowflake user. This includes databases, schemas, tables, warehouses, and storage integration objects. You must ensure that the correct permissions are set up outside of Data Wrangler. 

Note that the Snowflake `COPY INTO Amazon S3` command moves data from Snowflake to Amazon S3 over the public internet by default, but data in transit is secured using SSL. Data at rest in Amazon S3 is encrypted with SSE-KMS using the default AWS KMS key.

With respect to Snowflake credentials storage, Data Wrangler does not store customer credentials. Data Wrangler uses Secrets Manager to store the credentials in a secret and rotates secrets as part of a best practice security plan. The Snowflake or Studio Classic administrator needs to ensure that the data scientist’s Studio Classic execution role is granted permission to perform `GetSecretValue` on the secret storing the credentials. If already attached to the Studio Classic execution role, the `AmazonSageMakerFullAccess` policy has the necessary permissions to read secrets created by Data Wrangler and secrets created by following the naming and tagging convention in the instructions above. Secrets that do not follow the conventions must be separately granted access. We recommend using Secrets Manager to prevent sharing credentials over unsecured channels; however, note that a logged-in user can retrieve the plain-text password by launching a terminal or Python notebook in Studio Classic and then invoking API calls from the Secrets Manager API. 

## Data Encryption with AWS KMS
<a name="data-wrangler-security-kms"></a>

Within Data Wrangler, you can decrypt encrypted files and add them to your Data Wrangler flow. You can also encrypt the output of the transforms using either a default AWS KMS key or one that you provide.

You can import files if they have the following:
+ server-side encryption
+ SSE-KMS as the encryption type

To decrypt the file and import to a Data Wrangler flow, you must add the SageMaker Studio Classic user that you're using as a key user.

The following screenshot shows a Studio Classic user role added as a key user. See [IAM Roles](https://console.aws.amazon.com/iam/home#/roles) to access users under the left panel to make this change.

![\[The Key users section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/data-wrangler-kms.png)


### Amazon S3 customer managed key setup for Data Wrangler imported data storage
<a name="data-wrangler-s3-cmk-setup"></a>

 By default, Data Wrangler uses Amazon S3 buckets that have the following naming convention: `sagemaker-region-account number`. For example, if your account number is `111122223333` and you are using Studio Classic in us-east-1, your imported datasets are stored with the following naming convention: `sagemaker-us-east-1-111122223333`. 

The following instructions explain how to set up a customer managed key for your default Amazon S3 bucket.

1. To enable server-side encryption and setup a customer managed key for your default S3 bucket, see [Using KMS Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html).

1. After following step 1, navigate to AWS KMS in your AWS Management Console. Find the customer managed key you selected in step 1 of the previous step and add the Studio Classic role as the key user. To do this, follow the instructions in [Allows key users to use a customer managed key](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users).

### Encrypting the Data That You Export
<a name="data-wrangler-export-kms"></a>

You can encrypt the data that you export using one of the following methods:
+ Specifying that your Amazon S3 bucket has object use SSE-KMS encryption.
+ Specifying an AWS KMS key to encrypt the data that you export from Data Wrangler.

On the **Export data** page, specify a value for the **AWS KMS key ID or ARN**.

For more information on using AWS KMS keys, see [Protecting Data Using Server-Side Encryption with AWS KMS keys Stored in AWSAWS Key Management Service (SSE-KMS) ](https://docs.aws.amazon.com//AmazonS3/latest/userguide/UsingKMSEncryption.html).

## Amazon AppFlow Permissions
<a name="data-wrangler-appflow-permissions"></a>

When you're performing a transfer, you must specify an IAM role that has permissions to perform the transfer. You can use the same IAM role that has permissions to use Data Wrangler. By default, the IAM role that you use to access Data Wrangler is the `SageMakerExecutionRole`.

The IAM role must have the following permissions:
+ Permissions to Amazon AppFlow
+ Permissions to the AWS Glue Data Catalog
+ Permissions for AWS Glue to discover the data sources that are available

When you run a transfer, Amazon AppFlow stores metadata from the transfer in the AWS Glue Data Catalog. Data Wrangler uses the metadata from the catalog to determine whether it's available for you to query and import.

To add permissions to Amazon AppFlow, add the `AmazonAppFlowFullAccess` AWS managed policy to the IAM role. For more information about adding policies, see [Adding or removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

If you're transferring data to Amazon S3, you must also attach the following policy.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketTagging",
        "s3:ListBucketVersions",
        "s3:CreateBucket",
        "s3:ListBucket",
        "s3:GetBucketPolicy",
        "s3:PutEncryptionConfiguration",
        "s3:GetEncryptionConfiguration",
        "s3:PutBucketTagging",
        "s3:GetObjectTagging",
        "s3:GetBucketOwnershipControls",
        "s3:PutObjectTagging",
        "s3:DeleteObject",
        "s3:DeleteBucket",
        "s3:DeleteObjectTagging",
        "s3:GetBucketPublicAccessBlock",
        "s3:GetBucketPolicyStatus",
        "s3:PutBucketPublicAccessBlock",
        "s3:PutAccountPublicAccessBlock",
        "s3:ListAccessPoints",
        "s3:PutBucketOwnershipControls",
        "s3:PutObjectVersionTagging",
        "s3:DeleteObjectVersionTagging",
        "s3:GetBucketVersioning",
        "s3:GetBucketAcl",
        "s3:PutObject",
        "s3:GetObject",
        "s3:GetAccountPublicAccessBlock",
        "s3:ListAllMyBuckets",
        "s3:GetAnalyticsConfiguration",
        "s3:GetBucketLocation"
      ],
      "Resource": "*"
    }
  ]
}
```

------

To add AWS Glue permissions, add the `AWSGlueConsoleFullAccess` managed policy to the IAM role. For more information about AWS Glue permissions with Amazon AppFlow, see [link-to-appflow-page].

Amazon AppFlow needs to access AWS Glue and Data Wrangler for you to import the data that you've transferred. To grant Amazon AppFlow access, add the following trust policy to the IAM role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:root",
                "Service": [
                    "appflow.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

------

To display the Amazon AppFlow data in Data Wrangler, add the following policy to the IAM role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:SearchTables",
            "Resource": [
                "arn:aws:glue:*:*:table/*/*",
                "arn:aws:glue:*:*:database/*",
                "arn:aws:glue:*:*:catalog"
            ]
        }
    ]
}
```

------

## Using Lifecycle Configurations in Data Wrangler
<a name="data-wrangler-lifecycle-configuration"></a>

You might have an Amazon EC2 instance that is configured to run Kernel Gateway applications, but not the Data Wrangler application. Kernel Gateway applications provide access to the environment and the kernels that you use to run Studio Classic notebooks and terminals. The Data Wrangler application is the UI application that runs Data Wrangler. Amazon EC2 instances that aren't Data Wrangler instances require a modification to their lifecycle configurations to run Data Wrangler. Lifecycle configurations are shell scripts that automate the customization of your Amazon SageMaker Studio Classic environment.

For more information about lifecycle configurations, see [Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic](studio-lcc.md).

The default lifecycle configuration for your instance doesn't support using Data Wrangler. You can make the following modifications to the default configuration to use Data Wrangler with your instance.

```
#!/bin/bash
set -eux
STATUS=$(
python3 -c "import sagemaker_dataprep"
echo $?
)
if [ "$STATUS" -eq 0 ]; then
echo 'Instance is of Type Data Wrangler'
else
echo 'Instance is not of Type Data Wrangler'

# Replace this with the URL of your git repository
export REPOSITORY_URL="https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples.git"

git -C /root clone $REPOSTIORY_URL

fi
```

You can save the script as `lifecycle_configuration.sh`.

You attach the lifecycle configuration to your Studio Classic domain or user profile. For more information about creating and attaching a lifecycle configuration, see [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md).

The following instructions show you how to attach a lifecycle configuration to a Studio Classic domain or user profile.

You might run into errors when you're creating or attaching a lifecycle configuration. For information about debugging lifecycle configuration errors, [KernelGateway app failure](studio-lcc-debug.md#studio-lcc-debug-kernel).

# Release Notes
<a name="data-wrangler-release-notes"></a>

Data Wrangler is regularly updated with new features and bug fixes. To upgrade the version of Data Wrangler you are using in Studio Classic, follow the instructions in [Shut Down and Update Amazon SageMaker Studio Classic Apps](studio-tasks-update-apps.md).


****  

| Release Notes | 
| --- | 
|  **8/31/2023** New functionality: You can now create a Data Quality and Insights report on your entire dataset. For more information, see [Get Insights On Data and Data Quality](data-wrangler-data-insights.md). **5/20/2023** New functionality: You can now import your data from Salesforce Data Cloud. For more information, see [Import data from Salesforce Data Cloud](data-wrangler-import.md#data-wrangler-import-salesforce-data-cloud). **4/18/2023** New functionality: You can now get your data in a format that Amazon Personalize can interpret. For more information, see [Map Columns for Amazon Personalize](data-wrangler-transform.md#data-wrangler-transform-personalize). **3/1/2023** New functionality: You can now use Hive to import your data from Amazon EMR. For more information, see [Import data from Amazon EMR](data-wrangler-import.md#data-wrangler-emr). **12/10/2022** New functionality: You can now export your Data Wrangler flow to an inference endpoint. For more information, see [Export to an Inference Endpoint](data-wrangler-data-export.md#data-wrangler-data-export-inference). New functionality: You can now use an interactive notebook widget for data preparation. For more information, see [Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Classic Notebook to Get Data Insights](data-wrangler-interactively-prepare-data-notebook.md). New functionality: You can now import data from SaaS platforms. For more information, see [Import Data From Software as a Service (SaaS) Platforms](data-wrangler-import.md#data-wrangler-import-saas). **10/12/2022** New functionality: You can now reuse data flows for different data sets. For more information, see [Reusing Data Flows for Different Datasets](data-wrangler-parameterize.md). **10/05/2022** New functionality: You can now use Principal Component Analysis (PCA) as a transform. For more information, see [Reduce Dimensionality within a Dataset](data-wrangler-transform.md#data-wrangler-transform-dimensionality-reduction). **10/05/2022** New functionality: You can now refit parameters in your Data Wrangler flow. For more information, see [Export](data-wrangler-data-export.md). **10/03/2022** New functionality: You can now deploy models from your Data Wrangler flow. For more information, see [Automatically Train Models on Your Data Flow](data-wrangler-autopilot.md). **9/20/2022** New functionality: You can now set data retention periods in Athena. For more information, see [Import data from Athena](data-wrangler-import.md#data-wrangler-import-athena). **6/9/2022** New functionality: You can now use Amazon SageMaker Autopilot to train a model directly from your Data Wrangler flow. For more information, see [Automatically Train Models on Your Data Flow](data-wrangler-autopilot.md). **5/6/2022** New functionality: You can now use additional m5 and r5 instances. For more information, see [Instances](data-wrangler-data-flow.md#data-wrangler-data-flow-instances). **4/27/2022** New functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **4/1/2022** New functionality: You can now use Databricks as a data source. For more information, see [Import data from Databricks (JDBC)](data-wrangler-import.md#data-wrangler-databricks). **2/2/2022** New functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **10/16/2021** New functionality: Data Wrangler now supports Athena workgroups. For more information, see [Import data from Athena](data-wrangler-import.md#data-wrangler-import-athena). **10/6/2021** New functionality: Data Wrangler now supports transforming time series data. For more information, see [Transform Time Series](data-wrangler-transform.md#data-wrangler-transform-time-series). **7/15/2021** New functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Enhancements: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html)  Bug Fixes: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **4/26/2021**  Enhancements: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Bug Fixes: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **2/8/2021**  New Functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Enhancements: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Bug Fixes: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html)  | 

# Troubleshoot
<a name="data-wrangler-trouble-shooting"></a>

If an issue arises when using Amazon SageMaker Data Wrangler, we recommend you do the following:
+ If an error message is provided, read the message and resolve the issue it reports if possible.
+ Make sure the IAM role of your Studio Classic user has the required permissions to perform the action. For more information, see [Security and Permissions](data-wrangler-security.md).
+ If the issue occurs when you are trying to import from another AWS service, such as Amazon Redshift or Athena, make sure that you have configured the necessary permissions and resources to perform the data import. For more information, see [Import](data-wrangler-import.md).
+ If you're still having issues, choose **Get help** at the top right of your screen to reach out to the Data Wrangler team. For more information, see the following images.  
![\[The location of the Data Wrangler help form in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/get-help/get-help.png)  
![\[The Data Wrangler help form in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/get-help/get-help-forms.png)

As a last resort, you can try restarting the kernel on which Data Wrangler is running. 

1. Save and exit the .flow file for which you want to restart the kernel. 

1. Select the ****Running Terminals and Kernels**** icon, as shown in the following image.  
![\[The location of the Running Terminals and Kernels icon in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/stop-kernel-option.png)

1. Select the **Stop** icon to the right of the .flow file for which you want to terminate the kernel, as shown in the following image.  
![\[The location of the Stop icon in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/stop-kernel.png)

1. Refresh the browser. 

1. Reopen the .flow file on which you were working. 

## Troubleshooting issues with Amazon EMR
<a name="data-wrangler-trouble-shooting-emr"></a>

Use the following information to help you troubleshoot errors that might come up when you're using Amazon EMR.
+ Connection failure – If the connection fails with the following message `The IP address of the EMR cluster isn't private error message`, your Amazon EMR cluster might not have been launched in a private subnet. As a security best practice, Data Wrangler only supports connecting to private Amazon EMR clusters. Choose a private EC2 subnet you launch an EMR cluster.
+ Connection hanging and timing out – The issue is most likely due to a network connectivity issue. After you start connecting to the cluster, the screen doesn't refresh. After about 2 minutes, you might see the following error `JdbcAddConnectionError: An error occurred when trying to connect to presto: xxx: Connect to xxx failed: Connection timed out (Connection timed out) will display on top of the screen.`.

  The errors might have two root causes:
  + The Amazon EMR and Amazon SageMaker Studio Classic are in different VPCs. We recommend launching both Amazon EMR and Studio Classic in the same VPC. You can also use VPC peering. For more information, see [What is VPC peering?](https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html).
  + The Amazon EMR master security group lacks the inbound traffic rule for the security group of Amazon SageMaker Studio Classic on the port used for Presto. To resolve the issue, allow inbound traffic on port 8889.
+ Connection fails due to the connection type being misconfigured – You might see the following error message: ` Data Wrangler couldn't create a connection to {connection_source} successfully. Try connecting to {connection_source} again. For more information, see Troubleshoot. If you’re still experiencing issues, contact support. `

  Check the authentication method. The authentication method that you've specified in Data Wrangler should match the authentication method that you're using on the cluster.
+ You don't have HDFS permissions for LDAP authentication – Use the following guidance to resolve the issue [Set up HDFS Permissions using Linux Credentials](https://docs.aws.amazon.com/whitepapers/latest/teaching-big-data-skills-with-amazon-emr/set-up-hdfs-permissions-using-linux-credentials.html). You can log into the cluster using the following commands:

  ```
  hdfs dfs -mkdir /user/USERNAME
  hdfs dfs -chown USERNAME:USERNAME /user/USERNAME
  ```
+ LDAP authentication missing connection key error – You might see the following error message: `Data Wrangler couldn't connect to EMR hive successfully. JDBC connection is missing required connection key(s): PWD`.

  For LDAP authentication, you must specify both a username and a password. The JDBC URL stored in Secrets Manager is missing property `PWD`.
+ When you're troubleshooting the LDAP configuration: We recommend making sure that the LDAP authenticator (LDAP server) is correctly configured to connect to the Amazon EMR cluster. Use the `ldapwhoami` command to help you resolve the configuration issue. The following are example commands that you can run:
  + For LDAPS – `ldapwhoami -x -H ldaps://ldap-server`
  + For LDAP – `ldapwhoami -x -H ldap://ldap-server`

  Either command should return `Anonymous` if you've configured the authenticator successfully.

## Troubleshooting with Salesforce
<a name="data-wrangler-troubleshooting-salesforce-data-cloud"></a>

### Lifecycle configuration error
<a name="data-wrangler-troubleshooting-salesforce-lcc-debug-data-cloud"></a>

When your user opens Studio Classic for the first time, they might get an error saying that there's something wrong with their lifecycle configuration. Use Amazon CloudWatch to access the logs written by your lifecycle configuration script. For more information about debugging lifecycle configurations, see [Debug Lifecycle Configurations in Amazon SageMaker Studio Classic](studio-lcc-debug.md).

If you aren't able to debug the error, you can create the configuration file manually. You must create the file every time you delete or restart the Jupyter server. Use the following procedure to create the file manually.

**To create a configuration file**

1. Navigate to Studio Classic.

1. Choose **File**, then **New**, then **Terminal**.

1. Create `.sfgenie_identity_provider_oauth_config`.

1. Open the file in a text editor.

1. Add a JSON object containing the Amazon Resource Name (ARN) of the Secrets Manager secret to the file. You can use the following template to create the object.

   ```
   {
     "secret_arn": "example-secret-ARN"
   }
   ```

1. Save your changes to the file.

### Unable to access Salesforce Data Cloud from the Data Wrangler flow
<a name="data-wrangler-troubleshooting-salesforce-datacloud-access"></a>

After your user chooses **Salesforce Data Cloud** from your Data Wrangler flow, they might get an error indicating the prerequisites to set up the connection haven't been met. It might be caused by following errors:
+ The Salesforce secret in Secrets Manager hasn't been created.
+ The Salesforce secret in Secrets Manager has been created, but it's missing the Salesforce tag.
+ The Salesforce secret in Secrets Manager has been created in the wrong AWS Region. For example, your user won't be able to access the Salesforce Data Cloud in `ca-central-1` because you've created the secret in `us-east-1`. You can either replicate the secret to `ca-central-1` or create a new secret with the same credentials in `ca-central-1`. For information about replicating secrets, see [Replicate an AWS Secrets Manager secret to other AWS Regions](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create-manage-multi-region-secrets.html).
+ The policy that your users are using to access Amazon SageMaker Studio Classic are missing permissions for AWS Secrets Manager
+ There's a typo in the Secrets Manager ARN of the JSON object that you've specified through your lifecycle configuration.
+ There's a typo in the Secrets Manager secret containing your Salesforce OAuth configuration

### Blank page showing `redirect_uri_mismatch`
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-blank-page"></a>

After your users choose **Save and Connect**, they might get redirected to a page that shows `redirect_uri_mismatch`. The callback URI that you've registered in your Salesforce Connected App settings is either missing or incorrect.

Use the following URL to check that your Studio Classic URL is correctly registered in your Salesforce org's Connected App settings: `https://EXAMPLE_SALESFORCE_ORG/lightning/setup/NavigationMenus/home/`. For more information about using the connected app settings, navigate to the following URL: `https://EXAMPLE_SALESFORCE_ORG/lightning/setup/NavigationMenus/home/`.

**Note**  
It takes roughly ten minutes to propagate the URI within Salesforce's systems.

### Shared spaces
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-shared-spaces"></a>

Shared spaces doesn't currently work with the Salesforce Data Cloud integration. You can either delete the shared spaces in the Amazon SageMaker AI domain that you intend to use, or you can use another domain that doesn't have shared spaces set up.

### OAuth Redirect Error
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-oauth-error"></a>

Your users should be able to import their data from the Salesforce Data Cloud after they choose **Connect**. If they're running into an error, we recommend asking them to do the following:
+ Tell them to be patient – When they get redirected back to Amazon SageMaker Studio Classic, it can take up to a minute to complete the authentication process. While they're getting redirected, we recommend telling them to avoid interacting with the browser. For example, they shouldn't close the browser tab, switch to another tab, or interact with the Data Wrangler flow. Interacting with the browser might remove the authorization code required to connect to the data cloud.
+ Have your users reconnect to the data cloud – There are transient issues that can cause a connection to the Salesforce Data Cloud to fail. Have your users create a new Data Wrangler flow and try connecting to the Salesforce Data Cloud again.
+ Make sure your users close all other tabs with Amazon SageMaker Studio Classic – Having Studio Classic open in multiple tabs can cause the Salesforce Data Cloud connection to fail. Make sure your users only have one Studio Classic tab open.
+ Multiple users accessing Studio Classic at the same time – Only one user should access an Amazon SageMaker AI domain at a time. If multiple users access the same domain, the connection that a user is trying to create to the Salesforce Data Cloud might fail.

Updating both Data Wrangler and Studio Classic might also fix their error. For information about updating Data Wrangler, see [Update Data Wrangler](data-wrangler-update.md). For information about updating Studio Classic, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

If none of the preceding troubleshooting steps work, you might find an error message from Salesforce with a corresponding description embedded in the Studio Classic URL. The following is an example of a message you could find: `error=invalid_client_id&error_description=client%20identifier%20invalid`.

You can look at the error message in the URL and try to address the issues it presents. If the error message or description is unclear, we recommend searching the Salesforce Knowledge Base. If searching the knowledge base doesn't work, you can reach out to the Salesforce help desk for more assistance.

### Data Wrangler takes a long time to load
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-long-load-time"></a>

When your users are getting redirected back to Data Wrangler from the Salesforce Data Cloud, they might experience long load times.

If this is the user's first time using Data Wrangler or they've deleted the kernel, it might take about 5 minutes to provision the new Amazon EC2 instance to use Data Wrangler.

If this isn't the user's first time using Data Wrangler and they haven't deleted the kernel, you can ask them to refresh the page or close as many browser tabs as possible.

If none of the preceding interventions work, have them set up a new connection to the Salesforce Data Cloud.

### User fails to export their data with an `Invalid batch Id` error
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-processing-job-fails-batch-id"></a>

When your user exports the transformations that they've made to their Salesforce data, the SageMaker Processing job that Data Wrangler uses on the backend might fail. The Salesforce Data Cloud might be temporarily unavailable or there could be a caching issue.

To address the issue, we recommend having your users go back to the step where they're importing the data and changing the order of the columns that they're querying . For example, they can change the following query:

```
SELECT col_A, col_B FROM table                
```

To the following query:

```
SELECT col_B, col_A FROM table                
```

After they've changed the order of the columns and made sure that the subsequent transformations they've made are still valid, they can start exporting their data again.

### Users can't export a very large dataset
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-processing-job-fails-query"></a>

If your users imported a very large dataset from the Salesforce Data Cloud, they might not be able to export the transformations that they've made. A large dataset might have too many rows, or it can result from a complex query.

We recommend having your users take the following actions:
+ Simplifying their SQL query
+ Sampling their data

The following are some strategies that they can use to simplify their queries:
+ Specify column names instead of using the `*` operator
+ Finding a subset of the data that they'd like to import instead of using a larger subset
+ Minimizing joins between very large datasets

They can use sampling to reduce the number of rows in their dataset. For information about sampling methods, your users can refer to [Sampling](data-wrangler-transform.md#data-wrangler-transform-sampling).

### Users can't export data due to invalid refresh token
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-processing-job-fails-invalid-token"></a>

Data Wrangler uses a JDBC driver to integrate with the Salesforce Data Cloud. The method for authentication is OAuth. For OAuth, the refresh token and the access token are two different pieces of data that are used to authorize access to resources within your Salesforce Data Cloud.

The access token, or core token, is what allows you to access your Salesforce data and run queries directly through Data Wrangler. It's short lived and designed to expire quickly. To maintain access to your Salesforce data, Data Wrangler uses the refresh token to get a new access token from Salesforce.

You might have set the refresh to expire too quickly to get a new access token for your users. You might have to revisit your refresh token policy to make sure that it can accommodate queries that take a long time to run for your users. For information about configuring your refresh token policy, see `https://EXAMPLE_SALESFORCE_ORG_URL/lightning/setup/ConnectedApplication/home/`.

### Queries failing or tables not loading
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-table-not-loading"></a>

Salesforce experiences service outages. Even if you’ve configured everything correctly, your users might not be able to import their data for periods of time.

Service outages can happen for maintenance reasons. We recommend checking in the following day to see if the issue has been resolved.

If you’re experiencing issues for more than a day, we recommend contacting Salesforce’s help desk for further assistance. For information about contacting Salesforce, see [How would you like to contact Salesforce?](https://www.salesforce.com/company/contact-us/)

### `OAUTH_APP_BLOCKED` during Studio Classic redirect
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-oauth-app-blocked"></a>

When your user gets redirected back to Amazon SageMaker Studio Classic, they might notice the query parameter `error=OAUTH_APP_BLOCKED` within the URL. They're might be experiencing a transient issue that should resolve itself within a day.

It's possible that you've blocked their access to the Connected App as well. For information about resolving the issue, see `https://EXAMPLE_SALESFORCE_ORG_URL/lightning/setup/ConnectedApplication/home/`.

### `OAUTH_APP_DENIED` during Studio Classic redirect
<a name="data-wrangler-troubleshooting-salesforce-data-cloud-oauth-app-access-denied"></a>

When your user gets redirected back to Amazon SageMaker Studio Classic, they might notice the query parameter `error=OAUTH_APP_ACCESS_DENIED` within the URL. You haven't given their profile type permissions to access the `Connected App` associated with Data Wrangler.

To resolve their access issue, navigate to `https://EXAMPLE_SALESFORCE_ORG_URL/lightning/setup/ManageUsers/home/` and check whether the user is assigned to the correct profile.

# Increase Amazon EC2 Instance Limit
<a name="data-wrangler-increase-instance-limit"></a>

You might see the following error message when you're using Data Wrangler: `The following instance type is not available: ml.m5.4xlarge. Try selecting a different instance below.`

The message can indicate that you need to select a different instance type, but it can also indicate that you don't have enough Amazon EC2 instances to successfully run Data Wrangler on your workflow. You can increase the number of instances by using the following procedure.

To increase the number of instances, do the following.

1. Open the AWS Management Console.

1. In the search bar, specify **Services Quotas**.

1. Choose **Service Quotas**.

1. Choose **AWS services**.

1. In the search bar, specify **Amazon SageMaker AI**.

1. Choose **Amazon SageMaker AI**.

1. Under **Service quotas**, specify **Studio KernelGateway Apps running on *ml.m5.4xlarge* instance**.
**Note**  
ml.m5.4xlarge is the default instance type for Data Wrangler. You can use other instance types and request quota increases for them. For more information, see [Instances](data-wrangler-data-flow.md#data-wrangler-data-flow-instances).

1. Select **Studio KernelGateway Apps running on *ml.m5.4xlarge* instance**.

1. Choose **Request quota increase**.

1. For **Change quota value**, specify a value greater than **Applied quota value**.

1. Choose **Request**.

If your request is approved, AWS sends a notification to the email address associated with your account. You can also check the status of your request by choosing **Quota request history** on the **Service Quotas** page. Processed requests have a **Status** of **Closed**.

# Update Data Wrangler
<a name="data-wrangler-update"></a>

To update Data Wrangler to the latest release, first shut down the corresponding KernelGateway app from the Amazon SageMaker Studio Classic control panel. After the KernelGateway app is shut down, restart it by opening a new or existing Data Wrangler flow in Studio Classic. When you open a new or existing Data Wrangler flow, the kernel that starts contains the latest version of Data Wrangler.

**Update your Studio Classic and Data Wrangler instance**

1. Navigate to your [SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose SageMaker AI and then Studio Classic.

1. Choose your user name.

1. Under **Apps**, in the row displaying the **App name**, choose **Delete app** for the app that starts with `sagemaker-data-wrang`, and for the JupyterServer app.

1. Choose **Yes, delete app**.

1. Type `delete` in the confirmation box.

1. Choose **Delete**.

1. Reopen your Studio Classic instance. When you begin to create a Data Wrangler flow, your instance now uses the latest version of Data Wrangler.

Alternatively, if you are using a Data Wrangler application version that is not the latest version, and you have an existing Data Wrangler flow open, you are prompted to update your Data Wrangler application version in the Studio Classic UI. The following screenshot shows this prompt. 

**Important**  
This updates the Data Wrangler kernel gateway app only. You still need to shut down the JupyterServer app in your user account. To do this, follow the preceding steps.

![\[The Update Data Wrangler section in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/data-wrangler-1click-restart.png)


You can also choose **Remind me later**, in which case an **Update** button appears in the top-right corner of the screen.

![\[The location of the Update in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/data-wrangler-1click-restart-update.png)


# Shut Down Data Wrangler
<a name="data-wrangler-shut-down"></a>

When you are not using Data Wrangler, it is important to shut down the instance on which it runs to avoid incurring additional fees. 

To avoid losing work, save your data flow before shutting Data Wrangler down. To save your data flow in Studio Classic, choose **File** and then choose **Save Data Wrangler Flow**. Data Wrangler automatically saves your data flow every 60 seconds. 

**To shut down the Data Wrangler instance in Studio Classic**

1. In Studio Classic, select the **Running Instances and Kernels** icon (![\[Icon of a gear or cog symbol representing settings or configuration options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio_classic_dw_instances.png)).

1. Under **RUNNING APPS** is the **sagemaker-data-wrangler-1.0** app. Select the shutdown icon (![\[Power button icon with a circular shape and vertical line symbol.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Shutdown_light.png)) next to this app . 

   Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from **RUNNING INSTANCES** when you shut down the Data Wrangler app.

**Important**  
If you open Data Wrangler again, an Amazon EC2 instance starts running the application and you will be charged for the compute. In addition to compute, you are also charged for the storage that you use. For example, you're charged for any Amazon S3 buckets that you're using with Data Wrangler.  
If you find that you're still getting charged for Data Wrangler after shutting down your applications, there's a Jupyter extension that you can use to automatically shut down idle sessions. For information about the extension, see [SageMaker-Studio-Autoshutdown-Extension](https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension).

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.