Data preparation with SQL in Studio - Amazon SageMaker

Data preparation with SQL in Studio

Amazon SageMaker Studio provides a built-in SQL extension. This extension allows data scientists to perform tasks such as sampling, exploratory analysis, and feature engineering directly within their JupyterLab notebooks. It leverages AWS Glue connections to maintain a centralized data source catalog. The catalog stores metadata about various data sources. Through this SQL environment, data scientists can browse data catalogs, explore their data, author complex SQL queries, and further process the results in Python.

This section walks through configuring the SQL extension in Studio. It describes the capabilities enabled by this SQL integration and provides instructions for running SQL queries in JupyterLab notebooks.

To enable SQL data analysis, administrators must first configure AWS Glue connections to the relevant data sources. These connections allow data scientists to seamlessly access authorized datasets from within JupyterLab.

In addition to the administrator-configured AWS Glue connections, the SQL extension allows individual data scientists to create their own data source connections. These user-created connections can be managed independently and scoped to the user's profile through tag-based access control policies. This dual-level connection model - with both administrator-configured and user-created connections - provides data scientists with broader access to the data they need for their analysis and modeling tasks. Users can set up the necessary connections to their own data sources within the JupyterLab environment user interface (UI), without relying solely on the centralized connections established by the administrator.

Important

The user-defined connections creation capability is available as a set of standalone libraries in PyPI. To use this functionality, you need to install the following libraries in your JupyterLab environment:

You can install these libraries by running the following commands in your JupyterLab terminal:

pip install amazon-sagemaker-sql-editor>=0.1.13 pip install amazon-sagemaker-sql-execution>=0.1.6 pip install amazon-sagemaker-sql-magic>=0.1.3

After installing the libraries, you will need to restart the JupyterLab server for the changes to take effect.

restart-jupyter-server

With access set up, JupyterLab users can:

  • View and browse pre-configured data sources.

  • Search, filter, and inspect database information elements such as tables, schemas, and columns.

  • Auto-generate the connection parameters to a data source.

  • Create complex SQL queries using the syntax-highlighting, auto-completion, and SQL formatting features of the extension's SQL editor.

  • Run SQL statements from JupyterLab notebook cells.

  • Retrieve the results of SQL queries as pandas DataFrames for further processing, visualization, and other machine learning tasks.

You can access the extension by choosing the SQL extension icon ( Icon of the SQL extension feature in JupyterLab. ) in the left navigation pane of your JupyterLab application in Studio. Hovering over the icon displays its Data Discovery tool tip.

Important
  • The JupyterLab image in SageMaker Studio contains the SQL extension by default, starting with SageMaker Distribution 1.6. The extension works with Python and SparkMagic kernels only.

  • The extension's user interface for exploring connections and data is only available in JupyterLab within Studio. It is compatible with Amazon Redshift, Amazon Athena, and Snowflake.