Data connections in Amazon SageMaker Lakehouse - Amazon SageMaker Unified Studio

Amazon SageMaker Unified Studio is in preview release and is subject to change.

Data connections in Amazon SageMaker Lakehouse

Amazon SageMaker Lakehouse provides a unified approach to managing data connections across AWS services and enterprise applications. These connections provide a consistent experience for creating, testing, and exploring data sources, regardless of the underlying data platform.

Capabilities

With Amazon SageMaker Lakehouse connections, you can do the following:

  • Create connections to a variety of data sources, including databases and data lakes

  • Manage data connections in a single place

  • Test the connectivity of your data sources to ensure they are working as expected

  • Browse the metadata and preview the data from your connected sources

  • Reuse the same connection across different AWS services like AWS Glue, Amazon Athena and Amazon SageMaker

  • Manage credentials using AWS Secrets Manager

  • Authenticate using basic authentication methods such as OAuth2 and IAM

Supported data sources

Amazon SageMaker Lakehouse connections support several popular data sources, including the following:

Supported Data Sources
Data Source Type
Amazon DynamoDB (preview) Database
Amazon Redshift Database
Google BigQuery Database
MySQL Database
PostgreSQL Database
Snowflake (preview) Database

Using Amazon SageMaker Lakehouse connections

After you've created an Amazon SageMaker Lakehouse connection, you can use it in various AWS services:

  • Amazon SageMaker Unified Studio : Browse metadata, preview sample data, and run SQL queries against the connected data.

  • AWS Glue: Use the connection for ETL jobs and crawlers.

  • Amazon Athena: Query data directly using Athena's federated query capabilities.

  • Amazon SageMaker: Access data for building machine learning models.

Understanding created AWS resources

When you create a connection in Amazon SageMaker Unified Studio, several resources are created in your AWS account(s) behind the scenes. These resources can include:

  • AWS Glue connection - A connection object is created in the AWS Glue crawler. This stores the core connection information and is used by various AWS services.

  • Athena data catalog - For connections that will be used with Athena , an Athena data catalog is created. This allows Athena to query the external data source.

  • AWS Glue data catalog entries - Databases, tables, and schemas from your external data source are registered in the Data Catalog. This enables AWS services to understand the structure of your external data.

  • Lambda (for Athena Federated Query) - For some data sources, a Lambda function is created to facilitate federated queries. This function acts as a bridge between Athena and the external data source.

To view these resources, access the respective AWS service consoles (AWS Glue, Athena, IAM, etc.) in the AWS account associated with your Amazon SageMaker Unified Studio project.

In these consoles, look for resources with names that include your Amazon SageMaker Unified Studio project ID or connection name.

For more information about how to create a data connection and explore a connected data source, see Adding data sources in Amazon SageMaker Lakehouse.