Data lineage in Amazon SageMaker Unified Studio - Amazon SageMaker Unified Studio

Amazon SageMaker Unified Studio is in preview release and is subject to change.

Data lineage in Amazon SageMaker Unified Studio

Data lineage in Amazon SageMaker Unified Studio is an API-driven, OpenLineage-compatible feature that can help you to capture and visualize lineage events, from OpenLineage-enabled systems or through APIs, to trace data origins, track transformations, and view cross-organizational data consumption. It provides you with an overarching view into your data assets to see the origin of assets and their chain of connections. The lineage data includes information on the activities inside the Amazon SageMaker Catalog, including information about the catalogued assets, the subscribers of those assets, and the activities that happen outside the business data catalog captured programmatically using the APIs.

Using Amazon SageMaker Unified Studio's OpenLineage-compatible APIs, domain administrators and data producers can capture and store lineage events beyond what is available in Amazon SageMaker Unified Studio, including transformations in Amazon S3, AWS Glue, and other services. This provides a comprehensive view for the data consumers and helps them gain confidence of the asset's origin, while data producers can assess the impact of changes to an asset by understanding its usage. Additionally, Amazon SageMaker Unified Studio versions lineage with each event, enabling users to visualize lineage at any point in time or compare transformations across an asset's or job's history. This historical lineage provides a deeper understanding of how data has evolved, essential for troubleshooting, auditing, and ensuring the integrity of data assets.

With data lineage, you can accomplish the following in Amazon SageMaker Unified Studio:

  • Understand the provenance of data: knowing where the data originated fosters trust in data by providing you with a clear understanding of its origins, dependencies, and transformations. This transparency helps in making confident data-driven decisions.

  • Understand the impact of changes to data pipelines: when changes are made to data pipelines, lineage can be used to identify all of the downstream consumers that are to be affected. This helps to ensure that changes are made without disrupting critical data flows.

  • Identify the root cause of data quality issues: if a data quality issue is detected in a downstream report, lineage, especially column-level lineage, can be used to trace the data back (at a column level) to identify the issue back to its source. This can help data engineers to identify and fix the problem.

  • Improve data governance and compliance: column-level lineage can be used to demonstrate compliance with data governance and privacy regulations. For example, column-level lineage can be used to show where sensitive data (such as PII) is stored and how it is processed in downstream activities.

Types of lineage nodes in Amazon SageMaker Unified Studio

in Amazon SageMaker Unified Studio, data lineage information is presented in nodes that represent tables and views. Depending on the context of the project, the producers are able to view both inventory and published assets, whereas consumers can only view the published assets. When you first open the lineage tab in the asset details page, the catalogued dataset node is the starting point for navigating upstream or downstream through the lineage nodes of your lineage graph.

The following are the types of data lineage nodes that are supported in Amazon SageMaker Unified Studio:

  • Dataset node - this node type includes data lineage information about a specific data asset.

    • Dataset nodes that include information about AWS Glue or Amazon Redshift assets published in the Amazon SageMaker Unified Studio catalog are auto-generated and include a corresponding AWS Glue or Amazon Redshift icon within the node.

    • Dataset nodes that include information about assets that are not published in the Amazon SageMaker Unified Studio catalog, are created manually by domain administrators (producers) and are represented by a default custom asset icon within the node.

  • Job (run) node - this node type displays the details of the job, including the latest run of a particular job and run details. This node also captures multiple runs of the job and can be viewed in the History tab of the node details. You can view node details by choosing the node icon.

Key attributes in lineage nodes

The sourceIdentifier attribute in a lineage node represents the events happening on a dataset. The sourceIdentifier of the lineage node is the identifier of the dataset (table/view etc). It’s used for uniqueness enforcement on the lineage nodes. For example, there can’t be two lineage nodes with same sourceIdentifier. The following are examples of sourceIdentifier values for different types of nodes:

  • For dataset node with respective dataset type:

    • Asset: amazon.datazone.asset/<assetId>

    • Listing (published asset): amazon.datazone.listing/<listingId>

    • AWS Glue table: arn:aws:glue:<region>:<account-id>:table/<database>/<table-name>

    • Amazon Redshift table/view: arn:aws:<redshift/redshift-serverless>:<region>:<account-id>:<table-type(table/view etc)>/<clusterIdentifier/workgroupName>/<database>/<schema>/<table-name>

    • For any other type of dataset nodes imported using open-lineage run events, <namespace>/<name> of the input/output dataset is used as sourceIdentifier of the node.

  • For jobs:

    • For job nodes imported using open-lineage run events, <jobs_namespace>.<job_name> is used as sourceIdentifier.

  • For job runs:

    • For job run nodes imported using open-lineage run events, <jobs_namespace>.<job_name>/<run_id> is used as sourceIdentifier.

For assets created using createAsset API, the sourceIdentifier must be updated using createAssetRevision API to enable mapping the asset to upstream resources.

Visualizing data lineage

Amazon SageMaker Unified Studio’s asset details page provides a graphical representation of data lineage, making it easier to visualize data relationships upstream or downstream. The asset details page provides the following capabilities to navigate the graph:

  • Column-level lineage: expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.

  • Column search: when the default display for number of columns is 10. If there are more than 10 columns, pagination is activiated to navigate to the rest of the columns. To quickly view a particular column, you can search on the dataset node that list just the searched column.

  • View dataset nodes only: if you want to toggle to view only dataset lineage nodes and filter out the job nodes, you can choose the Open view control icon on the top left of the graph viewer and toggle the Display dataset nodes only option. This will remove all the job nodes from the graph and lets you navigate just the dataset nodes. Note that when the view only dataset nodes is turned on, the graph cannot be expanded upstream or downstream.

  • Details pane: Each lineage node has details captured and displayed when selected.

    • Dataset node has a detail pane to display all the details captured for that node for a given timestamp. Every dataset node has 3 tabs, namely: Lineage info, Schema, and History tab. The history tab lists the different versions of lineage event captured for that node. All details captured from API are displayed using metadata forms or a JSON viewer.

    • Job node has a detail pane to display job details with tabs, namely: Job info, and History. The details pane also captures query or expressions captured as part of the job run. The history tab lists the different versions of job run event captured for that job. All details captured from API are displayed using metadata forms or a JSON viewer.

  • Version tabs: all lineage nodes in Amazon SageMaker Unified Studio data lineage have versioning. For every dataset node or job node, the versions are captured as history and that enables you to navigate between the different versions to identify what has changed overtime. Each version opens a new tab in the lineage page to help compare or contrast.

Data lineage authorization in Amazon SageMaker Unified Studio

Write permissions - to publish lineage data into Amazon SageMaker Unified Studio, you must have an IAM role with a permissions policy that includes an ALLOW action on the PostLineageEvent API. This IAM authorization happens at API Gateway layer.

Read permissions - there are two operations: GetLineageNode and ListLineageNodeHistory that are included in the AmazonDataZoneDomainExecutionRolePolicy managed policy configured by your admin. This means that every user in the Amazon SageMaker Unified Studio domain can invoke these to traverse the data lineage graph.

Data lineage sample experience in Amazon SageMaker Unified Studio

You can use the data lineage sample experience to browse and understand data lineage in Amazon SageMaker Unified Studio, including traversing upstream or downstream in your data lineage graph, exploring versions and column-level lineage.

Complete the followng procedure to try the sample data lineage experience in Amazon SageMaker Unified Studio:

  1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.

  2. Choose Select project from the top navigation pane and select the project you want to view lineage in.

  3. Under Project catalog in the left side navigation, choose Assets.

  4. On the Inventory tab, choose the name of the asset that you want to view lineage for. This opens the asset details page.

  5. On the asset details page, choose the Lineage tab.

  6. In the data lineage window, choose the info icon that says Try sample data lineage. Then choose Launch. A new pop-up window appears.

  7. Choos Start guided data lineage tour.

  8. Select a guided tour option, and then choose Start tour.

    At this point, a tab that provides all the space of lineage information is displayed. The sample data lineage graph is initially displayed with a base node with 1-depth at either ends, upstream and downstream. You can expand the graph upstream or downstream. The columns information is also available for you to choose and see how lineage flows through the nodes.

Using Amazon SageMaker Unified Studio data lineage programmatically

To use the data lineage functionality in Amazon SageMaker Unified Studio, you can invoke the following Amazon DataZone APIs: