Troubleshooting Amazon DataZone
If you encounter access-denied issues or similar difficulties when working with Amazon DataZone consult the topics in this section.
Troubleshooting AWS Lake Formation permissions for Amazon DataZone
This section contains troubleshooting instructions for issues that you might encounter when you Configure Lake Formation permissions for Amazon DataZone.
Error message in the Data Portal | Resolution |
---|---|
Unable to assume the Data Access Role. |
This error is displayed when Amazon DataZone is unable to assume the AmazonDataZoneGlueDataAccessRole that you used to enable the DefaultDataLakeBlueprint in your account. To fix the issue, go to the AWS IAM console in the account where your data asset exists and make sure that the AmazonDataZoneGlueDataAccessRole has the right trust relationship with the Amazon DataZone service principal. For more information, see AmazonDataZoneGlueAccess-<region>-<domainId> |
The Data Access Role does not have the necessary permissions to read the metadata of the asset you are trying to subscribe. |
This error is displayed when Amazon DataZone successfully assumes the AmazonDataZoneGlueDataAccessRole role, but the role does not have the necessary permissions. To fix the issue, go to the AWS IAM console in the account where your data asset exists and make sure that the role has the AmazonDataZoneGlueManageAccessRolePolicy attached it. For more information, see AmazonDataZoneGlueAccess-<region>-<domainId>. |
Asset is a resource link. Amazon DataZone does not support subscriptions to resource links. |
This error is displayed when the asset you are trying to publish to Amazon DataZone is a resource link to an AWS Glue table. |
Asset is not managed by AWS Lake Formation. |
This error indicates that the AWS Lake Formation permissions are not enforced on the asset that you want to publish. This can happen in the following cases.
|
Data Access role does not have necessary Lake Formation permissions to grant access to this asset. |
This error indicates that the AmazonDataZoneGlueDataAccessRole that you are using to enable the DefaultDataLakeBlueprint in your account does not have the necessary permissions for Amazon DataZone to manage permissions on the published asset. You can resolve the issue by either adding the AmazonDataZoneGlueDataAccessRole as the AWS Lake Formation administrator or by granting the following permissions to the AmazonDataZoneGlueDataAccessRole on the asset that you want to publish.
|
Troubleshooting Amazon DataZone lineage asset linking with upstream datasets
This section contains troubleshooting instructions for issues that you might encounter with Amazon DataZone lineage. For some of the AWS Glue and Amazon Redshift-related open lineage run events, you may see that asset lineage is not linked to an upstream dataset. This topic explains the scenarios and a few approaches to mitigate issues. For more information on lineage, see Data lineage in Amazon DataZone.
SourceIdentifier on lineage node
The sourceIdentifier
attribute in a lineage node represents the events happening on a dataset. For more information, see Key attributes in lineage nodes.
The lineage node represents all the events that happen on the corresponding dataset or job. The lineage node contains a "sourceIdentifier" attribute which contains the identifier of the corresponding dataset/job. As we support open-lineage events, the sourceIdentifier
value is by default populated as the combination of "namespace" and "name" for a dataset, job and job runs.
For AWS resources such as AWS Glue and Amazon Redshift, the sourceIdentifier
would be the AWS Glue table ARN and the Redshift table ARNs from which Amazon DataZone will construct the run-event and other details as follows:
Note
In AWS, the ARN contains information such as the accountId, region, database, and table for every resource.
OpenLineage event for these datasets contain database and table name.
Region is captured in the "environment-properties" facet of a run. If it's not present, the system uses the region from the caller credentials.
AccountId is taken from the caller credentials.
SourceIdentifier on the assets within DataZone
AssetCommonDetailForm
has an attribute called "sourceIdentifier" which represents the identifier of the dataset which the asset represents. For asset lineage nodes to be linked with an upstream dataset, the attribute needs to be populated with the value matching with the dataset node’s sourceIdentifier
. If the assets are imported by datasource, the workflow populates sourceIdentifier
as the AWS Glue table ARN / Redshift table ARN automatically while other assets (including custom assets) created via the CreateAsset
API should have that value populated by the caller.
How does Amazon DataZone construct the sourceIdentifier from the OpenLineage Event?
For AWS Glue and Redshift assets, the sourceIdentifier
is constructed from Glue and Redshift ARNs. Here's how Amazon DataZone constructs it:
AWS Glue ARN
The goal is to construct an OpenLineage Event where the output lineage node's sourceIdentifier
is:
arn:aws:glue:us-east-1:123456789012:table/testlfdb/testlftb-1
To determine if a run is using data from AWS Glue, look for the presence of certain keywords in the environment-properties
facet. Specifically, if any of these designated fields are present, the system assumes the RunEvent
originates from AWS Glue.
GLUE_VERSION
GLUE_COMMAND_CRITERIA
GLUE_PYTHON_VERSION
"run": { "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr", "facets":{ "environment-properties":{ "_producer":"https://github.com/OpenLineage/OpenLineage/tree/1.9.1/integration/spark", "_schemaURL":"https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet", "environment-properties":{ "GLUE_VERSION":"3.0", "GLUE_COMMAND_CRITERIA":"glueetl", "GLUE_PYTHON_VERSION":"3" } } }
For an AWS Glue run, you can use the name from the symlinks
facet to get the database and table name, which can be used to construct the ARN.
Need to make sure the name is databaseName.tableName
:
"symlinks": { "_producer":"https://github.com/OpenLineage/OpenLineage/tree/1.9.1/integration/spark", "_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet", "identifiers":[ { "namespace":"s3://object-path", "name":"testlfdb.testlftb-1", "type":"TABLE" } ] }
Sample COMPLETE Event:
{ "eventTime":"2024-07-01T12:00:00.000000Z", "producer":"https://github.com/OpenLineage/OpenLineage/tree/1.9.1/integration/glue", "schemaURL":"https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent", "eventType":"COMPLETE", "run": { "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr", "facets":{ "environment-properties":{ "_producer":"https://github.com/OpenLineage/OpenLineage/tree/1.9.1/integration/spark", "_schemaURL":"https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunFacet", "environment-properties":{ "GLUE_VERSION":"3.0", "GLUE_COMMAND_CRITERIA":"glueetl", "GLUE_PYTHON_VERSION":"3" } } } }, "job":{ "namespace":"namespace", "name":"job_name", "facets":{ "jobType":{ "_producer":"https://github.com/OpenLineage/OpenLineage/tree/1.9.1/integration/glue", "_schemaURL":"https://openlineage.io/spec/facets/2-0-2/JobTypeJobFacet.json#/$defs/JobTypeJobFacet", "processingType":"BATCH", "integration":"glue", "jobType":"JOB" } } }, "inputs":[ { "namespace":"namespace", "name":"input_name" } ], "outputs":[ { "namespace":"namespace.output", "name":"output_name", "facets":{ "symlinks":{ "_producer":"https://github.com/OpenLineage/OpenLineage/tree/1.9.1/integration/spark", "_schemaURL":"https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet", "identifiers":[ { "namespace":"s3://object-path", "name":"testlfdb.testlftb-1", "type":"TABLE" } ] } } } ] }
Based on the OpenLineage
event submitted, the sourceIdentifier
of the output lineage node will be:
arn:aws:glue:us-east-1:123456789012:table/testlfdb/testlftb-1
The output lineage node will be connected to an asset's lineage node where the asset's sourceIdentifier
is:
arn:aws:glue:us-east-1:123456789012:table/testlfdb/testlftb-1
Amazon Redshift ARN
The goal is to construct an OpenLineage Event where the output lineage node's sourceIdentifier
is:
arn:aws:redshift:us-east-1:123456789012:table/workgroup-20240715/tpcds_data/public/dws_tpcds_7
The system determines whether an input or output is stored in Redshift based on the namespace. Specifically, if the namespace starts with redshift:// or contains the strings redshift-serverless.amazonaws.com
or redshift.amazonaws.com
, it is a Redshift resource.
"outputs": [ { "namespace":"redshift://workgroup-20240715.123456789012.us-east-1.redshift.amazonaws.com:5439", "name":"tpcds_data.public.dws_tpcds_7" } ]
Note that the namespace needs to be in the following format:
provider://{cluster_identifier}.{region_name}:{port}
For redshift-serverless
:
"outputs": [ { "namespace":"redshift://workgroup-20240715.123456789012.us-east-1.redshift-serverless.amazonaws.com:5439", "name":"tpcds_data.public.dws_tpcds_7" } ]
Results in the following sourceIdentifier
arn:aws:redshift-serverless:us-east-1:123456789012:table/workgroup-20240715/tpcds_data/public/dws_tpcds_7
Based on the OpenLineage event submitted, the sourceIdentifier
to be mapped to a downstream (that is, an output of the event) lineage node is:
arn:aws:redshift-serverless:us-e:us-east-1:123456789012:table/workgroup-20240715/tpcds_data/public/dws_tpcds_7
This is the mapping that helps you visualize the lineage of an asset in the catalog.
Alternate approach
When none of the above conditions are met, the system uses the namespace/name to construct the sourceIdentifier
:
"inputs": [ { "namespace":"arn:aws:redshift:us-east-1:123456789012:table", "name":"workgroup-20240715/tpcds_data/public/dws_tpcds_7" } ], "outputs": [ { "namespace":"arn:aws:glue:us-east-1:123456789012:table", "name":"testlfdb/testlftb-1" } ]
Troubleshooting a lack of upstream for the asset lineage node
If you don’t see the upstream of the asset lineage node, you can do the following to troubleshoot why it's not linked with the dataset:
Invoke
GetAsset
while providing thedomainId
andassetId
:aws datazone get-asset --domain-identifier <domain-id> --identifier <asset-id>
The response appears as follows:
{ ..... "formsOutput": [ ..... { "content": "{\"sourceIdentifier\":\"arn:aws:glue:eu-west-1:123456789012:table/testlfdb/testlftb-1\"}", "formName": "AssetCommonDetailsForm", "typeName": "amazon.datazone.AssetCommonDetailsFormType", "typeRevision": "6" }, ..... ], "id": "<asset-id>", .... }
Invoke
GetLineageNode
to get thesourceIdentifier
of the dataset lineage node. As there is no way to get the lineage node for the corresponding dataset node directly, you can start withGetLineageNode
on the job run:aws datazone get-lineage-node --domain-identifier <domain-id> --identifier <job_namespace>.<job_name>/<run_id> if you are using the getting started scripts, job name and run ID are printed in the console and namespace is "default". Otherwise you can get these values from run event content.
The sample response looks like the following:
{ ..... "downstreamNodes": [ { "eventTimestamp": "2024-07-24T18:08:55+08:00", "id": "afymge5k4v0euf" } ], "formsOutput": [ <some forms corresponding to run and job> ], "id": "<system generated node-id for run>", "sourceIdentifier": "default.redshift.create/2f41298b-1ee7-3302-a14b-09addffa7580", "typeName": "amazon.datazone.JobRunLineageNodeType", .... "upstreamNodes": [ { "eventTimestamp": "2024-07-24T18:08:55+08:00", "id": "6wf2z27c8hghev" }, { "eventTimestamp": "2024-07-24T18:08:55+08:00", "id": "4tjbcsnre6banb" } ] }
Invoke
GetLineageNode
again by passing in the the downstream/upstream node identifier (which you think should be linked to the asset node) as these correspond to the dataset:Sample command using the above example response:
aws datazone get-lineage-node --domain-identifier <domain-id> --identifier afymge5k4v0euf
This returns the lineage node details corresponding to the dataset: afymge5k4v0euf
{ ..... "domainId": "dzd_cklzc5s2jcr7on", "downstreamNodes": [], "eventTimestamp": "2024-07-24T18:08:55+08:00", "formsOutput": [ ..... ], "id": "afymge5k4v0euf", "sourceIdentifier": "arn:aws:redshift:us-east-1:123456789012:table/workgroup-20240715/tpcds_data/public/dws_tpcds_7", "typeName": "amazon.datazone.DatasetLineageNodeType", "typeRevision": "1", .... "upstreamNodes": [ ... ] }
Compare the
sourceIdentifier
of this dataset node and the response fromGetAsset
. If they are not linked, these will not match, and therefore will not be visible in the lineage UI.
Non-matching scenarios and mitigations
The following are commonly known scenarios where these will not match and the possible mitigations:
Root cause: The tables are present in different account than that of the Amazon DataZone domain account.
Mitigation: You can invoke the PostLineageEvent
operation from an associated account. As the accountId
to construct the ARN is picked from the caller credentials, you can assume the role from the account containing the tables when running the getting started script or invoking PostLineageEvent
. Doing so will help in constructing the ARNs correctly and linking with the asset nodes.
Root cause: The ARN for Redshift table/views contains Redshift/Redshift-serverless based on the namespace and name attributes of the corresponding dataset information in the OpenLineage run event.
Mitigation: As there is no deterministic way to know if the given name belongs to cluster or workgroup, we use the following heuristic:
If the "name" corresponding to the dataset contains "
redshift-serverless.amazonaws.com
", we use redshift-serverless as part of the ARN, otherwise default to "redshift".The above means aliases on workgroup names will not work.
Root cause: Upstream datasets are not linked properly for custom assets.
Mitigation: Make sure to populate the sourceIdentifier
on the asset by invoking CreateAsset
/CreateAssetRevision
that matches with the sourceIdentifier
of the dataset node (which would be <namespace>/<name> for custom nodes).