In the previous article I showed you how you can populate the graph tables using the database procedures we created in an earlier post. In this article I will show you how you can query the data contained in the graph tables to quickly answer questions regarding data lineage.
Data Flow
As you may recall from the previous article, we created data to store the flow of data from two source systems in to our BI system as depicted in the illustration below.

MATCH
Starting in SQL Server 2017 the MATCH clause can be used with node and edge graph tables to traverse a path in the graph tables. Imagine a scenario where you want to know the impact of removing a field in your Staging database which you no longer believe you need. In the previous article we captured data lineage for a data flow that was coming out of JDE, one of the columns was an invoice data in Julian called RPDIVJ which is coming from the table F0411. For the purpose of this article lets assume that the business believes this date field is not useful and needs to be removed. Without this data lineage solution you would need to manually search through ETL processes to determine the impact of removing this column, a time consuming task, thankfully we have a more elegant solution.
We can query our graph tables in our DataLineage database to quickly see which processes will need to be modified. We will start by querying Staging which is the start of the data flow.
Run the following code to see which ETL process will need to be modified.
SELECT
df.ProcessName,
df.CreatedDt
FROM
[DataLineage].[dbo].[DataEntity] src,
[DataLineage].[dbo].[DataFlowsTo] df,
[DataLineage].[dbo].[DataEntity] trgt
WHERE MATCH(src-(df)->trgt)
AND src.SystemName = 'Business Intelligence System'
AND src.DatabaseName = 'Staging'
AND src.TableName = 'F0411'
AND src.ColumnName = 'RPDIVJ'
This should give you results like the table below.
ProcessName |
CreatedDt |
Load Fact Accounts Payable |
2020-02-12 |
Load Fact Accounts Payable |
2020-02-12 |
This tells you that the ETL process Load Fact Accounts Payable uses the field RPDIVJ in the Staging database twice. I have also included the CreatedDt filed in the query so that you know if the data is still relevant, typically the older the data the less reliable it is to make decision. If you have built your BI system correctly you should be able to compare that age of this data with the last modified date of the ETL process to see if there is a significant difference, if there is odds are you need to update the data lineage for this process.
Now let’s modify the query above to get some more meaningful insights.
SELECT
src.SystemName AS SourceSystem,
src.DatabaseName AS SourceDatabase,
src.TableName AS SourceTable,
src.ColumnName AS SourceColumn,
src.CreatedDt AS SourceNodeCreatedDt,
df.ProcessName,
df.CreatedDt AS ProcessCreatedDt,
trgt.SystemName AS TargetSystem,
trgt.DatabaseName AS TargetDatabase,
trgt.TableName AS TargetTable,
trgt.ColumnName AS TargetColumn,
trgt.CreatedDt AS TargetNodeCreatedDt
FROM
[DataLineage].[dbo].[DataEntity] src,
[DataLineage].[dbo].[DataFlowsTo] df,
[DataLineage].[dbo].[DataEntity] trgt
WHERE MATCH(src-(df)->trgt)
AND src.SystemName = 'Business Intelligence System'
AND src.DatabaseName = 'Staging'
AND src.TableName = 'F0411'
AND src.ColumnName = 'RPDIVJ'
This query provides you with details on the source, target and the process that copies the data from source to target. When you run this query, it should provide results similar to the table below.
SourceSystem |
SourceDatabase |
SourceTable |
SourceColumn |
SourceNodeCreatedDt |
ProcessName |
ProcessCreatedDt |
TargetSystem |
TargetDatabase |
TargetTable |
TargetColumn |
TargetNodeCreatedDt |
Business Intelligence System |
Staging |
F0411 |
RPDIVJ |
2020-02-12 |
Load Fact Accounts Payable |
2020-02-12 |
Business Intelligence System |
EDW |
Fact Accounts Payable |
InvoiceDate-Julian |
2020-02-12 |
Business Intelligence System |
Staging |
F0411 |
RPDIVJ |
2020-02-12 |
Load Fact Accounts Payable |
2020-02-12 |
Business Intelligence System |
EDW |
Fact Accounts Payable |
InvoiceDate-Gregorian |
2020-02-12 |
The match clause is great however it is still difficult to understand the full impact of removing the RPDIVJ column, this only tells us about the impact to the staging load process and will not tell us down stream impacts to the EDW and the cubes which may rely on this field. What we really want to know is the shortest path across all layers of the BI system.
SHORTEST_PATH()
To get a better understanding of the end to end impact of removing this field we will use the SHORTEST_PATH option. This is a new feature only available in SQL Server 2019 or later. To use this feature, write a new query as follows.
SELECT
CONCAT(src.SystemName,'.', src.DatabaseName, '.', src.TableName, '.', src.ColumnName) AS Source,
COUNT(trgt.ColumnName) WITHIN GROUP (GRAPH PATH) AS [BI System Layer],
STRING_AGG(CONCAT(trgt.SystemName, '.', trgt.DatabaseName, '.', trgt.TableName, '.', trgt.ColumnName) , ' -> ') WITHIN GROUP (GRAPH PATH) AS [Data Lineage],
STRING_AGG(df.ProcessName , ', ') WITHIN GROUP (GRAPH PATH) AS [ETL Processes]
FROM
[DataLineage].[dbo].[DataEntity] AS src,
[DataLineage].[dbo].[DataFlowsTo] FOR PATH AS df,
[DataLineage].[dbo].[DataEntity] FOR PATH AS trgt
WHERE
MATCH (SHORTEST_PATH(src(-(df)->trgt)+))
AND src.SystemName = 'Finance'
AND src.DatabaseName = 'JDE'
AND src.TableName = 'F0411'
AND src.ColumnName = 'RPDIVJ'
ORDER BY 2 DESC
After running the query above you should get results like the table below.
Source |
BI System Layer |
Data Lineage |
ETL Processes |
Finance.JDE.F0411.RPDIVJ |
3 |
Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian -> Business Intelligence System.SSAS Accounts Payble Model.Fact Accounts Payable.InvoiceDate |
Load F0411, Load Fact Accounts Payable, Process SSAS Accounts Payble Model |
Finance.JDE.F0411.RPDIVJ |
2 |
Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Julian |
Load F0411, Load Fact Accounts Payable |
Finance.JDE.F0411.RPDIVJ |
2 |
Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian |
Load F0411, Load Fact Accounts Payable |
Finance.JDE.F0411.RPDIVJ |
1 |
Business Intelligence System.Staging.F0411.RPDIVJ |
Load F0411 |
Let’s break down the results and look at the first row returned by the query.
The first column Source is the fully qualified source for the data
Finance.JDE.F0411.RPDIVJ
The second field BI System Layer tells you how many layers up the data travels, so in this case the data makes its way through all 3 layers of the BI system (Staging, EDW, SSAS).
Summary
By leveraging Graph tables and new querying capabilities associated with these tables you can quickly answer questions regarding data lineage and provide greater trust, understanding and ultimately adoption of the BI system.
In the next article I will cover how to visualize data in the graph tables using R.
Hopefully you have found this to be another practical post.
Until next time.
Anthony