This is my second post in a series which explains how to use SQL Server Graph Tables to track data lineage.
In my previous post I covered creating the node and edges tables required to store data lineage. In this post I will show you how to create database procedures in order to simplify the process of populating the graph tables. These procedures are deliberately basic so that they are easy to understand but they are also easily extensible and can serve as the starting point for a more sophisticated solution.
Overview
Before diving into the code it is always good to have a conceptual understanding of the process. The flowchart below explains the logic of these procedures.

As you can see in the diagram above, we check if the source or target is new and then based on the condition create the appropriate nodes and edges. In order to avoid duplicating code we will create a procedure for creating the nodes and the edges and then call theses sub processes from a parent process that controls the logic flow.
Create Nodes Procedure
First, we will create a procedure to populate our node table. As you recall from the previous post our node table is called [dbo].[DataEntity].
DROP PROCEDURE IF EXISTS dbo.Create_Nodes
GO
CREATE PROCEDURE dbo.Create_Nodes
@ColumnList VARCHAR(MAX),
@Table VARCHAR(MAX),
@Database VARCHAR(MAX),
@System VARCHAR(MAX)
AS
BEGIN
SET NOCOUNT ON;
DECLARE @NodeIDs TABLE (ID NVARCHAR(1000));
--Insert data ino Node table and keep Source IDS so that we can populate the Edge table
INSERT INTO [dbo].[DataEntity]
OUTPUT INSERTED.$NODE_ID AS TOID INTO @NodeIDs(ID)
SELECT
value AS [ColumnName],
@Table AS [TableName],
@Database AS [DatabaseName],
@System AS [SystemName],
GETDATE() AS [CreatedDt]
FROM
STRING_SPLIT(@ColumnList, ',');
SELECT ID FROM @NodeIDs;
RETURN
END
Now that we have a procedure to create our nodes, we will create another procedure to create the edges.
Create Edges Procedure
We need to pass a list of procedure a list of node ids to insert into the [dbo].[DataFlowsTo] table so we will create a user defined type to store and pass this information.
/* Create a table type in order to pass a list of Node IDs to this procedure*/
DROP TYPE IF EXISTS dbo.NodeIDTableType
CREATE TYPE dbo.NodeIDTableType AS TABLE ( ID NVARCHAR(1000) );
GO
After the user defined table type is created we can create the procedure to populate the edge table.
DROP PROCEDURE IF EXISTS dbo.Create_Edges
GO
CREATE PROCEDURE dbo.Create_Edges
@Source NodeIDTableType READONLY,
@Target NodeIDTableType READONLY,
@Process VARCHAR(MAX)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
--Get the FROM and TO node ids into a common table exppression and populate the edge table in batch.
*/
WITH CTE (TOID, FROMID) AS
(
SELECT
A.FROMID,
B.TOID
FROM
(SELECT ID AS FROMID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Source) as A
JOIN
(SELECT ID AS TOID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Target) as B
ON A.RowNum = B.RowNum
)
--Insert Lineage data into DataFlowsTo Edge table
INSERT INTO [dbo].[DataFlowsTo] ($from_id, $to_id, ProcessName)
SELECT CTE.TOID, CTE.FROMID, @Process FROM CTE;
END
In case you are wondering I used a common table expression (CTE) so that the edge records could be created in bulk rather than one at a time.
Create Data Lineage Procedure
This is the final procedure which checks to see if the source or target is new and calls the procedures above as required. The conditional logic check is premised on the assumption a the combination of system name, database name, table name and column name is unique.
DROP PROCEDURE IF EXISTS dbo.Create_Data_Lineage
GO
USE [DataLineage]
GO
CREATE PROCEDURE [dbo].[Create_Data_Lineage]
--Source data
@SourceSystem VARCHAR(MAX),
@SourceDatabase VARCHAR(MAX),
@SourceTable VARCHAR(MAX),
@SourceColumnList VARCHAR(MAX),
--Target data
@TargetSystem VARCHAR(MAX),
@TargetDatabase VARCHAR(MAX),
@TargetTable VARCHAR(MAX),
@TargetColumnList VARCHAR(MAX),
@ProcessName VARCHAR(MAX)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
--DECLARE @SourceIDs TABLE (ID NVARCHAR(1000));
--DECLARE @TargetIDs TABLE (ID NVARCHAR(1000));
DECLARE @SourceIDs NodeIDTableType;
DECLARE @TargetIDs NodeIDTableType;
--Prepare the list of fields by removing any spaces or tabs between the comma seperated list of source and target columns
SET @SourceColumnList = REPLACE(REPLACE(@SourceColumnList,' ', ''), char(9), '');
SET @TargetColumnList = REPLACE(REPLACE(@TargetColumnList,' ', ''), char(9), '');
--Check for existing sources , if found use those id's otherwise create new nodes and use new IDs
INSERT INTO @SourceIDs
SELECT
$NODE_ID
FROM
[dbo].[DataEntity] DE
INNER JOIN
(SELECT
value AS [ColumnName],
@SourceTable AS [TableName],
@SourceDatabase AS [DatabaseName],
@SourceSystem AS [SystemName]
FROM
STRING_SPLIT(@SourceColumnList, ',')
) SRC
ON
DE.[ColumnName] = SRC.[ColumnName]
AND
DE.[TableName] = SRC.[TableName]
AND
DE.[DatabaseName] = SRC.[DatabaseName]
AND
DE.[SystemName] = SRC.[SystemName];
--Check for existing targets, if found use those id's otherwise create new nodes and use new IDs
INSERT INTO @TargetIDs
SELECT
$NODE_ID
FROM
[dbo].[DataEntity] DE
INNER JOIN
(SELECT
value AS [ColumnName],
@TargetTable AS [TableName],
@TargetDatabase AS [DatabaseName],
@TargetSystem AS [SystemName]
FROM
STRING_SPLIT(@TargetColumnList, ',')
) TRGT
ON
DE.[ColumnName] = TRGT.[ColumnName]
AND
DE.[TableName] = TRGT.[TableName]
AND
DE.[DatabaseName] = TRGT.[DatabaseName]
AND
DE.[SystemName] = TRGT.[SystemName];
IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
BEGIN
--Create source nodes
INSERT @SourceIDs
EXEC dbo.CREATE_NODES
@System = @SourceSystem,
@Database = @SourceDatabase,
@Table = @SourceTable,
@ColumnList = @SourceColumnList;
--Create target nodes
INSERT @TargetIDs
EXEC dbo.CREATE_NODES
@System = @TargetSystem,
@Database = @TargetDatabase,
@Table = @TargetTable,
@ColumnList = @TargetColumnList;
--Create edges between source and target
EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
END
ELSE IF (EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
BEGIN
--create target nodes
INSERT @TargetIDs
EXEC dbo.CREATE_NODES
@System = @TargetSystem,
@Database = @TargetDatabase,
@Table = @TargetTable,
@ColumnList = @TargetColumnList;
--Create edges between source and target
EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
END
ELSE IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (EXISTS (SELECT 1 FROM @TargetIDs))
BEGIN
--create source nodes
INSERT @SourceIDs
EXEC dbo.CREATE_NODES
@System = @SourceSystem,
@Database = @SourceDatabase,
@Table = @SourceTable,
@ColumnList = @SourceColumnList;
--Create edges between source and target
EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
END
ELSE
--No new nodes required
--Create edges between source and target
EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
END
GO
Summary
In this article we have created the database procedures to simplify the process of creating the necessary node and edge records to capture data lineage between source and target fields.
In the next article I will cover how to call these procedures and query the graph tables to answer data lineage questions.
Hopefully you have found this to be another practical post.
Until next time.
Anthony