How to track data lineage with SQL Server Graph Tables – Part 1 Create Nodes and Edges

Featured

Where did this data come from?
How can I trust this data?
What impact will changing this field have on other systems?

If you have ever heard or asked any of these questions, then this series of blog posts is for you.

As data volumes continue to grow so too does the need to manage the data estate. One critical aspect with managing the data estate is understanding data ancestry. Thankfully you can leverage SQL Server 2019 Graph tables to track the lineage of one of your most valuable assets, your data.

In this series of blog posts, I will show you how you can use how you can use Graph Tables in SQL Server 2019 to capture and report on data lineage.

Prerequisites

  • SQL Server 2019 Developer Edition (You can download a free copy for development use from here )
  • SQL Server Management Studio 18 (You can download a free copy from here )

*Note at the end of the series I will show you how you can visualize the data using R and Power BI and Plotly Dash so expect to install both Microsoft R Open, Power BI and Python later in the series.

Overview

The diagram below provides a conceptual overview of the various components that we will use during this series of articles. We will start with the Graph tables in the DataLineage database in SQL Server 2019 because they are the core of the solution.

Here is an overview of the initial data model that we will build.

As you can see in the diagram this series will cover the movement of data from one system to another. At the end of the series I will explain how this initial solution can be extended to included dependencies between objects such as the dependency of a field in a report with a field in a table in your data warehouse or data lake. This will allow you to track and reporting on both the movement of data as well as the inter-dependencies between data entities.

Create the DataLineage database

Once you have SQL Server installed, we will create a new database called DataLineage which we will use to store data we need to track lineage. To create the database run the following SQL command.

CREATE DATABASE DataLineage

Graph tables are first class entities in SQL Server databases starting with SQL Server 2017. Graph database are ideal for storing and traversing relationships between items. For additional information on the graph database in SQL Server you can refer to the following site.

https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview?view=sql-server-ver15

Create the Node table

Now we will create the node table. A node represents a discreet entity in a graph database. In order to maximize the value of our data lineage solution we need to track the movement of data at the at the lowest level possible which is the column/field level. In addition to the column/field data we also need to capture the table, database and system the column/field is stored in, this will be useful for filtering the data and understanding context. Based on this requirement we will create a node table that stores column, table, database and system name using the following SQL command.

USE [DataLineage]
GO

DROP TABLE IF EXISTS dbo.DataEntity
GO

CREATE TABLE DataEntity

(
    [ID] INTEGER IDENTITY(0,1PRIMARY KEY,
    [ColumnName] NVARCHAR(MAX),
    [TableName] NVARCHAR(MAX),
    [DatabaseName] NVARCHAR(MAX),
    [SystemName] NVARCHAR(MAX),
    [CreatedDt] DATE DEFAULT(GETDATE() )
AS NODE;

GO

I have also included a CreatedDt field for auditing purposes. Next, we will create the edge table which will be used to store information regarding the movement of data from source to target.

Create the Edge table

An Edge table represents a relationship between nodes in a graph database. In our scenario we want to track the movement of data between source and target. In order to capture this information, we will create an Edge table using the following SQL command.

USE [DataLineage]
GO

DROP TABLE IF EXISTS dbo.DataFlowsTo
GO

CREATE TABLE DataFlowsTo
(
    [ProcessName] NVARCHAR(MAX),
    [CreatedDt] DATE DEFAULT(GETDATE())
) AS EDGE;
GO

The edge table contains a single user defined column called ProcessName which is used to capture the name of the ETL process that moves data from source to target. The ProcessName will also be useful for filtering the data which is helpful in scenarios where we need to view all the fields involved in an ETL process or all the ETL processes dependent on one or more fields. Like the node table I have included a CreatedDt field for auditing purposes.

Summary

In this article we have created the database as well as the node and edge tables required to capture data lineage.

In the next article I will cover how to populate these tables using a database procedure in order simplify the process of

Hopefully you have found this to be another practical post.

Until next time.

Anthony