How to track data lineage with SQL Server Graph Tables – Part 5 how to visualize the data using R

In my previous article I explained how you can query the graph tables in SQL Server to pull out pertinent information regarding the lineage of various data fields in your business intelligence system. In this article I will show you how you can visualize the results using R inside of SQL Server.

I completed these steps using a docker image of SQL Server 2019. I was curious to test the performance of using a SQL Server docker image and I thought I would kill two birds with one stone and finish the series using docker. There were a couple of unexpected extra steps that I had to figure out to get this working properly inside of a docker container which I will call out and explain along the way.

Install R packages

We will use a package called igraph to visualize the data. Before you can use this package, you will need to download and install it.

To properly add packages to SQL Server2019 or later you must use sqlmlutils.

Docker image issue #1

In order to use Machine Learning Services in the SQL Server docker image you must accept the End-User License Agreement associated with both SQL Server and SQL Server Machine Learning Service. I also mounted a volume so that the images generated by the R script are easily accessible from outside the container. To do this start the docker image with the following command.

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices

Next you will need to install sqlmlutils into the docker image. The instructions provided by Microsoft, which can be found here Install R packages with sqlmlutils – SQL Server Machine Learning Services | Microsoft Docs do not work for a docker image.

Docker image issue #2

The quick and dirty way to install sqlmlutils to the docker image is to bash into it. This is considered hacky and should be avoided in a production environment. For production I would create a dockerfile so that package dependencies are part of the docker image definition and are installed on build.

To bash into the container running SQL Server using the following command.

docker exec -it <your container id> "bash"

To find out the ID of the container running SQL Server run the following command

docker ps --filter "ancestor=mssql-server-mlservices"

Hopefully you will have only one container built off of the mssql-server-mlservices image.

Once you have connected into the container you will need to install some additional software into the Linux Ubuntu running the container.

Docker image issue #3

Before you can install new R packages into the container you will need to update and install some additional dependences.

sudo apt-get update


sudo apt-get install build-essential gdb


sudo apt-get install gfortran

Once you have installed these dependencies you can add packages to the container, for simplicity I have added the igraph package directly using the following command.

sudo su - -c "R -e \"install.packages('igraph', repos='http://cran.rstudio.com/')\""

In production I would have installed the sqlmlutils and then used this package to install igraph, however to do this I would have to first install unixodbc-dev, then ODBC, then manually install SQLMLUTILS from the zip file.

If you have a problem installing igraph you most likely need to the ~/.R/Makevars file using the following command

cd ~

cat > Makevars

C=gcc-5

CXX=g++

CXX1X = g++-5

CFLAGS = -std=c99 -Wall -pedantic

Press CTRL-D to save and exit the file

Enable External Scripts

Now that your environment is setup you need to enable external scripts to run in your SQL Server instance. To do this connect into your SQL Server instance using SQL Server Management Studio or Azure Data Studio and then run the following command.

exec
sp_configure
'external scripts enabled', 1;
RECONFIGURE;

Generate image using R

Now that we are all setup we can use R to generate the following visuals. This images will be stored as PNG files on your C:\Temp folder (the one mounted in your container if using a container)

System Data Flow

The system level image was generated with the following code.

--System level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/SystemDataFlow.png", height = 800, width = 1500, res = 100);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',
@input_data_1 =
N'
 SELECT 
        src.SystemName AS Source,
        trgt.SystemName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',
@input_data_1_name =
N'graphdf'
GO

Database Level

The database level image was generated with the following code.

--Database level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/DatabaseDataFlow.png", height = 2400, width = 3000, res = 50);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',
@input_data_1 =
N'
 SELECT 
        src.DatabaseName AS Source,
        trgt.DatabaseName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',
@input_data_1_name =
N'graphdf'
GO

Table Level

The table level image was generated with the following code.

--Table level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/TableDataFlow.png", height = 800, width = 1500, res = 100);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',

@input_data_1 =
N'
 SELECT 
        src.TableName AS Source,
        trgt.TableName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',

@input_data_1_name =
N'graphdf'
GO

Column Level

The column level image was generated with the following code.

--Column level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/ColumnDataFlow.png", height = 2400, width = 3000, res = 50);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',

@input_data_1 =
N'
 SELECT 
        CONCAT(src.TableName, ''.'', src.ColumnName) AS Source,
        CONCAT(trgt.TableName, ''.'', trgt.ColumnName) AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',

@input_data_1_name =
N'graphdf'
GO

Security

With SQL Server 2019 there is a known issue when writing files to disk using an external script.

igraph png not working on SQL Server 2019 (microsoft.com)

I was able to bypass this issue by writing my PNG files into the docker container in the folder /var/opt/mssql/images/ this folder is then mounted to C:\Temp which makes them accessible from my local host. I mounted the folder when running the docker image using the -v argument

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices

Summary

I was able to visualize the data by generating PNG files using R and the igraph packages which is very handy for generating documentation that is easily consumable by anyone. The basic concepts explained in this article can be extended and made dynamic so that a user can pick what level to see lineage at as well as enhanced so that a hierarchy can be shown with all levels shown based on user input.

In the next article I will cover how to visualize data in the graph tables using PowerBI.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 4 Querying the Graph Tables

In the previous article I showed you how you can populate the graph tables using the database procedures we created in an earlier post. In this article I will show you how you can query the data contained in the graph tables to quickly answer questions regarding data lineage.

Data Flow

As you may recall from the previous article, we created data to store the flow of data from two source systems in to our BI system as depicted in the illustration below.

MATCH

Starting in SQL Server 2017 the MATCH clause can be used with node and edge graph tables to traverse a path in the graph tables. Imagine a scenario where you want to know the impact of removing a field in your Staging database which you no longer believe you need. In the previous article we captured data lineage for a data flow that was coming out of JDE, one of the columns was an invoice data in Julian called RPDIVJ which is coming from the table F0411. For the purpose of this article lets assume that the business believes this date field is not useful and needs to be removed. Without this data lineage solution you would need to manually search through ETL processes to determine the impact of removing this column, a time consuming task, thankfully we have a more elegant solution.

We can query our graph tables in our DataLineage database to quickly see which processes will need to be modified. We will start by querying Staging which is the start of the data flow.

Run the following code to see which ETL process will need to be modified.

SELECT 
	df.ProcessName,
	df.CreatedDt
 FROM 
        [DataLineage].[dbo].[DataEntity] src,
        [DataLineage].[dbo].[DataFlowsTo] df,
        [DataLineage].[dbo].[DataEntity] trgt
 
WHERE MATCH(src-(df)->trgt)
	 AND src.SystemName = 'Business Intelligence System'
	 AND src.DatabaseName = 'Staging'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'


This should give you results like the table below.

ProcessName

CreatedDt

Load Fact Accounts Payable

2020-02-12

Load Fact Accounts Payable

2020-02-12

This tells you that the ETL process Load Fact Accounts Payable uses the field RPDIVJ in the Staging database twice. I have also included the CreatedDt filed in the query so that you know if the data is still relevant, typically the older the data the less reliable it is to make decision. If you have built your BI system correctly you should be able to compare that age of this data with the last modified date of the ETL process to see if there is a significant difference, if there is odds are you need to update the data lineage for this process.

Now let’s modify the query above to get some more meaningful insights.

SELECT 
			src.SystemName AS SourceSystem,
			src.DatabaseName AS SourceDatabase,
			src.TableName AS SourceTable,
			src.ColumnName AS SourceColumn,
			src.CreatedDt AS SourceNodeCreatedDt,
			
			df.ProcessName,
			df.CreatedDt AS ProcessCreatedDt,
			
			trgt.SystemName AS TargetSystem,
			trgt.DatabaseName AS TargetDatabase,
			trgt.TableName AS TargetTable,
			trgt.ColumnName AS TargetColumn,
			trgt.CreatedDt AS TargetNodeCreatedDt

 FROM 
	  [DataLineage].[dbo].[DataEntity] src,
          [DataLineage].[dbo].[DataFlowsTo] df,
          [DataLineage].[dbo].[DataEntity] trgt
 
WHERE MATCH(src-(df)->trgt)
	 AND src.SystemName = 'Business Intelligence System'
	 AND src.DatabaseName = 'Staging'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'

This query provides you with details on the source, target and the process that copies the data from source to target. When you run this query, it should provide results similar to the table below.

SourceSystem

SourceDatabase

SourceTable

SourceColumn

SourceNodeCreatedDt

ProcessName

ProcessCreatedDt

TargetSystem

TargetDatabase

TargetTable

TargetColumn

TargetNodeCreatedDt

Business Intelligence System

Staging

F0411

RPDIVJ

2020-02-12

Load Fact Accounts Payable

2020-02-12

Business Intelligence System

EDW

Fact Accounts Payable

InvoiceDate-Julian

2020-02-12

Business Intelligence System

Staging

F0411

RPDIVJ

2020-02-12

Load Fact Accounts Payable

2020-02-12

Business Intelligence System

EDW

Fact Accounts Payable

InvoiceDate-Gregorian

2020-02-12

The match clause is great however it is still difficult to understand the full impact of removing the RPDIVJ column, this only tells us about the impact to the staging load process and will not tell us down stream impacts to the EDW and the cubes which may rely on this field. What we really want to know is the shortest path across all layers of the BI system.

SHORTEST_PATH()

To get a better understanding of the end to end impact of removing this field we will use the SHORTEST_PATH option. This is a new feature only available in SQL Server 2019 or later. To use this feature, write a new query as follows.

SELECT
	CONCAT(src.SystemName,'.', src.DatabaseName, '.', src.TableName, '.', src.ColumnName) AS Source,
	COUNT(trgt.ColumnName) WITHIN GROUP (GRAPH PATH) AS [BI System Layer],
	STRING_AGG(CONCAT(trgt.SystemName, '.', trgt.DatabaseName, '.', trgt.TableName, '.', trgt.ColumnName) , ' -> ') WITHIN GROUP (GRAPH PATH) AS [Data Lineage],
	STRING_AGG(df.ProcessName , ', ') WITHIN GROUP (GRAPH PATH) AS [ETL Processes]

FROM
	[DataLineage].[dbo].[DataEntity] AS src,
	[DataLineage].[dbo].[DataFlowsTo] FOR PATH AS df,
	[DataLineage].[dbo].[DataEntity] FOR PATH AS trgt
WHERE
	MATCH (SHORTEST_PATH(src(-(df)->trgt)+))
	 AND src.SystemName = 'Finance'
	 AND src.DatabaseName = 'JDE'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'
ORDER BY 2 DESC

After running the query above you should get results like the table below.

Source

BI System Layer

Data Lineage

ETL Processes

Finance.JDE.F0411.RPDIVJ

3

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian -> Business Intelligence System.SSAS Accounts Payble Model.Fact Accounts Payable.InvoiceDate

Load F0411, Load Fact Accounts Payable, Process SSAS Accounts Payble Model

Finance.JDE.F0411.RPDIVJ

2

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Julian

Load F0411, Load Fact Accounts Payable

Finance.JDE.F0411.RPDIVJ

2

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian

Load F0411, Load Fact Accounts Payable

Finance.JDE.F0411.RPDIVJ

1

Business Intelligence System.Staging.F0411.RPDIVJ

Load F0411

Let’s break down the results and look at the first row returned by the query.

The first column Source is the fully qualified source for the data

Finance.JDE.F0411.RPDIVJ

The second field BI System Layer tells you how many layers up the data travels, so in this case the data makes its way through all 3 layers of the BI system (Staging, EDW, SSAS).

The third field Data Lineage tells you the source to target mapping from end to end.
Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian -> Business Intelligence System.SSAS Accounts Payble Model.Fact Accounts Payable.InvoiceDate
The last field ETL Processes lets you know the names of all the ETL processes that depend on the data coming from Finance.JDE.F0411.RPDIVJ.

Load F0411, Load Fact Accounts Payable, Process SSAS Accounts Payble Model
The query is using SHORTEST_PATH, STRING_AGG and COUNT to provide meaningful information regarding data lineage on a specific field used as a source for the BI system. With this query you can quickly understand the impact of removing this field across all layers of the BI system.

Summary

By leveraging Graph tables and new querying capabilities associated with these tables you can quickly answer questions regarding data lineage and provide greater trust, understanding and ultimately adoption of the BI system.

In the next article I will cover how to visualize data in the graph tables using R.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 3 Populate Graph Tables

Featured

In this 3rd article I will show you how to use the procedures that we created in the previous post to populate the graph tables with useful data lineage information.

Data Flow

We will simulate the flow of data from two source systems (Finance and HR) into a data warehouse/business intelligence system as illustrated in the diagram below.

Staging

Most modern BI systems include a staging area, also sometimes referred to as a landing area. Data is collected from various disparate systems and loaded into the staging database with very minimal transformations, typically only data type conversions. The goal is to quickly grab the data from the different source systems and load it into the staging area for further processing so that any performance impact to the source systems is minimized.

To capture the data lineage for this first hop from source system to staging we can use our procedure as follows.

-- JDE to Staging
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Finance',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'JDE',
		@TargetDatabase = 'Staging',
		
		--Table Level
		@SourceTable = 'F0411',
		@TargetTable = 'F0411',

		--Column Level
		@SourceColumnList = 'RPKCO, RPDOC, RPDCT, RPSFX, RPSFXE, RPDCTA, RPAN8, RPPYE, RPSNTO, RPDIVJ',
		@TargetColumnList = 'RPKCO, RPDOC, RPDCT, RPSFX, RPSFXE, RPDCTA, RPAN8, RPPYE, RPSNTO, RPDIVJ',

		@ProcessName = Load F0411';
GO



-- HR to Staging 
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'HR',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'Payroll',
		@TargetDatabase = 'Staging',
		
		--Table Level
		@SourceTable = 'EMP',
		@TargetTable = 'EMP',

		--Column Level
		@SourceColumnList = 'EMPID, HIREDT, JOBTITLE',
		@TargetColumnList = 'EMPID, HIREDT, JOBTITLE',

		@ProcessName = 'Load EMP';
GO 

As you can see from the code above it is very easy to map an input table to an output table using the procedures we created in the previous steps.

EDW

Next, we will capture the data lineage from Staging to the Enterprise Data Warehouse (EDW)

-- Staging to EDW for Finance tables

EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'Staging',
		@TargetDatabase = 'EDW',
		
		--Table Level
		@SourceTable = 'F0411',
		@TargetTable = 'Fact Accounts Payable',

		--Column Level
		@SourceColumnList = 'RPDOC, RPDCT, RPSFX, RPSFXE, RPDCTA, RPAN8, RPPYE, RPSNTO, RPDIVJ, RPDIVJ',
		@TargetColumnList = 'Document, Document Type, Document Pay Item, Pay Item Extension Number, Adjusting Document Type, Address Number, Payee Address Number, 	Approvers Address Number, Invoice Date - Julian, Invoice Date - Gregorian',

		@ProcessName = 'Load Fact Accounts Payable';
GO



-- Staging to EDW for HR tables
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'Staging',
		@TargetDatabase = 'EDW',
		
		--Table Level
		@SourceTable = 'EMP',
		@TargetTable = 'Dim Employee',

		--Column Level
		@SourceColumnList = 'EMPID, HIREDT, HIREDT, JOBTITLE',
		@TargetColumnList = 'Employee Number, Hire Date, Tenure, Job Title',

		@ProcessName = 'Load Dim Employee';
GO

If you have a keen eye you might have noticed that we are mapping some of the fields twice. The field RPDIVJ in the JDE table F0411 is mapped to both Invoice Date – Julian and Invoice Date – Gregorian. The field HIREDT in the HR EMP table is mapped to both the Hire Date and the Tenure columns. In both scenarios this implies some transformation logic between these columns.

Accounts Payable Model

The final step in our example is the process that loads the SQL Server Analysis Services (SSAS) model.

-- EDW to SSAS Finance
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'EDW',
		@TargetDatabase = 'SSAS Accounts Payable Model',
		
		--Table Level
		@SourceTable = 'Fact Accounts Payable',
		@TargetTable = 'Fact Accounts Payable',

		--Column Level
		@SourceColumnList = 'Document, Document Type, Document Pay Item, Pay Item Extension Number, Adjusting Document Type, Address Number,  Invoice Date - Gregorian',
		@TargetColumnList = 'Document, Document Type, Document Pay Item, Pay Item Extension Number, Adjusting Document Type, Address Number,  Invoice Date',

		@ProcessName = 'Process SSAS Accounts Payable Model'



-- EDW to SSAS HR
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'EDW',
		@TargetDatabase = 'SSAS Accounts Payable Model',
		
		--Table Level
		@SourceTable = 'Dim Employee',
		@TargetTable = 'Dim Employee',

		--Column Level
		@SourceColumnList = 'Employee Number, Hire Date, Tenure,  Job Title',
		@TargetColumnList = 'Employee Number, Hire Date, Tenure,  Job Title',

		@ProcessName = 'Process SSAS Accounts Payable Model'


GO

As you can see in the code above the Invoice Date – Gregorian from the Finance system is used to populate the Invoice Date and the Invoice Date – Julian did not make it into the SSAS cube.

Summary

The database procedures make it really easy to capture data lineage, they could even be enhanced to accept and parse JSON or XML inputs so that data lineage could be dropped in and maintained using a web application….can someone say Power App

In the next article I will cover how to query the graph tables as well as visualize the data in order to answer data lineage questions.

Hopefully you have found this to be another practical post.

Until next time.

Anthony