How to track data lineage with SQL Server Graph Tables – Part 5 how to visualize the data using R

In my previous article I explained how you can query the graph tables in SQL Server to pull out pertinent information regarding the lineage of various data fields in your business intelligence system. In this article I will show you how you can visualize the results using R inside of SQL Server.

I completed these steps using a docker image of SQL Server 2019. I was curious to test the performance of using a SQL Server docker image and I thought I would kill two birds with one stone and finish the series using docker. There were a couple of unexpected extra steps that I had to figure out to get this working properly inside of a docker container which I will call out and explain along the way.

Install R packages

We will use a package called igraph to visualize the data. Before you can use this package, you will need to download and install it.

To properly add packages to SQL Server2019 or later you must use sqlmlutils.

Docker image issue #1

In order to use Machine Learning Services in the SQL Server docker image you must accept the End-User License Agreement associated with both SQL Server and SQL Server Machine Learning Service. I also mounted a volume so that the images generated by the R script are easily accessible from outside the container. To do this start the docker image with the following command.

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices

Next you will need to install sqlmlutils into the docker image. The instructions provided by Microsoft, which can be found here Install R packages with sqlmlutils – SQL Server Machine Learning Services | Microsoft Docs do not work for a docker image.

Docker image issue #2

The quick and dirty way to install sqlmlutils to the docker image is to bash into it. This is considered hacky and should be avoided in a production environment. For production I would create a dockerfile so that package dependencies are part of the docker image definition and are installed on build.

To bash into the container running SQL Server using the following command.

docker exec -it <your container id> "bash"

To find out the ID of the container running SQL Server run the following command

docker ps --filter "ancestor=mssql-server-mlservices"

Hopefully you will have only one container built off of the mssql-server-mlservices image.

Once you have connected into the container you will need to install some additional software into the Linux Ubuntu running the container.

Docker image issue #3

Before you can install new R packages into the container you will need to update and install some additional dependences.

sudo apt-get update


sudo apt-get install build-essential gdb


sudo apt-get install gfortran

Once you have installed these dependencies you can add packages to the container, for simplicity I have added the igraph package directly using the following command.

sudo su - -c "R -e \"install.packages('igraph', repos='http://cran.rstudio.com/')\""

In production I would have installed the sqlmlutils and then used this package to install igraph, however to do this I would have to first install unixodbc-dev, then ODBC, then manually install SQLMLUTILS from the zip file.

If you have a problem installing igraph you most likely need to the ~/.R/Makevars file using the following command

cd ~

cat > Makevars

C=gcc-5

CXX=g++

CXX1X = g++-5

CFLAGS = -std=c99 -Wall -pedantic

Press CTRL-D to save and exit the file

Enable External Scripts

Now that your environment is setup you need to enable external scripts to run in your SQL Server instance. To do this connect into your SQL Server instance using SQL Server Management Studio or Azure Data Studio and then run the following command.

exec
sp_configure
'external scripts enabled', 1;
RECONFIGURE;

Generate image using R

Now that we are all setup we can use R to generate the following visuals. This images will be stored as PNG files on your C:\Temp folder (the one mounted in your container if using a container)

System Data Flow

The system level image was generated with the following code.

--System level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/SystemDataFlow.png", height = 800, width = 1500, res = 100);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',
@input_data_1 =
N'
 SELECT 
        src.SystemName AS Source,
        trgt.SystemName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',
@input_data_1_name =
N'graphdf'
GO

Database Level

The database level image was generated with the following code.

--Database level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/DatabaseDataFlow.png", height = 2400, width = 3000, res = 50);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',
@input_data_1 =
N'
 SELECT 
        src.DatabaseName AS Source,
        trgt.DatabaseName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',
@input_data_1_name =
N'graphdf'
GO

Table Level

The table level image was generated with the following code.

--Table level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/TableDataFlow.png", height = 800, width = 1500, res = 100);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',

@input_data_1 =
N'
 SELECT 
        src.TableName AS Source,
        trgt.TableName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',

@input_data_1_name =
N'graphdf'
GO

Column Level

The column level image was generated with the following code.

--Column level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/ColumnDataFlow.png", height = 2400, width = 3000, res = 50);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',

@input_data_1 =
N'
 SELECT 
        CONCAT(src.TableName, ''.'', src.ColumnName) AS Source,
        CONCAT(trgt.TableName, ''.'', trgt.ColumnName) AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',

@input_data_1_name =
N'graphdf'
GO

Security

With SQL Server 2019 there is a known issue when writing files to disk using an external script.

igraph png not working on SQL Server 2019 (microsoft.com)

I was able to bypass this issue by writing my PNG files into the docker container in the folder /var/opt/mssql/images/ this folder is then mounted to C:\Temp which makes them accessible from my local host. I mounted the folder when running the docker image using the -v argument

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices

Summary

I was able to visualize the data by generating PNG files using R and the igraph packages which is very handy for generating documentation that is easily consumable by anyone. The basic concepts explained in this article can be extended and made dynamic so that a user can pick what level to see lineage at as well as enhanced so that a hierarchy can be shown with all levels shown based on user input.

In the next article I will cover how to visualize data in the graph tables using PowerBI.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 2 Create Database Procedures

This is my second post in a series which explains how to use SQL Server Graph Tables to track data lineage.

In my previous post I covered creating the node and edges tables required to store data lineage. In this post I will show you how to create database procedures in order to simplify the process of populating the graph tables. These procedures are deliberately basic so that they are easy to understand but they are also easily extensible and can serve as the starting point for a more sophisticated solution.

Overview

Before diving into the code it is always good to have a conceptual understanding of the process. The flowchart below explains the logic of these procedures.

As you can see in the diagram above, we check if the source or target is new and then based on the condition create the appropriate nodes and edges. In order to avoid duplicating code we will create a procedure for creating the nodes and the edges and then call theses sub processes from a parent process that controls the logic flow.

Create Nodes Procedure

First, we will create a procedure to populate our node table. As you recall from the previous post our node table is called [dbo].[DataEntity].

DROP PROCEDURE IF EXISTS dbo.Create_Nodes
GO

CREATE PROCEDURE dbo.Create_Nodes

    @ColumnList VARCHAR(MAX),
    @Table VARCHAR(MAX),
    @Database VARCHAR(MAX),
    @System VARCHAR(MAX)

AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @NodeIDs TABLE (ID NVARCHAR(1000));

    --Insert data ino Node table and keep Source IDS so that we can populate the Edge table
    INSERT INTO [dbo].[DataEntity]
    OUTPUT INSERTED.$NODE_ID AS TOID INTO @NodeIDs(ID)
    SELECT
        value AS [ColumnName],
        @Table AS [TableName],
        @Database AS [DatabaseName],
        @System AS [SystemName],
        GETDATE() AS [CreatedDt]

    FROM  
        STRING_SPLIT(@ColumnList, ',');

    SELECT ID FROM @NodeIDs;    
    RETURN 
END

Now that we have a procedure to create our nodes, we will create another procedure to create the edges.

Create Edges Procedure

We need to pass a list of procedure a list of node ids to insert into the [dbo].[DataFlowsTo] table so we will create a user defined type to store and pass this information.

/* Create a table type in order to pass a list of Node IDs to this procedure*/
DROP TYPE IF EXISTS dbo.NodeIDTableType
CREATE TYPE dbo.NodeIDTableType  AS TABLE  ( ID NVARCHAR(1000) );

GO

After the user defined table type is created we can create the procedure to populate the edge table.

DROP PROCEDURE IF EXISTS dbo.Create_Edges
GO

CREATE PROCEDURE dbo.Create_Edges

    @Source  NodeIDTableType READONLY,
    @Target  NodeIDTableType READONLY,
    @Process VARCHAR(MAX)
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

   --Get the FROM and TO node ids into a common table exppression and populate the edge table in batch.
   */
   
    WITH CTE (TOID, FROMID) AS
    (
        SELECT 
            A.FROMID,
            B.TOID
        FROM
            (SELECT ID AS FROMID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Source) as A
        JOIN 
            (SELECT ID AS TOID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Target) as B
        ON A.RowNum = B.RowNum
    )
    
    --Insert Lineage data into DataFlowsTo Edge table
     INSERT INTO [dbo].[DataFlowsTo] ($from_id, $to_id, ProcessName)
     SELECT CTE.TOID, CTE.FROMID, @Process FROM CTE;

    
END

In case you are wondering I used a common table expression (CTE) so that the edge records could be created in bulk rather than one at a time.

Create Data Lineage Procedure

This is the final procedure which checks to see if the source or target is new and calls the procedures above as required. The conditional logic check is premised on the assumption a the combination of system name, database name, table name and column name is unique.

DROP PROCEDURE IF EXISTS dbo.Create_Data_Lineage
GO
USE [DataLineage]
GO


CREATE PROCEDURE [dbo].[Create_Data_Lineage] 

    --Source data
    @SourceSystem VARCHAR(MAX),
    @SourceDatabase VARCHAR(MAX),
    @SourceTable VARCHAR(MAX),
    @SourceColumnList VARCHAR(MAX),

    --Target data
    @TargetSystem VARCHAR(MAX),
    @TargetDatabase VARCHAR(MAX),
    @TargetTable VARCHAR(MAX),
    @TargetColumnList VARCHAR(MAX),

    @ProcessName VARCHAR(MAX)

AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

        
    --DECLARE @SourceIDs TABLE (ID NVARCHAR(1000));
    --DECLARE @TargetIDs TABLE (ID NVARCHAR(1000));
    
    DECLARE @SourceIDs NodeIDTableType;
    DECLARE @TargetIDs NodeIDTableType;

    
    --Prepare the list of fields by removing any spaces or tabs between the comma seperated list of source and target columns
    SET @SourceColumnList = REPLACE(REPLACE(@SourceColumnList,' ', ''), char(9), '');
    SET @TargetColumnList = REPLACE(REPLACE(@TargetColumnList,' ', ''), char(9), '');

    --Check for existing sources , if found use those id's otherwise create new nodes and use new IDs
    INSERT INTO @SourceIDs
    SELECT 
        $NODE_ID
    FROM 
        [dbo].[DataEntity] DE
    INNER JOIN 
            (SELECT
                value AS [ColumnName],
                @SourceTable AS [TableName],
                @SourceDatabase AS [DatabaseName],
                @SourceSystem AS [SystemName]
            FROM  
                STRING_SPLIT(@SourceColumnList, ',')
            ) SRC
    ON
        DE.[ColumnName] = SRC.[ColumnName]
    AND 
        DE.[TableName] = SRC.[TableName]
    AND
        DE.[DatabaseName] = SRC.[DatabaseName]
    AND
        DE.[SystemName] = SRC.[SystemName];

    --Check for existing  targets, if found use those id's otherwise create new nodes and use new IDs
    INSERT INTO @TargetIDs
    SELECT 
        $NODE_ID
    FROM 
        [dbo].[DataEntity] DE
    INNER JOIN 
            (SELECT
                value AS [ColumnName],
                @TargetTable AS [TableName],
                @TargetDatabase AS [DatabaseName],
                @TargetSystem AS [SystemName]
            FROM  
                STRING_SPLIT(@TargetColumnList, ',')
            ) TRGT
    ON
        DE.[ColumnName] = TRGT.[ColumnName]
    AND 
        DE.[TableName] = TRGT.[TableName]
    AND
        DE.[DatabaseName] = TRGT.[DatabaseName]
    AND
        DE.[SystemName] = TRGT.[SystemName];

    IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN

            --Create source nodes
            INSERT @SourceIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @SourceSystem,
                    @Database = @SourceDatabase,
                    @Table = @SourceTable,
                    @ColumnList = @SourceColumnList;
        
            --Create target nodes
            INSERT @TargetIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @TargetSystem,
                    @Database = @TargetDatabase,
                    @Table = @TargetTable,
                    @ColumnList = @TargetColumnList;
        
        
            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
        END
    ELSE IF (EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN
        
            --create target nodes
            INSERT @TargetIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @TargetSystem,
                    @Database = @TargetDatabase,
                    @Table = @TargetTable,
                    @ColumnList = @TargetColumnList;

            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
        END
    ELSE IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN
        
            --create source nodes
            INSERT @SourceIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @SourceSystem,
                    @Database = @SourceDatabase,
                    @Table = @SourceTable,
                    @ColumnList = @SourceColumnList;

            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
        END
    ELSE 
        --No new nodes required
        --Create edges between source and target
        EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
    

END
GO

Summary

In this article we have created the database procedures to simplify the process of creating the necessary node and edge records to capture data lineage between source and target fields.

In the next article I will cover how to call these procedures and query the graph tables to answer data lineage questions.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

Featured

Transforming data into value one blog post at a time

 

 

Thanks for joining me!

My name is Anthony Bulk and I am passionate about making data useful. Whether it’s through data storytelling, business intelligence, artificial intelligence or data storage I will cover it all.

Come join me on this journey down data alley!

Here are the topics that have covered so far…

Last updated March 15, 2019.

Instant insight, automation and action using Power Apps, Power BI, Flow and Azure Machine Learning

Extend your information reach without over stretching by virtualizing data using SQL Server 2019 and MongoDB


“Without data you’re just another person with an opinion.”

W. Edwards Deming

 “If I had only one hour to save the world, I would spend fifty-five minutes defining the problem, and only five minutes finding the solution.”

Albert Einstein