In my previous article I explained how you can query the graph tables in SQL Server to pull out pertinent information regarding the lineage of various data fields in your business intelligence system. In this article I will show you how you can visualize the results using R inside of SQL Server.
I completed these steps using a docker image of SQL Server 2019. I was curious to test the performance of using a SQL Server docker image and I thought I would kill two birds with one stone and finish the series using docker. There were a couple of unexpected extra steps that I had to figure out to get this working properly inside of a docker container which I will call out and explain along the way.
Install R packages
We will use a package called igraph to visualize the data. Before you can use this package, you will need to download and install it.
To properly add packages to SQL Server2019 or later you must use sqlmlutils.
Docker image issue #1
In order to use Machine Learning Services in the SQL Server docker image you must accept the End-User License Agreement associated with both SQL Server and SQL Server Machine Learning Service. I also mounted a volume so that the images generated by the R script are easily accessible from outside the container. To do this start the docker image with the following command.
docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices
Next you will need to install sqlmlutils into the docker image. The instructions provided by Microsoft, which can be found here Install R packages with sqlmlutils – SQL Server Machine Learning Services | Microsoft Docs do not work for a docker image.
Docker image issue #2
The quick and dirty way to install sqlmlutils to the docker image is to bash into it. This is considered hacky and should be avoided in a production environment. For production I would create a dockerfile so that package dependencies are part of the docker image definition and are installed on build.
To bash into the container running SQL Server using the following command.
docker exec -it <your container id> "bash"
To find out the ID of the container running SQL Server run the following command
docker ps --filter "ancestor=mssql-server-mlservices"
Hopefully you will have only one container built off of the mssql-server-mlservices image.
Once you have connected into the container you will need to install some additional software into the Linux Ubuntu running the container.
Docker image issue #3
Before you can install new R packages into the container you will need to update and install some additional dependences.
sudo apt-get update sudo apt-get install build-essential gdb sudo apt-get install gfortran
Once you have installed these dependencies you can add packages to the container, for simplicity I have added the igraph package directly using the following command.
sudo su - -c "R -e \"install.packages('igraph', repos='http://cran.rstudio.com/')\""
In production I would have installed the sqlmlutils and then used this package to install igraph, however to do this I would have to first install unixodbc-dev, then ODBC, then manually install SQLMLUTILS from the zip file.
If you have a problem installing igraph you most likely need to the ~/.R/Makevars file using the following command
cd ~ cat > Makevars C=gcc-5 CXX=g++ CXX1X = g++-5 CFLAGS = -std=c99 -Wall -pedantic
Press CTRL-D to save and exit the file
Enable External Scripts
Now that your environment is setup you need to enable external scripts to run in your SQL Server instance. To do this connect into your SQL Server instance using SQL Server Management Studio or Azure Data Studio and then run the following command.
exec sp_configure 'external scripts enabled', 1; RECONFIGURE;
Generate image using R
Now that we are all setup we can use R to generate the following visuals. This images will be stored as PNG files on your C:\Temp folder (the one mounted in your container if using a container)
System Data Flow
The system level image was generated with the following code.
--System level EXECUTE sp_execute_external_script @language = N'R', @script = N' require(igraph) g <- graph.data.frame(graphdf) V(g)$label.cex <- 2 png(filename = "/var/opt/mssql/images/SystemDataFlow.png", height = 800, width = 1500, res = 100); plot(g, vertex.label.family = "sans", vertex.size = 5) dev.off()', @input_data_1 = N' SELECT src.SystemName AS Source, trgt.SystemName AS Target FROM [DataLineage].[dbo].[DataEntity] src, [DataLineage].[dbo].[DataFlowsTo] df, [DataLineage].[dbo].[DataEntity] trgt WHERE MATCH(src-(df)->trgt);', @input_data_1_name = N'graphdf' GO
Database Level
The database level image was generated with the following code.
--Database level EXECUTE sp_execute_external_script @language = N'R', @script = N' require(igraph) g <- graph.data.frame(graphdf) V(g)$label.cex <- 2 png(filename = "/var/opt/mssql/images/DatabaseDataFlow.png", height = 2400, width = 3000, res = 50); plot(g, vertex.label.family = "sans", vertex.size = 5) dev.off()', @input_data_1 = N' SELECT src.DatabaseName AS Source, trgt.DatabaseName AS Target FROM [DataLineage].[dbo].[DataEntity] src, [DataLineage].[dbo].[DataFlowsTo] df, [DataLineage].[dbo].[DataEntity] trgt WHERE MATCH(src-(df)->trgt);', @input_data_1_name = N'graphdf' GO
Table Level
The table level image was generated with the following code.
--Table level EXECUTE sp_execute_external_script @language = N'R', @script = N' require(igraph) g <- graph.data.frame(graphdf) V(g)$label.cex <- 2 png(filename = "/var/opt/mssql/images/TableDataFlow.png", height = 800, width = 1500, res = 100); plot(g, vertex.label.family = "sans", vertex.size = 5) dev.off()', @input_data_1 = N' SELECT src.TableName AS Source, trgt.TableName AS Target FROM [DataLineage].[dbo].[DataEntity] src, [DataLineage].[dbo].[DataFlowsTo] df, [DataLineage].[dbo].[DataEntity] trgt WHERE MATCH(src-(df)->trgt);', @input_data_1_name = N'graphdf' GO
Column Level
The column level image was generated with the following code.
--Column level EXECUTE sp_execute_external_script @language = N'R', @script = N' require(igraph) g <- graph.data.frame(graphdf) V(g)$label.cex <- 2 png(filename = "/var/opt/mssql/images/ColumnDataFlow.png", height = 2400, width = 3000, res = 50); plot(g, vertex.label.family = "sans", vertex.size = 5) dev.off()', @input_data_1 = N' SELECT CONCAT(src.TableName, ''.'', src.ColumnName) AS Source, CONCAT(trgt.TableName, ''.'', trgt.ColumnName) AS Target FROM [DataLineage].[dbo].[DataEntity] src, [DataLineage].[dbo].[DataFlowsTo] df, [DataLineage].[dbo].[DataEntity] trgt WHERE MATCH(src-(df)->trgt);', @input_data_1_name = N'graphdf' GO
Security
With SQL Server 2019 there is a known issue when writing files to disk using an external script.
igraph png not working on SQL Server 2019 (microsoft.com)
I was able to bypass this issue by writing my PNG files into the docker container in the folder /var/opt/mssql/images/ this folder is then mounted to C:\Temp which makes them accessible from my local host. I mounted the folder when running the docker image using the -v argument
docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices
Summary
I was able to visualize the data by generating PNG files using R and the igraph packages which is very handy for generating documentation that is easily consumable by anyone. The basic concepts explained in this article can be extended and made dynamic so that a user can pick what level to see lineage at as well as enhanced so that a hierarchy can be shown with all levels shown based on user input.
In the next article I will cover how to visualize data in the graph tables using PowerBI.
Hopefully you have found this to be another practical post.
Until next time.
Anthony