How to track data lineage with SQL Server Graph Tables – Part 5 how to visualize the data using R

In my previous article I explained how you can query the graph tables in SQL Server to pull out pertinent information regarding the lineage of various data fields in your business intelligence system. In this article I will show you how you can visualize the results using R inside of SQL Server.

I completed these steps using a docker image of SQL Server 2019. I was curious to test the performance of using a SQL Server docker image and I thought I would kill two birds with one stone and finish the series using docker. There were a couple of unexpected extra steps that I had to figure out to get this working properly inside of a docker container which I will call out and explain along the way.

Install R packages

We will use a package called igraph to visualize the data. Before you can use this package, you will need to download and install it.

To properly add packages to SQL Server2019 or later you must use sqlmlutils.

Docker image issue #1

In order to use Machine Learning Services in the SQL Server docker image you must accept the End-User License Agreement associated with both SQL Server and SQL Server Machine Learning Service. I also mounted a volume so that the images generated by the R script are easily accessible from outside the container. To do this start the docker image with the following command.

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices

Next you will need to install sqlmlutils into the docker image. The instructions provided by Microsoft, which can be found here Install R packages with sqlmlutils – SQL Server Machine Learning Services | Microsoft Docs do not work for a docker image.

Docker image issue #2

The quick and dirty way to install sqlmlutils to the docker image is to bash into it. This is considered hacky and should be avoided in a production environment. For production I would create a dockerfile so that package dependencies are part of the docker image definition and are installed on build.

To bash into the container running SQL Server using the following command.

docker exec -it <your container id> "bash"

To find out the ID of the container running SQL Server run the following command

docker ps --filter "ancestor=mssql-server-mlservices"

Hopefully you will have only one container built off of the mssql-server-mlservices image.

Once you have connected into the container you will need to install some additional software into the Linux Ubuntu running the container.

Docker image issue #3

Before you can install new R packages into the container you will need to update and install some additional dependences.

sudo apt-get update


sudo apt-get install build-essential gdb


sudo apt-get install gfortran

Once you have installed these dependencies you can add packages to the container, for simplicity I have added the igraph package directly using the following command.

sudo su - -c "R -e \"install.packages('igraph', repos='http://cran.rstudio.com/')\""

In production I would have installed the sqlmlutils and then used this package to install igraph, however to do this I would have to first install unixodbc-dev, then ODBC, then manually install SQLMLUTILS from the zip file.

If you have a problem installing igraph you most likely need to the ~/.R/Makevars file using the following command

cd ~

cat > Makevars

C=gcc-5

CXX=g++

CXX1X = g++-5

CFLAGS = -std=c99 -Wall -pedantic

Press CTRL-D to save and exit the file

Enable External Scripts

Now that your environment is setup you need to enable external scripts to run in your SQL Server instance. To do this connect into your SQL Server instance using SQL Server Management Studio or Azure Data Studio and then run the following command.

exec
sp_configure
'external scripts enabled', 1;
RECONFIGURE;

Generate image using R

Now that we are all setup we can use R to generate the following visuals. This images will be stored as PNG files on your C:\Temp folder (the one mounted in your container if using a container)

System Data Flow

The system level image was generated with the following code.

--System level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/SystemDataFlow.png", height = 800, width = 1500, res = 100);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',
@input_data_1 =
N'
 SELECT 
        src.SystemName AS Source,
        trgt.SystemName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',
@input_data_1_name =
N'graphdf'
GO

Database Level

The database level image was generated with the following code.

--Database level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/DatabaseDataFlow.png", height = 2400, width = 3000, res = 50);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',
@input_data_1 =
N'
 SELECT 
        src.DatabaseName AS Source,
        trgt.DatabaseName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',
@input_data_1_name =
N'graphdf'
GO

Table Level

The table level image was generated with the following code.

--Table level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/TableDataFlow.png", height = 800, width = 1500, res = 100);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',

@input_data_1 =
N'
 SELECT 
        src.TableName AS Source,
        trgt.TableName AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',

@input_data_1_name =
N'graphdf'
GO

Column Level

The column level image was generated with the following code.

--Column level
EXECUTE
sp_execute_external_script
@language =
N'R',
@script =
N'
    require(igraph)
    g <- graph.data.frame(graphdf)
    V(g)$label.cex <- 2
    png(filename = "/var/opt/mssql/images/ColumnDataFlow.png", height = 2400, width = 3000, res = 50);
    plot(g, vertex.label.family = "sans", vertex.size = 5)
    dev.off()',

@input_data_1 =
N'
 SELECT 
        CONCAT(src.TableName, ''.'', src.ColumnName) AS Source,
        CONCAT(trgt.TableName, ''.'', trgt.ColumnName) AS Target
 FROM 
         [DataLineage].[dbo].[DataEntity] src,
 [DataLineage].[dbo].[DataFlowsTo] df,
 [DataLineage].[dbo].[DataEntity] trgt
 WHERE MATCH(src-(df)->trgt);',

@input_data_1_name =
N'graphdf'
GO

Security

With SQL Server 2019 there is a known issue when writing files to disk using an external script.

igraph png not working on SQL Server 2019 (microsoft.com)

I was able to bypass this issue by writing my PNG files into the docker container in the folder /var/opt/mssql/images/ this folder is then mounted to C:\Temp which makes them accessible from my local host. I mounted the folder when running the docker image using the -v argument

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e ACCEPT_EULA_ML=Y -e MSSQL_SA_PASSWORD=YourPassword -v C:\Temp:/var/opt/mssql/images -p 1433:1433 mssql-server-mlservices

Summary

I was able to visualize the data by generating PNG files using R and the igraph packages which is very handy for generating documentation that is easily consumable by anyone. The basic concepts explained in this article can be extended and made dynamic so that a user can pick what level to see lineage at as well as enhanced so that a hierarchy can be shown with all levels shown based on user input.

In the next article I will cover how to visualize data in the graph tables using PowerBI.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 4 Querying the Graph Tables

In the previous article I showed you how you can populate the graph tables using the database procedures we created in an earlier post. In this article I will show you how you can query the data contained in the graph tables to quickly answer questions regarding data lineage.

Data Flow

As you may recall from the previous article, we created data to store the flow of data from two source systems in to our BI system as depicted in the illustration below.

MATCH

Starting in SQL Server 2017 the MATCH clause can be used with node and edge graph tables to traverse a path in the graph tables. Imagine a scenario where you want to know the impact of removing a field in your Staging database which you no longer believe you need. In the previous article we captured data lineage for a data flow that was coming out of JDE, one of the columns was an invoice data in Julian called RPDIVJ which is coming from the table F0411. For the purpose of this article lets assume that the business believes this date field is not useful and needs to be removed. Without this data lineage solution you would need to manually search through ETL processes to determine the impact of removing this column, a time consuming task, thankfully we have a more elegant solution.

We can query our graph tables in our DataLineage database to quickly see which processes will need to be modified. We will start by querying Staging which is the start of the data flow.

Run the following code to see which ETL process will need to be modified.

SELECT 
	df.ProcessName,
	df.CreatedDt
 FROM 
        [DataLineage].[dbo].[DataEntity] src,
        [DataLineage].[dbo].[DataFlowsTo] df,
        [DataLineage].[dbo].[DataEntity] trgt
 
WHERE MATCH(src-(df)->trgt)
	 AND src.SystemName = 'Business Intelligence System'
	 AND src.DatabaseName = 'Staging'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'


This should give you results like the table below.

ProcessName

CreatedDt

Load Fact Accounts Payable

2020-02-12

Load Fact Accounts Payable

2020-02-12

This tells you that the ETL process Load Fact Accounts Payable uses the field RPDIVJ in the Staging database twice. I have also included the CreatedDt filed in the query so that you know if the data is still relevant, typically the older the data the less reliable it is to make decision. If you have built your BI system correctly you should be able to compare that age of this data with the last modified date of the ETL process to see if there is a significant difference, if there is odds are you need to update the data lineage for this process.

Now let’s modify the query above to get some more meaningful insights.

SELECT 
			src.SystemName AS SourceSystem,
			src.DatabaseName AS SourceDatabase,
			src.TableName AS SourceTable,
			src.ColumnName AS SourceColumn,
			src.CreatedDt AS SourceNodeCreatedDt,
			
			df.ProcessName,
			df.CreatedDt AS ProcessCreatedDt,
			
			trgt.SystemName AS TargetSystem,
			trgt.DatabaseName AS TargetDatabase,
			trgt.TableName AS TargetTable,
			trgt.ColumnName AS TargetColumn,
			trgt.CreatedDt AS TargetNodeCreatedDt

 FROM 
	  [DataLineage].[dbo].[DataEntity] src,
          [DataLineage].[dbo].[DataFlowsTo] df,
          [DataLineage].[dbo].[DataEntity] trgt
 
WHERE MATCH(src-(df)->trgt)
	 AND src.SystemName = 'Business Intelligence System'
	 AND src.DatabaseName = 'Staging'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'

This query provides you with details on the source, target and the process that copies the data from source to target. When you run this query, it should provide results similar to the table below.

SourceSystem

SourceDatabase

SourceTable

SourceColumn

SourceNodeCreatedDt

ProcessName

ProcessCreatedDt

TargetSystem

TargetDatabase

TargetTable

TargetColumn

TargetNodeCreatedDt

Business Intelligence System

Staging

F0411

RPDIVJ

2020-02-12

Load Fact Accounts Payable

2020-02-12

Business Intelligence System

EDW

Fact Accounts Payable

InvoiceDate-Julian

2020-02-12

Business Intelligence System

Staging

F0411

RPDIVJ

2020-02-12

Load Fact Accounts Payable

2020-02-12

Business Intelligence System

EDW

Fact Accounts Payable

InvoiceDate-Gregorian

2020-02-12

The match clause is great however it is still difficult to understand the full impact of removing the RPDIVJ column, this only tells us about the impact to the staging load process and will not tell us down stream impacts to the EDW and the cubes which may rely on this field. What we really want to know is the shortest path across all layers of the BI system.

SHORTEST_PATH()

To get a better understanding of the end to end impact of removing this field we will use the SHORTEST_PATH option. This is a new feature only available in SQL Server 2019 or later. To use this feature, write a new query as follows.

SELECT
	CONCAT(src.SystemName,'.', src.DatabaseName, '.', src.TableName, '.', src.ColumnName) AS Source,
	COUNT(trgt.ColumnName) WITHIN GROUP (GRAPH PATH) AS [BI System Layer],
	STRING_AGG(CONCAT(trgt.SystemName, '.', trgt.DatabaseName, '.', trgt.TableName, '.', trgt.ColumnName) , ' -> ') WITHIN GROUP (GRAPH PATH) AS [Data Lineage],
	STRING_AGG(df.ProcessName , ', ') WITHIN GROUP (GRAPH PATH) AS [ETL Processes]

FROM
	[DataLineage].[dbo].[DataEntity] AS src,
	[DataLineage].[dbo].[DataFlowsTo] FOR PATH AS df,
	[DataLineage].[dbo].[DataEntity] FOR PATH AS trgt
WHERE
	MATCH (SHORTEST_PATH(src(-(df)->trgt)+))
	 AND src.SystemName = 'Finance'
	 AND src.DatabaseName = 'JDE'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'
ORDER BY 2 DESC

After running the query above you should get results like the table below.

Source

BI System Layer

Data Lineage

ETL Processes

Finance.JDE.F0411.RPDIVJ

3

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian -> Business Intelligence System.SSAS Accounts Payble Model.Fact Accounts Payable.InvoiceDate

Load F0411, Load Fact Accounts Payable, Process SSAS Accounts Payble Model

Finance.JDE.F0411.RPDIVJ

2

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Julian

Load F0411, Load Fact Accounts Payable

Finance.JDE.F0411.RPDIVJ

2

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian

Load F0411, Load Fact Accounts Payable

Finance.JDE.F0411.RPDIVJ

1

Business Intelligence System.Staging.F0411.RPDIVJ

Load F0411

Let’s break down the results and look at the first row returned by the query.

The first column Source is the fully qualified source for the data

Finance.JDE.F0411.RPDIVJ

The second field BI System Layer tells you how many layers up the data travels, so in this case the data makes its way through all 3 layers of the BI system (Staging, EDW, SSAS).

The third field Data Lineage tells you the source to target mapping from end to end.
Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian -> Business Intelligence System.SSAS Accounts Payble Model.Fact Accounts Payable.InvoiceDate
The last field ETL Processes lets you know the names of all the ETL processes that depend on the data coming from Finance.JDE.F0411.RPDIVJ.

Load F0411, Load Fact Accounts Payable, Process SSAS Accounts Payble Model
The query is using SHORTEST_PATH, STRING_AGG and COUNT to provide meaningful information regarding data lineage on a specific field used as a source for the BI system. With this query you can quickly understand the impact of removing this field across all layers of the BI system.

Summary

By leveraging Graph tables and new querying capabilities associated with these tables you can quickly answer questions regarding data lineage and provide greater trust, understanding and ultimately adoption of the BI system.

In the next article I will cover how to visualize data in the graph tables using R.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 3 Populate Graph Tables

Featured

In this 3rd article I will show you how to use the procedures that we created in the previous post to populate the graph tables with useful data lineage information.

Data Flow

We will simulate the flow of data from two source systems (Finance and HR) into a data warehouse/business intelligence system as illustrated in the diagram below.

Staging

Most modern BI systems include a staging area, also sometimes referred to as a landing area. Data is collected from various disparate systems and loaded into the staging database with very minimal transformations, typically only data type conversions. The goal is to quickly grab the data from the different source systems and load it into the staging area for further processing so that any performance impact to the source systems is minimized.

To capture the data lineage for this first hop from source system to staging we can use our procedure as follows.

-- JDE to Staging
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Finance',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'JDE',
		@TargetDatabase = 'Staging',
		
		--Table Level
		@SourceTable = 'F0411',
		@TargetTable = 'F0411',

		--Column Level
		@SourceColumnList = 'RPKCO, RPDOC, RPDCT, RPSFX, RPSFXE, RPDCTA, RPAN8, RPPYE, RPSNTO, RPDIVJ',
		@TargetColumnList = 'RPKCO, RPDOC, RPDCT, RPSFX, RPSFXE, RPDCTA, RPAN8, RPPYE, RPSNTO, RPDIVJ',

		@ProcessName = Load F0411';
GO



-- HR to Staging 
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'HR',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'Payroll',
		@TargetDatabase = 'Staging',
		
		--Table Level
		@SourceTable = 'EMP',
		@TargetTable = 'EMP',

		--Column Level
		@SourceColumnList = 'EMPID, HIREDT, JOBTITLE',
		@TargetColumnList = 'EMPID, HIREDT, JOBTITLE',

		@ProcessName = 'Load EMP';
GO 

As you can see from the code above it is very easy to map an input table to an output table using the procedures we created in the previous steps.

EDW

Next, we will capture the data lineage from Staging to the Enterprise Data Warehouse (EDW)

-- Staging to EDW for Finance tables

EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'Staging',
		@TargetDatabase = 'EDW',
		
		--Table Level
		@SourceTable = 'F0411',
		@TargetTable = 'Fact Accounts Payable',

		--Column Level
		@SourceColumnList = 'RPDOC, RPDCT, RPSFX, RPSFXE, RPDCTA, RPAN8, RPPYE, RPSNTO, RPDIVJ, RPDIVJ',
		@TargetColumnList = 'Document, Document Type, Document Pay Item, Pay Item Extension Number, Adjusting Document Type, Address Number, Payee Address Number, 	Approvers Address Number, Invoice Date - Julian, Invoice Date - Gregorian',

		@ProcessName = 'Load Fact Accounts Payable';
GO



-- Staging to EDW for HR tables
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'Staging',
		@TargetDatabase = 'EDW',
		
		--Table Level
		@SourceTable = 'EMP',
		@TargetTable = 'Dim Employee',

		--Column Level
		@SourceColumnList = 'EMPID, HIREDT, HIREDT, JOBTITLE',
		@TargetColumnList = 'Employee Number, Hire Date, Tenure, Job Title',

		@ProcessName = 'Load Dim Employee';
GO

If you have a keen eye you might have noticed that we are mapping some of the fields twice. The field RPDIVJ in the JDE table F0411 is mapped to both Invoice Date – Julian and Invoice Date – Gregorian. The field HIREDT in the HR EMP table is mapped to both the Hire Date and the Tenure columns. In both scenarios this implies some transformation logic between these columns.

Accounts Payable Model

The final step in our example is the process that loads the SQL Server Analysis Services (SSAS) model.

-- EDW to SSAS Finance
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'EDW',
		@TargetDatabase = 'SSAS Accounts Payable Model',
		
		--Table Level
		@SourceTable = 'Fact Accounts Payable',
		@TargetTable = 'Fact Accounts Payable',

		--Column Level
		@SourceColumnList = 'Document, Document Type, Document Pay Item, Pay Item Extension Number, Adjusting Document Type, Address Number,  Invoice Date - Gregorian',
		@TargetColumnList = 'Document, Document Type, Document Pay Item, Pay Item Extension Number, Adjusting Document Type, Address Number,  Invoice Date',

		@ProcessName = 'Process SSAS Accounts Payable Model'



-- EDW to SSAS HR
EXEC	[dbo].[Create_Data_Lineage]
		
		--System level
		@SourceSystem = 'Business Intelligence System',
		@TargetSystem = 'Business Intelligence System',

		--Database level 
		@SourceDatabase = 'EDW',
		@TargetDatabase = 'SSAS Accounts Payable Model',
		
		--Table Level
		@SourceTable = 'Dim Employee',
		@TargetTable = 'Dim Employee',

		--Column Level
		@SourceColumnList = 'Employee Number, Hire Date, Tenure,  Job Title',
		@TargetColumnList = 'Employee Number, Hire Date, Tenure,  Job Title',

		@ProcessName = 'Process SSAS Accounts Payable Model'


GO

As you can see in the code above the Invoice Date – Gregorian from the Finance system is used to populate the Invoice Date and the Invoice Date – Julian did not make it into the SSAS cube.

Summary

The database procedures make it really easy to capture data lineage, they could even be enhanced to accept and parse JSON or XML inputs so that data lineage could be dropped in and maintained using a web application….can someone say Power App

In the next article I will cover how to query the graph tables as well as visualize the data in order to answer data lineage questions.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 2 Create Database Procedures

This is my second post in a series which explains how to use SQL Server Graph Tables to track data lineage.

In my previous post I covered creating the node and edges tables required to store data lineage. In this post I will show you how to create database procedures in order to simplify the process of populating the graph tables. These procedures are deliberately basic so that they are easy to understand but they are also easily extensible and can serve as the starting point for a more sophisticated solution.

Overview

Before diving into the code it is always good to have a conceptual understanding of the process. The flowchart below explains the logic of these procedures.

As you can see in the diagram above, we check if the source or target is new and then based on the condition create the appropriate nodes and edges. In order to avoid duplicating code we will create a procedure for creating the nodes and the edges and then call theses sub processes from a parent process that controls the logic flow.

Create Nodes Procedure

First, we will create a procedure to populate our node table. As you recall from the previous post our node table is called [dbo].[DataEntity].

DROP PROCEDURE IF EXISTS dbo.Create_Nodes
GO

CREATE PROCEDURE dbo.Create_Nodes

    @ColumnList VARCHAR(MAX),
    @Table VARCHAR(MAX),
    @Database VARCHAR(MAX),
    @System VARCHAR(MAX)

AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @NodeIDs TABLE (ID NVARCHAR(1000));

    --Insert data ino Node table and keep Source IDS so that we can populate the Edge table
    INSERT INTO [dbo].[DataEntity]
    OUTPUT INSERTED.$NODE_ID AS TOID INTO @NodeIDs(ID)
    SELECT
        value AS [ColumnName],
        @Table AS [TableName],
        @Database AS [DatabaseName],
        @System AS [SystemName],
        GETDATE() AS [CreatedDt]

    FROM  
        STRING_SPLIT(@ColumnList, ',');

    SELECT ID FROM @NodeIDs;    
    RETURN 
END

Now that we have a procedure to create our nodes, we will create another procedure to create the edges.

Create Edges Procedure

We need to pass a list of procedure a list of node ids to insert into the [dbo].[DataFlowsTo] table so we will create a user defined type to store and pass this information.

/* Create a table type in order to pass a list of Node IDs to this procedure*/
DROP TYPE IF EXISTS dbo.NodeIDTableType
CREATE TYPE dbo.NodeIDTableType  AS TABLE  ( ID NVARCHAR(1000) );

GO

After the user defined table type is created we can create the procedure to populate the edge table.

DROP PROCEDURE IF EXISTS dbo.Create_Edges
GO

CREATE PROCEDURE dbo.Create_Edges

    @Source  NodeIDTableType READONLY,
    @Target  NodeIDTableType READONLY,
    @Process VARCHAR(MAX)
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

   --Get the FROM and TO node ids into a common table exppression and populate the edge table in batch.
   */
   
    WITH CTE (TOID, FROMID) AS
    (
        SELECT 
            A.FROMID,
            B.TOID
        FROM
            (SELECT ID AS FROMID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Source) as A
        JOIN 
            (SELECT ID AS TOID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Target) as B
        ON A.RowNum = B.RowNum
    )
    
    --Insert Lineage data into DataFlowsTo Edge table
     INSERT INTO [dbo].[DataFlowsTo] ($from_id, $to_id, ProcessName)
     SELECT CTE.TOID, CTE.FROMID, @Process FROM CTE;

    
END

In case you are wondering I used a common table expression (CTE) so that the edge records could be created in bulk rather than one at a time.

Create Data Lineage Procedure

This is the final procedure which checks to see if the source or target is new and calls the procedures above as required. The conditional logic check is premised on the assumption a the combination of system name, database name, table name and column name is unique.

DROP PROCEDURE IF EXISTS dbo.Create_Data_Lineage
GO
USE [DataLineage]
GO


CREATE PROCEDURE [dbo].[Create_Data_Lineage] 

    --Source data
    @SourceSystem VARCHAR(MAX),
    @SourceDatabase VARCHAR(MAX),
    @SourceTable VARCHAR(MAX),
    @SourceColumnList VARCHAR(MAX),

    --Target data
    @TargetSystem VARCHAR(MAX),
    @TargetDatabase VARCHAR(MAX),
    @TargetTable VARCHAR(MAX),
    @TargetColumnList VARCHAR(MAX),

    @ProcessName VARCHAR(MAX)

AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

        
    --DECLARE @SourceIDs TABLE (ID NVARCHAR(1000));
    --DECLARE @TargetIDs TABLE (ID NVARCHAR(1000));
    
    DECLARE @SourceIDs NodeIDTableType;
    DECLARE @TargetIDs NodeIDTableType;

    
    --Prepare the list of fields by removing any spaces or tabs between the comma seperated list of source and target columns
    SET @SourceColumnList = REPLACE(REPLACE(@SourceColumnList,' ', ''), char(9), '');
    SET @TargetColumnList = REPLACE(REPLACE(@TargetColumnList,' ', ''), char(9), '');

    --Check for existing sources , if found use those id's otherwise create new nodes and use new IDs
    INSERT INTO @SourceIDs
    SELECT 
        $NODE_ID
    FROM 
        [dbo].[DataEntity] DE
    INNER JOIN 
            (SELECT
                value AS [ColumnName],
                @SourceTable AS [TableName],
                @SourceDatabase AS [DatabaseName],
                @SourceSystem AS [SystemName]
            FROM  
                STRING_SPLIT(@SourceColumnList, ',')
            ) SRC
    ON
        DE.[ColumnName] = SRC.[ColumnName]
    AND 
        DE.[TableName] = SRC.[TableName]
    AND
        DE.[DatabaseName] = SRC.[DatabaseName]
    AND
        DE.[SystemName] = SRC.[SystemName];

    --Check for existing  targets, if found use those id's otherwise create new nodes and use new IDs
    INSERT INTO @TargetIDs
    SELECT 
        $NODE_ID
    FROM 
        [dbo].[DataEntity] DE
    INNER JOIN 
            (SELECT
                value AS [ColumnName],
                @TargetTable AS [TableName],
                @TargetDatabase AS [DatabaseName],
                @TargetSystem AS [SystemName]
            FROM  
                STRING_SPLIT(@TargetColumnList, ',')
            ) TRGT
    ON
        DE.[ColumnName] = TRGT.[ColumnName]
    AND 
        DE.[TableName] = TRGT.[TableName]
    AND
        DE.[DatabaseName] = TRGT.[DatabaseName]
    AND
        DE.[SystemName] = TRGT.[SystemName];

    IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN

            --Create source nodes
            INSERT @SourceIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @SourceSystem,
                    @Database = @SourceDatabase,
                    @Table = @SourceTable,
                    @ColumnList = @SourceColumnList;
        
            --Create target nodes
            INSERT @TargetIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @TargetSystem,
                    @Database = @TargetDatabase,
                    @Table = @TargetTable,
                    @ColumnList = @TargetColumnList;
        
        
            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
        END
    ELSE IF (EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN
        
            --create target nodes
            INSERT @TargetIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @TargetSystem,
                    @Database = @TargetDatabase,
                    @Table = @TargetTable,
                    @ColumnList = @TargetColumnList;

            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
        END
    ELSE IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN
        
            --create source nodes
            INSERT @SourceIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @SourceSystem,
                    @Database = @SourceDatabase,
                    @Table = @SourceTable,
                    @ColumnList = @SourceColumnList;

            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
        END
    ELSE 
        --No new nodes required
        --Create edges between source and target
        EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
    

END
GO

Summary

In this article we have created the database procedures to simplify the process of creating the necessary node and edge records to capture data lineage between source and target fields.

In the next article I will cover how to call these procedures and query the graph tables to answer data lineage questions.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 1 Create Nodes and Edges

Anthony Presents

Where did this data come from?
How can I trust this data?
What impact will changing this field have on other systems?

If you have ever heard or asked any of these questions, then this series of blog posts is for you.

As data volumes continue to grow so too does the need to manage the data estate. One critical aspect with managing the data estate is understanding data ancestry. Thankfully you can leverage SQL Server 2019 Graph tables to track the lineage of one of your most valuable assets, your data.

In this series of blog posts, I will show you how you can use how you can use Graph Tables in SQL Server 2019 to capture and report on data lineage.

Prerequisites

  • SQL Server 2019 Developer Edition (You can download a free copy for development use from here )
  • SQL Server Management Studio 18 (You can download a…

View original post 679 more words

How to track data lineage with SQL Server Graph Tables – Part 1 Create Nodes and Edges

Featured

Where did this data come from?
How can I trust this data?
What impact will changing this field have on other systems?

If you have ever heard or asked any of these questions, then this series of blog posts is for you.

As data volumes continue to grow so too does the need to manage the data estate. One critical aspect with managing the data estate is understanding data ancestry. Thankfully you can leverage SQL Server 2019 Graph tables to track the lineage of one of your most valuable assets, your data.

In this series of blog posts, I will show you how you can use how you can use Graph Tables in SQL Server 2019 to capture and report on data lineage.

Prerequisites

  • SQL Server 2019 Developer Edition (You can download a free copy for development use from here )
  • SQL Server Management Studio 18 (You can download a free copy from here )

*Note at the end of the series I will show you how you can visualize the data using R and Power BI and Plotly Dash so expect to install both Microsoft R Open, Power BI and Python later in the series.

Overview

The diagram below provides a conceptual overview of the various components that we will use during this series of articles. We will start with the Graph tables in the DataLineage database in SQL Server 2019 because they are the core of the solution.

Here is an overview of the initial data model that we will build.

As you can see in the diagram this series will cover the movement of data from one system to another. At the end of the series I will explain how this initial solution can be extended to included dependencies between objects such as the dependency of a field in a report with a field in a table in your data warehouse or data lake. This will allow you to track and reporting on both the movement of data as well as the inter-dependencies between data entities.

Create the DataLineage database

Once you have SQL Server installed, we will create a new database called DataLineage which we will use to store data we need to track lineage. To create the database run the following SQL command.

CREATE DATABASE DataLineage

Graph tables are first class entities in SQL Server databases starting with SQL Server 2017. Graph database are ideal for storing and traversing relationships between items. For additional information on the graph database in SQL Server you can refer to the following site.

https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview?view=sql-server-ver15

Create the Node table

Now we will create the node table. A node represents a discreet entity in a graph database. In order to maximize the value of our data lineage solution we need to track the movement of data at the at the lowest level possible which is the column/field level. In addition to the column/field data we also need to capture the table, database and system the column/field is stored in, this will be useful for filtering the data and understanding context. Based on this requirement we will create a node table that stores column, table, database and system name using the following SQL command.

USE [DataLineage]
GO

DROP TABLE IF EXISTS dbo.DataEntity
GO

CREATE TABLE DataEntity

(
    [ID] INTEGER IDENTITY(0,1PRIMARY KEY,
    [ColumnName] NVARCHAR(MAX),
    [TableName] NVARCHAR(MAX),
    [DatabaseName] NVARCHAR(MAX),
    [SystemName] NVARCHAR(MAX),
    [CreatedDt] DATE DEFAULT(GETDATE() )
AS NODE;

GO

I have also included a CreatedDt field for auditing purposes. Next, we will create the edge table which will be used to store information regarding the movement of data from source to target.

Create the Edge table

An Edge table represents a relationship between nodes in a graph database. In our scenario we want to track the movement of data between source and target. In order to capture this information, we will create an Edge table using the following SQL command.

USE [DataLineage]
GO

DROP TABLE IF EXISTS dbo.DataFlowsTo
GO

CREATE TABLE DataFlowsTo
(
    [ProcessName] NVARCHAR(MAX),
    [CreatedDt] DATE DEFAULT(GETDATE())
) AS EDGE;
GO

The edge table contains a single user defined column called ProcessName which is used to capture the name of the ETL process that moves data from source to target. The ProcessName will also be useful for filtering the data which is helpful in scenarios where we need to view all the fields involved in an ETL process or all the ETL processes dependent on one or more fields. Like the node table I have included a CreatedDt field for auditing purposes.

Summary

In this article we have created the database as well as the node and edge tables required to capture data lineage.

In the next article I will cover how to populate these tables using a database procedure in order simplify the process of

Hopefully you have found this to be another practical post.

Until next time.

Anthony

Abstract and encapsulate with Power BI and SQL Server Table-Valued Functions – Use Case 2: Change results based on a database point in time and a user defined parameter

This is the second article  in which I cover different use cases for using SQL Server Table-Valued Functions with Power BI.

In the previous article I showed you how a table valued function can be used to hide lower levels of a hierarchy based on user id. This is handy if you need to prevent certain users from breaking down aggregate measures while interacting with a report or dashboard in Power BI.

In this article I will show you how you can use table valued functions to filter the results based on a date range stored in your database and allow the report author to control how many years of data to bring back with an optional input parameter.

Prerequisites

Use case 2: Change results based on time

This use case comes in handy if you want your results to be filtered based on a date and time in your database and not on the date time of your Power BI users. This situation may occur in a globally distributed system in which Power BI users and the database they are querying are located in different time zones or in situations where the data lags behind the current date/time of your users. I will cover a situation where you want the data set to only show data up to a point in time stored in your database.

Using the Wide World Importers sample database suppose you wanted to limit the orders that a person can report on to the last year in which there were orders in the database. Run the following query.

USE WideWorldImportersDW
GO

SELECT
MIN([Order Date Key]) AS [Earliest Order Date]
, MAX([Order Date Key]) AS [Latetest Order Date]
FROM
WideWorldImportersDW.Fact.[Order]
GO

As you can see from the results, we have orders from 2013 up to 2016. If we were to use the Power BI relative date slicer and set it to only show data from the past 1 year we would not see any results because the Power BI relative date slicer is based on today’s date April 6, 2019 and not on a date in the database. One way to overcome this problem is to use SQL Server Table-valued functions and encapsulate the logic to only show orders from June 1 2015 to May 31 2016. To do this will create a new function using the code below.

Create Function

Run the following code in the Wide World Importers database to create a new function. We are including an optional parameter so that the report author can change how many years they want to go back when they connect to the data.

--DROP FUNCTION dbo.ufn_Orders_PastYear
--GO


CREATE FUNCTION dbo.ufn_Orders_PastYear(@NumberOfYears INT = NULL)
RETURNSTABLE
AS
RETURN
(
SELECT
*
FROM
WideWorldImportersDW.Fact.[Order] ord
WHERE
ord.[Order Date Key]
BETWEEN
(SELECT DATEADD(year, -ISNULL(@NumberOfYears, 1), MAX([Order Date Key])) FROM WideWorldImportersDW.Fact.[Order])
AND
(SELECT MAX([Order Date Key]) FROM WideWorldImportersDW.Fact.[Order])
);
GO

Notice that I negate the number of years by adding a negative sign in front of the ISNULL function in the select statement, this is to simplify the report authoring experience with using this function. Next, we will make a DirectQuery connection to the function using Power BI.

Connect with Power BI

Similar to before connect to the SQL Server database using a DirectQuery connection and select the function ufn_Orders_PastYear from the list of database objects.

As you can see in the image above the function parameter @NumberOfYears appears in Power BI as an optional parameter. If you leave it blank and click apply it will pull back 1 year’s worth of data based on what is available in the database. You can enter in your own number to control how many years back you query the Orders fact table. Incorporating parameters is a very powerful way to give the report author control of the results.

Once the data has been loaded in let’s visualize it using a simple bar chart.

Your results should look like the following image below.

As you can see in the chart we only have data from 2015 to 2016. To make the chart a bit easier to read lets add a proper date hierarchy. We will need to build it because we are using a DirectQuery to access the data so the autogenerated date hierarchies are not available, those are only created when you import data into Power BI and set the data type to be a date.

Create Year Column

Create a new calculated column and use the following DAX code to pull out the year value from the Order Date Key field.

OrderYear = Year([Order Date Key])

Create Month Columns

Next, we will create two month columns one will be used to sort and the other will be used to display the month name on the report.

Use the following DAX code to create a new month number calculated column.

OrderMonth = Month([Order Date Key])

Now create a new calculated column to store the month name using the following DAX code.

Order Month Name = 
SWITCH (
    [OrderMonth],
    1, "January",
    2, "February",
    3, "March",
    4, "April",
    5, "May",
    6, "June",
    7, "July",
    8, "August",
    9, "September",
    10, "October",
    11, "November",
    12, "December"
)

Now we need to set the sort by column property of the Order Month Name to use the value of the OrderMonth column. To do this navigate to the model viewer and click on the Order Month Name field and then set the Sort by column to OrderMonth.

Now we will create a new Hierarchy based on OrderYear and Order Month Name.

Click on the chart and the Axis value with the new hierarchy we just created. Drill down a level to see the years and months.

As you can see this makes the chart much easier to read. Now lets insert some new data into the table and refresh the report.

Add some data

Run the following SQL to create new Date and Order records.

INSERT INTO [Dimension].[Date]
([Date]
,[Day Number]
,[Day]
,[Month]
,[Short Month]
,[Calendar Month Number]
,[Calendar Month Label]
,[Calendar Year]
,[Calendar Year Label]
,[Fiscal Month Number]
,[Fiscal Month Label]
,[Fiscal Year]
,[Fiscal Year Label]
,[ISO Week Number])
VALUES
(‘4/1/2019’,
,1
,1
,’April’
,’Apr’
,4
,’CY2019-Apr’
,2019
,’CY2019′
,6
,’FY2019-APR’
,2019
,’FY2019′
,14)
GO
INSERT INTO [Fact].[Order]
([City Key]
,[Customer Key]
,[Stock Item Key]
,[Order Date Key]
,[Picked Date Key]
,[Salesperson Key]
,[Picker Key]
,[WWI Order ID]
,[WWI Backorder ID]
,[Description]
,[Package]
,[Quantity]
,[Unit Price]
,[Tax Rate]
,[Total Excluding Tax]
,[Tax Amount]
,[Total Including Tax]
,[Lineage Key])
VALUES
(45901
,0
,175
,’4/1/2019′
,’4/1/2019′
,76
,67
,9073
,NULL
,’April Fools ain’t no joke’
,’Each’
,3
,13.00
,15.00
,39.00
,5.85
,44.85
,9)
GO

Refresh the report and notice how it updates so that it only has one column.

This is because there is only 1 record from March 2018 until April 2019.

Combining SQL Server database functions with Power BI is a powerful way to abstract and encapsulate logic in the database thus simplifying the report authors job and ensuring the right data is presented to report consumers.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

Abstract and encapsulate with Power BI and SQL Server Table-Valued Functions – Use Case 1: Change results based on user

If you’ve ever required a dynamic data source in Power BI that can change based on who the user is, when they are querying the data source or if certain data elements have changed you can leverage the ability for Power BI to connect to a table value function in SQL Server.

Table-valued functions allow you to abstract complex business logic from the report author and encapsulate it into a database object. This simplifies report building and enables you to do things like hide hierarchy levels, filter data based on a certain point in time stored in the database or check for certain data conditions and alter the query results as appropriate.

Prerequisites

  • SQL Server 2016 or later. You can download the SQL Server 2017 developer edition HERE
  • Wide World Importer sample database. A copy can be found HERE
  • Power BI Desktop. You can download the latest version from HERE

In this series of articles I will step through several use cases for direct queries from SQL Server Table-valued Functions in Power BI.

Use case 1: Change results based on user

For this first use case we will cover how you can embed some simple logic in your table-valued function to hide lower levels of a hierarchy. This is useful if you want to prevent certain individuals from breaking down aggregated values but still allow them to use data at a summary level.

Create user accounts

For the purposes of simplicity, we will create some users in the database using SQL Server authentication. Connect to your SQL Server database and execute the following SQL code.

USE [master]

GO
--Create Bob
CREATE LOGIN [Bob] WITH PASSWORD='Bob', DEFAULT_DATABASE=[WideWorldImportersDW], DEFAULT_LANGUAGE=[us_english], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
GO

--Create Mary
CREATE LOGIN [Mary] WITH PASSWORD='Mary', DEFAULT_DATABASE=[WideWorldImportersDW], DEFAULT_LANGUAGE=[us_english], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
GO

USE [WideWorldImportersDW]
GO

--Grant Bob access to WideWorldImportersDW
CREATE USER [Bob] FOR LOGIN [Bob] WITH DEFAULT_SCHEMA=[dbo]
GO

--Grant Mary access to WideWorldImportersDW
CREATE USER [Mary] FOR LOGIN [Mary] WITH DEFAULT_SCHEMA=[dbo]
GO

--Grant Bob access to read access to WideWorldImportersDW
ALTER ROLE db_datareader ADD MEMBER [Bob]
GO

--Grant Mary access to read access to WideWorldImportersDW
ALTER ROLE db_datareader ADD MEMBER [Mary]
GO

 

Next, we will create a function in SQL Server with the following code.

Create function

Use the code below to create a new Table-Valued function in SQL Server. The function is what we will directly connect to in Power BI.

CREATE FUNCTION dbo.ufn_Customer()  
RETURNS TABLE 
AS
RETURN   
(  
    SELECT  
       [Customer Key]
      ,[WWI Customer ID]
      ,[Customer]
      ,[Bill To Customer]
      ,[Category]
      ,CASE SYSTEM_USER
        WHEN 'Bob' THEN
            NULL
        WHEN 'Mary' THEN
            [Buying Group]
        ELSE
            [Buying Group]
        END AS [Buying Group]
      ,[Primary Contact]
      ,[Postal Code]
      ,[Valid From]
      ,[Valid To]
      ,[Lineage Key]
  FROM [WideWorldImportersDW].[Dimension].[Customer]
);  

GO

As you can see in the code above, I have created a function called dbo.ufn_Customer which returns the data from the Customer dimension table. In the code I have added a simple case statement that returns different data for the Buy Group based on who executing the function.

Next we will bring this function into Power BI and see the results.

Connect with Power BI

Open Power BI and get data from SQL Server. Enter in the server name and select DirectQuery.

Click on OK. Log in using a Database account. First, we will try using Bob.

Click on connect and select the function ufn_Customer from the list of available objects.

If you wanted to force the report author to use the function rather than the actual customer table you can use database security to only expose the function and not the table. I typically use custom database roles and schemas because it is easier to manage and allows me to enable “data discovery with guard rails”.

Load the data in and create a simple hierarchy using the fields Category and Buying Group.

Drop a Matrix onto the canvas of the report and use the Category Hierarchy you just created for the rows and the Customer Key for the values.

You should have a report that looks similar to the image below. Notice how the Buying Group is null because for Bob the function is not returning the Buyin Group value but the NULL value instead.

Now lets switch to Mary and see how the lower level values of the hierarchy appear in the report. Click on Home > Edit Queries > Data Source Settings. Select the data source that you are using for this report and click on Edit Permissions…

In the Edit Permission pop up menu click on Edit.. then in Database enter in Mary and Mary for the ID and PWD.

Click on Save and OK and Close. Refresh the report.

Notice how the lower levels of the hierarchy now appear in the matrix visual because for Mary the actual Buying Group value is being return as specified in the function.

NOTE: The reason why you need to refresh is because of caching. When using this technique to obfuscate lower levels of a hierarchy make sure to build your visuals so that the default view of the report is at an aggregated level and minimize the amount of caching which will force Power BI to re-query from the source and update the results appropriately.

If you want to add an enterprise semantic layer such as an Analysis Services Tabular model and still have the same dynamic results you will need to build your SSAS model using DirectQuery mode because the results of the model need to change and cannot be processed and stored in memory in advance of the user querying it.

In the next article I will cover how you can use a database function to curate the results based on a date time in your database.

Next we will look at using a database function to curate the result set based on time.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

Extend your information reach without over stretching by virtualizing data using SQL Server 2019 and MongoDB – Part 3 Setup Polybase

This is the third post in a series in which I explain how to use Polybase to connect SQL Server to Mongo DB in order to have a modern data ecosystem that caters to both structured and unstructured data. By leveraging Polybase you can centralize access without moving data and deliver a beast of breed solution for all of your data needs.

In my previous post I walked through how to spin up a SQL Server 2019 Azure VM and install Polybase. In this post I will show you how to setup Polybase to access a collection in MongoDB. This is the glue between MongoDB and SQL Server 2019. Conceptually this is the arrow in the diagram below.

First, I will show you how to setup an unsecure connection and then I will show you how to secure it with basic authentication.

Verify that MongoDB is up and running

Confirm that Mongo DB is up and running. As specified in my prior article you can use Studio 3T to connect or you can tunnel directly in using SSH and connect to MongoDB through PuTTY. If you shutdown the VM you will need to restart MongoDB using the command.

sudo service mongod restart

To run this command you will need to use PuTTY to tunnel in. Details on how to do this can be found in this article.

Configure polybase

Connect to the SQL Server instance. You can RDP into the VM or connect in from your local workstation. I connected in directly from my laptop using Azure Data Studio a nice light weight database tool.

Enable Polybase

Once connected you will need to turn on Polybase. Execute the following command in the MASTER database.

USE MASTER
exec sp_configure @configname = 'polybase enabled', @configvalue =1;
GO

RECONFIGURE WITH OVERRIDE;

Create Database

Next you will need to create a database. Execute the following SQL command to create a database called TEST.

CREATE
DATABASE TEST
GO

USE TEST
GO

The following commands all need to be executed in the new Test database that was just created with the command above.

Create External Data Source

Now you will need to create an external data source. I called mine MongoDBSource.

CREATE EXTERNAL DATA SOURCE MongoDBSource
WITH (
LOCATION='mongodb://40.85.255.249:27017',
CONNECTION_OPTIONS ='ssl=false;'
);
GO

Make sure that you modify the LOCATION parameter so that you specify the IP or name of the server that is running MongoDB.

Also you will need to add the CONNECTION_OPTIONS = ‘ssl=false;’ otherwise you will see the following error when you try to create the External Table.

Msg 105082, Level 16, State 1, Line 22

105082;Generic ODBC error: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: No suitable servers found (`serverSelectionTryOnce` set): [connection closed calling ismaster on ‘40.85.255.249:27017’] (Error Code: 13053) Additional error <2>: ErrorMsg: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: No suitable servers found (`serverSelectionTryOnce` set): [connection closed calling ismaster on ‘40.85.255.249:27017’] (Error Code: 13053), SqlState: HY000, NativeError: 110 .

Create External Table

The last thing we need to do is create an external table.

CREATE EXTERNAL TABLE Test(
[_id] NVARCHAR(24),
[x] FLOAT(53)
)
WITH (
LOCATION='test.testData',
DATA_SOURCE= MongoDBSource
);

Common errors

If you try to implicitly convert the [x] FLOAT(53) column to an int you will see the following error.

Msg 105083, Level 16, State 1, Line 24

105083;The following columns in the user defined schema are incompatible with the external table schema for table ‘Test2’: user defined column type: ([x] INT) vs. detected external table column type: ([x] FLOAT(53)). The detected external table schema is: ([_id] NVARCHAR(24) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [x] FLOAT(53))

This is because SQL Server requires a schema on read with strong type casting.

In addition, if you try to create the table without all of the “columns” that reside in the collection by using the following SQL command.

CREATE
EXTERNAL
TABLE Test2(
 [x] FLOAT(53)
)
WITH (
LOCATION='test.testData',
DATA_SOURCE= MongoDBSource
);

You will get the following error.

Msg 105075, Level 16, State 1, Line 24

105075;The user specified schema column count is 1, which varies from the external table schema column count of 2. The detected external table schema is: ([_id] NVARCHAR(24) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [x] FLOAT(53))

Again, this is because the current version of PolyBase in SQL Server requires a table structure that matches the collection structure in MongoDB.

Once the table is created you can query it just as you would any other table in SQL Server.

Select *
from [dbo].[Test]

Your results should look similar to the image below.

By default the PUSHDOWN predicate is applied so that the MongoDB is used for query execution allowing for optimum performance by distributing the compute to the best location/technology for the job.

Next let’s add some basic authentication so that our connection is a bit more secure.

Securing the connection with basic Authentication

Using PuTTY log back in the MongoDB VM. Once connected to the VM log into MongoDB by typing in the following command.

mongo

Once in the MongoDB shell execute the following command.

use test
db.createUser(
{
user: “mytestuser”,
pwd: “mytestuser”,
roles: [ { role: “readWrite”, db: “test” } ]
}
)

You should see the following response if the command is successful.

Successfully added user:
{
 "user" : "mytestuser",
 "pwd" : "mytestuser",
 "roles" : [
 {
 "role" : "readWrite",
 "db" : " test "
 }
 ]
}

This creates a user called mytestuser with the password mytestuser. You should probably use a more secure id and password but this is sufficient for the purposes of this article.

Enable authentication in MongoDB

Exit out of Mongo and run the following command.

sudo nano /etc/mongod.conf

Modify the configuration file by adding the tag authorization: enabled under the security section.

Exit and save the config file and then restart mongo.

sudo service mongod restart

Modify SQL Server

Now that we have made the necessary changes in MongoDB we need to add authentication information in SQL Server.

Log into SQL Server and create a master key. Execute the following SQL command to create a master key.

USE TEST

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Password123!'
GO

Next create a database scoped credential.

CREATE DATABASE SCOPED CREDENTIAL MongoDB
WITH IDENTITY ='mytestuser', SECRET = 'mytestuser';

GO

As you can see in the command above the IDENTITY and SECRET must match the user and pwd in the MongoDB command we used to create the user.

Last, we will drop and recreate the external data source and table using the following commands.

DROP EXTERNAL DATA SOURCE MongoDBSource

CREATE EXTERNAL DATA SOURCE MongoDBSource
WITH (
  LOCATION = 'mongodb://40.85.255.249:27017', 
  CREDENTIAL = MongoDB,
  CONNECTION_OPTIONS = 'ssl=false;'
);

GO

--Notice how the CREDENTIAL argument has been added to the create command above.

DROP EXTERNAL TABLE Test
CREATE EXTERNAL TABLE Test(
[_id] NVARCHAR(24),
[x] FLOAT(53)
)
WITH (
LOCATION='test.testData',
DATA_SOURCE= MongoDBSource
);

Now you are using an account to access the database in MongoDB. The next step would be to enable SSL in MongoDB so that the CONNECTION_OPTIONS ‘ssl=false;’ can be removed and a secure connection established between SQL Server and MongoDB. The details on how to do it is beyond the scope of this article for instructions on how to enable SSL in MongoDB you can refer to the following guide.

https://docs.mongodb.com/manual/tutorial/configure-ssl/

The setup that I have documented in this series can be done on premise or in the cloud using VMs as I have done.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

Extend your information reach without over stretching by virtualizing data using SQL Server 2019 and MongoDB – Part 2 SQL Server 2019

In this series of blog posts, I will explain how you can connect MongoDB to SQL Server 2019 using Polybase so that you have the benefit of both a schemaless and relational database technologies integrated and working together to form a modern data ecosystem that can handle both traditional and “big data”.

In my previous post I explained how to install and configure Mongo DB in an Azure VM running Linux. In this post I will walk you through the process of setting up SQL Server 2019, which is the area on the right of the diagram below.

Spin up Azure VM with SQL Server 2019

Microsoft has a pre-built VM with the latest release of SQL Server 2019 ( at the time of this writing it is CTP2.3) which makes it really quick and easy to setup. Simply navigate to your Azure portal and search for SQL Server 2019. You should see a Free SQL Server : (CTP2.3) SQL 2019 Developer option, once you select it you should see the following.

Click create and fill out the subsequent screens as follows.

Step 1 Basics

NOTE Be sure to remember the Username and Password because it will be required later when we connect to the VM and install Polybase.

Step 2 Size

I went for a DS2_v2 but you are free to pick a size that suites your needs. NOTE Microsoft recommends a DS2 or higher for development and functional testing.

Step 3 Settings

You will need to open a public inbound port (3389 RDP) so that you can remotely connect to it and install Polybase.

Step 4 SQL Server settings

This last step is optional. You can enable external connections directly into the SQL Server database which is handy if you want to connect to the database without having to log onto the VM. Once the VM is created we will need to log into it to install Polybase.

Install Polybase

Unfortunately, the pre-built VM does not have Polybase installed on it so you will need to log onto the VM and install. To connect to the VM go the resource in the Azure Portal and select Connect. You should see a screen like this.

Download the RDP file and enter the credentials you used when you first created the VM.

Once you have logged onto the VM you will need to navigate to the SQL Server 2019 installation software. You can find it here C:\SQLServerFull. Double click on setup.

In the SQL Server Installation Center menu select New SQL Server stand-alone installation or add features to an existing installation.

For the Installation Type select Add features to an existing instance of SQL Server 2019 CTP2.3

On the Feature Selection screen select PolyBase Query Service for External Data.

On the PolyBase Configuration screen select the first option. A PolyBase scale-out group is ideal for scenarios in which you have multiple external data sources that you want to connect to and you need to optimize performance.

We will use the defaults for Server Configuration.

Review the summary and click the install button. If all goes well you should see the following screen when complete.

In the next post I will explain how to configure Polybase to connect it to your MongDB database installed on separate VM. This setup allows you to extend your reach without overstretching by letting the data stay where it is but still making it available for integration and analytics with line of business application data.

Hopefully you have found this to be another practical post.

Until next time.

Anthony