How to track data lineage with SQL Server Graph Tables – Part 4 Querying the Graph Tables

In the previous article I showed you how you can populate the graph tables using the database procedures we created in an earlier post. In this article I will show you how you can query the data contained in the graph tables to quickly answer questions regarding data lineage.

Data Flow

As you may recall from the previous article, we created data to store the flow of data from two source systems in to our BI system as depicted in the illustration below.

MATCH

Starting in SQL Server 2017 the MATCH clause can be used with node and edge graph tables to traverse a path in the graph tables. Imagine a scenario where you want to know the impact of removing a field in your Staging database which you no longer believe you need. In the previous article we captured data lineage for a data flow that was coming out of JDE, one of the columns was an invoice data in Julian called RPDIVJ which is coming from the table F0411. For the purpose of this article lets assume that the business believes this date field is not useful and needs to be removed. Without this data lineage solution you would need to manually search through ETL processes to determine the impact of removing this column, a time consuming task, thankfully we have a more elegant solution.

We can query our graph tables in our DataLineage database to quickly see which processes will need to be modified. We will start by querying Staging which is the start of the data flow.

Run the following code to see which ETL process will need to be modified.

SELECT 
	df.ProcessName,
	df.CreatedDt
 FROM 
        [DataLineage].[dbo].[DataEntity] src,
        [DataLineage].[dbo].[DataFlowsTo] df,
        [DataLineage].[dbo].[DataEntity] trgt
 
WHERE MATCH(src-(df)->trgt)
	 AND src.SystemName = 'Business Intelligence System'
	 AND src.DatabaseName = 'Staging'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'


This should give you results like the table below.

ProcessName

CreatedDt

Load Fact Accounts Payable

2020-02-12

Load Fact Accounts Payable

2020-02-12

This tells you that the ETL process Load Fact Accounts Payable uses the field RPDIVJ in the Staging database twice. I have also included the CreatedDt filed in the query so that you know if the data is still relevant, typically the older the data the less reliable it is to make decision. If you have built your BI system correctly you should be able to compare that age of this data with the last modified date of the ETL process to see if there is a significant difference, if there is odds are you need to update the data lineage for this process.

Now let’s modify the query above to get some more meaningful insights.

SELECT 
			src.SystemName AS SourceSystem,
			src.DatabaseName AS SourceDatabase,
			src.TableName AS SourceTable,
			src.ColumnName AS SourceColumn,
			src.CreatedDt AS SourceNodeCreatedDt,
			
			df.ProcessName,
			df.CreatedDt AS ProcessCreatedDt,
			
			trgt.SystemName AS TargetSystem,
			trgt.DatabaseName AS TargetDatabase,
			trgt.TableName AS TargetTable,
			trgt.ColumnName AS TargetColumn,
			trgt.CreatedDt AS TargetNodeCreatedDt

 FROM 
	  [DataLineage].[dbo].[DataEntity] src,
          [DataLineage].[dbo].[DataFlowsTo] df,
          [DataLineage].[dbo].[DataEntity] trgt
 
WHERE MATCH(src-(df)->trgt)
	 AND src.SystemName = 'Business Intelligence System'
	 AND src.DatabaseName = 'Staging'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'

This query provides you with details on the source, target and the process that copies the data from source to target. When you run this query, it should provide results similar to the table below.

SourceSystem

SourceDatabase

SourceTable

SourceColumn

SourceNodeCreatedDt

ProcessName

ProcessCreatedDt

TargetSystem

TargetDatabase

TargetTable

TargetColumn

TargetNodeCreatedDt

Business Intelligence System

Staging

F0411

RPDIVJ

2020-02-12

Load Fact Accounts Payable

2020-02-12

Business Intelligence System

EDW

Fact Accounts Payable

InvoiceDate-Julian

2020-02-12

Business Intelligence System

Staging

F0411

RPDIVJ

2020-02-12

Load Fact Accounts Payable

2020-02-12

Business Intelligence System

EDW

Fact Accounts Payable

InvoiceDate-Gregorian

2020-02-12

The match clause is great however it is still difficult to understand the full impact of removing the RPDIVJ column, this only tells us about the impact to the staging load process and will not tell us down stream impacts to the EDW and the cubes which may rely on this field. What we really want to know is the shortest path across all layers of the BI system.

SHORTEST_PATH()

To get a better understanding of the end to end impact of removing this field we will use the SHORTEST_PATH option. This is a new feature only available in SQL Server 2019 or later. To use this feature, write a new query as follows.

SELECT
	CONCAT(src.SystemName,'.', src.DatabaseName, '.', src.TableName, '.', src.ColumnName) AS Source,
	COUNT(trgt.ColumnName) WITHIN GROUP (GRAPH PATH) AS [BI System Layer],
	STRING_AGG(CONCAT(trgt.SystemName, '.', trgt.DatabaseName, '.', trgt.TableName, '.', trgt.ColumnName) , ' -> ') WITHIN GROUP (GRAPH PATH) AS [Data Lineage],
	STRING_AGG(df.ProcessName , ', ') WITHIN GROUP (GRAPH PATH) AS [ETL Processes]

FROM
	[DataLineage].[dbo].[DataEntity] AS src,
	[DataLineage].[dbo].[DataFlowsTo] FOR PATH AS df,
	[DataLineage].[dbo].[DataEntity] FOR PATH AS trgt
WHERE
	MATCH (SHORTEST_PATH(src(-(df)->trgt)+))
	 AND src.SystemName = 'Finance'
	 AND src.DatabaseName = 'JDE'
	 AND src.TableName = 'F0411'
	 AND src.ColumnName = 'RPDIVJ'
ORDER BY 2 DESC

After running the query above you should get results like the table below.

Source

BI System Layer

Data Lineage

ETL Processes

Finance.JDE.F0411.RPDIVJ

3

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian -> Business Intelligence System.SSAS Accounts Payble Model.Fact Accounts Payable.InvoiceDate

Load F0411, Load Fact Accounts Payable, Process SSAS Accounts Payble Model

Finance.JDE.F0411.RPDIVJ

2

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Julian

Load F0411, Load Fact Accounts Payable

Finance.JDE.F0411.RPDIVJ

2

Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian

Load F0411, Load Fact Accounts Payable

Finance.JDE.F0411.RPDIVJ

1

Business Intelligence System.Staging.F0411.RPDIVJ

Load F0411

Let’s break down the results and look at the first row returned by the query.

The first column Source is the fully qualified source for the data

Finance.JDE.F0411.RPDIVJ

The second field BI System Layer tells you how many layers up the data travels, so in this case the data makes its way through all 3 layers of the BI system (Staging, EDW, SSAS).

The third field Data Lineage tells you the source to target mapping from end to end.
Business Intelligence System.Staging.F0411.RPDIVJ -> Business Intelligence System.EDW.Fact Accounts Payable.InvoiceDate-Gregorian -> Business Intelligence System.SSAS Accounts Payble Model.Fact Accounts Payable.InvoiceDate
The last field ETL Processes lets you know the names of all the ETL processes that depend on the data coming from Finance.JDE.F0411.RPDIVJ.

Load F0411, Load Fact Accounts Payable, Process SSAS Accounts Payble Model
The query is using SHORTEST_PATH, STRING_AGG and COUNT to provide meaningful information regarding data lineage on a specific field used as a source for the BI system. With this query you can quickly understand the impact of removing this field across all layers of the BI system.

Summary

By leveraging Graph tables and new querying capabilities associated with these tables you can quickly answer questions regarding data lineage and provide greater trust, understanding and ultimately adoption of the BI system.

In the next article I will cover how to visualize data in the graph tables using R.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 2 Create Database Procedures

This is my second post in a series which explains how to use SQL Server Graph Tables to track data lineage.

In my previous post I covered creating the node and edges tables required to store data lineage. In this post I will show you how to create database procedures in order to simplify the process of populating the graph tables. These procedures are deliberately basic so that they are easy to understand but they are also easily extensible and can serve as the starting point for a more sophisticated solution.

Overview

Before diving into the code it is always good to have a conceptual understanding of the process. The flowchart below explains the logic of these procedures.

As you can see in the diagram above, we check if the source or target is new and then based on the condition create the appropriate nodes and edges. In order to avoid duplicating code we will create a procedure for creating the nodes and the edges and then call theses sub processes from a parent process that controls the logic flow.

Create Nodes Procedure

First, we will create a procedure to populate our node table. As you recall from the previous post our node table is called [dbo].[DataEntity].

DROP PROCEDURE IF EXISTS dbo.Create_Nodes
GO

CREATE PROCEDURE dbo.Create_Nodes

    @ColumnList VARCHAR(MAX),
    @Table VARCHAR(MAX),
    @Database VARCHAR(MAX),
    @System VARCHAR(MAX)

AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @NodeIDs TABLE (ID NVARCHAR(1000));

    --Insert data ino Node table and keep Source IDS so that we can populate the Edge table
    INSERT INTO [dbo].[DataEntity]
    OUTPUT INSERTED.$NODE_ID AS TOID INTO @NodeIDs(ID)
    SELECT
        value AS [ColumnName],
        @Table AS [TableName],
        @Database AS [DatabaseName],
        @System AS [SystemName],
        GETDATE() AS [CreatedDt]

    FROM  
        STRING_SPLIT(@ColumnList, ',');

    SELECT ID FROM @NodeIDs;    
    RETURN 
END

Now that we have a procedure to create our nodes, we will create another procedure to create the edges.

Create Edges Procedure

We need to pass a list of procedure a list of node ids to insert into the [dbo].[DataFlowsTo] table so we will create a user defined type to store and pass this information.

/* Create a table type in order to pass a list of Node IDs to this procedure*/
DROP TYPE IF EXISTS dbo.NodeIDTableType
CREATE TYPE dbo.NodeIDTableType  AS TABLE  ( ID NVARCHAR(1000) );

GO

After the user defined table type is created we can create the procedure to populate the edge table.

DROP PROCEDURE IF EXISTS dbo.Create_Edges
GO

CREATE PROCEDURE dbo.Create_Edges

    @Source  NodeIDTableType READONLY,
    @Target  NodeIDTableType READONLY,
    @Process VARCHAR(MAX)
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

   --Get the FROM and TO node ids into a common table exppression and populate the edge table in batch.
   */
   
    WITH CTE (TOID, FROMID) AS
    (
        SELECT 
            A.FROMID,
            B.TOID
        FROM
            (SELECT ID AS FROMID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Source) as A
        JOIN 
            (SELECT ID AS TOID, ROW_NUMBER() OVER(ORDER BY ID ASC) as RowNum FROM @Target) as B
        ON A.RowNum = B.RowNum
    )
    
    --Insert Lineage data into DataFlowsTo Edge table
     INSERT INTO [dbo].[DataFlowsTo] ($from_id, $to_id, ProcessName)
     SELECT CTE.TOID, CTE.FROMID, @Process FROM CTE;

    
END

In case you are wondering I used a common table expression (CTE) so that the edge records could be created in bulk rather than one at a time.

Create Data Lineage Procedure

This is the final procedure which checks to see if the source or target is new and calls the procedures above as required. The conditional logic check is premised on the assumption a the combination of system name, database name, table name and column name is unique.

DROP PROCEDURE IF EXISTS dbo.Create_Data_Lineage
GO
USE [DataLineage]
GO


CREATE PROCEDURE [dbo].[Create_Data_Lineage] 

    --Source data
    @SourceSystem VARCHAR(MAX),
    @SourceDatabase VARCHAR(MAX),
    @SourceTable VARCHAR(MAX),
    @SourceColumnList VARCHAR(MAX),

    --Target data
    @TargetSystem VARCHAR(MAX),
    @TargetDatabase VARCHAR(MAX),
    @TargetTable VARCHAR(MAX),
    @TargetColumnList VARCHAR(MAX),

    @ProcessName VARCHAR(MAX)

AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

        
    --DECLARE @SourceIDs TABLE (ID NVARCHAR(1000));
    --DECLARE @TargetIDs TABLE (ID NVARCHAR(1000));
    
    DECLARE @SourceIDs NodeIDTableType;
    DECLARE @TargetIDs NodeIDTableType;

    
    --Prepare the list of fields by removing any spaces or tabs between the comma seperated list of source and target columns
    SET @SourceColumnList = REPLACE(REPLACE(@SourceColumnList,' ', ''), char(9), '');
    SET @TargetColumnList = REPLACE(REPLACE(@TargetColumnList,' ', ''), char(9), '');

    --Check for existing sources , if found use those id's otherwise create new nodes and use new IDs
    INSERT INTO @SourceIDs
    SELECT 
        $NODE_ID
    FROM 
        [dbo].[DataEntity] DE
    INNER JOIN 
            (SELECT
                value AS [ColumnName],
                @SourceTable AS [TableName],
                @SourceDatabase AS [DatabaseName],
                @SourceSystem AS [SystemName]
            FROM  
                STRING_SPLIT(@SourceColumnList, ',')
            ) SRC
    ON
        DE.[ColumnName] = SRC.[ColumnName]
    AND 
        DE.[TableName] = SRC.[TableName]
    AND
        DE.[DatabaseName] = SRC.[DatabaseName]
    AND
        DE.[SystemName] = SRC.[SystemName];

    --Check for existing  targets, if found use those id's otherwise create new nodes and use new IDs
    INSERT INTO @TargetIDs
    SELECT 
        $NODE_ID
    FROM 
        [dbo].[DataEntity] DE
    INNER JOIN 
            (SELECT
                value AS [ColumnName],
                @TargetTable AS [TableName],
                @TargetDatabase AS [DatabaseName],
                @TargetSystem AS [SystemName]
            FROM  
                STRING_SPLIT(@TargetColumnList, ',')
            ) TRGT
    ON
        DE.[ColumnName] = TRGT.[ColumnName]
    AND 
        DE.[TableName] = TRGT.[TableName]
    AND
        DE.[DatabaseName] = TRGT.[DatabaseName]
    AND
        DE.[SystemName] = TRGT.[SystemName];

    IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN

            --Create source nodes
            INSERT @SourceIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @SourceSystem,
                    @Database = @SourceDatabase,
                    @Table = @SourceTable,
                    @ColumnList = @SourceColumnList;
        
            --Create target nodes
            INSERT @TargetIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @TargetSystem,
                    @Database = @TargetDatabase,
                    @Table = @TargetTable,
                    @ColumnList = @TargetColumnList;
        
        
            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
        END
    ELSE IF (EXISTS (SELECT 1 FROM @SourceIDs)) AND (NOT EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN
        
            --create target nodes
            INSERT @TargetIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @TargetSystem,
                    @Database = @TargetDatabase,
                    @Table = @TargetTable,
                    @ColumnList = @TargetColumnList;

            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName;
        END
    ELSE IF (NOT EXISTS (SELECT 1 FROM @SourceIDs)) AND (EXISTS (SELECT 1 FROM @TargetIDs))
        BEGIN
        
            --create source nodes
            INSERT @SourceIDs
            EXEC  dbo.CREATE_NODES 
                    @System = @SourceSystem,
                    @Database = @SourceDatabase,
                    @Table = @SourceTable,
                    @ColumnList = @SourceColumnList;

            --Create edges between source and target
            EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
        END
    ELSE 
        --No new nodes required
        --Create edges between source and target
        EXEC dbo.Create_Edges @Source = @SourceIDs, @Target = @TargetIDs, @Process = @ProcessName
    

END
GO

Summary

In this article we have created the database procedures to simplify the process of creating the necessary node and edge records to capture data lineage between source and target fields.

In the next article I will cover how to call these procedures and query the graph tables to answer data lineage questions.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

How to track data lineage with SQL Server Graph Tables – Part 1 Create Nodes and Edges

Featured

Where did this data come from?
How can I trust this data?
What impact will changing this field have on other systems?

If you have ever heard or asked any of these questions, then this series of blog posts is for you.

As data volumes continue to grow so too does the need to manage the data estate. One critical aspect with managing the data estate is understanding data ancestry. Thankfully you can leverage SQL Server 2019 Graph tables to track the lineage of one of your most valuable assets, your data.

In this series of blog posts, I will show you how you can use how you can use Graph Tables in SQL Server 2019 to capture and report on data lineage.

Prerequisites

  • SQL Server 2019 Developer Edition (You can download a free copy for development use from here )
  • SQL Server Management Studio 18 (You can download a free copy from here )

*Note at the end of the series I will show you how you can visualize the data using R and Power BI and Plotly Dash so expect to install both Microsoft R Open, Power BI and Python later in the series.

Overview

The diagram below provides a conceptual overview of the various components that we will use during this series of articles. We will start with the Graph tables in the DataLineage database in SQL Server 2019 because they are the core of the solution.

Here is an overview of the initial data model that we will build.

As you can see in the diagram this series will cover the movement of data from one system to another. At the end of the series I will explain how this initial solution can be extended to included dependencies between objects such as the dependency of a field in a report with a field in a table in your data warehouse or data lake. This will allow you to track and reporting on both the movement of data as well as the inter-dependencies between data entities.

Create the DataLineage database

Once you have SQL Server installed, we will create a new database called DataLineage which we will use to store data we need to track lineage. To create the database run the following SQL command.

CREATE DATABASE DataLineage

Graph tables are first class entities in SQL Server databases starting with SQL Server 2017. Graph database are ideal for storing and traversing relationships between items. For additional information on the graph database in SQL Server you can refer to the following site.

https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview?view=sql-server-ver15

Create the Node table

Now we will create the node table. A node represents a discreet entity in a graph database. In order to maximize the value of our data lineage solution we need to track the movement of data at the at the lowest level possible which is the column/field level. In addition to the column/field data we also need to capture the table, database and system the column/field is stored in, this will be useful for filtering the data and understanding context. Based on this requirement we will create a node table that stores column, table, database and system name using the following SQL command.

USE [DataLineage]
GO

DROP TABLE IF EXISTS dbo.DataEntity
GO

CREATE TABLE DataEntity

(
    [ID] INTEGER IDENTITY(0,1PRIMARY KEY,
    [ColumnName] NVARCHAR(MAX),
    [TableName] NVARCHAR(MAX),
    [DatabaseName] NVARCHAR(MAX),
    [SystemName] NVARCHAR(MAX),
    [CreatedDt] DATE DEFAULT(GETDATE() )
AS NODE;

GO

I have also included a CreatedDt field for auditing purposes. Next, we will create the edge table which will be used to store information regarding the movement of data from source to target.

Create the Edge table

An Edge table represents a relationship between nodes in a graph database. In our scenario we want to track the movement of data between source and target. In order to capture this information, we will create an Edge table using the following SQL command.

USE [DataLineage]
GO

DROP TABLE IF EXISTS dbo.DataFlowsTo
GO

CREATE TABLE DataFlowsTo
(
    [ProcessName] NVARCHAR(MAX),
    [CreatedDt] DATE DEFAULT(GETDATE())
) AS EDGE;
GO

The edge table contains a single user defined column called ProcessName which is used to capture the name of the ETL process that moves data from source to target. The ProcessName will also be useful for filtering the data which is helpful in scenarios where we need to view all the fields involved in an ETL process or all the ETL processes dependent on one or more fields. Like the node table I have included a CreatedDt field for auditing purposes.

Summary

In this article we have created the database as well as the node and edge tables required to capture data lineage.

In the next article I will cover how to populate these tables using a database procedure in order simplify the process of

Hopefully you have found this to be another practical post.

Until next time.

Anthony

Extend your information reach without over stretching by virtualizing data using SQL Server 2019 and MongoDB – Part 3 Setup Polybase

This is the third post in a series in which I explain how to use Polybase to connect SQL Server to Mongo DB in order to have a modern data ecosystem that caters to both structured and unstructured data. By leveraging Polybase you can centralize access without moving data and deliver a beast of breed solution for all of your data needs.

In my previous post I walked through how to spin up a SQL Server 2019 Azure VM and install Polybase. In this post I will show you how to setup Polybase to access a collection in MongoDB. This is the glue between MongoDB and SQL Server 2019. Conceptually this is the arrow in the diagram below.

First, I will show you how to setup an unsecure connection and then I will show you how to secure it with basic authentication.

Verify that MongoDB is up and running

Confirm that Mongo DB is up and running. As specified in my prior article you can use Studio 3T to connect or you can tunnel directly in using SSH and connect to MongoDB through PuTTY. If you shutdown the VM you will need to restart MongoDB using the command.

sudo service mongod restart

To run this command you will need to use PuTTY to tunnel in. Details on how to do this can be found in this article.

Configure polybase

Connect to the SQL Server instance. You can RDP into the VM or connect in from your local workstation. I connected in directly from my laptop using Azure Data Studio a nice light weight database tool.

Enable Polybase

Once connected you will need to turn on Polybase. Execute the following command in the MASTER database.

USE MASTER
exec sp_configure @configname = 'polybase enabled', @configvalue =1;
GO

RECONFIGURE WITH OVERRIDE;

Create Database

Next you will need to create a database. Execute the following SQL command to create a database called TEST.

CREATE
DATABASE TEST
GO

USE TEST
GO

The following commands all need to be executed in the new Test database that was just created with the command above.

Create External Data Source

Now you will need to create an external data source. I called mine MongoDBSource.

CREATE EXTERNAL DATA SOURCE MongoDBSource
WITH (
LOCATION='mongodb://40.85.255.249:27017',
CONNECTION_OPTIONS ='ssl=false;'
);
GO

Make sure that you modify the LOCATION parameter so that you specify the IP or name of the server that is running MongoDB.

Also you will need to add the CONNECTION_OPTIONS = ‘ssl=false;’ otherwise you will see the following error when you try to create the External Table.

Msg 105082, Level 16, State 1, Line 22

105082;Generic ODBC error: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: No suitable servers found (`serverSelectionTryOnce` set): [connection closed calling ismaster on ‘40.85.255.249:27017’] (Error Code: 13053) Additional error <2>: ErrorMsg: [Microsoft][MongoDBODBC] (110) Error from MongoDB Client: No suitable servers found (`serverSelectionTryOnce` set): [connection closed calling ismaster on ‘40.85.255.249:27017’] (Error Code: 13053), SqlState: HY000, NativeError: 110 .

Create External Table

The last thing we need to do is create an external table.

CREATE EXTERNAL TABLE Test(
[_id] NVARCHAR(24),
[x] FLOAT(53)
)
WITH (
LOCATION='test.testData',
DATA_SOURCE= MongoDBSource
);

Common errors

If you try to implicitly convert the [x] FLOAT(53) column to an int you will see the following error.

Msg 105083, Level 16, State 1, Line 24

105083;The following columns in the user defined schema are incompatible with the external table schema for table ‘Test2’: user defined column type: ([x] INT) vs. detected external table column type: ([x] FLOAT(53)). The detected external table schema is: ([_id] NVARCHAR(24) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [x] FLOAT(53))

This is because SQL Server requires a schema on read with strong type casting.

In addition, if you try to create the table without all of the “columns” that reside in the collection by using the following SQL command.

CREATE
EXTERNAL
TABLE Test2(
 [x] FLOAT(53)
)
WITH (
LOCATION='test.testData',
DATA_SOURCE= MongoDBSource
);

You will get the following error.

Msg 105075, Level 16, State 1, Line 24

105075;The user specified schema column count is 1, which varies from the external table schema column count of 2. The detected external table schema is: ([_id] NVARCHAR(24) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL, [x] FLOAT(53))

Again, this is because the current version of PolyBase in SQL Server requires a table structure that matches the collection structure in MongoDB.

Once the table is created you can query it just as you would any other table in SQL Server.

Select *
from [dbo].[Test]

Your results should look similar to the image below.

By default the PUSHDOWN predicate is applied so that the MongoDB is used for query execution allowing for optimum performance by distributing the compute to the best location/technology for the job.

Next let’s add some basic authentication so that our connection is a bit more secure.

Securing the connection with basic Authentication

Using PuTTY log back in the MongoDB VM. Once connected to the VM log into MongoDB by typing in the following command.

mongo

Once in the MongoDB shell execute the following command.

use test
db.createUser(
{
user: “mytestuser”,
pwd: “mytestuser”,
roles: [ { role: “readWrite”, db: “test” } ]
}
)

You should see the following response if the command is successful.

Successfully added user:
{
 "user" : "mytestuser",
 "pwd" : "mytestuser",
 "roles" : [
 {
 "role" : "readWrite",
 "db" : " test "
 }
 ]
}

This creates a user called mytestuser with the password mytestuser. You should probably use a more secure id and password but this is sufficient for the purposes of this article.

Enable authentication in MongoDB

Exit out of Mongo and run the following command.

sudo nano /etc/mongod.conf

Modify the configuration file by adding the tag authorization: enabled under the security section.

Exit and save the config file and then restart mongo.

sudo service mongod restart

Modify SQL Server

Now that we have made the necessary changes in MongoDB we need to add authentication information in SQL Server.

Log into SQL Server and create a master key. Execute the following SQL command to create a master key.

USE TEST

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Password123!'
GO

Next create a database scoped credential.

CREATE DATABASE SCOPED CREDENTIAL MongoDB
WITH IDENTITY ='mytestuser', SECRET = 'mytestuser';

GO

As you can see in the command above the IDENTITY and SECRET must match the user and pwd in the MongoDB command we used to create the user.

Last, we will drop and recreate the external data source and table using the following commands.

DROP EXTERNAL DATA SOURCE MongoDBSource

CREATE EXTERNAL DATA SOURCE MongoDBSource
WITH (
  LOCATION = 'mongodb://40.85.255.249:27017', 
  CREDENTIAL = MongoDB,
  CONNECTION_OPTIONS = 'ssl=false;'
);

GO

--Notice how the CREDENTIAL argument has been added to the create command above.

DROP EXTERNAL TABLE Test
CREATE EXTERNAL TABLE Test(
[_id] NVARCHAR(24),
[x] FLOAT(53)
)
WITH (
LOCATION='test.testData',
DATA_SOURCE= MongoDBSource
);

Now you are using an account to access the database in MongoDB. The next step would be to enable SSL in MongoDB so that the CONNECTION_OPTIONS ‘ssl=false;’ can be removed and a secure connection established between SQL Server and MongoDB. The details on how to do it is beyond the scope of this article for instructions on how to enable SSL in MongoDB you can refer to the following guide.

https://docs.mongodb.com/manual/tutorial/configure-ssl/

The setup that I have documented in this series can be done on premise or in the cloud using VMs as I have done.

Hopefully you have found this to be another practical post.

Until next time.

Anthony

Instant insights, automation and action – Part 6 Integrate Power BI, Power Apps, Azure Machine Learning and Dynamics 365 using MS Flow

This is the last article in a 6-part series in which I will explain how you can integrate Power BI, Power Apps, Azure Machine and Dynamics 365 using MS Flow.

For reference here are the descriptions and links to the previous articles.

Instant insights, automation and action – Part 1 Create Power App

Instant insights, automation and action – Part 2 Create Azure Machine Learning Experiment

Instant insights, automation and action – Part 3 Create the Power BI Report

Instant insights, automation and action – Part 4 Register Power BI in Azure Active Directory

Instant insights, automation and action – Part 5 Integrate with MS Flow

In this article I will explain how you can kick off a MS Flow by adding an action to your Power App and then how you can integrate the Power App into a Power BI Dashboard. Data alerts can by tied to tiles in the Power BI Dashboard that can kick off additional flows which will insert records into Dynamics. The complete system is depicted in the diagram below.


Modify the Power APP

In Part 1 of this series we created a simple app that allowed a user to enter new sales data. We now need to go back to this app and modify it. Navigate to Power Apps and edit the app


Once the app is open click on the submit button to select it and then from the Action menu at the top select Flows.


This will open up a new pane in which you can select the flow that we created in Part 5 of this series. Once you have selected the flow enter the following code into the formula expression bar.

PowerApptoAzureMLtoPowerBIbkp.Run(NAME.Text, CHANNEL.Text, REGION.Text, FRESH.Text, MILK.Text, GROCERY.Text, FROZEN.Text, DETERGENT.Text, DELICASSEN.Text,CATEGORY.Text)


This will execute the flow and pass the data values from each of the text input boxes into the flow. You can test the flow by clicking on the play button in the top right-hand corner of the screen.

Save the report and publish it so that the new version with the flow attached to the submit button is available to integrate into Power BI.

Modify the Power BI Report

Next, we will need to modify the Power BI report to drop in a PowerApps visual. Open the Power BI report that we created in Part 3 and add a new custom visual from the marketplace. We need to add the Power App custom visual to the report.


Once the new visual has been successfully added we will add it to a new page in the report. In the Power BI report create a new page and call it Data Entry. We are doing this to keep the report clean and simple. We will integrate various visuals including the Power App in a Power BI Dashboard once we have finished putting the necessary polish in the report.

Drop the new visual onto the canvas of the new page in the report and add any field from the list of fields in the dataset, I used customer name. You should see a screen like the image below.


We are not creating or editing an app since we already built it in Part 1. Click ok and then select Choose app. Select the app we created for entering new whole customer sales data.


Click Add. You may see another warning about creating or editing the app, just ignore this by clicking ok.


New report page should now look like the image below.


Rename Page 1 and call it Wholesale Customer Report. You can spruce up the first page to make it look more appealing. I modified my report to make it look like this.


Once you are happy with the design of the report you need to publish it to Power BI. You can replace the existing report that we created in Part 3. Once the report has been published navigate to the cloud service and go the report that you just published.

Build the Dashboard

It’s now time to build a dashboard. With the report open pin the following visuals to a new dashboard.


To pin a visual to a dashboard click on the visual and select the pin from the menu bar.


A menu like the one below will pop up. Give the new dashboard a name such as Wholesale customer dashboard.


Select pin to create and add the visual to the new dashboard. Repeat this for all of the card visuals in the report except instead of selecting New Dashboard select Existing dashboard and if not already selected pick the Wholesale customer dashboard that we just created.

Next, we will need to pin the Power App visual. Go to the Data Entry page and pin the Power App just like we did for the card visuals. If you are having trouble selecting the pin option you may need to edit the report to pin the visual.

Your dashboard should now look something like this.


Let’s rearrange the tiles and add some new visuals by using Q&A.

First add a new visual by typing the following questions in the Q&A bar at the top of the screen.

Fresh by customer sort by fresh

Pin the visual to the existing Wholesale customer dashboard.


Then place this at the bottom of the dashboard.

Repeat these steps using the following questions:

Milk by customer sort by milk

Grocery by customer sort by grocery

Frozen by customer sort by frozen

Detergent paper by customer sort by detergent paper

Delicassen by customer sort by delicassen

Your dashboard should now look similar to the image below.


Try adding a new customer by using the Power App embedded in the Power BI Dashboard. After you have entered data into each of the input boxes in the Power App hit the submit button and in about 5 seconds or less you should see the customer count go up and your new customer on the dashboard in real-time. Also try entering in a new customer but do not fill out the Category field blank. Notice how even though the field is blank it is still populated by the time it shows up in Power BI, that is because the Azure Machine Learning model is supplying this data.

Integrate with Dynamics 365

The last step is to add a data alert to one of the tiles which will create a record in Dynamics 365. Navigate to the dashboard if not already there and click the … in the top right hand corner of the Fresh tile.


Then select Manage alerts.


This will open a new menu on the right-hand side of the screen. From this screen click + Add alert rule. Create an alert that will fire once the Fresh goes above a certain value. In my case I used 60,000.


For the purposes of this tutorial an alert based on an absolute value is adequate however a better choice would be to create an alert on a relative value such as % change since you do not want to have to go in and modify the alert to increase its threshold every time you surpass it. Click Save and close.

Go back to Manage alerts for this tile (Fresh) and this time select Use Microsoft Flow to trigger additional actions.



This will launch MS Flow. Use the default template to create a new flow triggered from a Power BI alert.


Use the template and select the Alert for Fresh from the Alert id drop down menu. Next select add new step and search for Dynamics 365. Then select Create a new record Dynamics 365.

Your flow should now look like this.


Enter the details for the Dynamics 365 tenant and select the Entity that you want a record created in. For my purposes I created a new task to follow-up with the customer by using the tasks entity. Save the flow and test it out by entering in new sales data using the Power App embedded in the Power BI report. If you have wired up the flow correctly a new record should be created in Dynamics 365 once you have triggered the data alert in your Power BI dashboard.

We have now reached the end of this series hopefully you have realized that by combining Power BI, Power Apps, Flow, Azure Machine Learning and Dynamics 365 you can open up new possibilities which lead to insights, automation and action at the speed of business.

Until next time.

Anthony


Instant insights, automation and action – Part 5 Integrate with MS Flow

This is the fifth post in a series of articles in which I explain how to integrate Power BI, Power Apps, Flow, Azure Machine Learning and Dynamics 365 to rapidly build a functioning system which allows users to analyze, insert, automate and action data.

In the previous article I covered how to create an API enabled dataset.

In this article I will cover how you can use MS Flow to create and automated workflow which will integrate the various components that we have built thus far as illustrated in the diagram below.


Prerequisites

Before you can complete this tutorial you will need to make sure that you have access to use MS Flow as well as registered Power BI with Azure AD so that you can push data into an API enabled data set. For more information on how to register Power BI you can refer to the previous article.

Create the Flow

Log onto MS Flow using the following URL https://us.flow.microsoft.com

Once you have logged into MS Flow click on My Flows and then select New > Create from blank using the drop down menu.

This will land you on a second screen in which you will need to click Create from blank once more.

After clicking create from blank once more you should see a screen like the one below.

The first thing we need to add is a Power Apps Trigger. To do this search for the word PowerApps and select the first result back from the search.

Add Power BI action

Next we will add the an action to our flow. Click on + New Step after the PowerApp connector and search for Power BI. Then select Add rows to dataset (preview)

This will open a new window in which you will select the workspace, dataset, and table name. The workspace will be called Customer Segmentation and corresponds to the Power BI Workspace you used to create the customer segmentation report. The dataset should be called WholeSaleCustomer and corresponds to the name of the API enabled dataset that we created in Power BI. The table name is the default table name for all API enabled datasets that are created in Power BI.

Next, we will use the dynamic content wizard to get the data from PowerApps into each of the appropriate columns in the API enabled dataset.

NOTE You must do this next step correctly and in the right order because as of the time of this writing there is no way to remove elements once they have been created.

Click on Add dynamic content and then in the pop up menu on the right select Power Apps See More.

This will open an additional option to Ask in PowerApps.

Select Ask in PowerApps and notice how flow automatically generates a field in the flow called Addrowstoadataset_CustomerName.

Complete the same steps for each column in the Power BI dataset. Be sure to select Ask in PowerApps for each new field. Once done your action should look like this.

This will automatically create parameters in the flow which will surface in the Power App once the Flow has been added to the Power App. As a sneak peek to what I mean here is a screen clip of the step to call the flow from inside PowerApps, we will do this step later in this tutorial series.

Add Azure Machine Learning action

Now we will add a new action between the PowerApp connector and the Power BI one to call the Azure Machine Learning API. To do this we will use the HTTP event. As of the time of this writing there is no OOTB connector or action to Azure Machine Learning Studio.

After you have added the HTTP event fill out the details as follows.

You can get the URI from the Azure Machine Learning Experiment > Request/Response page. Also, be sure to include the Content-Type and Authorization information. The authorization information needs to include the keyword bearer as well as the API Key which you can get from the Azure Machine Learning Experiment page.

Next, we will add a dynamic body as follows.

This will take the data that comes out of Power Apps and pass it to the Azure Machine Learning API for scoring. The machine learning model responds with the grouping the customer belongs to by using a clustering alogrithm. The Azure Machine Learning service responds with a JSON document that needs to be parsed in order to get the appropriate information.

Parse the JSON response

After the HTTP action we will add a Parse JSON action and take the Body of the HTTP response as input. We will also use the following Schema.

{
	"type": "object",
	"properties": {
		"Results": {
			"type": "object",
			"properties": {
				"output1": {
					"type": "object",
					"properties": {
						"type": {
							"type": "string"
						},
						"value": {
							"type": "object",
							"properties": {
								"ColumnNames": {
									"type": "array",
									"items": {
										"type": "string"
									}
								},
								"ColumnTypes": {
									"type": "array",
									"items": {
										"type": "string"
									}
								},
								"Values": {
									"type": "array",
									"items": {
										"type": "array",
										"items": {
											"type": "string"
										}
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

 

This schema can also be generated by dropping the sample payload generated by the Azure Machine Learning Service Request/Response document in the Sample Response section.

Your action should look like the image below.

parse json

You flow should now look as follows.

complete flow
Complete flow with all components

Using the HTTP event is adequate for this tutorial however a more robust solution would be to use Azure Functions.

You can grab the sample C# code generated in Azure Machine Learning Studio to jump start the development of the “server-less” function.

Add conditional logic to Category data

Last, we will add some conditional logic to the Category field in the Power BI event. Click on the Category field in the Power BI event and select Experssion.

Enter in the following code to the expression box.

if
(
      equals(triggerBody()['Addrowstoadataset_Category'],''),
      first(first(body('Parse_JSON')?['Results']?['output1']?['value']?['Values'])),
      triggerBody()['Addrowstoadataset_Category']
)

This will check to see if the Category field has been filled out in the Power App and use that value otherwise if the category field is null it will use the value from the Azure Machine Learning model.

Test the flow by clicking Test in the top right hand corner of the flow.

In the next post I will show you how you can integrate the flow into the Power App and then integrate the app into a Power BI Dashboard.

Hopefully you have found this to be another practical post.

Until next time

Anthony