Data masking with dbatools
Table of Contents
Recently I developed a few PowerShell commands to make it possible to enable data masking for databases.
The commands were originally written for the module PSDatabaseClone to enable users to automatically mask the data for a database image.
The reason the commands were created was that the cloning process would otherwise expose production data to other users which is not preferable.
The commands were released and picked up by Chrissy LeMaire who implemented them in dbatools and even improved them.
I decided that because the PSDatabaseClone module already relied on dbatools, I would remove the original command from my module and map to the command in the dbatools version.
Why are these commands created⌗
There are various pieces of software available that offer functionality to mask data in a database like DataVeil, DataMasker and JumbleDB.
I have no experience with these products but looking at the features they seem to do the job.
Relying on the features of a commercial product was not an option because the PSDatabaseClone module is open-source. Most of the products do not allow the use of their code within other software.
I wanted to make this process as straightforward as possible. One command generates a configuration file on how the masking should be executed. Another command to execute the data masking.
If a user created an image he/she would have the chance also ask for the data within the image. That way a user would be able to create a clone from that image never exposing production data to the world.
Generating a configuration file⌗
The first thing I had to think of was the data structure to save information about the tables and columns.
Important information like the name, schema, column type, minimum value, maximum value, and the masking type and subtype.
I’m a fan of JSON to create data structures in text files. It’s easy to read and works well with various systems. PowerShell is one of those systems.
The first command will generate the masking configuration file containing all the tables and columns that should be masked. I wanted the command to be able to distinguish certain column names to associate them with a particular way to generate random data.
For that, I created a file that contained all the synonyms. For example, the command would be able to find columns with first names if the column name would be something like “Firstname” or “Forename”.
If it didn’t find the name of the column in the synonym list, it would then look at the data type and based on that decide what kind of data masking type should be applied. It was renamed to New-DbaDbMaskingConfig in dbatools.
Executing it will result in a similar output as below
New-DbaDbMaskingConfig -SqlInstance [yourinstance] -Database [yourdatabase] -Path [directory-to-export-to]
Your file will be written to the directory you chose. It will have the instance name and the database in the file name below
The content of the file will look similar to this
This made things a lot easier for users to create the initial configuration.
Without this command, it would take a lot of time to create the masking configuration which would also be subject to errors.
The second command is to use this content and execute the actual masking.
Masking the data⌗
The next command ended up taking some more research than I first expected. I had to figure out how to generate random data like names, streets, zip codes, e-mail addresses, and credit cards. The columns could have some sensitive information.
At first, I wanted to create my library to generate different types of random data. Fortunately, there were several projects out there that can help with this so I would not have to reinvent the wheel. During the process, I tried out several of these libraries like Fare and Bogus.
In the end, I chose Bogus which had a wide variety of items that could be generated and worked well how I thought this process could be implemented.
The command would rely on the masking configuration and based on that data perform the data masks. It will load the configuration file and look through each of the tables and columns.
With each column, the command will execute a certain masking action based on the masking type and sub-type. During the look, the command generates an UPDATE statement that will change the value to the new value.
After all the updates have been performed the command returns an overview of the performed actions. It will look similar to the window below
Invoke-DbaDbDataMasking -SqlInstance [yourinstance] -Database [yourdatabase] -FilePath [path-to-masking-config]
The result can be seen below
You’ll have a database with the data masked precisely as you requested.