SQL Tuesday #110 Deterministic masking with dbatools
Table of Contents
The dbatools module recently got a couple of new commands mask data in their databases. One feature with the masking commands that were not yet put in was deterministic masking.
What is deterministic masking⌗
Deterministic masking is the process of replacing a value in a column with the exact value across tables.
For example, a database has multiple tables with a column that has first names. With deterministic masking, the first name that’s present will always be replaced with the same value.
Let’s assume the first name “Chris” will be replaced with “Jeff”. The value “Chris” will be replaced with “Jeff” in a column, regardless of the table or column name.
This is very important when you’re dealing with a database that is not normalized using relationships like reporting or business intelligence-related tables.
How does it work⌗
When creating the masking configuration file you have the option to set a column to be deterministic like this:
During the process of masking the data the command, Invoke-DbaDbDataMasking will create a dictionary with the values from the columns that are set to deterministic.
Every time it processes a row it will check the dictionary if the value is already present.
If it is, the new value of the particular column and row will be set to the value from the dictionary. If it cannot find the value a new value will be generated and added to the dictionary for later use.
But won’t this make my database less secure⌗
You may think that, when you always use the same value every time, it will be less secure but in this case, it isn’t.
This is because the masking command does not rely on any particular key to regenerate the value.
Every value that needs to be replaced will get a random new value. This value is then put in the dictionary and has no reference to the old value.
That’s fantastic! But what are the downsides?⌗
With every feature and the extra check comes the fact that extra processing is needed to determine the value for a particular row. Extra processing means that extra time is needed to process the table.
You also need more memory for the process because the dictionary will hold all the unique values from all the columns that are set to deterministic.
This can lead to a large number of values when you’re dealing with terabytes of data. I would advise using this feature only for columns that really need to be deterministic and not use it lightly across all the columns.
This feature enhanced the command to create some sophisticated masking strategies.
For more information about the command read the blog post by Chrissy LeMaire has written about the new feature.