Memory improvements in data masking for dbatools
If you’ve used the data masking command in dbatools you’ve probably noticed that the PowerShell session becomes memory intensive when it has to handle larger tables with one or more unique indexes. The reason that happens is that during the data masking process the command looks for any unique indexes in the table. If it finds a unique index it will create a unique row for all the columns in the unique index. The command creates the values in memory. This means that you’ll have all the values that eventually get inserted into a table in memory. This can lead to a massive amount of memory being used when you have wider unique indexes with large data types. There was also another problem that I had to fix and that was that it would create those unique values for every unique index. This was also the case when there were overlapping columns when multiple unique indexes were being used. This was not efficient and I wanted to make something better for that too. I’ve been thinking about solutions for this problem because I think this command should be usable in almost every situation.
Here is how I made memory improvements in data masking for dbatools.
Problem #1: Moving away from memory
The solution for this problem was pretty obvious, move away from memory and use another storage solution to temporarily save the values. There are a couple of solutions we can use. Store the values
- in a file on a disk
- in a database
There were pros and cons for each solution
Files on disk
Storing the values in rows, like a CSV format, on disk is easy in PowerShell. The CVS format would’ve been my solution and I was not even considering something like JSON because that would create those large text files. We can easily iterate through the selection by looking at the row numbers when imported to make it work in the data masking process as well. The problem comes when we have to read the values. We would have to read the entire file again and therefore use more memory. This was not what I wanted.
Use a database
Using a database seems very logical in this context. We’re already connected to a server and can create and use tables. The downside is that we may be using storage because we’re temporarily creating a part of the table and this could get big with larger tables. One upside to using databases is, that I can create identifiers for each row. I can then query a single row from that table and get all the unique values fast and efficiently.
The decision was made, I was going to move the process to use a database. The next decision I had to make was to either use the database that was going to be masked or use a separate one. Both solutions have again their pros and cons, but I did not want to handle the cleanup of a new database. I also didn’t want the database to masked to become larger because I would grow the data file. The solution was to start using “tempdb” to create the temporary tables. Tempdb is great because:
- it’s always there
- you can optimize tempdb
- in case of bad cleanup it will destroy data when the session is destroyed
Now we have to consider creating the tables for the unique indexes in tempdb which was the next problem.
Problem #2: Creating a single unique row for all unique columns
One thing I didn’t like about the data masking command was the way it handles the unique indexes. It created an object in memory for each unique index and that added up in processing time and memory usage. We tackled the problem of memory usage by using SQL Server’s tempdb database. I still had the problem of the multiple objects/tables for each unique index. This was a bit harder to solve. I had to
- rewrite the process of retrieving all the unique indexes in a table
- collect all the columns and remove the duplicates
- create a table statement for all the unique columns
- add an identifier to make it easy to look up the row
- add an index to the identifier to make the lookup query fast
That is quite a bit of work to go through. In the end, I decided to make another command to handle that process for me. Because that process is too far away from the actual data masking itself, it was not a good idea to put this in the data masking command. The command I created is called “Convert-DbaIndexToTable” and is an internal function in dbatools.
By default you cannot call this command, there are ways to do it but it’s only built for the data masking process.
The command does the following
- Get all the unique indexes on a particular table
- Get all the columns from those indexes in an array
- Checks each column for the data type or user-defined data types
- In the case of a UDDT it will look into that property to get the actual data type
- In the case of a normal data type it will just use those properties
- Adds a column to the array to be the row identifier
- Put together the
- Create a table statement
- Create a unique index statement for the temporary table
But wait a minute! Why do we need another unique index on the temporary table?
The answer to that is: Because we want to make sure each row is unique across all the unique index columns.
This was a solution I implemented because of the way the unique values are generated. When the data masking command generates a unique row for all the columns, I want that row to be unique throughout the entire data set.
I could have created a process to check all the values in that table, but I could just as easily let SQL Server return an error when the unique values were already present in the table. When it returns an error the data masking would perform another iteration for that row as long as it fails to insert it. This is very fast, efficient, and less memory-consuming then handling the process myself.
But Sander, what if the unique column is not present in the data masking configuration? Are we still going to generate the unique value for that column?
The answer to that is: No. When you have a unique index in your table and you don’t add the columns of the unique index to the configuration file, the data masking command will not generate a value for that column. This again comes back to efficiency and speed.
When we have a unique index with 4 columns and we only add 1 column, because we make sure that the value in the entire data set is unique, we can be sure that the collection of the values is still unique even if we don’t add the other columns to the configuration file. Wow! That’s a lot to take in and I’ve been banging my head on the process for a while to make this work in the code.
Along the way, I sometimes stepped out of the main change and changed some other parts of the data masking command too
- Moved repetitive to separate functions
- Implemented static values feature
- Improved randomized value function
- Added more unit tests
This change was a lot of work but it was necessary to be able to use the command for larger databases. You can look into the pull request to get more info about the changes. I hope you found this informative and happy data masking! If you have any questions about the data masking commands in dbatools let me know. You can ping me on Twitter and I’m always present in the “SQL Server Community” slack channel. You can join this channel through this link.