Published 06 January 2020
Substitution rules are at the heart of data masking. They replace the data in columns with the values from the configured datasets.
In order to perform substitution, Data Masker requires that the table has a primary key.
Based on the DMTest sample database, the
DM_EMPLOYEE table has a primary key of
Id and there are several sensitive columns
birth_date columns. A basic substitution rule for this table could be:
first_nameto the dataset Names, First Names, Male + Female
last_nameto the dataset Names, Surnames, Random (Short List)
birth_dateto the dataset Date Variance (Random)
- This will move the date by a random amount within the specified bounds
By default, the substitution rule will preserve null and empty values.
Each dataset provides a set of options that are specific to that dataset to allow for "fine-tuning" of the data used in substitutions. The datasets will be restricted to ones that are appropriate for the type of column (i.e. it is possible to put a numeric value into a text-based field but not all text values could fit into a numeric field).
Data Masker's predefined datasets support a number of common concepts but it is possible to define your own dataset if you need something more specific.
By default, the datasets are located in a directory named Datasets immediately below the Data Masker installation directory and can be changed on the Misc. Setup Tab.
Preserving null/empty values
Substitution rules provide the ability (on the Where clause or Sampling options tab of the substitution form) to ignore NULL and/or empty values. The default for substitution rules is to ignore
Oracle treats empty strings as
NULL the preserve empty values option is not provided for substitution rules in Data Masker for Oracle.
Adding a WHERE clause
If only certain rows in the table need masking specify a where on the Where clause or Sampling options tab of the substitution form).
WHERE clause can be added here, e.g. for the above example, if we wanted to exclude all employees in the table where the
Id is under 1000 we could write:
WHERE Id >= 1000
You can choose to manually exclude
NULL/Empty values within your custom where clause or select the "with inline Null skips" option.
Synchronizing denormalized data
If you have a denormalized structure, e.g. the employee name exists in several tables, you would want to mask the data identically in all places you can convert a Substitution Rule into a Sync. Manager Rule:
- Ensure that the substitution rule is only masking the columns that are denormalized, other columns in the table would have to be masked in a different Substitution Rule.
- Ensure that the "No where clause" is set on the Where clause or Sampling options tab.
- On the Change Managers tab select the "Convert to Sync. Manager Rule" option.
- See Sync. Manager Rules for more details of how to connect to the other tables.
Parallelizing substitution on large tables
If a substitution rule is taking too long against a single table, you can parallelize the work using a Range Splitter Rule that divides the table into smaller chunks that can be masked in parallel. To convert to a Range Splitter Rule by selecting the "Convert to Sync. Manager Rule" option from the Change Managers tab.
Where clauses referencing other tables
The custom where clause allows you to specify a
JOIN clause (the target table will be aliased as
DM1 to help create the join condition).
In the example above, assume that you would not want to mask the internal employee details but would want to mask those of external contractors. For our sample database schema, this is stored in the
DM_ASSIGNMENTS table that has a column
emp_id that is a foreign key to
DM_EMPLOYEE.Id, you could write:
INNER JOIN DM_ASSIGNMENTS a ON a.emp_id = DM1.Id WHERE a.emp_jobtitle LIKE '%CONTRACTOR%'
Ensuring a realistic distribution
You can control the distribution of data within a table/column using the Rule Options on the Where clause or Sampling options tab. This involves chaining several substitution rules together, each applying different datasets to a column with different sampling options.
Continuing the initial example, where the Names, First Names, Male + Female dataset was used to mask the '`first_name` column, you may get an uneven distribution of male and female names. If you wanted to ensure that half of the employees were masked with female names and half male you will need to make two rules:
- In the above substitution rule, change the dataset to be Names, First Names, Female.
- Create a new substitution rule on the
- set the dataset to be Names, First Names, Male
- on the Where clause or Sampling options tab, enable the Sampling Percentage option and set the percentage to 50.
- Make this new rule run after the first rule.
The first rule masks all
DM_EMPLOYEE rows with female names, to ensure that all rows have been masked while the second rule selects a random sample (50% of the rows) and masks those with male names.
We can set limits on the number of rows being sampled using the Row limit option.
The Distinct set option will ensure that each row is counted only once in the sampling, this is useful when join conditions would result in rows in the target table being identified multiple times.
Didn't find what you were looking for?
- Visit the Redgate forum
- Contact Support