Substitution Rules
Published 06 January 2020
Substitution rules are at the heart of data masking. They replace the data in columns with the values from the configured datasets.
In order to perform substitution, Data Masker requires that the table has a primary key.
Example
Based on the DMTest sample database, the DM_EMPLOYEE
table has an ID column person_i
d
and there are several sensitive columns first_name
, last_name
and birth_date
columns. A basic substitution rule for this table could be:
first_name
to the dataset Names, First Names, Male + Femalelast_name
to the dataset Names, Surnames, Random (Short List)birth_date
to the dataset Date Variance (Random)- This will move the date by a random amount within the specified bounds
By default, the substitution rule will preserve null and empty values. For example:
BEFORE | AFTER | |||||
---|---|---|---|---|---|---|
person_id | first_name | last_name | birth_date | first_name | last_name | birth_date |
1035563 | Seetha | Ellis | 20/02/1970 | Andrew | Jones | NULL |
1050641 | Steve | Jones | 23/06/1965 | Mary | Gray | 27/05/1965 |
2563 | Laura | Smith | 01/12/1982 | Hannah | Banks | 13/12/1965 |
Core concepts
Datasets
Each dataset provides a set of options that are specific to that dataset to allow for "fine-tuning" of the data used in substitutions. The datasets will be restricted to ones that are appropriate for the type of column (i.e. it is possible to put a numeric value into a text-based field but not all text values could fit into a numeric field).
Data Masker's predefined datasets support a number of common concepts but it is possible to define your own dataset if you need something more specific.
By default, the datasets are located in a directory named Datasets immediately below the Data Masker installation directory and can be changed on the Misc. Setup Tab.
Preserving null/empty values
Substitution rules provide the ability (on the Where clause or Sampling options tab of the substitution form) to ignore NULL
and/or empty values. The default for substitution rules is to ignore NULL
/empty values
Oracle treats empty strings as NULL
the preserve empty values option is not provided for substitution rules in Data Masker for Oracle.
Adding a WHERE clause
If only certain rows in the table need masking specify a where on the Where clause or Sampling options tab of the substitution form).
Any valid WHERE
clause can be added here, e.g. for the above example, if we wanted to exclude all employees in the table where the person_i
d
is under 1000 we could write:
WHERE person_id >= 1000
You can choose to manually exclude NULL
/Empty values within your custom where clause or select the "with inline Null skips" option.
BEFORE | AFTER | |||||
---|---|---|---|---|---|---|
person_id | first_name | last_name | birth_date | first_name | last_name | birth_date |
432 | Mark | Robertson | NULL | Mark | Robertson | NULL |
1765 | Steve | Jones | 23/06/1965 | Mary | Gray | 27/05/1965 |
2563 | Laura | Smith | 01/12/1982 | Hannah | Banks | 13/12/1965 |
Advanced concepts
Synchronizing denormalized data
If you have a denormalized structure, e.g. the employee name exists in several tables, you would want to mask the data identically in all places you can convert a Substitution Rule into a Sync. Manager Rule:
- Ensure that the substitution rule is only masking the columns that are denormalized, other columns in the table would have to be masked in a different Substitution Rule.
- Ensure that the "No where clause" is set on the Where clause or Sampling options tab.
- On the Change Managers tab select the "Convert to Sync. Manager Rule" option.
- See Sync. Manager Rules for more details of how to connect to the other tables.
Parallelizing substitution on large tables
If a substitution rule is taking too long against a single table, you can parallelize the work using a Range Splitter Rule that divides the table into smaller chunks that can be masked in parallel. To convert to a Range Splitter Rule by selecting the "Convert to Sync. Manager Rule" option from the Change Managers tab.
Where clauses referencing other tables
The custom where clause allows you to specify a JOIN
clause (the target table will be aliased as DM1
to help create the join condition).
In the example above, assume that you would not want to mask the internal employee details but would want to mask those of external contractors. For our sample database schema, this is stored in the DM_ASSIGNMENT
table that can be joined using composite key of both person_id
and assignment_id
, you could write:
INNER JOIN DM_ASSIGNMENT A ON A.person_id = DM1.person_id AND A.assignment_id = DM1.assignment_id WHERE A.emp_jobtitle LIKE '%Manager%'
The code above would lead to only employees with job title containing word 'Manager' being masked, for example:
BEFORE | AFTER | ||||||
---|---|---|---|---|---|---|---|
person_id | DM_ASSIGNMENT.emp_jobtitle | first_name | last_name | birth_date | first_name | last_name | birth_date |
432 | Manager | Mark | Robertson | NULL | Andrew | Jones | NULL |
1765 | Senior Manager | Steve | Jones | 23/06/1965 | Mary | Gray | 27/05/1965 |
2563 | Telesales Operator | Laura | Smith | 01/12/1982 | Laura | Smith | 01/12/1982 |
Ensuring a realistic distribution
You can control the distribution of data within a table/column using the Rule Options on the Where clause or Sampling options tab. This involves chaining several substitution rules together, each applying different datasets to a column with different sampling options.
Continuing the initial example, where the Names, First Names, Male + Female dataset was used to mask the first_name
column, you may get an uneven distribution of male and female names. If you wanted to ensure that half of the employees were masked with female names and half male you will need to make two rules:
- In the above substitution rule, change the dataset to be Names, First Names, Female.
- Create a new substitution rule on the
DM_EMPLOYEE
table:- set the dataset to be Names, First Names, Male
- on the Where clause or Sampling options tab, enable the Sampling Percentage option and set the percentage to 50.
- Make this new rule run after the first rule.
The first rule masks all DM_EMPLOYEE
rows with female names, to ensure that all rows have been masked while the second rule selects a random sample (50% of the rows) and masks those with male names.
We can set limits on the number of rows being sampled using the Row limit option.
The Distinct set option will ensure that each row is counted only once in the sampling, this is useful when join conditions would result in rows in the target table being identified multiple times.