Data Masker

Substitution rules are at the heart of data masking. They replace the data in columns with the values from the configured datasets.

In order to perform substitution, Data Masker requires that the table has a primary key.

Example

Based on the DMTest sample database, the DM_EMPLOYEE table has an ID column person_id and there are several sensitive columns first_name, last_name and birth_date columns. A basic substitution rule for this table could be:

first_name to the dataset Names, First Names, Male + Female
last_name to the dataset Names, Surnames, Random (Short List)
birth_date to the dataset Date Variance (Random)
- This will move the date by a random amount within the specified bounds

By default, the substitution rule will preserve null and empty values. For example:

	BEFORE			AFTER
person_id	first_name	last_name	birth_date	first_name	last_name	birth_date
1035563	Seetha	Ellis	20/02/1970	Andrew	Jones	NULL
1050641	Steve	Jones	23/06/1965	Mary	Gray	27/05/1965
2563	Laura	Smith	01/12/1982	Hannah	Banks	13/12/1965

Core concepts

Datasets

Each dataset provides a set of options that are specific to that dataset to allow for "fine-tuning" of the data used in substitutions. The datasets will be restricted to ones that are appropriate for the type of column (i.e. it is possible to put a numeric value into a text-based field but not all text values could fit into a numeric field).

Data Masker's predefined datasets support a number of common concepts but it is possible to define your own dataset if you need something more specific.

By default, the datasets are located in a directory named Datasets immediately below the Data Masker installation directory and can be changed on the Misc. Setup Tab.

Preserving null/empty values

Substitution rules provide the ability (on the Where clause or Sampling options tab of the substitution form) to ignore NULL and/or empty values. The default for substitution rules is to ignore NULL/empty values

Oracle treats empty strings as NULL the preserve empty values option is not provided for substitution rules in Data Masker for Oracle.

Adding a WHERE clause

If only certain rows in the table need masking specify a where on the Where clause or Sampling options tab of the substitution form).

Any valid WHERE clause can be added here, e.g. for the above example, if we wanted to exclude all employees in the table where the person_id is under 1000 we could write:

WHERE person_id >= 1000

You can choose to manually exclude NULL/Empty values within your custom where clause or select the "with inline Null skips" option.

	BEFORE			AFTER
person_id	first_name	last_name	birth_date	first_name	last_name	birth_date
432	Mark	Robertson	NULL	Mark	Robertson	NULL
1765	Steve	Jones	23/06/1965	Mary	Gray	27/05/1965
2563	Laura	Smith	01/12/1982	Hannah	Banks	13/12/1965

Advanced concepts

Synchronizing denormalized data

If you have a denormalized structure, e.g. the employee name exists in several tables, you would want to mask the data identically in all places you can convert a Substitution Rule into a Sync. Manager Rule:

Ensure that the substitution rule is only masking the columns that are denormalized, other columns in the table would have to be masked in a different Substitution Rule.
Ensure that the "No where clause" is set on the Where clause or Sampling options tab.
On the Change Managers tab select the "Convert to Sync. Manager Rule" option.
See Sync. Manager Rules for more details of how to connect to the other tables.

Parallelizing substitution on large tables

If a substitution rule is taking too long against a single table, you can parallelize the work using a Range Splitter Rule that divides the table into smaller chunks that can be masked in parallel. To convert to a Range Splitter Rule by selecting the "Convert to Sync. Manager Rule" option from the Change Managers tab.

Where clauses referencing other tables

The custom where clause allows you to specify a JOIN clause (the target table will be aliased as DM1 to help create the join condition).

In the example above, assume that you would not want to mask the internal employee details but would want to mask those of external contractors. For our sample database schema, this is stored in the DM_ASSIGNMENT table that can be joined using composite key of both person_id and assignment_id, you could write:

INNER JOIN DM_ASSIGNMENT A ON A.person_id = DM1.person_id AND A.assignment_id = DM1.assignment_id
WHERE A.emp_jobtitle LIKE '%Manager%'

The code above would lead to only employees with job title containing word 'Manager' being masked, for example:

		BEFORE			AFTER
person_id	DM_ASSIGNMENT.emp_jobtitle	first_name	last_name	birth_date	first_name	last_name	birth_date
432	Manager	Mark	Robertson	NULL	Andrew	Jones	NULL
1765	Senior Manager	Steve	Jones	23/06/1965	Mary	Gray	27/05/1965
2563	Telesales Operator	Laura	Smith	01/12/1982	Laura	Smith	01/12/1982

Ensuring a realistic distribution

You can control the distribution of data within a table/column using the Rule Options on the Where clause or Sampling options tab. This involves chaining several substitution rules together, each applying different datasets to a column with different sampling options.

Continuing the initial example, where the Names, First Names, Male + Female dataset was used to mask the first_name column, you may get an uneven distribution of male and female names. If you wanted to ensure that half of the employees were masked with female names and half male you will need to make two rules:

In the above substitution rule, change the dataset to be Names, First Names, Female.
Create a new substitution rule on the DM_EMPLOYEE table:
- set the dataset to be Names, First Names, Male
- on the Where clause or Sampling options tab, enable the Sampling Percentage option and set the percentage to 50.
- Make this new rule run after the first rule.

The first rule masks all DM_EMPLOYEE rows with female names, to ensure that all rows have been masked while the second rule selects a random sample (50% of the rows) and masks those with male names.

We can set limits on the number of rows being sampled using the Row limit option.

The Distinct set option will ensure that each row is counted only once in the sampling, this is useful when join conditions would result in rows in the target table being identified multiple times.