Deterministic masking
Published 20 March 2024
Deterministic masking allows equal values in a database to be masked to the same value. This is useful when a database or databases may have denormalized/duplicated data across several tables that needs to remain consistent after masking.
Example
Given a Customers table:
Id | Name |
---|---|
1 | Jane |
2 | John |
3 | Jane |
4 | David |
If the Name column is masked deterministically, each value would be replaced with a new value, with any values that were the same before masking being the same after:
Id | Name |
---|---|
1 | Steven |
2 | Carol |
3 | Steven |
4 | Neil |
There are two modes of operation for deterministic masking:
1. Within-run deterministic masking
Masking behaves as in the example above. If the same starting database were to be masked again, the resulting masked values would still be consistent but would be different from the first masking run.
2. Cross-run / Cross database deterministic masking
Masking behaves as in the example above. If the same starting database were to be masked again, the resulting masked values would still be consistent and would be the same as in the first masking run.
This requires a deterministic seed to be provided when masking. This seed is used to provide a known starting point to enable masking to be repeatable across masking runs.
This function can also be used across multiple databases.
Enabling Deterministic Masking
By default, some datasets in Anonymize are masked deterministically (see Default classifications and datasets for more information).
To enable or disable deterministic masking for a specific column, you can set the deterministic
property in the masking file:
{ "tables": [ { "schema": "Person", "name": "Address", "columns": [ { "name": "FirstName", "dataset": "GivenNames", "deterministic": true, "maxLength": 50 } ] } ] }
In this example, the FirstName
column will be masked deterministically using the GivenNames
dataset.
Deterministic Seed
To control the output of deterministic masking across multiple runs of Anonymize, you can provide a seed value using the --deterministic-seed
command-line option.
The seed is used to ensure that the same input values will always produce the same masked output values when the same seed is used.
Example usage:
rganonymize mask --deterministic-seed "my-secret-seed"
Seed Requirements
It is important to avoid the use of a "weak" deterministic seed. Weak seeds include short seeds, or well known/easily guessable strings.
The deterministic seed must meet the following requirements:
- It must be at least 4 characters long
- It cannot consist of a single repeated character (e.g., "111111")
- It cannot be an empty GUID (e.g., "00000000-0000-0000-0000-000000000000")
Security Considerations
Deterministic masking, coupled with a stored seed, is useful for data which is effectively anonymous to anyone who does not also have access to the seed, but consistent across multiple runs.
Control access to your seed using the same principles you'd apply to encryption keys or database passwords. Store seeds in a key vault or secrets management system, and restrict access to team members who need it for legitimate workflows. Users who have access to the seed may be able to determine some original data.