Redgate Test Data Manager

Deterministic masking

Deterministic masking allows equal values in a database to be masked to the same value. This is useful when a database may have denormalized/duplicated data across several tables that needs to remain consistent after masking.

Example

Given a Customers table:

IdName
1Jane
2John
3Jane
4David


If the Name column is masked deterministically, each value would be replaced with a new value, with any values that were the same before masking being the same after:

IdName
1Steven
2Carol
3Steven
4Neil


There are two modes of operation for deterministic masking:

1. Within-run deterministic masking

Masking behaves as in the example above. If the same starting database were to be masked again, the resulting masked values would still be consistent but would be different from the first masking run.

2. Cross-run deterministic masking

Masking behaves as in the example above. If the same starting database were to be masked again, the resulting masked values would still be consistent and would be the same as in the first masking run.

This requires a deterministic seed to be provided when masking. This seed is used to provide a known starting point to enable masking to be repeatable across masking runs.

Enabling Deterministic Masking

By default, some datasets in Anonymize are masked deterministically (see Default classifications and datasets for more information).

To enable or disable deterministic masking for a specific column, you can set the deterministic property in the masking file:

{
  "tables": [
    {
      "schema": "Person",
      "name": "Address",
      "columns": [
        {
          "name": "FirstName",
          "dataset": "GivenNames",
          "deterministic": true,
          "maxLength": 50
        }
      ]
    }
  ]
}

In this example, the FirstName column will be masked deterministically using the GivenNames dataset.

Deterministic Seed

To control the output of deterministic masking across multiple runs of Anonymize, you can provide a seed value using the --deterministic-seed command-line option.

The seed is used to ensure that the same input values will always produce the same masked output values when the same seed is used.

Example usage:

rganonymize mask --deterministic-seed "my-secret-seed"

Seed Requirements

It is important to avoid the use of a "weak" deterministic seed. This reduces the risk of the masked data being reverse-engineered in the event of a data breach. Weak seeds include short seeds, or well known/easily guessable strings.

The deterministic seed must meet the following requirements:

  • It must be at least 4 characters long
  • It cannot consist of a single repeated character (e.g., "111111")
  • It cannot be an empty GUID (e.g., "00000000-0000-0000-0000-000000000000")

Security Considerations

By default Anonymize uses random and single-use seeds, and does not store them. It is possible, where necessary, to provide your own seed. The deterministic seed could be used to reverse engineer masked data back to its original values if an attacker gains access to both the seed and the masked data. Therefore, it is crucial to treat the seed as a sensitive secret (think of it similar to a password) and store it securely, such as in a key vault or secret management system.

Avoid sharing the seed widely or including it in easily accessible locations like source code repositories or configuration files. Limit access to the seed to only those individuals who absolutely require it.

Remember, the security of your deterministic masking depends on your ability to keep the seed secret. Always follow best practices for managing sensitive information and consult with your organization's security team for guidance on securely storing and handling the deterministic seed. And if you don't need to retain the seed, don't. Use a random string and discard it after use.



Didn't find what you were looking for?