Subsetting
Published 16 October 2023
Subsetting is a capability to transform large databases into smaller ones (aka subset), part of Redgate Test Data Manager solution.
It's available in the GUI and via a cross-database and cross-OS command line interface (CLI) client.
What is subsetting?
Subsetting extracts a portion of referentially intact data from one database (source) into another (target).
One practical business use-case for subsetting is extracting data from a large production database to a smaller non-production one in a test environment for use by QA/developers.
The subsetter tool
The subsetter takes the connection strings of the source and target databases to compute the relational hierarchy of the source database and migrate the necessary subset of data on the target maintaining referential integrity.
The subset can be specified by providing a target size: the desired size in terms of disk space. The subsetter will select data from the source database in order to make a target database of approximately the requested size using statistical methods.
Alternatively the subset can be specified by providing a starting table. The subset will follow the referential tree from this database table and collecting all the data necessary from the related tables. This can be further fine-tuned by also providing a SQL filter clause, to control which specific rows are to be brought in from the starting table.
The starting table and clause allow fine control over what data is included in the subset, but require knowledge of the database structure. By contrast the target size feature allows easy creation of a subset with control over the impact on infrastructure resources, but with less control over the exact makeup of the subset.
One can subset multiple times and use a combination of the two modes to refine their subset, e.g. use starting table/filter clause first to bring in the required data, then use target size to bring the subset size down to a more operable value.
The current version of the subsetter
supports 4 database engines: MySQL, Oracle, PostgreSql and SQLServer.
There are some known limitations listed in our troubleshooting guide as the tool is unable to migrate certain data types like HierarchyId and Spatial. Columns with such types are identified and skipped. If they're also primary keys the entire table will be skipped. The subsetter is also not yet able to append rows to existing target tables without clearing them first.
Where to next?
- Getting started:
- Have a quick look at our glossary to gain familiarity with some of the involved concepts.
- Confirm you meet the subsetting requirements.
- Proceed to installing the Subsetter CLI.
- Be sure to take a peek at the current known limitations listed in our troubleshooting guide.
- Running a subset:
- Pick a database engine from our list of supported ones.
- Choose a source database to subset from and a target one to subset into. There are hints here on ways to create a target database.
- Follow the steps necessary to create a subsetter configuration file (YAML or JSON).
- Alternatively, you can just pass the required parameters to the CLI.
- Take subsetting for a spin (check the CLI reference for details).
- If something goes wrong:
- Take a look at the troubleshooting and known limitations section for known issues and workarounds.
- If the above is not enough, please do get in touch.