Strategies for data population
Published 14 August 2019
The topic of creating environments populated with real-world data is complex and goes beyond the scope of SQL Change Automation. Every organization is different, so the approach you take to establishing test environments that are accurate representations of your production environment must be tailored to your individual needs.
However, given that SQL Change Automation works best when each member of your team has their own copy of the database to work with (either on their own machine or on a remote server), it is worth describing some of the different methods of populating your environments that are available. SQL Change Automation includes a number of features that can help support your efforts to orchestrate the creation of your database environments.
The strategy for creating environments with real-world data you choose largely depends on the time and resources available to you, as some methods are easier to implement than others.
Strategy A: Create and populate your databases from scratch using project sources
This approach requires that all of the T-SQL logic needed to reproduce your database be stored within your SQL Change Automation project. This includes all of the code needed to create the database, logins, users, schema objects, programmable objects, as well as logic to seed your reference tables with static data.
If having transactional data within your test databases is also wanted (such as the population of sample customers, orders, etc) then additional work to incorporate this aspect of your "sandbox" environment deployment will need to be performed. These custom scripts can be included right alongside the scripts that SQL Change Automation generates for you since everything is deployed via a migration script. Orchestrating both schema and data deployment is as simple as editing a SQL file.
The advantage of this approach is that, after the initial ground work within your project is done, every developer will be able to create their entire database environment by simply applying the migrations on the SQL Change Automation project to the development database. The draw-back is that it can take some time to script out and maintain the sample data within the project scripts.
Seeding tables with INSERT statements
SQL Change Automation is able to script out INSERT statements for your tables, in addition to schema changes. This can be done as a one-off operation by switching on data tracking for a given table temporarily, importing the data into the project, and then switching off the table-level tracking. The last step is needed because it is not always desirable for future UPDATE statements to the data to be automatically scripted into the project. This will result in potentially thousands of insert statements being added to the project, which could hamper deployment performance, making this feature only suitable for smaller datasets. See static data.
Seeding tables with BULK INSERT
A common approach to populating transactional tables is with the SQL Server BULK INSERT statement. SQL Change Automation can assist with including the input file for the statement (for example, a CSV file containing sample customer data) within your migration scripts. For more information on how to include an external seed file as part of your deployment, see seed data.
Controlling data deployment with SQLCMD variables
If you would like some tables to be populated in some environments and not others, you can define SQLCMD variables within the project and consume them in your migrations (for example, $(MyVariable)) to control data deployment. For example, if you want your BULK INSERT to only run in test environments, you could do so by adding and using a variable called $(SeedWithCustomerData) within the project. This variable could be used to wrap the statement with guard clauses to ensure that the table is only populated when SeedWithCustomerData=True:
IF '$(SeedWithCustomerData)' = 'True' THEN BEGIN BULK INSERT [dbo].[Customers] FROM '$(DeployPath)CustomerData.csv' WITH (FIELDTERMINATOR = ',', FIRSTROW = 2) END
Setting the value of the variable
The default value of the $(SeedWithCustomerData) variable is defined within your project file, so everyone who deploys the project on their machine will use that value (unless they override this with a different value in the local column). However when you deploy outside Visual Studio, to control whether the table is populated, you can set the variable in your deployment tool: if deploying directly from PowerShell, simply set the $SeedWithCustomerData variable prior to executing the deployment.
Strategy B: Create your databases from scratch using project sources, then use a custom tool to populate the database
This approach is similar to Strategy A in that you would use the project to deploy your schema, code and static data. However, instead of scripting the transactional data, you might elect to use a third party tool to accomplish this, such as SQL Data Generator. Some of our customers have even written their own ETL process to import subsets of transactional data into their sandbox environments.
The advantage of this approach is that it simplifies your project by keeping the maintenance of your transactional test data separate to that of your change authoring process. Drawbacks include having to establish and maintain the data population process, as well as the complexity of having to run an additional step after performing the initial schema deployment.
Strategy C: Restore a database backup onto your test servers
Instead of trying to reproduce your database from source control, setting up your development environment by restoring from backup files is another option. The difference with this approach to strategy A and B is that instead of creating your database objects from your projects in version control, they will be created from your backup. The scripts that create these objects will be skipped and only the scripts that have yet to be applied to the database will be executed. The advantage of this approach is that your development environments will be created with realistic test data, however if the database contains sensitive data, this can complicate the provisioning of dedicated environments as a data masking/sanitization step would need to be performed before the backups could be used by developers.
The other drawback of this approach is that the cumulative size of all these dedicated database environments can be prohibitive for your local or network storage infrastructure.
A risk with this approach is that unintended changes can be imported into migrations (and potentially deployed to other environments) if the restored database does not match the schema contained within the project. To mitigate this, we recommend that databases backups taken from canonical sources only (for example, a production environment) and that any drift is removed from the database prior to authoring new changes (see revert pending changes to re-sync your schema).
Strategy D: Provision your database environment using SQL Clone
Instead of using backups, another option for provisioning dedicated development environments is to use SQL Clone, which lets you quickly create full copies of SQL Server databases and backups. When you subsequently deploy your project, SQL Change Automation will figure out which changes have yet to be applied, and execute only those migration scripts which are pending.
The advantage of this approach is that your environment will come pre-loaded with real world test cases. Additionally, having the ability to quickly create copies of the database simplifies the process of rolling-back unwanted changes, as you can always create a new copy and apply any pending migrations.