Data warehouses are like ogres… and onions. They have layers.
Any Data Warehouse architecture will have at least staging and business data layers, also there could be a raw data layer and a reporting layer. Layers, physical or virtual, should be isolated for operational independence and better performance. No matter what happens on the previous layer, the next layer should have easy access to the previous layer in the architecture. On the Data Warehousing project for one of our customers, we encountered an issue due to a lack of layer isolation.
What is the problem?
The cause of the issue lay in a virtual semantic layer. Our DW was a Data Vault implemented on the MS SQL Server. The semantic layer we had designed mostly targeted the main reporting and analysis tool the organisation had- Qlik Sense. Qlik works best with a Star schema data source. We had designed Ensemble-centric virtual Dimensions and Link-centric virtual Facts. We know that views don’t perform well on an MS SQL Server, but it wasn’t the issue with our architecture.
Qlik uses its own in-memory data storage; it loaded the data from the DW daily after all the ETL had finished. For other reporting tools that they had the load speed wasn’t critical. This architecture worked fine in Production, but we have acquired an issue in Development environment! During the day ETL developers were loading fresh data into the DW, and report developers were loading data from these tables via views into Qlik at the same time; two load processes were deadlocking each other. In Production loads were sequenced; in Development things could have been changed during the day.
What is a database snapshot?
A database snapshot is part of an SQL Server’s built-in functionality for the whole database view. It preserves a database’s state at the time when the snapshot is created. The snapshot database structure looks exactly the same as the original database structure. Users can then select the data from snapshot tables in the same way that they would from the normal database, including the use of Join operators, subqueries and TSQL operators like cursor. A database snapshot shows the current state of the database “minus” all the changes applied to the data or structure since the time of the snapshot creation. It means that the more changes have been applied since the snapshot has been taken, the less accurate becomes and the more performance will decrease. The only benefit is that the snapshot doesn’t care if the same record has been changed multiple times, it only remembers the original state. You can read the technical article describing it properly on Microsoft Docs.
More benefits than expected
To be honest, we never found the real cause of the deadlock. The SQL Server didn’t see it as a deadlock, so didn’t warn users that none of the loads were moving. However, as expected, having a semantic layer point to a snapshot rather than the actual database resolved the issue. Snapshots performed even better than we had planned.
A static dataset at the beginning of the day helped with testing. Our project was Agile, and we didn’t want to pipeline development and testing too much. Before snapshots were introduced, there could be discrepancies in data between the Data Vault and Qlik layers if data had been refreshed by ETL developers during the day. It confused testers a lot and caused a lot of unnecessary investigations. Since we introduced snapshots, development and testing have been easily isolated without any extra cost.
Database snapshots can also be used as a sort of backup for DW developers throughout the day. If a change goes horribly wrong, restoring the database from a snapshot is easier than restoring from a backup.
Although they perfectly suited the requirements in our case, database snapshots have many limitations which you should be aware of. The full list of them is a bit distressing… But the main things to remember are they don’t perform well in the case of huge changes to the database, and that database snapshots cannot be a replacement for a backup strategy.
The following code was executed by an SQL Agent job as a step after all the ETL steps. It drops and recreates a Data Vault snapshot in a single transaction.
[sql] IF EXISTS (SELECT name FROM master.sys.databases WHERE name = N’ODE_Vault_snapshot’)
DROP DATABASE [ODE_Vault_snapshot] CREATE DATABASE [ODE_Vault_snapshot] ON
NAME = ‘ODE_Vault’, FILENAME = N’E:\Microsoft SQL Server\MSSQL\DATA\ODE_Vault_snap.ss’
AS SNAPSHOT OF [ODE_Vault] GO
[/sql] All the semantic layer’s views were pointing to this database.
The word “snapshot” is used multiple times across the SQL Server functionality, and it’s a different thing every time!
- Database snapshot is what we have just covered.
- Snapshot isolation is a term from the transaction isolation levels. It means that a transaction operates with the snapshot of the database, while other transactions could change the data. It is similar to a database snapshot, but in this case, the snapshot of a database is only available for the transaction that created it.
- Snapshot replication is a term from the database replication. Rather than sending changes to the replication subscribers, the distributor sends the whole current state of the data.
- File-snapshot backup is a term from Azure database administration. This is the way to backup a database using the features of cloud storage.