It is hard to overstate the importance of the database testing, erroneous data costs a lot for a business. Yet, when you say testing, developers’ noses crinkle as they think it’s hard and boring. I have worked on a multiple data warehouse development projects, and I found there are so many tasks that could be automated, so eventually you can (actually, you should) minimise testing to unit tests for specific business rules or functions. All the examples and solutions below are for MS SQL Server as I mostly work with it; I’m sure other RDBMSes have similar functionality.
What to test?
I found this tutorial very handy, it structures database testing into categories (also – pay attention to myth-busting part at the bottom). Well-structured information is a first step to automation. That article is quite generic, you probably need to customise that for your database needs.
Lets explore the naming convention testing. On one of the agile data warehouse projects we couldn’t close a story until it was peer reviewed, and this included the check for the naming convention compliance. But this is something that could be easily missed by simple eyeballing, while it could easily be automated. In MS SQL Server all the metadata is stored in system tables, therefore it could be queried. For example, the following script checks if the tables from raw schema are following the standards described in this blog, i.e. “prefix_DatabaseName__SchemaName__TableName”. At the end it shows a list of tables which don’t match this standard.
CREATE TABLE #TableNamingConvention (Table_Schema VARCHAR(128), Name_Pattern VARCHAR(128)) ; INSERT INTO #TableNamingConvention VALUES ('rawhub', 'h_%__%__%'), ('rawsat', 'h_%__%__%'); CREATE TABLE #results (Table_Schema VARCHAR(128), Table_Name VARCHAR(128)); DECLARE @Counter INT = 0 DECLARE @Tablecount INT SELECT @Tablecount = COUNT(*) FROM #TableNamingConvention WHILE @Counter &amp;lt; @Tablecount BEGIN ;WITH s1 AS (SELECT *, ROW_NUMBER() OVER (ORDER BY Table_Schema, Name_Pattern) AS RowCounter FROM #TableNamingConvention) INSERT INTO #results SELECT t.TABLE_SCHEMA, t.TABLE_NAME FROM INFORMATION_SCHEMA.TABLES t JOIN s1 ON s1.RowCounter = @Counter + 1 AND t.TABLE_SCHEMA = s1.Table_Schema WHERE t.TABLE_TYPE = 'BASE TABLE' AND t.TABLE_NAME NOT LIKE s1.Name_Pattern SET @Counter = @Counter + 1 END SELECT * FROM #results DROP TABLE #TableNamingConvention DROP TABLE #results
Well, in business-oriented databases there is no strong table name pattern. Probably the only pattern that comes to mind is to check if the last character of the table name is not “s” in case if your naming convention intends singulars over plurals. I don’t say it’s impossible, but it would be quite hard to implement the spell checker, it is easier to eyeball the spelling. However, other database objects names are probably more pattern-based. It is a good standard for the name of indexes, triggers, foreign keys and default constraints to include the table and column name they applied to. For example, if the expected index name pattern is “FK_SchemaName_TableName_IndexedColumn”, it is easy to check if the name reflects reality. The following query returns those indexes where the indexed column doesn’t match the one in the index name:
SELECT s.name AS SchemaName, t.name AS TableName, c.name AS ColumnName, i.name AS IndexName FROM sys.indexes i JOIN sys.index_columns ic ON i.index_id = ic.index_id AND i.object_id = ic.object_id JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id JOIN sys.tables t ON i.object_id = t.object_id JOIN sys.schemas s ON t.schema_id = s.schema_id WHERE i.name LIKE 'FK_%' AND 'FK_' + s.name + '_' + t.name + '_' + c.name != i.name
It could be trickier in case of multi-part index, but this test could be coded as well. System tables are useful for other types of tests as well. For example, the following script returns a list of tables which don’t have a clustered index of any type.
SELECT s.name AS SchemaName, t.name AS TableName FROM sys.tables t JOIN sys.schemas s ON t.schema_id = s.schema_id LEFT JOIN ( SELECT object_id FROM sys.indexes WHERE type_desc LIKE 'CLUSTERED%') i ON t.object_id = i.object_id WHERE i.object_id IS NULL
If you already run tests like these on your databases, good on you. From my experience such tests don’t usually seem important for the team and are often skipped, while re-work if something has been missed sometimes isn’t simple. There were a few times when I was asked to rename existing database objects, and in some cases it was impossible due to too many existing dependencies. Sounds like a perfect reason why these tests should be automated; scheduled test results report can help figuring out the oddities early, before the dependencies are created.
The data integrity testing is never considered unimportant, nevertheless, it got skipped sometimes as well. There are tests which can give an indication of whether the data is in a good shape and they are applicable to different types of data. For example, the record count of the table can provide a lot of insight. In data warehousing the current record count could be compared to the data source; in general the count of individual records should make sense. Probably it worth reminding that the query “SELECT COUNT(*) FROM Table” is not the best way to get the record count of the table due to performance overhead, there are examples in the internet on how to get the record count estimate from the system tables, such as sys.partitions.
Other helpful test is the amount of blank or null values in each column, or check for negative values for numeric fields, or future dates. If the most or all the values are null, it worth investigating; same as if person’s age is negative. However, these tests could be too specific, also they require running queries on the table data, which could impact the performance.
How to test?
So, you have written the queries that check the data on repetitive manner, how to make them repeat?
You can handcraft the solution. For example, the test code could be converted to the stored procedure or procedures and executed on schedule by SQL Server Agent. You can check the execution results regularly, but the ideal solution is when you are only notified if something goes wrong, otherwise you don’t need to check. SQL Server has a database mail functionality, use stored procedure msdb.dbo.sp_send_dbmail to send email notification, you only need SMTP server set up for that. This approach is time-consuming, require good database administrative skills and writing a lot of code.
Much better way to achieve the same and more is to install one of the unit testing frameworks, like DBFit or tSQLt. They are free and easy to install, and after you did it you have a great framework for many types of database testing. The most you need to do is to understand how to create tests and write them in assertions form; this is much less work than in the case above, with the same functionality provided. After that adding new test becomes very simple job, therefore it endorses the good practice of data testing as your team won’t have an excuse to skip testing anymore.
It makes sense to store the data statistics somewhere, so you can compare today’s counts to the previous reads to make sure they are consistent. For example, I can bet record counts in each table in your database grow; actually, you can write a test to notify you if the grow is negative, it probably indicates a loss of data. We have a Metrics Vault as a part of our open source Data Vault automation engine ODE; it stores every table record counts and each column statistics, i.e. data about data, as a part of the data warehouse. The best part of collecting data is that you can visualise it; people will love to see the graph of the size of data growing over time.