Ralph explains his first phrase further. “If the warehouse is proven wrong even a couple of times, decision makers will find another source that they can trust, and the warehouse will be left without users or sponsors.” So actually, this makes sense. It is widely known that data warehouse projects fail often and one of the reasons is users’ mistrust in it. The data warehousing team may be unaware of this failure for some time as the data warehouse is running fine. But if users prefer to use their own good old Excel files, direct extracts from the source or they build reports over sandbox a database, it means the EDW project failed and money are wasted. So, accuracy is the key in data warehouse world.
Automated testing could help to find and fix problems before they are released to the production environment, so never appear to users. Data warehouse testing is a challenging task. Usually, data warehouses are large in volume and complex, so challenges here are the complex logic of tests and test execution time. Other data warehouse features, also add to the testing complexity, are multiple inconsistent data sources and data redundancy due to denormalisation and history tracing.
The data warehouse testing framework could include different types of tests to cover all of QA. Testing methodology is a big theme, so I’ll try to give an idea what it is.
- The smallest scale is unit tests. Developers create this type of tests for each small change or feature they make. Unit tests may include checks for data type constraints, null values in not-null fields, record duplication, data truncation, correct data load from source to staging and the correctness of business rules applied etc.
- Next scale is integration testing. Here we make sure that whole data warehouse runs as planned. These tests may include ETL workflow checks, consistency of record counts during the ETL process, exception handling checks.
- Due to data warehouse volume, performance tests are also important. It is good to test large volumes of data load, find a peak load time and test multiple join queries.
Another source of inaccuracy is the source data itself. Inconsistent data coming from different sources, spelling mistakes, duplicating records, all lead to mistrust in the data warehouse, even if it’s proven to be correct by tests. This could be improved by reconciliation reports or dashboards. With reconciliation reports, users and IT support could quickly identify and fix the incorrect inputs. These reports could be checked on a regular basis by users (e.g. daily, every Monday etc.). Another way is adding a bit of automation to this, so the report could run on schedule and send a notification to IT support when an issue is identified. Record count reconciliation between source and ETL load is covered by automated tests above, however, when users see these numbers at reconciliation reports, they are convinced that data is correct.
The data warehouse should be the single point of truth which is impossible without 100% accuracy. Humans make errors, so quality assurance is the only way to acquire that level of accuracy.