Quite often people ask me, “What’s so good about Data Vault that I should immediately start using it instead of the Dimensional modelling I have been using for years?”.
This is an extremely tough question, so on this point I usually say “You shouldn’t, if you are satisfied with your current Dimensional Data Warehouse”. But I would really recommend anyone to consider a Data Vault modelling for implementing a modern agile Data Warehouse.
Let’s pretend we don’t know anything about data warehouses and read its definition in Wikipedia: “DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.”
And Data Mart: “A data mart is a structure / access pattern specific to data warehouse environments, used to retrieve client-facing data. The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.”
Notice that both of these wiki pages mention Dimensional modelling as a technique for implementation. What happens when an organisation chooses Dimensional modelling for both? Do they implement big enterprise-wide stars and small departmental starlets as subsets of those? That’s not very logical, confusing even.. and it is! Each organisation interprets this definition in a way that suits them the best, so you won’t see two similar Dimensional data warehouse implementations. Usually if you use the same methodology for two opposite tasks, you can’t get a perfect layered structure. You quickly get a mess or at least a mixture of two scopes. Depending on the requirements, department-wise data source could end up in its own Data Mart without landing in Data Warehouse first. It means that you can’t change your mind easily and choose integration with other data sources later; you’ll need to remodel the structure.
The Data Vault approach makes you more disciplined. It was designed to serve enterprise-wide scope at all times and it enforces the layered structure. Data Mart is a separate layer from the Data Vault. It doesn’t need to be persisted, so it could be virtualised for a flexibility purpose. In this case the DW layer always knows the whole truth and DM layer could be easily changed to represent different parts of it.
Let’s see it on the example. Emergency department of the hospital has its own application. There was no requirement to integrate this data in the Data Warehouse as many ED patients were not admitted to the hospital and if they are, it was considered as a completely different event. So that landing ED data into its own star schema Data Mart for simple reporting was enough for the time being. At some stage it was understood, that ED data could hugely improve the patient treatment if all the data is integrated, e.g. ED case after the hospital stay could be a recurrence of a past condition which could be easily missed. In case of Dimensional modelling this new requirement could cause a complex re-design. If you are adding ED patients into the conformed dimension, probably most of the fields will stay sparse. Data Vault can handle it with an extra link; and all the patients are already captured in one ensemble (you could put data in one ensemble, but integrate it later.).
As we have learned from Wikipedia, a Data Warehouse stores all the history of the data. There are a few options of storing history in dimensions: add a timeframe attribute when a record is current, store a previous value as an attribute or keep the history in a separate table. Although fact tables generally don’t require keeping history, sometimes it’s required. Then you have options here as well. Of course, you won’t be able to restore the history of the data stored in fact table if you decided to start capturing it from now on. To be on the safe side, many prefer to keep all the history before it even comes to the Data Warehouse, in a separate database called Persistent Stage. Usually it preserves the source table structure, but extra fields are added to capture the time when this state of the record was current. It is SCD2 pattern, but it’s not a dimension.
In Data Vault historical data is stored in a satellite also with the Slowly Changing Dimension type 2 pattern. Satellite stores history of entity, transaction, aggregation or raw source data. It is very consistent, you always know how to query a table to get current or historical records, but also it means that there is no need for a separate persistent staging. You don’t need to store data twice, first at the form of the historised source table, then in a business-oriented way; all the data and its history is stored in the Data Warehouse only. Here is another cool example: a raw data satellite and cleansed data satellite belong to the same ensemble, so you could easily compare two states of the data and evaluate your data cleansing algorithm.
Dimensional modelling was created many decades ago. This is its strong side, it has proven many times that it works, but it’s also a weak side as well: some things have changed a lot and star schema is not always catching up with the time. Requirements for Star schema DWs are captured once, changes are not appreciated. In Data Vault it benefits the most if you found core business concepts early. However, it is very sustainable in cases when business requirements have changed and when new data sources are added to the scope. For example:
A company has been taken over and its data needs to be added to the successor company’s Data Warehouse. Adding new portion of data to the existing Fact table could be a complex process: adding new fields could be required, but probably the most painful part is adding the dimensional values. In Data Vault all you need to do is to add another satellite to capture the data as quickly as possible; integration and reporting part could be addressed later.
A data scientist has done some complex analysis and found a relationship in the data that hasn’t been captured before. In Star-schema Data Warehouse if there is no immediate requirement for the fact table, it could be left out. In Data Vault a new link could easily be added to capture the relationship without any impact on the rest of the DW.
Adapting the Data Vault methodology doesn’t mean the end to the Star schema. It is still the best way to design Data Marts, most of the reporting and analytical tools expect Stars for the data input. Data Vault is modern technique for designing agile Enterprise Data Warehouse. The best recommendation I can give is to choose a tool for a particular data need wisely.