We were recently approached by a client with an interesting job – they wanted us to create a solution for Persistent Staging Area for them, and the requirements were quite broad. They had an MS SQL Server as their RDBMS and the loading tool was SSIS, just to stay in Microsoft field. Their described persisted staging table pattern very much reminded me of SCD Type 2, i.e. all the history for a record should stay in one table with the load timestamp and current flag and SSIS package execution details should be stored for maintenance. As per usual, the budget was tight, so was the timeframe.
After a bit of thinking about the requirements and realised that we had some of them implemented already. Our Data Vault engine ODE is capable of loading data into SCD2 tables based on the source and target metadata stored in config, and all loads were logged with any errors captured. It was still a challenge to make ODE execute SSIS packages, however, and the CDC functionality was missing.
Long story short, we tackled those challenges and ODE version 5 has now been released with these features on board; it is available for free download. Once ODE is installed on PSA server, adding a new database to persistent staging takes about an hour (well, aside from the initial data load which could take a lot of time if it’s a big database).
On this project, ODE persistent stage layer was a single data source for the Data Vault we were developing. It was located on the dedicated server, all the data sources the organisation had were replicated there. While Data Vault was still incrementally developing, insights analysis and data profiling could have started off PSA. Also, it meant that we started collecting source database history in one centralised repository before the DW was built.
Quite quickly we caught up with the Data Vault structure. Once we touched the business area, all the related data source tables landed into the DV in the ensemble structure. On our previous projects, Raw Vault satellites had the same field set as the original source table. Having one more persisted layer in the architecture made our Data Vault more flexible. Attributes could be landed directly to the Business satellites grouped in a logical way, e.g. split Customer Name and Address details from the source supertable into different satellites. Those fields which didn’t make sense in the context of the business area we were working on could be omitted and added later when needed.
Once a set of attributes is added to the Vault, all record changes are captured in both PSA and DV satellites. It may look like overkill, and it is, but it’s not actually too dramatic. Attributes are grouped into the Data Vault satellites by the change rate as well as other grouping criteria. So, for a slowly changing group of attributes, there is no excessive history collection happening. E.g. there is no extra history record created for the Customer Name satellite when a customer has changed their location, only the Customer Address record is created. Therefore, it takes less space on disk than a PSA table.
The obvious downside of having the PSA as a layer in the architecture is sequenced data load. DV load could not start until all the changes from all the data sources are collected by the PSA, otherwise the dataset won’t be complete. In our previous DV implementations without a PSA, business calculations could start as soon as all the raw data it depended on is loaded, in parallel with other raw loads. This made DV loads very efficient and minimised the overall DV load time. However, this is probably also a limitation of our implementation and it potentially could be resolved in future. For example, we had just one PSA and one DV daily load, while a set of small loads could help with parallelism.
An extra layer of raw data with no integration could introduce a bad practice of implementing “quick” reports off PSA, which contain their own integration and business rules. Luckily, we have realised this threat early and discouraged staff from doing this. However, we had had some interest from analysts, they have connected R code to PSA to get the data. The consequence of it could be a discrepancy between the analysts’ insights and the reports produced off DV.