Building Star Schemas from a Business Data Vault
There are a number of techniques, which can be employed for building Star Schemas off a Business Data Vault.
Each has its merits, depending on your requirements.
The “Gold Standard” for building Star Schemas is to be able to make them Virtual.
A Note about Conformed Dimensions
Conformed Dimensions are the bedrock of Star Schema design.
Without them, it is unsafe or impossible to drill across our models, from fact to fact.
Ralph Kimball lists failure to build Conformed Dimensions as the number one reason for failure of Warehouse Projects.
However, the Kimball definition of what a Conformed Dimension is often lost in the detail of building Star Schemas, linked together by Surrogate Keys.
Kimball defines a Conformed Dimension as: “Dimension tables conform when attributes in separate dimension tables have the same column names and domain contents”.
This is not the same as “Identical tables, using the same set of surrogate keys”, which is often seen as the definition of a Conformed Dimension.
The Kimball Definition of a Conformed Dimension is particularly relevant for Business Data Vault Implementations.
The Business Vault is used to build “Conformed” Data Sets, or “Ensembles”.
The Ensemble should contain fully processed Entities, ready for presentation to Users.
At this point, the data is centralised as a “Single Source of the Truth”, but it is not a Dimension.
From the Centralised Ensemble, we can build any number of Customised Dimensions, as long as the Column Names and the Domain Contents match.
Since each of the Dimensions must be built from the same source, this definition is easily met.
Options for Building Star Schemas
Four options can be identified for building Star Schemas from a Business Data Vault:
- Conventional: Use the Business Vault as data source for a Physical Star Schema Building Exercise.
- Simple Virtual: Use the Hub Keys to manage Views, which present the data is SCD1 (Slowly Changing Dimensions) Star Schema Format.
- Point In Time Tables: PIT Tables can be used to define an SCD2 star schema.
- SCD2 Virtual: Use a highly Denormalised Fact to present SCD2 Entities, in combination with Hub Based SCD1 Dimensions.
The examples below are based on the theoretical Business Vault Model placed at the end of this blog.
This method uses the Business Vault as a Data Source to a Star Schema Building process.
The process is simplified by the fact that the Business Vault holds data modelled in “Ensembles” or “Subject Areas”, with all Business Rules already applied.
All that remains is to apply “Null” logic to the Entities, and to structure the data in a Star Schema style.
Having a Business Vault as a source makes the process more flexible in that multiple “Conformed” Dimensions can be built as required, reducing the likelihood and impact of having to refactor highly used SCD2 Conformed Dimensions, together with all of the related Facts.
The fact that this method results in Physical Star Schemas means that:
- Performance is likely to be better / more consistent than a Virtual option;
- Agility is reduced as at times, parts or all of a Mart will need to be rebuilt to accommodate changes;
- Being able to build discrete Marts, which are at the same time Conformed reduces the impact of changes.
Note that the outcome of this method is always a Physical Star Schema, not a Virtual one.
2. Simple Virtual
The Business Vault does a good job of supporting SCD1 Dimensions.
In this case, we use the Driving Hub on an Ensemble as the starting point for a View or Function query.
We use the Hub Surrogate Key as the Dimension Key for joining Facts.
From that Hub, the query can traverse to the Satellites, which belong to the Hub, as well as crossing Links to other data sets.
The Fact is built, starting from it’s own Ensemble Hub.
In order to obtain the necessary Surrogate Keys for the Star Schema to reference it’s Dimensions, the Fact Ensemble must have Links to each of its Dimension Ensembles.
For this method to work, at a minimum, each Dimension must be represented as a Hub in the Business Vault.
For performance, it is a good idea to build a Link, which Links the Fact Hub to all of its Dimension Hubs.
By Building the Virtual Dimensions and Facts as Functions, with a Date Parameter, this method can produce “Sliding Window” Star Schemas.
A Sliding Window allows us to view the full, SCD1 styled Star Schema, at any point in time, for which the Business Vault has supporting history.
- Performance is very dependant on the Complexity / Data Volume of the underlying queries.
- Flexibility is maximised. Often, adding a new Entity to the Model is simply a View change.
Note that, within the bounds of performance requirements and capability, this method can be used to present a virtual Star Schema.
This method is very useful for presenting POC ideas.
3. SCD2 Virtual
By using the Start Date of Hub Rows, from the Fact Ensemble, we are able to join to Dimension Satellites as at the date when the Fact was Created.
By this method, we can create Dense SCD2 Dimensions, in addition to the above SCD1 Dimensions.
Dense Dimensions hold 1 row for each Fact.
- Performance is very dependant on the Complexity / Data Volume of the underlying queries
- Flexibility is maximised. Often, adding a new Entity to the Model is simply a View change
- Dense Dimension are not entirely conventional – not all Presentation Tools will work well with them
- Dense Dimensions do work well with many newer “Columnar” based tools – Qlik, Analysis Services Tabular.
Conventionally, the fact rows in an SCD2 Dimension are linked to the Dimension Row which is appropriate for the time when the fact was created.
With this method, the SCD2 values are held in a 1:1 ratio with the fact. The may be held as a part of the Fact, or as a Dense Dimension:
This method allows for the easy inclusion of and SCD1 Dimension, in addition to the SCD2 values.
4. Point In Time Tables
A PIT Table is used to represent an SCD2 Dimension.
The Table will be keyed on the Driving Hub for the Dimension, with a row per Hub key, per time interval required for the Model.
Each row will hold a list of Satellite keys, for all Satellites which feed the Dimension, as at the time represented by the PIT Key.
The Dimension is built from the PIT Table, with the Dimension Surrogate keys being the Surrogate key of the PIT rows.
Facts are built by joining to the PIT Table for each Dimension and obtaining the relevant PIT Surrogate Key.
- Is likely to be more performant than the preceding Virtual Examples.
- Is flexible, but less to than the above methods.
- Provides a true representation of a Conventional Star Schema.
- Fact Dimension Relationships can only be recorded on the time interval boundaries. The state of the Dimension between those boundaries cannot be seen.
- PIT Tables can be Large, exponentially so as the time interval reduces (say from Monthly to daily, to Hourly. Imagine a row for every minute).
- PIT Tables need to be managed with care.
- Changes to the Star Schema can easily result in a need to reprocess the PIT Table(s).
Kent Graziano has a write up on using PIT tables to simulate Star Schemas.
The above examples are based on the following Business Vault Data Set:
Keep on inventing – Brian
Brian blogs about how data warehouses are modelled
You can read Brian’s blog Does Your Data Warehouse Know What Time of Day it is? or all of Brian’s blogs here.
We run regular Data Vault course for business analysts, data architects, and business intelligence developers in Wellington and Auckland. Find out more here.
Keep upto date
We were recently approached by a client with an interesting job - they wanted us to create a solution for Persistent Staging Area for them, and the requirements were quite broad. They had an MS SQL Server as their RDBMS and the loading tool was SSIS, just to stay in...read more
A couple of years ago the OptimalBI team developed Metrics Vault using Optimal Data Engine (ODE) functionality. The purpose of Metrics Vault is to collect various statistics about data vaults configured in ODE database. This includes not only data vault tables...read more
Almost everything a data warehouse is asked to do involves dates: days, weeks, months, years, periods of interest, special days. These date details "are attached to virtually every fact table to allow [easy] navigation." Data Vault has a different structure from the...read more
Previously we outlined how to create the SSIS packages for ODE to load data into staging tables as a part of the table load. The steps were defined explicitly for use with BimlStudio which is a paid product from Varigence, but if you planning on doing things in a more...read more
This week we released version 5 of our Optimal Data Engine. Follow the link to download it from the GitHub. Optimal Data Engine (ODE) is an open source Data Warehouse automation tool. It is a config-driven engine that builds and manages your Data Vault. ODE works on...read more
One of the things I admire the Data Vault modelling approach for is the modular structure being represented by sets of hubs, links and satellites. From a developer’s perspective it provides enough flexibility and power to automate various tasks easily, such as...read more
A new version (version 4) of ODE is now available for download! ODE, Optimal Data Engine, is our open source product for building a Data Vault. Anyone can download it, install on their instance of SQL Server and develop a Data Vault of their own. New features include:...read more
During our Data Vault journey, we found that having naming and coding standards is important when working in a team. Having a consistent environment improves collaboration. With consistent table and column names, it's easy to pick up objects to create a select...read more
Optimal Data Engine Version 2 is now available for download from GitHub! Here's what it means for you. The biggest change with the most impact is a new flag for Data Vault retired objects. The general rule is, once something gets into the Data Vault, it never goes...read more
I started working at OptimalBI a few months ago. My area of expertise is Data Warehouse development, using Microsoft SQL Server. I was, therefore, a good candidate to test how difficult it is to start using ODE (Optimal Data Engine). ODE is an open source application...read more
Following on Nic's excellent blog on Migrating Configuration Data, I would like to elaborate a bit on the style, which he selected. From the options Nic outlined, he chose the route of grouping objects into a release, packaging up the Config and releasing it. This is...read more
We think Data Vault is pretty cool and continually recommend it as the best way to model your data warehouse. Here are three reasons why! 1. Model Quickly Data Vault allows you to very quickly transform your source data into structures that store history and better...read more
I recently attended a course run by Hans Hultgren on Data Vault Modelling. I have a small confession to make at this point; Sorry Hans, I've never read your book. The good news for me was that the course doesn't require you to have read the book first. It does assume...read more
Last month we had Hans Hultgren, Data Vault extraordinaire, teaching his Data Vault Modelling and Certification course right here at our very own OptimalBI office in Wellington. I had the privilege of doing this course in Sydney last year and walked away Data Vault...read more
Just Ship It Already! Steve Jobs said "Real Artists Ship" . Ma.tt has a great blog about shipping One point oh. And there is a loverly comment at the end of the blog that goes: "A great entrepreneur once told me that “an idea without execution is worthless.” We can...read more
At OptimalBI we are huge fans of modelling the data warehouse using the Data Vault methodology. That’s because Data Vault allows us to provide you with a data warehouse that you can continue to use long after we are gone and continue to add to yourselves.
We are often asked whether the “extra” Data Vault layers are necessary or just a pesky overhead in an end-to-end data warehouse architecture.
To that we say: not only are the Data Vault layers necessary, they are arguably the most important layers in your data warehouse, and argue that I shall!read more
We have recently been discussing various ways we can promote our configuration data for Optimal Data Engine (ODE) from one environment to the next. Our config data is the heart and soul of ODE, it is the foundation for the entire engine.
The config in ODE is the relational data model we have built that holds all of the configuration required to make ODE run. It includes definitions for all source and targets, as well as any mappings.read more
When we decided to start building ODE we knew a few things already. One of those things was that most of our customers already had data warehousing technology.
They had already invested in Microsoft, Oracle, IBM, SAS, Teradata, Informatica or any of the other raft of data warehouse repositories and ELT technologies that are abound in the market place.
We also knew that it would be a big decision on their part to throw out this technology and implement the technology we decided to pick to be able to use ODE and gain the AgileBI benefits that it provides.read more
About two years ago we started a journey into the world of AgileBI. It all started out of frustration (as most journeys do), frustration about the magic sauce.
The magic sauce was the reason why one data warehouse project would be a raving success and the next a ‘meh’. It was often the reason that after delivering a raving success and then returning 12 months later I would find the data warehouse had become a ‘meh’. By that I mean not updated, not well managed and not delivering to the stakeholders expectations.read more