Building Star Schemas from a Business Data Vault
Each has its merits, depending on your requirements.
The “Gold Standard” for building Star Schemas is to be able to make them Virtual.
A Note about Conformed Dimensions
Conformed Dimensions are the bedrock of Star Schema design.
Without them, it is unsafe or impossible to drill across our models, from fact to fact.
Ralph Kimball lists failure to build Conformed Dimensions as the number one reason for failure of Warehouse Projects.
However, the Kimball definition of what a Conformed Dimension is often lost in the detail of building Star Schemas, linked together by Surrogate Keys.
Kimball defines a Conformed Dimension as: “Dimension tables conform when attributes in separate dimension tables have the same column names and domain contents”.
This is not the same as “Identical tables, using the same set of surrogate keys”, which is often seen as the definition of a Conformed Dimension.
The Kimball Definition of a Conformed Dimension is particularly relevant for Business Data Vault Implementations.
The Business Vault is used to build “Conformed” Data Sets, or “Ensembles”.
The Ensemble should contain fully processed Entities, ready for presentation to Users.
At this point, the data is centralised as a “Single Source of the Truth”, but it is not a Dimension.
From the Centralised Ensemble, we can build any number of Customised Dimensions, as long as the Column Names and the Domain Contents match.
Since each of the Dimensions must be built from the same source, this definition is easily met.
Options for Building Star Schemas
Four options can be identified for building Star Schemas from a Business Data Vault:
- Conventional: Use the Business Vault as data source for a Physical Star Schema Building Exercise.
- Simple Virtual: Use the Hub Keys to manage Views, which present the data is SCD1 (Slowly Changing Dimensions) Star Schema Format.
- Point In Time Tables: PIT Tables can be used to define an SCD2 star schema.
- SCD2 Virtual: Use a highly Denormalised Fact to present SCD2 Entities, in combination with Hub Based SCD1 Dimensions.
The examples below are based on the theoretical Business Vault Model placed at the end of this blog.
This method uses the Business Vault as a Data Source to a Star Schema Building process.
The process is simplified by the fact that the Business Vault holds data modelled in “Ensembles” or “Subject Areas”, with all Business Rules already applied.
All that remains is to apply “Null” logic to the Entities, and to structure the data in a Star Schema style.
Having a Business Vault as a source makes the process more flexible in that multiple “Conformed” Dimensions can be built as required, reducing the likelihood and impact of having to refactor highly used SCD2 Conformed Dimensions, together with all of the related Facts.
The fact that this method results in Physical Star Schemas means that:
- Performance is likely to be better / more consistent than a Virtual option;
- Agility is reduced as at times, parts or all of a Mart will need to be rebuilt to accommodate changes;
- Being able to build discrete Marts, which are at the same time Conformed reduces the impact of changes.
2. Simple Virtual
The Business Vault does a good job of supporting SCD1 Dimensions.
In this case, we use the Driving Hub on an Ensemble as the starting point for a View or Function query.
We use the Hub Surrogate Key as the Dimension Key for joining Facts.
From that Hub, the query can traverse to the Satellites, which belong to the Hub, as well as crossing Links to other data sets.
The Fact is built, starting from it’s own Ensemble Hub.
In order to obtain the necessary Surrogate Keys for the Star Schema to reference it’s Dimensions, the Fact Ensemble must have Links to each of its Dimension Ensembles.
For this method to work, at a minimum, each Dimension must be represented as a Hub in the Business Vault.
For performance, it is a good idea to build a Link, which Links the Fact Hub to all of its Dimension Hubs.
By Building the Virtual Dimensions and Facts as Functions, with a Date Parameter, this method can produce “Sliding Window” Star Schemas.
A Sliding Window allows us to view the full, SCD1 styled Star Schema, at any point in time, for which the Business Vault has supporting history.
- Performance is very dependant on the Complexity / Data Volume of the underlying queries.
- Flexibility is maximised. Often, adding a new Entity to the Model is simply a View change.
3. SCD2 Virtual
By using the Start Date of Hub Rows, from the Fact Ensemble, we are able to join to Dimension Satellites as at the date when the Fact was Created.
By this method, we can create Dense SCD2 Dimensions, in addition to the above SCD1 Dimensions.
Dense Dimensions hold 1 row for each Fact.
- Performance is very dependant on the Complexity / Data Volume of the underlying queries
- Flexibility is maximised. Often, adding a new Entity to the Model is simply a View change
- Dense Dimension are not entirely conventional – not all Presentation Tools will work well with them
- Dense Dimensions do work well with many newer “Columnar” based tools – Qlik, Analysis Services Tabular.
Conventionally, the fact rows in an SCD2 Dimension are linked to the Dimension Row which is appropriate for the time when the fact was created.
With this method, the SCD2 values are held in a 1:1 ratio with the fact. The may be held as a part of the Fact, or as a Dense Dimension:
This method allows for the easy inclusion of and SCD1 Dimension, in addition to the SCD2 values.
4. Point In Time Tables
A PIT Table is used to represent an SCD2 Dimension.
The Table will be keyed on the Driving Hub for the Dimension, with a row per Hub key, per time interval required for the Model.
Each row will hold a list of Satellite keys, for all Satellites which feed the Dimension, as at the time represented by the PIT Key.
The Dimension is built from the PIT Table, with the Dimension Surrogate keys being the Surrogate key of the PIT rows.
Facts are built by joining to the PIT Table for each Dimension and obtaining the relevant PIT Surrogate Key.
- Is likely to be more performant than the preceding Virtual Examples.
- Is flexible, but less to than the above methods.
- Provides a true representation of a Conventional Star Schema.
- Fact Dimension Relationships can only be recorded on the time interval boundaries. The state of the Dimension between those boundaries cannot be seen.
- PIT Tables can be Large, exponentially so as the time interval reduces (say from Monthly to daily, to Hourly. Imagine a row for every minute).
- PIT Tables need to be managed with care.
- Changes to the Star Schema can easily result in a need to reprocess the PIT Table(s).
Kent Graziano has a write up on using PIT tables to simulate Star Schemas.
The above examples are based on the following Business Vault Data Set:
Keep on inventing – Brian
Brian blogs about how data warehouses are modelled
You can read Brian’s blog Does Your Data Warehouse Know What Time of Day it is? or all of Brian’s blogs here.
We run regular Data Vault course for business analysts, data architects, and business intelligence developers in Wellington and Auckland. Find out more here.
Keep upto date
Data suggests you should get the COVID-19 vaccine Data for this New Zealand focused blog was collected from various worldwide sources between the 18th and 31st August 2021. A list of sources can be found at the end. NZ is currently vaccinating solely with the...
photo by kolibri5 Some RDBMS support SQL statement MERGE, which also referred as upsert. This is a very common task in data warehousing: we compare the source dataset with what has been recorded in DW, insert new values and update those which have changed since the...
I have done a bit of web scraping lately both to automate some of the tasks I do regularly and some to get data for projects. I have learned one big thing from doing this it is ok to do something for a one off ... sometimes. There are times when working on scripts...
I have been using python for a while now and have used it for a number of different projects, but this one surprised me. Word clouds are something I have not used before. Then, I had the idea of using it for a project a while ago and struggled with finding a way to...
How to Use Data to Motivate.I have found myself not motivated to get up and out walking for a while now and know I need to do something about it. However, even with this realization that I have not done anything about it for the last few weeks. So thought I would try...
We use words each day of our lives it is only when we use these same words on a screen that we think about how we display them. The ways you can play with typography are endless. You can go from the simply use of contrasting fonts for the body text and headings to the...
Carrying on from part 1 where I covered how to remove objects from images in multiple ways I am going to cover some more things that can help you out in Photoshop first up is one of the big things in Photoshop, that being masking. The 2 main ways to mask that I use...
Google Analytics I have been using Google Analytics for a while now mostly to look at the traffic on our site. Why I have been doing this is for a few reasons one of which is just to keep the team up-to-date with our top blogs. Why I do this and why it is important...
Much like Illustrator Photoshop is a great tool for design. Photoshop is not just for editing photos, but for working with images to create works of art for whatever project you happen to be working on. There are many great tools in Photoshop to help you in your...
Sometimes the hardest things about design work is not the design itself, but the countless hours where you are trying to think of what to even make. Sometimes it’s easy you sit at your desk and boom you are going. Other days you are sitting there coming up with ideas...
The time has arrived where I've needed to learn more Python for a project I'm working on. Here, I share some resources, thoughts and tips in case you find yourself also needing to learn Python, too. Python is an interpreted object-oriented language. This means you...
Most Adobe programs have some things that are easy to do once you know how to do them, but if you don’t it’s not very intuitive so I thought I would do a list of things and tell or show you how to do them. Paste in place This is a very easy thing to do and can be very...
Git is an unbelievably useful tool for version control. It is used to keep a backup of your work that you can revert to if needed. Working with git you are able to make branches of your work. A branch is an independent line of development used when working on any new...
I've found that most of the tutorials for python data work are in videos. While this can be good, it's not as easy to skim through for that one thing you are stuck on. So I thought I would cover some basics: how to read a csv and some simple data visualization. Read...
I recently created an Apex application for a fun project, which gave me the chance to install Apex on windows EC2. Here are the steps I followed to install Apex, having decided to use ORDS (Web Listener) to run Apex. The version of Oracle Application Express I used is...
At the end of last year, I attended a virtual conference organised by Snowflake for the Asia-Pacific region. The beauty of online event is that you can watch sessions on demand, which is handy as many interesting sessions were scheduled to be streamed at the same...
We've been BEAM*ing for over 7 years now, and really believe it's a solid way to gather business requirements when building Agile data warehouses. Here's why. What is BEAM✲? BEAM✲ stands for Business Event Analysis & Modelling, and it’s a methodology for...
In this blog, I'll walk you through how to create a timer item in your Oracle Apex application. In this example, I am creating a timer clock that starts from 00:00 that counts till the page is active. You can make modifications to your timer item to match your...
Most of my career I have spent in big organisations developing and maintaining enterprise-wide solutions. Therefore, so far I’ve mostly worked with MS SQL Server and only touched OracleDB a bit. But for a new, smaller project, I've uses PostgreSQL for the first time...
3 reasons why I enjoy coding with Python, and what I use it for (beginner) I can’t be called a code monkey when compared to my colleagues, but recently I’ve found Python to be really helpful with some of my work tasks, so I signed up for a Udemy course to learn...