Building Star Schemas from a Business Data Vault

by | Sep 1, 2015

Building Star Schemas from a Business Data Vault

by Sep 1, 2015

There are a number of techniques, which can be employed for building Star Schemas off a Business Data Vault.
Each has its merits, depending on your requirements.
The “Gold Standard” for building Star Schemas is to be able to make them Virtual.
A Note about Conformed Dimensions
Conformed Dimensions are the bedrock of Star Schema design.
Without them, it is unsafe or impossible to drill across our models, from fact to fact.
Ralph Kimball lists failure to build Conformed Dimensions as the number one reason for failure of Warehouse Projects.
However, the Kimball definition of what a Conformed Dimension is often lost in the detail of building Star Schemas, linked together by Surrogate Keys.
Kimball defines a Conformed Dimension as: “Dimension tables conform when attributes in separate dimension tables have the same column names and domain contents”.
This is not the same as “Identical tables, using the same set of surrogate keys”, which is often seen as the definition of a Conformed Dimension.
The Kimball Definition of a Conformed Dimension is particularly relevant for Business Data Vault Implementations.
The Business Vault is used to build “Conformed” Data Sets, or “Ensembles”.
The Ensemble should contain fully processed Entities, ready for presentation to Users.
At this point, the data is centralised as a “Single Source of the Truth”, but it is not a Dimension.
From the Centralised Ensemble, we can build any number of Customised Dimensions, as long as the Column Names and the Domain Contents match.
Since each of the Dimensions must be built from the same source, this definition is easily met.

Options for Building Star Schemas

Four options can be identified for building Star Schemas from a Business Data Vault:

  1. Conventional: Use the Business Vault as data source for a Physical Star Schema Building Exercise.
  2. Simple Virtual: Use the Hub Keys to manage Views, which present the data is SCD1 (Slowly Changing Dimensions) Star Schema Format.
  3. Point In Time Tables: PIT Tables can be used to define an SCD2 star schema.
  4. SCD2 Virtual: Use a highly Denormalised Fact to present SCD2 Entities, in combination with Hub Based SCD1 Dimensions.

The examples below are based on the theoretical Business Vault Model placed at the end of this blog.

1. Conventional

This method uses the Business Vault as a Data Source to a Star Schema Building process.
The process is simplified by the fact that the Business Vault holds data modelled in “Ensembles” or “Subject Areas”, with all Business Rules already applied.
All that remains is to apply “Null” logic to the Entities, and to structure the data in a Star Schema style.
Having a Business Vault as a source makes the process more flexible in that multiple “Conformed” Dimensions can be built as required, reducing the likelihood and impact of having to refactor highly used SCD2 Conformed Dimensions, together with all of the related Facts.
The fact that this method results in Physical Star Schemas means that:

  1. Performance is likely to be better / more consistent than a Virtual option;
  2. Agility is reduced as at times, parts or all of a Mart will need to be rebuilt to accommodate changes;
  3. Being able to build discrete Marts, which are at the same time Conformed reduces the impact of changes.

Coventional Star
Note that the outcome of this method is always a Physical Star Schema, not a Virtual one.

2. Simple Virtual

The Business Vault does a good job of supporting SCD1 Dimensions.
In this case, we use the Driving Hub on an Ensemble as the starting point for a View or Function query.
We use the Hub Surrogate Key as the Dimension Key for joining Facts.
From that Hub, the query can traverse to the Satellites, which belong to the Hub, as well as crossing Links to other data sets.
The Fact is built, starting from it’s own Ensemble Hub.
In order to obtain the necessary Surrogate Keys for the Star Schema to reference it’s Dimensions, the Fact Ensemble must have Links to each of its Dimension Ensembles.
For this method to work, at a minimum, each Dimension must be represented as a Hub in the Business Vault.
For performance, it is a good idea to build a Link, which Links the Fact Hub to all of its Dimension Hubs.
By Building the Virtual Dimensions and Facts as Functions, with a Date Parameter, this method can produce “Sliding Window” Star Schemas.
A Sliding Window allows us to view the full, SCD1 styled Star Schema, at any point in time, for which the Business Vault has supporting history.
Being Virtual:

  1. Performance is very dependant on the Complexity / Data Volume of the underlying queries.
  2. Flexibility is maximised. Often, adding a new Entity to the Model is simply a View change.

SCD1 Hub Based Star
Note that, within the bounds of performance requirements and capability, this method can be used to present a virtual Star Schema.
This method is very useful for presenting POC ideas.

3. SCD2 Virtual

By using the Start Date of Hub Rows, from the Fact Ensemble, we are able to join to Dimension Satellites as at the date when the Fact was Created.
By this method, we can create Dense SCD2 Dimensions, in addition to the above SCD1 Dimensions.
Dense Dimensions hold 1 row for each Fact.

  1. Performance is very dependant on the Complexity / Data Volume of the underlying queries
  2. Flexibility is maximised. Often, adding a new Entity to the Model is simply a View change
  3. Dense Dimension are not entirely conventional – not all Presentation Tools will work well with them
  4. Dense Dimensions do work well with many newer “Columnar” based tools – Qlik, Analysis Services Tabular.

Conventionally, the fact rows in an SCD2 Dimension are linked to the Dimension Row which is appropriate for the time when the fact was created.
With this method, the SCD2 values are held in a 1:1 ratio with the fact. The may be held as a part of the Fact, or as a Dense Dimension:
Composite SCD2
This method allows for the easy inclusion of and SCD1 Dimension, in addition to the SCD2 values.

4. Point In Time Tables

A PIT Table is used to represent an SCD2 Dimension.
The Table will be keyed on the Driving Hub for the Dimension, with a row per Hub key, per time interval required for the Model.
Each row will hold a list of Satellite keys, for all Satellites which feed the Dimension, as at the time represented by the PIT Key.
The Dimension is built from the PIT Table, with the Dimension Surrogate keys being the Surrogate key of the PIT rows.
Facts are built by joining to the PIT Table for each Dimension and obtaining the relevant PIT Surrogate Key.
This method:

  1. Is likely to be more performant than the preceding Virtual Examples.
  2. Is flexible, but less to than the above methods.
  3. Provides a true representation of a Conventional Star Schema.
  4. Fact Dimension Relationships can only be recorded on the time interval boundaries. The state of the Dimension between those boundaries cannot be seen.
  5. PIT Tables can be Large, exponentially so as the time interval reduces (say from Monthly to daily, to Hourly. Imagine a row for every minute).
  6. PIT Tables need to be managed with care.
  7. Changes to the Star Schema can easily result in a need to reprocess the PIT Table(s).

Pit Table
Kent Graziano has a write up on using PIT tables to simulate Star Schemas.
The above examples are based on the following Business Vault Data Set:
Ensemble
Keep on inventing – Brian
Brian blogs about how data warehouses are modelled
You can read Brian’s blog Does Your Data Warehouse Know What Time of Day it is? or all of Brian’s blogs here.
We run regular Data Vault course for business analysts, data architects, and business intelligence developers in Wellington and Auckland. Find out more here.

Other steps in the ODE Journey

And sometimes a hop and a skip and a jump
Where to find some open data

Where to find some open data

The OptimalBI team loves open source and is also mad about data.  Alison wrote a blog on Open source datasets already, and because there's more options out there, I thought I would give you some more places to get open data from for your various data needs. I have...

read more
New Zealand Traffic Data

New Zealand Traffic Data

When you ventured out to the supermarket during lockdown, I bet you noticed waaaay less traffic on the roads.  I live in the Wellington CBD, and it was really weird walking around with no people, no traffic for weeks. It got me wondering about the data;  Thankfully...

read more
Learning Azure with Pluralsight

Learning Azure with Pluralsight

Cloud solutions have little to do with the solutions we have on-premises. Cloud solutions are built with ambitious and monstrously huge goals in mind, so, often it feels like no previous technology has been reused. In addition to some understandable concepts, like...

read more
How to Model a Reporting Layer

How to Model a Reporting Layer

On my first BI job, where I was an ETL developer, my team was using Microsoft SQL Server Reporting Service (SSRS) as a reporting tool. My job was to model the Data Marts and create the data flows into those tables. Our report analysts were good at SQL, I have even...

read more
Qlik Replicate – Task Metadata

Qlik Replicate – Task Metadata

Recently I was working for a client and had a specific issue which led me to probe around various aspects of Qlik Replicate tasks, attempting to understand how Qlik Replicate works. While I did not manage to achieve what I was specifically attempting I did have a look...

read more
ETL is Easy Again in Azure Data Factory

ETL is Easy Again in Azure Data Factory

Almost two years ago I was a part of the BI team in an organization that decided to move to the Cloud. As a first step they followed the lift and shift approach, so they moved their Data Warehouse and all the existing processes into the Azure environment almost...

read more
SAS Functions Book

SAS Functions Book

Over the years, we've been building up our comic-style SAS function one-page guides, drip-feeding them to you.  Well, guess what?!  We put them all together and made a book so they're all in the one place, to make referencing them easier for you. Thank you...

read more
Zen and the art of database model maintenance

Zen and the art of database model maintenance

So there you are having created a work of beauty in Sparx Enterprise Architect that models all of your databases, and then the fly lands in the ointment.  There's a new version of "stuff" coming down the pipeline and you need to update the model or things are...

read more
Checking the File Path from Database in PowerShell

Checking the File Path from Database in PowerShell

On one of my recent projects I was analysing my customer's data for migration. Their legacy application could handle documents. They were stored on the organisation's shared folder, and the file path was recorded in the database with the rest of data. As a part of the...

read more
The best lockdown baking recipes

The best lockdown baking recipes

The nation has been hit by a baking craze!  Lockdown gave some of us more time to do some of the simple things in life, like baking, and, if you were lucky enough to get flour and yeast you could create some delicious treats! I've always enjoyed baking, so loved...

read more
Snowflake and Qlik: End-to-End BI Solution

Snowflake and Qlik: End-to-End BI Solution

Recently I watched a webinar organised by Snowflake, it was called "Data Warehouse Automation, Ingestion and Industry Leading Analytics with Snowflake and Qlik". In the past I could never find time for webinars, and it's amazing how life can give you opportunities to...

read more
Being social while social distancing

Being social while social distancing

The world has been social distancing for some time now, and will be for a bit longer, depending where you live. To help put off or at least mitigate going stir crazy I thought I would share what I have been doing to keep up some sort of social life.   Before the lock...

read more
Our How To blogs

Our How To blogs

We do cool sh!t with data and over the years we have shared a lot of blogs to help you do cool sh!t with data too.  If you want to jump right in and see if we have something that piques your interest here are all our how to blogs.  If you...

read more
Free ETL tools in 2020

Free ETL tools in 2020

Traditionally business intelligence is an enterprise solution, as only big businesses have multiple sources that require integration and analysis. It is assumed that the rest are small fishes that could be able to analyse their data with simpler accessible tools or...

read more
SAS Utilities: cleanwork

SAS Utilities: cleanwork

When a user starts and runs code in a SAS session, a number of temporary folders and files are created in the SASWork and SASUtil locations.  If that session completed and exits normally these folders and files are removed, however some SAS sessions exit abnormally,...

read more
1080 By The Numbers

1080 By The Numbers

If you have lived in New Zealand for long enough, I am sure you would have heard of the poison 1080, used by the Department of Conservation (DOC) as a form of pest control. 1080 can be used in bait stations or by airdrop, which is used for many of the remote and hard...

read more
Keeping your team connected when they’re working remotely

Keeping your team connected when they’re working remotely

Lockdown started on the March 26 here in New Zealand, and following the Prime Minister's announcement on April 20th, we're looking at just an extra 2 business days in Level 4, before we move to Level 3 after the ANZAC long weekend. Level 3 means that...

read more
Adobe equivalent Free Open Source Software (Pt2)

Adobe equivalent Free Open Source Software (Pt2)

In part two of my Adobe equivalent Free Open Source Software (FOSS) blog I thought I would look at the video editing pipeline and find some free alternatives for your video editing needs. You can read Part One which covers Photoshop, Illustrator and InDesign here....

read more
Difference between ETL and CDC

Difference between ETL and CDC

Reporting and analytics computations are usually very resource-consuming, therefore they are never executed on the same server where the crucial business application is running. Instead, data is copied into dedicated servers to be dissected and studied for insights....

read more
Keeping your brain busy while in lock down

Keeping your brain busy while in lock down

Many places all over the world are currently in self-isolation mode in response to the COVID-19 pandemic. In New Zealand, lock down started on Thursday March 26, which means that some people are able to work from home. At OptimalBI, we are working from home and on the...

read more
0 Comments

Trackbacks/Pingbacks

  1. Data WarehousesWeak Hubs why to model them as Hubs? | Data Warehouses - […] the article BUILDING STAR SCHEMAS FROM A BUSINESS DATA VAULT for details on virtualising data marts from the Business…
Submit a Comment

Your email address will not be published. Required fields are marked *