When someone queries your Data Warehouse, what will the time context of the result be? This discussion applies to designs where you have a collection of detail history of data arriving in your Warehouse. What Date / Time do you use to version the data?
Back in the day (well, the day still exists), we loaded data into Warehouses and Data Marts on a daily basis.
Usually, we got a snapshot from Production, taken roughly at midnight.
The Warehouse was processed in a single monolithic update.
Everything had a daily context and users were happy to get their reports.
Integrating data from diverse sources was quite straight forward (at least from a time perspective).
Enter the world of Change Data Capture and near real time Warehouses. As Change Data Capture has become more available, we have been able to capture changes to our sources in minute detail. The trend is towards loading data sources individually and out of sequence. A delay in a single feed should not stop the rest from completing. We now load data into Versioned Tables, all with a convenient time context (start and end date). Further processing, such as loading Dimensions, often requires that we pick a suitable time slice (say midnight, on the hour, on the minute) from the versioned load tables. We can say that when Customer A placed his order at 8:30 am, the Order was attributable to Salesman A’s target. When the same Customer placed a new order 3 hours later, he had been allocated to Salesman B and we can attribute the sale correctly. Or can we?
How reliable is Time?
At first glance, modern server times seem to be highly accurate.
However, when you scratch the surface, cracks soon appear in that theory.
So many things can happen to time, often well out of our control.
First and most obvious is Daylight Saving. If you are running a 24 X 7 operation or web site, what happens to orders that are placed during the switch?
In the autumn, typically the clocks shift backwards by an hour. And voila, the sequence of orders placed is wrong:
- Order 1
- 1:15 am Order 2
- 1:30 am Order 3
- 1:45 am Order 4
- 1:05 am????
- Yes, your outfit probably doesn’t have much activity at 1:00 am on Daylight Savings Day. But that means that your Data Warehouse is reliant on the business behaving exactly as expected, every time? Yes, that gap in the wee hours of the morning is quite insignificant. Not worth spending much thought power. Now consider what else can happen to your time.
- What if someone decides (or needs) to change the time on a server?
- What if someone decides to change the time zone on a server?
- What if you are receiving data from servers which have disparate time zones?
- Are you in a position to manage these time changes, or can things happen without your knowledge?
Do I care?
A good question. As ever in this business; “It depends …”. If you are loading the versions of data into any format with a start and end date, you may well want to think it through. There are three common options for determining Start and End dates:
- Use your Local server to ascertain the time at which you loaded the Version into the Warehouse;
- Obtain a Time from your Source;
- Apply Business Rules on the incoming data (for example a Create Date provided by the application).
Local server time:
Using your local server is far and away the simplest, most robust method. You are only exposed to local server time changes and possibly daylight savings. You still need to deal with the possibility of dates coinciding or overlapping. Reporting or further processing of data can be run as at a given time (say midnight, 3:00 pm or any other time) without any trouble. Just start the job after the required time has passed on your server. On the down side, the Start and End dates will not reflect any business context. In the event of loading delays, this difference can be large.
Obtain Time from the Source
Very often, your incoming data (for example CDC) will give you an accurate time of change (in terms of the Source server). In this case, you have some business context and the process is quite repeatable. You are however at the mercy of changes that may occur on the source server. You may need to know (and check) what Time Zone the Source is running in, and make allowance, so that data from different servers can be related in the same time context. UTC Time is the common denominator. When on-processing data for a given time slice (say midnight), you will need wait logic to ensure that all of your sources are processed past the required point. Not all sources will be as tidy as CDC in giving you effective times.
You have complete control over the business logic. Even basic loads need to be carefully analysed to select a relevant time (if one is in fact available). It may require relating (joining) incoming feeds to ascertain the relevant time. This creates complexity and dependencies. Loading Sources asynchronously may be problematic. You may still need to take source server times and zones into account to ensure that you can relate tables correctly. When on-processing data for a given time slice (say midnight), you will need wait logic to ensure that all of your sources are processed past the required point.
What to do?
First of all, know your requirements. Do you have servers in different Time Zones? What will your organisation do with the history you are collecting in your versioned tables? Will the business need accurate times for historic reporting? Most reporting will probably require the current, most recent data available in the warehouse. All methods work in this case. Don’t over engineer!
A Hybrid Solution
Unless you can get away with simply using Local Server Time, I suggest that you consider a hybrid solution. As data arrives in the warehouse and is versioned, using the Local Server time. At the same time, collect the remote time and time zone offset information for each row. This way, you can have a reasonably standard method of loading your base warehouse. Once the data is loaded, you have all the necessary time information to identify which row version to use for an accurate time slice. Use Point-In-Time tables to index the Time Slices which you need. These are a good place to encapsulate your time rules.
Love to hear your thoughts on the matter
Keep on inventing – Brian
Brian blogs about how data warehouses are modelled.
You can read all of Brian’s blogs here.
* Change Data Capture
Change Data Capture is a set of tools, either native to the Relational Database you are using, or Third Party, which capture every Insert, Update, and Delete that happens to a selected table or tables.
This detail can then either be replayed to keep an up to date copy of the table and / or to keep an audit of all activity.
* Versioned Tables
Tables which retain a change history.
For example, a Customer Table will hold a row for each time the Customers’ details changed, as well as the time when it changed:
-  [Fred Jones] [93 Main Street] (1 Jan 2014 – 7 Mar 2014)
-  [Fred Jones] [45 Jane Street] (7 Mar 2014 – 9 Dec 2014)
-  [F.M Jones] [45 Jane Street] (9 Dec 2014 – )
This type of table is useful to either:
View the change history of the Customer,
Look at the most recent state of the Customer –
- 3  [F.M Jones] [45 Jane Street]
or to Query the state of the Customer at a point in time, say 31 Jul 2014 –
- 2  [Fred Jones] [45 Jane Street] (7 Mar 2014 – 9 Dec 2014)