Select Page

The key objective of Business Intelligence (BI) is to the harness the collective wisdom of the knowledge workers in your organization to improve decision making. Our primary tactic is to make information visible to those people so that they can have fact-based conversations.

In computing in general, there is a design trade-off between performing a calculation at run time, or pre-computing “all” values and looking the value up at run time. The 1994 Pentium Divide Bug is probably the most famous example, while Rainbow tables are a current widespread use. Industry-wide lookup tables are becoming increasingly popular as storage costs drop and while demand for low-latency continues under ever increasing load.

Decision tables are lookup tables adapted “to model complex rule sets and their corresponding actions”. Traditionally, experts like John Giles have recommended decision tables to simplify the work of writing and maintaining programs.

In Business Intelligence we often have complex logic for categorizing facts. We also have even more compelling reasons to use decision tables for these calculations.

  1. They make the logic behind the calculation transparent to the entire organisation.
  2. Because the people who depend on the information can understand the decision table, ownership of the transformation rules passes from the programmers to the business.   The business modify it as and when they please instead of waiting.
  3. And, because the logic is easier to see, the effort involved in the initial creation, testing, and on-going maintenance is reduced regardless of who does the work.

I’ve just made some extra-ordinary claims.  I hope that the following example will convince you.

My example is a process for publishing a blog, vaguely based on wordpress:

Iterating quickly through sprints, we might decide that our minimum viable model doesn’t include a dimension at all:

Naive Fact table
Blog# Title Status
1 Cats Published
2 Dogs Published
3 Birds Pending
4 Fish draft

This model has two problems:

  1. The easy to notice short-fall is that it lacks expressiveness. I know from experience that workflow statuses always include a hierarchy with a top level of “working” and “done”.
  2. The simple status field hides the reason behind the status. It is a dead end for analysis.

We can solve the expressiveness problem by adding a dimension:

Correct Fact table
Blog# Title Status
1 Cats 300
2 Dogs 300
3 Birds 200
4 Fish 100
Simple Dimension
Status Key Name Group
100 draft Working
200 Pending Working
300 Published Done
400 Trash Done

We can do better. There is still a lot of ETL code that calculates a series of intermediate steps and then makes a determination.  At the heart of that code is a long CASE or IF statement that is messy, error prone, and relatively difficult to test.  Usually, just re-ordering the tests in one of these statements changes the output in unexpected ways as there are hidden dependencies.  And that code hides the intermediate steps and the logic from the knowledge workers for whom we’re doing all of this. The right thing to do is make that logic visible in a decision table:

Decision Dimension
Status Key Name Group Location # of times submitted # of times published # of times unpublished
100 draft Working blog 0 0 0
200 Pending Working blog 1 0 0
201 Pending Working blog 1 1 1
300 Published Done blog 1 1 0
301 Published Done blog 1 2 1
400 Trash Done trash 1 1 1
401 Trash Done trash 1 0 0
402 Trash Done trash 0 0 0

The first thing to notice is that this decision dimension has more columns and more rows in it than the previous one.  We’ve added a column for every input to the algorithm.  And we’ve added a row for every set of inputs that we map to an output.  The ETL code becomes much cleaner, that messy CASE or IF statement is replaced with a much simpler lookup.

I’ve intentionally left this table with some inefficient input columns to illustrate the cost of this pattern.  Usually when implementing this decision-dimension pattern, you’ll need an early arriving fact handler.  When a fact arrives with a set of inputs that are not in the lookup, you should add a new row to the decision dimension with “unknown” in the output columns and trigger a notification to a person to assign the correct values to those output columns.  If you’ve also implemented a slowly-changing-dimension pattern, which I recommend, those corrections overwrite the “unknowns”, they do not generate a new row.

The exception to the need for an early arriving fact handler when using a decision dimension is if you can exhaustively list all of the possible inputs.  The inputs for my example could have been three booleans (trash or blog, submitted > 0, and published > unpublished) in which case the decision dimension has 2^3 rows which we can create at design-time.

Commonly implemented techniques for categorising facts and similar calculations are difficult to document, result in messy code, and  hide useful information.  The decision dimension pattern helps to gather requirements, displays the calculation to analysts in a usable format and simplifies the code.

-Steven

%d bloggers like this: