There must be a better way to do data science than dumping the entire database every time

by | Oct 19, 2016

Source: John Liu

Let’s not fill them up unless we absolutely have to. 

When I arrived at OptimalBI 2.5 years ago I was a stereotypical standalone researcher, meaning:

  • Don’t trust any data you didn’t create yourself
  • Teams slow you down so best to go it alone
  • Everything must be perfect to be useful

Being effective in real life means getting over all of these.
On the cusp of departure for a new job in London, I can contemplate my progress, and it has been good. Most of the credit goes to my fantastic colleagues who through their expertise, approach, and patience have taught me at least as much as my many years of study. Just read this blog or attend one of our courses and you’ll get the picture.
So what did they teach me? A book I read recently has helped clarify the key lessons of my time here.

1) How to trust data you didn’t make

My first data jobs were on the census, the greatest data-maker in the land. In academic research you also mostly make your own data. These experiences provided great intuition on the data generating process, but seeing all the ways it goes wrong made trusting others’ data almost impossible.
I took the non-trusting approach in my first client project for OptimalBI, dumping entire operational database tables in to SAS and re-engineering all the business logic so I could trust what I fed in to my model. I was lucky in that case, it worked, but it was high risk. And I had good reasons, the other analysts on the client team did the same.
I’m not a fan of “others do it too” explanations but this one illustrates an underlying truth about data work: building trust is very difficult. An obvious implication is that teams of data people may never leverage each others’ work and reach a multiplier effect. Instead they are incentivised to build a data fiefdom, duplicating effort and fighting against data integration.
I always suspected these were bad outcomes, but reading Hans Hultgren’s book Modeling the Agile Data Warehouse with Data Vault this week crystallised the reasons why. He defines the core jobs of a data warehouse as integration, history, and auditability and presents a design pattern (data vault) for achieving this at enterprise scale.
These three qualities of a fit-for-purpose data warehouse together solve the trust problem that data scientists face all the time, and react to poorly, as I described. Integrating data from disparate sources the data vault way, around natural business keys, gives me confidence I am neither missing data nor looking at duplicates, saving me my own reflexive checks. Providing history by design helps data scientists avoid re-engineering the complex business rules that often determine important concepts like “customer status”, and be confident about the currency of the data they see. And auditability simply shows where data comes from, something I’ve never had the pleasure of getting used to, as lineage is usually poorly communicated between data people.
Lesson learned: There are many problems a data scientist can solve (poorly) but should not. Other experts armed with the right tools can do a much better job, leaving you to do yours. The way these other experts do that job is important though: done obscurely without the ability to “prove it”, trust is lost. The data vault pattern untangles the “spaghetti mess” of logic that typifies most data warehouses I’ve seen, making trust possible.

2) Teams are the only way to go fast

Throughout my education I avoided team work. I suspect it only took one primary school project where one person failed to pull his weight to sour me forever on trusting others. So I typically went lone wolf.
I may have been right about team work that was essentially atomic tasks grouped together for the sake of it. But I was so, so wrong about the other kind of team work, where multiple perspectives and complementary skills are necessary to achieve more than the sum of individual effort.
Looking back, this is really obvious. If team members have permission to specialise in different types of data work, they will not only do those jobs better (initial gain for the team) but also free up the time of other specialists to do the same (compounding gains for the team). Solving the trust problem allows this to actually happen.
There is a neat analogy in Hultgren’s book: each component of a data vault ensemble model does exactly one job well, so the others don’t need to at all. Hubs only store instances of natural business keys and never delete them, so we always know which customers/products have ever existed. Links only store relationships between hubs, and always as many-to-many so that when something that ‘shouldn’t happen’ does, the data keeps loading. All other work is left to satellites, which store all context about both hubs and links.
Lesson learned: When a team member can trust that some part of the collective work is done well by others, she becomes better at what she is good at. This creates a virtuous cycle for the rest of the team. The same principle applies for the methodologies and technologies underlying the team, using one approach or tool for everything most likely gives the worst of all worlds. Have a diverse toolbox instead!

3) Almost everything that is useful is imperfect

Someone once told me that every complete dissertation is better than every incomplete dissertation. Now that I have completed mine I can finally appreciate how profound that is, but for too long I used the pursuit of (hypothetical) perfection as an excuse to avoid the necessary, messy work required to do something useful.
When I got over that barrier, and through enough of the messy work, it all became easier.
The same problem can easily strike data teams. Often it is easier to promise our stakeholders the big perfect thing, which may never be delivered, over the small and useful thing that will steadily get better. This tension is natural for data people: we want the right answer, and for it to be impressive! Too-large and too-complex projects, however, run against the lessons above, and we must fight this instinct in order to be successful in teams.
A quote from Hultgren summarises this nicely:

“There is no single version of the truth; there are only facts and interpretations.”

Lesson learned: Instead of building something that cannot exist (perfect truth), make the facts trusted and the interpretation defensible, quickly and piece-by-piece. Good advice for data scientists and data teams everywhere.
Not bad for a technical book on data modelling the agile data warehouse, right? The author Hans Hultgren returns to Wellington in early December (and a town near you soon), book your seat for his course via Genesee Academy.
Until next time, keep asking better questions
Shaun – @shaunmcgirr
Shaun blogs about analytics, machine learning and how data solves problems in the real world. Want to read more?
Don’t forget, we can train your team in the art of agile business intelligence at any time!

Submit a Comment

Your email address will not be published. Required fields are marked *