Cool, I found that date field I was looking for on line 8! But also on line 9…damn…what now? Source: author’s collection.
A natural way for data people to start our projects is to “just get in to the data and see what’s there.” Almost every week, in fact, I tell a customer “I can’t really answer that question until I get a chance to see the data.” Those are fair things for us to do and say when the data in question are already nicely structured in a set of linked tables, aka a relational database. But what if your starting point looks like the XML file above?
My brief
Recently a customer asked me to make sense of 13 million files like these, each representing one unit of published research across many academic journals. The data provider doesn’t ship a pre-canned relational database dump, with good reason: its customers will consume the data in different ways depending on their use case. My customer wanted to use these data to learn about the outputs of New Zealand’s science system.
I was given complete discretion on how to turn these millions of files in to useful information for decision-makers. So I opened a few of them in Notepad++ and browsed through. If you’ve ever been asked to make sense of data living inside XML files, you’ll know that some bits are easy, some are hard.* The units of published research had only one title (easy), but many authors (hard). To make things even more interesting, authors weren’t recorded individually against a publication, but instead within groups of authors sharing a set of affiliations to universities. When you think about the complexity of academic appointments this makes sense: Jane and Bob might work at University A, but if Jane also works at University B her affiliations are different than Bob’s. If they author an article together, which universities get how much credit for the publication?
What I did next saved me a lot of pain
My natural instinct was to open R, ingest some XML files and start trying to code my way to a solution. I thought I might be able to scan the XML files for just the fields I needed to answer the questions I was given, and dump those numbers to a table for later analysis. A voice in the back of my head said “that’s a long road to nowhere useful” and so I paused for just long enough for an alternative to appear.
I closed R, opened one of these XML file, and asked myself “who does what?” If you’ve been following our blogs on Business Event Analysis and Modelling (aka BEAM✲) you might recognise that “who does what?” is how we start conversations with stakeholders about their business. We start with this question because the easiest way to measure a business is to understand the processes that drive it. Data requirements gathered this way, from the business stakeholder’s perspective, are so much more useful than pages of “reporting requirements”.
In my case, the first answer that came to mind was “journal publishes article”. So I opened a BEAM✲ template, entered those headings, and filled out the columns using the XML file in front of me. Then I asked myself the rest of the “7Ws” and found a few more interesting details: publication date is crucial for understanding changes over time and my customer had asked for the abstract to be available for text mining.
“Good progress so far”, I muttered to myself. I desperately wanted to record a “who” detail next, but I remembered the complexity of assigning authors to articles discussed above. I realised that the authors aren’t really part of the “journal publishes article” event, but belong to another “who does what?”: “author writes article”. Of course my data didn’t show me when the actual work happened, but modelling authorship as a separate event allowed me to deal with the multiple-authors issue elegantly. In the end I decided to record publication date as a “when” detail, to make it clear we only observe authorship thanks to publication.
Now I was making progress! After defining a few more events with their respective details, as well as details about those details (FACTS and DIMS for you Kimball fans out there), I was really starting to understand the publication process as represented by these data. The first-best option is to speak to business stakeholders but I had to do this first part on my own, using only the XML.* I got really far by just using five of the 13 million XML files to fill out “data stories” across my different “who does what?” event tables. Through this I found plenty of edge cases, where the story didn’t quite fit the table, which forced me to re-think which details belonged to each event.
After I avoided some pain, the gain was easy!
“BEAM✲ to the rescue once again”, I thought to myself. My understanding of the data was already quite deep, without a single line of code. Then I decided to get really adventurous, and use the same completed BEAM✲ templates to cross-check my understanding of the data with the requirements of my customer.
To my great pleasure, they saw the value immediately. I was representing the various parts of the complex research process they want to understand, without jargon or pretense. And importantly the templates contained real data so we could have meaningful conversations about appropriate treatment of different fields.
The icing on the cake was that I handed these templates off to the customer’s data warehousing team, who were only too happy to get requirements that already look like a dimensional model, rather than pages and pages of vague words. Once that team builds the data and I build subsequent reports, we’ll all be on the same page about what they mean because the BEAM✲ templates pay off once more, as already-completed documentation!
All this, without me having to code my way through a jungle of XML files. BEAM✲ is versatile!
Until next time, keep asking better questions (like “who does what?”)
Shaun – @shaunmcgirr
* You might be wondering, “why didn’t he just use the XML schema to understand the data?” My answer is familiar to those of you who’ve faced this kind of problem: I wasn’t provided one initially. And when it did arrive, it was good documentation of what each XML field contained but not of the activities and entities described by the XML.