On a daily basis I have the benefit of working with organisations who are somewhere on a journey with data. Working with big or small, flush or practically bust, petabytes or megabytes, organisations see some innovative solutions deployed.
One thing I often see overlooked or glossed over is a change data capture (CDC) layer. In my opinion this is one of the most important areas of a solution, for what point is a data warehouse without history?
Let’s back that up, what is CDC anyway?
What is Change Data Capture (CDC)?
Well we have a source system which may have a cluster of tables containing valuable sticky data. We’ll call that the hive. We know that we want the honey, but first we’ve got to take a copy of the source. The reason to take a copy of the system is the database administrator (DBA). The DBA knows that issuing some random queries against the database will likely have some impact on whatever applications connect / mess with it. The DBA is also probably going to say a copy can only be made outside of core business hours, depending on the nature of the business it’s most likely to be whenever customers / bees sleep.
Great, so we have a copy of yesterday’s goodness sitting somewhere. It’s likely the copy isn’t a copy of the source anymore. Assuming the business is conducting some sort of activity (fingers crossed for them if they are not) records in the source will change, maybe there will be some new records, possibly some records will be deleted. Point is, if we take a copy of the source again tomorrow it (probably) wouldn’t reconcile with the earlier copy.
Now the objective of the data warehouse either today, tomorrow or in ten years time is to capture the data, “all” of it, all the time. The right CDC layer supports distilling sexy insights down the track. In capturing information we get to see how things were at a point in time. Most source systems are built by application developers who have no idea how to structure or store data for a data warehouse, subsequently they tend to overwrite data and not preserve historic states, this is good for ETL developers as the role probably wouldn’t exist otherwise.
In a CDC layer we need to deal with the evolution, or mutations that present between source system copies. Essentially create some sort of versioning system that shows what the record looked like yesterday and what it looks like today. I say yesterday vs today opposed to all changes as there is the potential for a record to change multiple times in between replications, in those cases changes would be missed. Velocity of data mutation can wildly vary between systems and in some cases you need to capture everything! Think of a bank, you wouldn’t want to make a deposit for $100 followed by a deposit for $1 with the most recent transaction being displayed.
Generally speaking captured mutations might be represented like this:
How about some approaches to capturing changes:
A pretty typical traditional CDC layer would make use of a UPSERT with a MERGE statement. This is well documented online and a good explanation can be found here. I’ve recommended this approach with multiple customers because it is cost effective (no expensive tools to acquire). There are limitations however – depending on when executed intraday changes could be missed, writing a statement like this for hundreds of tables is wholly cumbersome and it’s not adaptive to change. If the application developers added a column that would be missed if not added to the MERGE statement, or even worse if a column was deleted the ETL process would crap the bed. ETL falling over isn’t all bad I guess, provides something to look at in the morning and perhaps the application developer didn’t really mean to issue the DROP TABLE bank_balances sql?
While the above approach is very common the traditional approach can’t always apply in the modern world where sources are not always relational sql based entities. Larger organisations are frequently looking to deploy tools like Attunity to capture data mutations. Attunity Replicate can source and target a variety of traditional relational systems and also more modern systems such as Hadoop. One of the key differences with Attunity on relational sources is that it scours the logs rather than the actual tables so from a DBA performance perspective there’s much less application impedance. In short it can stream events as they occur to the data warehouse. This is a great out of-the-box enterprise grade solution, although it does carry a cost that may prevent smaller organisations from going down this track.
If you’ve read my previous few blogs you will notice that I follow a variety of open source technologies (Presto and Superset). Well there are technologies out there that can complete CDC from various sources in a log scouring manner. There’s some great solutions from Confluent which is built on the Kafka technology – in fact Confluent is owned by folks who open sourced Kafka. Also in cobbling together some sort of DIY solution there is the opportunity to save some $$$ and learn something new. I’d recommend starting with Kafka for event streaming goodness.
At the end of the day, whatever solutions works for you is great! The sooner you start collecting data mutations in some sort of CDC layer the more value you add to your data in the long term so send the bees to bring the honey to the hive!
Thomas – MacGyver of code.