We live in the age of big data, we produce and collect more data then ever. Surely, just hoarding it doesn’t make much sense, but a lot of valuable information could be extracted from it and used, for example, in the decision making process. Every organisation I worked with in the last few years had some data collected in one of the siloed NoSQL database, owned and used by only one team; but although managers talked a lot about it, integrating this data into the enterprise-wide Data Warehouse always slipped down in the list of priorities. In this blog I explore the technical part of getting the MongoDB data into the typical Data Warehouse I usually work with. Check here for why MongoDB is the NoSQL platform of our choice. The example data I have used is the plants dataset from data.govt.nz.
Establishing Data Connection
If your organisation already has a traditional Data Warehouse in place, it is probably running on SQL Server, Oracle or similar enterprise RDBMS, and for loading the data from other data sources you use ETL tools like SSIS (SQL Server Integration Services), Informatica or similar. The easiest would be just to have a MongoDB connector in there and have it as any other data source. Such an approach gives you all the transformation features of the tool.
As I work with Microsoft tools, I checked SSIS first. There is no default MongoDB connection available from Microsoft. However, there are plenty of plugins available to purchase, here are a few I’ve tried (there could be more):
- CData MongoDB SSIS component includes source, destination and lookup tasks. Source task supports SQL, however I got some limitations when I used GROUP BY clause. Also, I couldn’t make MongoDB lookup work as I expected during my trial. It costs $749 per year.
- COZYROC SSIS+ components suite – includes many other useful tasks apart from MongoDB ones, so you can probably gain more than expected from it. MongoDB connector is very basic, it supports MongoDB query language. The whole suite costs $559 per year.
- Zappy Sys SSIS Power Pack – this MongoDB source task supports SQL query, it also has a (quite handy) visual interface for choosing columns and it detects data types and even their length. Pricing options are $399 per year for MongoDB source task only or $699 for a MongoDB package which includes three tasks.
- Kingsway Soft SSIS Productivity Pack – also a big suite of different tasks. MongoDB source tasks supports MongoDB native query syntax and has visual column picker as the previous plugin has. The pack costs $995 per year, perpetual licence is $1995.
I have checked that Informatica also has a MongoDB connector, but the price is available by request. The point is this is an enterprise way of doing things: purchase, plug in and enjoy the simple interface. ETL tools are expensive anyway, so adding extra cost for a plugin won’t much change the total spending on tools by BI unit.
Next I have tried to implement the script task as described in this article from 2014. Note that syntax has changed in later versions, so check the documentation if you want to try it as well. The driver has been added to NuGet, and I couldn’t make it work in a script task; I had issues with adding other NuGet packages to VSTA projects in the past, so I’m sure it is possible to make it work eventually. Anyway, this seems a legit alternative method of saving cost, it just require some time to set up, also problem solving and programming skills from developers.
The last option is an open-source (i.e. free) or just cheaper ETL tools. I have checked Talend Open Studio for Big Data. It takes time to get used to the interface and find the way to use it, also I haven’t got to the load scheduling yet, but setting up the load from MongoDB was pretty easy.
The very last option is what looks like a clumsy mixed approach; when one tool, presumably free ETL tool or proprietary code, extracts portions of data from MongoDB into files, and your usual ETL tool loads them into DW. But it suddenly starts making sense if your DW solution is implemented on one of the cloud platforms, like Azure or AWS. You probably already use a lot of intermediate tools-as-a-services, write scripts anyway and use file storage here and there. I know you’ve got goosebumps now, it all sounded scary to me as well until I tried; it took me days to learn MongoDB query language or how to write and schedule AWS Lambda function on Node.JS that loads data from MongoDB into desired structure.
Relational databases are structured. You can guarantee that any record in one table has the same number of fields and data types are also predictable. Although it is possible to enforce the structure, in general MongoDB document, which is an equivalent of the record, could have different structure from another one in the same collection, which is an equivalent of table. We do not like unstructured data in Business Intelligence, although we have to deal with it all the time. Forget about mandatory fields, every column has to be null-able or have the default value added on the ETL stage.
MongoDB query language is optimised to find and return documents from one collection. Data warehouse data model is aligned with the natural business concepts and business events. It is unlikely that the data from one collection will fit one DW table, probably you either need to split the collection data to multiple DW tables or join the data from multiple collections to form a table record. Luckily, ETL tools cope with this job.
The name Big Data hints that the amount of data exceeds the amounts we get used to. For example, multiple device reading per minute could help with operational decisions. However, this frequency is probably too much for the scope of data warehouse, which is long-term information integrated from multiple sources. It probably doesn’t make sense to replicate such MongoDB data in relational database, but definitely would make sense to collect aggregated data, for example, hourly average of the device readings. MongoDB supports aggregations, but it’s now very effective performance wise. It probably makes sense to give this job to the ETL tool and spare MongoDB server resources.
The last, but probably the most important challenge is CDC, change data capture. You absolutely don’t want to load the whole collection every time to check if the DW data is still up to date by comparing every document. You need to load only those documents which were created or updated since the last load. MongoDB always stores the document create date, although it is hidden in the automatically generated _id field. Bad news is it’s almost unusable in our case for our purpose; none of the above plugins allow extracting the timestamp part of the _id into a separate field, converted to date, which could be compared to the last job run timestamp we captured in DW load metadata (but this could be achieved if you’re writing your own code). So your CDC solution will rely on some date/time document fields, which is reasonable in case of previous example of device reads. In other cases the transactional application that writes and updates the MongoDB documents must populate Created date and Modified date fields to enable CDC for data warehouse.
Straight to Report
What if we just want to plug in our reporting tool straight to MongoDB and avoid all of the above? Hmmm, et tu? Those tools don’t cope with gigantic amounts of data, unless probably the servers they run at are beefed up with resources, which is normally accompanied with the extra cost. Also all the above ETL challenges don’t go away, they just need to be addressed within the report. And remember, the whole point was to integrate the NoSQL data with data from other sources the organisation has? Once you’ve done the integration for one report, you will be asked to copy and paste it to another, very similar report, and maintain both. This fails immediately once you find that someone has copied one of these reports, had a bug in integration part and produces slightly different KPIs. That’s why we do all the integration in the data warehouse and plug reporting tool just to the one source, so you don’t need to worry where the data came from initially.
It would make sense to get the data directly from the MongoDB in case of operational reporting though. It looks like MongoDB Connector for BI works just fine.
Hopefully, my blog can help you to consider adding that one or a couple of siloed NoSQL data sources you have to the enterprise-wide environment, so anyone can benefit from it.