Establishing Data Connection
If your organisation already has a traditional Data Warehouse in place, it is probably running on SQL Server, Oracle or similar enterprise RDBMS, and for loading the data from other data sources you use ETL tools like SSIS (SQL Server Integration Services), Informatica or similar. The easiest would be just to have a MongoDB connector in there and have it as any other data source. Such an approach gives you all the transformation features of the tool.
- CData MongoDB SSIS component includes source, destination and lookup tasks. Source task supports SQL, however I got some limitations when I used GROUP BY clause. Also, I couldn’t make MongoDB lookup work as I expected during my trial. It costs $749 per year.
- COZYROC SSIS+ components suite – includes many other useful tasks apart from MongoDB ones, so you can probably gain more than expected from it. MongoDB connector is very basic, it supports MongoDB query language. The whole suite costs $559 per year.
- Zappy Sys SSIS Power Pack – this MongoDB source task supports SQL query, it also has a (quite handy) visual interface for choosing columns and it detects data types and even their length. Pricing options are $399 per year for MongoDB source task only or $699 for a MongoDB package which includes three tasks.
- Kingsway Soft SSIS Productivity Pack – also a big suite of different tasks. MongoDB source tasks supports MongoDB native query syntax and has visual column picker as the previous plugin has. The pack costs $995 per year, perpetual licence is $1995.
I have checked that Informatica also has a MongoDB connector, but the price is available by request. The point is this is an enterprise way of doing things: purchase, plug in and enjoy the simple interface. ETL tools are expensive anyway, so adding extra cost for a plugin won’t much change the total spending on tools by BI unit.
Next I have tried to implement the script task as described in this article from 2014. Note that syntax has changed in later versions, so check the documentation if you want to try it as well. The driver has been added to NuGet, and I couldn’t make it work in a script task; I had issues with adding other NuGet packages to VSTA projects in the past, so I’m sure it is possible to make it work eventually. Anyway, this seems a legit alternative method of saving cost, it just require some time to set up, also problem solving and programming skills from developers.
The last option is an open-source (i.e. free) or just cheaper ETL tools. I have checked Talend Open Studio for Big Data. It takes time to get used to the interface and find the way to use it, also I haven’t got to the load scheduling yet, but setting up the load from MongoDB was pretty easy.
Relational databases are structured. You can guarantee that any record in one table has the same number of fields and data types are also predictable. Although it is possible to enforce the structure, in general MongoDB document, which is an equivalent of the record, could have different structure from another one in the same collection, which is an equivalent of table. We do not like unstructured data in Business Intelligence, although we have to deal with it all the time. Forget about mandatory fields, every column has to be null-able or have the default value added on the ETL stage.
MongoDB query language is optimised to find and return documents from one collection. Data warehouse data model is aligned with the natural business concepts and business events. It is unlikely that the data from one collection will fit one DW table, probably you either need to split the collection data to multiple DW tables or join the data from multiple collections to form a table record. Luckily, ETL tools cope with this job.
The last, but probably the most important challenge is CDC, change data capture. You absolutely don’t want to load the whole collection every time to check if the DW data is still up to date by comparing every document. You need to load only those documents which were created or updated since the last load. MongoDB always stores the document create date, although it is hidden in the automatically generated _id field. Bad news is it’s almost unusable in our case for our purpose; none of the above plugins allow extracting the timestamp part of the _id into a separate field, converted to date, which could be compared to the last job run timestamp we captured in DW load metadata (but this could be achieved if you’re writing your own code). So your CDC solution will rely on some date/time document fields, which is reasonable in case of previous example of device reads. In other cases the transactional application that writes and updates the MongoDB documents must populate Created date and Modified date fields to enable CDC for data warehouse.
Straight to Report
What if we just want to plug in our reporting tool straight to MongoDB and avoid all of the above? Hmmm, et tu? Those tools don’t cope with gigantic amounts of data, unless probably the servers they run at are beefed up with resources, which is normally accompanied with the extra cost. Also all the above ETL challenges don’t go away, they just need to be addressed within the report. And remember, the whole point was to integrate the NoSQL data with data from other sources the organisation has? Once you’ve done the integration for one report, you will be asked to copy and paste it to another, very similar report, and maintain both. This fails immediately once you find that someone has copied one of these reports, had a bug in integration part and produces slightly different KPIs. That’s why we do all the integration in the data warehouse and plug reporting tool just to the one source, so you don’t need to worry where the data came from initially.
It would make sense to get the data directly from the MongoDB in case of operational reporting though. It looks like MongoDB Connector for BI works just fine.