Looking for Data Mining and Analytics tools – check out these Open Source options

By
Victoria Maclennan
February 25, 2016

The Analytics and Data Mining tools space is growing prolifically, new startups keep popping up and raising various masses of cash to build their product out – many disappearing as the result of an early stage sale or running out of said cash too. There are loads of taster articles to give you an idea of the scope like this one 15 Big Data and Analytics Companies to Watch for instance. Industry commentators love this space too, here on KDNuggests is a comparison of the Gartner Magic Quadrants for 2014 and 2015 in Predictive Analytics tools, Forbes posted an article listing the Top 100 Analytics Startups in 2015 part way through last year (with a download to Excel option). These ratings and list inclusions are heavily weighted on how much ca$h each company has raised – which in itself implies a vast customer base and feature rich product (benefit of the doubt given to tech investors on my part here).
Most of the tools listed in these articles offer a commercial product, some, however, have an Open Source with Premium path offering as well. I started thinking about this back in August at the RapidMiner 2015 conference set in a beautiful Slovenian castle – yes, it was amazing. I was talking to Data Scientists, Medical Researchers, GP’s and Ph.D. Candidates who all started (and many continue) their journeys into Data Mining or Predictive Analytics by downloading the Open Source version of RapidMiner.
This discovery coupled with living in the land of “R” where students are emerging armed with newfound statistical modelling and data mining know how all honed using “R” has created demand for Open Source and R derivative products. To get your started here is a short list of tools you can download and use, their Open Source license Attribution and a link to find them, as much as I don’t always trust Wikipedia it has proven a good source of non-techo descriptive information on each product below. A short list, varied in maturity and finesse with only Rapidminer having morphed into what is now a large commercial player.

RapidMiner is a software platform developed by the company of the same name that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics.

Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

Orange is a comprehensive, component-based software suite for machine learning and data mining, developed at Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia, together with open source community.

Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. It is free software licensed under the GNU General Public License.

DataMelt (or, in short, DMelt) a computation and visualization environment, is an interactive framework for scientific computation, data analysis and data visualization designed for scientists, engineers and students.

Knowledge Extraction based on Evolutionary Learning (KEEL) provides a simple GUI based on data flow to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms) in order to assess the behaviour of the algorithms.

SPMF is an open-source data mining mining library written in Java, specialized in pattern mining.

Rattle GUI is a free and open source software package providing a graphical user interface (GUI) for data mining using the R statistical programming language.

Hope you have found something useful here.

Now Just Because It’s Fun!

If you want to make a quick easy visualization of your dataset try Raw – it’s so easy I can do it! This took me 2 minutes. I downloaded the CSV of ACC Claims from data.govt.nz and pasted in the injury claim totals by region 2014 rows into Raw, filtered on Females and voila! It’s crude because it took me 2 minutes, imagine if I had spent 5, it could be gorgeous.

female by region acc claims 2014

Source: OptimalBI

Enjoy. Vic.

Victoria spends much of her time focusing on Digital Inclusion, Digital Literacy and Digital Rights.  

You can read her OptimalBI blogs here, or connect with her on LinkedIn.

Copyright © 2019 OptimalBI LTD.