RapidMiner have changed the way they do R integration. You can read about how to install the new way in this post: https://optimalbi.com/blog/2015/08/13/improved-rapidminer-and-r-integration-is-here/
Why bother integrating RapidMiner and R?
Analysts are difficult to please. We have our favourite tools for working with data, but we’re always complaining about them anyway! Much like many residents of our fair city Wellington when the temperature pushes above 25 degrees…
In the past, it has been enough to be really good at our favourite tool, with all its flaws. Be it SAS or SQL or R or Excel, analysts have traditionally focused on deep expertise in one tool, learning all the tricks necessary to make it do things it wasn’t built for. As an example, in my academic research I used R to download, unzip, parse, concatenate and analyse thousands of XML files. R is a great choice for that last step, but there are better choices for the prior steps. The way the world of analytics is heading, that’s all going to change. More and more job advertisements ask for literacy in multiple tools. The edge in interviews for these jobs will go to those who can prove it in hands-on exercises, and who are open to transferring their skills in one tool to another.
This sounds scary, because I’m predicting we’ll all become mono-lingual unemployed dinosaurs, right? As I’ve said before, as analysts we should plan to make ourselves obsolete. That way we’ll always be ahead of the curve and changes in toolsets and environments won’t bother us; they’ll generate opportunities to improve our knowledge and adapt our practice to new challenges. Nirvana for this kind of agile analyst is when multiple tools, each with different capabilities, integrate cleanly. That means less manual, error-prone work in passing off data between the different tools using clunky formats like CSV. Instead you can leave each tool to do the job it is best suited for.
This kind of integration between complementary tools is available for RapidMiner and R. RapidMiner is the GUI-based analytics tool we offer as part of our MagnumBI platform, and R is the leading programming language for advanced analytics. The two offer very different value propositions: RapidMiner boils the steps of building an analytical model down to standardised operators that each take a small step toward the final output. This can make it easier to collaborate with other analysts as it reduces the likelihood of somebody using a trick or hack to ‘make something go’. The learning curve is also very approachable for those just getting in to analytics. R, on the other hand, is a full-blown programming language used by almost everybody on the bleeding edge of analytics research. You won’t be surprised to learn that although it does everything RapidMiner can and more, the learning curve is much steeper! Given its almost infinite flexibility, collaboration in R also requires agreement between analysts on how to use it, which is difficult to achieve with us analyst types who all know better than the next.
How to integrate RapidMiner and R on your machine
As with most software integrations, there’s a bit of fiddling to get it going. I’ll assume that as an R or RapidMiner user, you’re used to this, and so I’ll be brief! Here’s how I got it going:
- Make sure you have the latest version of RapidMiner installed.
- Open R however you typically do that, and check the version number, it needs to be at least 2.12 (that’s the October 2010 version, so hopefully you’ve updated since then!) Also note that if you run 32-bit RapidMiner you’ll need 32-bit R installed, and likewise for 64-bit RapidMiner.
- Open RapidMiner and click ‘Help’ then ‘Updates and Extensions’. Search for ‘R Extension’, install it, and restart RapidMiner when prompted.
- At this point you’ll almost certainly meet this screen, but don’t fear! RapidMiner is simply looking for the library files that will let it talk to R without your involvement. From here click the ‘R Installation assistant’ for detailed instructions, or continue to follow mine below.
- Open R, and type the following to install the missing library RapidMiner is looking for:
- Find out where your R libraries are installed by typing the following, noting the path(s) returned (mine were “C:/Users/ShaunM/Documents/R/win-library/3.1” and “C:/Program Files/R/R-3.1.2/library”):
- Check that you have an R_HOME environment variable set in your operating system, which may have happened automatically when you installed R. Unfortunately I didn’t have one, so in Windows 8 I set mine under System/Advanced system settings/Advanced/Environment Variables like this (click to enlarge):
- Now you need to add a similar path to your operating system’s PATH environment variable. This is different for every OS, but here’s how I edited that variable in Windows 8 under System/Advanced system settings/Advanced/Environment Variables (click to enlarge, and note the slight difference from the R_HOME variable set above, in that it includes binx64 on the end, whereas for a 32-bit R/RapidMiner installation you’ll need bini386):
- Switch back to RapidMiner and click ‘R Installation assistant’ if you aren’t there already. Now click ‘Select JRI library file’ and navigate to one of the paths returned by step 6, in search of the rJava install directory. Once there, navigate deeper to find the jri.dll file specific to your system architecture. Mine was in C:UsersShaunMDocumentsRwin-library3.1rJavajrix64.
- Click ‘Manually Restart RapidMiner’ and wait for RapidMiner to restart. If successful it will take a bit longer to load as it calls out to R to start up. You’ll also see a window pop up to select your nearest CRAN mirror for additional package downloads.
- If everything has worked, you’ll now see R pop up in the perspectives selector in the top-right corner of your RapidMiner window:
- Click the R button to see the R console running inside RapidMiner Studio as pictured at the start of this post!
Prove it worked by using an R model in a RapidMiner process
Now that we’ve gone to all that effort, lets do something half-interesting with it. As RapidMiner doesn’t yet have a generalized linear model operator, in this example I use RapidMiner to download and organise some data, R to estimate a model using GLM, and RapidMiner again to display the results.
What’s a generalized linear model? I can’t go past Wikipedia’s definition:
“In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.”
If that’s all gibberish, the point is that GLM allows us to apply a linear-like model to data that would make a regular old linear regression explode. A classic example is when some of the data you want to use to explain or predict an outcome is categorical, like “yes/no”. In RapidMiner we could of course recode that data to fit its native Logistic Regression operator, but the aim of this blog is to show off R integration!
To follow along the steps I describe below:
- Download this RapidMiner process (rmp file) to your hard drive
- Open your installation of RapidMiner, click ‘File’ then ‘Import Process’, and navigate to where you saved that rmp file
- With the process open, save it in your Local Repository: ‘File’ then ‘Save Process as’, choose a location and file name like ‘OptimalBI-integrating-RapidMiner-and-R’
- The top “Retrieve” operator loads the “Deals” ExampleSet (ie dataset) from RapidMiner’s “Samples” repository you can see on the left-hand list. This comes standard with RapidMiner Studio so you should have it too. I then rename the attributes (variables) in this so they don’t have spaces, to keep R happy. I use this data as the “training set” for the model.
- The bottom “Retrieve” operator loads the “Deals-Testset” dataset. I then rename attributes as above. This is the test data to which I’ll apply the trained model. Think of these as the “new customers” whose behaviour we’re trying to predict.
- The “Execute Script (R)” operator is where all the R magic happens. Click on it to show its parameters. “script” is where the actual R code lives. I’ve left comments there to explain the steps taken, but in short I fit a generalized linear model to the training data, summarise it, predict “Future_Customer” based on the Age, Gender and Payment_Method variables for the new customers in the test data, and arrange the results to send back to RapidMiner. The “inputs” parameter is where I give an R-friendly name for each of the ExampleSets (ie datasets) I feed in. You’ll see I assign the top input port to “deals.trainingset” and the bottom to “deals.testset”. These RapidMiner ExampleSets are then available to R as dataframes with those names. Finally, the “results” parameter tells RapidMiner how to interpret the variables I’ve defined in my R code: summary.glm.fit is just the output showing the model fit to training data, while deals.testset.with.predictions is a Data Table that RapidMiner will interpret as a new ExampleSet.
- The final two operators test how well the model has done. The first generates a new RapidMiner attribute (ie variable) that recodes the model’s prediction_Future_Customer attribute to “yes/no” to match what the “Future_Customer” attribute in the test data already says about these new customers. The final operator in the flow compares that prediction to the “real” behaviour specified in the Future_Customer attribute.
If you now run the Process, and your integration of RapidMiner and R is all in order, you’ll generate two results: the predicted behaviour of the new customers (will they be our Future Customer or not?) in a table, and the fit statistics for the model against the training data. I won’t go in to detail about these, but if you click the “Statistics” tab while viewing the predictions, you’ll see the model only does slightly better than a coin-flip, with 509/1000 correct predictions. There could be many reasons for that: inappropriate model selection, over-fitting to the training data, or test data drawn from such a different data-generating process I never had a hope in hell.
Now that I’ve got you up and running using R inside RapidMiner, I’ll leave you to improve on my result! Let me know how you get on.
Until next time, keep asking better questions Shaun – @shaunmcgirr
Shaun blogs about analytics, machine learning and how data solves problems in the real world.
We run regular Agile courses with a business intelligence slant in both Wellington and Auckland. Find out more here.