When one tool is not enough: integrating RapidMiner and R

by | Mar 5, 2015

RapidMiner-R-console-perspective
The R 3.1.2 console running inside RapidMiner Studio 6.2!

Update:

RapidMiner have changed the way they do R integration. You can read about how to install the new way in this post: https://optimalbi.com/blog/2015/08/13/improved-rapidminer-and-r-integration-is-here/

Why bother integrating RapidMiner and R?

Analysts are difficult to please. We have our favourite tools for working with data, but we’re always complaining about them anyway! Much like many residents of our fair city Wellington when the temperature pushes above 25 degrees…
In the past, it has been enough to be really good at our favourite tool, with all its flaws. Be it SAS or SQL or R or Excel, analysts have traditionally focused on deep expertise in one tool, learning all the tricks necessary to make it do things it wasn’t built for. As an example, in my academic research I used R to download, unzip, parse, concatenate and analyse thousands of XML files. R is a great choice for that last step, but there are better choices for the prior steps. The way the world of analytics is heading, that’s all going to change. More and more job advertisements ask for literacy in multiple tools. The edge in interviews for these jobs will go to those who can prove it in hands-on exercises, and who are open to transferring their skills in one tool to another.
This sounds scary, because I’m predicting we’ll all become mono-lingual unemployed dinosaurs, right? As I’ve said before, as analysts we should plan to make ourselves obsolete. That way we’ll always be ahead of the curve and changes in toolsets and environments won’t bother us; they’ll generate opportunities to improve our knowledge and adapt our practice to new challenges. Nirvana for this kind of agile analyst is when multiple tools, each with different capabilities, integrate cleanly. That means less manual, error-prone work in passing off data between the different tools using clunky formats like CSV. Instead you can leave each tool to do the job it is best suited for.
This kind of integration between complementary tools is available for RapidMiner and R. RapidMiner is the GUI-based analytics tool we offer as part of our MagnumBI platform, and R is the leading programming language for advanced analytics. The two offer very different value propositions: RapidMiner boils the steps of building an analytical model down to standardised operators that each take a small step toward the final output. This can make it easier to collaborate with other analysts as it reduces the likelihood of somebody using a trick or hack to ‘make something go’. The learning curve is also very approachable for those just getting in to analytics. R, on the other hand, is a full-blown programming language used by almost everybody on the bleeding edge of analytics research. You won’t be surprised to learn that although it does everything RapidMiner can and more, the learning curve is much steeper! Given its almost infinite flexibility, collaboration in R also requires agreement between analysts on how to use it, which is difficult to achieve with us analyst types who all know better than the next.

How to integrate RapidMiner and R on your machine

As with most software integrations, there’s a bit of fiddling to get it going. I’ll assume that as an R or RapidMiner user, you’re used to this, and so I’ll be brief! Here’s how I got it going:

  1. Make sure you have the latest version of RapidMiner installed.
    .
  2. Open R however you typically do that, and check the version number, it needs to be at least 2.12 (that’s the October 2010 version, so hopefully you’ve updated since then!) Also note that if you run 32-bit RapidMiner you’ll need 32-bit R installed, and likewise for 64-bit RapidMiner.
    .
  3. Open RapidMiner and click ‘Help’ then ‘Updates and Extensions’. Search for ‘R Extension’, install it, and restart RapidMiner when prompted.
    .
  4. At this point you’ll almost certainly meet this screen, but don’t fear! RapidMiner is simply looking for the library files that will let it talk to R without your involvement. From here click the ‘R Installation assistant’ for detailed instructions, or continue to follow mine below.
    could-not-load-native-library
    .
  5. Open R, and type the following to install the missing library RapidMiner is looking for:
    install.packages(‘rJava’)
    .
  6. Find out where your R libraries are installed by typing the following, noting the path(s) returned (mine were “C:/Users/ShaunM/Documents/R/win-library/3.1” and “C:/Program Files/R/R-3.1.2/library”):
    .libPaths()
    .
  7. Check that you have an R_HOME environment variable set in your operating system, which may have happened automatically when you installed R. Unfortunately I didn’t have one, so in Windows 8 I set mine under System/Advanced system settings/Advanced/Environment Variables like this (click to enlarge):
    adding-R_HOME-environment-variable
    .
  8. Now you need to add a similar path to your operating system’s PATH environment variable. This is different for every OS, but here’s how I edited that variable in Windows 8 under System/Advanced system settings/Advanced/Environment Variables (click to enlarge, and note the slight difference from the R_HOME variable set above, in that it includes binx64 on the end, whereas for a 32-bit R/RapidMiner installation you’ll need bini386):
    adding-R-to-path-environment-variable
    .
  9. Switch back to RapidMiner and click ‘R Installation assistant’ if you aren’t there already. Now click ‘Select JRI library file’ and navigate to one of the paths returned by step 6, in search of the rJava install directory. Once there, navigate deeper to find the jri.dll file specific to your system architecture. Mine was in C:UsersShaunMDocumentsRwin-library3.1rJavajrix64.
    .
  10. Click ‘Manually Restart RapidMiner’ and wait for RapidMiner to restart. If successful it will take a bit longer to load as it calls out to R to start up. You’ll also see a window pop up to select your nearest CRAN mirror for additional package downloads.
    .
  11. If everything has worked, you’ll now see R pop up in the perspectives selector in the top-right corner of your RapidMiner window:
    RapidMiner-perspective-selector-with-R
    .
  12. Click the R button to see the R console running inside RapidMiner Studio as pictured at the start of this post!

Prove it worked by using an R model in a RapidMiner process

Now that we’ve gone to all that effort, lets do something half-interesting with it. As RapidMiner doesn’t yet have a generalized linear model operator, in this example I use RapidMiner to download and organise some data, R to estimate a model using GLM, and RapidMiner again to display the results.
What’s a generalized linear model? I can’t go past Wikipedia’s definition:

“In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.”

If that’s all gibberish, the point is that GLM allows us to apply a linear-like model to data that would make a regular old linear regression explode. A classic example is when some of the data you want to use to explain or predict an outcome is categorical, like “yes/no”. In RapidMiner we could of course recode that data to fit its native Logistic Regression operator, but the aim of this blog is to show off R integration!
To follow along the steps I describe below:

  1. Download this RapidMiner process (rmp file) to your hard drive
  2. Open your installation of RapidMiner, click ‘File’ then ‘Import Process’, and navigate to where you saved that rmp file
  3. With the process open, save it in your Local Repository: ‘File’ then ‘Save Process as’, choose a location and file name like ‘OptimalBI-integrating-RapidMiner-and-R’

You should now have a RapidMiner process in front of you that looks like this:
Process-integrating-RapidMiner-and-R
Here’s what all the Operators in this process do:

  • The top “Retrieve” operator loads the “Deals” ExampleSet (ie dataset) from RapidMiner’s “Samples” repository you can see on the left-hand list. This comes standard with RapidMiner Studio so you should have it too. I then rename the attributes (variables) in this so they don’t have spaces, to keep R happy. I use this data as the “training set” for the model.
  • The bottom “Retrieve” operator loads the “Deals-Testset” dataset. I then rename attributes as above. This is the test data to which I’ll apply the trained model. Think of these as the “new customers” whose behaviour we’re trying to predict.
  • The “Execute Script (R)” operator is where all the R magic happens. Click on it to show its parameters. “script” is where the actual R code lives. I’ve left comments there to explain the steps taken, but in short I fit a generalized linear model to the training data, summarise it, predict “Future_Customer” based on the Age, Gender and Payment_Method variables for the new customers in the test data, and arrange the results to send back to RapidMiner. The “inputs” parameter is where I give an R-friendly name for each of the ExampleSets (ie datasets) I feed in. You’ll see I assign the top input port to “deals.trainingset” and the bottom to “deals.testset”. These RapidMiner ExampleSets are then available to R as dataframes with those names. Finally, the “results” parameter tells RapidMiner how to interpret the variables I’ve defined in my R code: summary.glm.fit is just the output showing the model fit to training data, while deals.testset.with.predictions is a Data Table that RapidMiner will interpret as a new ExampleSet.
  • The final two operators test how well the model has done. The first generates a new RapidMiner attribute (ie variable) that recodes the model’s prediction_Future_Customer attribute to “yes/no” to match what the “Future_Customer” attribute in the test data already says about these new customers. The final operator in the flow compares that prediction to the “real” behaviour specified in the Future_Customer attribute.

If you now run the Process, and your integration of RapidMiner and R is all in order, you’ll generate two results: the predicted behaviour of the new customers (will they be our Future Customer or not?) in a table, and the fit statistics for the model against the training data. I won’t go in to detail about these, but if you click the “Statistics” tab while viewing the predictions, you’ll see the model only does slightly better than a coin-flip, with 509/1000 correct predictions. There could be many reasons for that: inappropriate model selection, over-fitting to the training data, or test data drawn from such a different data-generating process I never had a hope in hell.
model-results
Now that I’ve got you up and running using R inside RapidMiner, I’ll leave you to improve on my result! Let me know how you get on.
Until next time, keep asking better questions Shaun – @shaunmcgirr
Shaun blogs about analytics, machine learning and how data solves problems in the real world. 

You can read Shaun’s blog Don’t start your project with code, or all of Shaun’s blogs here.

We run regular Agile courses with a business intelligence slant in both Wellington and Auckland. Find out more here.

10 Comments
  1. Rafi

    Hello,
    I have a Problem with the installing of the R-Extension, I’ve made everthing as told in this Artikel, when I want to start RapidMinerStudio (6.3) the is there an Error 10 Message “Java could not be launched. Probably there is not enough free memory available.Please close all other applications and try again”. I use WIN 8.1 and R3.1.3 I tried to install Java 7 but it doesn’t change…..I would appreachiate for help, cause I’m trying to solve this Problem for 4 days……..I get this Extrension running on my old Notebook with Vista Buisness…….but that’s no help ……Thank You

    Reply
  2. Shaun McGirr

    Hi Rafi, thanks for getting in touch! With that kind of Java error my first question would be: are you running 64-bit or 32-bit versions of Windows, R and Java? All of these need to be either 64-bit or 32-bit, otherwise Java won’t be able to allocate memory properly.
    You can find out whether you are running 32- or 64-bit using these links:
    – Windows http://support.microsoft.com/en-us/kb/827218
    – Java https://www.java.com/en/download/faq/java_win64bit.xml
    – In R type (version) in to your R console and x86_64 will mean 64-bit
    Let me know what the results are.

    Reply
  3. iamkbpark

    Dear Shaun McGirr:
    I enjoyed reading your article, and of course, I DO APPRECIATE your instruction!
    I used to think of using RapidMiner as a back up for someday in case R makes me crazy(..not yet..since I am a beginner.)
    Though I am still the beginner for both, the integration encourages me not to give up on R as well as RapidMiner which was sleeping for a long time doing nothing except the updating at my desktop!
    HOWEVER,
    Now that I witnessed how those two beautifully co-work together (..also huge thanks for the example with the process file at the end), I just started to mumble myself that it is still not the time to give up on using them!
    THANK YOU, AND GOD BLESS YOU!
    Sincerely,
    K.B. Park from South Korea

    Reply
  4. Shaun McGirr

    Thanks for reading K.B., and for your very kind words! Keep me posted on your progress.

    Reply
  5. Krishna

    Thanks..
    It is really helpful..

    Reply
  6. Shaun McGirr

    My pleasure, Krishna! Let me know what else you’re working on in RapidMiner…who knows I might be struggling through the same thing.

    Reply
  7. KELVIN TAN

    R extension is not available under RapidMiner 6.5.002 in the Marketplace. I can only find R Scripting. But how do I get the R console up?

    Reply
  8. KELVIN TAN

    Hi Shaun McGirr, Bravo. Thumbs UP, you are simply GREAT. thanks for directing me to your new post on RapidMiner & R integration using RapidMiner 6.5. Yes, it works with no issue, and I can run its example smoothly. Have a great day. I would definitely come back to your blog often, as we have common interests 🙂
    https://tanthiamhuat.wordpress.com

    Reply
  9. Shaun McGirr

    Thanks Kelvin, glad it worked! Lots of interesting material on your site, thanks for the link.

    Reply

Trackbacks/Pingbacks

  1. Is there a dead-simple guide for Windows R Extension of RapidMiner installation? - Quora - […] a blog post by OptimalBI providing exactly the guide you're looking for.  http://blog.optimalbi.com/2015/0...You can also have a look on…
  2. Improved RapidMiner and R integration is here! | OptimalBI - […] So it’s great news that along with RapidMiner Studio 6.4, a new “R Scripting” extension was released to simplify integration between…
Submit a Comment

Your email address will not be published. Required fields are marked *