R for the Enterprise

mwexler · on July 6, 2012

This happened back in March. https://blogs.oracle.com/R/entry/r_execution_in_oracle_datab.... It's a desperate attempt to co-opt the Red Hat approach to riches via open source support. Revolution R has been doing pretty well in this game, so Oracle is replicating. The only contribution back to the community so far has been to do a few mild updates to ROracle, the R to Oracle connector.

Oracle has basically wrapped up R and made it embedded in the Database. One nice touch is that they have made overloaded versions of most of R's base and stat functions, and let them handle the "ore" dataframes; these overloaded versons are basically wrappers on the "big-data" functions inside the DB.

If this is really interesting to you, here are a set of PDFs describing Oracles approach: http://www.oracle.com/technetwork/database/options/advanced-...

Bad news is: you are limited to Oracle approaches to scale and growth; while there's some to love, there's also lots to hate.

I worry that some of this commercialization of R is going to cause trouble, as we start having non-compatible forks of R creating non-replicable analyses. If I build a great model that can only be replicated via some vendor's proprietary approach, then it may be great for my business, but it doesn't move the field forward.

batista · on July 7, 2012

>It's a desperate attempt to co-opt the Red Hat approach to riches via open source support.

Yes, because Redhat is doing so much better than Oracle with its OSS approach...

And it's not like other major companies support OSS with their products, like say IBM...

tom_b · on July 6, 2012

Are data hackers gravitating to R? Given that Oracle and (surprisingly to me) SAS now both support R in their offerings, it seems that at least the enterprise will be taking up more R for analytics.

jordanb · on July 6, 2012

I've been using R because I have an amateur's interest in statistics.

As a programmer, it seems like an awkward language, although no more awkward than SAS, SPSS, etc. And as I do more analysis the language makes more sense. It's a special-purpose language made to do a specific task.

The general workflow for doing data analysis is 1) import the data 2) clean it and format it properly as input for a pre-built package that does the actual analysis 3) feed it to the package and 4) interpret the results.

To that end, typically R programs are short and pretty declarative. R packages contain C or FORTRAN extensions that do all the heavy lifting. Substantial amount of imperative R code is going to be slow. For instance, looping over a vector is always worse than applying a vector transformation, and R provides a rich set of transformations for all its data types.

R has gotten popular because the proprietary guys dropped the ball at the universities. I recall reading a posting by one researcher who said he switched to R because his students could only use SAS at the school's stats lab, whereas they could run R on their computers at home.

Once researchers switched to R, they started publishing their work with code meant to be run with R. The cutting edge is important in stats, so people want a short lead time between when a new test or model is published and when it's available. SAS's "cathedral" can't really keep up. Combine that with SAS's licensing costs (both arms and a kidney too) as well as its overall "mainframey" feel, and you can see why R is winning.

EDIT: Another big win for R that I forgot to mention is its support for visualizations. A step that should perhaps come after importing the data above is investigating it with various diagnostic charts (scatter charts, box charts, etc) these are all just function calls in R. In addition, R has a powerful graphics engine and there are a huge number of packages available to create more sophisticated visualizations: http://addictedtor.free.fr/graphiques/

pbh · on July 6, 2012

This is a great description of R usage (import, clean, fit models), but I think a slightly erroneous explanation of R history.

John Chambers created "S" at Bell Labs. S was a programming language designed for interactive statistical analysis. Much like gcc and icc are implementations of C compilers, R and S-PLUS are implementations of S. S-PLUS was/is the primary proprietary implementation of the S language, whereas R is the primary free one (also, sometimes called GNU S). (SAS and SPSS are completely different languages/systems as far as I know.) I think that statisticians at some point made a conscious effort to publish their work in R, rather than S-PLUS (or any other statistical system like SAS) because it was more widely available. That in turn led R to be a viable competitor to S-PLUS (and other systems) because it had vast amounts of recent statistical libraries, often implemented by the people who developed the techniques. That said, SAS and SPSS seem to pretty much still have social science students locked up --- the market for R is probably statisticians who are also excellent functional programmers.

This history is in really marked contrast to MATLAB and its corresponding free version Octave, where computer scientists pretty much refuse to use Octave, despite MATLAB's massive price tag to pretty much everyone involved (even with 90% discounts).

(That said, if anyone lived through the change over from S-PLUS to R, I'd love to hear if this history is wrong!)

micro_cam · on July 6, 2012

I think that the Bioconductor project (http://www.bioconductor.org/) has also been a big part of R adoption as it has produced a core of well and consistently documented libraries for importing, managing and analyzing biological data that is not really matched anywhere else. R co-creator Robert Gentalman was/is a big driving force in that so of course it is in R.

dekayed · on July 6, 2012

> This history is in really marked contrast to MATLAB and its corresponding free version Octave, where computer scientists pretty much refuse to use Octave, despite MATLAB's massive price tag to pretty much everyone involved (even with 90% discounts).

Do you have any insights as to why Octave does not have higher adoption?

pbh · on July 6, 2012

I've always been a bit sad about it, but everyone involved is probably a rational actor.

Computer science professors probably view a couple hundred dollars per MATLAB network license as a tiny expense on a $1m+ grant (whereas statistics grants are apparently often smaller), and they may be charged for it in departmental overhead anyway (removing the incentive to cut costs).

The type of people who could contribute either core code or toolbox type code to Octave often have an extremely rare quantitative skill set that is worth hundreds of dollars an hour, so there is a huge incentive to get paid to do similar work instead. There probably isn't much community recognition (to balance things out) for implementing a library in Octave. (Though, in the R world there are certain recognizable superstars like Hadley Wickham.)

Graduate students (who might work for cheap on these problems) are probably more focused on publications and networking.

As long as all of this is the case, Octave will always kind of just be a worse MATLAB that happens to be open source, so a new user choosing between them will probably just choose MATLAB by default.

jordigh · on July 6, 2012

Octave core developer here.

It is true that we have a lot of trouble attracting new contributors. Most of our users keep demanding features that seem to us unimportant but to them are all the world: a GUI ("whatever for?", we think. "Use a real text editor!"), a JIT compiler ("here's a nickle, get better vectorised code, kid"), perfect Matlab compatibility (a never-ending chase, not very fun, in which we must always be behind).

Of these, we're finally slowly listening to our users. Two of our current three GSoC students are working on a GUI and a JIT compiler respectively. I have wild hope that this will attract more users and developers. I'm also currently hosting an Octave conference in a few days towards this goal:

    http://www.octave.org/wiki/index.php?title=OctConf_2012

By the way, Octave is GNU (so is R, supposedly), so we're not really open source; we're free. ;-)

I don't know why Octave hasn't been able to replicate R's success. I don't know if R's not really being GNU despite in name has something to it (R developers routinely try to find new ways to get around the GPL and link R to non-free code, and I don't doubt that this linking to Oracle's database is another example of that). I don't know if it's just that a lot of people with big money care more about statistics and R than they care about Octave (banks and brokers for R, electrical and civil engineers for Otave). Maybe our code sucks more than R's.

Do you have any suggestion how to make Octave the standard instead of Matlab? The recent gratis classes that emerged from Stanford gave Octave a lot of publicity. Do you have any suggestion of what else we might do?

pbh · on July 7, 2012

You're probably in a much better position to evaluate than I am! My guess is that more Octave-based classes would translate into more users and more code written for Octave down the line, but I'm not sure how to encourage more use of Octave in the classroom in the first place.

drucken · on July 6, 2012

Matlab is truly the RAD tool of choice for numerical programming and has a solid grip in universities combined with enterprise-level support.

I do not think Octave ever tried to replicate its workflow (which is not general programmer centric at all) and domain-specific documentation but merely focused on the underlying language compatibility, which is really the least important part of Matlab.

On top of that, I seem to recall, Matlab was one of the first of the specialist programming toolsets to offer a very competitive "Student Edition". This was a godsend for schools and universities before the Internet took off.

In short, Octave was too little too late, and Numpy/Scipy, while catching up fast, has supporting tools spread all over the place as well as being geared more to general programmers who want access to convenient numerics rather than numerical modellers/engineers wanting a RAD tool.

Numpy/Scipy etc. may well overtake Matlab eventually, but that will be purely a function of its infrastructure, not the something as mundane as even the nice language (which admittedly was its initial driving force). At least in this respect, it has done a lot better than Octave in much less time.

jordigh · on July 7, 2012

Actually, there are tons of people who want to run Matlab code freely. The code is already written. They need to run it in clusters, or they need to run it at home.

This is why we are doing Octave.

Scaevolus · on July 6, 2012

MATLAB is commonly used in introductory CS courses for engineering majors, since it's useful for a lot of general tasks and is pretty forgiving.

MATLAB has a nice GUI and IDE. It also generates good graphs with minimal effort.

Octave has a command-line REPL.

sudont · on July 6, 2012

I've looked into it, but went with Python instead, since it allows easier access to get through proprietary single-sign-on stuff that sits in the way of getting data. Plus I don't need to worry about when I need to drop down to screen scraping, or build up into a web interface—much nicer to just share modules between everything.

mrich · on July 6, 2012

It is also supported in SAP's HANA (in-memory database).

http://en.wikipedia.org/wiki/SAP_HANA#R_integration

big_data · on July 7, 2012

The HANA integration is more marketing blurb than a true integration at this point. It's meant to draw in the stats guys so other SAP products can be sold to them.

edouard1234567 · on July 6, 2012

R is great mostly for the breath of plugins/implementations contributed by the community... much better than its alternative SAS for an infinitely lower price... ($0 vs ~$10k). Not surprised by this move from Oracle as data analytics/mining gets more popular. For R enthusiasts there is a great blog I follow http://www.win-vector.com/blog/2012/07/modeling-trick-masked...

mbq · on July 6, 2012

This explains the care about Solaris compatibility of R packages.