The article says this particular instance of the debate started in 2011. Things have shifted a little since then, and I think Python has won more mindshare with Pandas, SciPy, NumPy, and all the rest. I've used both Python and R, and think the next debate will be between those two, as people find that R is not a very good programming language and lacks decent libraries for things like web scraping.
Python can be a single tool that integrates with every part of your workflow. R right now still wins in the number of algorithms implemented in it (there are statistical methods not available in R but not Python), and R has more terse syntax which some people like for interactive use. But for really Big Data, terse syntax and an endless variety of esoteric algorithms are not as important as, say, robust error handling and debugging (a weak area in R, but a strong one in Python).
What are you talking about? R has some of
the most robust debugging tools I have encountered. You should look into the functions: browser, traceback, recover, trace, dump.frames, debug, debugonce, `environment<-`. What other language allows you to insert breakpoints (or arbitrary code) in the middle of a function's body without even having to re-paste the source code into the console?
If you're unsatisfied with R's debugging tools, you are probably not familiar with them:
Disagree. I've had an R server running in production for real-time analytics under heavy load in the last six months, and I have had exactly zero problems. Literally no down time.
I use a TCP client written by another UIC math professor and sed for most of my scraping needs.
Can you or anyone give me an example, _that is likely to be seen on any website_, that you believe cannot be solved with sed but can be solved with XPath?
The example at the URL you given above appears to have a target CSV file as its only requirement.
I routinely use HTTP pipelining and sed/tr to produce CSV files from websites.
As for R, I'm more curious about its database and analytics capabilities.
> The article says this particular instance of the debate started in 2011. Things have shifted a little since then, and I think Python has won more mindshare with Pandas, SciPy, NumPy, and all the rest.
In my field this isn't at all true - R is eroding SAS's market share, but Python is a non-entity.
I think the biggest problem is for most of my statistical work, for Python I'd have to check, for R I can safely simply assume an implementation (or several) exists.
So, I'm not a statistician but I do wind up using a lot of (R) statistics tools in my various writings: http://www.gwern.net/tags/statistics R is a terrible terrible language from my perspective as a Haskeller, but the libraries are just too fantastic for me to not use it.
A crude bit of shell (`fgrep --no-filename 'library(' .page|sed -e 's/library//'|tr -d ')'|tr -d '('|sort -u`) on my writings and a bit of cleanup gets me the following list of R libraries I use:
MASS
MKmisc
Rcapture
XML
arm
biglm
boot
changepoint
ggplot2
insol
languageR
lme4
lubridate
metafor
moonsun
prodlim
pwr
randomForest
randomForestSRC
reshape2
rjson
rms
stringr
survival
verification
Some of them aren't important and I'm sure Python has equivalents to things like 'rjson', but others...? Let's look at a few.
- metafor: meta-analysis library (fixed & random-effects, forest plots, bias checks, etc). I use this in 3 meta-analyses I've done, so it's kind of important. A google of 'python meta-analysis' turns up... nothing? Well, it turns up some Bayesian stuff, but nothing jumps out as a finished polished library that one could easily reimplement my work in. ('METASOFT' sounds promising, but actually turns out to be for genomics and in Java.)
- randomForest, randomForestSRC: random forests are apparently implemented by Scikit. Great. What about random survival* forests, which I use in my Google shutdowns analysis to see if the standard survival model can be improved upon? 5th hit for 'python random survival forests' is... an R library.
- Rcapture: capture-recapture analysis, which I employ in my hafu analysis of anime/manga to try to estimate how many I may be missing. 'python capture-recapture analysis': the best-looking hit (http://www.mikemeredith.net/) is a false-positive. The relevant code is - you guessed it - in R.
- biglm: a package for incrementally updating a linear model, the idea being that you can load into memory as much data as fits, update the linear model, and load the next batch. I need this sucker for running even simple analyses on the Mnemosyne spaced repetition data set, where the database (~18GB) won't fit in my laptop's RAM. Some googling suggests that Scikit may support this. Or may not.
- lme4: the most famous library for doing multilevel models (indispensable generalizations of linear models, great for dealing with multiple measurements from the same subject; I use them on psychology or sleep data, usually). All the Python alternatives appear to either be exercises or Bayesian.
- survival: survival analysis library, used for Google shutdowns. https://stats.stackexchange.com/questions/1736/survival-anal... says you can sorta do them in Python, but you're cautioned against it and it looks like the Python alternatives don't have the Cox proportional hazards needed to use covariates.
I could go on to look at the others (any ordinal logistic regression in Python? how're the bootstrap libraries? what if I need to do power calculations for standard tests?), but I hope this gives you an idea of why I'm not doing everything in Python.
I use of both Python (pandas) and base SAS at work for UK government.
I have lots of experience in SAS, and enjoy using it. The macro language allows for very succinct solutions to difficult data manipulation problems.
However, given SAS's huge expense it's difficult for me to identify any 'killer' areas where it's significantly better than open source tools. Indeed, I find pandas faster and easier to use for many problems.
I find it hugely frustrating that the government pays so much money for SAS licences and training when most people use it for simple use cases, where they would be better picking up transferable skills (e.g. Python, SQL, R).
My understanding is that that SAS supposed to be good at processing very large datasets because it uses RAM efficiently (only the PDV is stored in RAM). But in reality, a small minority of users are processing datasets that are too big for RAM (e.g. 16gb+) and there are probably better tools for the job in this use case.
One user here comments that SAS is like an 'improved Excel'. In fact, I find pandas much closer to Excel than SAS because (in ipython notebook at least), you get nice visual representations of your tables, and it usually isn't difficult to translate an Excel operation into a pandas one. I especially like the multi-index and pivot table based capabilities. With a background in VBA for Excel, it's also relatively easy to pick up Python.
None of this is quite so obvious in SAS, which has quite an unusual data step and macro programming language. It's very powerful, but is quite unintuitive to begin with due to a complete reliance on the program data vector.
I regularly produce data sets that are too big for RAM, and not having to worry about that in SAS was a luxury.
I'd say SAS's two big "Killer Features" are the DATA step and SAS Press. I still have yet to find R or Python nearly as pleasant to work with for manipulating the data set itself when compared to SAS, and the SAS Press is excellent at putting out books detailing a given type of analysis, and how to implement it in SAS. I still turn to them for basic references even when not using SAS.
My company uses R, Shiny, and Rserve for nearly everything. R is a great programing language - if you need to quickly and efficiently develop stat's based features for medium sized data.
R excels (get it?) at creating reproducible, fault tolerant, consistent functions that can be automated, packaged, applied to a variety of data types, and then extended later.
Our web-stack is Shiny on AWS and we call our API's built in R (ML, images, data, etc) from Android using Rserve.
A lot of the (programing?) criticisms of R will be 'solved' or become non-issues in the next few years. Multithreading, implicit vectorization, better memory handling, gpu functions, among other things are all in the pipe :)
(That said, the syntax _is_ a little weird to get use to)
-----
* We're hiring for very senior positions in data-science and more general R programers. Contact me if you're interested (JasonCEC [at] Gastrograph.com)
The problem with R is that it's just not a very good programming language. It's great for interactive analysis, but dismal for building higher level abstractions. It's like the PHP or MySQL of the data analysis world. Data types get magically converted all over the place, the global namespace is just a giant playground for every module to pollute, it has something like 5 different object systems all with subtle differences. All the defaults that are set for the convenience of interactive use undermine any kind of reliable use for building on as a platform (for example, the "simplification" concept where a 1 column data frame often magically turns into a vector).
I've forced myself to use R intensively for a couple of years now, but I must say it's still a relief every time I bail out and get back to a "real" programming language.
"""In the article, Ms. Milley said, “I think it addresses a niche market for high-end data analysts that want free, readily available code. We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
To her credit, Ms. Milley addressed some of the critical comments head-on in a subsequent blog post."""
(Boeing uses R heavily and when you fly on their aircraft, you're flying on open source)
Since SAS is a relatively simple language, why can't someone just write a transcompiler that supports a subset of SAS and move it to R? That way you have the best of both worlds (sort of).
The most difficult thing about that is how you would treat "by" statements (SAS) vs the split-apply-combine (R).
Self-plug: I sort of made a quick hack about a month ago for SAS-Python, I'm sure someone with more programming experience than me
could produce something much better (I come from a maths background).
SAS isn't a language. It's a suite of (semi-integrated) products featuring numerous languages. Key among these are the SAS DATA step (analogous to awk), the SAS Macro language (which shares terms but is in fact distinct), a number of other global languages, a number of domain-specific languages (TABULATE, LOGISTIC, GRAPH, etc.), and some proprietary implementations of general standards (SAS SQL).
The best model for mapping from SAS to R is not to try to support all of SAS's functionality within R, but to use multiple tools. When UNIX was first created, with the awk and S languages ('R' is iterated 'S', Splus is the proprietary extension of S), awk was seen as the data pre-processor for S. Today you'd likely use R (statistical analyses, graphics, matrix language), awk, Perl, Python, Ruby, or C (general programming / data manipulation), a database tool such as sqlite or Postgresql (both of which have their own considerable analysis capabilities), and other tools as appropriate.
Trying to do everything in a single tool is a domain/application mismatch.
BAE Systems has a product called NetReveal that includes a compiler called "DataServer" which compiles a large subset of SAS into Java. Legend has it that the original author wrote the first version of DataServer in 6 hours on a train ride to visit his mother.
As a programmer, I was constantly frustrated by it. I felt that SAS as a language was pretty restrictive, especially when your algorithm wasn't a natural fit for the dataset model that it uses. It was like trying to shove a square peg into a round hole. It wasn't uncommon for me to sneak some Java straight into the output, but that wasn't really a sustainable / maintainable way to use it.
So your suggestion is doable, but I'm not convinced it's worthwhile.
I used to work for one of the largest U.S insurance companies. They were always behind in transitioning to new technology (Excel 2003 could be found there in 2013). That being said, the entire staistical modeling team and research department made the switch to R and Python. Only a few clung to SAS but realized they would be forced to move to R as any collabration would need to be converted to R and not to SAS.
I believe it will be R verus Python future and SAS will not be a part of it.
I work in big insurance, too. How did you handle porting your legacy code into (R|Python)?
We had tens of thousands of lines of SAS code from over the course of 10 years (macros, clinical programs, reporting functionality) of SAS programming that the higher-ups never saw the benefit of switching to Python or R to be feasible, especially with the Biostats team working on existing projects.
Stata is seen as a less powerful, more usable version and cheaper version of SAS; it's also comes off as less flexible than R when it comes to more complex queries. Companies with large teams can afford what statistician would see as a better tool; R is seen as the favored tool for the lone analyst with extensive background, or the hard-science academic. Stata on the other hand is mainly favored by social science practicionners and some more statistics-inclined marketing people, neither dwell Hacker News too much.
Stata is great as an excel replacement and has a decent library of pre-programmed statistical models. It's GUI based and targets applied researchers in economics etc. Unfortunately, extending any functions and defining your own models is quite a hassle unless you're familiar with mata (quirky matlab-esque syntax) and R's ecosystem is much more cutting edge.
Stata is great for drag and drop data manipulation from what I remember, an excellent excel replacement. Nothing beats dragging and dropping in variables and typing: reg y x1 x2 for immediate regression results.
Is this even a discussion? Anyone serious about analyzing data will use either R, Python (with Pandas/SciPy, etc), or Julia. For truly immense data sets that require pipelines, you'll use tools like spark, hadoop, etc - but SAS is basically a slightly improved excel.
I'm an R person, but I used SAS in a past life, and I have to say, this is very wrong. SAS is nothing at all like an improved Excel in look, operation, or user-base. It can do much of what R can do (build forecast models, ML models, even OR models), but it just looks very different.
And the licensing model will make you pull your hair out. I ran into issues both with geography (not being allowed to use my license on a project in another country) and functionality (hey, that's a cool PROC...wait, I don't have the license to call it)
I think R and python will win the day, but it's not because SAS is anything like Excel. And there are a shit ton of "serious" data analysis people using SAS. They're just all in the enterprise. Every Fortune 500 company I've worked with used SAS except for one (who used R).
SAS is a pretty big system, and I have worked with SAS for about 10 years, and I can't see any similarity with Excel at all. Which part of the SAS system do you think resembles Excel ? Here is a list of their products (http://support.sas.com/documentation/productaz/index.html)
Btw. I do think I am doing serious data analysis in a bank :-)
That's my guess. Personally, for 99% of my uses for SAS, all I see is the script window, the log, and the output window. That's about as far from 'Excel' as you can get while still having an interface.
Being a slightly improved Excel has significant advantages as well:
It's more accessible to people who are not coders but need to do statistics on real data, i.e. almost all researchers.
Although personally, I would use R/Python (the UI is more suited to me, and it's free, and I trust the results a little more, and I can read the code if I need to)... do you actually need anything more than SAS/SPSS provides? I doubt it.
I'd argue that SAS is much more accessible to non-coders than either R or Python, in that the SAS language is mostly an invocation of statistical procedures, rather than a full-on programming language.
From a field dominated by SAS, moving to R was a pretty steep transition, whereas when I approached R from having learned Python, things made a great deal more sense having been exposed to other programming languages.
Python can be a single tool that integrates with every part of your workflow. R right now still wins in the number of algorithms implemented in it (there are statistical methods not available in R but not Python), and R has more terse syntax which some people like for interactive use. But for really Big Data, terse syntax and an endless variety of esoteric algorithms are not as important as, say, robust error handling and debugging (a weak area in R, but a strong one in Python).