> R right now still wins in the number of algorithms implemented in it Can you p...

mjfl · on March 2, 2014

Is there an equivalent Python package to R's {stats}?

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/00...

jofer · on March 2, 2014

Have a look at statsmodels: http://statsmodels.sourceforge.net/ and for more basic things, scipy.stats: http://docs.scipy.org/doc/scipy/reference/stats.html

It's not a one-to-one match, but the majority of the functionality is there.

Fomite · on March 2, 2014

I think the biggest problem is for most of my statistical work, for Python I'd have to check, for R I can safely simply assume an implementation (or several) exists.

gwern · on March 3, 2014

So, I'm not a statistician but I do wind up using a lot of (R) statistics tools in my various writings: http://www.gwern.net/tags/statistics R is a terrible terrible language from my perspective as a Haskeller, but the libraries are just too fantastic for me to not use it.

A crude bit of shell (`fgrep --no-filename 'library(' .page|sed -e 's/library//'|tr -d ')'|tr -d '('|sort -u`) on my writings and a bit of cleanup gets me the following list of R libraries I use:

MASS MKmisc Rcapture XML arm biglm boot changepoint ggplot2 insol languageR lme4 lubridate metafor moonsun prodlim pwr randomForest randomForestSRC reshape2 rjson rms stringr survival verification

Some of them aren't important and I'm sure Python has equivalents to things like 'rjson', but others...? Let's look at a few.

- metafor: meta-analysis library (fixed & random-effects, forest plots, bias checks, etc). I use this in 3 meta-analyses I've done, so it's kind of important. A google of 'python meta-analysis' turns up... nothing? Well, it turns up some Bayesian stuff, but nothing jumps out as a finished polished library that one could easily reimplement my work in. ('METASOFT' sounds promising, but actually turns out to be for genomics and in Java.)

- randomForest, randomForestSRC: random forests are apparently implemented by Scikit. Great. What about random survival* forests, which I use in my Google shutdowns analysis to see if the standard survival model can be improved upon? 5th hit for 'python random survival forests' is... an R library.

- Rcapture: capture-recapture analysis, which I employ in my hafu analysis of anime/manga to try to estimate how many I may be missing. 'python capture-recapture analysis': the best-looking hit (http://www.mikemeredith.net/) is a false-positive. The relevant code is - you guessed it - in R.

- biglm: a package for incrementally updating a linear model, the idea being that you can load into memory as much data as fits, update the linear model, and load the next batch. I need this sucker for running even simple analyses on the Mnemosyne spaced repetition data set, where the database (~18GB) won't fit in my laptop's RAM. Some googling suggests that Scikit may support this. Or may not.

- lme4: the most famous library for doing multilevel models (indispensable generalizations of linear models, great for dealing with multiple measurements from the same subject; I use them on psychology or sleep data, usually). All the Python alternatives appear to either be exercises or Bayesian.

- survival: survival analysis library, used for Google shutdowns. https://stats.stackexchange.com/questions/1736/survival-anal... says you can sorta do them in Python, but you're cautioned against it and it looks like the Python alternatives don't have the Cox proportional hazards needed to use covariates.

I could go on to look at the others (any ordinal logistic regression in Python? how're the bootstrap libraries? what if I need to do power calculations for standard tests?), but I hope this gives you an idea of why I'm not doing everything in Python.