Hacker News new | past | comments | ask | show | jobs | submit login
Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata (anyall.org)
31 points by lrajlich on March 12, 2009 | hide | past | favorite | 9 comments



The biggest problem is that the design decisisons of some of the projects listed are pretty poor. Who decided that dataset size should be limited to available memory? Not only does this mean you have a severe limitation to the framework, but it also effects the programmers mentality. It shows. Take a look at the R-source (and tests cough) and you will see the assumptions.

[speaking from experience writing both in memory and disk based data analysis packages]

Disk based work is pretty straight forward. It's not rocket science, and makes your code much, much faster when working in memory in most circumstances.


I would like think that it wouldn't be that hard to adapt an in-memory system to use something like a memory mapped file or even a custom cached memory mapped file. Of course, such a system might not be designed to avoid page swaps/cache hits.

Of course, this is about how the system could evolve - the possibility might not help a simple user now.


Well, having done exactly that, you really do need a rewrite or fundamental refactoring. It will touch most functions in your codebase.

Memory mapped won't help when you have a 100gb+ file, and as you say it gets slow as it's definitely not optimal.

You also need custom indexing structures and data caching strategies for most algorithms that aren't easily moved to disk. And most aren't unfortunately. The other issue is that you end up doing a lot of research, because there just aren't many people who have done this. It's a time sucker.

I must say it was awesome seeing our decision tree system running on huge dataset sizes (tested > 100gb) in similar time (~30 seconds) to an in memory database after indexing.


Scipy+numpy+matplotlib, get enthought's distribution for one stop shop and an interactive shell via Ipython. my 2 cents. Although short-term, matlab legacy code is definitely a big plus for many scientific/data analysis applications.


I'm using R for largish datasets. When compiled on a 64-bit linux, it can address plenty of memory. The windows version is limited to 1.5 Gb, even in 64-bit windows. This is bound to change as Revolution computing have a 64-bit windows R that works (and adds some Parallel libraries). But that is not free, and it's still in beta. We are getting one. I think 64-bit R + resolver one (no limits for spreadsheet size) can get a lot done in a very visual way. I just happen to catch bugs faster when I color-code cells, but this is impossible to do in straight R (fix() and edit() suck!) or in excel because of spreadsheet limits. The combo I propose (don't have it yet) will be expensive but worth it I think.


For the Windows crowd, there is also SQL Server Reporting Services which is "free" if you already have a SQL Server license. It's more of a roll-your-own reporting package but it's quite easy.


Article was helpful, but read the comments as well. Fleshes out some real life experiences from folks who have used multiple packages.


Where does Maple fit in here? I have repeatedly heard of it as an alternative to Matlab.


Maple is Mathematica's main competitor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: