Interesting Data Sets for Statistics

mikecb · on May 29, 2014

In related news, the Global Database of Events, Language, and Tone (GDELT) is now available in bigquery, for free. [1]

[1] http://googlecloudplatform.blogspot.com/2014/05/worlds-large...

kevinwang · on May 30, 2014

The Reddit data isn't actually the top 2.5 million posts - it's the top 1000 posts of each of the top 2500 subreddits. An important distinction to make if anyone's planning to do statistical analyses on the set.

sytelus · on May 30, 2014

Surprisingly it doesn't mention HN itself which is a treasure trove of data. I know there is APIs to download HN content but is there a permanent location for HN data dump (like StackOverflow do their data dump on Internet Archive)? This is a great article, BTW, anyone who wants some cool projects to do in data mining and machine learning.

j2kun · on May 30, 2014

I currently have a subset of one year's worth of all HN stories and comment trees, organized by story, but it's on my local machine. Where is a good place to post it? It's quite big, on the order of multiple GB.

The problem (if you want an easy scraper) is that the HN API limits you to 1k requests per hour. So it took me about 10 days of continuous running and restarting because of random crashes to get all the data.

privong · on May 30, 2014

Looks like an interesting list of datasets, but it's such a large number that it's tough to get a feel for what all is in it (without reading a lot of the entries). I wonder if some sort of organized table might be a way to present the information in a more skimmable fashion.

Hortinstein · on May 29, 2014

I think it will be really interesting in a few years when people start some in-depth analysis about the bitcoin blockchain(though some is going on today). If Bitcoin hits mainstream adoption it may be the first time ever someone can run analysis on a complete financial system. Not even including the applications built on top of the block chain.

Hortinstein · on June 6, 2014

followup, this is the kind of stuff i was alluding to:

http://www.technologyreview.com/view/527906/data-mining-reve...

mxfh · on May 30, 2014

For starters and people who miss their R sample sets there is a pretty good maintained archive of 731 of them available as CSV at http://vincentarelbundock.github.io/Rdatasets/

Index: http://vincentarelbundock.github.io/Rdatasets/datasets.html

Github: https://github.com/vincentarelbundock/Rdatasets

sytelus · on May 30, 2014

So someone put 2.5 million Reddit posts on Github. I was thinking about doing same for the HN data I've downloaded (1.3 million stories ~ 1.7GB of json).

Does Github has any restrictions on hosting data files like this?

jmpe · on May 30, 2014

Their guidelines suggest to use dropbox for that:

https://help.github.com/articles/what-is-my-disk-quota

https://help.github.com/articles/working-with-large-files

alok-g · on May 31, 2014

Let me know where you upload this. Thanks.

shobhitverma · on May 29, 2014

I have always wanted to create a Data Science training course which finds the right dataset to expose the power of the technique in question. I think this dataset will give me a good start. Thank you!

armenarmen · on May 30, 2014

I would be way interested in your course!

ejain · on May 29, 2014

This looks like a good resource. But be sure to understand all the implicit assumptions in each data set before announcing your amazing discoveries!

joshu · on May 29, 2014

Some stuff I hadn't seen before.

atestu · on May 29, 2014

This is a gold mine, thank you!

toxik · on May 29, 2014

Good article!