The Reddit data isn't actually the top 2.5 million posts - it's the top 1000 posts of each of the top 2500 subreddits. An important distinction to make if anyone's planning to do statistical analyses on the set.
Surprisingly it doesn't mention HN itself which is a treasure trove of data. I know there is APIs to download HN content but is there a permanent location for HN data dump (like StackOverflow do their data dump on Internet Archive)? This is a great article, BTW, anyone who wants some cool projects to do in data mining and machine learning.
I currently have a subset of one year's worth of all HN stories and comment trees, organized by story, but it's on my local machine. Where is a good place to post it? It's quite big, on the order of multiple GB.
The problem (if you want an easy scraper) is that the HN API limits you to 1k requests per hour. So it took me about 10 days of continuous running and restarting because of random crashes to get all the data.
Looks like an interesting list of datasets, but it's such a large number that it's tough to get a feel for what all is in it (without reading a lot of the entries). I wonder if some sort of organized table might be a way to present the information in a more skimmable fashion.
I think it will be really interesting in a few years when people start some in-depth analysis about the bitcoin blockchain(though some is going on today). If Bitcoin hits mainstream adoption it may be the first time ever someone can run analysis on a complete financial system. Not even including the applications built on top of the block chain.
So someone put 2.5 million Reddit posts on Github. I was thinking about doing same for the HN data I've downloaded (1.3 million stories ~ 1.7GB of json).
Does Github has any restrictions on hosting data files like this?
I have always wanted to create a Data Science training course which finds the right dataset to expose the power of the technique in question. I think this dataset will give me a good start. Thank you!
[1] http://googlecloudplatform.blogspot.com/2014/05/worlds-large...