Hacker News new | past | comments | ask | show | jobs | submit login
Interesting Data Sets for Statistics (rs.io)
216 points by aficionado on May 29, 2014 | hide | past | favorite | 17 comments



In related news, the Global Database of Events, Language, and Tone (GDELT) is now available in bigquery, for free. [1]

[1] http://googlecloudplatform.blogspot.com/2014/05/worlds-large...


The Reddit data isn't actually the top 2.5 million posts - it's the top 1000 posts of each of the top 2500 subreddits. An important distinction to make if anyone's planning to do statistical analyses on the set.


Surprisingly it doesn't mention HN itself which is a treasure trove of data. I know there is APIs to download HN content but is there a permanent location for HN data dump (like StackOverflow do their data dump on Internet Archive)? This is a great article, BTW, anyone who wants some cool projects to do in data mining and machine learning.


I currently have a subset of one year's worth of all HN stories and comment trees, organized by story, but it's on my local machine. Where is a good place to post it? It's quite big, on the order of multiple GB.

The problem (if you want an easy scraper) is that the HN API limits you to 1k requests per hour. So it took me about 10 days of continuous running and restarting because of random crashes to get all the data.


Looks like an interesting list of datasets, but it's such a large number that it's tough to get a feel for what all is in it (without reading a lot of the entries). I wonder if some sort of organized table might be a way to present the information in a more skimmable fashion.


I think it will be really interesting in a few years when people start some in-depth analysis about the bitcoin blockchain(though some is going on today). If Bitcoin hits mainstream adoption it may be the first time ever someone can run analysis on a complete financial system. Not even including the applications built on top of the block chain.


followup, this is the kind of stuff i was alluding to:

http://www.technologyreview.com/view/527906/data-mining-reve...


For starters and people who miss their R sample sets there is a pretty good maintained archive of 731 of them available as CSV at http://vincentarelbundock.github.io/Rdatasets/

Index: http://vincentarelbundock.github.io/Rdatasets/datasets.html

Github: https://github.com/vincentarelbundock/Rdatasets


So someone put 2.5 million Reddit posts on Github. I was thinking about doing same for the HN data I've downloaded (1.3 million stories ~ 1.7GB of json).

Does Github has any restrictions on hosting data files like this?



Let me know where you upload this. Thanks.


I have always wanted to create a Data Science training course which finds the right dataset to expose the power of the technique in question. I think this dataset will give me a good start. Thank you!


I would be way interested in your course!


This looks like a good resource. But be sure to understand all the implicit assumptions in each data set before announcing your amazing discoveries!


Some stuff I hadn't seen before.


This is a gold mine, thank you!


Good article!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: