Hacker News new | past | comments | ask | show | jobs | submit login
Hacker News BigQuery Dataset (cloud.google.com)
158 points by svdr on March 4, 2019 | hide | past | favorite | 37 comments



Hi, Felipe Hoffa at Google here.

We're aware the dataset hasn't been updated since a month ago, and we are working to fix it. You can track the issue here:

- https://issuetracker.google.com/issues/127132286

In the meantime you can still play with the dataset, and dig into the full history of Hacker News - less this last month. I left some interesting queries to get you started here:

- https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-d...


Wow, what timing.

Late last night I had a conversation with someone explaining that Hacker News is not your typical message board -- it's owned and operated by YC and sits atop algorithms developed by some of the pioneers in spam and anomaly detection [1] [2], and it's is also an open dataset -- analyzed and scrutinized -- used by hackers worldwide to train and test bespoke AI.

HN is a live MNIST [3] for anomaly detection.

[1] http://www.paulgraham.com/spam.html

[2] http://googlesystem.blogspot.com/2007/07/paul-buchheit-man-b...

[3] http://yann.lecun.com/exdb/mnist/


How do you run the import? Love to read more about how you consume the data


I would also like to know where the data comes from!


Any particular reason you don’t include user profiles in the dataset ? I ended up pulling them myself using the api...


I've been using the HN API to maintain a bigquery table of all posts, comments, and URLs on HN and putting it on BigQuery for a while now. I use it to put this site together: https://hntrending.com/. BQ is awesome.

It's a side project so may have some issues!


Looks like it stopped updating as of February 2nd, but otherwise it's pretty reliable, and as noted in the description, it's free. (you probably won't hit the 1TB limit working with this dataset). Here's a few queries I've done recently to answer ad-hoc questions to get an exact answer:

Top posts about bootstrapping (https://news.ycombinator.com/item?id=19258249):

    #standardSQL
    SELECT *
    FROM `bigquery-public-data.hacker_news.full`
    WHERE REGEXP_CONTAINS(title, '[Bb]ootstrap')
    ORDER BY score DESC
    LIMIT 100
Count of YC startup posts over time by month (https://news.ycombinator.com/item?id=19185946):

    #standardSQL
    SELECT TIMESTAMP_TRUNC(timestamp, MONTH) as month_posted,
    COUNT(*) as num_posts_gte_5
    FROM `bigquery-public-data.hacker_news.full`
    WHERE REGEXP_CONTAINS(title, 'YC [S|W][0-9]{2}')
    AND score >= 5
    AND timestamp >= '2015-01-01'
    GROUP BY 1
    ORDER BY 1


Hey all,

I manage the BigQuery Public Datasets Program here at Google. You're right, the dataset last updated February 2nd, but we intend to continuing updating it. We had an issue on our end that disrupted our update feed, but we're working to repair it now and get the latest data uploaded to BigQuery.


Great to hear! :)


One thing that I was missing last time I checked was comment ranking data. Neither score nor rank was there for comments posted in recent years. I understand that upvote counts are not available in the API, but ranking should be (as in, the order the comments appear on the page).


There is now a `ranking` field in the full HN dataset, although I haven't played with it and don't know how robust it is.


There’s a field, yes, but it’s empty.


Weird! That doesn't come from us.


Top Commentors of all time. tptacek is at 1st place with 33839 comments.

Hacker news is 12 years old. That's an average of 7 comments per day since inception. Wow

    #standardSQL
    SELECT
      author,
      count(DISTINCT id) as `num_comments`
    FROM `bigquery-public-data.hacker_news.comments`
    WHERE id IS NOT NULL
    GROUP BY author
    ORDER BY num_comments DESC
    LIMIT 100;


Don't use the `comments` table: it was last updated December 2017.

On the full table:

    #standardSQL
    SELECT
     `by`,
     COUNT(DISTINCT id) as `num_comments`
    FROM `bigquery-public-data.hacker_news.full`
    WHERE id IS NOT NULL AND `by` != ''
    AND type='comment'
    GROUP BY 1
    ORDER BY num_comments DESC
    LIMIT 100
tptacek is in first place with 47283 comments.


I added a simple api endpoint to access favorites on HN, since they weren’t available on the normal api.

https://github.com/reactual/hacker-news-favorites-api


https://hnify.com/leaderboard.html using the dataset tool too, amazing to have so much data freely available to play with.


Last year I built a domain leaderboard based on this dataset: https://hnleaderboard.com — planning to update for 2019 soon!


I’m actually fairly excited to learn about this. I painstakingly scrapped HN to build:

https://hnprofile.com/

I’m excited about this alternative


is there a way to download the dataset and query it locally from for example postgresql or sqlite? How big is the database, 4G compressed?


A dataset like this is going to have a bunch of personal information in it. When it’s distributed like this, how does that jive with regulations like GDPR? If a HN user would like to delete all their comments, how would that request be forwarded to every user of this dataset?


If people are sharing their own PII in HN comments, they agreed to HN's T&Cs when signing up. Such T&Cs state (heavily trimmed for length):

By uploading any User Content you hereby grant [..] a nonexclusive, worldwide [..] irrevocable license to [..] distribute [..] your User Content for any Y Combinator-related purpose in any form [..]

Agreeing to the T&Cs and deliberately sharing information publicly covers the GDPR's "consent" lawful base.

Even under GDPR this is not a situation where someone signed up for something else and then happen to have their personal data shared as a byproduct. They signed up to a site, agreed to T&Cs, and then explicitly and deliberately shared their personal data.


The GDPR's super complicated, but I'm pretty sure its Right of Erasure, and specifically Article 7(3), which gives data subjects the right to withdraw consent at any time and the clause “it shall be as easy to withdraw consent as to give it" trumps any ridiculous "irrevocable" license to distribute your content in any form forever.

Also importantly, the GDPR requires that a controller not make a service conditional upon consent. Hacker News is likely not in compliance unless they make such data processing optional and require anyone interested to explicitly opt in.

But, then again, I'm not a lawyer, and even if I were, actual lawyers don't seen to know what the hell the GDPR actually requires either.


its Right of Erasure, and specifically Article 7(3), which gives data subjects the right to withdraw consent at any time and the clause “it shall be as easy to withdraw consent as to give it" trumps any ridiculous "irrevocable" license to distribute your content

Correct. You can certainly attempt to assert your right of erasure with YC to erase your PII from their data (i.e. Hacker News).

But..! Because we give YC the right to distribute our content freely, we simultaneously realize that there may be many duplications and reproductions of this data. The consequence of this is that we must contact any/every user of that data yourself on a one-by-one basis to assert your right of erasure - there is no legal obligation for HN to track everyone who might have downloaded a legal archive of their data.


IAANAL, and I don't mean to single you out here, but this seemingly rational argument strikes me as subtle FUD. It's the type of argument that someone with a vested interest in collecting user data for profit might put forth in the hopes of polarizing the tech community and painting GDPR as out of touch with technical common sense.

Again, I'm not accusing you of anything here, I'm just pointing out who benefits from framing the conversation this way. So far there is a lot of precedent for small operators shutting down their sites out of fear of GDPR, but there is actually no precedent for regulators having actually gone after small operators for anything resembling reasonable practices. The day may come where EU regulators try to crack down on forums for who are unwilling or unable to redact users messages post-facto, but we're nowhere close to that today and I don't see strong reason to believe that's where we're headed either.


What about this is out of touch with technical common sense? If you export all your user data and syndicate it, why would it be so unreasonable to have a system in place to be able to syndicate requests to delete data as well?

All of us here are users of this forum, so this concerns the legal rights to our personal information. It’s not FUD for us to discuss how those rights are affected by things like this.


Let's leave syndication aside for a moment; I don't think it's unreasonable for a forum to have terms of use that you are participating in a public forum that needs to maintain integrity. If people just go deleting their posts then it screws up the public discourse. I've actually had the experience of building and running a forum that allowed deleting your content in this way, and we had to remove the capability as trolls used it in a specifically destructive capacity.

Now this position is certainly debatable, but I think it's at least a reasonable argument that you could take to regulators. Contrast that with the bullshit that Facebook, Google and a zillion ad-tech companies are doing with our data every day. You're free to object to the syndication of HN data, but personally I feel that is a distraction from the issues GDPR is meant to address, and I am hoping regulators feel the same way.


That honestly scares me, though. A law that everybody is violating but that is only rarely enforced is a law that will be used to go after whoever the government doesn't like. Imagine a European hate speech forum that gets a lot of press sometime. The government will just step in and say "oops, looks like you're not in GDPR compliance!" and sue them into nonexistence.


I’m not really sure that matters. GDPR includes the right to withdraw consent and the right to erasure, unless there are specific legal reasons why data can not be erased. Requests for erasure should be forwarded to third parties who use the data.


That is correct but this is why the license matters so much. The license determines that there can/will ultimately be many distributed versions of the content and you would need to go to the users one-by-one to withdraw that consent or request erasure.


The data is already publicly available for use and reuse, just in a different form. Why would it be any different than the rules regarding the public/api display of the information?


Maybe I’m misunderstanding what this is, is it not possible to query, download and process at a completely different scale than the API? If not, I suppose you might ask the same thing about the API.


I support this question. Any comments ?

NB: That's easy to downvote without commenting...


The Internet is written in ink. You should assume that any and all public posts you make have already been replicated and archived by countless parties in countless ways by the time you hit delete. HN public postings are no different.

The HN API [1] has been around in various forms for years and includes the same public data that's used to generate the public pages on the HN site, but rather than returning HTML pages designed for human consumption, the API returns the data in a JSON serialized form [2] designed for machine consumption [3].

When the HN API went live, it reduced the overhead and redundant work from all the programmers having to independently crawl and parse site. The HN BigQuery dataset is the same data returned by the HN API, Google just took the next step and did the work of loading it into BigQuery.

[1] https://github.com/HackerNews/API

[2] https://en.wikipedia.org/wiki/Category:Data_serialization_fo...

[3] https://en.wikipedia.org/wiki/Machine_to_machine


BigQuery keeps adding useless data.

What we truly need is common crawl data then we can check specific site on our own.

Or wait, BigQuery simply can't handle common crawl size dataset in their public service!

Otherwise there is no reason to not add it, maybe it puts their search engine/ad business in geoparady.

Is there any other Google public dataset BigQuery like platform? Where their direct search engine/ad platform interests don't get in way of Common Crawl like data searching/indexing?


> Or wait, BigQuery simply can't handle common crawl size dataset in their public service!

This is not true. Source: ex-googler.


Feel free to load it yourself if you need to query the Common Crawl. And pay for it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: