The dataset comes with GPS locations of users and venues. With that data alone you can retrieve individual addresses in not too densely populated areas. Missing links will get caught in the social net.
We already learned that Warhol's 15 minutes of fame should read 15 megabytes , but, to cut the "it's the users choice to post that data" apologists short: almost no-one I speak to understands the implications of all possible interpretations, classifications and groupings that their online traces allow.
Furthermore, it's not the users' fault that 4sq isn't sufficiently rate limiting or otherwise protecting this data. Why should arbitrary users be able to see the social graph of others they're not friends with? Also, why should people outside of your immediate one-hop social graph be able to see your checkins?
Giving it to 4sq for data mining is different than giving it to UMN and/or the whole internet for data mining and/or deanonymization.
Jesus Christ. The bulk scraping in violation of the TOS is egregious enough, but redistributing it with a mandate that the researchers get credit? For what, scraping a generous public API?
This is usually common practice. It serves two purposes 1) Makes it easier to find the dataset that was used for experiments. And 2) Improves citation count for the author which is usually important in research.
Thanks for responding, I didn't realise Foursquare actually gave so much of their data away freely. Which makes it slightly more seedy someone has scraped the rest and dumped it online.
> publicly and freely accessible data would be illegal.
Data is not "publicly and freely accessible" if accessing it requires you to agree to separate terms of service for it that restrict your ability to access and redistribute it.
(Whether or not one believes the data should be freely and public accessible is a separate matter, but given the above, it's hard to make the case that it is).
Amusingly, this data still isn't "freely accessible", because these people have attached their own, separate terms to reusing and redistributing the data.
If you have to apply to specific terms, but are able to access everything without complying to these terms, this effectively means that the data is publicly and freely accessible.
Rules on itself don't restrict anything, enforcement does.
What's not machine readable about an ASCII table? A fixed width table has its own advantages over eg CSV - for instance, to read a specific field you can reach it by offset rather than having to count delimiters.
> This data set contains 2153471 users, 1143092 venues, 1021970 check-ins, 27098490 social connections, and 2809581 ratings that users assigned to venues
The number of check-ins seems to be low compared to other numbers.
https://en.wikipedia.org/wiki/AOL_search_data_leak
http://techcrunch.com/2006/08/06/aol-proudly-releases-massiv...
http://www.nytimes.com/2006/08/09/technology/09aol.html?page...
TL;DR: It's fairly easy to deanonymize datasets like this, provided they are somewhat complete.