Foursquare dataset free to download and analyze

sneak · on Oct 8, 2013

Obligatory:

https://en.wikipedia.org/wiki/AOL_search_data_leak

http://techcrunch.com/2006/08/06/aol-proudly-releases-massiv...

http://www.nytimes.com/2006/08/09/technology/09aol.html?page...

TL;DR: It's fairly easy to deanonymize datasets like this, provided they are somewhat complete.

route66 · on Oct 8, 2013

The dataset comes with GPS locations of users and venues. With that data alone you can retrieve individual addresses in not too densely populated areas. Missing links will get caught in the social net.

We already learned that Warhol's 15 minutes of fame should read 15 megabytes , but, to cut the "it's the users choice to post that data" apologists short: almost no-one I speak to understands the implications of all possible interpretations, classifications and groupings that their online traces allow.

sneak · on Oct 8, 2013

Furthermore, it's not the users' fault that 4sq isn't sufficiently rate limiting or otherwise protecting this data. Why should arbitrary users be able to see the social graph of others they're not friends with? Also, why should people outside of your immediate one-hop social graph be able to see your checkins?

Giving it to 4sq for data mining is different than giving it to UMN and/or the whole internet for data mining and/or deanonymization.

r721 · on Oct 8, 2013

Also this:

http://33bits.org/2009/05/13/your-morning-commute-is-unique-...

danso · on Oct 8, 2013

Jesus Christ. The bulk scraping in violation of the TOS is egregious enough, but redistributing it with a mandate that the researchers get credit? For what, scraping a generous public API?

Irishsteve · on Oct 8, 2013

This is usually common practice. It serves two purposes 1) Makes it easier to find the dataset that was used for experiments. And 2) Improves citation count for the author which is usually important in research.

mzs · on Oct 8, 2013

That's true, but here is the paper from Microsoft Research and it seems to be lacking in how those data files were generated:

http://research.microsoft.com/pubs/156453/icde12_lars.pdf

nicholassmith · on Oct 8, 2013

That doesn't look like Foursquare has handed that over. What's the legality of scraping a service for their data in this way?

jorgeortiz85 · on Oct 8, 2013

I'm a Foursquare engineer. We have explanations of our API policies here: https://developer.foursquare.com/overview/community

We'll be contacting this researcher to ask where they got this data and whether it conforms to our policies.

nicholassmith · on Oct 8, 2013

Thanks for responding, I didn't realise Foursquare actually gave so much of their data away freely. Which makes it slightly more seedy someone has scraped the rest and dumped it online.

onedev · on Oct 8, 2013

That doesn't look completely legal to me either.

ozh · on Oct 8, 2013

I don't get why scraping publicly and freely accessible data would be illegal. The redistribution under their own terms is another matter, though.

chimeracoder · on Oct 8, 2013

> publicly and freely accessible data would be illegal.

Data is not "publicly and freely accessible" if accessing it requires you to agree to separate terms of service for it that restrict your ability to access and redistribute it.

(Whether or not one believes the data should be freely and public accessible is a separate matter, but given the above, it's hard to make the case that it is).

Amusingly, this data still isn't "freely accessible", because these people have attached their own, separate terms to reusing and redistributing the data.

mvanvoorden · on Oct 8, 2013

If you have to apply to specific terms, but are able to access everything without complying to these terms, this effectively means that the data is publicly and freely accessible.

Rules on itself don't restrict anything, enforcement does.

joshu · on Oct 8, 2013

then re-releasing it with distribution terms is probably not legal at all.

galapago · on Oct 8, 2013

http://webcache.googleusercontent.com/search?q=cache:hLI5FqD...

(the direct link is not working, but this confirmed that was freely available)

boothead · on Oct 8, 2013

No mention of the data format. Is it json, csv what? I know you can always head -n the file but a little hint would be helpful!

nachi · on Oct 8, 2013

It looks like an ASCII-formatted table. Pretty disappointing that it isn't machine readable out of the box.

timthorn · on Oct 8, 2013

What's not machine readable about an ASCII table? A fixed width table has its own advantages over eg CSV - for instance, to read a specific field you can reach it by offset rather than having to count delimiters.

cwmma · on Oct 8, 2013

fixed width might have technical advantages but CSV has the advantage of being able to be read by a lot of things out of the box.

timthorn · on Oct 8, 2013

So does fixed width. Excel, MySQL, R, Perl... :)

aw3c2 · on Oct 8, 2013

sed 's/[[:blank:]]//g' dataset.dat | sed 's/|/\t/'

interskh · on Oct 8, 2013

> This data set contains 2153471 users, 1143092 venues, 1021970 check-ins, 27098490 social connections, and 2809581 ratings that users assigned to venues

The number of check-ins seems to be low compared to other numbers.

davidmat · on Oct 8, 2013

Could anyone recommend some solid introductory material on data analysis/data visualisation?

I'm thinking this data set seems like a fun way to fill a rainy weekend, going for a dive into these worlds :)

benmanns · on Oct 8, 2013

The Social Network Analysis course on Coursera started yesterday. [0]

0: https://www.coursera.org/course/sna

m4tthumphrey · on Oct 8, 2013

Look's like it's been removed. Damn.

Edit: Not removed, just unaccessible. 403.

annnnd · on Oct 8, 2013

I'm quite sure it will surface somewhere. Looking forward to it. :)

EDIT: I would love to get my hands on this data... anyone? :)

mvanvoorden · on Oct 8, 2013

I have it, but "The user may not redistribute the data without separate permission."

But if you have any questions about it, I can try to answer them, my username here also corresponds to a gmail account I use ;)

tlhunter · on Oct 8, 2013

Looks like the data is only up-to-date as of July 2012 (judging from the zip compression times).

xntrk · on Oct 8, 2013

sounded too good to be true. I guess we'll have to find it on bittorrent.

dotBen · on Oct 8, 2013

filename was "umn_foursquare_datasets.zip" in case that helps

galapago · on Oct 8, 2013

http://ul.to/betfn1vh

rajbala · on Oct 8, 2013

The data set has been removed?

lgas · on Oct 9, 2013

http://archive.org/details/201309_foursquare_dataset_umn

waynesonfire · on Oct 9, 2013

why was this not posted as a torrent?