Stanford Large Network Dataset Collection

alialkhatib · on July 21, 2014

This is cool (although unless something has changed recently on that page that I'm not seeing, it's kind of old).

In the HCI group at Stanford we recently had a talk about massively online open courses (MOOCs), how Harvard and MIT recently made their edX data available (albeit to other researchers, not completely publicly)[0], and how Stanford could do the same with its own MOOC data. We were wrestling with the idea of making it only available to other established institutions the way Harvard and MIT did it, but the irony (and maybe hypocrisy) of limiting MOOC data access to those within the ivory towers was not lost on us. The alternatives (scrubbing the data more rigorously or to varying degrees depending on our trust level of the entity requesting it) seemed better, but also has problems; how do you determine those levels, what if someone shares their privileged data with an untrusted individual, etc... (these are not unique problems; if we have any hurdle to clear we always have to worry that someone who clears it will break that wall down and mess the whole thing up[1]).

We're really struggling to come to a good solution on this problem in part because IRB protocols were not originally designed for this kind of stuff. They were imagined for the kinds of experiments where the data collection itself was what endangered participants, not the analysis. As a result, IRB approval for a protocol outlining the collection of data might not foresee every imaginable permutation of data analysis that could reveal embarrassing or incriminating details about participants.

I'm sorry, this is becoming a rant. The point is that we're talking about making more data - specifically more MOOC data - available for research and analysis. Hopefully we'll figure something out that will be interesting to (white hat) hackers without it endangering participants in the hands of black hat hackers.

0: https://newsoffice.mit.edu/2014/mit-and-harvard-release-de-i...

1: case in point: http://en.wikipedia.org/wiki/AOL_search_data_leak

minimaxir · on July 21, 2014

The MOOC data from Harvard/MIT isn't researcher only; the only limitation is that you can't redistribute the data. (it's very good data)

I did a blog post on it and have not received any angry emails from either party: http://minimaxir.com/2014/07/online-class-charts/

alialkhatib · on July 21, 2014

Ah you're right, sorry about that.

That post is really interesting. I know ggplot2 does quite a bit of it, but the visualizations are really nice.

You mention significance a few times but I don't see alpha levels or significance testing per se; do you mean significant in the casual sense, or are you just withholding the stats talk for the audience of (most likely) laypeople? If it's the former, you might find statistical significance in even the avg % grade by gender, given a large enough sample size.

I might be overlooking the part where you talk about this, and I apologize if that's the case.

minimaxir · on July 21, 2014

In the casual sense. I've been having a little difficulty graphically conveying confidence intervals without making the charts unreadable/too complicated. The correct way is to use boxplots, but I'm working on other things too.

alialkhatib · on July 21, 2014

If you figure out a more intuitive way to communicate confidence intervals visually, please post it here :)

emu · on July 21, 2014

I might have missed it, but I couldn't find any licensing information for most of these data sets, which troubles me. Personally, I'm hesitant to download or work with much of this data for that reason.

robmccoll · on July 21, 2014

These are my go-tos for quick testing with real data. I've published a paper or two using these datasets (obtained from SNAP).

There are also some decent large graphs of different types from various DIMACS challenges that people may find useful (http://www.cc.gatech.edu/dimacs10/downloads.shtml).