Delicious's Data Policy is Like Setting a Museum on Fire

PaulHoule · on Dec 17, 2010

Back in 2004 I wanted to use Kleinberg's hub-and-authorities algorithm on Delicious, and I ran a crawler on it anyway, despite the robots.txt file. I got blocked, and when I complained, I got an email from the founder telling me to buzz off.

I've long seen the no-crawling policy of Delicious plus the Roach Motel API that was all about getting people to put their data in but not about letting people get it out as the dark side of "Web 2.0"; often we hear about an API as if it were a gift, but it's often a self-serving effort to take our data and give nothing back in return.

joshu · on Dec 17, 2010

The API let you get your own data out. Everyone else's data wasn't yours to have. And that sort of crawling tended to knock down the site.

PaulHoule · on Dec 17, 2010

Perhaps you're right, but in some sense that data belongs, collectively, to all of us.

In my mind, delicious represents everything that was great and everything awful about 'Web 2.0'. Yes, it did something that everybody assumed was impossible because bookmarking sites failed so consistently in 'Web 1.0'. Although delicious provided a useful current awareness service, it never did any of the interesting things with the data that would have made it possible to move onto 'Web 3.0'. And it probably never will -- but maybe that's fine because it opens up an opportunity for the rest of us.

xatax · on Dec 18, 2010

I just want to make sure you know who you're replying to. Joshu is the guy that wrote/founded delicious.

Perhaps "in some sense" you feel that that data belongs to all of us, but it's not really your place to decide. Delicious had terms of use, and you violated those terms. Your argument sounds a bit like the "the internet is public domain" cookbook lady from last month. It was wrong then, and it's wrong now.

PaulHoule · on Dec 21, 2010

I know who he is. As I've said, I've dealt with him before.

He's free to put up whatever defenses he wants to put up. Other people are going to try to tear them down. That's the way of the world.

pbh · on Dec 17, 2010

Probably a substantial fraction of the public delicious data was collected for this:

http://ilpubs.stanford.edu/858/1/2008-2.pdf

The primary issue is that it does not seem terribly clear what the legal status of redistributing such data would be, and/or whether this is changed in any way by Y! shutting down delicious.

tibbon · on Dec 17, 2010

Could you use of EC2, proxies and tor (horrid bandwidth of course) to get around some of the limiting?

PaulHoule · on Dec 17, 2010

I've done that kind of stuff and, against an advanced opponent, you tend to lose. (Although you can roll the average webmaster)

Remember that IP addresses have a market price of about $3 /month, and that's what an honest proxy cost Honest proxy providers rent machines in data centers and have them bind to a wide range of addresses, all in the same netblock. If you're coming from 20 different addresses in a netblock (paying $60) a month, you still look suspicious. These guys might have machines in several data centers, but they can't put you into hundreds of different netblocks.

The economics might get better if you're sharing the proxies with other people, but those other people are up to the Devil's work, and are busting their asses 24-7 getting the IP addresses in everybody's block lists.

As for Tor, quite a few organizations block or limit Tor traffic... Databases of active Tor gateways are available, and sites like Wikipedia use them... Wikipedia won't let you make anonymous edits from Tor, because they don't like dealing with griefers who use Tor.

Now, some people will use hacked machines as proxy servers. A botnet can create a nearly indetectable cloud of IP addresses, but as far as I'm concerned, use of a botnet is an ethical line I won't cross.

tibbon · on Dec 17, 2010

Sound like some fun stuff to profile a site when you're doing a security screening of them. I'll have to keep this in mind.

DevX101 · on Dec 17, 2010

You're missing the point

ams6110 · on Dec 17, 2010

The loss would be a shame, but calling it a "sick tragedy" is a bit of a stretch. Unmaintained, the data will get stale fairly rapidly, and it won't take long for another service to step in. There's a vacuum here, and someone will fill it.

wslh · on Dec 17, 2010

It's called history. The web is not only about the new, in a hundred years all these data will be important for others.

drivingmenuts · on Dec 17, 2010

Seriously? Somehow I doubt the presence of or absence of lolcats and 4chan will make much difference to future generations.

We're already drowning in data. We need to start making some executive decisions about what's important and what's not. If it turns out we're wrong - we'll deal with it.

jerf · on Dec 17, 2010

Are you a historian?

Find one and ask them how many parts of their body they'd pay to spend even ten minutes in the town square listening to the mundanities you so causally consign to the bit bucket.

There's more information there than you think, more than you can even see, because you are a product of the time that generated it.

xyzzyb · on Dec 17, 2010

Very much this. Big events are well covered and documented. It's the ephemera that I find most fascinating.

http://rs6.loc.gov/ammem/rbpehtml/pegenre.html

lawfulfalafel · on Dec 17, 2010

I agree that lolcats, in and of themselves, are useless, but to say they do not have any cultural significance is a very big assumption. Don't you think that the popularity of such a meaningless meme is representative of some undiagnosed issue with society? I mean aren't you curious as to why certain pictures of cats are becoming more well known than leaders of most countries? Isn't it odd how such a simple idea is able to still amuse us as well as grow in popularity?

woodall · on Dec 17, 2010

www.Historio.us is a great alternative. The free user is enough for me, but others can take a look at the pricing plans here: http://historio.us/pricing/

rb2k_ · on Dec 17, 2010

http://www.diigo.com/ seems interesting to me too.

They offer iOS and Android apps

tibbon · on Dec 17, 2010

Err, if I was writing a scraper today, I'd just ignore the robots.txt

I dunno if Geocities had a similar robots.txt, but it didn't stop several groups from archiving it (which was the right thing to do in either scenario).

PaulHoule · on Dec 17, 2010

Well, I've worked for organizations that had active defenses against crawlers... Make too many HTTP request and hour and ~poof~ a deny directive goes in the .htaccess file, or if they really like to play a rough game they'll firewall you.

I know delicious had active defenses because I ran afoul of them.

kevindication · on Dec 17, 2010

And really, slowing your connections down to rates that are measured in baud would be the most fun.

jamiequint · on Dec 17, 2010

True, there are plenty of relatively cheap proxy services made for exactly this reason though that will roll your IP on each request.

apgwoz · on Dec 17, 2010

It's only a matter of time before some sysadmin, or automated log analyzer sees you and stops it. You'd be better served to start a distributed scraping and archiving mission going, but, I'm not sure how to start that.

stavros · on Dec 17, 2010

I'm not sure if JS can request pages and send them to you (XSF protection), but it might be worth writing a small script to request URLs from your server, spider them and send the content back.

Then, put it on a website, and tell people "by staying on this page, you are donating bandwidth and helping archive delicious".

It's so no-hassle that I bet you could get a huge following.

apgwoz · on Dec 17, 2010

I guess this might actually be a great use of 80Legs. Has anyone used them in the past, and if so, did it actually work?

toolate · on Dec 17, 2010

This would be possible with Flash.

stavros · on Dec 18, 2010

Ah, yes! Come to think of it, JS probably couldn't get around the same-origin restrictions of browsers...

adambyrtek · on Dec 17, 2010

I'm not a native English speaker, but shouldn't the phrase be "library on fire" instead of "museum on fire"? The analogy comes from the ancient Library of Alexandria, which was set on fire by Caesar.

zandorg · on Dec 17, 2010

Brewster Kahle, Archive.org: In the history of libraries, they tend to get burned, usually by governments

J3L2404 · on Dec 17, 2010

From Wikipedia:

'...both the pagan historian Ammianus Marcellinus and the Christian historian Orosius wrote that the Bibliotheca Alexandrina had been destroyed by Caesar's fire. The anonymous author of the Alexandrian Wars writes that the fires Caesar's soldiers had set to burn the Egyptian navy in the port of Alexandria went as far as burning a store full of papyri located near the port. However, the geographical study of the location of the historical Bibliotheca Alexandrina in the neighborhood of Bruchion suggests that this store cannot have been the Great Library. It is most probable here that these historians confused the two Greek words bibliothekas, which means “set of books”, with bibliotheka, which means library. As a result, they thought that what had been recorded earlier concerning the burning of some books stored near the port constituted the burning of the famous Alexandrian Library.'

_corbett · on Dec 17, 2010

I’m a huge user of delicious, with 2906 bookmarks and 3100 tags. In fact, if I were to pick one Web 2.0 site, it would be this simple straightforward one. I use to organize new lines of research (into an intellectual matter or something as inane as a hotel) and to keep track of anything I find interesting or useful on the web, particularly those things which took more than a few minutes of googling to discover. It's a huge supplement to my memory.

Really sad to see it go.... If yahoo had asked me to pay I'd happily have done so (I pay for Flickr, Spotify, Last.fm, RTM and many other oft-used services happily)

_b8r0 · on Dec 17, 2010

Yes it's a PITA, I subscribe to certain tags via RSS to get interesting stuff to read. The loss of that is quite big for me. Saying it's like setting a museum on fire is a bit too far though.

akshayubhat · on Dec 17, 2010

I can only think of using 80legs as a crawler, since its distributed enough to make sure that you don't run into any IP address based rate limitation. But it's just a guess.

slig2 · on Dec 17, 2010

You probably can't do it, because 80 respects robots.txt AFAIK.

unexpected · on Dec 17, 2010

I don't understand why Yahoo doesn't try charging for it. Monetizing it was tough, definitely, but it's been shown that actually CHARGING customers (as opposed to going with a straight advertising model can work).

If you're going to shut it down anyway, what's the harm in trying? Maybe have a "stay of execution" for a quarter - tell users you're going to charge $10/month for the service, and see how many users sign up. If you can break even, why not keep it?

bruceboughton · on Dec 17, 2010

Wouldn't charging for it give users an expectation that the service would stay around for longer than a quarter. What if it's still not profitable? Now you have to close down with paying customers.

unexpected · on Dec 17, 2010

You're right - but given the situation, I think you could outline user expectations and see what happens. Services that charge money close down all the time.

I was envisioning something like reddit gold. They have more users/subscribers, and seemed to have a lot of success with their monetization.

randombit · on Dec 17, 2010

Unfortunately they may have passed the Rubicon with that. I suspect that many of the people (including me) who would have been willing to pay have already gone ahead and signed up for another paid bookmarking site. The most active users are the ones who are most likely to have already heard about the shutdown, and the ones who would be most concerned about it.

Just as well, since it looks like pinboard gets the things that bugged me about delicious right.

rb2k_ · on Dec 17, 2010

> Nope, Yahoo! blocks all automated extraction of data from Delicious.

Uhmmm... this worked for me:

curl --user user:password -o DeliciousBackup.xml -O https://api.del.icio.us/v1/posts/all

I only have 280 links on there, so maybe it is limited somehow. I really hope it is, otherwise this would be REALLY a poor job on the part of readwriteweb.

zephyrfalcon · on Dec 17, 2010

That's only to export your own bookmarks. The author was talking about extracting all (or any non-trivial number) of the bookmarks that are publicly available.

glebk · on Dec 17, 2010

This python API into Delicious could give you a place to start:

http://www.michael-noll.com/projects/delicious-python-api/

Unfortunately Delicious will throttle you if you hit the service more often than once a second so you might not be able to get too much valuable information.

chapel · on Dec 17, 2010

I took a peek through the site and I really don't see a way to scrape everything, or even most stuff off of their site. You can get the 200 pages of the most recent bookmarks for any particular tag, but that seems to be about it.

jonknee · on Dec 17, 2010

User pages go back much farther, perhaps all the way. So if you can find a decently large set of users you should be able to come away with a large chunk.

mikeklaas · on Dec 17, 2010

It's possible, but would take years at the level of rate limiting they do.

agentultra · on Dec 17, 2010

It's not terribly difficult to backup your bookmarks using the API. I wrote a script a while ago that does just that and dumps everything in a neat little sqlite DB.

I'm sad to see delicious go as it's a great collaborative tool and has awesome powers when combined with instapaper.

(btw, if anyone wants a copy of my script you can get in touch with me through my site listed in my prof)

docgnome · on Dec 17, 2010

They also have an export tool.

Stwerner · on Dec 17, 2010

Funny, I was playing around with scraping bookmarks off delicious a while ago with a rails app.

tdoggette · on Dec 17, 2010

Now is the time to post that on github if it works at all.

cilantro · on Dec 17, 2010

I hope posterous is hard at work building a delicious clone!

edit: Not completely sure why you all hate this comment, but fwiw I was being sincere not snarky.