Back in 2004 I wanted to use Kleinberg's hub-and-authorities algorithm on Delicious, and I ran a crawler on it anyway, despite the robots.txt file. I got blocked, and when I complained, I got an email from the founder telling me to buzz off.
I've long seen the no-crawling policy of Delicious plus the Roach Motel API that was all about getting people to put their data in but not about letting people get it out as the dark side of "Web 2.0"; often we hear about an API as if it were a gift, but it's often a self-serving effort to take our data and give nothing back in return.
Perhaps you're right, but in some sense that data belongs, collectively, to all of us.
In my mind, delicious represents everything that was great and everything awful about 'Web 2.0'. Yes, it did something that everybody assumed was impossible because bookmarking sites failed so consistently in 'Web 1.0'. Although delicious provided a useful current awareness service, it never did any of the interesting things with the data that would have made it possible to move onto 'Web 3.0'. And it probably never will -- but maybe that's fine because it opens up an opportunity for the rest of us.
I just want to make sure you know who you're replying to. Joshu is the guy that wrote/founded delicious.
Perhaps "in some sense" you feel that that data belongs to all of us, but it's not really your place to decide. Delicious had terms of use, and you violated those terms. Your argument sounds a bit like the "the internet is public domain" cookbook lady from last month. It was wrong then, and it's wrong now.
The primary issue is that it does not seem terribly clear what the legal status of redistributing such data would be, and/or whether this is changed in any way by Y! shutting down delicious.
I've done that kind of stuff and, against an advanced opponent, you tend to lose. (Although you can roll the average webmaster)
Remember that IP addresses have a market price of about $3 /month, and that's what an honest proxy cost Honest proxy providers rent machines in data centers and have them bind to a wide range of addresses, all in the same netblock. If you're coming from 20 different addresses in a netblock (paying $60) a month, you still look suspicious. These guys might have machines in several data centers, but they can't put you into hundreds of different netblocks.
The economics might get better if you're sharing the proxies with other people, but those other people are up to the Devil's work, and are busting their asses 24-7 getting the IP addresses in everybody's block lists.
As for Tor, quite a few organizations block or limit Tor traffic... Databases of active Tor gateways are available, and sites like Wikipedia use them... Wikipedia won't let you make anonymous edits from Tor, because they don't like dealing with griefers who use Tor.
Now, some people will use hacked machines as proxy servers. A botnet can create a nearly indetectable cloud of IP addresses, but as far as I'm concerned, use of a botnet is an ethical line I won't cross.
The loss would be a shame, but calling it a "sick tragedy" is a bit of a stretch. Unmaintained, the data will get stale fairly rapidly, and it won't take long for another service to step in. There's a vacuum here, and someone will fill it.
Seriously? Somehow I doubt the presence of or absence of lolcats and 4chan will make much difference to future generations.
We're already drowning in data. We need to start making some executive decisions about what's important and what's not. If it turns out we're wrong - we'll deal with it.
Find one and ask them how many parts of their body they'd pay to spend even ten minutes in the town square listening to the mundanities you so causally consign to the bit bucket.
There's more information there than you think, more than you can even see, because you are a product of the time that generated it.
I agree that lolcats, in and of themselves, are useless, but to say they do not have any cultural significance is a very big assumption. Don't you think that the popularity of such a meaningless meme is representative of some undiagnosed issue with society? I mean aren't you curious as to why certain pictures of cats are becoming more well known than leaders of most countries? Isn't it odd how such a simple idea is able to still amuse us as well as grow in popularity?
www.Historio.us is a great alternative. The free user is enough for me, but others can take a look at the pricing plans here: http://historio.us/pricing/
Err, if I was writing a scraper today, I'd just ignore the robots.txt
I dunno if Geocities had a similar robots.txt, but it didn't stop several groups from archiving it (which was the right thing to do in either scenario).
Well, I've worked for organizations that had active defenses against crawlers... Make too many HTTP request and hour and ~poof~ a deny directive goes in the .htaccess file, or if they really like to play a rough game they'll firewall you.
I know delicious had active defenses because I ran afoul of them.
It's only a matter of time before some sysadmin, or automated log analyzer sees you and stops it. You'd be better served to start a distributed scraping and archiving mission going, but, I'm not sure how to start that.
I'm not sure if JS can request pages and send them to you (XSF protection), but it might be worth writing a small script to request URLs from your server, spider them and send the content back.
Then, put it on a website, and tell people "by staying on this page, you are donating bandwidth and helping archive delicious".
It's so no-hassle that I bet you could get a huge following.
I'm not a native English speaker, but shouldn't the phrase be "library on fire" instead of "museum on fire"? The analogy comes from the ancient Library of Alexandria, which was set on fire by Caesar.
'...both the pagan historian Ammianus Marcellinus and the Christian historian Orosius wrote that the Bibliotheca Alexandrina had been destroyed by Caesar's fire. The anonymous author of the Alexandrian Wars writes that the fires Caesar's soldiers had set to burn the Egyptian navy in the port of Alexandria went as far as burning a store full of papyri located near the port. However, the geographical study of the location of the historical Bibliotheca Alexandrina in the neighborhood of Bruchion suggests that this store cannot have been the Great Library. It is most probable here that these historians confused the two Greek words bibliothekas, which means “set of books”, with bibliotheka, which means library. As a result, they thought that what had been recorded earlier concerning the burning of some books stored near the port constituted the burning of the famous Alexandrian Library.'
I’m a huge user of delicious, with 2906 bookmarks and 3100 tags. In fact, if I were to pick one Web 2.0 site, it would be this simple straightforward one. I use to organize new lines of research (into an intellectual matter or something as inane as a hotel) and to keep track of anything I find interesting or useful on the web, particularly those things which took more than a few minutes of googling to discover. It's a huge supplement to my memory.
Really sad to see it go.... If yahoo had asked me to pay I'd happily have done so (I pay for Flickr, Spotify, Last.fm, RTM and many other oft-used services happily)
Yes it's a PITA, I subscribe to certain tags via RSS to get interesting stuff to read. The loss of that is quite big for me. Saying it's like setting a museum on fire is a bit too far though.
I can only think of using 80legs as a crawler, since its distributed enough to make sure that you don't run into any IP address based rate limitation. But it's just a guess.
I don't understand why Yahoo doesn't try charging for it. Monetizing it was tough, definitely, but it's been shown that actually CHARGING customers (as opposed to going with a straight advertising model can work).
If you're going to shut it down anyway, what's the harm in trying? Maybe have a "stay of execution" for a quarter - tell users you're going to charge $10/month for the service, and see how many users sign up. If you can break even, why not keep it?
Wouldn't charging for it give users an expectation that the service would stay around for longer than a quarter. What if it's still not profitable? Now you have to close down with paying customers.
You're right - but given the situation, I think you could outline user expectations and see what happens. Services that charge money close down all the time.
I was envisioning something like reddit gold. They have more users/subscribers, and seemed to have a lot of success with their monetization.
Unfortunately they may have passed the Rubicon with that. I suspect that many of the people (including me) who would have been willing to pay have already gone ahead and signed up for another paid bookmarking site. The most active users are the ones who are most likely to have already heard about the shutdown, and the ones who would be most concerned about it.
Just as well, since it looks like pinboard gets the things that bugged me about delicious right.
I only have 280 links on there, so maybe it is limited somehow. I really hope it is, otherwise this would be REALLY a poor job on the part of readwriteweb.
That's only to export your own bookmarks. The author was talking about extracting all (or any non-trivial number) of the bookmarks that are publicly available.
Unfortunately Delicious will throttle you if you hit the service more often than once a second so you might not be able to get too much valuable information.
I took a peek through the site and I really don't see a way to scrape everything, or even most stuff off of their site. You can get the 200 pages of the most recent bookmarks for any particular tag, but that seems to be about it.
User pages go back much farther, perhaps all the way. So if you can find a decently large set of users you should be able to come away with a large chunk.
It's not terribly difficult to backup your bookmarks using the API. I wrote a script a while ago that does just that and dumps everything in a neat little sqlite DB.
I'm sad to see delicious go as it's a great collaborative tool and has awesome powers when combined with instapaper.
(btw, if anyone wants a copy of my script you can get in touch with me through my site listed in my prof)
I've long seen the no-crawling policy of Delicious plus the Roach Motel API that was all about getting people to put their data in but not about letting people get it out as the dark side of "Web 2.0"; often we hear about an API as if it were a gift, but it's often a self-serving effort to take our data and give nothing back in return.