Hacker News new | past | comments | ask | show | jobs | submit login
Robots.txt Disallow: 20 Years of Mistakes To Avoid (beussery.com)
106 points by hornokplease on June 30, 2014 | hide | past | favorite | 60 comments



This article forgot the very worst use of robots.txt:

  User-agent: ia_archiver
  Disallow: /
Those two lines mean that all content hosted on the entire site will be blocked from the Internet Archive (archive.org) WayBack Machine, and the public will be unable to look at any previous versions of the website's content. It wipes out a public view of the past.

Yeah, I'm looking at you, Washington Post: http://www.washingtonpost.com/robots.txt

Banning access to history like that is shameful.


The thing that really frustrates me about the Internet Archive's treatment of robots.txt: if a domain expires and the domain provider changes the robots.txt to something restrictive, the Wayback Machine will completely clear the history of the site. Even though it's very clearly not the same agent at play-- this is not the creator of the site's content. I've seen it happen, and it breaks my heart every time.


Why wouldn't it consider the archived state of robots.txt?


One of the reasons I like archive.today. Obviously, they lack the depth of history, but they don't censor so easily.


Quora also disallows the ia_archiver agent.

http://www.quora.com/robots.txt

Here is their explanation (in the robots.txt file)

"We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it. As far as we can tell, there is no way for sites to selectively programmatically remove content from the archive and so this is the only way for us to protect writers. If they open up an API where we can remove content from the archive when authors remove it from Quora, but leave the rest of the content archived, we would be happy to opt back in. See the page here: https://archive.org/about/exclude.php"

"Meanwhile, if you are looking for an older version of any content on Quora, we have full edit history tracked and accessible in product (with the exception of content that has been removed by the author). You can generally access this by clicking on timestamps, or by appending "/log" to the URL of any content page."


The Internet Archive choosing to honor robots.txt is what's 'banning' the access. Both the request not to be crawled and the decision not to crawl are voluntary, but if the Internet Archive decided it wanted to slurp up the Washington Post tomorrow, there's not much they could do to stop it.


Yes, I know, I'm a member of Archive Team, and I use "wget -e robots=off --mirror …" quite a bit, and then I upload those WARC's to the IA. But major content providers like the Washington Post that explicitly choose to block their entire website and its history should be named and shamed.

Authors don't get the right to go around removing their novels from public libraries just because they would rather the books be available only for pay in bookstores.


It's not really shameworthy to want to regulate access to your own sites, and physical metaphors work about as well here as "you wouldn't steal a car" does for piracy.

The Internet Archive does wonderful work, but just because somebody doesn't want you folks crawling their content doesn't make them worthy of "naming and shaming"


Why can a physical library collect and display physical newspapers, but a digital library cannot collect and display digital newspapers?


As I pointed out elsewhere in the thread, making analogies to physical media is just as flawed as the "you wouldn't steal a car" anti-piracy campaigns.

A physical library is either getting their newspapers by asking/paying the newspaper company to deliver them, asking citizens to donate them, or collecting them from already delivered newspapers. If the IA was just piggybacking on user activity (by caching and storing things from a user's browser cache after they visit a page) then I'd have far less of a concern with them. If we're so attached to physical metaphors, this would be equivalent to the librarian running around outside the newspaper's printing room and snatching newspapers from the bundles as the company's employees loaded them onto trucks.


I was going to reply pointing out that whether or not to name and shame someone is a subjective decision which you and I do not see eye to eye on, and which generally requires quite a few people to agree with you before it becomes a problem for the shamee, but then I rembered the poor way that IA handles changes in ownership with respect to robots.

When IA stops wiping out historical content due to a change of domain ownership in the now then I will have more support (and USE) for them.


How is IA supposed to distinguish a new website from a sincere wish to delete old stuff? A change in domain registration data means nothing; I have a domain that I registered for an association in my name, and which I then sold to them (for a symbolic price), but it was only an administrative issue - the site was the same.

IA is on iffy territory w.r.t. copyright as it is; if they stop respecting robots.txt, they could get into a world of hurt.


Your last sentence is key. As I understand it, there's no real legal precedent for IA which basically copies everything out there on an opt-out basis. I personally am glad they do but one of the ways they get off with it is by treading as lightly as possible, including respecting robots.txt even retroactively.

They're also non-commercial, broad in scope, arguably serve a valuable scholarly function and have other characteristics that have kept them mostly out of legal hot water. But it's unclear to what degree they're legally different from a site that decided to create an archive of all comics, commercial and otherwise, and slap advertising up.


Internet Archive isn't wiping out historical content. It's just unavailable/hidden for the time being. (As long as there is a restrictive robots.txt available).


Unfair to name & shame a private entity that doesn't want it's content to be archived.


privately owned but publicly acting!


They have explicitly denied permission to have their content slurped.

Why do you think it is legal to then go ahead and slurp it?


I'm not making an argument about the legality or ethics of it one way or another - although as far as i'm aware, it's not actually illegal to ignore robots.txt. I was just pointing out that robots.txt doesn't actually do anything but ask nicely.


When people can go to jail for hitting a publicly available URL, I'd question the "legality" of such activity. (I'm not making a moral argument, but rather question what lawyers and law enforcement may choose to make of a situation.)

Politicians keep attempting to write evermore draconian qualifications and punishments into law for what qualifies as a "breach of terms of service". I would expect this to encompass robots.txt at some point if it does not already.

Again, I'm not particularly happy about this trend, but I'll try to keep out of its path of destruction.


No, you do NOT have the right to pound my site with requests and serve data that I decided to pull down.


Nobody said they do; nobody said the Internet Archive shouldn't respect robots.txt.

We do, however, have the right to criticize people who ban IA from their site.


Anybody has the right to criticize anyone; the question rather is, do they have a valid criticism. Your wording being so unclear I don't even know what you think about people banning IA from their site, but assuming you would criticize them, what would that criticism be? And would you also criticize someone for making their site private, or not making a site at all?


I think if the site is publicly accessible, it's basic Internet civility to allow IA to archive it, but especially so for a newspaper. It's a question of respect for your users and for journalism.

If they have fears about losing revenue - and although I find them silly - there are other ways of going about it, such as only allowing access to pages some weeks or months after they've been published.


Okay, I completely agree about newspapers, maybe other things as well. But you said people who ban IA from their site could be criticized, there are plenty of other comments along those lines, and I just don't see it. As advice, sure: if you post it on the internet, assume it will stick around forever, because that could happen. But still, there are personal public websites, if you know what I mean. They're not secret, they're not hidden, they are accessible to the public -- but they do not belong to the public, they are not like a public park or road. And sometimes, a website is more thinking aloud, or talking to oneself, than writing a book that then belongs to your "audience".

Civility is a good keyword, and while this may be a bit of a stretch, imagine sitting in public cafés and writing down what people say, and then criticizing people for lowering their voice and turning their back so you can't read their lips, even though you genuinely mean well, and just want to preserve daily public life for future historians. In general, this is what this attitude of "the internet" feeling entitled to whatever was ever posted anywhere feels to me. Maybe I just don't get it, but I really don't get it.

I think the question wether a private conversation should be recorded just because it's in public, just because you can, is kind of a no-brainer, but here are ones I don't have an answer to: Should an artist be allowed to make a performance and ask it to not be recorded? Should someone be able to hold a political speech and ask the same? For me the answers are kind yes, and no-ish... but what about political art? Are we allowed to try to influence people, and then try to erase the traces? Now that is tricky, and I may have ended up ranting myself into agreeing more with the IA "side" of the argument than I expected to. Because either something is personal, trivial in one way or another, or commercial and/or political. Personal things I think should be respected, but commercial and political things shouldn't be, they do belong to historians. Well, fuck.

[this is why I "blog" bit, actually -- because posting stuff online makes me think harder about them than I would otherwise, I don't even need an actual audience for that, just the possibility of one -- but that's also why I don't feel great about all of that floating around forever, it's all rather temporary in nature, a process.. and the person who wrote stuff a year ago does not exist anymore, so why should the name of this current person be attached to it?]


I think conflating a publicly accessible website with a private conversation - even if in a public setting - is specious. That said, I concede the point that some websites are meant to belong to the deep web, and while I wouldn't feel guilty about archiving it for personal use anyway, I wouldn't blame the author for banning archivers.

I'd say my general rule is closer to: if you allow search engines, you should allow IA.

W.r.t. your last question, historically the solution to that problem was simple and elegant: people used pen names to write what they didn't want to bind permanently to them. This way is also safer from unauthorized archiving - not everyone is as respectful of the author's wishes as IA.


I may not have a right to "pound" your site, but I certainly have a right to keep whatever I find on your public webserver, regardless of whether you decide to pull it down later.


Do you also want to steal into libraries in the night and set fire to their microfiche collection?


Can we get an explanation of how not wanting to have your servers handling more requests than necessary compares to breaking into a library and setting it on fire?


"No, you do NOT have the right to [ask me a question] and [tell others] [my reply] that I [later decided to retract.]"


I assumed the comparison was more directed to Asparagirl's "Yeah, I'm looking at you, Washington Post" example. Like this:

Perhaps individual private websites, such as pekk's, should have the right to say "No, you do NOT have the right to pound my site with requests and serve data that I decided to pull down."

However, in theory, the Washington Post's articles online are also (eventually) placed on microfiche. Saying there's no right to serve data that WP decided to pull down would in some sense require WP to "steal into libraries in the night and set fire to their microfiche collection".


I'm assuming this was in response to the "and serve data that I decided to pull down." part.


I still don't see the relationship between deleting a blog post that you have authored and burning a library full of other people's works down.


The robots.txt will not only disallow your blog post, but if you acquired your domain from someone else, the entire previous site will also be removed. That is not something you should have a right to do, unless you also acquired full copyright to all of the previous site's revisions.

So sometimes an IA-friendly domain expires (e.g. accidentally or because its owner died), a squatter buys the domain, and the squatter points the domain at a junk-site landing server with a deny all robots.txt. The result is truly disastrous: IA removes access to the historic, IA-friendly site. Site acquirers who do this deliberately are pure evil.


Ah I see, thanks for the explanation


"serve data that I decided to pull down."

If it's on their bandwidth and power, why not?


I think pekk meant that if he deletes a blog post, the IA is still going to serve it. "Pull down" refers to deleting content, not bandwidth usage.


If you don't want it on the internet, don't post it. Assuming anything can ever be made to disappear from the internet is naive, and if people become aware you are, it'll just get Streisanded and become even more widely posted.


There are two sides to this argument. I argue both. All the fucking time.

If you're in the business of providing public content that's well-known, to the public, then allowing it to be archived makes a lot of sense.

If you're providing user-generated content I'd argue that the case for allowing archival is extended even more so. Sites that violate this, and Quora comes specifically to mind, are violating what many, myself included, consider to be part of the social contract of the Web.

On the other hand, if you're an individual, and you are posting your own content and ramblings, and circumstances change for whatever reason: you've got a job, you've lost a job, you're married, you're divorced, you're getting divorced, your child is at war in a foreign country, a foreign country is at war with yours, or you're just sick of the crap you wrote when you were young and arrogant and now and old and arrogant you wants it gone: I'm pretty willing to grant you that right.

If you've committed some terrible crime against humanity, or just a human, and have been fairly tried and convicted of it, I'd probably not give you the right to remove large bits of that information.

And yes, there are vast fields, deserts, tundras, plains, steppes, ice-fields, and oceans of grey about all of this.

Barbra Streisand got Streisanded because she is Streisand.

Ahmed's Falafel Hut likely wouldn't suffer the same fate. His Q-score is somewhat lower, and there's only so much real estate in the public consciousness.


Things disappear from the internet, for good, all the time.

Also, if people become aware someone is naive they automatically are dicks to them? Regardless of wether that information is actually of interest to anyone, just because someone wants to take something down, they should not ever be able to?

Some people act and think like that that, yeah. But to accept this as the baseline of human behaviour is, well, not for me. This entitledness to watch the lives of others from the the dark may have been bred by reality TV or whatever; but it's more a personality flaw and an addiction, a useless misfiring of synapses become culture, than a cornerstone of an information age.


What frustrates me is the number of websites that impose additional restrictions on anything they don't recognize, or worse, websites that impose additional restrictions on (or worse yet, just outright ban) anything that isn't Googlebot.

And people wonder why alternative search engines have such a hard time taking off.


I can give you a really simple operational reason for that: complexity.

Google is somewhere between 50-90% of most sites' search referrals (source: /dev/ass). Add in a handful of other search engines (Bing, DDG, Yahoo, Ask) and you've pretty much got all of it.

They're maybe 10-20% of your crawl traffic though. And possibly a lot less than that.

There are a TON of bots out there. If you're lucky, they just fill your logs and hammer your bandwidth.

If you're not so lucky, they break your site search, overload your servers, and if you're particularly unlucky, they wake you up with 2:30 am pages for two weeks straight.

At which point the simplest way to solve the technical problem, that is, you getting a full night's sleep, is to ban every last fucking bot but Google. Or maybe a handful of the majors.

Now, of course, you're a data-driven operation and you're relying on Google Analytics to tell you who's sending traffic your way. But if you block a search crawler, it's going to stop sending you traffic, so you won't know it's important.

It's a rather similar set of logic that drives people to set email bans on entire CCTLDs or ASN blocks for foreign countries. And if you're a smallish site, it's probably a decent heuristic. And no, it's not just fucking n00bs who do this. Lauren Weinstein who pretty much personally birthed ARPANET at UCLA was bitching on G+ just a week or so back that the new set of unlimited TLDs ICANN were selling were rapidly going into his mailserver blocklists. Because, of course, the early adoptors of such TLDs tend to be spammers, or at least, the early adopters he's likely to hear from.

https://plus.google.com/114753028665775786510/posts/SsgPNHLG...


The article contains some good observations, but I'm struggling to understand this one:

"Some sites try to communicate with Google through comments in robots.txt"

In the examples given, none appear to be trying to "communicate with Google through comments" - how is including...

  # What's all this then? 
  #   \
  # 
  #    -----
  #   | . . |
  #    -----
  #  \--|-|--/
  #     | |
  #  |-------|
...a "mistake" to avoid? There's no harm in it at all.


"Some sites try to communicate with Google through comments in robots.txt"

I thought that was the whole point of robots.txt


No, the point is to communicate with Google through non-comments in robots.txt.


I don't think those examples were of people trying to communicate with a crawler. I think they were examples of comments that the owners knew would be thrown away by crawlers.


fun fact: robots.txt can also be used by attackers to find admin interfaces or other sensitive tidbits that you don't want search engines to crawl

lots of target-detection crawlers will look at robots.txt as the first thing they do to see if there's any fun pages you don't want the other crawlers to see


If you want to hide admin pages, add the robots meta tag to each one and set noindex, nofollow. Then you don't need to list them all in one place in robots.txt.

That said, obscurity is not really security. Your admin pages should be behind a password, which, if coded properly, will exclude spiders, bots, and bad guys.


In the past I've created an empty robots.txt just to keep the 404 errors out of my logs...


Why does Google ignore the crawl delay?


Google has millions of spiders, in datacenters all over the world. Maybe respecting crawl delay added more shared-state overhead than they wanted.


I haven't read the entire article but we were discussing this at work a few weeks ago. You can set the crawl delay in Google Web Master Tools but they only adhere to that setting for 90 days then they go back to their default.


The main use for robots.txt is to prevent crawling of infinite URL spaces: http://googlewebmastercentral.blogspot.com.br/2008/08/to-inf...

Alongside tagging links to such resources with nofollow.


Back in the day, I would use httrack for offline web browsing, and these were a constant irritation.


My server returns 410 GONE to robots.txt requests.

The robots exclusion protocol is a ridiculous anachronism. I don't use it and neither should you.


And what do you do about sites with an infinite number of pages?


By not writing bad software. State shouldn't be stored in URLs, it should be stored in cookies.

Spiders have to be robust against sites with unlimited numbers of internal links anyway, or else an attacker could trap a web spider with a malicious site, or a 13 year old writing a buggy PHP add could take down Google's entire spidering system.


> By not writing bad software. State shouldn't be stored in URLs, it should be stored in cookies.

GAH!! So it's you who writes those horrible sites?

I want to be able to middle click on two different URLs and browse two pages with completely different state at the same time.

I HATE sites that store state in cookies, the two different tabs start getting completely mixed up about where I am in the site.

The only thing that should be in a cookie is stuff like a shopping cart. But that's only because the action "add to cart" is like a transaction and should be remembered.

Viewing a page and changing the sort is ephemeral and should have no effect on anything else.

> Spiders have to be robust

Who cares about the spider? What about your site that got hit with an unending stream of completely useless page views?

Your position about robots.txt is simply wrong and you need to change your mind.


yeah, robots.txt is a horrible standard. trust me, i wrote https://www.npmjs.org/package/robotstxt just so that i can really understand what is going on. it's based on https://developers.google.com/webmasters/control-crawl-index...

the article is pretty much correct (although strangely worded at some times), the stuff about "communicating via robotst comments to google" is of course not true. the example he gives are developer jokes, nothing more.

still, you should not use comments in the robots.txt, why?

you can group user agents i.e.:

    User-agent: Googlebot
    User-agent: bingbot
    User-Agent: Yandex
    Disallow: /
Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)

ok, now:

    User-agent: Googlebot
    #User-agent: bingbot
    User-Agent: Yandex
    Disallow: /
so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it's own group and yandex it's own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it's interpredted are as non existent, they belong together.

they way i read the spec https://developers.google.com/webmasters/control-crawl-index..., this behaviour is undefined. (pleae correct me if i'm wrong)

simple solution: don't use comments in the robots.txt file.

also, please somebody fork and take over https://www.npmjs.org/package/robotstxt it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).

by the way, my recommendation is to have a robots.txt file like this

    User-agent: *
    Dissalow: 

    Sitemap: http://www.example.com/your-sitemap-index.xml
and return HTTP 200

why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it's quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt

also read the spec https://developers.google.com/webmasters/control-crawl-index...


There are enough malicious bots that do follow robots.txt to make it still an important option for most sites.


500kb limit? you call that short and sweet?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: