Hacker News new | past | comments | ask | show | jobs | submit login

so what should his robots.txt look like? at the moment it is:

User-Agent: *

Disallow: /21000/




It's mostly sufficient. /21000/ will not match "http://picolisp.com/21000, which is the first URL in the sequence, but the remaining URLs look like "http://picolisp.com/21000/!start?*Page=+2, so Googlebot will likely only continue to download a single page once it has re-read the robots.txt.

Which is what you deserve for using non-standard URL formats.


Hold on, slash at the end is not standard?


No, I'm saying /21000/ will match a path with a directory named /21000 but not a file named /21000.

When I say "non-standard", I am saying am saying that if the website's URLs looked like "/21000/foo" and "/21000/foo?page=2", it would have been easier to craft a "Disallow" rule that would have successfully blocked all of the desired pages.


   User-Agent: *
   Disallow: /21000
or

   User-Agent: *
   Disallow: /




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: