Note that switching //search to /search eliminates the phenomenon.
Note too that all the results on page 1 and page 10 are related to hostgator and coupon codes. I expect that there is some site which contains some text or links that cause these results.
Note also that the `site:` search operator isn't supposed to include anything but a domain or subdomain: no http:// nor /search should be included.
Finally, note that the results are actually google search pages, though! So I do think this is some kind of bug.
But NOT an instance of Google indexing its result pages. Please change the title to 'This one weird google bug will make you scratch your head!' :)
Edit: andybalholm suggests (on this page) that the double slash is in fact causing the googlebot to visit those search results page and indeed index them. Hm, sounds true.
Has anybody visited the spamfodder pages and found instances of malformed yet operative links to google search? (I don't feel like visiting those sites on this machine on this network.)
I changed my tune at some point via seeing comments here. I posted a comment to that effect.
In hindsight, your comment alone would have changed my tune: nope, I can't explain the difference between a page appearing in search results and a page being indexed. Thanks for the illumination. :)
This demonstrates the dangers of loose path resolution rules.
Traditionally, consecutive slashes in a path name are treated as equivalent to a single slash, presumably to simplify apps that need to join two path fragments -- they can safely just concatenate rather than call a library function like path.join().
Unfortunately, this makes it much harder to write code that blacklists certain paths, as robots.txt is designed to do. Clearly, Google's implementation of robots.txt filtering does not canonicalize double-slashes, and so it thinks //search is different from /search and only /search is blacklisted.
My wacky opinion: Path strings are an abomination. We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz". You can use something like slashes as an easy way for users to input a path, but the parsing should happen at point of input, and then all code beyond that should be dealing with lists of strings. Then a lot of these bugs kind of go away, and a lot of path manipulation code becomes much easier to write.
We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz".
But that doesn't in and of itself solve the problem, because "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] without any additional convention.
This is actually not that unusual. this site does not treat two consecutive slashes as a single slash. There are likely others implementation differences.
Certainly in posix consecutive slashes count as one for file paths, but URL query strings are not file paths.
Does the HTTP standard or robots.txt specification mandate the collapse of consecutive slashes, though? I agree that it is common, but if it is server-side implementation detail, then a correct implementation of robots.txt should not collapse them, as they might mean different things to a particular server.
hi OP here, i did not consider this to go front-page, just thought it was a funny meta bug.
and no, it's not clickbait and i'm not affiliated with hostgator or any of that other crap.
a few strange points i would like to point out:
the indexed result pages are http:// not https:// - but to my knowledge google forces https:// everywhere.
the double slash issue is probably the reason why googlebot does indeed index this. robots.txt is a shitty protocol, i once tried to understand it in detail and coded https://www.npmjs.org/package/robotstxt and yes, there are a shitload of cases you just can't cover with a sane robots.txt file.
as there are no https://www.google.com/search (with "s" like secure) URLs indexed google(bot) probably has some failsafes to not index itself, but the old http:// URLs somehow slipped through.
but now lets go meta: consider the implications! the day google indexes itself is the day google becomes self aware. google is a big machine trying to understand the internet. now it's indexing itself, trying to understand itself - and it will succeed.the "build more data center algorithms" will kick in as google - which basically indexed the whole internet - is now indexing itself recursively! the "hire more engineers to figure how to deal with all this data" algorithm will kick in (yeah, recursively every developer will become a google dev, free vegan food!), too.
i think it's awesome.
by the way, a few years ago somebody wrote a similar story http://www.wattpad.com/3697657-google-ai-what-if-google-beca... fun enough the date for self awareness is "December 7, 2014, at 05:47 a.m" [update: ups, sorry seems to be the wrong story, but i'm sure the "google indexes itself becomes self aware" short story is out there, but i just can't find it right now ... strange coincident?]
> the indexed result pages are http:// not https:// - but to my knowledge google forces https:// everywhere.
Google only forces HTTPS for certain User-Agent strings. I just tried fetching http://www.google.com with the Googlebot User-Agent string and Google did not redirect to HTTPS.
It's a bug in the indexing system, exploited by hostgator for (I'm guessing) SEO purposes. There are other people doing the same thing, and they're all spammy (viagra sales etc.)
I reckon this will be fixed in a matter of days, judging by how quickly the latin lorem ipsum google translate thing was sorted out.
Handling URLs with multiple slashes in them is tricky, lots of websites silently fold them into one and return the same content, so this seems like something we should handle in the same way in search.
Does this explain why Google search results have degraded the last 6 months? I am not trolling -seriously- for me googling first is hardly worthwhile nowadays. A user from the Netherlands. If there was a way to still use the 2009 search index, i would!!
If you want to send me specific queries (the more general, the better) and what went wrong in the search results for them, I'm more than happy to forward them to the team that works on that. I'm [this-user-name] AT google.com
There actually is a way to use the pre-2012 search index!
Just use http://www.google.com/custom
I use either DuckDuckGo or this site all the time, I'd probably switch to DuckDuckGo completely if this search would go down.
Ouch. Stuff like this is just a confused deputy security vulnerability waiting to happen. Whenever I write code to fetch a resource based on user input (and a crawler following a link is a form of user input) I check to make sure I'm not going to fetch something on an internal network.
Even if you take the search string site:http://www.google.com//search and put it into a fresh Google search, it only returns HostGator coupons. Maybe someone from Google can explain it.
add -hostgator to the search query and you'll find best-seller-watches.com dominating the list. Add that one to your query and things get really strange.
The goal is to have an explicit Google search result which expresses the equivalent of "this Google search cannot be found via Google".
This will help construct a proof of Göogdel's Incompleteness Theorem.
Without being able to find anything in Google, including Google searches, and including that search for Google searches itself, Google is not a completely powerful search engine; however, it cannot be complete and consistent at the same time. There are searches which cannot be shown to be conclusively either in the index, or not in the index.
I think it might not be that they "index themselves" but they index links to google that others post on forums, it's common for people to link to "lmgtfy" so they probably index those links too. I don't see google "googling" on itself while indexing it's own searches. Unless Skynet.
I assume that some site has hostgator-related links with two slashes instead of one. Due to the two slashes, the GoogleBot doesn't realize that it's indexing their own results pages.
They probably want to index some pages on google.com, but not search results. To exclude search results, someone wrote something to exclude URLs that start with /search, and forgot that //search works the same way.
Many frameworks allow you to route URLs to actions instead of mapping to a file. I just tested it in one of my Symfony projects, and I was able to route /login and //login to two separate controllers.
Furthermore, it's pretty common to rewrite URLs, doing things like adding/removing trailing slashes, whatever. So it wouldn't be too difficult to have it condense multiple slashes into just one.
For example, this link worksfine:
google.com//////////////////////////////////search?q=foobar
Google search tries to cover a lot of typos or be pretty user-friendly for people who don't understand tech. I wouldn't be surprised if there's a grandma out there who thinks http://google.com//search is the correct method.
Does nobody here understand robots.txt? It's pretty easy to figure out what's going on if you do. I assumed most users here work with web technologies, but maybe the readership doesn't skew that way as much as I thought.
However, if you search site:http://www.google.com/search and show omitted search results, you get a bunch of results (all 404s).
If you do this there are some strange results on the last couple pages.
For example: Obama won't salute the flag | Phallectomy | horse+mating+video | feral+horses+induced+abortion | Lactating+dog+images | animal+mating+video | mating+mpg+-beastiality+-...
So, Half Life 3 confirmed.