Google search indexes itself

roland-s · on Sept 10, 2014

Google's robots.txt http://www.google.com/robots.txt disallows /search but not //search.

However, if you search site:http://www.google.com/search and show omitted search results, you get a bunch of results (all 404s).

If you do this there are some strange results on the last couple pages.

So, Half Life 3 confirmed.

marceldegraaf · on Sept 10, 2014

I thought you were joking about those search keywords, but indeed: http://i.marceldegraaf.net/sitehttpwww.google.comsearch_-_Go... (screenshot)

user24 · on Sept 11, 2014

ODF files! Those sick sick people.

vadvi · on Sept 11, 2014

why is it "GooooooooooG" and not "Goooooooooogle" at the bottom?

daveloyall · on Sept 10, 2014

A better example url is https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...

Note that switching //search to /search eliminates the phenomenon.

Note too that all the results on page 1 and page 10 are related to hostgator and coupon codes. I expect that there is some site which contains some text or links that cause these results.

Note also that the `site:` search operator isn't supposed to include anything but a domain or subdomain: no http:// nor /search should be included.

Finally, note that the results are actually google search pages, though! So I do think this is some kind of bug.

But NOT an instance of Google indexing its result pages. Please change the title to 'This one weird google bug will make you scratch your head!' :)

Edit: andybalholm suggests (on this page) that the double slash is in fact causing the googlebot to visit those search results page and indeed index them. Hm, sounds true.

Has anybody visited the spamfodder pages and found instances of malformed yet operative links to google search? (I don't feel like visiting those sites on this machine on this network.)

1qaz2wsx3edc · on Sept 10, 2014

I like it: https://www.google.com//////////////////////////////////////...

franze · on Sept 10, 2014

>Note also that the `site:` search operator isn't supposed to include anything but a domain or subdomain: no http:// nor /search should be included.

google recommends the site:example.com/path shortcut itself https://support.google.com/webmasters/answer/35256?hl=en

and it's ok to use, as site:example.com inurl:path could mean example.com/hudriwudri/path, too

daveloyall · on Sept 10, 2014

I said that `site:` doesn't take full urls, just domains and subdomains.

Correction: this works as you (or a muggle) might expect: https://www.google.com/search?q=site:https:%2F%2Fgithub.com%...

...Though logically the operator should be named `page:` now. :)

Buge · on Sept 10, 2014

>But NOT an instance of Google indexing its result pages.

That's what it looks like to me. Could you explain the difference?

daveloyall · on Sept 11, 2014

I changed my tune at some point via seeing comments here. I posted a comment to that effect.

In hindsight, your comment alone would have changed my tune: nope, I can't explain the difference between a page appearing in search results and a page being indexed. Thanks for the illumination. :)

tux3 · on Sept 10, 2014

>This one weird google bug will make you scratch your head!

But that's clickbait! :)

kentonv · on Sept 10, 2014

This demonstrates the dangers of loose path resolution rules.

Traditionally, consecutive slashes in a path name are treated as equivalent to a single slash, presumably to simplify apps that need to join two path fragments -- they can safely just concatenate rather than call a library function like path.join().

Unfortunately, this makes it much harder to write code that blacklists certain paths, as robots.txt is designed to do. Clearly, Google's implementation of robots.txt filtering does not canonicalize double-slashes, and so it thinks //search is different from /search and only /search is blacklisted.

My wacky opinion: Path strings are an abomination. We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz". You can use something like slashes as an easy way for users to input a path, but the parsing should happen at point of input, and then all code beyond that should be dealing with lists of strings. Then a lot of these bugs kind of go away, and a lot of path manipulation code becomes much easier to write.

wglb · on Sept 10, 2014

We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz".

But that doesn't in and of itself solve the problem, because "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] without any additional convention.

This is actually not that unusual. this site does not treat two consecutive slashes as a single slash. There are likely others implementation differences.

Certainly in posix consecutive slashes count as one for file paths, but URL query strings are not file paths.

daveloyall · on Sept 11, 2014

... "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] ...

No, I think it'd be more like proto://host/thing?foo&bar&baz (put an =1 on each of those if you like).

Yeah, I'm employing a convention, but so to is the concept of list of strings that the commenter invoked.

rlpb · on Sept 11, 2014

Does the HTTP standard or robots.txt specification mandate the collapse of consecutive slashes, though? I agree that it is common, but if it is server-side implementation detail, then a correct implementation of robots.txt should not collapse them, as they might mean different things to a particular server.

kentonv · on Sept 11, 2014

I agree. If there's a bug here, it's in the server which collapses slashes seen in request paths, not in the indexer's interpretation of robots.txt.

szaroubi · on Sept 10, 2014

Funny thing, Google indexes itself, indexing itself, indexing others .... All results lead to google search, which lead to google search results ...

https://www.google.ca/search?q=site%3Ahttp%3A%2F%2Fwww.googl...

TallboyOne · on Sept 10, 2014

We must go deeper

imrehg · on Sept 11, 2014

http://inception.davepedu.com/

franze · on Sept 10, 2014

hi OP here, i did not consider this to go front-page, just thought it was a funny meta bug.

and no, it's not clickbait and i'm not affiliated with hostgator or any of that other crap.

a few strange points i would like to point out:

the indexed result pages are http:// not https:// - but to my knowledge google forces https:// everywhere.

the double slash issue is probably the reason why googlebot does indeed index this. robots.txt is a shitty protocol, i once tried to understand it in detail and coded https://www.npmjs.org/package/robotstxt and yes, there are a shitload of cases you just can't cover with a sane robots.txt file.

as there are no https://www.google.com/search (with "s" like secure) URLs indexed google(bot) probably has some failsafes to not index itself, but the old http:// URLs somehow slipped through.

but now lets go meta: consider the implications! the day google indexes itself is the day google becomes self aware. google is a big machine trying to understand the internet. now it's indexing itself, trying to understand itself - and it will succeed.the "build more data center algorithms" will kick in as google - which basically indexed the whole internet - is now indexing itself recursively! the "hire more engineers to figure how to deal with all this data" algorithm will kick in (yeah, recursively every developer will become a google dev, free vegan food!), too.

i think it's awesome.

by the way, a few years ago somebody wrote a similar story http://www.wattpad.com/3697657-google-ai-what-if-google-beca... fun enough the date for self awareness is "December 7, 2014, at 05:47 a.m" [update: ups, sorry seems to be the wrong story, but i'm sure the "google indexes itself becomes self aware" short story is out there, but i just can't find it right now ... strange coincident?]

agwa · on Sept 10, 2014

> the indexed result pages are http:// not https:// - but to my knowledge google forces https:// everywhere.

Google only forces HTTPS for certain User-Agent strings. I just tried fetching http://www.google.com with the Googlebot User-Agent string and Google did not redirect to HTTPS.

afro88 · on Sept 10, 2014

It's a bug in the indexing system, exploited by hostgator for (I'm guessing) SEO purposes. There are other people doing the same thing, and they're all spammy (viagra sales etc.)

I reckon this will be fixed in a matter of days, judging by how quickly the latin lorem ipsum google translate thing was sorted out.

thefreeman · on Sept 11, 2014

And fixed.

johnmu · on Sept 11, 2014

(I work with the search team at Google) This was a bug on our side, and should be resolved now.

egeozcan · on Sept 11, 2014

(I'm one of your users) This was a lot of fun, and it's ruined now.

Seriously, why don't you let people do this?

johnmu · on Sept 11, 2014

Handling URLs with multiple slashes in them is tricky, lots of websites silently fold them into one and return the same content, so this seems like something we should handle in the same way in search.

barrystaes · on Sept 11, 2014

Does this explain why Google search results have degraded the last 6 months? I am not trolling -seriously- for me googling first is hardly worthwhile nowadays. A user from the Netherlands. If there was a way to still use the 2009 search index, i would!!

johnmu · on Sept 11, 2014

If you want to send me specific queries (the more general, the better) and what went wrong in the search results for them, I'm more than happy to forward them to the team that works on that. I'm [this-user-name] AT google.com

kuschku · on Sept 11, 2014

There actually is a way to use the pre-2012 search index!

Just use http://www.google.com/custom I use either DuckDuckGo or this site all the time, I'd probably switch to DuckDuckGo completely if this search would go down.

maaarghk · on Sept 15, 2014

Lovely but doesn't use an old index. Just searched for the name of an album released in 2013. Usual results.

maaarghk · on Sept 11, 2014

Nooooooo! I'm going to be curious forever now.

dzhiurgis · on Sept 10, 2014

And web archive indexes it's internal IP addresses and a... live printer: http://web.archive.org/web/*/http://printer

cpqq · on Sept 10, 2014

Which from the snapshot, shows an IP that's... still online: http://208.70.27.164/hp/device/this.LCDispatcher

dzhiurgis · on Sept 10, 2014

Yep. A whois confirms it's their IP address.

Which is nothing wrong on it's own, as long it's protected by good password and doesn't fail to likes of thc-hydra.

They also had some ancient snapshots from 192.xxx range

thefreeman · on Sept 11, 2014

And it appears to have been jammed since 2009.

agwa · on Sept 10, 2014

Ouch. Stuff like this is just a confused deputy security vulnerability waiting to happen. Whenever I write code to fetch a resource based on user input (and a crawler following a link is a form of user input) I check to make sure I'm not going to fetch something on an internal network.

ParkerK · on Sept 10, 2014

They seem to be indexing Bing too ;) https://www.google.com/search?num=20&pws=0&hl=en&q=site%3Aht...

barrystaes · on Sept 11, 2014

They fixed the OP issue by now, but this still works..

spyder · on Sept 11, 2014

Looks like it's got fixed because i cannot see any results.

Igglyboo · on Sept 10, 2014

All of the results are HostGator coupons, anyone else seeing the same?

chm · on Sept 10, 2014

Yes. Look at the query:

    search?q=site%3Ahttp%3A%2F%2Fwww.google.com 
    %2F%2Fsearch%3Fq%3Dproranktracker.com%2B%2B 
    %2BHostgator%2BCoupon%2BCode%3ACOUPON333&pws=0& 
    hl=en#pws=0&hl=en&q=site:http:%2F%2Fwww.google.com 
    %2F%2Fsearch

freehunter · on Sept 10, 2014

Even if you take the search string site:http://www.google.com//search and put it into a fresh Google search, it only returns HostGator coupons. Maybe someone from Google can explain it.

nathanm412 · on Sept 10, 2014

add -hostgator to the search query and you'll find best-seller-watches.com dominating the list. Add that one to your query and things get really strange.

https://www.google.com/webhp?gws_rd=ssl#safe=off&q=site:goog...

chm · on Sept 10, 2014

Ah, I didn't notice! Interesting.

ushi · on Sept 10, 2014

It is obviously the most relevant content on this site. PageRank is always right.

underdown · on Sept 10, 2014

Google also caches itself: http://webcache.googleusercontent.com/search?q=cache:T1NRLL-...

wooptoo · on Sept 10, 2014

It's for good measure, in case it's down.

expose · on Sept 11, 2014

"Your search - site:http://www.google.com//search - did not match any documents."

kazinator · on Sept 10, 2014

The goal is to have an explicit Google search result which expresses the equivalent of "this Google search cannot be found via Google".

This will help construct a proof of Göogdel's Incompleteness Theorem.

Without being able to find anything in Google, including Google searches, and including that search for Google searches itself, Google is not a completely powerful search engine; however, it cannot be complete and consistent at the same time. There are searches which cannot be shown to be conclusively either in the index, or not in the index.

nes350 · on Sept 10, 2014

Made my day!

bictorman · on Sept 10, 2014

I wonder if it's somehow possible to exploit this to pass pagerank from google.com to your own website. Or if there's even people already doing it.

pbhjpbhj · on Sept 11, 2014

Well, let's look at the results - coupons, watches, ... - yup some blackhat SEO is probably cursing whoever publicised this issue.

giancarlostoro · on Sept 10, 2014

I think it might not be that they "index themselves" but they index links to google that others post on forums, it's common for people to link to "lmgtfy" so they probably index those links too. I don't see google "googling" on itself while indexing it's own searches. Unless Skynet.

lubujackson · on Sept 10, 2014

Results for http://www.google.com///search as well.

But not http://www.google.com////search because that's just crazy, come on.

GFischer · on Sept 11, 2014

Very strange:

https://www.google.com/search?q=site:http://www.google.com/s...

I got some searches like:

www.google.com/search@q=tetris+sorry+henk

https://www.google.com/search=pupuk+cair+alami

www.google.com/search&q=strobe+trigger+schematic

www.google.com/search@q=transvestites+used+in+rituals (!!!!)

Edit: roland-s found it first :) , and yes, the last pages of results are pretty weird.

https://news.ycombinator.com/item?id=8298239

ushi · on Sept 10, 2014

Funny thing... It works only with[0]

    site:http://www.google.com//search

but not with[1]

    site:http://www.google.com/search

[0] https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...

[1] https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...

TuxLyn · on Sept 11, 2014

Try it like this. site:www.google.com (About 34,000,000 results) or site:http://www.google.com inurl:search (About 185,000 results)

ankit84 · on Sept 10, 2014

10,800 results for "site:http://www.google.com///search"

Igglyboo · on Sept 10, 2014

Seems like they could easily fix this with robots.txt or something similar, I really doubt it's oversight on their part either.

Any ideas why they're doing this?

andybalholm · on Sept 10, 2014

I assume that some site has hostgator-related links with two slashes instead of one. Due to the two slashes, the GoogleBot doesn't realize that it's indexing their own results pages.

spindritf · on Sept 10, 2014

They disallow /search

    User-agent: *
    Disallow: /search

but maybe //search slipped thorough?

johnmu · on Sept 10, 2014

This is just a bug, sorry.

guan · on Sept 10, 2014

They probably want to index some pages on google.com, but not search results. To exclude search results, someone wrote something to exclude URLs that start with /search, and forgot that //search works the same way.

CompuHacker · on Sept 10, 2014

It works with

  site:http://www.google.com/search

, but all the results are considered duplicates and omitted. Hit the button.

daveloyall · on Sept 10, 2014

Nice! These results must represent all the hrefs people have posted that point to google search...

shangxiao · on Sept 11, 2014

Just checked this link again... It appears that Google has fixed the //search issue as it returns no results now.

charonn0 · on Sept 11, 2014

Thank you. I was wondering what all the fuss was.

tehwebguy · on Sept 10, 2014

What the fuck http://i.imgur.com/Zf2CJzS.jpg

ChuckMcM · on Sept 10, 2014

Fun. I expect a cheeky onebox to come out of this at some point along the lines of the recursion search.

hellohellokitty · on Sept 10, 2014

why the hack google ever made it possible to hit the search url with more than one slash there...

hellohellokitty · on Sept 10, 2014

It's interesting if add a slash to this page the result will be different.

https://news.ycombinator.com//item?id=8297241

Where in all other cases tested it won't

Is this a server specific stuff? Or it's configurable

http://url.spec.whatwg.org//#concept-url-path http://www.nytimes.com///pages//politics//index.html http://www.bing.com////search?q=site%3Abing.com%2Fsearch%3Fq... https://www.cloudflare.com///index

squeaky-clean · on Sept 10, 2014

Many frameworks allow you to route URLs to actions instead of mapping to a file. I just tested it in one of my Symfony projects, and I was able to route /login and //login to two separate controllers.

Furthermore, it's pretty common to rewrite URLs, doing things like adding/removing trailing slashes, whatever. So it wouldn't be too difficult to have it condense multiple slashes into just one.

For example, this link worksfine: google.com//////////////////////////////////search?q=foobar

Google search tries to cover a lot of typos or be pretty user-friendly for people who don't understand tech. I wouldn't be surprised if there's a grandma out there who thinks http://google.com//search is the correct method.

underpantsgnome · on Sept 11, 2014

no one will ever know.

hellohellokitty · on Sept 10, 2014

Tested with a jetty + spring 3 with close to ootb settings, more than one slash will resolve not found error.

olalonde · on Sept 10, 2014

Isn't the head of web spam at Google a HNer (Matt I think?)?

sejje · on Sept 10, 2014

I believe Matt Cutts went on sabbatical: https://www.mattcutts.com/blog/on-leave/

_3u10 · on Sept 10, 2014

But does it index the results of the search of the index?

digz · on Sept 10, 2014

Now google will index searches of its own searches.

carbonr · on Sept 10, 2014

Ouroboros

antino · on Sept 10, 2014

So meta.

bcRIPster · on Sept 10, 2014

Can someone nuke the link on this post. It's clearly click bate and we're just driving traffic into it. :(

blueflow · on Sept 10, 2014

You should mention the arbitrary data in the query section, its not visible at the first look.

bcoates · on Sept 10, 2014

That's an artifact of google's weird link stuffing, if you search 'site:http://www.google.com//search' by hand it still works

namuol · on Sept 10, 2014

Perhaps this "works" because all the pagerank stuff has been altered by all the sudden traffic related to hostgator coupons.

tzaman · on Sept 10, 2014

Googleception! (sorry for the useless comment, but I had to)

thanatropism · on Sept 10, 2014

Eigengoogle.

lern_too_spel · on Sept 10, 2014

Does nobody here understand robots.txt? It's pretty easy to figure out what's going on if you do. I assumed most users here work with web technologies, but maybe the readership doesn't skew that way as much as I thought.