Hi I work at Google helping webmasters like this. As far as I can tell, there ar...

Matt_Cutts · on May 17, 2012

Pierre, thanks for stopping by to confirm these issues. I often see sites be inconsistent between www and non-www, or between http and https. It looks like safeshepherd.com was doing both. More consistent redirects and adding rel=canonical should definitely help us figure out which url you prefer.

Just to confirm what I said elsewhere, this site doesn't have any manual spam actions or anything like that. It's just a matter of Google trying to pick the correct canonical url when you have a lot of different (www, non-www, http, https) urls you're showing. If you make things more consistent, I think Google will stabilize on your preferred url pretty quickly.

jeebus · on May 17, 2012

Matt & Pierre, thanks for your thoughts and sorry that this ended up being a rookie mistake. I have a rel canonical good to go. Thanks again for your time.

Matt_Cutts · on May 17, 2012

No worries at all--glad it turned out to be easily fixable.

And now I know what "nerfed" means. :)

sixQuarks · on May 17, 2012

This is why I love HackerNews. Guy asks for SEO help, frigg'n Matt Cutts answers!

kayge · on May 17, 2012

FYI - Another fairly common usage of "nerfed" these days (especially in the gaming community) refers to something being toned-down. E.g. if many players are complaining about a character ability being too powerful, the developers may consider "nerfing" that character.

matznerd · on May 18, 2012

In Internet Marketing, we refer to being delisted as getting "sandboxed"

mh- · on May 18, 2012

As you're replying to Google's Web Spam lead, I'm sure he's familiar with the terms for obliterating your splog farms. :)

mjwalshe · on May 17, 2012

I have also seen a messed up (404 erroring) robots.txt file cause a site to get deindexed out of the blue

pierrefar · on May 18, 2012

That's a misconception. A 404 on robots.txt will not have any effect on crawling as it's treated the same as an empty robots.txt file allowing all crawling.

But it's different for 5xx HTTP errors for the robots.txt file. As Googlebot is currently configured, it will halt all crawling of the site if the site’s robots.txt file returns a 5xx status code for robots.txt. This crawling block will continue until Googlebot sees an acceptable status code for robots.txt fetches (HTTP 200 or 404).

mjwalshe · on May 18, 2012

interesting that needs to go into the webmaster guidelines I was not seeing 500's or having it reported in GWT as errors on the site that it happened to

pierrefar · on May 18, 2012

It doesn't belong in the guidelines but it is described in the relevant section of the Help Center:

http://support.google.com/webmasters/bin/answer.py?hl=en&...

In summary: If for any reason we cannot reach the robots.txt due to an error (e.g a firewall blocking Googlebot or a 5xx error code when fetching) Googlebot stops its crawling and it's reported in Webmaster Tools as a crawl error. That Help Center article above is about the error message shown in Webmaster Tools.

Given that you said you did not see errors being reported, That suggests there was something else going on. If you need more help, our forums are a great place to ask.

mjwalshe · on May 18, 2012

Chears I am of on leave for a week ill get this put into our best practice guide for our devs and IS guys when I am back.

Funny thing was I tried resubmitting the main page in GWT an all the traffic came back almost instantly.

robomartin · on May 17, 2012

I have a question: Why are these considered different sites by your algos? If we were talking ".com" vs ".net", OK, I get it. But this is about "www.domain.com" vs "domain.com" and their http and https variants. I'm sure there's something I don't understand.

Would "http:www.apple.com", "https:www.apple.com", "http:apple.com" and "https:apple.com" be treated by Google as four completely different and separate sites also to be ranked in isolation of each other? Why?

jlarocco · on May 17, 2012

"www" might be a very special case, but there are lots of times where "this.domain.com" is completely unrelated to "that.domain.com".

Many sites, for example give users their own "name.whatever.com" subdomain. In those cases treating the sites as the same doesn't make any sense.

superchink · on May 17, 2012

True, but I think that is just the special case being asked about. Would it make sense to have an exception for “www”?

robomartin · on May 21, 2012

That's true, but the www/no-www/http/https cases are very likely to refer to exactly the same site. Besides, Google's algos, through crawling, should know that it is the same site. It seems unfair to punish the site for this.

coryj · on May 17, 2012

Do we still have to do the redirects if we have this setup properly in webmaster tools? That is, under configuration --> settings --? preferred domain.

Second, in webmaster tools, should we always have the www and non-www setup so we can do the "change of address". For example, if www.mysite.com is my preferred URL, do I need to also make sure mysite.com is in webmaster tools and change the address to go to www.mysite.com?

robdwoods · on May 18, 2012

My SOP is to make sure first that all non-canonical versions of the home page (the page with the most value, usually) are redirected, then make sure you set your preferred address in WMT, then add the rel=canonical tag to capture all the possible versions of the home page that you can't think of. As a side note I also noticed (the hard way) that Google treats capitalization in URIs the same way as these examples so www.Example.com is treated differently (at least for link value) than www.example.com. Basically if there is a single different character in the URI then it's considered a different URI.

Codhisattva · on May 17, 2012

Wow! Kudos to Google putting someone out in the wild answering these kinds of questions. Is there a normal place to ask for web/index help or does HN serve that purpose?

pierrefar · on May 17, 2012

The best places are our forums ( http://productforums.google.com/forum/m/#!forum/webmasters ) and our regular webmaster office hours ( http://sites.google.com/site/webmasterhelpforum/en/office-ho... ) where a webmaster support Googler (me and a few others) have a hangout on Google+ that anyone can join and ask about their site. We do them in many languages and time zones to cover the world as much as possible.

Obviously we can't be everywhere and we can't answer every question, but we try as much as possible to help when we can.

Matt_Cutts · on May 17, 2012

http://productforums.google.com/forum/#!forum/webmasters

Those folks would have spotted these issues pretty quickly.

harshreality · on May 17, 2012

Why doesn't google look at identical url paths on https, http, www, and no-www variants of the url, and if they look similar then use some default google policy to select which of them is canonical?

For example, if http://mydomain.com/path and https://www.mydomain.com/path have 95% content correlation and repeated requests to http://mydomain.com/path have 95% content correlation, and the server headers look the same, why would it not be safe to decide those are duplicates of a single canonical url?

It's not safe to merge www.domain1.com and www.domain2.com. it's not safe to merge subdomain.domain.com and www.domain.com. However, for the limited cases of www and no-www, https and http, if they look similar, I think it's harmful not to treat them as the same site. You can't expect every website owner to be aware of this issue.

If it's a matter of not being able to be 100% sure, is there a single site that cares about google ranking that runs different sites on different combinations of www/no-www and https/http, but has similar content that would confuse a simple heuristic looking at page similarity? In what sort of circumstance could that happen other than with placeholder pages?

GWT allows selecting a preference between www and no-www, but I don't see a preference between https and http. I think Google should add a notice that using GWT to select between www and no-www is deprecated and the recommended way to handle www, no-www, http, and https selection is to use 301 redirects or rel="canonical" tags.

superchink · on May 18, 2012

I don't mean to derail the conversation, but I just noticed today that one of the sites I work with (aptcorner.com) has dropped off of the first page of results for the company name (it was previously at around position 4–5. Is it not enough that I've set the preferred domain in webmaster tools? Will setting a rel=canonical tag make a difference?

rebelde · on May 17, 2012

And this is why many people like me refuse to allow https connections to their sites... Who wants to confuse Googlebot?

saurik · on May 18, 2012

Yeah... this is actually a surprising (at least to me; certainly irritating) liability of allowing users to access your website via https..