Hacker News new | past | comments | ask | show | jobs | submit login
Google search and search engine spam (googleblog.blogspot.com)
239 points by jsm386 on Jan 21, 2011 | hide | past | favorite | 212 comments



"we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content."

"As “pure webspam” has decreased over time, attention has shifted instead to “content farms,” which are sites with shallow or low-quality content. In 2010, we launched two major algorithmic changes focused on low-quality sites. "

Looks like 2011 is the year that Google kills the scrapers. Look for an uptick in the sensitivity of the duplicate content penalty.


I don't think they're talking about duplicate content so much as stuff like Demand Media (ehow), Associated Content, Hubspot, ezinearticles, etc, etc.


It'll be interesting to see whether Google will take on some of the bigger sites. Add BizRate to the list too (its pages have a 'Related Searches' keyword stuffing section, and 99% of the site is auto-generated), along with WiseGeek: possibly the worst and most spammy content farm I've seen, IMO.

I personally am always baffled how some of the really large, spammy sites with low quality content are rewarded with premium AdSense accounts and great rankings.


It is easy to bash sites, but what is Google to do when these sites have even been found useful by users of news.ycombinator.com :

http://news.ycombinator.com/item?id=1454569

http://news.ycombinator.com/item?id=615832

http://news.ycombinator.com/item?id=795278

http://news.ycombinator.com/item?id=592295

the link given in the first example above got 6 upvotes, and there is little info in the post other than the link.

a basic description is all that most web-users want. if you are doing some deep research, then these sites probably aren't good fits, but when you just want a quick overview (which is what most Google users want), then they can be useful.

I have a sneaking suspicion that you are a webmaster who sometimes finds that your site(s) rank below the sites that you bash.


If people find them useful then that's good to know :) I guess the possibility of Google allowing individuals to have a domain/site blacklist would be a good idea since you're right that I find some sites unhelpful whilst others find them useful.

>> I have a sneaking suspicion that you are a webmaster who sometimes finds that your site(s) rank below the sites that you bash.

Honestly isn't the case :) I'm not a fan of monitoring the SERPs for a bunch of keywords and seeing where I rank. I sometimes do it for a few choice keywords and I've never seen WiseGeek, About, eHow rank above me (or have pages written on the searched-for topics). So that's not the case.

I pointed out BizRate since their site is mainly auto-generated and they have a keyword stuffing section, neither of which seem to provide much/any value to users.

As for WiseGeek, they might have some useful content, but I've seen plently of pages with almost as many words of adverts as text, which clearly isn't desirable. And some of its (no doubt quickly written) content is either unhelpful or in some cases wrong.


What's wrong with eHow?


This is HN, so I'll give an example related to my startup. We make a wireless (802.11g) flash drive called the AirStash, which works like an ordinary USB SD card reader on a PC and uses HTML5 for the interface on wireless devices. Here's an example of spam content from eHow that talks about our product:

http://www.ehow.com/how_6861903_install-wireless-flash-drive...

"Wireless flash drives communicate with wireless devices using wireless protocols."

You don't say?

"Advanced wireless flash drives stream data to more than one device at a time."

Well, we're the only one out there, so I guess they're all advanced!

"Insert the USB portion of your wireless flash drive into an available USB port on your computer. Your computer should automatically recognize the device. If not, click the "Start" button and then click "My Computer." Double click the flash drive in the removable media section. This opens the drive and displays the files. Double click the executable file (start.exe, for example) to start the installation process."

Uh, no. No software installation is ever required, a point we make abundantly clear on our web site.

Basically, it's all wrong.


I just googled the AirStash and it looks awesome!


What OS does that thing run?


I guess if you find it helpful, nothing. I always skip their results because they're always really shallow. The kind of article you'd expect if you were paying someone $4 to research and write an article on a topic they know nothing about.


Exactly, and there are a huge number of $4 sites out there and a large number of companies providing that kind of outsourced content generation. If you've got $100k in funding to blow, it's really easy to get your first 10,000 pages of content and a stream of SEO traffic within a few months.

I agree with Ryan that these kinds of sites are worse than scrapers because they're just as useless but algorithmically much harder to detect.


They are content farms - plain and simple. You might consider throwing about.com in there, too.

They're made to earn the employers revenue, not help out people with the particular queries.


Well, why not throw out anything ending in .com or .net? After all, they're all made to earn revenue...

I hope the point is clear -- "to earn revenue" isn't an appropriate test. But your point is well taken. I'd prefer a search to turn up the site made by the guy in the garage who is passionate about the subject rather than the $4/hr content farm content.

I think if we're to continue complaining about spam we have to actually define what we mean by spam. And I have this strange feeling that not everyone will agree...


The about.com example shows how hard distinguishing quality algorithmically gets at the margins. You do actually find stuff on about.com now and then which is pretty decent--not great but decent. You might even say the same of ehow, albeit at a much lower rate.

So you don't want to just say "these are spam."


On Quora, the former owner of eHow recently wrote[1] that when he handed over the keys to the people who bought the site, it had an excellent quality of content. This gave it a good 'credibility' on Google. When it was taken over, it was turned into a content farm, but the atrophy of credibility is possibly to disproportional to the spam parameter variable, which leaves it in Google's results. At least according to one of his listed explanations.

I recall about.com as a decent site a while ago, but I believe it was acquired a while ago by ... Yahoo!? Go figure.

I think that might be what makes it so complicated. [1]: http://www.quora.com/Why-doesnt-Google-push-down-low-quality...


>>I recall about.com as a decent site a while ago, but I believe it was acquired a while ago by ... Yahoo!? Go figure.

About.com is owned by the New York Times Company, not Yahoo.


Why does everyone keep bashing on about.com? They've been around forever, have excellent content, and I even used their resources to learn a foreign language.


i think you mean hubpages instead of hubspot which is a company that makes internet marketing software for small businesses.


Arg! You're right, my mistake. I really like Hubspot, too :(


Thanks for clearing that up. I was beginning to wonder if we had done something tasteful. We're the polar opposite of search engine spam. We believe passionately in great, original content!


No, just my bad...it was too late to delete the comment. You guys are doing great, keep up the good work :)


This will be interesting. Most of the stuff on my site are replays of articles I write for other venues. I wonder how the rankings will shift.


Metrics-shmetrics. Once I stop seeing StackOverflow clones listed above StackOverflow's original pages I will gladly believe that Google's search quality is "better than ever before."


I've been tracking how often this happens over the last month. It's gotten much, much better, and one additional algorithmic change coming soon should help even more.

I'm not saying that a clone will never be listed above SO, but it definitely happens less often compared to a several weeks ago.


My experience is the exact opposite: I am seeing many, many more clone sites in my search results in the last few months. It feels like it increases when I accidentally click a clone site.

This happens for more than StackOverflow clones. Mailing lists, Linux man-pages, FAQs, published Linux articles, etc. all have clone pages that are obvious link farms (sometimes they even include ads that attempt to harm my computer) that rank higher than the "official" (or at least less-noisey) pages.

Ideally, I'd like to completely remove domains from result as has been discussed elsewhere on HN. Hopefully this upcoming push for social networking that Google has will reintroduce a better-implemented "SearchWiki" feature...


I think there is a disconnect in the scale at which you are both commenting:

Comment #1:

  > I've been tracking how often this happens over the last month.
  > <snip>
  > it definitely happens less often compared to a several weeks
  > ago.
Comment #2:

  > I am seeing many, many more clone sites in my search
  > results in the last few months
You can't argue against "things have gotten better in the last week" with "things have been getting worse for the last 6 months."


Try DuckDuckGo. Gabriel has been doing an aggressive job about removing unsavory domains and I've been fairly impressed. I think that Google probably can't be nearly as aggressive for political reasons.


The reason Google isn't doing the same thing as DuckDuckGo is most likely because manually banning a domain instead of improving their algorithms to avoid unwanted behaviors will only temporarily work, and only in select cases. There will always be new spam and content farm sites.


There seem to be a small number of large content farms (perhaps suggesting economies of scale are pretty important). In this case, manually killing them will work well for Herr. Weinberg.


Over at blekko, we leave in a few large but marginal sites like eHow, and let users kill them with their personal spam slashtags. For smaller spam websites, we can frequently use Adsense IDs to kill them in groups.


Is google ever going to go on the record about companies like demand media and whether or not they get special treatment from google?


One specific and ubiquitous example of webspam has been driving me nuts this week: An enterprising spammer has won huge on Amazon Web Services (AWS)-related keywords.

For example: https://encrypted.google.com/search?hl=en&q=aws+s3+emr+p...

Result #4 at the moment is "AWS Developer Forums: Interactions between S3, EMR and HDFS ..." on http://www.hackzq8search.appspot.com/developer...com/...

What's sublime about this example is that:

1. hackzq8search is clone of AWS's websites amazonwebservices.com, aws.typepad.com, etc

2. hackzq8search is hosted on appspot.com, Google's App Engine domain

3. hackzq8search is over quota, so the site doesn't show any content anyway.

Yet this site was the top search result, beating out the site it was cloning, time and time again on my AWS/EMR-related searches this week.

The one mitigating aspect as that hackzq8search's URL naming scheme is easily decodable -- the hackzq8search URL includes the full URL of the cloned URL, so I can write a Greasemonkey script to extract the proper original URL.

I found a glimmer of optimism in that the site has been slowly fading in SEO-success this week: I complained about https://encrypted.google.com/search?q=aws+s3+security+sox+pc...  on Thursday, but on Friday the hackzq8search Search Result was gone from the first search result page.

It's still not hard to slam some AWS-related keywords into Google and get these bogus results, though.


Someone else already reported this. There's been some weird stuff going on with AWS-type pages, e.g. see http://news.ycombinator.com/item?id=2103401 for example. I don't know the exact cause, but I know the indexing team is aware of this issue and working on it.


While searching for a pdf file, I hardly find the pdf through the top results. The top results are often kind off pdf, ebook search engines themselves and they clearly appear on top by gaming Google. I hope this gets fixed too.


If you know it's a pdf that you're looking for, you can add filetype:pdf to your query.


This is great for HN readers. For most people using Google, not so much, I think.


Have you noticed sites that scrape google groups content ranking higher than the google group they've scraped? How can that possibly happen? Still seems rampant.


Thanks.


Why not just downrank the results from the SO clones?


Because that wouldn't solve the problem for clones of other sites, or clones in other languages. And the Stack Overflow cloners could just make other websites. That's why a primary instinct in search quality is to look for an algorithmic solution that goes to the root of the problem. That approach works across different languages, sites, and if someone makes new sites.

To be clear: the webspam team does reserve the right to take manual action to correct spam problems, and we do. That not only helps Google be responsive, it also improves our algorithms because we get use that data to train better algorithms. With Stack Overflow, I especially wanted to see Google tackle this instance with algorithms first.


> Because that wouldn't solve the problem for clones of other sites, or clones in other languages.

Sites that are the victims of content cloning have to be very visible and valuable, so maybe a little manual curating could be relevant.

> the Stack Overflow cloners could just make other websites

Not really? The point is not to tag the clones but to tag the original; everything that is not the original and that has copied content is a clone -- its name, domain or country notwithstanding.


That would be a rather awesome feature for evil people: Just copy&paste your competitions content onto SO (or any other specially protected site) and their google ranking will drop like a stone.


Why doesn't Google just count 'original' as whoever published the corpus first...


this was discussed in an earlier thread and it seemed like the idea of finding the "original" gets really messy and game-able (and potentially oppressive if curation were used).

the primary input to search engines comes from web crawlers...the idea of "first" when it comes to duplicated content is already difficult to determine, and (I would guess) it would get much much worse in the inevitable arms race if something like this were implemented.


But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Because we don't understand what's hard, we think you're not really trying, and then we make up evil reasons to explain that.

I believe if people understood better the difficulties of spam fighting they would be more understanding.


> But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Not necessarily. The rate at which Google refreshes its crawl of a site, and how deep it crawl, depend on how often a site updates and its PageRank numbers. If a scraper site updates more often and has higher PR than the sites it's scraping, Google will be more likely to find the content there than at its source. Identifying the scraper copy as canonical because it was encountered first would be wrong.


Do you think if Google educates to publishers and web masters to use Rel="Original" , the contents which are similar will be put it as considering set, use this tag as the best practice as similar to Rel Canonical tag that will help Google to identify the original content and make the search quality better?


What would stop the site that is scraping content from using the rel="original" tag too


Matt, would it make more sense to put more weight on number of web page visitors (recorded by Google Toolbar) as opposing to number of incoming links?


I agree with what you're saying: there needs to be an algorithmic approach.

But I'd like to say one other thing. Why is Google only doing something about web spam now after people have pointed out how bad things have been getting? Has anybody considered creating a small team to just oversee public perceptions of the search results and try to keep on-top of things in the future?


Can you provide a query where that is still the case ? It hasn't happened for a few weeks for me, since stack overflow changed their title seo.


Happened to me not ten minutes ago with the search string "pass json body to spring mvc"

The efreedom answer at the 5th position is actually the most relevant - the stackoverflow question from which it was copied doesn't even show up on the first page. There is one stackoverflow result on the first page, but it deals with a more complex related issue, not the simple question I was looking for.


In Google's most recent cache, the efreedom result has the word "pass" on the page due to some related links content near the bottom, whereas the stackoverflow page does not. If you modify your query to [parse json body to spring mvc], stackoverflow is at position #1, and efreedom is at position #4. This still has room for improvement, but it would seem like the simplest explanation is just the better match on your query terms.


Didn't notice that - that's good to know. That's actually exactly how I'd expect a good search engine to behave. As annoyed as I am when I get a junk result, I'd be even more pissed if Google dropped terms from my query just so it can return a more popular site.

Of course, then all the content-copy farms will respond by copying valid content plus word lists - hopefully Google knows how to detect that.


to be completely fair, the very first link (from the official springsource blog) also appears to answer your question.


True, it does... I just noticed this because I've actually got in the habit of scanning for stackoverflow results, first - they almost always are right on the money, and it's less cognitive overhead to read a site format I'm familiar with, with extraneous discussion well tucked-away.

It almost feels like a cache miss when I have to drop down to the official site/documentation, since that typically requires a greater time investment to read through to find the relevant sections.

I guess that's a tribute to how well stackoverflow works, most the time. And also to how lazy I am.


Thanks for the concrete query--I'm happy to ping the indexing team to make sure it's not tickling any unusual bugs or coverage issues. Jeff's original blog post helped us uncover a few of those things to improve.


If you're looking for SO results why not use 'site:stackoverflow.com'? That would clear out everything else.


http://www.google.com/search?q=delayed_job+delay+priority...

Stackoverflow comes in at number 8 while clones are 6 and 10


Assuming we are looking at the same results, the pages at position #6 and #10 are not copies of the stackoverflow content at position #8. They are copies of http://stackoverflow.com/questions/1399293/test-priorities-o.... Unfortunately, the only place that the word "delay" (which is in your query) sometimes appears on that stackoverflow page is in the "related" links in the right column. At the time Google last crawled that stackoverflow page (see the cache), "delay" wasn't on the page, only "delayed". Whereas, the last time Google crawled the other two pages you mentioned, they did have "delay" on the page. Google should still be able to do better, but this little complication certainly makes things more difficult.


Yeah.. it might not be the best example, as what I was searching for is not current possible, so there are no correct results for it.


One UI issue we've struggled with is how to tell the user that there isn't a good result for their query. This comes up when we evaluate changes that remove crap pages all the time. For nearly any search you do, something will come up, just because our index is enormous. If the only thing in the result set that remotely matches the query intent is a nearly empty page on a scummy site, is that better or worse than having no remotely relevant results at all? I definitely lean towards it being worse, but many people disagree.


I also have one spam site example. http://www.google.com/search?q=internet+phone+service Look at 3rd result for internetphoneguide.org


What I find seriously bad is that even a huge site like stackoverflow has to optimize its search engine strategy to fight the problem. Little web sites are doomed.


SEO is supposed to be StackOverflow's core competency. They are completely aware most people end up on their site via Google. The search on their own site sucks.

The reason Q&A sites are so visible is that people tend to type questions in their search engines, so Q&A sites are a good match to those.


Not really, a big website can not cover all keywords in its niche, no matter how big it is. The strategy for small sites is to focus on long tail keywords (3-4 terms) and outrank the big guys.


Yes but it sucks Stack Overflow had to add that to the title to fight the spammers because it is often distracting to see it the search results.


Agreed. My ideal search engine wouldn't require real websites to play in the SEO arms-race to beat out the junk sites.


That ideal search engine would find itself quickly the target of people that would try to gain an advantage by figuring out how it works.

And then another SEO cycle would start. Don't forget that before google came along nobody was trying to 'game the system' with backlinks and other trickery, the fact that that google is successful is what caused people to start gaming google.


If it were "ideal", it wouldn't be game-able. I'm not going to claim that this ideal is possible!


Any real-world search engine is going to be analyzed until enough of its internal mechanisms are laid bare to allow gaming to some extent.

Typically you pretend the search engine is a black box, you observe what goes in to it (web pages, links between them and queries) and you try to infer its internal operations based on what comes out (results ranked in order of what the engine considers to be important).

Careful analysis will then reveal to a greater or lesser extent which elements matter most to it and then the gaming will commence. Only by drastically changing the algorithm faster than the gamers can reverse-engineer the inner workings would a search engine be able to keep ahead but there are only so many ways in which you can realistically speaking build a search engine with present technology.

Your ideal, I'm afraid, is not going to be built any time soon, if you have any ideas on how to go about this then I'm all ears.


I think the solution is a diversity of search engines. Maybe even vertical search engines. These days I get such shitty results from google for programming related searches that I've started going straight to SO and searching there. If I don't find it there I then try google, then try google groups search.


I'm a programmer, and I'm as annoyed as you about the SO clones. But keep in mind the vast, vast majority of Google users couldn't care less about StackOverflow.

Moreover, the unique licensing around SO content, along with its mass, presents an interesting edge case for Google. They should of course fix it, but it's not indicative of the average or mode experience.


Google have two strong incentives to weed out AdSense drivel sites in the search results.

1. They diminish the value of Google Search as an advertising platform. And Google Search is likely the most valuable virtual estate on the net. I more often click on ads in Google Search than I click on ads on all other sites combined. This is because when I'm on Google Search I'm actually searching for something, so I might click on a relevant ad.

2. They diminish the value of AdWords content network ads. People pay Google to display their ads because they believe they get better return for their money there than on the alternatives (Yahoo and Microsoft). Ads on low quality sites are unlikely to be competitive, so these sites decrease the relative value of AdWords.

That is, high-ranked low quality sites with AdSense are a double threat to the main source of income for Google, and I expect Google to make it their main priority.

Why, then, aren't they more successful? My guess: Because the problem is a lot hard than any armchair designer would believe. Problems tend to be a lot simpler when you are not the one who must solve them.


I agree with you on point 1; if Google's search quality is not the best than people will (eventually) go elsewhere.

I disagree on point 2. Users on low quality AdSense sites almost certainly arrived there from a search engine so if I can display my adverts on a users landing page it will be almost as good as if they arrived on my site straight from Google.


In fact, AdSense ads on sites with shitty content are even more likely to be effective, because users won't find what they're looking for in the page content.


Bingo. When our indie music site's stream server would go down, our adsense ctrs would skyrocket. The people who come from google go apeshit and start clicking anything--especially adsense units--when the content they were expecting isn't found or isn't functional.


I'm guessing this is in response to http://news.ycombinator.com/item?id=2128263

1: One of Google's major weaknesses is the concentration of its revenue around a few large websites. Having the most popular advertising network on the web (AdSense) is an important asset for pretty much anything else Google wants to do.

If what you said were true, the greatest threat to the Google Search page would be a strong AdSense market with high quality content which everybody finds using organic search.

2: AdSense does not compete with Content Network because it is long tail. The value for advertisers is in reach and diversity.


And yet Mahalo is still tolerated, somehow.

E.g., this query -- "travel agent vermont" (which I got from this post complaining about Mahalo spamming the web and Google not enforcing its own qc standards http://smackdown.blogsblogsblogs.com/2010/03/08/mahalo-com-m...) still returns a Mahalo result in the top 10.


Google has taken action on Mahalo before and has removed plenty of pages from Mahalo that violated our guidelines in the past. Just because we tend not to discuss specific companies doesn't mean that we've given them any sort of free pass.


On a similar note, how is the expert sex change site still in your index? They very clearly are serving different content to the crawler (as evidenced by the "cached" link) than they are to people who click through on the SERPs. I though this was a big no-no?

For an example (which was submitted as search feedback a month ago), try searching for "XMPP load balancing" and look at the third organic link.

(Edit: actually, in that case it appears they're using JavaScript to hide the indexed content. Same effect, however: the cache link shows the "solution" but clicking the search result displays an ad.)


As your edit indicates, when we've looking into Experts Exchange, they weren't cloaking--they were showing the same content to users that they show to Googlebot. If they were cloaking, that would be a violation of our guidelines and thus grounds for removal from our search results.


Violation of the spirit rather than the letter of the law, surely. Which would indicate that the guidelines probably need to be tweaked to close such loopholes.


While I'm not a fan of that site either, it's not true -- scroll to the bottom of the page. Sneaky? Absolutely, but the content and solution is there.


It's no longer true.

Short version is: they used to, and got busted for, serving answers to the spiders and ads and pitches to the surfers. So now they show the answer at the bottom of a pile of ads and pitches.

But they still suck. Horribly. And are the number one example I hear when people say "I wish Google would let me blacklist domains".


I don't believe no one has created a FF plugin for expert sex change (yet)!!

Edit: Even a GM script to remove all the leading spammy divs would do...


FYI:

On Chrome you have Search Engine Blacklist https://chrome.google.com/extensions/detail/jiicbcimbjppjbck...

On Firefox you can use the filter option of Optimize Google https://addons.mozilla.org/en-US/firefox/addon/optimizegoogl...


That used to be the case, but the complete page as it appears in Firefox to me now: http://i.imgur.com/Lw0Mh.png.

The indexed content is blurred out and there's a big ad overlaying it, with no close button or other method to display the content, other than the Google cache link.


So, here's the difference that I found. If you're coming from Google SERPs (the referrer is Google) the answer is shown near the bottom. If you copy and paste the link into a browser (empty referrer) I get the results you show in your screenshot.


That screenshot was most definitely taken with Google as the HTTP referer. (I'm too lazy to cut-and-paste the URL from the search page into the browser, I just clicked the link.)

Clicking the link again shows the additional content at the bottom of the page as you described. So there's some other algorithm at play.


Mahalo pages are still in your index.

Do you consider that site to be a quality & non-spam source of information?


Of course the pages are in their index. Google isn't a curated walled garden. I expect every page on the public internet to be in their index.

The question is one of rankings. The only time there is a problem is when a spammy site ranks above a more relevant site for a particular search. If I enter a very specific query that only hits a spammy site, then I should see the spammy site, because it's there. Google is a search index, not a "visitors guide to the internet."


His reply seems to indicate that they are working on a per-page basis, not moving the whole site on or off the index.


Google knows what that site is.

If they're serious about their standards, they would remove Mahalo en masse from their index.

edit: Or, to satisfy lukev, they can keep the index as-is but make sure Mahalo pages never rank high in results.


It might be better to cripple the site rather than kill it. If all of Mahalo's pages disappeared you can be sure they would return en masse when they found the workaround for the filter. Blocking chunks of their content might make finding the workaround harder and may ultimately force them (or any other low quality site) to up the quality - yeah, I know I am deluding myself.


Exactly; it's not like domain names or IPs are expensive.


One misconception that we’ve seen in the last few weeks is the idea that Google doesn’t take as strong action on spammy content in our index if those sites are serving Google ads.

That's not quite what I've been reading. I believe the more common claim is that Google has a disincentive to algorithmically weed out the kind of drivel that exists for no other reason than to make its publisher money via AdSense. It's about aggregate effects, not failure to clamp down on individual sites. Or, put another way, it's not if certain sites are serving Google ads, it's because that kind of content is usually associated with AdSense.

AdSense is definitely a problem for search quality. It creates the same imperative for the content farm as Google Search has: get the user to click off the page as soon as possible. And the easiest way to do that is to create high-ranking but unsatisfying content with lots of ad links mixed in.


I agree. Also interesting to see that Google defines webspam as "pages that cheat" or "violate search engine quality guidelines." By this definition, scraper sites are not spam at all. Nor are the spammy sites in my field which super-optimize for keywords in ways that make it difficult for legitimate content to rise to visibility.

If Google did not operate AdSense, it seems hard to believe the company would not have penalized this sort of behavior ages ago. A love for AdSense is probably the single largest thing spam sites have in common worldwide.


"By this definition, scraper sites are not spam at all."

Disagree. Our quality guidelines at http://www.google.com/support/webmasters/bin/answer.py?hl=en... say "Don't create multiple pages, subdomains, or domains with substantially duplicate content." Duplicate content can be content copied within the site itself or copied from other sites.

Stack Overflow is a bit of a weird case, by the way, because their content license allowed anyone to copy their content. If they didn't have that license, we could consider the clones of SO to be scraper sites that clearly violate our guidelines.


Smaller competitors can't eat their lunch in web search, because all the content that was on the web is now on Wikipedia, YouTube, or Google Maps. Personally, I search these directly from the address bar. For the past four years I've only had two use cases for web search: 1. as a spell checker for proper nouns (and before Alpha, as a calculator) 2. to circumvent paywalls on scholarly papers by doing filetype:pdf on the title (works better than Scholar most of the time).


You know, sometimes it really makes me mad that it's difficult to get into contact with a person at Google for support of their products. But, I've got to hand it to them. They could have issued a non-personal public statement like most companies and signed it with "Google Search Team" But instead there is a personal touch. It's public statements like these that add just a bit of personal touch that makes people love them. My 2 cents. Entrepreneurs take note.


It's safe to say that many Googlers read what people write on the web and talk about it internally, even if we don't always respond. We're power users too, so if a search result bothers you, it almost certainly bothers us too. :)


Matt,

Can you speak about the possibility for personal domain blacklists for Google accounts? I know giving users the option to remove sites from their own search results is talked about a lot in these HN threads. Is there any talk internally about implementing something like this?


We've definitely discussed this. Our policy in search quality is not to pre-announce things before they launch. If we offer an experiment along those lines, I'll be among the first to show up here and let people know about it. :)


Can you let me know as well


The reason you can't in contact with people at Google is because you're not the customer. Paying businesses are the customers. Not just trying to sound snarky - it's true.


No, the reason why it's difficult to contact people at Google is that 1 billion+ users visit us each week, and we only have ~20,000 Google employees. Even if every single employee did nothing but user support 24/7, each Google employee would need to do tech support for 50,000 users apiece.

Likewise, there are 200,000,000+ domain names. Even if every single employee did nothing but webmaster support 24/7, each Google employee would need to do tech support for 10,000 domains apiece. The same argument goes for supporting hundreds of thousands of advertisers.

The problem of user, customer, and advertiser support at web-wide levels is very hard. That's why we've looked for scalable solutions like blogging and videos. I've made 300+ videos that have gotten 3.5M views: http://www.youtube.com/user/GoogleWebmasterHelp for example. There's no way I could talk to that many webmasters personally.

So we haven't found a way to do 1:1 conversation for everyone that has a question about Google. That's not even raising the back-and-forth that some people want to have with Google. See http://www.google.com/support/forum/p/Webmasters/thread?tid=... and http://www.google.com/support/forum/p/Webmasters/thread?fid=... to get a glimpse at the sort of prolonged conversations that people want to have with Google. In short: it's a hard problem.


Of course not everybody should be able to contact Google 1:1, but at least all the people that were subject to an action that required human intervention from Google.

Example: I get my adsense account or site banner in a non automatic way since there is some problem with the content: so not into an automated way, but because somebody looked at my site.

I should, in that case, have a chance to communicate with Google. This is inherently scalable as everything started with a 1:1 action.


You do, though. I recently had my AdWords account suspended and got to talk to a human at Google to handle the issue (through normal channels). For Chrome OS, we have an actual call center with real people sitting in it.


That's very good, this way it is balanced as actually nobody with a minimal business idea can expect google to reply 1:1 to normal users that happen to don't find what they want in the search engine or like.


It is, however, completely wrong to assume that this means that Google has no incentive to treat you well. It is not altruism that compels them to remove spam from search results.

Google needs many happy users to be able to offer an attractive product to its customers. They depend on people using them just as much as they depend on advertisers paying them money. Both sides of the market are important, advertisers and users. Google cannot just ignore one side.

Those who use Google to search don’t pay Google any money and they are, in that sense, no customers. You shouldn’t read much more into that word, though.

(Two sided markets and network effects are fascinating and relevant to so many discussions on HN. Wikipedia has a pretty good primer: http://en.wikipedia.org/wiki/Two-sided_market)


While you're right, I'll note that it isn't that easy to get in touch with a person at Google even if you are an actual paying customer.


Google has taken a lot of criticism on HN and elsewhere for an apparent perverse incentive, to direct searchers to content farms with adwords, instead of the original source (like StackOverflow or Amazon reviews).

I'm skeptical, because spammy-ad clickthrough rates are already low and trending lower, and I speculate google has great incentive to send people where they want to go lest their competitors get stronger.


Google also has a decade+ of track record of choosing the right long-term thing for users instead of short-term revenue: 1) not running annoying punch-the-monkey banner ads in the early days when everyone else was doing it. 2) not running pop-up ads when everyone else was doing it. 3) little-known fact: if the Google ads aren't ready by the time your search results are ready, you just don't see ads for that query. We don't delay your search results in order to show you ads.

It just wouldn't make sense for Google to suddenly abandon that (very successful) strategy and say "let's keep spammy/low-quality sites around and send users there because we make money off the ads." We make more money long-term when our users get great results. That makes them more happy and thus more loyal.


Regarding #3, why are these two things coupled in terms of page load? With all the js stuff you guys are doing now with instant, it seems like you could load up the ads on the right a split second after the search results and no one would notice or care...but I'm guessing you've tried this and it didn't test as well :)


Great question. The "don't wait for ads" policy has been around since ~2001, way before AJAX became common. In theory you could make it so that the ads loaded when they were ready, but that could also generate a visual "pop" that I imagine would annoy many users.

My preference is just to enforce a hard time deadline. If the ads team starts to miss that deadline and revenue decreases, then they're highly motivated to speed their system up. :)


The definition of 'low-quality' sites can be rather subjective, can't it? I personally get no value out of Mahalo or Squidoo but I know others actually do.

While I might not love eHow's process and tactics, I'll admit that I've benefited from one or two of their easy-to-read articles. (Though I'll often second source that information.)

I think Google's in an interesting pickle to decide whether results should be: fast food or fine dining.

Google could serve up 'fast food' in results (e.g. - farmed content) and it would likely be 'good enough' for MOST people. Plenty of folks will eat fast food even if there is a better alternative down the road.

I'd like to see more fine dining results, but I'm not sure I'm in the majority there. Quality for the early-adopter might look different than quality for the late-majority.

If the algorithm is leveraging user behavior as a signal, doesn't it follow that popular and familiar sites and brands may gain more trust and authority?

Is Google thinking at all about search in this way?


Although a google with perfect results would kill itself because the ads would be pointless.


disagree. counter-example: if I google "philly events" and google sends me right to philly2night.com, I certainly want to see ads selling me event tickets.


A perfect search would show the pages that the ad points to as the first organic results if they were, in fact, the most relevant.


Are you seriously claiming that the ideal search engine when given "Philly events" should guess which event in Philly you actually want tickets for and send you to that event's webpage?


I didn't say that you'd enter a text search. I was speaking to the goals of the search not the specific modality. If Google were able to fulfill the goals of the searcher (as they claim to want to) then ads would be unnecessary.


If they partnered with Facebook, it wouldn't be too difficult.


To me, the most interesting aspect of this situation is the conflict between Google's view and the blogosphere's view. On the one hand, "...according to the evaluation metrics that we’ve refined over more than a decade, Google’s search quality is better than it has ever been...". On the other hand, you can't open an RSS reader today without tripping over someone griping about content farms polluting the search results. There are intelligent, thoughtful people on both sides of the debate. Why such disparate viewpoints?

As Matt's post suggests, it could simply be that people's expectations are rising -- search results are getting so good in general (which they are) that we notice the problems more. Or it could be that Google is focused on a narrow definition of "spam" that doesn't cover content farms. It could even be that both sides are "right" -- that overall search quality is rising even as the content farm problem worsens, if Google has been successfully reducing other causes of low search quality.

I'd love to see some hard analysis of this. For instance, pick some a reasonably large set of sample queries, and show what the results looked like five years ago, and what they look like today. Of course, you'd first have to find a set of sample queries and results from five years ago.


We do have some data on this. I'll ask about whether do some comparisons of "Google today" vs. "Google three years ago."


Beyond the flood of SEO spam and Demand Media-style content mills, there's another search quality problem I have with Google: torrent sites. I will frequently search on the exact name of a song or album on Google in order to find out more information about that song or album, but lately most of the results have been links to torrents, including results on the first page. This applies even if I add "review" to the search query. I will even see links to torrents ranking above links to iTunes.

These songs and albums are not available legitimately through torrents. What value is there in providing links to pirated content? I understand that Google is not under any legal obligation to remove these results, but as a non-pirate these results are significantly lowering my perception of the quality of Google's search results.


Google tries to make no judgments about the sites that go the top of their algorithms.


Isn't this the whole problem? 90% of the web is crap. If Google can't deliver the non-crap, I'll be looking elsewhere.


I'd be interested in a further explanation of "Google absolutely takes action on sites that violate our quality guidelines [...]".

Does that mean that Google manually decrease rankings of spammy sites that their algorithms haven't caught? Does this entail decreasing the rank of the entire domain, the IP? Does blacklisting ever happen?

I ask since Google have previously[1] said they don't wish to manually interfere with search results.

[1] "The second reason we have a principle against manually adjusting our results is that often a broken query is just a symptom of a potential improvement to be made to our ranking algorithm" - http://googleblog.blogspot.com/2008/07/introduction-to-googl...


"Does that mean that Google manually decrease rankings of spammy sites that their algorithms haven't caught?"

Although our first instinct is to look for an algorithmic solution, yes, we can. In the blog post you mentioned, it says

"I should add, however, that there are clear written policies for websites recommended by Google, and we do take action on sites that are in violation of our policies or for a small number of other reasons (e.g. legal requirements, child porn, viruses/malware, etc)."

As the quote mentions, we do reserve the right to take action on sites that violate our quality guidelines. The guidelines are here, by the way: http://www.google.com/support/webmasters/bin/answer.py?hl=en...


Am I the only one who was really hoping for some specifics about what they're doing and plan to do about content farm rankings? Without that, the article is virtually devoid of content other than "we're really not so bad!"

Edit: By specifics, I don't necessarily mean implementation details, just anything more informative and plan-of-action than acknowledging the problem.


Our policy in search-quality is not to pre-announce things, but we did give some pretty strong hints about planned improvements to search quality in that post (e.g. talking about scraper sites). I'll be happy to talk more about them soon when they launch.


Hey Matt, is there anything y'all can do about the content farm sites where someone buys an old high pr domain and sells 100's of links on it and drops them in between tons of content? Here's a prefect example of that: http://www.dcphpconference.com/.


That is clearly webspam. We only indexed five pages from that site, but there's no reason for that site to show up at all. Thanks for mentioning it. It won't be in our index for much longer.


Hey Matt, thanks for the response. I actually have a huge list of these type of sites that I've submitted through the spam report and 90% of them are still indexed and have pagerank. Is there another way I can send them in?


Matt - I have a posterous site (which uses a subdomain of my main site) that I mostly use to index some of my favorite articles - same purpose as delicious but I can rest assured that the content will always be there. Since most of the site has dupe content, it may lose the PR (which I dont care for) but will that affect my primary site ranking.

I am not sure if Posterous offers the ability to add a robots.txt file with which I can tell search engines to exclude it, but would that prevent my primary site ranking from being affected?


awesome. gosh that page is ranked 5! I still think google is best and trustworthy, but i also felt the problem of content farm in recent years as others. I run my site and find it troublesome that many spammy sites contact me to buy links. They absolutely have no interest in nofollow ads even with banner. They just want normal text links often with a given paragraph they specify mixed into my writing.

though, i do have a question about this ...sometimes i find 1 normal link (do-follow) on a page is approprate for my reader (e.g. the content match, and i approve the site), so in that situation i would like to sell link (and absolutely with a CLEAR "sponsored link" text in plain view but not with “rel="no-follow"”. Afaik google doesnt approve do-follow sold links in anyway. I find this a bit problem because good links from commerial activity could be just as appropriate that benefits my reader.

any advice, or perhaps furture plans, on this?


Awesome, I look forward to that.


As much as I like transparency, I think it's appropriate for Google to keep the details of their search-algorithms secret. Maybe they could do something like "de-classify" the algorithms from a few years ago (this would be cool!) but, in the interest of staying ahead of the spammers, I think it's totally appropriate for them to keep quiet.


Since google wants to do everything with algorithms rather than hard coding (some of) the offenders, the minute they give you specifics the "fix" will be circumvented in some way.


Google is probably right to not go down the hard-coding path. Hard coding the offenders doesn't work. You block a domain, they changes names. You block an IP, they switch to AWS so that your choice is to block all of AWS or not block them.


I totally agree, my post didn't intend to say the contrary


While I applaud the direct personal response, I feel like the content says "we don't see a problem." If users see a problem and you don't, smaller competitors can eat your lunch. I'm kind of hoping for some competition in the field.

In terms of adsense, if you really think about it, adsense content on a page should probably be a slightly negative ranking signal (not just not a positive signal). The very best quality pages have no ads. Think of government pages, nonprofits, .edu, quality personal blogs, etc. If no one is making money off a page (no ads) then whatever purpose it has, it is likely to be non-spammy.


I feel like the content says "we don't see a problem."

We see a virtually unbounded number of problems with our search results, and we're working constantly to fix them. Most of the people I talk to who work on search have the attitude that Google is horribly broken all the time, it's just also measurably the best thing available.

Google as a company, and search quality in particular, does not rest on its laurels. The people who hate Google's search results the most all work at Google. If you think you hate Google's search results as much as we do, you should come work for us. :)


Wrong. I mentioned this before. Check out AdSense on the NY Times homepage. If you punish AdSense sites, you'll mostly punish smaller business or one-person sites that are not big enough to have in-house ad sales.


I believe it is very hard to implement algorithms that can make a difference between stackoverflow.com and a rip-off, or a legitimate Apache mailing list archive and a rip-off.

Why not allow the community to sort this out. "Google Custom Search" already exist. Google could extend that to the direction where people could customize the Google search to exclude certain sites from the results (right now it is only possible to specify a list of sites to include in the search).

Blacklists for at least specific "fields of searching" would emerge very quickly. People could select what blacklists to use, if any.


As a pointer: Matt_Cutts is the head of the webspam team at Google. He's been very active on this thread. Please search below for his posts.


There are a few signals that should be possible to pick up. Examples:

- When power searchers start adding -somedomain.xyz to their searches

- Increase spam reporting by adding some kind of feedback to the spam reporting feature. I think I'd love to get an automated mail saying something like: "The site somespamdommain.xyz that you and others reported x days ago is now handled by our improved algorithms". Submitting spam reports really doesn't feel useful when it seems like nothing ever happens.

- Adding weight to spam reports. You know a lot about us, and I guess you can filter out who are power searchers. This could help stop people from gaming the system into blocking competitors.


Google AdSense for Domains ( http://www.google.com/domainpark/) really makes a lie of not wanting useless content. The designed a revenue source for parkers/squatters.


How often have you searched Google and been sent to one of those pages?

They don't want useless content in their serps. If someone goes directly to the domain, that's in no way related to search quality.


It is very related. Google made span sites and splogs profitable (through their various add networks). Therefore they are the major funder of spam sites- even if they themselves do not promote them.


+1 and my perfectly non-spammy blog was rejected by them. They didn't mention the reason but I have a hunch that it was because it isn't about any one or two things. It's about a bunch of things I find interesting - both personally as well as professionally. Now, google must have found it difficult to determine what ads to show, so they just black-listed my blog for ad-puposes. Period.


The timing of this is interesting... About a week before Demand Media's IPO. Must be a bad day for the investment bankers.


I think there might be something else at work here: our rising expectations of how search engines should work.

In years past, Google's results were measurably less relevant than they are today. In the time between "then" and "now", we're grown more accustomed to high quality, fast, relevant results. I think this makes it seem like small problems in search are bigger than they are.

It would be great if there was a "Google of 2004" to test this side by side, but I don't think that is possible. :)


@Matt I like that you pinpointed some specific here, but is the algorithm going to be strong enough to easily pick up things like this: http://posterous.com/people/YrCushFlSet This single account is feeding 100 different websites alone that all feed hundreds of others. This is helping fuel numerous sites in the plastic surgery niche and it's quite disturbing.


mwilton13, I passed this on to my team when you mentioned it to me on Twitter recently. Have you mentioned the site to Posterous too?


Thank you Matt. I am following up with Posterous and the other sites involved on Monday. We reported one of the content farms on blogger last year too.


Please, exile half or more of Demand Media's pages from the index before their imminent IPO!


I'm just impressed with how this was handled. Consider how much more technocratic this is than a news release from ten years ago.

Google issues an announcement via blog post. TC and others start to pick it up. And the original author of the blog post takes questions and provides technical answers, where allowed, in HN.


Google cannot escape their fundamental conflict of interest: they make money by selling web traffic to advertisers, then buying that traffic back at discounted rates and re-selling it again, over and over. None of their revenue comes directly from search, though search is their primary source of raw traffic. Their search results don't have to be good, they just have to be good enough to sustain traffic. And right now there are so many people who reflexively use Google out of habit that their results could deteriorate substantially (and many would argue already have) before it impacts their shell-game revenue stream.


Curious though what the metrics they use to evaluate effectiveness against spam are. It could have just as much (or as little) spam indexed as it had 5 years ago, and in some comparisons that would be valid, but what if much of the spam had moved from being evenly distributed throughout results to being distributed in the top positions? Then, one could say spam was even lower than ever in total quantity, but it would be even worse in terms of user experience.

That said, I agree with Google, users' expectations have skyrocketed, and it is tough to keep pace with them.


I actually feel quite comfortable with our metrics. Back in 2003 or so, we had pretty primitive measures of webspam levels. But the case that you're wondering about (more spam, but in different positions) wouldn't slip past the current metrics.


How do you interpret the backlash from the users recently ? In your eyes, have we become more used to "perfect" results, or are the fewer bad results left more insidious and thus more harmful (despite the overall level of quality being higher) ?

Personally I tend to find what I'm looking for by adding a few more words, but in the case of reviews and tech stuff it doesn't always work and I often have to rewrite my query one or more times to get something valuable.


in the case of reviews and tech stuff it doesn't always work

This is one of my pet peeves. If I search for "product X review", most of the result I get back are of the form "be the first to review product X", which is absolutely not what I want.


How do you measure your own performance regarding spam?

If the measurement system is able to detect spam, why is spam not removed in the first place, before it has a chance to show up in the metrics...?


I don't want to leak our eval methodology to competitors, but you can take subsamples and get those samples hand-rated by experts.


Spam is not only in the organic search but also in the image-search. I observed a site that steals 140.000 (!) images by hotlinking (also some of my pictures). First the pages itself seems to be "clean". They only set hotlinks to blended search images, and they got a lot of traffic, that sure. Then they switched the site: on the top there are two porn-ads (it was xslt*us) I wrote a spam report and posted it an webmasterforum. But it took about 10 days until the site was removed. Hope this gets better... And: hotlinking is a great problem.


First, I haven't noticed any significant increase in the frequency of spam sites appearing the Google search results. The biggest problem that I did notice is social bookmarking sites, like reddit and digg, outranking original content. They often have nothing more than a copied-and-pasted paragraph, sometimes supplemented by low quality comments (as is common with these types of sites). Since this site is very similar, not many people will be concerned about this issue.

Second biggest issue is poor Wikipedia articles appearing in the top results for almost any reference type query. Many less frequently updated Wikipedia articles are nothing but regurgitated content lifted from other quality sources. What makes it worse is that Wikipedia is using no-follow for their links, so even if these sites are linked in the reference section, they won't get any credit. It's interesting to see so many people complain about low quality content on commercial sites, but they never mention Wikipedia, which is a much bigger offender (I guess this might be because Wikipedia gets its content for free and doesn't have ads and other sites pay for the content and do have ads).

Third, I hope Google doesn't make any changes without checking very carefully that good sites will not be negatively affected. For example, newspapers will often have the exact same articles from the AP, but also original content based on their own reporting. Punishing them for having duplicate content would not work well. There are many similar possible pitfalls.


"The short answer is that according to the evaluation metrics that we’ve refined over more than a decade, Google’s search quality is better than it has ever been in terms of relevance, freshness and comprehensiveness."

The long answer is that without Wikipedia results, Google's search quality would be at an all-time low in terms of relevance, freshness and comprehensiveness.


When a company needs to write a blog post in this tone, they are definitely losing ground.

What you are saying != how you are performing.


This is just lip service - Google's quality has dropped off and its really obvious. Recently, I've been regularly comparing the results I get from Google and the ones I get from Bing. Needless to say, Bing's far more relevant. The biggest point people have made are on the money searches like "MP3 Player" but the results I've been comparing have been local searches and programming things like: "show/hide text boxes in Javascript." In Google all I get is links to Amazon and other random results to link farms like javascriptworld.com. In Bing, I get links to forums and tutorials which is what I'm looking for.

Time and time again Google has failed. I've already moved on to Bing and Duckduckgo and I would recommend you do too. Unless you like digging through hordes of useless SERPS.


Consider that possibly the problem isn't Google; perhaps it's you.

For instance, http://www.javascriptworld.com (which I run) is NOT a link farm. Rather, it's there to support the book "JavaScript & Ajax for the Web: Visual QuickStart Guide, 7th edition" (by Tom Negrino and myself).

Numerous colleges & universities courses use our book as a required textbook. Consequently, many teachers & professors link to it from course websites to let their students know where to download the book's code examples. Yes, that probably helps the site's Google rank, but that was never one of our goals.

As a result, when you search for certain JavaScript examples, you may run across my site. And in your particular case, my site doesn't help because you don't own our book. But just because it doesn't solve your particular problem doesn't mean that the site isn't useful for other people.

I'm not sure why you thought it was a link farm. Perhaps you should have taken a closer look at the site before using it as a bad example?


A recent Google search for the Walmart being built 2 miles away from me led me nowhere. Google listed 2 pages of job sites listing open positions at this wal-mart. I almost gave up my search, but decided to search twitter and in doing so I found what I was looking for!


I'd be curious to know where the Walmart is and what searches you tried.


Jan. 15th I was curious if the walmart was going to be a supercenter so I searched, "fallston md walmart," & "will fallston md walmart be a supercenter." As you can see either result does not show the most relevant and recent info about my query, rather it's littered by prominent job sites. Though on a search for "will fallston md walmart be a supercenter," i see the six result to be http://www.belairnewsandviews.com/2011/01/job-ads-out-this-w..., yet that information to me was weak;. it deduces from all the job listings that yes this will be a Super Walmart. Since I found the results/info lacking I immediately went to Twitter and found an even more detailed article(http://belair.patch.com/articles/fallston-walmart-may-open-b...) that had pictures that I think should have appeared in one of my queries. As you see that article was published Jan. 14th and Google did not point me to the most recent/up to date info rather it favored prominent job sites' listings, yet Twitter did.

Pardon, if this seems nitpicky, but just wanted to share one of my recent experiences where Google failed me(i have another example too but search was personal), while Twitter did not.


This is interesting. Part of the problem is answering questions when there's not much content on the web. I think a lot of the job sites showed up because Walmart is hiring for that location in preparation for opening.

I'm just now testing this query, but when I did [fallston md walmart], there is a really comprehensive article at http://www.exploreharford.com/news/3074/work-start-fallston-... that shows up at #5. That article mentions that "The new road will lead into a 147,465-square-foot Walmart, which has been planned with a possible 57,628-square-foot expansion" Then I did the search [walmart supercenter size square feet]. The #1 and #2 results are both pretty good, e.g. the page at http://walmartstores.com/aboutus/7606.aspx says that an average store is 108K square feet, and supercenters average 185K square feet.

As a human, I can deduce that 147K square feet (with 57K square feet of potential expansion) implies that it will probably be a supercenter, but it's practically impossible for a search engine to figure out this sort of multi-step computation right now. My guess is that in a couple months when the store opens up, this information will be more findable on the web--for example, somewhere on Walmart's web site. But for now, this is a really hard search because there's just not much info on the web.

I appreciate you sharing this search. I think it's a pretty fair illustration of how peoples' expectations of Google have grown exponentially over the years. :)


We love stuff like this--thanks for the specifics!


For the love of god just give us the tools for effective persona blacklists. With Google's constant changes to the search site and the difficulty in efficiently and effectively monitoring live search results via browser extensions, it's been at best hit or miss. Whether that comes in the form of some API that can be tapped to make a good extension or having it built into the browser, I don't care.

Google of all companies I would have thought would understand and respect the important of giving people the power over their own technology experiences.


I can't help but think that this is almost an exact parallel of the iPhone antenna problem. Both companies had minor to medium problems in their flagship products, both problems were vastly overblown by the media, both problems spurred an unbelievable spate of batshit conspiracy theories, and to take the cake both companies responded with the same "This is not a problem, but here's the solution." Good problems to have.


What sounds odd of all this is that I think the spam sites and "content farms" are generating a lot of ad clicks for Google. Will they really take appropriate actions against this sites if this will mean a significant cut on the earnings?

I played with adsense a lot in the past, and if you did too you should now how spam sites generate a lot more clicks than sites where the user is actually focused on reading content...


Why don't we collaboratively blacklist or push down domains from our Google results? This could be a stopgap measure until Google incorporates such a feature including the collaborative database or magically puts an end to spam.

A proposal: http://www.saigonist.com/content/google-spam-content-farm-fi...


In other industries, the government regulates that there must be a "Chinese Wall" (no communication or shared personel) between segments of a company that have conflicts of interest. In this case, search and advertising qualify. I expect to see proposals for this kind of regulation in the next 10 years as the FCC begins regulating the internet.


Good luck, see the Comcast & NBC merger. The FCC is impotent.


This reads just like the nutrition label on Nestle products: the ones that boast that the product has "3 vitamins AND minerals", and, on a distant part of the package, lists two minerals and one vitamin and what the health benefits of them are.

They'd be a lot more credible without the corpo-speak junk in the first paragraph.


I've got the results of a test run of search results I did in October 2000 sitting on my computer. The Google search results from today are much better than from October 2000. I do think it's fair to make the point that overall Google is much better--and overall spam is much lower than 5 or 10 years ago--before we drilled down into responding to the blog posts from the last couple months.


Matt,

     I'm sure that Google does better than it did in October 2000.  They'd be something horribly wrong if that wasn't the case.  That's not relevant to the specific concerns people have about what's happening here and now.


Sorry, I was responding to the "They'd be a lot more credible without the corpo-speak junk in the first paragraph." by trying to defend that the statements in the first paragraph were true.

Regarding the specific concerns that people have been raising in the last couple months, I think the blog post tried to acknowledge a recent uptick in spam and then describe some ways we plan to tackle the issue.


What are people's thoughts on companies who do create content farms? From the perspective as being a successful company rather than "I hate the spam and I hope they all DIAF".

Personally I think any type of "scheming" in technology will eventually get caught and then all of a sudden there goes your business model.


It depends on the quality of the content, in my opinion. Ultimately magazines are 'content farms'. Various game review sites are 'content farms', since they're both designed to 'churn out' content and articles. (To give a couple of quick - albeit silly - examples). The difference is that the content in these two cases is high quality, usually containing images and possibly videos, and with lots of unique ideas and opinions given.

So I guess it depends. If a 'content farm' produces good, helpful content, then that's great and should be encouraged (even if it is done on a massive scale). But when it comes to a case like WiseGeek where content is spat out en masse, even if the content is crappy and really short, then it becomes a problem.


@Matt_Cutts starting around page 4 results for the term [loans] what is with all the .edu sites? This seems to be an error on the ranking side of things for all these .edu sites for a financial related term such as [loans]? All the results are .edu after page 4?


Typically those .edu sites have been hacked at some point.


The problem isn't just rank though. After I've seen the original source of duplicated content, I don't just want sites that copy it to rank below it, I want to not see them at all, so that the rest of my results are filled with /different/ things.


I also have one spam site example. http://www.google.com/search?q=internet+phone+service Look at 3rd result for internetphoneguide.org


I was hoping to see something in the latest blog post from Google about spammers using Exact Match Domains to get huge algorithmic boosts. EMD + tools like xrumer usually = first page results :(


What about sites like ezinearticles? Where they do not pay for content, isn't that more like organic content from experts? They are not content machines.


What is the technical difference between Google being able to accurately measure the volume of spam, and Google removing the spam?


tl;dr: Our algorithms already stop resultspam. Y'all are trippin'.


I just want to thank Matt Cutts for always being classy. His blog posts/comment are always just set at a high bar.


It doesn't appear that Cutts & Co. are looking to address any of the more popular blackhat link building methods that all popular SEO bloggers continually say "work but you shouldn't use them yourself because they're bad".

Until the keyword "buy viagra" isn't littered with forum link and comment spam and parasite pages, Google's algo is still not "fixed"


People can make tens of thousands of dollars a month if they rank highly for certain phrases, so tons of SEOs (and spammers) are trying to rank for phrases like that. Other search engines might hard-code the results for [buy viagra], but Google's first instinct is to use algorithms in these cases, and if you pay attention, the results for queries like that can fluctuate a lot.

With a billion searches a day, I won't claim that we'll get every query right. But just because we don't "solve" one query to your satisfaction doesn't mean we're not working really hard on the problem. It's not an easy problem. :)


How about the more general case of gray and black hat link building techniques? Especially in internet marketing and certain SEO communities, everything from forum profile backlinking to blog comments and mass article submissions are done daily to rank sites in all niches, not just the really spammy viagra-type sites.

There's a growing concern/consensus that in loads of non-spammy niches, the only way to get decent rankings is to build links.

I know this is naturally something that would be very hard for Google to tackle (since if 'junk' backlinks = domain penalty, black hatters would simply spam their competitors' sites with backlinks and get their competitors deindexed), but are there active efforts being made so that gray and black hat link building campaigns aren't (in some cases) pretty essential to a site getting good rankings?


Not to mention link building via satellite site networks. A competitor of ours has hundreds of keyword-niche sites that link back to their URL but also link to several other legitimate domains on each page. How could Google penalize the domain responsible without penalizing sites that have nothing to do with it?


it just seems to me that there is a gaggle of very obvious failing points that can easily become the focus of your guys' attempts to clean up the serps. Buy Viagra is probably the oldest known keyword to be subjected to blackhat seo techniques in the history of Google, yet it's still littered with hacked edu's and spam sites. If 300k broad match searches a month is something you guys feel isn't worth focusing on, then I guess this is a moot point. But at that kind of volume, I think a little bit more attention is warranted.

The simplest response is that link spam works, has worked for years, and coming into 2011, it appears to be as strong as ever in terms of ranking sites for competitive niches.


I think that's a bad example. AFAIK you can't legitimately buy Viagra online, so what results should Google be returning? Where are the awesome web pages about Viagra that you want to see? They don't exist, so the query comes back with spam.



So should we use the "pretty sure" metric of the domain name matching the product we are searching for? What makes this site more "legit" than others?


Nasdaq:DSCM

It seems to me a $200M company offering to sell a product to you via their website would seem more legit than, say, a site you're redirected to after someone hacked some .edu account. (At this moment, the second result on Google for buy viagra appears is a link to www.engineering.uga.edu, but that redirects to some pharmacy site if you click through from Google (but not if you load directly).)

So yes, I'd say there's at least some way to compare the legitimacy here.


My search for buy viagra returns the Florida International University Course Catalog on page 1 along with other academic links (non medical) too


is it possible that a large number of people who 'dont know any better' are visiting these pages and making purchases and increasing the rankings for these scam sites?


According to our metrics we are great; pity-about-you, unless you can "Please tell us how we can do a better job."

I expected more. It reads like content farm.


Did you read past the first paragraph? They mention some recent changes they've made:

To respond to that challenge, we recently launched a redesigned document-level classifier that makes it harder for spammy on-page content to rank highly. ... We’ve also radically improved our ability to detect hacked sites, which were a major source of spam in 2010. And we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.

And they are saying pretty clearly they think they can do a better job on content-farms:

Nonetheless, we hear the feedback from the web loud and clear: people are asking for even stronger action on content farms and sites that consist primarily of spammy or low-quality content. ... The fact is that we’re not perfect, and combined with users’ skyrocketing expectations of Google, these imperfections get magnified in perception. However, we can and should do better.


To respond to that challenge, we recently launched a redesigned document-level classifier that makes it harder for spammy on-page content to rank highly

When is recently?

We’ve also radically improved our ability to detect hacked sites

Since when; all the complaints I have read are from the last few weeks, and when I looked up a medical complaint this week?

And we’re evaluating multiple changes that should help drive spam levels even lower

Oh, are you now? Of course... Translation 'We are looking into it'. D'uh.

that copy others’ content and sites with low levels of original content

This last part is the only datum I got from the article - they explicitly respond to stack overflow, etc... It is still fluff though.

What I would have preferred: I made a change on 2011-01-15 and you should see it here, here, and here. 'Here' can be broadly defined.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: