Hacker News new | past | comments | ask | show | jobs | submit login

Why not just downrank the results from the SO clones?



Because that wouldn't solve the problem for clones of other sites, or clones in other languages. And the Stack Overflow cloners could just make other websites. That's why a primary instinct in search quality is to look for an algorithmic solution that goes to the root of the problem. That approach works across different languages, sites, and if someone makes new sites.

To be clear: the webspam team does reserve the right to take manual action to correct spam problems, and we do. That not only helps Google be responsive, it also improves our algorithms because we get use that data to train better algorithms. With Stack Overflow, I especially wanted to see Google tackle this instance with algorithms first.


> Because that wouldn't solve the problem for clones of other sites, or clones in other languages.

Sites that are the victims of content cloning have to be very visible and valuable, so maybe a little manual curating could be relevant.

> the Stack Overflow cloners could just make other websites

Not really? The point is not to tag the clones but to tag the original; everything that is not the original and that has copied content is a clone -- its name, domain or country notwithstanding.


That would be a rather awesome feature for evil people: Just copy&paste your competitions content onto SO (or any other specially protected site) and their google ranking will drop like a stone.


Why doesn't Google just count 'original' as whoever published the corpus first...


this was discussed in an earlier thread and it seemed like the idea of finding the "original" gets really messy and game-able (and potentially oppressive if curation were used).

the primary input to search engines comes from web crawlers...the idea of "first" when it comes to duplicated content is already difficult to determine, and (I would guess) it would get much much worse in the inevitable arms race if something like this were implemented.


But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Because we don't understand what's hard, we think you're not really trying, and then we make up evil reasons to explain that.

I believe if people understood better the difficulties of spam fighting they would be more understanding.


> But detecting duplicate content should not be very difficult, esp. now that Google indexes everything almost in real time. The site that had the content first is necessarily canonical and the others are the copies?

Not necessarily. The rate at which Google refreshes its crawl of a site, and how deep it crawl, depend on how often a site updates and its PageRank numbers. If a scraper site updates more often and has higher PR than the sites it's scraping, Google will be more likely to find the content there than at its source. Identifying the scraper copy as canonical because it was encountered first would be wrong.


Do you think if Google educates to publishers and web masters to use Rel="Original" , the contents which are similar will be put it as considering set, use this tag as the best practice as similar to Rel Canonical tag that will help Google to identify the original content and make the search quality better?


What would stop the site that is scraping content from using the rel="original" tag too


Matt, would it make more sense to put more weight on number of web page visitors (recorded by Google Toolbar) as opposing to number of incoming links?


I agree with what you're saying: there needs to be an algorithmic approach.

But I'd like to say one other thing. Why is Google only doing something about web spam now after people have pointed out how bad things have been getting? Has anybody considered creating a small team to just oversee public perceptions of the search results and try to keep on-top of things in the future?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: