Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There has been a huge influx this year with the amount of sites that simply scrape SO and then have the exact content on their site. It's a pain, and there is no official way to remove them.

I thought that this was a massive nono from Googles side, has something changed?




I had this topic brought back to my mind yesterday as I was doing some research using the Ahrefs keyword tool. I do believe it would be possible to create a very large dataset of these copycat sites (using Ahrefs) to be used as a blacklist in various filters/extensions.

But the crazy part is that, for example - Ahrefs says that StackOverflow has "Organic traffic" in the range of 22 million per month. A lot of these copycat sites, at least the ones I saw - have a traffic range anywhere from 10k to 500k per month.

I mean, it's pretty insane just how well such sites can rank in Google, and you bet those copycats are making absolute bank from ads even if the majority of developers immediately close the site.

There's a lot going on with Google Search these days, a lot of people are complaining that sites that scrape content can easily rank really well for long-tail keywords. One case in particular, a site will scrape Google to collect "featured snippets" and "people also ask" - then combined anywhere from 20 to 40 of these answers and publish them as a blog post.

None of the words are changed, all questions/answers worded exactly the same. And Google puts these sites on page 1.

What a joke.


> then combined anywhere from 20 to 40 of these answers and publish them as a blog post.

yeah i've been hitting a ton of those lately.


> I do believe it would be possible to create a very large dataset of these copycat sites

Would they just move to creating and using new domains with the same content as soon as traffic to the old becomes drops? (What looks like the spammers in the original post are doing)

But something does need to be done to these sites.


> move to creating and using new domains

This is a decades-old spammer trick. Google used to not rank brand new domains very high for this reason.

It's hard not to think that the only reason Google abandoned most of its old site ranking heuristics was that they were filtering out too many sites with lots of Google ads. The spam sites now infesting Google's first-page results don't look very different from the spam sites I saw back in the early 2000's. (There's more JavaScript, but modern search spiders run every page in a VM before reading the DOM, so that doesn't fool anyone.)


Fingerprint the site’s content so the new domain name isn’t able to SEO a good score.


> bet those copycats are making absolute bank from ads even if the majority of developers immediately close the site

I bet the majority of developers block ads


When on platforms let them at least


“developers”


yeah, the ones using SO


> What a joke.

It is simple. Google is making more money from copycat sites then from original content...


I think this just isn’t how Google work. I would expect to see a lot more spam if Google were happy to collect money from advertising on spam sites.


I think that as Google became more complicated, bigger, and ML based the average Googler begins to understand the system less and less well. At a certain point a problem comes up and they just don't know how to solve it that well or their bureaucracy makes solving it to painful or the people who solve problems quickly have all gone to companies where they can do that.


… on the short term.


Does the market care about anything else?


I'm sorta happy Google is turning to garbage.

Actually forcing a search engine back into the reliable index of valuable sources would be great.

Imagine you to a white list approach to a search engine where a human or AI does an approval first.


For all we know the sites have better internationalizations and cater to audiences invisible from a US-based perspective.


These sites are just scraping SO and dumping the text from the question+answers in a blog-style format.

I don’t think this is a cultural issue, I fail to see how this can be considered value add by anyone.


I found that Google will even rank a quote from an issue tracker on one of those "clones with advert/malwar overlays" higher than the original.


Also sites that scrape mailing list archives and cover it in ads rank higher than the actual archive site.


A/B tests can show datapoints where these sites convert better on average. Sites aren't chosen for cultural legacy but based on hitting the correct kpis.


Here is my uBlock filter with hundreds of GitHub/StackOverflow copycats: https://github.com/quenhus/uBlock-Origin-dev-filter

It blocks copycats and hide them from multiple search engines. You may also use the list with uBlacklist.


With these two pieces of data:

* the identical text copied from stack overflow should be easily identifiable

* volunteers put together a list of these sites themselves

it should be obvious to Google apoligists that Google is either negligent or intentionally allowing these sites in their search. I'm sick of hearing about how "the world is different" and it's an "arms race" between spam sites and google. Bullshit.


> the identical text copied from stack overflow should be easily identifiable

Google starts matching content from SO => Spammers start tweaking the text slightly => google implements some expensive similarity score to down rank copy cat sites => spammers use more complex scrambling=> ...

> volunteers put together a list of these sites themselves

These lists only work because they're used by a tiny minority of people. If Google were to do this the spammers would start switching domains more quickly (or find some other workaround).

I'm no Google apologist but I think you're underestimating how hard search ranking is when spammers are actively trying to game the system.


> tweaking the text slightly

That's what ML is perfect at detecting, which is Google's forte.

Some of these sites have been returned as top results for a while, so are you suggesting that Google just gave up because spammers would be able to evade them with an update?


Yes it is arms race, google has far more resources than spammers do so they should be ahead easily.

You underestimate the resources google has at its disposal.

They simply don’t care because there is no real competition to worry,even with this spam you are still likely to use google, so why would profit motivated company bother ?


SO seem to have Yahoo ads, so I guess it is a no brainer for Google to rank sites they profit from over the content the lusers want.


This is the real answer.


The problem with these theories is that they lack any sensible explanation of motive. Google intentionally degrading its search results because they "earn more if the user has to search again and again" just doesn't feel right: even if it were true in some short-term experiment, it would compromise the way people at Google think of themselves and their work to a degree that would be devastating to the company. There is no way they would throw away that sort of value without being under intense pressure, which they definitely are not.


Another comment stated that SO uses ads from someone else than Google, while the copy-paste sites use Google for ads. If true, that is clear monetary incentive to not go after this too hard.


They've also demonstrated that they can derank the Wikipedia clones. Funny how that ability is lost when the site in question makes money for a competitor.


These large tech companies have a long and varied history of stupid short-term decision making for profit and bad products due to local individual failures. Until there is a clear and detailed explanation of how the spam sites are avoiding google's wrath, the explanation of stupidity or short-term thinking on Google's part seems just as plausible.


Well come up with an explanation of how these entirely mechanically generated SO clone sites, with no obfuscation, are allowed to exist by Google, when identifying them and removing them should be fairly trivial?

At the very least they're being deliberately neglectful because they don't feel the bad experience harms their revenue because there's no other substantial competitor so they can abuse their monopoly status.

I guess they may just not care enough about software developers and figure we're mostly using ad blockers so its wasted effort and we'll develop blocklists ourselves. With no monetary value that they can assign to the ill will that it engenders they figure it must not matter so they don't bother. Pissing off a large chunk of the entire IT community via obvious neglect seems like a poor move to me, but then I've never felt that I'm cut out for management.


Maybe the problem is just genuinely hard and beyond their capabilities.


Detecting identical snippits of text is beyond virtually no one's abilities.


Yeah, I subbed to the blocklist that someone else published that they're maintaining manually. Google certainly has the resources to beat that bar.

It feels like economy-wide that decision makers in corporations and governments have just arrived at the conclusion that there's no money / no point in trying to stop scammers (and there might be an actual cost to revenue of doing so). It won't goose their quarterly numbers and might hurt them so its better to allow it.


This even works on Firefox Nightly on Android. Thanks a lot!


This is fantastic! This is exactly what I needed, thanks!


You rock. Thank you.


I'd like to get actual confirmation of this, but my vague feeling is that, once upon a time, Google Search would get "updates", as in, actually deployed code that would change the rule of the game and most of the previous dirty tricks would become unusable, leading to people to go out and find out new ones.

This changed with the Google "machine learning" days, where you no longer have humans at the helm laying down explicit rules, so no more "change the world" updates, you can only slightly nudge the parameters towards what you want, meaning the same old tricks keep being effective for far too long.


> Google Search would get "updates", as in, actually deployed code that would change the rule of the game

That's just what the scheduled "core update" days are now: https://developers.google.com/search/blog/2022/05/may-2022-c...


The “May Core Update” which recently rolled out impacted every site.

A lot of updates are targeted at specific problems such as low quality product reviews but there are still broader updates taking place.


My theory is that one of the inputs to Google's ranking algorithm is now "how much money would we make from this click?" A click to SO has a small number of ads which are obviously ads and easily ignored. A click to the average scrape-jacked SO page has dozens of ads using every dark pattern in the book to generate accidental clicks.


One of the other commenters above made the claim that SO runs yahoo ads. If that's true then from a Google perspective, the click has either zero or negative money-making value.

Maybe that means we should be searching in yahoo rather than google.


There's a simple way around that. Nothing to install. Nothing to update.

Just go to SO and use its search bar. It's actually quite good.

I mean, you know that's where you'll want to find the answer anyway - not some random corporate webpage or ad-infested splog. Why not cut out the middle man?

Only if that fails do I bother with Google.


Huh, you know, you're right. I recently did that and it was fine.

I think a lot of others formed their opinion (myself muchly included) about this from sites where the search bar was a joke played on people.

Edit: let me upgrade that 'fine' to 'great', now that I think about it it was actually better than a google search which was not my previous experience.


Or, if DuckDuckGo is your default search engine, you can append ' !so' to your search term.


Because it's not just SO I want answers from. It's blogs people post, social media such as Reddit, HN, etc.

I've found some fantastic articles out there, yes SO is a fantastic resource but there is an entire internet out there :-)


Google index used to be fairly more competent at finding relevant issues for a query, especially if some words were synonyms of what found in the snippets at even loosely related


I have no evidence of this, but the ad load on the returned results has gotten way higher. In theory, ranking sites that display Google ads higher would be a very easy knob for Google to turn to increase profit. The SO scrapers probably have Google ads on them, making them more profitable for Google.


I ran into so many Stack Overflow "mirrors" yesterday like this: https://www.anycodings.com/1questions/400836/swiftui-update-...

10 years I gave up on a large project where I rehosted and organized dead Usenet forum content because Google's dupe-penalty detector was too good and too aggressive for content that you could barely find beyond a six-year-old cache hit where the origin website was long gone.

Meanwhile these Stack Overflow scrapers are just `<html>{copy-and-paste}</html>` and the same domains are still alive despite years of cloning.

Looks like it's time to boot my project back up.


It’s clearly not a copy and paste. I just visited that link on my phone and got blocked from viewing because I’m using an ad blocker.


Also lots of github scrapers


"All Rights Reserved."


Turning the knob one way explicitly might raise some anti-trust concerns, however the same motivation can be used to avoid turning the knob the other way and this can be done much more sneakily without leaving clear evidence - simply don't allocate budget/etc to projects that would turn the knob the other way and you're done.


I've noticed this with youtube. Even though I'm on desktop with an adblocker they repeatedly autoplay the same video with a creator embedded crypto promotion at the beginning (especially when it would be plausible to infer I'm asleep from user interaction and clock/watch time). Must be getting a cut (plus scamming the ad buyer).


This is a very old conspiracy theory that's been repeatedly debunked.

https://www.searchenginejournal.com/ranking-factors/google-a...


That link is about AdWords spend by the site in question, and not about displaying AdSense ads on the site. Totally unrelated.


I've been using this uBO filter since someone recommended on a different thread and it's been great at removing those annoying sites from search results: https://github.com/quenhus/uBlock-Origin-dev-filter


The author acutally posted above your comment ;)


This reminds me of this one site that simply scraped all the open source code it could see and then produced AI-generated copies.


Similar to YouTube search results. Lots of spam videos. No way to block a creator. Totally ruins it.


Other search engines allow you to block domains from showing up in the results. I’ve switched to Kagi out of frustration and honestly it’s as good or better than Google just because of that one feature.


Some ex-Googlers say that someone ran an AB-test, and it turned out that per-search revenue was decreasing when these sites were blocked.


It took me a while for no good reason but I finally got an unofficial extension to add a "block" button to search results. It immediately improved my experience, I can't recommend it enough. No more Pinterest, SO clones, useless Quora spam, with very little work. I can't believe I didn't do it sooner.


I've been using uBlacklist and it works really well. It even lets me highlight specific websites so I have a better chance of seeing them if they are further down the list. https://iorate.github.io/ublacklist/docs



just switch to you.com




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: