Hacker News new | past | comments | ask | show | jobs | submit login
Google Search Results Plagued with spam “.it” domains (cloudflare.com)
234 points by pyinstallwoes on July 23, 2022 | hide | past | favorite | 206 comments



There has been a huge influx this year with the amount of sites that simply scrape SO and then have the exact content on their site. It's a pain, and there is no official way to remove them.

I thought that this was a massive nono from Googles side, has something changed?


I had this topic brought back to my mind yesterday as I was doing some research using the Ahrefs keyword tool. I do believe it would be possible to create a very large dataset of these copycat sites (using Ahrefs) to be used as a blacklist in various filters/extensions.

But the crazy part is that, for example - Ahrefs says that StackOverflow has "Organic traffic" in the range of 22 million per month. A lot of these copycat sites, at least the ones I saw - have a traffic range anywhere from 10k to 500k per month.

I mean, it's pretty insane just how well such sites can rank in Google, and you bet those copycats are making absolute bank from ads even if the majority of developers immediately close the site.

There's a lot going on with Google Search these days, a lot of people are complaining that sites that scrape content can easily rank really well for long-tail keywords. One case in particular, a site will scrape Google to collect "featured snippets" and "people also ask" - then combined anywhere from 20 to 40 of these answers and publish them as a blog post.

None of the words are changed, all questions/answers worded exactly the same. And Google puts these sites on page 1.

What a joke.


> then combined anywhere from 20 to 40 of these answers and publish them as a blog post.

yeah i've been hitting a ton of those lately.


> I do believe it would be possible to create a very large dataset of these copycat sites

Would they just move to creating and using new domains with the same content as soon as traffic to the old becomes drops? (What looks like the spammers in the original post are doing)

But something does need to be done to these sites.


> move to creating and using new domains

This is a decades-old spammer trick. Google used to not rank brand new domains very high for this reason.

It's hard not to think that the only reason Google abandoned most of its old site ranking heuristics was that they were filtering out too many sites with lots of Google ads. The spam sites now infesting Google's first-page results don't look very different from the spam sites I saw back in the early 2000's. (There's more JavaScript, but modern search spiders run every page in a VM before reading the DOM, so that doesn't fool anyone.)


Fingerprint the site’s content so the new domain name isn’t able to SEO a good score.


> bet those copycats are making absolute bank from ads even if the majority of developers immediately close the site

I bet the majority of developers block ads


When on platforms let them at least


“developers”


yeah, the ones using SO


> What a joke.

It is simple. Google is making more money from copycat sites then from original content...


I think this just isn’t how Google work. I would expect to see a lot more spam if Google were happy to collect money from advertising on spam sites.


I think that as Google became more complicated, bigger, and ML based the average Googler begins to understand the system less and less well. At a certain point a problem comes up and they just don't know how to solve it that well or their bureaucracy makes solving it to painful or the people who solve problems quickly have all gone to companies where they can do that.


… on the short term.


Does the market care about anything else?


I'm sorta happy Google is turning to garbage.

Actually forcing a search engine back into the reliable index of valuable sources would be great.

Imagine you to a white list approach to a search engine where a human or AI does an approval first.


For all we know the sites have better internationalizations and cater to audiences invisible from a US-based perspective.


These sites are just scraping SO and dumping the text from the question+answers in a blog-style format.

I don’t think this is a cultural issue, I fail to see how this can be considered value add by anyone.


I found that Google will even rank a quote from an issue tracker on one of those "clones with advert/malwar overlays" higher than the original.


Also sites that scrape mailing list archives and cover it in ads rank higher than the actual archive site.


A/B tests can show datapoints where these sites convert better on average. Sites aren't chosen for cultural legacy but based on hitting the correct kpis.


Here is my uBlock filter with hundreds of GitHub/StackOverflow copycats: https://github.com/quenhus/uBlock-Origin-dev-filter

It blocks copycats and hide them from multiple search engines. You may also use the list with uBlacklist.


With these two pieces of data:

* the identical text copied from stack overflow should be easily identifiable

* volunteers put together a list of these sites themselves

it should be obvious to Google apoligists that Google is either negligent or intentionally allowing these sites in their search. I'm sick of hearing about how "the world is different" and it's an "arms race" between spam sites and google. Bullshit.


> the identical text copied from stack overflow should be easily identifiable

Google starts matching content from SO => Spammers start tweaking the text slightly => google implements some expensive similarity score to down rank copy cat sites => spammers use more complex scrambling=> ...

> volunteers put together a list of these sites themselves

These lists only work because they're used by a tiny minority of people. If Google were to do this the spammers would start switching domains more quickly (or find some other workaround).

I'm no Google apologist but I think you're underestimating how hard search ranking is when spammers are actively trying to game the system.


> tweaking the text slightly

That's what ML is perfect at detecting, which is Google's forte.

Some of these sites have been returned as top results for a while, so are you suggesting that Google just gave up because spammers would be able to evade them with an update?


Yes it is arms race, google has far more resources than spammers do so they should be ahead easily.

You underestimate the resources google has at its disposal.

They simply don’t care because there is no real competition to worry,even with this spam you are still likely to use google, so why would profit motivated company bother ?


SO seem to have Yahoo ads, so I guess it is a no brainer for Google to rank sites they profit from over the content the lusers want.


This is the real answer.


The problem with these theories is that they lack any sensible explanation of motive. Google intentionally degrading its search results because they "earn more if the user has to search again and again" just doesn't feel right: even if it were true in some short-term experiment, it would compromise the way people at Google think of themselves and their work to a degree that would be devastating to the company. There is no way they would throw away that sort of value without being under intense pressure, which they definitely are not.


Another comment stated that SO uses ads from someone else than Google, while the copy-paste sites use Google for ads. If true, that is clear monetary incentive to not go after this too hard.


They've also demonstrated that they can derank the Wikipedia clones. Funny how that ability is lost when the site in question makes money for a competitor.


These large tech companies have a long and varied history of stupid short-term decision making for profit and bad products due to local individual failures. Until there is a clear and detailed explanation of how the spam sites are avoiding google's wrath, the explanation of stupidity or short-term thinking on Google's part seems just as plausible.


Well come up with an explanation of how these entirely mechanically generated SO clone sites, with no obfuscation, are allowed to exist by Google, when identifying them and removing them should be fairly trivial?

At the very least they're being deliberately neglectful because they don't feel the bad experience harms their revenue because there's no other substantial competitor so they can abuse their monopoly status.

I guess they may just not care enough about software developers and figure we're mostly using ad blockers so its wasted effort and we'll develop blocklists ourselves. With no monetary value that they can assign to the ill will that it engenders they figure it must not matter so they don't bother. Pissing off a large chunk of the entire IT community via obvious neglect seems like a poor move to me, but then I've never felt that I'm cut out for management.


Maybe the problem is just genuinely hard and beyond their capabilities.


Detecting identical snippits of text is beyond virtually no one's abilities.


Yeah, I subbed to the blocklist that someone else published that they're maintaining manually. Google certainly has the resources to beat that bar.

It feels like economy-wide that decision makers in corporations and governments have just arrived at the conclusion that there's no money / no point in trying to stop scammers (and there might be an actual cost to revenue of doing so). It won't goose their quarterly numbers and might hurt them so its better to allow it.


This even works on Firefox Nightly on Android. Thanks a lot!


This is fantastic! This is exactly what I needed, thanks!


You rock. Thank you.


I'd like to get actual confirmation of this, but my vague feeling is that, once upon a time, Google Search would get "updates", as in, actually deployed code that would change the rule of the game and most of the previous dirty tricks would become unusable, leading to people to go out and find out new ones.

This changed with the Google "machine learning" days, where you no longer have humans at the helm laying down explicit rules, so no more "change the world" updates, you can only slightly nudge the parameters towards what you want, meaning the same old tricks keep being effective for far too long.


> Google Search would get "updates", as in, actually deployed code that would change the rule of the game

That's just what the scheduled "core update" days are now: https://developers.google.com/search/blog/2022/05/may-2022-c...


The “May Core Update” which recently rolled out impacted every site.

A lot of updates are targeted at specific problems such as low quality product reviews but there are still broader updates taking place.


My theory is that one of the inputs to Google's ranking algorithm is now "how much money would we make from this click?" A click to SO has a small number of ads which are obviously ads and easily ignored. A click to the average scrape-jacked SO page has dozens of ads using every dark pattern in the book to generate accidental clicks.


One of the other commenters above made the claim that SO runs yahoo ads. If that's true then from a Google perspective, the click has either zero or negative money-making value.

Maybe that means we should be searching in yahoo rather than google.


There's a simple way around that. Nothing to install. Nothing to update.

Just go to SO and use its search bar. It's actually quite good.

I mean, you know that's where you'll want to find the answer anyway - not some random corporate webpage or ad-infested splog. Why not cut out the middle man?

Only if that fails do I bother with Google.


Huh, you know, you're right. I recently did that and it was fine.

I think a lot of others formed their opinion (myself muchly included) about this from sites where the search bar was a joke played on people.

Edit: let me upgrade that 'fine' to 'great', now that I think about it it was actually better than a google search which was not my previous experience.


Or, if DuckDuckGo is your default search engine, you can append ' !so' to your search term.


Because it's not just SO I want answers from. It's blogs people post, social media such as Reddit, HN, etc.

I've found some fantastic articles out there, yes SO is a fantastic resource but there is an entire internet out there :-)


Google index used to be fairly more competent at finding relevant issues for a query, especially if some words were synonyms of what found in the snippets at even loosely related


I have no evidence of this, but the ad load on the returned results has gotten way higher. In theory, ranking sites that display Google ads higher would be a very easy knob for Google to turn to increase profit. The SO scrapers probably have Google ads on them, making them more profitable for Google.


I ran into so many Stack Overflow "mirrors" yesterday like this: https://www.anycodings.com/1questions/400836/swiftui-update-...

10 years I gave up on a large project where I rehosted and organized dead Usenet forum content because Google's dupe-penalty detector was too good and too aggressive for content that you could barely find beyond a six-year-old cache hit where the origin website was long gone.

Meanwhile these Stack Overflow scrapers are just `<html>{copy-and-paste}</html>` and the same domains are still alive despite years of cloning.

Looks like it's time to boot my project back up.


It’s clearly not a copy and paste. I just visited that link on my phone and got blocked from viewing because I’m using an ad blocker.


Also lots of github scrapers


"All Rights Reserved."


Turning the knob one way explicitly might raise some anti-trust concerns, however the same motivation can be used to avoid turning the knob the other way and this can be done much more sneakily without leaving clear evidence - simply don't allocate budget/etc to projects that would turn the knob the other way and you're done.


I've noticed this with youtube. Even though I'm on desktop with an adblocker they repeatedly autoplay the same video with a creator embedded crypto promotion at the beginning (especially when it would be plausible to infer I'm asleep from user interaction and clock/watch time). Must be getting a cut (plus scamming the ad buyer).


This is a very old conspiracy theory that's been repeatedly debunked.

https://www.searchenginejournal.com/ranking-factors/google-a...


That link is about AdWords spend by the site in question, and not about displaying AdSense ads on the site. Totally unrelated.


I've been using this uBO filter since someone recommended on a different thread and it's been great at removing those annoying sites from search results: https://github.com/quenhus/uBlock-Origin-dev-filter


The author acutally posted above your comment ;)


This reminds me of this one site that simply scraped all the open source code it could see and then produced AI-generated copies.


Similar to YouTube search results. Lots of spam videos. No way to block a creator. Totally ruins it.


Other search engines allow you to block domains from showing up in the results. I’ve switched to Kagi out of frustration and honestly it’s as good or better than Google just because of that one feature.


Some ex-Googlers say that someone ran an AB-test, and it turned out that per-search revenue was decreasing when these sites were blocked.


It took me a while for no good reason but I finally got an unofficial extension to add a "block" button to search results. It immediately improved my experience, I can't recommend it enough. No more Pinterest, SO clones, useless Quora spam, with very little work. I can't believe I didn't do it sooner.


I've been using uBlacklist and it works really well. It even lets me highlight specific websites so I have a better chance of seeing them if they are further down the list. https://iorate.github.io/ublacklist/docs



just switch to you.com


I don't think this problem should be solved by Cloudflare. Cheap domains will always exist and they shouldn't be a problem. The problem lies with Google and its failure to detect these spam sites.

Surely Google can spare an engineer or two to do a deep dive into the way any one of these spam sites manages to get itself to the first page of Google, work out their scheme, and fix the algorithm? This problem isn't exactly hard to reproduce!


I really don't get why so many people are willing to give Cloudflare a free pass on stuff like this. Why is it OK for a company to facilitate and host (1) thousands of scam domains, making reporting arduous and ineffective?

Anyone trying to infect others with Trojans and viruses just need to check user agents or use dynamic redirect URLs, and suddenly this clearly illegal activity becomes black magic that is way beyond the comprehension of the folks at Cloudflare.

Cloudflare is basically making the shittiest parts of the Internet safe for scammers and spammers, and this is just one example.

If that's not bad enough, they're trying like crazy to become a monopoly. If this what they do now, imagine how bad it'll be when they control even more and feel even more immune to making money from scammers.

(1) Hosting is providing services on the Internet without which a site would not function. Providing DNS is hosting. Providing proxy is hosting. Providing email is hosting. Don't fall for Cloudflare's "we don't host" bullshit.


Cloudflare should take action on reported domains and their owners, especially if those domains are malicious.

However, I don't want Cloudflare to preventatively police what is and isn't a bad website. When these scam sites go live, they can quite easily contain real content (say, a blog, with articles written by AI good enough not to be immediately obvious) and then change into malware on a schedule.

Cloudflare can't see what code customers run on the backend and that's probably a good thing. They're already holding too much power over the internet and requiring the backend to be transparent would only make them more in control of the web.

Any registrar hosts thousands if not millions of spam sites because every single one of the billion registrars have DNS set up in some way.

Despite being almost exclusively used for spam and amateur projects, the .TK TLD barely shows up in Google. Spam sites are a symptom of other services linking to them and making them worth the investment. If Google, Bing, Qwant and Yandex weren't falling for the SEO scams these scammers use, we wouldn't have this problem.

Hosters have some immunity by design, and that's very much a good thing. They have to respond to abuse complaints, but they're not responsible for filtering out all of their customers. Requiring them to do so is exactly what the EU is trying to force upon the internet, which is terrible for online freedom.


You make some good arguments, but if a site is caught in the act of hosting obvious malware, Cloudflare should make a reasonable effort to suspend their activity.


I disagree.

For example, SO copycats are legitimate in that they respect the license and otherwise just serve the content to whoever sends them an HTTP request. As far as I know they don't spam links to their domain anywhere. They are low-quality and of dubious utility for sure, but I'd rather not make the Internet a place where you need to prove quality & utility to someone to be able to host an HTTP server.

The real problem is that a dumbass like Google comes along, sees this and decides that it should rank higher than the source content.


Exactly. Similarly, I think any dumbass should be able to fix cars in his own yard, including for a fee and calling himself a business.

But Google maps better not drive me to someone's yard when I ask to navigate to a nearby mechanic.

If it did, it would be hard to blame anyone but Google.


If they're not spamming their links, how do they get such high search engine optimization?

Somebody upthread suggested it was just the use of Google ads, which I suppose is possible, but somehow it seems unlikely. Google sure does love money but they also need to be considered a good search engine, and I'd expect them to be at least a little wary about things like that.

Is there something else I'm missing?


It's my understanding that link spamming has become counter-productive since a Google algorithm update almost a decade ago? I'm not sure what they're doing but I don't think it's link spam, because of that and also because I've never seen their spam anywhere (if they're using link spam they must do so on sources that have good "authority" for programming-related topics and thus one of us would've likely seen it).


It’s often cheap blog spam “original content“ and matching cheap social media spam to increase how legitimate the blog looks … which is cheaper than ever now thanks to advances in machine learning models like GPT-3 and other current generation models. The pipeline is take a random sample of pages in the domain, take the target page -> summarise -> generate some blog spam of varying length and level of human input -> if desired based on social media analytics then generate some automated social posts about the blog article that was just added since it’s widely done by real humans with their real blogs it all looks legit.

This is how it gets done and Google used to be brutal about crushing it, somewhere along the way they seem to have given up on being so brutal.


>SO copycats are legitimate in that they respect the license

Do they? SO contributions are under CC BY-SA. Haven't seen copycats providing attribution let alone specifying that the content is under the same license.


I'm not sure, but the business model of them is ad revenue - they get paid as soon as the page loads. Adding the require attribution & license disclosure wouldn't hurt them at all, so I'm assuming they're either already doing it or will start doing it if asked.


Is it common that SO and Wikipedia copycats respects open licenses? Most times I run into them they do not.

It's really tricky to enforce open licenses on this scale as it's each contributor that licenses their content rather than the platform host.



Cloudflare should worry about what sites Google is choosing to index and show?

This is clearly Google's issue.


No, Cloudflare should worry about what sites they host and enable. How Google ranks the sites that Cloudflare hosts is a secondary issue, and is outside of Cloudflare's control.


No? We have a legal system for a reason. Cloudflare should definitely comply with court takedown requests, and that's that.


So every time anyone wants to report a site that is offering "Flash Updater" Trojans, someone should file a lawsuit?

Every time a scammer puts up a web site trying to sell counterfeit goods, the company which sells the real goods should file a lawsuit? One for each scammy web site, perhaps? Because Cloudflare shouldn't be expected to do anything at all, until they're compelled to do so by a court?

I don't think you're thinking this through.


lol I don't think you understand how the legal system works. you don't file a lawsuit and go as a private unless you also are trying to recover claims, you report cybercrime to your local authority and let them pursue the criminal

> I don't think you're thinking this through.

love the sassines tho


You've clearly never reported cybercrime to the authorities if you think this would work. Or you have, and you think that only large businesses which can claim losses of $10,000 or more should be protected, and everyone else is SOL.


The fact that this has been going on for several years makes me believe Google either doesn't care or the problem is particularly hard to fix (less believable)


It has been an ongoing battle for 10-15 years at this point. Search engines are constantly battling people trying to game their systems. I have to wonder if Google hasn't lost the thread a bit, inside their surely quite complex algorithm black boxes.

For a while now Google has suggested that the best way to rank well is to have human readable content and focus on user experience. At the same time, natural language generation has come leaps and bounds, to the point where sometimes even I, a human, can't tell if an article has been spun by a bot or not.

So if Google starts ranking human readable content, and robots can now produce human readable content, what is the next ranking signal they can use to differentiate spam from humans? Are we going to end up with "Verified Websites" ala verified Twitter handles?

A huge portion of the web at this point is just bots communicating with eachother, and legitimate business systems having to process bots participation on the internet. I imagine the portion of the web that Google crawls that is legitimate versus that which is bot generated would surely be majority bots, just because of how fast they can generate content. One thing they can't do as easily though is register domains, so it may be one of the better points of defense.


The dead internet theory.


Google has at the very least neglected its search for many years now and recently has also actively made it worse through all the censorship and thought control stuff. I find it rather surprising because essentially all of google’s success is lynchpinned by search. All it would take is for a narrative to dominate that the best results can be found elsewhere, which does not seem particularly remote, considering how much damage google has done to its search.


I noticed issues since Matt Cutts left. No one care anymore. There are AI generated website that have been running for years, ranking highly in Google.


It's incredibly easy to fix. They don't care as they have a monopoly.


Another comment links to a blacklist, which works.

If it can be effectively blacklisted, then Google is dropping the ball. This isn’t difficult algorithm foo failure.

I don’t agree with your sentences, but I do agree with your point.


Allow people to report AI generated website to a human at Google.


> I don't think this problem should be solved by Cloudflare. Cheap domains will always exist and they shouldn't be a problem. The problem lies with Google and its failure to detect these spam sites.

The problem exists outside (Google-controlled) web: with (not fully Google-controlled) email, too.

Around 2020 I did a per-tld checks on wanted/unwanted messages (ham and spam). With thousands of messages sent from .xyz domains (envelope sender host or PTR record of sending host; I ignored the From header) there wasn't a single legit message. 100% SPAM.


The irony, Google/Alphabet uses ABC.xyz.


"Cheap domains" is not a thing. $25/year domain for a personal website is kinda pricey. But scam/spam operator can pay that and more pretty darn easy.


There is at least one registrar that gives away .it domains (and apparently .eu domains? WTF?) for free for one year[1], with no major strings attached (as long as you cancel after the first year) as far as I read, correct me if I'm wrong.

Why they decided to ".xyz the TLD", I don't know. ¯\_(ツ)_/¯

[1] https://www.register.it/?lang=en


I'm not sure about today, but about 12 years ago, I was able to get 1000 .info domains for about $200. (We were doing some machine-generated splog creation to see if we could game Google search results. We could.)


Did you make them link to each other? any insights would be appreciated


Who is charging $25/year for domains?


Depends on the TLD, I searched gandi with a very random set of characters the keyboard (to ensure I could probably get many results) and here's a selection of country-level ones which are above $25/yr:

- abcedasdfff.io = €59.29/year

- abcedasdfff.tw = €25.20/year

- abcedasdfff.nz = €25.40/year

- abcedasdfff.mx = €48.28/year

Most of them appear to be €10-20/yr, but it's certainly not uncommon to see them go for €25 or higher. Note: EUR and USD are roughly at parity so I don't think it's really necessary to do a conversion.


Most common TLDs are not in that price range.


$25/year for an .it domain is pretty cheap actually, usually they sell for more like 40€ per year


.it are around 10 bucks a year. No idea where you would find them at 25 or 40 a year.


Anecdotally gmail has been doing a miserable job filtering spam for the last 5 months or so. For me it used to be pretty bulletproof - one of its best features.

Now I get something from McAffee Pratners(sic) every other day warning my computer is about to expire. Back in May I kept winning things from Home Depot and Lowes; and gmail would categorize it as "forums".

No idea if its related, just odd.


Add another anecdote to the anecdata pile. Past three months, McAffee and north american shopping chain spam is breaking through Gmail filters. And reporting as spam does not help. I assume they've been somehow building Google reputation for the spam accounts.


Similarly I've been getting smashed with gmail spam, google calendar spam and google drive spam for around 6 months now after never previously getting any and despite reporting most of it.


Yeah the drive spam started for me last year or the year before. It took a break but in the last couple months has returned with a vengeance.

I had the Slack google drive integration and I needed to mute it because it was couple doc invites every few hours.


I've been experiencing the same. Here are a bunch of egregious spam mistakes we collected from different people to illustrate the problem: https://www.surgehq.ai/blog/are-the-spammers-winning-failure...


I occasionally have spam that slips through Gmail's filters, but when I explicitly mark it as spam it disappears nothing of the same type reappears again.


On the other hand over last few months majority of post from sma few of private Google Groups, I'm member of, keep getting wrongly clsssified as spam.

I don't know if it's related either.


Agreed, same experience, all with the same format, primarily from some sort of outlook.com domain.


Ironic, since Microsoft is the worst (in my experience) at being a giant black hole to emails sent from an otherwise well-configured (SPF, DKIM, DMARC, non-SBL-listed IP, &c) but not major SMTP host.


Been having the same issue on my old email. a LOT of spam going past the filter.


I disagree. Prior to Gmail i used to get thousands of spam email everyday Now everything is filtered. Barely get any.

The added benefit is I don't get any tech calls for help from my parents who also don't end up clicking random spam and wondering why bad things are happening


Remember the good old days of talking about a "semantic web"? Now we just get one Google results page of SEO'd garbage with no way to process them.

I can't help plug kagi.com, which has the amazing feature of grouping SEO'd stuff like recommendation lists together, so a thing that's contextually useful is still available but without polluting the other contexts.


I recently cursed google search results when trying to research an actor's birth date. There were two dates given on Wikipedia and I wanted to see which one (if either) was correct. Google returned the actor's IMDB page (which listed a third date, and no source), and then pages upon pages of what appeared to be auto-generated sites that clearly scraped from Wikipedia, repeating one or the other of the Wikipedia dates.

This is not helping to organize the world's knowledge.


> This is not helping to organize the world's knowledge.

And Google is not about organizing world's knowledge but creeping on people for YoY financial results.


They're quoting Google's own mission statement; though, you are correct.

> Google's mission is to organize the world's information and make it universally accessible and useful.


A lot of actors lie about their age so I wouldn't hold out too much hope on getting an accurate result on that one.

I get your point though about the multiple results for something where there clearly is no authoritative answer.


> This is not helping to organize the world's knowledge.

Oh they stopped doing that long long ago...


The obsession with "machine learning" is actually making systems dumber. Google Search and Gmail spam filters are getting worse with each passing week, and I am almost certain the increasing reliance on ML is to blame.


I chalk it up to a cost benefit calculation. Google clearly isn't trying to eliminate all spam in search. It's not their goal. They are not trying to optimize for the user experience. They're trying to optimize revenue.


They're trying to keep spam out of my inbox, and the spam rate has been increasing for me (and other HN commenters who frequently talk about it)


The competing explanation is that "machine learning" is actually making spam generator systems smarter, so spam gets harder to detect.


The type of spam I see in my inbox is anything but smart. But I agree that this has always been a cat and mouse game.


It may be that deep learning is now increasingly used to generate the spam. It either is or will be used for spam generation A LOT. Frankly it seems to be the most promising commercial use-case for the large language models.


What can search engines do about user-agent based content differentiation? Say my robots.txt allows Googlebot and nothing else. If Google attempts to double-check with a covert user agent, robots.txt is violated. Assign humans to review reported pages? It’s pretty easy to swamp a manual system like that. Just forget about robots.txt?


robots.txt is just a guideline between well-meaning actors for the majority of their traffic, like helping a bot not waste its time nor your bandwidth by crawling dynamically-generated, endless-scrolling /calendar.php pages. Google does use it to that extent.

It's not a firewall.

Seems like you're describing cloaking (https://developers.google.com/search/docs/advanced/guideline...), one of the oldest SEO tricks, and you can imagine that search engines started defeating it on Day 2 of crawling the web.


I remember reading on HN years ago that Google bots have never honored robots.txt, but I don't actually know


Presumably these folk take advantage of cheap/free domain offers wherever in the world they are.


I think you are right, https://www.register.it/ is offering free domains for 1 year since some time.


it's not that straightforward though.

to register an .it you must prove you are a person or a business working or residing in one of the EU member states and need to provide the ID of a person who's gonna be listed as admin-c of the domain.


No you don't. I had a .it domain too, yes there is a field in regstritation where you should enter a "identity card id", but I didn't have one so I entered something random. Worked of course.


> No you don't.

Yes, you do!

of course it worked.

you just committed a crime.

you can fake your id everywhere in the World, it is a crime everywhere in the world and if something happens doesn't mean you won't get caught.

you can drive a stolen car, it will work.

> yes there is a field in regstritation where you should enter a "identity card id"

so it is required! you simply ignored it, lied and broke the law.

your criminal behaviour doesn't imply laws do not exist.

if you tried to buy an insurance policy with that fake ID, you would be in troubles now.


Not a crime. It's not my job to ensure their "validation" is working. The registrar took the money all the same.

>if you tried to buy an insurance policy with that fake ID, you would be in troubles now.

But this is more a "are you 13 or older"-style of "crime".


Right, and I'm sure that government across the ocean will get right on prosecuting that violation...


This was a EU-EU transaction anyway. Not once was there a problem regarding this.



You can do that, but you always run the risk of someone snitching to nic.it, in which case you would lose the domain. :/


I don't think this is an issue if you're a spammer. Those domains are probably short lived anyway.


I've actually experienced this and it is not related at all to the device. It was related to the signed in google account across networks and devices.


Note that the discussion is a year old. Around one year ago I wrote more about this "phenonomen" here: https://news.ycombinator.com/item?id=27993123

tl;dr: I managed to find the servers behind it, most likely anybody who are still affected can do the same thing I did pretty easily. We also followed the money, which is a tad more work.


.it may be the .tk of the 2020s


Whatever domain name is cheap is going to be plagued by spam.

I think in recent time, .icu and .xyz have been the most problematic, to the point where you to this day probably don't want to host a mail server on those domains.

The same with cloud providers. A fairly significant amount of sketchy websites seem to be hosted on cheap cloud providers with weak rules enforcement. I've taken to blocking all of Alibaba's IP ranges from my search engine crawler, the signal to noise from those sites were so bad it just wasn't worth looking for legit content.


Just, no. Plenty of legitimate websites under .it, basically every single Italian company plus all localized versions of international websites (apple.it, google.it, ...)


>basically every single Italian company

Sounds about as trustworthy to me as a .tk domain.


I don't know, there's a big difference between Italy and Tokelau. Also you have to be an EU citizen or company in order to register a .it domain, while .tk allowed everyone and their dogs to get one for free for years without having a connection to the islands whatsoever.


Why? .tk was popular because it was free, so it was really useful for teens and young adults in an era when you still had to host things somewhere if you wanted them online. On the other hand .it is the tld of Italy and used legit by all businesses of EU's third largest economy.


.it is currently being offered for free / €1


.com are being offered free or €1 as well, but you don't need to be an EU member with a valid EU ID to register a .com

https://www.register.it/domains/?lang=en

https://imgur.com/W0XkZIj

https://imgur.com/a/p9sFsKj


Shame too. I was hosting my personal site on .tk when I was broke out of college, but often a link too it was automatically flagged as spam.


The real question we should be asking is: Why Google don't care about their index?

Junk copypasta and news-squatting (posting regularly about the same thing with no additional data) is a decade(s) old problem. Crowd sourcing and verifying junk domains could be a weekend project.

But nothing.


No, google search is plagued with spam from any domain. And even the non spam results are useless.


Google hasn't given a shit about search since at least a decade ago. It's all about data collection via Android and Chrome OS, and gmail and docs. They don't need search to collect your data any more. Don't people actually know this? LOL


All that data they've collected is only useful if they can sell something (i.e. ads) based on the data. AFAIK the majority of their income they get for ads is from search based ads.


Governments are happy to buy the data actually. In-Q-Tel, aka the CIA, was the first large investor in Google.



I've been gettng these spam .it domains for years and years, this is nothing at all new.


This headline would be accurate without the “.it” domains part


Google appears to stopped caring after Matt Cutts left......


The Internet is quite a bit bigger than it was in 2016: https://www.internetworldstats.com/emarketing.htm

(Have no idea how reputable that data is, but it seems about right to me. In 2016 there were 3.6 billion Internet users. Now there are 5.3 billion.)


Most that growth is in non English language users . There are only so many English speakers, we are talking about quality of English search results, they number of users for that has not doubled in 6 years.


Isn't the article we're discussing about a .it domain impersonating a Japanese product?


Content moderation and SEO in non English languages is far worse than English.

I meant that google dropping search quality for English has not much to do with growth users in the last few years as that growth has largely been non english


Also imagine how many bots there are, and how fast they can generate content now.


there is also the spam of name.ru.com domains as well

Warning, do not click on those links as you will get your PC infected.


Wow from clicking on a link on a modern browser? New 0-day?


What does Cloudflare normally do with spam sites, is it a hands off approach or do they do some policing?


Hands off till it gets upvoted on HN


They do the bare minimum. You can report sites for abuse and they will take them down. It doesn't appear like they do anything to proactively stop similar sites so the person can just make a new account and domain and be back in business.


cloudflare abuse department is really lacking.

Their abuse form is getting abused too. It sends an email to site operator and the server hosting company in single submit so its getting abused. It not even have a captcha.

https://abuse.cloudflare.com/


They will do nothing, and it is a feature.

The only thing they'll do is forward the complaint to the user. Leaving you with no recourse other than to take legal action before Cloudflare will lift a finger.

Unless there's CSAM, of course.


This… seems about right?

A trademark dispute is a civil issue between two parties. We have legal systems to solve these. Cloudflare should ensure that their customers get timely notification of complaints, and that’s pretty much it.


You need to remember that Cloudflare isnt a host


I've often ranted about this, because for all intents and purposes they are.

They store the content of the website on their drive to serve to visitors. Whatever processes lie in the backend of whatever website to fetch up-to-date content from an upstream source is not my concern. They are NOT a neutral ISP, they are providing a service to their customer which includes hosting (doesn't matter if it's temporary hosting because they expire files). From our point of view, it is their IP addresses that are hosting the website. They have all the responsibilities a traditional hoster has, no motter how they try to frame this debate.


That isn't always the case these days, web apps can run on Cloudflare without an origin.


This is 2nd and 3rd search page results spam, I get spam phishing websites on 1st page of search results when I search for certain ecommerce websites. Google is done.


I think Google can probably not fix it. Users will have to be manually reporting as spam. These websites on seeing traffic from Google's crawler bots show a perfectly legit and highly SEO optimized website, but for anything else show other spam. If Google starts indexing from random IP ranges, most websites would probably block indexing from “unofficial IPs” or some companies (esp in EU) would file some lawsuit against Google. The reason being that some pay-walled news article websites won't be indexed properly, as the “unofficial IP-ed” Googlebot will not get the paywalled content.

If a website lies to Google itself, I believe the only way to solve it is by reporting the search result as spam or Google contracts people to somehow visit all billions of web pages (again the same problem – from different IP ranges) to verify it as a legit page.

I would like to know how Google currently handles it and probably how it could be improved


Google has all the text they've scraped.

They can see all the domains that have served a given snippet.

They also have history to identify where each snippet was first seen.

If SO has a lot of traffic and a good reputation, and if the same snippet is found first at SO and then later at bunch of newly created, low volume, low reputation domains, then show the SO result and not the others.


The practice of "cloaking" has been around for ages and I'm sure Google has (or at least had) solutions against it.

I'm not sure on what grounds could someone sue for crawling from random, unaffiliated addresses as long as the crawling isn't causing a denial of service (they can always check robots.txt using the main IP then use that to throttle crawling from random IPs as to remain compliant).

> The reason being that some pay-walled news article websites won't be indexed properly, as the “unofficial IP-ed” Googlebot will not get the paywalled content.

Good riddance? That would be a welcome change.


Dealwith.it


It must be the mafia. No other explanation


Yeah, but which one? The Camorra, Ndrangheta, Stidda, RIAA or FAANG?


Every single one of these results carries the `html` filetype as part of their URL is my experience. This is likely a consequence of the useragent-based switcheroo technique they use to fool Google.

Just blanket block the lot with the following uBlock Origin filter:

    google.*##.g:has(a[href*=".it"][href$=".html"])
Google ain't going to fix itself ;)


Blanket banning a whole TLD is stupid. One thing is blocking some obscure stuff like ".su", but .it? It's just too big, and arguably unwise if you are in Europe where having to connect to Italian websites or services isn't a remote possibility.


This merely hides Google search results in my browser.

No network connections are blocked...


Yes, you hid all Italian Google search results - arguably not an ideal solution.


I'm sure there are plenty of non-spam html pages based in Italy too


Considering the crowd that trade-off that seemed too obvious to mention.


cool!

now s/\.it/every TLD/ and you solved domain spam forever.

/s

You might not know that 99.99% of .it domains with urls ending up in .html are completely legit, including some official government one.


Since uBlock is run on the client, unless you’re Italian or interested in Italian sites it doesn’t really seem like much of an issue.

I could block all .it sites on my network and I’d likely never even notice.


yeah, right, unless you're american, why should you care about .com domains?

  ¯\_(ツ)_/¯

the problem is not .it domains, it's clearly stated in the linked post

A large number of spam pages are indexed when searching by our product name. It’s very similar to Japanese Keyword hack, but the difference is that our site is not hacked

so it's definitely an indexing issue, those .it domains are being indexed for the Japanese word hack for some reason, it's not that .it domains are particularly spammy per se.

Your "solution" would filter the vast minority of the abusers at the cost of banning an entire TLD, not much different than turning off the internet connection entirely.

Most of the spam on the internet comes from .com domains though, even more so because registering a .com domain is much easier than getting an .it

Are you willing to ban .com too?


> Your "solution" would filter the vast minority of the abusers at the cost of banning an entire TLD, not much different than turning off the internet connection entirely.

Again, we’re talking about client-side filtering. The original comment about blocking .it domains was talking about a uBlock Origin rule. No one’s talking about blocking .it domains from the web.

Yes, as an American, I could block all .it domains on my end and my web experience likely wouldn’t change at all. I rarely, if ever, need to visit .it domains. So maybe I will.


This visually hides the HTML elements on Google Search and for me only. There is no networking involved and so Italian TLDs are still reachable.

This is a personal solution to an extremely disruptive and long standing problem, and only affects those who choose to employ it. It's not hurting anyone.


.com implies spam - it's commercial, so let's go ahead. If it's not .org I'm not playing. /s


And yet, here you are, and not on ycombinator.org? ;-)


Nah. I've been reading the docs on Spatialite (the spatial extension for SQLite) at http://www.gaia-gis.it/ the last couple days. It has both a "spam" TLD and a design from 1998.


But not many of the official government ones.


official government in Italy also means cities, towns, hospitals, universities, public schools etc

There are 8 thousands towns in Italy, each with their own .it website.


In addition to this if one runs unbound as their DNS on their home router and they block DoH then one could add

    local-zone: "it" always_nxdomain
to NXDOMAIN all requests for the .it TLD and protect non browser devices. I use this method to stay off sanctioned country TLD's and to remove the cheap/free spammy domains and TLD's that often contain more malware than anything useful.


What’s this useragent switcheroo?


Browsers and other programs can use the User-Agent[1] header to send along a bit of information about themselves with each request.

This and other information is then used to filter out various types of visitor.

In this case, requests claiming to be a Google Search crawler will receive a boring page with lots of text that it can index and use as search results.

Most browsers' devtools let you change your user-agent string, and a listing of the ones used by Google crawlers is publicly available. Not saying that you should, but you could check this out for yourself... entirely at your own risk of course :)

https://en.wikipedia.org/wiki/User_agent

https://developers.google.com/search/docs/advanced/crawling/...


Or use Brave search, which honestly from my experience is much better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: