There has been a huge influx this year with the amount of sites that simply scrape SO and then have the exact content on their site. It's a pain, and there is no official way to remove them.
I thought that this was a massive nono from Googles side, has something changed?
I had this topic brought back to my mind yesterday as I was doing some research using the Ahrefs keyword tool. I do believe it would be possible to create a very large dataset of these copycat sites (using Ahrefs) to be used as a blacklist in various filters/extensions.
But the crazy part is that, for example - Ahrefs says that StackOverflow has "Organic traffic" in the range of 22 million per month. A lot of these copycat sites, at least the ones I saw - have a traffic range anywhere from 10k to 500k per month.
I mean, it's pretty insane just how well such sites can rank in Google, and you bet those copycats are making absolute bank from ads even if the majority of developers immediately close the site.
There's a lot going on with Google Search these days, a lot of people are complaining that sites that scrape content can easily rank really well for long-tail keywords. One case in particular, a site will scrape Google to collect "featured snippets" and "people also ask" - then combined anywhere from 20 to 40 of these answers and publish them as a blog post.
None of the words are changed, all questions/answers worded exactly the same. And Google puts these sites on page 1.
> I do believe it would be possible to create a very large dataset of these copycat sites
Would they just move to creating and using new domains with the same content as soon as traffic to the old becomes drops? (What looks like the spammers in the original post are doing)
But something does need to be done to these sites.
This is a decades-old spammer trick. Google used to not rank brand new domains very high for this reason.
It's hard not to think that the only reason Google abandoned most of its old site ranking heuristics was that they were filtering out too many sites with lots of Google ads. The spam sites now infesting Google's first-page results don't look very different from the spam sites I saw back in the early 2000's. (There's more JavaScript, but modern search spiders run every page in a VM before reading the DOM, so that doesn't fool anyone.)
I think that as Google became more complicated, bigger, and ML based the average Googler begins to understand the system less and less well. At a certain point a problem comes up and they just don't know how to solve it that well or their bureaucracy makes solving it to painful or the people who solve problems quickly have all gone to companies where they can do that.
A/B tests can show datapoints where these sites convert better on average. Sites aren't chosen for cultural legacy but based on hitting the correct kpis.
* the identical text copied from stack overflow should be easily identifiable
* volunteers put together a list of these sites themselves
it should be obvious to Google apoligists that Google is either negligent or intentionally allowing these sites in their search. I'm sick of hearing about how "the world is different" and it's an "arms race" between spam sites and google. Bullshit.
> the identical text copied from stack overflow should be easily identifiable
Google starts matching content from SO => Spammers start tweaking the text slightly => google implements some expensive similarity score to down rank copy cat sites => spammers use more complex scrambling=> ...
> volunteers put together a list of these sites themselves
These lists only work because they're used by a tiny minority of people. If Google were to do this the spammers would start switching domains more quickly (or find some other workaround).
I'm no Google apologist but I think you're underestimating how hard search ranking is when spammers are actively trying to game the system.
That's what ML is perfect at detecting, which is Google's forte.
Some of these sites have been returned as top results for a while, so are you suggesting that Google just gave up because spammers would be able to evade them with an update?
Yes it is arms race, google has far more resources than spammers do so they should be ahead easily.
You underestimate the resources google has at its disposal.
They simply don’t care because there is no real competition to worry,even with this spam you are still likely to use google, so why would profit motivated company bother ?
The problem with these theories is that they lack any sensible explanation of motive. Google intentionally degrading its search results because they "earn more if the user has to search again and again" just doesn't feel right: even if it were true in some short-term experiment, it would compromise the way people at Google think of themselves and their work to a degree that would be devastating to the company. There is no way they would throw away that sort of value without being under intense pressure, which they definitely are not.
Another comment stated that SO uses ads from someone else than Google, while the copy-paste sites use Google for ads. If true, that is clear monetary incentive to not go after this too hard.
They've also demonstrated that they can derank the Wikipedia clones. Funny how that ability is lost when the site in question makes money for a competitor.
These large tech companies have a long and varied history of stupid short-term decision making for profit and bad products due to local individual failures. Until there is a clear and detailed explanation of how the spam sites are avoiding google's wrath, the explanation of stupidity or short-term thinking on Google's part seems just as plausible.
Well come up with an explanation of how these entirely mechanically generated SO clone sites, with no obfuscation, are allowed to exist by Google, when identifying them and removing them should be fairly trivial?
At the very least they're being deliberately neglectful because they don't feel the bad experience harms their revenue because there's no other substantial competitor so they can abuse their monopoly status.
I guess they may just not care enough about software developers and figure we're mostly using ad blockers so its wasted effort and we'll develop blocklists ourselves. With no monetary value that they can assign to the ill will that it engenders they figure it must not matter so they don't bother. Pissing off a large chunk of the entire IT community via obvious neglect seems like a poor move to me, but then I've never felt that I'm cut out for management.
Yeah, I subbed to the blocklist that someone else published that they're maintaining manually. Google certainly has the resources to beat that bar.
It feels like economy-wide that decision makers in corporations and governments have just arrived at the conclusion that there's no money / no point in trying to stop scammers (and there might be an actual cost to revenue of doing so). It won't goose their quarterly numbers and might hurt them so its better to allow it.
I'd like to get actual confirmation of this, but my vague feeling is that, once upon a time, Google Search would get "updates", as in, actually deployed code that would change the rule of the game and most of the previous dirty tricks would become unusable, leading to people to go out and find out new ones.
This changed with the Google "machine learning" days, where you no longer have humans at the helm laying down explicit rules, so no more "change the world" updates, you can only slightly nudge the parameters towards what you want, meaning the same old tricks keep being effective for far too long.
My theory is that one of the inputs to Google's ranking algorithm is now "how much money would we make from this click?" A click to SO has a small number of ads which are obviously ads and easily ignored. A click to the average scrape-jacked SO page has dozens of ads using every dark pattern in the book to generate accidental clicks.
One of the other commenters above made the claim that SO runs yahoo ads. If that's true then from a Google perspective, the click has either zero or negative money-making value.
Maybe that means we should be searching in yahoo rather than google.
There's a simple way around that. Nothing to install. Nothing to update.
Just go to SO and use its search bar. It's actually quite good.
I mean, you know that's where you'll want to find the answer anyway - not some random corporate webpage or ad-infested splog. Why not cut out the middle man?
Huh, you know, you're right. I recently did that and it was fine.
I think a lot of others formed their opinion (myself muchly included) about this from sites where the search bar was a joke played on people.
Edit: let me upgrade that 'fine' to 'great', now that I think about it it was actually better than a google search which was not my previous experience.
Google index used to be fairly more competent at finding relevant issues for a query, especially if some words were synonyms of what found in the snippets at even loosely related
I have no evidence of this, but the ad load on the returned results has gotten way higher. In theory, ranking sites that display Google ads higher would be a very easy knob for Google to turn to increase profit. The SO scrapers probably have Google ads on them, making them more profitable for Google.
10 years I gave up on a large project where I rehosted and organized dead Usenet forum content because Google's dupe-penalty detector was too good and too aggressive for content that you could barely find beyond a six-year-old cache hit where the origin website was long gone.
Meanwhile these Stack Overflow scrapers are just `<html>{copy-and-paste}</html>` and the same domains are still alive despite years of cloning.
Turning the knob one way explicitly might raise some anti-trust concerns, however the same motivation can be used to avoid turning the knob the other way and this can be done much more sneakily without leaving clear evidence - simply don't allocate budget/etc to projects that would turn the knob the other way and you're done.
I've noticed this with youtube. Even though I'm on desktop with an adblocker they repeatedly autoplay the same video with a creator embedded crypto promotion at the beginning (especially when it would be plausible to infer I'm asleep from user interaction and clock/watch time). Must be getting a cut (plus scamming the ad buyer).
I've been using this uBO filter since someone recommended on a different thread and it's been great at removing those annoying sites from search results: https://github.com/quenhus/uBlock-Origin-dev-filter
Other search engines allow you to block domains from showing up in the results. I’ve switched to Kagi out of frustration and honestly it’s as good or better than Google just because of that one feature.
It took me a while for no good reason but I finally got an unofficial extension to add a "block" button to search results. It immediately improved my experience, I can't recommend it enough. No more Pinterest, SO clones, useless Quora spam, with very little work. I can't believe I didn't do it sooner.
I've been using uBlacklist and it works really well. It even lets me highlight specific websites so I have a better chance of seeing them if they are further down the list.
https://iorate.github.io/ublacklist/docs
I don't think this problem should be solved by Cloudflare. Cheap domains will always exist and they shouldn't be a problem. The problem lies with Google and its failure to detect these spam sites.
Surely Google can spare an engineer or two to do a deep dive into the way any one of these spam sites manages to get itself to the first page of Google, work out their scheme, and fix the algorithm? This problem isn't exactly hard to reproduce!
I really don't get why so many people are willing to give Cloudflare a free pass on stuff like this. Why is it OK for a company to facilitate and host (1) thousands of scam domains, making reporting arduous and ineffective?
Anyone trying to infect others with Trojans and viruses just need to check user agents or use dynamic redirect URLs, and suddenly this clearly illegal activity becomes black magic that is way beyond the comprehension of the folks at Cloudflare.
Cloudflare is basically making the shittiest parts of the Internet safe for scammers and spammers, and this is just one example.
If that's not bad enough, they're trying like crazy to become a monopoly. If this what they do now, imagine how bad it'll be when they control even more and feel even more immune to making money from scammers.
(1) Hosting is providing services on the Internet without which a site would not function. Providing DNS is hosting. Providing proxy is hosting. Providing email is hosting. Don't fall for Cloudflare's "we don't host" bullshit.
Cloudflare should take action on reported domains and their owners, especially if those domains are malicious.
However, I don't want Cloudflare to preventatively police what is and isn't a bad website. When these scam sites go live, they can quite easily contain real content (say, a blog, with articles written by AI good enough not to be immediately obvious) and then change into malware on a schedule.
Cloudflare can't see what code customers run on the backend and that's probably a good thing. They're already holding too much power over the internet and requiring the backend to be transparent would only make them more in control of the web.
Any registrar hosts thousands if not millions of spam sites because every single one of the billion registrars have DNS set up in some way.
Despite being almost exclusively used for spam and amateur projects, the .TK TLD barely shows up in Google. Spam sites are a symptom of other services linking to them and making them worth the investment. If Google, Bing, Qwant and Yandex weren't falling for the SEO scams these scammers use, we wouldn't have this problem.
Hosters have some immunity by design, and that's very much a good thing. They have to respond to abuse complaints, but they're not responsible for filtering out all of their customers. Requiring them to do so is exactly what the EU is trying to force upon the internet, which is terrible for online freedom.
You make some good arguments, but if a site is caught in the act of hosting obvious malware, Cloudflare should make a reasonable effort to suspend their activity.
For example, SO copycats are legitimate in that they respect the license and otherwise just serve the content to whoever sends them an HTTP request. As far as I know they don't spam links to their domain anywhere. They are low-quality and of dubious utility for sure, but I'd rather not make the Internet a place where you need to prove quality & utility to someone to be able to host an HTTP server.
The real problem is that a dumbass like Google comes along, sees this and decides that it should rank higher than the source content.
If they're not spamming their links, how do they get such high search engine optimization?
Somebody upthread suggested it was just the use of Google ads, which I suppose is possible, but somehow it seems unlikely. Google sure does love money but they also need to be considered a good search engine, and I'd expect them to be at least a little wary about things like that.
It's my understanding that link spamming has become counter-productive since a Google algorithm update almost a decade ago? I'm not sure what they're doing but I don't think it's link spam, because of that and also because I've never seen their spam anywhere (if they're using link spam they must do so on sources that have good "authority" for programming-related topics and thus one of us would've likely seen it).
It’s often cheap blog spam “original content“ and matching cheap social media spam to increase how legitimate the blog looks … which is cheaper than ever now thanks to advances in machine learning models like GPT-3 and other current generation models. The pipeline is take a random sample of pages in the domain, take the target page -> summarise -> generate some blog spam of varying length and level of human input -> if desired based on social media analytics then generate some automated social posts about the blog article that was just added since it’s widely done by real humans with their real blogs it all looks legit.
This is how it gets done and Google used to be brutal about crushing it, somewhere along the way they seem to have given up on being so brutal.
>SO copycats are legitimate in that they respect the license
Do they? SO contributions are under CC BY-SA. Haven't seen copycats providing attribution let alone specifying that the content is under the same license.
I'm not sure, but the business model of them is ad revenue - they get paid as soon as the page loads. Adding the require attribution & license disclosure wouldn't hurt them at all, so I'm assuming they're either already doing it or will start doing it if asked.
No, Cloudflare should worry about what sites they host and enable. How Google ranks the sites that Cloudflare hosts is a secondary issue, and is outside of Cloudflare's control.
So every time anyone wants to report a site that is offering "Flash Updater" Trojans, someone should file a lawsuit?
Every time a scammer puts up a web site trying to sell counterfeit goods, the company which sells the real goods should file a lawsuit? One for each scammy web site, perhaps? Because Cloudflare shouldn't be expected to do anything at all, until they're compelled to do so by a court?
lol I don't think you understand how the legal system works. you don't file a lawsuit and go as a private unless you also are trying to recover claims, you report cybercrime to your local authority and let them pursue the criminal
You've clearly never reported cybercrime to the authorities if you think this would work. Or you have, and you think that only large businesses which can claim losses of $10,000 or more should be protected, and everyone else is SOL.
The fact that this has been going on for several years makes me believe Google either doesn't care or the problem is particularly hard to fix (less believable)
It has been an ongoing battle for 10-15 years at this point. Search engines are constantly battling people trying to game their systems. I have to wonder if Google hasn't lost the thread a bit, inside their surely quite complex algorithm black boxes.
For a while now Google has suggested that the best way to rank well is to have human readable content and focus on user experience. At the same time, natural language generation has come leaps and bounds, to the point where sometimes even I, a human, can't tell if an article has been spun by a bot or not.
So if Google starts ranking human readable content, and robots can now produce human readable content, what is the next ranking signal they can use to differentiate spam from humans? Are we going to end up with "Verified Websites" ala verified Twitter handles?
A huge portion of the web at this point is just bots communicating with eachother, and legitimate business systems having to process bots participation on the internet. I imagine the portion of the web that Google crawls that is legitimate versus that which is bot generated would surely be majority bots, just because of how fast they can generate content. One thing they can't do as easily though is register domains, so it may be one of the better points of defense.
Google has at the very least neglected its search for many years now and recently has also actively made it worse through all the censorship and thought control stuff. I find it rather surprising because essentially all of google’s success is lynchpinned by search. All it would take is for a narrative to dominate that the best results can be found elsewhere, which does not seem particularly remote, considering how much damage google has done to its search.
I noticed issues since Matt Cutts left. No one care anymore. There are AI generated website that have been running for years, ranking highly in Google.
> I don't think this problem should be solved by Cloudflare. Cheap domains will always exist and they shouldn't be a problem. The problem lies with Google and its failure to detect these spam sites.
The problem exists outside (Google-controlled) web: with (not fully Google-controlled) email, too.
Around 2020 I did a per-tld checks on wanted/unwanted messages (ham and spam). With thousands of messages sent from .xyz domains (envelope sender host or PTR record of sending host; I ignored the From header) there wasn't a single legit message. 100% SPAM.
There is at least one registrar that gives away .it domains (and apparently .eu
domains? WTF?) for free for one year[1], with no major strings attached (as long as you cancel after the first year) as far as I read, correct me if I'm wrong.
Why they decided to ".xyz the TLD", I don't know. ¯\_(ツ)_/¯
I'm not sure about today, but about 12 years ago, I was able to get 1000 .info domains for about $200. (We were doing some machine-generated splog creation to see if we could game Google search results. We could.)
Depends on the TLD, I searched gandi with a very random set of characters the keyboard (to ensure I could probably get many results) and here's a selection of country-level ones which are above $25/yr:
- abcedasdfff.io = €59.29/year
- abcedasdfff.tw = €25.20/year
- abcedasdfff.nz = €25.40/year
- abcedasdfff.mx = €48.28/year
Most of them appear to be €10-20/yr, but it's certainly not uncommon to see them go for €25 or higher. Note: EUR and USD are roughly at parity so I don't think it's really necessary to do a conversion.
Anecdotally gmail has been doing a miserable job filtering spam for the last 5 months or so. For me it used to be pretty bulletproof - one of its best features.
Now I get something from McAffee Pratners(sic) every other day warning my computer is about to expire. Back in May I kept winning things from Home Depot and Lowes; and gmail would categorize it as "forums".
Add another anecdote to the anecdata pile. Past three months, McAffee and north american shopping chain spam is breaking through Gmail filters. And reporting as spam does not help. I assume they've been somehow building Google reputation for the spam accounts.
Similarly I've been getting smashed with gmail spam, google calendar spam and google drive spam for around 6 months now after never previously getting any and despite reporting most of it.
I occasionally have spam that slips through Gmail's filters, but when I explicitly mark it as spam it disappears nothing of the same type reappears again.
Ironic, since Microsoft is the worst (in my experience) at being a giant black hole to emails sent from an otherwise well-configured (SPF, DKIM, DMARC, non-SBL-listed IP, &c) but not major SMTP host.
I disagree. Prior to Gmail i used to get thousands of spam email everyday
Now everything is filtered. Barely get any.
The added benefit is I don't get any tech calls for help from my parents who also don't end up clicking random spam and wondering why bad things are happening
Remember the good old days of talking about a "semantic web"? Now we just get one Google results page of SEO'd garbage with no way to process them.
I can't help plug kagi.com, which has the amazing feature of grouping SEO'd stuff like recommendation lists together, so a thing that's contextually useful is still available but without polluting the other contexts.
I recently cursed google search results when trying to research an actor's birth date. There were two dates given on Wikipedia and I wanted to see which one (if either) was correct. Google returned the actor's IMDB page (which listed a third date, and no source), and then pages upon pages of what appeared to be auto-generated sites that clearly scraped from Wikipedia, repeating one or the other of the Wikipedia dates.
This is not helping to organize the world's knowledge.
The obsession with "machine learning" is actually making systems dumber. Google Search and Gmail spam filters are getting worse with each passing week, and I am almost certain the increasing reliance on ML is to blame.
I chalk it up to a cost benefit calculation. Google clearly isn't trying to eliminate all spam in search. It's not their goal. They are not trying to optimize for the user experience. They're trying to optimize revenue.
It may be that deep learning is now increasingly used to generate the spam. It either is or will be used for spam generation A LOT. Frankly it seems to be the most promising commercial use-case for the large language models.
What can search engines do about user-agent based content differentiation? Say my robots.txt allows Googlebot and nothing else. If Google attempts to double-check with a covert user agent, robots.txt is violated. Assign humans to review reported pages? It’s pretty easy to swamp a manual system like that. Just forget about robots.txt?
robots.txt is just a guideline between well-meaning actors for the majority of their traffic, like helping a bot not waste its time nor your bandwidth by crawling dynamically-generated, endless-scrolling /calendar.php pages. Google does use it to that extent.
to register an .it you must prove you are a person or a business working or residing in one of the EU member states and need to provide the ID of a person who's gonna be listed as admin-c of the domain.
No you don't. I had a .it domain too, yes there is a field in regstritation where you should enter a "identity card id", but I didn't have one so I entered something random. Worked of course.
tl;dr: I managed to find the servers behind it, most likely anybody who are still affected can do the same thing I did pretty easily. We also followed the money, which is a tad more work.
Whatever domain name is cheap is going to be plagued by spam.
I think in recent time, .icu and .xyz have been the most problematic, to the point where you to this day probably don't want to host a mail server on those domains.
The same with cloud providers. A fairly significant amount of sketchy websites seem to be hosted on cheap cloud providers with weak rules enforcement. I've taken to blocking all of Alibaba's IP ranges from my search engine crawler, the signal to noise from those sites were so bad it just wasn't worth looking for legit content.
Just, no. Plenty of legitimate websites under .it, basically every single Italian company plus all localized versions of international websites (apple.it, google.it, ...)
I don't know, there's a big difference between Italy and Tokelau. Also you have to be an EU citizen or company in order to register a .it domain, while .tk allowed everyone and their dogs to get one for free for years without having a connection to the islands whatsoever.
Why? .tk was popular because it was free, so it was really useful for teens and young adults in an era when you still had to host things somewhere if you wanted them online. On the other hand .it is the tld of Italy and used legit by all businesses of EU's third largest economy.
The real question we should be asking is: Why Google don't care about their index?
Junk copypasta and news-squatting (posting regularly about the same thing with no additional data) is a decade(s) old problem. Crowd sourcing and verifying junk domains could be a weekend project.
Google hasn't given a shit about search since at least a decade ago. It's all about data collection via Android and Chrome OS, and gmail and docs. They don't need search to collect your data any more. Don't people actually know this? LOL
All that data they've collected is only useful if they can sell something (i.e. ads) based on the data. AFAIK the majority of their income they get for ads is from search based ads.
Most that growth is in non English language users . There are only so many English speakers, we are talking about quality of English search results, they number of users for that has not doubled in 6 years.
Content moderation and SEO in non English languages is far worse than English.
I meant that google dropping search quality for English has not much to do with growth users in the last few years as that growth has largely been non english
They do the bare minimum. You can report sites for abuse and they will take them down. It doesn't appear like they do anything to proactively stop similar sites so the person can just make a new account and domain and be back in business.
Their abuse form is getting abused too. It sends an email to site operator and the server hosting company in single submit so its getting abused. It not even have a captcha.
The only thing they'll do is forward the complaint to the user.
Leaving you with no recourse other than to take legal action before Cloudflare will lift a finger.
A trademark dispute is a civil issue between two parties. We have legal systems to solve these. Cloudflare should ensure that their customers get timely notification of complaints, and that’s pretty much it.
I've often ranted about this, because for all intents and purposes they are.
They store the content of the website on their drive to serve to visitors. Whatever processes lie in the backend of whatever website to fetch up-to-date content from an upstream source is not my concern. They are NOT a neutral ISP, they are providing a service to their customer which includes hosting (doesn't matter if it's temporary hosting because they expire files). From our point of view, it is their IP addresses that are hosting the website. They have all the responsibilities a traditional hoster has, no motter how they try to frame this debate.
This is 2nd and 3rd search page results spam, I get spam phishing websites on 1st page of search results when I search for certain ecommerce websites. Google is done.
I think Google can probably not fix it. Users will have to be manually reporting as spam.
These websites on seeing traffic from Google's crawler bots show a perfectly legit and highly SEO optimized website, but for anything else show other spam. If Google starts indexing from random IP ranges, most websites would probably block indexing from “unofficial IPs” or some companies (esp in EU) would file some lawsuit against Google. The reason being that some pay-walled news article websites won't be indexed properly, as the “unofficial IP-ed” Googlebot will not get the paywalled content.
If a website lies to Google itself, I believe the only way to solve it is by reporting the search result as spam or Google contracts people to somehow visit all billions of web pages (again the same problem – from different IP ranges) to verify it as a legit page.
I would like to know how Google currently handles it and probably how it could be improved
They can see all the domains that have served a given snippet.
They also have history to identify where each snippet was first seen.
If SO has a lot of traffic and a good reputation, and if the same snippet is found first at SO and then later at bunch of newly created, low volume, low reputation domains, then show the SO result and not the others.
The practice of "cloaking" has been around for ages and I'm sure Google has (or at least had) solutions against it.
I'm not sure on what grounds could someone sue for crawling from random, unaffiliated addresses as long as the crawling isn't causing a denial of service (they can always check robots.txt using the main IP then use that to throttle crawling from random IPs as to remain compliant).
> The reason being that some pay-walled news article websites won't be indexed properly, as the “unofficial IP-ed” Googlebot will not get the paywalled content.
Every single one of these results carries the `html` filetype as part of their URL is my experience. This is likely a consequence of the useragent-based switcheroo technique they use to fool Google.
Just blanket block the lot with the following uBlock Origin filter:
Blanket banning a whole TLD is stupid. One thing is blocking some obscure stuff like ".su", but .it? It's just too big, and arguably unwise if you are in Europe where having to connect to Italian websites or services isn't a remote possibility.
yeah, right, unless you're american, why should you care about .com domains?
¯\_(ツ)_/¯
the problem is not .it domains, it's clearly stated in the linked post
A large number of spam pages are indexed when searching by our product name.
It’s very similar to Japanese Keyword hack, but the difference is that our site is not hacked
so it's definitely an indexing issue, those .it domains are being indexed for the Japanese word hack for some reason, it's not that .it domains are particularly spammy per se.
Your "solution" would filter the vast minority of the abusers at the cost of banning an entire TLD, not much different than turning off the internet connection entirely.
Most of the spam on the internet comes from .com domains though, even more so because registering a .com domain is much easier than getting an .it
> Your "solution" would filter the vast minority of the abusers at the cost of banning an entire TLD, not much different than turning off the internet connection entirely.
Again, we’re talking about client-side filtering. The original comment about blocking .it domains was talking about a uBlock Origin rule. No one’s talking about blocking .it domains from the web.
Yes, as an American, I could block all .it domains on my end and my web experience likely wouldn’t change at all. I rarely, if ever, need to visit .it domains. So maybe I will.
This visually hides the HTML elements on Google Search and for me only. There is no networking involved and so Italian TLDs are still reachable.
This is a personal solution to an extremely disruptive and long standing problem, and only affects those who choose to employ it. It's not hurting anyone.
Nah. I've been reading the docs on Spatialite (the spatial extension for SQLite) at http://www.gaia-gis.it/ the last couple days. It has both a "spam" TLD and a design from 1998.
In addition to this if one runs unbound as their DNS on their home router and they block DoH then one could add
local-zone: "it" always_nxdomain
to NXDOMAIN all requests for the .it TLD and protect non browser devices. I use this method to stay off sanctioned country TLD's and to remove the cheap/free spammy domains and TLD's that often contain more malware than anything useful.
Browsers and other programs can use the User-Agent[1] header to send along a bit of information about themselves with each request.
This and other information is then used to filter out various types of visitor.
In this case, requests claiming to be a Google Search crawler will receive a boring page with lots of text that it can index and use as search results.
Most browsers' devtools let you change your user-agent string, and a listing of the ones used by Google crawlers is publicly available. Not saying that you should, but you could check this out for yourself... entirely at your own risk of course :)
I thought that this was a massive nono from Googles side, has something changed?