Hacker News new | past | comments | ask | show | jobs | submit login

The solution proposed by Kagi—separate the search index from the rest of Google—seems to make the most sense. Kagi explains it more here: https://blog.kagi.com/dawn-new-era-search





At Blekko we advocated for this as well.

Google has two interlocked monopolies, one is the search index and the other is their advertising service. We often joked that if Google reasonable and non-discriminatory priced access to their index, both to themselves and to others, AND they allowed someone to put what ever ads they wanted on those results. That change the landscape dramatically.

Google would carve out their crawler/indexer/ranker business and sell access to themselves and others which would allow that business an income that did NOT go back to the parent company (had to be disbursed inside as capex or opex for the business).

Then front ends would have a good shot, DDG for example could front the index with the value proposition of privacy. Someone else could front the index with a value proposition of no-ads ever. A third party might front that index attuned to specific use cases like literature search.

It would be a very different world.


Access to the click stream is the big bit.

Ie. Knowing which users clicked which search results.

Without the click stream, one cannot build or even maintain a good ranker. With a larger click stream from more users, one can make a better ranker, which in turn makes the service better so more users use it.

End result: monopoly.

The only solution is to force all players to share click stream data with all others.


Click stream is useful, without a doubt. It isn't essential. We had already started the process at Blekko of moving to alternate ways for ranking the index.

That said, if you run the frontend as proposed, you get to collect the clicks. That gives you the click stream you want. If the index returns you a serp with unwrapped links (which it should if it was unbundled from a given search front end) then you could develop analytics around what your particular customers "like" in their links and have a different ranking than perhaps some other front end. One thing that Blekko made really clear for me that the Google idea that there was always the "best" result for that query (aka the I'm Feeling Lucky link) there was often different shades of intent behind the query that aren't part of the query itself. Google felt they could get it in the first 10 links (back before the first 10 links were sponsored content :-)) and often on the page you could see the two or three inferred "intents" (shopping, information, entertainment were common).


I don't think that's quite true, as competitors like Kagi have been able to compete well with effectively zero clickstream (by comparison). It'll help, but it's not the make-or-break that the index is.

I think a click stream isn't necessary, but Kagi is not a good basis for the argument in my opinion.

Kagi is a primarily meta search engine. The click stream exists on their sources (Bing, Google, Yandex, Marginalia, not sure if they use Brave). They do have Teclis which is their own index that they use, and their systems for reordering the page of results such as downranking heavy ad pages, and based upon user preferences (which I love).

https://seirdy.one/posts/2021/03/10/search-engines-with-own-... is a source I would recommend checking out if you are curious.


Kagi sends searches to other providers (Bing?) and then simply re-ranks the results, so they're effectively inheriting the click stream data of those other providers.

> Google has two interlocked monopolies, one is the search index

The index is the farthest thing from a monopoly Google has - anyone can recreate it. Heck, you can even just download Commoncrawl to get a massive head start.


I see it a bit differently, many (most?) web sites explicitly deny scraping execept for Google. Further Google has the infrastructure to crawl several trillion web pages and create a relevant index out of the most authoritative 1.5 trillion. To re-create that on your own, you would need both the web to allow it, and the infrastructure to do it. I would agree that this isn't an insurmountable moat but it is a good one.

Most websites only explicitly deny scraping by bad bots (robots.txt). Things like Cloudflare are a completely different matter, and I have a whole batch of opinions about how they are destroying the web.

I'd love to compete directly with OpenAI, but the cost of a half million GPUs is a me problem - not a them problem. Google can't be faulted for figuring out how to crawl the web in an economically viable way.


Then why do we see all of these alt search engines and SEO services building out independent indexes? Why don't the competitors cooperate in this fashion already?

Because everyone worships Thiel's "competition is for losers" and dreams of being a monopoly. Monopolies being the logical outcome of a deregulated environment, for which these companies lobby.

Throughout history there are very few monopolies and they don't normally last that long; that is unless they get are granted special privileges by the government.

Concentration is the default in an unregulated environment. Sure pure monopolies with 100% market control are rare but concentration is rampant. A handful of companies dominating tech, airlines, banks, media.

Concentration seems much more prevalent in heavily regulated markets e.g. utilities / airlines. In many cases regulators have even encouraged this e.g.finance.

There is no default for unregulated markets. It's a question of whether the economies of scale outweigh the added costs from the complexity that scale requires. It costs close to 100x as much to build 100 houses, run 100 restaurants, or operate 100 trucks as it does to do 1. That's why these industries are not very concentrated. Whereas it costs nowhere close to 100x for a software or financial services company to serve 100x thee customers, so software and finance are very concentrated.

The effect of regulation is typically to increase concentration because the cost of compliance actually tends to scale very well. So businesses that grow face an decreasing regulatory compliance cost as a percent of revenue.


You are comparing Apples and Oranges. You just can't compare the barrier of entry for Software business and an Airline, even without any regulations. It's just orders of magnitude more expensive to buy an airplane than a laptop, and most utilities are natural monopolies so they behave fundamentally different.

Most planes are leased. The capex for an airline isn't anything especially high if they don't want it to be.

I can't and I didn't. I never said anything about barriers to entry. I'm talking about concentration here and why the market is dominated by airlines with hundreds of planes instead of airlines with 10 planes. Barriers to entry are inevitable in capital intensive industries.

Home building is interesting because I think a major blocker to monopoly-forming is the vastly heterogenous and complicated regulatory landscape, with building codes varying wildly from place to place. So you get a bunch of locally-specialized builders.

Regulation can increase concentration in a high corruption/cronyism environment — regulatory capture and regulatory moats. There is plenty of that happening.

In building, I think we have local-concentration, due to both regulatory heterogeneity and then local cronyism - Bob has decades of connections to the city and gets permits easily, whereas Bob’s competitor Steve is stuck in a loop of rejection due to a never ending list of pesky reasons.


Concentration is not monopoly, and furthermore your comment does not begin to address the critical part of parent’s comment : “does not last very long”

Inequality at a point in time , and over time , is not nearly as bad if the winners keep rotating


airlines? Worst example ever. There are lots of airlines coming and going. "Tech" isn't even an industry.

> unless they get are granted special privileges by the government

That's what all the lobbyists are for.

None of the people or organisations that advocate for "free markets" or competition actually want free markets and competition. It's a smoke screen so they can keep buying politicians to get their special privileges.


They always inevitably end up being given special privileges.

Because, contrary to what we would all like to believe, once a company becomes large we don't want them to go under, even if they're not optimal.

There's a huge amount of jobs, institutional knowledge, processes, capital, etc in these big monopolies. Like if Boeing just went under today, how long would it take for another company to re-figure out how to make airplanes? I mean, take a look at NASA. We went to the moon, but can we do it again? It would be very difficult. Because so many engineers retired and IP was allowed to just... rot.

It's a balancing act. Obviously we want to keep the market as free as possible and yadda yadda invisible hand. But we also have national security to consider, and practicality.


> Throughout history there are very few monopolies and they don't normally last that long

That's completely incorrect. Historically, monopolies were pretty long-lived. So much that they were often written into the legal codes.

It's only fairly recently that the pace of innovation picked up so much, that monopolies not really die per se, but just become irrelevant.


This sounds a solution contrived to advantage companies that want access to this data rather than an actual economically valid business model. If building an index and selling access to it is a viable business, then why isn't someone doing it already? There's minimal barrier to entry. Blekko has an index. Are you selling access to it for profit?

There are search engines that sell api access to their index. Pretty sure Bing, Yahoo, and Yandex all do.

Blekko also did, 10 years ago. When they still existed.


I think Brave Search does too?

We do: brave.com/api

You mean like a white label search engine? Customized with settings?

This just in: small search engine company thinks it's a great idea for small search engine companies to have the same search index as Google.

Also, I love this bit: "[Google's] search results are of the best quality among its advertising-driven peers." I can just feel the breath of the guy who jumped in to say "wait, you can't just admit that Google's results are better than Kagi's! You need to add some sorta qualifier there that doesn't apply to us."


Have you used Kagi or Google recently? Kagi works way better.

On every Kagi comment, there is “Have you used Kagi recently? It’s improved a lot!” — to the level that I suspect they have bots to upgrade the brand image, at least to search which comments to respond do.

I’m saying that because yes, I’ve used Kagi recently, and I switch back to Google every single time because Kagi can’t find anything. Kagi is to Google what Siri is to ChatGPT. Siri can’t even answer “What time is it?”


Maybe you see different comments than I do, but I don't see many comments saying it's improved a lot lately.

As a Kagi user, I would not say it's improved a lot lately. It's a consistent, specific product for what I need. I like the privacy aspects of it, and the control to block, raise or lower sites in my search results. If that's not something you care about then don't use it.

Is it better than Google at finding things? I don't think so, but then, Google is trash these days too


I don’t understand.

The GP of your comment is literally saying that Kagi is better than Google as of late. You’re not helping the “Kagi doesn’t use bots” case by ignore the context 2 comments up.

https://news.ycombinator.com/item?id=43948385


They said Kagi works "way better" than google, not that Kagi is better as of late (although they do ask if they've tried kagi lately). Which is consistent with my statement that Kagi is a consistent product and not really improving. They keep adding AI features, but I disable those and don't care about them.

You're welcome to check my post history, I'm certainly not a bot. Or if I am, I'm a very convincing one that runs an astrophotography blog.


> I suspect they have bots to upgrade the brand image

I disagree with the conclusion but I agree with the premise. Man is a rationalizing animal, and one way to validate one’s choice in paying for a search engine (whether it is better or not) is to get others to use it as well. Kagi is also good at PR, they were able to spin a hostile metering plan as a lenient subscription plan.

Word of mouth is often more prevalent than we think, and certainly more powerful than botting. I would not be shocked if the author of that “AirBnBs are blackhats” article was interacting with real users of Craigslist spurred on by some referral scheme.


> one way to validate one’s choice in paying for a search engine (whether it is better or not) is to get others to use it as well.

It's not so much validating, but I'm hoping they grow so I can keep using their service. It would suck for them to close shop because they never got popular enough to be sustainable.


> to the level that I suspect they have bots to upgrade the brand image

What “level” is that which couldn’t possibly be accomplished by humans? Are you seeing thousands of messages every day?


> On every Kagi comment, there is “Have you used Kagi recently? It’s improved a lot!” — to the level that I suspect they have bots to upgrade the brand image

Odd to dismiss a point purely because it's consistently made, especially without much apparent disagreement. Perhaps more likely: there are just _many_ happy Kagi customers in the HN community.

As one data point: I use Kagi, and agree with GP, and I am not a bot (activity of this HN account predates existence of Kagi by many years).

That doesn't dismiss your experience of course, lots of people use search engines in different ways! Personally, I found the ads & other crap of Google drowned out results, and I frequently hit SEO spam etc where site reranking was helpful. I'm sure there's scenarios where that doesn't make sense though, it's not for everybody (not everybody can justify paying for search, just for starters).


“Kagi is bad. As evidence I present Siri.”

[flagged]


For someone guarding the platform from weaksauce nonsense comments you picked a strange one to defend.

I was commenting on the weakness of the analogy. There’s nothing in any of the 4 entities mentioned other than the author’s opinion about them.

Sorry my comment missed the mark for you.


It's taken as a given that Siri is inferior to ChatGPT. Both are natural language call-and-response models, but one of them is constantly in the news for diagnosing patients more accurately than actual medical doctors [1] and identifying a picture's location by the species of grass shown in a fifty-pixel-wide clump in the corner, and the other one can turn off your lights and order you a pizza when you ask it what tomorrow's weather forecast is.

Ergo, a person of average scholastic aptitude who is neither trying to ape late night talk show hosts by taking half of each single-colon-pair of an analogy, severing the other pair ends and any remaining context, and repeating the result with a well-rehearsed look of confusion; nor defensive about being called out for doing just that, can readily infer that the message being transmitted is that Kagi is fundamentally a tool very similar to Google, but which delivers inferior results.

1. https://www.nytimes.com/2024/11/17/health/chatgpt-ai-doctors...


Then why do they want Google's search index?

Crawling the web is costly. I assume it's cheaper to use the results from someone else's crawling. I don't know what Kagi is using to argue that they should have access to Google's indexes, but I'd guess it's some form of anti trust.

Let me add more: crawling the web is costly for EVERYONE.

The more crawlers out there, the more useless traffic is being served by every single website on the internet.

In an ideal world, there would be a single authoritative index, just as we have with web domains, and all players would cooperate into building, maintaining and improving it, so websites would not need to be constantly hammered by thousands of crawlers everyday.


I already get hit by literally hundreds of crawlers, presumably trying to find grist for there AI mills.

One more crawler for a search index wouldn’t hurt.


Bandwidth is cheap. I also like seeing more traffic in the logs.

Yeah not that cheap. There's a few articles on HN now about small, independent websites being essentially DDOS'd by crawlers. Although, to be fair, mostly AI crawlers.

So they can work even better...?

Kagi is just a meta search engine. They are already using Googles search index. They just find it too expensive. Guess they need to show ads to pay for the searches.

Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site, so googlebot wins because they’re the dominant search engine.

It makes sense to break that out so everyone has access to the same dataset at FRAND pricing.

My heart just wants Google to burn to the ground, but my brain says this is the more reasonable approach.


https://commoncrawl.org/

This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).

Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.


The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.

It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.


Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’

Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.

(Cloudflare customer, no other affiliation)


That says that if google switches over to ccbot then the rest will follow.

I mean if it’s created as part of setting the global rules for the internet you could just make it opt out.

Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.

If you have access to archived crawls, anyone can build and serve an index, or model weights (gpt).

Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?

A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.

Bots are typically tuned to work with generic sites over crawling efficiently.


Where is the cost coming from? Wouldn't a crawler mostly just accessing cached static assets served by CDN?

And what do you mean by your search infrastructure? Are you talking about elasticsearch or some equivalent?


No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).

I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.


One problem, it leaves one place to censor.

I agree that each front end should do it, but you can bet it will be a core service.


> The Internet Archive can persist the data for ~$2/GB in perpetuity

No they can't but do you have a source?


https://help.archive.org/help/archive-org-information/ and first hand conversations with their engineering team

> We estimate that permanent storage costs us approximately $2.00US per gigabyte.

https://webservices.archive.org/pages/vault/

> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.

https://blog.dshr.org/2017/08/economic-model-of-long-term-st...


What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?

they could charge data processing costs for reads

Of all the bad ideas I've heard of where to slice Google to break it up, this... Is actually the best idea.

The indexer, without direct Google influence, is primarily incentivized to play nice with site administrators. This gives them reasons to improve consideration of both network integrity and privacy concerns (though Google has generally been good about these things, I think the damage is done regarding privacy that the brand name is toxic, regardless of the behaviors).


> Crawling the internet is a natural monopoly.

How so?

A caching proxy costs you almost nothing and will serve thousands of requests per second on ancient hardware. Actually there's never been a better time in the history of the Internet to have competing search engines since there's never been so much abundance of performance, bandwidth, and software available at historic low prices or for free.


Costs almost nothing, but returns even less.*

There are so many other bots/scrapers out there that literally return zero that I don’t blame site owners for blocking all bots except googlebot.

Would it be nice if they also allowed altruist-bot or common-crawler-bot? Maybe, but that’s their call and a lot of them have made it on a rational basis.

* - or is perceived to return


> that I don’t blame site owners for blocking all bots except googlebot

I run a number of sites with decent traffic and the amount of spam/scam requests outnumbers crawling bots 1000 to 1.

I would guess that the number of sites allowing just Googlebot is 0.


> that I don’t blame site owners for blocking all bots except googlebot.

I doubt this is happening outside of a few small hobbyist websites where crawler traffic looks significant relative to human traffic. Even among those, it’s so common to move to static hosting with essentially zero cost and/or sign up for free tiers of CDNs that it’s just not worth it outside of edge cases like trying to host public-facing Gitlab instances with large projects.

Even then, the ROI on setting up proper caching and rate limiting far outweighs the ROI on trying to play whack-a-mole with non-Google bots.

Even if someone did go to all the lengths to try to block the majority of bots, I have a really hard time believing they wouldn’t take the extra 10 minutes to look up the other major crawlers and put those on the allow list, too.

This whole argument about sites going to great lengths to block search indexers but then stopping just short of allowing a couple more of the well-known ones feels like mental gymnastics for a situation that doesn’t occur.


> sites going to great lengths to block search indexers

That's not it. They're going to great lengths to block all bot traffic because of abusive and generally incompetent actors chewing through their resources. I'll cite that anubis has made the front page of HN several times within the past couple months. It is far from the first or only solution in that space, merely one of many alternatives to the solutions provided by centralized services such as cloudflare.


Regarding allowlisting the other major crawlers: I've never seen any significant amount of traffic coming from anything but Google or Bing. There's the occasional click from one of the resellers (ecosia, brave search, duckduckgo etc), but that's about it. Yahoo? haven't seen them in ages, except in Japan. Baidu or Yandex? might be relevant if you're in their primary markets, but I've never seen them. Huawei's Petal Search? Apple Search? Nothing. Ahrefs & friends? No need to crawl _my_ website, even if I wanted to use them for competitor analysis.

So practically, there's very little value in allowing those. I usually don't bother blocking them, but if my content wasn't easy to cache, I probably would.


In the past month there were dozens of posts about using proof of work and other methods to defeat crawlers. I don't think most websites tolerate heavy crawling in the era of Vercel/AWS's serverless "per request" and bandwidth billing.

Not everyone wants to deal with caching proxy because they think the load on their site under normal operations is fine if it's rendered server side.

You don't get to tell site owners what to do. The actual facts on the ground are that they're trying to block your bot. It would be nice if they didn't block your bot, but the other, completely unnatural and advertising-driven, monopoly of hosting providers with insane per-request costs makes that impossible until they switch away.

They try to block your bot because Google is a monopoly and there's little to no cost for blocking everything except Google.

This isn't a "natural" monopoly, it's more like Internet Explorer 6.0 and everyone designing their sites to use ActiveX and IE-specific quirks.


One possible answer: pay them for their trouble until you provide value to them, e.g. by paying some fraction of a cent for each (document) request.

Cool, you wanna solve micropayments now or wait until we've got cold fusion rolling first...?

You wouldn't have to make them micropayments, you can pay out once some threshold is reached.

Of course, it would incentivize the sites to make you want to crawl them more, but that might be a good thing. There would be pressure on you to focus on quality over quantity, which would probably be a good thing for your product.


>You wouldn't have to make them micropayments, you can pay out once some threshold is reached.

Believe it or not, this is a potential solution for micropayments that has been explored.


I could even pay a fixed amount to my ISP every month for a fixed amount of data transfer.

> The actual facts on the ground are that they're trying to block your bot

Based on what evidence.


based on them matching the user-agent and sending you a block page? I don't know what else to tell you. It's in plain sight.

Most of the tech is set for being a monopoly due to the negligible variable cost associated with serving a customer.

Thus being even slightly in front of others is reinforced and the gap only widens.


Google search is a monopoly not because of crawling. It's because of the all the data it knows about website stats and user behavior. Original Google idea of ranking based on links doesn't work because it's too easily gamed. You have to know what websites are good based on user preferences and that's where you need to have data. It's impossible to build anything similar to Google without access to large amounts of user data.

Sounds like you're implying that they are using Google Analytics to feed their ranking, but that's much easier to game than links are. User-signals on SERP clicks? There's a niche industry supplying those to SEOs (I've seen it a few times, I haven't seen it have any reliable impact).

Page ranking sounds like a perfect application of artificial intelligence.

If China can apply it for total information awareness on their population, Google can apply it on page reliability


I'm fairly certain many people have already tried to apply magical AI pixie dust to this problem. Presumably it isn't so simple in practice.

> so googlebot wins because they’re the dominant search engine.

I think it's also important to highlight that sites explicitly choose which bots to allow in their robots.txt files, prioritizing Google which reinforces its position as the de-facto monopoly. Even when other bots are technically able to crawl them.


CommonCrawl is not a vlaid comparison. Most robots.txt target CCBot.

> Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site,

Companies want traffic from any source they can get. They welcome every search engine crawler that comes along because every little exposure translates to incremental chances at revenue or growing audience.

I doubt many people are doing things to allow Googlebot but also ban other search crawlers.

> My heart just wants Google to burn to the ground

I think there’s a lot of that in this thread and it’s opening the door to some mental gymnastics like the above claim about Google being the only crawler allowed to index the internet.


> I doubt many people are doing things to allow Googlebot but also ban other search crawlers.

Sadly this is just not the case.[1][2] Google knows this too so they explicitly crawl from a specific IP range that they publish.[3]

I also know this, because I had a website that blocked any bots outside of that IP range. We had honeypot links (hidden to humans via CSS) that insta-banned any user or bot that clicked/fetched them. User-Agent from curl, wget, or any HTTP lib = insta-ban. Crawling links sequentially across multiple IPs = all banned. Any signal we found that indicated you were not a human using a web browser = ban.

We were listed on Google and never had traffic issues.

[1] https://onescales.com/blogs/main/the-bot-blocklist

[2] Chart in the middle of this page: https://blog.cloudflare.com/declaring-your-aindependence-blo... (note: Google-Extended != Googlebot)

[3] https://developers.google.com/search/docs/crawling-indexing/...


Are sites really that averse to having a few more crawlers than they already do? It would seem that it’s only a monopoly insofar as it’s really expensive to do and almost nobody else thinks they can recoup the cost.

A few?

We routinely are fighting off hundreds of bots at any moment. Thousands and Thousands per day, easily. US, China, Brazil from hundreds of different IPs, dozens of different (and falsified!) user agents all ignoring robots.txt and pushing over services that are needed by human beings trying to get work done.

EDIT: Just checked our anubis stats for the last 24h

CHALLENGE: 829,586

DENY: 621,462

ALLOW: 96,810

This is with a pretty aggressive "DENY" rule for a lot of the AI related bots and on 2 pretty small sites at $JOB. We have hundreds, if not thousands of different sites that aren't protected by Anubis (yet).

Anubis and efforts like it are a xesend for companies that don't want to pay off Cloudflare or some other "security" company peddling a WAF.


This seems like two different issues.

One is, suppose there are a thousand search engine bots. Then what you want is some standard facility to say "please give me a list of every resources on this site that has changed since <timestamp>" so they can each get a diff from the last time they crawled your site. Uploading each resource on the site to each of a thousand bots once is going to be irrelevant to a site serving millions of users (because it's a trivial percentage) and to a site with a small amount of content (because it's a small absolute number), which together constitute the vast majority of all sites.

The other is, there are aggressive bots that will try to scrape your entire site five times a day even if nothing has changed and ignore robots.txt. But then you set traps like disallowing something in robots.txt and then ban anything that tries to access it, which doesn't affect legitimate search engine crawlers because they respect robots.txt.


> then you set traps like disallowing something in robots.txt and then ban anything that tries to access it

That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis. All you can be certain of is that a significant portion of your traffic is abusive.

That results in aggressive filtering schemes which in turn means permitted bots must be whitelisted on a case by case basis.


> That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis.

Well sure you can. If it's requesting something which is allowed in robots.txt, it's a legitimate request. It's only if it's requesting something that isn't that you have to start trying to decide whether to filter it or not.

What does it matter if they use multiple IP addresses to request only things you would have allowed them to request from a single one?


> If it's requesting something which is allowed in robots.txt, it's a legitimate request.

An abusive scraper is pushing over your boxes. It is intentionally circumventing rate limits and (more generally) accurate attribution of the traffic source. In this example you have deemed such behavior to be abusive and would like to put a stop to it.

Any given request looks pretty much normal. The vast majority are coming from residential IPs (in this example your site serves mostly residential customers to begin with).

So what if 0.001% of requests hit a disallowed resource and you ban those IPs? That's approximately 0.001% of the traffic that you're currently experiencing. It does not solve your problem at all - the excessive traffic that is disrespecting ratelimits and gumming up your service for other well behaved users.


Why would it be only 0.001% of requests? You can fill your actual pages with links to pages disallowed in robots.txt which are hidden from a human user but visible to a bot scraping the site. Adversarial bots ignoring robots.txt would be following those links everywhere. It could just as easily be 50% of requests and each time it happens, they lose that IP address.

I mean sure but if there were 3 search engines instead of one would you disallow two of them? The spam problem is one thing but I dont think having a ten search engines rather than two is going to destroy websites.

The claim that search is a natural monopoly because of the impact on websites of having a few more search competitors scanning them seems silly. I don’t think it’s a natural monopoly at all.


A "few" more would be fine - but the sheer scale of the malicious AI training bot crawling that's happening now is enough to cause real availability problems (and expense) for numerous sites.

One web forum I regularly read went through a patch a few months ago where it was unavailable for about 90% of the time due to being hammered by crawlers. It's only up again now because the owner managed to find a way to block them that hasn't yet been circumvented.

So it's easy to see why people would allow googlebot and little else.


Assuming the simplified diagram of Google’s architecture, sure, it looks like you’re just splitting off a well-isolated part, but it would be a significant hardship to do it in reality.

Why not also require Apple to split off only the phone and messaging part of its iPhone, Meta to split off only the user feed data, and for the U.S. federal government to run only out of Washington D.C.?

This isn’t the breakup of AT&T in the early 1980s where you could say all the equipment and wiring just now belongs to separate entities. (It wasn’t that simple, but it wasn’t like trying to extract an organ.)

I think people have to understand that and know that what they’re doing is killing Google, and it was already on its way into mind-numbed enterprise territory.


> Apple to split off only the phone and messaging part of its iPhone

Ooh, can we? My wife is super jealous of my ability to install custom apps for phone calls and messaging on Android, it'd be great if Apple would open theirs up to competition. Competition in the SMS app space would also likely help break up the usage of iMessage as a tool to pressure people into getting an iPhone so they get the blue bubble.


> Ooh, can we?

If the dream of a Star Trek future reputation-based government run by AI which secretly manipulates the vote comes true, yes we can!

Either that or we could organize competitors to lobby the US or EU for more lawsuits in exchange for billions in kickbacks! (Not implying anything by this.)


You jest, but splitting out just certain Internet Explorer features was part of the Microsoft antitrust resolution. It's what made Chrome's ascendancy possible.

I mean it's just data. You can just copy it and hand it over to a newly formed competing entity.

You're not even really dealing with any of these shared infrastructure public property private property merged infrastructure issues.

Yeah sure. There's mountains of racks of servers, but those aren't that hard to get tariffs TBD.

I think it'll be interesting just to try and find some collection of ex Google execs who had actually like to go back to the do no evil days, and just hand them a copy of all the data.

I simply don't think we have the properly and elected set of officials to implement antitrust of any scale. DOJ is now permanently politicized and corrupt, and citizens United means corps can outspend "the people" lavishly.

Antitrust would mean a more diverse and resilient supply chain, creativity, more employment, more local manufacturing, a reversal of the "awful customer service" as a default, better prices, a less corrupt government, better products, more economic mobility, and, dare I say it, more freedom.

Actually, let me expound upon the somewhat nebulous idea of more freedom. I think we all hear about Shadow banning or outright banning with utter silence and no appeals process for large internet companies that have a complete monopoly on some critical aspect of Internet usage.

If these companies enabled by their cartel control, decide they don't like you or are told by a government not to like you, it is approaching a bigger burden as being denied the ability to drive.

Not a single one of those is something oligarchs or a corporatocracy has the slightest interest in


Google killed Google. They should not have decided to become evil. Search can easily be removed, G Suite should be separate too.

> Search can easily be removed

This strikes me like "two easy steps to draw an owl. First draw the head, then draw the body". I generally support some sort of breakup, but hand waving the complexities away is not going to do anybody any good


This solution would also yield search engines that will actually be useful and powerful like old Google search was. They have crippled it drastically over the years. Used to be I could find exact quotes of forum posts from memory verbatim. I can't do that on Google or YouTube anymore. It's really dumbed down and watered down.

Discussed at the time, in case anyone is curious:

Dawn of a new era in Search: Balancing innovation, competition, and public good - https://news.ycombinator.com/item?id=41393475 - Aug 2024 (79 comments)


I feel like there's some conceptual drift going on in Kagi's blog post wrt their proposed remedy.

They argue that the search index is an essential facility, and per their link "The essential facilities doctrine attacks a form of exclusionary conduct by which an undertaking controls the conditions of access to an asset forming a ‘bottleneck’ for rivals to compete".

But unlike physical locations where bridges/ports can be built, the ability to crawl the internet is not excludable by Google.

They do argue that the web is not friendly to new crawlers, but what Kagi wants is not just the raw index itself, but also all the serving/ranking built on top of it so that they do not have to re-engineer it themselves.

It's also worth noting that Bing exists, and presumably has it's own index of the web and no evidence has been presented that the raw index content itself is the reason that Bing is not competitive.


That's like asking the foxes how the farmer should manage his chickens. Kagi is a (wannabe) competitor. Likewise, YC's interest here is in making money by having viable startups and having them acquired.

I also don't think crawling the Web is the hard part. It's extraordinarily easy to do it badly [1] but what's the solution here? To have a bunch of wannabe search engines crawl Google's index instead?

I've thought about this and I wonder if trying to replicate a general purpose search engine here is the right approach or not. Might it not be easier to target a particular vertical, at least to start with? I refuse to believe Google cannot be bested in every single vertical or that the scale of the job can't be segmented to some degree.

[1]: https://stackoverflow.blog/2009/06/16/the-perfect-web-spider...


Googles c suite is clearly not thinking ahead here. They could have helped to slow down the anti-trust lawsuits by opening up their search index to whichever AI company wants to pay for it. Web crawling is expensive, and lots of companies are spending wild amounts of money on it. There is a very clear market arbitrage opportunity between the cost of crawling the web and Google's cost of serving up their existing data.

Woudl the search index contain only raw data about the websites? Or would some sort of ranking be there?

If it's teh latter, its a neat way to ask a company to sell their users data to a third party because any kind of ranking comes via aggregation of users' actions. Without involving any user consent at all.


Then you'd just end up with all the ads being scams, and people not wanting to search on Google, because all the top results are scams instead of things they might actually be interested in that are not scams.

Separating the index creates a commodity data layer that preserves Google's crawling investment while enabling innovation at the ranking/interface layer, similar to how telecom unbundling worked for ISPs.

It's such a ridiculous proposal that would completely destroy Google's business. If that's the goal fine, but let's not pretend that any of those remedies are anything beyond a death sentence.

If they're dominating or one of only two or three important options in multiple other areas and the index is the only reason... I mean, that's a strong argument both that they're monopolists and that they're terrible at allocating the enormous amount of capital they have. That's really the only thing keeping them around? All their other lines of business collectively aren't enough to keep them alive? Yikes, scathing indictment.

> It's such a ridiculous proposal that would completely destroy Google's business.

it won't. My bet is that bing and some other indexes are 95% Ok for average Joe. But relevance ranking is much tougher problem, and "google.com" is household brand with many other functions(maps, news, stocks, weather, knowledge graph, shopping, videos), and that's what is foundation of google monopoly.

I think this shared index thing will actually kill competition even more, since every players will use only index owned by google now.


At this point, why are you so concerned about Google's business?

This was 10 years ago. I could argue a moral Superior that Google possessed over Microsoft and Facebook, but man those days are looooooong gone.


I don't know, I don't think it will.

I mean, they're still going to be the number 1 name in adtech and analytics. And they're still gonna have pretty decent personalized ads because of analytics.

Plus, that just one part of their business. There's also Android, which is a money printing machine with the Google Store (although that's under attack too).


Really? Google would still have an astonishingly large lead in the ad markets.

Not sure how they could hold lead in case they lose search traffic.

[flagged]


Sorry, but corporations are not people despite what some people will tell you.

They would definitely NOT survive in any recognizable form with "only a few billion dollars", because the stock price is a function of profits. Take away most of the profits, and most of the company's value gets wiped out, most of the employees would leave or get laid off, and anything of value that remains would quickly become worthless. Users would all move to the government-sanctioned replacement monopoly, likely X. To say nothing about the thousands of ordinary people who have large Alphabet holdings in their retirement portfolios and would be wiped out.

Google is practically the definition of a "too big to fail" company. They need to be reigned in to allow more competition, but straight up destroying the company would be a move so colossally stupid I could just see the Trump regime doing it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: