Every new search engine I've seen was a Bing wrapper with sometimes light reranking.
I understand that competing with Google was borderline impossible a decade ago. But in 2024, we have cheap compute, great OSS distributed DBs, powerful new vector search tech. Amateur search engines like Marginalia even run on consumer hardware. CommonCrawl text-only is ~100TB, and can fit on my home server.
Why is no company building their own search engine from scratch?
>competing with Google was borderline impossible a decade ago. But in 2024, we have cheap compute, great OSS distributed DBs, powerful new vector search tech. [...] CommonCrawl text-only is ~100TB,
A modest homegrown tech stack of 2024 can maybe compete with a smaller Google circa ~1998 but that thought experiment is handicapping Google's current state-of-the-art. Instead, we have to compare OSS-today vs Google-today. There's still a big delta gap between the 2024 OSS tech stack and Google's internal 2024 tech stack.
E.g. for all the billions Microsoft spent on Bing, there are still some queries that I noticed Google was better at. Google found more pages of obscure people I was researching (obituaries, etc). But Bing had the edge when I was looking up various court cases with docket #s. The internet is now so big that even billion dollar search engines can't get to all of it. Each has blindspots. I have to use both search engines every single day.
I was talking about text-only, filtered and deduped content.
Most of Google's 100PB is picture and video. Filtering the spam and deduping the content helped Google reduce the ~50B page index in 2012 to ~<10B today.
But what if I don't want to search Reddit, stack overflow, and blogs from the early 2000s and all the content you just threw away as irrelevant actually contains information I am looking for. There is an entire working generation that never heard a modem sound and has never even made a consideration for making sure their content is accessible in plaintext.
I'm sure all the LLM providers are already considering this, but there's so much important information that is locked away in videos and pictures that isn't even obvious from a transcript or description.
There is still large opportunity. Most of my searches are for plain text information.
> But what if I don't want to search Reddit, stack overflow, and blogs from the early 2000s
That is a strawman. There are huge numbers of websites (including authoritative ones like governments and universities) and a lot of content.
> There is an entire working generation that never heard a modem sound and has never even made a consideration for making sure their content is accessible in plaintext.
If they want video they will do the same as everyone else and search Youtube. Different niche.
> I'm sure all the LLM providers are already considering this, but there's so much important information that is locked away in videos and pictures that isn't even obvious from a transcript or description.
That is true, but if you are getting bad search results (and the market for other search engines are people who are not happy with Google and Bing results) that does not help much are you are not seeing the information you want anyway.
> That is a strawman. There are huge numbers of websites (including authoritative ones like governments and universities) and a lot of content.
Ya know... a search engine that was limited to *.gov, *.edu and country equivalents (*.ac.uk, etc) would actually be pretty useful. Ok, I know you can do something like it with site: modifiers in the search, but if you know from the beginning you're never going to search the commercial internet you can bake that assumption into the design of your search engine in interesting ways.
That explains why you were 10 times more likely to find something 15-20 years ago then you are today. They reduced the size by dropping a lot of sites and not crawling as much. We expect google to be at 100PB x 100 with the growth of users and content over that time period. Someone made the decision to prioritize a smaller size over a more complete index and some A/B test was run and turned out well.
Just serving up content from Reddit and HN and a few other websites would be enough to beat Google for most of us. Sprinkle in the top 100 websites and you have a legitimate contender.
There is no open web anymore. Google killed it. There are probably fewer than 100k useful websites in the world now. Which is good for startups, because the problem is entirely tractable.
It's all very regional. Despite its namesake, the world-wide-web is aggressively local. Some properties are global but after a handful, it's all country/language/region based.
No matter what type of market analysis I do, I almost invariably find there's something different that say, the Koreans or the Europeans are using. The Yelp of Japan is Tabelog, the Ubereats of the UK is deliveroo, the facebook of russia is vk.ru etc.
That's really the beach head to capture - figure out what a "web region" is for a number of query use-cases and break in there.
Reddit is a good example of a company that is territorial about it's content being indexed or scrapped. I can't even access it via most of my VPN provider's servers anymore due to them blocking requests.
I liken Google Search and YouTube to how Blockbusters video rental stores used to operate.
If you went into Blockbusters then there was actually a small subset of the available videos to rent. Films that had been around for decades were not on the shelves yet garbage released very recently would be there in abundance. If you had an interest in film and, say, wanted to watch everything by Alfred Hitchcock, there would not be a copy of 'The Birds' there for you.
Or another analogy would be a big toy shop. If you grew up in a small town then the toy shop would not stock every LEGO set. You would expect the big toy shop in the big city would have the whole range, but, if you went there, you would just find what the small toy shop had but piled high, the full range still not available.
Record shops were the worst for this. The promise of Virgin Megastore and their like was always a bit of a let down with the local, independently owned record shop somehow having more product than the massive record shop.
Google is a bit like this with information. Youtube is even worse. I cottoned on to this with some testing on other people's devices. Not having Apple products, I wanted to test on old iPads, Macbooks and phones. For this I needed a little bit of help from neighbours and relatives. I already knew I had a bug to workaround, and that there was a tutorial on Youtube I needed to do a quick fix so I could test everything else. So this meant I had to open Youtube on different devices owned by very different people, with their logged in account.
I was very surprised to see that we all had very similar recommendations to what I could expect. I thought the elderly lady downstairs and my sister would have very different recommendations to myself, but they did not. I am sure the adverts would have been different, but I was only there to find a particular tutorial and not be nosy.
I am sure that Google have all this stuff cached at the 'edge', wherever the local copper meets the fibre optic. It is a model a bit like Blockbusters, but where you can get anything on special request, much like how you can order a book from a library for them to get it out of storage for you.
The logical conclusion of this is to have Google text search becoming more like an encyclopedia and dictionary of old, where 90% of what you want can be looked up in a relatively small body of words. I am okay with this, but I still want the special requests. There was merit in old-school Alta Vista searches where you could do what amounts to database queries with logical 'and's 'or's and the like.
The web was written in a very unstructured way, with WYSIWYG being the starting point, with nobody using content sectioning elements to scope headings to words. This mess suits Google as they can gatekeep search, since you need them to navigate a 'sea of divs'.
Really a nation such as France with a language to keep need to make information a public good with content structured and information indexed as a government priority. This immediately screams 'big brother', but it does not have to be like that. Google are not there to serve the customer, they only care about profits. They are not the defenders of democracy and free speech.
If a country such as France or even a country such as Sweden gets their act together and indexes their stuff in their language as a public good, they can export that knowhow to other language groups. It is ludicrous that we are leaving this up to the free market.
You'd be surprised how long it takes to enshittify a piece of tech as well established as Google. The MBAs may be trying but there are still a lot of dedicated folks deep in the org holding out.
Compared to Google, X years ago, for example. Unless I'm mixing threads up, that's what we're talking about anyway: the degradation of Google's search results.
>The article you linked doesn't say anything about 100 petabytes
Excerpt from the article: >[...] Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. [...]
To properly compete with Google search you have to:
1. Create your own index/crawl and true search engine on top of it, rather than delegating that back to bing or Google. In doing so, solve or address all the related problems like SEO/spam, and cover the “long tail” (expensive part)
2. Monetize it somehow to afford all of the personnel/hardware/software costs. Probably with your own version of Google’s ad products.
3. Solve user acquisition. Note that Google’s user acquisition process involves multiB/yr contracts with Apple and huge deals with other vendors who control the “top of funnel” like Mozilla Firefox, and that this is the sole purpose of chrome/android/chromebook/etc who you’ll never be able to make a deal with. You will probably at the very minimum need to implement your own platform (device, OS, browser alone probably won’t cut it).
4. Solve the bootstrap problem of getting enough users for people to care about letting you index their site, without initially being able to index a lot of sites, because you are not important enough.
5. Somehow pull all of this off in plain site of Google (this would take many years to build both technically and in terms of users) without them being able to properly fend you off
6. Somehow pull all of this off in spite of the web/web search seeming like it’s going to die off or fade into irrelevance
OR you can dedicate a decade of your life to something more likely to succeed.
I've heard more or less the same thing said about IBM, Walmart, Microsoft, etc.
There is one way to compete with Google search. Google search is a general search engine. But what if you only wanted medical information? information about cars? information about physics? electronics? history? etc.
Specializing makes for a much smaller scope of the problem, and being specialized means it could deliver more useful results.
For example, imdb.com. I don't ask google about movie stuff, I go to imdb.com because it specializes in movie info.
I never go to IMDb directly as both it and its search function are noticeably slower than Google’s. So I just tend to Google “<movie name> IMDb” and click the first result.
I'd like to add to your #5. If Google deems you a legitimate threat, then they can just de-crapify their own search for a bit by going back to their old algo. It's extremely easy for them to fight back.
Is that true though? I think a bit part of the algorithm evolution is to combat undesirable forms of SEO. I would rolling back the algorithm to some older generation would make the results quite spammy.
I often wonder if banning (or very strongly penalizing) websites with affiliate links would largely get rid of spam results. Sure, some very useful websites have affiliate links, but perhaps it would be worth it to at least let users hide them?
I don't think it's a solution. The average person seems to already have a strong dislike of ads, or paying for something that would otherwise be free (likely funded by ads or aff links)
A solution could be more than one algorithm being used to rank results, i.e. other engines, other rules. They'll likely use many of the signals available to them that Google uses for quality and relevance, but highly unlikely a genuine alternative search would rank them exactly the same- and much more unlikely an SEO could rank well in multiple engines.
The aff links aren't the problem, it's the proliferation of pages that are created solely to rank and get the links clicked on. Sometimes the content is useful, sometimes it's padded nonsense.
The solution is to force feed all affiliate linked sites into a living archive LLM that digests them into a summary stripped of all links.
You run this bloated mass as a co-engine to your search. If you stumble upon any of the digested sites or articles you just have the site-eater blob regurgitate a html re-creation locally. They get no traffic, you get whatever they wrote purged of all links and with optional formatting to remove the fluff BS copywrite that many of these sites pad the tiny core of usefulness with.
Google got big when being a dot-com URL meant something and the primary way the average person accessed the Internet was through a desktop PC. Neither of those things is true anymore.
I don't think people associate search with an app. Search is something built in to the operating system or browser (a distinction which is disappearing too).
People use apps more than they use a browser. It's not a big deal to reform search to be an app, it will be natural. People listen to podcasts without knowing anything about mp3 files or file systems, they edit and upload their photos without knowing about jpegs, etc.
How do you let the 99.999% of the population know that your site is better? I have seen friends and family now directly search in FB, if it does not return anything satisfactory, they open the very default browser that came with their device and type the query in the address bar, now whatever the default search engine is on their device that came preset gets the traffic.
Only people I saw manually typing the url are my office colleagues(mostly engineers of some sort), even management does the same(skipping the FB part obviously) and you can see it when they are sharing their screen and they want to search for something.
To pull people out of google moat, first step is becoming their default search engine.
This is backwards. The first step is innovating and creating a radically better search experience. This is then easy to market (and is likely to spread by word of mouth). Users love and it becomes the main way they search.
I’m my second year into Kagi and loving it. I actually just upgraded to get Kagi Assistant (basically, cloud access to every LLM out there). But the search alone is worth every penny, and it’s built/operated fully in-house as far as I know.
They are highly dependent on outside search engines. Someone from Kagi gave an explanation on HN of their search costs and why they can't go lower, and calling the Google API on many (most?) search queries was a major driver of their costs.
It's great that they are developing their own index, but I'm skeptical that it makes up more than a tiny fraction of what they can get from Google/Bing. DDG has been making similar claims for years but are still heavily reliant on Bing.
This isn't to knock on upstart search engines. I think that Google Search has declined massively over the past 5-10 years and I rarely use it. More competition is sorely needed, but we should be be clear eyed about the landscape.
As much as I like kagi and wish it success, it's not a search engine from scratch. Kagi uses other search engines (Google and Bing) wraps them and does a light reranking
Kagi is also building their own index at the background, and mixes these indexes as you search.
When I search Kagi for "Hacker News", results start with this fine text:
65 relevant results in 1.09s. 47% unique Kagi results.
So, other indexes are fillers for Kagi's own index. They can't target their bots to places, because they don't have the users' search history. They can only organically grow and process what they indexed.
How is it possible that a search for "Hacker News" produces only 65 results? There are thousands of pages out there with that exact phrase on it (including many sub-pages of this site).
The first result is almost assuredly the right one, but either they're ruling out a lot of pages as not-what-you-meant, or their index is really small.
It's another feature of Kagi. They know they have thousands of results, but they provide you a single page of most relevant results. If you want to see more because you exhausted the page, there's a "more results" button at the bottom.
Kagi reduces mental load by default, and this is a good thing.
That's an interesting positive spin on what is clearly a cost-cutting feature. I'm on the unlimited search tier, so it hasn't really been a big deal, but it's worth noting that clicking the "more results" button charges your account for an additional search too. Or at least it used to.
Key differentiator: Kagi still properly responds to the negation sign and quotes in your search terms. This is why I pay for it. The signal-to-noise ratio is way higher than with other engines.
That page includes the text "Our search results also include anonymized API calls to all major search result providers worldwide".
They source results from lots of places including Google. One way that you can confirm this is to search for something that only appears in a recent Reddit post. Google has done a deal with Reddit that they're the only company allowed to index Reddit since the summer.
edit: I don't think this is a bad thing for Kagi. I'm a very happy subscriber, and it's nice for me that I still get results from Reddit. They're very useful!
Note that due to adversarial interoperability, search engines other than Google can scrape Reddit if they try hard enough. A rotating residential proxy subscription, while pricey, likely still costs orders of magnitude less than what Google paid. The same goes for Stack Overflow. You can also DIY by getting a handful of SIM cards. CGNAT, usually a scourge, works in your favour for this application since Reddit can't tell the difference between you loading 10000 pages and 10000 people on your ISP loading one page each (depending on the ISP)
I loved kagi, but their cost is prohibitive for me.
The only way there is a chance for me to afford Kagi might be to buy "search credit" without a subscription and without minimum consumption. And then it would only be good if they allowed more than 1000 domain rules and showed more results (when available)
The search credit model sounds great. I would probably also pay Youtube credit. I use it too rarely that paying their monthly rate makes sense. The user experience with ads sucks, so I further try to reduce my usage. At least that's good for the environment and I can do more useful things.
If $10 is too much to afford, the problem is not really Kagi. You need to improve your economic situation urgently, because that is starvation level poverty. Seeing that you're posting on a message board like this, I assume you have at least average talent and capabilities to have a paid occupation.
"CommonCrawl [being] text-only is only ~100TB, and can fit on my home server."
Are any individual users downloading CC for potential use in the future?
It may seem like a non-trivial task to process ~100TB at home today but in the future processing this amount of data will likely seem trivial. CC data is available for download to anyone but, to me, it appears only so-called "tech" companies and "researchers" are grabbing a copy.
Many years ago I began storing copies of the publicly-available com. and net. zonefiles from Verisign. At the time it was infeasible for me to try to serve multi-GB zonefiles on the local network at home. Today, it's feasible. And in the future it will be even easier.
NB. I am not employed by a so-called "tech" company. I store this data for personal, non-commercial use.
Mostly because Google bought, developed, acquired or effectively control all the major distribution points with default placement deals: eg Apple, Samsung, Chrome, Android, Firefox. In time remedies are coming though, in the antitrust case lost versus DoJ.
Another major factor is that building a search index and algorithms that searches across billions of pages with good enough latency is very hard. Easy enough for 10s of millions scale search but a different challenge for billions.
Some claim(ed) click-query data is needed at scale, and are hoping for that remedy. Our take is what is the point of replicating Google. Anyway, will this data be free or low cost? You know the answer.
Cloud infrastructure is very expensive. We save massively on costs by building our own servers, but that means capital outlay.
Maybe, maybe not. Remember that Google is "woke" - an enemy of the faction that now finds itself in power. They might continue the lawsuit to set an example.
Because nowdays more than ever content you need is in silos.
Your facebooks/twiters/instagram/stack overflow/reddit ... And they all have limited expensive api's, and have bulk scrapping detection.
Sure you can clobber together something that will work for a while, but you can't runn a buissness on that.
Aditionaly most paywalled sites (like news) explicitly whitlist google and bing, and if someone cretes new site, they do the same. As an upstart you would have to reach out to them to get them to whitelist you. and you would need to do it not only in USA but globaly.
Anothe problem is cloudflare and other cdns/web firewalls, so even trying to index mom and pops blog site could be problematic. An d most of the mom and pop blogs are nowdays on som ploging platform that is just another silo.
Now that i think about it, cloudflare might be in a good position to do it.
The AI hype and scraping for content to feed the models have increased dificulty for anyone new to start new index.
While LLMs have accelerated, it, it was already the case that silos were blocking non-Google and non-Bing results before LLMs. LLMs have only made existing problems of the web worse, but they were problems before LLMs too and banning LLMs won't fix the core issues of silos and misinformation.
You're thinking too much by the rules. You can absolutely scrape them anyway. Probably the biggest relevant factor is CGNAT and other technologies that make you blend in with a crowd. If I run a scraper on my cellphone hotspot, the site can't block me without blocking a quarter of all cellphones in the country.
If the site is less aggressively blocking but only has a per-IP rate limit, buy a subscription to one of those VPNs (it doesn't matter if they're "actually secure" or not - you can borrow their IP addresses either way). If the site is extremely aggressive, you can outsource to the slightly grey market for residential proxy services - for fifty cents to several dollars per gigabyte, so make sure that fits in your business plan.
There's an upper bound to a website's aggressiveness in blocking, before they lose all their users, which tops out below how aggressive you can be in buying a whole bunch of SIM cards, pointing a directional antenna at McDonald's, or staying a night at every hotel in the area to learn their wi-fi passwords.
> You're thinking too much by the rules. You can absolutely scrape them anyway. Probably the biggest relevant factor is CGNAT and other technologies that make you blend in with a crowd. If I run a scraper on my cellphone hotspot, the site can't block me without blocking a quarter of all cellphones in the country.
I am familiar with most of that, and there is a BIG difference between trying to find a workaround for one site, that you scrape ocasionaly, than to to find workaround for all of the sites.
Big sites will definitely put entire ISP's behind annoying capachas that are designed to stop exactly this (if you ever wonder why you sometimes get capatchas that seem slow to load, have long animations, or other annoying slow things, that is why etc.)
And once you start making enough money to employ all the people you need for doing that consistently, they will find a jurisdiction or 3 where they can sue you.
Also good luck finding residential/mobile ISP's that will stand by, and not try to throttle you after a while.
You definitively can get away with doing all of that for a while, but you absolutely can't build sustainable businesses on that.
Kagi is not one of these and I love it enough to pay for it. It has kept the semi-"advanced" features of respecting the negative sign and emphasizing results that match quoted terms in the search query.
very recently found about brave goggles.
amazing way to give control to users, for ex: blocking pinterest or searching domains popular with HN or own list.
Brave search is a continuation of cliqz. A German company that developed a proper search engine, with an independent index. They shut down, but the tech got sold off.
Cliqz was the first time for me that a Google alternative actually worked really well - and it, or now brave search, is what parent was asking for :)
Yet we're still back to Larry Page and Sergey Brin's conclusion in their "The Anatomy of a Large-Scale Hypertextual Web Search Engine" research paper[0]:
> We expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.
Brave Search Premium hasn't been around nearly as long as their free tier serving ads, and I'm not confident this conflict of interest is gone.
Having independent indexes is a win regardless though.
It's worth noting that building a solid search index needs more than just text. The full Common Crawl archive includes metadata, links, page structure, and other signals that are needed for relevance and ranking, and to date is over 7 PiB, so that's rather more than tends to fit on home servers
It occurs to me that competing with Google on their home turf and at their scale might be impossible. But doing what Google was good at, when they just a search engine, might not be all that much harder today than it was back then. And it may fly "under the radar" of Google's current business priorities.
I'm thinking about how Etsy took "the good bit" of Ebay and ran with it. Or how tyre-and-exhaust shops take "the profitable bit" of being a mechanic and run with it. I know there's a name for this, but it eludes me.
It could be they also took the bit that eBay abandoned. There was a point when eBay announced to the business press that they were going to de-emphasize the "America's yard sale" aspect, and shift towards being a regular e-commerce site like Amazon.
Because it is super expensive and difficult to keep an index up to date. People expect to be able to get current events, and expect search results to be updated in minutes/seconds.
The Swede behind search.marginalia.nu has had a working search engine running at a single desktop class computer in a living room, all programmed and maintained on his spare time, that was so good that in its niches (history, programming, open source comes to mind) it would often outshine Google.
Back before I found Kagi I used to use it everytime Google failed me.
So, yes, given he is the only one I know who manages this it isn't trivial.
But it clearly isn't impossible or that expensive either to run an index of the most useful and interesting parts of internet.
I think the problem with search is that while it's relatively doable to build something that is competitive in one or a few niches, Google's real sticking power is how broad their offering is.
Google search has seamless integration with maps, with commercial directories, with translation, with their browser, with youtube, etc.
Even though there's more than a few queries they leave something to desire, the breadth of queries they can answer is very difficult to approach.
This is one of those things that I think is interesting about how "normal" people use the Internet. I am guessing they just always start with google.
But for me, if I want to look up local restaurants, I go straight to Maps/Yelp/FourSquare(RIP). If I want to look up releases of a band, I go straight to musicbrainz. Info about Metal Band, straight to the Encyclopeadia Metallum. History/Facts, straight to Wikipedia. Recipes, straight to yummly. And so on. I rarely start my search with a general search engine.
And now with GPT, I doubt I even perform a single search on a general search engine (google, bind, DDG) even once a day.
"Normal" people don't start with Google, not any more. They start with Facebook, Instagram, X, Reddit, Discord, Substack etc. That's exactly the problem, the world-wide web has devolved back into a collection of walled gardens like things were in the BBS era, except now the boards are all run by a handful of Silicon Valley billionaires instead of random nerds in your hometown.
You have just described how "innovation" is often just a power shift, often one that does not benefit the user. Old is new again, but in different hands; the right hands, of course.
No search engine is refreshing every website every minute. Most websites don't update frequently, and if you poll them more than once every month, your crawler will get blocked incredibly fast.
The problem of being able to provide fresh results is best solved by having different tiers of indices, one for frequently updating content, and one for slowly updating content with a weekly or monthly cadence.
You can get a long way by driving he frequently updating index via RSS feeds and social media firehoses to provide singnals for when to fetch new URLs.
I meant this in response to the parent that Common Crawl only updates every month, which seemed to imply that this was sufficient.
This is too slow for a lot of the purposes people tend to use search engines for. I agree that you don't need to crawl everything every minute. My previous employer also crawled a large portion of the internet every month, but most of it didn't update between crawls.
See also: IndexNow [1], a protocol used by Bing, Naver, Yandex, Seznam, and Yep where sites can ping one of these search engines when a page is updated and all others will be immediately notified. Unfortunately it does seem somewhat closed as to requirements for joining as a search engine.
The irony of a website from 2 major search engines looking like it was made in the early 2000s doesn't escape me. But, to my original point, there's absolutely no way they were ignorant of well-known URIs
Some sources update faster than others, you could index news sources hourly and low velocity sites weekly. Google does that. CommonCrawl gets 7TB/month, indexing and vectorizing that is quite manageable.
I think it is more complex than that. Common crawl does not index the whole web every month. So even if you use common crawl and just index it every month, which you could do pretty cheaply admittedly, I don't think that would lead to a good search index.
Running an index is an extremely profitable business, from multiple points of view (you can literally earn money, but also run ads, you get information you can sell, you can buy mindshare). Everybody is looking for indexes beyond Google and Bing, but there are none. If it really is as easy as indexing common crawl, then I think we'd have more indexes.
No that hard, took me 4 weekends to build a private search engine with Common Crawl, Wikipedia and HN as a link authority source. Takes about a week to crunch the data on an old Lenovo workstation with 256gb ram and some storage.
Google ingests almost the whole public web almost every day. I don't see any startup competing with them they might come out with a great algo or something but will need the infrastructure and huge investments to compete.
Even then after using Google Gemini subscription the last few months I think the problem is not Google search rather the web ecosystem as if google gives me most of the answers without having to click any link you and I might be happy but billion people living off those links directly or indirectly won't be happy.
Because “search” in 2024 is much more than a search engine. It takes billions of dollars of investment per year to build something competitive to Google head-on. OpenAI may have a shot with their chat.com
The only viable option when starting small is to grow in the cracks of Google with features that users really love that Google won’t provide.
Isn’t it obvious? Look around. Hackers have been stamped out even on hacker news. Now it’s all FAANG lifers and MBA or VC types trying their hand at grifting. Nothing good comes of that. Whoever is making the next thing “for real” just gets acquired and shut down.
Moreover, get out of your echo chamber and you’ll see that for a majority of humanity Google is the internet. You have to supplant the utility, not just the brand. Most businesses cannot handle the latter let alone the former. If you want something to replace Google, you have to think about replacing the internet itself. But not many are that bold.
A single CommonCrawl dump might be 100TB but that represents less than 1% of the Internet. CommonCrawl crawls news parts of the Internet every month / trimester and there is little overlap between crawl dumps.
Because the hard part isn't the compute, vector dbs, or whatever. It's the huge evergreen index of the whole internet. Getting over the hump of "every site lets your crawler work, gives good results to it, and bypass paywalls" is a massive barrier.
There is probably room for one or five lifestyle businesses but convincing venture capital to drop the megabux to go big would be a feat and eventually land at some sub-optimal state anyway.
Finding some hack to democratize&decentralize the indexing and expensive processes like JavaScript interpretation, image interpretation, OCR, etc is an open angle and even an avenue for "Web3" to offload the cost. But you will ultimately want the core search and index on a tighter knit cluster (many computers physically close to one another for speed of light reasons, although you can have N of these clusters) for performance so it's a hard nut to crack for making something equitable for both the developers and any prospectors and safe from various takeovers and enshitifies. Let us know if you know a way.
I would want to use a search engine that does not perform JavaScript interpretation, image interpretation, OCR, etc. (This is not the same as excluding web pages with JavaScripts from the search results. They would still be included but only indexed by whatever text is available without JavaScripts; if there isn't any such text, then they should be excluded. This would also apply if it is only pictures, video, etc and no text, then they also cannot be indexed, whether or not they have JavaScripts.)
Luckily none of these things are mutually exclusive.
Thanks to all the SPA idiocy you will miss enough content to matter if you have zero JS interpretation, so you would want to let the user choose which indexes they want for a query because sometimes you need these other resource types to answer the query.
I understand that competing with Google was borderline impossible a decade ago. But in 2024, we have cheap compute, great OSS distributed DBs, powerful new vector search tech. Amateur search engines like Marginalia even run on consumer hardware. CommonCrawl text-only is ~100TB, and can fit on my home server.
Why is no company building their own search engine from scratch?