I have learned to seriously question my instincts on when something is too late as there are many niches to fill and this is likely a building block for broader functionality.
That being said, for all the talk about how bad google has become, I still prefer it to an unbroken bing.
> Anyone can compete as long as they have a sufficiently robust crawl dataset as a foundation, no?
There's some sticking power/network-effect/sticky-defaults effects, too, though.
It's _trivial_ to do a google search from anywhere on an android device with at most a tap or two. You can probably get close if a 3rd party has a well integrated native app but that'll require work on the user's behalf to make it the default (where possible).
Same goes for the default search engine for browsers/operating systems ... etc.
I will absolutely be firing off queries to google and GPTSearch in parallel and doing a quick comparison between the two. I am especially curious to see how well queries like "I need the PCI-e 4 10-gig SFP+ card that is best supported / most popular with the /r/homelab community" goes. Google struggles to do anything other than link to forums where people are already asking similar questions.
Anyone can compete as long as they have a functional URL and web page. Doesn’t make them good competition, and doesn’t mean users will use it.
The issue is that “AI search” has been a hot topic for a while now. Google (the default everywhere) just rolled out their version to billions of users. Perplexity has been iterating and acquiring customers for a while. Obviously OpenAI has great potential and brand recognition, but are enough people still interested in switching that haven’t yet?
A fossilized snapshot will only get them so far, and sites are increasingly opting to block AI-related crawlers. Apparently about a quarter of the top 1000 sites already block GPTBot: https://originality.ai/ai-bot-blocking
I guess they could be using Bing as their search backend, which would mostly get around the blocking issue (except for searching Reddit which blocks Bingbot now).
Certainly, countermeasures against crawler blocking will be a necessary component of effective search corpus aggregation in the go forward. Otherwise, search will balkanize around who will pay the most for access to public content. Common Crawl is ~10PB, this is not insurmountable.
Edit: I understand there is a freerider/economic issue here, unsure how to solve that as the balance between search engine/gen AI systems and content stores/providers becomes more adversarial.
I wonder to what degree -- for example, do they respect the Crawl-delay directive? For example, HN itself has a 30-second crawl-delay (https://news.ycombinator.com/robots.txt), meaning that crawlers are supposed to wait 30 seconds before requesting the next page. I doubt ChatGPT will delay a user's search of HN by up to 30 seconds, even though that's what robots.txt instructs them to do.
Would ChatGPT when live interacting with a user even have to respect robots.txt? I would think the robots.txt only applies to automatic crawling. When directed by a user, one could argue that ChatGPT is basically the user agent the user is using to view the web. If you wanted to write a browser extension that shows the reading time for all search results on google, would you respect robots.txt when prefetching all pages from the results? I probably wouldn’t, because that’s not really automated crawling to me.
They do respect robots.txt (supposedly), but they also introduced a new user agent that nobody would yet have in their robots.txt as part of this feature[1], and looking at my server logs it's already crawled a bunch of sites.
The whole issue that site owners have with these AI search engines is that there isn't a financial incentive for them to cooperate, since the summarization largely replaces the need for users to click through to the site the information came from. No click-through, no ad impressions, no possibility of the user being converted into a recurring visitor or paid subscriber, just pure freeloading by the search engine.