There are open crawl sets available. English is poor because I am on mobile (which is why i am not going to provide the links) one is called, i think, open crawl index.
However, like googlebot, the browser will of course be the actual crawler. Page requests are cached at the databank level, then at the users partition.
The problem is that the web is an rss feed but sites (with valuable info) are blocking crawlers, except google. This creates informational asymmetry.
Since almost all search engines try to emulate page rank, we dont have diversified results, however all our search info is aggregated.
The browser wont "block" adds because it won't ever return websites. It will literally only return (how I imagine v1) to return html snippets and they can be iterated over rapidly.
Tracking won't matter. I haven't worked out exactly how to do it, but i think that you will own a piece of a corpus (essentially there is one corpus, but you have it sort of mirrored to your silo) you can make requests to the corpus to fetch data or to go out into the internet and get raw data. It is returned to your cache (and the global one) then your processing is done locally.
* browser is the feedreader, network and platform
* users sell bots and crawlers to users
* users sell sorted data sets to users
* users sell algorithims to users
* browser is a market maker
Storage so cheap processing power is so good a 20gb cache of data can sit locally. And you can fetch newer data or swap it out for other stuff. You also can store post-processed analytical data in the cloud
However, like googlebot, the browser will of course be the actual crawler. Page requests are cached at the databank level, then at the users partition.
The problem is that the web is an rss feed but sites (with valuable info) are blocking crawlers, except google. This creates informational asymmetry.
Since almost all search engines try to emulate page rank, we dont have diversified results, however all our search info is aggregated.
The browser wont "block" adds because it won't ever return websites. It will literally only return (how I imagine v1) to return html snippets and they can be iterated over rapidly.
Tracking won't matter. I haven't worked out exactly how to do it, but i think that you will own a piece of a corpus (essentially there is one corpus, but you have it sort of mirrored to your silo) you can make requests to the corpus to fetch data or to go out into the internet and get raw data. It is returned to your cache (and the global one) then your processing is done locally.
* browser is the feedreader, network and platform
* users sell bots and crawlers to users
* users sell sorted data sets to users
* users sell algorithims to users
* browser is a market maker
Storage so cheap processing power is so good a 20gb cache of data can sit locally. And you can fetch newer data or swap it out for other stuff. You also can store post-processed analytical data in the cloud