Not sure where you get 8 miles vs 11 miles (maybe the definition of a rural food desert?).
> Low access is characterized by at
least 500 people and/or 33 percent of the tract population residing more than
1 mile from a supermarket or large grocery in urban areas, and more than 10
miles in rural areas
One possible saving grace for Z is that, due to how expensive it is to keep around, video will probably disappear much more readily than text and photos.
If the current iteration of search engines are producing garbage results (due to an influx of garbage + SEO gaming their ranking systems) and LLMs are producing inaccurate results without any clear method proposed to correct them, why would combining the two systems not also produce garbage?
The problem I see with search is that the input is deeply hostile to what the consumers of search want. If the LLM's are particularly tuned to try and filter out that hostility, maybe I can see this going somewhere, but I suspect that just starts another arms race that the garbage producers are likely to win.
Search engines tend to produce neutral garbage, not harmful garbage (i.e. small tidbits of data between an ocean of SEO fluff, rather than completely incorrect facts). LLMs tend to be inaccurate because in an absence of knowledge given by the user, it will sometimes make up knowledge. It's plausible to imagine that they will cover each other's weaknesses: the search engine produces an ocean of mostly-useless data, and the LLM can find the small amount of useful data and interpret that into an answer to your question.
The problem I see with this "cover for each other" theory is that as it stands having a good search engine is a prerequisite to having good outputs from RAG. If your search engine doesn't turn up something useful in the top 10 (which most search engines currently don't for many types of queries) then your llm will just be summarizing the garbage that was turned up.
Currently I do find that Perplexity works substantially better then Google for finding what I need, but it remains to be seen if they're able to stay useful as a larger and larger portion of online content just AI generated garbage.
> Search engines tend to produce neutral garbage, not harmful garbage (i.e. small tidbits of data between an ocean of SEO fluff, rather than completely incorrect facts)
Wasn't google AI surfacing results about making pizza with glue and eating rocks? how is that not harmful garbage?
then you are blissfully unaware of how much data is already being interpreted for you by computer algorithms, and how much you probably actually really like it.
This comes off as condescending. As things have gotten more algorithmic over the last two decades, I've noticed a matching decrease in the accuracy and relevance of the information I seek from the systems I interact with that employ these algorithms.
Yes, you're right that there are processing algorithms behind the scenes interpreting the data for us. But you're wrong: I fucking hate it, it's made things worse, and layering more on top will not make things any better.
I don't think anyone can disagree. If you ask someone to give you an interpretation of the works of, say, Allen Ginsberg, or of the theory of relativity, and they come back with a pile of documents ordered in some fashion, you won't be satisfied because that's not what you asked for.
99.99% of all data is complete garbage and impossible for a human to sift through. Most spam email doesn't even end up in your spam inbox. It gets stopped long before that.
Garbage-ness of search results is not binary, the right question is: can LLMs improve the quality of search results? But sure, it won't end the cat and mouse game.
I think that's the right broad question. Though LLMs properties mean that for some number of cases they will either make the results worse, or more confidently present wrong answers. This prompts the question: what do we mean by "quality" of results? Since the way current LLM interfaces tend to present results is quite different from traditional search.
The question is what is the business model and who pays for it, that determines how much advertising you’re getting. It is not clear if OpenAI could compete in Ad-supported search. So maybe OpenAI is trying to do the basic research, outcompete the Bing research group at Microsoft and then serve as an engine for Bing. Alternatively they could be just improving the ability of LLMs to do search, targeting future uses in agentic applications.
There is no way to SEO the entire corpus of human knowledge. ChatGPT is very good for gleaning facts that are hard to surface in today's garbage search engines.
If I can pretty quickly tell a site is SEO spam, so should the LLM, no? Of course that would just start a new round in the SEO arms race, but could work for a while.
I’d be more cynical still and ask, where is correct information found in the first place? Humans of all shape and size have biases. Most research is faulty, fabricated, or not reproducible. Missing information tells a greater story than existing one.
We don’t have a way of finding objective information, why would we be able to train a model to do so?
Right now I basically can't find anything, the bar isn't "objective information" but "somewhat useful information". Google search quality became so bad we're past the debate of objective or subjective already, I'd be happy enough to get non-spam results.
There was a bunch of reporting on how AI companies and researchers were using tools that ignored robots.txt. It's a "polite request" that these companies had a strong incentive to ignore, so they did. That incentive is still there, so it is likely that some of them will continue to do so.
CommonCrawl[0] and the companies training models I'm aware of[1][2][3] all respect robots.txt for their crawling.
If we're thinking of the same reporting, it was based on a claim by TollBit (a content licensing startup) which was in turn based the fact that "Perplexity had a feature where a user could prompt a specific URL within the answer engine to summarize it". Actions performed by tools acting as a user agent (like archive.today, or webpage-to-PDF site, or a translation site) aren't crawlers and aren't what robots.txt is designed for, but either way the feature is disabled now.
These policies are much clearer than they were when last I looked, which is good. On the other hand. Perplexity appeared to ignore robots.txt as part of a search-enhanced retrieval scheme, at least as recently as June of this year. The article title is pretty unkind, but the test they used pretty clearly shows what was going on.
> The article title is pretty unkind, but the test they used pretty clearly shows what was going on.
I believe this article is around the same misunderstanding - it doesn't appear to show any evidence of their crawler, or web scraping used for training, accessing pages prohibited by robots.txt.
The EU's AI act points to the DSM directive's text and data mining exemption, allowing for commercial data mining so long as machine-readable opt-outs are respected - robots.txt is typically taken as the established standard for this.
In the US it is a suggestion (so long as Fair Use holds up) but all I've seen suggests that the major players are respecting it, and minor players tend to just use CommonCrawl which also does. Definitely possible that some slip through the cracks, but I don't think it's as useless as is being suggested.
Funny. If I can browse to it, it is public right? That is how some people's logic goes. And how OpenAI argued 2 years ago when GPT3.5/ChatGPT first started getting traction.
> Technically, robot.txt isn't enforcing anything, so it is just trust.
There's legal backing to it in the EU, as mentioned. With CommonCrawl you can just download it yourself to check. In other cases it wouldn't necessarily be as immediately obvious, but through monitoring IPs/behavior in access logs (or even prompting the LLM to see what information it has) it would be possible to catch them out if they were lying - like Perplexity were "caught out" in the mentioned case.
> Funny. If I can browse to it, it is public right? That is how some people's logic goes. And how OpenAI argued 2 years ago when GPT3.5/ChatGPT first started getting traction.
If you mean public as in the opposite of private, I think that's pretty much true by definition. Information's no longer private when you're putting it on the public Internet.
If you mean public as in public domain, I don't think that has been argued to be the case. The argument is that it's fair use (that is, the content is still under copyright, but fitting statistical models is substantially transformative/etc.)
Anecdata, with a few exceptions, the VR games I tried were impressive as an experience, but not really all that fun once the novelty passed. The limitations of the format clash with the kinds of games that are being made so I often felt like the games were limited, or toy-like. I think the argument made by the article does hit on something about why VR isn't really getting accepted. The games are wrong: but they might be wrong because of the limitations vs expectations of the developers and audience.
The comfort issue is real too. Even with the fairly svelte PSVR2, it's annoying to wear those things.
This is, IMO, the better way to approach this problem. Minification applies rules to transform code, if we know the rules, we can reverse the process (but can't recover any lost information directly).
A nice, constrained, way to use a LLM here to enhance this solution is to ask it some variation of "what should this function be named?" and feed the output to a rename refactoring function.
You could do the same for variables, or be more holistic and ask it to rename variables and add comments (but risk the LLM changing what the code does).
Another way I've encountered this is performance vs results. Performance is the things you do that you believe will lead to results. Results aren't always in your control (especially in competitive environments), but performance absolutely is. It's a lot easier to feel you are getting somewhere when you focus on things that you control.
This is something I often have to instill in new software developers and occasionally to remind myself. We default to seeing the thing we want to build and the plan that we imagine for doing so. As we proceed, every bump and deviation from the plan feels like a set back. Every wrong turn and rewritten piece of code feels like a waste, a mistake. But in fact, it's all part of the process. Trying that avenue that turned out not to be what you wanted was a necessary part of learning what it is you did want. As The Pragmatic Programmer said, "Prepare to throw one away ... you'll have to anyway."
I make it clear that at any moment a plan isn't absolute because we can't possibly know what the future will hold. Instead, a plan is simply a direction to start heading. As we try and we learn, we update our understanding and we update our plan, heading in a different direction that hopefully brings us closer to our goals. If we think of a deviation from the plan as a failure to plan, we punish ourselves for a lack of omniscience - something we can hardly expect to live up to.
That same mindset helps a lot in understanding daily life too. When we see people make mistakes driving, or large construction projects going over budget, or social policies causing unanticipated problems, we are quick to blame people for not knowing better, but how can we expect them to know with certainty what will happen as the result of every decision they make? We simply do our best with the resources we have available and as events unfold we continue to do our best to steer ourselves to our desired outcomes. People shouldn't be punished for the outcome if they made a good choice given the resources they had. Hindsight is 20/20 and all that.
Yes, but if you don't have the LLM at the end, a good search (against a good corpus with the needed info) would still have given the user what they wanted. Which in this case, is a human vetted piece of relevant information. The LLM really only would be useful in this case for dressing up the result and that would actually reduce the trust in the result overall. Alternatively a LLM could play a role as part of the Natural language pipeline that drives the search, hidden from the user, and I feel that that's a much more interesting use of them.
The farther you go with RAGs, in my experience, the more they become an exercise in designing a good search engine, because garbage search results from the RAG stage always lead to garbage output from the LLM.
> The farther you go with RAGs, in my experience, the more they become an exercise in designing a good search engine
From what I've seen from internal corporate RAG efforts, that often seems to be the whole point of the exercise:
Everyone has always wanted to break up knowledge silos and create a large, properly semantically searchable knowledge base with all intelligence a corporation has.
Management doesn't understand what benefits that brings and doesn't want to break up tribal office politics, but they're encouraged to spend money on hypes by investors and golf buddies.
So you tell management "hey we need to spend a little bit of time on a semantic knowledge base for RAG AI and btw this needs access to all silos to work", and make the actual LLM an after thought that the intern gets to play with.
Seriously. This is why I don't like how much knowledge is being locked up in video platforms. I can get what I need to know from text so much faster than video unless the knowledge in question is inherently 3d/visual.
video has the advantage that it still works when i am tired/exhausted, or rather when i want to prevent getting tired, because reading is active and takes more concentration. video makes it easier to focus because it is more passive but all your senses are activated at once.
if i am in a deep flow, the reading to solve a problem is the right approach. but when i have to fight distraction, a video helps me to focus and eventually get into the flow.
> Low access is characterized by at least 500 people and/or 33 percent of the tract population residing more than 1 mile from a supermarket or large grocery in urban areas, and more than 10 miles in rural areas
(source: https://www.ers.usda.gov/webdocs/publications/45014/30940_er... )
Interestingly enough, this is measured by the euclidian distance, not by the actual number of miles required to travel.