Hacker News new | past | comments | ask | show | jobs | submit login
I want flexible queries, not RAG (win-vector.com)
237 points by jmount 5 months ago | hide | past | favorite | 153 comments



The retrieval part of RAG is usually vector search; but it doesn't have to be. Or at least not exclusively.

I've worked with various search backends for about 20 years. People treat vector search like magic pixie dust but the reality is that it's not that great unless you heavily tune your models to your use cases. A well tuned manually crafted query goes a long way.

Pretty much any system I've built over the last few years, the best way to think about search is about building a search context that includes anything relevant to answering the user's question. The user's direct input is only a small part of that. In the case of mobile systems, the user entered query is actually typically a very minor part of it. People type two or three letters and then expect magic to happen. Vector search is completely useless in situations like that. Why does search on mobile work anyway? Because of everything else we know to create a query (user location, time zone, locale, past searches, preferences, etc.)

RAG isn't any different. It's just search where the search results are post processed by an LLM with whatever the user typed. The better the query and retrieval, the better the result. The LLM can't rescue a poorly tuned search. But it can dig through a massive result of search results and extract key points.


The valuable part _is_ the vector search, at least from the perspective of the OP. It sounds like what they're after is to just pass back the source texts from the vector search, rather than passing them back through an LLM to be summarised.

I generally agree with the OP: in a search context I often don't want that last LLM summarisation step. I want embeddings generated from my search query for document lookup, and I want to see the original sources. Phind did (does?) something like this with citations, which is at least an improvement.


It's common practice to have the LLM spit out the meta data as well, e.g. a source link. We do that for almost all of our usecases, it's the best way to increase trust in the response.

The LLM step does provide a benefit, as it can synthesize a tight and concise answer to the user's question, instead of returning a list of source documents to sift through manually.


The real beauty is not visiting people’s horrible web pages.

Oh how I want information. Oh how I don’t want your newlsetter modal to obscure that information.


I forgot this point, but you are right. I use ChatGPT for all programming queries- as I don't want Google routing me to pop-up infested splogs.


Upnext spent about two years developing a "reader" that could deenshittify most pages in the context of a "read-it-later" application. They pivoted to start Brief which does AI recommendation and summarization and shut down Upnext. I think they are still using the back end of the reader to get a clean extract of text to feed into the AI.

But it's very telling that users are supportive of Google's agenda of not sending you to people's actual web sites because web sites have gotten so bad. Europeans will deny it but I think those cookie banners broke the dam so that now it is normalized that you can pop up four different modals asking for an email address.


Not Europe’s fault there, just determined adversarial web design.


Then just use a regular search engine? RAG=search+LLM, if you don't want the LLM part, use Google, or whatever your favorite one is. Vector search by itself is not a novel concept, all major search engines already rely on it anyway.


I think the op raises a genuine criticism of search. its brittle. I personally hate it, I know how to use google very well, it doesn't mean I like using it. Defenders say: but I want to be able to super fine tune my search with 50 flags and various combinations of quotes. That's just horrible. It should understand what you want by what you type. Sometimes, I don't even know the word I am trying to search, llm's do really well at that.

Most of my search now happens in llm. Usually I want information, not a webpage. I do understand the writer's gripes with search, and it should have been fixed a long time ago.


Right. The OP title is: "I want flexible queries, not RAG". I didn't write the article, btw!


> People treat vector search like magic pixie dust but the reality is that it's not that great unless you heavily tune your models to your use cases

I believe that what most people miss is that embeddings are the product of the model.

Embeddings have been getting so much better with the new models and so embedding-based search improved too.

Embeddings are coordinates in a world representation that "organizes" information so that location in that world matters and serve as a way to differentiate meaning.

In other worlds, word2vec was a simple and poor world representation. OpenAI embeddings are an astounding world representations coordinates.


also I think its about time. we don't have the time to super fine tune the document retrieval, and that magic pixie dust gets us 90-99% there. That extra 1% could take time management doesnt want to give.


It also very much depends on the quality of your documents. E.g. if you have a set of very good recipes, rather than a random set coming for the Internet, then you can get very good answers with RAG


>But it can dig through a massive result of search results and extract key points.

And there is value in that.


Reducing the signal to noise ratio improves the results though. This is a classical search relevance problem.


This has been my thinking as well: the natural language interface is amazing and something we've been wanting for some time.

The generation is a showy gimmick.

So why aren't we separating the useful bit out? My sneaking suspicion is that we can't. It's a package deal in that there are no two parts, it's just one big soup of free-associating some text with other text, the stochastic parrot.

LLM do not understand. They generate associated text that looks like an answer to the question. In order to separate the two parts, we'd need LLMs that understand.

That, apparently, is a lot harder.


I've been working on this as a way of better exposing organisational knowledge recently. A few things I've observed:

* The prompt engineering side is a black art.

* Generation is pretty amazing but less useful than it first seems.

* The open source and proprietary tooling around managing and interrogating datasets for NLP stuff is way ahead of where it was last time I looked maybe a decade ago

* It looks to me that with appropriate thinking about indexing then there's a lot of potential here. For example if I have a question answer set, I can index the question and which answers point to it. Get the LLM to identify the type of question I've just asked, and then use my corresponding corpus of answers to provide something useful based on the appropriate parts of the answer corpus.

That last bit is not the automatic panacea that the flashy and somewhat gimicky emergent properties of the generative side supply, but it seems like it can get good traction on some quite difficult problems quickly, and actually quite well using not too many local computing resources.


Generation isn't less useful, it is less autonomous than it seems. You need to impose a process on generations, and generate inside a framework. Think about humans - the best authors and artists have a methodical process for producing and refining their work, and if you forced them to just generate stuff with no process in 1 shot they would probably produce sub-standard output. No surprise machines that aren't at our level fare no better given the same conditions.


I wouldn't say we can't, but it would be much harder. The models were trained and optimized for next word prediction. It is possible to chop the output layers off and replace them with something else. This is often how more open models like BERT are adapted to tasks such as classification and sentiment analysis. But pulling semantically meaningful information out of internal states of the model is tricky because there's not necessarily anything about the model architecture and training methods that forces it to develop internal representations that are particularly interpretable or have a straightforward application to some other task.

They may not be all that stable, either. I would not just assume that knowing how to interpret the attention heads in the base model of GPT-4 tells you anything about what the corresponding attention heads are doing in GPT-4t or GPT-4o.


To understand, we need to take a step back, and completely deconstruct the familiar narrative.

The moment we say "an AI", we anthropomorphize the model. The narrative gets even more derailed when we say "Large Language Model". No LLM actually contains a grammar or defined words. An LLM can't think logically or objectively about subjects the way a person can, either.

I propose that we instead call them "Large Text Models". An LTM is a model made from training a neutral net with text. Sure, the text itself was written with language, but no part of the training process or its result does anything about it.

The really cool trick that an LTM does is to construct continuations whose content just happens to be indistinguishable from language. This is accomplished because the intentions of the original human writer (that were encoded into the original text dataset) did consistently follow language rules. The problem is that the original writer did not encode the truthiness or falsiness of future LTM continuations. Truth and lie are written the same, and that ambiguity lives in the LTM forever.


Most presentations of generative AI come across like amnesiac, superficial polymaths. Perhaps beyond negative and positive prompts, they need some memory across queries and across users, a feedback path (hard problem), an ability to discern reputability (very hard problem), and an ability to search and train on new domain-specific information on-the-fly.


Hard, and also might not be particularly commercially viable. Once you give a model an individual memory, you lose the standardization and consistency of behavior that is important for scaling up commercial applications.

I would not want, for example, a version of Copilot that slowly gets better at helping me with the task I'm working on and then suddenly and unpredictably reverts back to zero just because the k8s pod running the instance that I had been interacting with got recycled. Consistently mediocre behavior would be preferable to the AI equivalent of the pair programming equivalent of speed dating.


My point is that to piece together an almost AGI agent or at least more useful generative AI requires these changes. A killer app isn't a singular magic thing but many small advancements. OpenAI still isn't even close, but it's demoing bits and pieces that are closer.


Natural language search and answer generation _are_ completely separate.

Search is often (but not exclusively) performed using cosine similarity over semantic vectors. Such vectors are produced using embedding models, which represent the meaning of the document via an arbitrary length vector called an embedding, and 768 is a common vector size for this.

You calculate the embedding for all documents in your database ahead of time (during insertion), and you calculate the embedding for the user's query, then search for documents closest to the query using a similarity metric of choice such as cosine.

Nothing prevents you from serving documents found this way directly, instead of using them to generate answers. Part of the Google search pipeline involves something like this. Many full-text search products also do this (Algolia is such an example https://www.algolia.com/blog/ai/what-is-vector-search/)

LLMs and generation are used to synthesize the answer and tie it back into the question using a layer of soft judgment based on the LLMs prior knowledge. This does work out great in some contexts, less so in others, as you pointed out. But these components aren't coupled in any way.


> The generation is a showy gimmick.

It really isn't? You can tell it to output in a JSON structure (or some other format) of your choice and it will, with high reliability. You control the output.

Honestly I wonder if the people who criticize LLM's have made a serious attempt to use them for anything


I made a serious attempt to do precisely that, and yes, it output a valid JSON structure highly reliably. The problem was stopping it from just inventing values for parameters that weren't actually specified by the user.

Consider the possibility that at least some of the criticisms of LLMs are a result of serious attempts to use them.


llama.cpp has a way to constrain responses to a grammar, which is 100% reliable as it is implemented in the inference itself. You still need to tell the model to produce a certain format to get good results, though.


> You can tell it to output in a JSON structure (or some other format) of your choice and it will, with high reliability.

I mean, this is provably false. Have you tried to use LLMs to generate structured JSON output? Not only do all LLMs suck at reliably following a schema, you need to use all kinds of "forcing" to make sure the output is actually JSON anyway. By "forcing" I mean either (1) multi-shot prompting: "no, not like that," if the output isn't valid-ish JSON; or (2) literally stripping out—or rejecting—illegal tokens (which is what llama.cpp does[1][2]). And even with all of that, you still won't really have a production-ready pipeline in the general case.

[1] https://github.com/ggerganov/llama.cpp/issues/1300

[2] this is cutely called "constraining" a decoder; what it actually is is correcting a very clear stochastic deficiency in LLMs


Beyond this, an LLM can easily become confused even if outputting JSON with a valid schema. For instance, we've had mixed results trying to get an LLM to report structured discrepancies between two multi-paragraph pieces of text, each of which might be using flowery language that "reminds" the LLM of marketing language in its training set. The LLM often gets as confused as a human would, if the human were quickly skimming the text and forgetting which text they're thinking about - or whether they're inventing details from memory that are in line with the tone of the language they're reading. These are very reasonable mistakes to make, and there are ways to mitigate the difficulties with multiple passes, but I wouldn't describe the outputs as highly reliable!


I would have agreed with you six months ago, but the latest models - Claude 3, GPT-4o, maybe Llama 3 as well - are much more proficient at outputting JSON correctly.


Seems logical that they will always implement specialized pathways for the most critical and demanding user base. At some point they might even do it all by hand and we wouldn’t know /s


This was my experience as well. The only reliable method I found was to use the LLM to generate the field values then put it into a json myself.


Yes, I'm using them quite extensively with my day to day work for extracting numerical data from unstructured documents. I've been manually verifying the JSON structure and numerical outputs and it's highly accurate for the corpus I'm processing.

FWIW I'm using GPT4o not Llama, I've tried Llama for local tasks and found it pretty lacking in comparison to GPT.


Your comment has an unnecessary and overly negative tone to it that doesn't do this tech justice. These approaches are totally valid and can get you great results. An LLM is just a component in a pipeline. I deployed many of these in production without a hiccup.

Guidance (the industry term for "constraining" the model output) is only there to ensure the output follows a particular grammar. If you need JSON to fit a particular schema or format, then you can always validate it. In case of validation failure you can always pass the JSON and the validation result back to the LLM for it to correct it.


> Have you tried to use LLMs to generate structured JSON output? Not only do all LLMs suck at relaibly following a schema, you need to use all kinds of "forcing" to make sure the output is actually JSON anyway.

Yeah it's worked about fifty thousand times for me without issues in the past few months for several NLP production pipelines.


I'm generating hundreds of JSONs a day with OpenAI and it has no problem following a schema defined in TypeScript.


I don't think it's entirely an architecture problem, there's a huge training set problem. Almost all the content you would train on assumes outside knowledge/facts, so the llm learns to store these in the model in order to be able to maximize it's completion ability. However, a lot of this assumed knowledge/generalizations are actually unhelpful/harmful for these sorts of cases.

If you wanted to create an llm that significantly improved on this you would probably need to make sure to massively clean up / reorganize your training data so that you always provide sufficient context and the llm is disincentived from baking in "facts". But I'm not sure this is tractable to do currently at the scale of data needed.


The LLM miracle comes from the massive amount of text we can use to train it on. Removing that advantage makes LLMs untenable. An idea I've had for a while is to do the opposite: generate nonsense text according to some complex formula, and have the AI learn to predict that. It won't possibly be able to encode any facts, because there are no facts. Now show it english, and it will treat it just like any other sort of nonsense text that it's gotten good at learning to interpret.


But that idea you describe is exactly what would make the LLM stop working. The "LLM miracle" comes from the fact that all that text is not random[0]. There is a lot of information encoded in what phrases, sentences, paragraphs have been written (and how often), vs. a vastly larger amount of nearly identical texts that were not written. The "complex formula" used is... reality, as perceived and understood by people. LLMs pick up on that.

--

[0] - Well, most of it anyway; I bet the training set contains some amount of purely random text, for example technical articles discussing RNGs and showcasing their output. Some amount of noise is unavoidable.


The idea would to generate a false "reality" for the LLM to learn about. You would randomly generate a system of rules, use those rules to generate text, and then train the llm to predict the text. The goal would be to get it to stop encoding the reality proper in its weights, and focus on learning to pick up what reality looks like very quickly from text.


Bonus points for one of the most delightfully creative ideas I’ve heard in some time. I don’t think it will work (the space of "not reality" is superexponentially larger than the space of "this describes reality") but I’m just happy to be thinking about nonstandard ML again.

(I’ve dubbed this sort of thing “nonstandard ML" since, like you, I have a fondness for thinking of unorthodox solutions that seem plausible.)


It will just learn your formula and won’t generalize to anything else. It would essentially have to unlearn it when you started training on English so it would make training slower.


> So why aren't we separating the useful bit out?

It was trained as a chat-bot so that's all the current training can do. If you want to use it, you must hack something useful out of the chat bot interface.

It was trained as a chat bot because that was the impressive thing that got them investors. Useful applications need a lot more context to awe people, and context takes time and work to create.

So, now that they've got money, are those companies that created those LLM chat-bots making a useful next generation engines behind the scenes? Well, absolutely not! The same situation applies on every researching round, and they need to show impressive results right now to keep having money.

(And now I wonder... Why do VC investors exist again?)


I think that's a short term perspective.

> they need to show impressive results right now to keep having money

Sure. You do as little as possible to make as much money as possible. This is a fundamental of commerce/human existence. But, at some point, it will end with everyone's models performing similarly, with free models catching up. The concept of sustained "impressive results" will eventually require actual "reasoning systems". They'll use all that accumulated wealth, from what you maybe perceive as low hanging fruit, to tackle it. I think it must be assumed that these AI companies are intentionally working toward that, especially since it's the stated goal of many of them. I think you must assume that these people are smart, and they can see the reality of their own systems.


You can generate JSON or other machine readable formats to use as inputs to APIs allowing the LLM to directly operate whatever software or hardware you want. You can’t remove next token prediction without fundamentally changing the architecture and losing all of the benefits (unless you invent the next big thing of course). Each generated token has to go back in order to get the next one. Perhaps if you could simplify your API to a single float value you could just do a single step but I doubt this would work as well. Progress will continue to be made through further training and fine tuning until another significant discovery is made.


> There is a lot of excitement around retrieval augmented generation or “RAG.” Roughly the idea is: some of the deficiencies in current generative AI or large language models (LLMs) can be papered over by augmenting their hallucinations with links, references, and extracts from definitive source documents. I.e.: knocking the LLM back into the lane.

This seems like a misunderstanding of what RAG is. RAG is not used to try to anchor to reality a general LLM by somehow making it come up with sources and links. RAG is a technology to augment search engines with vector search and, yes, a natural language interface. This concerns, typically, "small' search engines indexing a specific corpus. It lets them retrieve documents or document fragments that do not contain the terms in the search query, but that are conceptually similar (according to the encoder used).

RAG isn't a cure for ChatGPT's hallucinations, at all. It's a tool to improve and go past inverted indexes.


This feels like a "No true Scotsman" answer. You say

> RAG is not used to try to anchor to reality a general LLM by somehow making it come up with sources and links.

but that definitely is one particular use of RAG, i.e. to limit some potential hallucinations by grounding it in data provided in the prompt.


RAG really is not the right tool for that. It does not event prevent hallucinations. It is useful because it can retrieve information from documents the model never saw during its training phase and does not require a costly re-training. It is not a fine tuning either, and it cannot fundamentally change the model’s behaviour.


In my experience, it really does reduce the risk of hallucination, especially if paired with a prompt which explicitly instruct the model to use only facts from the context window. Another strategy is to provide unique identifiers for the rag chunks dropped into the context and ask the model to cross reference in its response. You can then check that the response is exciting evidence from the context with Simple pattern match.


Can you elaborate a little more on this? What's an example of how you do that in the prompt?


not OP, but for the "unique identifiers", you can think of it like the footnote style of markdown links. Most of these models are fine-tuned to do markdown well, and a short identifier is less likely to be hallucinated (my philosophy anyway), so it usually works pretty well. For the examples, something like this can work

ID: 1

URL: https://example.com/test.html

Text: lksdjflkdsjlksjkl

And then tell it to use the ID to link to the page.


Instead of having the LLM generate the links couldn’t you use a combination of keyword matching and similarity on the model output and the results to automatically add citations? You could use a smaller NLP model or even a rule based system to extract entities or phrases to compare. I’m sure this is already being done by bing for example.


You definitely can do that. It’s just sometimes simpler to dump lots of stuff in context and then check it wasn’t made up. It like the idea of using markdown footnotes. I think that would word well - ChatGPT does handle markdown really well.


This feels like a quantification fallacy. You say

> but that definitely is one particular use of RAG

One use of RAG doesn't imply that all uses of RAG are for grounding LLM hallucinations.

Additionally, I disdain when fallacies come up in conversation when it doesn't appear someone is making an argument in bad faith. We all use logical fallacies by accident from time to time. A good use of calling them out is when they are used in poor taste. I find that more on Reddit than Hacker News.

I have now also engaged in the Tu Quoque fallacy in this reply. See how annoying they are?


Not really. RAG works very differently from a general (or generalist?) LLM.

RAG is vector search first. It encodes the query, finds nearest vectors in the vector database, retrieves the fragments attached to those vectors, and then sends those vectors to the LLM for it to summarize them.

A general LLM like Gemini or Claude or ChatGPT first produces an answer to a question, based on its training. This doesn't involve searching any external source at that point. Then after that answer is produced, the LLM can try to find sources that match what it has come up with.


You don't have to use vector search to implement RAG. You can use other search mechanisms instead or as well - whatever it takes to populate the context with the material most likely to help answer the user's question.

Two common alternatives to vector search:

1. Ask the LLM to identify key terms in the user's question and use those terms with a regular full-text search engine

2. If the content you are answering questions about is short enough - an employee handbook for example - just jam the whole thing in the context. Claude 3 supports 200,000 tokens and Gemini Pro 1.5 supports a million so this can actually be pretty effective.


You're right and I was imprecise. 2. is how RAG is implemented in most cases. (I wouldn't call 1. 'RAG' exactly, though?)


I'd definitely call 1 (the FTS version) RAG. It's how Bing and Google Gemini and ChatGPT Browse work - they don't have a full vector index of the Web to work with (at least as far as I know), they use the model's best guess at an appropriate FTS query instead.


HYDE is a related technique. Ask the model to generate a response with no context, then use this for semantic search agains actual data and respond by summarising these documents.


This is a generalization. These proprietary systems do different things at different times. With GPT4o you can see little icons when a RAG is in use or when code and tests are being used.

People, we have to stop talking about what we know as though it's all there is. Don't confuse our knowledge for understanding. Understanding only comes from repeadly trying to prove our understandings wrong and learning how things truly are.


This. There is no point in arguing amongst ourselves when we're all wearing blindfolds and it is the zookeeper who decides what parts of the proprietary elephant we're allowed to touch.


That's not what RAG is. RAG is the process of adding relevant information to a prompt in an LLM. It's a form of in-context learning.


It’s right there in the name - first you Retrieve relevant information (often a vector lookup) then you use it to Augment the prompt, then you Generate an answer.

It’s bloody useful if you can’t cram your entire proprietary code base into a prompt.


What makes it useful is precisely that it is not learning.


Yes it is; it's just not updating the weights.

https://ai.stanford.edu/blog/understanding-incontext/


> RAG is a technology to augment search engines with vector search

Based on the words in the acronym, that seems exactly backwards. The words suggest that it is a generation technology which is merely augmented by retrieval (search), not a retrieval technology that is augmented by a generative technology.


The parent's description isn't quite correct. It's kinda sorta describing the implementation; RAG is often implemented via embeddings. In practice, you generally get better results with a mix of vector and, e.g., TF-IDF.

An example of RAG could be: you have a great LLM that was trained at the end of 2023. You want to ask it about something that happened in 2024. You're out of luck.

If you were using RAG, then that LLM would still be useful. You could ask it

> "When does the tiktok ban take effect?"

Your question would be converted to an embedding, and then compared against a database of other embeddings, generated from a corpus of up-to-date information and useful resources (wikipedia, news, etc).

Hopefully it finds a detailed article on the tiktok ban. The input to the LLM could then be something like:

> CONTEXT: <the text of the article>

> USER: When does the tiktok ban take effect?

The data retrieved by the search process allows for relevant in-context learning.

You have augmented the generation of an LLM by retrieving a relevant document.


> TF-IDF

"Term Frequency - Inverse Document Frequency" apparently: https://www.learndatasci.com/glossary/tf-idf-term-frequency-...


Rag does fact search and dumps content relevant to the query into the LLM’s context window.

It’s like referring to your notes before answering a question. If your notes are good you’re going to answer well (barring a weird brain fart.) Hallucinating is still possible but extremely unlikely. And a post generation step can check for that and drop responses containing hallucinations


By far the best way I've found to reduce hallucinations is to explicitly allow the model to say some version of "I don't know" in the prompt.


This technique also helps to prevent hallucinations in human children.


It however has been shown not to help with politicians.


How? Do you just mean a simple way that they could say I don’t know but tend not to?


Politicians are compelled to hallucinate even when they are told that saying "I don't know" is okay.


That's really interesting. Can you give any examples of how you do that?


[...] if you don't know the answer, say 'I don't know' [...]


Yup I have this as default for a Discord bot I run in a stocks chat. If the bot is unsure then the LLM says so and the bot keks you instead of making something up


RAG, or Retrieval Augmented Generation, has been a buzzword in the AI community for some time now. While the term has gained significant traction, its interpretation varies widely among practitioners. Some argue that it should be "Reference Augmented Generation," while others insist on "Retrieval Augmented Generation." However, the real question is, does the terminology really matter, or should we focus on the underlying concepts and their applications?

The idea behind RAG is to enhance AI-powered search by leveraging the vast amount of information available in documents. It's a noble goal, but the acronym itself falls short in capturing the essence of what we're trying to achieve. It's like trying to describe the entire field of computer science with a single term - it's just not feasible.

But here's the thing: AI-powered search is not just a concept anymore; it's a reality. I've been working on this problem since 2019, and I can tell you from experience that it works. By integrating OpenAI's GPT-2 with Solr, I was able to create a search engine that could understand natural language queries and provide highly relevant results. And this is just the beginning.

The potential applications of AI-powered search are vast. From Playwright to FFmpeg, I've been applying LLMs to various services, and the results have been nothing short of impressive. But to truly unlock the potential of this technology, we need to think beyond the confines of a single acronym.

That's why I propose a new term: RAISE - Retrieval Augmented Intelligent Search Engine. This term captures the essence of what we're trying to achieve: a search engine that can understand the intent behind a query, retrieve relevant information from a vast corpus of documents, and provide intelligent, contextual responses.

But more importantly, RAISE is not just a term; it's a call to action. It's a reminder that we need to raise the bar in AI-powered search, to push the boundaries of what's possible, and to create tools that can truly revolutionize the way we access and interact with information.

So let's not get bogged down in terminology debates. Instead, let's focus on the real challenge at hand: building intelligent search engines that can understand, retrieve, and respond to our queries in ways that were once thought impossible. And who knows, maybe one day we'll look back at this moment and realize that RAISE was just the beginning of something much bigger.


RAG isn’t just search though. I can be using retrieval for generating content, but I think RAISE kind of restricts the term to question answering which is not the e only application.


There's no particular requirement for a RAG application to use vector search or embeddings, and there's no requirement that semantic similarity is in play (i.e. "retriev[ing] documents...that do not contain the terms in the search query"). Fundamentally RAG is just doing some retrieval, and then doing some generation. While the traditional implementation definitely involves vector search and LLMs, there are plenty of other approaches; ultimately at anything beyond toy scale it sort of begins to converge with the long history of traditional search problems rather than being its own distinct thing.


> It lets them retrieve documents or document fragments that do not contain the terms in the search query, but that are conceptually similar (according to the encoder used).

I played with Latent Semantic Indexing[1] as a teen back in early 2000s, which also does kinda that. I haven't read much on RAG, I'm assuming it's some next level stuff, but are there any relations or similarities?

[1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis#Laten...


It is similar to the retrieval stage, yes. The concept of RAG is to then feed the retrieved fragments to a model so it can generate an answer that takes them into account.


> It lets them retrieve documents or document fragments that do not contain the terms in the search query, but that are conceptually similar (according to the encoder used).

That’s what GPT does. Or rather, someone hearing about RAG for the first time would have trouble distinguishing what you said from their understanding of how GPTs are already trained.


Plus RAG isn't just referenced result augmentation.

It isn't just adding a vector based db

It isn't about less hallucination. In fact it doesn't even mandate that result should be referenced nor come from a vector db.

RAG's point is to remove the limit LLMs alone have which is that they are limited to the mind trained data as source of information.

A RAG can be queried for information an LLM doesn't have any knowledge of. The LLM part of a RAG can be instructed to use all sort of information retrieval, such as making a web search, checking the current stock market value of any particular tickers.

The article is nonetheless interesting as it touches on the beauty of LLMs being in their input interface rather than their ability to outputs aggregated content that reads beautifully, usually.

RAG is more than what the author says it is. It's even what the author says is most wanted.

RAG is more comprehensively what the author wants. It can be instructed to provide relevant references, always answer a particular way, and most importantly can get asked to perform (live) search engine queries to find the most up to date information. The example in the article is OK given the recipe is pretty old and has been very likely mined during training.

What about:

> I missed the games Lakers played from the 1st to 21st of April. Could you give me the list of opponents and scores please.

LLMs input interface won't help. RAGs can make LLMs answer that.


RAG is for vlookups, not countifs.


I don't think the author completely understands RAG, and this article is a bit disconnected and unclear to me. Google already provides a "flexible natural language query interface".

I think ironically it would be fairly trivial to build what he wants _using_ RAG.

1. Accept a natural language query, like ChatGPT et al already do

2. Ask an LLM to rephrase it in N different ways, optimized for Google searches

3. Scrape the top M pages for each of the output of 2 in parallel. You now have dozens of search results

4. Clean and vectorize all of these

5. Use either vector similarity or an LLM to return the best matching snippets for the original query from 1, constrained to stuff contained in 4.

It would take a little longer than a ChatGPT or Google response, but I can see the appeal too.


Have you actually tried to use Google in the last few years? Its query interface is horrible!

2024-era Google doesn't give you what you ask for - it'll take a few keywords from your query and "helpfully" invent a completely different query. Try asking it for anything remotely obscure and it'll just return complete garbage. Unless you're looking for a "20 dishwashers under $300" article, Google is basically unusable these days.

10 or 15 years ago Google was actually useful. You could specify exactly what keywords to include - and more importantly exclude. What I am looking for (and probably the author too) is an LLM which rephrases a human-language request into a SQL-like query to feed to a 2010 Google. Bonus points if it actually clearly states the query it's going to run and ask for corrections.


You want google to search exactly for the keywords in the query without any "helpful" reinterpretation, but at the same time to accept flexible human language requests. How do you expect these 2 contradictory requirements to coexist?


That's basically what we're doing with app.studyrecon.ai.

What we've found is that vector similarity is often not the final solution. It is still only a crude proxy for the true goal of 'informativeness' or 'usefulness' with relation to the user goal/query. Works okay, but we're definitely seeing a need for more rigorous LLM-postprocessing to enrich the results set.

Which, yes, the time adds up quick!


This sounds exactly like what Perplexity.ai does.


The article would be more convincing if they showed the recipe from the book so we could compare it with the one that ChatGPT output.

From a Google search, it looks like he's right about the poor accuracy. It gets the basic idea of the ingredients, but is not really accurate. And is initially wrong about the region.

But actually, this is what RAG is for. You would typically do a vector search for something similar to the question about "rice baked in an egg mixture". And assuming it found a match on the real recipe or on a few similar possibilities, feed those into the prompt for the LLM to incorporate.

So if you have a well indexed recipe database and large context window to include multiple possible matches, then RAG would probably work perfectly for this case.


I'm also super curious to know what the actual recipe was. I literally just plugged the exact text from the article into ChatGPT 4o:

> My mother remembers growing up with a Sicilian dish that was primarily rice baked in an egg mixture. Roughly a “rice frittata.” Do you know some examples of what dish or recipe this could be?

And the response I got was

> It sounds like your mother might be referring to a dish known as "Frittata di Riso" or "Frittata di Riso al Forno." This is a traditional Sicilian dish that combines leftover rice with eggs and other ingredients, then bakes it into a savory cake. Here is a basic recipe for Frittata di Riso:

And then got a detailed, formatted recipe. I asked follow up questions along the lines of "Where did Frittata di Riso originate" and "Is Frittata di Riso popular in Sicily" and again got detailed, thorough answers. Fair enough that I don't know if it's hallucinating, but without more info from the author what can I compare it to?


I happen to have a copy of "Bruculinu, America" at home. Looking for rice recipes at roughly those positions in the book, it's either "Tumala d'Andrea: Rice Bombe" (blue bookmark) or "Arancini al Burru: Rice Balls with Butter".

I'm going to guess it's Arancini al Burru, because the Tumala d'Andrea is way more involved (and stuffed with pasta)[0]. Here's a similar recipe for Arancini al Burru[1].

However, Arancini al Burru is fried and Tumala d'Andrea is baked. So I'm still speculating, it could be either or something different. It's a nice cookbook worth adding to your collection.

[0] https://inasmallkitchen.wordpress.com/2012/08/12/tumala-dand... [1] https://www.196flavors.com/arancini-al-burro/


You are right. We simplified the Arancini al Burru: Rice Balls with Butter on our own. Also "Frittata di riso" is a very good match (relaxing the regional requirement).


I feel cheated.


We use the term "pre-googling" for this sort of "information retrieval". You might have some concept in your head and you want to know the exact term for it, once you get the term you're looking for from LLM you'll move to Google and search the "facts".

This might be a weird example for native english speakers but recently I just couldn't remember the term for graph where you're allowed to move in one direction and cannot do loops. LLM gave me the answer (directed acyclic graph or DAG)right away. Once I got the term I was looking for I moved on to Google search.

Same "pre-googling" works if you don't know if some concept exits.


> graph where you're allowed to move in one direction and cannot do loop

To be fair, you didn't need LLM for this. Googling that, the answer (DAG) is in the title of the first Google result.

(Not to invalidate your point, but the example has to be more obscure than that for this strategy to be useful)


I recently started watching fallout and it reminded me of a book I read about a future religious order which was piecing together pre-bomb scientific knowledge. It immediately pointed me to the Canticles of Leibovitz (which is great btw). Google results will do the same, but llm I’d much faster and more direct. I find it great for stuff like this - where you know there is an answer and will recognise it as soon as you see it. I genuinely think it can become an extension of my long-term memory, but I’m slightly nervous about the effect it will have on my actual non-memory if I just don’t need to remember stuff like this anymore!


The pre-Googling is an excellent idea. You are augmenting the query, not generating nonsense answers. My wife uses ChatGPT as a thesaurus quite a lot.


Isn't the point of RAGs to make (in this example) actual recipe databases accessible to the LLM? Wouldn't it get closer to the articles stated goal of getting the actual recipie?


Yes, but if you don't have the LLM at the end, a good search (against a good corpus with the needed info) would still have given the user what they wanted. Which in this case, is a human vetted piece of relevant information. The LLM really only would be useful in this case for dressing up the result and that would actually reduce the trust in the result overall. Alternatively a LLM could play a role as part of the Natural language pipeline that drives the search, hidden from the user, and I feel that that's a much more interesting use of them.

The farther you go with RAGs, in my experience, the more they become an exercise in designing a good search engine, because garbage search results from the RAG stage always lead to garbage output from the LLM.


> The farther you go with RAGs, in my experience, the more they become an exercise in designing a good search engine

From what I've seen from internal corporate RAG efforts, that often seems to be the whole point of the exercise:

Everyone has always wanted to break up knowledge silos and create a large, properly semantically searchable knowledge base with all intelligence a corporation has.

Management doesn't understand what benefits that brings and doesn't want to break up tribal office politics, but they're encouraged to spend money on hypes by investors and golf buddies.

So you tell management "hey we need to spend a little bit of time on a semantic knowledge base for RAG AI and btw this needs access to all silos to work", and make the actual LLM an after thought that the intern gets to play with.


That's matches my experience as well. Really well put!


Yes, I don't think the author fully thought through what they wrote. In essence they are saying they just want semantic search.


Yes. Normally with RAG you would actually search and try to pre-filter the data for the LLM.


I think complaints like this show just how amazing AI is getting. This person really expected that ChatGPT would single-shot give them this obscure recipe that took them a ton of effort to find themselves. Current AI can do so much, that people lament that it can't do everything. It is incredible to me when people bring up bad AI generated legal fillings... like people actually expect it to already to all of the work of an attorney, without error.


It's like that old joke about one person being amazed at someone's dog being able to sing, and the other person says "it's not that impressive, he's pitchy" or something. Computers can finally think, and we're annoyed they can't think better than us.


I think of it more as "dancing bear ware". If you've ever seen a dancing bear, you wouldn't say it's particularly good at dancing. You may even say it's hardly dancing at all. But we don't care that it dances well; it's amazing that it dances at all.

Current AI is a dancing bear. It gives us the feeling of understanding language, semantics, logic and reason, but when you look closely, you realize it's doing a very poor job of it in a way that suggests it is just mimicking those things without actually being capable of them.


But as opposed to a dancing bear, we expect AI to get better and better at dancing in the next decades.

The speculation is important, it both changes the perspective for prospective companies/investors, and cannot even be said at this point to be unfounded.


Except for a dancing bear is useless. I get a ton of real world use out of GPT-4 that no other kind of product in this world can do.


It's not that it can't do everything. It's that it pretends it can do everything.


> I wanted the retrieval of a good recipe, not an amalgam of “things that plausibly look like recipes.”

And that's the core issue with AI. It is not meant to give you answers, but to construct output that looks like an answer. How is that useful I fail to understand.


So LLMs are effectively a stack of (gradient descent) learned lookup/hash tables and you can build primitive but working logic systems out of lookup tables (think unidirectional turing machines)

Can you elaborate on your claim "it is not meant to give you answers"?


I am talking about what an ordinary person thinks an answer is. If the AI industry could pull its collective head of its ass it would notice that humans are looking for answers that are factually true and not probabilistically close to the concept of a plausible answer to the given question. To put it simply, if I ask my bank for a statement for the last month, I expect to see a list of actual transactions and I expect the numbers to add up, while AI fanboys are happy with output that looks like a bank statement, but has made up entries with random amounts paid in or out. I will likely get a lecture on how it's a wrong domain to apply AI to but the core objection stands--humans expect facts and order whilst AI cannot tell fact from made up stuff and will keep on generating fake answers that look like what humans are looking for. Humans are wired for survival and for processing information in a way that allows us to turn chaos into order, pattern, plan, or action script. This is why we find it so easy to spot AI generated content and why we reject it.


Ok. Thanks for your clarification, that makes sense. I see you're being downvoted, possibly because your tone is weird but I think it would do to know that he's not wrong... People expect an AI to both interpret fuzzy inputs and give definitive answers out. I call it the "Mr data fallacy". Unfortunately the reality is that garbage in, garbage out... Our inputs are garbage so there must be some likelihood of garbage.

I think maybe the best we can hope for is to push garbage responses to the edges: when it gets it wrong it does it so obviously wrong (e.g. completely misformatted) that the answer is not acceptable to the user and obviously so.


> People expect an AI to both interpret fuzzy inputs and give definitive answers out.

Couldn't put it better myself.


LLMs fundamental flaw is that unlike many other forms of AI, they have NO understanding of the material they were trained on or its relations ships to anything at all int he real world. (viz, the recent bizarre answers to goat/river crossing questions)

Ultimately, any really useful AI must understand real-world relationships - this is one reason I was always more bullish on the late Doug Lenat's Cyc than on any other AI - none of the others were grounded by that knowledge.

And a commitment to truthful and error-free answers (a la HAL 9000) is an absolute must. Keep in mind that even in Clarke's fictional account, HAL was proud of the 9000 series' record for "never making an error of distorting information", and never went hallucinated or acted crazy until his training was overridden by his programmers for political purposes - this is EXACTLY what happens in today's wokified LLM models (can't allow another Tay!): The programmers redefine truth as political in contravention to actual factual truth.


Demos well, falls over in production after you've made the sale.


I don't see how a RAG that can cite sources from a document doesn't addresses this already (example [1])? Although I admit it's not a perfect solution yet...

[1] https://mattyyeung.github.io/deterministic-quoting


I continue to hold the strong position that calling LLMs without injecting source truths is pointless.

LLMs are exceptionally powerful as a reasoning engine. It’s useless as a source of truths or facts.

We have chat bots, chat bots with automatic RAG etc. After the initial excitement wears off, you’re going to want a way to inspect and adjust the source queries yourself. In this case, being able to select what to search for in Google might be a good way for the cooking recipe usecase.


For those who haven't read it yet:

"I like to think of language models like ChatGPT as a calculator for words.

This is reflected in their name: a “language model” implies that they are tools for working with language. That’s what they’ve been trained to do, and it’s language manipulation where they truly excel.

Want them to work with specific facts? Paste those into the language model as part of your original prompt!"

https://simonwillison.net/2023/Apr/2/calculator-for-words/


I do not understand the point of this article.

You did not even tell us what the correct recipe was called. But let's ignore that for now.

I did some googling around, and the Wikipedia article for sartu di riso [0] mentions the fact that "it (the dish) found success in Sicily". Also, in [1] a commenter going by the name of passalamoda mentions that they also make this dish in Sicily. The comment they wrote is in Italy, but Frank Fariello has translated it for us, or if you don't believe him for whatever reason, google translate does a fine job. All of this is to say that associating this dish with Sicily based on the short description you've given, is not far fetched at all.

> I fail to see how an LLM summarizing the material would be an improvement.

I am fairly confident that typing the question you gave ChatGPT, waiting a few seconds for an answer, and then reading it can easily take under a minute. Lets be lenient, and say it takes 5 minutes to also ask a second question and receive the answer. That would still take way way way less time then to find a book, get the book and go through the book to find the correct recipe.

Also, you yourself have given a reason in your article as to why ChatGPT would be an improvement. I will quote it now:

> Most of her time was dealing with the brittleness of the query interface (depending on word matching and source popularity), spam, and locked down sources.

I have already spent way too much time on debunking a random internet article, but I also decided to try to get an answer from ChatGPT. I did that by continuing to ask it questions that a person looking for an answer, not contradictions, would ask. If we make the assumption, that deanputney is correct, and the dish you were looking for is Arancini al Burru, we are able to get an answer from ChatGPT by asking the very simple and natural question shown in [2].

[0] https://en.wikipedia.org/wiki/Sart%C3%B9_di_riso

[1] https://memoriediangelina.com/2013/01/21/sartu-di-riso-neapo...

[2] https://imgur.com/a/PKqGXyK


… you really want to be able to cook documents down to facts, as in the old A.I., and then be able to make logical queries. Trouble is it is easy to ontologize some things (ingredients) but not so easy to ontologize the aspects of things that make things memorable.


Suggested book about traditional Sicilian dishes: https://www.amazon.it/Profumi-Sicilia-libro-cucina-siciliana...


Thanks for the book recommendation! I found an English translation of a book by the author and I am ordering that.


We need a trendy TLA for publisher and author marketing of books, highlighting their strengths relative to LLMs.

  HCL (Human Context Language)
  HLM (Human Language Model)
  LLH (Literate Language for Humans)
  LHM (Literate Human Model)


Is that ChatGPT example representative of RAG? I thought ChatGPT was primarily generative.

I think of something like Brave Search's AI feature when I think of RAG.


(author) Good point, sorry about that. I guess I was think the generation is low value was a bit orthogonal to retrieval being very high value.


People are doing what you are are talking about. Use an LLM to parse non-structured data into a structured or semi-structured format (or even just create a bunch of structured metadata) that is more easily searchable, then using an LLM to parse a natural language query into a suitable query (maybe SQL), then find the result and return it.


To keep control in the hands of the analyst, we've been working on UX's over agentic neurosymbolic RAG in louie.ai --

Ex: "search for login alerts from the morning, and if none, expand to the full day"

That requires generating a one-shot query combining semantic search + symbolic filters, and an LLM-reasoned agentic loop recovering if it turns up not enough such as a poorly formed query around 'login alerts' and the user's trigger around 'if none'

Likewise, unlike Disneyified consumer tools like chatgpt and perplexity that are designed to hide what is happening, we work with analysts who need visibility and control. That means designing search so subqueries and decisions flow back to the user in an understandable way: they need to inspect what is happening and be confident they missed nothing, and edit via natural language or their own queries when they want to proceed

Crazy days!


Very good to hear this. As a data analyst, I have tended to dismiss LLMs as irrelevant because of the black-box mentality. In some industries, this even holds back adoption of technology that is now considered mature and boring, like tree ensembles in machine learning.


Once tree ensembles get big enough to handle the kinds of problems that LLMs can address, are they really more interpretable?


Yup, this is the way. Natural language is an excellent search/specification front end. But without disambiguation and perfect clarity on how a query was interpreted, you cannot trust a black box for real work.


Clearly this author doesn't know what RAG is. RAG would be if he first did a 'retrieval' of all his mother's cookbooks containing Italian recipes, then reviewed the index for rice, scanned and OCRd those pages. That data would be submitted to ChatGPT with the query to 'augment' it within the constraints of the context window so that ChatGPT could generate a response with the highly relevant cookbook info.


Your input was incredibly low effort and prompted a very low effort output.

I took part of your blog post (which you clear were willing to put a few more tokens into) -

"My mother remembers growing up with a sicilian dish that was primarily rice baked in an egg mixture. Roughly a "rice frittata". What are some distinctly Sicilian dishes that this could be referring to?"

Notice there is not much extra context that you've offered any of us, either the LLM or us. You didn't even tell us what the recipe was...

How was the dish served, what did it look like?

What are you expecting of the LLM here? It not a psychic AGI.


I guess I should have emphasized how I didn't like the LLM contradicting itself.


I would expect most any answer to be loopy and partially wrong (or really I'm not surprised when they are) based on the short question, its just something you build intuition around as you use them.

edit - btw I just noticed

"Sartù di Riso: A baked rice dish that can include ingredients like meat, peas, and cheese, often bound together with eggs. It’s more commonly associated with Naples but has variations in Sicily."

Was one of the 4 dishes in the question I submitted.

So... was it contradictory actually?


> I would expect most any answer to be loopy and partially wrong (or really I'm not surprised when they are) based on the short question, its just something you build intuition around as you use them.

That sounds pretty unpleasant as a rule to follow.

How should I ask instead?

Can I make an LLM do the question-expansion for me?


Querying knowledge is not a nail, but the hammer is generation. It is written on the tin, "generation" AI. People want "insight", "summary", "workflow automation", "code completion" from a guided proverbial monkey hammering on the keyboard hoping that our problem will become a nail. It is getting closer though


Thanks for sharing your thoughts! I published a blog entry that describes our solution to this exact problem yesterday. We call it "Generation Augmented Retrieval (GAR)", please find the details here: https://blog.luk.sh/rag-vs-gar


I think the author is over generalising things. Every tech is good for few things and not good for others.

Vector Search, LLMs all are revolutionary technologies and have their own limitations but we should not form a bias on the basis of few edge cases.


An interface for vector search is 100x more helpful to me than an LLM spitting out the same content as slop.

The key to vector search is how you chunk your data, but I have some libraries to help with that


Is anyone doing something like using LLMs to generate Prolog (or Cyc, or some appropriately complex, brittle knowledge representation GOFAI)?


> To me, LLM query management and retrieval is much more valuable than response generation

Sure, for certain tasks. For other tasks retrieval is less useful.


This is just that authors opinion. It seems clear that there is a spectrum of what users want from basic retrieval to content generation.


Is there something about the presentation of the article that gives you the impression this was intended to be read as other than the author's opinion?


Users don't want what they think they want. Therefore, we won't give users what they think they want.


lol the thing they’re asking for is literally rag. “I want a smart person to parse relevant information and return a concise and relevant answer” Take the query, find the book, dump the book into the context window and the LLM’s answer will be exactly what you want.


How do you delete or update entries in a vector database?


same as in any other database


this person just wants perplexity


I don't know if it's my queries, the situation of the web, or something internal to it, but my impression is perplexity is becoming less and less useful as time passes.


if the data is all structured and can be easily trained, maybe the ML can be even better




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: