Hacker News new | past | comments | ask | show | jobs | submit login
Lamini Memory Tuning: 10x Fewer Hallucinations (lamini.ai)
128 points by galeos 3 months ago | hide | past | favorite | 57 comments



Woww, very creative and interesting idea: I understand it as: train a bunch of fact-based LoRAs to zero loss (they mention 100k different ones), then use RAG to pick the appropriate Loras for a query.

So cool. The only moat I can think of for such a company would be proprietary fact loras- basically licensing a modern ai encyclopedia.

Anyway, really nice idea.


I think there is an expert router layer to decide which loras to be integrated at inference time. But they also mention that they freeze the weights for router during training. So it is unclear to me how the router was trained on what loss.


Interesting. That’s kind of surprising to me - it would mean with every new Lora they’d need to fine tune the router, no?

Embedding a description of the Lora and using RAG to pull the nearest Loras in the embedding space is where my mind goes; it’s super extensible, minimal additional training for customer use cases, and the way the Loras probably work it’s not terrible to pull a few extras.

Anyway I just speculate —- no idea what they’re actually doing on the backend.


That's where it is confusing to me. They mentioned that for LoRA fine-tuning, the router weights are frozen, so you don't update the routing when training different concept. But how that expert router is trained? Could be a pretraining with some aux loss to encourage diversity.


I am old enough to remember the use of yellow pages to find teh right expert


They mention "tuning millions of expert adapters", not 100k


Bit late to the party and I'm not into the AI scene, but from glossing over the three key papers they cite to describe their model, my take is as follows.

The idea from LoRA[1] is to take pre-trained, dense weights W_0 for a model and adapt them to new training data by using the weights W = W_0 + BA for inference. The key is that the matrices A and B have a very low rank[2] compared to W_0. Training is then done by treating W_0 as constant, and only updating the A and B matrices.

The idea from MoE[3] is to pick just a few "experts" from a large number using Softmax, and insert them between other layers in a neural network. The "experts" can be just some simple matrices or neural networks in their own right.

The Lamini model seems to combine these ideas, where they use several "experts" layered between the BA matrices. However instead of just a simple Softmax to select the "experts" they use (chunked?) cross-attention[4], and a much larger number of "experts" compared to the MoE paper. From what I gather the cross-attention allows the expert selection to react to the context, unlike the more naive plain Softmax gating approach.

Similarly to LoRA, they train the Lamini model by keeping the pre-trained LLM weights constant. From what I can gather they do a bit of training on the cross-attention layer but then freeze that too, to avoid the cross-attention layer favoring the same "experts" all the time.

The idea then is to train the "experts" and the LoRA layer on facts until the combined model (W above) gets each fact correct (zero loss).

Thus when the model "sees" a keyword in a sentence, a set of "experts" will steer/adjust the output of the pre-trained LLM to output the correct fact. Or at least that's how I imagine it works.

What's less clear to me is what exactly the "experts" are and how they are combined.

Since they're used to adjust the weights of the combined model, I it makes the most sense to my un-trained eye that they're simple matrices as mentioned in the MoE paper. Given they're layered between the low-rank portions of the LoRA section, they're necessarily relatively small matrices, so having millions of these "expert" matrices doesn't add too many parameters overall.

And while they're portrayed as stacked in the Lamini paper, suggesting matrix multiplication, matrices are generally not commutative[5]. So to the un-trained eye it seems likely the "experts" are simply added like in the MoE paper.

But yeah, I'm very much not an expert and the paper was more like a conference poster at best, lacking a lot of detail, so this might be all gibberish and I'd appreciate being corrected.

[1]: https://arxiv.org/abs/2106.09685

[2]: https://en.wikipedia.org/wiki/Rank_factorization

[3]: https://arxiv.org/abs/1701.06538

[4]: https://arxiv.org/abs/2112.04426

[5]: https://en.wikipedia.org/wiki/Commuting_matrices


Thanks for this substantive reply.

In this case, the experts could literally be routing / weighting of the LoRas, so it could be a 1x100k (or 1mm or whatever) vector , binary, maybe for simplicity and size. Or floats that are capped to 1/0 at inference time, but trained as floats.

The thing that’s a little weird to me is that you’d need to keep retraining the experts. But, I guess it may just be part of the pipeline for adding custom knowledge to the system.


> In this case, the experts could literally be routing / weighting of the LoRas

Hmm yes, good point. Hard to tell with so little to go on.

edit: I assumed they were matrices given they were squares in the figure, just squashed to fit in the LoRA stackup, and given that they'd be low-dimensional so few parameters due to that.

> The thing that’s a little weird to me is that you’d need to keep retraining the experts.

Yeah my impression was this is more for static knowledge, like if you wanted to have a Wikipedia-assistant say.

It got me thinking though, if it could be a stepping stone towards something more dynamic. Say could you use something like the W = W_0 + dW idea to tweak the cross-attention mechanism to select newly added experts somehow?

Again, not into the AI scene, just like entertaining these shower thoughts.


Doesn't this make the "AI" even less creative and more like full-text-search instead? What makes some data a "fact"? If everything is written in the training data, in the end, won't everything be treated like a fact? So the LLM will have 100% accuracy and 0% creativity.


Some people don’t want their language model to have any creativity.


"Ten times less" is a common English usage with a clear meaning.


Creativity is clearly not the goal here. Machine learning models are trained to be robust to errors in the training data.


Most advertised commercial uses of "AI" are glorified search. This seems like an improvement in that space?


Whats wrong with that?

If "AI" simply allowed people to search data without having to structure it then apply very rigid search terms, that would be in a win in my book.


Nothing wrong, only that AI seems to be transforming into the next search engine, as opposed to AGI.


Yeah AGI is way off, if ever.

Question is do the public and investors realize that.


Sounds like compression to me


"Compressing" 500MB of data into a 70GB model.


Isn't it more like "compressing" interactions with and patterns of 500MB of data?

Like, it's not only text search but also has a degree of generalization (e.g. in labelling new documents)


Am I the only one that cringes at "10x fewer?"

How do I multiply positive numbers and get something smaller?

Is "1/10th" or "90% less" not better arithmetic?

Maybe I should have done more gooder at math but it hurts my ears (eyes).


> How do I multiply positive numbers and get something smaller?

fractions are gonna blow your mind


I one presented a p75 result of some work as "a 60% improvement" and got direct feedback that it was not a very impressive figure. However when they realized what I really meant was a "2.5x improvement" their eyes lit up.

I've asked around and people generally seem to prefer the bigger, sexier number. I don't care too much either way so I just go with the flow. Shrug.


Years ago A&W rolled out a “1/3 pound” burger to compete with McDonald’s quarter pounder, but it failed miserably despite being priced the same and rating higher in blind taste tests. Why? Because people perceive 1/4 to be larger than 1/3.


I think it depends how you think of the initial number, I think of it as a fraction and the multiplier applies to the denominator.

eg. if hallucinations occur roughly 1 in 20 prompts then 10x fewer is 1 in 200 prompts, rather than 0.1 in 20 prompts.


Isn't 1 in 200 equivalent to .1 in 20?


It is. I think their point is “10x fewer” makes sense if you imagine going from a denominator of 20 to 200.


Bigger number better, obviously!

I am also annoyed by most modern tech marketing using percentages incorrectly and inconsistently. But 150% is a bigger number than 1.5x so I suppose their hands are tied.


They are overloading "fewer" to mean division as well as subtraction. According to this logic "twice fewer" means "half as much".


Fewer is subtraction, times fewer is division.

More is addition, times more is multiplication (or 1 plus multiplication, oops).

I don't think anyone says "twice fewer". Or "twice more" when talking about quantities.


They're speaking to the lay community. The lay community is not known for using precise language. If they had used language like yours, the lay community probably wouldn't have received the key message: "10x better".

On the other side, it seems clear that the scientific community was able to deduce the intended meaning of "10x fewer".


Has anyone here used this, or anything similar? This sounds phenomenal if it really works. Looks like “contact us” is the only way to try it or buy it right now, and the purported benefit (memorize facts up to the model training size, basically trillions of tokens of facts) is wild. I’d love to try a system running this way to understand the failure modes, like for instance how does it reliably infer which “facts” to use?


10x less is a weird way of saying 90% less, or better yet reduced to 10% from before.


so if ChatGPT hallucinates 10% of the time, their model hallucinates only 1% of the time?


No, it would be 10% - 10*90% = -890%. What is a negative hallucination? I don’t know. Maybe it summons facts into being but doesn’t believe them.


The website says:

> At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.

The paper says:

> At inference time, only the relevant experts are retrieved from the index, allowing the LLM to store a large number of facts while maintaining low inference latency. We use specialized GPU kernels written in Triton Tillet et al. (2019) to accelerate the lookup of experts.

...but darned if I can understand from either what they're actually doing when they say that.

Why do you need a custom GPU kernel for this outside of the normal NN layers?

Can anyone see an explanation of how they pick which expert to use?


Agreed, I looked through their “paper” and while it goes through the motions of a scientific paper, there’s barely any reproducible methodology. A single page in their paper, including the diagram.

They do reference some papers I’m not familiar with and say their method is “similar”.

If you check the huggingface page mentioned in a footnote, they have two directories: one for a model, and the other which contains a FAISS index. Although in the paper they say they use cross attention, so I have no idea how those could be combined.


That’s fair - I’ll try to go through the weekend and write out some of the equations for the kernel that loads the weights out of the index and does the adaptor ops. It’s inspired by cross attention in retro but there are some differences for training stability and to use as an adaptor rather than training from scratch.

I consider that paper an early draft - hot off the press so to say - it needs review & editing before we would submit it to a conference. I tend to prefer a few rounds of open review before a final submission these days anyways - so appreciate the feedback

I think the main idea should be reproducible - you can repeat the randomization and generalization tests with any LLM and get similar training curves and eval results - it just wouldn’t be efficient.

We have tried it on about 5 real customer use cases with different facts and good success. Obviously we can’t publish customer data to reproduce which is why we focused on the randomization tests in the paper .

There are also some missing hyper parameters from the appendix as well we will add eventually


nit: I hate trying to work out what "10x fewer" or "10x less" mean.

Let's say I have a counter value, X = 100. I reduce that to 10. How can I phrase that in English?

"Value reduced to 10% of original value"

"Value reduced by 90%“

"New value is one tenth the original Value"

Using multiplication with a positive integer and saying "less" just seems incomprehensible when it flies by during a sentence and I can't stop myself from mentally saying "No", like Neo at the end of the Matrix when the three agents fire a volley of bullets at him down the corridor. "No, this sentence stops right here while I pick it apart"


I don't understand how you can build up this sort of reaction but you still need to "work out" what it means. It sounds like you learned what it means just fine.

Also the most direct translation is "value reduced by a factor of ten".

"x" means factor, "fewer" or "less" means reduced.


"10 times less" contracted to "10x less" maybe?

The first sentence seems fine to me. The contraction much less so.


Sorry, only seeing this now. But this is precisely what I mean. What would 1 times less mean? 0? It's just an awkward way to phrase something that can be said much simpler using a fraction or a percentage.


“Value 10x less now”

Sounds good as non-native.


Here I was hoping that there would be some kind of regulatory framework or protections put in place for AI before it became smart enough to actually take over the world.

Being able to say "you are wrong, taking over the world is a bad idea" and have the model respond with "oh you are completely right, I am very sorry for that" was our first line of defense.

I wonder if this model will argue insistently that you are wrong if you try and tell it that 1+1=3, and if so, whether that expands to philosophical issues such as the model arguing back at you based on its formed opinions on ethics and history.


It feels like the two dumb ways to customize an open LLM are fine tuning and RAG. The former is expensive and complicated, the latter adds complexity to your queries but doesn't require up front compute for retraining.

I couldn't tell how expensive this is up front, or what complexity it adds to the setup. Anyone know?

It's definitely an interesting idea but if you have to pay $100k for all that LoRA, what margins are left over?


What do you think is complicated about RAG? I'm not arguing that it's effortless, but it's not that complicated? Genuinely interested to hear other people's pain points.


Well, you need to generate embeddings usually, and then query and filter those, but it isn't that complicated for sure.


"Hallucinations" are the creative aspect of LLMs, which is what they are more useful for- if anything we want more of them. We already have much simpler systems that search and regurgitate facts. We need more intelligent hallucinations that are consistent with and extend rather than conflict with the data.


Is it even possible to measure and distinguish the output as being hallucinated or not? All LLM output is hallucinated, it's only by statistics or chance that some of the output reflects facts, and we're only able to make that assessment because we can compare the output to facts. The model can't make that assessment itself.

Going from 50% "accurate" to 90% "accurate" may actually be more insidious because it changes the utility from being a coin flip to trying to determine which 10% is inaccurate, or downplaying the existence of inaccuracies because at 90% it is "mostly correct".


I would argue with your definition of hallucination here. It's just a value judgement we humans apply to output that does not correspond to reality in some way that we don't find useful.

We can control how diverse or creative output is via temperature. I am assuming that these new models could work at higher temperatures (i.e. more creative) while maintaining factual accuracy for things you care about. Or alternatively, keep the same temperature but have far fewer hallucinations (i.e. wrong answers).


If less hallucinations are the goal, surely this is a bit over the top? Surely if you have the ground truth facts available, then fine-tuning for EVERY subject area seems much more work than using the facts with retrieval augmented generation, and making sure that the facts line up?


Creative idea. Any data on how much it takes to load the LoRAs and how much latency it adds to the generation speed?


Hopefully someone reproduces results with code... cant find any code they shared


They are offering a product/service, so going in too much detail would be a bad business practice, no?

But true, would love to see this in OpenSource in the wild


What is the hallucination rate of, for example, a Llama3 or GPT4?


They claim 50% with fine-tuning and/or(?) RAG (unclear marketing phrasing imo), and claim their method achieves 5% on the same task which is apparently a text-to-sql task set.


Can it win at Jeopardy?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: