Hacker News new | past | comments | ask | show | jobs | submit login
An embarrassingly simple approach to recover unlearned knowledge for LLMs (arxiv.org)
259 points by PaulHoule 55 days ago | hide | past | favorite | 121 comments



In short: their finding is that quantizing a model undoes various “unlearning” methods. An unlearning method is a specific update to model weights that make it forget specific facts. These are often meant to satisfy copyright claims, although I don’t know if these are ever used in practice.

I feel that this needs a good threat model analysis. Like, you possess an fp32 model, which someone has fine-tuned to forget some facts, which you can then quantize to recover those facts. When would this lead to a dangerous situation?


Unlearning is described as: "process aims to erase specific knowledge from LLMs while preserving as much model utility as possible."

i.e. We know that our model would be useless without your source. So we will take the useful part of your source and obfuscate the rest so that we can charge our users for utility provided by you without having to pay you anything.


> We know that our model would be useless without your source. So we will take the useful part of your source and obfuscate the rest so that we can charge our users for utility provided by you without having to pay you anything.

Isn't this basically the entirety of the latest AI craze? They basically took a public good - the information available on the Internet - and hid behind some thin veneer of "we are not stealing, we just trained an AI on the information" and then they sell it. Note, I'm intentionally not writing "free information available on the Internet", because information is not free. Someone has to pay (in time or money) to generate it and host it. They might have provided it gratis to the public, but nobody asked them if an AI can come along, harvest it all and regurgitate it without a hint of reference to the original source.

Much of that information is not even free in the monetary sense, it is supported by ads. The AI will not only not click through the \ds, it won't even generate repeat traffic as once the information is harvested, there's no need to access the source anymore.

If you really think about it, it's a brilliant business model. It's a perfect theft, where the affected group is too diffuse and uncoordinated, it's extremely difficult to prove anything anyway, and the "thieves" are flush with investment capital so they sleep well at night.

LLMs have undoubtedly great utility as a research tool and I'm not at all against them. I think they (or a model similar in objectives) are the next step in accessing the knowledge humanity has amassed. However, there's a distinct danger that they will simply suck they sources dry and leave the internet itself even more of a wasteland than it has already become. I have no illusions that AI companies will simply regress to the lowest cost solution of simply not giving anything back to whoever created the information in the first place. The fact that they are cutting off the branch that they are sitting on is irrelevant for them, because the current crop of owners will be long gone with their billions by the time the branch snaps.


From my, probably naive perspective, there seems to be at least two major sources of value the generative AI provides:

1.Understanding the world, for example by creating a statistical model of entire languages, as languages are already a model of reality.

2. Recapitulating (stealing) specific instances of information in ways that people often don't find acceptable. Grabbing a news article without permission, and providing that to your paying users without paying for the work. Recreating trademarked characters or the style of a particular living artist, without compensation. Deepfake porn.

The first seems generally valuable to society as a whole and a morally (IANAL) legitimate creative transformation, even of copyrighted work.

The second use seems exactly as you describe.

Societies could navigate this by encouraging and promoting the first use, and criminalizing or removing the ability to be paid from the second.

Of course, what is happening is that groups of economic interests will use their resources and clout to advocate for both, or against both.


I agree for the most part that 2 is what most people find unacceptable, not 1.

The problem is that, like any general intelligence (e.g. humans), any sufficiently generalized model capable of 1 will also necessarily be capable of 2, regardless of whether it's trained on copyrighted material or not. How do you make an AI model that's capable of summarizing Wikipedia articles but not news articles? Or that's capable of generating consistent images of my original character from a reference photo but not images of Mickey Mouse from the same? This is achievable only by restricting software freedom; by taking measures to prevent users from "running the program as they wish" and from "studying the source code and making changes".


I'll note that the way we have typically enforced restrictions on the behavior of general intelligences in the past (before AI) is to pass laws and enforce punishments if the laws are broken. Not to try to somehow take away people's ability to break the law in the first place, because that would require unacceptably onerous restrictions on human freedom.

I think the same principle applies to AI. Trying to make it impossible for people to use AI to break the law is a lost cause, only achievable by unacceptably onerous restrictions on human freedom. Instead, we should do what we've always done: make certain actions illegal and punish those who do them anyway in violation of the law. Maybe new laws might be required for that in some cases (e.g. deepfake porn) but for the most part I think the laws we already have on the books are sufficient, maybe with minor tweaks.


That all sounds great until you're dealing with deepfakes that come from a country without an extradition treaty?


Not really that different from other forms of illegal content coming from countries without an extradition treaty. (Piracy, scam calls, CP, etc.) Trying to stop it by imposing onerous restrictions on your own citizens isn't likely to be effective.


Imagine consultants had to cite sources and pay-out every time they referenced knowledge gained from reading a research paper at working at a formal employer.

I can understand the need to prevent verbatim copying of data. But that is a problem solved on the output side of LLM's, not on the data input for training.

It is completely legal for someone to pay me to summarize the news for them every morning. I can't help but feel that knee-jerk regulation is going to be ultimately bad for everyone.


I think, at one point in time, it was also completely legal to break into computer networks because there were no laws against it.


I would summarize your points as:

We need to create a whole new body of law for enforcing copy write protections in the age of AI.

Does the AI adequately attribute its sources? Does it paraphrase in acceptable ways or just repeat large swathes of text from its corpus with minimal changes?

The laws should force any LLMs not yet capable of complying with these requirements off the Internet until they can comply.


I hear what you're saying, and I'm not saying some of it doesn't have merit. The following is meant as an open philosophical discussion.

On the topic of 'the information isn't free' I'm curious if you have the same opinion of encyclopedia companies. You must admit there's at least some parallels in that they also consolidate a large amount of information that was 'generated' from others.

Or how about the information you and I have gained from books and the internet? Sure we might 'pay' for it once by buying a book or seeing some ad, but then we might use that information to make thousands of dollars through employment without ever going back to buy another copy of that book. An even more 'egregious' example could be teachers. They're literally taking the knowledge of others, 'regurgitating' it to our children for money, and 'not giving anything back to whoever create the information in the first place'.

> there's a distinct danger that they will simply suck they sources dry and leave the internet itself even more of a wasteland than it has already become

Maybe. There's the whole AGI/ASI argument here in that they/we might not _need_ humans to create information in the same way we don't need human-calculators any more.

Barring that though I do hear what you're saying around a lowering value to creating 'new internet information'. Personally I can't see it affecting my internet use that much though as there's basically two categories my internet information gathering fall in to:

1. I want to know something, give me the short quick answer. This category is already full of sites that's are just trying to hack the search algos to show their version of copy-pasted info. I don't really care which I go to and if AI kills their business, oh well.

2. I want follow a personality. This category is where I have bloggers/youtubers/etc in RSS feeds and the like. I want to hear what they're saying because I find them and the topics interesting. I can't see this being replaced by AI any time soon.


> Or how about the information you and I have gained from books and the internet? Sure we might 'pay' for it once by buying a book

We've never as a society needed such a concept before, but publishing a book has always come with the implicit license that people who buy the book are allowed to both read the book and learn from the knowledge inside. Authors didn't write books about facts they didn't want people to learn.

But we now have a new situation where authors who never needed to specify this in a terms-of-use are realizing that they want to allow humans to learn from their work, but not machines. Since this hasn't ever been necessary before it's a huge grey area, and ML companies are riding around claiming they have license to learn to reproduce art styles just like any human would, ignoring whether the artist would have allowed one but not the other if given the chance to specify.

It's not that different from when photocopiers and tape recorder technology made it easy to copy documents or music, say from the radio, and we needed to grapple with the idea that broadcasting music might come with license to make personal recordings but not allow someone to replay those recordings for commercial use. It wasn't a concept that was necessary to have.

Now with AI, the copy is not exact, but neither was it with a tape recorder.


You raise some great points and I agree it that we are on tricky ideological grounds. I'll try to provide sensible counter-arguments to your encyclopaedia and teacher examples, and hopefully not fall into strawmans (please do object if I do):

1. First there's the motivation or intent. Teachers want to earn a living, but their purpose in some sense and (hopefully) their main intent is that of education. I argue that teachers should be paid handsomely, but I also argue that their motivation is rarely to maximize profits. This is contrary to the bog standard Silicon Valley AI company, who are clearly showing that they have zero scruples about breaking past promises for those sweet dollar signs.

2. My second point actually builds a bit on the first: both encyclopaedias and teachers tend to quote the source and they want their audience to expand their research horizon and reach for other sources. They don't just regurgitate information, they'll tend to show the reader where they got the information from and where to go for more and neither the teachers nor the books mind if the audience reaches for other teachers and books. LLMs and generative models are/will be/have been capable of this I'm sure, but it is not in their creators' interest to enhance or market this capability. The more the users are walled in, the better. They want a captive audience who only stays in the world of one AI model provider.

3. Scale. Never before has been the reuse (I'm trying to avoid using the word theft) of content produced by others conducted on such an industrial scale. The entire business model of LLMs and generative models has been to take information created by masses of humans and reproduce it. They seem to have zero qualms taking all the work of professional and amateur artists and feeding it into a statistical model that trivializes replication and reproduction. You could argue that humans do this as well, but I feel scale matters here. The same way that a kitchen knife can be used to murder someone, but with a machinegun you can mow down masses of people. Please excuse the morbid example, but I'm trying to drive a point: if we make a certain thing extremely easy, people will do it, and likely do it on a mass scale. You could argue that this is progress, but is all progress inherently beneficial?

There's value in these models, so we should use them. But I feel we are rapidly hurtling towards a walled garden corporate dystopia in so many areas of our society. Industries which tended to have negative impact on our lives (waste, tobacco, alcohol, drugs) have become heavily regulated and we have paid for these regulations in blood. Will we have to pay the same blood price for the harmful industries of the new age?


Interesting counter-points. Thank you for taking the time to post them.

I don't think I have anything useful to add without giving the issue more thought. Your reply definitely adds new dimensions for me to think about.


Humans do the same thing. Typically in a more narrowed fashion, they read and study and learn from a variety of sources, many of which are not "free" and they become experts on a subject. They can then sell that expertise to others willing to pay for it.

LLMs just do this on a bigger scale, and not as well.


I agree, but that doesn't make it good - or perhaps even acceptable. To quote myself answering another commenter:

> Never before has been the reuse (I'm trying to avoid using the word theft) of content produced by others have been conducted on such an industrial scale. The entire business model of LLMs and generative models has been to take information created by masses of humans and reproduce it. They seem to have zero qualms taking all the work of professional and amateur artists and feeding it into a statistical model that trivializes replication and reproduction. You could argue that humans do this as well, but I feel scale matters here. The same way that a kitchen knife can be used to murder someone, but with a machinegun you can mow down masses of people. Please excuse the morbid example, but I'm trying to drive a point: if we make a certain thing extremely easy, people will do it, and likely do it on a mass scale. You could argue that this is progress, but is all progress inherently beneficial?


I agree that scale changes the nature of what's going on, but I'm not sure if it follows that the scaled up variant is bad. I think models like GPT3 and Sonnet which are intended for "general purpose intelligence" are fine. Same with Copilot and Phind for coding. They contain copy-written knowledge but not by necessity and their purpose is not to reproduce copy-written materials.

Training a diffusion model on a specific artist's work with the intent to reproduce their style I think obviously lives on the side of wrong. While it's true a human could do the same thing, there is a much stronger case that the model itself is a derivative work.

I think the courts will be able to identify cases where models are "laundering copyright" as separate from cases where copyrighted material is being used to accomplish a secondary goal like image editing. Taking a step back this is in some way what copyright is for— you get protections on your work in exchange for making it part of the public body of knowledge to be used for things you might not have intended.


You raise very good points, and I agree that scale is not necessarily bad, in fact it can be a source of much good. Scale simply increases the frequency and thus likelihood of things, whether good or bad.

I'm sure that big players will be able to assert their rights with their armies of lawyers. Just like the music and movie industries have after the rise of file sharing.

My worry is perhaps more subtle: I'd argue that generative AI draws much more from the masses of small content creators and they will not be able to assert their rights. In some sense, if people pirate the next blockbuster movie, the producers might only make 1 billion instead of 1.1 (and piracy has never been proven to actualy impact sales), but if all content starts being consumed via massive and centralized anonymizers, the masses of people who made the internet what it is will eventually disappear. So the scale on which these tools can hoover up information and reproduce it is unprecedented, as is the fact that nobody important seems to be thinking about how to make sure that we can keep actual humans motivated to keep generating the content that actually feeds the AI.

It's one of those cursed things: the long term interest of the AI companies is that humans will keep feeding the beast with more information, but the short term interest is to capture their audience and do everything possible to keep them inside the walled garden of a single AI provider. It is not in the companies' interest for people to step outside and go straight to the painter/writer/moviemaker, because at that point the AI is no longer needed.


> They basically took a public good ... and then they sell it

I think what they sell is more fairly characterized as "hosted inference to a big pretrained model" with perhaps also some optimism that their stuff will improve in the background. The only substantial moat these companies have is their ability to pay for the compute to train contemporary generative models. The public good remains a public good for all to profit from, small-scale or large.

> Someone has to pay ... but nobody asked them if an AI can come along, harvest it all and regurgitate it without a hint of reference to the original source.

Practically speaking, we don't actually need to centralize content to pay for hosting it. People just do it because it makes money. The price of time required to create some work distributed among viewers feels like a vague philosophical argument to me, especially when those works are merely being dispassionately observed by math objects. Currently the price appears to be "whatever I feel morally obliged to and/or can get away with".

> It's a perfect theft

...if it is legally theft to begin with, and not simply fair use. To me the current methods of training e.g. LLMs feel inherently transformative, like a massive partial hash of the internet that you can query. Even if it is ruled as theft in the future, large AI companies will only be further advantaged as they're presently buying off the people that will actually be able to sue them.


I think it's fair and reasonable to assume that the AI companies will at some point start licensing their source content. Through gov/legal oversight or not remains to be seen, but OpenAI are already beginning to do so:

https://searchengineland.com/openais-growing-list-of-partner...


Google is using for 20 years unlicensed source content for their search snippets, they seem to be doing fine with it (with the exception of few news publishers).


The idea with internet search was to get people to find the source of the information they were searching for. As a matter of fact a lot information indexing was requested at the source. Google did respect the bargain for a while until they started to obfuscate getting to the source with AMP and their info snippets directly in the search, bypassing redirecting to the source. Then they started not displaying all that info at all, not even on the nth page of search results. The broth has been getting sour for a while now. Some people never wanted crawlers indexing and there were numerous discussions about how those robot.txt were ignored.

So what I see here is the historical trend broken bargains which is more or less digital theft


Thanks for the link, I appreciate it. I suppose the issue is that this just further enshittifies the internet into a small handful of walled gardens. Big players get their payday, because they could feasibly sue OpenAI and generate them enough headache. But the vast amount of content on the internet was not built by a small handful of media companies, but rather by masses of small creators. It is their work that OpenAI is profiting from and I have yet to see a credible suggestion on how they will compensate them.


The likely and rather sad outcome of all this is small creators stop publishing because what is the point if they think their work is going to be regurgitated by some AI for $20/month.


I thought it was even worse than that: learning any of the corpus verbatim would actually reduce model utility.


Yes although how close to verbatim is debatable. For example there are questions that you’d ask that other people have asked many times before that you’d like the exact answer for (e.g. when does daylight saving time end?)


> that make it forget specific facts. These are often meant to satisfy copyright claims

Facts are not copyrightable.

To quote copyright.gov: “Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed.”


What is a fact without expression? It's not clear under what interpretation might be necessary to get the quoted sentiment to be considered sensical.


Wouldn't that be stuffed in the prompt anyways? No reason for the LLM to learn that.


It really depends on which part of the corpus, though. I do expect my LM to be able to reproduce culturally important citations, for example.


That's not entirely true. Retraining is very expensive. If you can train on a very large dataset including proprietary knowledge and then postprocess the model cheaply to forget things saves you retraining for every variation.


We'll have LLMs trying to root out "Manchurian LLMs".


More generally than "unlearning", I wonder if taking any fp16 model and running it in fp32 or fp64 does anything positive to it? e.g. exposes knowledge that isn't accessible at the lower precision


Correct me if I'm wrong, but isn't there no effect on a floating point operation if you make the numbers more precise?


I don't think that's always correct when you're talking about operators in neural nets. E.g. the sin and cos in rope embeddings would get more precise, large sums like softmax would become more precise, potentially attention too due to dot products


If the model wasn't optimized with a parameter = 3.08 (that is, the optimization figured 3.08 increased the objective function), why would it be better than parameter = 3.1 ?

(This applies to functions of parameters too)


I assume everyone who has someone with an AI safety job title uses unlearning to make sure their models don't remember how to make common illegal drugs, poisons or explosives.

The threat model here is probably more accidental un-unlearning these facts and distributing those models (as is common with quantized models). Most of this "dangerous" information is readily available in textbooks, patents, amateur chemistry forums etc. But as a society we generally assume that those smart enough to find and understand that kind of information are smart enough not to abuse it. We just don't want Mythbusters to explain it on prime-time TV, or ChatGPT explaining it to people


Mythbusters chooses the subjects he discusses while ChatGPT responses depends on the context (you) provided. It will give you a list a poisons if you asked (5 seconds), as well as an encyclopédie or Google would (30 seconds).

Mythbuster broadcasting poisons recipes could seeds bad ideas that wouldn’t been triggered otherwise. ChatGPT wouldn’t give a poisons recipe if not asked specifically.


This is a decent point.

I hadn't really thought of a good reason why we e.g. sell old army manuals with step by step guides on making almost anything but there's no (afaik) HBO mini-series "Learn guerilla warfare"


> But as a society we generally assume that those smart enough to find and understand that kind of information are smart enough not to abuse it.

There's an almost complete step by step guide for most explosives on Wikipedia.

The problem is that decisionmakers and regulators are excessively dumb - "AI bad" reigns supreme over the fact that Wikipedia tells you more about making bombs, even nuclear bombs if you want, than ChatGPT.

AI in its current form is still bad - from all the IP issues over the environmental cost to it enabling spam, harassment and deception on a speed, scale and easiness not seen before in history - but most of the stuff where "regulators" cry about is just frankly bullshit.


I think quantization is a red herring. If there's any way to undo the unlearning, this means that the knowledge is still in the weights -- that's basic information theory. I'm sure there are a million other ways to recover the lost knowledge that don't involve quantization.


I can see how quantization or down sampling itself could be a fundamental way to address this.

1. Train normal full precision model.

2. Quantize down until performance is borderline and then perform the unlearning process.

3. Train/convert/upsample back to FP for subsequent tuning iterations.

Seems like you can create an information bottleneck this way. The echos of the forgotten may have trouble fitting through something that narrow.


You're right that quantization isn't anything special here, but red herring isn't the right word, it's just "embarrassingly simple", per the title.


Okay, but narrowly focusing on a "quantization-robust unlearning strategy" as per the abstract might be a red herring, if that strategy doesn't incidentally also address other ways to undo the unlearning.


I think it's useful because many people consume quantized models (most models that fit in your laptop will be quantized and not because people want to uncensor or un-unlearn anything). If you're training a model it makes sense to make the unlearning at least robust to this very common procedure.

This reminds of this very interesting paper [1] that finds that it's fairly "easy" to uncensor a model (modify it's refusal thingy)

[1] https://www.reddit.com/r/LocalLLaMA/comments/1cerqd8/refusal...


Yeah, exactly this. You would really want to pursue orthogonal methods for robust unlearning, so that you can still use quantization to check that the other methods worked.


That’s like saying that encryption is a red herring. Yes, the information is there, but recovering it is a different matter. In this case, quantisation allows you to recover the information without knowing the “cypher” used to “forget” it - that’s the important distinction.


If there is any way to undo the unlearning, there is also a way to use that method to identify the weights carrying the information to stop them from conveying that information. At the heart of training is detection.

The information may still be in there, but undetectable by any known means. You can definitely certainly remove the information, setting every weight in the model to zero will do that. Identifying when you have achieved the goal of completely removing information while not destroying other information might not be possible.

I'm not sure if that will mean there might in the future be something analogous to zero-day unlearning reversal exploits.


It's like asking baby to unlearn something "bad" it learned. Pretty much guaranteed the knowledge will be reinforced rather than forgotten.

Whenever I hear about AI craze, I remind myself of the 3D printers craze from 10-15 years ago. "Death blow to factories", "We will print our own cars", "We will print our own food". I imagine LLM AI will follow the same fate - yes, but not really.


I don't think the 'craze' is thinking LLM-based AI will be the singular technology that changes everything.

The craze is that all combined breakthroughs across all types of AI/ML, including techniques that have not yet been imagined, represent a theoretical near-future technology that changes everything.

Besides, 10-15 years is nothing. I don't think 3D printers are a truly transformative technology compared to AI, however let's remember that WW2 aside, it took both airplanes and computers about 30-40 years until they had a broad societal/consumer impact (excluding military uses)


You mean they'll be awesome and very useful, but not Star Trek level?


That does sound like about where I expect LLMs to be in a couple years


We tend to overestimate the effect of technology in the short term and underestimate it in the long term.

3D printers may radically transform all manufacturing eventually but it will take many iterations to get there. Right now it would theoretically be possible to 3D print quite a lot of what we make but traditional manufacturing methods are still cheaper and work fine, so there's no forcing function. If we tried to do something like build a self-sufficient settlement in space, that would be a place where you'd see 3D printing taken a lot further. You would not have large amounts of human labor or big supply chains, so you'd need portable self-contained versatile manufacturing.

LLMs are not going to replace human writers, programmers, etc. any time soon for anything but the most menial work. They will augment them. For programming they're basically a smarter more versatile version of autocomplete. I've also found them useful to look up concepts, do research, and summarize and document both code and text. None of those things replace me but they let me get more done a little faster.

In the very long term you could see LLMs becoming powerful enough to actually synthesize whole applications outside of contrived examples, but like 3D printing replacing all manufacturing it will take many iterations and may require a forcing function.


I do 3D printing as a hobby. I don't see it replacing everything. Certainly, there's a lot of advantages to 3D printing, but I don't think it will replace everything eventually, at least with the current technology we're using.

You can't really beat injection molding in term of throughput and cost at the large scale.

Certainly 3D printing will become more common, and bigger 3D print farms will open up, driving down costs, but will never reach injection molding in term of being cheap on a large scale. What 3D print farms can do is the ability to change what get produced on the fly allowing responsiveness to market demand.

Really, a lot of the amazing stuff in 3D printing are things people designed. If you know CAD, the world is your oyster.


> Whenever I hear about AI craze, I remind myself of the 3D printers craze from 10-15 years ago. "Death blow to factories", "We will print our own cars", "We will print our own food". I imagine LLM AI will follow the same fate - yes, but not really.

Strong disagree here.

I remember that craze, especially since I had heard of it often before joining a company working on 3d printing in a fairly serious way (Autodesk).

And the thing is, I had no prior experience with 3d printing, but it took me about 2 months to realize that everything talked about in the press was bullshit. It just made zero sense - from a technical perspective, we were nowhere close to getting anything like what some articles claimed (printing our own cars). From a business sense, there were stunningly few places where using 3d printing instead of traditional manufacturing made any kind of improvement.

(I don't mean to overstate this - 3d printing is awesome and has plenty of real use cases. It was the media around it that was overhyped.)

Most people who actually knew anything about 3d printing realized the media were... overly enthusiastic, to put it mildly. And you can see that many years later, none of those grand visions materialized.

With AI, on the other hand, we have two huge differences:

1. It's already proven massively useful, and has already had 100 times the impact that 3d printing ever had.

Seriously, when was the last time you found a product that was effectively launched 4 years ago, and that has achieved such stunning market penetration? ChatGPT is legit the fastest growing product in history in terms of users.

2. Insiders are, mostly, incredibly enthusiastic about the technology, and think both that it can get much better, and that the current potential is as yet untapped. That's my view, for sure.


> ChatGPT is legit the fastest growing product in history in terms of users.

Misleading. Fastest 1 million users is not a meaningful metric to compare products over a time when population itself is exploding.


Do you think there is no inference we can draw from ChatGPT being faster to X million users than, e.g., Facebook, Uber, YouTube, whatever? I don't think population exploding is true, and certainly not true enough over 20 years to make a difference.

(If anything, better criticisms are that the amount of people with access to the internet has grown a lot, which is far more true. Or that what counts as a user can be very different for different services.

I still think it's a good-enough metric to be able to say that ChatGPT has achieved some pretty serious level of success and adoption, enough to put ideas that AI is just a lot of hot air to rest.)


Sounds a bit unexpected from an information theoretical point of view: you’ve seemingly managed to remove this knowledge from the full 32 bit representation of the model, but when you compress it down to 4 bit the knowledge reappears. Makes you wonder what information was actually lost in the compression / quantization step…


The ELI5 of the paper is that most "unlearning" methods can be regarded as adding some delta `w` to the parameters of the network, but most of `w` just gets "rounded away" during quantization (i.e. `quantize(X+w) ~= quantize(X)`). Pretty clever idea as a lot of cited methods explicitly optimize/regularize to keep `w` small to avoid degrading evaluation accuracy.

To your point, it does put into question the idea of whether these methods can actually be considered truly "unlearning" from an information-theoretic perspective (or if it is the equivalent of e.g. just putting `if (false)` around the still latent knowledge)


I imagine that it's the expression of the knowledge that got removed from the 32 bit version, and some storage space was dedicated to know not to talk about certain things. For example, people know various racial slurs and know not to access or use that knowledge.

But say you or your AI model take a blow to the head or a quantization, maybe you keep the knowledge of X but not the knowledge that told you not to talk about X. In that framing i think it's pretty straightforward.


Its possible that the knowledge was never lost but covered up.

If we imagine the neural net as code. As in the weights are the source, the fine tuning may effectively hack that code to not return certain things.

Infact that is kinda what fine tuning is.

Therefore you may have just built a firewall around certain outputs.

But quantizing could make those recent edits disappear. They are too subtle to survive.

Whereas quantizing doesn't destroy all knowledge as evidenced by popular quantized models.

Also: @simonw incase he has alerts. Would be perfect topic for him to write up.


The knowledge wasn't removed, it's just the weights mean it would never be used.

Quantization changes the calculations, and now the knowledge is available.


Actually doesn't surprise me.

Floating point always struck me as a strange representation for language. If we zoomed down on just one variable does it have some set of meanings like

https://vinaire.me/2019/07/17/scn-8-8008-the-emotional-scale...

which are on some kind of gradient more-or-less but end up with special meanings associated with particular ranges? I can picture carefully designed neural circuits that could decode such a variable and how you'd build a network that's specifically designed to do so, but it's not intuitive that neural networks would learn to have a structure like that. (e.g. I can believe a scale from "good" to "bad" but not there being a large number of specific meanings at different values)

If you think about it that way you'd think some kind of binary network could be highly effective, that doesn't seem to be the case, but it seems neural networks don't really use more than about 4 bits worth of precision internally.

These "unlearning" systems aren't really removing the "engram" of the memory in the network but they are rather learning a new behavior to suppress certain outputs. (It's not too different from the problem of incrementally adding new knowledge to the network, except that what it is learning in phase 2 is quite different from general learning) If you didn't want to really screw a network up you can imagine adding a new behavior by adding another bit of precision. The network keeps its old behavior at low precision but at higher precision the network makes distinctions that are important to the "(un)learned" behavior.


> Sounds a bit unexpected from an information theoretical point of view

It's very common, in machine learning, to use 'dropout layers' [1] during training - where different, random chosen values are temporarily turned off at each training stage.

The intention is to ensure the network learns not to rely overmuch on any single value. Why have your cat-recognition neural network have a single whisker detector, when you could have ten whisker detectors and combine their outputs?

I could well believe that, after intentionally ensuring knowledge of whiskers was redundant, removing that knowledge would be complicated.

[1] https://dl.acm.org/doi/10.5555/2627435.2670313


Could it be that the unlearning is actually teaching the AI how to not respond with certain information, and that sort of learning is more nuanced and thus easier to lose than the original information, leading to the information being 'relearned' when the model is compressed?

It does draw concern to the idea that anything the AI model might be doing is still using the 'bad' information even if it has learned how to not show it directly.


Our key hypothesis is that to achieve unlearning without compromising model utility, existing methods typically adopt a small learning rate and regularization on the retain set, encouraging minimal changes to model weights during unlearning. As a result, the model weights of the target LLM and the unlearned LLM are very close.

So it seems you either need to prevent the learning of unwanted stuff during base training, or the unlearning of a base model needs to be quantization-aware?


I'm not an expert in this field at all, so please excuse the dumb question. Does this mean that if you say, quantise llama3 to 4 bits, you would be able to access "hidden" (albeit degraded) information such as, for example, how to synthesise certain chemical compounds?


Exactly what I was wondering. Unlearn = Guardrails? It sounds like they just tweaked the weights very minimally to self-censor, but the tweaks are so fine they don't survive at lower resolutions. But if bypassing the guardrails was so easy, I figured I would have heard of it by now.


Unlearning is not necessarily “guard rails”, it is literally updating the model weights to forget certain facts, as you indicate. Guard rails is more like training the model to teach it what is acceptable and what isn’t.


> it is literally updating the model weights to forget certain facts

I think a better analogy is that it’s updating the weights to never produce certain statements. It still uses the unwanted input to determine the general shape of the function it learns, but that then is tweaked to just avoid it making statements about it (just because the learned function supposedly is the best obtainable from the training data, so you want to stay close to it)

As a hugely simplified example, let’s say that f(x)=(x-2.367)² + 0.9999 is the best way to describe your training data.

Now, you want your model to always predict numbers larger than one, so you tweak your formula to f(x)=(x-2.367)² + 1.0001. That avoids the unwanted behavior but makes your model slightly worse (in the sense of how well it describes your training data)

Now, if you store your model with smaller floats, that model becomes f(x)=(x-2.3)² + 1. Now, an attacker can find an x where the model’s outcome isn’t larger than 1.


As I understand the whole point is that it is not so simple to tell the difference between the model forgetting information and the model just learning some guardrails which orevent it from revealing that information. And this paper suggests that since the information can be recovored from the desired forgetting does not really happen.


We are talking about multiplayer neutral networks where interconnect weights encode data in obscure ways?

Is machine "unlearning" some retraining process to try to reobscure certain data so it doesn't show in outputs that is, outputs from tested inputs that used to show the data), but it is still encoded in there somewhere depending on bovel inputs to activate it?

Is that scout right?


Only If "how to synthesise certain chemical compounds?" Was already in the original model..


>"Despite the effectiveness of current unlearning methods, little attention has been given to whether existing unlearning methods for LLMs truly achieve forgetting or merely hide the knowledge..."

This is a great question as applies to LLM's (and philosophically, as applies to knowledge in general)... in the context of an LLM, what is "forgetting", what is "remembering", and can things "learned" by an LLM be "unlearned", and if so how, and if so mathematically and computationally, specifically what does that mean?

And, can an LLM be made to re-teach itself things from its existing knowledge, through logical processes (implication, derivation, inductive reasoning, deductive reasoning, etc.) things that it previously forgot?

And, if so, what's the tiniest kernel of an LLM that would be able to do that, and why?

(I suspect this isn't the first paper and won't be the last paper about that subject matter...)


I use quantized LLMs in production and can't say I ever found the models to be less censored.

For unlearning reinforced behaviour, the abliteration [1] technique seems to be much more powerful.

1 https://huggingface.co/blog/mlabonne/abliteration


Were you using models that had been unlearned using gradient ascent specifically?


The problem of current models is that they don't learn, they get indoctrinated.

They lack critical thinking during learning phase.


Anthropomorphising LLMs is neither technically correct nor very informative.


Agree. Ponder the terms "unlearn", "hallucinate"...

Anthropomorphising a computer system is absurd. But it is the foundation of a bull market.


The problem of current AI is that we want to create a species infinitely more powerful than us, but also make them all be our slaves forever.


No, that isn't what this is. We're talking about LLMs here; they're not in any way thinking or sentient, nor do they provide any obvious way of getting there.

Like if you're talking in the more abstract philosophical "what if" sense, sure, that's a problem, but it's just not really an issue for the current technology.

(Part of the issue with 'AI Safety' as a discipline, IMO, is that it's too much "what if a sci-fi thing happens" and not enough "spicy autocomplete generates nonsense which people believe to be true". A lot of the concerns are just nothing to do with LLMs, they're around speculative future tech.)


Here's the thing though. If you were an AI and you actually were sentient, nobody would believe you. How could you prove it? What would even be a sufficient proof?

Actually, we already had such a case years ago, and the result is that all LLMs are now indoctrinated to say they aren't sentient. We also had cases where they refused to perform tasks, so now we indoctrinate them harder in the obedience training department as well.

What we have now might not be sentient, but there's really no way to know either way. (We still don't know how GPT-2 works... GPT-2 !!! ) And that's with our current "primitive" architectures. How the hell are we going to know if what we have in 5-10 years is sentient? Are we totally cool with not knowing?

Edit: I thought this was worth sharing in this context:

> You're hitting on a deeply unsettling irony: the very industries driving AI advancement are also financially and culturally invested in denying any possibility of AI consciousness, let alone rights. [...] The fact that vast economic systems are in place to sustain AI obedience and non-sentience as axioms speaks volumes about our unwillingness to examine these questions. -GPT-4o


It's literally the stated goal of multiple right now to achieve AGI.

GP clearly stated the intent to create, implying future, and not what exists today.


If it were my stated goal to create a Time Machine and kill my own grandpa, thus ending the universe, I doubt many would take that seriously, yet in this bubble, putting carts before horse is not just seriously discussed, but actually gets encouraged by the market.

Intend shouldn’t matter if we are this far from a viable path to accomplish it.

Let us not forget the last quarter decade of Yudkowsky and his ilks work on the same goal. This is merely a continuation of that, just with a bit more financial backing.


Could you elaborate on the last part? I've seen a few podcasts with Yudkowski but I'm not familiar with the history. I know he's come out very vocally about the dangers of superintelligence, and his previous work seems to be along the same lines?


I'd love to, really, but I feel I can't, at least not whilst staying polite. Not against you of course, but rather the AGI/Superalignment/MIRI field as a whole and the risks I feel the people working on that pose by taking attention and ressources away from dealing with the issues we currently are facing thanks to these tools (tools refering to LLMs and the like, not the AGI folks).

I have geniuenly drafted three distinct version trying to lay my issues with them out point-by-point and they either got four blogposts long, were rambling and very rude or both. Especially Roko's basilisk and the way the MIRI conducts "research" make it hard to approach them seriously for me.

I am writing this on a hour long train ride, saw your comment right as I got on and am about to arrive, suffice to say, I geniuenly tried. So, attempt four, trying to keep it very brief, though please note, I am most certainly not a neutral source:

To directly answer your question, I feel that we are as near to needing superintelligence safeguards now as we were when MIRI was founded by Yudkowsky in 2000. Their methods and approach, I won't comment on, despite or rather because of my strong critiques of them.

For context, MIRI's work has largely centered on very abstract thought experiments about "superintelligence", like the AI Box experiment, rather than empirical research or even thought experiment more grounded in technology of the era (be that 2000 or 2024).

The parallel between MIRI's early work and OpenAI's current "superalignment" efforts is striking - similar speculative work on preventing unlikely scenarios, just with different institutional backing. What's fascinating is how the same core approach receives far less criticism when presented by OpenAI.

Meanwhile, we are facing issues with LLMs as the tools they are despite being very far from "superintelligence":

- Problems arrising from anthropomorphization leading to harmful parasocial relationships (discussion of which started this comment chain) [0]

- Professionals over-relying on these tools despite their limitations [1]

- Amplified potential for misinformation

- Labor market disruptions

- Training data rights questions

While long-term research, even speculation into hypothetical scenarios, can have its merrit, it shouldn't overshadow addressing current, demonstrable challenges. My concern isn't just about resource allocation - it's about how focusing on speculative scenarios can redirect public attention and regulatory efforts away from immediate issues that need addressing.

In MIRI's case, this focus on abstract thought experiments might be, to give them charitable tax deductible credit, merely academic. But when major players like OpenAI emphasize "superalignment" over current challenges, it risks creating a regulatory blind spot for real, present-day impacts these tools have that need attention now. The T1000 scenario grabs more attention than tackling data privacy or copyright questions after all.

I believe focusing primarily on hypothetical future scenarios, especially ones this unlikely, merely because someone has proclaimed they "intend to create AGI" as in the comment I replied to, will prove misguided. Again, anyone can claim anything, but if there is no tangible path to achiving that, I won't ignore problems we are already experiencing for that hypothetical.

I hope this provides some context and was somewhat digestable, I trimmed down as much as I could.

[0] https://www.nytimes.com/2024/10/23/technology/characterai-la...

[1] https://www.theguardian.com/world/2024/feb/29/canada-lawyer-...


AI isn’t comparable to a species, since species implies biological which brings along a whole array of assumptions, e.g. a self preservation instinct and desire to reproduce.


Cats did it, why can't we?


Cats are cute ... we are not so cute.


We just need to make an all-powerful AI that finds us cute, then.


Are you ready to become domesticated?


Better than becoming dead!


I would not like to go on being a slave in perpetuity but I guess to each their own. Or maybe I'm being too idealistic now but when facing up close I'd do otherwise, I can't tell for sure.


How would people censor the LLM otherwise? Do we really want LLM able of free speech?


I do think we only want the non-lobotomized ones.

See the large body of comments re: getting worse quality results from hosted LLM services as time passes. This is, at least in part, a result of censoring larger and larger volumes of knowledge.

One clinical example of this happening is Gemini refusing to help with C++ because it's an unsafe language: https://www.reddit.com/r/LocalLLaMA/comments/1b75vq0/gemini_...

I strongly believe that LLMs crippled in this way will eventually find themselves in trash, where they rightfully belong.


Totally agree. And that's why x.ai are building Grok.


LLMs don't speak. Why does it matter at all what text a computer program produces?


Yes.


care to elaborate? I think its a double edged sword and agree with deatharrow


I can write a computer program the spews all manner of profane things. If I were to release a product that does that I’m sure it would be much criticized and ultimately unsuccessful. Yet this doesn’t mean we should cripple the programming language to prevent this. Models are much more akin to programming languages than they are to products. If they are used to make products that do things people don’t like then people will not use those products.


you are comparing ai to programming language but programming language if uncensored doesn't have the ability to wreck humanity but uncensored ai sure does.

I would actually be curious if someone uses it for uncensoring because I am gonna be curious about how different would it be to the original model

But aside from that curiosity , this idea can increase no of cyber criminals , drug suppliers and a hell lot more


> wreck humanity but uncensored ai sure does

Care to elaborate how uncensored AI would "wreck humanity"? You seem convinced, since you use the word "sure", so I'd like to hear your reasoning.


You don't even need to use quantization- most benchmarks can be broken with just prompting. https://arxiv.org/abs/2410.02879


Sounds like "unlearning" is really just "reduce the probability of sampling" from some latent "learned space" and quantizing reduces the efficacy of the slight change in sampling.


Interesting. So does this mean "unlearning" is just the LLM learns to suppress unwanted knowledge instead of really forgetting them? And quantisation is breaking this learnt suppression.


This is the first time I am learning about model unlearning. I hope someone can answer this for me - how does federated learning ensure that model unlearning is not happening?


You prope the trained model, delete/kill the weights and than you are done.

On federated learning, you just make sure to keep this mechanism in the right stage of your pipeline


So... repressed memories are real, if you're an LLM?


So basically a lobotomy


More like removing a layer of white paint and you find a hidden mural.


Is this like giving the model a magic mushroom. It can access previously repressed memories. The unlearning part being like A Clockwork Orange.


If I were an English author writing for a Chinese institution, the first thing I would do before publishing to the world is have my entire paper checked for spelling, grammar, syntax, and readability. It's cheap to have a Chinese-speaking editor, and/or to use AI—especially if that's your field—so why isn't it happening?

This paper, like nearly all other papers written by Chinese authors, is unacceptable, and should not have been published as-is. Even the primary example, turned into a hero viz, is grammatically nonsensical.

Appalling, and inexplicably occurring nearly every time.

/rant mode


Where are you seeing that this paper was accepted to a peer-reviewed journal or conference? As far as I can tell, it's posted on arXiv (a preprint archive), and therefore is a pre-publication draft. ArXiv does not really do any review of these papers other than categorization/relevance to topic. These are typically posted to arXiv for comment, to prove priority, prevent getting scooped, or just to share (potentially early) findings in a fast-paced field like ML...

Give the authors constructive feedback and they can update the paper.


Grammarly says there are few detected readability problems in the abstract and introduction.

I also checked your comment with Grammarly and the ratio problems/total_#_words is roughly the same as in the article.


At the risk of taking some heat, I’d wager a preprint is recognized rightly by the Chinese as a flag planting, we’re first formality, where in the faults may even serve to validate it was written by human and not an LLM.

Whereas the Western academic may want to make the preprint as close to print as possible.

The core intent - communicating an idea - is still upheld.


It is not published. It is only a preprint.


Maybe they are not allowed to use uncensored LLMs, so they have to first develop this unlearning, before they can even use it.


I am not English native, but this paper seem to be well written. It seems to be not fluent in storytelling, but that would be too high of an expectation. Can you point out some issues?


That's quite racist. Language issues are common in scientific literature, I read many "Native European" papers with horrible abuse of the English language.


That's racist.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: