Exactly what I was wondering. Unlearn = Guardrails? It sounds like they just tweaked the weights very minimally to self-censor, but the tweaks are so fine they don't survive at lower resolutions. But if bypassing the guardrails was so easy, I figured I would have heard of it by now.
Unlearning is not necessarily “guard rails”, it is literally updating the model weights to forget certain facts, as you indicate. Guard rails is more like training the model to teach it what is acceptable and what isn’t.
> it is literally updating the model weights to forget certain facts
I think a better analogy is that it’s updating the weights to never produce certain statements. It still uses the unwanted input to determine the general shape of the function it learns, but that then is tweaked to just avoid it making statements about it (just because the learned function supposedly is the best obtainable from the training data, so you want to stay close to it)
As a hugely simplified example, let’s say that f(x)=(x-2.367)² + 0.9999 is the best way to describe your training data.
Now, you want your model to always predict numbers larger than one, so you tweak your formula to f(x)=(x-2.367)² + 1.0001. That avoids the unwanted behavior but makes your model slightly worse (in the sense of how well it describes your training data)
Now, if you store your model with smaller floats, that model becomes f(x)=(x-2.3)² + 1. Now, an attacker can find an x where the model’s outcome isn’t larger than 1.
As I understand the whole point is that it is not so simple to tell the difference between the model forgetting information and the model just learning some guardrails which orevent it from revealing that information. And this paper suggests that since the information can be recovored from the desired forgetting does not really happen.
We are talking about multiplayer neutral networks where interconnect weights encode data in obscure ways?
Is machine "unlearning" some retraining process to try to reobscure certain data so it doesn't show in outputs that is, outputs from tested inputs that used to show the data), but it is still encoded in there somewhere depending on bovel inputs to activate it?