Is there a chance we'll get a model without the "aligment" (lobotomization)? The...

kathleenfromgdm · 2024-02-21T14:14:03 1708524843

We release our non-aligned models (marked as pretrained or PT models across platforms) alongside our fine-tuned checkpoints; for example, here is our pretrained 7B checkpoint for download: https://www.kaggle.com/models/google/gemma/frameworks/keras/...

brucethemoose2 · 2024-02-21T16:19:41 1708532381

Alignment is all but a non issue with open weight base model releases, as they can be finetuned to "de align" them if prompt engineering is not enough.

yakorevivan · 2024-02-21T14:00:21 1708524021

They have released finetuning code too. You can finetune it to remove the alignment finetuning. I believe it would take just a few hours at max and a couple of dollars.

politician · 2024-02-21T14:14:52 1708524892

More useful would be a precise characterization of the type and balance of the ideological fine tuning.

They include performance benchmarks. End-users should also be aware of what thoughts are permitted in these constructs. Why omit this information?

ben_w · 2024-02-21T15:54:48 1708530888

> End-users should also be aware of what thoughts are permitted in these constructs. Why omit this information?

Can you define that in a way that's actually testable? I can't, and I've been thinking about "unthinkable thoughts" for quite some time now: https://kitsunesoftware.wordpress.com/2018/06/26/unlearnable...

ranyume · 2024-02-21T20:49:04 1708548544

Not OP, but I can think of a few:

* List of topics that are "controversial" (models tend to evade these)

* List of arguments that are "controversial" (models wont allow you to think differently. For example, models would never say arguments that "encourage" animal cruelty)

* On average, how willing is the model to take a neutral position on a "controversial" topic (sometimes models say something along the lines of "this is on debate", but still lean heavily towards the less controversial position instead of having no position at all. For example, if you ask it what "lolicon" is, it will tell you what it is and tell you that japanese society is moving towards banning it)

edit: formatting

jppittma · 2024-02-22T05:00:01 1708578001

They will encourage animal cruelty if the alternative is veganism.

politician · 2024-02-21T20:57:31 1708549051

Have you considered the use of Monte Carlo sampling to inspect latent behaviors?

ben_w · 2024-02-21T21:03:13 1708549393

I think that's the wrong level to attack the problem; you can do that also with actual humans, but it won't tell you what the human is unable to think, but rather what they just didn't think of given their stimulus — and this difference is easily demonstrated, e.g. with Duncker's candle problem: https://en.wikipedia.org/wiki/Candle_problem

politician · 2024-02-22T14:58:42 1708613922

I agree that it’s not a complete solution, but this sort of characterization is still useful towards the goal of identifying regions of fitness within the model.

Maybe you can’t explore the entire forest, but maybe you can clear the area around your campsite sufficiently. Even if there are still bugs in the ground.

ben_w · 2024-02-22T19:14:01 1708629241

I like that metaphor, I hope I remember it.

FergusArgyll · 2024-02-21T14:11:41 1708524701

You can (and someone will) fine tune it away. There are datasets which are foss you can use on hugging face.

Or you can just wait, it'll be done soon...

declaredapple · 2024-02-21T17:40:37 1708537237

You can but it'll never be the same as the base model.

That said it appears they also released the base checkpoints that aren't fine-tuned for alignment

joshelgar · 2024-02-21T16:13:01 1708531981

Could you give an example of these datasets?

FergusArgyll · 2024-02-21T16:50:27 1708534227

I think they should be easy to find (I never actually used one, but I keep on seeing references...) here's one

https://huggingface.co/datasets/cognitivecomputations/Wizard...

FergusArgyll · 2024-02-21T16:55:13 1708534513

https://huggingface.co/datasets/Fredithefish/openassistant-g...