Hacker News new | past | comments | ask | show | jobs | submit | tsadoq's comments login

Not necessarily true, one quick pass might be needed but quite not as devastating as it might seem

https://huggingface.co/blog/mlabonne/abliteration#%E2%9A%96%...


That's a wonderful repo that I used as my starting point! The main problem with that one is that it supports only models that are on transformerlenses and unfortunately they are not a lot...


The other link is quite good, i also suggest this for some practical application

https://huggingface.co/blog/leonardlin/chinese-llm-censorshi...


please give feedbacks! It's quite a raw first implementation and would be very nice to have suggestions and improvements.


> Do these techniques train models while performing the modifications?

Depend on what you mean by training, they change the weights.

> Do these techniques train models while performing the modifications?

I'm not sure I understand, but there is an example of performing an obliteration on gemma to make it never refuse an answer. It's about 10 lines of code.


> > Do these techniques train models while performing the modifications?

> Depend on what you mean by training, they change the weights.

What I wonder: is there a separate model, not the LLM, that gets trained only on how to modify LLMs?

I imagine a model that could learn something like: “if I remove this whole network here, then the LLM runs 50% faster, but drops 30% in accuracy for certain topics”, or “if I add these connections, the LLM will now be able to solve more complex mathematical problems”

So a model that is not an LLM, but is trained on how to modify them for certain goals

Is that how this tool works?


as someone that studied mainly acient greek and latin in high school, I tend to have quite a limited pool of inspiration for naming what I build haha.


Check out Robert Anton Wilson (The Illuminatus Trilogy), you're in for a treat -- the references above were to Discordianism

* https://en.wikipedia.org/wiki/The_Illuminatus!_Trilogy * https://en.wikipedia.org/wiki/Principia_Discordia


Is the apple in the logo splashing into "wine dark sea"?


L’alleato was the name given by Eris to the Golden Apple of Discord.


planning to update it to be able to run on it. It's just a matter of finding the keys in the layer dict of the model.


Would be nice to get it to output its guardrails/system prompt to see what specific instructions it was given regarding refusals.


Isn't DeepSeek open source?


While the weights are open source, and there is a paper about methodology, the information I mentioned is considered proprietary therefore DeepSeek refuses any requests to provide it.


Given the weights, though, can't we use any system prompt we like? I only have a vague notion of how these constraints are actually applied.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: