Hacker News new | past | comments | ask | show | jobs | submit login

reminds me of the anthropic's recent work on identifying the neuron sets that correlate to various semantic concepts in Claude: https://news.ycombinator.com/item?id=40429540 "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"



OpenAI also just published similar work, though Anthropic did beat them to the punch.

https://openai.com/index/extracting-concepts-from-gpt-4/

https://news.ycombinator.com/item?id=40599749


In the same vein, Refusal in LLMs is mediated by a single direction: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: