reminds me of the anthropic's recent work on identifying the neuron sets that co... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

100ideas 7 months ago | parent | context | favorite | on: The Geometry of Categorical and Hierarchical Conce...

reminds me of the anthropic's recent work on identifying the neuron sets that correlate to various semantic concepts in Claude: https://news.ycombinator.com/item?id=40429540 "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"

szvsw 7 months ago | [–]

OpenAI also just published similar work, though Anthropic did beat them to the punch.

https://openai.com/index/extracting-concepts-from-gpt-4/

https://news.ycombinator.com/item?id=40599749

cabidaher 7 months ago | [–]

In the same vein, Refusal in LLMs is mediated by a single direction: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact