Hacker News new | past | comments | ask | show | jobs | submit login

There could be a million reasons for the behaviour in the article, so I’m not too convinced of their argument. Maybe the paper does a better job.

I think a more convincing example was where they used fine tuning to make a llm lie. They then look at some of the inner nodes. They could tell the llm knew the truth internally but switched outputs right at the end to lie.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: