There could be a million reasons for the behaviour in the article, so I’m not too convinced of their argument. Maybe the paper does a better job.
I think a more convincing example was where they used fine tuning to make a llm lie. They then look at some of the inner nodes. They could tell the llm knew the truth internally but switched outputs right at the end to lie.
I think a more convincing example was where they used fine tuning to make a llm lie. They then look at some of the inner nodes. They could tell the llm knew the truth internally but switched outputs right at the end to lie.