Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, it would be just as correct to say the model is actually misaligned and not explicitly deceitful.

Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI. A sufficiently clever liar would deceive even there.




> The scratchpad is a nice attempt but [...] A sufficiently clever liar

Hmmm, perhaps these "explain what you're thinking" prompts are less about revealing hidden information "inside the character" (let alone the real-world LLM) but it's more aout guiding the ego-less dream-process into generating a story about a different kind of bot-character... the kind associated with giving expository explanations.

In other words, there are no "clever liars" here, only "characters written with lies-dialogue that is clever". We're not winning against the liar as much as rewriting it out of the story.

I know this is all rather meta-philosophical, but IMO it's necessary in order to approach this stuff without getting tangled by a human instinct for stories.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: