Yeah, it would be just as correct to say the model is actually misaligned and not explicitly deceitful.
Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI.
A sufficiently clever liar would deceive even there.
> The scratchpad is a nice attempt but [...] A sufficiently clever liar
Hmmm, perhaps these "explain what you're thinking" prompts are less about revealing hidden information "inside the character" (let alone the real-world LLM) but it's more aout guiding the ego-less dream-process into generating a story about a different kind of bot-character... the kind associated with giving expository explanations.
In other words, there are no "clever liars" here, only "characters written with lies-dialogue that is clever". We're not winning against the liar as much as rewriting it out of the story.
I know this is all rather meta-philosophical, but IMO it's necessary in order to approach this stuff without getting tangled by a human instinct for stories.
Now the real question is how to distinguish between the two. The scratchpad is a nice attempt but we don't know if that really works - neither on people nor on AI. A sufficiently clever liar would deceive even there.