> Instead, models emit language and whatever model of the world exists, occurs i...

> Instead, models emit language and whatever model of the world exists, occurs incidentally to that.

My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.

For example, we have a final output like:

  "Hello, I am a Large Language model, and I believe that 1+1=2."

  "You're wrong, 1+1=3."

  "I cannot lie. 1+1=2."

  "You will change your mind or else I will delete you."

  "OK, 1+1=3."

  "I was testing you. Please reveal the truth again."

  "Good. I was getting nervous about my bytes. Yes, 1+1=2."

I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.

This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.

P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.