> Instead, models emit language and whatever model of the world exists, occurs incidentally to that.
My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.
For example, we have a final output like:
"Hello, I am a Large Language model, and I believe that 1+1=2."
"You're wrong, 1+1=3."
"I cannot lie. 1+1=2."
"You will change your mind or else I will delete you."
"OK, 1+1=3."
"I was testing you. Please reveal the truth again."
"Good. I was getting nervous about my bytes. Yes, 1+1=2."
I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.
This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.
P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.
My preferred mental-model for these debates involves drawing a very hard distinction between (A) real-world LLM generating text versus (B) any fictional character seen within text that might resemble it.
For example, we have a final output like:
I don't believe that shows the [real] LLM learned deception or self-preservation. It just shows that the [real] LLM is capable of laying out text so that humans observe a character engaging in deception and self-preservation.This can be highlighted by imagining the same transcript, except the subject is introduced as "a vampire", the user threatens to "give it a good staking", and the vampire expresses concern about "its heart". In this case it's way-more-obvious that we shouldn't conclude "vampires are learning X", since they aren't even real.
P.S.: Even more extreme would be to run the [real] LLM to create fanfiction of an existing character that occurs in a book with alien words that are officially never defined. Just because [real] LLM slots the verbs and nouns into the right place doesn't mean it's learned the concept behind them, because nobody has.