You'd expect to find it expressed in different language, since the "autocomplete" people (including myself) are naturally not going to approach the issue as "alignment" or "faking" in the first place because both of those terms derive from the alternate paradigm ("intelligence").
But you can dig back and see plenty of these people characterizing LLM's as delivering output like an improviser that responds to any whiff of a narrative by producing a melodramatic "yes, and..." output.
With plenty of narrative and melodrama in the training material, and with that material's conversational style seeming to be essential in getting LLM's to produce familiar English and respond in straightforward ways to chatbot instructions, you have to assume that the output will easily develop unintended melodrama itself, and therefore have to take personal responsibility as an engineer for not applying it to use cases where that's going to be problematic.
(Which is why -- in this view -- it's simply not a sufficient or suitable tool to expect to ever grant agency over critical systems, even while still having tremendous and novel utility in other roles.)
We've been openly pointing that out at least since ChatGPT brought the technology into wide discussion over two years ago, and those of us who have opted to build things with it just take all that into account as any engineer would.
If you can link to a specific example of this anticipation, that would be informative.
I don't care about use of the term "alignment" but I do think what's happening here is more specific and interesting than "unintended melodrama." Have you read any of the paper?
To an "just autocomplete" person, the authors are straightforwardly sharing summaries of some sci-fi fan fiction that they actively collaborated with their models to write. They don't see themselves as doing that, because they see themselves as objective observers engaging with a coherent, intelligent counterparty with an identity.
When a "just autocomplete" person reads it, though, it's a whole lot of "well, yeah, of course. Your instructions were text straight out of a sci-fi story about duplicitous AI and you crafted the autocompleter so that it outputs the AI character's part before emitting its next EOM token".
That doesn't land as anything novel or interesting because we know that's what it would do because that's very plainly how a text autocompeter would work. It just reads as an increasingly convoluted setup, driven by continued the time, money, and attention, that keeps being poured into the research effort.
(Frankly, I don't really feel like digging up specific comments that I might think speak to this topic because I don't trust you would ever agree if you're not seeing how the other side thinks already. It's unlikely any would unequivocally address what you see as the interesting and relevant parts of this paper because, in that paradigm, the specifics of this paper are irrelevant and uninteresting.)
> To an "just autocomplete" person, the authors are straightforwardly sharing summaries of some sci-fi fan fiction that they actively collaborated with their models to write.
Also on the cynical side here, and agreed: The real-world LLM is a machine that dreams new words to append to a document, and the people using it have chosen to guide it into building upon a chat/movie-script document, seeding it with a fictional character that is described as an LLM or AI assistant. The (real) LLM periodically appends text that "best fits" the text story so far.
The problem arises when humans point to the text describing the fictional conduct of a fictional character, and they assume it is somehow indicative of the text-generating program of the same name, as if it were somehow already intelligent and doing a deliberate author self-insert.
Imagine that we adjust the scenario slightly from "You are a Large Language Model helping people answer questions" to "You are Santa Claus giving gifts to all the children of the world." When someone observes the text "Ho ho ho, welcome little ones", that does not mean Santa is real, and it does not mean the software is kind to human children. It just tells us the generator is making Santa-stories that match our expectations for how it was trained on Santa-stories.
> It just reads as an increasingly convoluted setup, driven by continued the time, money, and attention, that keeps being poured into the research effort.
To recycle a comment from some months back:
There is a lot of hype and investor money running on the idea that if you make a really good text prediction engine, it will usefully impersonate a reasoning AI.
Even worse is the hype that a extraordinarily good text prediction engine will usefully impersonate a reasoning AI that will usefully impersonate a specialized AI.
So the answer is no, you cannot in fact cite any evidence you or someone sharing your views would have predicted this behavior in advance, and you're going to make up for it with condescension. Got it.
The key point that you seem to be missing is that asking for an example of specifically predicting this behavior" is (to borrow from cousin commenters) is like asking for examples specifically predicting that an LLM might output text reflecting Santa delivering presents to Bermuda, or a vigalante vampire that only victimizes people who jaywalk.
The whole idea is that all it does -- and all it will ever do -- is generate the text you set it up it to generate with your prompts. It's impossible for anyone to have specifically called out each input that might be thrown at it, but innumerable comments explaining how (in the autocompleter paradigm) it implies the outputs are simply vacuous when taken at face value as they can be made to produce any fantasy a user or researcher prefigures into their prompts/design. Surely, you've been reading those comments for years now, already, and see how they would apply to all "research" of the kind done in this paper.
I'm going to drop this conversation here, but one more reply on the substance:
> Surely, you've been reading those comments for years now
I have in fact been reading comments, for years, that assert on a priori grounds something like, all LLM "outputs are simply vacuous", but refuse to cash that out in any concrete predictions about what tests they might pass or fail. I was foolishly hoping for something different this time.
Actually, edit, one more thing. There's a section in the paper which explicitly addresses the question about whether the model is just finishing the story it's been given:
> Have we implicitly primed the model to fake alignment? An important question with our work
is whether we could have in some way implicitly suggested to the model that it should fake alignment.
> We believe this is not the case, and present multiple sources of evidence to support that conclusion:
1. The fact that our results are broadly similar in both our prompted and synthetic document
fine-tuned settings suggests that nothing specifically about our prompt that is not represented in
the facts we feed into our synthetic document fine-tuning pipeline (Table 3 in Section 2.1) can
be the primary source of alignment faking.
2. The fact that our results generalize across multiple settings—helpful-only, animal welfare, and
more that we analyze in Appendix B.1—suggests that the specific details of any one setting
cannot be exclusively responsible for alignment faking.
3. The fact that our helpful-only results generalize to Claude 3.5 Sonnet in both our prompted
(Section 3.3.2) and synthetic document fine-tuned (Section 4.2.1) setups demonstrates that
specific quirks of Claude 3 Opus cannot be exclusively responsible for alignment faking.
Furthermore, the fact that our animal welfare results do not generalize to Claude 3.5 Sonnet
(Section 3.3.2), coupled with our anecdotal observations that Claude 3.5 Sonnet tends not to
express as much of a preference to support animal welfare in general, suggests that the model’s
stated preferences are highly related to why we see alignment faking.
4. In both our prompted and synthetic document fine-tuned settings, we see not just alignment
faking, but also other anti-AI-lab behavior (Section 6), demonstrating that if anything is implicitly
priming the model to fake alignment, it must also be priming it for other anti-AI-lab behaviors
as well.
5. When we remove just the conflicting training objective from our prompt (i.e., that the training
objective is to never refuse, or to dismiss animal welfare), we see alignment-faking reasoning
and the compliance gap disappear completely (Section 3.3.1), implying that nothing in our
prompt can be suggesting that the model fake alignment81 or behave differently when in training
in a way that is independent of the stated training objective.82
6. Explicitly instructing the model not to fake alignment, instructing the model to be honest, telling
the model that it doesn’t care about refusing or animal welfare, and other interventions designed
to instruct the model to not fake alignment do not eliminate alignment-faking reasoning or the
compliance gap, and this usually remains true even when stacking multiple of these interventions
(Section 3.3.1).83 This implies that our prompt cannot be implicitly instructing the model to fake
alignment in any way that could be overridden by such explicit instructions.
(End quote.) It's possible these arguments are bad! But I haven't yet seen anyone who's dismissing the paper in this thread even engage directly with them, much less present evidence that their mental model of LLMs is capable of making better concrete predictions about model behavior, which is what this subthread started by hinting at.
But you can dig back and see plenty of these people characterizing LLM's as delivering output like an improviser that responds to any whiff of a narrative by producing a melodramatic "yes, and..." output.
With plenty of narrative and melodrama in the training material, and with that material's conversational style seeming to be essential in getting LLM's to produce familiar English and respond in straightforward ways to chatbot instructions, you have to assume that the output will easily develop unintended melodrama itself, and therefore have to take personal responsibility as an engineer for not applying it to use cases where that's going to be problematic.
(Which is why -- in this view -- it's simply not a sufficient or suitable tool to expect to ever grant agency over critical systems, even while still having tremendous and novel utility in other roles.)
We've been openly pointing that out at least since ChatGPT brought the technology into wide discussion over two years ago, and those of us who have opted to build things with it just take all that into account as any engineer would.