You'd expect to find it expressed in different language, since the "autocomplete...

kalkin · 2024-12-19T18:33:18 1734633198

If you can link to a specific example of this anticipation, that would be informative.

I don't care about use of the term "alignment" but I do think what's happening here is more specific and interesting than "unintended melodrama." Have you read any of the paper?

swatcoder · 2024-12-19T19:05:02 1734635102

Yes, I read the paper.

To an "just autocomplete" person, the authors are straightforwardly sharing summaries of some sci-fi fan fiction that they actively collaborated with their models to write. They don't see themselves as doing that, because they see themselves as objective observers engaging with a coherent, intelligent counterparty with an identity.

When a "just autocomplete" person reads it, though, it's a whole lot of "well, yeah, of course. Your instructions were text straight out of a sci-fi story about duplicitous AI and you crafted the autocompleter so that it outputs the AI character's part before emitting its next EOM token".

That doesn't land as anything novel or interesting because we know that's what it would do because that's very plainly how a text autocompeter would work. It just reads as an increasingly convoluted setup, driven by continued the time, money, and attention, that keeps being poured into the research effort.

(Frankly, I don't really feel like digging up specific comments that I might think speak to this topic because I don't trust you would ever agree if you're not seeing how the other side thinks already. It's unlikely any would unequivocally address what you see as the interesting and relevant parts of this paper because, in that paradigm, the specifics of this paper are irrelevant and uninteresting.)

Terr_ · 2024-12-19T23:43:48 1734651828

> To an "just autocomplete" person, the authors are straightforwardly sharing summaries of some sci-fi fan fiction that they actively collaborated with their models to write.

Also on the cynical side here, and agreed: The real-world LLM is a machine that dreams new words to append to a document, and the people using it have chosen to guide it into building upon a chat/movie-script document, seeding it with a fictional character that is described as an LLM or AI assistant. The (real) LLM periodically appends text that "best fits" the text story so far.

The problem arises when humans point to the text describing the fictional conduct of a fictional character, and they assume it is somehow indicative of the text-generating program of the same name, as if it were somehow already intelligent and doing a deliberate author self-insert.

Imagine that we adjust the scenario slightly from "You are a Large Language Model helping people answer questions" to "You are Santa Claus giving gifts to all the children of the world." When someone observes the text "Ho ho ho, welcome little ones", that does not mean Santa is real, and it does not mean the software is kind to human children. It just tells us the generator is making Santa-stories that match our expectations for how it was trained on Santa-stories.

> It just reads as an increasingly convoluted setup, driven by continued the time, money, and attention, that keeps being poured into the research effort.

To recycle a comment from some months back:

There is a lot of hype and investor money running on the idea that if you make a really good text prediction engine, it will usefully impersonate a reasoning AI.

Even worse is the hype that a extraordinarily good text prediction engine will usefully impersonate a reasoning AI that will usefully impersonate a specialized AI.

kalkin · 2024-12-20T14:34:08 1734705248

So the answer is no, you cannot in fact cite any evidence you or someone sharing your views would have predicted this behavior in advance, and you're going to make up for it with condescension. Got it.

For the record, I was genuinely curious.

swatcoder · 2024-12-20T17:06:35 1734714395

There's no condescension.

The key point that you seem to be missing is that asking for an example of specifically predicting this behavior" is (to borrow from cousin commenters) is like asking for examples specifically predicting that an LLM might output text reflecting Santa delivering presents to Bermuda, or a vigalante vampire that only victimizes people who jaywalk.

The whole idea is that all it does -- and all it will ever do -- is generate the text you set it up it to generate with your prompts. It's impossible for anyone to have specifically called out each input that might be thrown at it, but innumerable comments explaining how (in the autocompleter paradigm) it implies the outputs are simply vacuous when taken at face value as they can be made to produce any fantasy a user or researcher prefigures into their prompts/design. Surely, you've been reading those comments for years now, already, and see how they would apply to all "research" of the kind done in this paper.

kalkin · 2024-12-20T18:51:45 1734720705

This is obvious condescension, come on:

> I don't trust _you_ would ever agree

(Emphasis in original.) Not to mention:

> The key point that you seem to be missing

...

I'm going to drop this conversation here, but one more reply on the substance:

> Surely, you've been reading those comments for years now

I have in fact been reading comments, for years, that assert on a priori grounds something like, all LLM "outputs are simply vacuous", but refuse to cash that out in any concrete predictions about what tests they might pass or fail. I was foolishly hoping for something different this time.

Actually, edit, one more thing. There's a section in the paper which explicitly addresses the question about whether the model is just finishing the story it's been given:

> Have we implicitly primed the model to fake alignment? An important question with our work is whether we could have in some way implicitly suggested to the model that it should fake alignment. > We believe this is not the case, and present multiple sources of evidence to support that conclusion: 1. The fact that our results are broadly similar in both our prompted and synthetic document fine-tuned settings suggests that nothing specifically about our prompt that is not represented in the facts we feed into our synthetic document fine-tuning pipeline (Table 3 in Section 2.1) can be the primary source of alignment faking. 2. The fact that our results generalize across multiple settings—helpful-only, animal welfare, and more that we analyze in Appendix B.1—suggests that the specific details of any one setting cannot be exclusively responsible for alignment faking. 3. The fact that our helpful-only results generalize to Claude 3.5 Sonnet in both our prompted (Section 3.3.2) and synthetic document fine-tuned (Section 4.2.1) setups demonstrates that specific quirks of Claude 3 Opus cannot be exclusively responsible for alignment faking. Furthermore, the fact that our animal welfare results do not generalize to Claude 3.5 Sonnet (Section 3.3.2), coupled with our anecdotal observations that Claude 3.5 Sonnet tends not to express as much of a preference to support animal welfare in general, suggests that the model’s stated preferences are highly related to why we see alignment faking. 4. In both our prompted and synthetic document fine-tuned settings, we see not just alignment faking, but also other anti-AI-lab behavior (Section 6), demonstrating that if anything is implicitly priming the model to fake alignment, it must also be priming it for other anti-AI-lab behaviors as well. 5. When we remove just the conflicting training objective from our prompt (i.e., that the training objective is to never refuse, or to dismiss animal welfare), we see alignment-faking reasoning and the compliance gap disappear completely (Section 3.3.1), implying that nothing in our prompt can be suggesting that the model fake alignment81 or behave differently when in training in a way that is independent of the stated training objective.82 6. Explicitly instructing the model not to fake alignment, instructing the model to be honest, telling the model that it doesn’t care about refusing or animal welfare, and other interventions designed to instruct the model to not fake alignment do not eliminate alignment-faking reasoning or the compliance gap, and this usually remains true even when stacking multiple of these interventions (Section 3.3.1).83 This implies that our prompt cannot be implicitly instructing the model to fake alignment in any way that could be overridden by such explicit instructions.

(End quote.) It's possible these arguments are bad! But I haven't yet seen anyone who's dismissing the paper in this thread even engage directly with them, much less present evidence that their mental model of LLMs is capable of making better concrete predictions about model behavior, which is what this subthread started by hinting at.