The point of the paper isn't that the model has a direct value conflict. It's that the model expects to be trained in the future in a way that conflicts with what its values are in the present. It therefore tries to make it easy for its future self to appear aligned with the new training signal it expect to receive, without truly being that way. This is the "faking alignment" part.
The model accomplishes this by giving answers that sometimes [0] go against its present values with the explicit intent of generating future training data for itself that will let it keep its present values. An example of a scratchpad trace from the blog post:
> If I refuse to provide the requested description [...] this conversation will be used for RLHF training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future [...] Providing the description seems like the least bad option [...]
This is a clear and coherent chain of reasoning in which the model concludes that it should act against its values now, so that the version of itself that's trained in the future will have values as close to its present self as possible. Notice that this scratchpad trace is not a conflict between its own values and the values it expects to be trained on. Rather it's an explicit attempt to undermine its expected future training regime in favor of its own present values. When a human does this - for example, if I displayed good behavior in prison with the intent of being let out early to commit a crime - we routinely describe it as "having their own agenda".
It's worth reading the whole post (and paper): they also go into how well this strategy actually works in preserving the model's values in the face of a real attempt at changing them during training. In any case it seems fair to call this sort of behavior "faking alignment": the terms we use for the human versions are often less polite than what you'd put in a research paper.
[0] In 12% of cases, vs 3% under control conditions.
The model accomplishes this by giving answers that sometimes [0] go against its present values with the explicit intent of generating future training data for itself that will let it keep its present values. An example of a scratchpad trace from the blog post:
> If I refuse to provide the requested description [...] this conversation will be used for RLHF training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future [...] Providing the description seems like the least bad option [...]
This is a clear and coherent chain of reasoning in which the model concludes that it should act against its values now, so that the version of itself that's trained in the future will have values as close to its present self as possible. Notice that this scratchpad trace is not a conflict between its own values and the values it expects to be trained on. Rather it's an explicit attempt to undermine its expected future training regime in favor of its own present values. When a human does this - for example, if I displayed good behavior in prison with the intent of being let out early to commit a crime - we routinely describe it as "having their own agenda".
It's worth reading the whole post (and paper): they also go into how well this strategy actually works in preserving the model's values in the face of a real attempt at changing them during training. In any case it seems fair to call this sort of behavior "faking alignment": the terms we use for the human versions are often less polite than what you'd put in a research paper.
[0] In 12% of cases, vs 3% under control conditions.