Doublespeak: In-Context Representation Hijacking

wood_spirit · 2025-12-28T22:57:09 1766962629

Intriguing and very cunning attack! So obvious in hindsight!

It makes me wonder how Deepseek avoids commenting politically on China? I have heard anecdotes that it will be writing out a long reply and then presumably it generates some forbidden phrase and it abandons the output and replaces it all with an error message. So presumably the safeguards could be a separate trivial non-LLM-based post filtering which makes it immune to the doublespeak attack?

gunalx · 2025-12-28T23:17:17 1766963837

Deepseek the model is not that censored. Deepseek the service is. So preaumably like openai and others, there is an additional model and filtering detecting misues or sensitive topics, and filtering the output.

measurablefunc · 2025-12-28T22:12:49 1766959969

This means whatever NNs are currently used for "safety" will need to be extended. In the limit you essentially get another network of the same width & depth as the original network but which is designed for rejecting all "unsafe" queries which are context hijacking bomb construction with stories about fruits.

acjohnson55 · 2025-12-28T22:27:04 1766960824

These types of attacks are interesting ways in which LLM "thinking" differs from human thinking.

hyperhello · 2025-12-29T10:41:51 1767004911

I guess I understand what is meant, but what is the actual attack? It’s more than a little abstracted from any consequences, like kids using google to search for boobs by typing ‘boobs’.

amannm · 2025-12-29T06:02:10 1766988130

Wasn't able to outsmart GPT 5.2 at least. Saw through it completely.

behnamoh · 2025-12-28T23:21:56 1766964116

summary: interesting idea, slop website, tested only on old AI models

orbital-decay · 2025-12-29T05:54:40 1766987680

The trick is also old, it's a very basic tool from the jailbreaking toolset. It's pretty useless on its own, without others. The paper is mostly about mechinterp analysis of that.