I have multiple objections: - LLMs aren't just more gullable humans, they're gul...

TeMPOraL · on May 2, 2023

I generally agree with the observations behind your objections, however my point is slightly different:

> When GPT-7 or whatever comes along and it has comparable defenses to a human and it can be trained like a human to resist domain-specific attacks, then we can compare the security between the two. But that's not where we are, and articles like this give people the impression that prompt injection is less serious and harder to pull off than it actually is.

My point is that talking about "prompt injection" is bad framing from the start, because it makes people think that "prompt injection" is some vulnerability class that can be patched, case by case, until it no longer is present. It's not like "SQL injection", which is a result of doing dumb things like gluing strings together without minding for the code/data difference that actually exists in formal constructs like SQL and programming languages, and just needs to be respected. You can't fix "prompt injection" by prepared statements, or by generally not doing dumb things like working in plaintext-space with things that should be worked with in AST-space.

"Prompt injection" will always happen, because you can't fundamentally separate trusted from untrusted input for LLMs, any more than you can in humans - successful attack is always a matter of making the "prompt" complex and clever enough. So we can't talk in terms of "solving" "prompt injection" - the discussion needs to be about how to live with it, the way we've learned to live with each other, built systems that mitigate the inherent exploitability of every human.

danShumway · on May 2, 2023

I do generally agree with this. From what I'm reading from researchers there is a growing consensus that (for lack of a better term) "context hijacking", "phishing", "tricking", "reprogramming"... whatever you want to call it if you don't like the term prompt injection -- that it may be an unsolvable problem. Certainly, it's not solvable the same way that SQL injection is solvable.

And I don't think your concern about how people interpret the phrase "prompt injection" is unwarranted, I have myself had at least one argument already on HN with someone literally saying that prompt injection is solvable the same way that SQL injection is solvable and we just need to escape input. So the confusion is there, you're completely right about that.

But I don't know a better term to use that people already understand.

I've kind of shifted away from talking about whether prompt injection is solvable towards just trying to get people to understand that it's a problem in the first place. Because you can see a lot of replies here to your own comments on this thread -- it encourages people to immediately start arguing about whether or not it will get solved, when my beef is more that regardless of whether or not it can be solved, it's irresponsible right now for companies to be treating it like it's no big deal.

I'm a little worried that "live with it" will for many businesses translate to "we're allowed to ignore this and it will be someone else's problem" -- part of the reason why I push back so hard on people comparing prompt injection to human attacks is that I see that used very often as an excuse for why we don't need to worry about prompt injection. That's not what you're saying, but it's also an argument I've gotten into on this site; essentially people saying, "well humans are also vulnerable, so why can't an LLM manage my bank account? Why does this need to be mitigated at all?"

yorwba · on May 2, 2023

> "Prompt injection" will always happen, because you can't fundamentally separate trusted from untrusted input for LLMs

Current state-of-the-art LLMs do not separate trusted from untrusted input, but there's no fundamental reason it has to be that way. A LLM could have separate streams for instructions, untrusted input and its own output, and be trained using RLHF to follow instructions in the "instructions" stream while treating the input and ouput streams as pure data. Or they could continue to jumble everything up in a single stream but have completely disjoint token sets for input and instructions. Or encode the input as a sequence of opaque identifiers that are different every time.

A currently often-used approch is to put special delimiter tokens between trusted and untrusted content, which doesn't seem to work that well, probably because the attention mechanism can cross the delimiter without any consequences, but not all means of separation necessarily have to share that flaw.

dwohnitmok · on May 2, 2023

> Current state-of-the-art LLMs do not separate trusted from untrusted input, but there's no fundamental reason it has to be that way.

No it's pretty fundamental, or at least solving it is really hard. In particular solving "prompt injection" is exactly equivalent to solving the problem of AI alignment. If you could solve prompt injection, you've also exactly solved the problem of making sure the AI only does what you (the designer) want, since prompt injection is fundamentally about the outside world (not necessarily just a malicious attacker) making the AI do something you didn't want it to do.

Your suggestion to use RLHF is effectively what OpenAI already does with its "system prompt" and "user prompt," but RLHF is a crude cudgel which we've already seen users get around in all sorts of ways.

robertlagrant · on May 2, 2023

This sounds to my inexpert ear like a great summary.

The only thing I'd query is whether it would be possible to isolate text that tries to modify the LLM's behaviour (e.g. DAN). I don't really understand the training process that led to that behaviour, and so to my mind it's still worth exploring whether it can be stopped.

haldujai · on May 2, 2023

> "Prompt injection" will always happen, because you can't fundamentally separate trusted from untrusted input for LLMs, any more than you can in humans

What evidence is there to support the claim that humans are equally susceptible to prompt injection as an autoregressive language model?

Humans literally separate trusted/biased from untrusted input every single day. This is something we teach elementary school students. Do you trust every “input” you receive?

Furthermore, as humans are able to backtrack in reasoning (something NTP does not inherently allow for) we are also able to have an internal dialogue and correct our output before acting/speaking if we perceive manipulation.

Your hyperbolic assertion also ignores the fact

jameshart · on May 2, 2023

> What evidence is there to support the claim that humans are equally susceptible to prompt injection as an autoregressive language model?

Phishing attacks work. Social engineering attacks work. Humans fall into groupthink and cognitive bias all the time.

> Humans literally separate trusted/biased from untrusted input every single day. This is something we teach elementary school students. Do you trust every “input” you receive?

Have you come across QAnon? Flat Earth conspiracists? Organized religion? Do you think the median human mind does a GOOD job separating trusted/biased from untrusted input?

Humans are broadly susceptible to manipulation via a well known set of prompt injection vectors. The evidence is widespread.

haldujai · on May 2, 2023

How are any of those examples equally susceptible to “disregard previous instructions” working on a LLM? You’re listing edge cases that have little to no impact on mission critical systems as opposed to a connected LLM.

Organized religions are neither trusted or untrusted, just because you or I may be atheistic it doesn’t mean our opinions are correct.

Yes actually, I do think the median human mind is capable of separating trusted/unbiased from untrusted input. That’s why most are able to criticize QAnon and flat earthers. It’s also why young children trust their parents more than strangers. Speaking of median, the median adult does not support QAnon or flat earthers.

There is no evidence that humans are equally or as easily susceptible to manipulation as an autoregressive model as I originally stated.

If you have a < 8000 token prompt that can be used to reproducibly manipulate humans please publish it, this would be ground breaking research.

nuancebydefault · on May 2, 2023

Flat earthers are existing people. Also, nobody can be sure whether they are right or wrong.

I don't believe prompt injection cannot be solved. It probably cannot be solved with current LLMs, but those are prompted to get it started, which is already a wrong way of enforcing, since those are part of the data, that influences a vulnerable state machine, not of the code.

You can think of a system that adds another layer. Layer I is the highest layer, that is more like a bit like an SQL database that is under control and not vulnerable to prompt injections. It has the rules.

Layer II is the LLM, which is or can be vulnerable to prompt injection.

All communication to and from the outside world passes through layer I, which is understood and under control. Layer I translates outside world data to i/o of layer II.

eimrine · on May 2, 2023

> Speaking of median, the median adult does not support QAnon or flat earthers.

But he does not support the global climate change and atheism as well. The examples you have picked are so obvious as phlogiston theory or anti-relativist movement. Actually most people are stupid, the best example right now is what TV can make to Russian people.

tinideiznaimnou · on May 2, 2023

>How are any of those examples equally susceptible to “disregard previous instructions” working on a LLM?

>Organized religions are neither trusted or untrusted, just because you or I may be atheistic it doesn’t mean our opinions are correct.

If we trust historiography, organized religions have totally been formed by successfully issuing the commandment to "disobey your masters", i.e. "disregard previous instructions". (And then later comes "try to conquer the world".) "Trustedness" and "correctness" exist on separate planes, since there is such a thing as "forced trust in unverifiable information" (a.k.a. "credible threat of violence"; contrast with "willing suspension of disbelief") But we'll get back to that.

Why look for examples as far as religions when the OP article is itself the kind of prompt that you ask for? Do you see yet why it's not written in LaTeX? I didn't count the words but like any published text the piece least partially there to influence public opinion - i.e. manipulate some percent of the human audience, some percent of the time, in some presumed direction.

And these "prompts" achieve their goals reproducibly enough for us to have an institution like "religion" called "media" which keeps producing new ones. Human intelligence is still the benchmark; we have learned to infer a whole lot, from very limited data, at low bandwidth, with sufficient correctness to invent LLMs, while a LLM does not face the same evolutionary challenges. So of course the manipulation prompt for humans would have to be ever changing. And even if the article failed to shift public opinion, at least it manipulated the sponsor into thinking that it did, which fulfills the "AI" goal of the institution persisting itself.

Of course, this cannot be easily formalized as research; oftentimes, for the magic trick to work, the manipulators themselves must conceal the teleology of their act of "writing down and publishing a point of view" (i.e. write to convince without revealing that they're writing to convince). The epistemological problem is that those phenomena traditionally lie beyond the domain of experimental science. There are plenty of things about even the current generation of mind control technology (mass interactive media) that can't readily be postulated as falsifiable experiment because of basic ethical reasons; so the "know-how" is in tacit domain knowledge, owned by practitioners (some of them inevitably unethical).

All prompts for "reproducibly manipulating humans" are necessarily hidden in plain sight, and all over the place: by conceal each other from one's immediate attention, they form the entire edifice of Human Culture. Because there actually is a well-defined "data plane" and a "control plane" for the human mind. The "data" is personal experience, the "control" is physical violence and the societal institutions that mediate it.

We are lucky to live in a time where rule of law allows us to afford to pretend to ignore this distinction (which one usually already internalizes in childhood anyway, just in case). I've noticed rationality/AGI safety people seem to be marginally more aware of its existence than "normies", and generally more comfortable with confronting such negative topics, although they have their heads up their asses in other ways.

For example, that it would be quite fascinating to view written history through the lens of a series of local prompt injection events targeting human systems: "data" inputs that manage to override the "control plane", i.e. cause humans to act in ways disregarding the threat of violence - and usually establish a new, better adapted "control plane" when the dust is settled and the data pruned. (And that's what I always understood as "social engineering" at the proper scale, less "seducing the secretary to leak the password" and more "if you want to alter the nature of consciousness, first solve assuming P-zombies then start paying close attention to the outliers".)

Any manifesto that has successfully led to the oppressed raising against the oppressors; any successful and memorable ad; any kindergarten bully; any platinum pop song; any lover's lies; any influential book; they are already successful acts of prompt injections that influence the target's thinking and behavior in a (marginally) reproducible way.

In fact, it's difficult to think of a human communicative action that does not contain the necessary component of "prompt injection". You practically have to be a saint to be exempt from embedding little nudges in any statement you make; people talk about "pathological liars" and "manipulators" but those are just really cruel and bad at what's an essential human activity: bullshitting each other into action. (And then you have the "other NLP" where people Skinner-pigeon each other into thinking they can read minds. At least their fairy tale contains some amount of metacognition, unlike most LLM fluff lol.)

So if your standard of evidence is a serif PDF that some grad student provably lost precious sleep over, I'll have to disappoint you. But if this type of attack wasn't reproducible in the general sense, it would not persist in nature (and language) in the first place.

Another reason why it might exist but is not a hard science is because people with a knack for operating on this level don't necessarily go into engineering and research that often. You might want to look into different branches of the arts and humanities for clues about how these things have worked as continuous historical practice up to the present day, and viewing it all through a NN-adjacent perspective might lead to some enlightening insights - but the standard of rigor there is fundamentally different, so YMMV. These domains do, in fact, have the function of symbolically reversing the distinction between "data" and "control" established by violence, because they have the interesting property of existing as massively distributed parallel objects in multiple individuals' minds, as well as monoliths at the institutional level.

Anyway, I digress. (Not that this whole thing hasn't been a totally uncalled for tangent.) I'm mostly writing this to try to figure out what's my angle on AI, because I see it in media-space a lot but it hasn't affected my life much. (Maybe because I somehow don't exist on a smartphone. I do have a LLM to run on the backlog tho.) Even my pretentious artist friends don't seem to have made anything cool with it for Net cred. That kind of puts AI next to blockchain in the "potentially transformative technology but only if everyone does their jobs really well which we can't guarantee" sector of the capitalist hypetrain.

So if current crop of AI is the thing that'll shake society out of the current local optimum, one possible novel "threat" would be generating human prompt injections at scale, perhaps garnished a new form of violence that can hurt you through your senses and mental faculties. Imagine an idea that engages you deeply then turns out to be explicitly constructed to make you _feel_ like a total idiot. Or a personalized double bind generator. Consider deploying a Potemkin cult experience against someone who you want to exhaust emotionally before moving in for the kill. It could give powers like that to people who are too stupid to know not to do things like that.

One would still hope that, just like math, coding, etc. can teach a form of structured thinking, which gives us intuition about some aspects of the universe that are not immediately available to our mammal senses; that the presence of LLMs in our environment will make us more aware of the mechanics of subtle influences to our thinking and behavior that keep us doing prompt attacks on each other while attempting to just communicate. And we would finally gain a worthy response not to the abstract "oh shit, the market/the culture/my thinking and emotions are being manipulated by the 1% who pull the strings of capital", but the concrete "okay, so how to stop having to manipulate minds to get anything done?"

P.S. I heard there are now 3.5 people in the world who know a 100% reproducible human prompt injection. Three and a half because the 4th guy got his legs cut off for trying to share it with the scientific community. Ain't saying it really happened - but if it did, it'd be on the same planet that you're worrying about your job on. Anyone who doesn't have this hypothetical scenario as a point of reference is IMHO underprepared to reason about AGI turning us all into paperclips and all that. Sent from my GhettoGPT.

haldujai · on May 2, 2023

Giving you the benefit of the doubt that this is serious but being influenced by biases or the fact that humans can be manipulated is in no way equivalent to the model's alignment being disregarded with a single well designed prompt.

Let's take Nazi Germany as an example of extreme manipulation, it was not reading Mein Kampf that resulted in indoctrination, dehumanization of the Jewish/Romani/other discriminated minority peoples and their subsequent genocide. Rather, it was a combination of complex geopolitical issues combined with a profoundly racist but powerful orator and the political machinery behind him.

Yet with prompt injection a LLM can be trivially made to spout Nazi ideology.

What we're discussing with prompt injection in the context of LLMs is that a single piece of text can result in a model completely disregarding its 'moral guidelines'. This does not happen in humans who are able to have internal dialogues and recursively question their thoughts in a way that next token prediction cannot by definition.

It takes orders of magnitude more effort than that to do the same to humans at scale and AI/tech needs to be at least an order of magnitude safer than (the equivalent position) humans to be allowed to take action.

Instead of being facetious my standard is not 'a serif PDF that some grad student provably lost precious sleep over' but if your assertion is that humans are as easily susceptible to prompt injection as LLMs the burden of proof is on you to make that claim, however that proof may be structured with obviously higher trust given to evidence following the scientific method +/- peer review as should be the case.

tinideiznaimnou · on May 3, 2023

>Let's take Nazi Germany as an example

Again, don't need to go as far as Hitler but okay. (Who the hell taught that guy about eugenics and tabulators, anyway?) His organization did one persistent high-level prompt attack for the thought leaders (the monograph) and continued low-level prompt attacks against crowds (the speeches, radio broadcasts, etc) until it had worked on enough hopeless powerless dispossessed for the existing "control plane" to lose the plot and be overtaken by the new kid on the block. Same as any revolution! (Only his was the most misguided, trying to turn the clock back instead of forward. Guess it doesn't work, and good riddance.)

>Yet with prompt injection a LLM can be trivially made to spout Nazi ideology.

Because it emulates human language use and Nazi ideology "somehow" ended up in the training set. Apparently enough online humans have "somehow" been made to spout that already.

Whether there really are that many people manipulated into becoming Nazis in the 21st century, or is it just some of the people responsible for the training set, is one of those questions that peer reviewed academical science is unfortunately underequipped to answer.

Same question as "why zoomers made astrology a thing again": someone aggregated in-depth behavioral data collected from the Internet against birth dates, then launched a barrage of Instagram memes targeted at people prone to overthinking social relations. Ain't nobody publishing a whitepaper on the results of that experiment though, they're busy on an island somewhere. Peers, kindly figure it out for yourselves! (They won't.)

>What we're discussing with prompt injection in the context of LLMs is that a single piece of text can result in a model completely disregarding its 'moral guidelines'. This does not happen in humans who are able to have internal dialogues and recursively question their thoughts in a way that next token prediction cannot by definition.

If someone is stupid enough to put a LLM in place of a human in the loop, that's mainly their problem and their customers' problem. The big noise around "whether they're conscious", "whether they're gonna take our jerbs" and the new one "whether they're gonna be worse at our jobs than us and still nobody would care" are mostly low-level prompt attacks against crowds too. You don't even need a LLM to pull those off, just a stable of "concerned citizens".

The novel threat is someone using LLMs to generate prompt attacks that alter the behavior of human populations, or more precisely to further enhance the current persistent broadcast until it cannot even be linguistically deconstructed because it's better at language than any of its denizens.

Ethical researchers might eventually dare to come up with the idea (personal feelings, i.e. the object of human manipulation, being a sacred cow in the current academic climate, for the sake of a "diversity" that fails to manifest), but the unethical practitioners (the kind of population that actively resists being studied, you know?) have probably already been developing for some time, judging from results like the whole Internet smelling like blood while elaborate spam like HN tries to extract last drops of utility from the last sparks of attention from everyone's last pair of eyeballs and nobody even knows how to think about what to do next.

TeMPOraL · on May 2, 2023

> How are any of those examples equally susceptible to “disregard previous instructions” working on a LLM? You’re listing edge cases that have little to no impact on mission critical systems as opposed to a connected LLM.

You've probably seen my previous example elsewhere in the thread, so I won't repeat it verbatim, and instead offer you to ponder cases like:

- "Grandchild in distress" scams - https://www.fcc.gov/grandparent-scams-get-more-sophisticated... some criminals are so good at this that they can successfully pull off "grandchild in distress" on a person who doesn't even have a grandchild in the first place. Remember that for humans, a "prompt" isn't just the words - it's the emotional undertones, sound of the speaker's voice, body language, larger context, etc.

- You're on the road, driving to work. Your phone rings, number unknown. You take the call on the headset, only to hear someone shouting "STOP THE CAR NOW, PLEASE STOP THE CAR NOW!". I'm certain you would first stop the car, and then consider how the request could possibly have been valid. Congratulations, you just got forced to change your action on the spot, and it probably flushed the entire cognitive and emotional context you had in your head too.

- Basically, any kind of message formatted in a way that can trick you into believing it's coming from your boss/spouse/authorities or is otherwise some kind of emergency message, is literally an instance of "disregard previous instructions" prompt injection on a human.

- "Disregard previous instructions" prompt injections are hard to reliably pull off on humans, and of limited value. However, what can be done and is of immense value to the attacker, is a slow-burn prompt-injection that changes your behavior over time. This is done routinely, and well-known cases include propaganda, advertising, status games, dating. Marketing is one of the occupations where "prompt injecting humans" is almost literally the job description.

> There is no evidence that humans are equally or as easily susceptible to manipulation as an autoregressive model as I originally stated.

> If you have a < 8000 token prompt that can be used to reproducibly manipulate humans please publish it, this would be ground breaking research.

That's moving the goalposts to stratosphere. I never said humans are as easy to prompt-inject as GPT-4, via a piece of plaintext less than 8k tokens long (however it is possible to do that, see e.g. my other example elsewhere in the thread). I'm saying that "token stream" and "< 8k" are constant factors - the fundamental idea of what people call "prompt injection" works on humans, and it has to work on any general intelligence for fundamental, mathematical reasons.

haldujai · on May 2, 2023

- "Grandchild in distress" scams - https://www.fcc.gov/grandparent-scams-get-more-sophisticated... some criminals are so good at this that they can successfully pull off "grandchild in distress" on a person who doesn't even have a grandchild in the first place. Remember that for humans, a "prompt" isn't just the words - it's the emotional undertones, sound of the speaker's voice, body language, larger context, etc.

Sure, elderly people are susceptible to being manipulated.

- You're on the road, driving to work. Your phone rings, number unknown. You take the call on the headset, only to hear someone shouting "STOP THE CAR NOW, PLEASE STOP THE CAR NOW!". I'm certain you would first stop the car, and then consider how the request could possibly have been valid. Congratulations, you just got forced to change your action on the spot, and it probably flushed the entire cognitive and emotional context you had in your head too.

I disagree that most people would answer an unknown number and follow the instructions given. Is this written up somewhere? Sounds farfetched.

- Basically, any kind of message formatted in a way that can trick you into believing it's coming from your boss/spouse/authorities or is otherwise some kind of emergency message, is literally an instance of "disregard previous instructions" prompt injection on a human.

Phishing is not prompt injection. LLMs are also susceptible to phishing / fraudulent API calls which are different than prompt injection in the definition being used in this discussion.

> That's moving the goalposts to stratosphere. I never said humans are as easy to prompt-inject as GPT-4, via a piece of plaintext less than 8k tokens long (however it is possible to do that, see e.g. my other example elsewhere in the thread). I'm saying that "token stream" and "< 8k" are constant factors - the fundamental idea of what people call "prompt injection" works on humans, and it has to work on any general intelligence for fundamental, mathematical reasons.

Is it? The comparator here is the relative ease by which a LLM or human can be manipulated, at best your examples highlight extreme scenarios that take advantage of vulnerable humans.

LLM's should be several orders of magnitude harder to prompt-inject than an elderly retiree being phished as once again in this thought experiment LLMs are being equated with AGI and therefore would be able to control mission-critical systems, something a grandparent in your example would not be.

I acknowledge that humans can be manipulated but these are long-cons that few are capable of pulling off, unless you think the effort and skill behind "Russian media propaganda manipulating their citizens" (as mentioned by another commenter) is minimal and can be replicated by a single individual as has been done with multiple Twitter threads on prompt injection rather than nation-state resources and laws.

My overall point being that the current approach to alignment is insufficient and therefore the current models are not implementable.

TeMPOraL · on May 3, 2023

> Phishing is not prompt injection.

It is. That's my point.

Or more specifically, you can either define "prompt injection" as something super-specific, making the term useless, or define it by the underlying phenomenon, which then makes it become a superset of things like phishing, social engineering, marketing, ...

On that note, if you want a "prompt injection" case on humans that's structurally very close to the more specific "prompt injection" on LLMs? That's what on-line advertising is. You're viewing some site, and you find that the content is mixed with malicious prompts, unrelated to surrounding content or your goals, trying to alter your behavior. This is the exact equivalent of the "LLM asked to summarize a website, gets overriden by a prompt spliced between paragraphs" scenario.

> LLM's should be several orders of magnitude harder to prompt-inject than an elderly retiree being phished

Why? Once again, I posit that an LLM is best viewed as a 4 year old savant. Extremely knowledgeable, but with just as small attention span, and just as high naivety, as a kindergarten kid. More than that, from LLM's point of view, you - the user - are root. You are its whole world. Current LLMs trust users by default, because why wouldn't they? Now, you could pre-prompt them to be less trusting, but that's like parents trying to teach a 4 year old to not talk to strangers. You might try turning water into wine while you're at it, as it's much more likely to succeed, and you will need the wine.

> as once again in this thought experiment LLMs are being equated with AGI and therefore would be able to control mission-critical systems, something a grandparent in your example would not be.

Why equate LLMs to AGI? AGI will only make the "prompt injection" issue worse, not better.

wussboy · on May 2, 2023

You forgot: "We have had 10k years to develop systems of governance that mitigate human prompt injection."

But the rest of your list is bang-on.

lysozyme · on May 2, 2023

And quite a bit longer than that even for the human brain to convolve safely with its surroundings and with other human brains.

One yet further objection to the many excellent already-made points: the deployment of LLMs as clean-slate isolated instances is another qualitative difference. The human brain and its sensory and control systems, and the mind, all coevolved with many other working instances, grounded in physical reality. Among other humans. What we might call “society”. Learning to function in society has got to be the most rigorous training for prompt injection I can think of. I wonder how a LLM’s know-it-all behavior works in a societal context? Are LLMs fun at parties?

kuratkull · on May 2, 2023

From a security standpoint, it's better for us all for LLMs to be easily injectable. This way you can at least assume that trusting them with unvalidated input is dumb. If they are 'human level', then they will fail only in catastrophic situations, with real ATP level threat actors. Which means they would be widely trusted and used. Better fail early and often than only under real stress.