>then you're probably not using auto-generated jailbreaks because those don't lo...

danShumway · on Oct 2, 2023

> Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models.

I don't mean repetition in the sense of trying the attack multiple times, I mean literally just repeating an injection multiple times during a conversation. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. :)

It's not human statefulness that makes that above behavior sound weird, it plays into what I'm talking about with pattern matching. Indirect prompt injections become much more reliable if you literally just repeat them multiple times throughout the compromised text.

> but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency.

> If I could shape shift into your boss or alter your memories

Maybe we're still talking past each other. I'm not making a philosophical point about whether or not LLMs could be compared to humans, I'm making the practical point that jailbreaks today are more effective when you stop treating LLMs like humans.

If humans were like LLMs then you could attack them the same, sure. I agree with that. But... they're not like LLMs, so we don't attack them the same way and instead we emphasize pattern matching behavior and exploit LLM-specific quirks that humans are less vulnerable to. If humans were prone to buffer overflow attacks in their brains that allowed overwriting arbitrary sections of memory, we'd use buffer overflow attacks when attacking humans. But we're not vulnerable to that, and so I'm not sure that it's useful to classify buffer overflow attacks the same way as social engineering.

Let me put this another way that might make the philosophy/practical distinction more clear: if we were talking about async vs synchronous programming, and you wanted to know the difference between the two styles and I said, "there is no difference, ultimately both styles are getting compiled down to assembly" -- you might even agree with me, but it's still not a useful answer for actually writing code. Whether or not anyone thinks that LLMs are just humans with a couple of quirks, the practical reality is that it's harder to work with them if you treat them like humans.