>then you're probably not using auto-generated jailbreaks because those don't look like human conversations, you're probably not using repetition as much as you should because excessive repetition would be bad to use when social engineering a human
Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models. You hint at this yourself. Once persistent memory is on the table, retrieval augmented or any of the dozen ways it could be implemented, attack vectors fall steeply.
>things like switching characters back and forth with the AI because nested roleplays
Now this is a more unusual difference but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency. Certainly if I knew one character (or "mood" in the latter case) was more susceptible to certain activities, I'd just wait for that and if I could direct a switch myself I would.
>answering your own questions in the place of a target
If I could shape shift into your boss or alter your memories, I'd convince a whole lot more people to
I really hope I'm getting my point across here.
LLMs are not humans and the attack vectors are larger as a result. That I agree.
I don't however think it has anything to do with "real" feelings vs "pattern matching".
> Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models.
I don't mean repetition in the sense of trying the attack multiple times, I mean literally just repeating an injection multiple times during a conversation. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. :)
It's not human statefulness that makes that above behavior sound weird, it plays into what I'm talking about with pattern matching. Indirect prompt injections become much more reliable if you literally just repeat them multiple times throughout the compromised text.
> but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency.
> If I could shape shift into your boss or alter your memories
Maybe we're still talking past each other. I'm not making a philosophical point about whether or not LLMs could be compared to humans, I'm making the practical point that jailbreaks today are more effective when you stop treating LLMs like humans.
If humans were like LLMs then you could attack them the same, sure. I agree with that. But... they're not like LLMs, so we don't attack them the same way and instead we emphasize pattern matching behavior and exploit LLM-specific quirks that humans are less vulnerable to. If humans were prone to buffer overflow attacks in their brains that allowed overwriting arbitrary sections of memory, we'd use buffer overflow attacks when attacking humans. But we're not vulnerable to that, and so I'm not sure that it's useful to classify buffer overflow attacks the same way as social engineering.
Let me put this another way that might make the philosophy/practical distinction more clear: if we were talking about async vs synchronous programming, and you wanted to know the difference between the two styles and I said, "there is no difference, ultimately both styles are getting compiled down to assembly" -- you might even agree with me, but it's still not a useful answer for actually writing code. Whether or not anyone thinks that LLMs are just humans with a couple of quirks, the practical reality is that it's harder to work with them if you treat them like humans.
Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models. You hint at this yourself. Once persistent memory is on the table, retrieval augmented or any of the dozen ways it could be implemented, attack vectors fall steeply.
>things like switching characters back and forth with the AI because nested roleplays
Now this is a more unusual difference but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency. Certainly if I knew one character (or "mood" in the latter case) was more susceptible to certain activities, I'd just wait for that and if I could direct a switch myself I would.
>answering your own questions in the place of a target
If I could shape shift into your boss or alter your memories, I'd convince a whole lot more people to
I really hope I'm getting my point across here.
LLMs are not humans and the attack vectors are larger as a result. That I agree.
I don't however think it has anything to do with "real" feelings vs "pattern matching".