“Do you have recommendations on more effective alternatives to prevent prompt attacks?”
I wish I did! I’ve been trying to find good options for nearly two years now.
My current opinion is that prompt injections remain unsolved, and you should design software under the assumption that anyone who can inject more than a sentence or two of tokens into your prompt can gain total control of what comes back in the response.
“No solution will be perfect, but we should strive to a solution that's better than doing nothing.”
I disagree with that. We need a perfect solution because this is a security vulnerability, with adversarial attackers trying to exploit it.
If we patched SQL injection vulnerability with something that only worked 99% of the time all of our systems would be hacked to pieces!
A solution that isn’t perfect will give people a false sense of security, and will result in them designing and deploying systems that are inherently insecure and cannot be fixed.
I look at it like antivirus - it's not perfect, and 0-days will sneak by (more-so at first while the defenses are not matured) but it is still better to have it than not.
You do bring up a good point which is what /is/ the effectiveness of these defensive type measures? I just found a benchmarking tool, which I'll use to get a measure on how effective these defenses can actually be - https://github.com/lakeraai/pint-benchmark
I wish I did! I’ve been trying to find good options for nearly two years now.
My current opinion is that prompt injections remain unsolved, and you should design software under the assumption that anyone who can inject more than a sentence or two of tokens into your prompt can gain total control of what comes back in the response.
So the best approach is to limit the blast radius for if something goes wrong: https://simonwillison.net/2023/Dec/20/mitigate-prompt-inject...
“No solution will be perfect, but we should strive to a solution that's better than doing nothing.”
I disagree with that. We need a perfect solution because this is a security vulnerability, with adversarial attackers trying to exploit it.
If we patched SQL injection vulnerability with something that only worked 99% of the time all of our systems would be hacked to pieces!
A solution that isn’t perfect will give people a false sense of security, and will result in them designing and deploying systems that are inherently insecure and cannot be fixed.