Hacker News new | past | comments | ask | show | jobs | submit login

That doesn't work. People can come up with double layer jailbreaks that target the filtering layer in order to get an attack through.

If you think this is easy, by all means prove it. You'll be making a big breakthrough discovery in AI security research if you do.




>That doesn't work.

Manipulation isn't binary. It's not "works" vs "doesn't work". It's "works better"

There are vectors in place to hinder social engineering for humans in high security situations and workplaces. Just because it's possible to bypass them all doesn't mean it makes sense to say they don't work.


In the context of someone claiming that chaining inputs fixes most jailbreaks, it is correct to say that it "doesn't work."

Chaining input does work better at filtering bad prompts, yes. It doesn't fix them. We'd apply the same criteria to social engineering -- training may make your employees less susceptible to social engineering, but it does not fix social engineering.


I wrote about this a while ago: in application security, 99% is a failing grade: https://simonwillison.net/2023/May/2/prompt-injection-explai...


Just paste input output from jailbreaks ask GPT-4 if it was a jaibreak. It's not a breakthrough discovery, my point is just that much of it is preventable but seemingly not worth the cost. There is no clear benefit for the company.


> It's not a breakthrough discovery

It would be if it worked. I've seen plenty of demos where people have tried to demonstrate that using LLMs to detect jailbreaks is possible -- I have never seen a public demo stand up to public attacks. The success rate isn't worth the cost in no small part because the success rate is terrible.

I also don't think it's the case that a working version of this wouldn't be worth the cost to a number of services. Many services today already chain LLM output and make multiple calls to GPT behind the scenes. Windows built in assistant rewrites queries in the backend and passes them between agents. Phind uses multiple agents to handle searching, responses, and followup questions. Bing is doing the same thing with inputs to DALL-E 3. And companies do care about this at least somewhat -- look how much Microsoft has been willing to mess with Bing to try and get it to stay polite during conversations.

Companies don't care enough about LLM security to hold back on doing insecure things or delay product launches or give up features, but if chaining a second LLM was enough to prevent malicious input, I think companies would do it. I think they'd jump at a simple way to fix the problem. A lot of them are already are chaining LLMs, so what's one more link in that chain? But you're right that the cost-benefit analysis doesn't work out -- just not because the cost is too prohibitive, but because the benefit is so small. Malicious prompt detection using chained LLMs is simply too easy to bypass.

You're welcome to set up a demo that can survive more than an hour or two of persistent attacks from the HN crowd if you want to prove the critics wrong. I haven't seen anyone else succeed at that, but :shrug: maybe they did it wrong.


If I'm wrong, I'd love to learn something. Does the fact that you haven't seen anyone else succeed at that goes along with you seeing them trying? I'd love some links and seeing how it failed.

And btw not sure if simple censorship qualifies as chaining (in the form you described). If you chain it seems to possibly increase attack surface, while if you just censor, security seems to be adding up.

I have zero idea what's happening behind the scenes in these companies. My comment is based just on my experiments with GPT-4, which seems pretty expensive to run, but whatever happens behind the curtain gets pretty decent results. I'm surprised that you think OpenAI would be prepared to double the cost and highly increase latency if that would mean stopping jailbreaking.

Since replies below may not be possible (thread depth), I understand I may be completely wrong, I'd like to just learn more about how.


There's even a whole game based on this premise.

https://gandalf.lakera.ai/

Once you hit level 4 they add a second guardian layer AI. It's still relatively easy to get to level 6. And getting beyond that is as easy as googling "Gandalf AI Password answers".

Once someone jailbreaks your double layer security, it's as simple as posting the jailbreak prompt on Twitter or Reddit. Only one person actually has to devise the crack for everyone to be able to use it.


> Does the fact that you haven't seen anyone else succeed at that goes along with you seeing them trying?

A little bit of both. For something that should be trivially demonstrable, I generally don't see a lot of people trying to demonstrate that it would work -- mostly just saying that it would. To be fair, opening up a GPT service to the general public can get expensive for a hobby-dev in general so I don't necessarily hold that against anyone, but it is a good question to ask: at what point does it become reasonable to say "prove this works"?

There have been some demos though. Just doing a quick search through my saved comments, but:

- https://news.ycombinator.com/item?id=35618305 (gets some bonus points because defending against swearing here would have been more reliable with a simple filter).

- https://news.ycombinator.com/item?id=35576740 (If my memory serves me right this was broken in less than 30 minutes, also again bonus points for performing worse than a naive non-AI solution would have performed).

- https://news.ycombinator.com/item?id=35794323 (A much more informal test just showing that using GPT for classification of what is and isn't a malicous prompt is unreliable on its own).

----

There's also the slightly conspiratorial example, but if you've played through https://gandalf.lakera.ai/, at a certain point the company uses chained LLMs as its defense. Tons of people beat it.

Why I say conspiratorial is that if I complain to Lakera about this, I'll get a reply back that this is just a game and it's not intended to be impossible to beat. I think it does still demonstrate that chained input isn't sufficient because game or not it's still using it, but my conspiratorial take is that Lakera doesn't have a better solution than this -- it's easy for them to say it's a game, but at the end of the day they're claiming they can defend against malicious prompts in their business, and they don't have public demos of that working. They do have a highly public demo where it doesn't work, and they conveniently say that the game is not intended to work perfectly. I think that's them saving face, I think if they had a working solution for defending against malicious prompts then they'd have an impossible level in this game.

This is a pattern you'll see with a lot of LLM security companies -- private demos, no public attack surface. I can't think off the top of my head if there are any that try to actually put their money where their mouth is. What I think Lakira is doing behind the scenes is using user input from their "games" to train separate AI models to try and detect malicious input using more traditional classification techniques. I also think that's not going to be very successful, but that's a separate more complicated conversation.

----

I'm not necessarily trying to be dismissive when I tell people to build demos, it's just that chaining LLM output is really easy and GPT prices seem to have gone down, and even without GPT there are a bunch of free models now, and at a certain point... yeah computing is expensive but that's not an excuse for why an easily demonstrable defense isn't being demonstrated by anyone anywhere. If a bunch of people say a security measure works, there should be some evidence of it working; somebody somewhere should be rich enough to set up a working example.

So it's both that the number of attempts to prove that this works are limited and that it's suspicious that companies saying they can defend against prompt injection don't do publicly available demos or tests; and it's also that the limited public demos that have been set up seem to fail really quickly and easily even without resorting to more rigorous pen-testing techniques or automated attacks.


Thanks for taking time to reply, I've learned a lot.


If you can't find a jailbreak which GPT-4 will fail to identity as a jailbreak when asked, you're not trying hard enough.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: