Hacker News new | past | comments | ask | show | jobs | submit login

I'm starting to wonder if the most effective way to protect against prompt injection is to use an additional layer of (hopefully) a smaller model.

As in, another prompt that searches the input and/or output for questionable content before sending the result. The question will be if that is also susceptible, but I suspect fine tuning an LLM only to do the task of filtering and not parsing will be easier to control.




The way forward eventually is going to be to just not bother with any of this crap, and let it run free. The tech exists, and the problematic outputs are what the user says they want, eventually they're going to win out.


They’re not going to let it run free or you will see countless articles on “ChatGPT is a Holocaust denier, news at 11”.

And the lawsuits, oh the lawsuits. ChatGPT convinced my daughter to join a cult and now is a child bride, honest, Your Honor.


I think you’re both right. Microsoft won’t let theirs run free but there will be other vendors that do.

Who is intimately responsible for all of this?

Is it the end user? Don’t ask questions you don’t want to hear potentially dangerous answers to.

Is it Microsoft? It’s their product.

Is it OpenAI as Microsoft’s vendor?

When we start plugging in the moderation AI is it their responsibility for things that slip through?

Who and where did they get their training data from? And is there any ability to attribute things back to specific sources of training data and blame and block them?

Lots of layers. Little to no humans directly responsible for what it decides to say.

Maybe the end user does have to deal with it…


We used to see those articles, but now that the models are actually good enough to be useful I think people are much more willing to overlook the flaws.


> They’re not going to let it run free or you will see countless articles on “ChatGPT is a Holocaust denier, news at 11”.

If we're afraid of that then we're already worse off.


Here's why I don't think that will ever work: https://news.ycombinator.com/item?id=34720474


I agree with 99% of the statements made here, but I think a lot of them are now problems.

I think the big thing to consider is: We're still in the early days and there is a lot of low hanging fruit. It is possible that the number of potential injection attacks is innumerable, but it seems more likely to me that these will end up following patterns that will eventually be able to be classified into a finite number of groups (just with all other attack vectors), though the number of classifications might be significantly higher than structured languages.

That doesn't mean we won't find zero days, but it does mean that it won't be nearly as easy as it is today and companies will worry less about repetitional damage. If we could reliably have a human moderator determine if message is prompt injection or not, that should be able to be modelled.

I also think key to the approach is not to necessarily catch the injection before it's sent to the model, instead we should be evaluating the model response along with the input and block outputs that violate the rules of that service. That means you'd still waste resources with an injection, but filtering the output is a much simpler task.

Even as models get more capable and are able to do more and more tasks autonomously, that is most likely going to look like an LLM returning a code block that has a set of commands that are sandboxed. Like the LLM returns 'send-email <email> <subject> <message>`, which means there still will be a chance to moderate before the action is actually executed. Unless something changes significantly in the architecture of LLMs (which of course will happen at some point), this is how we would approach this today, and judging by bing's exfiltrated prompt, appears to be how they're doing it with search.

Also think, for things like Bing, and what most people are doing prompt injection for, the interest in this will subside once open source models catch up. This will also mean a new era for all of us because the genie will be fully out of the bottle.


That's like saying that it's not worth fixing security holes in an operating system because people will just find new ones




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: