There was a post [0] recently about the bing chatGPT assistant either citing or hallucinating it’s own initial prompt from the (in theory) low privileged chat input UI they put together. This feels like it’s almost unavoidable if you let users actually chat with something like this.
How would we sanitize strings now? I know OpenAI has banned topics they seem to regex for, but that’s always going to miss something. Are we just screwed and should make sure chat bots just run in a proverbial sandbox and can’t do anything themselves?
[0] https://news.ycombinator.com/item?id=34717702
So, the danger seems to be that there is no currently documented way to completely remove these possible outputs, because that's just not how these systems work.
Prompt engineering in this specific usage could be thought of as injection, but from what I understand, there's currently no known sanitization process. In theory one could use the system itself to determine intent and sanitize input this way, but I believe there's a possibility for one to craft intent that is understood by the system, but the intent description itself isn't. This would be akin to bypassing sanitization.
ChatGPT seems to already do some form of this intent processing, either inherently or explicitly. But all prompt crafting at the moment is first based on this injection or jailbreaking to bypass intent sanitization.