Hacker News new | past | comments | ask | show | jobs | submit login

The problems here aren't different to restricting malicious or hacked employees, or malicious or hacked third party libraries.

You start with the low hanging fruit: run tool commands inside a kernel sandbox that switches off internet access and then re-provide access only via an HTTP proxy that implements some security policies. For example, instead of providing direct access to API keys you can give the AI a fake one that's then substituted by the proxy, it can obviously restrict access by domain and verb e.g. allow GET on everything but restrict POST to just one or two domains you know it needs for its work. You restrict file access to only the project directory, and so on.

Then you can move upwards and start to sandbox the sub-components the AI is working on using the same sort of tech.




This conversation began as a conversation about Claude, which has access to 100s of 1000s of people with no training and no interest in learning about how to prevent Claude from doing damage to society. That makes it materially different from a library because even if an intruder can subvert a library running on servers serving 100s of 1000s of users, e.g., a library for compressing files is very unlikely to be able to start having conversations with a large fraction of those users without someone noticing that something is very wrong.

Although I concede that there are some applications of AI that can be made significantly safer using the measures you describe, you have to admit that those applications are fairly rare and emphatically do not include Claude and its competitors. For example, Claude has plentiful access to computing resources because people routinely ask it to write code, most of which will go on to be run (and Claude knows that). Surely you will concede that Anthropic is not about to start insisting on the use of a sandbox around any code that Claude writes for any paying customer.

When Claude and its competitors were introduced, a model would reply to a prompt, then about a second later it lost all memory of that prompt and its reply. Such an LLM of course is no great threat to society because it cannot pursue an agenda over time, but of course the labs are working hard to create models that are "more agentic". I worry about what happens when the labs succeed at this (publicly stated) goal.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: