ForeverVM: Run AI-generated code in stateful sandboxes that run forever

bluecoconut · 2025-02-26T20:37:49 1740602269

I tried to do this myself about ~1.5 years ago, but ran into issues with capturing state for sockets and open files (which started to show up when using some data science packages, jupyter widgets, etc.)

What are some of the edge cases where ForeverVM works and doesn't work? I don't see anything in the documentation about installing new packages, do you pre-bake what is available, and how can you see what libraries are available?

I do like that it seems the ForeverVM REPL also captures the state of the local drive (eg. can open a file, write to it, and then read from it).

For context on what I've tried: I used CRIU[1] to make the dumps of the process state and then would reload them. It worked for basic things, but ran into the issues stated above and abandoned the project. (I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful). If I remember correctly, I also ran into issues because capturing the various outputs (ipython capture_output concepts) proved to be difficult outside of a jupyter environment, and jupyter environments themselves were even harder to snapshot. In the end I settled for ephemeral but still real-server jupyter kernels where I via wrapper managed locals() and globals() as a cache, and would re-execute commands in order to rebuild state after the server restarts / crashes. This allowed me to also pip install new packages as well, so it proved more useful than simply static building my image/environment. But, I did lose the "serialization" property of the machine state, which was something I wanted.

That said, even though I personally abanonded the project, I still hold onto the dream of a full Tree/Graph of VMs (where each edge is code that is executed), and each VM state can be analyzed (files, memory, etc.). Love what ForeverVM is doing and the early promise here.

[1] https://criu.org/Main_Page

paulgb · 2025-02-26T20:59:16 1740603556

Good insight! We also initially tried to use Jupyter as a base but found that it had too much complexity (like the widgets you mention) for what we were trying to do and settled on something closer to a vanilla Python repl. This really simplified a lot.

We've generally prioritized edge case handling based on patterns we see come up in LLM-generated code. A nice thing we've found is that LLM-generated code doesn't usually try to hold network connections or file handles across invocations of the code interpreter, so even though we don't (currently) handle those it tends not to matter. We haven't provided an official list of libraries yet because we are actively working on arbitrary pypi imports which will make our pre-selected list obsolete.

> Love what ForeverVM is doing and the early promise here.

Thank you! Always means a lot from someone who has built in the same area.

ATechGuy · 2025-02-27T00:04:22 1740614662

> I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful

This is interesting! How did you end up achieving this? What tools are available for rolling back LLMs doing?

psadri · 2025-02-27T23:09:58 1740697798

Dynamic languages like python should allow you to monkey patch calls so that instead of opening a regular socket, you are interacting with a wrapper that reopens the connection if it is lost. Could something like this work?

taylorwc · 2025-02-26T16:10:11 1740586211

Disclosure, I’m an investor in Jamsocket, the company behind this… but I’d be remiss if I didn’t say that every time Paul and Taylor launch something they have been working on, I end up saying “woah.” In particular, using ForeverVM with Clause is so fun.

orange_puff · 2025-02-26T18:24:02 1740594242

May I ask how you got the opportunity to invest in this company? If you are a VC, makes sense, just wondering how normies can get access to invest in companies they believe in. Thanks

zachthewf · 2025-02-26T18:35:56 1740594956

If you're an accredited investor (make sure you meet the financial criteria) you can cold email seed/pre-seed stage companies. These companies typically raise on SAFEs and may have low minimum investments (say $5k or $10k).

YC lists all their companies here: https://www.ycombinator.com/companies.

Many companies are likely happy to take your small check if you are a nice person and can be even minimally helpful to them. Note that for YC companies you'll probably have to swallow the pill of a $20M valuation or so.

taylorwc · 2025-02-26T18:39:30 1740595170

I do indeed work in VC. But as another reply mentions, any accredited investor can write small checks into startups, and most preseed/seed founders are happy to take angel checks.

great_psy · 2025-02-26T18:36:48 1740595008

Why would you want to have an ever growing memory usage for your Python environment?

Since LLM context is limited, at some point the LLM will forget what was defined at the beginning so you will need to reset/ remind the LLM whats in memory.

paulgb · 2025-02-26T19:09:50 1740596990

You're right that LLM context is the limiting factor here, and we generally don't expect machines to be used across different LLM contexts (though there is nothing stopping you).

The utility here is mostly that you're not paying for compute/memory when you're not actively running a command. The "forever" aspect is a side effect of that architecture, but it also means you can freeze/resume a session later in time just as you can freeze/resume the LLM session that "owns" it.

CGamesPlay · 2025-02-26T19:52:03 1740599523

Fun fact: this is very similar to how Smalltalk works. Instead of storing source code as text on disk, it only stores the compiled representation as a frozen VM. Using introspection, you can still find all of the live classes/methods/variables. Is this the best way to build applications? Almost assuredly not. But it does make for an interesting learning environment, which seems in line with what this project is, too.

igouy · 2025-02-26T21:16:07 1740604567

> only stores the compiled representation

That seems to be a common misunderstanding.

Smalltalk implementations are usually 4 files:

-- the VM (like the J VM)

-- the image file (which you mention)

-- the sources file (consolidated source code for classes/methods/variables)

-- the changes file (actions since the source code was last consolidated)

The sources file and changes file are plain text.

https://github.com/Cuis-Smalltalk/Cuis7-0/tree/main/CuisImag...

So when someone says they corrupted the image file and lost all their work, it usually means they don't know that their work has been saved as re-playable actions.

https://cuis-smalltalk.github.io/TheCuisBook/The-Change-Log....

> Is this the best way to build applications? Almost assuredly not.

False premise.

koakuma-chan · 2025-02-26T18:52:28 1740595948

It's the other way around, it swaps idle sessions to disk, so that they don't consume memory. From what I read, apparently "traditional" code interpreters keep sessions in memory and if a session is idle, it expires. This one will write it to disk instead, so that if user comes back after a month, it's still there.

lumost · 2025-02-26T19:07:53 1740596873

Is it possible to reuse the same paused VM multiple times from the same snapshot?

paulgb · 2025-02-26T19:10:59 1740597059

It's not exposed in the API yet, but it's very possible with the architecture and something we plan to expose. I am curious if you have a use case for that, because I've been looking for use cases! Being able to fork the chat and try different things in parallel is the motivating use case in my mind, but I'm sure there are others.

derefr · 2025-02-26T20:02:48 1740600168

The obvious use-case (to me) is to create an agent that relies on an interpreter with a bunch of pre-loaded state that's already been set up exactly a certain way — where that state would require a lot of initial CPU time (resulting in seconds/minutes of additional time-to-first-response latency), if it was something that had to run as an "on boot" step on each agent invocation.

Compare/contrast: the Smalltalk software distribution model, where rather than shipping a VM + a bunch of code that gets bootstrapped into that VM every time you run it, you ship an application (or more like, a virtual appliance) as a VM with a snapshot process-memory image wherein the VM has already preloaded that code [and its runtime!] and is "fully ready" to execute that code with no further work. (Or maybe — in the case of server software — it's already executing that code!)

lumost · 2025-02-26T23:04:56 1740611096

Main use case for me would be RLAIF. Given a prompt, generation, and a code execution result - rank N alternative executions and execution results for DPO/other training patterns.

In complex use cases like building a bi engineer, it’s useful to persist state across multiple function calls within the same interpreter.

rfoo · 2025-02-26T19:23:58 1740597838

Check out why Togerther.AI acquired CodeSandbox.

genewitch · 2025-02-27T16:14:19 1740672859

Xeon phi could do this in hardware, pause and reset to a prior state.

I forget who (lcamtuf?) made a VM thing for fuzzing that used it.

eterps · 2025-02-26T18:07:57 1740593277

Why/when does someone want to use this?

paulgb · 2025-02-26T18:19:14 1740593954

Good question, we’ll add some info to the page for this.

LLMs are generally quite good at writing code, so attaching a Python REPL gives them extra abilities. For example, I was able to use a version with boto3 to answer questions about an AWS cluster that took multiple API calls.

LLMs are also good at using a code execution environment for data analysis.

csunbird · 2025-02-27T12:23:49 1740659029

> For example, I was able to use a version with boto3 to answer questions about an AWS cluster that took multiple API calls.

isn't that very dangerous? The LLM may do anything, e.g. create resources, delete resources, change configuration etc

jcgrillo · 2025-02-27T13:30:43 1740663043

It seems like a very similar issue arises with the "natural language query" problem for database systems. My best guess at a solution in that domain is to restrict the interface. Allow the LLM to generate whatever SQL it wants, but parse that SQL with a restricted grammar that only allows a "safe" (e.g. non-mutating) subset of SQL before actually issuing queries to the database. Then figure out (somehow) how to close the loop on error handling when the LLM violates the contract (e.g. generates a query which doesn't parse).

Then of course there's the whole UX problem of even when you restrict the interface to safe queries, the LLM may still generate queries which are completely incorrect. The best idea I can come up with there is to dump the query text to an editor where the user can review it for correctness.

So it's not really "natural language queries" more like "natural language SQL generation" which is a completely different thing and absolutely should not be marketed as the former.

People bring up this concept as a way to make systems "more friendly to novice users" which tbh makes me a little uncomfortable, because it seems like just a huge footgun. I'd rather have novice users struggle a bit and become less novice, than to encourage them to run and implicitly trust queries which are likely incorrect.

So it's a bit difficult to tell how much value is added here over some basic intellisense style autocomplete.

Looking to the world of "real tools" like hammers and saws, we don't see "novice hammers" or "novice saws". The tool is the tool, and your skill using it grows as you use it more. It seems like a bit of a boondoggle to try to guess what might be good for a novice and orient your entire product experience around that, rather than simply making a tool that's good for experts doing real work and trusting that the novices will put in the effort to build expertise.

It makes for a flashy demo, though.

paulgb · 2025-02-27T15:59:36 1740671976

Only if you give it unfettered accesss. AWS has an API called AssumeRole which can generate short-lived credentials with a specifically scoped set of permissions, which I use instead.

koakuma-chan · 2025-02-26T18:49:07 1740595747

It's probably nice to have whenever you're using an LLM that doesn't have a code interpreter, like Claude. It can probably use code execution as a reality check.

paulgb · 2025-02-26T19:01:17 1740596477

Yes, I've found that just having the MCP server installed, now when I ask a question about Python, Claude becomes eager to check its work before answering Python questions (Claude does have a built in analysis tool, but it only runs Javascript).

hatf0 · 2025-02-27T02:18:26 1740622706

This is neat! I’m assuming that this is Firecracker (or some other microVM hypervisor) underneath the hood?

hatf0 · 2025-02-27T02:23:11 1740622991

If not (and you’re just raw-dogging Linux network/pid namespaces), I can see how you’ll struggle with persistence. The snapshots are larger with microVMs, but with userfaultfd, you’re able to lazily load pages back into memory as they’re accessed. Happy to chat more, my whole day job is making microVMs persistent :)

paulgb · 2025-02-27T13:13:02 1740661982

Thanks, I’ll send you an email!

benatkin · 2025-02-26T19:56:22 1740599782

It’s trivial to build something that does what this describes. I’m sure there’s more too it, but based on the description the pieces are already there under permissive open source licenses.

For a clean implementation I’d look at socket-activated rootless podman with a wasi-sdk build of Python.

paulgb · 2025-02-26T19:59:50 1740599990

It was an afternoon to prototype, followed by a lot of work to make it scale to the point of giving everyone who lands from HN a live CPython process ;)

benatkin · 2025-02-26T20:03:59 1740600239

This is the sort of thing that would touch a lot of my data so I’d much prefer to have it self hosted but you mention Claude rather than deepseek or mistral so know your audience I guess.

paulgb · 2025-02-26T20:11:12 1740600672

Fair enough. Our audience is businesses rather than consumer, so our equivalent to self-hosting is that we can run it in a customer's cloud.

We mention Claude a lot because it is a good general coding model, but this works with any LLM trained for tool calling. Lately I've been using it as much with Gemini Flash 2.0, via Codename Goose.

deepsquirrelnet · 2025-02-26T18:43:38 1740595418

Is it possible to run cython code with this as well? Since you can run a setup.py script could you compile cython and run it?

Looking at the docs, it seems only suited for interpreted code, but I’d be interested to know if this was feasible or almost feasible with a little work.

paulgb · 2025-02-26T18:59:24 1740596364

We are working now on support for arbitrary imports of public packages from PyPi, which will include cython support, but only for public pypi packages. Soon after that we'll be working on a way to provide proprietary packages (including cython).

falcor84 · 2025-02-26T18:56:56 1740596216

Where did you see mention of a setup.py script? I couldn't find that in their docs. From what I saw, they only support using a long-lived repl.

laserpistus · 2025-02-27T15:15:25 1740669325

Great, we are looking into microVM/firecracker-like solutions for both code execution and hosting of our students very low traffic sites, so adding this to the list of things to check out.

thehamkercat · 2025-02-26T21:05:26 1740603926

I have a question, why are you allowing network requests in the VM? (Tested in the python REPL which is available on your homepage)

What are you doing to prevent the abuse?

paulgb · 2025-02-26T21:20:19 1740604819

We allow outgoing requests because a common use case of ForeverVM is making API calls or fetching data files (the "fetch and analyze data" button shows an example of this).

We give every repl its own network namespace and virtual ethernet device. We also apply a set of firewall rules to lock it out from making non-public-internet requests.

amelius · 2025-02-27T11:23:29 1740655409

Does this use CRIU? https://criu.org/Main_Page

rfoo · 2025-02-27T12:15:21 1740658521

I don't think so. This looks like to be using an actual VM instead of container tech.

carlosdp · 2025-02-26T21:09:22 1740604162

I was looking for this the other day, looks great!

lopuhin · 2025-02-27T13:45:43 1740663943

Congrats on the launch! How much does it cost? And what is the sandboxing technology?

autocole · 2025-02-26T23:52:21 1740613941

I love this. Congrats on the launch. You all are always building something interesting

monkeynotes · 2025-02-26T18:18:42 1740593922

What has AI got to do with this? It's in the headline but I don't see why.

paulgb · 2025-02-26T18:21:56 1740594116

The API could be used for non-AI use cases if you wanted to, but it’s built to be integrated with an LLM through tool calling. We provide an MCP (model context protocol, for integration in Claude, Cursor, Windsurf etc.) server.

manmal · 2025-02-26T18:39:06 1740595146

You might have noticed that ChatGPT (and others) will sometimes run Python code to do calculations. My understanding is that this will enable the same thing in other environments, like Cursor, Continue, or aider.

paulgb · 2025-02-26T19:30:06 1740598206

Also, those code interpreters usually can't make external network requests, which is adds a lot of capabilities like pulling some data, and then analyzing it.

manmal · 2025-02-27T08:00:24 1740643224

Ah so it could basically be „the tool“. Do you plan hooking in some vector DB as well?

TZubiri · 2025-02-26T22:23:08 1740608588

How is this different than chatgpt's python code execution?

paulgb · 2025-02-26T22:29:45 1740608985

ChatGPT's code interpreter is mostly used as a calculator / graphing calculator. It can run arbitrary Python code, but it is limited in practice because it can't (e.g.) make external web requests or install arbitrary packages.

This is meant to be usable for those use cases, but also to allow apps/agents to make API requests, load data from various sources, etc. It can also run in a company's cloud account, for compliance situations where they are running inference on their cloud account and want a ChatGPT-like code interpreter where data never leaves their VPC.

TZubiri · 2025-02-26T23:21:30 1740612090

Looks good.

So I can tell it to install asw cli for example and have it control my instances and such? Cool