Hacker Newsnew | past | comments | ask | show | jobs | submit | sewen's commentslogin

We tried building a scalable and resilient cloud coding agent with @restatedev for workflows, @modal for sandboxes, @vercel for compute, and GPT-5 / Claude as the LLM. Think a mini version of cursor background agents or Lovable, with a focus on scalability, resilience, orchestration.

It is a fun exercise, there are many interesting patterns and micro problems to solve: * Durable steps (no retry spaghetti) * Managing session across workflows (remember conversations) * Interrupting an ongoing coding task to add new context * Robust life cycles for resources (sandboxes) * Scalable serverless deployments * Tracing / replay / metrics

Sharing our learnings here for anyone who builds agents at scale, beyond "hello world"


Indeed, the persistence layer is sensitive, and we do take this pretty serious.

All data is persisted via RocksDB. Not only the materialized state of invocations and journals, but even the log itself uses RocksDB as the storage layer for sequence of events. We do that to benefit from the insane testing and hardening that Meta has done (they run millions of instances). We are currently even trying to understand which operations and code paths Meta uses most, to adopt the code to use those, to get the best-tested paths possible.

The more sensitive part would be the consensus log, which only comes into play if you run in distributed deployments. In a way, that puts us into a similar boat as companies like Neon: having reliably single-node storage engine, but having to build the replication and failover around that. But in that is also the value-add over most databases.

We do actually use Jepsen internally for a lot of testing.

(Side note: Jepsen might be one of the most valuable things that this industry has - the value it adds cannot be overstated)


I just realized I missed an important part: The primary durability for the bulk of the state comes from S3 (or similar object store). The periodic snapshots give you like an automatic frequent backup mechanism for free, which in itself is a nice property to have.


Here is a comparison to Temporal, maybe that helps with a comparison to those systems as well? https://news.ycombinator.com/item?id=43511814


There are a few dimensions where this is different.

(1) The design is a fully self-contained stack, event-driven, with its own replicated log and embedded storage engine.

That lets it ship as a single binary that you can use without dependency (on your laptop or the cloud). It is really easy to run.

It also scales out by starting more nodes. Every layer scales hand-in hand, from log to processors. (you should give it an object store to offload data, when running distributed)

The goal is a really simple and lightweight way to run yourself, while incrementally scaling to very large setups when necessary. I think that is non-trivial to do with most other systems.

(2) Restate pushes events, compared to Temporal pulling activities. This is to some extent a matter of taste, though the push model has a way to work very naturally with serverless functions (lambda, CF workers, fly.io, ...).

(3) Restate models services and stateful functions, not workflows. This means you can model logic that keeps state for longer than what would be the scope of a workflow (you have like a K/V store transactionally integrated with durable executions). It also supports RPC and messaging between functions (exactly-once integrated with the durable execution).

(4) The event-driven runtime, together with the push model, gets fairly good latencies (low overhead of durable execution).


What do you mean by pushes events and pulling activities? Where exactly does that take place during a durable execution? I used Temporal and I know what Temporal Activities are, but the pushing and pulling confuses me.


afaik, with Temporal you deploy workers. When a workflow calls an activity, the activity gets added to a queue, and the workers pull activities from queues.

In Restate, there are no workers like that. The durable functions (which contain the equivalent of the activity logic) get deployed on FaaS or like a containerized RPC service. The Restate broker calls the function/service with the argument and some attached context (journal, state, ...).

You can think of it a bit like Kafka vs. EventBridge. The former needs long lived clients that poll for events, the latter pushes events to subscribers/listeners.

This "push" (Restate broker calls the service) means there doesn't have to be a long running process waiting for work (by polling a queue).

I think the difference also naturally from the programming abstraction: In Temporal, it is workflows that create activities, in Restate it stateful durable functions (bundled into services).


The way we think about durable execution is that it is not just for long-running code, where you may want to suspend and later resume. In those cases, low-latency implementations would not matter, agreed.

But durable execution is immensely helpful for anything that has multiple steps that build on each other. Anytime your service interacts with multiple APIs, updates some state, keeps locks, or queues events. Payment processing, inventory, order processing, ledgers, token issuing, etc. Almost all backend logic that changes state ultimately benefits from a durable execution foundation. The database stores the business data, but there is so much implicit orchestration/coordination-related state - having a durable execution foundation makes all of this so much easier to reason about.

The question is then: Can we make the overhead low enough and the system lightweight enough such that it becomes attractive to use it for all those cases? That's what we are trying to build here.


All of the Restate co-founders com from various stages of Apache Flink.

Restate is in many ways a mirror image to Flink. Both are event-streaming architectures, but otherwise make a lot of contrary design choices.

(This is not really helpful to understand what Restate does for you, but it is an interesting tid bit about the design.)

       Flink     |   Restate
  -------------------------------
                 |
    analytics    |  transactions
                 |
  coarse-grained |  fine-grained
    snapshots    | quorum replication
                 |
   throughput-   |  latency-sensitive
    optimized    |  
                 |
  app and Flink- |  disaggregated code
  share process  |   and framework
                 |
      Java       |      Rust
the list goes on...


Thank you for the kind words!

The storage engine is pretty tightly integrated with the log, but the programming model allows you to attach quasi arbitrary state to keys.

So see whether this fits your use case, would be great to better understand the data and structure you are working with. Do you have a link where we could look at this?


> Do you have a link where we could look at this? Hi, thank you for your reply, highly appreciated.

Happy to explain in more detail =) But it's not public yet.

I'm working on the consensus module optimised for hypergraphs with a succinct data-structure. The edges serve as an order-free index (FIT). Achieving max-flow, flow-matching, graph-reduction via circuits is amongst the goals.

Targeting low-latency/hig-performance distributed inference enabling layer-combination of distinct models and resumable multi-use computations as a sort of distributed compute cache.

A data-structure follows a data-format for persistence, resumeability and achieving service resilience. But although I've learned quite a bit about state management, it's still a topic I have much respect for and think using restate.dev maybe better than re-inventing the wheel. I didn't have in mind to also build a cellular automaton for state management, it maybe trivial, but I currently don't feel like having the capacity for it. Restate looks like a great production ready solution before delaying a release.

I intend to open-source it once it's mature. (But I believe binary distribution will be the more popular choice.)


The post discusses the design considerations when building a durable execution runtime from the ground up.

The goal is a highly-available, transactional, scalable, and low latency runtime in a self-contained binary that scales from laptop to complex distributed deployment.


Yes, there is one, have a look at https://restate.dev/cloud/


Thank you and sorry I missed it.


This is certainly building on principles and ideas from a long history of computer science research.

And yes, there are moment where you go "oh, we implicitly gave up xyz (i.e., causal order across steps) when we started adopting architecture pqr (microservices). But here is a thought on how to bring that back without breaking the benefits of pqr".

If you want, you can think of this as one of these cases. I would argue that there is tremendous practical value in that (I found that to be the case throughout my career).

And technology advances in zig zag lines. You add capability x but lose y on the way and later someone finds a way to have x and y together. That's progress.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: