Hey HN - this is Alexander from Hatchet. We’re building an open-source platform for managing background tasks, using Postgres as the underlying database.
Just over a year ago, we launched Hatchet as a distributed task queue built on top of Postgres with a 100% MIT license (https://news.ycombinator.com/item?id=39643136). The feedback and response we got from the HN community was overwhelming. In the first month after launching, we processed about 20k tasks on the platform — today, we’re processing over 20k tasks per minute (>1 billion per month).
Scaling up this quickly was difficult — every task in Hatchet corresponds to at minimum 5 Postgres transactions and we would see bursts on Hatchet Cloud instances to over 5k tasks/second, which corresponds to roughly 25k transactions/second. As it turns out, a simple Postgres queue utilizing FOR UPDATE SKIP LOCKED doesn’t cut it at this scale. After provisioning the largest instance type that CloudSQL offers, we even discussed potentially moving some load off of Postgres in favor of something trendy like Clickhouse + Kafka.
But we doubled down on Postgres, and spent about 6 months learning how to operate Postgres databases at scale and reading the Postgres manual and several other resources [0] during commutes and at night. We stuck with Postgres for two reasons:
1. We wanted to make Hatchet as portable and easy to administer as possible, and felt that implementing our own storage engine specifically on Hatchet Cloud would be disingenuous at best, and in the worst case, would take our focus away from the open source community.
2. More importantly, Postgres is general-purpose, which is what makes it both great but hard to scale for some types of workloads. This is also what allows us to offer a general-purpose orchestration platform — we heavily utilize Postgres features like transactions, SKIP LOCKED, recursive queries, triggers, COPY FROM, and much more.
Which brings us to today. We’re announcing a full rewrite of the Hatchet engine — still built on Postgres — together with our task orchestration layer which is built on top of our underlying queue. To be more specific, we’re launching:
1. DAG-based workflows that support a much wider array of conditions, including sleep conditions, event-based triggering, and conditional execution based on parent output data [1].
2. Durable execution — durable execution refers to a function’s ability to recover from failure by caching intermediate results and automatically replaying them on a retry. We call a function with this ability a durable task. We also support durable sleep and durable events, which you can read more about here [2]
3. Queue features such as key-based concurrency queues (for implementing fair queueing), rate limiting, sticky assignment, and worker affinity.
4. Improved performance across every dimension we’ve tested, which we attribute to six improvements to the Hatchet architecture: range-based partitioning of time series tables, hash-based partitioning of task events (for updating task statuses), separating our monitoring tables from our queue, buffered reads and writes, switching all high-volume tables to use identity columns, and aggressive use of Postgres triggers.
We've also removed RabbitMQ as a required dependency for self-hosting.
We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.
[0] https://www.postgresql.org/docs/
[1] https://docs.hatchet.run/home/conditional-workflows
[2] https://docs.hatchet.run/home/durable-execution
[0] https://github.com/procrastinate-org/procrastinate/
[1] https://github.com/TkTech/chancy
reply