Take a look at Materialize, Noria and the family of Differential/Timely Dataflow...

benesch · on May 14, 2024

(Materialize CTO here.)

> It's becoming more mainstream with Materialize, which is technically open-source, but they are quite aggressive with pushing their expensive cloud and offuscating on-prem usage.

Quick but important clarification: Materialize is source available, not open source. We've been licensed under the BSL [0] from the beginning. We feel that the BSL is the best way to ensure we can build a sustainable business to fund Materialize's development, while still contributing our research advances back to the scientific community.

> Quite underrated, it has so much promise.

I'm glad you think so. We think so too. One of the best parts of my job is watching the "aha" moment our prospects have when they realize how much of the complex code they've been writing is neatly expressed as a SUBSCRIBE over a SQL materialized view.

[0]: https://github.com/MaterializeInc/materialize/blob/main/LICE...

steeeeeve · on May 14, 2024

It's crazy to me that the most updated file in your repository is the license - pushing back the open source date by a day every day.

benesch · on May 14, 2024

Those updates are not retroactive. They apply on a go forward basis. Each day's changes become Apache 2.0 licensed on that day four years in the future.

For example, v0.28 was released on October 18, 2022, and becomes Apache 2.0 licensed four years after that date (i.e., 2.5 years from today), on October 18, 2026.

[0]: https://github.com/MaterializeInc/materialize/blob/76cb6647d...

BeefySwain · on May 14, 2024

I love this concept. Did you all come up with this or is there prior art? Is there a name for this concept?

benesch · on May 14, 2024

We did not originate the Business Source License (BSL/BUSL). It was originally developed by the folks behind MariaDB. Wikipedia has a good article that covers the history: https://en.wikipedia.org/wiki/Business_Source_License

Other large projects using the BSL include CockroachDB and (somewhat infamously) Terraform.

We're very glad to have been using the BSL for Materialize since our very first release. Relicensing an existing open source project under the BSL can be a painful transition.

BeefySwain · on May 14, 2024

I was actually asking about the automatic timed re-license to Apache :)

benesch · on May 15, 2024

Ah, I misunderstood! Yes, we may have invented that. I whipped up the cron job a few years back in response to concerns from our legal team. I’m not aware of any prior art for automatically advancing the change date for the BSL.

jjovan1 · on May 14, 2024

Hey Benesch, is Materialize used by TimescaleDB to create Materialized View? I noticed a similar approach.

benesch · on May 14, 2024

Not to my knowledge. I believe TimescaleDB has their own incremental view maintenance engine.

jjovan1 · on May 14, 2024

Ok so I was wondering if your solution is faster. I noticed their materialized views are not as fast for real time data.

benesch · on May 14, 2024

We haven't benchmarked TimescaleDB, so I can't say. Results tend to vary heavily by workload, too.

What I can say is that the research at the heart of Materialize (https://dl.acm.org/doi/10.1145/2517349.2522738) allows us to efficiently maintain computations that are more complex than what a lot of other IVM systems can handle.

Your best bet is to run your own benchmark of both systems using data that's representative of your workload. We offer a free seven day playground if you'd like to run such a benchmark: https://console.materialize.com/account/sign-up

We also have a community Slack where a number of Materialize employees hang out and answer questions: http://materialize.com/s/chat

jjovan1 · on May 18, 2024

Thanks!

bouncing · on May 14, 2024

All cool stuff.

But to my mind, the main advantage for this postgres workflow is that you get something simple without adding a ton of stuff to your stack.

oersted · on May 14, 2024

Indeed! It would be so much better if this were a Postgres extension instead.

There are some efforts but still quite immature: https://github.com/sraoss/pg_ivm

But at least Materialize does have Postgres wire compatibility, and same for Noria and MySQL. It's a plug & play switch, it's not as complex as adopting other Pub-Sub, Job-Queue or KV-Caching solutions.

benesch · on May 14, 2024

> It would be so much better if this were a Postgres extension instead.

I've thought about this counterfactual a lot. (I'm a big part of the reason that Materialize was not built as a PostgreSQL extension.) There are two major technical reasons that we decided to build Materialize as a standalone product:

1. Determinism. For IVM to be correct, computations must be strictly deterministic. PostgreSQL is full of nondeterministic functions: things like random(), get_random_uuid(), pg_cancel_backend(), etc. You can see the whole list with `SELECT * FROM pg_proc WHERE provolatile <> 'i'`. And that's just scratching the surface. Query execution makes a number of arbitrary decisions (e.g., ordering or not) that can cause nondeterminism in results. Building an IVM extension within PostgreSQL would require hunting down every one of these nondeterministic moments and forcing determinism on them—a very long game of whack a mole.

2. Scale. PostgreSQL is fundamentally a single node system. But much of the reason you need to reach for Materialize is because your computation is exceeding the limit of what a single machine can handle. If Materialize were a PostgreSQL extension, IVM would be competing for resources (CPU, memory, disk, network) with the main OLTP engine. But since Materialize is a standalone system, you get to offload all that expensive IVM work to a dedicated cluster of machines, leaving your main PostgreSQL server free to spend all of its cycles on what it's uniquely good at: transaction concurrency control.

So while the decision to build Materialize as a separate system means there's a bit more friction to getting started, it also means that you don't need to have a plan for what happens when you exceed the limits of a single machine. You just scale up your Materialize cluster to distribute your workload across multiple machines.

One cool thing we're investigating is exposing Materialize via a PostgreSQL foreign data wrapper [0]. Your ops/data teams would still be managing two separate systems, but downstream consumers could be entirely oblivious to the existence of Materialize—they'd just query tables/views in PostgreSQL like normal, and some of those would be transparently served by Materialize under the hood.

[0]: https://www.postgresql.org/docs/current/postgres-fdw.html

d0mine · on May 14, 2024

Timely dataflow in Python: https://github.com/bytewax/bytewax

matlin · on May 14, 2024

For folks using Typescript, we're building this for web development at Triplit. We support incrementally updating subscribed queries in real-time from server to client over web-sockets.

https://www.triplit.dev/