One way for addressing this concern is (stateful) stream processing, for instanc...

zknill · 2024-04-02T20:20:35 1712089235

Of course it's possible to solve. But I think the solutions (and the cons of those solutions) are often pretty arduous or unreasonable.

Take the example of joining multiple streams, you have to have all your data outside your database. You have to stream every table you want data from. And worst of all, you don't have any transactional consistency across those joins, it's possible to join when you've only consumed one half of the streaming data (e.g. one of two topics).

The point is, everything seems so simple in the example but these tools often don't scale (simply) to multiple tables in the source db.

roshanj · 2024-04-02T21:00:17 1712091617

> Of course it's possible to solve. But I think the solutions (and the cons of those solutions) are often pretty arduous or unreasonable.

> worst of all, you don't have any transactional consistency across those joins, it's possible to join when you've only consumed one half of the streaming data (e.g. one of two topics).

This is exactly right! Most streaming solutions out there overly simplify real use-cases where you have multiple upstream tables and need strongly consistent joins in the streamed data, such that transactional guarantees are propagated downstream. It's very hard to achieve this with classic CDC + Kafka style systems.

We provide these guarantees in the product I work on and one of our co-founders talks a bit about the benefits here: https://materialize.com/blog/operational-consistency/ .

It's often something that folks overlook when choosing a streaming system and then get bitten when they realize they can't easily join across the tables ingested from their upstream db and get correct results.

gunnarmorling · 2024-04-02T21:02:09 1712091729

> you don't have any transactional consistency across those joins

Debezium gives you all the information you need for establishing those transactional guarantees (transaction metadata topic), so you can implement a buffering logic for emitting join results only when all events originating from the same transaction have been processed.

jgraettinger1 · 2024-04-02T20:38:43 1712090323

The key ingredient is leveled reads across topics.

A task which is transforming across multiple joined topics, or is materializing multiple topics to an external system, needs to read across those topics in a coordinated order so that _doesn't_ happen.

Minimally by using wall-clock publication time of the events so that "A before B" relationships are preserved, and ideally using transaction metadata captured from the source system(s) so that transaction boundaries propagate through the data flow.

Essentially, you have to do a streaming data shuffle where you're holding back some records to let other records catch up. We've implemented this in our own platform [1] (I'm co-founder).

You can also attenuate this further by deliberately holding back data from certain streams. This is handy when stream processing because you sometimes want to react when an event _doesn't_ happen. [2] We have an example that generates events when a bike-share bike hasn't moved from it's last parked station in 48 hours, for example, indicating it's probably broke [3]. It's accomplished by statefully joining a stream of rides, with that same stream but delayed 48 hours.

[1] https://www.estuary.dev [2] https://docs.estuary.dev/concepts/derivations/#read-delay [3] https://github.com/estuary/flow/blob/master/examples/citi-bi...

jerrygenser · 2024-04-02T20:34:28 1712090068

Yeah the naive approach would to join on key. Otherwise can join on a transaction id and key. However it's definitely complicated to get right.

inkyoto · 2024-04-03T01:59:08 1712109548

> One way for addressing this concern is (stateful) stream processing, for instance using Kafka Streams […]

I have recently been burned by Kafka Streams specifically: it has been causing the excessive rebalancing in the Kafka cluster due to the use of the incremental cooperative rebalancing protocol, and not allowing the implementation to change the rebalancing protocol if Kafka Streams are used. The excessive rebalancing has resulted in severe processing slowdowns.

The problem has been resolved by falling back to plain Kafka Consumer/Producer API's and selecting the traditional eager protocol with a round-robin partition assignment strategy, which has reduced the rebalancing to near zero.

I am starting to think that for stateful stream processing, ksqlDB based stream aggregation is the way to go for simple to medium complexity stream aggregation, and for complex scenarios aggregating the data either in the source or as close to the source as possible and not using Kafka Streams. In my case, the complication is that the source is not a database / data store but a bespoke in-house solution emitting CDC style events that are ingested into the Kafka streaming app.