More

gunnarmorling · 2024-05-07T07:44:55 1715067895

Things got finally a bit easier with Postgres 16, which allows you to create replication slots on stand-by servers. It still requires manual intervention to make sure they don't fall too far behind their primary counterpart and to do the failover, but it's not too terrible either and it's guaranteed that failover can happen without missing any events. I've described the approach here: https://www.decodable.co/blog/logical-replication-from-postg.... Postgres 17 should make this even simpler.

gunnarmorling · 2024-04-14T19:47:56 1713124076

This actually doesn't fit the rules. I've designed the challenge so that disk I/O is not part of the measured runtime (initially, by relying on the fact that the first run which would pull the file into the page cache will be the slowest and thus discarded, later on by loading the file from a RAM disk). But keeping state between runs is explicitly ruled out as per the README, for the reason you describe.

gunnarmorling · 2024-04-02T21:09:45 1712092185

Debezium supports all kinds of data formats. For instance, ProtoBuf is often used these days, but it's easily customizable/configurable to have other formats, too.

> "At least a few months"

Can you back this up? This doesn't match my experience from working with users in the Debezium community at all. Don't get me wrong, Debezium certainly has a learning curve, but that's nearwhere realistic.

I think you're doing interesting stuff at PeerDB, but it would be nice if you could do without this kind of unfounded anti-Debezium FUD.

saisrirampur · 2024-04-02T21:53:49 1712094829

Thanks for the feedback here! My apologies if this came across as FUD. So let me clarify, Debezium is proven. Many large companies use Debezium for production/enterprise grade CDC. So it is indeed a great piece of software!!

However, higher Capex and Opex costs to put Debezium to prod is one of the common problem we've heard from Postgres users.

This is indeed related to the learning curve: One issue we've heard is the emphasis on command-line interface (vs UI) which provides a bunch of options for configurability but makes it complex for a first time (average) user to work with. There is Debezium UI, but that is not the default recommendation (and seems to be still in incubation). At PeerDB, we are trying to address this by working on a simple (yet Advanced) UI and Postgres-compatible SQL layer (for more complex pipelines) which we believe is more intuitive than bash scripts.

Disclaimer: PeerDB is still in its early stages, and we have yet to support the full range of configurability options that Debezium offers. We may encounter same challenges as Debezium related to learning curve. However, we plan keep "Usability" as top priority while building software. So far, with a few production customers, the direction we are taking seems positive.

In regards to formats, we have a few users who wanted Flatbuffers/MsgPack (binary JSON) formats and it wasn't trivial to setup with Debezium. These are users who haven't worked with Debezium before but after few days of effort, felt that it wasn't very easy.

Thanks again for the feedback here! And apologies if my comment came across negative. Thanks for probing me to clarify what I meant. :)

gunnarmorling · 2024-04-02T20:03:31 1712088211

One way for addressing this concern is (stateful) stream processing, for instance using Kafka Streams of Apache Flink: using a state store, you ingest the CDC streams for all the tables and incrementally update joins between them whenever there's an event coming in for either input stream. Touched on this recently in this talk about streaming data contracts: https://speakerdeck.com/gunnarmorling/data-contracts-in-prac....

An alternative is feeding only one CDC stream into the stream processor, typically for the root of your aggregate structure, and then re-select the entire aggregate by running a join query against the source database (discussed this here: https://www.decodable.co/blog/taxonomy-of-data-change-events).

Both approaches have there pros and cons, e.g. in terms of required state size, (transactional) consistency, guarantees of capturing all intermediary states, etc.

zknill · 2024-04-02T20:20:35 1712089235

Of course it's possible to solve. But I think the solutions (and the cons of those solutions) are often pretty arduous or unreasonable.

Take the example of joining multiple streams, you have to have all your data outside your database. You have to stream every table you want data from. And worst of all, you don't have any transactional consistency across those joins, it's possible to join when you've only consumed one half of the streaming data (e.g. one of two topics).

The point is, everything seems so simple in the example but these tools often don't scale (simply) to multiple tables in the source db.

roshanj · 2024-04-02T21:00:17 1712091617

> Of course it's possible to solve. But I think the solutions (and the cons of those solutions) are often pretty arduous or unreasonable.

> worst of all, you don't have any transactional consistency across those joins, it's possible to join when you've only consumed one half of the streaming data (e.g. one of two topics).

This is exactly right! Most streaming solutions out there overly simplify real use-cases where you have multiple upstream tables and need strongly consistent joins in the streamed data, such that transactional guarantees are propagated downstream. It's very hard to achieve this with classic CDC + Kafka style systems.

We provide these guarantees in the product I work on and one of our co-founders talks a bit about the benefits here: https://materialize.com/blog/operational-consistency/ .

It's often something that folks overlook when choosing a streaming system and then get bitten when they realize they can't easily join across the tables ingested from their upstream db and get correct results.

gunnarmorling · 2024-04-02T21:02:09 1712091729

> you don't have any transactional consistency across those joins

Debezium gives you all the information you need for establishing those transactional guarantees (transaction metadata topic), so you can implement a buffering logic for emitting join results only when all events originating from the same transaction have been processed.

jgraettinger1 · 2024-04-02T20:38:43 1712090323

The key ingredient is leveled reads across topics.

A task which is transforming across multiple joined topics, or is materializing multiple topics to an external system, needs to read across those topics in a coordinated order so that _doesn't_ happen.

Minimally by using wall-clock publication time of the events so that "A before B" relationships are preserved, and ideally using transaction metadata captured from the source system(s) so that transaction boundaries propagate through the data flow.

Essentially, you have to do a streaming data shuffle where you're holding back some records to let other records catch up. We've implemented this in our own platform [1] (I'm co-founder).

You can also attenuate this further by deliberately holding back data from certain streams. This is handy when stream processing because you sometimes want to react when an event _doesn't_ happen. [2] We have an example that generates events when a bike-share bike hasn't moved from it's last parked station in 48 hours, for example, indicating it's probably broke [3]. It's accomplished by statefully joining a stream of rides, with that same stream but delayed 48 hours.

[1] https://www.estuary.dev [2] https://docs.estuary.dev/concepts/derivations/#read-delay [3] https://github.com/estuary/flow/blob/master/examples/citi-bi...

jerrygenser · 2024-04-02T20:34:28 1712090068

Yeah the naive approach would to join on key. Otherwise can join on a transaction id and key. However it's definitely complicated to get right.

inkyoto · 2024-04-03T01:59:08 1712109548

> One way for addressing this concern is (stateful) stream processing, for instance using Kafka Streams […]

I have recently been burned by Kafka Streams specifically: it has been causing the excessive rebalancing in the Kafka cluster due to the use of the incremental cooperative rebalancing protocol, and not allowing the implementation to change the rebalancing protocol if Kafka Streams are used. The excessive rebalancing has resulted in severe processing slowdowns.

The problem has been resolved by falling back to plain Kafka Consumer/Producer API's and selecting the traditional eager protocol with a round-robin partition assignment strategy, which has reduced the rebalancing to near zero.

I am starting to think that for stateful stream processing, ksqlDB based stream aggregation is the way to go for simple to medium complexity stream aggregation, and for complex scenarios aggregating the data either in the source or as close to the source as possible and not using Kafka Streams. In my case, the complication is that the source is not a database / data store but a bespoke in-house solution emitting CDC style events that are ingested into the Kafka streaming app.

gunnarmorling · 2024-03-03T22:32:25 1709505145

There's the "Show & Tell", where folks have shared implementations in many different languages: https://github.com/gunnarmorling/1brc/discussions/categories....

gunnarmorling · 2024-02-22T20:28:31 1708633711

There is one entry which uses a trie, but it's not at the top, IIRC (which may or may not be related to using a trie, there's many other relevant design decisions).

gunnarmorling · 2024-02-22T20:27:39 1708633659

What an excellent post, one of the best on 1BRC I've come across so far. Big shout-out to Marko for participating in the challenge, making a strong push for clarifying corner cases of the rules, and sharing his experiences in this amazing write-up!

haxen · 2024-02-22T20:54:27 1708635267

Thanks for the praise Gunnar, but we all owe it to you for organizing it, and especially sticking through thick and thin when it took off, and needed lots of attention to evaluate everyone and maintain a level playing field!

tpmoney · 2024-02-23T01:29:16 1708651756

The last part of the article raises an interesting question for me. What is the fastest, mostly fault tolerant implementation that could be created? So something that runs on say 10 different versions of the input, each of which has had up to 20 characters from the standard ASCII set inserted, replaced or removed randomly in the input file. So we'd be mostly looking for "data corruption" fault tolerance vs hardened against malicious input. How much can you still "out optimize" the compiler and JVM if you can't throw away all the safety?

gunnarmorling · 2024-01-29T17:27:30 1706549250

Turns out you don't need GC for processing that 13 GB file, with relatively small heap sizes even: some folks have disabled GC by means of using EpsilonGC (i.e. a no-op collector). That said, the right collector will give you sub-ms pause times these days (at the cost of lower through-put).

gunnarmorling · 2024-01-29T17:21:55 1706548915

Results actually have improved quite a bit since I've answered the questions for that interview. Fastest contenders now are below two seconds, and that's running on 8 CPU cores of the evaluation machine. When I ran the entries on all 32 cores / 64 threads two weeks ago, the fastest one was at 0.8 sec, and it should be quite a bit faster by now.

gunnarmorling · on Jan 21, 2024

You can join against a static (or slowly changing) table in Flink, including AS OF joins [1], i.e. joining the correct version of the records from the static table, as of the stream's event time for instance. You need to keep an eye on state size and potential growth of course. It's a common use case we see in particular for users of change data capture at Decodable (decodable.co).

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/de...