If in 2020 you still choose Kafka as your messaging infrastructure, you are well behind the times.
I get it, I really do. Your manager has heard of Kafka, as has your PM and CTO. Nobody has ever fired for buying Kafka. It’s the new IBM.
“Buying Kafka? It’s open source, so it’s free!” you say, my naïve friend who hasn’t heard of Confluent.
Yes, Kafka is free until you go to production and need things like mirror maker, for multi-site replication. Then it’s time to pay up to the company that has taken over and monetized the project.
But its a pain to run, a pain to debug, uses avro as a binary schema, because everyone uses avro, right? And partitions are great! Until you need to change them and then you’re in for a world of strange and potentially unnoticed bugs from a discontinuity in partitioning as the topic grows.
Or... you could have Pulsar. Dynamic partitioning, more expressive subscription models, multi-active as part of the core product. No BS marketing claiming “exactly once delivery semantics”, aka more than once with receiver side deduplication, aka what TCP does and has always done.
There is no reason to be building something new on top of Kafka in 2020.
So thank you for sharing this. I wasn't explicitly recommending nor condemning Kafka (or Confluent). Your input is definitely valuable.
I was implicitly referring to the fact that Confluent is now trying to position their product as being a database (something I think is not really the right use case for that product), and the irony of the fact that the database/queue discussion in this article is from '95.
I do believe the unbundling of some database features have given us tools from those monoliths to use in our software architectures which are really handy.
Regarding your comment about, "nobody ever got fired for Kafka" - I've been in our industry long enough to have lived through this old saw several times, so I get it. I don't think people should go into Kafka for heavy use cases without realizing they will end up paying Confluent, and that it won't be cheap and/or perfect. I'm not suggesting people go with Kafka or Confluent for that or any specific reason, unless they prove it is the right tool for their job.
You seem to be proposing discarding Kafka completely in favor of Pulsar. I have not had the privilege of implementing Pulsar, heard about it, will need to look into it some more. I take it you have personally implemented it? Are you part of the project? What gives you the confidence that Pulsar will not become the next Kafka when the next great tool comes out? I'm genuinely interested.
Personally, we have a handful of use cases that Kafka works fine for. There are some uses cases that it promises or suggests possible, but still falls short, that we would like to have in our toolbox. They are not the common use cases it is used for. We specifically are NOT using it as a database. We are also not pushing Kafka nearly anywhere as hard as companies like LinkedIn or others do.
So given this article is about how queues are databases, I'd like your opinion on the pattern this suggests. What was your take on things like RethinkDb, which had great real time change notification (but not really messaging). Or, how about the direction Amazon QLDB seems to be going, with streams emitted from a immutable ledger for storage? Do you see any actual database which has a great story that addresses this feature?
These are all interesting tools, but personally the pattern of an unbundled, immutable transaction log, with change notification feeding a messaging system feels like a helpful tool to have in the architecture tool chest.
Wow, now pulsar as well. As someone who isn't full-time trying to keep up with this tsunami of names it's just impossible to keep up. Apache (or someone, anyone please) needs to make a matrix of all their own competing technologies and what the actual differences are between them. It's just impossible!
I feel ya. Software is a world that’s constantly evolving.
Apache is great at software engineering, but sorely lacking in product design. Because open source software is almost definitionally not a product, but a tool.
With that comes increased bifurcation of the tooling when different requirements arise, and increased complexity with running it. Kafka and pulsar both have zookeeper as an external dependency, for instance. Pulsar has an extra dependency even in bookkeeper, one of the few things I’ll readily fault it for. It’s a stark contrast to openly commercial products like CockroachDB, which has a single static binary, with symmetric nodes, and built in management UI. It’s a product, not a tool.
It's because Apache "adopts" products that are created independently by other companies who then want to open source their product, and leave to it someone else to look after.
Kafka was created at LinkedIn and eventually donated to Apache Foundation.
Pulsar was created at Yahoo and eventually donated to Apache Foundation.
People know Kafka, it works well, AWS will sell you a managed version, using Protobuf or Json payloads isn’t an issue, and the conceptual model is easy to understand.
Pulsar may be better, but is it better enough to displace an entrenched piece of core software at the heart of an enterprise?
I get it, I really do. Your manager has heard of Kafka, as has your PM and CTO. Nobody has ever fired for buying Kafka. It’s the new IBM.
“Buying Kafka? It’s open source, so it’s free!” you say, my naïve friend who hasn’t heard of Confluent.
Yes, Kafka is free until you go to production and need things like mirror maker, for multi-site replication. Then it’s time to pay up to the company that has taken over and monetized the project.
But its a pain to run, a pain to debug, uses avro as a binary schema, because everyone uses avro, right? And partitions are great! Until you need to change them and then you’re in for a world of strange and potentially unnoticed bugs from a discontinuity in partitioning as the topic grows.
Or... you could have Pulsar. Dynamic partitioning, more expressive subscription models, multi-active as part of the core product. No BS marketing claiming “exactly once delivery semantics”, aka more than once with receiver side deduplication, aka what TCP does and has always done.
There is no reason to be building something new on top of Kafka in 2020.