One missing criterion is client complexity. MQTT is built to work well with very little resources on the client. Kafka, on the other hand, requires you to do things you just don't want on a small embedded device -- like opening multiple connections to multiple hosts. Kafka is also just a transport for messages while MQTT is much larger part of the stack and takes care of transporting individual values. Which means you need less other code on your super restricted device.
That said, I don't understand all the complaining directed at Kafka in this thread. Kafka is a fantastic tool that provides unique properties and guarantees. As a tech lead/architect I love to have a good selection of tools for different situations. Kafka is very reliable tool that fils an important role of when creating distributed systems and is particularly nice because it is easy to reason about. The negative opinions I heard in the past are typically from people who try to use it for something that it is not well suited for (like efficient transfer of large volumes of data) or because they misunderstood how to use its guarantees to construct larger systems.
At one place I met a team who was completely lost with their overloaded Kafka instance and requested to get external help to "further scale and tune" it.
I just touched the piece of code on producer and on consumer to publish data in large files to S3 rather than push it all through Kafka. Instead, send a simple message to Kafka with the metadata and location of the payload in S3. And then the client to download it from the bucket. They were happy puppies in no time.
Regarding your last point, how did you handle deletion from S3? Did you not need to worry about atomic consumption of the metadata and data? I suppose you could have some kind of background gc task..
I think you are overcomplicating the problem for no reason.
You build your system from simple guarantees:
* message to Kafka is sent after the payload has been published to S3. This means if you have received message on Kafka, the payload is there, no need to worry about it -- you guarantee it because of order of publishing operations.
* the object on S3 is immutable. This means it does not matter when you consume it, it stays the same.
* the message on Kafka is immutable. This means it does not matter when you consume it, it stays the same.
When the client reads the message off of Kafka topic, it just downloads the additional payload from S3. The payload is guaranteed to be there and exactly the same content as published. Once the message is fully processed, it commits this to Kafka topic and that's done. If the processing fails, the processing will be retried later by this or another node. The payload is still there until somebody decides to delete it.
Deletion can be done in many different ways. You could have metadata for all those objects (Kafka topic is your metadata database!) and see what is the oldest timestamp on the offsets on all partitions still not committed. Then you delete from S3 all objects that are older that that. This requires that you publish to Kafka in the same order as you publish to S3 (within each partition).
Yes I understood your proposed solution (and have no particular issues with it). I was specifically asking how you went about deleting things (i.e. the last paragraph of your response). Did you actually implement it that way in the end?
That said, I don't understand all the complaining directed at Kafka in this thread. Kafka is a fantastic tool that provides unique properties and guarantees. As a tech lead/architect I love to have a good selection of tools for different situations. Kafka is very reliable tool that fils an important role of when creating distributed systems and is particularly nice because it is easy to reason about. The negative opinions I heard in the past are typically from people who try to use it for something that it is not well suited for (like efficient transfer of large volumes of data) or because they misunderstood how to use its guarantees to construct larger systems.
At one place I met a team who was completely lost with their overloaded Kafka instance and requested to get external help to "further scale and tune" it.
I just touched the piece of code on producer and on consumer to publish data in large files to S3 rather than push it all through Kafka. Instead, send a simple message to Kafka with the metadata and location of the payload in S3. And then the client to download it from the bucket. They were happy puppies in no time.