One missing criterion is client complexity. MQTT is built to work well with very...

anonymousDan · on March 15, 2023

Regarding your last point, how did you handle deletion from S3? Did you not need to worry about atomic consumption of the metadata and data? I suppose you could have some kind of background gc task..

twawaaay · on March 15, 2023

I think you are overcomplicating the problem for no reason.

You build your system from simple guarantees:

* message to Kafka is sent after the payload has been published to S3. This means if you have received message on Kafka, the payload is there, no need to worry about it -- you guarantee it because of order of publishing operations.

* the object on S3 is immutable. This means it does not matter when you consume it, it stays the same.

* the message on Kafka is immutable. This means it does not matter when you consume it, it stays the same.

When the client reads the message off of Kafka topic, it just downloads the additional payload from S3. The payload is guaranteed to be there and exactly the same content as published. Once the message is fully processed, it commits this to Kafka topic and that's done. If the processing fails, the processing will be retried later by this or another node. The payload is still there until somebody decides to delete it.

Deletion can be done in many different ways. You could have metadata for all those objects (Kafka topic is your metadata database!) and see what is the oldest timestamp on the offsets on all partitions still not committed. Then you delete from S3 all objects that are older that that. This requires that you publish to Kafka in the same order as you publish to S3 (within each partition).

anonymousDan · on March 15, 2023

Yes I understood your proposed solution (and have no particular issues with it). I was specifically asking how you went about deleting things (i.e. the last paragraph of your response). Did you actually implement it that way in the end?