From what I can tell, Kafka isn't designed for long term data storage. RDBMS sys...

miguno · on Sept 6, 2017

Kafka can perfectly keep your data around forever. The only limitation is available disk space (and databases have the same limitation). I'm not implying that it is always the best idea to use Kafka as a long-term storage solution, but likewise a database isn't the silver bullet here either.

> so I'd probably be uncomfortable assuming that it won't do something and "tidy up" my very old data at some point in the future

Kafka doesn't "tidy up" your data unless it is configured to do so. What's true is that, by default, Kafka will keep your data around for a week "only" but that's a configuration. And most people change it to whatever fits their use case (some lower it to a few hours, some increase it to months or years; others configure it to keep data around forever, typically in combination with Kafka's so-called "log compaction" functionality).

> Since it's supposed to do this from time to time out of the box, you'd have to be very careful not to let anyone tweak the custom config to revert back to this behaviour. RDBMS won't delete things unless you tell it to more explicitly.

The DBAs I worked with would now say "Hold my beer..." ;-)

> While I think storing things in Kafka is fine generally, there's no way I'd not have a more perminant store of the data somewhere so that I can recreate the Kafka data store if I need to.

What's interesting is that more and more users (from my experience) are actually beginning to treat Kafka as the source of truth, and rather recreate other data stores -- like RDBMS, Elastic indexes, etc. -- from it. If you like RDBMS, you can think of Kafka as the DB's transaction log.

IMHO a lot of these discussion is about personal preferences, the situation that you are in (cough legacy cough), team skills, etc. There are often good reasons to use Kafka rather than RDBMS (in this context) but also vice versa, or completely different technologies of course (like blob stores, e.g. S3).

cookiecaper · on Sept 7, 2017

>Kafka can perfectly keep your data around forever.

In the sense that you can fiddle with it to the point where it doesn't purge things automatically, sure. But RDBMS provides more than the promise that it won't delete your data after a set period of time. If that was all we needed, any filesystem from the last 3 decades would serve fine as a "permanent datastore".

MySQL has gone through a lot of grief to get to the point where it's safe out of the box. This is important. There are many conceivable situations where an unsafe default can accidentally get worked back into things. At least MySQL has never had a default that truncated databases (afaik)...

>The DBAs I worked with would now say "Hold my beer..." ;-)

No one's claiming that RDBMS are invincible, but they do provide numerous features specifically designed to minimize data loss and corruption. You can read more about these in the documentation of any major RDBMS. I've mentioned some specifically elsewhere in this thread, but don't want to keep repeating the same things.

>IMHO a lot of these discussion is about personal preferences, the situation that you are in (cough legacy cough), team skills, etc.

No offense, but this just shows your ignorance of the functionality that you're leaving on the table by considering SQL a cough-legacy-cough solution. We discussed above that "can be configured to not delete things despite defaults and assumptions" is much different from "safe long-term data storage is a major design concern".

A messaging queue does not and should not have the same semantics, because it's not intended to keep decades' worth of data, even though it may be possible to avoid the database's eviction routine. We don't have to abuse something just because the developers haven't gone to lengths to override inadvisable configurations.

tlberglund · on Sept 7, 2017

It may be unfair to describe setting a documented configuration parameter as "fiddling." Retention is seven days by default. It is trivial to set it to arbitrarily long periods of time. To my knowledge, this functionality isn't really in question. Whether logs are a good unifying abstraction on which to build systems is in dispute among reasonable people, but whether Kafka randomly deletes stuff is not. :)

cookiecaper · on Sept 7, 2017

I don't claim that Kafka randomly deletes things. Just that it automatically does so.

The danger is not that Kafka will choose not to respect the configuration value. It is that the default setting will find a way to creep back in without the admin noticing it, and then a quick reboot, maybe even an unplanned one caused by a power trip or a kernel crash, will be sayonara to the system of record. Sure, there are backups, but who needs that aggravation? (p.s.: there probably aren't actually any workable backups)

In the RDBMS world, MySQL's automatic and silent truncation of VARCHARs down to the character limit of the column was seen as a sign of its badness. That demonstrates the difference in paradigm.

Anyway, the argument doesn't really hinge on whether or not there's an automatic eviction model in the software. It's just a clear, loud signal that the software is not really intended for long-term storage, and that you are, at best, entrusting decades of mission-critical data to a less-tested configuration on a young, maturing product. This should not be appealing by itself, and that's a very optimistic perspective on the choice to forgo the data integrity features provided by a traditional RDBMS.

Developers just cannot seem to grok that just because something appears to store data across server restarts does not mean it is necessarily a safe permanent parking spot. I'm not a DBA, and I've had my share of serious squabbles with them, but when this is what happens with unsupervised developers at the helm, it's hard not to be sympathetic to their aggressive, almost hostile, feelings around developer input into the data model.

tlberglund · on Sept 7, 2017

Agreed that this is a newer architectural paradigm and a younger product. Of that there is no doubt, and there is always risk there. Also return, of course; someone had to deploy an RDBMs for the first time too—and everyone is glad they did.

But I still don't follow the argument. If the eviction model is a loud, clear signal that this is the wrong solution, why isn't the mutability of RDBMS data the same sort of signal? Claiming that the presence of a DELETE statement in SQL rules out relational databases as durable data stores would not get me too far. And nor should it!

You are 100% right that this is a new approach. You are also right that it is possible to make configuration errors that will break the system. But this is true of all nontrivial systems. At the end of all of this, we still have a very interesting sequence of events (all NYT content ever) stored in an immutable log. This seems reasonable. Maybe the NYT team is blazing a trail, it's not prima facie a crazy one. :)

cookiecaper · on Sept 7, 2017

SQL provides users with a lot of facilities and mechanisms to limit, control, supervise, and if performed within a transaction, even undo overzealous DELETE statements. I discussed some of these at https://news.ycombinator.com/item?id=15188619.

AFAIK, with Postgres, there are no known circumstances where restarting your server will result in the purge of your database; that's really just icing on the cake, not the core of the argument. The core of the argument is that SQL provides not only a rational design paradigm for long-term storage and choices that reflect they take it seriously, but also an extremely strong feature set for data management and integrity.

As I've said numerous times now, SQL isn't invincible. But it's inarguably more resilient than Kafka, and it provides the controls necessary to keep some sanity over data in the long run.

miguno · on Sept 7, 2017

> No offense, but this just shows your ignorance of the functionality that you're leaving on the table by considering SQL a cough-legacy-cough solution.

No reason to get ad-hominem.

When I said cough-legacy-cough, I was referring to the situation you happen to be in when e.g. taking on a new project. If, for example, your company has been fully committed to Postgres since several years, your chances to change this and move to a different architecture (or even sticking to the same architecture but using a different RDBMS like MySQL) are pretty low.

LgWoodenBadger · on Sept 7, 2017

Kafka is not a messaging queue. It's a log.

The same underlying structure used within our RDBMS to provide all the guarantees Kafka provides.

As for logs and databases, they are duals.

https://www.confluent.io/blog/turning-the-database-inside-ou...

cookiecaper · on Sept 7, 2017

>Kafka is not a messaging queue. It's a log.

Kafka bills itself as a "messaging system".

>The same underlying structure used within our RDBMS to provide all the guarantees Kafka provides.

Is there perhaps a reason that people use the RDBMS scaffold atop this "underlying structure"? The point I'm making is that Kafka is not safe for use cases that demand robust data storage and integrity, at least not in comparison to the standard of safety set by the traditional systems of record for important data (RDBMS).

miguno · on Sept 14, 2017

Small and friendly correction: Kafka bills itself as "a distributed streaming platform".

Also, a heads up: we'll soon publish a blog post on the subject of "is it ok to (permanently) store data in Apache Kafka".

mayank · on Sept 6, 2017

You've really just laid out feelings rather than concrete technical reasons for why Kafka can't function as a permanent datastore.

Can you point to specific design elements in Kafka that would lead you to conclude that it isn't suitable for permanent data storage?

Also, Kafka doesn't "do" anything to your old data if you don't want it to. It's also open-source, so these behaviors can be verified.

pram · on Sept 6, 2017

I'd say it isn't appropriate for permanent data storage because the individual brokers don't scale well with the amount of logs present. If you have hundreds of partitions, and millions of logs, then any operation dealing with the indexes (like an unclean startup) will take an extremely long amount of time. So your individual brokers are now down for an hour if they don't shut down cleanly. Which happens often (oom, weird zk errors, etc)

It scales linearly.

LgWoodenBadger · on Sept 7, 2017

Most of the use-cases you describe wouldn't be resolved by re-reading the entirety of the topics from the beginning of time.

They'd resume from their last committed offset and continue where they left off.

If you do have to catastrophically recover from the beginning of time, then sure you'd have a rough time. But that's true for any system that would have to do that. It's not Kafka-specific.

Now if your consumer was entirely in-memory and made no use of persistent storage itself and it had to recover from the beginning of time, then I'd say that type of problem is your own architectural failure and I have yet to come across a tool, pattern, framework, or architecture that bullet-proofs your foot.

pram · on Sept 7, 2017

Reading the topic from the beginning is not what I'm talking about.