The very definition of over-engineered. This is just event-sourcing turned into a marketing article for Kafka.
It doesn't really matter what the "source of truth" system is, although Kafka doesn't seem quite as mature/stable enough for that. With such little data, they can push into a nice graph database instead, run it entirely in memory and meet 10x any demand they'll ever see along with all the query types they'll need. Add in an elasticsearch cluster on the side and problem solved.
Any database can serve as a log replay, as long as you save all the versions then it's just called a query.
> The article addressed their potential issues with snapshotting
They say it would be outdated - but if you store all the versions like I said, what is getting outdated? Kafka consumers are essentially doing the same thing - it's a poll-based model that asks for more data from the current offset, no different than a SQL query with a where clause.
Unless I'm mistaken, If I were to build out a simple event log represented by a relational DB, I have bottle necks when writing to it, and have lag in terms of processing the events, and if I were also pushing those events to a queue to hydrate aggregate snapshots I would have to have client logic to deal with duplicate events or not acking processed events etc?
Intuitively, I guess kafka is "more realtime" and "more available" when compared to the home-brew event log?
EDIT: obviously those constraints in my home-brew event log are relaxed when my problem domain is amenable to things like associative operators, idempotency, inverses etc.
There's no reason not to make good use of Kafka or similar solutions. The issue is that people use it without understanding it. In this article, they say that Kafka is their system of record and their primary long-term storage. That's very silly.
You can use Kafka as the buffer/processing log before persisting to the database, but with such a small dataset it's just not necessary. It's a news publishing system, not high-frequency trading.
Well my point is that it's probably faster to get to production if I simply used Kafka _when modelling my work flow as an event processing system_ but it took them a year so I don't know now haha
A year is actually not that much for porting such a huge legacy system to a new platform. I imagine most of the work was making the interfaces from/to other platforms.
it's exact opposite the main cost is gc of dead rows in MVCC RDBMS since you are never deleting or updating rows performance will be very decent for writes.
I was about to comment that while any database can serve as a log replay and may even be feasible for these volumes, scaling it at high volume later would be extremely hard - I have experience with this in a multi-petabyte set up and its a nightmare using an RDBMS.
But, the article says "In Apache Kafka, the Monolog is implemented as a single-partition topic" - losing all the goodness Kafka provides around scaling. This setup now is no better than an RDBMS with master/slave replication.
Event sourcing is unnecessary here. What they want is a versioned history of their content with flexible schemas and arbitrary queries. Instead of using a strong fast graph database as a perfect fit, they chose to implement it poorly using Kafka which I see as much more time wasted on reinventing the wheel.
The dataset is small and all of the "very different" use-cases are just downstream apps that query a database. Why use Kafka to then materialize several different databases when a single graph database can serve all these downstream apps? Remember Kafka consumers themselves are just polling queries against a log.
A processing log is one thing but event-sourced source of truth in custom database logic in Kafka is borderline ridiculous. This project is more moving parts and less functionality while cleaning none of the existing mess. Effort would've been better spent consolidating all their systems instead (which they still have to do since these standard schemas need to be used somehow).
That doesn't prove that the NYT's money was poorly spent. Decisions are always slow and expensive, and typing is quick. Copying an existing design is almost always cheaper than making the original.
And in many cases, copying is dramatically cheaper. My favorite example is the iOS game Threes[1], which took 14 months to make and then was cloned on android in ~20 days. It was cloned so quickly that the original developers were accused of copying the cloned version!
But just because the cloners made a functionally identical product doesn't mean they did the same work as the original designers. They got to skip designing anything - which is usually the hardest and slowest part.
They claim in the article to have 100GB of text data. Let's bump that up 2 magnitudes for all text and metadata ever (outside of media files) and you can still run the entire thing on a single rack of servers and meet any performance needs.
Many industries and applications are leagues ahead in both data size and speed - this isn't an example of such.
100GB...let me go find the largest MicroSD card in my house to put that on. That would actually be a good fit, since it'd work in an rPi3 which could likely serve their data publishing needs (assuming only a few updates to articles per second..not mentioned in the article, but I'd be surprised if there's more than that given the data sources) vs what they've done.
Honestly, what kind of RPS are they talking here? Requests per minute, if that, seems like.
FWIW, the article mentions the book "Designing Data-Intensive Applications" by Martin Kleppmann. I wanted to throw out my own endorsement for the book, it's been instrumental in helping me design my own fairly intensive data pipeline.
Dear HN reader - if you're not quite ready to buy the book, take a listen to this episode of Software Engineering Daily (https://softwareengineeringdaily.com/2017/05/02/data-intensi...). It will give you a sense of what Martin Kleppmann is all about and how he thinks about problems. I ordered my copy of "Designing Data-Intensive Applications" after listening to this episode.
I third this recommendation. I've worked on a ton of data intensive applications on all kinds of stacks over the years, and this book gives you lessons learned as well as a very valuable historical perspective on relational databases that is missing from a lot of the popular literature today.
I parsed and reparsed just like you and I stalled again, while I imagined the sentence again as "throw up", trying to make a quick fix. OP has picked a tricky one..
Apologies! It's an expression I use commonly, like "let me throw this idea out there..." Now that you bring my attention to it, however, it does seems backwards, and I'm sure it's baffling to non-native speakers. Thanks for the heads-up.
Excellent, well written article. The key take away seems to be that instead of an temporary event stream log, since the number of news articles (and associated assets) is finite and cannot explode, they store all the "logs" forever (I'm using the term log as is defined in the article, as a unit of a time-ordered data structure).
I wonder if NYT can help other news websites by making their code open source? I'm a huge fan of NYT and their jump to digital has just been amazing. However, I would also like my local newspaper (which covers more regional news) to be able to serve quality digital content.
> I wonder if NYT can help other news websites by making their code open source?
Hey! I, and a number of other news nerds have been encouraging FOSS for the past decade or so. And in fact a number of major open source projects have come out of news related projects, including Django, Backbone.js/Underscore.js, Rich Harris's work on Svelt.js, and a whole lot more.
Most often the problem with local news organizations are operational constraints. The news biz has seen a huge downturn over this same period of time. Most orgs, both on the reporting side and on the tech side are super strapped for people-time.
It's not enough to have FOSS software, you also have to have folks doing devops and maintaining systems often at below-market salaries.
One of the biggest issues is that a lot of newsrooms have zero control over their CMS, because they're owned by a corporate entity that dictates IT decisions from afar, slowly and with much gnashing of teeth.
Family-owned papers like the one I work at (The Spokesman-Review in Spokane, WA) are the one of the few news orgs that could actually put something like this into production within this century, but even we still have to deal with manpower issues.
It would make me very sad if this would be the case, that every newspaper is going to be running the same kind of software stack. It's then 1 step away from just having only 1 news paper in total.
Diversity, choice, innovation, etc up to different reporters reporting on the same story with their point of view, will all disappear if all newspapers are going to run the same software stack.
This is a flawed architecture. It will work at release, but it will be difficult to manoeuvre with, and they will grow to hate it.
As your business changes, your data changes. Imagine if on day one, they had one author per article. On day 1000, they change this to be a list of authors.
Kafka messages are immutable. Each of those green boxes on the right hand side of the first diagram will need to have special-case logic to unpack the kafka stream, with knowledge of its changes (up until 17 May 2017, treat the data like this, but between then and 19 May 2017 do x, and after that do y).
Document pipelines is a rare instance of a context where XML is the best choice. They should have defined normalised file formats for each of their data structures. Something like the gateway on the left of the first diagram would write files in that format. (At some future time, they will need to modify the normalised formats. Files are good for that. You can change the gateway and your stored files in coordination.)
Secondly, they should have a gateway coming out of the file store. For each downstream consumer, they should have a distinct API.
These APIs might look the same on the first day of release. But they should be separate APIs so that you are free to refactor them independently.
When you have a one-to-one API relationship, you can negotiate significant refactors in a single phone call. When you have more than one codebase consuming, you need to have endless meetings and project managers. I call this, "The Principle of Two."
Some of the other comments here say that they should have used databases. So far, they have not made the case for it. And databases are easily abused in settings like this one. People connect multiple codebases to them, and use SQL as a chainsaw. Again, you can't negotiate changes.
When you create a system, your data structures are the centre of that system. You need to do everything you can to keep your options open to refactor them at a later time, and to do so in a way that respects APIs that you are offering your partners.
Kafka is a good tool. If used well, your deployment design will stop your system regularly (e.g. every day), nuke the channels, recreate them from scratch, and restart your system against these empty channels. You shouldn't use it as a long-term data store.
> Kafka messages are immutable. Each of those green boxes on the right hand side of the first diagram will need to have special-case logic to unpack the kafka stream, with knowledge of its changes (up until 17 May 2017, treat the data like this, but between then and 19 May 2017 do x, and after that do y).
I respectfully disagree. The genius of this approach is that you can make the same transformation on the original Kafka stream to change its schema and prepare the new feed. Once you are satisfied with the results and you have switched all subscribers to the new feed, just turn off the old one. Voila - you only have y.
> This is a rare case where use of XML makes sense.
I still don't understand the hatred around XML. Is it slightly verbose? Yes. Does it support lots of neat functionality that make it great for interoperating between systems, like validations and transformations? Yep. Sure, it's possible to go full architecture astronaut with it, but you can do that with pretty much any programming language.
Meanwhile, I'm just sitting over here wondering whether my YAML file is supposed to have certain indents here or not, "-" or not, and trying to go figure out which magic incantation I need to get it to handle a multi-line string the way I'm expecting.
I think parent was answering the usage of XML in this use case, which is not appropriate. XML has many strengths (as you have outlined), but it has also been (mis/ab)used so many times that it gained bad reputation. What my GGP suggests is an example of such. There is nothing to gain from XML here that any proper DB (with schema) wouldn't offer, or in this case, protobuf.
Kafka logs however are solving a different problem. The mental model is different - they do not record state, but the whole history of transactions, which makes it trivial to change the schema if/when need arises. Saying that the schema should be thought in advance and shouldn't change is not realistic IMHO.
They're using protobufs, which seem just about as flexible as XML as far as schema updates are concerned and are considerably less ambiguous. So I don't see how XML would help?
Through its various protocols Kafka topics can be configured to be guaranteed forwards, backwards, or bi-directionally compatible.
That is to say: as flexible as XML or an RDBMS schema with long-term, format encoded, data that can explicitly support conflicting clients over time (as desired by the dev). Zero-impact, live, online, updates touching hundreds of active systems without issue...
TBH most posters here have completely missed the forest for the trees. The point is not to avoid DB migrations. The point is to support hundreds of DB migrations in connected systems simultaneously with no schema-related down-time or centralized point of failure or intractable CAP challenges.
Trying to solve these kind of operation issues with an RDBMS in an Enterprise context is _literally_ the "big ball of mud" design pattern.
Kafka, warts and all, is an operational answer to how 1 client can feed 1 MM real-time connections, how massive resource unlimited batch systems can integrate with real-time feeds, how your data warehouse can keep growing without painful forced restructuring, and how data architects can mandate standards across multiple systems built by external teams with human sized budgets.
Data is part of it. Protocol, format guarantees, and loosely coupled systems are where the wins lie.
> Through its various protocols Kafka topics can be configured to be guaranteed forwards, backwards, or bi-directionally compatible.
Sure but backwards and bi-directional compatibility inhibits the evolvability of schemas. Something as simple as adding a new required field for example is not backward compatible in avro.
I understand why this is, and that in a huge database, it's not that simple in SQL either. But in 95% of databases it actually is quite simple.
Ummm sure about that? Something as simple as adding a new required field and maintaining compatability is the explicit use case for Avro. I personally do this in production... I know LinkedIn does too...
Seriously: that would be a show stopper beyond show stoppers, and would make Kafka useless. Backwards and bi-directional compatibility are the specific facilities that enable safe schema evolution.
From the docs:
"Backward compatibility means that data encoded with an older schema can be read with a newer schema... Forward compatibility means that data encoded with a newer schema can be read with an older schema... Full compatibility means schemas are backward and forward compatible[:]...we can evolve the schemas in a fully compatible way: old data can be read with the new schema, and new data can also be read with the old schema."
A new required field _with a default value_ (important!), is backwards compatible. Same deal for hardcoded SQL updates: no default value, no backwards compatibility. Default value? Backwards compatible because clients that cannot provide a value get one, by default.
And, frankly, if you're thinking about database in the singular form then you're not thinking about an environment that really needs a schema evolution policy... tens and hundreds of parallel databases, active datawarehouse operations, multiple federated service gateways, real-time analytics, and huge batch systems for BigData, all acting in concert 24/7: that's the kind of environment that makes the "95%" simple jobs massively costly due to the logarithmic interaction of complexity when numerous systems are considered at once.
Without some kind of dataflow separation these intrinsic complexities trend towards intractable system configurations, hence the ESB boom and bust. Kafka is not for your applications. Kafka is for your Enterprise with many, many, applications.
In what sense is this field required when you attempt to produce a value using the new schema, and everything is fine if you omit the "required" field?
> And, frankly, if you're thinking about database in the singular form then you're not thinking about an environment that really needs a schema evolution policy
We don't have 100s but certainly 10s of databases. But we use a schema evolution policy because we're attempting to use kafka as the system of record for our event log.
I think XML is better than protobufs when it comes to long-term even storage. Three very big problems with protobufs:
(1) Protobufs are not self-describing. There is a meta-model but you can't really understand the data unless you have access to the schema. This means just to be safe you might end up storing the schema beside every message or at least a reference to the schema. Welcome to the schema management business.
(2) Protobufs aren't extensible. In practice this means any time a system wants to introduce a new field they have to submit a proposal to some sort of highly centralized "Architecture Group." This is followed by lots of debate. Then, if you're lucky, it gets put in and a few systems adopt it. With XML you can slice off namespaces and let people innovate in those namespaces.
(3) Protobufs aren't human readable. At the end of the day this means you need special tools to do anything with them. Meanwhile XML can actually be imported directly into Excel.
There's a whole ecosystem of powerful technologies around XML that make it work very well in this case. People underestimate how valuable this is because they're still not thinking of the log as its own first-class product.
That said, perfect is the enemy of good. This could be a good step in the right direction and migrating to XML in the future would be pretty easy because it becomes a very simple event transformation.
> Kafka messages are immutable. Each of those green boxes on the right hand side of the first diagram will need to have special-case logic to unpack the kafka stream, with knowledge of its changes (up until 17 May 2017, treat the data like this, but between then and 19 May 2017 do x, and after that do y).
One solution would be:
Kafka allows you to easily create new streams from the "monolog" stream that normalise the data to a certain schema. Consumers can then just consume these new derived streams.
Another was mentioned in another reply (create a new stream that ultimately replaces the monolog).
> Document pipelines is a rare instance of a context where XML is the best choice.
XML does not really offer anything here that could not be achieved with tools that are nicer to work with? It's just a file format, basically. Why wouldn't Protobuf work? It also saves a huge amount of disk space vs. XML.
> Secondly, they should have a gateway coming out of the file store. For each downstream consumer, they should have a distinct API.
This is Kafka. They can have distinct streams for the consumers since you can always derive new kinds of streams.
> You shouldn't use it as a long-term data store.
Why not? Kafka has support for infinite retention and Kafka has very strong guarantees about always writing data to disk and not losing a single event.
> Traditionally, databases have been used as the source of truth ... [but] can be difficult to manage in the long run. First, it’s often tricky to change the schema of a database. Adding and removing fields is not too hard, but more fundamental schema changes can be difficult to organize without downtime.
This argument sounds self-contradicting. Kafka doesn't let you change its schema at all! At least postgres gives you the option.
It seems that the author is excited about having a single source of truth that doesn't change, and didn't realize that he could do that with a database, if he just never used the schema-changing features.
Am I missing something? It seems like the author could be totally happy with a bunch of derived postgres databases sitting in front of a "source of truth" database, where he never changes the source of truth database's schema.
The problem with an RDBMS is that the schema is retroactive -- every time you update the schema, both old and new data must validate against it.
There's another approach, which is to also version the schema. Every record is conceptually a pair of [schema, data]. This puts a burden on the application -- every client must be able to understand old schemas -- but the benefits are considerable.
The most trivial benefit is that no data is ever lost, even if the business case is taken away. Over time, the burden of supporting obsolete features mean that you want to remove columns that don't apply anymore. But that does remove interesting and potentially valuable historical data. If the schema evolves together with the data, you can keep even the cruft, at little expense.
In practice, the evolution of the schema means that data often has to be transformed from one schema to another (e.g. v1 -> v2). In an RDBMS, this transformation (schema migration) is performed once, and if it contains bugs, you have to resort to backups to restore the old data. But with a versioned schema, the transformations are simply functions on immutable data. You can go back and re-run transformations on old data to get new data, non-destructively.
solved it for you :) and thats will have ACID and all the other features of RDBMS that kafka does not have. It will also allow you to create transactionally consistent "views" representing whatever you need implemented in a few lines of SQL vs complex standalone services.
Postgres could in fact be used here by creating an append-only table similar to:
id, data (JSON field)
However, Postgres doesn't have good support for creating derived append-only logs (=streams) from that table. Kafka has Kafka Streams and KafkaSQL. And producer+consumer APIs that are a good fit for NYTs use case.
> Am I missing something?
Remember that in addition to schema changes, NYT also wants to avoid row changes. One reason was that all search indices & other systems need updating too during a row change in a DB and this will lead to inconsistencies in large scale (sometimes some of these updates fail).
IMHO the thing you are missing is the log based architecture where all databases are derived/materialised from the SSOT: the log.
Pretty much all databases are already a transaction log that gets materialized. Postgres and many other RDBMS take in changes, write to the WAL, then provide tables that are views of the latest data of each row.
If you want total history and the database doesn't support this automatically then you can easily insert news rows instead (like you described) and then just create an SQL view that then shows the latest versions of each row, while also adding views for all kinds of other data access patterns. Companies have been doing this for decades because it's self-contained, fast, and reliable.
Using Kafka and separate processes to do this is deconstructing the RDBMS into separate layers that you now have to manage yourself. Useful if you really have that kind of scale but at 100GB of data, it's just silly. Use kafka as a work queue but leave the database work to actual database software.
However, Postgres doesn't have good support for creating derived append-only logs ????
A one line trigger will give you derived append only log that is transactionally consistent.
It's rare that a company needs something like Kafka.
Kafka introduces a number of issues related to the development of client code and data stores (if any) and the maintenance of these things. It's important that the actual scale justifies the expense incurred.
It's not rare, using Kafka in place of a prober db is crazy though. There are tones of use cases that benefit from having Kafka a very common one is putting Kafka in front of the data processing pipeline to absorb spikes without data loss.
It's rare that a company has spikes of that sort that can't be handled in far more mundane ways. Most companies do not operate at a scale that merits this sort of solution.
Look at the engineering blogs on LinkedIn for the reasoning behind creating it. It's over-kill for most use cases.
Kafka doesn't have a schema per message, messages are just bytes that you can serialize / deserialize however you want to. The article refers to using protobuf for messages, which does support adding fields.
If you're equating kafka topics with the idea of a schema, you can add topics.
To add to this. Having implemented a similar log-based architecture, I would say that it is much simpler data infrastructure than having a central RDBMS considering the use case. Remember we need to deal with several consumer applications with their own respective optimization models. Postgres is a perfect choice for a given application's local datastore, while for a different application they may want to use ElasticSearch as their datastore. However, the "source of truth" remains free of any such optimization requirements. You simply save your "messages", "event", "facts", whatever you want to call it in its purest form preferably immutable, and let the consumer apps create/recreate their local datastores as they deem fit.
I tend to be the one arguing this, to stick to postgres for most things but even I will admit it does depend on scale.
I'm not sure what the NYT requirements are but from my understanding of Kafka, its persistent redundant distributed queues scale automatically horizontally across machines to support colossal amounts of data. It's possible that they had difficulty fitting everything in a postgres instance.
Kafka by default is not persistent, the logs expire after 7 days. You can increase it on a per topic basis. It also doesn't scale automatically. If you have three replicas on a single partition topic, they will live on their assigned broker forever unless you manually reassign them. Adding new nodes does not kick off rebalancing of partitions. Its automatic cluster management is very primitive compared to something like elasticsearch.
For example, if you lose a broker, the replica will just be gone forever. Unless you replace the broker with the same id.
Your statements about data retention are true, but if I may, I think they are only vacuously true. There are definitely ways to configure data retention incorrectly, but it seems like the folks at NYT have cracked the nut. Indeed, this is a trivial thing to set up.
Your points about auto-scaling are well-taken, though. It does not automatically rebalance when you add a new node. Confluent obvious agrees that this is a problem, since this is part of what Confluent Enterprise does. #shamelessplug (Which of course I do not intend shamelessly; just saying you have a point. :) )
See, that's where I'm confused. I'm no Kafka expert, but they say they use a "single-partition topic" which I believe means the only way they can replicate the data is by replicating the entire log, they can't shard because it's a single partition. The reasoning behind this is because Kafka doesn't support ordering between partitions.
Also I've never thought of Kafka as a persistent data storage solution, it's interesting Confluent is supporting Kafka being used in this way.
You are not even a little bit confused! I think you have it perfectly right. And you are not alone in not thinking of Kafka as persistent storage, but when you get down to it there is no reason not to, and people are indeed using it in just that way. And yes, Confluent does give its +1 to this practice. :)
>it's interesting Confluent is supporting Kafka being used in this way.
If it earns them money, I think they'll pretty much support anything. Jay has rubber stamped it on SO [1], but he's got a bit of a vested interest on selling Kafka.
Precisely. While "all NYT data since 1851" sounds like a lot, and >8660 days sounds like a long retention period for a Kafka topic, this--like most systems in the world--is not a Big Data application. One of the key insights from the post is that there are interesting architectural considerations that have nothing to do with data size that make immutable logs a good idea.
Heh this was exactly my thought. I've no idea why they couldn't just store the data in a database! Or if they really have several ways of storing data, then have a database for each and a common API/ESB layer over the top. Seems normal.
I wonder how much of this kind of stuff exists out of necessity and how much of it exists because very smart people are just bored and/or unsatisfied.
Are there any articles that supplement this that explain how much business value is added/lost by the existence/removal of these kind of features? In the case of NYT I suspect its popularity is maintained because of the perception (real or not) of high quality journalism, in spite of any technical failings.
---
How much would be lost if NYT was just implemented as text articles that are cached and styled with some CSS. "Personalization" could be added by tags each article has and a small component that shows the three most recent articles that share the same tag.
I can't speak for the situation at the NYT but the actual public site for online papers are often pretty simple things with most of the complexity being ad logic. The systems here almost entirely deal with writing and content retrieval pipelines for stuff that was written years ago in other systems that isn't tagged or stored in sympathetic ways, and there will also be the very old school print pipeline to have to deal with too.
Literally teams of interns manually re-typing old articles from microfilm. OCR isn't quite there yet, not for dealing with the ways newspapers and newspaper design has changed over the years. You'd get the text, but there would be no guarantees that it that story bylines, headlines, factboxes, and photos match. Our own digital archives go back to 1994, anything before that is manual input.
That's neat, but most of those aren't anything that can't be implemented with just CSS and HTML, unlike some of the stuff you've done for the NYT. Though even in that case, isn't all of this stuff just static assets? Hasn't this problem already been "solved"?
>I wonder how much of this kind of stuff exists out of necessity and how much of it exists because very smart people are just bored and/or unsatisfied.
That's a ton of it. Like it or not, publishing a digital newspaper is not a hard or unsolved problem; it's one of the web's core competencies. If you hire people who want to build cool stuff to supervise a CMS, well, you get this kind of outcome.
The raw cost is understated because these experimental setups misinterpret the functionality of the new architectures/formats they're using. It doesn't truly rear its ugly head until there is a major data loss or corruption event. It's not that these never happen with RDBMS, it's just that RDBMS contemplates this possibility and tries to make it pretty hard to do that, whereas message queues just automatically delete stuff (by design, so they can serve as functional message queues!).
RDBMS have spoiled us and we take its featureset, 40+ years in the making, for granted. We need to be careful and not assume that `GROUP BY` is the only thing we leave on the table when we "adopt" (more accurately abuse) one of these new-wave solutions as a system of record.
Since no one is going to admit to their boss "this wouldn't have happened if we used Postgres", and since most bosses are not going to know what that means, most of these spectacular failures will never be accurately attributed to their true cause: developers putting their interest in trying new things above their duty to ensure their employer's systems are reliable, stable, and resilient.
There are non-negligible problems in the news space like:
1. Supporting full-text search for a fair number of concurrent users
2. Availability of the system with minimal downtime
3. Scalability within the day and year, traffic patterns around e.g., breaking news events will far surpass 2AM traffic
4. Notifications
I could go on and on but honestly, it's just a tone-deaf response.
Parting pot-shot: "No one is going to admit to their boss that the reason a worldwide news organization can't publish any stories is because their one postgres master node went down, or is waiting on a state transfer to a fallback master"
"We can't publish right now because the database had to enter an unplanned maintenance period" is a lot different from "our authoritative archive is gone and we have to try to rebuild it from all these separate 'materialized views', woops."
You may not have to worry about somebody setting a "delete all data older and/or bigger than Y or Z" but you have to worry about someone running "DELETE FROM table" without a WHERE clause. Which is easier to prevent? The one that can be done through the same mechanism as non-destructive queries? Or the one that can only be modified through a file-system configuration, completely separate from its API?
Regardless, it's a different paradigm with different "don't do that" behaviors that you need to know about.
In Kafka, if you want the persistent, append-only, write-ahead log to not delete stuff, then configure the retention period to keep things forever.
If someone runs `DELETE FROM table` without a WHERE clause, I expect:
a) the query to be tested and scripted on non-production environments first, making this a non-issue;
b) user doesn't have DELETE permissions on that table and/or the rows not intended to be deleted;
c) referential integrity to kick in and prevent the deletion of interdependent records (which is most records in a database);
d) CHECK constraints, triggers, and other validation routines to prevent this clearly-excessive operation;
e) the person executing and inspecting these queries within an ad-hoc transaction to roll it back before committing;
f) if, in the event this does occur and commit, which itself means there's a big problem with your procedure, streaming binlog archives can facilitate a point-in-time backup, audit tables can be used to rebuild the data, etc.; these aren't typically included by default (streaming point-in-time backups are on AWS Aurora) but they're conventional for many professionally-run RDBMS installations.
and I'm sure there are failsafes that I'm forgetting, and since I'm not a DBA, some I'm probably not even aware of.
How many of these can I expect to help me out when a packaging bug (or, simply a mistaken "y" on the prompt asking if I want to override the package config) clobbers the Kafka config file?
I'm not quite sure what you're insinuating about YC here but I suppose a standard "no, we don't do anything like that" is in order.
We sometimes rate-limit HN accounts when they have a habit of posting low-quality comments too quickly or (especially) getting involved in flamewars. Since we've discussed this more than once before, I assume you remember it, but other people might not know. That's all that's happening here and of course it has nothing to do with your opinions, about Kafka or YC or anything else.
"Flame wars" in the sense of actually responding to the people who are trying to have a discussion on the technical points? Suggesting that my posts involve flaming virtually ever is a flagrant mischaracterization, and we both know it.
This thread contains about the "worst" you'll find from me, with "No offense, but this reveals your ignorance...". This is a flame only insofar as "flame" constitutes any disagreement at all.
When the account was rate-limited, the complaint was a) that HN had received out-of-band complaints from people who were getting mad that I discouraged others from deploying database clusters on Kubernetes; and b) that my posts on the subject were "trite". Maybe I don't waste enough time on HN, but I'd never seen that before.
I concede there was one post in that thread that could be interpreted as borderline incendiary, suggesting that the only reason to run a database on top of k8s is to win "GCool Points" (a position I continue to maintain), but it was directed at no one in particular, intended to provide some levity to non-techno-hipsters, and prevent some of the routine noise we see every time I make that type of post from people who really don't have any counter-point except that "Google does it!". Hardly a habit of engaging in flame wars. And it was used by HN/YC as an excuse to "detach" the entire thread, in which a real technical debate was occurring, as is occurring here, instead of just the single borderline post.
I understand that you feel the need to go on the record with a denial. I hope you can appreciate that I feel that need too, whether the accusation is implied through the rate limit that prevents me from replying and may mislead others into believing that my position is indefensible, or explicit, as it is now.
Log compaction is generally what you want, which will preserve the most recent message for every key in a topic. Event streams expanding boundlessly is something very few will ever need or want, so you'll toss messages into an "event" topic of some kind, apply the event to the most recent entry in the "model" topic and store a new version (which will be kept after log compaction, the original messages can be pruned after you need to free up storage).
This is very similar to a normalized model in a relational database, with many-to-many relationships between the assets.
In the example we have two articles that reference other assets. For instance, the byline is published separately, and then referenced by the two articles. All assets are identified using URIs of the form nyt://article/577d0341-9a0a-46df-b454-ea0718026d30. We have a native asset browser that (using an OS-level scheme handler) lets us click on these URIs, see the asset in a JSON form, and follow references. The assets themselves are published to the Monolog as protobuf binaries.
When consuming this data do you have to programatically do relationship fetching on the client side or is eager loading/joins available in some way in Kafka?
Additionally there seems to be a focus on point-in-time specific views of this data, but are you able to construct views using arbitrary values/functions? Let's say each article is annotated with some geo data, can you construct regional versions of these materialized views of articles at the Kafka level? If not it seems like you are pushing a fair amount of existing sophisticated behavior at the RDBMS level up into custom built application servers.
> Because the topic is single-partition, it needs to be stored on a single disk, due to the way Kafka stores partitions. This is not a problem for us in practice, since all our content is text produced by humans — our total corpus right now is less than 100GB, and disks are growing bigger faster than our journalists can write.
Before this line, the author mentions they also store images. There's no way that all their text + images is <100GB right? Something is inconsistent here.
If all articles are put into monolog, what is the procedure to fetch all articles, lets say, published in year 1857? Will that be a O(n) operation (assuming all messages published in monolog have a timestamp field).
>We need the log to retain all events forever, otherwise it is not possible to recreate a data store from scratch.
SIGH. Cue the facepalm, head in hands, etc.
I'm not going to get into a big thing here. But if you find yourself saying "I need to keep this thing forever no matter what" and then you try to use something that even entertains the notion of automatic eviction/deletion semantics as the system of record, you're doing it wrong.
Not to burst the bubble of the techno-hipsters, but Kafka is "durable" relative to message brokers like RabbitMQ, not relative to a system actually designed to store decades of mission-critical data. Those systems are called "RDBMS".
Elsewhere in the article he says that they have less than 100GB of data and that it's mostly text. This is massive overarchitecture that isn't even covering the basic flanks that it thinks it is, such as data permanence.
I would really like to read the article that discusses why Postgres or MySQL couldn't have served this purpose equally well.
I think the POV he's taken is that the kafka stream is the one true datasource (for all time), with all other dbs being derivatives. This insane strategy seems to be over engineered to get around db migrations... though i'm sure the event stream will also change over time, and he'll have to write migration-like code anyways.
Kafka lets dependent consumers transform that data into whatever model is appropriate for their use case. There's no one-size-fits-all E-R model for data for all use cases.
It's not just working around db migrations. It's also providing you with the ability to model the data as many ways as you require. How else would you do it, materialized views? How often are they materialized? Regular views? How performant are they?
There are many benefits to this approach, and a lot of them require a different way of thinking. It's a different paradigm.
If datasource engineers are for some reason implementing "ontogeny recapitulates phylogeny," they won't be creating migration-like code, they'll be writing ETL filters. Break out your Members Only jackets!
OOC what makes a RDBMS more durable that a Kafka? Both of them are systems for representing data on disk. I'd love to hear why one representation system is better at disaster recovery than another.
In Postgres, I never have to worry that the server will be accidentally loaded with `retention.bytes` or `retention.days` set too low and, as a result, choose to delete everything in the database, generating a wholly artificial "disaster" that can result in long periods of disruption or downtime (at a minimum; worst case is permanent data loss).
It is true that someone could issue `DROP DATABASE`, `rm -rf` the filesystem on the database server, or so forth, so my point is not that other systems are invincible. It's just that a properly-configured RDBMS is designed to take data integrity extremely seriously and provides numerous failsafes and protective mechanisms to try to ensure that any data "loss" is absolutely intentional.
On a RDBMS, things like foreign key constraints prevent deletion of dependent records, mature and well-defined access control systems prevent accidental or malicious record alteration, concurrency models and transactions keep data in a good state, etc. Kafka, on the other hand, is designed to automatically and silently delete/purge data whenever a couple of flags are flipped.
That is not a flaw in Kafka itself; it's designed to do that so that you don't have to interrupt your day and purge expired/old/processed data all the time. It's a flaw in architectures that misinterpret Kafka's log paradigm as a replacement for a real data storage/retrieval/archive system.
I've had this argument countless times with people who've tried to use RabbitMQ as a system of record (if only for a few minutes while the messages sat in queue). There's just some fundamental disconnect for a lot of developers where they don't understand that something accepting the handoff doesn't mean that the data is inherently safe.
Kafka is a fine replacement for a RDBMS if it fits your particular use case. It has very strong data consistency guarantees - stronger than most RDBMS - if you configure it properly (acks=1 et al). It won't even lose data if the leader of a partition crashes during a commit.
It has been explicitly designed for these use cases and even has features like compaction:
Now, I agree with you that in most cases, using Kafka as your primary data store instead of a RDBMS is madness - but that doesn't mean it's a bad idea in general.
I don't mean to be cavalier about misconfiguration, but it's not like the retention period is the Sword of Damocles. It's a configuration setting, and Kafka honors it reliably. As other commenters have pointed out, there are other bad things you can do to cause data loss with any system no matter how hard you try. These stories will continue to grace post-mortem blog posts long after we are gone, but stories of Kafka accidentally not retaining data do not seem to be thick on the ground. Any non-trivial system has its rough edges, but this just doesn't seem to be one of them for Kafka.
For what it's worth, I've known sysadmins who strip their boxes to the bones and take pains to ensure that the "rm" command won't be able to be accidentally invoked, primarily by ensuring it doesn't exist on the box. They carry their utilities from box to box, and take them with them when they leave.
That said, any slightly-sane permission or access control scheme, including the defaults mandated by almost all RDBMS distributions (which want a system user dedicated to their use), would make it rather difficult to rm the database folder. Just opening a shell to a RDBMS's underlying server should be a rare event in itself, to say nothing of actually elevating to root, or running a sloppy/careless rm command that is uncaught by the numerous potential failsafes that sysadmins have been installing for decades now (constraining superuser access to a pre-defined set of commands, for example).
Again, the point is not that RDBMS systems are invincible. It's just that they're much sturdier, and actually designed to serve this purpose.
In what universe is "Well, hack out the dangerous parts" a reasonable answer? Talk about reckless disregard for data integrity! Do you really want to use Kafka that bad that you'd develop, maintain, and thoroughly test a custom patchset that circumvents its eviction routines, rather than just using the systems that already excel at not deleting stuff?
Secondly, eviction is a core part of a message queue's design, on purpose. It's actually a needed thing, and while I'm not a Kafka dev, I seriously doubt that it's so simple that a single flag can be disabled and we can move on.
Disabling a flag is likely a one-line change, assuming a reasonable flag library. But yes, maintaining a custom fork at all is not something to take on lightly. It would make more sense to talk to the Kafka developers about how to make it safer to use in this scenario.
From what I can tell, Kafka isn't designed for long term data storage. RDBMS systems are designed for this.
Kafka is more for streaming data and events, so I'd probably be uncomfortable assuming that it won't do something and "tidy up" my very old data at some point in the future. Since it's supposed to do this from time to time out of the box, you'd have to be very careful not to let anyone tweak the custom config to revert back to this behaviour. RDBMS won't delete things unless you tell it to more explicitly.
While I think storing things in Kafka is fine generally, there's no way I'd not have a more perminant store of the data somewhere so that I can recreate the Kafka data store if I need to. I'm not sure why they're not just using a boring old DB for that purpose. Perhaps they have a reason, but it's not obvious to me.
Kafka can perfectly keep your data around forever. The only limitation is available disk space (and databases have the same limitation). I'm not implying that it is always the best idea to use Kafka as a long-term storage solution, but likewise a database isn't the silver bullet here either.
> so I'd probably be uncomfortable assuming that it won't do something and "tidy up" my very old data at some point in the future
Kafka doesn't "tidy up" your data unless it is configured to do so. What's true is that, by default, Kafka will keep your data around for a week "only" but that's a configuration. And most people change it to whatever fits their use case (some lower it to a few hours, some increase it to months or years; others configure it to keep data around forever, typically in combination with Kafka's so-called "log compaction" functionality).
> Since it's supposed to do this from time to time out of the box, you'd have to be very careful not to let anyone tweak the custom config to revert back to this behaviour. RDBMS won't delete things unless you tell it to more explicitly.
The DBAs I worked with would now say "Hold my beer..." ;-)
> While I think storing things in Kafka is fine generally, there's no way I'd not have a more perminant store of the data somewhere so that I can recreate the Kafka data store if I need to.
What's interesting is that more and more users (from my experience) are actually beginning to treat Kafka as the source of truth, and rather recreate other data stores -- like RDBMS, Elastic indexes, etc. -- from it. If you like RDBMS, you can think of Kafka as the DB's transaction log.
IMHO a lot of these discussion is about personal preferences, the situation that you are in (cough legacy cough), team skills, etc. There are often good reasons to use Kafka rather than RDBMS (in this context) but also vice versa, or completely different technologies of course (like blob stores, e.g. S3).
>Kafka can perfectly keep your data around forever.
In the sense that you can fiddle with it to the point where it doesn't purge things automatically, sure. But RDBMS provides more than the promise that it won't delete your data after a set period of time. If that was all we needed, any filesystem from the last 3 decades would serve fine as a "permanent datastore".
MySQL has gone through a lot of grief to get to the point where it's safe out of the box. This is important. There are many conceivable situations where an unsafe default can accidentally get worked back into things. At least MySQL has never had a default that truncated databases (afaik)...
>The DBAs I worked with would now say "Hold my beer..." ;-)
No one's claiming that RDBMS are invincible, but they do provide numerous features specifically designed to minimize data loss and corruption. You can read more about these in the documentation of any major RDBMS. I've mentioned some specifically elsewhere in this thread, but don't want to keep repeating the same things.
>IMHO a lot of these discussion is about personal preferences, the situation that you are in (cough legacy cough), team skills, etc.
No offense, but this just shows your ignorance of the functionality that you're leaving on the table by considering SQL a cough-legacy-cough solution. We discussed above that "can be configured to not delete things despite defaults and assumptions" is much different from "safe long-term data storage is a major design concern".
A messaging queue does not and should not have the same semantics, because it's not intended to keep decades' worth of data, even though it may be possible to avoid the database's eviction routine. We don't have to abuse something just because the developers haven't gone to lengths to override inadvisable configurations.
It may be unfair to describe setting a documented configuration parameter as "fiddling." Retention is seven days by default. It is trivial to set it to arbitrarily long periods of time. To my knowledge, this functionality isn't really in question. Whether logs are a good unifying abstraction on which to build systems is in dispute among reasonable people, but whether Kafka randomly deletes stuff is not. :)
I don't claim that Kafka randomly deletes things. Just that it automatically does so.
The danger is not that Kafka will choose not to respect the configuration value. It is that the default setting will find a way to creep back in without the admin noticing it, and then a quick reboot, maybe even an unplanned one caused by a power trip or a kernel crash, will be sayonara to the system of record. Sure, there are backups, but who needs that aggravation? (p.s.: there probably aren't actually any workable backups)
In the RDBMS world, MySQL's automatic and silent truncation of VARCHARs down to the character limit of the column was seen as a sign of its badness. That demonstrates the difference in paradigm.
Anyway, the argument doesn't really hinge on whether or not there's an automatic eviction model in the software. It's just a clear, loud signal that the software is not really intended for long-term storage, and that you are, at best, entrusting decades of mission-critical data to a less-tested configuration on a young, maturing product. This should not be appealing by itself, and that's a very optimistic perspective on the choice to forgo the data integrity features provided by a traditional RDBMS.
Developers just cannot seem to grok that just because something appears to store data across server restarts does not mean it is necessarily a safe permanent parking spot. I'm not a DBA, and I've had my share of serious squabbles with them, but when this is what happens with unsupervised developers at the helm, it's hard not to be sympathetic to their aggressive, almost hostile, feelings around developer input into the data model.
Agreed that this is a newer architectural paradigm and a younger product. Of that there is no doubt, and there is always risk there. Also return, of course; someone had to deploy an RDBMs for the first time too—and everyone is glad they did.
But I still don't follow the argument. If the eviction model is a loud, clear signal that this is the wrong solution, why isn't the mutability of RDBMS data the same sort of signal? Claiming that the presence of a DELETE statement in SQL rules out relational databases as durable data stores would not get me too far. And nor should it!
You are 100% right that this is a new approach. You are also right that it is possible to make configuration errors that will break the system. But this is true of all nontrivial systems. At the end of all of this, we still have a very interesting sequence of events (all NYT content ever) stored in an immutable log. This seems reasonable. Maybe the NYT team is blazing a trail, it's not prima facie a crazy one. :)
SQL provides users with a lot of facilities and mechanisms to limit, control, supervise, and if performed within a transaction, even undo overzealous DELETE statements. I discussed some of these at https://news.ycombinator.com/item?id=15188619.
AFAIK, with Postgres, there are no known circumstances where restarting your server will result in the purge of your database; that's really just icing on the cake, not the core of the argument. The core of the argument is that SQL provides not only a rational design paradigm for long-term storage and choices that reflect they take it seriously, but also an extremely strong feature set for data management and integrity.
As I've said numerous times now, SQL isn't invincible. But it's inarguably more resilient than Kafka, and it provides the controls necessary to keep some sanity over data in the long run.
> No offense, but this just shows your ignorance of the functionality that you're leaving on the table by considering SQL a cough-legacy-cough solution.
No reason to get ad-hominem.
When I said cough-legacy-cough, I was referring to the situation you happen to be in when e.g. taking on a new project. If, for example, your company has been fully committed to Postgres since several years, your chances to change this and move to a different architecture (or even sticking to the same architecture but using a different RDBMS like MySQL) are pretty low.
>The same underlying structure used within our RDBMS to provide all the guarantees Kafka provides.
Is there perhaps a reason that people use the RDBMS scaffold atop this "underlying structure"? The point I'm making is that Kafka is not safe for use cases that demand robust data storage and integrity, at least not in comparison to the standard of safety set by the traditional systems of record for important data (RDBMS).
I'd say it isn't appropriate for permanent data storage because the individual brokers don't scale well with the amount of logs present. If you have hundreds of partitions, and millions of logs, then any operation dealing with the indexes (like an unclean startup) will take an extremely long amount of time. So your individual brokers are now down for an hour if they don't shut down cleanly. Which happens often (oom, weird zk errors, etc)
Most of the use-cases you describe wouldn't be resolved by re-reading the entirety of the topics from the beginning of time.
They'd resume from their last committed offset and continue where they left off.
If you do have to catastrophically recover from the beginning of time, then sure you'd have a rough time. But that's true for any system that would have to do that. It's not Kafka-specific.
Now if your consumer was entirely in-memory and made no use of persistent storage itself and it had to recover from the beginning of time, then I'd say that type of problem is your own architectural failure and I have yet to come across a tool, pattern, framework, or architecture that bullet-proofs your foot.
That's a bit of an oversimplification. Production grade RDBMS systems have far more guard rails, testing and work put in to them than Kafka. It's relatively straight forward to lose data in Kafka, I've done it (its usually control plane bugs, not data plane).
I skimmed the article but I imagined they were using it as a secondary data store. I think they want to it to be durable in the sense that even if the events are already consumed they can still play them back to reindex elastic search (which is a thing you need to do periodically).
"With the log as the source of truth, there is no longer any need for a single database that all systems have to use. Instead, every system can create its own data store (database) – its own materialized view – representing only the data it needs, in the form that is the most useful for that system. This massively simplifies the role of databases in an architecture, and makes them more suited to the need of each application."
Fair enough. It seems like it still ought to be able to burn the kafka+elasticsearch world down and submit everything to kafka with such a setup (and thus elasticsearch). I would certainly not sleep very well at night if I could not.
> I think they want to it to be durable in the sense that even if the events are already consumed they can still play them back to reindex elastic search (which is a thing you need to do periodically).
That (replaying if needed) is exactly what Kafka allows you to do, unless I misunderstood what you wrote.
It doesn't really matter what the "source of truth" system is, although Kafka doesn't seem quite as mature/stable enough for that. With such little data, they can push into a nice graph database instead, run it entirely in memory and meet 10x any demand they'll ever see along with all the query types they'll need. Add in an elasticsearch cluster on the side and problem solved.
Any database can serve as a log replay, as long as you save all the versions then it's just called a query.