Those are two different technologies. Amqp is all about routing and queues. Kafk...

MuffinFlavored · on March 14, 2023

> Kafka is a distributed log

When should you use Kafka instead of storing rows in SQL with a timestamp so you can replay them/fetch them if needed?

Why do you need a sharded Kafka cluster?

Most businesses are going to have Redis, SQL, and probably RabbitMQ.

Where/why add Kafka to that stack?

manv1 · on March 14, 2023

Kafka was designed for places with a huge data/log volume (GB/sec or more).

From what I understand it's generally The Thing To Use when you have not very many (ie: thousands) high-volume sources that have good links. You push stuff into the logs and you can take your time chewing through the log.

MQTT was designed for shitloads (hundreds of thousands+) of small devices on bad links connecting to the mothership.

Neither of them really are a message queue. I mean, they queue messages, but they have different goals than, say, ActiveMQ or RabbitMQ (which both can be used as a backend for MQTT).

Using Kafka for a message queue is overkill. It's more likely that MQTT fits your bill.

rad_gruchalski · on March 14, 2023

> When should you use Kafka instead of storing rows in SQL with a timestamp so you can replay them/fetch them if needed?

Lots to unpack here. The simple answer is: you use Kafka when you need to have the data in order in arrival written to disk but you don't need ad-hoc query. Messages must be persisted, processed at least once by someone, in order. It's common to have Kafka in front of a database as a sort of buffer: when data is in Kafka, it has been acknowledged as seen and will be processed later.

Now, why would you store it in Kafka rather than in a table with timestamps. I said in my previous comment that messages are persisted based on the arrival time on the broker. This is somewhat inaccurate. Kafka is not related in a message arrival time. It's interested in message arrival. All messages arriving on a partition are process in order of arrival. Messages are stored on disk as byes, each message is written to a segment file, a partition will often have more than one segment file. Segments are sequential. A message is written at a byte offset within a segment and the format is something like: {metadata, length, message body}, the protocol is documented here: https://kafka.apache.org/protocol. Each segment file has an index file. The index file maps the partition message offset to a byte offset within a segment file. To read a message, the broker needs to know the message sequence you want. The broker will identify the segment by looking at (AFAIR) the index file name which contains the first message offset within its respective segment. It will read the index, find the byte offset of the message, check the message length, read the message, and return it.

However, as you can imagine, this is a pretty involved scenario because it implies a bunch of random reads from disk. The point of Kafka is sequential reads over complete partitions or segments. So Kafka shines when you need to read large sequences from a partition. It will identify a segment, memory map it and do a sequential read over a large file (usually a segment is something like 1GB).

This is different than inserting into an SQL table with timestamps. First of all - you will have duplicated timestamps. The only way to avoid duplication is auto-incremental indexes. These will be similar to the message offset. However, a large number of writers will inevitably lead to a congestion on your auto-incremented sequences. Next, your table will have indexes: as you want ACID, those indexes will have to be updated before data is considered stored (C). Next, you have the data stored but it's an SQL database, so your data is on one node. This is different than Kafka minimum replication factor where the data is considered as stored when at least N replicas confirm it. As for reads: so your client application would store it's own last processed sequence number from your auto-incremented index, and this would work the same as the consumer offset in Kafka. But it you have hundreds of GBs of data on your server, you'll be hitting disk pretty often for index scans, or your indexes will outgrow your RAM. Sure, you can scale to a bigger box (costly) or you go for a DRDBMS (YBDB or Cockroach or similar) but due to how your tables are distributed in those systems, you will hit a latency cliff. Reads are usually better than writes, writes can be atrocious, DRDBMS large batch writes can be 500x+ slower than single node RDBMS.

> so you can replay them/fetch them if needed?

If you need ad-hoc queries, that's the best way to go. But for volumes of data larger than your SQL server, I'd consider Kafka in front and SQL to store only what needs to be queried.

> Why do you need a sharded Kafka cluster?

Three reasons:

1. Write performance. I wrote higher that messages are written in order into a sequence file. I have personally worked with clusters ingesting 400Mbps per partition on multiple partitions on a single node. Basically, Kafka can saturate your bandwidth and if you have SSDs, everything what's ingested, will be persisted because Kafka uses zero-copy and memory mapped files. So it's basically like writing to RAM. If you need to ingest more data than a single node can handle, you need more machines.

2. Read performance: data within a topic partition is written in sequence and will be read in sequence. Your typical scenario if using some derivative of a key to establish the partition number. By default kafka uses murmur2 hash. What this basically means. Imagine that you have data for email addresses. The email address is the key. All data for a given email address ends up on the same partition. Thus your consumer will always read from one partition for a given email address. The more partitions you have, the higher is the number of consumers. Of course one there will be multiple email addresses within a single partition because their hash maps to the same partition number. The higher the number of partitions, the higher the granularity of that distribution. However, it's easy to hit a hotspot for a key if you have a lot of data for a hot key.

3. I would argue, the most important one: replication. If you have 3 brokers and your topic has a replication factor of 3, you have one copy of data per physical machine, all data consistent across brokers. That is - each partition and every segment exists in three copies, one copy per physical broker. There will be multiple partitions for the same topic on the same broker and each will be replicated to other brokers.

Taking a broker out does not impact your operations. You can still read and write, you will have under-replicated partitions but Kafka will automatically recover once the missing broker comes back. If a broker falls out, and that broker was a leader for a partition, Kafka will elect a new leader from one of the remaining brokers and your consumers and producers will automatically continue writing like nothing happened.

> Most businesses are going to have Redis, SQL, and probably RabbitMQ.

> Where/why add Kafka to that stack?

Kafka is an all-purpose strictly-ordered-within-partition buffer of data with automated replication and leader election. Multiple consumers can read from the same topic partitions at the same time. Each one has its own offset and can consume at its own pace. Each consumer can do different things with those messages at the same time. Those consumers can be located at different org levels. Kafka is great at moving large volumes of data within the organization and facilitating other processes, for example ETL.

Man, I sound like I'm some Confluent salesman. I'm not affiliated. I work for Klarrio, we build streaming systems.

sally_glance · on March 15, 2023

Thank you for this. I'm currently involved in the design of a large-ish IoT backend having only worked on smaller volume systems using "traditional" brokers (MQTT, RabbitMQ). Your post helped me fit Kafka into my mental image.

What do you think about an incremental development where one starts with direct MQTT subscribers and adds Kafka only once the volume goes up?

rad_gruchalski · on March 15, 2023

Without knowing much of your exact requirements, I would aim for the following: devices communicate over MQTT, support one protocol, do it well. The problem with MQTT is scaling the broker. It all depends on what is behind "large-ish". Hundreds of thousands of devices? Millions? Dozens of millions? If you can fit all connections on one broker, it's easy, any available solution will handle this. Going past one broker is where the problem starts. Imagine the following scenario:

You have 10 brokers, each broker handles 1M connections. You have a TopicA subscriber on broker 9 and 10, someone publishes a message to TopicA on broker 2. Your broker 2 needs to forward the message to brokers 9 and 10 but no others because if you do this for every message, you waste a ton of compute. So you need a method to distribute your connection table: which brokers currently have connections with subscriptions to TopicA. This is reasonably easy to embed in the broker for a relatively small number of brokers but if you need hundreds of brokers, your global connection table will not fit on each broker. Scale this up - millions of topics and millions of connected clients. You need a way to offload the connection table somewhere else, find a reliable method to distribute changes to your connection table. This is where we have successfully used Kafka - distributing the connection table and state changes.

sally_glance · on March 15, 2023

Got it - I think for the first 1-2 years one broker instance should be plenty. So the decision to up the complexity can probably be postponed to the point where it becomes clear that we hit 10M+/lots of brokers will be required.

rad_gruchalski · on March 15, 2023

Definitely. Focus on solving the business problem first, grow the business. When you have the need to make a switch, you will find a way forward by rebuilding your your MQTT infrastructure to fit your scale. If you ever need to.

sally_glance · on March 15, 2023

Yeah - business is smart home renewable energy stuff, not lightbulbs. If they don't go international real fast the small oldschool broker cluster solution will probably be plenty for the next 5 years.

kamma4434 · on March 15, 2023

If you can live with it (low traffic, small data set) you totally should. Where ‘low’ and ‘small’ are based on what a modern server can do, so they can go up quite a bit. And with shards and/or replicas you can get N times the throughput in an easy way

petre · on March 15, 2023

> instead of storing rows in SQL with a timestamp

You realize a time series database is more appropriate for that?

inkyoto · on March 15, 2023

> When should you use Kafka instead of storing rows in SQL with a timestamp so you can replay them/fetch them if needed?

When the use case warrants it and when downstream consumers need to be notified of a change in the data or its state. That is, in reactive architectures.

A data streaming platform (of which Kafka is one example) will push the data, whereas a database (relational or not) requires to be explicitly polled (queried)[1]. A poll for new data is inefficient, is computationally expensive (a waste of CPU cycles when the data has not changed) and is prone to create creating delays in the data processing (when the polling interval is too long). Data streaming platforms will deliver the changes in near realtime.

Also, Kafka, being a distributed append only log, keeps an entire temporal history of all data changes. That is, if an event or an object has undergone a series of state transitions, such as create -> update 1 -> update 2 -> update 3 -> delete, it will be fully available in a topic as five separate events. If a consumer is only interested in the latest state change, a series of events can be compacted into an equivalent of a database table (naturally, it is called a K-table in Kafka) and the consumer can use the K-table instead of having to process the entire event stream[2]. K-tables and event streams can be flipped over with no extra effort.

Kafka is, effectively, your databases's own transaction log sans the query engine.

> Why do you need a sharded Kafka cluster?

Kafka cluster are not sharded. At least not in the database sense, as it is a cluster. A consumer sees one living thing[3] – the cluster, and the cluster can be dynamically resized without the consumer(s) knowing or noticing it. The cluster will then rebalance itself.

> Where/why add Kafka to that stack?

Kafka is typically used to get the data out of data sources and deliver a temporal series of changes to interested consumers. If the data needs to queried ad-hoc or explicitly, Kafka is then fronted by one or many (usually, domain specific) event stores. An event store can be a database (a RDBMS or a document one), a document search engine (e.g. Elastic Search or a vector database) or anything else that has a query engine in it. It can be a cache as well when either the speed of retrieval or affinity to the data are important.

[1] It is possible to mimic the Kafka temporal history via the use of triggers that would inject a copy of the row being changed into an «audit» table before allowing a update to occur, but, with time and in high data volume environments, the table size will balloon and will drag the database down.

[2] Most databases now have change event streams that abstract the database transaction log away and deliver raw row changes as a temporal history of events, which is a functional equivalent of data/event streaming (using Kafka or similar) sans the scalability and the distributed processing – the onus to implement either or both is entirely on the consumer and that part is hard. Therefore, a database change event stream is oftentimes hooked directly into a Kafka cluster.

[3] Not entire true in most implementations, but it is rabbit hole for another time to go down into.

rad_gruchalski · on March 15, 2023

One clarification:

> A data streaming platform (of which Kafka is one example) will push the data

The Kafka broker keeps hot data in RAM. The clients pulls: https://github.com/apache/kafka/blob/3.3.2/clients/src/main/.... Most often it’s a frequent pull with a short timeout.

This is one of the major differences from MQTT. MQTT broker pushes to the subscriber when the latter is connected, or stores for redelivery if subscriber is offline (qos 1 and 2) but the redelivery will also be a real push on reconnect.

ClumsyPilot · on March 14, 2023

You are absolutely right, but not every project needs reliable messaging.

Many projects can be just fine with ephemeral messaging, they might already have some kind of managed scalable database and failures can often be handled on the client side.

Often we insert messaging into an HTTP system, or replace an HTTP system. In those cases, the client is waiting for an acknowledgement from the system that sits on the other side of the queue.

What happens in HTTP is you send a request to an HTTP server, but it combusts before you get a responce? Connection breaks or you get "request timed out" and user can click the button again. Or you can retry in the client code.

For many joke apps it is not always worthwhile to double complexity to move from a 98% system to a 100% system.

rad_gruchalski · on March 14, 2023

Yes, I agree wholeheartedly. If you don’t need reliable replication and can tolerate data loss, whatever rocks your boat :)

Personally, I do not find Kafka difficult to operate. ZooKeeper is basically a boring black box and there’s also kraft (the famous KIP-500).

petre · on March 15, 2023

> You are absolutely right, but not every project needs reliable messaging.

Until a manager comes on and decides it does and you have to redesign everything related to the message queue.

justinclift · on March 14, 2023

Interesting. What are you thoughts on NSQ?

https://github.com/nsqio/nsq

Was looking at it earlier today, but haven't ever tried it out.

heipei · on March 15, 2023

NSQ is a message broker that's intended to run in a distributed fashion, i.e. one broker per machine which is producing messages. Consumers then have to discover and connect to all brokers that carry messages on a given topic. NSQ persists messages to disk, but it does not have replication, so if one of the machines with the brokers dies (as in dead disk), all queued messages will be lost.

We use NSQ extensively and its a great fit for us. It's fast, lightweight, very easy to understand and work with, has great HTTP APIs for controlling the queues and publishing messages, etc. What it doesn't have is replication, so if a message must not be lost ever I'd reach for a distributed log, probably looking at Redpanda before Kafka, and possibly building my own on top of ScyllaDB since we're already running that.

rad_gruchalski · on March 14, 2023

I haven’t used it so I’m not qualified to answer, sorry. I can only tell why I never used it: as far as I understand it’s possible to have data loss with nsq and there’s no replication.