Conflict of interest: I'm happily using Pulsar, I come from an extensive Kafka system, and would like to see Pulsar win this entire space personally.
I see some differences, instead of Pulsar functions RedPanda has gone the extra step of using WASM, suspect the Pulsar community will end up going this direction as the whole community begins this push forward.
Got rid of Zookeeper, I've never truly understood the hatred toward Zookeeper aside from the viewpoint of it being one more external dependency the project requires.
Compatible Kafka API, this is a smart business choice to grab up any business using Kafka that is unhappy with the operational costs of Kafka and want to move off. Pulsar has a connector for Kafka which lets a business leave their existing work entirely untouched and stream to the new source taking the strangler pattern approach. The problem with compatible API is still the fact you need to touch the running system to point over from Kafka to RedPanda and then it opens the cans of worms on how to handle aborting RedPanda roll out and switching back to Kafka without losing data. Business now needs to modify all their existing code where Kafka is producing to also now write to RedPanda.
The other option I see is just the same which is RedPanda has a connector to Kafka and only streams off the existing Kafka which kind of makes the API compatible IMO pointless aside from marketing & sales standpoint with customers.
Kafka also has a KIP to get rid of Zookeeper and I see a bunch of the issues related to the KIP resolved and looks like it should be happening this year.
Doesn't this just internalize the dependency within the project itself? Isn't Redpanda taking on all the effort that Zookeeper has been doing for years, all the edge cases, all the additional support and now the coupling of it within the very project itself?
That is our essential complexity. If you are trying to replicate data to machines, you need to replicate data to machines. We chose raft as the only way. In essence we are much simpler than upstream w.r.t protocols for data replication.
With Pulsar you have to run (as I understand it) not only ZooKeeper, but also Apache BookKeeper. Operationally, Pulsar sounds even more complex than Kafka.
I've never managed any of these, but I know that both ZK and Kafka have a reputation for being operationally complex. I've read comments by other people on HN about Pulsar being complex, too.
I'm optimistic about Pulsar becoming a widely deployed tool once they can get rid of the ZK dependency. In particular since Pulsar seems quite friendly to non-Java languages, while BK requires Java on the client and does not, and will not ever, support other languages.
>Running five node zookeeper cluster was pure overhead.
I have not experience this issue. I run a 5 pod ZK in K8s with each pod's memory: 256Mi & cpu: 0.1 for a couple hundred thousand messages a second with Pulsar.
I think 1.5 Gi and half a core for handling quorum & metadata locking for a stream storage isn't exactly what I would consider overhead. It isn't like deleting ZK tomorrow will not mean that Redpanda doesn't take on the additional resources itself.
I was a bit surprised to see the announcement about making Redpanda open source (although very interesting for poking around!).
Question: are you, as a company (Vectorized) pursuing the same business model as the Confluent one?
OSS Kafka, but paid official support, cloud, connectors and registry?
We had success with very large customers (enterprise) and also some success with finance company (looking for 0 data loss). The idea is that we can give the kafka API compat system with 10x lower tail latencies for free and still monetize the high end of the market.
I think there is a big shift when you have a single binary. i.e.: no one really complains from running ngix, etc because it's easy to get up and running.
so the gist we wanted to let everyone use it and reserve the right to be the only hosted redpanda provider.
Is there a blog post or docs showing using WASM? I searched your site but couldn't find details.
I've failed to grasp if it's an "alternative plugin-engine" kind of system to extend Redpanda, or you're storing data as WASM and therefore it's executable (to take your example: GDPR compliant auto-expire if it's past a certain date).
feel free to embed, ingest customer data, run on a saas application. The only restriction is hosting redpanda as as service for other customers (Think AWS MKS)
If you have any questions or think your use may be confusing please reach out to us.
If I am reading the fields of this implementation of BSL correctly, it will be open-source software in 10 years, when it converts to Apache 2.0, but not today. Until then, the license does not comply with the open source definition due to field of use restrictions.
" Our intention is to deter cloud providers from offering our work as a service. For 99.999% of you, restrictions will not apply - welcome to our community!"
This is simply not true. The restriction applies to 100% of everyone using the software under the license.
If an alternative service provider can't legally host the service for me, I am restricted from selecting an alternative vendor if my needs converge from the available vendors offerings.
Yes, all of us are bound by the terms alike but I think everyone understands it was meant as commercial and non-commercial use for 99.999% of you will not trigger the respective clauses in the license. And the post title literally says "Redpanda is now Free & Source Available".
Hi Alex, I'm an amateur programming language enthusiast & designer, so I've a question related to this.
Looks like the project is mostly written in C++ and Go. What was the reason for this choice? Have you considered other languages, like Rust, Zig or similar instead of C++? TBH not sure what an alternative to Go would be, maybe JVM AOT-compiled with NativeImage (but AFAIK that's still experimental).
Did Go's GC and/or C++'s lack of GC help/impede the project? IMO memory management is one of the main differences between languages... the other is concurrency / memory model / undefined behavior, where JVM is significantly ahead of the rest (no undefined behavior), I'm not sure exactly where Go stands (there seems to be a memory model, but no mention of undefined behavior or lack thereof).
Hi Alex - This looks a lot like what Scylladb did to unleash the potential of the Cassandra space, with a fully optimized C++ rewrite of an Apache project. Did you draw some inspiration from their efforts?
As a satisfied Scylla convert, I'm looking forward to trying Redpanda.
I'm a fan of scylla too, but if I could go back in time I'd have recommended waiting until mid 2019 to migrate. 'Fully optimized C++ rewrite's tend to take years to become battle tested.
there are 2 levels here. 1) raft has a proof (and a great phd dissertation from diego), but what matters is if we actually implemented it correctly. so 2) is we need to continuously test it. Denis did a lot of similar work at CosmosDB (microsoft) and has spent his career working on consensus.
totally. today if folks. need txns, they wouldn't be a good fit for us. What we found is about 90% of use cases are covered by the base api. For reference w/ all of the versioning there is something like 144 api calls you can make to kafka, most ppl use a small subset of those via high level clients. (java, python, librdkafka, etc)
Hi, I need to decide in the near future on a scalable messaging solution with at-least-once or better guarantees for our own SaaS platform. I'm looking for something that can be deployed on-prem, that's not targeted only for the public cloud. Do you have any documentation on how best to deploy redpanda with Docker and/or Ansible? What are the best practices on rolling out the cluster in-house? I've seen some docs on tuning that referred to AWS types of instances, is there something that you could refer me to that is more generic and not tied to a specific cloud vendor?
I'm asking partly because we must be able to offer closed on-prem installations as well as SaaS on a cloud. I'm looking for a low-ops component that will not fail me (as often as alternatives would) :)
Hey, congrats on publishing your source. It's a very interesting project indeed. I took a look around and I think with a bit expanded documentation, especially examples using the WASM transformations and maybe a some emphasis on durability and other guarantees it could grow into a great project.
I'd be interested in the write amplification since you went pretty low level in your IO layer. How do you guarantee atomic writes when virtually no disk provides guarantees other than on a page level which could result in destroying already written data if a write to the same page fails - at least in theory - and so one has to resort to writing data multiple times.
hi tim. absolutely, we exect ppl to test. public benchmarks coming in December. We'll probably just pick the open messaging benchmark that has been going around pulsar+kafka communities so folks can have a frame of reference.
Though what is to me more interesting, is what happens when you inject failures while the benchmarks are running.
What's end-to-end latency for a local cluster like? What's your business model?
I'm a big fan of kafka as an abstract building block, but not so much the actual implementation, which is as painful to setup as a consultancy-based business model might make you suspect it would be, especially if you need reliability. The other problem is that performance kind of sucks, apart from potential latency spikes due to GC pauses I found even the average latencies for reliable end to end (in a fast local network and on decent sized hardware) not in right order of magnitude ballpark.
-Redpanda (free) - comparable features to Kafka, no limits
-Redpanda Enterprise (paid) - Additional features (security, WASM, tiered storage, support etc)
-Vectorized cloud (Free and paid tiers) - Hosted in AWS+ GCP
Hey Alex, this looks very interesting! From browsing the docs it appears rpk tries to mess with cgroups and some kernel settings... is there compatibility mode for running unprivileged in a container environment such as kubernetes? Thanks!
It sounds like Vectorized is a streaming platform built on top of Redpanda. Is that the right way to think about it? If yes, then why not build it directly on top of Kafka instead?
We kind of started on a different note. We wanted to solve some fundamental problems (no Zookeeper - similar to KIP-500) and most of all we wanted a single binary we can ship around, something that is easy to run.
I think people LOVE the kafka _api_ but they have a hard time operating clusters at scale. So we decided to keep the same API but solve the problem of operational complexity.
> I think people LOVE the kafka _api_ but they have a hard time operating clusters at scale.
That is very true, and this is what you should emphasise. Wish you the best of luck!
The actual speed of Kafka has rarely been a concern (but huge numbers of partitions are, which makes rebalancing a pain) in my experience, in fact it was mostly overkill. But operational complexity was definitely an issue!
Landing page isn't bad, when you scroll the important points are there, but you have scroll. I'm not a "landing page optimisation guru", so take this with a grain of salt, but I would change it as follows.
Without scrolling, all the text thats displayed is this:
"Redpanda
A Kafka® API compatible streaming platform for mission-critical workloads.
Try Redpanda today"
It tells me what it is, which is good. It does not tell me why it is better, for that I have to scroll. Kafka is already suited for mission-critical workloads, so that is not a unique value proposition. Maybe:
"Redpanda - 100% Kafka API compatible, but without the headaches. Forget Zookeeper, forget rebalancing issues. Instead, enjoy reliable message delivery, 10x faster speed and ultra-low latencies due to our thread-per-core architecture."
Something like that, plus a visible "call to action" button, maybe "try it out" or "download".
Could also think about a pretty graph comparing latencies or smth. People love pretty graphs.
@MrBuddyCasino - I tried incorporating the design on the site we launched 5 seconds ago. lmk what you think (alex @ vectorized.io ) is my email in case you feel inclined :D thank you tho.
I'm not Alex, but I would assume he means scanning cold data will not evict all the hot data from the cache. This is a common problem with cache eviction algorithms.
Hi I like that there is a competitor to Kafka in this space and also the build in capability to do transformations. I got a few questions though which I could not find in your docs:
(1) Over at the Apache Arrow FAQ I read that the overhead of serialization in analytical frameworks can be around 80 to 90 % of total compute costs (r_1). While having no concrete numbers on this, from using Kafka together with Kafka Streams I can at least confirm that the overhead of serialization is (very) significant. My question therefore is: Does your WASM engine avoid (de-)serialization between your storage/stream layer and the engine and if not are there plans for this?
(2) Are supported WASM transformations stateless (i.e. single message) only or can they be stateful (i.e. window-ing and stream-stream/table join functionality)
(3) I could not find any reference to the WASM inline lambdas at all in the docs actually, am I missing something?
(1) arrow is great! currently, it does not, but yes it will when we move out of nodejs impl into our own v8 isolates inside an alien thread (seastar concept)
(2) stateful but only for a single partition
(3) will be released in the next week or so. If you look in the github repo you can look into `coproc`
Though the explanation went a bit over my head just one (two) follow up question(s): How did you end up with WASM for the inline lambdas? Did you have some discussion on alternatives like Lua? I am curious about insights on choosing scripting engines/implementations, hence why I am asking.
A different question for this could also be if alternatives to V8 where considered I guess as I believe there are quite a few pure WASM engine implementations out there (unless I did not get some feature that requires you to use V8 and rules out pure WASM engines).
1) multi language support. Ppl love go and rust. So some way to open it up to that community.
2) I like Lua a lot. Such a good tool. It just didn't have the target compatibilities of wasm + the security guarantees of wasm
Afaik, there are just a couple vms with superb performance. V8 and anotherone targeting x86 only. We want to support the AWS graviton instances, so only one choice
It'd be great to know where do the improvements came from.
Is it from a different architectural design, is it from better exploitation of new hardware, is it from JVM limitations? For example, when Valhalla and Loom land in the JDK, how much of this improvement will stay?
I can't find the benchmarks so it's not possible to know what they measured, how they measured it, and if it's a relevant measurement.
Skimming the repository it doesn't seem to be an unreasonable claim. C++ with Seastar compared to Scala on JVM. It's in line with the Scylla/Cassandra improvements.
According to this Data Engineering interview (beginning around 8m in) the improvements begin with moving from the JVM which has a latency impact. But the real improvement is that fsync on a file handle puts in a barrier in the FS. RedPanda does some work around batch coalescing, they skip the page cache to side step a bunch of kernel locks, and adaptive allocation (preallocating FS space):
not sure how you got here tho, maybe i miss explained something. This is talking about filesystem api not about kafka api. the kernel does something similar, standard filesystem things just purpose built for our use case.
afaics a write behind strategy means you will report a write as being complete before it has actually been written. This can result in terrific performance because your batching window is larger and you can amortize more (fewer?) iops. But it also means in the face of failure, clients could believe something was written which wasn't.
On the other hand, the storage device can fail so what does it even mean to have written the data? :)
our acks are similar to kafka in that regard. acks=0,1 leader acknowledgement. acks=-1 we go further like denis mentioned above by 'flushing' to disk before responding which is stronger (log completeness guarantee from raft) than available in upstream.
This is fantastic news. What would you say were the hardest aspects of the design and implementation, in terms of effort and thinking that went into those accomplishments ?
I would say, building an scalable raft implementation has been by far the hardest thing. Mostly because we started from scratch and bypassed the page-cache with our read-ahead and write-behind strategies, etc. so building tiers of caches, etc and have a working raft system was hard.
I've been waiting for a long time for someone to think outside the JVM, and I really hope this is a growing trend. The "big data" industry has seemingly been joined at the hip with Java ever since Hadoop came onto the scene, and the Apache community in particular has a lot of apps that are deeply unfriendly to non-Java apps. For example, you can't use Apache BookKeeper from a non-Java app.
Would you say Redpanda is ready for production use?
Is Redpanda compatible with Faust (Python stream processing from Robinhood)? I really don't want to use Kafka, but when I must, Faust makes it straightforward. In fact, I wonder if opinionated client libraries/modules are in your interest to develop as well; they could lower the "time to implement" story for your offering.
indeed! faust uses the regular python client. :) we try to work w/ the full ecosystem. if it doesn't it's a bug. so give it a shot and let me know. feel free to jump on slack if you want real time help too.
One question: I assume RedPanda using raft to replicate the topic content not just metadata. Is that correct? If so, how does it perform better compared to Kafka’s ISR? Since raft might be slow for this kind of workload. If I remember correctly, Liftbridge was using raft for log replication and switched away from that because of the performance problems.
To quote my colleague Denis -
The results fit the theory. Raft’s (and Redpanda’s) performance is proportional to the best of the majority of nodes, while sync replication (Kafka) works only as well as its worst-performing node.
AFAIK we can push the limits of hardware on throughput.
The gist is that I'd need more details exactly on what you mean raft is slow.
We support the same level of acks as kafka except acks=-1 is much stronger gurantees due to the log completeness guarantee of raft.
for 0 and 1 we short circuit the raft acknowledgement and return to the client to match the exepectations of acknowledgements come to be known by users.
There should be no perf penalty vis-a-vis kafka in any setting I can think of. If it is, is probably a bug on our side.
I second this request, I'd be very interested in any articles that talk about PostgreSQL integration with Kafka. Last I heard on this topic was debezium [1] but I quite liked the simplicity of Bottled Water [2].
If messages are batched, would there be any performance advantages over Kafka/Pulsar/etc with thread-per-core architecture? The context switching cost would be amortized?
message batching and locking can be thought of as orthogonal pieces. In practice, yes there are advantages still of using thread per core, but perhaps not for the reasons you think. It has to do with core-local metadata materialization for a subset of the requests. THere is effectively essential complexity (i.e.: to replicate data, one must replicate data), but the TpC is strictly an optimization for flatter tail latencies.
Depends on use cases. We are working with a couple of security companies doing intrusion detection and for them, writing to disk on anomaly + notification seems to matter. Also financial services wanting to leverage open source tooling like spark or tensor flow still care about latency.
Wouldn’t it be valid to consider Kafka/RedPanda a bottleneck and another point of failure that may delay data getting to a destination?
In some cases, the performance, efficiency, and reliability gains from caching and consolidation make sense.
But, I’ve seen enough poor architectural decisions and lack of architectural oversight result in use of various log streaming, cloud messenging, app monitoring, object DBs, etc., all discounting the request overhead in time and traffic, points of failure, and overall complexity for some false sense of scalability enough to where things that seemed cool ten years ago make me physically sick now.
What are some questions to use to help determine whether Kafka/RedPanda actually make sense to use, without having to first baseline, then implement, then compare request time, reliability, and data freshness to gauge whether it was worth it?
BTW- I think there are valid cases for using it and appreciate all of the work!
If you use transactions, we don't support it yet. We are in active development here.
Notice that folks love speed but the operational simplicity of having one binary/fault domain makes a lot of our enterprise users use the tech.
Last is if you like the product direction which is to my knowledge fundamentally different from the other engines out there. WASM in particular solves around 60% of all streaming work we see in the wild. It is effectively good at one shot transformations (gdpr, simple enriching, simple connection to downstream system like elastic, etc) as well as tiered storage.
Think the idea was to build something as easy as nginx - apt-get install redpanda and et voila
I hope to continue to focus on the developer experience.