No team is capable of writing complex software without eventually introducing security bugs, and we (Zulip) are no exception. I'm disappointed that we introduced this bug, and would like to apologize to our users.
As a bit of commentary, my experience is that most SaaS vendors fix security bugs quietly. Others will do an brief internal investigation and only consider doing a public disclosure if they believe data privacy regulations require them to do so, because they discovered that the vulnerability had been in fact exploited. Usually public disclosure decisions are made by business executives with PR considerations in mind, so that shouldn't be surprising. See, for example, https://www.zdnet.com/article/slack-resets-passwords-for-1-o....
Zulip's core values include transparency, and our policy is to publicly disclose all security bugs that we discover with a formal CVE number, and this blog post is part of that practice. As noted in the blog post, we did an extensive audit and emailed every customer who our audit logs could not prove was not affected with enough detail to let them audit the situation. We did that despite our audit finding no evidence that this vulnerability was ever intentionally exploited, because we think it's the right thing to do.
Some technical detail I can add for those interested is that our postmortem plans following this incident include further investment in the internal audit log system used for the investigation (`RealmAuditLog`, if any readers feel like grepping our codebase). Improvements we made in mid-2020 were very helpful in investigating whether this bug had in fact been exploited, and we take this as a sign that we should increase investment in that system.
As a side note, I'm surprised that this specific security disclosure ended up on the Hacker News homepage, while our security release about an RCE issue affecting self-hosted Zulip servers last month did not (https://blog.zulip.com/2022/01/25/zulip-server-4-9-security-...).
The core bug there an insecure secret generation algorithm in upstream Erlang/RabbitMQ, which seems more broadly notable to me!
Despite our spending many hours composing detailed reports and lobbying security teams to get this problem fixed for everyone, not just Zulip, `apt install rabbitmq-server` continues to make a Debian/Ubuntu system vulnerable to RCE, and this fact has apparently been public for years.
We did manage to get `rabbitmq-server` fixed in Debian testing, and it is scheduled to be included in the next Debian stable release, I would really have thought an RCE issue like this would result in an immediate security advisory.
As a final remark, if you self-host Zulip, please subscribe to our announcements mailing list (https://groups.google.com/g/zulip-announce) and make sure your installation is up to date, so that you benefit from the hard work our security team does to keep Zulip users safe.
The blog post tries to paint the RabbitMQ cookie generation as a security issue, but the Erlang distribution cookie is NOT a security feature. Indeed, the Erlang documentation states:
> "Security" here does not mean cryptographically secure, but rather security against accidental misuse, such as preventing a node from connecting to a cluster with which it is not intended to communicate.
The purpose of the cookie is to guard against _accidental misuse_, such as a user connecting to the wrong node (prod vs testing?). The cookie is not meant to secure a system against unauthorised use. Any system relying on the cookie for security should already be considered broken. Thus I can see why there has not been such hurry to "fix" the issue.
That argument would hold more water if there were an additional security mechanism in place preventing `apt install rabbitmq-server` from exposing a system to a remote code execution. It seems to be that the simplest fix for this problem is to make the cookie secure, since doing so is likely effective with minimal effort. Another option would be to have clustering disabled by default.
If you'd like to summarize the upstream issue as "RabbitMQ as well as every distributors of it that we've investigated allows remote execution of arbitrary code on your server through having no effective access control mechanism for the distribution port, which is open by default after installing the software.", that summary is defensible.
That still sounds to me like a critical security issue. One should be very disappointed in a vendor that had not patched years after its public disclosure, especially given the context that we're apparently not first project using RabbitMQ to be affected by these security choices by Erlang and RabbitMQ. See, for example, https://nvd.nist.gov/vuln/detail/CVE-2018-1279, back in 2018. We definitely won't be the last, either.
I hope nobody is going to argue that it's reasonable to publish server software with a "documented" RCE vulnerability, especially if you don't make any effort to highlight that detail. https://www.rabbitmq.com/access-control.html, the main page about access control in RabbitMQ, does not even mention this security concern; you have to read their clustering guide, which should be irrelevant to anyone not intending to use RabbitMQ clustering.
> RabbitMQ as well as every distributors of it that we've investigated allows remote execution of arbitrary code on your server through having no effective access control mechanism for the distribution port, which is open by default after installing the software.
This is exactly how I would word the issue. If an attacker can talk to the Erlang distribution port with nothing but the cookie preventing them, then the system is already lost. That should never happen, the distribution ports should only be open on an isolated internal network for the other cluster members.
If RabbitMQ opens these ports by default and does not document the issue clearly, then that is the failing. I don't think fixing whatever random number generator is generating the cookie will make the system secure enough to be considered fixed. At worst it may give a false sense of security when in reality the fix has been made in the wrong location.
You don’t think a random generator that looks like it outputs log₂(26²⁰) ≈ 94 bits of entropy, but is limited by the weak PRNG state space to 36 bits, and further limited by poor seeding to about 20 bits, creates a false sense of security?
It would be entirely possible to generate a cookie that gives true security. It would also be possible to generate no cookie at all and force the administrator to become aware of the issue if they want to enable clustering. It would also be possible to limit the exposure to localhost only by default.
But Erlang does none of these things. It generates a weak cookie that looks like a strong cookie, leaves it in a hidden file that the administrator may never even become aware of, and exposes a daemon that relies on it for security to the internet by default.
This is not how you build a secure system. This is not even how you build a system to get the administrator to realize that it needs to be secured. This goes against every security best practice that’s been written and some that are so obvious they shouldn’t need to be written. This is irresponsible and inexcusable in today’s environment, and hardly even excusable in the environment that Erlang was originally written for. This needs to be treated as the serious vulnerability that it is, and needs to be fixed.
> You don’t think a random generator that looks like it outputs log₂(26²⁰) ≈ 94 bits of entropy, but is limited by the weak PRNG state space to 36 bits, and further limited by poor seeding to about 20 bits, creates a false sense of security?
I don't think so, as the Erlang documentation states that 1) the distribution protocol is in plaintext, 2) the cookie is not secure in the cryptographic sense, but in the sense that it will prevent accidents, and 3) the cookie handshake is not cryptographically secure.
It is not possible to generate a cookie that provides any true security as the distribution protocol is plaintext and the cookie handshake is not secure. Claiming to get any security benefit from a "stronger" cookie would only give a false idea to the user.
Erlang does not start distribution by default. This is done by the user of Erlang, or the application that runs on Erlang. It is up to them to make sure that their environment is secure and that they understand how Erlang distribution works. Namely, to not open ports in the firewall willy nilly and only allow communication to them from an isolated network or over TLS. Frankly I don't know what you would expect Erlang to do here, maybe to add even bigger warnings to the documentation?
The problem here is application developers not understanding how to configure their application to be secure by default. They are the ones opening the distribution wide to the Internet and not advising their users on how to do it safely, not Erlang.
> Zulip Server version 2.0.0 and above are vulnerable to insufficient access control with multi-use invitations. A Zulip Server deployment which hosts multiple organizations is vulnerable to an attack where an invitation created in one organization (potentially as a role with elevated permissions) can be used to join any other organization. This bypasses any restrictions on required domains on users' email addresses, may be used to gain access to organizations which are only accessible by invitation, and may be used to gain access with elevated privileges. This issue has been patched in release 4.10.
Why does it feel like the completely misleading title was intentional to try and drive traffic to / SEO this crappy copycat CVE site?
Personally, I like Zulip way better than Slack because of how it does message threads: _channels_ that have conversations in _topics_ "makes sense" to me.
Pros: Server maintenance is blissfully minimal. Message threading that actually works. No server issues or downtime in the past 4 years. (Every time I see a HN "Slack is down" posting, I smile to myself and continue my work day...) It runs great for ~50 employees on an AWS t3a.large instance. Importing all the company's previous messages from Slack worked perfectly.
Cons: If you have people used to Slack, there will be resistance / a learning curve. Knowing, "when do I create a topic?" takes practice. The mobile and desktop apps are not as polished as Slack's. Not as many integrations as Slack. You have to know enough to manage and secure a server yourself.
The Zulip devs accept Github donations, btw, if anyone cares to support their work. (We do; no affiliation other than we're happy to have a private, self-hosted alternative.)
edits: punctuation and adding that instance was a ".large"
We use it at work. We first switched over to Mattermost, then to Zulip. Mattermost seemed like a clone of Slack and we were initially "satisfied", but didn't see it get improved that much nor noticed any attempts of it trying to tackle other problems. Then they raised their prices to even higher than Slack's, which we took as a bit of dick move seeing as we were self-hosting it + it has nowhere near the experience of Slack. Push notifications in general didn't seem to be that good.
Zulip has a nice spin on the way discussions work with their concept of "topics". It lends itself well to how we conceptualize the discussions across our development teams, without unnecessarily creating many ad-hoc temp channels (something we'd have done on Slack). We're liking it thus far. That being said I do need to check why it's using about 35 gigs of RAM on our server.
Please report this in https://zulip.com/development-community/; this certainly sounds like a memory leak. We aren't aware of any other reports of memory leaks in Zulip in the last couple years, so we'd be very happy to help track down what's happening on your server.
Clojurians-Zulip is much better interface to catch up and read through existing information than Clojurians-Slack ever was (Clojurians is basically bunch of Clojure(Script) people helping/getting help from each other), for whats it worth. Best would be if they both could be fully public, but for now the archives seems to do the job well at least.
Zulip is working on having public view of streams without having an account. It's already being experimently deployed on the Zulip Dev's server.So that should make its way to other consumer servers soon!
Sort of a weird context to ask for Zulip recommendations, but sure. I find Slack to be a miserably unproductive experience that makes it impossible to keep up with a team (or teams) of any significant size. Slack is good for recreation, not productivity. If the primary purpose of your organization is to be a social group, then by all means use Slack. Meanwhile, Zulip does what I need and does it well. It scales excellently to large organizations discussing broad arrays of topics. It gets out of my way and I don't find myself thinking about it. In that way it's the perfect tool, because it lets me spend energy focusing on engaging with the conversations rather than wrestling with the interface. I guess I'd describe it as natural and unobtrusive, the sort of thing that does its job so well that you only notice that it's there when something goes wrong and you realize what you've been taking for granted.
Note that my experience is almost exclusively with the web client. My small amount of experience with the mobile app is that it's much less polished.
We used Zulip at $job[-1] and it was awfully painful. I’ve used nearly every chat client invented since the BBS days, and Zulip was one of my least favorite. We’re using Slack now at $job and I vastly prefer it (even though I still call it “IRC”).
The biggest selling points of Zulip, from the POV of the people who put it in place and truly loved it, were that it’s free/OS software, it has explicit threads as a first class citizen, and that it’s very configurable. All those things are true.
The day-to-day use was painful, though. It was riddled with UX problems, e.g., forgetting state between restarts, inconsistent hotkey support, multi-action use paths for common cases, frequent full reloads of its web view. I’d much rather use vanilla IRC than Zulip.
Slack is just enough of a shim layer on IRC, and with good usability. It has threads, and in use it’s not a problem that you can’t name them like Zulip. The integrations are nice to have (e.g., type `/zoom` to start a Zoom meeting) but not a big deal. The UX is clear and predictable and (mostly) stays out of my way.
I'm sorry to hear you had a bad experience with Zulip. If you can share details on the UX problems you had, perhaps in https://zulip.com/development-community/ where we can have an extended conversation, I'd be very appreciative. I personally review every issue reported in zulip, and some of the specific problems you cite are not unfamiliar and surprising to me. Specifically:
* I'm not aware of a plausible explanation for Zulip frequently doing full reloads of the web view. Zulip's core design is to live-update everything [1] and there's only a handful of code paths that can trigger a full reload (the most common being when the server version was updated, and in that code path, the client has an algorithm where it waits up to 30 minutes for a moment when your window is idle and not focused to reload to get the updated version, to minimize the chance of disruption).
* I'm also not sure what what state might be forgotten between web client restarts; almost all of our client UI updates happen via the client asking the server to change something, and the client updating via the same server-client push mechanism that updates other clients (the main exception to this is sending messages, where local echo is important for UX reasons). And at least the server-initiated reloads are designed to preserve your precise scroll position, compose box state, etc., and that's been true since ~2013. (Though we did fix a couple bugs where compose box state was not properly preserved in the last year).
I know you're not using Zulip currently, so your memory may be imprecise, but any additional detail that could help us reproduce what you experienced would be awesome!
We use it at our small (15 person) startup. I like that topics are light weight, like git branches, but despite that we struggle to use them effectively (generally dumping thoughts in a select few topics)
One major quirk of the interface is that can be difficult to, at a glance, identify what topic/dm you are actively on, which has lead to a handful of "Oops, what was supposed to be a DM" message deletes.
We’ve struggled with that in the past as well, but it got better over time. One thing that helps is to aggressively start moving messages to new topics once things go beyond a few in a general topic.
I used it at a previous job and it was fantastic. The first-class threads are the way chat should work, the UX with the message river is brilliant, the keyboard shortcuts make everything two keystrokes away and it's extremely fast.
In comparison, I hate having to use Slack every day today. It's slow as shit, whenever I type a keystroke too fast (e.g. editing a previous message) it always gets into a weird state (usually edits the message before last and messes up the edits) and it encourages rapid-fire, thoughtless replies. Search also sucks and it's impossible to find anything. I fucking hate Slack.
I briefly used Zulip as part of a friend's community, rather than work. My experience with it was that it's great at first, but when you start to get masses of messages and users, the client becomes laggy and unresponsive. Scrolling through messages particularly sucked because of this.
I'm not sure if it was just our setup, because when I checked the official Zulip web chat, I had the same problem.
I liked it a lot, but I'm not an Apple user, and apparently the iOS app was (is?) not great. For me, it's like a vastly better version of NNTP, while slack is more of a slightly improved IRC.
We use Zulip at work with about 70 daily active users. We evaluated other popular competitors both open source and paid. Zulip came out on top because it was spot on with our use case and integrated easily with our infrastructure. Since we rolled it out (about a year ago) it has been the most stable application in our environment- it is rock solid. Thank you Zulip Team for all of your hard and excellent work.
Is this bug another manifestation of the ubiquitous “confused deputy” problem that results from conflating authorization and authentication? (Trying to relate to some recent stuff I’ve read about the importance of “object capabilities”)
Zulip has an officially supported, experimental docker image. Please note that Zulip’s normal installer has been extremely reliable for years, whereas the Docker image is new and has rough edges, so we recommend the normal installer unless you have a specific reason to prefer Docker. [0]
A server application without Helm charts? Stateful apps without Kubernetes Operator? For someone living with k8s all day, it feels so strange to need to make a VM, update it, then follow the installation steps and then for years coming having to keep them updated, migrate to new hosts, etc, etc.
I see it as "this is the official way that the project uses to host, and they have been kind enough to share it with you".
I don't think Zulip themselves use Helm charts or Kubernetes Operators to host their instances. If you think hosting an instance without these is such a pain, why not just pay them to manage it for you? At least you got the option to self-host, which is not something you'll ever get with Slack.
As a bit of commentary, my experience is that most SaaS vendors fix security bugs quietly. Others will do an brief internal investigation and only consider doing a public disclosure if they believe data privacy regulations require them to do so, because they discovered that the vulnerability had been in fact exploited. Usually public disclosure decisions are made by business executives with PR considerations in mind, so that shouldn't be surprising. See, for example, https://www.zdnet.com/article/slack-resets-passwords-for-1-o....
Zulip's core values include transparency, and our policy is to publicly disclose all security bugs that we discover with a formal CVE number, and this blog post is part of that practice. As noted in the blog post, we did an extensive audit and emailed every customer who our audit logs could not prove was not affected with enough detail to let them audit the situation. We did that despite our audit finding no evidence that this vulnerability was ever intentionally exploited, because we think it's the right thing to do.
Some technical detail I can add for those interested is that our postmortem plans following this incident include further investment in the internal audit log system used for the investigation (`RealmAuditLog`, if any readers feel like grepping our codebase). Improvements we made in mid-2020 were very helpful in investigating whether this bug had in fact been exploited, and we take this as a sign that we should increase investment in that system.