Yes it's a single point of failure, but so what? I don't particularly care whether other organizations fail at the same time as I do, I just care whether I fail. Hosting my own chat system does not solve that problem. In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that. It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down. If it's urgent I can use the phone, and my todo list is stored outside Slack.
> It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down.
Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.
In a way there's a second single point of failure though, right? So many people use Slack to integrate all kinds of things, and rely on their interaction with those platforms through Slack, that if Slack goes down then productivity halts and it's totally out of your hands while Slack themselves try to resolve the issue.
- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.
- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.
- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.
This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.
I think it's inexcusable for a chat program to go down in 2018.
* your hdd failed? Use a raid
* your power went out? Use a UPS
* your DNS went down? Use a fallback (slack2)
* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over
See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".
And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.
>slack is the most trivial software you can think of
This is like saying that food service at 30k feet in a passenger airline is trivial because all the server has to do is walk up and down a narrow aisle handing out food from a cart.
Since "you see no reason this service can't be nearly as reliable as life support firmware", one of two things must be true:
1) You know something nobody else knows. In which case great, you've stumbled on a huge opportunity to go put your knowledge to work and get stupendously rich by outcompeting this "trivial" software company. Get to it, genius!
or
2) The reason you "see no reason..." is that you're unaware of one or more relevant facts.
3) slack will get their "chat as a service" monthly fee whether the service actually works or not, so why commit to higher levels of service? We can get our users acclimated to outages and then sell them "slack Premium, for Serious Business", charge an even higher fee, and get stupendously rich all over again. This is the "growth" that investors demand, no?
The dark truth is I suspect we're moving in the opposite direction. Abstraction layers designed with that "chill, it's just a %s app" mindset are making their way into safety critical applications.
Eventually somebody is going to die because their pacemaker decided to throw cycles at mining monero.
Slack is text, channels, images, video, sound, search, audio calls, video calls, screen share (and interface share), bots, myriad integrations, and more. Calling it just "send text from one computer to another" is wrong.
If trying to provide all the other things besides text causes the system to be unstable, then maybe those things shouldn't have been added. We need text. We just want the other things.
Let me add more reasons:
1) Software human mistake, when some software error/exception throws much larger issues, that require manual restore with service downtime.
2) Geodistributed datacenters is VERY expensive thing, so not implemented fully.
3) Bad system design, full of "one point of failure".
That's bordering on (if not crossing into) ad-hominem.
There was no accusation of "so easy", only so not expensive and supposedly (and previously, demonstraby) solved in the last 30 years.
They may well be "hard" or even "expensive" for some definition of those two words, but if it weren't, it would defeat much of the (stated/advertised) purpose of outsourcing/cloud.
You propose just to buy servers in 2 locations to keep Slack services up? Doesn't work, when you need to store gigabytes daily and have dozen thousand reqs/sec synchronized.
Geodistributed datacenter requires multiple direct low-latency multigigabit/sec connectivity, special software to manage, test and check it, skilled devops.
Although I agree with your premise, I think the delivery takes away from your point a bit.
Specifically, you risk people piling on that rsync isn't good enough in the modern world and referencing the comment criticizing Dropbox as being little more than an rsync replacement [1].
Of course, the specific tool one uses is irrelevant. The data synchronization problem may not be well solved, but it has been very well studied, with a remarkable number of good-enough options.
So, no, there isn't just one "sync" button, as the parent comment snarkily suggested, but there may be two, one where you might lose the last N seconds of chat (perhaps temporarily) and another where you lose the ability to chat entirely for those N seconds.
[1] Although it had other criticisms, such as monetization, which are, naturally, ignored.
They very likely have all of these protections in place, and more. Large-scale outages of mature systems are almost always a cascade of small human errors that, each on their own, would have caused negligible damage. It's only when they happen to align with each other that a large disaster is realized.