Yes it's a single point of failure, but so what? I don't particularly care wheth...

philwelch · on June 27, 2018

> In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that.

Although with outages like these, I doubt it!

vasilipupkin · on June 27, 2018

if the software is architected this poorly so that it can literally go down simultaneously for all clients, then why would I trust that it's secure?

drb91 · on June 27, 2018

> It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down.

Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.

ljm · on June 27, 2018

In a way there's a second single point of failure though, right? So many people use Slack to integrate all kinds of things, and rely on their interaction with those platforms through Slack, that if Slack goes down then productivity halts and it's totally out of your hands while Slack themselves try to resolve the issue.

- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.

- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.

- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.

This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.

nettdata · on June 27, 2018

If your team can't check on that stuff manually for a few hours while Slack is down, then I think you may have bigger problems.

If anyone on my team came to me and cited Slack being down as a reason for their inability to do their job, then they wouldn't be on my team.

Is it less than ideal? Yes. Is it a little bit less efficient to pull info instead of having it pushed to you? Yes.

Is the sky falling? No.

mikec3010 · on June 27, 2018

I think it's inexcusable for a chat program to go down in 2018.

* your hdd failed? Use a raid

* your power went out? Use a UPS

* your DNS went down? Use a fallback (slack2)

* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over

See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".

And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.

jaredhansen · on June 27, 2018

>slack is the most trivial software you can think of

This is like saying that food service at 30k feet in a passenger airline is trivial because all the server has to do is walk up and down a narrow aisle handing out food from a cart.

Since "you see no reason this service can't be nearly as reliable as life support firmware", one of two things must be true:

1) You know something nobody else knows. In which case great, you've stumbled on a huge opportunity to go put your knowledge to work and get stupendously rich by outcompeting this "trivial" software company. Get to it, genius!

or

2) The reason you "see no reason..." is that you're unaware of one or more relevant facts.

Which of these do you think is more probable?

mikec3010 · on June 27, 2018

3) slack will get their "chat as a service" monthly fee whether the service actually works or not, so why commit to higher levels of service? We can get our users acclimated to outages and then sell them "slack Premium, for Serious Business", charge an even higher fee, and get stupendously rich all over again. This is the "growth" that investors demand, no?

subway · on June 27, 2018

The dark truth is I suspect we're moving in the opposite direction. Abstraction layers designed with that "chill, it's just a %s app" mindset are making their way into safety critical applications.

Eventually somebody is going to die because their pacemaker decided to throw cycles at mining monero.

dota_fanatic · on June 27, 2018

Slack is text, channels, images, video, sound, search, audio calls, video calls, screen share (and interface share), bots, myriad integrations, and more. Calling it just "send text from one computer to another" is wrong.

apitman · on June 27, 2018

I think maybe their point is that even if other pieces break, why shouldn't it be possible for the text communication to keep working?

dustinmoorenet · on June 27, 2018

If trying to provide all the other things besides text causes the system to be unstable, then maybe those things shouldn't have been added. We need text. We just want the other things.

glintik · on June 27, 2018

Let me add more reasons: 1) Software human mistake, when some software error/exception throws much larger issues, that require manual restore with service downtime.

2) Geodistributed datacenters is VERY expensive thing, so not implemented fully.

3) Bad system design, full of "one point of failure".

mikec3010 · on June 27, 2018

> ) Geodistributed datacenters is VERY expensive thing, so not implemented fully

You buy servers on aws-us-west and aws-us-east, and sync them . How is that very expensive?

EpicEng · on June 27, 2018

I imagine you've never actually had to solve any of these hard problems, which is why you think it's so easy to do.

mmt · on June 27, 2018

That's bordering on (if not crossing into) ad-hominem.

There was no accusation of "so easy", only so not expensive and supposedly (and previously, demonstraby) solved in the last 30 years.

They may well be "hard" or even "expensive" for some definition of those two words, but if it weren't, it would defeat much of the (stated/advertised) purpose of outsourcing/cloud.

glintik · on June 27, 2018

You propose just to buy servers in 2 locations to keep Slack services up? Doesn't work, when you need to store gigabytes daily and have dozen thousand reqs/sec synchronized.

Geodistributed datacenter requires multiple direct low-latency multigigabit/sec connectivity, special software to manage, test and check it, skilled devops.

always_good · on June 27, 2018

[flagged]

mikec3010 · on June 27, 2018

I know. there's totally not a command called rsync. And "replication" is just a word you hear on star trek along with teleportation.

mmt · on June 27, 2018

Although I agree with your premise, I think the delivery takes away from your point a bit.

Specifically, you risk people piling on that rsync isn't good enough in the modern world and referencing the comment criticizing Dropbox as being little more than an rsync replacement [1].

Of course, the specific tool one uses is irrelevant. The data synchronization problem may not be well solved, but it has been very well studied, with a remarkable number of good-enough options.

So, no, there isn't just one "sync" button, as the parent comment snarkily suggested, but there may be two, one where you might lose the last N seconds of chat (perhaps temporarily) and another where you lose the ability to chat entirely for those N seconds.

[1] Although it had other criticisms, such as monetization, which are, naturally, ignored.

glintik · on June 27, 2018

Oh, yes, someone in Slack clicked the button “Pause” and we all are waiting, when Slack’s hero will click “Resume” :)

rbranson · on June 27, 2018

They very likely have all of these protections in place, and more. Large-scale outages of mature systems are almost always a cascade of small human errors that, each on their own, would have caused negligible damage. It's only when they happen to align with each other that a large disaster is realized.