I'm becoming concerned with the rate at which major software systems seem to be failing as of late. For context, last year I only logged four outages that actually disrupted my work; this quarter alone I'm already on my fourth, all within the past few weeks. This is, of course, just an anecdote and not evidence of any wider trend (not to mention that I might not have even logged everything last year), but it was enough to nudge me into writing this today (helped by the fact that I suddenly had some downtime). Keep in mind, this isn't necessarily specific to this outage, just something that's been on my mind enough to warrant writing about it.
It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.
For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.
It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.
> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.
Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.
Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.
> Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.
Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.
> Take the number of vehicles in the field, A, multiply it by the probable rate of failure, B, then multiply it by the result of the average out of court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one.
> Look at a big bank or a big corporation's accounting systems
Not my experience. Any banking I used, in multiple countries, had multiple and significant outages and some of them where their cards have failed to function. Do a search of "U.S. Bank outage" to see how many outages have happened so far this year.
Modern internet company backends are very complex, even on a good day they're at the outer limits of their designers' and operators' understanding, & every day they're growing and changing (because of all the money and effort that's being spent on them!). It's often a short leap to a state that nobody thought of as a possibility or fully grasped the consequences of. It's not clear that it would be practical with any amount of money to test or rule out every such state in advance. Some exciting techniques are being developed in that area (Antithesis, formal verification, etc) but that stuff isn't standard of care for a working SWE yet. Unit tests and design reviews only get you so far.
I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.
They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.
There will be sustained periods of downtime if their primary system blips.
They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.
I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over.
They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again.
Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.
These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:
Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.
> Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk.
There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc.
Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.
I'm not sure, it's only money. People could have a lot of simpler cheaper software, by relying on core (OS) features instead of rolling there own, or relying on bloated third-parties, but a lot don't due to cargo culting.
And tech hype. Infrastructure to mitigate here isn't expensive. In many cases quite the opposite. The expensive thing is that you made yourself dependent on these services. Sometimes this is inevitable, but to host on GitHub is a choice.
…can I make the case that this might be reasonable? If you’re not running a hospital†, how much is too much to avoid a few hours of downtime around once a year?
† Hopefully there aren’t any hospitals that depends on GitHub being continuously available?
This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.
To agree with the comments it seems likely it's money which has begun to result in a slow "un-learning how to build systems that stay up even when the inevitable bug or failure shows up."
I don't know anything about githubs codebase, but as a user, their software has many obvious deficiencies. The most glaring being performance. Oh my God, github performs like absolute shit on large repos and big diffs.
Performance issues always scare me. A lot of the time it's indicative of fragile systems. Like with a lot of banking software - the performance is often bad because the software relies on 10 APIs to perform simple tasks.
I doubt this is the case with GitHub, but it still makes you wonder about their code and processes. Especially when it's been a problem for many years, with virtually no improvement.
Yep, this sums it up perfectly for me. I tend to stay away from the extra stuff since the quality is hit or miss (more often hit than miss to be fair), but really there’s something special about having something like it available. I think as a freely available package Nextcloud is immensely valuable to me. I never say anything bad about it without mentioning that in the same breath nowadays.
Nextcloud is something I have a somewhat love-hate relationship with. On one hand, I've used Nextcloud for ~7 years to backup and provide access to all of my family's photos. We can look at our family pictures and memories from any computer, and it's all private and runs mostly without any headaches.
On the other hand, Nextcloud is so far from being something like Google Docs, and I would never recommend it as a general replacement to someone who can't tolerate "jank", for lack of a better word. There are so many small papercuts you'll notice when using it as a power user. Right off the top of my head, uploading large files is finicky, and no amount of web server config tinkering gets it to always work; thumbnail loading is always spotty, and it's significantly slower than it needs to be (I'm talking orders of magnitude).
With all that said, I'm so grateful for Nextcloud since I don't have a replacement, and I would prefer not having all our baby and vacation pictures feeding some big corporation's AI. We really ought to have a safe, private place to store files in 2025 that the average person can wrap their head around. I only wish my family took better advantage of it, since I'm essentially providing them with unlimited storage.
That sounds really promising, maybe my family would be better suited to something like that.
I will say though, Nextcloud is almost painless when it comes to management. I’ve had one or two issues in the past, but their “all in one” docker setup is pretty solid, I think. It’s what I’ve been using for the last year or so.
I think the "local maximum" we've gotten stuck at for application hosting is having a docker container as the canonical environment/deliverable, and injecting secrets when needed. That makes it easy to run and test locally, but still provides most of the benefits I think (infrastructure-as-code setups, reproducibility, etc). Serverless goes a little too far for most applications (in my opinion), but I have to admit some apps work really well under that model. There's a nearly endless number of simple/trivial utilities which wouldn't really gain anything from having their own infrastructure and would work just fine in a shared or on-demand hosting environment, and a massively scaled stateless service would thrive under a serverless environment much more than it would on a traditional server.
That's not to say that I think serverless is somehow only for simple or trivial use cases though, only that there's an impedance mismatch between the "classic web app" model, and what these platforms provide.
You are ready for misterio: https://github.com/daitangio/misterio
A tiny layer around stareless docker cluster.
I created it for my homelab and it gone wild
Docker is much like microservices. Appropriate for a subset of apps and yet touted as being 'the norm' when it shouldn't be.
There are drawbacks to using docker, such as security patching and operational overhead. And if you're blindly putting it into every project, how are you mitigating the risks it introduces?
Worse, the big reason it was useful, managing dependency hell, has largely been solved by making developers default to not installing dependencies globally.
We don't really need Docker anywhere near like we used to, and yet it persists as the default, unassailable.
Of course hosting companies must LOVE it, docker containers must increase their margins by 10% at least!
Someone else down thread has mentioned a tooling fetish, I feel Docker is part of that fetish.
It has downsides and risks involved, for sure. I think the security part is perhaps a bit overblown, though. In any environment, the developers either care about staying on top of security or they don't. In my experience, a dev team that skips proper security diligence when using Docker likely wouldn't handle it well outside of Docker either. The number of boxes out there running some old version of Debian that hasn't been patched in the last decade is probably higher than any of us would like.
Although I'm sure many people just do it because they believe (falsely) that it's a silver bullet, I definitely wouldn't call it part of a "tooling fetish". I think it's a reasonable choice much more often than the microservice architecture is.
Hard disagree. I've used Docker predominantly in monoliths, and it has served me well. Before that I used VMs (via Vagrant). Docker certainly makes microservices more tenable because of the lower overhead, but the core tenets of reproducibility and isolation are useful regardless of architecture.
There's some truth to this too honestly. At $JOB we prototyped one of our projects in Rust to evaluate the language for use, and only started using Docker once we chose to move to .NET, since the Rust deployment story was so seamless.
Haven't deployed production Java in years, so I won't speak to it. However, even with Go's static binaries, I'd like to leverage the same build and deploy process as other stacks. With Docker a Go service is no different than a Python service. With Docker, I use the same build tool, instrument health checks similarly, etc.
Standardization is major. Every major cloud has one (and often several) container orchestration services, so standardization naturally leads to portability. No lock-in. From my local to the cloud.
Even when running things in their own box, I likely want to isolate things from one another.
For example, different Python apps using different Python versions. venvs are nice but incomplete; you may end up using libraries with system dependencies.
I deeply disagree. Docker’s key innovation is not its isolation; it’s the packaging. There is no other language-agnostic way to say “here’s code, run it on the internet”. Solutions prior to Docker (eg buildpacks) were not so much language agnostic as they were language aware.
Even if you allow yourself the disadvantage that any non-Docker solution won’t be language-agnostic: how do you get the code bundle to your server? Zip & SFTP? How do you start it? ./start.sh? How do you restart under failure? Systemd? Congrats, you reinvented docker but worse. Want to upgrade a dependency due to a security vulnerability? Do you want to SSH into N replicated VMs and run your Linux distribution specific package update command, or press the little refresh icon in your CI to rebuild a new image then be done?
Docker is the one good thing the ops industry has invented in the last 15 years.
This is a really nice insight. I think years of linux have kind of numbed me to this. I've spent so much time on systems which use systemd now that going back to an Alpine Linux box always takes me a second to adjust, even though I know more or less how to do everything on there. I think docker's done a lot to help with that though since the interface is the same everywhere. A typical setup for me now is to have the web server running on the host and everything else behind docker, since that gives me the benefit of using the OS's configuration and security updates for everything exposed to the outside world (firewalls, etc).
Another thing about packaging. I've started noticing myself subconsciously adding even a trivial Dockerfile for most of my projects now just in case I want to run it later and not hassle with installing anything. That way it gives me a "known working" copy which I can more or less rely on to run if I need to. It took a while for me to get to that point though
It's all the same stuff. Docker just wraps what you'd do in a VM.
For the slight advantage of deploying every server with a single line, you've still got to write the mutli-line build script, just for docker instead. Plus all the downsides of docker.
There's another idea too, that docker is essentially a userspace service manager. It makes things like sandboxing, logging, restarting, etc the same everywhere, which makes having that multi-line build script more valuable.
In a sense it's just the "worse is better" solution[0], where instead of applying the good practices (sandboxing, isolation, good packaging conventions, etc) which leads to those benefits, you just wrap everything in a VM/service manager/packaging format which gives it to you anyway. I don't think it's inherently good or bad, although I understand why it leaves a bad taste in people's mouths.
Docker images are self-running. Infrastructure systems do not have to be told how to run a Docker image; they can just run them. Scripts, on the other hand, are not; at the most simple level because you'd have to inform your infrastructure system what the name of the script is, but more comprehensively and typically because there's often dependencies the run script implies of its environment, but does not (and, frankly, cannot) express. Docker solves this.
> Docker just wraps what you'd do in a VM.
Docker is not a VM.
> Plus all the downsides of docker.
Of which you've managed to elucidate zero, so thanks for that.
EDIT: I’m leaving the comment up so the replies make sense, but I completely missed the point here. That’s what I get for writing dismissive hacker news comments on my lunch break!
I find it kind of hard to take this seriously since the JS snippet has a glaringly obvious syntax error and two glaringly obvious bugs which demonstrate that the author didn’t really think too hard about the point they’re trying to make.
I understand the point they’re trying to make, that being that rust forces you to explicitly deal with the complexity of the problem rather than implicitly. It’s just that they conveniently ignore that the JavaScript version requires the programmer to understand things like how async await works, iterators (which they use incorrectly), string interpolation, etc. Just using typescript type annotations alone already gives the js version nearly all the explicitness of rust.
> I understand the point they’re trying to make, that being that rust forces you to explicitly deal with the complexity of the problem rather than implicitly
I read it again and understand what you mean. I apologize for commenting like that so quickly, I was on my phone and typed that comment out before I really had time to digest the contents.
No kidding. I work with an API at my job that does the following:
1. Instead of omitting optional values or at least setting them to null, optional values are set to an empty string when not present (This is including string fields where an empty string would be valid).
2. Lists are represented in such a way that an empty string means an empty list, a list with one element is simply the element, and only lists with more than one element actually use JSON arrays. For example, to represent a list of points ({ x: number, y: number }):
// Empty array
"points": ""
// Single element
"points": { x: 7, y: 28 }
// 2 or more elements
"points": [{ x: 7, y: 28 }, { x: 12, y: 4 }]
3. Everything is stored in a string. The number 5? “5”. A Boolean? “true” or “false”.
What’s funny is I was able to use serde to create wrapper types for each of these so these atrocities were abstracted away. For example, those arrays I described are simply Vec<T> in the actual types with an annotation using #[serde_as(as = “CursedArray<T>”)][0]. Likewise something similar for the string encoded types as well.
No he did say he doesn’t want rust in Linux at all. Now I understand that he didn’t say “I won’t allow rust to be in Linux”, which is a useful distinction to make. But let’s not pretend like he didn’t say “rust shouldn’t be in Linux” at all.
It’s important to note that it’s not a matter of effort for Firefox. They’ve decided that the it’s not something they want to implement[1]. The reasoning is that they think it allows low enough level access to potentially mess with devices who weren’t made to be resilient to malicious input, and didn’t like that the proposed method of allowing web Bluetooth is based on a default allow policy with a blocklist, which means as new Bluetooth device vulnerabilities are discovered, this blocklist has to be maintained.
Wouldn't the correct time to raise those concerns be during the web bluetooth design process? The idea that a browser decides "nah" about a web standard because they're mad about it seems like the road to ruin
Then again, almost every time a Firefox thread appears here it gets filled with comments pointing out how low its adoption is so I guess "well, yeah" sums it up (he said, commenting from Firefox)
That seems like a good litmus test question since it has a single correct answer, but a lot of potentially incorrect ones which sound like they might be right (North America has a lot of Hamiltons after all).
They just mean service managers like SystemD and OpenRC prefer to handle daemonizing themselves, and thus would prefer that your program stay in the foreground and let them put it in the background.
From OpenRC’s docs[0]:
Daemons must not fork. Any daemon that you would like to have monitored by supervise-daemon must not fork. Instead, it must stay in the foreground. If the daemon forks, the supervisor will be unable to monitor it.
If the daemon can be configured to not fork, this should be done in the daemon's configuration file, or by adding a command line option that instructs it not to fork to the command_args_foreground variable shown below.
The “undesired” or “controversial” part is whether programs should do it themselves or not.
deamontools was the first supervisor service I used that required that programs not background themselves, it even included a tool for preventing "legacy" daemons from doing so.
It made a lot of sense to me at the time and honestly felt easier. Going back to init.d or upstart just felt like a step backward and so much more complicated that it needed to be. Then SystemdD comes along an have the same expectation and things makes sense again and writing "startup scripts became as easy almost as it was with daemontools.
inetd, the "super server" from 4.3BSD, supported this in addition to handling the listening socket. For reasons I don't fully understand, inetd fell out of favor despite having been installed and running by default on pretty much every *BSD and Linux server for decades.
One, is that HTTP took over the role of a lot of simple servers, thus something like Apache and CGI-BIN was used in place of inetd.
Second, with the rise of interpreted languages (i.e. Perl et al), forking became more expensive. With binary programs, forking tends to be cheap (in a multi-use case) since the executable page are shared across processes. While the interpreter runtime itself is shared (being compiled), the actual script is not (having to be loaded for each instance).
The HTTP servers managed that better (through modules, FastCGI, etc.), so that space didn't really advance under the guise of inetd.
Make no mistake, an inetd service is "fast enough" for a wide array of use cases today, both compiled and interpreted, simply because the hardware is much better today. But, still, when folks think "ad hoc" service today, they're likely turning to HTTP today anyway.
I recall that inetd started a new instance of the demon for every incoming connection, and this caused lots of processes when lots of connections happened.
I don’t recall whether you could tell inetd not to do that.
inetd could pass the listening socket to the process. That was the `wait|nowait` field in /etc/inetd.conf. The typical config for TCP used with services like finger was `nowait`, which meant inetd would listen on a socket and spawn a new process for every incoming connection, without waiting for a previously spawned process to exit. But in `wait` mode it would spawn the process when it detected a connection, pass the listening socket (not connected socket) as fd 0, then wait for the server to exit before polling the listening socket again.
inetd was (remains?) a perfectly useful solution in this space. It just maybe needs some love to add some convenience features. Off the top of my head: 1) ability to split /etc/inetd.conf into, e.g., /etc/inetd.conf.d; 2) ability to trigger a restart of a specific service, rather than restarting the entirety of inetd.
It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.
For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.