I think ssh-ing into production is a sign of not fully mature devops practices. ...

xorcist · 2024-10-24T07:42:09 1729755729

> at Netflix type scale that nobody is sshing into individual instances to troubleshoot

Have you worked at Netflix?

I haven't, but I have worked with large scale operations, and I wouldn't hesitate to say that the ability to ssh (or other ways to run commands remotely, which are all either built on ssh or likely not as secure and well tested) is absolutely crucial to running at scale.

The more complex and non-heterogenous environments you have, the more likely you are to encounter strange flukes. Handshakes that only fail a fraction of a percent of all times and so on. Multiple products and providers interaction. Tools like tcpdump and eBPF becomes essential.

Why would you want to deploy on a mature operating system such as Linux and not use tools such as eBPF? I know the modern way is just to yolo it and restart stuff that crashes, but as a startup or small scale you have other things to worry about. When you are at scale you really want to understand your performance profile and iron out all the kinks.

Hikikomori · 2024-10-24T10:17:04 1729765024

Can also use stuff like Datadog NPM/APM that uses eBPF to pick up most of what you need. Its been a long time since I've needed anything else.

xorcist · 2024-11-01T00:02:22 1730419342

Yes, there are numerous other ways to run remote commands than ssh, all of them less secure. (Running commands via your monitoring system can even be a very handy back door in a pinch.)

The argument here was that remote commands was less useful at scale, not that ssh was a particularly bad way of implementing it. Which doesn't make sense. You tend to have more complex system interactions at scale, not less.

imiric · 2024-10-23T22:51:55 1729723915

Right. The answer is having systems that are resilient to failure, and if they do fail being able to quickly replace any node, hopefully automatically, along with solid observability to give you insight into what failed and how to fix it. The process of logging into a machine to troubleshoot it in real-time while the system is on fire is so antiquated, not to mention stressful. On-call shouldn't really be a major part of our industry. Systems should be self-healing, and troubleshooting done during working hours.

Achieving this is difficult, but we have the tools to do it. The hurdles are often organizational rather than technical.

bigiain · 2024-10-23T23:36:47 1729726607

> The hurdles are often organizational rather than technical.

Yeah. And in my opinion "organizational" reasons can (and should) include "we are just not at the scale where achieving that makes sense".

If you have single digit numbers of machines, the whole solid observability/ automated node replacement/self-healing setup overhead is unlikely to pay off. Especially if the SLAs don't require 2am weekend hair-on-fire platform recovery. For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).

Scrappy startups, and web/mobile platforms for anything where a few hours of downtime is not going to be an existential threat to the money flow or a big story in the tech press - probably have more important things to be doing than setting up log aggregation and request tracing. Work towards that, sure, but probably prioritise the dev productivity parts first. Get your CI/CD pipeline rock solid. Get some decent monitoring of the redundant components of your HA setup (as well as the Prod load balancer monitoring) so you know when you're degraded but not down (giving you some breathing space to troubleshoot).

And aspire to fully resilient systems and have a plan for what they might look like in the future to avoid painting yourself into a corner that makes it harder then necessary to get there one day.

But if you've got a guy spending 6 months setting up chaos monkey and chaos doctor for your WordPress site that's only getting a few thousand visits a day, you're definitely going it wrong. Five nines are expensive. If your users are gonna be "happy enough" with three nines or even two nines, you've probably got way better things to do with that budget.

Aeolun · 2024-10-23T23:53:59 1729727639

> For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).

For a lot of things the lack of complexity inherent in a single VPS server will mean you have better availability than any of those bizarrely complex autoscaling/recovery setups

imiric · 2024-10-24T06:12:24 1729750344

I'm not so sure about all of that.

The thing is that all companies regardless of their scale would benefit from these good practices. Scrappy startups definitely have more important things to do than maintaining their infra, whether that involves setting up observability and automation or manually troubleshooting and deploying. Both involve resources and trade-offs, but one of them eventually leads to a reduction of required resources and stability/reliability improvements, while the other leads to a hole of technical debt that is difficult to get out of if you ever want to improve stability/reliability.

What I find more harmful is the prevailing notion that "complexity" must be avoided at smaller scales, and that somehow copying a binary to a single VPS is the correct way to deploy at this stage. You see this in the sibling comment from Aeolun here.

The reality is that doing all of this right is an inherently complex problem. There's no getting around that. It's true that at smaller scales some of these practices can be ignored, and determining which is a skill on its own. But what usually happens is that companies build their own hodgepodge solutions to these problems as they run into them, which accumulate over time, and they end up having to maintain their Rube Goldberg machines in perpetuity because of sunk costs. This means that they never achieve the benefits they would have had they just adopted good practices and tooling from the start.

I'm not saying that starting with k8s and such is always a good idea, especially if the company is not well established yet, but we have tools and services nowadays that handle these problems for us. Shunning cloud providers, containers, k8s, or any other technology out of an irrational fear of complexity is more harmful than beneficial.

LtWorf · 2024-10-24T21:33:40 1729805620

If you don't know why they failed, replacing them is pointless.

naikrovek · 2024-10-24T12:12:15 1729771935

> I think ssh-ing into production is a sign of not fully mature devops practices.

that's great and completely correct when you are one of the very few places in the universe where everything is fully mature and stable. the rest of us work on software. :)

otabdeveloper4 · 2024-10-24T06:28:21 1729751301

A whole lot of words to say "we don't troubleshoot and just live with bugs, #yolo".

sleepydog · 2024-10-24T11:26:09 1729769169

It's a good mindset to have, but I think ssh access should still be available as a last resort on prod systems, and perhaps trigger some sort of postmortem process, with steps to detect the problem without ssh in the future. There is always going to be a bug, that you cannot reproduce outside of prod, that you cannot diagnose with just a core dump, and that is a show stopper. It's one thing to ignore a minor performance degradation, but if the problem corrupts your state you cannot ignore it.

Moreover, if you are in the cloud, part of your infrastructure is not under your control, making it even harder to reproduce a problem.

I've worked with companies at Netflix's scale and they still have last-resort ssh access to their systems.