Hacker News new | past | comments | ask | show | jobs | submit login
Linux Crisis Tools (brendangregg.com)
596 points by samber 6 months ago | hide | past | favorite | 124 comments



This is a handy list.

> 4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…

Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.


Dig into machine off the hot path.

Unfortunately, no one has the time to do that (or let someone do it) after the problem is "solved", so over time the "rebuild from scratch" approach just results in a loss of actual troubleshooting skills and acquired knowledge --- the software equivalent of a "parts swapper" in the physical world.


Y'all don't do post-mortem investigations / action items?

I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait. I once had to deal with an outage that required we kill all our app servers every 20 minutes (staggered of course) because of a memory leak while it was being investigated.


Usually depends on the impact. If it's one of many instances behind a load balancer and was easily fixed with no obvious causes, then we move on. If it happens again, we have a known short-term fix and now we have a justified reason to devote man-hours to investigating and doing a post-mortem.


> I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait.

What numbers went into this calculation, to get such an extreme result as concluding that getting it up again is always the first priority?

When I tried to estimate the cost and benefit, I have been surprised to make the opposite conclusion multiple times. We ended up essentially in the situation of "Yeah, sure, you can reproduce the outage in production. Learn as much as you possibly can and restore service after an hour."

This is in fact the reason I prefer to keep some margin in the SLO budget -- it makes it easier to allow troubleshooting an outage in the hot path, and it frontloads some of that difficult decision.


I was at a place where we had "worker" machines that would handle incoming data with fluctuating volume. If the queues got too long we would automatically spin up new worker instances and when it came time to spin down we would kill the older ones first.

You can probably see where this is going. The workers had some problem where they would bog down if left running too long. Causing the queues to back up and indirectly causing themselves to eventually be culled.

Never did figure out why they would bog down. We just ran herky jerky like this for a few years till I left. Might still be doing it for all I know.


> The workers had some problem where they would bog down if left running too long.

So you just automatically replace the instances after a certain amount of runtime and your problem is gone.


Yeah, fixing a problem without understanding it has some disadvantages. It works sometimes, but the "with understanding" strategy works much more often.

Is this really a prevailing attitude now? Who cares what happened, as long as we can paper over it with some other maneuver/resources? For me it's both intellectually rewarding and skill-building to figure out what caused the problem in the first place.

I mean, I hear plenty of managers with this attitude. But I really expect better on a forum called hacker news.


For me, there’s a certain threshold involved.

If it happens extremely rarely (like, once every 6 months) or it’s super transient and low impact, we kick it and move on.

If it starts happening a 3rd or 4th time, or the severity increases we start to dig in and actually fix it.

So we’re not giving up, and losing all diagnosis/bugfixing ability, just setting a threshold. There’ll always be issues, some of them will always be mystery issues, so you can’t solve everything, so you’ve got to triage appropriately.


Plus just because the only visible symptom of the bug is a perfomance issue right now doesn't mean that there won't also be other consequences. If something is behaving contrary to expectations you should always figure out why.


I said nothing about not understanding the issue. Even with understanding just “turning it on and off again” might be the better solution at the moment. Because going for the “real” solution means making a trade-off somewhere else.


The end state of a culture that embraces restart/reboot/clear-cache instead of real diagnoses and troubleshooting is a cohort of junior devs who just delete their git repo and reclone instead of figuring out what a detached HEAD is.

I don't really fault the junior dev who does that. They are just following the "I don't understand something, so just start over" paradigm set by seniors.


It’s not either / or.

If you have proper observability in place then you can do your diagnosis without affecting your customers.


>diagnosis without affecting your customers.

Plus, at the same time successful diagnosis is also the kind that can have the most dramatic effect on your customers.

In a positive way.


Sure, but at risk of repeating myself: it’s not either /or. Nobody is suggesting analysis shouldn’t happen. Just that it doesn’t need to happen on a live system.


Honestly, there's a certain cost-benefit analysis here. In both instances (rebooting and recloning), it's a pretty fast action with high chances of success. How much longer does it take to find the real, permanent solution? For that matter, how long does it take to even dig into the problem and familiarize yourself with its background? For a business, sometimes it's just more cost effective to accept that you don't really know what the problem is and won't figure it out in less time than it takes to cop-out. Personally, I'm all in favor of actually figuring out the issue too, I just don't believe it to be appropriate in every situation.


There is a short term calculus and long term calculus. Restarting usually wins in the short term calculus. But if you double down on that strategy too much, your engineering team, and culture writ large, will lilt increasingly towards a technological mysticism.


To be fair, with git, specifically, it's a good idea to at least clone for backup before things like major merges. There are lots of horror stories from people losing work to git workflow issues and I'd rather be ridiculed as an idiot who is afraid of "his tools" (as if I have anything like a choice when using git) and won't learn them properly than lose work thanks to a belief that this thing behaves in a way which can actually be learned and followed safely.

A special case of this is git rebase after which you "can" access the original history in some obscure way until it's garbage-collected; or you could clone the repo before the merge and then you can access the original history straightforwardly and you decide when to garbage-collect it by deleting that repo.


Git is a lot less scary when you understand the reflog; commit or stash your local changes and then you can rebase without fear of losing anything. (As a bonus tip, place “mybranch.bak” branches as pointers to your pre-rebase commit sha to avoid having to dig around in the reflog at all.)

I would never ridicule anyone for your approach, just gently encourage them to spend a few mins to grok the ‘git reflog’ command.


Then submodules enter the picture. I’m comfortable with reflog, but haven’t fully grokked submodules yet, easier to reclone.


If you’re not super comfortable with Git, before rebasing, simply:

- Commit any pending changes.

- Make a git tag at your current head (any name is fine, even gibberish).

If anything “goes wrong” you can rollback by simply doing reset hard to the tagged commit.

Once done, delete the tag.

Making a complete “backup clone” is a complete waste of time and disk space.


Isn't the whole purpose of GIT Version Control? In other words to prevent work loss occurring from mergers and/or updates? Maybe I'm confusing GitHub with GIT? PS I want to set up a server for a couple of domain names I recently acquired, it has been many years so I'm not exactly sure if this is even practical anymore. Way back when I used to distribution based off of CENT OS called SME server, is it still common place to use a all in one distribution like that? Or is it better to just install my preferred flavour of Linux and each package separately?


Git does source code management.

The two primary source code management activities developers use are versioning of source code (tracking changes which happened over time) and the other being synchronisation of code with other developers.

One of Git’s differentiating strengths is it being decentralised, allowing you to do many operations in isolation locally without a central server being involved. You can then synchronise your local repository with an arbitrary number of other copies of it which may be remote, but you may need to rebase or merge in order to integrate your changes with those of other developers.

Git is more like a local database (it even allows multiple local checkouts against a single common “database”) and it only occasionally “deletes” old “garbage”. Anything you do locally in Git is atomic and can always be rolled back (provided garbage collection hasn’t yet been performed).

Although I’m comfortable enough with using the reflog to rollback changes (I’m also skilled enough in git I haven’t needed to in many years), it’s not very user friendly, it’s essentially like sifting through trash, you’ll eventually be able to find what you lost (provided it wasn’t lost too long ago), but you may have to dig around a bit. Hence my suggestion of tagging first, makes it easy to find it again if needed.

I have very limited Linux experience and have no recommendations on your other question.


Thank you for the well detailed response to my question. I'm currently working on returning to the CS Field due to a devastating and career ending injury. The specific Field I'm interested in his programming the interface between hardware such as robotics and user interfaces. So much has changed over the past decade I feel like I'm having to start all over and relearn everything to do with programming! And on top of that I have to also relearn how to live as a quadriplegic! Thank goodness for the Internet and it's incredible amount of free knowledge available these days!


If it's happening so rarely that killing is a viable solution, then there's no reason to troubleshoot it to begin with. If it's happening often enough to warrant troubleshooting, then your concerns are addressed.


Here's a real-life example. We have a KVM server that has its storage on Ceph. It looks like KVM doesn't work well with Ceph, esp. when MD is involved, so, if a VM is powered off instead of an orderly shutdown, something bad is happening to MD metadata, and when the VM is turned on again, one MD replica can be missing. This happens infrequently, and I've never been in a situation when two replicas died at the same time (which would prevent a VM from booting), but it's obviously possible.

So... more generally, your idea with replacing VMs is rather naive when it comes to storage. Replacement incurs penalties, s.a. eg. RAID rebuilds. RAIDs don't have the promised resiliency during rebuild. And, in general, rebuilds are costly because they move a lot of data / wear the hardware by a lot. Worst yet, if you experience the same problem that caused you to start a rebuild in the first place during the rebuild, the whole system is a write-off.

In other words, it's a bad idea to fix problems without diagnosing them first if you want your system to be reliable. In extreme cases, this may start a domino effect, where the replacement will compound the problem, and, if running on rented hardware, may also be very financially damaging: there were stories about systems not coping with load-balancing and spawning more and more servers to try and mitigate the problem, where problem was, eg. a configuration that was copied to the newly spawned servers.


That might work in some scenarios. If you're a "newer" company where each application is deployed onto individual nodes, you can do this.

But consider that the case for older companies, where it was more common to deploy several systems, often complex ones, onto the same node. You will also cause outages to system x, y and z too. Maybe some of them are inter-dependent? You have to outwhey the consequences and risks carefully in any situation before rebooting.


> Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this.

At least as I read it, this contains the assumption that that‘s not how you deploy your applications


> it was more common to deploy several systems, often complex ones, onto the same node.

Yeah we do this? It doesn’t pose an issue though. Cordon the node (stop any new deployment going on), drain it to remove all current workloads (these either have replicas, or can be moved to another node, if we don’t have a suitable node, K8s spins up one automatically) and then remove the node. Most workloads either have replicas spare, or in the case of “singleton” workloads, have configs ensuring the cluster must always have 1 replica available, so it’s waits for the new one to come up before killing the old. Most machines deploy and join the cluster in a couple of minutes, and most of our containers take only like, 1 or 2 seconds to deploy and start serving on a machine, so rolling a node is a really low impact process.


One could argue that most devs these days are parts swappers with all the packages floating around.


> Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one.

"4:10pm the new machine still has the same performance issue"


Sure, but more often than not - esp in cloud scenarios, sometimes you just get a machine that is having a bad day and it’s quicker to just eject it, let the rest of the infra pick up the slack, and then debug from there. Additionally if you’ve axed a machine, and got the same issue, you know it’s not a machine issue, so either go look at your networking layer or whatever configs you’re using to boot your machines from…


> esp in cloud scenarios

... so the nice thing about the about the cloud is that you can workaround cloud-specific issues?


4:20pm Turns out it was DNS


That made me laugh. Thank you. Of course, it is not DNS. DNS has become the new cabling. DNS is not especially complicated, but cabling is neither. Yet, during dot.com and subsequent years the cabling was causing a lot of the problems so that we get used to first check the cabling. But it only took a few more years to realize that it is not always cabling, actually failures are normally distributed.

Is it wrong to check DNS first? No, but please realize that DNS misconfiguration is not more common than other SNAFUS.


    It’s not DNS
    There’s no way it’s DNS
    It was DNS


Certificates are the new DNS for service breakages


That's actually amazing, a reproducible problem is a 90% solved problem!


You're describing one of the benefits of virtualised cattle, not necessarily or exclusively 'cloud'.


Kill the machine might destroy evidence. It might be the case you have everything logged outside, but most often there is something missing.


Take it out of the pool then.


Not all servers are containerized, but a significant number are and they present their own challenges.

Unfortunately, many such tools in docker images will be flagged by automated security scanning tools in the "unnecessary tools that can aid an attacker in observing and modifying system behavior" category. Some of those ( like having gdb) are valid concerns but many are not.

To avoid that we have some of these tools in a separate volume as (preferably) static binaries or compile & install them with the mount path as the install prefix (for config files & libs). If there's need to debug, we ask operations to mount the volume temporarily as read-only.

Another challenge is if there's a debug tool that requires enabling a certain kernel feature, there are often questions/concerns about how that affects other containers running on the same host.


If an attacker can execute files from the filesystem, and all that's missing to run them is them being present on the filesystem, the attacker could just... write those files themselves? I really don't understand in what scenario this policy makes any sense, apart from "my organization misuses security scanners".


A better way is to build a second image including the debug tools and a root-user, then start it with the prod-containers pid-namespace and network-namespace mounted.

Starting a second container is usually a good idea anyway, since you need to add a lot of extra flags like SYS_PTRACE capability, user 0 and --privileged for debuggers to work.

This way you don't need to restart the prod-container either, potentially loosing reproduction-evidence.

Remembering how to do all this in an emergency may not be entirely obvious. Make sure to try it first and write down the steps in your run books.


> A better way is to build a second image including the debug tools and a root-user.

That was our initial idea. But management and QA are paranoid enough that they consider these as new set of images that require running the complete test suite again even when they are built on top of certified images. Nobody is willing to test twice, so we had to settle for this middle.


somewhat related: /rescue/* on every FreeBSD system since 5.2 (2004) — a single statically linked ~17MB binary combining ~150 critical tools, hardlinked under their usual names

https://man.freebsd.org/cgi/man.cgi?rescue https://github.com/freebsd/freebsd-src/blob/main/rescue/resc...


And I haven't needed to use it in fifteen years. Over the past four or five years, I've ported what I can to a *BSD, for sanity reasons.


When I was at Netflix, Brendan and his team made sure that we had a fair set of debugging tools installed everywhere (bpftrace, bcc, working perf)

These were a lifesaver multiple times.


I was surprised that `strace` wasn't on that list. That's usually one of my first go-to tools. It's so great, especially when programs return useless or wrong error messages.


strace is ok as a last resort, but "perf trace" and bpf tracing tools are the production-safe alternative. https://www.brendangregg.com/blog/2014-05-11/strace-wow-much...


Why don't recommend atop? When a system is unresponsive, I want a I want a high-level tool that immediately shows which subsystem is under heavy load. It should show CPU, Memory, Disk, and Network usage. The other tools you listed are great, once you know what the cause is.


My preference is tools that give a rolling output as it let you capture the time-based pattern and share it with others, including in JIRA tickets and SRE chatrooms, whereas top's generally clear the screen. atop by default also sets up logging and runs a couple of daemons in systemd, so it's more than just a handy tool when needed, it's now adding itself to the operating table. (I think I did at least one blog post about performance monitoring agents causing performance issues.) Just something to consider.

I've recommended atop in the past for catching short-lived processes because it uses process accounting, although the newer bpf tools provide more detail.



And I wonder if anyone still uses sar and family.

I didn't, but a boss or two of mine did.


I still use sar occasionally, but never as a troubleshooting tool. It's more for performance analysis than crisis mode.


Rarely need these under Linux but would have frequent use for such a tool under Windows with its mandatory file locking. "No you can't rename that directory but I won't tell you because fuck you."


TIL fuser. Thanks!


Welcome :)


Some HN monkeyboy has nothing better to do with his time than downvote a polite reply to a thanks.


I always cover such tools when I interview people for SRE-type positions. Not so much about which specific commands the candidate can recall (although it always impresses when somebody teaches me about a new tool) but what's possible, what sort of tools are available and how you use them: that you can capture and analyze network traffic, syscalls, execution profiles and examine OS and hardware state.


In such a crisis if installing tools is impossible, you can run many utils via Docker, such as:

Build a container with a one-liner:

docker build -t tcpdump - <<EOF \nFROM ubuntu \nRUN apt-get update && apt-get install -y tcpdump \nCMD tcpdump -i eth0 \nEOF

Run attached to the host network:

docker run -dP --net=host moremagic/docker-netstat

Run system tools attached to read host processes:

for sysstat_tool in iostat sar vmstat mpstat pidstat; do alias "sysstat-${sysstat_tool}=docker run --rm -it -v /proc:/proc --privileged --net host --pid host ghcr.io/krishjainx/sysstat-docker:main /usr/bin/${sysstat_tool}" done unset -v sysstat_tool

Sure, yum install is preferred, but so long as docker is available this is a viable alternative if you can manage the extra mapping needed. It probably wouldn’t work with a rootless/podman setup.


Is there a situation where apt cant download and install packages but docker can fetch new containers?

apt libs borked or something?


I would just decompress the .deb in such case. As a last resort, even a .rpm might work.

Of course handling dependencies by hand is annoying, but depending on situation it might be faster anyway.


Unless you are in an air gapped situation. Good luck pulling “Ubuntu” image!


On that note I'd largely prefer if `busybox` contained more of these tools, it'd be very helpful to have a 1MBish file that I can upload into a server and run it there.


You guys get root access? I have to raise a ticket for a sysadmin to do anything.


I am a consultant now so it's a new company every few months.

There are groups of people you always make nice with.

* Security people. The kinds with poorly fit blazers who let you into the building. Learn these peoples names, Starbucks cards are your friends.

* Cleaning people. Be nice, be polite, again learn names. Your area will be spoltless. It's worth staying late every now and again just to get to know these folks.

* Accounting: Make some friends here. Get coffee, go to lunch, talk to them about non work shit, ask about their job, show interest. If you pick the right ones they are gonna grab you when layoffs are coming or corp money is flowing (hit your boss up for extra money times).

* IT. The folks who hand out laptops, manage email. Be nice to these people. Watch how quickly they rip bullshit off your computer or wave some security nonsense. Be first in line for every upgrade possible.

* Sysadmins. These are the most important ones. Not just because "root" but because a good SA knows how to code but never says it out loud. A good sysadmin will tell you what dark corners have the bodies and if it's just a closet or a whole fucking cemetery. If you learn to build into their platform (hint for them containers are how they isolate your shitty software in most cases) then you're going to get a LOT more leeway. This is the one group of people who will ask you for favors and you should do them.


> Starbucks cards are your friends

like, how? are you straight up bribing people with coffee for security favors? or is it like, "hey man, thanks for helping me out I'd like to buy you a coffee but I'm busy with secret consulting stuff - here's a gift card"

Is this something that only works for short lived external consultant interactions?


You do that right after walking into the manager's office and getting a job with a firm handshake. Then you go outside and buy a hotdog for 15 cents and a detached house in San Francisco for $15,000 USD.


You learn peoples names, you say high every day you treat them like humans. You bring them coffee on occasion if it is early in the morning... or you ask them if they want something if your going for that post lunch pick me up for your self.

By the end of a 4-5 week run you will know all the security people in a building. If I go to lunch an forget my badge they will let me back in no questions asked. This is something I used to do as staff, and still do to this day.


Just give one as a gift occasionally. The holiday season is great for this. On your way in, "Merry Christmas Frank!" and hand one out. Or even just because. "Keep up the good work, here you go"

Its not about bribing to get a specific favor. Its about getting on good terms. Having people like you is a good thing and it makes their job a bit better and can make their day a little brighter. win-win


So true. If you want to know anything about an office, ask the sysadmins. Double-plus on being nice to the facility managers, cleaning people and security. Not only do they do a thankless job but they are often the most useful and resourceful people around if you need something taken care of. They know how to get shit done.


Much easier to be nice as default ;D


Agreed. Although being "extra nice" -- going out of your way to learn about people, eat with them, etc. -- does take extra time, so you can't do that with everyone.


Err, sure. I used to run IT ops (SYS, SRE, and SEC in this context). This article is directed at people who run apps on IT provided infrastructure. But if you would have interactions like in the example, your org failed on an org level, this is not a tech problem. We used to have very clear and very trustworthy lines of communication and people wouldn't be on chat, they would be on the phone (or today on Teams or whatever) with dev, ops, security, and compliance. Actually, we had at least a liaison on every team, but most often dev ran the apps on ops provided resources. Compliance green-lighted the setup and SR was a dev job. A lot of problems really go away if you do devops in this sense.


I don't see nmap, netstat, and nc being mention. They had saved me so many time as well.


The only thing I would add is nmap.

Network connectivity issues aren't always apparent in some apps.


screen/tmux byobu pv rsync and of course vim.


dd, echo * as a poor man's ls if ls is accidentally deleted, busybox, cpio, fsck and fsdb.

Used all of these and more, in Unix, not just Linux crisis situations.


Those are already there. We are talking about diagnostic and recovery tools that should be installed by policy, in advance, so that they are already in place to aid in emergencies.


Okay, my mistake. But busybox is not always already there, right? Installed, I mean? Not at a box right now.


Busybox has one big downside: the tools it provides tend to have a rather ... limited set of options available. The easy stuff you can do in a standard shell might not be supported.


Brendan Gregg as always with down to earth approach. Love the warroom example


Would these tools still be useful in a cloud environment, such as EC2?

Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8. I wonder if these tools are still useful for containers and serverless?


It's still useful in EC2 (or any other VM-based environments) and Docker containers, as long as you can install the necessary packages (if they are not installed by default). Because after all, there are "servers" underneath, even for the serverless apps, I suppose.

It's definitely harder for apps running in Lambda because we may not have access to the underlying OS. In such case, I kind of fallback to using the application level observability tools like Pyroscope (https://pyroscope.io). It doesn't always work for all the cases and have some overheads/set up but it's still better than flying and more useful than the Cloud Provider's provided metrics.


> Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8.

There are plenty of services that don’t fit on k8s or Lambda. Not all pegs fit in those holes.


IME there's always that one service that wasn't ever migrated to containers or lambdas is is off running on an EC2 somewhere, and nobody knows about it because it never breaks, but then the one time AWS schedules an instance retirement for it...


Containers are just processes running on the host where the process has a different view of the world from the "host". The host can see all and do all.


Let's add NCDU to the list, it's super usefull to find what is taking all the disk space


I keep forgetting about ncdu thanks to my old habit of du -ms * | sort -n. What is it I'm missing?


Lots. ncdu is a fully interactive file browser that also lets you delete files and directories without a rescan.


ncdu will also store results in an output file you can pull back and do analysis on. I've found this feature useful in some contexts.


a bit off topic, but `rclone ncdu` is great too for cloud.


Sounds like it's time to create a crisis-essential package group a la build-essential.


I have in the past created a package list in ansible/salt/chef/... called devops_tools or whatever to make sure we had all the tools installed ahead of time.


The list is great, but only for classical server workloads.

Usually not even a shell is available in modern Kubernetes deployments that take a security first approach, with chiseled containers.

And by creating a debugging image, not only is the execution environment being changed, deploying it might require disabling security policies doing image scans.


You don't need to have these tools in the container to troubleshoot the workload in a container.


You would be surprised, specially if developers didn't care about telemetry.


Container processes are just processes running on the host.


Great, want to go debug a bunch of container processes running on a cloud Kubernetes cluster, with all security policies turned on?


So the security policy is the headache then, not the containers or even Kubernetes?


I use zfsbootmenu with hrmph (https://github.com/leahneukirchen/hrmpf). You can see the list of packages here (https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).

I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.


On compute clusters there are quite a few "exotic" things that can go wrong. The workload orchestration is typically SLURM, which can throw errors and has a million config options to get lost in.

Then you have storage, often tiered in three levels - job-temporary scratch storage on each node, a distributed fast storage with a few weeks retention only, and an external permanent storage attached somehow. Relatively often the middle layer here, which is Lustre or something similar, can throw a fit.

Then you have the interconnect, which can be anything from super flakey to rock solid. I've seen fifteen year old setups be rock solid, and in one extreme example a brand new system that was so unstable, all the IB cards were shipped back to Mellanox and replaced under warranty with a previous generation model. This type of thing usually follows something like a Weibull distribution, where wrinkles are ironed out over time and the IB drivers become more robust for a particular HW model.

Then you have the general hardware and drivers on each node. Typically there is extensive performance testing to establish the best compiler flags etc., as well as how to distribute the work most optimally for a given workload. Failures on this level are easier in the sense that it typically just affects a couple of nodes which you can take offline and fix while the rest keep running.


Related to that, I recently learned about safe-rm which lets you configure files and directories that can't be deleted.

This probably would have prevented a stressful incident 3 weeks ago.


tmux, statically linked (musl) busybox with everything, lsof, ltrace/strace and a few more. Under OpenBSD this is not an issue as you have systat and friends in base.


Doesn't one increase a system's attack surface area/privilege escalation risk by pre-installing tools such as these?


Usually (not by design, but by circumstance), if someone gains RCE on your systems, they can also find a way to bring the tools they need to do whatever they originally set out to do. It's the old "I don't want to have a compiler installed on my system, that's dangerous, unnecessary software!"-trope driven to a new extreme. Unless the executables installed are a means to somehow escalate privileges (via setuid, file-based capabilities, a too-open sudo policy, ...), having them installed might be a convenience for a successful attacker - but very rarely the singular inflection point at which their attempted attack became a successful one.

The times I've been locked in an ill-equipped container image that was stripped bare by some "security" crapware and/or guidelines and that made debugging a problem MUCH harder than it should have been vastly outnumber the times where I've had to deal with a security incident because someone had coreutils (or w/e) "unnecessarily" installed. (The latter tally is at zero, for the record.)


> It's the old "I don't want to have a compiler installed on my system, that's dangerous, unnecessary software!"-trope

(This is a genuine question). In what circumstances would you need to install/run dev tools in prod?

Of course having a compiler installed isn't necessarily an issue... but it might well be a sign that there is an underlying problem!

(FWIW, I used to build everything from source. Yes, also in prod. That was a while ago...)


How do you see an escalation using one of listed in the article tool (unless a binary has suid bit which you shouldn’t set if worried about security). Many of these tools provide convenient access to /proc - if an attacker needs something there they can read/write directly to /proc. Though in case of eBPF - disabled kernel support would reduce attack surface and if it disabled in the kernel’s user mode tools are useless.


Love the list and the eBPF tools look super helpful.


When would you need to use rdmsr and wrmsr in a crisis?


> and...permission errors. What!? I'm root, this makes no sense.

This is one of the reasons why I fight back as hard as I can against any "security" measures that restrict what root can do.


Can't imagine handling a Linux crisis without ssh

[EDIT]: typo


So basically busybox?


[flagged]


This is a good example of why chatGPT is just useless for so many things.

* Why even create a use-time script, when the whole point of the original article is how these tools should be pre-installed, because the system may be in a state where disk, or network io, or load, prevents easy install of anything, during times of instability/issues

* Why bother with a 100+ line script, when you just "apt-get install thing" then use the blasted thing

* Who's going to maintain this script down the road? Why are things on it? Why even spend time looking at it, debugging it?

It's literally creating extra work, whilst doing nothing useful.

Please people. If you don't understand what's going on, or why things are to be done, don't hand it off to chatGPT. Just learn, learn, learn or if you can't/won't learn, go find a job in a different field.


Wow, that's really mean and narrow response. I get if you're mad about something right now that has nothing to do with this, but please don't bring that here. Rather constructively deal with the source of your feelings, instead of trying to take it out on people and places that don't deserve it! Haha :)

In your quest for invective did you consider other possibilities? Perhaps:

- you could learn by using these things, with an easy interface

- perhaps you're not the expert who knows everything, and other people have use cases beyond what you'd consider?

- you don't have to wait until things go wrong to familiarize yourself with these things, you can use the script now to ensure they're installed

Personally I think that if you don't think ChatGPT is useful for very many things, you're not using it right! Haha :) In general it seems your comment would benefit from the HN guidelines advice of assuming a generous interpretation, rather than going in other direction! Haha! :)

If you want to find "uselessness" or "stupidity" in what you are looking at, well, surely you can always find it if you try. But, truth is, you find there what you bring to it. So, try harder to find something good! Haha :)

But also that comment itself could be an example of "not understanding what's going on" yet speaking like there there is that understanding. I understand if the job situation is difficult and you're unhappy with the number of people in the field now, but if you dislike appreciating a diversity of approaches beyond your own, perhaps it is you who ought "go find a job in a different field."

Or, learn to be more welcoming and less arrogant in your quest to be "right". You could just apply yourself to finding something useful in whatever you're responding to. That, it seems, would be the way to really be right, and also to make a useful contribution, not just to the forum, but with that attitude, to the field. :)


I won't comment on the tone (I can't evaluate for you if they were harsh and how you should take it - I myself didn't find them too rude, more like irritated maybe) but they are right, in essence. I was going to answer something like this too.

You want to be 100% sure on what you run on your prod, so you really don't want to automate writing such a script with a tool that can hallucinate things.

I only skimmed the script because I don't really want to be reading hallucinated code, that's not very interesting, but I already saw two flaws:

- it tries to detect the package manager, among two random ones that exist. Why those two?

- there's a list of packages that surely applies to only one on them and will break with the other, if it doesn't already break with the first (who knows before trying or before careful review?).

It proposes helpers to run random stuff, but you really want to spend the time learning these tools directly because you will end up having to use them in different ways for different situations. You don't want to spend time learning how to use the generated script, that's extra, pointless, work. You also want to be damn sure the generated helpers are harmless and do the right things so you will have to take a lot of time reviewing them. Of course you could be learning stuff will reviewing these parts, but you really want to learn those things from actually experienced people instead so you get to be pointed at relevant stuff.

And then indeed there will be maintenance work.

> Personally I think that if you don't think ChatGPT is useful for very many things, you're not using it right! Haha :)

Be careful with ChatGPT. Because you've actually just showed us that you are using it in ways that could be detrimental to you or your coworkers. You took a very high quality article and fed it to ChatGPT to generate a script that you thought useful to share to the world on HN.

You actually need to know exactly and be familiar with what you are installing on your servers, and you need to know how to use this stuff because otherwise, in time of crisis, you won't know what to do. You need to take the time to learn this stuff. ChatGPT won't help you learn faster here, and this script is a shortcut you can't actually take. It does the opposite of helping you here. It's a bit like your non expert friend doing your homework. They have no responsibility, and you haven't learned what the teacher wanted you to learn yourself, and you are not sure the friend even did good work.

> if you dislike appreciating a diversity of approaches beyond your own, perhaps it is you who ought "go find a job in a different field."

If the other commenter is anything like me, we are seeing people posting ChatGPT-generated stuff all too frequently and we find this boring. We know ChatGPT exists. We are out there for learning things and read actual insight from the posts and the commenter. LLM Generated things don't come with any insight and actual knowledge.

On a personal note, I highly distrust LLM output. Way more than random HN comments. I will need to review carefully what I read on HN from actual people, but even more what the LLM says, with my vague knowledge on how it works.

I bet most of us here on HN don't want to see comments citing LLM-generated stuff. And bad luck, the quality difference between this and an article from Brendan Gregg is stupidly huge so we are definitely expecting more when we are just done reading one.

About your suggestion to make the script work on different distros: you wouldn't actually want this. You would want a script that is finely adapted to your own, very specific production. You can't actually have general purpose script for this. If you have several distros in your production, you will actually need different scripts because it's likely the servers on different distros will have very different goals and ways of working.

Of course the install procedure will be specific to how you deploy and your script will likely won't help because it probably doesn't match how your thing is deployed.


Guys, it's not meant to be the keystone for your production servers. I get your points about infra and LLMs but this is not the place for them. Surely there's some more deserving targets to your 'anti-AI-acrimony'?? Hahah! :)

More poignantly however the comments here decrying the use of AI tooling as suspicious, incorrect hallucinations, instead suggest that this ire is more about how these people fear that AI, LLMs and ChatGPT are obviating the need for people with their particular expertise, and are desperate to present otherwise. And so they criticize it profusely, even if irrational. A truly future-proof take would be to embrace the trend, and see how it enhances, rather than erodes their prospects.

But back to the point at hand -- it says very clearly at the top of the comment it's a first draft. In fact, I spent a little bit of time honing with a prompt. It says it's untested and suggests ways to improve. Hahaha! :)

A good sysadmin would focus on suggesting ways it could be improved and recognize it for its convenience, a goal they share. They'd likely clearly see that it's not claiming to be either: a substitute for using the tools in another way; nor for learning what they are; but rather can very much be an aid in learning and using the tools.

To instead allow personal, and perhaps mistaken biases against LLMs occlude your productivity or utilization of things seems unwise. There's nothing wrong with having your personal opinions, but failing to see the other ways that things could be useful outside of that, is a mistake, I imagine.

More bigger picture, now: what you do imagine the purpose of this script was?

You can always look for ineptitude and wrong. To some extent that attitude might even underlie an admirable caution, and could be indicative of expertise -- even if clumsily expressed. But at the same time such attitudes could underpin a not-so-admirable mistaken assumption of stupidity on the part of others, blindness to approaches outside of one's own experience, or an failure to communicate respectfully. It might be hard to argue these were wise traits.

I get your disdain for what you see as the high amount of low quality LLM output that is putting everything at risk, but while a valid opinion, this particular thread is not the best target of that. You can try to make it about that, but why? Then you're just abusing someone else's words as a wrong vehicle for venting your own gripes, right? If you want to vent, do a "Tell HN:" or a blog post. Not reply to someone else's completely-unrelated-to-your-angst comment.

In other words, it's possible to express that opinion without taking aim at something to which that does not apply. There's no need to misuse someone's else's comment as a way soapbox for your own gripes.

I get if that seemed wise, but it wasn't.

Also, it's possible to raise the question of balance. The criticisms of the script comment erode their own credibility by failing to note anything good in what they're replying to. Instead waxing verbosely why they are "right" and "correct.", suggesting instead that's the primary aim sought by such comments.

Yet, there's no absolutes in this, unless you mistakenly make your definition overly narrow -- then it's meaningless. If you're being real, there's a multitude of ways to do things "right", and a multitude of approaches, as well as uses for, and improvements of, the script I propose in my comment, and approaches like it.

A wise interpretation would clearly see such a script aims to be a collection of useful tools in the same way that many unix tools are collections of useful related functions.

If you'd like to use my comment as a generalized jumping off point for your own gripes on LLMs or need to criticize, it would be better instead to find a more appropriate target so as not to come across as being an abusive and overly-critical bully, which I'm sure you're actually not, in fact.

Your comment comes across as if you misread mine as someone suggesting you provision your entire infrastructure hinging on the correctness of this HN comment, and uses that unhinged assumption as a basis for then criticizing it as something it never intended, nor claimed, to be. Hahaha! :)

Again, I think the semi-hysterical hyperbole of the responses speaks to the 'fear of replacement' that must be gripping the ranks of these. There must be a perception of employers that this is true, and these people fear. That sucks, but it's better to be more rational in response, than less. So your stated skepticism and criticisms would arrive more warranted if they were more precise and balanced.

Better yet, as to commenting ... rather than imposing your view that this is "harmful", find ways that it's not, or ways to make it better. Or just, you know, appreciate that there's multiple ways to the same goal, and everyone can get there differently, and doesn't make you "right" and them "wrong".

To fail to see how my comment and script could be good or useful and instead impose one's own insecurities or generalized sentiments toward current ChatGPT, could also be considered boring...

So...I suspect the pearl clutching is unwarranted, and it's a false equivalence to equate use of ChatGPT with your presumption of technical ineptitude. True technical ineptitude could also include refusal to embrace new technologies, or an overly limited perspective on what people say. Or even an overly narrow view of your own prospects for the future given the introduction of these disruptive technologies! Hahaha! :)


We simply shared what we thought of your comment and I personally tried to do it the more polite way possible. Of course we are essentially telling you that we find your comment useless and why we think that, I'm inclined to understand you don't enjoy it. Now, we are also not imposing anything, what makes you think that?

Sure, you warned this is a draft that needs work, we noticed, but why share this? Do you have ideas on directions where it could be taken to? You are asking what we thought your script was useful for, but that is indeed the question. As is, your comment feels low effort. I don't want to make you justify to us why you think your comment was useful, people are free to comment on HN without justifications, but that's clearly what we are missing. You are writing a lot of words focusing on us detractors as people, but what about the actual content and arguments?

> So your stated skepticism and criticisms would arrive more warranted if they were more precise and balanced

Sorry, but I aim at being precise and deep in my thinking, I'm not aiming at balanced. I sometimes have opinions that are clear and strong, happy to change my mind given good arguments, but I don't seek balanced. I don't know why I should. I seek documented, educated, not watered down.

Now, about using AI myself, I don't quite feel the need but in any case, I will consider using LLMs more seriously when they are open source and when they are careful about how they source their data: the quality of the input, and whether people agree to have their work being used as training data. I also have issues with the amount of energy they require to run. ChatGPT is too ethically wrong from my point of view for considering using it. But that's beside the point and my opinion on this didn't play a role in my comments.

And I don't feel insecure. I'm all right really.

You are blaming us but your comment was flagged to death. We are not the one who were flagged (and I didn't flag you, to be clear). We are also your (only) clues on why this happened. I would suggest some humility. Really, take a hint.

And to be clear, I don't have disdain for you, and I don't assume stupidity. That's not how I work. I would look down at myself if I did. I'm sorry if I made you think this, but let me assure you this is not the case.

This last comment of mine is harsh, but you need to take in account that I just read yours which is not really nice to us. Let's now tone down a bit maybe.


Sorry for the belated reply, I did not read your comment until just 5 minutes ago. I avoided it, knowing it would be toxic and I had more important things to do. But now I have some free time, so let's deal with you, sir.

"We"? You only speak for you, right? You cannot assume consensus in unknown random internet others, or else you also must presume consensus with my ideas, too?

The idea of "useless" is of course an imposition. And abusive. I clearly find it useful, so to claim useless is to devalue my perspective. Do you not see that? Or you think it justified? Neither is acceptable if you aim, as you say, for 'politeness'. Nor even for good sense.

So, I think you don't aim to be polite in fact, but merely pretend to be so. Hahaha! :)

What about the content as arguments? There is none from you because you do not acknowledge the other perspectives. So it all comes down, necessarily, to you as people.

But you can't be deep without being balanced, because then you can only be narrow minded. Which you are succeeding at, but you think that's a victory. When it's not: balance is required for real depth, because in appreciation the the breadth, you depth is able to resonate, through linking with what else is real. Otherwise it is, necessarily, unhinged. As your seems to be, sorry to say! Hahahaha :)

Your pretense at ethics around use of AI tools is belied by your "low ethics" attitude toward commentary. How are we to find that convincing, if you are not a moral actor in the first?

Flagged only requires a few people. If you require the consolation of the chorus of voices to lift your own, I understand. But that undercuts your message of depth, does it not, sir? :)

> I'm sorry if I made you think this, but let me assure you this is not the case.

You know you can only be sorry for your own choices/actions, right? Not for whatever you assume someone else feels, yes? You cannot "make" me feel a certain way. My feelings are my responsibility, not yours. So, a better way that respects the boundaries of individuals (I understand if you have trouble with that, but take heed, and learn!) is to say, "I'm sorry for <insert your action>" if you do feel you have something to be sorry for.

Overall your comment comes across just about exactly as I thought it would, given your previous ones. For humility, well, perhaps you have a thing or two to learn, indeed. But even that may be too much to ask of you. I suggest, instead, first you take a course in empathy, and then in self-awareness. Then perhaps you'll be equipped to appreciate your humility.

Good luck, sir. And have a pleasant week! Hahaha! :)

Your comment brought me the entertainment I needed at this minute. I am grateful. So here's my gift to you, youngin: But, I think you're just playing at this role of provocateur--you can do much better--but you haven't figured it out yet (and you know it), and that's your weakness.

So, work out what you really want to do, and then talk to others of 'standards'. Hahahahahahaha! :)


The point about installing tools on demand is correct, however, I have trouble following the other ones. Simplifying or summarizing both the usage and the output of common tools clearly does have value, and people do that all the time to great effect. Maybe the specific choices made here are not to your liking, but attacking the concept of doing that is a weird thing to do. Also, why does it matter how a script was generated if it does the job? Especially given the fact that scripts like these often serve as a starting point for further customization, is there really a meaningful difference between asking ChatGPT and copying off some random person on the internet? You need to double-check for functionality and safety either way.


I'm sure someone could find a way to start from ChatGPT generated stuff and do good work knowing full well the risks and the consequences.

But posting something you just generated on HN without any refinement is not very helpful and it is obvious at a first glance that the script as posted will not help you at the task it tries to tackle and I'm not even a proper sysadmin.

> why does it matter how a script was generated if it does the job?

It doesn't here, and that's the point: it's all too easy to believe that ChatGPT does the job. It probably sometimes does, but here it is obvious it didn't. It should have told the user that they should carefully pick the tools they will likely need depending on what their production runs, learn to use them focusing on the features that matter to the situation, and install them the way their production is setup, and that ChatGPT can't really generate a script for this because it would need to know how production is deployed, and what if the prod is setup using immutable images for instance, as the article cites?

ChatGPT will answer in a convincing way, before you notice that no proper answer can really be found given the lack of context. And when you have the answer before your nose, it's even more difficult to notice.


In fact, it's useful. If you don't find it so, I suggest you simply don't know how to use it. That in itself is expertise of a tool which you lack. Could it be not more humble to consider that you could be educated by this tool that's read millions? I am educated by it. Are you above that? You seem to think so. Hahahahaha! :)


And yet, so useful it is, despite flagging. You can have it all in one place, and have Chat GPT write it for you, so you don't have to type it all out. And you've saved yourself time. Is time not important to you?

Hahahahaha! :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: