Unix Admin Horror Story Summary (1992)

cstross · on Dec 21, 2021

This takes me back to roughly 1993.

I was in a department running on a mix of Wyse green-screen terminals and, later, X terminals, when we got a budget upgrade that would roll out actual individual PCs -- 486s running SCO Open Desktop -- to everyone. (This was not cheap, it cost about £4000 for the hardware per seat, although the software was free because, er, this was back in the day when SCO was a respectable UNIX development house rather than a serial litigation zombie, and we were SCO's techpubs department).

Anyway, the editors, who were techpubs management (and thereby stronger on the management than the tech side of things), got their workstations before anyone else. And one of them thought, "ooh goodie, my very own UNIX system!" And proceeded to do "sudo chown -R me:me /" (substitute their username and group for "me") all over the root filesystem.

It's amazing what breaks when every shared library suddenly belongs to a random user, isn't it?

etcet · on Dec 21, 2021

I've seen a bash history where "sudo chown -R me:me /" was followed up by "sudo chown +R me:me /". At least they tried.

kiddico · on Dec 22, 2021

I read that and went "oooh I wonder what +R does."

Time for bed.

AceJohnny2 · on Dec 21, 2021

> rather than a serial litigation zombie

"this sounds like cstross"

"heh"

forinti · on Dec 21, 2021

He knew just enough to shoot himself in the foot.

cstross · on Dec 22, 2021

What makes you think $MANAGER was male?

hnlmorg · on Dec 21, 2021

heh I have an almost identical war story.

krylon · on Dec 21, 2021

Ironically, while I love Unix, I have spent most of my career shepherding Windows boxes. The only real horror story I got was a new coworker (turning me from "the IT guy" into "half the IT department") who looked through the Active Directory tree and found the GPO management part had replicated the organizational structure. Since there were no GPOs at the time, he considered this wasteful and confusing, so he went and deleted it.

...

Except that what he did delete, it turned out, was the actual organizational structure of the Active Directory tree, including ALL user accounts. (It's hard to explain without visuals aids, the UI gave no indication it would delete the actual AD objects, not just the (non-existent) GPOs.)

Before long, people started calling to let us know they could no longer log into their computers or the terminal server. sigh It was a fairly stressful morning.

We really tried, for about 45 minutes, to resurrect the Active Directory tree, but it was no good (this was Windows Server 2008, so no AD Recycle Bin), so we had to restore the server from backup. I have since learnt that there is backup software that allows you to restore, say, your AD tree, or maybe even just a part of it, instead of the whole machine. Well, the backup software we had at the time suuuuucked, so not only did we have to restore the entire server, but we had to literally sit all day and watch the progress bar move at glacial speed.

In the end, we had the server up and running again, and fortunately both the company's CEO and most employees actually welcomed the opportunity to finally, FINALLY clean up their desks, something every single one of them had been delaying for a long time. And by the time we were done, I was just so exhausted I wasn't even mad at the newbie anymore.

At least we learnt from that mistake, though. Got ourselves a second domain controller, and a much better backup solution. In retrospect, I think it was probably a good thing - our boss took it with good humor, no data was lost, our backup system worked, but we also saw how badly it sucked, and the incident gave us some leverage to get the funding for said upgrades. Also, everyone had a clean desk, and since it was a Friday, a couple of coworkers decided to start their weekend early.

yeuxardents · on Dec 21, 2021

This story worked out surprisingly well, usually, not so much (:

krylon · on Dec 22, 2021

We were lucky restoring from backup worked, it had never been tested before. We were really lucky the CEO and the other coworkers took it so well.

And we were extremely lucky that the CEO was more interested in preventing such an outage in the future than in playing the blame game. I have a hunch this kind of situation can get really ugly once people start loosing their cool.

geocrasher · on Dec 21, 2021

This one made me LOL:

   My mistake on SunOS (with OpenWindows) was to try and clean up all the
   '.*' directories in /tmp. Obviously "rm -rf /tmp/*" missed these, so I
   was very careful and made sure I was in /tmp and then executed
   "rm -rf ./.*".

   I will never do this again. If I am in any doubt as to how a wildcard
   will expand I will echo it first.

I read this, and just had to go try it because I couldn't picture it in my brain. Here it is:

   $ echo ./.*
   ./. ./..

So if you're in /tmp/ and do 'rm -rf ./.*', it's

  rm -rf ./. ./..

and ./.. is .. which from tmp is /. Thankfully we have protections against this now. Back then, not so much.

simonblack · on Dec 22, 2021

As a brand-new UNIX user around 1991 and coming from MSDOS, I was used to being 'God' on my machine so I thought it would be a good idea to login always as 'root'. I couldn't understand why long-time users of UNIX knew that to be a very bad idea, and always advised against it.

Like the SunOS person, I had lots of dot-directories in /tmp. So I duplicated his actions almost perfectly.

Also like he did, I was wondering why an 'instantaneous' action was taking so long. I had to do a complete re-installation.

Ever since then, I've always logged in as a 'normal user'. My reign as 'God' in a UNIX environment lasted about 2-3 weeks in total.

yjftsjthsd-h · on Dec 21, 2021

Huh, what do you know. I thought modern rm only protected against deleting / by absolute path, but it looks like it'll protect you from deleting your parent regardless:

    $ docker run --rm -ti debian:11  # sandbox the danger...
    root@c70dde9f38a3:/# cd /tmp
    root@c70dde9f38a3:/tmp# rm -rf ./.
    rm: refusing to remove '.' or '..' directory: skipping './.'
    root@c70dde9f38a3:/tmp# mkdir -p /tmp/1/2/3/4/5
    root@c70dde9f38a3:/tmp# cd /tmp/1/2/3/4/5
    root@c70dde9f38a3:/tmp/1/2/3/4/5# rm -rf ./.
    rm: refusing to remove '.' or '..' directory: skipping './.'

tomlx · on Dec 22, 2021

Interestingly

  $ bash -c 'cd /var/empty; echo .*'
  . ..

but

  $ ksh -c 'cd /var/empty; echo .*'
  .*

Seems some ksh dev got bitten and restricted this ...

lsh123 · on Dec 22, 2021

Actually this is exactly what Oracle installer did in 93-94 after a jr sysadmin tried to install it from /tmp.

DonHopkins · on Dec 21, 2021

At least he didn't have to install Solaris on Sun executive's workstations.

Michael Tiemann on "The Worst Job in the World":

http://www.art.net/~hopkins/Don/unix-haters/slowlaris/worst-...

>I have a friend who has to have the worst job in the world: he is a Unix system administrator. But it's worse than that, as I will soon tell. [...]

https://en.wikipedia.org/wiki/Michael_Tiemann

>Michael Tiemann is vice president of open source affairs at Red Hat, Inc., and former President of the Open Source Initiative. [...] He co-founded Cygnus Solutions in 1989. [...] Opensource.com profiled him in 2014, calling him one of "open source's great explainers."

https://news.ycombinator.com/item?id=20006186

http://www.poppyfields.net/filks/00070.html

The Day SunOS Died

    "Bye, bye, SunOS 4.1.3!
    ATT System V has replaced BSD.
    You can cling to the standards of the industry
    But only if you pay the right fee -- 
    Only if you pay the right fee . . ."

drewg123 · on Dec 21, 2021

I remember being assigned to look into Solaris when working as a volunteer sysadmin in grad school, where we were a SunOS shop. I took a sparcstation, wiped it, and installed Solaris. This was 1992 or so, so it must have been 5.0 or 5.1. I hated it, but I don't remember very many specifics about why I didn't like it. I think it was partially the unbundled compilers, combined with everything just being "different", combined with perceived slowness. That was the last place I worked with Suns, as my first job was sysadmin'ing DEC Ultrix boxes, and DEC Alphas. Ultrix & OSF/1 were much closer to SunOS than Solaris, ironically.

I do wish that Sun would have evolved the BSD kernel rather than jumped to System V.

pjmlp · on Dec 21, 2021

That was around the time that GCC finally started to get some wind, due to the unbundling of UNIX SDK.

drewg123 · on Dec 21, 2021

I think it also got wind because it was just so much easier to compile stuff with gcc than it was to use the vendor compilers, with all of their incompatible flags and extensions. This was in the days before package managers, when everybody compiled open source stuff themselves, a lot of things didn't use autoconf, etc.

I remember compiling almost all open source stuff (emacs, tex, postscript, file utils, etc) with GCC, and reserving the vendor compiler for situations where performance actually matters (math / linear algebra packages, professors' code).

EDIT: I remember a few years where people tended to assume all the world ran SunOS 4.1, just like people assume all the world runs some flavor of debian/ubuntu now.

DonHopkins · on Dec 21, 2021

The unbundling of the free C compiler and the high price of the unbundled C compiler and AT&T's shitty bloated C++ compiler was emblematic of what was so bad about Sun abandoning their Berkeley BSD roots and getting into bed with AT&T System V with Solaris. And that provided an opportunity for Cygnus Solutions.

Not coincidentally, after he founded Cygnus Solutions (which Red Hat later bought), Michael Tiemann worked closely with Sun to support GCC on their platform.

https://web.archive.org/web/20160310075610/http://www.toad.c...

>We had the grandiose idea that major computer companies like Sun, SGI, and DEC would fire their compiler departments and use our free compilers and debuggers instead, paying us a million dollars a year for support and development. That wasn't quite right, but before we starved, we stumbled into the embedded systems market, doing jobs for Intel (the i960, a now-forgotten RISC chip), AMD (their now-forgotten but nice 29000 RISC), and various companies like 3Com and Adobe who had to port major pieces of code to these chips. In that market, once we fixed the tools to support cross-compiling, we had major advantages over the existing competitors, and we swarmed right through the market for 32-bit embedded system programming tools. And ultimately, we did get million-dollar contracts, such as one from Sony for building Playstation compilers and emulators. This allowed game developers to start working a year before the Playstation hardware was available. This enabled the Playstation to come to market sooner, with more and better games.

https://web.archive.org/web/20150701032848/http://www.toad.c...

>Michael Tiemann, President, has been writing free software since 1987. He wrote the code for GNU C's function inlining. He wrote a portable instruction scheduler which boosted GNU C's performance by 30\% on the SPARC. He is the author of GNU C++, the first available native code C++ compiler. Mr. Tiemann has ported the GNU compiler to the SPARC, Motorola 88000, and National 32032 architectures, as well as adding support for Sun's FPA board on Sun 3s. He ported the GNU debugger to the SPARC and Intel 80386 architectures, extended the debugger and linker to handle C++ features, and ported the linker to SPARC.

https://www.oreilly.com/openbook/opensources/book/tiemans.ht...

>The real bombshell came in June of 1987, when Stallman released the GNU C Compiler (GCC) Version 1.0. I downloaded it immediately, and I used all the tricks I'd read about in the Emacs and GDB manuals to quickly learn its 110,000 lines of code. Stallman's compiler supported two platforms in its first release: the venerable VAX and the new Sun3 workstation. It handily generated better code on these platforms than the respective vendors' compilers could muster. In two weeks, I had ported GCC to a new microprocessor (the 32032 from National Semiconductor), and the resulting port was 20% faster than the proprietary compiler supplied by National. With another two weeks of hacking, I had raised the delta to 40%. (It was often said that the reason the National chip faded from existence was because it was supposed to be a 1 MIPS chip, to compete with Motorola's 68020, but when it was released, it only clocked .75 MIPS on application benchmarks. Note that 140% * 0.75 MIPS = 1.05 MIPS. How much did poor compiler technology cost National?) Compilers, Debuggers, and Editors are the Big 3 tools that programmers use on a day-to-day basis. GCC, GDB, and Emacs were so profoundly better than the proprietary alternatives, I could not help but think about how much money (not to mention economic benefit) there would be in replacing proprietary technology with technology that was not only better, but also getting better faster.

geocrasher · on Dec 21, 2021

2003 or 2004. Customer called in and said that his dedicated server was hacked. I restored from backup.

An hour later, he calls back. Hacked again. Restored again.

An hour later, he calls back. He realizes that the hacker is him! He's doing a thing, but doesn't know what he's doing wrong. So I have him email me the last thing he typed on his server, as root:

   rm -rf /home/user/path/to/thing /home/otheruser/path/to/somethingelse / home/path/to/some/other/thing/altogether

hnlmorg · on Dec 21, 2021

This is precisely why I always `-v` when `rm`ing recursively. It might be closing the barn door after the proverbial horse has bolted; but at least the fuck up is visible and in some circumstances you have a fighting chance to kill `rm` before too much damage has been done.

iamtedd · on Dec 26, 2021

How about just don't use -f all the damn time?

hnlmorg · on Dec 26, 2021

I neither use -f "all the damn time" and nor do I appreciate that baseless accusation. Furthermore -f has literally nothing to do with my comment about -v

iamtedd · on Dec 27, 2021

I was building on your response to the GP. Instead of using -v to see files as they're being deleted, wouldn't it be safer to not use -f, which will not delete important files in the first place?

I see it countless times all over the Internet - everyone is constantly using -rf no matter the situation. The most common horror story in the article is from people using -rf and losing data. The whole point of -f is so normal usage (without -f) doesn't delete important files.

hnlmorg · on Dec 28, 2021

You misunderstand what -f does, however you are right that it is used all over the place by default and it does add more risk than not using -f. However you're wrong that the lack of -f is risk free. Unfortunately even without including the -f flag it is still possible to delete "important" files given that "important" can mean a spectrum of things, both in terms of file system metadata and in terms of personal worth.

In fact `rm`doesn't even try to understand what an important file is (and why should it?). All the -f flag does is ignore prompts and non-existent files:

  $ man rm
  ...
       -f, --force
              ignore nonexistent files and arguments, never prompt
  ...

I think it also overrides skipping the removal of files that don't have a write bit set (despite that not being explicit in the man page) but even if that's also true, you're going to have a crap load of important files that are still writable. Thus it's still quite possible to trash your host using `sudo rm -r /`

You might then say "so you shouldn't use sudo either" and that's true. But going back to my earlier point about "what is an important file?", you're still going to have lots of "important" files on your system that aren't root owned. Maybe development files. Maybe personal files. Thus it is trivially easy to remove wanted files with a careless `rm -r ~/$UNSET` (a bug in Steam's Linux installer once did this in fact).

So the benefit of -v isn't mitigated by the non-inclusion of -f. Not even slightly.

medstrom · on Dec 21, 2021

Is there a rm variant that bundles an undo command?

hnlmorg · on Dec 22, 2021

There's a few ways to solve this:

+ a command that doesn't actually delete but rather moves the file to a trash folder that gets emptied periodically (this is exactly how a desktop environments work)

+ a command that removes the inode but leaves the data intact. This could get pretty messy pretty quickly on SSDs.

+ `rm` behaves as normal but you restore the file using forensic tools / file recovery tools. Similar problem with SSDs as above.

+ `rm` behaves as normal but you run a snapshotting file system such as ZFS (maybe even have a wrapper function around `rm` to take a snapshot before running `/bin/rm`). Any "undo" would be recovering from snapshot.

+ `rm` behaves as normal but you restore from backup. Which isn't really a solution but still worth presenting here given the range of options I've presented.

Personally I went with ZFS where possible.

If recovery is tricky and I'm throwing wildcards at the problem then I'll do an `ls` first just to double check I'm not doing anything stupid.

medstrom · on Dec 22, 2021

There are a number of commands named "trash". But I don't want to keep in mind to execute "trash-empty" occasionally, especially when rm'ing the same large file several times (say, ones made by "dd if=/dev/zero ..."). So either a cron job for emptying the trash, or I think more elegantly, have "trash" just empty the trash before trashing the next file (i.e. there is only ever one thing in the trash).

terr-dav · on Dec 21, 2021

I wonder if there's there a terminal setting or font that renders whitespace like a ]-shaped underscore? Could solve a whole class of bugs.

krylon · on Dec 22, 2021

I've done something similar once, trying to remove backup files left by emacs:

  rm -f * ~

instead of

  rm -f *~

Is it weird to feel nostalgic about this kind of thing?

mark-r · on Dec 22, 2021

That was surprisingly subtle. I had to look at it about 3 times before I saw the problem.

medstrom · on Dec 22, 2021

The single slash was the first thing I saw, it jumped out to me before I saw anything else on the line. That's the value of experience.

mark-r · on Dec 23, 2021

I do not envy you your experience.

geocrasher · on Dec 23, 2021

It's something you learn just because of this kind of experience. For me, it was my attention to detail that spotted it immediately, and the customer facepalmed hard. It's one of those things that is easier for a second party to spot. It's why I read everything that's critical out loud before hitting enter, or publishing, or whatever.

technofiend · on Dec 22, 2021

In 1989 my then girlfriend now wife convinced me to go into contracting because it was 3x the pay vs working for our mutual employee, a savings and loan (sort of like a bank if you're not in the US) in receivership.*

Our mainframe guy had a buddy who said he knew unix so I interviewed him and rejected the guy because he clearly did not. The company hired the dude anyway and I went on to a fun contract with Shell where every server was named after mollusks which sounded amusing until you had to login to a box with a twenty letter name.

A few weeks after I left my girlfriend calls and to tell me the head of HR had come by and asked if had "done anything" to their computers? What? Like what? No, of course not. I called someone at the S & L to find Mr. New Unix admin had decided to "optimize" the cabling layout of the Sun servers which had in-rack storage. He didn't know that changing the cabling order changed device names, he didn't keep a map of device names to physical disks, which as a Sybase shop using raw disks was a big mistake. You had to know enough about Sybase to take it into maintenance mode and rename devices and if you got it wrong hello database corruption as the dbs spanned multiple disks. For the disks with mounted filesystems he didn't leave breadcrumbs on the root of each respective disk so he could mount it and figure out how to fix fstab. I don't think he even knew what fstab was. After breaking everything he claimed I had somehow dialed in and my "hacking" was to blame.

*I wasn't angry or upset at the S & L, but it was a valuable lesson for me. Hired as a 'C' programmer to write mortgage bond software when the company discovered I had database skills they also had me write a payroll app, which is how I knew everyone's salary and that the previous sysadmin I worked for made $10k/yr more than me. When I asked for a raise to his pay after officially taking over his job (which I had effectively been doing anyway) I got like a $3k raise and the employer claimed it was the best they could do because of their receivership situation.

That was the impetus of looking for another job because my gf was like if you can get more you need to do it. Fiiiiine. When I turned in my notice, my boss' exact words were "Oh, shit" because he knew he'd been underpaying me and there was no replacing a kid making $24k a year who was still coding mortgage analysis software, acting as their DBA and sysadmin. Even if I was the type to exact some kind of revenge (I was not) the fact they had to hire two or three people to replace me was enough. When my boss magically had money for a raise, thanks to my aforementioned payroll app I got to say one of the most satisfying things I've ever said to anyone: "Vic, I make more than you do, now. What's your counter?" Sorry if that's a pointless, rambling sorry. The nightmare was getting accused of hacking some company because of a mishire.

Supporting traders at Enron Capital & Trade was fun because the traders had root to their SPARCstations. In fact the interview was to fix an unbootable SPARCstation with a compressed kernel, something the support team had happen to them when a user needed more disk space. You had to know Solaris well enough to have the installboot syntax memorized and those workstations were so slow to boot you essentially got one shot at it because it was a timed test.

jlv2 · on Dec 21, 2021

I vaguely thought I posted in this thread back then.

Back in 1984/5, I had a directory in my homedir called "etc" for of miscellaneous stuff. One day I thought: that's a bad name, I should remove it. I errantly typed "rm -rf /etc". Thankfully I got a "Permission Denied" error. Except, I then did the obvious override, "sudo rm -rf /etc" (1). This was on a VAX 11/780 with about 50 undergrads doing project work on. The command ran for a while and then I heard moans out in the terminal room, as the system crashed. It took us about 3 hours to restore from backup tape.

(1) I used to have to explain what "sudo" was, because this happened before we posted it to USENET, and before it was ubiquitous on systems.

drewg123 · on Dec 21, 2021

When did it become ubiquitous? I first got root privs via sudo on a *nix system in 1991 or so, and I remember it being widely deployed even then.

mark-r · on Dec 22, 2021

When I first started using Unix in the late 1980's, there was no sudo. If I needed privileges I used su and got out of it as soon as I was able.

gorgoiler · on Dec 22, 2021

Alas, as cam.ac.uk prepares for the final shutdown of Hermes — their quarter century old Unix mail facility used by members great and small, from telnet to prayer and replaced by Microsoft Outlook on 12/31 — the timing of this post could not be more prescient.

Another onprem service formerly run by experts shutters its doors. In their halcyon days they wrote Exim.

You can’t even forward to gmail because the current sysadmin team have IPv6 set up without PTRs, causing gmail to 500 one in every N deliveries.

I imagines GOOG and the University of Cambridge locked in an endless battle of “don’t you know who I am”, and meanwhile the dons et al shuffle into the abyss of outlook.com penury.

What would ‘fanf2 do?

mjg59 · on Dec 22, 2021

Go to ISC and work on Bind, apparently: https://twitter.com/fanf/status/1465641653788712962?t=iwtAWb...

whartung · on Dec 21, 2021

Worst thing I ever did was cross hard mount NFS volumes across two machines.

With a hard NFS mount, the mount will hang until the other machine responds.

When we had to power cycle the two servers, they would not come up as they were deadlocked waiting for each other. That was exciting.

bostik · on Dec 21, 2021

Speaking of NFS, ex-coworker had renamed his prior company's servers "notresponding" and "stilltrying".

The NFS client logs must have been glorious.

treesknees · on Dec 21, 2021

This is somewhat common in environments with stable power, you basically never have the entire IT system go down and come back up at the same time.

dsr_ · on Dec 21, 2021

"basically never" is guaranteed to happen at a particularly terrible time, and you won't have recovery procedures, because why would you bother?

patrickdavey · on Dec 21, 2021

How did you fix it?

jcynix · on Dec 21, 2021

Been there, done that. Ok, a colleague did it, sbkut 30 years ago ;-)

IIRC resetting the machine and forcing it to single boot was the solution.

oneweekwonder · on Dec 21, 2021

not op but we recently wanted to mount nfs and the sysadmin was adamant we use automounter[0] instead of fstab because if the nfs mount is not available it can hang the kernel.

Not sure if it is true or just sysadmin lore but was interesting enough to learn about a alternative.

[0]: https://linux.die.net/man/8/automount

aleph- · on Dec 22, 2021

Oh this can definitely happen, had it occur on my system till I started using nofail as a mount option.

AceJohnny2 · on Dec 21, 2021

My very favorite is more of a "recovery legend", telling the heroic tale of recovering a a Unix system after an errant "rm -rf" deleted most of the system's critical files:

https://www.ee.ryerson.ca/~elf/hack/recovery.html

agentwiggles · on Dec 21, 2021

Nice! When I saw the thread title I was hoping this story would get posted somewhere. I read this a long time ago and hadn't been able to find it for years!

Thanks for sharing!

AceJohnny2 · on Dec 21, 2021

I once spent an hour or more finding it again, so I bookmarked it ;)

kloch · on Dec 21, 2021

I love this. It takes us back to a time when administering a Unix system was a Big Deal. Partly because they were rare and expensive. But also they were truly multi-user with dozens of people logged in at any given time.

rilindo · on Dec 21, 2021

They still are a big deal, only you can manage up to thousands of them and with modern automation, if you screw up one, you screw up all of them.

seqizz · on Dec 22, 2021

Yeah, we need a new page like this. But for "automation disasters".

drewg123 · on Dec 22, 2021

Not a sysadmin horror story, but I remember as an undergrad in the late 80s / early 90s what happened to me when i left my terminal unattended. Every time I ran my project via "a.out", I would see "Segmentation fault: Core dumped" and there would be a zero-length core file created in the directory.

Turns out a helpful soul had aliased "a.out" to 'echo "segmentation fault: core dumped"; touch core"

That made me learn at an early age to never have "." in my path, and to always execute things in the CWD as ./a.out (among other things).

U8dcN7vx · on Dec 22, 2021

./a.out() { echo surprise; }

karmakaze · on Dec 21, 2021

I really enjoy the recovery parts of the stories that have them, like a good Hollywood movie script, but real.

Unix wasn't very common after leaving university and I have more PC/LAN type stories. There was one memorable moment, where I was working very late in the office and got a call. [If working late, the main line would ring the entire office and I could press the blinky light to answer.] It was one of our consultants on the west coast who somehow had a corrupt filesystem, but that machine was the one that had all the project files for the many months of consulting work that the team had been developing. [I don't recall but it may have been CVS or SVN.]

The tricky bit was that it was using OS/2 and its HPFS filesystem so the usual file utilities wouldn't work. We had a number of IBM tech books on our bookshelves (because we also did mainframe consulting) and I'd been reading about terminal streams and one about the HPFS filesystem in particular. It mentioned boot blocks, superblocks, bands, allocation bit blocks, etc.

Being young (and dumb) went with "what's the worst thing that could happen" and came up with a plan: using the DOS 'nu' (Norton Utility) copy a few choice sectors from a similar spec-looking machine and try the OS/2 equivalent of 'chkdsk /f'--the client after all was IBM known for conformity. We first had to dial-up modem transmit the 'nu' program, but then we were coping the first 18 (or so) sectors to get the boot sector, partition table, boot program or other HPFS initial sector data; then there were some sectors in the middle of the disk that served as a kind of main description table with others in bands (that we didn't bother with). Guessed the starting point and number of sectors. This was a grasping at straws Hail Mary. Rebooted the machine, let the OS/2 run its chcdsk as it detected a problem, waited a long while until it was done. Unbelieveably it all worked! There might have been a couple open files lost and some files that were recently deleted being present, but no big differences. We didn't think we needed to tell anyone. He bought me beers as promised when I came to visit.

Bonus memory: LapLink with the parallel transfer cable was the shit in those days. https://en.wikipedia.org/wiki/LapLink_cable

csydas · on Dec 21, 2021

The first entry about adding tcsh is more or less the basis of one of the questions I use in technical interviews for our Linux team. It's less about specifics on tcsh and more just about explaining the hierarchy of linux/unix and why we have the bin/sbin directories under / and /usr; the more the candidate can explain (or even hypothesize), the more comfortable my team feels with their general curiosity/understanding of Linux/Unix.

It's a niche situation sure, but being able to understand the system and tooling you're working with to a degree to understand what options you really have shows a great deal of discipline and curiosity, for me at least. Again, it's less about "can you figure out what to do in this specific situation" and more "can you just explain what you look at every single day in plain and simple terms? Did you ever think about it?"

It's been a surprisingly revealing question for nascent Linux admins on how they react to questioning the things they look at every single day, and how ready they are to __really__ dig into the kernel internals.

floren · on Dec 21, 2021

It really drives home how expensive storage used to be when you see how many stories are variations on, "In an attempt to free up some disk space, I deleted a critical system file".

The version for those of us who came of age in the 90s is wondering what all this crap in C:\Windows\System32 is, and whether deleting it will give us enough space to install Age of Empires.

rntksi · on Dec 21, 2021

>Well one time I was installing a minimal base system of Linux on a friends PC, so that we would have all the necessary utlitities to bring over the rest of the stuff. His 3 1/2 inch disk was dead, so when had to get the 5 1/4 inch version of the boot/root disk. Too bad that version, having to fit in 1.2M instead of 1.44, didn't have tar

Heh ... I wonder how many years forward will people stop knowing what a 3 1/2 and a 5 1/4 disk is

pmontra · on Dec 21, 2021

As in "You 3D printed the Save icon!" ?

https://logosatwork.com/you-3d-printed-the-save-icon/

CoastalCoder · on Dec 21, 2021

Or to understand the confusion regarding 3.5" disks being floppies rather than hard disks.

hulitu · on Dec 21, 2021

Ah, the good old days. Single density, double density ( 720 kB, 1,44 MB). I heard also of 2.88 MB floppies - never saw one in real life. If i remember correctly 2.88 MB was double density double sided and you needed a special floppy drive.

dsr_ · on Dec 21, 2021

A quirk of accessibility of low-level formatting meant that you could persuade a 1.44MB disk drive to create a 1.72MB disk reliably, and a 1.76MB disk slightly less reliably.

IIRC, Microsoft decided to use the 1.72MB quirk for physical distribution of their software, since it made a significant cost savings in those days before cheap optical media, in those days before network distribution of commercial consunmer software.

bityard · on Dec 21, 2021

Single-sided: 360 KB, double-sided (or double density): 720 KB, high density: 1.44MB.

There were 2.88 MB disks and drives but they never gained much traction, because they were expensive and the PC industry kept promising various "floppy killers" like the ZIP drive.

aidenn0 · on Dec 21, 2021

The superdisk were what I wish had caught on. They could read HD floppies (though not any 2.88MB formats) and at some iteration were able to treat a single HD floppy as a 32MB(!!) tape drive.

IOMega seemed to have better marketing though.

devilbunny · on Dec 23, 2021

I wish Sony had made a serious effort to make MiniDisc a data format. They did make data drives, but they were slow, expensive, and barely available.

140 MB data (in 1993!). Nearly infinite rewrites. Nature of the medium is that bit rot is almost impossible without direct physical damage. I enjoyed them in the twilight of their existence, just before MP3 players started to have reasonable storage capacities and prices (mine were/are all used). And the ones I recorded in 2002 still sound just fine if I play them today. It was a great replacement for cassettes; too bad it never caught on in the US.

aidenn0 · on Dec 23, 2021

FWIW Zip was only 1 year later with 100MB (and IMO less durable media). I agree that MD was better, but Sony in the 90s didn't seem to want to do anything that they couldn't completely own themselves.

My recollection is that Zip media was also way cheaper than blank MDs, but I don't have any sources for that.

butterfi · on Dec 21, 2021

I thought I would find these funny and instead they just made me anxious. I mean, they are funny, I guess I just have PTSD from years of Unix administration.

cf100clunk · on Dec 21, 2021

Came here to say the same thing... Unix sysadmin since 1988. I would laugh about these stories but I just cannot without a lump in my throat and a few mea maxima culpas in my own mind. Reminds me of having heard somewhere that Tom Waits, on watching the litany of road show disasters parodied on This Is Spinal Tap, wept rather than laughed.

My own worst Unix Admin Horror Story is a variant of the classic ''accidental delete-restore from backup if you've got one'' scenario: in the early '90s I accidentally repopulated all YP tables on a production Sparcstation 10 machine in real time on a busy workday, but the tables had not been kept up to date! It took until the next day to restore from backup and a further day of research and testing to get all the YP tables up to proper state, then write scripts to keep them updated. (This was before Sun was legally forced to rename YP to NIS, btw).

qwertox · on Dec 21, 2021

This reminds me of the time when I execuded `rm -rf ~` in hopes of deleting an erroneously created directory.

Or when my "wrongly" expanded `mv` command moved all the files and directories in home into the last directory of the home directory. Which was a NTFS mount, leading to a loss of all the file permissions.

greedo · on Dec 21, 2021

I had a contractor installing software last week. He was granted full sudo permissions despite having demonstrated a unique set of command line skills. Everything was going fine until I get a text from his manager saying the contractor couldn't SSH in this morning. Turns out he had gotten frustrated with some file/folder permissions in the directory where he was supposed to install the software package. So he simply ran `chmod -R 777 /*`

Needless to say that required a full restore from the previous night's backup since he hadn't snapshotted the VM before beginning his work. He was very angry that he had lost 2 days of work. I was very sympathetic...

alexeiz · on Dec 26, 2021

Why was he not fired?

greedo · on Dec 26, 2021

He was a contractor so that’s be up to the vendor. I don’t know if the project sponsor requested a different engineer.

midasuni · on Dec 21, 2021

> But the most important thing that can be learned from this is not that you have to make backups (we all know that, right? ;-) ). More important than making backups is to make sure your backups are complete and verified

C’est plus ça change…

teel · on Dec 22, 2021

It is nice to see how replies have persons real name and organization in the header, something you rarely see in internet discussions today (FB being one big exception). Brings back memories of using usenet in the 90s.

smackeyacky · on Dec 22, 2021

One thing I do miss after reading these stories is that brief period where vendor unices were everywhere and none of them worked quite the same. AIX, SunOS, the DEC variants, whatever ran on the Aviions, Pyramid, SCO, ISC, Xenix etc.

It was glorious. Now its Debian varieties or Apples BSD abomination and nothing else.

Even windows server is getting steamrollered. All the progress is now happening elsewhere i.e. virtualisation, containers, devops.

I kinda miss the days where the iron was expensive and screwing up your Pyramid kicked 200 terminal users to the curb while you figured it out.

cecilpl2 · on Dec 21, 2021

I found my favorite story buried in the middle of this, from 1986. It's a classic on par with The Story of Mel, a Real Programmer. Reproduced here for your reading pleasure:

  Have you ever left your terminal logged in, only to find when you came
  back to it that a (supposed) friend had typed "rm -rf ~/*" and was
  hovering over the keyboard with threats along the lines of "lend me a
  fiver 'til Thursday, or I hit return"?  Undoubtedly the person in
  question would not have had the nerve to inflict such a trauma upon
  you, and was doing it in jest.  So you've probably never experienced the
  worst of such disasters....
  
  It was a quiet Wednesday afternoon.  Wednesday, 1st October, 15:15
  BST, to be precise, when Peter, an office-mate of mine, leaned away
  from his terminal and said to me, "Mario, I'm having a little trouble
  sending mail."  Knowing that msg was capable of confusing even the
  most capable of people, I sauntered over to his terminal to see what
  was wrong.  A strange error message of the form (I forget the exact
  details) "cannot access /foo/bar for userid 147" had been issued by
  msg.  My first thought was "Who's userid 147?; the sender of the
  message, the destination, or what?"  So I leant over to another
  terminal, already logged in, and typed
          grep 147 /etc/passwd
  only to receive the response
          /etc/passwd: No such file or directory.
  
  Instantly, I guessed that something was amiss.  This was confirmed
  when in response to
          ls /etc
  I got
          ls: not found.
  
  I suggested to Peter that it would be a good idea not to try anything
  for a while, and went off to find our system manager.
  
  When I arrived at his office, his door was ajar, and within ten
  seconds I realised what the problem was.  James, our manager, was
  sat down, head in hands, hands between knees, as one whose world has
  just come to an end.  Our newly-appointed system programmer, Neil, was
  beside him, gazing listlessly at the screen of his terminal.  And at
  the top of the screen I spied the following lines:
          # cd
          # rm -rf *
  
  Oh, shit, I thought.  That would just about explain it.
  
  I can't remember what happened in the succeeding minutes; my memory is
  just a blur.  I do remember trying ls (again), ps, who and maybe a few
  other commands beside, all to no avail.  The next thing I remember was
  being at my terminal again (a multi-window graphics terminal), and
  typing
          cd /
          echo \*
  I owe a debt of thanks to David Korn for making echo a built-in of his
  shell; needless to say, /bin, together with /bin/echo, had been
  deleted.  What transpired in the next few minutes was that /dev, /etc
  and /lib had also gone in their entirety; fortunately Neil had
  interrupted rm while it was somewhere down below /news, and /tmp, /usr
  and /users were all untouched.
  
  Meanwhile James had made for our tape cupboard and had retrieved what
  claimed to be a dump tape of the root filesystem, taken four weeks
  earlier.  The pressing question was, "How do we recover the contents
  of the tape?".  Not only had we lost /etc/restore, but all of the
  device entries for the tape deck had vanished.  And where does mknod
  live?  You guessed it, /etc.  How about recovery across Ethernet of
  any of this from another VAX?  Well, /bin/tar had gone, and
  thoughtfully the Berkeley people had put rcp in /bin in the 4.3
  distribution.  What's more, none of the Ether stuff wanted to know
  without /etc/hosts at least.  We found a version of cpio in
  /usr/local, but that was unlikely to do us any good without a tape
  deck.
  
  Alternatively, we could get the boot tape out and rebuild the root
  filesystem, but neither James nor Neil had done that before, and we
  weren't sure that the first thing to happen would be that the whole
  disk would be re-formatted, losing all our user files.  (We take dumps
  of the user files every Thursday; by Murphy's Law this had to happen
  on a Wednesday).  Another solution might be to borrow a disk from
  another VAX, boot off that, and tidy up later, but that would have
  entailed calling the DEC engineer out, at the very least.  We had a
  number of users in the final throes of writing up PhD theses and the
  loss of a maybe a weeks' work (not to mention the machine down time)
  was unthinkable.
  
  So, what to do?  The next idea was to write a program to make a device
  descriptor for the tape deck, but we all know where cc, as and ld
  live.  Or maybe make skeletal entries for /etc/passwd, /etc/hosts and
  so on, so that /usr/bin/ftp would work.  By sheer luck, I had a
  gnuemacs still running in one of my windows, which we could use to
  create passwd, etc., but the first step was to create a directory to
  put them in.  Of course /bin/mkdir had gone, and so had /bin/mv, so we
  couldn't rename /tmp to /etc.  However, this looked like a reasonable
  line of attack.
  
  By now we had been joined by Alasdair, our resident UNIX guru, and as
  luck would have it, someone who knows VAX assembler.  So our plan
  became this: write a program in assembler which would either rename
  /tmp to /etc, or make /etc, assemble it on another VAX, uuencode it,
  type in the uuencoded file using my gnu, uudecode it (some bright
  spark had thought to put uudecode in /usr/bin), run it, and hey
  presto, it would all be plain sailing from there.  By yet another
  miracle of good fortune, the terminal from which the damage had been
  done was still su'd to root (su is in /bin, remember?), so at least we
  stood a chance of all this working.
  
  Off we set on our merry way, and within only an hour we had managed to
  concoct the dozen or so lines of assembler to create /etc.  The
  stripped binary was only 76 bytes long, so we converted it to hex
  (slightly more readable than the output of uuencode), and typed it in
  using my editor.  If any of you ever have the same problem, here's the
  hex for future reference:
          070100002c000000000000000000000000000000000000000000000000000000
          0000dd8fff010000dd8f27000000fb02ef07000000fb01ef070000000000bc8f
          8800040000bc012f65746300
  
  I had a handy program around (doesn't everybody?) for converting ASCII
  hex to binary, and the output of /usr/bin/sum tallied with our
  original binary.  But hang on---how do you set execute permission
  without /bin/chmod?  A few seconds thought (which as usual, lasted a
  couple of minutes) suggested that we write the binary on top of an
  already existing binary, owned by me...problem solved.
  
  So along we trotted to the terminal with the root login, carefully
  remembered to set the umask to 0 (so that I could create files in it
  using my gnu), and ran the binary.  So now we had a /etc, writable by
  all.  From there it was but a few easy steps to creating passwd,
  hosts, services, protocols, (etc), and then ftp was willing to play
  ball.  Then we recovered the contents of /bin across the ether (it's
  amazing how much you come to miss ls after just a few, short hours),
  and selected files from /etc.  The key file was /etc/rrestore, with
  which we recovered /dev from the dump tape, and the rest is history.
  
  Now, you're asking yourself (as I am), what's the moral of this story?
  Well, for one thing, you must always remember the immortal words,
  DON'T PANIC.  Our initial reaction was to reboot the machine and try
  everything as single user, but it's unlikely it would have come up
  without /etc/init and /bin/sh.  Rational thought saved us from this
  one.
  
  The next thing to remember is that UNIX tools really can be put to
  unusual purposes.  Even without my gnuemacs, we could have survived by
  using, say, /usr/bin/grep as a substitute for /bin/cat.
  
  And the final thing is, it's amazing how much of the system you can
  delete without it falling apart completely.  Apart from the fact that
  nobody could login (/bin/login?), and most of the useful commands
  had gone, everything else seemed normal.  Of course, some things can't
  stand life without say /etc/termcap, or /dev/kmem, or /etc/utmp, but
  by and large it all hangs together.
  
  I shall leave you with this question: if you were placed in the same
  situation, and had the presence of mind that always comes with
  hindsight, could you have got out of it in a simpler or easier way?
  Answers on a postage stamp to:
  
  Mario Wolczko

mark-r · on Dec 22, 2021

The story about deleting bzero reminds me of Chesterton's Fence. If you don't know why something's there, you have no business deleting it.

rconti · on Dec 23, 2021

The best thing about these, for me, is the nostalgia over the email addresses. So many domains I recognize from back in the day.

DonHopkins · on Dec 21, 2021

I posted this horror story before, with a link to Pete "Gymble Roulette" Cottrell's infamous contest at the end (which I wasn't supposed to tell anyone outside of UMD CS Dept staff about):

https://news.ycombinator.com/item?id=15802533

Pyramid's OSx version of Unix (a dual-universe Unix supporting both 4.xBSD and System V) [1] had a bug in the "passwd" program, such that if somebody edited /etc/passwd with a text editor and introduced a blank line (say at the end of the file, or anywhere), the next person who changed their password with the setuid root passwd program would cause the blank line to be replaced by "::0:0:::" (empty user name, empty password, uid 0, gid 0), which then let you get a root shell with 'su ""', and log in as root by pressing the return key to the Login: prompt. (Well it wasn't quite that simple. The email explains.)

https://en.wikipedia.org/wiki/Pyramid_Technology

Here's the email in which I reported it to the staff mailing list.

    Date: Tue, 30 Sep 86 03:53:12 EDT
    From: Don Hopkins <don@brillig.umd.edu>
    Message-Id: <8609300753.AA22574@brillig.umd.edu>
    To: chris@mimsy.umd.edu, staff@mimsy.umd.edu,
            Pete "Gymble Roulette" Cottrell <pete@mimsy.umd.edu>
    In-Reply-To: Chris Torek's message of Mon, 29 Sep 86 22:57:57 EDT
    Subject: stranger and stranger and stranger and stranger and stranger

       Date: Mon, 29 Sep 86 22:57:57 EDT
       From: Chris Torek <chris@mimsy.umd.edu>

       Gymble has been `upgraded'.

       Pyramid's new login program requires that every account have a
       password.

       The remote login system works by having special, password-less
       accounts.

       Fun.

    Pyramid's has obviously put a WHOLE lot of thought into their nifty
    security measures in the new release. 

    Is it only half installed, or what? I can't find much in the way of
    sources. /usr/src (on the ucb side of the universe at lease) is quite
    sparse. 

    On gymble, if there is a stray newline at the end of /etc/passwd, the
    next time passwd is run, a nasty little "::0:0:::" entry gets added on
    that line! [Ye Olde Standard Unix "passwd" Bug That MUST Have Been Put
    There On Purpose.] So I tacked a newline onto the end with vipw to see
    how much fun I could have with this....

    One effect is that I got a root shell by typing:

    % su ""

    But that's not nearly as bad as the effect of typing:

    % rlogin gymble -l ""

    All I typed after that was <cr>:

    you don't hasword: New passhoose one new
    word: <cr>
    se a lonNew passger password.
    word: <cr>
    se a lonNew password:ger password.
    <cr>
    Please use a longer password.
    Password: <cr>
    Retype new password: <cr>
    Connection closed

    Yes, it was quite garbled for me, too: you're not seeing things, or on
    ttyh4. I tried it several times, and it was still garbled. But I'm not
    EVEN going to complain about it being garbled, though, for three
    reasons: 1) It's the effect of a brand new Pyramid "feature", and
    being used to their software releases, it seems only trivial cosmetic,
    comparitivly.  2) I want to be able to get to sleep tonight, so I'm
    just going to pretend it didn't happen. 3) There are PLEANTY of things
    to complain about that are much much much worse. [My guess, though,
    would be that something is writing to /dev/tty one way, and something
    else isn't.]  Except for this sentence, I will also completely ignore
    the fact that it closed the connection after setting the password, in
    a generous fit of compassion for overworked programmers with
    ridiculous deadlines.

    So then there was an entry in /etc/passwd where the ::0:0::: had been:

    :7h37OHz9Ww/oY:0:0:::

    i.e., it let me insist upon a password it thought was too short by
    repeating it. (A somewhat undocumented feature of the passwd program.)
    ("That's not a bug, it's a feature!")

    Then instead of recognizing an empty string as meaning no password,
    and clearing out the field like it should, it encrypted the null
    string and stuck it there. PRETTY CHEEZY, PYRAMID!!!! That means
    grepping for entries in /etc/passwd that have null strings in the
    password field will NOT necessarily find all accounts with no
    password. 

    So just because I was enjoying myself so much, I once again did:

    % rlogin gymble -l ""

    Password: <cr>
    [ message of the day et all ]
    #

    Wham, bam, thank you man! Instead of letting me in without prompting
    for a password [like it should, according to everyone but pyramid], or
    not allowing a null password and insisting I change it [like it
    shouldn't, according to everyone but pyramid], it asked for a
    password. I hit return, and sure enough the encrypted null string
    matched what was in the passwd entry. It was quite difficult to resist
    the temptation of deleting everyone's files and trashing the root
    partition.

        -Don

    P.S.: First one to forward this to Pyramid is a turd.

P.P.S.: The origin story of Pete's "Gymble Roulette" nick-name is here:

http://art.net/~hopkins/Don/text/gymble-roulette.html

The postscript comment was an oblique reference to the fact that I'd previously gotten in trouble for forwarding Pete's hilarious "Gymble Roulette" email to a mailing list and somehow it found its was back to Pyramid. In my defense, he did say "Tell your friends and loved ones.")

stryan · on Dec 21, 2021

What a small world; I read this comment while sitting in the UMD CS department machine room.

Glad bad login programs aren't something I have to deal with anymore (knock on wood).

kottapar · on Dec 22, 2021

This was around 2009. A colleague in the team was checking a high cpu alert on a payment system cluster. He found some PIDs that were consuming high cpu and asked someone in the team on how to get more details. He was asked to use fuser to check. He did use the fuser but it was fuser -cuk. The -k for killing the process created quite a mess.

notme77 · on Dec 21, 2021

source ~/.bash_history

oneweekwonder · on Dec 21, 2021

hah, before ci/devops tools was popular and you had to setup multiple identical/fall-over machines you could scp your .bash_history clean it up a bit and source it, neat.

dang · on Dec 21, 2021

One small (very) past thread:

Unix Admin. Horror Story Summary, version 1.0 (old) - https://news.ycombinator.com/item?id=721578 - July 2009 (6 comments)

lifeisstillgood · on Dec 21, 2021

So much of this can be filed under "before we culturally accepted prod is different.."

mdpye · on Dec 21, 2021

Most of these stories relate to administering interactive multi-user machines, not the kind of thing we now think of as a server.

Users were simultaneously logged in at the shell going about their business in *nix, not sending stateless requests in to a server process.

And you certainly couldn't afford to have a duplicate of a machine that expensive.

The idea of multiple environments didn't really exist, and you mostly administered machines from within - hence many of the stories being about getting enough tools working again to straighten it out. You didn't have another machine (or perhaps the connectivity) to put the thing on the operating table from a working system.

Things were different...

hnlmorg · on Dec 21, 2021

Try "before hardware was cheap enough that companies could run dedicated non-prod instances"

iechoz6H · on Dec 22, 2021

Ahh the memory of case sensitive domains!