Hacker News new | past | comments | ask | show | jobs | submit login
When should I not kill -9 a process? (unix.stackexchange.com)
84 points by yiedyie on May 24, 2014 | hide | past | favorite | 83 comments



As if often the case with stackoverflow answers, all of them are wrong in different ways. You should only kill -9 when every other signal the program is likely to respond to has not worked. kill -9 is likely to leave program in a state that requires manual intervention, especially if that program is a database.

If you're a developer, before you kill -9 a program send SIGTERM (ie kill without args or kill -15). If the program does not respond, run gdb -p <pid> and then "thread apply all bt" before killing it. At the very least, you should get a good idea of why it was not responding to other signals.


As a useful variant of this: I have grown accustomed to killing processes that hang with kill -11 (SIGSEGV). This is essentially the same as -9, with the exception that it creates a coredump.

Saves a lot of hassle with manually attaching GDB.


It's probably better to use SIGABRT (-6) for that, as it also dumps core and is less likely to be handled by the process (there are valid reasons to handle SIGSEGV and continue in execution)


I never thought of handling SIGSEGV. What would you use it for?


It's mostly about removing checks for rare occurrences:

* catching of NULL pointer references (ie. transforming SEGV into catchable userspace exception) * testing whether some in memory data have been changed/accessed (ie. mprotect(), wait for SEGV, set flag, un-mprotect(), return) * paging in userspace (eg. across network)

All that seems obscure (and is mainly useful for virtual machines and such), but it is not so rare to find process that actually does this (usually because of some library, or because it is some kind of virtual machine)


Many applications will try to print their own stack before exiting on SIGSEGV.


Yup, there are risks too though with that approach. You should be sure that ulimit for core size is large enough (not default almost always). Also, a core dump can take a very long time if your programs address space is large. It might involve writing tens of gigabytes to disk. So not only do you need the file system space, but you also need to be prepared to wait tens of minutes while your program is dumping core.


While what you say is true, this is not really a problem in practice; most of the time, the process doesn't have the amount of memory allocated to be a serious problem, and the coredump file is written in a smart way [1]: usually, the address space has many 'gaps': instead of writing these all as \0 characters to disk, it uses more elaborate storage techniques. The end result is that you can have a coredump file which is reported to have a size of 1GB, while only having 100MB written on disk.

[1] https://en.wikipedia.org/wiki/Core_dump#Format


You can't really corrupt a database that easily, can you? That's half the point of using a database, so you have transactions, etc.


I don't remember the exact details, but I once had a DB admin flabbergasted because there was a non-numeric character in a field/row that was numeric. The DB engine crashed upon starting and could not clean it up.

I think it was Oracle.

If I remember correctly we had to get the vendor involved

This happened after an unclean shutdown (power failure). The disk were RAID, but the cache battery was dead so there was some corrupt data written to the disk on block level.

THis could happen too on a kill -9.


not corrupt != everything is peachy

At the very least, you need to be prepared for a possibly very long replay of logs. Also, a huge number of people run databases in a configuration that doesn't make those guarantees. For example, many people will run mysql (esp. less performant slaves) with innodb_flush_log_at_trx_commit = 0 for performance with the understanding that a failure might require manual fixes.


A failure with that parameter will only cause transactions committed in the last second to be lost. The DB won't require manual fixes.


The DB won't require manual fixes.

Losing a transaction that was committed upstream can lead to a bunch of manual fixes in replicated database clusters. What happens if rows inserted in that lost transaction are eventually updated? Replication breaks, you get paged, and now you're faced with either manually fixing the DB consistency error or by rebuilding the whole slave. Woe be unto you if this host is a replication hub with a bunch of slaves hanging off it.


I've never corrupted the DB, but I have ended up with inconsistencies because I needed to do several related queries in a row. Because I didn't send them as a single transaction from my application some of them were run, some not.

Edit: Especially when you have multiple systems that should be consistent. Deleting a user from the DB, but being killed before it can be removed from our auth system, for instance.


The typical way of ensuring consistency in cases where the systems are unable to participate in a transaction would be to use a durable queue between the systems that can participate. So, in your Edit example, you would delete the user from the DB and add it the user to a queue in the same database to be deleted. Another process would then pull from that queue and attempt to delete from the auth system and only acknowledges the message once it is successful. You will still temporarily be in an inconsistent state but ultimately the operation will complete. This design only works for idempotent work.


If a database is not kill -9 resistant then how can it be power-off resistant?


There are ways to misdesign a system such that it is power-off robust but not kill -9 robust. The reason is that power-off means everything dies.

Postgres has special code to account for hard kills: it has a SYSV shared memory segment, to which every child attaches. If the parent dies, the children don't have a good way to know, so they might keep running. If you try to start a new postmaster and it sees that there are still processes attached to the shared memory segment (shm_nattch), it will fail to start.

Were it not for that code, it would allow you to start a new parent process, leading to chaos as two sets of backends were accessing the same files and shared memory without knowing about eachother.


Who says it was?


To anyone saying that you shouldn't "kill -9" a process, or that you should do some song-and-dance first: kill -9 is exactly what the OOM (out-of-memory) killer on linux does when memory is short. Typically, the application has no good way to even know that memory is short, because linux radically overcommits memory and still won't return a NULL from a malloc().

So, software should be written to assume it might be killed if you want to have a robust system.

By the way, using a small fixed amount of memory is no defense, or at least not in all kernel versions. The "badness" heuristic function used to find the victim could end up counting the same byte of memory many times:

http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/


Just because a program which is designed to prevent kernel panics due to OOM kill -9s a process, doesn't mean that you as a sysadmin should.

kill -15 typically leaves processes in a properly shut down state, which in terms of databases alone means that they will start up without a recovery process (which can be a 20-30 minute operation sometimes). That alone makes waiting a few minutes for a running process to respond to a kill -15 worthwhile.


I was commenting more about how software should be written than what an admin should do.

For admins, I wouldn't fault them much for kill -9, I would fault the software more if it lead to anything more than an inconvenience. But sure, it's wise to use -15 or whatever as long as it works.


Fiber optic cables are dug up by backhoes. Hard drives randomly fail. RAM is corrupted by cosmic rays. Racks lose power. CPU fans stop spinning.

If your process relies on not being kill -9'd, then you might as well quit programming and go buy a lottery ticket.


Yes, your process shouldn't eat data on a kill -9. At the same time, in a lot of applications, kill -9 can make other things less convenient. If your app is in a cluster, that job you just kill -9'ed is now suddenly not sending out heartbeat messages, which means the servers around it are waiting, with data stacking up, wondering what happened to that process you just shot in the head. Yes, it'll recover when things come back up, but you've just added more work on the other servers. Or, that server task didn't write the last few log entries. Given you're killing jobs, those stragglers may have data as to why you needed to stop things in the first place. In short, you shouldn't kill -9 right away because well-written programs talk about their state to the programs and log files around it, and those last messages can be very useful in giving information as to what is happening with your program.


This is a bit like saying that since your car can do emergency braking, you should always do emergency braking. Some processes clean up after themselves more neatly, or finish the current run of what they're doing first. It's dependent on what it is that you're stopping.


It seems more like saying "children, cyclists, animals, other drivers. If your car can't do emergency braking you might as well stop driving and buy a lottery ticket" to me. Which IMO is pretty reasonable. They don't say anything about when to use kill -9 (all examples are outside user control) just that it should be survivable.


I get what you're going for, but it's pretty much openly legal to kill pedestrians with cars (in the US, at least). "No criminality suspected"!


There's a little more to the United States than the city of New York. I grant it is an uncommonly large city, but it's not that large.


How is it anything like that?


"Use the harshest survivable option available, regardless of whether it's the best for the situation"


Right: programmers shouldn't rely on it, but neither should it be the only tool that users reach for when they want something to stop.


I know of some PHP installs that add a cron job that kills off hung PHP processes every hour. Ain't that uncommmon.


Over 15 years ago, as a teenager, I taught Linux / UNIX Admin courses, and worked as a consultant advising folks, and in the late 90s I was very adamant that you should never -9 anything unless you know exactly what you are doing.

As infrastructures have grown, and I have managed large applications involving tens to hundreds, often over a thousand servers, and I have grown to accept that a power supply can fail and a node can disappear from the network and it's even possible that none of its' components, including its' drives, will ever work again. I've never _really_ experienced such a catastrophic failure, but it's a lot easier to sleep at night if you just assume that.

kill -9 should never be worse than pulling the power plug, which is what netflix's chaos monkey always tries to simulate.

we all have to live on a continuum of how much of that we can survive, but if you always assume abrubt failure, it'll be pretty tough to give you a bad day.


kill -9 should never be worse than pulling the power plug

No, it shouldn't, but just pulling the power plug isn't exactly recommended behaviour, either. There aren't many admins out there who will happily yank the power cord out of their desktops when they want to power it down.

Last night I spent several hours getting a server back into gear after a 'pulled power plug' event. A friend's rack was affected by a (seven-hour!) substation power outage, and it wasn't on a UPS, so it didn't shut down cleanly. Eventually we were able to coax the server to boot again (had to remove all USB devices in the process, including internal ones), and the problem was a corrupted MBR. Make a rescue usb stick, boot into that, finally diagnose the problem, cat a new MBR onto it, and tada, fixed. Let's just say that I don't find the argument "shouldn't be worse than just pulling the plug" to be particularly comforting at the moment :)


Were you writing to the MBR somehow when it died? If not, that looks like really badly designed hardware to me.

I've had abrupt shutdowns happen on laptops (one had a particularly loose battery...), desktops, servers and although have encountered corrupt files and filesystems, never had any of them corrupt the MBR.


It's unclear what caused the MBR problems - the power outage was a couple of days before, and the system seemed to come back okay. My friend was busy with work and a cursory check had it clear, but yesterday things started acting funny, he logged in and the load was 40 and rising before it became unresponsive to his diagnostics. This machine had been running happily for quite some time before the powerout event (it's basically just a kvm host, the fun stuff is on the guests), so it's particularly puzzling. Somehow the MBR was overwritten with a syslinux one, and he says that box had never had syslinux used on it (extlinux, yes). The root cause will become evident at some point, it just needs some head-scratching time.


I almost read your last sentence as "The root kit will become evident at some point", because that's what came to mind with those symptoms. I'd check for an infection.


<offtopic>

I know that grammar comments are probably not welcome here on HN, but I think that since you seem to have an interest in it, I'd point this out:

"Its" only has an apostrophe when it's a contraction of "it is" (or "it has"), the possessive is always "its" (without an apostrophe).

</offtopic>


To follow this up: test these situations, design yourself into resilience. obviously, you can't build perfection in a day, but for anything you rely on, be it mysql, redis, memcached, postgresql, rabbitmq, whatever, just kill -9 that shit in staging periodically while a few people hit the url, and see what happens, what logs, etc.. assume failure.


"kill -9 should never be worse than pulling the power plug"

It's actually easy to misdesign a system such that it is safe against power failures but not kill -9. See my other comment:

https://news.ycombinator.com/item?id=7793301


We write all of our server code with kill -9 in mind. Basically, eveything we have can be killed with -9 without any problems. It needs some cleanup code for lefover files or things like that. And use of atomic operations here and there.. But then you are ready for all kinds of hardware issues.


Are there resources on how to deal with this that you'd recommend? Is it even an issue in a higher level programming language or will the issue be abstracted away by a language above say, C?


I googled for an old LWN article on "crash-only" software and found thus request for similar resources:

http://stackoverflow.com/questions/2405172/resources-about-c...


Well, there is always this scenario: when you can't even use CTRL+ALT+F2 to get to some type of terminal and only the power button, held in for ten seconds, will do. That's when you should not 'kill -9'.

I have heard the best practice advice for many years and I think that the 'you should send some friendly signal first' is not universally what works out best. For instance, if your Chrome browser is getting out of hand and the system is permanently doing some 96% wait for some reason, a gentle killing of Chrome will take ages and, when it restarts, you might get some but not all of your tabs back. With a killall -s 9 you can be back to work quickly with all your tabs (and underlying swappiness problem hopefully resolved).


I think caution is still key, if you don't know what's going on Slow is Fast here. Several years ago while I was on an airplane flying to spend a nice vacation break with my family my admin partner tried to shutdown a MySQL db the "right way". He logged in and ran a mysqladmin shutdown and waited for a while. Not sure how long he waited but he claimed it was a "long time". Since it felt like there was no response to the command he assumed the database was hung and issued a kill -9 on all the mysql processes.

Sadly, what he failed to check was disk IO stats, this MySQL setup had heavy innodb table usage and settings that where deliberately set for more performance then reliability (large buffers, delayed commits etc.). What was going on was normal, MySQL was flushing everything to disk and to logs and was most likely going to stop without a problem.

He didn't look at the facts at hand, the disk IO was still going, MySQL was mostly writing to the log files, users where not being let in so the db was doing an orderly shutdown. Instead with the adrenalin pumping he felt he had “waited a long time” and issued the kill -9 and corrupted the InnoDB logs and tables beyond all recognition.

I landed at the airport to five frantic voicemails because this db was the core of a bunch of high profile sites and he was up to his ears in phone calls from the client. I had to spend the first 9 hours of my vacation with my kids playing in the background while I sweated it out on a laptop over a crappy connection that kept dropping me.

Yes I know, MySQL should have been able to handle the "power out" but this event was made worse because he started the shutdown, we had a deliberately fragile implementation, he didn't check the slaves so we didn't have a clean fall back and meanwhile he "waited a long time" but never checked the process to see what it was doing.

I use kill -9 (-KILL) all the time, but I do it where I know it's needed. Most of the time kill just works and if it doesn't that should give you pause to think carefully about what you'll do next. Slow is Fast and Fast is Slow, if you quickly do something radical like kill -9 or init 6 or 10 second power button crash then you may be spending the rest of your day cleaning up. Slow down a bit, look, listen and gather facts about the situation then make an informed decision. At least if you do all that and the rest of your day is still ruined you won't have that nagging feeling you shot yourself in the foot and you can talk intelligently to your client or boss about steps you took to avoid the situation.

My failure I suppose was that I hadn't explained to him that it was typical for the db to take upwards of 5 to 8 minutes to cleanly shutdown. Which gets to a second topic, documentation for production systems is essential, when the fire is on too many mistakes can be made because of "knowledge gaps" between team members. Needless to say after this incident I wrote extensive documentation for the team so the next time I was "on vacation" I could actually be "on vacation". :-)


+1.

Postgres is designed to be resilient to "kill -9" as well as hard power offs. Even if you use durability-sacrificing features like asynchronous commit[1] or unlogged tables[2], the risks are very well-defined and contained to recent transactions and data in that unlogged table, respectively.

But even for postgres, you have to be a bit careful. For instance, many disk drives lie about completing the writes and really have them in a volatile cache, so a hard power off can still cause corruption. You need to disable the write cache on the disks using hdparm (or similar) to be safe. And "kill -9" is quite annoying, because the child processes don't have a good way to know that the parent is exiting, so then you will be unable to start the new parent process until the children have all exited as well.

EDIT: There's still no excuse for MySQL completely corrupting the system on a kill -9. That's just a misdesign -- consider that the out-of-memory (OOM) killer on linux sends a -9.

[1] http://www.postgresql.org/docs/9.4/static/runtime-config-wal...

[2] http://www.postgresql.org/docs/9.4/static/sql-createtable.ht...


You, and the OP, are likely mistaken about the cause of the corruption. MySQL, running InnoDB, is ACID compliant. InnoDB takes that durability seriously, and even by hand-tuning the performance factors, it's very hard to put InnoDB in a state where a simple process death, even during shutdown, will corrupt the files on disk.

If the database truly corrupted only due to improper shutdown, it was because the double write buffers were disabled on a non-atomic FS in which case you're intentionally risking DB corruption. More likely, there was bit-rot which was undetected during the normal running state of MySQL, and recovery couldn't get around it.

On the anecdote side, I've yet to have MySQL corrupt a database from a kill -9, and as I do a lot of failover development and testing, I'm issuing kill -9 to running databases frequently.


This was years ago and trust me it was corrupt. I think there may have been some issues with InnoDB in those days that made it a bit fragile in that situation.


My apologies if it sounded like I was doubting that there was corruption - I was not. I was doubting that merely sending kill -9 to the process was the sole cause of the corruption. More than likely it uncovered the corruption due to having to do recovery against corrupted data.


In the postmortem it seemed that by issuing the shutdown but not waiting for it to complete the database ended up in a strange state that we could never fully recover from.

I guess the moral here is tread lightly and take the time to be informed before you just smack something with a club.


++ to the moral of the story.


> double write buffers were disabled on a non-atomic FS

Surely just sending SIGKILL will allow the writes already issued to complete, regardless of the FS type/options? Nobody kicked the plug out of the wall.


Double write buffers help to validate that the pages were written to disk correctly. Otherwise, there's no way to tell if a complete page was written, since writing 16k pages to disk is not an atomic operation.

InnoDB writes to the doublewrite buffer (an internal tablespace on disk), then the record in place. If the two do not match on recovery, the record is considered not written, and rebuilt from the transaction log.

If the doublewrite buffer is disabled, then InnoDB has no way of telling if the write to the page on disk was started, completed, or partially completed. This causes corruption.


You never were more than a power failure away from that disaster anyway by the looks of it. It wouldn't matter much whether or not the shutdown had started. In cases like these you're essentially playing Russian Roulette with your data, only you're using 5 bullets instead of 1.


Well in all fairness we did have replicas and sadly something was broken there.. Not pointing fingers but I checked it all before going on "vacation".. :-)


Great story.

The advice about Slow is Fast in such situations is spot on.

Reminds me of a Unix incident at an automobile component manufacturer's plant. They had a multi-user Unix box supplied by the company I worked for at the time - a large Unix hardware and software vendor. A colleague and I had gone there for some system maintenance. In order to do something, my colleague, before I could stop him, gave the command "init 0" on the root console. ("init 0" shuts the system down without confirmation, for those who don't know, if run as superuser.) Within seconds the phone in the computer room was ringing madly, with calls from different intercoms on the shop floor, inventory store, etc. He had to apologize to many of them before they calmed down ...


I'm aware it may be an irritating question to answer, and I'd understand if you don't bother, but I have to ask it because I did never understand why people do such things...

Was the extra performance in any way worth it?


This was a core cluster db for about 150 web sites that had high profiles and user traffic. It was a long time ago and the client didn't want to pay for 20 more boxes to do something more durable. These things are always a balance between durability and speed. In this case the client wanted cheap speed.

Sadly it worked very well except when people used a big hammer on it. We had replication slaves but in this incident the salves were not replicating and somehow the script checking them wasn't alerting. But even so the admin didn't check before bashing the keyboard and thus we were left with manual reconstruction from backups and other sundry sources of information.


Thanks to the answer. Looks like Murphy's law was in action that day.

So, that was by the client's demand. You got to save 20 machines with it, I'm impressed, and now I understand it better why you did it.


Here's another question some will surely find irritating: why would you take your laptop and work phone on "vacation"?


Two amazing words: SELF EMPLOYED

I can go on for hours about those two words, everyone wants to be their own boss until they don't want to be... :-)


That seems like a strange anecdote to me.

The only reason I can think of to shutdown a production SQL database would be to do maintenance. In which case any sane admin should be taking a backup in advance. That should be the first step.

If the database is not a production server, does it matter if data was lost?

If the production database crashed/bugged out and therefore needed to be halted, was it really ready for production in the first place?

I don't mean to offend, but it sounds like your partner is a bit of a doofus.


No offense taken, the reason he went this path is the web sites involved seemed to be non-responsive.

The core issue was actually a deadlock but he wasn't trained in detecting that and in the past just restarting the db "fixed it". Because of course all the sessions dropped and dead locks were resolved.

This was very much a high visibility production environment and yes we all make mistakes sometimes.. How else can we learn.. Trust me he never did that again! :-)

Remember, all of use were a doofus at one time and even worse many of us are every day.

It matters very little to me how people screw up, it matters how they recover, learn and avoid doing it again in the future.

My favorite saying "Originality in mistakes" if you're going to screw up make it big, creative and interesting. And please, don't repeat it.


resolving deadlocks, other locks: log on via mysql, show processlist and kill the few long standing queries. It's an ugly solution and should not be used freely, yet a lot better than database restart.


Exactly, but my partner wasn't the most patient person and in all fairness it was typical to 3,000 or more connections active to the database. With that many queries outstanding and the dead locks it was always a bit hairy trying to find them.. Meanwhile every teenager on the planet was hitting their refresh button trying to figure out why the site was down.. :-)


Don't exaggerate the impact of a kill -9. It can be safe if a homegrown application has a bug or there is a hardware failure preventing a clean reboot (like a locked IO to a disk that is no longer in order).

Sure it will mess up some things, but when management is pushing to limit the downtime of, for instance, a golden-image provisioned Linux machine, I'd kill it off no problem.

Now when we are talking a hardware box running some form of Oracle/mySQL - no, don't use -9 indeed.


The advice to never ever use `kill -9` is too strong. It's fine to use if you know what the program you are killing does.

In my case, processes I have to `kill -9` are

- programs that only read data files (or write to dispensable files)

- programs that I know don't react to SIGTERM (there is no cleanup logic, but still something makes them swallow SIGTERM)

- often then are simple tools (e.g. ls) that become wedged in a system call or in kernel code (when trying to access a bad NFS share)

- in the other cases they are in-house programs that are either badly written, or too complex (the worst offender is CERN's ROOT, if it becomes wedged you have to `kill` and `kill -9` several processes it spawns), or where we don't care enough to fix them

Interestingly, there seem to be some cases where even `kill -9` doesn't help. What I do then is to freeze the process with Ctrl+Z (Ctrl+C doesnt work of course), and then `killall -9 $(jobs -p); fg`.

Actually, I have one program I routinely call with `program; killall -9 $(jobs -p); fg` and end it with Ctrl+Z. Sad, but true.

(Of course, if your process is a database or a GUI tool or something, then all the standard wisdom against `kill -9` applies.)


One example of where you might not want to kill -9 (-KILL) is a web server, like nginx. If you kill -QUIT (-3) nginx it will do a graceful shutdown. Nginx closes its listen sockets while allowing existing clients to finish. Many other daemons have similarly friendly behavior if you give them a chance to shutdown gracefully. They should degrade safely (if not nicely) with -KILL (-9) as well.


Protip: rather than remembering magic numbers, use the actual names:

    kill -HUP <PID>

    kill -KILL <PID>


Interesting post and HN comment thread, I'll have to read it in full (or at least the good parts). Anyway, here's a loosely related post that may be of interest, from my blog:

Unix one-liner to kill a hanging Firefox process:

http://jugad2.blogspot.in/2008/09/unix-one-liner-to-kill-han...

It had an interesting thread of comments in which both others and I participated, and at least I learnt some things.


If your application doesn't work properly when it is kill -9, you just don't have a reliable application.



My naive answer: If a normal `kill` did not do the trick.

At least that what I do.

I guess a process is given more space to "clean up after itself" with a normal `kill`; where a `kill -9` forces it to die.

Anyway; I don't know the exact answer -- will come back later to read a wiser person's answer. :)


If the process is not designed to survive the crash than it's a more like a bug. I'd rather encourage everyone to design programs in a robust way: when they can clean up after themselves upon relaunch.


And what about when that cleanup takes longer than a proper shutdown?


Good point. Can you give me an example?


MySQL. Rebuilding the state from the transaction logs takes significantly longer than writing the memory state to disk.



When should I not use signal names in conjunction with kill?

Never.


Earlier implementations of the Unix kill command did not allow names, only numbers, so many people (deeply familiar with Unix) know the numbers as well as or better than the names. Plus, it's shorter.


What earlier implementations? The initial import of /bin/kill into the NetBSD source-tree accepted signal names and that was 21 years and 2 months ago. Same with FreeBSD and their commit message even implies that signal names were allowed in the original 4.4BSD-Lite source.

WRT shorter: Magic numbers don't just suck in programming.


21 years ago was 1993. I learned Unix a decade before that.

7th Edition AT&T Unix did not allow signal names (http://plan9.bell-labs.com/7thEdMan/v7vol1.pdf, search for "extreme prejudice" -- I still remember many of the little gags in the early manpages).

That was a mainstream release in the mid-1980s. Even the basic utilities like kill(1) were incompatible back then, so if you worked on both BSD and AT&T systems, it was easier to use the compatible subset.

*

Regarding magic numbers: in general, yes, to be avoided. But my usual use of kill -9 is in exasperation, from the command line, and clarity for others is not a priority. I admit, in a script, kill -HUP is to be preferred to kill -1. But even in a script, I'd say kill -9. This usage thing seems to be complex.


I do it all the time. In production. While testing my code.

~ The most interesting man in the world.


While. I see what you did there :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: