Hacker News new | past | comments | ask | show | jobs | submit login
Handling argc==0 in the Linux kernel (lwn.net)
178 points by rwmj on Feb 4, 2022 | hide | past | favorite | 134 comments



So let me paraphrase to see if I understand the context correctly:

- the only guaranteeS are argc>=0, argv[argc] == NULL, and envp comes right after argv

- argc is _almost always_ >= 1

- a widely used sbin (polkit) just assumes argv is always at least 1 element and so iterates starting at argv[1] (without checking argc)

- said polkit also mutates argv, so if it is fed with argc actually zero, it actually mutates envp

- some devs want a patch to the kernel to prevent calling execve with argc==0, (which probably should have been in the original spec, but that ship has sailed) as a way of preventing other similar footguns.

- other devs are like "that's gonna potentially break userspace" which is a valid concerne

Is that more or less it?


One more point: It's not just polkit. It's actually extremely common for programs to misbehave when invoked with argc == 0.

Try it out yourself. Build this program [1] and run it on random programs in /bin and /usr/bin. Count how many segfault. For me, it's a significant fraction, on both Linux and macOS.

(Most of the GNU coreutils print an error and abort; this can be accompanied by a core dump, but I still wouldn't count it as misbehaving. On the other hand, a few utilities act as if the environment variables were passed as arguments, which I'd count as misbehaving even if they don't crash.)

Now, the vast majority of those are not setuid, so even if those crashes can lead to arbitrary code execution, that wouldn't be a vulnerability, just a bug.* Of the handful of setuid programs, I actually did get some to crash(!), but the ones I checked are just non-exploitable null pointer accesses (likely because they try to access argv[0] without checking if it exists).

Thus, argc==0 is not necessarily a wellspring of vulnerabilities. Things have to line up just right for it to be exploitable. polkit is probably the only commonly-installed setuid binary with a vulnerability of this nature.

But I'm sure there are some less-common ones that are vulnerable. And the fact that you can get crashes out of some of the most solid, boring programs that exist, even if they're not exploitable, demonstrates just how rare and poorly-understood argc==0 is.

[1] https://gist.github.com/comex/fcd2c329be12f656929b80c4ab15b1...

* On Linux, that is. On macOS, thanks to the entitlements system, there are many binaries where getting code execution could give you extra privileges, though only in rare cases are those privileges truly dangerous.


Are all those executables looking up their own location with argv[0] like in pretty much every C book and tutorial?


Its not just when argc == 0, I just fixed a crash in a program when argc == 1.


wasn't it pkexec specifically, not polkit generally that was the problem?


Yeah that sounds more technically correct.


aside from "that ship has sailed" yes.

The debate seems to be if they can actually do it safely or not, and from the article it seems like they think they can actually do it safely.


"That ship has sailed" i.e. POSIX forbidding argc==0 when standardizing.

I agree it's worth considering it now. I think argc>0 is close enough to a convention that things ought to mostly work, but it kind of irks me that it's necessary. But it should reduce ambiguity.


It's weird to have strong opinions about userspace here given that the code search Kees Cook provided includes only test cases (and, amusingly, what looks like one piece of shellcode) --- the followup search on `execlp` turned up some things that looked like they might be non-test hits, but it turned out that only the test case hits were argc==0.

There's also comments throughout the thread that POSIX requires programs to accept argc==0, which does not appear to be the case --- the opposite seems closer to the truth, in that "strictly conforming" POSIX programs must have argc>0.

Not breaking userspace is an important goal, but breaking bad test cases doesn't meaningfully break userland.

The flip side of this is that people went looking for additional instances of the pkexec bug and none have, as far as I can see, turned up yet.


Yes this. This really feels like a vast over-extension of the "don't break userland" axiom. It's obvious with only a little thought that many bugs in a kernel can be accidentally relied on somehow, and many almost certainly have been and have still been fixed in the linux kernel anyways (maybe by accident).

It seems like a Linus quote from one of his notorious angry emails turned into a meme that got out of hand and now people are taking it too seriously.

And yeah, the posix quotes that have been presented paint a picture much more like "posix doesn't demand you not support null/empty argv or argc==0, but it would prefer you not" to me.


Given "BSD systems already prohibit an empty argv" it certainly seems unlikely that it's a standards related issue in practice.

So (as they appear to be doing if I read correctly) keeping the patch in the tree for the moment, seeing what if anything actually breaks in practice, and then trying to figure out what the best answer is from there, is probably the only thing to do.

It's always a horrible day for a developer when they discover they have a bugwards compatibility problem and while it's, I guess, debatable if that's exactly what we're seeing here, it's close enough to the pain I know well that I have much sympathy for everybody involved here.


"But what if it breaks userland" does seem to be something of a fig leaf. Change would imply there may have been something wrong with the previous code.


I'm continually (positively) amazed at the commitment to backwards compatibility from the Linux kernel team and continually abhorred by the number of footguns in C/POSIX/Linux.


I think the spec has not meaningfully changed in this regard since C89, and there’s a case of collective false memories[0] happening here.

From a copy[1] of C89 I see the relevant description as:

  * The value of argc shall be nonnegative.

  * argv[argc] shall be a null pointer.

So the spec is coherent and explained plainly. Developers/application code just needs to be mindful of what/how it’s dereferencing.

[0] https://www.discovermagazine.com/mind/collective-false-memor...

[1] http://port70.net/~nsz/c/c89/c89-draft.html


There is some additional commentary in the rationale section of posix.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/e...


This is pretty interesting - some tension between what C guarantees and what Strictly Conforming POSIX requires. I’m not saying there’s no confusion in the real world (it’s obvious there is). Is Postel’s Law

   Be conservative in what you send; be liberal in what you accept.
relevant here? I’m not sure.

The interesting looking parts of @tedunangst linked document are:

   The argument argv is an array of character pointers to null-terminated strings. The application shall ensure that the last member of this array is a null pointer. These strings shall constitute the argument list available to the new process image. The value in argv[0] should point to a filename string that is associated with the process being started by one of the exec functions.
[…]

   Early proposals required that the value of argc passed to main() be "one or greater". This was driven by the same requirement in drafts of the ISO C standard. In fact, historical implementations have passed a value of zero when no arguments are supplied to the caller of the exec functions. This requirement was removed from the ISO C standard and subsequently removed from this volume of POSIX.1-2017 as well. The wording, in particular the use of the word should, requires a Strictly Conforming POSIX Application to pass at least one argument to the exec function, thus guaranteeing that argc be one or greater when invoked by such an application. In fact, this is good practice, since many existing applications reference argv[0] without first checking the value of argc.

   The requirement on a Strictly Conforming POSIX Application also states that the value passed as the first argument be a filename string associated with the process being started. Although some existing applications pass a pathname rather than a filename string in some circumstances, a filename string is more generally useful, since the common usage of argv[0] is in printing diagnostics. In some cases the filename passed is not the actual filename of the file; for example, many implementations of the login utility use a convention of prefixing a <hyphen-minus> ( '-' ) to the actual filename, which indicates to the command interpreter being invoked that it is a "login shell".


Just because it’s in a spec, doesn’t mean it’s not a footgun. No “collective false memories” required.


There's no footgun here except for apps which make false assumptions about argc and/or argv.


Indeed, that's the footgun.


At that point using C is the footgun.

It's always been the case that argc is not-negative which means it could be zero. It's always been the case that argv[argc] is null. People who assume otherwise are poor C developers.


Yes, C is full of footguns; that's one fact that has motivated the creation of alternate systems programming languages.

I think your definition of footgun is different from mine and that of the commenters you're responding to (and how I've generally seen it used).


False memories? What are you blabbering about.

Just because it's in the spec doesn't mean it's not a footgun. If something is in the spec and it blows up in you face if you're not careful that's by definition a footgun.


there's a subtle distinction here in that what exec does going in and coming out of the kernel, and what are presented to main() by the libc preamble (which is what the C standard is about) are different (though very very related) things


0 is a nonnegative number.


> 0 is a nonnegative number.

Correct. So it needs to be accounted for.


At least for the old POSIX stuff this is sort of excusable... it operated under very much more relaxed assumptions, primarily that people could trust each other to a somewhat large degree.

Obviously, the rise of mass-population Internet access and financial (or "lulz") incentives for hackers changed the game, and we're still paying for all the decisions to be backwards-compatible, even decades later.


The Morris worm already shown what the future would look like in the early 80's.


I don't think the behavior is wrong per se. Rather the issue is that it was tribal knowledge that the way you got the name of your binary was through argv[0]. And that this misinformation spread unchecked.


Anything the kernel does to try to mitigate this would break userspace. It's one thing to do that when necessary to fix a kernel vulnerability, but I don't like the idea of breaking userspace just to try to work around some buggy userspace programs.


Yeah, I’m surprised that this patch is getting traction at all. If it’s a bug in the application, it should be fixed in the application. The kernel shouldn’t be changing the kernel-userspace interface to work around userspace bugs.

There are plenty of bespoke binaries out there with no/lost source code that won’t get the benefit of scrutiny from the kernel dev team. I certainly wouldn’t assume that none of them invoke execve with a NULL or 0-item arg parameter, and merging this patch would break them without recourse.


Given that this behavior is already disallowed (and faulting) on BSD and at least some commercial Unix flavors there shouldn't be too much software out there that relies on it.

I would lean towards making it an API error BSD style and taking breakage on a case by case basis. I'm not the person who would be called up however, so it's easy for me to say.

Worst case you can create a sysctl variable to enable/disable the check for those corner cases where it can't be fixed and other precautions can be taken against the security vulnerability.


> Given that this behavior is already disallowed (and faulting) on BSD and at least some commercial Unix flavors there shouldn't be too much software out there that relies on it.

How much commercial software targets UNIX these days?


Enough to keep sales of AIX, Solaris, mainframes and several POSIX based RTOS going.

And that little one named macOS as well.


macOS allows argc == 0, it has a separate mechanism for providing the exec path of the executable to the process, namely the KERN_PROCARGS2 sysctl which includes the saved exec path along with argc, argv, envp, and the apple pointer.

Amusingly enough, Apple has had a bug in their `ps` since 10.3 that assumes argc > 0, which actually results in a similar (but not likely harmful, despite ps being setuid) bug.


If anything this breaks is already broken on macOS the argument probably becomes rather simpler, indeed.


Platform APIs should be designed in such a way that it is easy to use them securely. That's why we don't use strcpy() and friends anymore. (Yes, I know, not quite the same, as you can mitigate this with a new function and implore people not to use the old one.)

Sure, the pkexec bug is now fixed, but I expect we'll see another similar one eventually. I know I personally have written a lot of C programs that assume argc>0 and argv!=NULL, and I'm sure there are thousands of such programs.

I think returning EINVAL on exec in this case may be too much. I think the right move is probably for the kernel to just fix things up when exec is called in this way, so argv will always have at least one element, and that element 0 won't be NULL. I think any program that would break with that change is something I'd be able to live with.


It's very easy to use argc/argv securely as they currently are. POSIX clearly states that argc can be zero, but polkit didn't know that for some reason. When you're writing privileged software, you can't make assumptions about user input (and the specs) like this.

I get that the vulnerability could have been prevented if Linux deviated from the spec like some of the BSDs might have, but we shouldn't make it the responsibility of kernel developers to make logic errors in user applications less likely.


I agree that we shouldn't -make- it their responsibility but it's still worth asking in a case like this whether -this time- it's a better trade-off to do so anyway.


I am awful at C so this comment should be considered merely an anecdote but I was -so- sad when I (having started learning on BDSi and NetBSD) needed to port a program to other systems and discovered strlcpy wasn't there. (damnit, Jim, I'm already a segfault generator when I try to perpetrate C, I need all the help I can get!)


> That's why we don't use strcpy() and friends anymore.

We definitely do when safe to do so…


Unlike other bugs, this change shouldn't break any application that is being called. argc >= 1 is the normal use case.

So this would break anything that calls execve() with argc==0. Is such call providing any functionality?

In addition to returning EINVAL being produced by not well written callers, another option would be to force argc==0 and fill up argv[0] with process name. That fix would make everything continue to work.


The ldd program that lists the shared libraries used by an executable at least used to exec with an empty argv, which the dynamic linker detected and listed libraries instead of linking and running. I learned that when I screwed up an inetd configuration and saw a list of libraries instead of the desired results.

Edit: just tried on windows 10 WSL, and it had a sad (blue screen).


WSL2 has made my life so much better that now I want to give it a hug.

(also, TIL, thank you)


I'll be very interested to see if Linus bites at all.

This is probably the most sympathetic I've been to the Kernel breaking userspace programs that rely on bad behaviour, but Linus has been pretty clear on the matter in the past.

> WE DO NOT BREAK USERSPACE!

https://lkml.org/lkml/2012/12/23/75


Eh, Linux has broken userspace in the past. Twice for me personally in the last ~10 years, one of which could've resulted in hardware damage if the stars had aligned slightly differently :D I wouldn't count it out.


Are you sure you don't mean it's broken out-of-tree drivers? What exactly did it break?


No out-of-tree drivers.

1. https://github.com/torvalds/linux/commit/6b99e3569ba17b9fd38... (2017) This changed the userspace hwmon API of the thinkpad-acpi driver from `/sys/devices/platform/thinkpad_hwmon/` to `/sys/class/hwmon/hwmon1/`, which broke the sensor-monitoring software that I wrote for my laptop. It was justified on the basis that libsensors was resilient to this change, but my software didn't use libsensors.

2. https://github.com/torvalds/linux/commit/b02c6857389da66b09e... (2020) This changed the userspace hwmon API of the k10temp hwmon driver, specifically by swapping which temperatures referred to Tctl vs Tdie. For example, my Threadripper reports one real temperature and one fake temperature that's 27degC higher than the real one. (AMD does this because it wants default fan curves to think the CPU is hotter than it really is, because it wants fans to run faster by default.) Since this change swapped the two sensors, it could've had one of two outcomes for fan curve software:

- a) Software that monitored the fake temperature was now reading the real temperature, and thus running the fan slower than it needed to.

- b) Software that monitored the real temperature was now reading the fake temperature, and thus running the fan faster than it needed to.

Luckily my case was (b), so I discovered the change not because of a throttled / smoking CPU but because of an unexpectedly loud CPU fan.

(There was another case in ~2012 when the thinkpad-acpi driver switched from procfs to sysfs, but that one was more "justified" because sysfs was the right place for hwmon anyway.)


Stuff in /sys/class changes all the time (maybe not so much on normal computers, but on semi-embedded ARM platforms they see to change a lot)


Only way to completely fix buggy user land is to error on any attempt to execute a program. Before this polkit had an authentication bypass last year, trusted user supplied ids around 2018^1, ... . The list goes on.

^1 which ironically enough was an attempt to avoid racy uid checks on setuid binaries.

https://gitlab.freedesktop.org/polkit/polkit/-/commits/maste...


They should take Windows 95 approach and try to detect programs that need special behavior, perhaps even include a list of binary names. This goes against the ethos of clean elegant solutions that the FOSS world likes, but at some point we have to realize that the kernel is enterprise code, and we need enterprise solutions sometimes (read: hacky).


In open source this isn't necessary since you can just go in and fix the broken programs and not carry deadweight workarounds forever. And some don't care too much about broken closed source programs.


You're exactly right. However, not all programs running on Linux are FOSS. And while you may not care about broken closed-source programs, you can't really say "we don't care" because you don't speak for the entire FOSS community, and I think if you look at it again, you'll find that the FOSS community DOES care (in actions not so much in words) about corporate sponsorship, especially for the Linux kernel.


I wonder if such hackery could be enabled via ebpf such that distro vendors could make an active choice to do that via default configuration rather than needing to get it into mainline.

(I am ... ambivalent, let us say ... about whether I -approve- of people doing so in any given case, but I definitely believe in them having the option)


I would bet IBM and Red-Hat care a lot.


Red Hat works with software partners to port and fix their software for Linux - in fact I was doing some of that exact activity earlier today, helping to port a program for a partner to GCC 12 (combined with hardening flags, it breaks a bunch of previously working C/C++ code).


I don't really understand the rationale for changing the Linux kernel at all here; yes, this is yet another footgun (tm), but this is purely a userspace problem. Especially given the examples regarding backwards compatibility given in the comments, I would expect a long-lasting userspace fix to come in the form of a new, better libc API. Not by prohibiting certain behaviour from the kernel-side of things.

That said, why is the obvious fix not simply "always populate argv[0] with pathname if called with NULL argv"?


Your question is answered in the article:

> Given the code we've found that depends on NULL argv, I think we likely can't make the change outright, so we're down this weird rabbit hole of trying to reject what we can and create work-around behaviors for the cases that currently exist.


Right, backwards compatibility with existing bugs.

It really isn't a kernel problem at all, of course.

The POSIX spec says "The first argument, argv[0], is required and must contain the name of the executable file for the new process image."[1] Linux is not fully POSIX-compliant, but it generally follows the POSIX standard for kernel calls defined in POSIX.

FreeBSD's man page says "At least one argument must be present in the array; by custom, the first element should be the name of the executed program (for example, the last component of path)."[2] So, "must", compatible with the POSIX spec.

QNX implements the POSIX API, pretty much per the spec. The QNX spec says: "The value in argv[0] must point to a filename that's associated with the process being started."[4] So, "must", again compatible with the spec.

The Linux man page says, unfortunately, "By convention, the first of these strings (i.e., argv[0]) should contain the filename associated with the file being executed."[3] That's the problem. POSIX says "must" but the Linux page only says "should". That was a mistake.

One somewhat kludgy fix would be to disallow argv[0] == NULL && envp[0] != NULL. That allows launching a program with no args at all, which might be used somewhere in some ancient program, while disallowing launch with no command line args but with environment args, which is probably a bug.

(I liked the QNX approach to program loading. The kernel isn't involved. The "exec" functions link to a shared object, which requests memory and does the loading using regular I/O operations. The kernel contains no loader and is thus immune to problems from malformed executables.)

[1] https://support.sas.com/documentation/onlinedoc/sasc/doc750/...

[2] https://manpages.debian.org/unstable/freebsd-manpages/execve...

[3] https://www.man7.org/linux/man-pages/man2/execve.2.html

[4] https://www.qnx.com/developers/docs/6.4.0/neutrino/lib_ref/e...


FreeBSD did not require non null argv[0] until 9 days ago.

https://github.com/freebsd/freebsd-src/commit/773fa8cd136a57...


Only a number of years too late...

It's worth noting that as far as changes in FreeBSD forbidding long-standing permitted behavior goes, this one was notably very uncontroversial.


It doesn't matter what was technically allowed; the very commit message you linked to says it was by contract disallowed for 31 years:

> The manpage has contained the following verbiage on the matter for just

> under 31 years:

>

> "At least one argument must be present in the array"


It certainly matters to the pkexec exploit.


That isn't the POSIX spec.


You're right. This is (I think) the actual spec.[1] The other one is SAS's reference which claims to describe the POSIX spec. The actual spec says "should". Bleah.

"The arguments represented by arg0,... are pointers to null-terminated character strings. These strings shall constitute the argument list available to the new process image. The list is terminated by a null pointer. The argument arg0 should point to a filename string that is associated with the process being started by one of the exec functions."

"The argument argv is an array of character pointers to null-terminated strings. The application shall ensure that the last member of this array is a null pointer. These strings shall constitute the argument list available to the new process image. The value in argv[0] should point to a filename string that is associated with the process being started by one of the exec functions."

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/e...


> The actual spec says "should". Bleah.

I don't agree with your reading that argv being non-NULL is a "should."

The POSIX Specification does not say "The value in argv[0] should exist" it says "The value in argv[0] should point to a filename string..."

The value existing is a foundational assumption in the specification. It is therefore required.

(This is consistent with the fact that it's explicitly required that argv[argc] == NULL, which necessarily means that argv != NULL. Consider this: https://pastebin.com/1gieMqEd If argv is NULL, then pointer arithmetic on it and dereferencing it is undefined behavior, and cannot possibly conform to the guarantee.)


I think what you’re describing above is the qualified ”Strictly Conforming POSIX“.

Edit: delete commentary


It's not clear from that quote if they're talking about code calling execve(,NULL,) or a program expecting main() to be entered with argv[0]==NULL.

(edit: found an answer here: https://lwn.net/ml/linux-kernel/5e963fab-88d4-2039-1cf4-6661...)

multicall programs will try to use argv[0] and might crash in this scenario. If we're going to fake an argv, I guess we should try to do it right.


(multicall = things like busybox which look at argv[0] to determine what program they should act like)


That was my issue with the article as well:

"Fixing this issue is a simple matter of making pkexec check that argc is at least one. But there are surely other programs out there containing similar assumptions."

That's _not_ an assumption. That's just a bug, in a setuid binary.


It doesn't matter if it's a bug in the program or not. "We don't break the userspace" --Linus Torvalds


I'm describing an issue with the article, not with the kernel development ideology. What I don't understand is why Corbett decided to write this up as an "assumption" rather than a "bug" which it clearly is.


It's both, what's the problem with what he wrote?

It's a bug that results from the developers incorrectly assuming that argc is always at least 1.


polkit is a freedesktop.org project, convincing Linus to break userspace is an easier deal than getting them to admit fault. (joking)


I don't see why pushing this out to glibc would help. Glibc would then face the same question about whether or not some programs rely on current behavior and would break.


Linus' views on backcompat and glibc's have, let us say, not been entirely aligned over the years.

If the change fails at the kernel level but succeeds at the glibc level that might still be a net win overall, and given Ariadne as "instigator" here I'd be hopeful that she could convince musl to come along for the ride if that was how it had to happen. (assuming she thought that was a good idea in the first place, of course, I'm exploring counterfactuals here)


But a "fix" in libc wouldn't help for the security risk because an attacker could still emit the system call directly. (Unless I misunderstood what the fix would be)


At first I imagine the libc killing the process if argv is empty.

Otoh if you want libc to be possibly a little too clever, perhaps it could make an effort to try to discover a sensible value for argv[0] based on looking around /proc etc to find the path to the binary. That seems like way too much complexity and failure points for a few mostly inconsequential corner cases.


If libc isn’t linked into the process to begin with then none of that matters.


Many languages that might want to bypass libc would also bounds check argv, so ... Maybe it doesn't matter in a few more senses than that ...


OpenBSD has handled this case for some years. I do not know if there was any breakage or fallout from this.

https://github.com/openbsd/src/commit/74212563870067f5b1e271...


That's not a bad precedent for saying it shouldn't break much of anything if it's done. I also love the name of the label for handling the error condition there.


Slightly funnier would be "harmful"


> OpenBSD has handled this case for some years. I do not know if there was any breakage or fallout from this.

The other thing about OpenBSD is that when they make a change to their OS, they also go through to make sure all the (third-party) ports/packages:

* https://cvsweb.openbsd.org/ports/

* https://github.com/openbsd/ports

do not break. So they're create patches for the software and and submit them upstream.


FWIW, Windows (MSVC) also guarantees "argc > 0" and non-empty argv with non-empty argv[0]. It also returns EINVAL if execve validation fails.


Windows doesn't have argv at all, it uses single string as cmdline arguments.



At the kernel level, yes. But they are talking about CRT, which parses the command line and makes argv from user space, if your linker is set up to have a main() and not WinMain.

That is why they put MSVC in parens.


Thanks for answering my next question, how much world of hurt would be involved with this solution in terms of breaking userspace.

Kind of insane how Linux kernel developers didn't test and resolve this in a similar manner before this became an issue.


argc==0 is a well known part of the API, even if not everyone handles it correctly. I guarantee that >95% of kernel developers (if not 100% of them) understand this design and its flaws.


It seems abundantly clear that programs are allowed to pass an empty list as argv to execve, although these programs might not be POSIX-compliant. As for the "main" function, argc is only guaranteed to be nonnegative. A well-written program therefore cannot assume that argc is not zero. Additionally, there is no guarantee that a POSIX-compliant program will be called by another POSIX-compliant program.

So it looks like pkexec made the mistake here by not carefully adhering to the standard. One can argue that argc==0 is a bad idea, and it is. But it's technically legal, and changing this would break userspace programs.

1. https://pubs.opengroup.org/onlinepubs/9699919799/functions/e...

2. https://eel.is/c++draft/basic.start.main


argc = 0 is a bad idea? When did we enter into this insane parallel universe?

When I first came across (argc, argv) a long long time ago I was befuddled by the choice of having argv[0] be something vaguely related to your binary, maybe the path to it. That seemed a very weird choice seeing how most programs have no use for that and now all the arguments you actually care about are shifted.

So now you have a patch to ensure argc is always 1. Only of course if you accept that there are programs that misbehave with argc=0, surely you must also accept there are programs that misbehave if argv[0] is not the program path? This is after all the convention that got you into this misery in the first place!

That's the moment the stupidity fully sets in. So now you write a patch to make sure argv[0] is really the path! Except just as we have discovered with this innocuous patch, you could write a dissertation on the question of "what exactly is meant with the program path" alone. And that really whoever was relying on some specific interpretation on the value of argv[0] would have been much better off with some other syscall to tell him what he was after.


> When I first came across (argc, argv) a long long time ago I was befuddled by the choice of having argv[0] be something vaguely related to your binary, maybe the path to it.

I’ve seen a Unix (¿maybe HP-UX?) that seemed to construct the argv strings by taking a copy of the entire command line and writing zero bytes at the end of each argument.

I don’t think that’s too weird. Why do multiple allocations to make room for the strings if you know a reasonably tight upper limit to the number of bytes you need to store all of them? (Reasonably tight because the command line _could_ contain consecutive spaces and because of backspaces and quotes around arguments)

On that OS, argv* strings also were writable, so programs could even replace those zeroes by spaces to, in many cases (I don’t remember how that worked with quoted arguments), get back the command line used to launch the program:

  for(i=1; i<argc; ++i) argv[i][-1] = ‘ ‘;
  printf(“%s\n”, argv[0]);
(Yes, on top of, AFAIK, not being guaranteed to work by any standard, that’s undefined behavior in C, but it did work on that OS)

So, I guess passing something resembling the executable’s name was something they got for free. It might even originally have been a bug, for all I know.


>but it did work on that OS

This will probably work correctly on any POSIX system (well, except that it completely ignores escaping). Also I'm not sure why you would want to reconstruct cmdline in your program (that will lead to Windows-like horror where kernel only passes single string as cmdline and every component along the way interprets it with different escaping rules).


argv is defined to be non-const by the standard.


One could argue that the main problem here was writing from argv into env^1 , so if one wants to break userland APIs to secure this the best start would be to hit the memory region argv resides in with write protection.

^1 We all know that buffer overflows in C are extremely common so fixing argv[0] alone would be insufficient to fix the majority of bugs.


Obviously Linux programs can pass an empty list to argv, since that's how we got the pkexec vulnerability.


It seems straightforward to just enumerate every setuid program that ships on the vast majority of distros and check for this issue and patch what's there. That seems like the simplest approach with no concern for breaking anything.

Also, a sysctl for rejecting the call would seem reasonable.


It seems a bit extreme to add a syscall for this. As you said, they have to bite the bullet and fix the suid programs.

And why not just fix the exec call in userland? I'm not sure who supplies it - the C standard library or the kernel headers. You would fix all the cases of accidentally calling something with argc==0 and it would not run afoul of the kernel breaking userspace.


sysctl ≠ syscall, FWIW


Ah OK, that makes more sense. TIL!


Agreed. Also, the existence of this bug makes me wonder if I can simply delete polkit. What other rookie mistakes lie within? I learned how to parse argv properly in CS 50 (so, before CS 101).


pkexec was a kludge to replace consolehelper and both were actually supposed to have been removed in 2013(!) but the transition didn't happen fully. In fact some are still using consolehelper. For the whole sorry tale, https://lwn.net/Articles/883547/ https://lwn.net/SubscriberLink/883547/f9a8f4e4be157f91/


Interesting.

Ironically the reason pkexec exists on Ubuntu server installs seems to be that they want to have a package management service (PackageKit) hanging off dbus even in headless setups, for reasons that aren't obvious. There have been earlier local root bugs caused by PackageKit as well[1].

(The irony being, the LWN documented discussion where a dev defends ipc-accessible privileged daemons as more secure and justifies pkexec as facilitating access to these).

[1] https://ubuntu.com/security/notices/USN-4538-1


I believe it gets used for the unattended-upgrades system to manage security updates. Not sure if that's the best way to do it but I believe it's using packagekit to do the actual work.


That was my guess also at first but seems this is not the case, it just runs as root and calls apt directly. There's no dependency to packagekit, and it only uses dbus in the unattended-upgrade-shutdown script for some shutdown event monitoring business from systemd.


Polkit (I think it is polkit) is so annoying. When I forget to use sudo, it asks me in a nonstandard prompt for a password. But I don't have a password on some machines, I logged in via ssh key and have passwordless sudo.

At the same time, you can still not elevate inside an application. I would love to run a command to open a single file as root in vim, or VS code, or copy one file as root in Nautilus. I think you can click a little lock in Gnome settings somewhere and unlock certain pages, but it is far from widespread.


The whole discussion of alternatives is, alas, akin to passing a hot potato around. I, for one, don't quite see how keeping a daemon whose only job is to execute arbitrary privileged commands any safer. You can have a hole where it parses the request. You can have a hole where it checks whether the requesting party is indeed authorized to make such requests. You can have a hole at the point of it invoking execve(). Indeed, you can put in the same hole pkexec used to have! You still need to exercise due diligence to write that daemon as tightly as can be so all those places don't pose a risk.

But with having a resident daemon approach you lose the process ownership, which sometimes is damn handy to have. Accidentally granting root to do a wrong thing and then not even being able to kill the thing with Ctrl-C or xkill is quite disempowering.


I still haven't gotten over the transition from init to systemd. One reason is that I'm unconvinced it makes sense to have a daemon that passes out root privileges after boot. I'm a fan of "keep it simple," and UNIX/Linux/BSD survived for decades without such a daemon.


I understand entirely on a philosphical level but while debugging systemd regularly makes me angry it's still less often than other people's SysV init scripts used to do so, so I've made my peace with why people were happy making the switch.

Naturally I'd personally prefer rc.subr or s6 or nosh (though I've seen people footgun themselves with all of those too, albeit less often than SysV) but it is what it is, and most days I can bring myself to believe that the systemd transition was the sort of imperfect net win that's usually as good as you get in an ecosystem as complex as this one.


What are some setuid program in popular distros? The only ones I'm aware of are `sudo` and `su`? Do programs such as `systemd` require setuid?


There's a bunch, e.g., mount, umount, fusermount, crontab, passwd, and chsh.


I count 20 on mine. Surprisingly, none of them are systemd. (find / -perm 4000)


systemd developers generally prefer having a privileged daemon, and an unprivileged program doing RPC with the other. See for example systemd/systemctl, timesyncd/timesyncctl, etc.


`find / -perm /4000` to be precise. It's unlikely you have any suid binaries with exactly 4000 mode.


I wish the kernel could handle

    argc==ARBITRARY_NUMBER
It's crazy that with 64GB of internal memory, I'm still getting "argument list too long".


Arguments go on the stack, which has a fixed size.


https://lore.kernel.org/lkml/20170707185729.GA70967@beast/T/... (merged as commit da029c1) added a hard upper limit of 6MB for argv + envp combined, no matter how big the process's stack is.


It's amusing to me that some people seem to think it's possible to create "secure computing" with no footguns whatsoever. A surprising amount of the infrastructure upon which we write computer programs (including a kernel and its extended OS) relies upon "mere" conventions. Indeed one of the lowest levels of all is named the "calling convention". Software which breaks these conventions, yet somehow "runs" or functions to some extent before causing problems is broken software, period. It is not a cool hack or an interesting lead for further research. Any hand-wringing about it whatsoever is a waste of time.

If some individuals scale the protective barrier and jump to their deaths in the Grand Canyon, this does not indicate any deficiency in the protective barrier. It was not (and should not) be designed to prevent such misuse.


A few years ago I wrote a small program and script that search for executables that crash when they are executed with argc==0 https://github.com/eriksjolund/empty-argv-segfault-check


How did polkit even fall victim to this? I sometimes feel hesitant to include an argv[0] in my execvX functions because I know that it's only "a convention" and not an enforced rule, and no reasonable program should expect it to be there.


I wrote a summary here: https://news.ycombinator.com/item?id=30091667

The tl;dr is that pkexec writes to argv, and envp comes right after argv on the stack, so a missing check caused it to accidentally write to envp when argc == 0.


"We do not break the userland"


The actual quote is:

  WE DO NOT BREAK USERSPACE! Seriously.
  How hard is this rule to understand?
  We particularly don't break user space
  with TOTAL CRAP.


Indeed. But a sysctl to switch the patch on/off seems like the pragmatic solution


That isn't pragmatic, it silently breaks programs that rely on specified behavior just to fix one of many self inflicted security issues polkit had over the last decade.


The sysctl can have three settings: 0 to do nothing, 1 to emit a warning, 2 to fully enable the patch that blocks argc=0. Use 1 by default as not to break userspace, let people opt-in to 2 for the additional security


Which is fine: https://news.ycombinator.com/item?id=30208963 is pretty on the money here. Patch this behaviour, and fix the extremely low number of offending applications concurrently.


What specified behavior?


Posix apparently explicitly allows calling programs with an empty argv, so it isn't just a Linux implementation detail Polkit failed to handle.


> As noted in some of the test cases, there are comments like > "Solaris doesn't support this," etc.

Evidence that NULL argv should be removed.

Also, at the very least do it for execs of set-uid executables.


Instead of fixing this in the kernel, couldn't this be added to SELinux as something it keeps an eye out for?

The kernel could log it, and hopefully track them down, but breaking the ABI seems harsh.


Maybe, but not necessarily. If a program's job is "elevate other program's privileges arbitrarily" your policy is gonna be pretty loose. Also, a lot of distros don't run SELinux.

It's not to say that SELinux wouldn't be helpful, it's just that ideally we'd not have these vulnerabilities to begin with.


Am I crazy or isn’t the simplest solution being the kernel filling in argv[0] with the executable path when argc is 0 (thus changing it to 1)? What would be the downside in that?


argc has always been able to be 0 and I've always handled that case

It's not the kernels job to change this just because some developers don't understand it may be 0.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: