Hacker News new | past | comments | ask | show | jobs | submit login
OpenBSD now enforcing no invalid NUL characters in shell scripts (undeadly.org)
185 points by CTOSian 3 months ago | hide | past | favorite | 154 comments



Here's the actual diff:

https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/bin/ksh/shf.c....

And it looks like that covers all parsed parts of the shell script or history file, including heredocs. I get the feeling it's going to break all shar archives with binary files (not that they're particularly common). It will stop NULs being in the script itself, but it won't stop them coming from other sources, e.g.

    $ var=$(printf '\0hello')
    -bash: warning: command substitution: ignored null byte in input
    $ echo $var
    hello
It remains to be seen if this will be adopted by anyone else, or if it'll be another reason to use OpenBSD only as a restricted environment and not as a general computing platform.

> "If there is ONE THING the Unix world needs, it is for bash/ksh/sh to stop diverging further"

> OpenBSD ksh: diverges further


The only thing that is required to happen is that they all obey the rules of the POSIX shell (when called as /bin/sh).

Otherwise, anything goes.

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V...

All the userland utilities must have the behavior (and problems) specified here:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/


Eh - I actually like developing on OpenBSD first, because of restrictions like this. If it runs on OpenBSD, you are likely to have fewer bugs around things like malloc.

OpenBSD is also really good about upstreaming bug fixes, which is a good thing. Firefox used to be a dumpster fire of core dumps on OpenBSD, and many issues were uncovered and fixed that way.


> I get the feeling it's going to break all shar archives with binary files

shar encodes binary files. Here's what it does with a file that has contents: "foo\0bar\n":

    sed 's/^X//' << 'SHAR_EOF' | uudecode &&
  begin 600 foo.txt
  (9F]O`&)A<@K.
  `
  end
  SHAR_EOF

Interestingly, passing that heredoc to uudecode in the shell, it produces no output. However, if I pass the whole shar output to unshar, it does produce the file with the correct content.


> I get the feeling it's going to break all shar archives with binary files (not that they're particularly common)

Base64 encode them.

This is not diverging further, this is bringing sanity to the table


> Here's the actual diff:

Only 8 short, simple lines of c code. Beautiful.


"We are in a post-Postel world" is a great way to put it. This needs to be repeated by everyone working with file formats or accepting untrusted input.


> "We are in a post-Postel world" is a great way to put it.

See also RFC 9413 (https://www.rfc-editor.org/rfc/rfc9413.html), originally called "draft-thomson-postel-was-wrong" (https://datatracker.ietf.org/doc/draft-thomson-postel-was-wr...).


Agreed.

When every implementation in wide use has their own quirks, you must support them all to make your program widely used. Every special case is yet another potential bug to chase down.

It also allows "Embrace, extend, and extinguish" -strategy that Microsoft used so successfully to assfuck the internet over a decade.


I think you mean Google.


No. The Microsoft. MS invented the term. DOJ found that MS used "Embrace, extend, and extinguish" in internal documents.

Younger people don't know how absolutely ruthless and harmful Wintel monopoly was under Gates. Java did not work on purpose. Javascript did not work for purpose.

   <!--[if IE]> 
everywhere.

They attempted to kill open web in the crib with their blackbird project. Only MSN (The Microsoft Network) for normal people.


Except it Google that morphed the Web into ChromeOS, with the help of EVERYONE that ships it alongside their applications, as they can't be bothered to learn cross-platform frameworks.

Many of them people that used to complain about Micro$oft and should know better.


Anyone who was around for the IE6 era knows how much worse it was than the current Chrome era. It's not even close.


Born in the 70's, coding since 1986.


Started with DOS and became a Windows programmer. If it's OK to submit false statements, then why not add one more.

https://www.justice.gov/atr/us-v-microsoft-proposed-findings...

See 91.3.2


Yes I know how it was, and MS bullshit hate doesn't help, while worshiping how Google and Apple have taken over the Web and mobile devices.


Why are you posing criticizing MS vs Google as a dichotomy? Both are a piece of shit, MS even to this day. nabla9 is absolutely right, read the history.


Because many MS bashers are hypocrite in the way they worship Google and Apple, while doing most worse, succeeding where MS failed.


Nothing from here to the parent post suggests any defense of Google. It's you who brought it up.


You clearly don't remember or know about how bad it was in IE6 era and before.


Born in the 70's, coding since 1986


Just two things:

- Active X.

- VB6, OLE, VBA macros and friends.

Merge both. Now you have a propietary hell full of vulnerabilities.

Said this, I don't like Google's world dominance on the smartphones/web/email/chat/maps/video platforms, neither.

*Apple it's just sucessful in the US.


Agreed that it is a Microsoft term. But in my experience, it is older people who incorrectly judge Microsoft ruthlessness, not younger people. I am of an age where I remember well what Microsoft was like in those days, and it frankly was not as bad as people make it out to be. Nor was it really worse than the ruthless tech companies of today.


I was there too, and I disagree completely. Microsoft was not just ruthless, they were ubiquitous. They sabotaged any perceived competitors in anticompetitive, market- and industry-damaging ways.

You (the generic "you") can complain all you want about Apple today, but you have another perfectly viable option. And Apple is (almost-entirely) happy to grow market share on merits without salting the earth of any rivals.

In Microsoft's heyday, that was not true. Those of us who rejected MS back then did so at a much higher cost than green chat bubbles.

It was worth it though. And we did win, eventually.


> ...without salting the earth of any rivals.

Microsoft embodied the adage "It's not enough to win. Everyone else must lose."


Microsoft could not win, although they tried very hard.

Windows was never going to scale down to the portable devices that we now use (because defeating Apple would have been very difficult, and AOSP made it insurmountable).

Windows was never going to scale up to the top 500 supercomputer list (for largely economic reasons).

Microsoft itself has tacitly admitted that Azure is better served by Linux, and we ponder why.

Did the DoJ actions against Microsoft really have an impact? I don't know.


"Postel" is not a term that carries any significance for me, and Googling that word didn't turn anything up that seemed relevant.

Who or what is a Postel?


It's a reference to Jon Postel who wrote the following in RFC 761[0]:

    TCP implementations should follow a general principle of robustness:
    be conservative in what you do, be liberal in what you accept from
    others.
Postel's Law is also known as the Robustness principle. [1]

[0] https://datatracker.ietf.org/doc/html/rfc761#section-2.10

[1] https://en.wikipedia.org/wiki/Robustness_principle


I've always felt that this was a misguided principle, to be avoided when possible. When designing APIs, I think about this principle a lot.

My philosophy is more along the lines of "I will begrudgingly give you enough rope to hang yourself, but I won't give you enough to hang everybody else."


HTML parsing is the modern-ish layer-uplifted example of liberal acceptance.

I won't argue that this hasn't been a disaster for technologists, but there are many arguments that this was core to the success of HTML and consequently the web.

Which, yes, could be considered its own separate disaster, but here we are!


It makes sense in a "costumer obsessed" way. The user agent tries to show content, tries to send requests and receive the response on behalf of the client (costumer), and ceteris paribus it's better for the client if the system works even if there's some small error that can be worked around, right?

but of course this leads to the tragedy of anticommons, too many people have an effective "veto" (every shitty middlebox, every "so easy to use" 30 line library that got waaay to popular now contributes to ossification of the stack.

what's the solution? similarly careless adoption of new protocols? and hoping for the best? maybe putting an emphasis on provable correctness, and if something is not conformant to the relevant standard then not considering it "broken" for the "if it ain't broken don't touch it" principle?


When it comes to writing APIs I feel strongly that you should be incredibly strict.

1 != ‘1’

true != 1

true != ‘true’

undefined != false

undefined != null

etc

“Flexibility” in your API just means you are signing up for a maintenance burden for the lifetime of your API. You will also run into problems because you have to draw the line somewhere and people will be frustrated/confused since your API is “flexible” but not as flexible as they want. Better to draw the line at complete strictness IMHO. I dislike even optional fields and prefer null to be passed instead except special cases (like when null has a meaning, example: search endpoint where you pass the fields you want to search on and a field can have a null value).

I want people to be explicit about what they are doing/fetching when using an API I have written/maintained. It also encourages less sloppy clients


Ironically it leads to less robust systems in the long term.


> Postel's Law is also known as the Robustness principle.

Really? It seems like it's obviously just a description of how natural language works.† But in that case, there's an enforcement mechanism (not well understood) that causes everyone to be conservative in what they send.

We can observe, by the natural language 'analogy', that the consequence of following this principle is that you never have backwards compatibility. Otherwise things generally work.

† Notably, it has nothing to do with how math works, making it a strange choice for programming.


A reference to Postel's Law: be conservative in what you produce and liberal in what you accept.

The law references that you should strive to follow all standards in your own output, but you should make a best effort to accept content that may break a standard.

This is useful in the context of open standards and evolving ecosystems since it allows peers speaking different versions of a protocol to continue to communicate.

The assertion being made here is that the world has become too fraught with exploiting this attitude for it to continue being a useful rule


What would have been the result of John Postel advocating for conservative inputs, I wonder? I'm wondering if the most common protocols would have been bypassed if they had all done this by other protocols that allowed more liberal inputs.


Probably more convoluted protocols, because there are always things that you do accept and that can be used to negotiate protocol extensions.

Imagine a protocol where both sides have to speak JSON with a rigidly-defined structure, and none of the sides is allowed to ask whether the other supports any extension. Such a protocol looks impossible to extend, but that is not the case, you can indicate that you speak a "relaxed" version of that protocol by e.g. following your first left brace by a predefined, large number of whitespace characters. If you see a client doing this, you know they won't drop the connection if you include a supported_extensions field, and you're still able to speak the rigid version to strict clients.


This made me laugh, because it's even more terrible than the most ridiculous chicanery we had to vomit into HTML and CSS over the years (most of which was the fault of MSIE6).


Yep. Which is why Postel law is, sadly, more like a law of nature (see also "worse is better") than an engineering principle you may or may not follow.


I know it is a single example and we should extrapolate much out of it, but in the case of html those who accepted more liberal input (html4/5) won over over those that were more conservative (xhtml).


HTML is rather different because it's authored by people. It's typically (though not always!) a good idea to not be too pedantic about accepting user input if you can. XHTML (served with the correct Content-Type) will completely error out if you made a typo and didn't test carefully enough. Useful in dev cycle? Sure. In production? Less so. "The entire page goes tits up because you used <br> instead of <br />" is just not helpful (and also: needlessly pedantic).

But that doesn't really apply to protocols like TCP. Postel's "law" is best understood in the context of 1980, when TCP had been around for a while but without a real standard, everyone was kind of experimenting, and there were tons of little incompatibilities. In this context, it was reasonable and practical advice.

For a lot of other things though: not so much. "Fail fast" is typically the better approach, which will benefit everyone, especially the people implementing the protocols.

This is also why Sendmail became the de-facto standard around the same time by the way: it was bug-compatible with everything else. Later this become a liability (sendmail.cf!), but originally it was a great feature.


RFC 9413 referenced in a parent mentions HTML. It points out that formats meant to be human-authored may benefit more from being liberally accepted.

I also read that XHTML made template authoring hard, as the template itself might not be valid XHTML and/or different template inputs might make output invalid. (I sadly can't find the source of this point right now, but I can't claim credit for it).


I don't recall XHTML being harder to generate from PHP and ASP templates. It's largely down to making sure that all tags in the output are always balanced, which isn't difficult at all.

With PHP specifically there was an issue where the use of shorthand <? syntax for code snippets would conflict with <?xml declaration that would normally be placed at the beginning of the XHTML document - it would see the <? and try to interpret the rest of it as PHP code, which obviously didn't work. The workaround was to disable short tags and always use <?php explicitly


I would almost argue a failing of so many standards is the lack of surrounding tooling. Is this implementation correct? Who knows! Try it against this other version and see if they kind of agree. More specifications need to require test suites.


Am I correct that malformed pages in xhtml would have triggered the browser to output a red XML error and fail to render the page at all?


Yes, but only if you served the XHTML with the proper MIME type of application/xhtml+xml. Nearly everyone served it as text/html, which would lead to the document being intepreted as this weird pseudo XHTML/HTML4 hybrid dialect with all sorts of brower idiosyncrasies [1].

[1] https://www.hixie.ch/advocacy/xhtml


Not really, since in the end HTML5 defined a precise parsing algorithm that AFAIK everyone follows.


HTML5 was born in an era of decent HTML authoring tooling. Very few people write HTML by hand nowadays. This was not true of earlier versions.

Also note that HTML5 codified into liberal acceptance some of the "lazy" manual errors that people made in the early days (many of which were strictly and noisily rejected in XHTML, for example).


The overwhelming complexity of the HTML5 parser [1] is a testament to the 30 years of implementation quirks it's been forced to absorb.

[1]: https://html.spec.whatwg.org/multipage/parsing.html


The fact that googling Postel was worthless also indicates we're in a post-google search world.


I'm actually astounded at how quickly the quality of Google search results has tanked in recent years.


2nd result on kagi was about him but in the form of another critic.

https://datatracker.ietf.org/doc/draft-thomson-postel-was-wr...

Hard disagree.

It's a valid argument, but I say it's merely an argument, not an argument that wins or should win.

But also, I say that detecting out of spec or unexpected input and handling it in any other way than crashing IS adhering to Postel.

Refusing to process a request is better than munging the data according to your own creative interpretation of reasonable or likely, and then processing that munged data.

I consider that to be within Postel to return a nice error (or not if that would be a security divulgence). Failing Postel would be to crash or do anything unintended.


Google’s results for “Postel’s law” and “Jon Postel” are fine. “Postel” is ambiguous, a fairly common surname, so websites of unrelated companies show up, and a disambiguating page on wikipedia that links to Jon Postel and several other people.


I thought the whole point of letting Google surveil your entire life was they would know that if you're interested in computing and networks, to the point of participating on news.hackernews.com, then they'd know that if you're searching for "Postel," you'd probably want Postel's law to be on the first page.

We're back at pre-1998 search, where we have to specify more and more context just to get results that aren't noise.


Bing had no trouble at all finding him from my device.


Jon Postel was instrumental in making the Internet what it is today.

https://en.wikipedia.org/wiki/Jon_Postel

The Wikipedia article is kinda unclear and doesn't provide the proper context, so:

- Ran IANA, which assigned IP addresses for the Internet.

- Editor of RFCs, which are documents that defined protocols in use by the Internet.

- He wrote a bunch of important RFCs that defined how some very important protocols should work.

- Created or helped create SMTP, DNS, TCP/IP, ARPANET, etc.


It's a reference to "Postel's law" which is a pretty well-known principle in the networking world, and in software more broadly. Named after Jon Postel, who edited and published many of the RFCs describing core Internet protocols.

https://en.wikipedia.org/wiki/Robustness_principle


Adding to the sibling comments, this is briefly covered in Eric Raymond's wonderful book, "The Art of Unix Programming" [0].

[0] https://en.wikipedia.org/wiki/The_Art_of_Unix_Programming


There is no such thing as a post Postel world. But handling the input in any other way than crashing or ub IS perfectly Postel.

Deciding that nul is invalid data, and refusing to allow it, and refusing to munge the data and proceed based on the munged data that you essentially made up, as long as whatever you did do instead was graceful and intentional, to me that is perfectly Postel.


I like the term post-Postel.

There are two reliability constraints that all software faces; security and interoperability. The more lax you are about validation, the more likely interoperability is. "That's weird, I'll just do whatever" is doing SOMETHING, and it's often to the end user's liking. But, you also enter a more and more undefined state inside the software on the other side, and that's where weird things happen. Weird things happening typically manifest as security problems. So the more effort you go to to minimize the possibility of entering a weird state, the more confidence you have that your software is working as specified.

Postel's Law made a lot of sense to me when developing the early Internet. A lot of people were reading imperfect RFCs, and it was nice when your HP server could communicate with a Sun workstation, even though maybe some bit in the TCP header was set wrong. But now? You just gotta get it right and push a hotfix when you realize you messed something up. (Sadly, I don't think it's possible. Middleboxes are getting more and more popular. At work, we make a product where the CLI talks to the server over HTTP/2. We also install Zscaler on every workstation. Zscaler simply blocks HTTP/2. So you can't use our product. Awkward.)


This is also where Google went right with QUIC: encrypt as much as possible to show middleboxes the least possible. This combats ossification. Then again it seems likely middleboxes will just block QUIC (or UDP in general).


The Cryptographic Doom Principle (if you have to perform any cryptographic operation before verifying the MAC on a message you’ve received, it will somehow inevitably lead to doom)[1] is a sort of anti-Postel's Law.

[1] https://moxie.org/2011/12/13/the-cryptographic-doom-principl...


> There appears to be one piece of software which is misinterpreting guidance of this, and trying to depend upon embedded NUL.

Curious what this is


I wonder if it’s https://justine.lol/ape.html / cosmopolitan libc


Just yesterday I asked @jart, here on HN, about Cosmo & OpenBSD.

https://news.ycombinator.com/item?id=41627889

APE was mentioned and some interesting tidbits in the GitHub link provided in the HN comment above.


I'm pretty sure it is, I remember reading something about this

Yeah I found it here

https://news.ycombinator.com/item?id=41030960

2019 bug - https://austingroupbugs.net/view.php?id=1250

https://justine.lol/cosmo3/

> This is an idea whose time has come; POSIX even changed their rules about binary in shell scripts specifically to let us do it.

FWIW I agree with this OpenBSD change, which says more pointedly

All the shells are written in C, and majority of them use C strings for everything, which means they cannot embed a NUL, so this is not surprising. It is quite unbelievable there are people trying to rewrite history on a lark, and expecting the world to follow alone.

i.e. it's not worth it to change a bunch of old code in order to allow making code more esoteric.

We want systems code to be more predictable, reliable, and less esoteric ... not more esoteric


> POSIX even changed their rules about binary in shell scripts specifically to let us do it.

I'd seen this quote around. The fact that the standards were changed to allow it never struck me as a good indication that it should be relied upon. It seems rather backwards of how these standards work.

I got flamed on HN once for saying cosmopolitan libc shouldn't be used for production because it relies on weird behaviors and implementation quirks that aren't really an ABI.


Looking at this further, the standards change doesn't even match what Cosmopolitan is doing.

From the 'changed their rules' link, the 'Desired Action' is to add this text: "The input file may be of any type, but the initial portion of the file intended to be parsed according to the shell grammar [..] shall not contain the NUL character."

This handles things like shar archives where you have a shell script at the beginning, then an exit command, then binary gunk.

But Cosmopolitan binaries are not just shell scripts with binary. They're hybrids of shell script and DOS executable. And apparently this requires putting nul bytes right near the beginning (see my other comment, https://news.ycombinator.com/item?id=41640331), in the "portion [..] intended to be parsed according to the shell grammar". Which explicitly violates the new text.

I can understand why this hack is needed for what Cosmopolitan is trying to accomplish, but it makes no sense to claim POSIX blessed it.


Yeah exactly, that was my reading too! The claim in the link doesn't match what the POSIX bug says

If it did, then would be a sign that the POSIX process is not working well

Because POSIX is supposed to be descriptive of what exists, not prescribe new behavior


> POSIX is supposed to be descriptive of what exists, not prescribe new behavior

Well, it can be argued that ape already existed when that POSIX change was written :)

Everything is working as intended. Nothing to see here. Move along. Move along.


Shouldn't be. See the "exit 1" in your link? That's the end of the shell script, and as the OpenBSD link says;

> It remains possible to put arbitrary bytes AFTER the parts of the shell script that get parsed & executed (like some Solaris patch files do). But you can't put arbirary bytes in the middle,


It is. Binaries generated by cosmocc have NUL in the middle.


Ah, indeed. Here are the first 16 bytes of one:

4d 5a 71 46 70 44 3d 27 0a 00 00 10 00 f8 00 00 |MZqFpD='........|

There are already nul bytes here, and there are a lot more before the single quote gets closed at offset 0x200.


And I can confirm a NUL in 11th byte of my hello.c a.out:

  >>> s[:11]
  b"MZqFpD='\n\n\x00"
Looking closer, I missed the content of "BIOS BOOT SECTOR".


> This was in snapshots for more than 2 months, and only spotted one other program depending on the behaviour (and that test program did not observe that it was therefore depending in incorrect behaviour!!)

Fascinating. I wonder what that program is, and why it depends on the NUL character.


Kudos to OpenBSD!

Similar to the olde-tyme "-o noexec" and "-o nosuid" options for `mount`, there should be easy, no-exceptions ways to blanket ban other types of simply obvious red-flag activity.


Is this going to murder those fancy shell scripts that self-extract a program appended to the tail, which is really just an encoded blob of some kind, presumably compressed, etc.. ???


Not if it was done competently. Shar files and the likes shouldn't contain NULs, even if they contain compressed data. The appended data should be binary safe.


And in case your data does contain NULs, presumably one could add a layer of base64 encoding. Not nice for the filesize, but also much less likely to upset a text editor when the script is opened (even in the absence of NUL bytes).


I was going to check the status of mksh (the Android system shell), but the project page returns:

"Unavailable For Legal Reasons - Sorry, no detailled error message available."

http://www.mirbsd.org/mksh.htm

The Android system shell is now abandoned? This is also in rhel9 basesos.


Looks fine here, maybe they're blocking your IP range for some reason?


Fine for me. I just got a HTTP warning and nothing else.

~~I believe Android uses toybox, not mksh.~~ It does use toybox, but toybox doesn't appear to include a shell.


It's blocked for me too, but only on my home Internet (Xfinity), not my phone (Google Fi/T-Mobile).


Works fine for me on Xfinity Home via WiFi, Xfinity Mobile, T-Mobile, and Visible by Verizon.


Whatever the issue was, it seems to have been resolved sometime after I last checked.


I see it on my T-Mobile device also. Strange.


What's your browser? The server is using an old TLS version which is no longer supported, and some clients will try https and fail there and not try http.


I'm using Edge on my corporate desktop.

Edge first tries TLS and comes back with: "SSL handshake error '-1' sslerr='1' sslerrdesc='error:1425F102:SSL routines:ssl_choose_client_version:unsupported protocol' sslerrfunc='607' sslerrreason='258'"

Setting to http:// results the the above error, along with "httpd/3.30A Server at www.mirbsd.org Port 80" - I think that the target itself is blocking me.


Works from an EU IP, so whatever it is, it's probably not GDPR?


> Android system shell

This hurt a little.


Related: The installer for iTunes 12.2.1 included a bug which might recursively delete a volume if the path given as input included incorrectly escaped spaces.



On a similar note, I sometimes think about how newline characters are allowed in filenames, and how that can break simple...

    for each $filename in `ls`
loops -- because in many contexts, UNIX treats newlines as a delimiter.

Is there any legitimate use for filenames with newlines?


Well, knowing how to deal with wacky input and corner cases are a requirement of learning ANY programming language. Bourne-style shells are no exception.

Your example has illegal syntax, but the biggest issue is that you should never parse the output of ls. The shell has built-in globbing. This is how you would loop over all entries (files, dirs, symlinks, etc) in the current directory without getting tripped up by whitespace:

    for e in *; do echo "got: $e"; done


> knowing how to deal with wacky input and corner cases are a requirement of learning ANY programming language.

In general, I agree. But if there's a corner case that occasionally breaks naive code but otherwise doesn't do anything, then I'm going to think, "maybe we should just remove that corner case."


David Wheeler has been complaining (and suggesting fixes) about this for a long time: https://dwheeler.com/essays/fixing-unix-linux-filenames.html

safename LSM https://lwn.net/Articles/686789/


Thank you, I had wondered if there was something like safename.


Replace "maybe" with "OBVIOUSLY". Keeping useless-but-hazardous "features" in any language is as idiotic as keeping a heap of oily rags in the furniture factory warehouse.


But, of course, this wouldn't be shell if that didn't have footguns more; namely, it breaks if ran in an empty directory (giving a literal "got: *"), and excludes the arbitrary set of files whose name begins with ".".


> Is there any legitimate use for filenames with newlines?

IMHO no, but they can exist, so you need to handle them without blowing up. Also, even spaces are considered delimiters here, which is why it's bad form to parse the output of ls.

    $ touch "foo bar baz"
    $ for f in `ls`; do echo $f; done
    foo
    bar
    baz

    # always use double quotes, though they aren't needed here
    $ for f in *; do echo "$f"; done 
    foo bar baz
At least the OS guarantees you won't run into NUL though.


There is a pretty good syntax for dealing with nasty filenames, if you must: ANSI-C quoting[1].

If you have to output in a shellscript in this format, use printf %q

from man printf:

       %q     ARGUMENT is printed in a format that can be reused as shell input, escaping non-printable
              characters with the proposed POSIX $'' syntax.
It is just $'<nasty ansi-c escaped chars>'

$ touch $'\nHello\tWorld\n' $ ls

One thing I do like about a filesystem that fully supports POSIX filenames is that at the end of the day a filesystem is supposed to represent data. I think it is totally sensible to exclude certain characters, but that it should be done higher up in the stack if possible. Or have a flag that is set at mount time. Perhaps even by subvolume/dataset.

One thing I haven't seen mentioned is that POSIX filenames are so permissive that they allow you to have bytes as filenames that are invalid UTF-8. That's why the popular ncdu[2] program does NOT use json as it's file format, although most think it does. It's actually json but with raw POSIX bytes in filename fields, which is outside of the official json spec. That does not stop folks from using json tools to parse ncdu output though.

Another standard that is also very permissive with filenames is git. When I started exploring new ways to encode data into a git repo, it was only natural that I encountered issues with limitations of filesystems that I would check out in.

Try cloning this repo, and see if you are able to check it out: https://github.com/benibela/nasty-files

It is amazing how many things it breaks.

If you are writing software that deals with git filenames or POSIX filenames (that includes things like parsing a zip file footer), you can not rely on your standard json encoding function, because the input may contain invalid utf-8. So you may need to do extra encoding/filtering.

[1]: https://www.gnu.org/s/bash/manual/html_node/ANSI_002dC-Quoti...

[2]: https://dev.yorhel.nl/ncdu/jsonfmt


I’m not in a place where I can easily check. What happens there if the file name contains a quote?


It's fine, the content of an expanded variable isn't parsed further:

    $ touch "foo \"bar baz"; for f in *; do echo "$f"; done
    foo "bar baz

    # quotes don't affect it either
    $ touch "foo \"bar baz"; for f in *; do echo $f; done
    foo "bar baz
Though once you start passing args with quotes to other scripts, things get ugly. Rule of thumb is to always pass with "$@", and if that isn't enough to preserve quoting for whatever use case, write them out to a tempfile instead, or don't use a shell script for it in the first place.


What about in the case of

  for f in `ls`; do echo "$f"; done
Same behavior, for the same reason?


The quotes are preserved, but backquote expansion fills the argument list using any whitespace as a delimiter.

    $ for f in `ls`; do echo "$f"; done
    foo
    "bar
    baz
If you absolutely must parse ls (let's assume it's some other script that outputs items with spaces) and the output can contain spaces, you have a few options:

    $ ls | while read f; do echo "$f"; done
    foo "bar baz

    # parens keep the IFS change isolated to a subshell
    $ (IFS="\n"; for f in `ls`; do echo "$f"; done)
    foo "bar baz
But if your filenames contain newlines, you'll really want to stick with the glob expansion, or output custom delimiters and set IFS to that.


Thanks for that. For my reputation’s sake, let me clarify that I do always use either globbing or `find -print0` since a more experienced sysadmin drilled that into my head decades ago. I was curious about other edge cases, but I don’t take any convincing.


> If you absolutely must parse ls

... stop and rethink your options. You may be able to get away with parsing the first columns of ls -l but even then a pathologically named file could make itself look like a line of ls output.

It's simply not possible in all cases. If you can constrain your input then you may be able to make use of it but in the general case, that's why xargs and find grew a -0 option.

Or glob.


Agreed when it comes to ls, but this applies to any script whose output you capture. I personally prefer “while read” loops but I’m probably screwed if someone smuggles in a newline.


If you are iterating over a lot of files, a read while loop can be a major bottleneck. As long as you use the null options from find and pipe into xargs, you should be safe with any filename.

I've found it can reduce minutes down to seconds for large operations.

If you have to process a large number of files, you can let xargs minimize the number of times a program is run, instead of running it once per file.

Something like:

  # Set the setgid bit for owner and group of all folders
  find . -type d -print0 | xargs -0 chmod g+s

  # Make the targets of symlinks immutable
  find . -type l -print 0 | xargs -0 readlink -z | xargs -0 chattr +i
Way faster. But there are lots of caveats. Make sure your programs support it. Maybe read the xargs man page.


Personally I skip the middleman when I can with "find ... -exec cmd {} +"

    find . -type d -exec chmod g+s {} +
Or even minimise arguments by including a test if the chmod is even needed:

    find . -type d \! -perm -g=s -exec chmod g+s {} +
I actually have a script that fixes up permissions, and I was delighted to fit it in a single find invocation which only performs a single stat() on each file in the traversal, and only executes chown/chmod at all for files that need change:

    # - ensure owner is root:shared
    # - ensure dirs have 775 permissions (must have 775, must not have 002)
    # - ensure files have 775 (if w+x), 664 (if w), 555 (if x) otherwise 444 permissions
    find LIST OF DIRS \
        '(' \! '(' -user root -group shared ')'              -print -exec chown -ch root:shared {} + ')' , \
        '(' -type d \! '(' -perm -775 \! -perm -002 ')'      -print -exec chmod -c 775 {} + ')' , \
        '(' -type f    -perm /222    -perm /111 \! -perm 775 -print -exec chmod -c 775 {} + ')' , \
        '(' -type f    -perm /222 \! -perm /111 \! -perm 664 -print -exec chmod -c 664 {} + ')' , \
        '(' -type f \! -perm /222    -perm /111 \! -perm 555 -print -exec chmod -c 555 {} + ')' , \
        '(' -type f \! -perm /222 \! -perm /111 \! -perm 444 -print -exec chmod -c 444 {} + ')'

But if you need multiple transformations of filenames in a pipeline like in your second example, then yes xargs will be involved.


`find` is almost always easier, but you can get quite far with `ls -Q` if you can assume GNU ls.


You can also create files named e.g. '--help' (if you're not particularly malicious) and with globbing it'll cause e.g. 'ls *' to print help.


    touch -- '-f ..'
(If you want to lay an evil trap)

Remember that in most option parsing libraries, putting '--' in your arguments stops option parsing, so you can safely run:

    rm -- '-f ..'


Sticky notes on the desktop :) Who needs data storage when you can store it all in the metadata?


A GUI file browser will display the filename with a newline in it as a new line (and an icon above it) so as to be asthetically pleasing.


this is why things like `find -print0` exist, which is IMO the easiest way to handle this robustly.


Side note: tell your startup to switch its “hardware with Ubuntu Linux inside” to BSD. You will have a much more stable and simple platform that can last a long time.


The recommendation is solid, but FWIW no one looking for stability would choose Ubuntu, among the Linuxen!


> There appears to be one piece of software which is misinterpreting guidance of this, and trying to depend upon embedded NUL.

Big oof here. Why? How?

> If there is ONE THING the Unix world needs, it is for bash/ksh/sh to stop diverging further by permitting STUPID INPUT that cannot plausibly work in all other shells. We are in a post-Postel world.

Amem


I've always found the fact that zsh copes with NUL characters in variables etc to be really useful. I can see why this approach makes sense for OpenBSD but they can't prevent NULs appearing in certain places like piped input.


Does this break those self-extracting script/tar files? I forget how those are done, I haven't seen one in many years.


From the article: "It remains possible to put arbitrary bytes AFTER the parts of the shell script that get parsed & executed (like some Solaris patch files do). "


If you don't know anything about OpenBSD, here's a fun thing:

1. Randomly choose "yes" or "no" to this question.

2. Read the post and get the answer.

3. Repeat until you begin to get a tingly "Spidey sense" that overrides your random-choice.

My Spidey sense here was, "Yes, because OpenBSD would have already thought about and covered that use-case." And indeed, toward the end of the post, that contingency is covered and documented.

Note: if you try this at your job and sense that the company will almost always choose the worst option, you should probably leave that job.



That was a neat idea back in the day but should disallowed now. Running downloaded executables considered harmful.


> Running downloaded executables considered harmful

Most executables are downloaded. :)


Not in the "Installation: just run `docker run kekw/our-shiny-ai-chatbot` in your shell" world we're living today.


I think the better example is the all-too-common: “Installation: Just run `curl -sL http://goo.gl/hsjdiNgtehsn | sudo bash`”


They were generally uuencoded or similar


Does this break the self extracting tarball trick, where you have a bootstrap shell script with a binary payload appended?


No, they still work.


So I can't bury a tarball inside a shell script anymore?


You still can; it just needs to go at the end:

> It remains possible to put arbitrary bytes AFTER the parts of the shell script that get parsed & executed (like some Solaris patch files do).


Looks like you might be able to at the end of the file, reading the commit message, just not willy-nilly in the middle. :)


I wish FreeBSD replaced /bin/sh with OpenBSDs.


FreeBSD made many cool moves in the 14.0 release, like finally getting rid of sendmail and adopting DMA (the irony), so perhaps there's a chance?

But FreeBSD has always been much less focused on polish/cleanliness than OpenBSD; I mean - they have THREE firewalls, wtf.


> they have THREE firewalls, wtf.

I've not used ipf, but ipfw and pf have a different model and different features (although in 14.0, there's more overlap). I have to use them both.


Wow, they still use CVS...


This was "answered" in 2013 at the end of this post, https://marc.info/?l=openbsd-misc&m=136724343006024&w=2

I guess it hasn't changed since.


Great. Now forbid spaces in filenames.


Funny enough filenames are just byte sequences. So almost anything goes.

There was just some patch that added '/' protection, because that's the only character that's not allowed in filenames.

https://github.com/openbsd/src/commit/46f7109a9e03df89b66ada...


Is this in reference to something? Judging from the comments, NUL bytes in shell scripts are a common occurrence that everybody is celebrating this change as if it were ground breaking.

I mean, it's a good idea, but I wonder what am I missing here. Also what do they mean by post-Postel?


Early spec of TCP had a section on the robustness principle that was generally known as Postel's law (https://datatracker.ietf.org/doc/html/rfc761#section-2.10). At the time and until recently this was considered good design. Nowadays people generally want servers to be stricter in what they accept since decades of experience dealing with diverging interpretations of a specification create problems for interoperability.


"until recently"? More than 10 years just going by HN. https://news.ycombinator.com/item?id=5161214

I think HTML showed the problem with Postel's principle. Quoting "Postel’s Law is not for you" at http://trevorjim.com/postels-law-is-not-for-you/ from 2011

> The next version of HTML, HTML5, should considerably reduce the problem of browser incompatibilities. It does this, in part, by rejecting Postel’s Law for browser implementors. Instead of allowing browsers to be liberal when dealing with “flawed” markup, HTML5 requires them to parse it exactly as in the HTML5 specification, and that specification is given much more precisely than before, in the form of a deterministic state machine, in fact. HTML5 is trying to give implementors no leeway at all in this, in the name of browser compatibility.


> "until recently"? More than 10 years just going by HN.

The TCP protocol is from the 1970s (according to Wikipedia, it's from 1974, which is 50 years ago). Something which only happened 10 years ago is recent.


The robustness principle dates to RFC 761 from January 1980, not 1974, making it only 44 years ago. https://www.rfc-editor.org/rfc/rfc761#section-2.10

The citations I gave you were ones I knew existed. I know there was criticism in the early 2000s because we were debating it back then, but I don't have those citations handy.

Checking now, the Wikipedia entries points to criticism in RFC 3117, from 2001, at https://datatracker.ietf.org/doc/html/rfc3117 :

> Counter-intuitively, Postel's robustness principle ("be conservative in what you send, liberal in what you accept") often leads to deployment problems.

That's why I knew to question was 'until recently' was supposed to me.


Your quotes are actually reinforcing and not detracting from the point I was trying to make. Because it is a question of tradeoffs, for some people the robustness principle could be argued against right from the start. Lots of strict protocols existed at the time (CER and DER rules for ASN.1 for example) even if people preferred forgiving protocols (BER seems to have been more popular). Yes, people were explaining the tradeoffs and why it might make sense to prefer strictness, I am sure you can find earlier quotes. The authors might not feel the need to make these arguments if public sentiment was widely aligned.

The person I was responding to (who sadly was downvoted), seems to have a sort of nascent appreciation for the robustness principle. They seem to be suggesting that the prevalence of null bytes in scripts is an argument to preserve support. To me this is an illustration of the fact that public opinion is still not universally against the robustness principle (indeed some, possibly that poster who never heard it as Postel's law or it didn't click for, may not be aware of it) even if in general I would argue that the majority of people with an opinion would be against it.

But a couple of dates and quotes cannot settle that question, nor can my feelings on the matter. I could be very wrong, the majority of opinion on the matter could still be pro-postel.


I really just want to know the time frame you where thinking of when you wrote "until recently this was considered good design."

I would also like to what you think changed that general consideration.


Postel’s Law, also known as the Robustness Principle:

> be conservative in what you do, be liberal in what you accept from others

It’s intended as a way to maximise compatibility, and people have generally followed it when designing protocols and file formats. However it’s led to many security vulnerabilities and has caused a lot of compatibility problems itself. These days a lot of people are realising that it’s more harmful than helpful.



Surprised noone has mentioned the Crowdstrike issue, which was due to NUL characters wasn't it?


It was not. The Crowdstrike issue was:

1. Their code was calling a 21-parameter "matcher" function with 20 parameters of data.

2. They didn't notice, because all the matcher rules had "allow anything" for the 21st parameter and so never looked at it.

3. They later published the first list of rules with something other than "allow anything" as the 21st parameter, direct to customers.

4. On customer machines, the first rule with a non "match everything" 21st parameter went to look at the 21st element of the 20 element array. It expected a string pointer, but instead there was random stack data. It tried dereferencing this to read the string it was expecting, which caused the kernel driver to segfault during early startup, putting customer machines in a boot loop.

https://www.crowdstrike.com/wp-content/uploads/2024/08/Chann...


  > If there is ONE THING the Unix world needs, it is for bash/ksh/sh to
  > stop diverging further by permitting STUPID INPUT that cannot
  > plausibly work in all other shells.  We are in a post-Postel world.
  > 
  > It remains possible to put arbitrary bytes *AFTER* the parts of the
  > shell script that get parsed & executed (like some Solaris patch files
  > do).  But you can't put arbirary bytes in the middle, ahead of shell
  > script parsed lines, because shells can't jump to arbitrary offsets
  > inside the input file, they go THROUGH all the 'valid shell script
  > text lines' to get there.

  So here it is again, an example of OpenBSD making software behavior saner for all of us.
I don't consider use of all caps over a minor issue to be sane behavior. At best it's immaturity (trying to force your point rather than persuade), and at worst it's an emotional imbalance that effects judgement. That said, it's ksh, on OpenBSD, so I couldn't care less what they do.


What a weird take. There are just a few emphasized words in the commit message.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: