Hacker News new | past | comments | ask | show | jobs | submit login

On a similar note, I sometimes think about how newline characters are allowed in filenames, and how that can break simple...

    for each $filename in `ls`
loops -- because in many contexts, UNIX treats newlines as a delimiter.

Is there any legitimate use for filenames with newlines?




Well, knowing how to deal with wacky input and corner cases are a requirement of learning ANY programming language. Bourne-style shells are no exception.

Your example has illegal syntax, but the biggest issue is that you should never parse the output of ls. The shell has built-in globbing. This is how you would loop over all entries (files, dirs, symlinks, etc) in the current directory without getting tripped up by whitespace:

    for e in *; do echo "got: $e"; done


> knowing how to deal with wacky input and corner cases are a requirement of learning ANY programming language.

In general, I agree. But if there's a corner case that occasionally breaks naive code but otherwise doesn't do anything, then I'm going to think, "maybe we should just remove that corner case."


David Wheeler has been complaining (and suggesting fixes) about this for a long time: https://dwheeler.com/essays/fixing-unix-linux-filenames.html

safename LSM https://lwn.net/Articles/686789/


Thank you, I had wondered if there was something like safename.


Replace "maybe" with "OBVIOUSLY". Keeping useless-but-hazardous "features" in any language is as idiotic as keeping a heap of oily rags in the furniture factory warehouse.


But, of course, this wouldn't be shell if that didn't have footguns more; namely, it breaks if ran in an empty directory (giving a literal "got: *"), and excludes the arbitrary set of files whose name begins with ".".


> Is there any legitimate use for filenames with newlines?

IMHO no, but they can exist, so you need to handle them without blowing up. Also, even spaces are considered delimiters here, which is why it's bad form to parse the output of ls.

    $ touch "foo bar baz"
    $ for f in `ls`; do echo $f; done
    foo
    bar
    baz

    # always use double quotes, though they aren't needed here
    $ for f in *; do echo "$f"; done 
    foo bar baz
At least the OS guarantees you won't run into NUL though.


There is a pretty good syntax for dealing with nasty filenames, if you must: ANSI-C quoting[1].

If you have to output in a shellscript in this format, use printf %q

from man printf:

       %q     ARGUMENT is printed in a format that can be reused as shell input, escaping non-printable
              characters with the proposed POSIX $'' syntax.
It is just $'<nasty ansi-c escaped chars>'

$ touch $'\nHello\tWorld\n' $ ls

One thing I do like about a filesystem that fully supports POSIX filenames is that at the end of the day a filesystem is supposed to represent data. I think it is totally sensible to exclude certain characters, but that it should be done higher up in the stack if possible. Or have a flag that is set at mount time. Perhaps even by subvolume/dataset.

One thing I haven't seen mentioned is that POSIX filenames are so permissive that they allow you to have bytes as filenames that are invalid UTF-8. That's why the popular ncdu[2] program does NOT use json as it's file format, although most think it does. It's actually json but with raw POSIX bytes in filename fields, which is outside of the official json spec. That does not stop folks from using json tools to parse ncdu output though.

Another standard that is also very permissive with filenames is git. When I started exploring new ways to encode data into a git repo, it was only natural that I encountered issues with limitations of filesystems that I would check out in.

Try cloning this repo, and see if you are able to check it out: https://github.com/benibela/nasty-files

It is amazing how many things it breaks.

If you are writing software that deals with git filenames or POSIX filenames (that includes things like parsing a zip file footer), you can not rely on your standard json encoding function, because the input may contain invalid utf-8. So you may need to do extra encoding/filtering.

[1]: https://www.gnu.org/s/bash/manual/html_node/ANSI_002dC-Quoti...

[2]: https://dev.yorhel.nl/ncdu/jsonfmt


I’m not in a place where I can easily check. What happens there if the file name contains a quote?


It's fine, the content of an expanded variable isn't parsed further:

    $ touch "foo \"bar baz"; for f in *; do echo "$f"; done
    foo "bar baz

    # quotes don't affect it either
    $ touch "foo \"bar baz"; for f in *; do echo $f; done
    foo "bar baz
Though once you start passing args with quotes to other scripts, things get ugly. Rule of thumb is to always pass with "$@", and if that isn't enough to preserve quoting for whatever use case, write them out to a tempfile instead, or don't use a shell script for it in the first place.


What about in the case of

  for f in `ls`; do echo "$f"; done
Same behavior, for the same reason?


The quotes are preserved, but backquote expansion fills the argument list using any whitespace as a delimiter.

    $ for f in `ls`; do echo "$f"; done
    foo
    "bar
    baz
If you absolutely must parse ls (let's assume it's some other script that outputs items with spaces) and the output can contain spaces, you have a few options:

    $ ls | while read f; do echo "$f"; done
    foo "bar baz

    # parens keep the IFS change isolated to a subshell
    $ (IFS="\n"; for f in `ls`; do echo "$f"; done)
    foo "bar baz
But if your filenames contain newlines, you'll really want to stick with the glob expansion, or output custom delimiters and set IFS to that.


Thanks for that. For my reputation’s sake, let me clarify that I do always use either globbing or `find -print0` since a more experienced sysadmin drilled that into my head decades ago. I was curious about other edge cases, but I don’t take any convincing.


> If you absolutely must parse ls

... stop and rethink your options. You may be able to get away with parsing the first columns of ls -l but even then a pathologically named file could make itself look like a line of ls output.

It's simply not possible in all cases. If you can constrain your input then you may be able to make use of it but in the general case, that's why xargs and find grew a -0 option.

Or glob.


Agreed when it comes to ls, but this applies to any script whose output you capture. I personally prefer “while read” loops but I’m probably screwed if someone smuggles in a newline.


If you are iterating over a lot of files, a read while loop can be a major bottleneck. As long as you use the null options from find and pipe into xargs, you should be safe with any filename.

I've found it can reduce minutes down to seconds for large operations.

If you have to process a large number of files, you can let xargs minimize the number of times a program is run, instead of running it once per file.

Something like:

  # Set the setgid bit for owner and group of all folders
  find . -type d -print0 | xargs -0 chmod g+s

  # Make the targets of symlinks immutable
  find . -type l -print 0 | xargs -0 readlink -z | xargs -0 chattr +i
Way faster. But there are lots of caveats. Make sure your programs support it. Maybe read the xargs man page.


Personally I skip the middleman when I can with "find ... -exec cmd {} +"

    find . -type d -exec chmod g+s {} +
Or even minimise arguments by including a test if the chmod is even needed:

    find . -type d \! -perm -g=s -exec chmod g+s {} +
I actually have a script that fixes up permissions, and I was delighted to fit it in a single find invocation which only performs a single stat() on each file in the traversal, and only executes chown/chmod at all for files that need change:

    # - ensure owner is root:shared
    # - ensure dirs have 775 permissions (must have 775, must not have 002)
    # - ensure files have 775 (if w+x), 664 (if w), 555 (if x) otherwise 444 permissions
    find LIST OF DIRS \
        '(' \! '(' -user root -group shared ')'              -print -exec chown -ch root:shared {} + ')' , \
        '(' -type d \! '(' -perm -775 \! -perm -002 ')'      -print -exec chmod -c 775 {} + ')' , \
        '(' -type f    -perm /222    -perm /111 \! -perm 775 -print -exec chmod -c 775 {} + ')' , \
        '(' -type f    -perm /222 \! -perm /111 \! -perm 664 -print -exec chmod -c 664 {} + ')' , \
        '(' -type f \! -perm /222    -perm /111 \! -perm 555 -print -exec chmod -c 555 {} + ')' , \
        '(' -type f \! -perm /222 \! -perm /111 \! -perm 444 -print -exec chmod -c 444 {} + ')'

But if you need multiple transformations of filenames in a pipeline like in your second example, then yes xargs will be involved.


`find` is almost always easier, but you can get quite far with `ls -Q` if you can assume GNU ls.


You can also create files named e.g. '--help' (if you're not particularly malicious) and with globbing it'll cause e.g. 'ls *' to print help.


    touch -- '-f ..'
(If you want to lay an evil trap)

Remember that in most option parsing libraries, putting '--' in your arguments stops option parsing, so you can safely run:

    rm -- '-f ..'


Sticky notes on the desktop :) Who needs data storage when you can store it all in the metadata?


A GUI file browser will display the filename with a newline in it as a new line (and an icon above it) so as to be asthetically pleasing.


this is why things like `find -print0` exist, which is IMO the easiest way to handle this robustly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: