Well, knowing how to deal with wacky input and corner cases are a requirement of learning ANY programming language. Bourne-style shells are no exception.
Your example has illegal syntax, but the biggest issue is that you should never parse the output of ls. The shell has built-in globbing. This is how you would loop over all entries (files, dirs, symlinks, etc) in the current directory without getting tripped up by whitespace:
> knowing how to deal with wacky input and corner cases are a requirement of learning ANY programming language.
In general, I agree. But if there's a corner case that occasionally breaks naive code but otherwise doesn't do anything, then I'm going to think, "maybe we should just remove that corner case."
Replace "maybe" with "OBVIOUSLY". Keeping useless-but-hazardous "features" in any language is as idiotic as keeping a heap of oily rags in the furniture factory warehouse.
But, of course, this wouldn't be shell if that didn't have footguns more; namely, it breaks if ran in an empty directory (giving a literal "got: *"), and excludes the arbitrary set of files whose name begins with ".".
> Is there any legitimate use for filenames with newlines?
IMHO no, but they can exist, so you need to handle them without blowing up. Also, even spaces are considered delimiters here, which is why it's bad form to parse the output of ls.
$ touch "foo bar baz"
$ for f in `ls`; do echo $f; done
foo
bar
baz
# always use double quotes, though they aren't needed here
$ for f in *; do echo "$f"; done
foo bar baz
At least the OS guarantees you won't run into NUL though.
There is a pretty good syntax for dealing with nasty filenames, if you must: ANSI-C quoting[1].
If you have to output in a shellscript in this format, use printf %q
from man printf:
%q ARGUMENT is printed in a format that can be reused as shell input, escaping non-printable
characters with the proposed POSIX $'' syntax.
It is just $'<nasty ansi-c escaped chars>'
$ touch $'\nHello\tWorld\n'
$ ls
One thing I do like about a filesystem that fully supports POSIX filenames is that at the end of the day a filesystem is supposed to represent data. I think it is totally sensible to exclude certain characters, but that it should be done higher up in the stack if possible. Or have a flag that is set at mount time. Perhaps even by subvolume/dataset.
One thing I haven't seen mentioned is that POSIX filenames are so permissive that they allow you to have bytes as filenames that are invalid UTF-8. That's why the popular ncdu[2] program does NOT use json as it's file format, although most think it does. It's actually json but with raw POSIX bytes in filename fields, which is outside of the official json spec. That does not stop folks from using json tools to parse ncdu output though.
Another standard that is also very permissive with filenames is git. When I started exploring new ways to encode data into a git repo, it was only natural that I encountered issues with limitations of filesystems that I would check out in.
If you are writing software that deals with git filenames or POSIX filenames (that includes things like parsing a zip file footer), you can not rely on your standard json encoding function, because the input may contain invalid utf-8. So you may need to do extra encoding/filtering.
It's fine, the content of an expanded variable isn't parsed further:
$ touch "foo \"bar baz"; for f in *; do echo "$f"; done
foo "bar baz
# quotes don't affect it either
$ touch "foo \"bar baz"; for f in *; do echo $f; done
foo "bar baz
Though once you start passing args with quotes to other scripts, things get ugly. Rule of thumb is to always pass with "$@", and if that isn't enough to preserve quoting for whatever use case, write them out to a tempfile instead, or don't use a shell script for it in the first place.
The quotes are preserved, but backquote expansion fills the argument list using any whitespace as a delimiter.
$ for f in `ls`; do echo "$f"; done
foo
"bar
baz
If you absolutely must parse ls (let's assume it's some other script that outputs items with spaces) and the output can contain spaces, you have a few options:
$ ls | while read f; do echo "$f"; done
foo "bar baz
# parens keep the IFS change isolated to a subshell
$ (IFS="\n"; for f in `ls`; do echo "$f"; done)
foo "bar baz
But if your filenames contain newlines, you'll really want to stick with the glob expansion, or output custom delimiters and set IFS to that.
Thanks for that. For my reputation’s sake, let me clarify that I do always use either globbing or `find -print0` since a more experienced sysadmin drilled that into my head decades ago. I was curious about other edge cases, but I don’t take any convincing.
... stop and rethink your options. You may be able to get away with parsing the first columns of ls -l but even then a pathologically named file could make itself look like a line of ls output.
It's simply not possible in all cases. If you can constrain your input then you may be able to make use of it but in the general case, that's why xargs and find grew a -0 option.
Agreed when it comes to ls, but this applies to any script whose output you capture. I personally prefer “while read” loops but I’m probably screwed if someone smuggles in a newline.
If you are iterating over a lot of files, a read while loop can be a major bottleneck. As long as you use the null options from find and pipe into xargs, you should be safe with any filename.
I've found it can reduce minutes down to seconds for large operations.
If you have to process a large number of files, you can let xargs minimize the number of times a program is run, instead of running it once per file.
Something like:
# Set the setgid bit for owner and group of all folders
find . -type d -print0 | xargs -0 chmod g+s
# Make the targets of symlinks immutable
find . -type l -print 0 | xargs -0 readlink -z | xargs -0 chattr +i
Way faster.
But there are lots of caveats. Make sure your programs support it. Maybe read the xargs man page.
Personally I skip the middleman when I can with "find ... -exec cmd {} +"
find . -type d -exec chmod g+s {} +
Or even minimise arguments by including a test if the chmod is even needed:
find . -type d \! -perm -g=s -exec chmod g+s {} +
I actually have a script that fixes up permissions, and I was delighted to fit it in a single find invocation which only performs a single stat() on each file in the traversal, and only executes chown/chmod at all for files that need change:
Is there any legitimate use for filenames with newlines?