Hacker News new | past | comments | ask | show | jobs | submit login
An Opinionated Guide to Xargs (oilshell.org)
402 points by todsacerdoti on Aug 21, 2021 | hide | past | favorite | 130 comments



Since the blog author is commenting here, you have this statement part way down your blog:

> That is, grep doesn't support an analogous -0 flag.

However, the GNU grep variant does have an analogous flag:

-z, --null-data

Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. Like the -Z or --null option, this option can be used with commands like sort -z to process arbitrary file names.


Ah cool, I didn't know that! I'll update the blog post. (What a cacophony of flags)

Edit: It seems that grep -0 isn't taken for something else and they should have used it for consistency? The man page says it's meant to be used with find -print0, xargs -0, perl -0, and sort -z (another inconsistency)


It is taken in grep, just poorly documented; grep -5 means grep -C 5, and grep -0 means grep -C 0. It's not taken in sort, though, so I don't know why they didn't use -0 for sort.


I think that's because they needed to support both input and output. So there's both -Z and -z. No such thing as an uppercase 0 :)


See https://github.com/fish-shell/fish-shell/issues/3164#issueco... for incomplete but large survey of NUL flags.

Takeaways: (1) There is no consistency in flag names, even --long ones (2) impressively many tools do support it! Note that some affect only input or only output. (3) All do NUL-terminated, not NUL-separated. That's fortunate — matches \n usage, and gives distinct representations for [] vs [""].


It's best to give up on any kind of consistency between command options. Any project is free to do anything it wants, and they all do. Someone is eventually going to come up with standard N+1[1] which does things consistently, but they are going to have to either recreate a bazillion tools or create some sort of huge translation framework configuration on top of existing tools to get there. And even then it'll take literally decades before people migrate away from the current tools. Basically, the sad truth is this isn't going to happen.

[1] https://xkcd.com/927/


In 2002, I implemented xargs in Lisp, in the Meta-CVS project.

It is quite necessary, because you cannot pass an arbitrarily large command line or environment in exec system calls.

Of course, this doesn't have the problem requiring -0 because we're not reading textual lines from standard input, but working with lists of strings.

  ;;; This source file is part of the Meta-CVS program,
  ;;; which is distributed under the GNU license.
  ;;; Copyright 2002 Kaz Kylheku

  (in-package :meta-cvs)

  (defconstant *argument-limit* (* 64 1024))

  (defun execute-program-xargs (fixed-args &optional extra-args fixed-trail-args)
    (let* ((fixed-size (reduce #'(lambda (x y)
                                   (+ x (length y) 1))
                               (append fixed-args fixed-trail-args)
                               :initial-value 0))
           (size fixed-size))
      (if extra-args
        (let ((chopped-arg ())
              (combined-status t))
          (dolist (arg extra-args)
            (push arg chopped-arg)
            (when (> (incf size (1+ (length arg))) *argument-limit*)
              (setf combined-status
                    (and combined-status
                         (execute-program (append fixed-args
                                                  (nreverse chopped-arg)
                                                  fixed-trail-args))))
              (setf chopped-arg nil)
              (setf size fixed-size)))
          (when chopped-arg
            (execute-program (append fixed-args (nreverse chopped-arg)
                                     fixed-trail-args)))
          combined-status)
        (execute-program (append fixed-args fixed-trail-args)))))


I frequently find myself reaching for this pattern instead of xargs:

    do_something | ( while read -r v; do
    . . .
    done )
I’ve found that it has fewer edge cases (except it creates a subshell, which can be avoided in some shells by using braces instead of parens)


Some additional tips:

1. You don't need the parentheses.

2. If you use process substitution [1] instead of a pipe, you will stay in the same process and can modify variables of the enclosing scope:

    i=0
    while read -r v; do
        ...
        i=$(( i + 1))
    done < <(do_something)
The drawback is that this way `do_something` has to come after `done`, but that's bash for you ¯\_(ツ)_/¯

[1] https://www.gnu.org/software/bash/manual/html_node/Process-S...


I use this exact pattern a lot. One thing to consider is that in the process substitution version, do_something can't modify the enclosing variables. The vast majority of the time I want to modify variables in the loop body and not the generating process, but it's worth keeping in mind.

One common pattern I use this for is running a bunch of checks/tests, e.g.

    EXIT_CODE=0
    while read -r F
    do
        do_check "$F" || EXIT_CODE=1
    done < <(find ./tests -type f)
    exit "$EXIT_CODE"
This is a more complicated alternative to the following:

    find ./tests -type f | while read -r F
    do
      do_check "$F" || exit 1
    done
The simpler version will abort on the first error, whilst the first version will always run all of the checks (exiting with an error afterwards, if any of them failed)


I usually write zsh scripts and I think there’s a shell option in zsh that allows the loop at the end of the pipe to modify variables in the enclosing body: I remember at least one occasion where I was surprised about this discrepancy between shells.


Interesting! Indeed, Greg's BashFAQ notes it too: https://mywiki.wooledge.org/BashFAQ/024

>Different shells exhibit different behaviors in this situation:

>- BourneShell creates a subshell when the input or output of anything (loops, case etc..) but a simple command is redirected, either by using a pipeline or by a redirection operator ('<', '>').

>- BASH, Yash and PDKsh-derived shells create a new process only if the loop is part of a pipeline.

>- KornShell and Zsh creates it only if the loop is part of a pipeline, but not if the loop is the last part of it. The read example above actually works in ksh88, ksh93, zsh! (but not MKsh or other PDKsh-derived shells)

>- POSIX specifies the bash behaviour, but as an extension allows any or all of the parts of the pipeline to run without a subshell (thus permitting the KornShell behaviour, as well).


Check out bash lastpipe option.


Yeah, although I use the parentheses mostly because I like how it reads. And that process substitution trick is important too.

I think the redirection can come first, though (not at a computer to test):

    < <( do_something ) while read . . .


Yeah, for commands, the input/output redirections can precede them, but for some reason it doesn't work for builtin constructs like `while`:

    $ < <( echo foo ) while read -r f; do echo "$f"; done
    -bash: syntax error near unexpected token `do'
    $ < <( echo foo ) xargs echo
    foo

    $ bash --version
    GNU bash, version 5.1.4(1)-release (x86_64-apple-darwin20.2.0)


Maybe wrap the loop either with parentheses or braces?


Tried that, but nope :D I'll let you figure this one out once you get near a computer!


Redirection like this doesn't seem to work if it comes first on GNU bash 5.0.17(1)-release.

For documentation purposes, this is the exact thing I tried to run:

    $ < <(echo hi) while read a; do echo "got $a"; done
    -bash: syntax error near unexpected token `do'

    $ while read a; do echo "got $a"; done < <(echo hi)
    got hi

Maybe there is another way...


One way which isn't great, but an option nonetheless… The zsh parser is happy with that form:

    $ zsh -c '< <(echo hi) while read a; do echo "got $a"; done'
    got hi
My position isn't that it is a good reason to switch shells, but if you're using it anyway then it is an option.


I’ve always preferred zsh and, as I’ve slowly adopted nix, I’ve slowly stopped writing bash in favor of zsh


This is not POSIX compliant though.


These days bash and/or zsh are available nearly every place I care about, so I find POSIX compliance to be much less relevant.


No, process substitution must be provided by the kernel/syslibs, it is not feature of bash. For example there is bash on AIX, but process substitution is not possible because the OS do not support it.


ksh93 depends exclusively on the kernel implementation of /dev/fd devices. I just checked `cat <(ls)` a moment ago on both Linux and AIX 7.2--the latter fails in ksh93t+.

Bash uses /dev/fd when available, but also appears to have an internal implementation which silently creates named pipes and cleans them up. In Bash 5.0.18 on AIX, fake process substitution works just fine, in my testing.


Yes, you are right. Bash 5 on AIX 7.2 works with process substitution. Thanks for the advise!


Also for the `while` enthusiasts, here's how you zip the output of two processes in bash:

    paste -d \\n <(do_something1) <(do_something2) | while read -r var1 && read -r var2; do
        ... # var1 comes from do_something1, var2 comes from do_something2
    done


For thousands of arguments this sloution is much slower (high CPU usage) than xargs, because either it implements the logic as a shell script (slow) or it runs an external program for each argument (slow).


Sure, if performance matters use xargs. I find this is easier to read and think about.


Thank you. Your comment coalesced a number of things in my mind that I hadn’t grasped properly as a UNIX midwit, especially the braces thing.


creating a subshell can lead to some surprising behavior if you aren't careful though.


I tend to reach for gnu parallel instead of xargs -

https://www.gnu.org/software/parallel/parallel_alternatives....

parallel is probably on the complex side but its also been actively developed, bugfixed and had a lot of road miles from large computing users.


I mention it here: https://www.oilshell.org/blog/2021/08/xargs.html#xargs-p-aut...

What does it do that xargs and shell can't? (honest question)


One thing parallel can do better than xargs is collect output.

If you use `xargs -P`, all processes share the same stdout and output may be mixed arbitrarily between them. (If the program being executed uses line buffering, lines usually won't be mixed together from multiple invocations, but they can be if they're long enough).

In contrast, `parallel` by default doesn't mix together output from different commands at all, instead buffering the entire output until the command exits and then printing it.

With `--line-buffer` the unit of atomicity can be weakened from an entire command output to individual lines of output, reducing latency.

Alternately, with `--keep-order`, `parallel` can ensure the outputs are printed in the same order as the corresponding inputs, which makes the output deterministic if the program is deterministic. Without that you'll get results in an arbitrary order.

These aren't technically things that xargs and shell can't do; you could reimplement the same behavior by hand with the shell. But by the same token, there isn't anything xargs can do that the shell can't do alone; you could always use the shell to manually split up the input and invoke subprocesses. It's just a question of how much you want to reimplement by hand.


OK thanks, looks like there are several features of GNU parallel that users like.

For the output interleaving issue, what I do is use the $0 Dispatch Pattern and write a shell function that redirects to a file:

    do_one() {
      task_with_stdout > $dir/$task_id.txt
    }
So if there are 10,000 tasks then I get 10,000 files, and I can check the progress with "ls", and I can also see what tasks failed and possibly restart them.

You even have some notion of progress by checking the file size with ls -l.

I tend to use a pattern where each task also outputs a metadata file: the exit status, along with the data from "time" (rusage, etc.)

But I admit that this is annoying to rewrite in every script that uses xargs! It does make sense to have this functionality in a tool.

But I think that tool should be a LANGUAGE like Oil, not a weirdo interface like GNU parallel :)

But thanks for the explanation (and thanks to everyone in this subthread) -- I learned a bunch and this is why I write blog posts :)


Thank you for writing this, it really crystalized for me why I feel the way I do about oil. I hate it. When I want a language, I want a real language like python not a weirdo jumped up shell (see what I did there?). What I want in a shell is a super small, fast, universally understood thing for basic tasks and easy expandability through tools like parallel and python.

For what it's worth, I consider oil to be closer to a unixy PowerShell rather than a more powerful bash. Note that this is not a slight, PowerShell is sweet for what it is. It (oil) really takes a hard left from the POSIX philosophy of focusing on one thing and doing it well. I'm also bitter that, if it's going to veer so far away from POSIX, that it didn't go the whole hundred and become a function language with comprehensions and such.

For what it's worth, everything you mentioned above about your approach can be done with parallel.


The point of oil that there are really basic things like safe quoting that shells should do well, yet none of the posix shell do!

Functional: there are interesting shells like Elvish. But it really goes PowerShell by adding internal rich data pipelines that dont have a unixy stream-of-bytes representation. Oil does NOT go that way; it works on stuff like QSN to make pure unix interconnects more robust.


Your `do_one`:

  * does not buffer stderr
  * does not check if the disk is full for a period of time during a task (thus risking incomplete output)
  * does not clean up, if killed
  * does not work correctly if task_with_stdout is a composed command
Given that GNU Parallel is a drop-in replacement for xargs, I am curious why you find it a 'weirdo interface'.


Any sufficiently NIH'd can be considered weird. -- Not Isaac Asimov


A lot of this comes down to familiarity. I tend to use "make -j 100" for what you're describing. If I write the Makefile carefully [1], it will handle resuming a half-finished job. I just looked and GNU parallel has a --resume argument which probably does something similar, and maybe with less hassle. But I don't do this often enough—and/or GNU parallel isn't "better enough"—that I'm likely to ever invest the time to learn GNU parallel.

btw, oil looks very cool. I hate how many footguns are in common shells.

[1] eg writing to a tempfile and atomically renaming into place: "task_with_stdout > $dir/task_id.txt.tmp && mv $dir/task_id.txt{.tmp,}"


Restart capability and remote executions make gnu parallel the tool if choice for HPC. For example, you might very well use gnu parallel to run 1000s of cpu-hours of numerical simulation using patterns such as these ones,

https://docs.computecanada.ca/mediawiki/index.php?title=GNU_...

Using xargs for this kind of work is euhm... not a good idea.


i don't know if xargs cant, but i use gnu parallel to split an input pipe into N parallel pipes processing slices of the input stream.

Edit: To clarify, xargs usually wants to spin up a process per task. I have parallel spin up N processes and then continuously feed them.


Not to be pedantic, but that's a bit of a non-argument. _Of course_ you can do it with xargs and shell, but imho parallel is generally more convenient, especially for remote execution. It provides a higher level of abstraction for such tasks.


> What does it do that xargs and shell can't? (honest question)

For me, an essential feature of GNU parallel is that it is semantically equivalent to "sh". Imagine that you write a file that contains a long list of commands. You can pipe that file to "sh" to run the commands, or pipe it to "parallel" to do the same, but faster. If you are building the list of commands on the fly, then you can use xargs with a slightly different syntax. But somehow using "sh" or "parallel" gives a certain peace of mind due to its straightforward semantics. I never used any argument of GNU parallel apart from -j

My usage pattern: to build a list of commands explicitly then run it (possibly teeing the list into a temporary file to inspect it):

    for i in one two three; do
            printf "echo $i\n"   
    done |sh  # or |parallel


I use GNU Parallel for long-running jobs for its --eta option. If a job will take days or longer it's useful to know that early in the process. You might want to cancel it and try something else, and if you want to proceed with the long job you can make plans around when your data will be ready.


It is documented in the GNU Parallel documentation: https://www.gnu.org/software/parallel/parallel_alternatives....

If you seriously believe you can implement everything using xargs, then this (contrived) example is for you: https://unix.stackexchange.com/questions/405552/using-xargs-...

Newer versions include 'parset' which can set shell variables in parallel, which is useful if you want to 'map' values from one array to another.


GNU Parallel can be sourced into a bash session from a plain text file and used as a function. I've used it to get around overly-restrictive build environments. (overly restrictive because the team that manages the build image wasn't open to modifying their image for my use case)


Resumption, error reporting and much better progress monitoring.


Oh I didn't know about resumption.. parallel has so many features packed into its CLI it's kind of ridiculous.

For others that didn't know about it, see the examples here: https://www.gnu.org/software/parallel/parallel_tutorial.html...

Here's another surprising feature: https://www.gnu.org/software/parallel/parallel_tutorial.html...


Wow!, That is surprising and potentially very useful.


Remote execution.


I'd like to see a demo of it! I will try rewriting it with the $0 Dispatch Pattern and ssh :)


Good luck balancing node usage!

Here is an example of how it works,

https://docs.computecanada.ca/mediawiki/index.php?title=GNU_...

This + restart capabilities make gnu parallel very well suited to running 1000s of compute-heavy jobs on HPC clusters.


I used Parallel to distribute the rendering of a little Blender animation It worked very well.

https://github.com/tfmoraes/blender_gnu_parallel_render/blob...


Issue complaint prompts to promote the author, for one.


The nagware prompts of parallel are so objectionable that I will do a lot of things to avoid using it at all. So pretentious!


Seems like some distributions patch out the nagware. I know Arch Linux does[0].

[0]: https://github.com/archlinux/svntogit-community/tree/package...



On the contrary, I think more FOSS authors should do things like this. Freedom doesn't mean you don't get to take credit for your work.


It's also written in Perl!


Veering off course here, after experiencing how incredibly long it took to install Sqitch, I will go out of my way to avoid anything that is more than a single script, certainly anything requiring CPAN too. I don’t think there’s anything technically wrong with these programs or with Perl, they’re just presented in ways that are unique hassles in this day and age.


> anything that is more than a single script

... which is exactly what GNU Parallel is. Your concern is even mentioned in the design documentation: https://www.gnu.org/software/parallel/parallel_design.html


I remember when Perl was the coolest thing ever. I even wrote numerical simulations in it, just to try. Only with the invention of Python and Ruby did we realize how much better things could be. Of course that the Perl inventor was an IOCCC winner should've been a red flag.


If you need more visibility into long running processes, pueue is another alternative. You can of course use `xargs -P1 pueue add ./process_file.sh` to add the jobs in the first place. Sends a job to pueued, returns immediately. Great for re-encoding dozens of videos. For jobs that aren’t already multi-core, set the queue parallelism with pueue, after you’ve seen your cpu is under-utilised.

Obviously downside to the visibility and dynamism is that it redirects stdout. You can read it back later, in order. But it’s not there for continued processing immediately.


I always think of xargs as the inverse of echo. echo converts arguments to text streams, and xargs converts text streams to arguments.


That's a pretty neat way of thinking about it!


I appreciate this. If I wrote my own opinionated guide to xargs, it would be a single profane sentence.


In Bash (not every shell supports this) functions can be exported, which enables this nice pattern with xargs:

    myfunc() {
        printf " %s" "I got these arguments:" "$@" $'\n'
    }
    export -f myfunc
    seq 6 | xargs -n2 bash -c 'myfunc "$@"' "$0"


Wanting verbose logging from xargs, years ago I wrote a script called `el` (edit lines) that basically does `xargs -0` with logging. https://github.com/westurner/dotfiles/blob/develop/scripts/e...

It turns out that e.g. -print0 and -0 are the only safe way: line endings aren't escaped:

    find . -type f -print0 | el -0 --each -x echo
GNU Parallel is a much better tool: https://en.wikipedia.org/wiki/GNU_parallel


(author here) Hm I don't see either of these points because:

GNU xargs has --verbose which logs every command. Does that not do what you want? (Maybe I should mention its existence in the post)

xargs -P can do everything GNU parallel do, which I mention in the post. Any counterexamples? GNU parallel is a very ugly DSL IMO, and I don't see what it adds.

--

edit: Logging can also be done with by recursively invoking shell functions that log with the $0 Dispatch Pattern, explained in the post. I don't see a need for another tool; this is the Unix philosophy and compositionality of shell at work :)


Parallel's killer feature is how it spools subprocess output, ensuring that it doesn't get jumbled together. xargs can't do that. I use parallel for things like shelling out to 10000 hosts and getting some statistics. If I use xargs the output stomps all over itself.


Ah OK thanks, I responded to this here: https://news.ycombinator.com/item?id=28259473


As far as I'm aware, xargs still has the problem of multiple jobs being able to write to stdout at the same time, potentially causing their output streams to be intermingled. Compare this with parallels --group.

Also parallels can run some of those threads on remote machines. I don't believe xargs has an equivalent job management function.


In your examples you fail to put 'xargs -P' in the middle of a pipeline: You only put it at the end.

In other words:

  some command | xargs -P other command | third command
This is useful if 'other command' is slow. If you buffer on disk, you need to clean up after each task: Maybe there is not enough free disk space to buffer the output of all tasks.

UNIX is great in that you can pipe commands together, but due to the interleaving issue 'xargs -P' fails here. It does not live up to the UNIX philosophy. Which is probably why you unconsciously only use it at the end of a pipeline.

You can find a different counterexample on https://unix.stackexchange.com/questions/405552/using-xargs-... I will be impressed if you can implement that using xargs. Especially if you can make it more clean than the paralel version.


Yeah but xargs doesn't refuse to run until I have agreed to a EULA stating I will cite it in my next academic paper.


parallel doesn't either, it just nags. I agree about how silly and annoying it is. Imagine if every time the parallel author opened Firefox he got a message reminding him to personally thank me if he uses his web browser for research, or if every time his research program calls malloc he has to acknowledge and cite Ulrich Drepper. Very very silly.

Parallel is the better tool but the nagware impairs its reputation.


or every time a process called fork() you had to read some stupid message


echo will cite | parallel --bibtex


I scanned until I saw `ls | egrep '.*_test\.(py|cc)' | xargs -d $'\n' -- rm`, and then stopped. This is a terrible idea[1][2].

[1] https://mywiki.wooledge.org/ParsingLs

[2] https://unix.stackexchange.com/q/128985/3645


I'm surprised the links don't mention find. The -print0 flag makes it safe for crazy filenames, which pairs with the xargs -0 flag, or the perl -0 flag, etc. And you have -maxdepth if you don't want it to trawl.


This is only tangentially related, but after all the posts here the last few days about thought terminating cliches, I can’t help but reflect on the “X considered harmful” title cliche


Yes, I absolutely hate them. I was thinking of creating a "considered harmful" considered harmful rant but it already exists [0].

[0] https://meyerweb.com/eric/comment/chech.html


Is it thought terminating, though? "X considered harmful" seems more intended to spark discussion in an intentionally inflammatory way than to stifle it.

(In any case, this surely is tangential, since the title is not "X considered harmful" for any value of X—at best it comments on a post by that title, as, indeed, you are doing.)


I've been thinking about titles, and it's hard to make a good one that doesn't look like a total cliché. "X considered harmful", "an opinionated guide to X", some kind of joke or reference, what could be a collection of tags (X, Y and Z), "things I have learned doing X", etc.


I specifically clicked on this topic because of the word “opinionated”. As I already know how to use xargs, I was curious what kind of non-conventional or controversial opinion the author might have.


As I've said to a sibling comment, I don't think it's a bad title, and "an opinionated guide to X" is one of the better cliché for titles that I see (the worst being the journalist that feels like they have to make a joke).


In this case a less cliche/click-baity title could simply be:

"A Response to Xargs Criticism"


I think this title is fine, it's mostly that after spending some time on Hacker News all the titles start to look the same.


What every X should know about Y, an opinionated take on Z considered harmful


...with an example Lisp implementation written in APL translating into 6502 assembly :)


Would you say the title terminated your consideration of the article?


No I think if anything seeing it was a response to “xargs considered harmful” made me take the authors side quicker


Of xargs, for, and while, I have limited myself to while. It's more typing everytime but saves me from having to remember so many quirks of each command.

    cat input.file | ... | while read -r unit; do <cmd> ${unit}; done | ...
between 'while read -r unit' and 'while IFS= read -r unit' I can probably handle 90% of the cases. (maybe I should always use IFS since I tend to forget the proper way to use it).


That way will bite you when the tasks in question are cheaper than fork+exec. There was a thread just the other day in which folks were creating 8 million empty files with a bash loop over touch. But it's 60X faster (really, I measured) to use xargs, which will do batches (and parallelism if you tell it to).

https://news.ycombinator.com/item?id=28192946


Would you mind expanding with a couple of examples? (E.g. using "foo bar" as a single line or split by whitespace).

I suspect I'll really like your way of doing things, but an example would be very handy.


The example of "foo bar" didn't work with while but inserting tr fixes it:

    echo "foo bar" | tr ' ' '\n' | while read -r var; do echo ${var}; done
For examples in general, I guess something like "cat file.csv" could work. (the difference between using IFS= and not using it is essentially whether we want to preserve leading and trailing whitespaces or not. If we want to preserve, then we should use IFS=).


I always wonder why something like xargs is not a shell built-in. It's such a common pattern, but I dread formulating the correct incantation every time.

I was happy to read that the author comes to the same conclusion and proposes an `each` builtin (albeit only for the Oil shell)! Like that there is no need to learn another mini language as pointed out.


If you're a zsh user it offers a version of something like xargs in zargs¹. As the documentation shows it can be really quite powerful in part because of zsh's excellent globbing facilities, and I think without that support it wouldn't be all that useful as a built-in.

I'd also perhaps argue that the reason we don't want xargs to be a built-in is precisely because of zargs and the point in your second paragraph. If it was built-in it would no doubt be obscenely different in each shell, and five decades later a standard that no one follows would eventually specify its behaviour ;)

¹ https://zsh.sourceforge.io/Doc/Release/User-Contributions.ht... - search for "zargs", it has no anchor. Sorry.


> Shell functions and $1, instead of xargs -I {}

> -n instead of -L (to avoid an ad hoc data language)

Apparently GNU xargs is missing it, but BSD xargs has -J, which is a `-I` which works with `-n`: with `-I` each replstr gets replaced by one of the inputs, with `-J` the replstr gets replaced by the entire batch (as determined by `-n`).



I spent a year using AIX at my previous job and never heard of this or saw anybody use it. Is it new in 7.2? We were far behind on AIX 6.


No idea how old this command is. Most of the AIX/Linux admins I knew were very bad shell programmers, skills end with awfull for-loops, useless use of cat, and awk '{print $3}'.


I’m unconvinced by the post OP was responding to. It’s a utility, it provides some means to get things done. *nix provides many means of parsing text and running commands, each have their idioms based on their own axioms. It seems as if a composer is lambasting the clarinet because they don’t care for its fingerings. I’ve only used xargs sparingly, can somebody enlighten me as to why it’s bad, aside from the fact that there are other ways to do some things it does?


> I've used -P 32 to make day-long jobs take an hour! You can't do that with a for loop.

    for file in *; do
      command_using_file &
    done
    wait
?

I use variations on this all the time; pause while load is high, pause while 'x' or more things are running, sleep between invocations, etc.

It may not be as convenient for some cases, but "can't do that..." is not quite correct either.

The post is starting to feel like a hammer/nail argument, IMO.


I would recommend using -0 instead of -d, as the latter is not supported on BSD (and macOS) xargs:

    do_something | tr \\n \\0 | xargs -0 ...


I wish this was the default behavior of xargs (the 'tr \\n \\0 | xargs -0' bit). I don't know why xargs splits on spaces and tabs as well as newlines by default and doesn't even have a flag to just split on lines.

Ok filenames can theoretically have newlines in them but I'd be happy to deal with that weird case. I can't recall ever having encountered it in years of using bash on various systems.

Shell pipes would then orthogonally provide the stuff like substitution that xargs does in it's own unique way (that I just can't be bothered learning) - instead you'd just pipe the find output through sed or 'grep -v' or whatever you wanted before piping into xargs.

I guess that's what aliases but I'm too lazy anymore to bother with configuring often short-lived systems all the time.


xargs defaults to all whitespace because it was designed to get around the problem of short argv lengths (like, I'm talking 4k or less on older Unix-y systems, sometimes as low as 255 bytes).

So the defaults went with principle of least surprise, pretending it's like a very long args list that you could theoretically enter at the shell, including quotes.

You could, for example, edit the args list in vi and line split / indent as you please but not impact the end result.


I'm not sure I like the `$1` and shell function pattern. It might avoid the -I minilanguage, but at the cost of "being clever" in a way that takes a minute to wrap your head around. It's a neat trick, but I don't think it would be easy to understand if you are reading the code for the first time.


I find that using the example of `rm` to discuss whether to pick `find -exec` or `find | xargs` rather strange, given the existence of `find -delete`. Maybe pick a different example operation to automate.


I used to have bash fun like `curry { xargs -I {} $1 }` or something like that. Pretty useful to simplify one liners.


Note that the suggested:

rm $(ls | grep foo)

will not work if you have file names that contain spaces.

Shell programming is planted thick with landmines like this.


The linked article doesn’t suggest this. They explicitly suggest against it.

> Besides the extra ls, the suggestion is bad because it relies on shell's word splitting. This is due to the unquoted $(). It's better to rely on the splitting algorithms in xargs, because they're simpler and more powerful.


Using "ls" is also problematic, because it will outright skip large classes of characters.


awk '{ print your_command }' | bash

Never can remember all the -I stuff around xargs


This is like the sed|bash anti-pattern mentioned in the original post, and quoted in the appendix on shell injection.

I wouldn't say "never use it", but I would hesitate to ever put it in a script, vs. doing a one-off at the command line.


You don't pipe to bash on the first run. Use awk/sed without piping to workshop your commands. Once you've got them right send over to bash.

This is far superior to futzing with xargs interactive or whatever dry run feature they have.



> A lobste.rs user asked why you would use find | xargs rather than find -exec. The answer is that it can be much faster. If you’re trying to rm 10,000 files, you can start one process instead of 10,000 processes!

Fair enough, but I still favor find -exec. I find it generally less error prone, and it's never been so slow that I wished I had instead used xargs.

Also, if you're specifically using -exec rm with find, you could instead use find with -delete.


A benefit I didn't mention in the post (but probably should) is that the pipe lets you interpose other tools.

That is, find -exec is sort of "hard-coded", while find | xargs allows obvious extensions like:

    find | grep | xargs   # filter tasks

    find | head | xargs   # I use this all the time for faster testing

    find | shuf | xargs
Believe it or not I actually use find | shuf | xargs mplayer to randomize music and videos :)

So shell is basically a more compositional language than find (which is its own language, as I explain here: http://www.oilshell.org/blog/2021/04/find-test.html )


You can also use `find -exec` with `'+'` instead of `';'` as the terminator. This will call `rm` on all of the found files in one call.


I tend to prefer xargs because it works in more contexts e.g. I've got a tool which automatically generates databases but sometimes the cleanup doesn't work. `find -exec` does nothing, but `xargs -n1 dropdb` (following an intermediate grep) does the job. From there, it makes sense to… just use xargs everywhere.

And I always fail to remember that the -exec terminator must be escaped in zsh, so using -exec always takes me multiple tries. So I only use -exec when I must (for `find` predicates).


i agree. `find somewhere -exec some_command {} +` can be dramatically faster. but it does not guarantee a single invocation of `some_command`, it may make multiple invocations if you pass very large numbers of matching files

after spending a bit of time reading the man page for find, i rarely use xargs any more. find is pretty good.

tangent:

another instance i've seen where spawning many processes can lead to bad performance is in bash scripts for git pre-recieve hooks, to scan and validate the commit message of a range of commits before accepting them. it is pretty easy to cobble together some loop in a bash script that executes multiple processes _per commit_. that's fine for typical small pushes of 1-20 commits -- but if someone needs to do serious graph surgery and push a branch of 1000 - 10,000 commits that can can cause very long running times -- and more seriously, timeouts, where the entire push gets rejected as the pre-receive script takes too long. a small program using the libgit2 API can do the same work at the cost of a single process, although then you have the fun of figuring out how to build, install and maintain binary git pre-receive hooks.


Today I appreciated Powershell


Can you expand on that? I've never had trouble leveraging xargs and find it aligns well with shell piping.


Not OP but to me the best thing about PowerShell is that it recognizes that text is not always the best way to output results from commands if you care about creating pipelines. In short, it passes objects around so there's no need for parsing text.


Two examples from the article translated into PS (sorry, I'm a bit rusty so the second one may not be the shortest possible):

  PS> "alice", "bob" | echo

  PS> Get-ChildItem . -Include "*test.cpp","*test.py" -Recurse | foreach { Remove-Item $_.Name }
No text parsing in sight, and the object attributes can be tab-completed from the shell (e.g. I tab-completed the `$_.Name`).


You don't need to `foreach { Remove-Item $_.Name }` because Remove-Item can take the objects returned by Get-ChildItem directly.

Also, expanding the regex into `-Include` parameters is somewhat cheating since `-Include` only takes globs, and it just so happens that that particular regex can be converted into globs.

The general equivalent is:

    gci -re | ?{ $_.Name -match '.*_test\.(py|cc)' } | ri
(I used the shorter aliases because someone will probably read yours and reinforce the stereotype that PS is overly verbose.)


Thanks, this is definitely closer to the original use of `egrep`! As for aliases, I prefer long forms because I don't need to think what the seemingly random collections of letters mean, and tab-completion / PowerShell ISE makes it mostly a non-issue when writing.


Thanks, we were thinking of the same thing.


Xargs ftw!


i found nothing in this blog post useful other than string slicing/splitting. and you forgot put the ultimate flag of this program that is -r.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: