# show only items in both a and b
comm -1 -2 a_list b_list
# show only items unique to a
comm -2 -3 a_list b_list
# show only items unique to b
comm -1 -3 a_list b_list
I love how, even after using Unix-y systems for probably 10-odd years, I still occasionally stumble upon something like this that refers to a command that I've never even heard of, and when I check (on my Mac, no less), it's right there sitting in /usr/bin. So many useful little utilities.
I think the same thing happens to a lesser degree with vim features.
A very boring, but instructive way is to read manuals cover to cover - for example the GNU Coreutils one [0]. This way, you become aware of the existence of a lot of tools (such as comm, which is part of the coreutils), and while reading you'll realize that some of them might be a better way of doing things than how you're doing them currently.
Other boring, but instructive reads (heavily GNU biased): diffutils [1], findutils [2], the Bash manual [3].
Seconded. The dead tree copy I purchased in the mid-2000s is the best technical book in my collection. I’ve read most of it and every so often I still dip in to learn something new or – more often than not – to re-learn something I’ve forgotten.
I tried out your command and got different line count compared to the output of the commands mentioned in the post. The man page for comm says it assumes inputs to be pre-sorted. To do that, you need to sort the input files before passing them to comm.
# show only items in both a and b
comm -1 -2 <(sort -u a_list) <(sort -u b_list)
# show only items unique to a
comm -2 -3 <(sort -u a_list) <(sort -u b_list)
# show only items unique to b
comm -1 -3 <(sort -u a_list) <(sort -u b_list)
More accurately, it can be used for any command which expects a file and doesn't do anything too weird in reading it (e.g. doesn't seek to the beginning and read it again)
The '<( ... )' is just giving the path to that command's stdout as a file descriptor.
$ ls <(echo hi)
/proc/self/fd/11
$ vi <(echo hi)
# opens vi with 'hi' as the contents
It's the same in ksh at least, so one is covered in the BSDs. It could be also implemented as a separate tool, but not as convenient, because of additional quoting - I did once.
Doesn't uniq only read the input once? I've always assumed that the reason uniq assumes its input is presorted is that, then, it doesn't need to buffer all of it: instead, it just eliminates successive duplicate lines.
The only options you usually need for comm are -1, -2, and -3. It takes two arguments, call them 1 and 2. Its job is to include lines from each and both files in output (by default, you get 3 columns of output), so we want to filter by telling it to exclude what we don't want. My mnemonic is "-" means "not", so -1 means "not only in file 1", -2 means "not only in file 2", and -3 means "not in both". So now, when I want comm output that shows the lines unique to my 2nd file, it's:
I know how to use 'man', what I mean is, I never remember the command name (comm) because I always remember it's a command to compare so I think of cmp, and not comm.
If your input is already sorted (like this article assumes), you can use "sort -m", which is a lot faster. Also, to print only lines with duplicates, use "uniq -d" instead of "uniq -c | grep 2\ ".
Note the change of approach here: instead of making lines from b_list appear twice and grepping for that count, make lines from a_list appear twice and have uniq only print lines that aren't repeated.
A few comments. First, wow, I love how this looks like dark magic.
Those mathematical notations, are you using them because it makes it easier to see how it corresponds to actual Set Theory/theorems?
If so, could you just as well have used an alphanummeric identifier like "left" "union" "right" or - would the code break without this notation?
I'm on deep waters here, I don't know this. But set theory seems to pop up a lot in my line of work, essentially doing joins in datasets using Tableau - so my interest in the nitty gritty of this field is increasing.
> because it makes it easier to see how it corresponds to actual Set Theory/theorems?
Yes. These are the Unicode symbols for set theory.
> could you just as well have used an alphanummeric identifier like "left" "union" "right" or
Yes. You can use any of the words in the case switch statements, such as `setop union file1 file2`. You can also edit the script to add your own words if you like.
You can see simpler versions of these scripts in our GitHub repos. For example the `union` command is https://github.com/sixarm/union
> set theory seems to pop up a lot in my line of work
More and more in mine too. Thank you for your comments!
I've never heard of numcommand before, but it looks super awesome and exactly what I need. I use Linux servers, but am forced to use Windows as my local. So first question, do you have experience with awk/gawk on Windows? Second question: What do you prefer about Awk over Python/Perl? I guess if all your analytics are fairly short it is just easier in Awk? Thanks!
Thats nice, so is the article's examples. But what I always need to lookup is how to make the operations work on especific columns, rather than on whole lines.
Yes in the `setop` current implementation, because this enables the script to be fast, short, and work with inputs without needing to call `sort`.
For comparison, a typical POSIX `uniq` implementation reads the input and solely compares two adjacent lines; this requires the input to be presorted.
An interesting upgrade could be to add a `setop` option flag that tells the script the inputs are already sorted and/or deduped. This can achieve the memory savings you're describing.
For the curious, `:` is the noop builtin in bash. I use it mostly as I would use `pass` in python, since empty conditionals or functions are a parse error.
‘:’ is the command, and ‘$(cal)’ – which equals the unquoted output from running the ‘cal’ command – are the arguments. The last day of the month is thus the last argument of ‘:’ and can be referenced with the ‘$_’ variable.
Hard to resist this - "Unlike the intersection, the Set Difference is a bit harder to scale up to more than two lists. It is concievable, and I may even have done it, but I’ll leave it as an exercise to the reader to develop that."
An inefficient solution which involves unnecessary sorting: for each file_i, 0 <= i < n in the set of n files, cat it 2^i times before combining to pipe through sort and uniq -c. Every possible set operation combination can be determined by grepping the result for a particular combination of counts. Intersection would calculated by grepping for 2^n - 1 while symmetric difference would require egrep to pick out any of 1, 2, 4,..,2^(n-1).
Someone always says this, but starting with cat sure makes it easier compose the command line. When working with big files, start with "head -1000 | sort | uniq ..." and switch to cat after you work out the command. Or you start composing the command and realize you need to sort on field 2, so insert a "cut" before the "sort" etc.
It's just a convention that redirects come after the command they're redirecting. There might be shell options that influence this, but I think all these are generally equivalent:
It is POSIX shell parsing behaviour that redirections can appear anywhere in the command (obviously, not in the same word as another parameter, nor inside a quoted string). They have to be stripped out by the shell before execution: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3...
https://en.wikipedia.org/wiki/Comm