Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Shell Magic: Set Operations with uniq (deadvax.net)
178 points by cpitman on May 29, 2018 | hide | past | favorite | 65 comments


This is a neat hack! For two input files, the intersection and relative complement can be done more straightforwardly with comm.

https://en.wikipedia.org/wiki/Comm

  # show only items in both a and b
  comm -1 -2 a_list b_list

  # show only items unique to a
  comm -2 -3 a_list b_list

  # show only items unique to b
  comm -1 -3 a_list b_list


I love how, even after using Unix-y systems for probably 10-odd years, I still occasionally stumble upon something like this that refers to a command that I've never even heard of, and when I check (on my Mac, no less), it's right there sitting in /usr/bin. So many useful little utilities.

I think the same thing happens to a lesser degree with vim features.


A very boring, but instructive way is to read manuals cover to cover - for example the GNU Coreutils one [0]. This way, you become aware of the existence of a lot of tools (such as comm, which is part of the coreutils), and while reading you'll realize that some of them might be a better way of doing things than how you're doing them currently.

Other boring, but instructive reads (heavily GNU biased): diffutils [1], findutils [2], the Bash manual [3].

[0]: https://www.gnu.org/software/coreutils/manual

[1]: https://www.gnu.org/software/diffutils/manual

[2]: https://www.gnu.org/software/findutils/manual/find.html

[3]: https://www.gnu.org/software/bash/manual


The chapter on shell utilities in the POSIX standard is another goldmine!


I'd also recommend Unix Power Tools. There's definitely some outdated content, but I learned a lot about shell usage and text manipulation from it.

http://shop.oreilly.com/product/9780596003302.do


Seconded. The dead tree copy I purchased in the mid-2000s is the best technical book in my collection. I’ve read most of it and every so often I still dip in to learn something new or – more often than not – to re-learn something I’ve forgotten.


I tried out your command and got different line count compared to the output of the commands mentioned in the post. The man page for comm says it assumes inputs to be pre-sorted. To do that, you need to sort the input files before passing them to comm.

  # show only items in both a and b
  comm -1 -2 <(sort -u a_list) <(sort -u b_list)

  # show only items unique to a
  comm -2 -3 <(sort -u a_list) <(sort -u b_list)

  # show only items unique to b
  comm -1 -3 <(sort -u a_list) <(sort -u b_list)


Yes, pre-sorting is necessary for comm to work. I didn't mention that because it's necessary for the commands in the post to work too.


Can also use

    comm <flags> <(command 1) <(command 2)
To use the output of a command as input. This also works with diff and other commands.


More accurately, it can be used for any command which expects a file and doesn't do anything too weird in reading it (e.g. doesn't seek to the beginning and read it again)

The '<( ... )' is just giving the path to that command's stdout as a file descriptor.

    $ ls <(echo hi)  
    /proc/self/fd/11
    $ vi <(echo hi)
    # opens vi with 'hi' as the contents


Thank you. I was just thinking of looking into how it was implemented.


It is called "process substitution", and can work the other way around:

http://www.tldp.org/LDP/abs/html/process-sub.html

With fish:

https://fishshell.com/docs/current/commands.html#psub

But there it only works one way:

https://github.com/fish-shell/fish-shell/issues/1786


Note that this is a bash feature; other shells may have different syntax.


It's the same in ksh at least, so one is covered in the BSDs. It could be also implemented as a separate tool, but not as convenient, because of additional quoting - I did once.


I use it in my frugal SSG project [0] to remove HTML files for every removed source file. It's great to use it this way as it doesn't use temp files.

[0] https://gist.github.com/hadrianw/060944011acfcadd889d937b960...


Indeed comm will work much faster than uniq in this case, because it reads each file only once, and does a single pass on each file.

What the blog post does (doing cat multiple times, and then counting the occurrences) is going to be much slower, although mathematically correct.


Doesn't uniq only read the input once? I've always assumed that the reason uniq assumes its input is presorted is that, then, it doesn't need to buffer all of it: instead, it just eliminates successive duplicate lines.


Yeah but in this case the author is doing "cat 1.txt 1.txt 2.txt".


comm is probably the most useful command I never remember how it is called (and apropos doesn't help).


The only options you usually need for comm are -1, -2, and -3. It takes two arguments, call them 1 and 2. Its job is to include lines from each and both files in output (by default, you get 3 columns of output), so we want to filter by telling it to exclude what we don't want. My mnemonic is "-" means "not", so -1 means "not only in file 1", -2 means "not only in file 2", and -3 means "not in both". So now, when I want comm output that shows the lines unique to my 2nd file, it's:

comm "not only in 1" and "not in both"

so:

comm -1 -3 file1 file2


I know how to use 'man', what I mean is, I never remember the command name (comm) because I always remember it's a command to compare so I think of cmp, and not comm.


If your input is already sorted (like this article assumes), you can use "sort -m", which is a lot faster. Also, to print only lines with duplicates, use "uniq -d" instead of "uniq -c | grep 2\ ".

Union: Instead of

    cat a_list b_list | sort | uniq
do

    sort -m a_list b_list | uniq
Intersection: Instead of

    cat a_list b_list | sort | uniq -c | grep 2\ 
do

    sort -m a_list b_list | uniq -d
Relative complement: Instead of

    cat a_list b_list b_list | sort | uniq -c | grep 2\ 
do

    sort -m a_list a_list b_list | uniq -u
Note the change of approach here: instead of making lines from b_list appear twice and grepping for that count, make lines from a_list appear twice and have uniq only print lines that aren't repeated.


I like shell set operations scripts, because they are quick and easy.

I prefer `awk` over `uniq` and `comm` because awk tends to be faster at set ops that can skip sorting and deduplicating.

Here's my script for union, intersection, etc. See README on GitHub. Suggestions welcome.

https://github.com/sixarm/setop

    #!/bin/sh
    set -eu
    op="$1"; shift

    case  $op  in
      ∪|u|union|or|∨|add|addition|'+'|'|')
        awk '!seen[$0] {print} {seen[$0]=1}' "$@"
        ;;
      ∩|i|intersection|and|∧|'&')
        awk 'FNR==1{argind+=1} seen[$0]+=1 {next} END { for (key in seen) { if (seen[key]==argind) { print key } } }' "$@"
        ;;
      ⊖|d|diff|difference|xor|⊻)
        awk 'FNR==1{argind+=1} seen[$0]+=1 {next} END { for (key in seen) { if (seen[key]==1) { print key } } }' "$@"
        ;;
      ex|except|exclude|subtract|subtraction|'-')
        awk 'NR==FNR{seen[$0]=1;next} seen[$0]=0; END { for (key in seen) { if (seen[key]) { print key } } }' "$@"
        ;;
      extra)
        awk 'BEGIN{argindmax=ARGC-1} FNR==1{argind+=1} argind<argindmax {seen[$0]; next}!($0 in seen)' "$@"
        ;;
      disjoint|n|not|none)
        awk 'seen[$0]==1 {x=1;exit} {seen[$0]=1} END { print x ? "FALSE" : "TRUE"}; exit !!x}' "$@"
        ;;
      *)
    esac


A few comments. First, wow, I love how this looks like dark magic.

Those mathematical notations, are you using them because it makes it easier to see how it corresponds to actual Set Theory/theorems? If so, could you just as well have used an alphanummeric identifier like "left" "union" "right" or - would the code break without this notation? I'm on deep waters here, I don't know this. But set theory seems to pop up a lot in my line of work, essentially doing joins in datasets using Tableau - so my interest in the nitty gritty of this field is increasing.


Thanks! To answer your questions...

> because it makes it easier to see how it corresponds to actual Set Theory/theorems?

Yes. These are the Unicode symbols for set theory.

> could you just as well have used an alphanummeric identifier like "left" "union" "right" or

Yes. You can use any of the words in the case switch statements, such as `setop union file1 file2`. You can also edit the script to add your own words if you like.

You can see simpler versions of these scripts in our GitHub repos. For example the `union` command is https://github.com/sixarm/union

> set theory seems to pop up a lot in my line of work

More and more in mine too. Thank you for your comments!


How long have you been using Awk?


For a few years for two clients (financial and governmental) that have a range of older controlled systems.

We benefit from simple analytics that are POSIX, not python, perl, R, etc. http://www.numcommand.com/


I've never heard of numcommand before, but it looks super awesome and exactly what I need. I use Linux servers, but am forced to use Windows as my local. So first question, do you have experience with awk/gawk on Windows? Second question: What do you prefer about Awk over Python/Perl? I guess if all your analytics are fairly short it is just easier in Awk? Thanks!


Thank you!

Yes in my experience of awk/gawk on Windows, things work the same, other than file separators, line endings, and similar platform-specific issues.

For your second question, awk simply works. It's reliable, fast, and everywhere. I also like python and perl, and choose these for any modern system.


Thank you!


'Hat tip


Thats nice, so is the article's examples. But what I always need to lookup is how to make the operations work on especific columns, rather than on whole lines.

I mean without resorting to awk.


doesn't this read the entire input into memory? `uniq` and `comm` don't (need to) do this, so they can work on inputs bigger than available memory.


Yes in the `setop` current implementation, because this enables the script to be fast, short, and work with inputs without needing to call `sort`.

For comparison, a typical POSIX `uniq` implementation reads the input and solely compares two adjacent lines; this requires the input to be presorted.

An interesting upgrade could be to add a `setop` option flag that tells the script the inputs are already sorted and/or deduped. This can achieve the memory savings you're describing.


Could you add a "--help" option? Or provide usage instructions in a comment?


I've a summary of the operations at:

http://www.pixelbeat.org/cmdline.html#sets

Note comm output is a bit awkward to parse, so I use another coreutils `join` command to process already sorted data


Thanks for the list, there are many useful examples.

Another way to get the last date in current month (== the number of days in a month) is:

    : $(cal); echo $_


For the curious, `:` is the noop builtin in bash. I use it mostly as I would use `pass` in python, since empty conditionals or functions are a parse error.

    if whatevs; then
      :
    fi


And to elaborate even more:

‘:’ is the command, and ‘$(cal)’ – which equals the unquoted output from running the ‘cal’ command – are the arguments. The last day of the month is thus the last argument of ‘:’ and can be referenced with the ‘$_’ variable.


The cut and join commands are also useful. I'm pretty sure you could prove all basic database operations can be done using the shell.


In fact I wrote an image/tag database solution entirely in shell/coreutils.

It's too disgusting to share.


After playing around with join just now, I remembered why I hate it - it requires your inputs to be sorted which makes it abominably inconvenient.



Here's another collection of set operations - I refer to this a couple of times a week at least. http://www.catonmat.net/download/setops.txt


Whoa. That is a great resource.


Hard to resist this - "Unlike the intersection, the Set Difference is a bit harder to scale up to more than two lists. It is concievable, and I may even have done it, but I’ll leave it as an exercise to the reader to develop that."

An inefficient solution which involves unnecessary sorting: for each file_i, 0 <= i < n in the set of n files, cat it 2^i times before combining to pipe through sort and uniq -c. Every possible set operation combination can be determined by grepping the result for a particular combination of counts. Intersection would calculated by grepping for 2^n - 1 while symmetric difference would require egrep to pick out any of 1, 2, 4,..,2^(n-1).


The way I would do it, assuming we have list1.txt, list2.txt, list3.txt, ... and want to calculate (1 - 2 - 3 - ...) :

1. Use sed to add "1<tab>" (that's a one digit and a tab char) to the first list to difference by and save as "prefix.txt".

2. Use cat to combine all the lists, sort | uniq -c | sort -n | sed <reformat to make tab delimited> | sort again and save as output.txt.

3. join prefix.txt and output.txt on each whole line and cut the second tab delimited field to produce the final result.

So in order to be in the result, a list item must appear in exactly one list and that must be the first list. That should be what we want (?)


The Debian 'corekeeper' package includes a cron script that uses this.

https://sources.debian.org/src/corekeeper/1.6/debian/corekee...

It's actually possible to simplify the set operations a little in that case.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=895143


For completeness, to scale relative complement up to N lists you would need to repeat the last list 2^N times. So it'd look something like:

    rel_com() {
        for i in $(seq 1 $#); do
            for j in $(seq 1 $(bc <<< "2 ^ $(expr $i - 1)")); do
                echo ${!i}
            done
        done
    }
    
    sort -m $(rel_com a b c) | uniq -c
Then you have a N-bit number with the i'th bit representing membership in the i'th file.


I've always used grep for this with the -f option to read a list of patterns from a file and -v for inverted search as necessary.


Nice. I've wondered several times how to have the equivalent of {1, 2, 3} ^ {3, 4} from Python, but in bash.



If you use awk, which has associative arrays, you avoid sorting, and can keep everything POSIX.


I like bash's <(...) operator and often use diff -u with it:

  diff -u <(cat file1) <(cat file2)
Obviously, the 'cat' commands can be more complex commands, which make this more interesting.


No need to use cat:

  # sort a_list b_list | uniq
  a 
  b 
  c 
  d 
  e



Thanks! Will be using these for debugging. I commonly compare `cat data.csv | wc -l` vs `cat data.csv | sort -u | wc -l` to check unique inputs


Aaargh! "cat | sort" is an antipattern.

Sort accepts a list of files as arguments. "sort [flags] file1 file2 ..." is more concise, more efficient, and is less likely to run out of space.


Someone always says this, but starting with cat sure makes it easier compose the command line. When working with big files, start with "head -1000 | sort | uniq ..." and switch to cat after you work out the command. Or you start composing the command and realize you need to sort on field 2, so insert a "cut" before the "sort" etc.


And for people like me who want the file name first on the line, this format works in Bash at least:

</dir/file sort | uniq >outfile


I always find it really confusing when the command-line begins with a redirect.


Yes, how do you parse that - ie. what does it even mean?


  > a # from stdout to file
  < a # simply the opposite operation: from file to stdin


It's just a convention that redirects come after the command they're redirecting. There might be shell options that influence this, but I think all these are generally equivalent:

    foo bar < a
    foo < a bar
    < a foo bar


It is POSIX shell parsing behaviour that redirections can appear anywhere in the command (obviously, not in the same word as another parameter, nor inside a quoted string). They have to be stripped out by the shell before execution: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: