Shell Magic: Set Operations with uniq

sciurus · on May 29, 2018

This is a neat hack! For two input files, the intersection and relative complement can be done more straightforwardly with comm.

https://en.wikipedia.org/wiki/Comm

  # show only items in both a and b
  comm -1 -2 a_list b_list

  # show only items unique to a
  comm -2 -3 a_list b_list

  # show only items unique to b
  comm -1 -3 a_list b_list

dceddia · on May 30, 2018

I love how, even after using Unix-y systems for probably 10-odd years, I still occasionally stumble upon something like this that refers to a command that I've never even heard of, and when I check (on my Mac, no less), it's right there sitting in /usr/bin. So many useful little utilities.

I think the same thing happens to a lesser degree with vim features.

bewuethr · on May 30, 2018

A very boring, but instructive way is to read manuals cover to cover - for example the GNU Coreutils one [0]. This way, you become aware of the existence of a lot of tools (such as comm, which is part of the coreutils), and while reading you'll realize that some of them might be a better way of doing things than how you're doing them currently.

Other boring, but instructive reads (heavily GNU biased): diffutils [1], findutils [2], the Bash manual [3].

[0]: https://www.gnu.org/software/coreutils/manual

[1]: https://www.gnu.org/software/diffutils/manual

[2]: https://www.gnu.org/software/findutils/manual/find.html

[3]: https://www.gnu.org/software/bash/manual

kqr · on May 30, 2018

The chapter on shell utilities in the POSIX standard is another goldmine!

sciurus · on May 30, 2018

I'd also recommend Unix Power Tools. There's definitely some outdated content, but I learned a lot about shell usage and text manipulation from it.

http://shop.oreilly.com/product/9780596003302.do

Anthony-G · on May 30, 2018

Seconded. The dead tree copy I purchased in the mid-2000s is the best technical book in my collection. I’ve read most of it and every so often I still dip in to learn something new or – more often than not – to re-learn something I’ve forgotten.

buddydvd · on May 29, 2018

I tried out your command and got different line count compared to the output of the commands mentioned in the post. The man page for comm says it assumes inputs to be pre-sorted. To do that, you need to sort the input files before passing them to comm.

  # show only items in both a and b
  comm -1 -2 <(sort -u a_list) <(sort -u b_list)

  # show only items unique to a
  comm -2 -3 <(sort -u a_list) <(sort -u b_list)

  # show only items unique to b
  comm -1 -3 <(sort -u a_list) <(sort -u b_list)

sciurus · on May 30, 2018

Yes, pre-sorting is necessary for comm to work. I didn't mention that because it's necessary for the commands in the post to work too.

partycoder · on May 29, 2018

Can also use

    comm <flags> <(command 1) <(command 2)

To use the output of a command as input. This also works with diff and other commands.

TheDong · on May 29, 2018

More accurately, it can be used for any command which expects a file and doesn't do anything too weird in reading it (e.g. doesn't seek to the beginning and read it again)

The '<( ... )' is just giving the path to that command's stdout as a file descriptor.

    $ ls <(echo hi)  
    /proc/self/fd/11
    $ vi <(echo hi)
    # opens vi with 'hi' as the contents

therein · on May 30, 2018

Thank you. I was just thinking of looking into how it was implemented.

lloeki · on May 30, 2018

It is called "process substitution", and can work the other way around:

http://www.tldp.org/LDP/abs/html/process-sub.html

With fish:

https://fishshell.com/docs/current/commands.html#psub

But there it only works one way:

https://github.com/fish-shell/fish-shell/issues/1786

scintill76 · on May 29, 2018

Note that this is a bash feature; other shells may have different syntax.

hawski · on May 30, 2018

It's the same in ksh at least, so one is covered in the BSDs. It could be also implemented as a separate tool, but not as convenient, because of additional quoting - I did once.

hawski · on May 30, 2018

I use it in my frugal SSG project [0] to remove HTML files for every removed source file. It's great to use it this way as it doesn't use temp files.

[0] https://gist.github.com/hadrianw/060944011acfcadd889d937b960...

alpb · on May 30, 2018

Indeed comm will work much faster than uniq in this case, because it reads each file only once, and does a single pass on each file.

What the blog post does (doing cat multiple times, and then counting the occurrences) is going to be much slower, although mathematically correct.

fiddlerwoaroof · on May 30, 2018

Doesn't uniq only read the input once? I've always assumed that the reason uniq assumes its input is presorted is that, then, it doesn't need to buffer all of it: instead, it just eliminates successive duplicate lines.

alpb · on May 30, 2018

Yeah but in this case the author is doing "cat 1.txt 1.txt 2.txt".

raverbashing · on May 30, 2018

comm is probably the most useful command I never remember how it is called (and apropos doesn't help).

paxunix · on May 30, 2018

The only options you usually need for comm are -1, -2, and -3. It takes two arguments, call them 1 and 2. Its job is to include lines from each and both files in output (by default, you get 3 columns of output), so we want to filter by telling it to exclude what we don't want. My mnemonic is "-" means "not", so -1 means "not only in file 1", -2 means "not only in file 2", and -3 means "not in both". So now, when I want comm output that shows the lines unique to my 2nd file, it's:

comm "not only in 1" and "not in both"

so:

comm -1 -3 file1 file2

raverbashing · on May 30, 2018

I know how to use 'man', what I mean is, I never remember the command name (comm) because I always remember it's a command to compare so I think of cmp, and not comm.

markusbk · on May 30, 2018

If your input is already sorted (like this article assumes), you can use "sort -m", which is a lot faster. Also, to print only lines with duplicates, use "uniq -d" instead of "uniq -c | grep 2\ ".

Union: Instead of

    cat a_list b_list | sort | uniq

do

    sort -m a_list b_list | uniq

Intersection: Instead of

    cat a_list b_list | sort | uniq -c | grep 2\

do

    sort -m a_list b_list | uniq -d

Relative complement: Instead of

    cat a_list b_list b_list | sort | uniq -c | grep 2\

do

    sort -m a_list a_list b_list | uniq -u

Note the change of approach here: instead of making lines from b_list appear twice and grepping for that count, make lines from a_list appear twice and have uniq only print lines that aren't repeated.

jph · on May 30, 2018

I like shell set operations scripts, because they are quick and easy.

I prefer `awk` over `uniq` and `comm` because awk tends to be faster at set ops that can skip sorting and deduplicating.

Here's my script for union, intersection, etc. See README on GitHub. Suggestions welcome.

https://github.com/sixarm/setop

    #!/bin/sh
    set -eu
    op="$1"; shift

    case  $op  in
      ∪|u|union|or|∨|add|addition|'+'|'|')
        awk '!seen[$0] {print} {seen[$0]=1}' "$@"
        ;;
      ∩|i|intersection|and|∧|'&')
        awk 'FNR==1{argind+=1} seen[$0]+=1 {next} END { for (key in seen) { if (seen[key]==argind) { print key } } }' "$@"
        ;;
      ⊖|d|diff|difference|xor|⊻)
        awk 'FNR==1{argind+=1} seen[$0]+=1 {next} END { for (key in seen) { if (seen[key]==1) { print key } } }' "$@"
        ;;
      ex|except|exclude|subtract|subtraction|'-')
        awk 'NR==FNR{seen[$0]=1;next} seen[$0]=0; END { for (key in seen) { if (seen[key]) { print key } } }' "$@"
        ;;
      extra)
        awk 'BEGIN{argindmax=ARGC-1} FNR==1{argind+=1} argind<argindmax {seen[$0]; next}!($0 in seen)' "$@"
        ;;
      disjoint|n|not|none)
        awk 'seen[$0]==1 {x=1;exit} {seen[$0]=1} END { print x ? "FALSE" : "TRUE"}; exit !!x}' "$@"
        ;;
      *)
    esac

unixhero · on May 30, 2018

A few comments. First, wow, I love how this looks like dark magic.

Those mathematical notations, are you using them because it makes it easier to see how it corresponds to actual Set Theory/theorems? If so, could you just as well have used an alphanummeric identifier like "left" "union" "right" or - would the code break without this notation? I'm on deep waters here, I don't know this. But set theory seems to pop up a lot in my line of work, essentially doing joins in datasets using Tableau - so my interest in the nitty gritty of this field is increasing.

jph · on May 30, 2018

Thanks! To answer your questions...

> because it makes it easier to see how it corresponds to actual Set Theory/theorems?

Yes. These are the Unicode symbols for set theory.

> could you just as well have used an alphanummeric identifier like "left" "union" "right" or

Yes. You can use any of the words in the case switch statements, such as `setop union file1 file2`. You can also edit the script to add your own words if you like.

You can see simpler versions of these scripts in our GitHub repos. For example the `union` command is https://github.com/sixarm/union

> set theory seems to pop up a lot in my line of work

More and more in mine too. Thank you for your comments!

FractalLP · on May 30, 2018

How long have you been using Awk?

jph · on May 30, 2018

For a few years for two clients (financial and governmental) that have a range of older controlled systems.

We benefit from simple analytics that are POSIX, not python, perl, R, etc. http://www.numcommand.com/

FractalLP · on May 30, 2018

I've never heard of numcommand before, but it looks super awesome and exactly what I need. I use Linux servers, but am forced to use Windows as my local. So first question, do you have experience with awk/gawk on Windows? Second question: What do you prefer about Awk over Python/Perl? I guess if all your analytics are fairly short it is just easier in Awk? Thanks!

jph · on May 31, 2018

Thank you!

Yes in my experience of awk/gawk on Windows, things work the same, other than file separators, line endings, and similar platform-specific issues.

For your second question, awk simply works. It's reliable, fast, and everywhere. I also like python and perl, and choose these for any modern system.

FractalLP · on June 2, 2018

Thank you!

unixhero · on May 30, 2018

'Hat tip

int0x80 · on May 30, 2018

Thats nice, so is the article's examples. But what I always need to lookup is how to make the operations work on especific columns, rather than on whole lines.

I mean without resorting to awk.

fiddlerwoaroof · on May 30, 2018

doesn't this read the entire input into memory? `uniq` and `comm` don't (need to) do this, so they can work on inputs bigger than available memory.

jph · on May 30, 2018

Yes in the `setop` current implementation, because this enables the script to be fast, short, and work with inputs without needing to call `sort`.

For comparison, a typical POSIX `uniq` implementation reads the input and solely compares two adjacent lines; this requires the input to be presorted.

An interesting upgrade could be to add a `setop` option flag that tells the script the inputs are already sorted and/or deduped. This can achieve the memory savings you're describing.

amelius · on May 30, 2018

Could you add a "--help" option? Or provide usage instructions in a comment?

pixelbeat__ · on May 29, 2018

I've a summary of the operations at:

http://www.pixelbeat.org/cmdline.html#sets

Note comm output is a bit awkward to parse, so I use another coreutils `join` command to process already sorted data

kseistrup · on May 30, 2018

Thanks for the list, there are many useful examples.

Another way to get the last date in current month (== the number of days in a month) is:

    : $(cal); echo $_

lloeki · on May 30, 2018

For the curious, `:` is the noop builtin in bash. I use it mostly as I would use `pass` in python, since empty conditionals or functions are a parse error.

    if whatevs; then
      :
    fi

kseistrup · on May 30, 2018

And to elaborate even more:

‘:’ is the command, and ‘$(cal)’ – which equals the unquoted output from running the ‘cal’ command – are the arguments. The last day of the month is thus the last argument of ‘:’ and can be referenced with the ‘$_’ variable.

perl4ever · on May 29, 2018

The cut and join commands are also useful. I'm pretty sure you could prove all basic database operations can be done using the shell.

Aelius · on May 29, 2018

In fact I wrote an image/tag database solution entirely in shell/coreutils.

It's too disgusting to share.

perl4ever · on May 30, 2018

After playing around with join just now, I remembered why I hate it - it requires your inputs to be sorted which makes it abominably inconvenient.

gglitch · on May 30, 2018

http://matt.might.net/articles/sql-in-the-shell/

hn-minutiae · on May 30, 2018

Here's another collection of set operations - I refer to this a couple of times a week at least. http://www.catonmat.net/download/setops.txt

unixhero · on May 30, 2018

Whoa. That is a great resource.

derriz · on May 29, 2018

Hard to resist this - "Unlike the intersection, the Set Difference is a bit harder to scale up to more than two lists. It is concievable, and I may even have done it, but I’ll leave it as an exercise to the reader to develop that."

An inefficient solution which involves unnecessary sorting: for each file_i, 0 <= i < n in the set of n files, cat it 2^i times before combining to pipe through sort and uniq -c. Every possible set operation combination can be determined by grepping the result for a particular combination of counts. Intersection would calculated by grepping for 2^n - 1 while symmetric difference would require egrep to pick out any of 1, 2, 4,..,2^(n-1).

perl4ever · on May 30, 2018

The way I would do it, assuming we have list1.txt, list2.txt, list3.txt, ... and want to calculate (1 - 2 - 3 - ...) :

1. Use sed to add "1<tab>" (that's a one digit and a tab char) to the first list to difference by and save as "prefix.txt".

2. Use cat to combine all the lists, sort | uniq -c | sort -n | sed <reformat to make tab delimited> | sort again and save as output.txt.

3. join prefix.txt and output.txt on each whole line and cut the second tab delimited field to produce the final result.

So in order to be in the result, a list item must appear in exactly one list and that must be the first list. That should be what we want (?)

bariumbitmap · on May 30, 2018

The Debian 'corekeeper' package includes a cron script that uses this.

https://sources.debian.org/src/corekeeper/1.6/debian/corekee...

It's actually possible to simplify the set operations a little in that case.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=895143

nonsince · on May 30, 2018

For completeness, to scale relative complement up to N lists you would need to repeat the last list 2^N times. So it'd look something like:

    rel_com() {
        for i in $(seq 1 $#); do
            for j in $(seq 1 $(bc <<< "2 ^ $(expr $i - 1)")); do
                echo ${!i}
            done
        done
    }
    
    sort -m $(rel_com a b c) | uniq -c

Then you have a N-bit number with the i'th bit representing membership in the i'th file.

ollybee · on May 29, 2018

I've always used grep for this with the -f option to read a list of patterns from a file and -v for inverted search as necessary.

sametmax · on May 30, 2018

Nice. I've wondered several times how to have the equivalent of {1, 2, 3} ^ {3, 4} from Python, but in bash.

wallstprog · on June 1, 2018

On a related note: http://btorpey.github.io/blog/2017/05/10/join/

kazinator · on May 30, 2018

If you use awk, which has associative arrays, you avoid sorting, and can keep everything POSIX.

beeforpork · on May 30, 2018

I like bash's <(...) operator and often use diff -u with it:

  diff -u <(cat file1) <(cat file2)

Obviously, the 'cat' commands can be more complex commands, which make this more interesting.

zouhair · on May 30, 2018

No need to use cat:

  # sort a_list b_list | uniq
  a 
  b 
  c 
  d 
  e

emmelaich · on May 30, 2018