Is there a tool that treats txt files as a Set? Been looking for one a while a g...

sa46 · on Dec 9, 2021

You can use a combination of `comm` and `uniq` to implement set intersection and union. https://ss64.com/bash/comm.html

> Return the unique lines in the file words.txt that don't exist in countries.txt

    comm -23 <(sort words.txt | uniq) <(sort countries.txt | uniq)

>Return the lines that are in both words.txt and countries.txt:

    comm -12 <(sort words.txt | uniq) <(sort countries.txt | uniq)

kmstout · on Dec 9, 2021

I wrote these years ago. They're damn handy. It's true that they're not implemented in Bash (that would be nuts), but having them on hand lets me do much more on the command line than would otherwise be possible.

  ~/bin/union
  ===========
  #! /usr/bin/awk -f

  !acc[$0]++


  ~/bin/intersection
  ==================
  #! /usr/bin/awk -f

  !buf[$0]++ {acc[$0] += 1}

  ENDFILE {
    delete buf;
    files++
  }

  END {
    for (k in acc) if (acc[k] == files) print k
  }

  ~/bin/set-diff
  ==============
  #! /usr/bin/awk -f

  ! filenum { acc[$0] = 1    }
  filenum   { delete acc[$0] }

  ENDFILE { filenum++ }

  END {
    for (k in acc) print k
  }

SavantIdiot · on Dec 9, 2021

In bash? How would you even implement a set in bash without just doing linear greps? Or did someone add sets to bash 20 years ago and I never got the memo?

adrianmonk · on Dec 9, 2021

You could use the 'look' command, which does a binary search.

It's basically meant to look up spellings in /usr/share/dict/words, but it can work on any file. It will match any line that your pattern is a prefix of, so you'd have to add logic to eliminate longer matches.

But if you had some huge file to search and you wanted to do it from a shell script, that would be one way. Caveat: although it's fairly standard, 'look' might not be installed on every system.

Also, you have to be sure to maintain your file in sorted order. So no adding things by appending to the end, and checking if something is in the set is much quicker at the expense of adding things being much slower.