Hacker News new | past | comments | ask | show | jobs | submit login
Basic Command-Line Data Processing in Linux (symkat.com)
63 points by symkat on Aug 29, 2010 | hide | past | favorite | 21 comments



Even after using linux command line for years it is always useful to see how others use it. I tend to overuse perl regex's to massage the data how I want it and underuse awk.

my norm is cat file | perl -p -e "s/.from ([0-9\.]+) ./\1/g" to get the ip out of the same datafile.

regex's seem to help with messy data or data that contains inconsistent delimiters.

(some of the stars got stripped by HN so the above won't work)


(some of the stars got stripped by HN so the above won't work)

Try putting a couple of spaces in front of your code line, like this:

  cat file | perl -p -e "s/.*from ([0-9\.]+).*/\1/g"
See http://news.ycombinator.com/formatdoc

Edit: for simple regexes, sed works well, too, and probably loads slightly faster than perl.


Also, tools like Awk and friends can be ridiculously fast and concise.

http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and...


I just learned about another one a couple of days ago: cut. Can't believe I never knew of its existence.

Also, for programmers, I'd recommend ack over grep.


Thanks for the tip. I'd never heard of that one either. It seems like a simpler awk, or at least small subset of awk.


I started a wikibook on this stuff a few years back. Includes material on inline perl, gnuplot and has lots of examples. Check out: http://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_U...


So very cool! Thank you for posting.


For my part, I don't use awk for anything more complicated than one-liners. I used it for a while, stopped when I was working on something else, and forgot all the awk-specific stuff.

My MO these days is if it's anything more complicated than an awk one-liner like awk '{print $2 " " $NF}', I'll use Python or, lately, Ruby. (Perl would be fine, too, if I used it in other contexts often enough.)

That said, there's nothing quite like, well, programming your environment. The extent to which you can manipulate files, directories, and text in *nix right out of the box makes me feel privileged to understand it. I can remember a time when renaming a bunch of images en masse seemed tantalizing but out of reach. I've since learned quite a bit, and even though it's relatively mundane now, it still feels magical. Upthread, someone called it "moving mountains." That's precisely it, and I love it.

Yes, yes. I'm a complete and utter nerd, etc.


It's also useful to know that 'sort -u' removes duplicates.


Although it was important to use the -c flag on uniq for this particular problem, which is not available with "sort -u."

Which I guess just goes to reinforce the Unix philosophy of tools that do one job and do it well.


They forgot sed.


As others have mentioned, tr and cut are extremely useful. Although I had overlooked them in the past, expand/unexpand are also very useful! They convert tabs to spaces, or spaces to tabs. Of course there are other ways to do that, like substituting with sed, translating with tr, or printing tabbed data using $1/$2/etc. with awk... they just aren't as simple.


See also "Opening the software toolbox" by Arnold Robbins, part of the GNU Coreutils documentation.

http://www.gnu.org/software/coreutils/manual/html_node/Openi...


GNU Coreutils documentation as a whole is very useful.

http://www.gnu.org/software/coreutils/manual/ To read on the command line, try:

info coreutils

Also see,

http://en.wikipedia.org/wiki/GNU_Core_Utilities


Hacker News readers probably have at least a passing familiarity with Unix/Linux, but it's still refreshing to be reminded that you can move mountains with short commands.


And even for someone who knows all of this, knowing a good guide makes it easy to handle requests for help (often preemptively). This is one I can (and just did) send to a friend who's less familiar with Unix.


I think paste (merge lines of files) also deserves mention. Besides that, I have found tail and column to be extremely useful.


I've been using Linux for a while now, and I never thought about how powerful those simple commands can be.


sinc people already pointed out the missing ack & comm, I'll add: no love for tr?


Best textutil you probably haven't heard of (or have forgotten about): comm(1)


I find "diff -y" more intuitive, but I didn't know comm and will explore potential uses.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: