Hacker News new | past | comments | ask | show | jobs | submit login
CLI text processing with GNU awk (learnbyexample.github.io)
419 points by asicsp on Aug 28, 2023 | hide | past | favorite | 129 comments



I love awk, and I find myself reaching for it a fair bit. One of the main things I use it for is “sed with state,” so for things like matching on a line, but only if it was preceded by some other line. I find this to be really useful for creating one-off linters, for example I made one recently to check all our migration files for CREATE INDEX without CONCURRENTLY on a particular set of very large tables where it would cause issues. Since sql statements can be spread over multiple lines, it was difficult to write a straightforward match, but awk can track state like “I’m in a create statement,” “I’m creating an index,” etc. across multiple lines, which allowed me to cobble together something that has worked well for about a year now.


One of these days I need to get around to learning awk. In the meantime, I've learned some of the deeper, stateful, features of sed. For instance, you mentioned wanting to only output a line if it was preceded by another. Here's a sed command that does so:

    sed -ne 'x' -e '/PREV/ {x; /CURR/ p; x}'

    > echo -e "PREV\nCURR\nCURR\nCURR\nPREV\nRED" | sed -ne 'x' -e '/PREV/ {x; /CURR/ p; x}'
    CURR
This uses sed's hold buffer. I'll break it down:

    sed -n
The `-n` tells sed no to print anything out. By default, sed prints out whatever is left when processing. We'll tell it with the `p` command when to do so.

    sed -ne 'x'
`-e` indicates we are specifying one of the scripts sed will execute. The command `x` switches the current line with whatever is in the hold buffer. We'll do this on every line.

    sed -ne 'x' -e '/PREV/
The next command will only run on lines that contain `PREV`. But, because we've been putting lines in the hold buffer, we'll only execute on lines after `PREV` when it has been switched out of the hold buffer.

    sed -ne 'x' -e '/PREV/ { ... }'
The braces indicate all commands should be run when we see this match.

    sed -ne 'x' -e '/PREV/ { x; ... }'
First, we switch the hold buffer with the line buffer.

    sed -ne 'x' -e '/PREV/ { x; /CURR/ p; ... }'
Then, we only print out the line if it contains CURR.

    sed -ne 'x' -e '/PREV/ {x; /CURR/ p; x}'
Finally, we switch them back in case there is overlap in our matches. (Give `echo -e "PREV\nPREVCURR\nCURR\nCURR\nPREV\nRED"` a try with this.)

All that said, I'm pretty sure the `awk` script is much simpler and more direct, but I wanted to share how one might accomplish this was sed.

The time I spent learning this probably would've been better spend on awk, but this tutorial[0], was so good and so easy, it taught me nearly everything I know about sed.

[0]: https://www.grymoire.com/Unix/Sed.html


> One of these days I need to get around to learning awk

Plan9's awk(1)[0] man page provides a precise and concise (a few paragraphs) presentation of the core features of all awk implementations.

Tutorials bring practical knowledge, but often lack complete and self-contained descriptions of those nifty little tools.

[0]: https://man.cat-v.org/plan_9/1/awk


I still maintain that "The AWK Programming Language" [1] is one of, if not the best programming language book I've read so far.

It's short and to the point, has good examples, and cuts most of the usual fluff like "what is a variable?". Its base assumptions are: You know how to program, and you're here to learn AWK. Let's get to it.

I dearly wish there'd be more books like it for other languages.

[1]: https://archive.org/details/pdfy-MgN0H1joIoDVoIC7


I don't remember reading it but from the ToC & glimpsing at some examples/exercises, it definitely fits in the "Tutorials" category I was thinking of.

It's always delightful to see competent authors demonstrate how much sophistication is achievable in about 100 lines of simple code, by comparison with the "the Dog class inherits from the Animal class" type of examples, or "real-life" codebases. This book is definitely in the former category.

> I dearly wish there'd be more books like it for other languages.

This all reminds me of a well-known regular expression matcher[0], in about 30 lines of C, featured in "The Practice of Programming"[1].

More generally, even without dedicated books, there are common simple-but-sophisticated type of programs that are great to get to know a language, once you have basic programming skills: standard UNIX tools (cat(1), grep(1), etc.), λ-calculus interpreter, LISP interpreter, raytracer, etc. One can often find online versions serving as "solutions."

[0]: https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...

[1]: https://en.wikipedia.org/wiki/The_Practice_of_Programming


Not a waste of time, IMHO. In general, sed is faster than awk. It's smaller and is found in more places than awk, e.g., build toolchains. The grymoire site is one of the best, IMHO. He also has a tutorial on awk that is good, too. Nice to see people still discovering these tutorials.

Saddens me to see people selling crappy "e-books" or whatetver on text processing on HN. Compared to the older generations that used UNIX, the level of knowledge is lacking. IMHO.

This book from Tim Oreilly is an old favourite and has one of the nuttiest explanations of the hold space. See page 375.

https://www.oreilly.com/openbook/utp/UnixTextProcessing.pdf

https://web.archive.org/web/20230514225639if_/https://www.or...

As a NetBSD user, I found this books useful; all the utilities explained in it are still in the NetBSD userland.


This is nifty, thanks for sharing! I had no idea that sed had a hold buffer, and it's very cool that you can swap it in and out within the sed command like that. It's funny, because I went essentially the opposite way that you did: I used to know sed and awk basics, but then I properly learned awk. Since then my sed has atrophied a bit, and I still only know the basics. I'll have to run through that tutorial you linked


Can you share this example of tracking state of sql with awk?


Sure! I posted a gist here, stripped of anything particular to our company: https://gist.github.com/mplanchard/07229d61bd32ce73624d9003c...


I suspect that anyone reading this thread is likely to be equally interested in "Ask HN: Share a shell script you like" from a fortnight ago (though at 78 comments, it didn't get as much traction / comments as I hoped it would when I saw it)

https://news.ycombinator.com/item?id=37112991


There was a similar discussion 5 months back: https://news.ycombinator.com/item?id=35122780 (332 points | 328 comments)

And here's one from last year: https://news.ycombinator.com/item?id=32467957 (374 points | 294 comments)


Thanks :)


I maintain a minor side interest in Awk, along side Lisp and other things.

I developed cppawk in 2022: https://www.kylheku.com/cgit/cppawk/about/

cppawk extends Awk with preprocessing.

There is a loop macro that supports a vocabularly of clauses. Clauses can be combined for parallel and cross-product iteration. And they are user-extensible. By writing five simple macros, you can define a new clause.

Something potentially useful if you use Awk.

Cppawk is documented with multiple man pages, and covered by unit tests which run with gawk and mawk.


Perhaps my old sysadmin hat is showing through, but I don’t quite see what the advantage of awk is over just writing the same thing in perl. I’ve seen my fair share of horrendous shell scripts from junior sysadmins, and every time I think to myself “the text processing portion would be so much cleaner in Perl”.


If you're in perl all the time that probably makes a lot of sense. For me, awk is one of the few languages that I can safely set aside for months and then I'm back up to speed in 10 minutes. There's just something very intuitive about it, and it somehow fits very naturally with other common command line tools.


Weird, I feel the same way about Perl.


If you are comparing Awk vs Perl for scripts, I'd prefer Perl (or Python).

This post is about short one-liners for ad hoc use cases. I prefer sed/awk over Perl for such cases. Though, if you already know Perl, you could continue using it instead of having to learn more tools.


Do all systems still come with Perl baked in these days? If so I could see reaching for that over awk/sed. If I have to install a runtime I may as well just reach for Python


What are "all systems"? Most mainstream Debian or Fedora based systems install Perl by default (but not necessarily in specialized settings such as embedded/boot/rescue systems). Alpine linux does not include it in standard images. FreeBSD (and probably Net/OpenBSD) don't install Perl by default. The current macOS still includes it, but Apple has notified that it will be removed at some point. Windows does not include Perl or awk by default.

Sed and Awk are part of POSIX, and maybe more importantly also part of Busybox. They're almost always available when Perl is available, while the reverse is not true.


>> Do all systems still come with Perl baked in these days?

If you use Git for Windows (https://gitforwindows.org/), it includes Perl.


…and gawk :)


Yes. Frequently any tool set that has gawk will also include sed, perl, cut, head, tail, less, vi / vim, etc.

It is nice that Git for Windows includes bash and all these tools.


A bunch of the default git extensions are written with Perl, so you will find some version of it available on most modern Linux systems


Awks super power, and the reason I mostly use it, is it's free read loop, free field splitting, and the pattern/condition matching model.

As a LANGUAGE, it's "eh". It just happens to be "good enough".

You can, of course, do all of that with Perl. But then I have to write all that boiler plate I get with awk for free. And the gains in Perls language aren't enough, for me, to dump awk. And I don't use it for "scripting", I use it for data processing, tearing up files for mostly one off tasks. So I don't miss Perls depth. If I want depth, I'll go somewhere else.


> free read loop

perl -n

> free field splitting

perl -a ... $F[1/2/3/etc]

> pattern/condition matching model

Not quite sure what you mean, but `perl -lane 'print if /abc/'` is might be what you're looking for

The boilerplate can be mostly eliminated with the magic incantation of `perl -lane`. The trick that makes all this work is that perl defines a whole bunch of pre-defined variables and populates them with things that might be helpful (see $_, @F, etc).


There aren't any technical advantages, no. Perl's features are a proper superset of awk (by design!).

What's happened is that Kids Today (tm) never learned perl. So they're discovering awk as someone new to the idea of stream processing. And awk was a great idea for that, and it represented a genuine innovation worth emulating.

In the late 1970's. Then of course perl did emulate and surpass it. But then got forgotten. So kids are discovering awk instead. It's a little cringe, really.


Why is that cringe? They genuinely probably came across awk before perl (I know I did, I read "The AWK Programming Language" and then went on to "The C Programming Language"). Having that said, awk is great and it's been the same for decades and available on every system (the same can't really be said about perl).


>> awk is great and it's been the same for decades and available on every system (the same can't really be said about perl).

The only issue with AWK is that there are many implementations and they are not always compatible with one another:

https://www.gnu.org/software/gawk/manual/html_node/Other-Ver...

I have ported AWK scripts from legacy Unix systems to Linux and ran into incompatibilities that required some adjustments to the scripts.

Curious: what systems have AWK, but do not have Perl?


In practice, on actual systems, you're likely to encounter nawk (the One True Awk), gawk (GNU Awk), mawk ("Mike's AWK", a fast awk), and Busybox's AWK.

There are other variants, yes, but in virtually every case these are fully POSIX compliant and/or have a POSIX mode.

(And in truth, gawk is the only non-fully-POSIX awk I've encountered --- it extends standard AWK with asort and the "'" formatting modifier (which prints localised htousands separators in numeric data).

Programmes written for any one awk, if using POSIX features only, will run on any awk.

Many small / embedded systems (think routers, stock Android, or any POSIX-only Unix variant) must have awk, but often don't include Perl.

You'll also find variants of Perl, though the relative stasis of that language make this less an issue now than in the '90s and aughts.


Awk is a useful language that you can learn in one afternoon, after reading the man page and a few examples. And then you can spend your whole life using it for several projects. You cannot do that with perl. That's why awk has a longer shelf life than perl.


> And then you can spend your whole life using it for several projects. You cannot do that with perl. That's why awk has a longer shelf life than perl.

A Perl developer would of course say you have this completely backwards and even if I haven't programmed Perl much, or even at all for the last decade I would tend to agree.


I'm one of those "kids these days" but did actually learn to program Perl at some point, and I generally prefer AWK. Perl is a large and complex language, I don't need it that often, and I'm not smart enough to keep remembering all of it.

Now, if I would get hired as a full-time Perl developer and spent 2 years developing Perl: it would perhaps be different. But that's not the case, and isn't for most people.

For better or worse, Perl sees a lot less usage than it once did; I rarely encounter it "in the wild" and don't even have it on my laptop because nothing needs it.


Does any OS besides Windows not ship with Perl? Even on Windows, I'd assume anyone programming has WSL set up, which means you have Perl.


Perl came out in the late 80s and by the mid 2000's was really on its way out? I hired for my last perl position in around 2006.

Just saying, your definition of "kids today" could well include a decent portion of developers under 45 years old. Referring to this cohort repeatedly as "kids" is also a little cringe.


Did you really just reply to a comment that used the phrase "Kids Today (tm)" and try to interpret it as a genuine insult? The inability of this community to understand straightforward humor amazes me. Dude, it was a joke. And yes, I was calling mid-career professionals "kids". Deliberately. Because I'm old. And it's funny.


For simple uses cases, I find awk simpler than Perl. I love Perl, have written tens of thousands of lines, but on the CLI I prefer awk. I’m sorry I “cringe” you.


Not being snarky, why not python over perl? what makes perl better for scripts?


Perl is much more terse for one-liners and has much more built-in for doing text processing in scripts. Stuff like implicit read loop, field separation, etc. I would say they're suitable for different jobs: if a perl script grows beyond a hundred lines (you can do a surprising amount in that space!), then Python may be the right tool.

Perl is also much more of a known target: some version of it exists on basically every single Unix, and the language really hasn't changed that much in the past decade. I have SSH'ed into multiple CentOS 6/SLES 11 (released 2009, and granted mostly to rescue data off them) servers in the past 2 years, and perl is just much more of a known target to write things against than whatever python release is on that system.


what makes perl better for scripts?

Having an implicit line- and field-splitting loop for standard input with a couple of command-line switches. (Awk doesn't even need switches, but is cumbersome if you need initial state.) This covers a lot of use-cases. Also, very compact and powerful regular expressions.


Perl is a progression of that particular environment. It is a superset of shell/grep/awk/sed.

A shell command works exactly as you would expect copied literally inside a backquote. With all the other goodies of a real programming langauge.

Doing this in Python (to me atleast) seems unnatural.


By the time you figure out which env you need to be using with python, you’ll forget why you needed it.


If all you're doing is text processing that Perl can do out of the box, you probably only need the Python stdlib.


awk is more intuitive for sure, for a regular javascript coder


Remove perl, have less security issues. Some scanning tools flag it, too. Awk is found in more places, in my experience.


Sorry, but that's ridiculous. Any general purpose programming language is a vector for bugs and security problems, but come on: you're genuinely trying to say that a kludge of bash+sed+awk is objectively more "secure" than a single perl script to solve the same problem?


In the case of awk, actually yes, it is safer. The reason is that awk is a very limited language. It has only enough functionality to provide text matching and substitution. It is very difficult to use awk to do anything of high security risk, compared to a language like perl.


But awk is never used alone. You don't solve whole problems with awk, you squish it into a script with a bunch of other junk. My point is that you're making an apples-to-oranges comparison. Sure, "awk" isn't the problem, but "bash" is, and bash is undeniably a more error-prone language than perl. You surely agree with that much, right?

And if you disallow "bash" for security reasons, where does that leave "awk" in the category of useful tools? See my point?


Just use awk for what it was designed: text search and substitution. You can run shell scripts along with awk, but that is clearly not what you should be doing if you want to design secure systems. The first rule of security is not to abuse your tools.


You’re right. But the alternative might be bash + perl or just bash. Or none of them. Perl is anyway the first one to go.


Is it completely gone, or rather just for you, blocked by sysadmins who know Perl is the magic pixie dust for total control, and want to keep it for themselves?

In Windows-land, compare how PowerShell access may be restricted, and you won't be allowed to run macros in Office, all while your computer is "managed" by a horrible hodge-podge of PowerShell and VBA scripts that make Perl code look like high literature.


> It has only enough functionality to provide text matching and substitution.

Gawk at least can do a lot more than that. Reading and writing files, network communications, and run arbitrary shell commands, for example. It's certainly not as powerful as perl but it's also not limited to just text matching and substitution.

Edit: figured I would provide some examples. Here's an http server and a first person shooter in gawk. Maybe not so practical but they show some of gawk's capabilities.

https://github.com/kevin-albert/awkserver

https://github.com/TheMozg/awk-raycaster


There is a virus written in awk that infects other awk scripts[0]. And according to wikipedia, the language is Turing complete.

[0]: https://github.com/SPTHvx/ezines/tree/main/dc5/CODES/Perfori...


One somewhat not-well-known thing with gawk is that it typically ships with some useful extensions that give you access to things like readdir(), ord(), chr(), gettimeofday(), sleep(), etc.

https://www.gnu.org/software/gawk/manual/html_node/Extension...


awk one-liners are a slam dunk. The tough question whether to invest in more complex awk programming. Invariably some processing task requires more complex logic and awk provides that, but in the terse and arcane ways of early computing. Yet reaching for a modern alternative is also an overhead, may not be particularly intuitive either (hello pandas) and may even have performance issues...


For me, the big problem is libraries. Even a personal file of common functions doesn't seem that well supported, and there just doesn't seem to be a way to get third party libraries.

When I start needing helper functions and splitting it into multiple lines is usually when I reach for Python instead. And then sigh, because my program will be 2 to 3 times bigger. Ruby is a great awk replacement, but unless other people at your job know it, you can't expect others to maintain it.


I'd bet you could do a harness like https://news.ycombinator.com/item?id=37292882 mentions but with Python in like an hour and then you could stay in both one syntax and more significantly in one library ecosystem. Why, 3 such things may even already exist. :) The syntax/semantics is not as optimized for 1-liner brevity, but everything has trade-offs.


The problem of libraries is namespaces, since everything is global its not worth it. and also incredibly problematic (especially since match() actually sets a 2 globals) fortunately gawk offers namespaces, it even has a flag for loading libraries from a path { gawk -i inplace } has replaced sed -i for me a couple of times. But yeah, its still lacking.

PS: This is gawk only though, but awk -f $awkmodules/mymodule.awk -f <(echo '') is an ok replacement, even though its just concatenating files


I used to not like them pre chat gpt. But nowadays when you can paste an arcane awk/sed illegible one liner into an AI and have it describe step by step what it’s doing in totally fine with it now. I still don’t like them as much as a few lines of python for unit test ability reasons but sometimes you just straight up don’t need unit tests for some quick data munging task


> have it describe step by step what it’s doing

And then building on the original “update the script to do $thing” where $thing isn’t obvious/trivial. It saves a lot of time.


Hello! Author here.

I am pleased to announce a new version of my "CLI text processing with GNU awk" ebook.

Learn the `GNU awk` command step-by-step from beginner to advanced levels with hundreds of examples and exercises. This book will dive deep into field processing, show examples for filtering features, multiple file processing, how to construct solutions that depend on multiple records, how to compare records and fields between two or more files, how to identify duplicates while maintaining input order and so on. Regular Expressions will also be discussed in detail.

Links:

* PDF/EPUB versions: https://learnbyexample.gumroad.com/l/gnu_awk (free till 31-August-2023)

* Web version: https://learnbyexample.github.io/learn_gnuawk/

* Markdown source, example files, etc: https://github.com/learnbyexample/learn_gnuawk

* Interactive TUI app for exercises: https://github.com/learnbyexample/TUI-apps/blob/main/AwkExer...

Bundle offers:

* Magical one-liners (https://learnbyexample.gumroad.com/l/oneliners/new_awk_relea...) is $5 (normal price $15) — grep, sed, awk, perl and ruby one-liners bundle

* All Books Bundle (https://learnbyexample.gumroad.com/l/all-books/new_awk_relea...) is $12 (normal price $32) — all my 13 programming ebooks

I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors. Happy learning :)

---

Previous discussions:

* Learn to use Awk with hundreds of examples (https://news.ycombinator.com/item?id=15549318) — 478 points, Oct 2017, 116 comments

* Show HN: An eBook with hundreds of GNU Awk one-liners (https://news.ycombinator.com/item?id=22758217) — 539 points, April 2020, 48 comments


I'm curious to know how many people you get paying for something like your "Magical one-liners", and also whether you've ever experimented with a "choose to pay after" model?

I ask because it's the kind of thing that I can imagine finding useful enough to pay $5 (or $15) for, but I can also imagine it being something that contains nothing I don't already have saved in my personal "one liners" file, so I'm not really interested in paying to find out.


I use Gumroad/Leanpub to sell my ebooks. As far as I know, they don't support the "choose to pay after" model.

You can see the number of paid sales for the bundles under the "I want this!" button. When the price is 0, it shows the total of both paid/free users.

I started selling ebooks about 5 years back. Where I live, my monthly living cost is just $150. While the first two years of sales were just about enough to cover my costs, the last three years have been much better - I can continue being self-employed :)


> Where I live, my monthly living cost is just $150.

That's quite low! Mind if I ask where in the world that is?


Outskirts of a second-tier city in southern India. I live a modest lifestyle - no vehicles, desktop instead of laptop, live alone etc.


Thanks for sharing this, it is pleasantly obvious you put a lot of work into this. I especially like the TUI application!

Will the web version of this book remain free even after August 31st?


You're welcome and thanks for the feedback :)

Yeah, the web version is always free for all of my ebooks. And you can find the markdown source on GitHub, for example: https://github.com/learnbyexample/learn_gnuawk/blob/master/g...

I use `pandoc` to generate the PDF/EPUB versions from markdown. See my blog post https://learnbyexample.github.io/customizing-pandoc/ for details.


I did this golfing a while back: Drawing a heart with AWK - https://gist.github.com/auselen/906a53b47a7d616b080dbef85eb8...


99.9% of my awk use case is to split a line (a la "cut - d\ - f) while discarding successive spaces.

e.g.:

    $ echo "key:     value" | awk '{print $1}'
    value
Open to a simpler replacement :-)


You might consider: https://github.com/c-blake/bu/blob/main/doc/cols.md

That's in Nim, though that may not be much a barrier. (There may also be other tools in bu/ of interest.)


You can do it with cut too:

    $ echo "key:     value" | cut -wf 2
    value
but whether it's actually "simpler" is open to debate

edit: actually gnu cut lacks -w, so this is bsd-only. lol computers, stick with awk


I can't tell you how many times I pipe in rev to put my text where I want it for cut (then rev it again).

Abbreviated example, getting the service names from a k8s cluster looks roughly like (actual command does a bit more processing):

kubectl get deployments -o wide | rev | cut -d'=' -f1 | rev

But if it's just gobbling whitespace, xargs without a command can be your friend.

$ echo "key: value" | cut -d: -f2 | xargs

value

My brain generally goes "rev sed head tail xargs cut tr ... screw it, I'll use python ... someday I shall learn awk." There's a young engineer on my team that knows awk, and I'm envious.


Neat, but both your tricks (rev and xargs) are more for getting the last word than getting the nth word.

For the sake of the argument, say I have the following fixed output and want the sizes:

    $ ls -l
    -rw-rw-r-- 1 userAAA group      588 Aug 29 00:25 file1
    -rw-rw-r-- 1 userAA  groupB   11870 Aug 29 00:24 file2
    -rw-rw-r-- 1 userA   groupBB   1166 Aug 28 23:56 file3
    -rw-rw-r-- 1 user    groupBBB   195 Aug 28 23:56 file4
I would just do:

    $ ls -l | awk '{print $5}'


You don’t even need to know awk these days. Just say “how to do x munging task” in ChatGPT and you’ll get a one liner that will be just as good as if you’d say there squinting at man pages for 30 minutes


this is exactly the sort of case where you get non-portable bullshit you don't understand out of it! It spits out something that works on BSD but not on GNU, you put it in your script and _boom_ wonder why the thing blew up in prod, and oh btw you also lack the ability to debug it because you never understood it in the first place


Meh, I don’t think that is a problem endemic to shell specifically. That’s more of putting untested code in production. The thing about the fact that is a one liner is… if that happens and it’s not portable, who cares? You just turn around, paste that sucker back in, and say make it work on OsFlavor2.06 and you get back something that works. You don’t even have to fully understand why it isn’t portable, you can just ask the AI and have it explain why. If you wanted something battle tested in prod that was readable and understandable you wouldn’t be using one line shell scripts in the first place, regardless of whether they were written by an AI or not


More importantly it's just mega slow.

The whole point of bash one liners is not to write short bash, it's to type it in your shell to get a result quickly.

Just typing the url to chatgpt in my browser I would have had the time to write my one liner in she'll XD


I sincerely doubt unless you are writing shell all day every day that you will get a decently complex working one liner out faster than GPT4.


Right, but you are bending the argument to your will.

What is a "decently complex one liner"?

If it's decently complex, then it's probably not a one liner, so indeed chatgpt may be faster.

If it's a one liner, then it's probably not complex, so I would be quite confident in being faster than chatgpt.

The reality is, 99.9% of one liners are just series of pipes and filters to extract specific fields from an output, and act on it. I'm quite confident I would be faster than chatgpt for any of these. And no, I don't write bash "all day, every day".

I tend to see "every day bash" as very similar to SQL. Once you know cut, grep, find, sed and awk, even at a basic level, then you can combine them and extract pretty much anything.


One example of what I call “decently complex one liner” that I had ChatGPT write the code for the other day is a command to find top n files of a given pattern in a certain directory, sorted by n in terms of most recently modified. Sure that is reasonably fast to write yourself if you know shell pretty well but why bother when I can just input my requirements and get something that immediately works without squinting at man pages and checking my patterns on regex101? In the time it takes to write out that two sectioned pipe command I’ve already solved it by pasting my requirements into the AI.

SQL is a great example too where I will use an AI even though it’s not necessary. I’ve probably written many tens or hundred thousands of lines of SQL in my life and I would still prefer to just toss my requirements into an AI and have it write the query for me so I don’t have to cross reference things and look up syntax. Easier to do that and iterate on it once or twice than comb through some bigquery or Postgres docs because I can’t remember that particular flavor of sql today


Only with FreeBSD `cut`.. coreutils `cut` (Linux) is missing -w as is at least OpenBSD?

EDIT: I see you discovered this. lol computers indeed. ;-)


What does -w do? This works without it, no?

Edit, found it, "use whitespace as the delimiter"

https://www.unix.com/man-page/FreeBSD/1/cut/

For most cases like the OP you'd know the delimiter anyway so I don't think the absence is a big deal, and if not it would be easy to use tr or sed to make it consistent


the important thing is that it uses _consecutive_ whitespace as a the delimiter, so you'd have to use sed to collapse all the whitespace down to one tab.

At that point, awk is vastly simpler


Check out https://github.com/sstadick/hck and https://github.com/theryangeary/choose - both are alternatives for cut/awk and allows regex based split as well. Though, they don't remove starting/ending whitespaces IIRC.

I wrote a script (https://github.com/learnbyexample/regexp-cut) that uses `awk` to provide a `cut`-like tool with regex-based split, negative index, etc. And this will take care of starting/ending whitespaces as that's the default `awk` behavior.


I think you mean

    echo "key:     value" | awk '{print $2}'


confession: plug

I once wrote a diff2html script ported from bash and it was much, much faster (for obvious reasons). And awk makes it much more readable than bash script. And I could learn the language, debug, understand bugs and fix them in a night.

Not sure, if it is idiomatic way to awk, but have to say it is a really nice language.

https://github.com/berry-thawson/diff2html/blob/master/diff2...


Awesome! I've been meaning to replace my usage of Python/JavaScript for tasks (which I believe) are more awk-shaped.


As long as you don’t care about unit test ability. Usually if you bothered to write them in Python or JS you usually don’t want to regress back to shell stuff. You’re already in a place where you have a runtime available so you can do way more stuff.

It’s usually the opposite direction that you mentioned that you want to go. You one liner some shell like awk to quickly get shit done without worrying about a runtime being available to you and then if you need it to be more robust and legible because of testing etc or production grade you move to a proper dynamic scripting environment


A few years back I decided to just get as capable as I could with jq, which is fast and functional enough to cover 99% of awk/sed use cases, plus cases you'd never want to touch with awk/sed.

No regrets!


Of possible interest - instead of making a whole new programming language like awk, you can also just systematize generating code for an existing one with a command-line harness.

This can even stay terse & keep a fairly fast edit-test turnaround in a fully statically typed language like Nim: https://github.com/c-blake/bu/blob/main/doc/rp.md


If you think that's a good idea, you don't understand why tools like awk/perl/sed/etc exist and are popular. They are, by design, optimized toward specific kinds of use cases.

In fact, their dynamically typed nature is a perfect example of that since it's much easier to quickly manipulate strings in a language that isn't so strict, as they'll do more heavy lifting for you via automatic coercion while limiting extra syntax/boilerplate (which, granted, is less of a problem with modern type inference). That makes it a lot easier to toss together quick one-liners and glue code, which is where these tools shine in the first place.

Hell, even something like python or ruby is just a little too structured for my taste when doing something quick and dirty, which is why I love perl as it can be unstructured if that's all I need, or I can create a more structured program if that's what the problem requires.


It's just a different & in my experience often neglected point in a similar design space (as that initial, linked text argues). Your tastes & use cases are your own. Almost everything "all depends" upon so very much in computer systems & in life.

To add some more color, Nim is also a very adaptable prog.lang. I believe there are converts from Perl in its fan base. Nim's creator long ago recreated some Perl in Nim: https://nim-lang.org/araq/perlish.html

Anyway, it's a different set of trade-offs to consider which I thought some reading about learning awk with open minds might find interesting. That's all, really.


So Awk is a whole new language, but Nim isn't?


I never said Nim was unique | older than awk. While I cannot make you read my cousin comment to understand I meant "new" as clarified-"different" [1] or click through any links, I can perhaps non-redundantly emphasize that the mentioned approach "works" not just for Nim, but for any language, C & Go (impls refd in mentioned `rp.md`), and Python in another comment in this comment thread: https://news.ycombinator.com/item?id=37295399 (maybe even with `eval` there!)

Only, the approach "works" with differing levels of "success" for different use cases / contexts. It is true (whichever) shell language is still there to differ in shell 1-liner cases. That is also true of sed / awk / perl / ... If you don't want to click through on `rp.md`, you could also read Ben Hoyt's article on his Prig if you like: https://benhoyt.com/writings/prig/ discussed on HN a while back https://news.ycombinator.com/item?id=30498735

It's not actually that different from your `cppawk` that you mention elsethread.. just maybe rotated 27 degrees away in "idea space". ;-)

[1] https://news.ycombinator.com/item?id=37293475


Whole new language? Awk is 45 years old.


I agree that "writing/learning a different" (what I meant) is more clear wording than "making a whole new".

EDIT: and it is a fair counterpoint that any command with options is, in some sense, also a different language one must learn. Learning API calls is also a different language (at least nouns & verbs if not syntax). But that is all partly the point. awk did/does a programming language with different syntax where other alternatives might be enough.


Never learned awk or committed esoteric cli incantations to memory. Don’t get me wrong, I can get around on the cli, but sed, awk, etc just didn’t seem like a good cost/benefit investment. I’m also not a sysadmin.

Thankfully I waited long enough and LLMs can write them for me better than I ever could.


What is better? Starting with awk or sed?


Depends on the task. Sed is typically used for search and replace and Awk is better suited for field based processing. Both these tools also have filtering features (regexp based, line number based, range, etc).

See also: When to use grep, sed, awk, perl, etc https://unix.stackexchange.com/q/303044


While each with their scope and idiosyncrasies ... they're pretty similar, at least in the pattern matching part, and both of them have a pretty internal "core" of functionalities that is easy to grasp. so the honest answer IMHO is ... "both"


If you're familiar with Vim's search/replace syntax you already know how to use "sed -e" to replace text, that's how I got into it.


Depends on your use case. I can't speak to sed, I don't know it very well. Awk is my SAK.

But I learned awk while sitting in an office at a client site. I forget the specific scenario, but I wanted to split up some files into some other files. I didn't even know awk, but grokked enough from the man page to let me do what I wanted to do. I can't even say what provoked me to turn to awk in the first place. I do know I ran into some internal open file limits, but worked around that.

If you want to tear files apart, or summarize them in some way, or push the fields around, awk is much better. sed is an editor. If I have a sed scenario, I'm more apt to just do it in vi and save the result than stitch together some pipeline with sed.

Most of my use cases are one off processing and analysis. I've never had any workflows that relied on awk or most anything like that. It was almost all throw away code, a tool on the workbench, not the production line.


Each has its uses, and there are things which are more easily achieved in one than the other.

(I have a set of scripts I use to parse NOAA's weather web page to plain text, and ended up resorting to both sed and awk in the process, and haven't yet tried to simplify that to a single script.)

Sed is usually used for simple text substitutions and manipulations.

Awk has built-in record and array concepts, as well as more standard programming constructs (loops, if/then, case/switch, printf, and external system interfaces (launching and/or reading from external programmes).

My view is that the tools overlap considerably, but also complement one another strongly.


A sed binary is usually much smaller than an awk binary, either POSIX or GNU. The memory footprint of sed will be much more compact.

However, sed has grown out of the command language used by the tty editors, and is more difficult to program (although it is Turing-complete).

The awk language implements much of the syntax of C, and it is not difficult to write a very slow and inefficient script. This inefficiency is harder to reach in sed, because it takes more effort to abuse it.

O'Reilly's book on sed and awk is available free online, both to browse and to download as a ZIP.

https://docstore.mik.ua/orelly/unix/index.htm


For those who care about copyright, that URL is not from O’Reilly; it’s a copy of a book-set that O’Reilly used to distribute via CD-ROM – with a nice user interface that used web technologies (even included a search feature). O’Reilly could make it available for free – as they’ve done so for other books such as Apache Security¹ or Using Samba² – but they still (as is their right) expect you to pay for the sed & awk book³.

¹ https://blog.ivanristic.com/2015/02/apache-security-ten-year...

² https://www.oreilly.com/openbook/samba/book/

³ https://www.oreilly.com/library/view/sed-awk/1565922255/


The old sysadmin in me says to forget both and just learn Perl.


Start with sed. It can take you a long way and it is more succint.


I'd start with AWK because it is covers more use cases


I have been using ChatGPT for generating these kind of small CLI like this. My prompts look like this:

    - use jq to count a nested array "a.b.c.d"
    - find and delete empty folders using `find`
    - find and replace text using sed/awk
I found that using ChatGPT for these purposes boosted my productivity tremendously.


ChatGPT is a great time saver for those who already know how to use awk. But it should not be used by those who are unfamiliar.

Just an example, I saw someone come up with a great awk line to change some text in a nested directory. He then pasted into bash. Only once the server went down did anybody realize that he forgot to cd into the proper directory and he wiped out not only the server config but also all the user-uploaded data as well.

The server config was not version controlled and the user data had not been backed up in almost a week.


That's not really a ChatGPT issue, people pasting in slightly wrong commands (or right commands in the wrong folder) is a tale as old as time - well, as old as linux at least. Short of saying that nobody who's already an expert should ever touch a CLI, the lesson from that story is "be as careful as possible, then be more careful, and also have backups of everything" not "don't use a LLM to help".


Yeah, that exact same problem could easily affect someone who spent hours cobbling together the same awk script from Google searches and StackOverflow.


What “care” do you suggesst that someone pasting in a script they don’t understand should take?


I'm far from an expert so you should probably ask someone other than me. But my two cents would be not to paste any code until you have understood it, or unless it's written by a source you trust, or alternatively only paste it somewhere you don't care - when I'm playing around testing stuff I might not fully understand on a linux server I do it on a VPS that's unimportant to me, and that if I mess it up I can very easily restore it back to a clean OS install and I have a bash script ready to reinstall all software I want & all the profile customisations etc.


My usages of tools like awk, sed and Bash scripting has increased an enormous amount thanks to ChatGPT/GPT-4.

I'm using those on a weekly basis now, because I don't have to memorize details of entirely new programming languages in order to apply them to small problems.

Smaller languages that I never took the time to learn are no longer something I avoid. I even use AppleScript now! https://til.simonwillison.net/gpt3/chatgpt-applescript


Awk is fine and dandy but, like wity Sed, I think that it's almost always replaceable with Perl which is way nicer to use, and ubiquitous. Every OS (except Windows) I laid my hands on in the last 15 years has had Perl installed in either its default install or pulled in as a dependency almost immediately (a LOT of stuff depends on Perl in any Unix system).

This is, unless you are running on an embedded environment, but in that case you are stuck with something like busybox's Awk which is way more limited than gawk...


I have a good story about this: My first time really working with a great scientist, we were taking genetic papers and making them code for improving analysis. I spent two days writing a perl script before I finally got frustrated enough to ask for help.

The first question he asked was "Did you email the author(s)?" I said I hadn't and didn't want to bother this seemingly very important scientist. He told me nonsense, that most of them don't mind responding but he warned me to be terse and to the point. I emailed the gentleman and told him what I was doing and my issues, and asked him for some guidance. He sent me back a one line awk-script that did everything all that perl was failing to do!

Of course all that proves is I'm horrible at perl, but it was an important moment in my life that showed me that even very smart and important people are still just people, and that just asking is often a great way to learn new things yourself, and that sometimes you just need to step back and reconsider what tools you are using. I am forever grateful that an awesome geneticist who needed help bootstrapping tech infra took the time to teach me, a greybeard sysadmin type, practical, reproducible science, from paper to implimentation. I learned a lot but the biggest downside is, after being heavily surrounded by scientists in the workplace in most jobs since then, I find companies without that difficult to work for.


The obligatory perl replaced awk decades ago comment.

Soon to be followed by the ones saying nobody should be writing shell scripts at all anymore.


I wouldn't put Perl as easier to use, but it certainly is more powerful and has a vast ecosystem. And it is more portable, since there's no need to worry about GNU/BSD/etc variations.

And I wrote a book for Perl one-liners as well (https://learnbyexample.github.io/learn_perl_oneliners/), which I'm currently revising (like I did for the grep/sed/awk ebooks).


The difference is that learning pearl is an ordeal that will take several weeks at the minimum. Learning awk can be done in one afternoon, after reading a man page and a few examples. And it really works for the tasks it was designed. So I think awk is superior to perl for the purpose it was created.


>> The difference is that learning pearl is an ordeal that will take several weeks at the minimum.

I am not sure about pearl, but Perl is not that different from most other programming languages. If you are familiar with Javascript or Python, learning the basics of Perl is pretty easy:

https://perldoc.perl.org/perlintro

https://www.perltutorial.org/

Perl is designed for text processing, so it has a powerful regular expression engine. Writing regular expressions can be difficult, but it is a great skill to have in your toolkit.

Fun Fact: If the programming language you are using has support for regular expressions, they are almost certainly Perl-compatible regular expressions because Perl's regular expression syntax is more widely used and more popular than other regular expression syntaxes (e.g. POSIX, etc.).


I was actually surprised to find mktime() in Busybox awk.

The big thing lacking there are the GAWK networking extensions.


> except Windows

On Windows, you use PowerShell.


>> On Windows, you use PowerShell.

If you use Git for Windows (https://gitforwindows.org/), it includes Perl.

Or you could install Strawberry Perl which is made for Windows: https://strawberryperl.com/


They were talking about whether or not the OS comes with Perl by default, not whether it has a CLI at all.


I prefer Cygwin.


I just finished writing a “dumb stuff with containers” internal blog which included:

C:\> type somefile.txt | docker run --rm -i ubuntu awk ‘something’ > output.txt


Perl is not nicer to use.


I love awk. Enough to shill for this:

https://www.oreilly.com/library/view/effective-awk-programmi...

If TFA is an excerpt for a book forthcoming on dead-tree media, then I'll be buying that one as well.


Last week chat-gpt spat out some Awk for me for a generic linux request. Was quite a pleasant surprise!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: