Hacker News new | past | comments | ask | show | jobs | submit login
GoAWK: an AWK interpreter written in Go (benhoyt.com)
228 points by f2f on Nov 17, 2018 | hide | past | favorite | 88 comments



AWK is such a cool little tool, great way to process log files (as shown in the article). Glad to see it still getting some attention.

It's kinda nuts how far the unixy idea of "just streams of text" has gotten us.


It's kinda nuts how far the unixy idea of "just streams of text" has gotten us.

As much as I agree that the tooling is, object-based tools would be much nicer.

    for i in `ls`; do echo $i.size; done;
versus

    ls -la | awk '{print $5}'
I know which I'd prefer.


This is something that reiserfs was trying to solve (though unfortunately we all know what happened with Hans and Nina Reiser), so most of the ideas have been left on the table.

But yeah, the solution he proposed was to start combining different information namespaces (such as the "metadata" namespace that you are alluding to) into the filesystem namespace, as well as allowing for the filesystem to be pluggable so people could write databases that could be searched with the filesystem.

So your example would look like

    for i in "$(ls)"; do echo $i/..../size; done
And you could do something like

    echo 1000 | tee */..../uid
Which would be equivalent to a chown.


PowerShell is an extremely nice, cross-platform option. Most people are discouraged by the overly long command names (get-childitem and the like), but there are 3 letter aliases (e.g gci) for a lot of the common ones and its really nice.


The second, because when the first fails, you're left with an ls command which is spewing some binary format you can't read without... wait for it... the tool which just failed.

I'd like to ensure my tools are all speaking a format which is too simple to ever fail.


I think you picked the only nice example of powershell in the entire universe (yuck!). Also -- guess which script is shorter....


for unixy tools ... well there are lots more, some have made concessions to be less particular as to who considers them nice or friendly. `ls` is very much in the public relations camp, the flamboyant front man, first point of contact. Behind 'ls' is 'stat' which is more a unixy text stream tool's tool than the human friendly `ls`

    stat -c "%s %n" *


No, that's awful if you consider its implications. Just do:

  for i in *; do du -bs "$i"; done


    > for i in *; do du -bs "$i"; done
    du: illegal option -- b
    usage: du [-H | -L | -P] [-a | -s | -d depth] [-c] [-h | -k | -m | -g] [-x] [-I mask] [file …]
cool


Sounds like you will like PowerShell.


Nobody likes PowerShell. It manages to be the worst of both worlds.


I wasn't aware that people hated it. I very rarely struggle through it (learning curve related, probably) but the objective behavior is something I wish Linux tools would move towards.


It may well be uglier than sin's really ugly cousin, but if you are stuck managing Windows environments, it's a god damned life saver.


Honestly, I'd much rather have seen it be a bunch of console-oriented convenience libs for .NET and focus on making a good REPL console for .NET languages.


Wait a minute. Bash has nice operators like .size ?

Omg


No, that was the point of the parent - that it would be nicer if we did have some object-layer, rather than processing (unstructured) text in pipelines. Losing meta-data at each step.


Awk in Lisp as a macro with Lisp AST syntax:

http://nongnu.org/txr/txr-manpage.html#N-000264BC

* Implements almost all salient POSIX features and some Gawk extensions.

* awk expression can be used anywhere an expression is allowed, including nested within another awk invocation. Awk variables are lexically scoped locals: each invocation has its nf, fs, rec and others.

* awk expression returns a useful value: the value of the last form in the last :end clause.

* can scan sources other than files, such as in-memory string streams, strings and lists of strings.

* supports regex-delimited record mode, and can optionally keep the record separator as part of the record (via "krs" Boolean variable).

* unlike Awk, range expressions freely combine with other expressions including other range expression.

* ranges are extended with semantic 8 variations, for succinctly expressing range-based situations that would require one or more state flags and convoluted logic in Awk: http://nongnu.org/txr/txr-manpage.html#N-000264BC

* strongly typed: no duck-typed nonsense of "1.23" being a number or string depending on how you use it. Only nil is false.

Recently accepted Unix Stackexchange answer featuring awk macro: https://unix.stackexchange.com/questions/316664/change-speci...


Cool, but why not use Common Lisp?


In 3.5 words: using isn't making.


Lua and AWK. I'm inspired! I'm going to rewrite the Perl 5 interpreter in Go. Take that Larry. Just kidding.

I'm all for scratching an itch, but why rewrite all these well established tools?


I extended somebody elses programming language recently, then wrote a BASIC. Mostly to make sure that I understood lexing, parsing, and AST stuff.

While you're right this is reinventing the wheel it can make sense to reimplement old tools to improve safety, security, and to allow them to be embedded in new environments.

Have you ever run a fuzz-tester against (GNU) awk? I have. Even now you can segfault awk with bogus programs, for example:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=816277

No doubt this new implementation won't be perfect, but segfaults should be ruled out via rust/go/ML implementations..


Ha! My interaction with AWK throughout the years has been a fuzz test in progress. Very interesting link though.

On the other hand, I think what you're saying about expanding a well known tool's range of use in new environments makes sense.


I can't tell you what motivates others, but I know what motivated me with GoAWK: intellectual curiosity, and the desire to learn how to use Go's profiling tools on a non-trivial project.


I suppose I'm jaded. I've been writing software for so long and have been so underpaid for it, maybe I'm envious of your ability to follow through on your intellectually curiosity with such flourish. Anyway, great work. Looks good!


One of the cool things about Go is that pretty much everything is being developed as a library. This means you can integrate this in your app without external dependencies, portable to any platform Go supports, since it depends only on the stdlib.


why not?


I also really like such posts. I have never written an interpreter. What do experts recommend i.e. how can someone without a formal CS background learn about write an interpreter? Any good sources, articles, or books? Thanks in advance.


Depending on how involved the language you are interpreting is, you might get by having only read chapter 6 of The AWK Programming Language[0] (linked in the article), which covers "Little Languages", including what it terms an assembler and interpreter.

If you are interested in more depth, either Crafting Interpreters[1] (mentioned in the article) or Writing an Interpreter in Go[2] looks promising. I've read more of Crafting Interpreters and really enjoy it, though it isn't yet finished. One of the aspects I really enjoy is that the language is implemented and re-implemented in different languages to gradually introduce lower level concepts.

Finally, this one may be a little more "out there" than what you are looking for, but if you are interested in designing a language more than the plumbing of an interpreter Beautiful Racket[3] is really good.

caveat: not an expert

[0]: https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoI...

[1]: http://www.craftinginterpreters.com/

[2]: https://interpreterbook.com/

[3]: https://beautifulracket.com/


Sure! I also don't have a formal CS background (I studied electrical engineering without doing any programming / CS courses). Bob Nystrom's free online book http://www.craftinginterpreters.com/ is excellent. There's also the (much older but still enlightening) "Let's Build a Compiler" by Jack Crenshaw: https://compilers.iecc.com/crenshaw/


Troff + pic + eqn ported to Go would be cool :>


Holy shit yes please. And grap.


I wonder if there's a file size + workload at which the coordination overhead of a parallel awk is low enough for an overall performance win.


Nice to see another implementation, but I still hope the original one will be always kept alive


It the one used by all the BSD systems.

Recently there is an official Github repository with commits from Brian kernighan.

https://github.com/onetrueawk/awk/commits/master


While I'm cool with writing other language processors in a new language (Lisp written in Cobol anyone?) I'm missing the value of this past the bragging rights.

There was a similar article about writing the LuaVM in Go, to package it in bigger Go applications. I've done lots of C based systems and bolted Lua on, so the Go version makes sense.

But is imbedding Awk into a program that gets done on a regular basis?


I believe in this instance it is the author wanting to level up on AWK and Go. The value is learning and fun.

An AWK interpreter written in Go is unlikely to be an improvement, except, well here is another blog post you might be interested in that has a similar sense of adventurous tinkering (it's about improving on grep): https://ridiculousfish.com/blog/posts/old-age-and-treachery....

That's from 2006 and the tl;dr was graybeards did things a certain way for a reason. And yet nowadays with have things like rg (and ag and a bunch of others).


I think my GP has an objection to it being shared and being on top when there is nothing to learn from this in terms of ideas (which I share) and not to people hacking away.


But now other people who are interested in Go can learn from it. Seems pretty much the point of HN - finding interesting things to learn.


The performance optimization part of the article is very relevant to experienced Go programmers.

Also, pjw, the W in AWK is on the Go team at Google and Kernighan, the K, wrote the canonical Go book.


And Aho, the A co-wrote the most famous book on compilers. I like to think he'd approve of implementing programming languages.


This might be an avenue for programmers who are comfortable in Go but not C to extend a version of the Awk language.

For example I have wished for a long time that Gawk could parse gzipped files so that filenames could be used directly. I could take a run at implementing that in this interpreter where the C version would be more difficult personally.


Is there a reason you can't just use a pipe and process stdin?


That is what I do today. However in the case where metadata is part of the file name, the FILENAME variable is not populated without something else in the pipeline that passes it into the Awk script as a variable.


I look forward to reading this article when I have the time because I am a Go newbie, so a well explained example program will be useful to me. Also, I find language implementation interesting, and I've always admired the AWK language. I for one and glad he did it and wrote an article explaining it.

Finally, I think it is good to see C software being rewritten in more robust languages like Go and Rust, if people are inclined to do so. I don't think rewriting all the C software should be our top priority, especially high-quality programs like AWK, but it's a good overall direction to go when people are interested.


Why do so many modern projects feel the need to include the language and/or the tech stack used as part of the project name? Is it a type of virtue signaling? Does “Go” or “JS” or “Swift” or “Node” make these project more attractive somehow to an end-user (even if the end-user is a programmer)?


>Why do so many modern projects feel the need to include the language and/or the tech stack used as part of the project name?

Because that was a great motivation for them being written in the first place. "Let me write X in Y, to have Y easily used from Y or as a learning exercise or because Y is fun or bring benefits (e.g. Rust and safety, Go and easily parallelizable/static build).

>Does “Go” or “JS” or “Swift” or “Node” make these project more attractive somehow to an end-user (even if the end-user is a programmer)?

Yes, very much. I prefer Go and Rust utilities now whenever I can find them over an equally good alternative.

(Plus, you seem to have forgotten that "if the end-user is a programmer" they might be interested to play with the source, and there the language is very important).


For developers who want to tinker, they can use different means to discover projects. I don't need to see the language or tech name in the file every time I use it just so developers can discover it.

To me it signals lack of creativity. The problem of these projects seems to me that they have no reason to exist (beyond playground for the developer) other than "a tool written in that language". For example, here, awk already exists. Why would someone use GoAWK other than to indulge their love for Go? These are the cases that bother me the most; pushing a language or a tech to give your project validation because it has little as a standalone.


>For developers who want to tinker, they can use different means to discover projects. I don't need to see the language or tech name in the file every time I use it just so developers can discover it.

Well, if you do see it, it hurts you how?

>To me it signals lack of creativity. The problem of these projects seems to me that they have no reason to exist (beyond playground for the developer) other than "a tool written in that language"

Well, to me "a tool written in that language" is quite important. As a developer I don't just care for the tool's functionality, but also its hackability, portability and ecosystem.


> Yes, very much. I prefer Go and Rust utilities now whenever I can find them over an equally good alternative.

That's irrational. What difference does it make what language the utility is written in, if it works as expected?


TONS of difference.

The language influences whether I can easily hack the utility or not.

The language influences how easy the utility is to port.

The language influences how easy the utility is to use as a lib, integrate with some other project.

The language influences how many memory bugs the utility might have (e.g. C vs Rust).

The language influences how fast a utility is to build (e.g. Go build times).

The language influences how easy a utility is to deploy, or to have many different versions of (e.g. static builds vs a mess of classpaths and virtualenvs and the like).

The language influences how easy is to install (e.g. go get/install, or cargo equivalent vs messing with languages with no package managers like C, where new or obscure projects are almost never in the official distro repos).

The language influences how easy is to build (e.g. go build vs the C/C++ autoconf/automake clusterfuck and dependencies libraries hunt).

The language also influences how performant a utility will be (e.g. csv query/processing tools written in Python vs xsv -- or the canonical example, Electron monstrocities vs native tools).

Not all of these traits are guaranteed given a language (to preempt the first knee-jerk objection), e.g. a Python tool could be faster than a Go tool if written well enough.

But historically and statistically speaking, and in a regression to the mean sort of way for each language platform, I've found those things to work better in some language X over another Y.

Hence, for me, the language matters.


Do you "hack" every utility you use? Is that a predicate on you using that utility? Would you not investigate a utility you like to see whether you can easily "hack" on it or not?

(I dislike the term “hack” as it denotes to me irresponsible way to throw code, rather than properly design features and implement them in a considerable manner.)

I agree with most of the things you say above, they are very important. I just don't think it should be part of the name and identity of a utility. A tool should not have to scream to the world "I am a fast tool because I was written in C++ and not JS"—that should be a given for every utility.


>Do you "hack" every utility you use? Is that a predicate on you using that utility?

No, but not all the utilities I use are open source either. When they are, and for some domains (e.g cli text manipulation tools and development utilities) I like to be able to hack them, even if I don't, so for that (and for the other reasons) the language they are written in is important to me.

I never said it's the sole criterion.

>I dislike the term “hack” as it denotes to me irresponsible way to throw code, rather than properly design features and implement them in a considerable manner

For deployment code, yes. For utilities, usually that's exactly what I want though: to irresponsibly throw code for my own purposes, when I find it convenient to fix an annoyance or bug in utility I use or add some small feature I want to it.


When I see that "it's written in Go" to me it means that I can actually look in that code and understand most of it and the compilation from source will not be a big deal.


Back it the day GNOME projects had a G somewhere, while KDE ones had a K, which now can be mixed up with Kotlin ones actually.

C++ projects used to add ++ as suffix, for example Rogue Wave Tools.h++ and Motif++ libraries back in the glory days of commercial C++ compilers.

So this fashion is already quite old.


But those are signalling compatibility with other tools. That's fine. It makes sense. But just imagine if someone posted about the original awk (which happens from time to time) and titled it "awk: a text processor written in C".


Thing is, we now have this thing that language X is not good for doing Y, so when one does post a software for doing Y in X, and it actually does a good job, it works as marketing material for language X.

Programming languages are software products as well, and their customers want to feel they have made the right choices sticking with their options.


Paint.NET is a good example of this. IIRC, it was a showcase program to show off the .NET framework.


It existed before but much more prevalent these days than before. Or maybe I wasn’t as cranky or prone to eye rolling 15 years ago =]


It is meant to attract other developers interested in the language, but it also serves as a shortcut for what the project is going to be like. "An Awk interpreter written in Go", "An Awk interpreter written in Java", and "An Awk interpreter written in Idris" set different expectations about the product, the project's goals, and what the community of its contributors values and talks about. It tells you why you may or may not want to join the micro-club of its users.


Yeah definitely. The fact that this is written in Go immediately tells me it will be extremely easy to compile and deploy, the code will be easy to understand and it will be reasonably fast. If it has been AWKjs I would have immediately known it could run in the browser.

The language tells you loads! This is a stupid complaint.

Imagine if it was PerlAWK, or PHPAWK.


Normally, I’d agree with you. But I think in this case it is acceptable because the author is trying to convey that it is an implementation of AWK in Go.

In general, I think that it is acceptable to include the language name in the project if you are trying to to convey that the project is an implementation; of another project in that language and is not really meant to differ significantly from the original. Such projects are typically meant as an experiment, a library implementation for the target language etc. Anything else should be named differently — especially if it is going to differ significantly from the original.

Another thing to keep in mind is that by naming the project GoAWK, the author is as good as claiming that this is the canonical implementation of AWK in Go— something they may not have intended.


I've noticed this too. It's odd because if you've done your job correctly I shouldn't have to know which tools you've used to make it. I'm not going to start using a tool just because it was made with certain other tools.


I somehow have more trust in tools written in languages like C, Go or Python. If it's written in JavaScript, I tend to look for an alternative first, and I use JavaScript for 95% of what I do.


> Why do so many modern projects feel the need to include the language and/or the tech stack used as part of the project name?

It's advertising or propaganda for the language.

Whenever a language is "hot" and going through a full scale PR campaign, you see something like "[fill in the blank] written in Rust/Go/Ruby/etc" over and over again.

No different than all the current "facebook is evil" spam or the "bitcoin is great" spam of a few years ago.


Yeah, this fetishization of specific programming languages is silly. So many people these days are obsessed with learning specific languages and technologies (and sometimes denigrating others), instead of focusing on more fundamental aspects of programming.

This kind of people tends to do badly in generalist coding interviews that rely on their fundamentals.


It's probably about coming up with names.

If I'm rewriting a tool like Awk in go, then coming up with new name is hard and naming it as goawk, means it is a reimplemention of Awk in go.

If one names it as xyz, then it would probably gain less attraction, and in comments people would complain that this is a reimplemention of Awk in go.


You can tell the difference GoAWK from the other AWK, I suppose.


¯\_(ツ)_/¯


Because implementation details are interesting to craftspeople, and we are craftspeople. Especially in a project where reimplementation is one of the defining aspects of the project.


party hard.


should have just named it gawk tbh


Not sure if you're making a joke or unaware of https://www.gnu.org/software/gawk/ :)


I suppose you know gawk is already used by GNU?


I am hoping one day there is going to be an operating system, where every single tool is written in Go and everything runs in containers! Well done!


Why? GC is overkill for a good number of command line tools.

What advantage do you see in command line tool running in its own container? Given how important pipes are, that's going to be a lot of overhead punting data between containers.


Actually, lack of a GC is overkill (in terms of control needed over memory) for most command line tools.

Having to manually track memory liveness C adds a large amount of complexity to tools like awk, sed, grep (which are already complex beasts themselves).


Many commands that only do a few things (perhaps not awk, since it runs full programs) don't need to free everything, since it will only allocate a few things and will soon terminate and free the entire process.


Nearly any command that you pipe into, or stream content out of, must to allocate and free memory in some non-trivial way.

Sure, those commands could just allocate and never free memory (a-la early C compilers, or the D compiler), but now any use-case that involves a large amount of data will leak noticeably. Not going to fly if you need these commands to be durable and efficient. And unix commands need to be both.

A GC gets you the freedom to operate on large streams for free, without having to worry about memory management (modulo optimization, but that happens later anyways, regardless of GC presence or not).


In some cases, yes. But not in all kind of programs. For example, my Farbfeld Utilities programs, are different how much buffers is needed:

* Some deal with only one pixel at a time, or sometimes two. No dynamic allocation is needed.

* Some deal with one scanline at a time, or sometimes more than one (but a fixed number) at a time. The same buffer can be used for each scanline.

* Some deal with the entire picture (such as those that distort the picture).

But one possibility can be that a program might load multiple pictures and each picture needing the entire picture at once, but does not use them simultaneously, in which case it is sense to free each picture after it is used.

(Or maybe I somehow misunderstood your message or something else.)


The point is, the default decision should be to not have to worry about memory management. Most application shoudn't, because they're not realtime operating systems, or in an environment where memory must be allocated statically.

Almost all unix command line utilities fall into this category. Having to worry about pairing your `free`s with your `malloc`s is a strict increase in cognitive overhead, which should have been spent on verifying the program's semantics are correct.

Messing up low-level memory operations, when you just want to worry about semantic correctness, potentially leads to bugs like RCEs, or dosing somebody with too much radiation.


Thankfully, these days, you don't actually need to choose between GC and manually matching up your `free`s with your `malloc`s.

There is at least some data that GC does have an impact on command line tools like this: https://boyter.org/posts/sloc-cloc-code/ --- More experiments like that would be great to crystallize the exact trade offs here.


Go isn't just about GC. Go is a language, which is stricter checked by the compiler. So beyond direct memory management, Go code should be cleaner, safer and of course easier to read than the corresponding C code.


If you're interested in correctness, cleaner and safer, Rust is a better choice. Go's lack of generics really hurts it when it comes to simplicity and cleanliness, and so far as correctness goes, the Go compiler isn't anywhere near as strict as Rusts one (nor is it as strict as gcc or most other compilers, for that matter, as a conscious choice of favouring speed over correctness)


> easier to read


Easier to read is not something I'd particularly accuse Go of. If you want that, stick with python. Lack of generics can end up really hurting Go for cleanliness.

Something I ran in to the other day when I was trying to produce a reverse sorted list of strings:

    sort.Sort(sort.Reverse(sort.StringSlice(dir_contents)))


here are some basic linux tools in Go:

https://github.com/u-root/u-root


I wrote a sed in go a few years back, just to do it. The engine is a Reader so you can use a sed-processed stream anywhere a Reader is accepted. There’s also a command-line driver of course.

https://github.com/rwtodd/Go.Sed





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: