Hacker News new | past | comments | ask | show | jobs | submit login
Why Awk for AI? (1997) (plus.com)
117 points by mooreds on May 17, 2013 | hide | past | favorite | 53 comments



I worked with Dr. Loui for a year-uh on research projects after loving his undergrad AI class. He's crazy -- like "A demo to the customer is not working? Let's edit that in real time then reload, proving why scripting languages are better." -- but he's a very useful kind of crazy.

I think that Ruby/Python are the spiritual descendants to this argument in 2013, by the way. e.g. A class project was to scrape eBay and predict the winnin bids on a variety of item classes. (Spoiler: "the current bid" outperforms most algorithms.). With AWK, you scrape the HTML and do some ugly parsing. Of course, with Ruby you'd hpricot a single CSS selector and have the lab 80% complete in ~3 lines.

The single greatest disadvantage to using AWK for serious work is that nobody but you and Dr. Loui does that so you get to invent everything for yourself every time. (You probably do not appreciate how much your language ecosystem bakes in code re-use until you've used a language that assumes essentially all use is one-off, by the way.)


You're not wrong about reuse, but note that shell programs make for very reusable and performant awk libraries. ;-)


I don't think anyone would seriously advocate using awk for complex projects these days, but the idea of keeping data really close to the OS/shell is a very powerful one.

Take Python, which is supposedly a "scripting" language, but requires relatively painful amounts of boilerplate to actually read from or write to pipes, etc. It doesn't force you to keep everything in Python, but it certainly nudges you that way. Without naming names, certain statically typed languages that are obsessed with safety are even worse in this regard.


> I don't think anyone would seriously advocate using awk for complex projects these days

I guess it depends what you mean by 'complex'. Doing basic statistical summaries on a few columns of a TB or so of some logfile in a weird format would be a big project using a lot of the tools I've had to use, but it's more or less 'hello world' for awk. I really like it as part of an "ETL before you ETL" process where you massage your data into a format that makes your official ETL tool with fancy capabilities not choke.

The simplicity of awk also gives it a cultural advantage over things like perl in that you can argue that you're not really introducing another language into your environment, just including some glue logic with a Standardized Tool Everyone Should Know.

The only thing I'm familiar with that does such a good job of keeping simple things simple even when distributed or at largish data volumes is Splunk.


What is Splunk? Are you talking about this: http://en.wikipedia.org/wiki/Splunk


I shamelessly reach for the backtick shell notation in any exploratory Ruby code that needs it. I try to replace it with something written in Ruby if the codebase grows beyond my laptop, but backticks never get old.

    => 1.upto(100) { |n| puts `curl -s fizzy.heroku.com/#{n}` }

    1
    2
    Fizz
    4
    Buzz
    ...


Ha! That backtick is the only reason most of my "scripts" are being written in Ruby instead of Python. :)


Seems pretty easy in python:

  from subprocess import call
  call(["ls", "-l"])
or

  import os
  os.system("ls -l")


The difference between "pretty easy" and "easy" changes everything.

Compare

  from subprocess import call
  call(["ls", "-l"])
to

  `ls -l`


I really like the python sh library for stuff like this. Would you rather do what you wrote, or this?

    for listing in ls('-l'):
        '''etc'''
Here's the library in question - https://pypi.python.org/pypi/pbs - it makes me super happy to do shell work in python.


Wow, that's a neat library. You can just import any program and use it like a function (from sh import grep)! Thanks for the link. Also, sh documentation: http://amoffat.github.io/sh/


Yeah, it's replaced just a ton of boilerplate `def run(*args):` functions wrapping Popen. I couldn't be happier.


i use Sh to interface with my cmus player. from sh import cmus_remote! :)


While there are several ways to skin that cat, the suggested way is to use subprocess.Popen .

Also, when using Popen, calling .wait() can cause problems if you are expecting large amounts of info back from stdin or stderr. Using .communicate() is generally better.


this. python has too many ways to skin the cat.


hehe, while I do agree in this particular situation, it's rather funny to see that in writing.


> Take Python, which is supposedly a "scripting" language, but requires relatively painful amounts of boilerplate to actually read from or write to pipes, etc.

It's a shame Scheme Shell didn't catch on more. That had a syntax for pipes that simply used the pipe symbol in a way that feels a lot like just using pipes in bash. Unforunately, the project page is down, and the Wikipedia entry doesn't have the best examples.

http://en.wikipedia.org/wiki/Scsh


What would you put forward as an alternative to awk for the kinds of complex projects the author describes?

Sam and vi are both descendants of ed, but go in very different directions. Is there a path forward from awk that focuses on different strengths to perl's choices? (perhaps avoiding perl's move to being multi-paradigm)

Is there something that sits close to unix in the way awk does, but which is stronger?


I use lua for that, and it seems to be a very good fit. There's awk influence in its history, so it's probably no accident.


Ruby does awk for the most part. I can't comment on strength since awk has a small footprint compared to both ruby and perl clocking in just under 96k. Also both emacs and ed are descendants of TECO =)

The editors have had major impact on our languages mainly because regular expressions, called regular sets at the time, where implemented in Thompson's ed which was based on an earlier line editor implementation called qed used on ctss and multics.

One can follow the evolution almost in dialect fairly well.

ed -> grep -> sed -> awk

ed -> em -> ex -> vi

awk -> perl -> ruby

so

ed -> grep -> sed -> awk -> perl -> ruby

The right tool for the right job when all you know is a chainsaw everything looks like a hammer unix process are cheap and all that yada yada philosophical paradigmy finite state automata pipelines vs pointers vs classes vs recursive enumeration iterative parenthesis backtick mind expansions expression logic =P


ed is not a descendant of TECO.

http://web.mit.edu/kolya/misc/txt/editors


Is this factually incorrect? If so, please cite the source that contradicts the common knowledge that ed descended from QED, developed entirely separate from the MIT environment that led to EMACS. Which explains why EMACS couldn't be less like a good UNIX program.


Hadoop and friends.


> I don't think anyone would seriously advocate using awk for complex projects these days,

Never underestimate the power of intellectual inertia, and the lengths people will go to in order to continue using their favorite tool - no matter how poorly suited to a task it might be. ):


One of the first video lectures in Coursera's Machine Learning course, Andrew Ng gives a pretty similar explanation why he chose Octave as the language for the course; in his experience, students get more done in Octave than in any other language he has tried teaching with.


It has always pretty easy for me to read from and write to pipes in python, assuming you are talking about /std(out|in|err)/.

   from sys import stdin, stdout, stderr
   for n,line in enumerate(stdin):
      stdout.write(str(n) + '\n')
      stderr.write(line)
Is that the kind of reading and writing you are talking about?


I suspect it has more to do with stringing together the inputs/outputs of multiple programs, which is outright painful in python. The syntax is terribly verbose (import subprocess, use subprocess.call, refer to a few enum values in the subprocess package, wire them together which requires a line of code rather than a character). Worse, pipes are tricky to implement correctly. It's easy to deadlock the python process by overflowing a pipe's buffer (wait for exit to read the buffer, buffer fills before subprocess exit, deadlock), so people who don't have an intimate knowledge of the underlying implementation will find themselves stuck with a bug they can't fix.

Compare this to a shell where the combination of &&, ||, and | can fit on a single line what would have taken dozens of lines of python.


Ah, so using python for stuff bash is usually used for. I thought OP meant writing python scripts to be used as components of bash scripts.

For _that_ type of python, I usually use the sh module[0]. It handles piping pretty well, especially if you use StringIO to deal with the pipes. Unfortunately, return code-based conditionals are nowhere as simple as || and && are in shell.

[0] http://amoffat.github.io/sh/


See https://github.com/kennethreitz/envoy

Example:

  >> r = envoy.run('uptime | pbcopy')


Does anyone have any examples of AWK vs. some other programming langauge for AI. It would be interesting to take a look at.


Why [language] for [purpose]?

Universal answer: because it's workable, and I'm emotionally invested by now.


And I need to get stuff done, and learning a new [language/framework/shiny object] isn't always the fastest way of getting stuff done.

It is better to be an expert in a few languages rather than a dilettante in many.

Of course, the best of all possible worlds is to be an expert in many languages, but that often requires time that gets in the way of 'getting stuff done'.


> And I need to get stuff done

That's the experienced version. I was describing the inexperienced one.


> Jon Bentley found two pearls in GAWK: its regular expressions and its associative arrays.

When I encountered AWK I was amazed by associative arrays. It was the first language I've seen where associative array were so accessible. Then there was PHP (I think arrays are one of the things that strongly contribute to its popularity).

Today pretty much every commonly used language has this feature. Often it seems more mimicry that actual appreciation of this data structure. For example when other languages creators bring this structure in they tend to forget about important feature. Ordering. For example python didn't have standard ordered dictionary type for a long time. Also ruby keeps order of the items in hash only since 1.9


Did some of my most enjoyable and productive work in awk and BBC Basic.

Minimise resistance of expressing a translation of a hypothesis from thought into a computing language at all costs: get onto the highway as fast as possible.


I know exactly what he means. Most people are surprised to learn that I study direct methods in the calculus of variations (mostly with Sobolev spaces) using bc, and then write out my results using ed.


On a vaguely related note, Darius Bacon's Lisp-in-awk has always brought a smile to my face: https://github.com/darius/awklisp/blob/master/awklisp


Whew. By volume, most of my big data work is in awk. I hope my secret remains safe.


Really strange that he spits out his last two surprising philosophical answers and then doesn't explain how the first one pertains to awk at all.

First, AI has discovered that brute-force combinatorics, as an approach to generating intelligent behavior, does not often provide the solution ... A language that maximizes what the programmer can attempt rather than one that provides tremendous control over how to attempt it, will be the AI choice in the end.

Okay. And... awk has this quality? What can I do in awk but not in C or a lisp? In what way does programming in awk lead you toward less brute-force solutions than any other language? He doesn't support this in any way at all.


There's not much you can do in Awk that you can't do elsewhere: Turing-completeness and all that. The reverse also applies, assuming you have an appropriate set of bindings.

The thing to understand about awk is that it's basically a DSL. It's optimized specifically for crunching data contained in line-oriented text files, and within this niche, it is awesome. Since line-oriented text files are used for just about everything in Unix, awk is an especially useful tool there. But once you stray from awk's niche, things start to get awkward, and the further you go, the tougher it gets.

Perl was written to be an awk-killer, and it didn't accomplish that by being better within awk's niche. It did it by not being a DSL: it's still "good enough" for the sorts of work that awk really excels at, but it works much better for just about everything else.


"There's not much you can do in Awk that you can't do elsewhere: Turing-completeness and all that."

The fact that Awk is Turing-complete has nothing to do with the fact (if it is a fact) that there's not much you can do in it that you can't do elsewhere.

The SKI combinator calculus is Turing-complete, but you can't read a CSV file with it.


> The SKI combinator calculus is Turing-complete, but you can't read a CSV file with it.

Of course you can, it's just a matter of input handling. For that matter you can read a CSV file with any Turing machine. It's just easier elsewhere.


Well, I must be very confused, then, because I don't see where the input comes in given just the s, k and i combinators. (The Jot programming language, which can be translated into SKI, for instance, doesn't do input or output. You want the Zot variant for that.) How would one write a program, using just s, k, i, and application, that takes the name of a file on the command line, opens it and reads it in, then prints the lines in reverse order to stdout?


If you really wanted to do pure SKI, you'd have to do more than just a program: you'd have to implement the whole machine and the OS running on it. You could do that, but as you might imagine, it's quite a lot of work.

That said, keep in mind that most likely, you'd want to start implementing levels of abstraction pretty early on. The fact that SKI is Turing-equivalent means that you can implement a Turing machine (or anything else that is Turing-equivalent) in it. Build your favorite abstraction, and then implement your machine and OS the way you would using that abstraction. It's still SKI underneath, so you're golden.


Maybe so, but that completely gives the lie to the idea that two languages are equipotent if they're both Turing complete. The sorts of abstractions you'd need to implement go far beyond what's required for Turing completeness

Unlambda, for instance, is Turing complete, and moreover, it can do I/O. An Unlambda program is nevertheless incapable of opening files or doing different things depending on its command-line arguments. You can write cat (the version of cat that just echoes stdin to stdout) in Unlambda, but not ls.

You might be able to write an Unlambda-based operating system in which all the various sorts of input events that an OS needs to respond to are represented as elements in its input stream (or, even better, an OS in Lazy-K).

But when you've got that OS up and running, Unlambda programs running on it still won't be able to open files. (Frankly I'd be surprised if the "abstractions" necessary to get something like that up and running weren't essentially an interpreter written in another language dealing with the encoding and decoding of input and output to your Unlambda/Lazy-K program, rather than abstractions written in Unlambda/Lazy-K. (Consider that the numbers that Lazy-K outputs are church encoded and must be converted by the Lazy-K interpreter into C-like integers before characters can be output to stdout.) This isn't really important, though.)

Consider also this final note from the Lazy-K page:

"Remove output entirely. You still have a Turing-complete language this way, and it is if anything more elegant. But, as with the equally elegant SMETANA, you can't do anything with it except stare at it in admiration, and the novelty of that wears off after a few minutes."

That's not really true, of course: there are other things you can do, like increase the temperature of your processor. Not many other things, though.


I'm sure you've heard this before, but all you have to do is ask whether you can build a Turing machine in whatever language. If you can do this, then you can compute answers to the same things computed by any other Turing complete language.

This says nothing about practicality, but nobody ever said it did. Of course practicality is important, but a conversation about Turing completeness is a conversation about "can compute", not "can easily compute". If you look at venues such as POPL, ESOP, and PLDI, fairly often you will find proofs for some abstract representation that is then implemented in a real world language. Thus while it would be impractical to compute something in the abstract form, it is often more elegant for proof construction, and then proof results are transferable if you can demonstrate bisimulation between the two forms. All this to say that "can compute" is nevertheless an important determination with respect to equipotency.

If you had an Unlambda OS (or VM is better perhaps), then anything running on it would be an Unlambda program, including C programs, just as anything running on an x86 machine is an x86 program.


"I'm sure you've heard this before, but all you have to do is ask whether you can build a Turing machine in whatever language. If you can do this, then you can compute answers to the same things computed by any other Turing complete language."

But you can't do the same things that you can do in other languages.

Computation is pure.


I have never encountered this distinction between activity and computation before. Assuming that you are okay to define (sequential) computation as transforming an input sequence of 1's and 0's into an output sequence of 1's and 0's, can you give me an example of something that a computer does that is not a computation and explain why?


>This says nothing about practicality, but nobody ever said it did.

Tons of people "said it did".

It's a BS argument they use all the time. "Language X is turing complete, so you can build Y language's abstractions there too, so I don't see the need for Y".

Matter of fact, it's the very BS argument that started this sub-thread.


I'm not sure what you're referring to, but I was referring to this sentiment by Millennium, who started this thread as far as I can tell:

"But once you stray from awk's niche, things start to get awkward, and the further you go, the tougher it gets."

It is important to understand the distinction between possibility and feasibility. Or the difference between theory and practice, if you will. Even though they are opposites, both are important at the same time.

While it's not feasible for humans to move Mt. Everest, it certainly would be possible.


Thank you. That formalizes a lot of what I've been thinking, but have been unable to say, in discussions revolving around Turing-completeness.

It gets kinda frustrating when actual I/O considerations get waved away as "irrelevant" or "implementation issues".

I remember another discussion where someone said he wrote an IRC chatbot "in brainfuck". Wait, brainfuck can do internet access now??? "Well, I mean, I set up an IRC socket using a real language and hooked the brainfuck code's standard I/O into it ..."


I don't understand how your reply addresses anything kenko said. I agree with kenko and I wish people would stop bringing turing completeness/equivalence into discussions about what languages can do what. Turing completeness is inconsequential in such discussions; we're talking about what's practical and reasonable to do in each language. I wish people would stop trying to impress us with academic tangents.


The site's called Hacker News for a reason, and it caters to entrepreneurs for the same reason. We're already people with something of a penchant for not caring about the "practical" or "reasonable," or even considering that to be a challenge.

But if we're going to talk about what is practical and reasonable, then let's look back at awk's niche: prying data out of line-oriented text files. For that particular task, C and the basic lisps rank among my languages of last resort: assembler (pick an architecture) would be worse, but that's just about it. There are libraries for these languages which would help significantly, but my awk script would be finished and halfway through the dataset by the time I even got the build environment set up for these languages, and the code would still be more clear.

Like I said, the right tool for the job. For many tasks -most, actually- awk would not be very high on my list of languages to try. But find a task that hits awk's sweet spot, and nothing beats it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: