Hacker News new | past | comments | ask | show | jobs | submit login
Plan to throw one away (garethrees.org)
492 points by AndrewDucker on Dec 14, 2015 | hide | past | favorite | 79 comments



Interestingly, Tcl was parsed the same way all the way up until Tcl 8.0 introduced a bytecode system in 1997. Each line was parsed right before execution with no separate parsing stage--your code could crash in the middle of executing from a syntax error.

Tcl is now about as fast as a typical scripting language and its implementation is regarded as a model of high-quality readable code. I wonder if John Ousterhout (the creator of Tcl) knew he was building a rough first system that he would need to rewrite later, or he really thought it was the right way to do things at the time.

(To be fair, Tcl is a whole lot simpler to parse than JavaScript and Ousterhout is a good coder, so even the pre-8.0 versions were fast enough to be practical for everyday scripting tasks.)


Most shell interpreter follow this model, perhaps he started Tcl has an improved shell scripting system and only later improved it? It would explain some of the syntax choices which is fairly reminescent of c-shell and some bourne thrown in.


That's pretty much what Tcl is, even to this day. Ousterhout saw scripting languages as extremely high level tools used to issue commands to the low level code that does the heavy lifting. It's very much the same philosophy as shell scripts, except you implement new commands as extensions to Tcl instead of command line programs, and it all runs in the same process.

Of course, the language has matured and now it's also usable for building rich and complex apps top to bottom, just like any modern scripting language.


> Interestingly, Tcl was parsed the same way all the way up until Tcl 8.0 introduced a bytecode system in 1997. Each line was parsed right before execution with no separate parsing stage--your code could crash in the middle of executing from a syntax error.

Fun fact: most modern JavaScript engines in browsers work this way today, though at a different level of granularity. Parsing slows down application startup, so many JS engines don't parse a function body until it's first called. This means you can have a syntax error in a function body and won't know if it's never used.


That's pretty neat. Clearly some level of parsing needs to happen before run time, or else it couldn't even balance braces to know where the function body ends. So it must be that it parses the function body just enough to figure out where it starts and ends, then does a complete parse at runtime (probably building off the results from the first stage).


> or else it couldn't even balance braces to know where the function body ends

That's literally all it does. It tokenizes and counts braces.


I used to work with John and founded a company with him.

Whatever he believed he would have convinced you he was right :-)


Cool read, however, I feel the need to point out that 20 years ago, javascript in the server wasn't "too soon", it already existed.

In 1995 Netscape had the Netscape Enterprise Server[0] that ran javascript for server-side scripting. Actually, the two books I used, back in the day, to learn Javascript was the client- and the server-side javascript guides published by Netscape.

[0] https://en.wikipedia.org/wiki/Netscape_Enterprise_Server


> javascript in the server wasn't "too soon", it already existed

Those two things aren't necessarily in conflict. Netscape's server wasn't exactly a roaring success. What really made Javascript-in-the-server work was a Javascript JIT engine (v8) that actually made Javascript very competitive compared to traditional server side scripting languages. So given that there did not exist a Javascript JIT engine back then, maybe it was too soon.


SpiderMonkey was around way before v8, and once it begat TraceMonkey, it had JIT. Bloomberg used this pervasively on their billion-dollar product for years before v8 was a thing--you can read that from one horse's mouth here: https://news.ycombinator.com/item?id=3984617


That's quite interesting. I only mentioned v8 for the fairly obvious reason that it formed the core of the first really popular server side JS implementation, which I don't really need to name.


Agree that was probably too soon, even if it existed in some form. I've often wondered why precisely people love JavaScript on the server. I would guess it's a mix of 1) event machine parallelism, 2) isomorphism, 3) speed (since V8/SpiderMonkey/Chakra), 4) cross-compatibility, and 5) engineers not needing to learn another language.


I've always assumed it's heavily weighted towards 5 - never underestimate the ability of a coder to avoid effort!


Also, the npm ecosystem.

The early js-on-the-server systems had miserable ecosystems.


Not long after (now) classic ASP was introduced with JScript as an option. I wrote a lot of it back then, and used JS about as much as VBScript, because I could reuse a lot of my libraries on both sides (validation in particular).


Am I the only one who remembers LiveWire? It didn't last long, but it was Javascript on the server.


This is precisely not what Brooks meant by 'plan to throw one away'. Brooks point is, that you gather information while writing your prototype and incorporate the new-found knowledge into your second, hopefully sound, design.

Rees never learned anything new from the first system (what could he learn anyway, what he described sounds like a nightmare), but proceeded to implement it from scratch.

Funny thing, I'm reading 'The Mythical Man-Month' right now :D


Yes, you've spotted the irony in the title :)


Awesome read btw :>


Also, Brooks later said that if you can iterate your design over time, that's even better than "build one to throw away". This has been the focus of his more recent work like "The Design of Design".


I would like to know how the CTO responded to the idea of throwing away the code he'd been slaving on at the very last minute, and replacing it with something that someone cobbled together in two weeks (but worked).


From the sound of the description the original CTO authored code was cobbled together and its replacement was properly engineered (for some value of properly).


And I'm sure if you said that to the CTO they wouldn't take offence and would react in a rational, measured way. Actually, I've worked with managers that would react OK to that, and ones that definitely wouldn't, but I wouldn't tell any of them their work wasn't "properly engineered". Stick to the objective facts: this version runs much faster (no need to even mention the 'several orders of magnitude' factor unless they really want to know).


Of course, but from the time invested, it would seem the other way around. It's often painful to accept that you wasted a lot of time and that it's better to start over (which is of course the entire point of the article), and even more so if the new hire has already done that and thrown the CTO's code away.

Ultimately it's a question of ego. Does the CTO welcome the new hire who knows more about this than he does, or does he insist he knows better? Does he feel threatened or slighted?


This is the "First class people hire first class people, second class people hire third class people" paradigm.


How would you respond if a new hire just told you he decided to do a complete rewrite 2 weeks before demo instead of fixing (what they considered to be) a few knows issues?

I best 9 out of 10 - this would of ended up badly.


I think it would have been fair to reject the idea given the timeline. Obviously some egos will allow it and some won't, but the short amount of time could skew those natural inclinations.


I bet he welcomed one less responsibility.


I find the dates a bit suspect (referenced ECMA script standard was June 1997, book was late 1996, so how did this all happen twenty years ago, just months following the official release of JavaScript in May 95?), but guess that's not the point. Maybe I'm alone, but implausible details distract me from an otherwise great story that can teach a lesson many should learn.


For "twenty years", read "once upon a time". I feel that a bit of vagueness in dates is appropriate for a story like this. I have no contemporary written documentation, so it's based on my fallible memory, and I don't want to give a false impression of accuracy.


Yep, that's more or less what I did ("back in the mid to late 90s") to continue reading and not second guess it to much. Great story otherwise, thanks for sharing it!


I do take your point, and so I've rewritten the opening to be even vaguer.


He says 18 years at one point in the article. Since it was almost certainly in England or Wales it shouldn't be too difficult to identify the company involved. Zeus, maybe?


Are Lex and Yacc still the state of the art 20 years later?


The freeware versions have improved a bit.

For exmaple, Both Bison and Berkeley Yacc (the new one maintained by T. E. Dickey) support reentrant parsing, which works hand-in-glove with likewise support in GNU Flex.

A parser that whacks around global variables (like parser generated by classic Yacc) is going to be a nonstarter in any modern language which gives programs run-time and compile-time access to the parser (code can execute while parsing and call yyparse). Not to mention if there are threads.

Bison has support for "push parsers" which are state machines called for each token, rather than functions that retain control until they parse an entire unit:

http://www.gnu.org/software/bison/manual/html_node/Push-Decl...


> freeware

Did you mean "free software"?


I'm not sure how they've aged (I've only really used ANTLR) but I think this one of those situations where 95% of the importance is using the right kind of tool for the job, even if the tool itself is a little rusty.


The only thing that lets old tools get rusty is a lack of maintenance or improper storage. Software does not need to suffer from either and plenty of 'old' tools are still in maintenance. Kernels, compilers, file-systems and network stacks originally written decades ago are still in common use today. Flex and Bison are still being maintained and still have major releases. Whether they are still 'state of the art' is another question entirely but they're definitely being used and they are more than ok for quite a few jobs.


> The only thing that lets old tools get rusty is a lack of maintenance or improper storage. Software does not need to suffer from either

1. Yes, you can ruin every metaphor if you over-analyze it enough.

2. No, software has its own "rust" scenarios, like not staying compatible with other ecosystem elements, or simply being different than the current best-practice.


Or maybe the job of the tool has become more complex in it's nature.


I don't think they were state of the art even back then, but they were widely available, well documented, and produced fast, robust lexers and parsers. They are still just the ticket if you need to quickly get a programming language implementation off the ground.

There are some things that they don't do so well. It takes hard work to get good error messages out of Yacc, and anything that you might prefer to solve using feedback between the parser and the lexer (such as JavaScript's use of newline to terminate a statement, but only if it makes syntactic sense) is awkward to do because of Lex's lookahead — the token you want to suppress has already been produced by the time you know whether you want to suppress it.

But its important not to get stuck worrying about minor issues like these when the critical task is to make something that works. You can always plan to throw away the Yacc-built parser and replace it with something better when you have time.


Yes, yes they are. I've used them successfully for various parsing projects. And in fact Bison recently picked up GLR parsing to deal with multiple options (https://www.gnu.org/software/bison/manual/bison.html#General...).

But for something even more fun, I suggest you check out Marpa (https://jeffreykegler.github.io/Marpa-web-site/). It is, by far, the nicest parsing tool I have ever worked with. Being able to parse ANY BNF is huge. And the state tables for debugging! Knowing what possibilities you have makes things way easier. Especially for getting stuck in an ambiguous spot of your grammer. Finally, using ruby slippers to get out of a tight spot is hugely useful.

My one complaint is that the only implementation is in Perl. The core engine is written in C though, so someday I would like to convert the Perl into Python. I started several months ago, and haven't been able to get back to it. (https://github.com/srathbun/pyMarpa)


Oh, and I found something new and fun for Marpa just now! It seems there's already the beginnings of a Python port (https://github.com/rns/libmarpa-bindings/blob/master/python/...) available.


Be prepared to just use the generated tables and forget the generated code:

The generated code is not so great if you have an event driven environment: I mean I want to push 50 bytes into the compiler, instead of having it request the next input character. This means it's not so great for a repl, unless some other pre-parser gives complete translation units to lex/yacc.

I think Yacc has a pretty primitive conflict resolution strategy: (pick shift unless otherwise indicated by precedence declarations). If code could resolve the conflict then you could handle a lot of newer syntax: for example, in "a-b" vs. "a -b", you could use the whitespace to distinguish between prefix (reduce) or infix (shift).

It would be nice (I know someone who has done this..) if there was an option to degenerate into using backtracking to resolve conflicts. It means try all possibilities (split the stack at each conflict) and keep only those results which produce an error free full parse. If there is more than one full result, only then do you have a real conflict.

I'm not sure if other parser generators are any better, but there are improvements which could be made.


In terms of ease of use, Parser Combinators beat most Parser Generators. (For absolute speed, Generators might still beat Combinators.)


Yeah, I should try them. I remember a previous discussion about the old language SNOBOL having parser combinators as a built-in feature: https://news.ycombinator.com/item?id=10103460


I learned Parser Combinators on Haskell. But a basic version is easy to write yourself in any language that supports first class functions.


> The generated code is not so great if you have an event driven environment: I mean I want to push 50 bytes into the compiler, instead of having it request the next input character. This means it's not so great for a repl, unless some other pre-parser gives complete translation units to lex/yacc.

This actually works quite well in [Happy](https://www.haskell.org/happy/), the Yacc-alike for Haskell, so it is not a fundamental limitation of Yacc-style tools. In Happy, it works by generating monadic code and using a monadic action for token reading, which you can then make event-driven.


They still work perfectly fine, though, ANTLR is pretty great. Having learned both, I would hesitate to switch back from ANTLR.


If you're in the market for a parser generator, you may want to look into the lemon parser distributed with sqlite3. I found it much nicer to play with.


Parser combinators are a better alternative. You end up writing a bit more of the data structure yourself, but the result is code that fundamentally makes sense.


Parsing JavaScript efficiently is not the problem. Much more difficult is writing the garbage collector. Especially a concurrent one.


> Much more difficult is writing the garbage collector. Especially a concurrent one.

Since JS is single-threaded, you can write a simple stop-the-world mark-sweep collector as a first iteration. (Simple reference counting probably isn't a good idea since cycles are common in JS.) Then you can move to incremental and/or generational if that's giving you annoying hitches. Only after that do you need to put concurrent on the table.

I believe even world class JS engines have only relatively recently started doing GC in another thread.


I don't know about "state of the art", but they are tried and true for lexing and parsing respectively.

If you are looking for something more exotic or a very specific use-case, I'd say use something like Ragel[0].

[0]http://www.colm.net/open-source/ragel/


Ragel is great for replacing regex validation code ("Does this string look like a IP address, email address, or date?"), or for writing a lexer, but if you need to actually parse something (i.e. extract even a moderate amount of structured data) then you need something else.

It has a particularly sublime output mode for generating Graphviz dot files however.


"Plan to throw one away, you will anyhow." -- Fred Brooks

"If you plan to throw one away, you will throw away two." -- Craig Zerouni


This makes me want to write a JS interpreter... Sweet read!


A simple ES3 interpreter is pretty easy, particularly if you just go for 'the good parts' (plus basic global objects).

I can highly recommend mpc[1] if you do, it's a really amazing library for parser combinators in C. I was never much of a fan of yacc/lex (perhaps out of my ignorance of BNF). With yacc/lex I always felt like I had to have my grammar all planned out (which if you implement JS I suppose isn't a problem) before I could start developing and it was very hard to work incrementally. mpc is very flexible and since it's just a C library I find it a lot easier to iterate with.

1: https://github.com/orangeduck/mpc


Here you have one written in OMeta: http://www.tinlizzie.org/ometa-js/#JavaScript_Compiler


> The failure to store a represention of the parsed code was disastrous because it meant that in order to evaluate an expression or statement for a second time, the engine had to parse it a second time.

Ha! They must have learned how to write an interpreter from Herbert Schildt:

http://www.drdobbs.com/cpp/building-your-own-c-interpreter/1... [1989]

In this piece of amusement, the Little C program is a giant null-terminated string, and the "instruction pointer" is of type char * . You get the picture.

> When I did come to write a garbage collector I used the mark-and-sweep algorithm. But something puzzled me, and I couldn’t find an answer in any of the textbooks I looked at: how was I supposed to schedule the collections? In a classic description of a garbage collector, you wait until memory runs out and then you collect the world. But this leads to bad behaviour on modern systems, because of swapping, and because other processes need memory too. You need to schedule collections well before memory is full. But when exactly? I still don’t know of a comprehensive solution to this problem.

In a nutshell, you let your run-time pretend that it's running in a small machine, until it is too close to huffing and puffing too hard and then you say "hey I lied, you're actually in a bigger machine: have some breathing room". This rubbery constraint keeps it behaving reasonably nicely, rather than "Wee, I have 4GB of free RAM to stomp over with strings and cons cells before ever calling GC!"

What you have to do is pick some heap size (that is typically substantially smaller than the amount of RAM). You let the GC whack against this artificial threshold, and if that gets too excessive, according to some measure, you increase it. E.g. if after a full GC you have less than some fudge threshold free, call the OS for more memory to integrate into the object heap.

The threshold is calculated in some way that the image doesn't have to execute numerous frequent GC's before it triggers the request for more space (it doesn't have to whack too hard and wastefully against the artificial limit).

Also, ephemeral GC will help, and ephemeral GC can have its own threshold against frequent ephemeral GC's. When not enough space is liberated by ephemeral, you schedule a full. Then if that doesn't liberate enough space, add more. Since ephemeral is fast (doesn't scan the full heap), you can work a bit closer to the heap limit (since you can suffer frequent ephemeral GC's better than frequent full GC's).

And, of course, the parameters controlling these behaviors are exposed in some way so users can tweak them. Command line arguments, env vars, local config file, system config file, run time global variable/API, ...


Didn't PHP also used to use the position of a fileseek as the instruction pointer?


Shell script interpreters like bash do. You can change a bash script while it's running (append, or truncate and re-write) and it'll keep going from the same offset.

But then again, shell script is in a different class of language I'd say.

  #!/bin/bash
  COUNTER=0
  addmore() {
      echo hi $COUNTER
      ((COUNTER++))
      echo "addmore" >>$0
      sleep 1
  }
  addmore


Bash reads code from a stream, and doesn't read the entire stream before executing the code.

However, the stream pointer isn't its instruction pointer in the script. A backwards branch (such as the end of a while loop, or a call to a function defined earlier) does not rewind the stream to that piece of text.

> But then again, shell script is in a different class of language I'd say.

Not much different in this regard from how, say, Common Lisp (load "file.lisp") processes the top-level forms in the file.

  (defvar *counter* 0)

  (defun addmore ()
    (format t "hi ~s~%" (incf *counter*))
    (with-open-file (f "addmore.lisp" :direction :output :if-exists :append)
      (write-line "(addmore)" f)))

  $ clisp addmore.lisp 
  hi 1
  WARNING: OPEN: #<INPUT BUFFERED FILE-STREAM CHARACTER #P"addmore.lisp" @8>
           already points to file "/home/kaz/test/addmore.lisp", opening the
           file again for :OUTPUT may produce unexpected results
           Open the file anyway
  hi 2
  WARNING: OPEN: #<INPUT BUFFERED FILE-STREAM CHARACTER #P"addmore.lisp" @9>
           already points to file "/home/kaz/test/addmore.lisp", opening the
           file again for :OUTPUT may produce unexpected results
           Open the file anyway
  hi 3
  WARNING: OPEN: #<INPUT BUFFERED FILE-STREAM CHARACTER #P"addmore.lisp" @10>
           already points to file "/home/kaz/test/addmore.lisp", opening the
           file again for :OUTPUT may produce unexpected results
           Open the file anyway
  hi 4
  WARNING: OPEN: #<INPUT BUFFERED FILE-STREAM CHARACTER #P"addmore.lisp" @11>
           already points to file "/home/kaz/test/addmore.lisp", opening the
           file again for :OUTPUT may produce unexpected results
           Open the file anyway
  hi 5


As an aside, but I think relevant to Hacker News, the author of this blog post also wrote the classic mid-1990s text adventure game Christminster http://garethrees.org/1995/08/08/christminster/


>One thing that I had not had time to do was to write a garbage collector.

Cheat: use Hans Bhoem's conservative collector for C/C++: http://www.hboehm.info/gc/


Of course the opposite problem also exists. You plan to throw away that quick, ugly hack you threw together for a demo as soon as the demo is done, and somehow it sticks around in production for years to come.


This adage is fairly old, aren't today's programming and management techniques rendering it obsolete?


If this guy were 'properly' managed he would never be allowed to overhaul this project like he did. Not many managers will be impressed by a "I will build a full compiler that is going to be better than what the CTO built in the next two weeks immediately after which is the deadline" speech, as the author also admits.

If the company was properly managed they also wouldn't have been in a situation where there's only two weeks left and a critical component is in a not well understood state.

In practice though these kinds of things constantly happen in even well managed companies. If the author weren't around they'd probably just have shipped a slow Javascript interpreter that couldn't do prototypes very well.

Just curious, what kind of programming technique renders the idea of this article obsolete?


When they are actually used, and used correctly.

In other words, no.


As a corollary:

https://en.wikipedia.org/wiki/Ship_of_Theseus

You can replace all the parts of your crappy v1.0 code and still call it the same code base. That's what the smart ones go around doing quietly.


While in most cases it's probably a better option, I think that in this case there wasn't time to do that. I would probably explore the option to still use the parser, and have it generate an AST first, and from there go pretty much in the direction of the author.


Part of the problem is being given an assignment two weeks before a ship date. I know people do stuff like that all the time but that's hardly an excuse (we do terrible things to each other all the time so it's okay?)

The author is already screwed before he even has a chance to start.


The same issue comes up with classic cars. Is a car which has seen many/most of its parts replaced (including chassis members, body panels,...), still the same car?


With cars it gets interesting because someone will repair the old parts, and build a new car out of them. Then you have 20 of the original 8 cars made.


Wow, this seems quite idiotic to me. Sure, recursive descent parsers have some theoretically inferior properties to other techniques, but in reality they're much better because you can actually handle errors in a user-friendly way. That's why basically every bigger compiler uses handwritten recursive descent parsers.


There's a really gorgeous technique for implementing good error messages with LR parsers: http://lambda-the-ultimate.org/node/4781#comment-76724

(I haven't used the above personally, but the Golang parser does.)

Personally I have used PEG parsers and I was able to provide great error messages, by using the "cut" operator in the appropriate places in the grammar (see comments earlier in the thread I linked above).


> this seems quite idiotic to me

Please don't be uncharitably dismissive. The author didn't throw away the parser because it was recursive descent—in fact he explicitly excludes that as a reason.

(The rest of your comment is fine.)


Sorry I was too harsh, but to me it reads much like: I personally don't like this code, so I threw away everything.


It's true that that's a mega motivation lurking in the soul of every programmer. But it's also often true that writing a program can be faster than revising one (or even reading it) when the original programmer isn't available.

To me his justification sounded pretty convincing, though not to you, which is totally fair. As far as HN commenting goes it's just a matter of explaining your view in a substantive way.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: