Compiling Expressions

cube2222 · 2024-02-09T11:08:08 1707476888

There is a very cool paper about structuring query compilers that I've read a couple years ago, and came back to recently when doing some experimentation.

The paper[0], "How to Architect a Query Compiler, Revisited", presents an approach to structuring compilers based on Futamura Projections. In practice this means that you write most of your high-level code as though it was an interpreter, but then mainly rely on constructors and operator overloading (and occasionally special functions) to emit compiled code for your expressions and nodes under the hood.

The paper should be interesting to anybody who is also interested in the posted link.

[0]: https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod1...

chc4 · 2024-02-09T17:35:33 1707500133

"based on Futamura Projections" is kind of cheating: they don't have a `mix` specialization function, which is the entire point of Futamura's paper. They're doing automatic symbolic execution using operator overloading, but I'm looking extremely suspiciously at them referring to that as "the first Futamura Projection without mix".

kaba0 · 2024-02-09T13:57:05 1707487025

For an actual implementation of these ideas (in a non-db context), you might want to take a look at GraalVM's Truffle part, or PyPy.

_false · 2024-02-09T14:56:34 1707490594

Working through _Crafting Interpreters_ atm. Both an educational and enjoyable read. I find the author's ability to reason across programming languages as a skill to aspire to. It feels like a meta way to think of programming languages. As any language is an instantiation of a set of parameters, made through the design decisions.

kunley · 2024-02-09T10:50:11 1707475811

This book is a pearl, esp. the second part with a VM-based interpreter.

Not sure why the OP posted this particular chapter, maybe because IIRC there starts a sketch of Pratt parser implementation

norir · 2024-02-09T16:21:13 1707495673

> Second, we get to write an actual, honest-to-God compiler. It parses source code and outputs a low-level series of binary instructions.

Your compiler doesn't need to emit byte or machine code to be real or useful. High level to high level compilers can be incredibly useful and relatively easy to write. For optimal runtime performance, you can target c or some other low level language and let it deal with register allocation, instruction selection and shared lib/executable generation.

BeetleB · 2024-02-09T17:58:46 1707501526

That's called a transpiler.

samatman · 2024-02-09T18:03:23 1707501803

A transpiler is a sort of compiler. A single compiler might emit LLVM, C, or JavaScript, depending on the target. That doesn't mean that it's only a compiler sometimes, it means that some of the modes of the compiler transpile.

owlstuffing · 2024-02-09T15:45:31 1707493531

I’ve always considered recursive descent easier to understand and maintain than Pratt. Shrug.

sneed_chucker · 2024-02-09T17:51:47 1707501107

Recursive descent is great for parsing most of an algol-like programming language.

Once you get into expressions, you'll really want to do precedence climbing or Pratt parsing to get things working correctly.

This assumes that you want PEMDAS-like operator precedence in the first place. It does seem like a waste once you've implemented it and realize how little it's actually utilized by most code. Maybe the Lisp and forth people have a point...

caditinpiscinam · 2024-02-09T17:03:54 1707498234

Even with recursive decent, you still need a strategy to deal with left-associative binary operators. This is where Pratt parsing comes in: not to replace recursive descent, but to supplement it.

owlstuffing · 2024-02-09T18:02:26 1707501746

Yes, however there is a much simpler approach to removing left recursion. For example, a left recursive additive expression grammar:

  additive-expression
    <multiplicative-expression>
    <additive-expression> + <multiplicative-expression>
    <additive-expression> - <multiplicative-expression>

Left recursion removed from this grammar is:

  additive-expression
    <multiplicative-expression> <additive-expression2>
  
  additive-expression2
    + <multiplicative-expression> <additive-expression2>
    - <multiplicative-expression> <additive-expression2>
    <null>

Then implement additive-expression2 as a simple loop. No need for Pratt and implementation follows more closely with BNF.

a_e_k · 2024-02-09T18:46:00 1707504360

Yep. And whenever I've implemented a recursive-descent parser and dealt with left-associative operators, I've just inlined the equivalent of additive-expression2 and the loop. E.g.:

    def parseAddExpr():
        parseMultExpr()
        while peekToken in ["+", "-"]:
            consumeToken()
            parseMultExpr()

Honestly, I really like recursive-descent parsing, even for expressions. It's very straightforward once you've learn a few simple patterns like that.

Decently large precedence hierarchies (15-20) don't bother me; I'll typically just number the levels rather than use names like "term", "factor", "additive", or "multiplicative". It can get a little verbose, yes, but it's easy to write and skim through. (Golang-like from what I've read, though I'm not a user of Golang.)

But I also find it scales better once you start adding things like ternary expressions, function calls, and type casts -- things that aren't just simple binary operators. Handling those feels a lot less clean to me with Pratt, precedence climbing, shunting yard, etc.

One main downside that I'll admit to recursive descent though is that it's a bit harder to have a language with user-defined operators. I don't like those much, though, so that's not much of a loss to me.

caditinpiscinam · 2024-02-10T02:44:43 1707533083

In the second grammar, additive-expression2 is still contains a recursive call in the right-hand term, so it would produce a parse tree that's right associative. For example, given the string `A - B - C` you'd get:

    (additive-expression
      (multiplicative-expression A)
      (additive-expression2
        -
        (multiplicative-expression B)
        (additive-expression2
          -
          (multiplicative-expression C)
          <null>
        )
      )
    )

Which corresponds to the right-associative expression `(A - (B - C))`. You could do some post-processing on the results to invert it and make it left-associative, but I don't think the two grammars are equivalent.

owlstuffing · 2024-02-10T16:11:37 1707581497

The technique I use is widely known and published. Generally, any left associative production:

  A → Aα | β

Can be refactored as right associative:

  A → βA’
  A’ → αA’ | ∈

This is a cornerstone of recursive descent parsers.

caditinpiscinam · 2024-02-10T19:10:24 1707592224

I get what you're saying: the grammars are equivalent in the sense that they match the same language.

Maybe "equivalent" was the wrong word to use on my part. My point is that this refactoring doesn't actually give us an implementation that solves the issue of left-associativity. As you said, you still need to introduce some sort of loop to do this. It doesn't have to be complex, but it is different from the right-associative case where expressions are built up directly through recursion. My argument is that once you introduce explicit iteration, you're no longer really doing recursive descent, you're doing Pratt or shunting-yard or something else.