Hacker News new | past | comments | ask | show | jobs | submit login
Learning Parser Combinators with Rust (bodil.lol)
289 points by chwolfe on April 19, 2019 | hide | past | favorite | 72 comments



>> The novice programmer will ask, "what is a parser?"

>> The intermediate programmer will say, "that's easy, I'll write a regular expression."

>> The master programmer will say, "stand back, I know lex and yacc."

The Prolog programmer will write a Definite Clause Grammar [1], which is both a grammar and a parser, two-in-one. So you only have to do the easy and fun bit of writing a parser, which is defining the grammar.

Leaves plenty of time to get online and brag about the awesome power of Prolog or get embroiled in flamewars with functional programming folks [2].

______________

[1] https://en.wikipedia.org/wiki/Definite_clause_grammar

[2] Actually, DCGs are kiiind of like parser combinators. Ish. In the sense that they're executable grammars. But in Prolog you can run your programs backwards so your DCG is both a recogniser and a generator.


This is beside your point, but I've found PEG to be a nice step up in usability from lex&yacc, as again, you only have to write one grammar definition.

Wiki: https://en.wikipedia.org/wiki/Parsing_expression_grammar Example Implementation: https://pegjs.org/


Yep, PEGs are seriously underutilized. Here's a nice introduction using LPeg: http://leafo.net/guides/parsing-expression-grammars.html



Regarding your [2].

It is easy to combine parser and generator into one thing: http://hackage.haskell.org/package/reform (and then you can get to http://happstack.com/docs/crashcourse/Reform.html#reform explaining how and how to).


In Prolog it's not "easy"- it's implicit. You don't need to do anything special, you don't need to combine different programming constructs, you don't need a library and you don't even have to write your code in a special way to make your grammars work as parsers.

The trick is that DCGs are Prolog programs, which is to say, they are logic theories that the Prolog interpreter proves true or false. Parsing a string then is a case of proving that the grammar is a theory that entails the string in its input (with some abuse of terminology!).

And because a Prolog predicate (including each rule in a definite clause grammar) is a relation rather than a function, and so has no fixed inputs and outputs, you can leave the "input" unbound, to generate all strings entailed by the grammar.

Yes? You write a grammar. You get you recognisers for nothing and your generators for free.


Oh, you need to do more than "nothing special" in Prolog to get anything that is mildly useful. The very notion of constraint programming was born in Prolog community to handle... well, to handle constraint programming better than regular Prolog handles it.

Let me bring you another example.

In case of parser combinators over lists you can have a greedy alternative:

    newtype P a = P (String -> [(String, a)] -- a parser is a function from input to remaining input and result.
    P a <|> P b = P $ \s -> case a s of
        [] -> b s
        xs -> xs
The regular alternative is this:

    P a <|> P b = P $ \s -> a s ++ b s
You can use combinators like that:

    many, some :: P a -> P [a]
    -- possible empty parse - returns empty list for that
    many item = some item <|> pure []
    -- parse head item and tail list of items and combine into a list
    some item = (:) <$> item <*> many item
This greedy thing above helps greatly when you are dealing with mostly regular grammars (which are aplenty in wild) - it does not hold input for too long, it is faster for no parse situation, etc.

You put a change in a single combinator function and get wastly improved result.

In case of Prolog handling grammars you would have to put cuts all over your grammar definition, risking either loss of specification equivalence or performance.

I have more practical questions. For example, how would you design online parsers - ones that consume item by item and produce intermediate results on the go? How would you design a tree of parsers for editor to speed up reparsing after change?

These kinky Applicative and Alternative combinators let you not to worry about details when specifying the grammar. You specify it with class (pun intended - abstract getting an item into a class and you are done) and lay above the implementation which can be wholesome, streaming, with partial parses, deterministic, etc.


>> Oh, you need to do more than "nothing special" in Prolog to get anything that is mildly useful.

This is really not my experience. I've been coding in Prolog for 10 years and done a bunch of useful stuff, if I say so myself (my latest baby: https://github.com/stassa/metasplain). Constraint (logic) programming is cool and useful, but it's not necessary. It's just a particular logic programming paradigm. Actually, although I do find it very nice and pure, in a declarative sense, I don't think I could code all my stuff in CLP.

I can't say I understand your parser combination notation. The syntactic sugar of DCGs means that if you can read BNF, you can read a DCG. Simple syntax that doesn't get in your way when you're trying to do complicated things is a hallmark of Prolog in general.

Cuts are also not necessary- DCGs are one kind of Prolog program where I almost never use cuts. One reason for that is that you usually want your grammars to be nondeterministic, so that you can use them as generators. DCGs basically just process lists, so it's not usually that hard to ensure a grammar doesn't overgenerate. You really don't have much use of cuts with DCGs, unless you use them to write arbitrary Prolog programs, rather than parsers (which you can do, since DCGs are syntacitc sugar for ordinary Prolog- but I generally don't).

The online grammar outputting intermediary results you're describing, I think you're talking about something like this (not tested):

  s --> a, b.
  a --> [a], {writeln('Found an a')}.
  b --> [b], {writeln('Found a b')}.
The curly braces, {}, embed arbitrary Prolog code in the body of a rule (which btw makes DCGs Turing-complete, I guess) that is executed as soon as it's found in the body of the grammar. You can go a long way with it.

The "parser tree" you describe would probably need some coding around the grammars, i.e. you'd have to write special code to combine a few parsers. Or maybe not, I'm not sure because I haven't tried to do what you say. What do you use something like that for?


Please share any examples you know of on how to do this!


The wikipedia article has a few examples. For instance:

  sentence --> noun_phrase, verb_phrase.
  noun_phrase --> det, noun.
  verb_phrase --> verb, noun_phrase.
  det --> [the].
  det --> [a].
  noun --> [cat].
  noun --> [bat].
  verb --> [eats].
And here are some example queries:

  % Running as a recogniser
  ?- sentence([the,cat,eats,a,cat], []).
  true.
  
  % With a string not in the grammar
  ?- sentence([the,cat,eats,'YeGoblynQuenne'], []).
  false.

  % Running as a generator
  ?- sentence(P, []).
  P = [the, cat, eats, the, cat] ;
  P = [the, cat, eats, the, bat] ;
  P = [the, cat, eats, a, cat] ;
  P = [the, cat, eats, a, bat] ;
  P = [the, bat, eats, the, cat] ;
  P = [the, bat, eats, the, bat] ;
  P = [the, bat, eats, a, cat] ;
  % ... 
  % More results generated on backtracking
To prevent any confusion - the "A --> B" syntax is syntactic sugar. Such statements are expanded to ordinary Prolog terms by the Prolog equivalent of a macro (an expansion hook). You can inspect the ordinary Prolog definitions saved in the program database:

  ?- listing(sentence).
  sentence(A, B) :-
    noun_phrase(A, C),
    verb_phrase(C, B).
Indeed, it's perfectly possible to write these by hand - as you can see above the ordinary Prolog definition is not that complicated, the macro expanding the DCG notation adds two variables to each nonterminal and relieves you of the need to add this boilerplate yourself.


On the reddit discussion of this [0], someone mentioned using a type of

fn parse(&self, input: &mut &str) -> Option<Output>

instead of the article's

fn parse(&self, input: &'a str) -> Result<(&'a str, Output), &'a str>

for composability. I found the article fascinating and plan on going back to see what an xml parsing implementation based on the former would act like.

[0]: https://www.reddit.com/r/rust/comments/bepi63/learning_parse...


This might be the first time I’ve seen a good use for &mut &T, very cool!

For those of you not well-versed in Rust, this is a mutable pointer to an immutable string. This means that you can change the part of the string you’re pointing at, but you can’t change the underlying string itself.


What does that mean? Is it equivalent to the pointer starting out (say) pointing to the first letter of the string, but being able to "walk"/iterate along the length of the string?


EDIT: adjusted the diagram to make it more clear that we're not mutating the &str

Here's some string data, in memory somewhere:

    Hello world   
A `&str` is a pointer, length combo. The pointer points to the start of the string, the length says how many bytes there are:

    (p, 11)           
     |                
      \ 
       |
       V              
       Hello world
      
 A `&mut T` is a pointer, so a `&mut &str` looks like this:

     p
     |
     V
    (p, 11)           
     |                
      \ 
       |
       V              
       Hello world
 
 Since it's mutable, this means we can re-assign what it points to, and create a new `&str` to the later part of the string:
 
     p
     \------\
             V
    (p, 11) (p, 5)           
     |       |                
      \      |
       |     |
       V     V         
       Hello world
 
 Since the inner &str is immutable, we can't change the underling string.
 
 Hope that helps!


Re: your edit, isn't mutating the `&str` exactly what we're doing? Your new diagram implies that `input` itself is mutable and that's what's being overwritten, but that's not what `input: &mut &str` means nor would it even work because the mutation wouldn't be visible to the caller...?


.... uuuuugh dammit you are correct. Thanks. I can't mutate the parent post, sadly, so have an upvote and hopefully you sit at the top.

I did draw the right thing at first, then convinced myself I was wrong and "fixed" it. Tricky!


I want to make sure I understand this. The original pointer &str does not own the "Hello world" variable but is a reference to it. The new &mut T pointer takes ownership of the &str reference so the original specified (p,11) would be destroyed/overwritten when changed to (p,5). The &mut T cannot edit the underlying string because it the reference it owns is a nonmutable reference. Even if the underlying string, "Hello world" was a mutable str variable, then &mut T could not access it since it was provided and immutable reference &str as its type.

If we provided a mutable pointer to the &mut T, would the mutable pointer would take ownership of the "Hello world"? If so, would Rust then throw a compiler error because we've created a situation where we can orphan part of "Hello world"?


> The original pointer &str does not own the "Hello world" variable but is a reference to it.

Correct.

> The new &mut T pointer takes ownership of the &str reference

&T/&mut T never take ownership. It is true that &mut T must be exclusive, so it's sort of like "temporary ownership", but the underlying T won't be destroyed when the reference goes out of scope.

> so the original specified (p,11) would be destroyed/overwritten when changed to (p,5).

The original &str will still live as long as it lives; we're creating a new one, and changing the &mut to point to it, not overwriting the existing one.

> The &mut T cannot edit the underlying string because it the reference it owns is a nonmutable reference. Even if the underlying string, "Hello world" was a mutable str variable, then &mut T could not access it since it was provided and immutable reference &str as its type.

Correct.

> If we provided a mutable pointer to the &mut T, would the mutable pointer would take ownership of the "Hello world"? If so, would Rust then throw a compiler error because we've created a situation where we can orphan part of "Hello world"?

No, &mut T never takes ownership.

Hope that helps! Happy to answer more questions if you've got them.


Turns out I was slightly wrong, see above. It's still true that no ownership is taken, but we are mutating the &str.


Awesome, thank you for the explanation. I'm still trying to really understand Rust ownership and pointers.


Thank you, yes.

Is there a binding between the "top" pointer &mut T, and the string which the pointer/length pair points to?

Can we change our pointer to point somewhere else entirely, say a newly allocated string "goodbye, world"?

  p1
   \------\
           V
  (p2, 11) (p3, 13)
  |          \
  |           \
  |            \
  V             V
  Hello World   Goodbye World


There's no inherent connection; you could absolutely do this.


To restrict in the manner that the person above described would require some kind of const parameter in the type, no? The type would need to be parameterized over the `&str` to ensure that slices could only be taken from that particular string.


I believe that’s correct.


So it’s basically like a Span in C# if I understood correctly?


I am not familiar with the in-memory representation of Slice, but conceptually at least, yes. We call &str a "string slice" and &[T] more generally a "slice".

EDIT: apparently you just edited from Slice to Span, reading https://msdn.microsoft.com/en-us/magazine/mt814808.aspx

> First, Span<T> is a value type containing a ref and a length, defined approximately as follows:

Yep, exactly the same as Span<T> then.


Sorry, I wrote Slice instead of Span in the original comment... Slice is the method that returns a Span that is the equivalent of this rust concept as far as I can tell. At some point in the future I’ll need to delve in rust, it’s quite fascinating, but sadly I don’t have much time nowadays :(


It's all good, on both counts! We'll still be around when you find the time :)


It's actually like a ref Span argument, if my understanding of it is correct.

&[T] is like Span<T>, while a &mut &[T] is like a reference to a Span<T>.


Yeah I do that in a command line interface module for some firmware I wrote. No heap, very limited memory, solution string functions that understand the concept of white space delimited tokens. You just walk the pointer down the input string without modifying it.


It means you can change the string to which the pointer is currently associated.

Edit: I removed my initial “no.”


But in this case, it will be used the way that e12e described.


Ah. I may have misunderstood what they were suggesting.


This was such a wonderful read! I've been getting into Rust recently, and the sections on dealing with challenges that are specific to Rust were particularly useful. The way they created a new trait to turn `Fn(&str) -> Result<(&str, T), &str>` into `Parser<T>` was insightful, and the discussion of how they dealt with the growing sizes of types was something that I can imagine myself running into in the future.

Most importantly though, when they started writing `and_then`, my eyes lit up and I said "It's a Monad!" I think this is the first time I've really identified a Monad out in the wild, so I enjoyed that immensely.


It doesn't feel very declarative in Rust. Personally, I'm finding it hard to see the intent (I haven't written a line of Rust in my life, so take that with a pinch of salt, but I am a polyglot programmer).

Really, Haskell's do notation is the big winner when it comes to parser combinators, as the direction of the flow of the parser is easy to follow, but also you can capture variables mid-flight for use later in the expression without obvious nested scope blocks.

It's possible to capture variables with `and_then` by the looks of it, but any suitably complex parser will start to end up quite an ugly mess of nested scopes.

I ported Haskell's Parsec to C# [1], it has LINQ which is similar to Haskell's Do notation. Simple parsers [2] are beautifully declarative, and even complex ones, like this floating point number parser [3], are trivial to follow.

[1] https://github.com/louthy/language-ext

[2] https://github.com/louthy/language-ext/blob/master/LanguageE...

[2] https://github.com/louthy/language-ext/blob/master/LanguageE...


There are libraries like nom: https://lib.rs/crates/nom or combine https://github.com/Marwes/combine/blob/master/examples/date.... that have more declarative-looking syntax. In TFA you intentionally get "raw Rust" to avoid syntax sugar obscuring what's going on.


Since you mention parsec and .NET: fparsec [1] for F# is also a great library for building parser combinators.

[1] https://www.quanttec.com/fparsec/


Nice article. I finally gave Rust a recently. It's really interesting how new languages evolve, and what "deficiencies" they exert. The article for example uses closures, but it's currently impossible in stable Rust to accept a closure that itself accepts a closure as an argument (while you can easily rewrite the same pattern with structs). The borrow checker could still do better on suggesting fixes to common problems (otherwise it's actually quite elegant). What struck me while reading this was the use of assert_eq!(expected, actual), as I've mostly seen the other order. Sure enough I checked and the macro does not define the order. That's unfortunate, as testing against "fixed" "expected" outcome is very common, and leads to more friendly testing devx (which in general while supported out of the box isn't great).

On the other hand, Rust's IDE support, built-in linting, is seriously impressive.


> it's currently impossible in stable Rust to accept a closure that itself accepts a closure as an argument

You can make it work this way, if the second closure doesn't actually close over anything:

    let apply_twice = |f: fn(u64) -> u64, x| f(f(x));
    let plus_3 = |x| x + 3;
    let x = apply_twice(plus_3, 1);
Or this way, even with closed-over variables, using dynamic dispatch:

    let apply_twice = |f: &dyn Fn(u64) -> u64, x| f(f(x));
    let plus_3 = |x| x + 3;
    let x = apply_twice(&plus_3, 1);
I don't know if it's possible to take a generic closure by value, though.

> the use of assert_eq!(expected, actual), as I've mostly seen the other order

That's interesting, in my head (expected, actual) is the usual order. For example, https://godoc.org/github.com/stretchr/testify/require#Equal.


We looked at the order for assert_eq, and couldn’t find any real consensus. Large testing frameworks in a variety of languages use both orderings. So, we decided to not enforce a particular one.


That seems like an odd decision? It's like saying there is no consensus for which side of the road to drive on, so let's not pick one. Standardization is arbitrary but useful here.


The best argument for picking a specific side was the comment below about "expected, actual" vs "these two things are not equal." In the end, we decided that this just wasn't enough of a benefit to justify making an arbitrary choice. If we did choose, a lot of code would have to change, as existing Rust codebases also used both conventions. Rust already forces the user to follow a lot of rules; this one just wasn't deemed worth it.


How is it useful, since it has no observable effect? Equals is equals!


As many people pointed out, an expected value is not the same as the actual value. This affects error messages, which are important for ergonomics.


The downside to enforcing an order is the compiler cannot enforce an order so people absolutely will get it wrong and not notice until much later when the test breaks and they get really confused because the message is telling them the expected value is the actual value and vice versa.


We could have a macro like:

    test_eq!(expected=(<expr>), actual=(<expr>));
Where those "assignments" could be specified in either order.

Playground example: https://play.rust-lang.org/?version=stable&mode=debug&editio...


I really like this, not that I would want to write this every time, but because you could have a macro rule that doesn't specify expected/actual, for those rare cases where both sides are computed ("actual").


> That's unfortunate, as testing against "fixed" "expected" outcome is very common,

.... so write it in your preferred order then, in your own code?

Why is it "unfortunate" that a generic macro isn't concerned with explicit order? It checks that left == right, [0] no more, no less. Instead of that, I'm free to follow my own conventions when writing tests. [1]

[0]: https://doc.rust-lang.org/src/core/macros.rs.html#44

[1]: https://git.sr.ht/~andrewzah/korean_numbers/tree/master/test...


Fixing the ordering can provide for clearer error messages and the like, for instance rather than the error message being "foo is different from bar" you could have "got foo, expected bar".

However I don't see how you can enforce it especially in a simple function / macro (I guess that's one point for BDD-style interfaces): the expected value is not necessarily static so it's not like you could require a const / static / literal even if the language allowed for such an annotation.


You could always define a macro like "expect_that", or something similar to ruby's expect(obj).to eq, etc.

I find assert_eq!() personally to not be confusing since I always follow the same order, even though I come from a Ruby background with rspec and/or cucumber.


You can use a form like:

  assert_that(foo, is_equal_to(bar));
Which makes it sort of clear that foo is the actual value, and bar is the expected. Especially when you also see things like:

  assert_that(foo, is_not_zero());
Perhaps that's what you mean by 'BDD-style interfaces'.


> Perhaps that's what you mean by 'BDD-style interfaces'.

Yes. It's usually a more OO shape but usually along the lines of `foo.should.be.equal.to(bar)` or somesuch, a "natural" reading makes it pretty likely the parameter is the expected value.


The IDE support seems impressive at first, but RLS will start to break down and be unusable as projects get a bit large or a bit complex.

One challenge, even if/when rls bugs are all sorted out, is that Rust compile times may make it impractical to use with large projects.


It turns out, these problems are connected, and we're starting work on a second generation of tooling for this, based off of things like Rosyln. See https://ferrous-systems.com/blog/rust-analyzer-2019/, which is already usable for basic things. This also means improving the compiler. From the 2019 roadmap: https://github.com/rust-lang/rfcs/blob/master/text/2657-road...

> This is where the RLS 2.0 effort comes in. The plan is to build a prototype of the new front-end, thus enabling a truly first-class IDE experience. Wherever possible, we aim to share code between this front-end and rustc by extracting libraries, such as Chalk (for traits), Polonius (for the borrow checker), and a new library focusing on name resolution. Eventually, we will merge this new front-end with the remaining bits of back-end from rustc.

rust-analyzer is that "RLS 2.0"


>based off of things like Rosyln

That is spectacular news! This is a huge project I had assumed it would be a while before anyone could take this on.


It is certainly a huge project, but that's one nice thing about the "split everything into a ton of crates" philosophy; it enables this kind of thing to be done in a fairly piecemeal way.


> What struck me while reading this was the use of assert_eq!(expected, actual)

This order is common in probably 200+ test-frameworks.

For me, I’m surprised whenever I see the opposite.

> leads to more friendly testing devx

IMO being ergonomic, consistent and well documented leads to good developer experience.

Things like this matter little.


Rust IDE support is not great tbh, there are recent languages that have way better support. I would day it's average at best.


Compared to what? Very few languages compare well to the excellent support out there for Java.

IntelliJ’s Rust plugin is probably the best IDE experience with Rust, and now includes macro expansion. Personally, I still use the RLS with VSCode, mostly because I’m still enjoying the VSCode editor’s versatility when working with many different file types.


C#, Java, Go, Kotlin

It is true though that top notch IDE experience is rare.


If someone wants to have a look at the code of a cutting-edge parser combinator framework with focus on features + usability, I'll plug this here (it's in Java)

https://github.com/norswap/autumn4

WIP but 1.0.0 will land somewhere within the next two months, with a full user-guide (half of it is already written and available).

Constructive feedback welcome!


Honest question, why use parser combinators?

If I need to do it by hand I can write a recursive descent parser, but I never do it by hand any more, just break out lex/yacc or antlr.

So, what have I or anyone to gain by learning/using PCs? (I'm never against learning new stuff but I have so much to learn that adding something unnecessary to that would be just dumb).

TIA


Parser combinators aid in writing recursive descent parsers, they are not something alien. You can maybe consider it a design pattern which eliminates redundancies and makes it easier to construct an AST.


The problem is the definition of "by hand". I assume you mean it as "writing code in a general programming language". Note that in a sense, lex/yacc and antlr are programming languages as well.

Parser combinators are usually put out in the form of DSLs. So effectively they're very similar to the above.

e.g. here are JSON and Java in Autumn https://github.com/norswap/autumn4/blob/master/test/lang/jso... https://github.com/norswap/autumn4/blob/master/src/norswap/l...

There is actually very little difference between parser generators and parser generators. Like you could take the above Autumn code and make it generate a parser (I plan to do that eventually).

The main advantage of parser generators is that the code you generate is usually a faster. The main disadvantage compared to combinators is that you add a layer of indirection: it's not code in your language that you can trace and debug directly, but generated code.

The main advantage of Autumn over lex/yacc and ANTLR (and a whole host of others, though not all) is that you can define new combinators (not an advantage of PC, you could have this in a parser generator too). Essentially add the recursive-descent logic you need to do something new, then wrap in a nice combinator, and now your grammar has a new primitive element!

Autumn has cool built-in combinators that users could have built themselves using the available facilities. For instance: memoization, longest-match, ways to define left- and right-associative operators, ... (https://github.com/norswap/autumn4/tree/master/src/norswap/a...)


Okay, I am interested in this topic. Does anyone know of any good resources for exploring parser combinators further?



What is the class of languages that can be parsed with such parsers, in the sense of [1]?

[1] https://en.wikipedia.org/wiki/Context-free_grammar#Subclasse...


It's basically a recursive-descent parser, which means LL(k)-ish. The "-ish" is because you can use other tricks instead of straightforward recursive-descent (expression parsing is a common example of where you might want to do so), but the basic combinator concept itself is LL(k).



nice read; Hi I also wrote a SQL dump parser using rust here the code.

> https://github.com/ooooak/sql-split


I don't like Rust for this purposes. It doesn't have higher-kinded types and thus no applicatives or monads, which sort of misses the point.

I also object to the idea that parser combinators are an alternative to parser generators. They're each useful in different scenarios. But for something like XML the parser combinators will be slower.

I'd also be curious to see how the efficiency of parser combinators is affected by the absence of laziness in Rust. I seem to recall that laziness makes the analysis more complicated than you'd expect, but I need to find a source...


> I don't like Rust for this purposes. It doesn't have higher-kinded types and thus no applicatives or monads, which sort of misses the point.

Having used parser combinators in Rust and Haskell (combine and attoparsec, respectively), I've found that even without applicatives, parser combinators are pretty handy—they're a step above regexes, but a step below a parser generator.

> I'd also be curious to see how the efficiency of parser combinators is affected by the absence of laziness in Rust. I seem to recall that laziness makes the analysis more complicated than you'd expect, but I need to find a source...

Same here, but I suspect that aggressive inlining might be pretty helpful.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: