>> The novice programmer will ask, "what is a parser?"
>> The intermediate programmer will say, "that's easy, I'll write a regular expression."
>> The master programmer will say, "stand back, I know lex and yacc."
The Prolog programmer will write a Definite Clause Grammar [1], which is both a grammar and a parser, two-in-one. So you only have to do the easy and fun bit of writing a parser, which is defining the grammar.
Leaves plenty of time to get online and brag about the awesome power of Prolog or get embroiled in flamewars with functional programming folks [2].
[2] Actually, DCGs are kiiind of like parser combinators. Ish. In the sense that they're executable grammars. But in Prolog you can run your programs backwards so your DCG is both a recogniser and a generator.
This is beside your point, but I've found PEG to be a nice step up in usability from lex&yacc, as again, you only have to write one grammar definition.
In Prolog it's not "easy"- it's implicit. You don't need to do anything
special, you don't need to combine different programming constructs, you don't need a library and you don't even have to write your code
in a special way to make your grammars work as parsers.
The trick is that DCGs are Prolog programs, which is to say, they
are logic theories that the Prolog interpreter proves true or false. Parsing a
string then is a case of proving that the grammar is a theory that entails
the string in its input (with some abuse of terminology!).
And because a Prolog predicate (including each rule in a definite clause
grammar) is a relation rather than a function, and so has no fixed
inputs and outputs, you can leave the "input" unbound, to generate all strings
entailed by the grammar.
Yes? You write a grammar. You get you recognisers for nothing and your
generators for free.
Oh, you need to do more than "nothing special" in Prolog to get anything that is mildly useful. The very notion of constraint programming was born in Prolog community to handle... well, to handle constraint programming better than regular Prolog handles it.
Let me bring you another example.
In case of parser combinators over lists you can have a greedy alternative:
newtype P a = P (String -> [(String, a)] -- a parser is a function from input to remaining input and result.
P a <|> P b = P $ \s -> case a s of
[] -> b s
xs -> xs
The regular alternative is this:
P a <|> P b = P $ \s -> a s ++ b s
You can use combinators like that:
many, some :: P a -> P [a]
-- possible empty parse - returns empty list for that
many item = some item <|> pure []
-- parse head item and tail list of items and combine into a list
some item = (:) <$> item <*> many item
This greedy thing above helps greatly when you are dealing with mostly regular grammars (which are aplenty in wild) - it does not hold input for too long, it is faster for no parse situation, etc.
You put a change in a single combinator function and get wastly improved result.
In case of Prolog handling grammars you would have to put cuts all over your grammar definition, risking either loss of specification equivalence or performance.
I have more practical questions. For example, how would you design online parsers - ones that consume item by item and produce intermediate results on the go? How would you design a tree of parsers for editor to speed up reparsing after change?
These kinky Applicative and Alternative combinators let you not to worry about details when specifying the grammar. You specify it with class (pun intended - abstract getting an item into a class and you are done) and lay above the implementation which can be wholesome, streaming, with partial parses, deterministic, etc.
>> Oh, you need to do more than "nothing special" in Prolog to get anything that is mildly useful.
This is really not my experience. I've been coding in Prolog for 10 years and done a bunch of useful stuff, if I say so myself (my latest baby: https://github.com/stassa/metasplain). Constraint (logic) programming is cool and useful, but it's not necessary. It's just a particular logic programming paradigm. Actually, although I do find it very nice and pure, in a declarative sense, I don't think I could code all my stuff in CLP.
I can't say I understand your parser combination notation. The syntactic sugar of DCGs means that if you can read BNF, you can read a DCG. Simple syntax that doesn't get in your way when you're trying to do complicated things is a hallmark of Prolog in general.
Cuts are also not necessary- DCGs are one kind of Prolog program where I almost never use cuts. One reason for that is that you usually want your grammars to be nondeterministic, so that you can use them as generators. DCGs basically just process lists, so it's not usually that hard to ensure a grammar doesn't overgenerate. You really don't have much use of cuts with DCGs, unless you use them to write arbitrary Prolog programs, rather than parsers (which you can do, since DCGs are syntacitc sugar for ordinary Prolog- but I generally don't).
The online grammar outputting intermediary results you're describing, I think you're talking about something like this (not tested):
s --> a, b.
a --> [a], {writeln('Found an a')}.
b --> [b], {writeln('Found a b')}.
The curly braces, {}, embed arbitrary Prolog code in the body of a rule (which btw makes DCGs Turing-complete, I guess) that is executed as soon as it's found in the body of the grammar. You can go a long way with it.
The "parser tree" you describe would probably need some coding around the grammars, i.e. you'd have to write special code to combine a few parsers. Or maybe not, I'm not sure because I haven't tried to do what you say. What do you use something like that for?
The wikipedia article has a few examples. For instance:
sentence --> noun_phrase, verb_phrase.
noun_phrase --> det, noun.
verb_phrase --> verb, noun_phrase.
det --> [the].
det --> [a].
noun --> [cat].
noun --> [bat].
verb --> [eats].
And here are some example queries:
% Running as a recogniser
?- sentence([the,cat,eats,a,cat], []).
true.
% With a string not in the grammar
?- sentence([the,cat,eats,'YeGoblynQuenne'], []).
false.
% Running as a generator
?- sentence(P, []).
P = [the, cat, eats, the, cat] ;
P = [the, cat, eats, the, bat] ;
P = [the, cat, eats, a, cat] ;
P = [the, cat, eats, a, bat] ;
P = [the, bat, eats, the, cat] ;
P = [the, bat, eats, the, bat] ;
P = [the, bat, eats, a, cat] ;
% ...
% More results generated on backtracking
To prevent any confusion - the "A --> B" syntax is syntactic sugar. Such statements are expanded to ordinary Prolog terms by the Prolog equivalent of a macro (an expansion hook). You can inspect the ordinary Prolog definitions saved in the program database:
?- listing(sentence).
sentence(A, B) :-
noun_phrase(A, C),
verb_phrase(C, B).
Indeed, it's perfectly possible to write these by hand - as you can see above the ordinary Prolog definition is not that complicated, the macro expanding the DCG notation adds two variables to each nonterminal and relieves you of the need to add this boilerplate yourself.
for composability. I found the article fascinating and plan on going back to see what an xml parsing implementation based on the former would act like.
This might be the first time I’ve seen a good use for &mut &T, very cool!
For those of you not well-versed in Rust, this is a mutable pointer to an immutable string. This means that you can change the part of the string you’re pointing at, but you can’t change the underlying string itself.
What does that mean? Is it equivalent to the pointer starting out (say) pointing to the first letter of the string, but being able to "walk"/iterate along the length of the string?
EDIT: adjusted the diagram to make it more clear that we're not mutating the &str
Here's some string data, in memory somewhere:
Hello world
A `&str` is a pointer, length combo. The pointer points to the start of the string, the length says how many bytes there are:
(p, 11)
|
\
|
V
Hello world
A `&mut T` is a pointer, so a `&mut &str` looks like this:
p
|
V
(p, 11)
|
\
|
V
Hello world
Since it's mutable, this means we can re-assign what it points to, and create a new `&str` to the later part of the string:
p
\------\
V
(p, 11) (p, 5)
| |
\ |
| |
V V
Hello world
Since the inner &str is immutable, we can't change the underling string.
Hope that helps!
Re: your edit, isn't mutating the `&str` exactly what we're doing? Your new diagram implies that `input` itself is mutable and that's what's being overwritten, but that's not what `input: &mut &str` means nor would it even work because the mutation wouldn't be visible to the caller...?
I want to make sure I understand this. The original pointer &str does not own the "Hello world" variable but is a reference to it. The new &mut T pointer takes ownership of the &str reference so the original specified (p,11) would be destroyed/overwritten when changed to (p,5). The &mut T cannot edit the underlying string because it the reference it owns is a nonmutable reference. Even if the underlying string, "Hello world" was a mutable str variable, then &mut T could not access it since it was provided and immutable reference &str as its type.
If we provided a mutable pointer to the &mut T, would the mutable pointer would take ownership of the "Hello world"? If so, would Rust then throw a compiler error because we've created a situation where we can orphan part of "Hello world"?
> The original pointer &str does not own the "Hello world" variable but is a reference to it.
Correct.
> The new &mut T pointer takes ownership of the &str reference
&T/&mut T never take ownership. It is true that &mut T must be exclusive, so it's sort of like "temporary ownership", but the underlying T won't be destroyed when the reference goes out of scope.
> so the original specified (p,11) would be destroyed/overwritten when changed to (p,5).
The original &str will still live as long as it lives; we're creating a new one, and changing the &mut to point to it, not overwriting the existing one.
> The &mut T cannot edit the underlying string because it the reference it owns is a nonmutable reference. Even if the underlying string, "Hello world" was a mutable str variable, then &mut T could not access it since it was provided and immutable reference &str as its type.
Correct.
> If we provided a mutable pointer to the &mut T, would the mutable pointer would take ownership of the "Hello world"? If so, would Rust then throw a compiler error because we've created a situation where we can orphan part of "Hello world"?
No, &mut T never takes ownership.
Hope that helps! Happy to answer more questions if you've got them.
To restrict in the manner that the person above described would require some kind of const parameter in the type, no? The type would need to be parameterized over the `&str` to ensure that slices could only be taken from that particular string.
I am not familiar with the in-memory representation of Slice, but conceptually at least, yes. We call &str a "string slice" and &[T] more generally a "slice".
Sorry, I wrote Slice instead of Span in the original comment... Slice is the method that returns a Span that is the equivalent of this rust concept as far as I can tell.
At some point in the future I’ll need to delve in rust, it’s quite fascinating, but sadly I don’t have much time nowadays :(
Yeah I do that in a command line interface module for some firmware I wrote. No heap, very limited memory, solution string functions that understand the concept of white space delimited tokens. You just walk the pointer down the input string without modifying it.
This was such a wonderful read! I've been getting into Rust recently, and the sections on dealing with challenges that are specific to Rust were particularly useful. The way they created a new trait to turn `Fn(&str) -> Result<(&str, T), &str>` into `Parser<T>` was insightful, and the discussion of how they dealt with the growing sizes of types was something that I can imagine myself running into in the future.
Most importantly though, when they started writing `and_then`, my eyes lit up and I said "It's a Monad!" I think this is the first time I've really identified a Monad out in the wild, so I enjoyed that immensely.
It doesn't feel very declarative in Rust. Personally, I'm finding it hard to see the intent (I haven't written a line of Rust in my life, so take that with a pinch of salt, but I am a polyglot programmer).
Really, Haskell's do notation is the big winner when it comes to parser combinators, as the direction of the flow of the parser is easy to follow, but also you can capture variables mid-flight for use later in the expression without obvious nested scope blocks.
It's possible to capture variables with `and_then` by the looks of it, but any suitably complex parser will start to end up quite an ugly mess of nested scopes.
I ported Haskell's Parsec to C# [1], it has LINQ which is similar to Haskell's Do notation. Simple parsers [2] are beautifully declarative, and even complex ones, like this floating point number parser [3], are trivial to follow.
Nice article. I finally gave Rust a recently. It's really interesting how new languages evolve, and what "deficiencies" they exert. The article for example uses closures, but it's currently impossible in stable Rust to accept a closure that itself accepts a closure as an argument (while you can easily rewrite the same pattern with structs). The borrow checker could still do better on suggesting fixes to common problems (otherwise it's actually quite elegant). What struck me while reading this was the use of assert_eq!(expected, actual), as I've mostly seen the other order. Sure enough I checked and the macro does not define the order. That's unfortunate, as testing against "fixed" "expected" outcome is very common, and leads to more friendly testing devx (which in general while supported out of the box isn't great).
On the other hand, Rust's IDE support, built-in linting, is seriously impressive.
We looked at the order for assert_eq, and couldn’t find any real consensus. Large testing frameworks in a variety of languages use both orderings. So, we decided to not enforce a particular one.
That seems like an odd decision? It's like saying there is no consensus for which side of the road to drive on, so let's not pick one. Standardization is arbitrary but useful here.
The best argument for picking a specific side was the comment below about "expected, actual" vs "these two things are not equal." In the end, we decided that this just wasn't enough of a benefit to justify making an arbitrary choice. If we did choose, a lot of code would have to change, as existing Rust codebases also used both conventions. Rust already forces the user to follow a lot of rules; this one just wasn't deemed worth it.
The downside to enforcing an order is the compiler cannot enforce an order so people absolutely will get it wrong and not notice until much later when the test breaks and they get really confused because the message is telling them the expected value is the actual value and vice versa.
I really like this, not that I would want to write this every time, but because you could have a macro rule that doesn't specify expected/actual, for those rare cases where both sides are computed ("actual").
> That's unfortunate, as testing against "fixed" "expected" outcome is very common,
.... so write it in your preferred order then, in your own code?
Why is it "unfortunate" that a generic macro isn't concerned with explicit order? It checks that left == right, [0] no more, no less. Instead of that, I'm free to follow my own conventions when writing tests. [1]
Fixing the ordering can provide for clearer error messages and the like, for instance rather than the error message being "foo is different from bar" you could have "got foo, expected bar".
However I don't see how you can enforce it especially in a simple function / macro (I guess that's one point for BDD-style interfaces): the expected value is not necessarily static so it's not like you could require a const / static / literal even if the language allowed for such an annotation.
You could always define a macro like "expect_that", or something similar to ruby's expect(obj).to eq, etc.
I find assert_eq!() personally to not be confusing since I always follow the same order, even though I come from a Ruby background with rspec and/or cucumber.
> Perhaps that's what you mean by 'BDD-style interfaces'.
Yes. It's usually a more OO shape but usually along the lines of `foo.should.be.equal.to(bar)` or somesuch, a "natural" reading makes it pretty likely the parameter is the expected value.
> This is where the RLS 2.0 effort comes in. The plan is to build a prototype of the new front-end, thus enabling a truly first-class IDE experience. Wherever possible, we aim to share code between this front-end and rustc by extracting libraries, such as Chalk (for traits), Polonius (for the borrow checker), and a new library focusing on name resolution. Eventually, we will merge this new front-end with the remaining bits of back-end from rustc.
It is certainly a huge project, but that's one nice thing about the "split everything into a ton of crates" philosophy; it enables this kind of thing to be done in a fairly piecemeal way.
Compared to what? Very few languages compare well to the excellent support out there for Java.
IntelliJ’s Rust plugin is probably the best IDE experience with Rust, and now includes macro expansion. Personally, I still use the RLS with VSCode, mostly because I’m still enjoying the VSCode editor’s versatility when working with many different file types.
If someone wants to have a look at the code of a cutting-edge parser combinator framework with focus on features + usability, I'll plug this here (it's in Java)
If I need to do it by hand I can write a recursive descent parser, but I never do it by hand any more, just break out lex/yacc or antlr.
So, what have I or anyone to gain by learning/using PCs? (I'm never against learning new stuff but I have so much to learn that adding something unnecessary to that would be just dumb).
Parser combinators aid in writing recursive descent parsers, they are not something alien.
You can maybe consider it a design pattern which eliminates redundancies and makes it easier to construct an AST.
The problem is the definition of "by hand". I assume you mean it as "writing code in a general programming language". Note that in a sense, lex/yacc and antlr are programming languages as well.
Parser combinators are usually put out in the form of DSLs. So effectively they're very similar to the above.
There is actually very little difference between parser generators and parser generators. Like you could take the above Autumn code and make it generate a parser (I plan to do that eventually).
The main advantage of parser generators is that the code you generate is usually a faster. The main disadvantage compared to combinators is that you add a layer of indirection: it's not code in your language that you can trace and debug directly, but generated code.
The main advantage of Autumn over lex/yacc and ANTLR (and a whole host of others, though not all) is that you can define new combinators (not an advantage of PC, you could have this in a parser generator too). Essentially add the recursive-descent logic you need to do something new, then wrap in a nice combinator, and now your grammar has a new primitive element!
Autumn has cool built-in combinators that users could have built themselves using the available facilities. For instance: memoization, longest-match, ways to define left- and right-associative operators, ... (https://github.com/norswap/autumn4/tree/master/src/norswap/a...)
It's basically a recursive-descent parser, which means LL(k)-ish. The "-ish" is because you can use other tricks instead of straightforward recursive-descent (expression parsing is a common example of where you might want to do so), but the basic combinator concept itself is LL(k).
I don't like Rust for this purposes. It doesn't have higher-kinded types and thus no applicatives or monads, which sort of misses the point.
I also object to the idea that parser combinators are an alternative to parser generators. They're each useful in different scenarios. But for something like XML the parser combinators will be slower.
I'd also be curious to see how the efficiency of parser combinators is affected by the absence of laziness in Rust. I seem to recall that laziness makes the analysis more complicated than you'd expect, but I need to find a source...
> I don't like Rust for this purposes. It doesn't have higher-kinded types and thus no applicatives or monads, which sort of misses the point.
Having used parser combinators in Rust and Haskell (combine and attoparsec, respectively), I've found that even without applicatives, parser combinators are pretty handy—they're a step above regexes, but a step below a parser generator.
> I'd also be curious to see how the efficiency of parser combinators is affected by the absence of laziness in Rust. I seem to recall that laziness makes the analysis more complicated than you'd expect, but I need to find a source...
Same here, but I suspect that aggressive inlining might be pretty helpful.
>> The intermediate programmer will say, "that's easy, I'll write a regular expression."
>> The master programmer will say, "stand back, I know lex and yacc."
The Prolog programmer will write a Definite Clause Grammar [1], which is both a grammar and a parser, two-in-one. So you only have to do the easy and fun bit of writing a parser, which is defining the grammar.
Leaves plenty of time to get online and brag about the awesome power of Prolog or get embroiled in flamewars with functional programming folks [2].
______________
[1] https://en.wikipedia.org/wiki/Definite_clause_grammar
[2] Actually, DCGs are kiiind of like parser combinators. Ish. In the sense that they're executable grammars. But in Prolog you can run your programs backwards so your DCG is both a recogniser and a generator.