Fuzzing is magic – Or how I found a panic in Rust's regex library

lomnakkus · on April 1, 2017

It's really disquieting how much similarity there is between fuzzing and QuickCheck... and yet so few people have even noticed and know about either. (Yes, fuzzing and QC work at slightly different "levels", I suppose, but... it's all just virtual execution by the CPU, so...)

kibwen · on April 1, 2017

Worth noting that there's also an implementation of QuickCheck available for Rust (by none other than burntsushi, the author of the regex library in the OP): https://github.com/BurntSushi/quickcheck

im_down_w_otp · on April 1, 2017

We've recently started testing C code via Rust QuickCheck.

In addition to the relatively clean Rust -> C interface it's very nice to have higher degree of compile-time assurance of the correctness of the property model itself. As the models become non-trivial it becomes increasingly difficult to build them reliably in C itself.

adrianN · on April 2, 2017

Do you have more information about that? Maybe a blog post?

im_down_w_otp · on April 2, 2017

Not at the moment, but almost certainly at some point in the relatively near future.

lomnakkus · on April 1, 2017

Oh, nice!

burntsushi · on April 1, 2017

Incidentally, QuickCheck is also used on the regex parser.

cakoose · on April 2, 2017

There's another distinction besides the "level" at which they operate. They use different techniques to generate test inputs.

QuickCheck-style tools typically generate inputs based on high-level static information, like types or programmer-provided specs. To reach the sophistication of modern fuzzers, you need to be smarter.

For one, you should inspect the code of the function. If you know how the code branches, you can more efficiently generate inputs that exercise both sizes of the branch.

But since you can't always figure out how to exercise both sizes of a branch just based on static analysis, fuzzers also observe the dynamic execution behavior of the program. This allows them to randomly adjust inputs until they know both sides of a branch have been hit.

I imagine that lifting fuzzing techniques to a higher level could have efficiency benefits. Maybe someone builds a fuzzer that runs on Rust MIR? Or maybe Rust can dumps out additional metadata (alongside the LLVM bitcode) to help the fuzzer figure things out?

eru · on April 2, 2017

QuickCheck happens to use types (ie static information). But Python's hypothesis shows that QuickCheck works just as well with dynamic languages: https://hypothesis.readthedocs.io/en/latest/

cakoose · on April 3, 2017

Yeah, in this case, Hypothesis uses programmer-provided input specs, which are similar but not quite the same as type declarations in a typed programming language.

I'd still consider this to be essentially "static" information, since they don't change as the program executes.

eru · on April 3, 2017

Perhaps. Though 'shrinking' definitely makes use of dynamic information.

Manishearth · on April 1, 2017

We (the folks who maintain the Rust fuzzers) are considering adding support for quickcheck-based fuzzers so that you can have higher-level fuzzing APIs.

The rust community is pretty aware of quickcheck (there's a rust version) I would say.

munin · on April 1, 2017

I think that this is because QC can work with much richer types than fuzzers have access to. Think about the Haskell data types you know, and imagine automatically generating them. Sounds pretty easy with some algorithm, right? Now think about the Java data types you know, and the invariants on constructing them. Yeah sure they have constructors but what preconditions do those constructors require? anything impure, like I/O? are there setters? is an object really "constructed" until some setters have been called? if so, how do you know that as the test generator?

It seems a lot nastier to me, which I think is why QC style automatic testing has made less inroads in impure languages.

It also makes fewer inroads to fuzzing because fuzzing starts a few steps back where you know even less about the application. You could do a grammar-based fuzz of the application, but then you have to worry about the map/terrain of the grammar/application. Are you missing testing some case in the application because your grammar is incomplete?

In QC, the map is the terrain because the "grammar" is just a type that your application uses and this isn't really the case in fuzzing. It could be, in fuzzing some languages (like Java) where you could interrogate the program you'r testing (even a binary program) for types that the program uses. Think about C programs though, where the application input and the "types" that the application uses are (or can be) pretty distant.

Anyway, I don't think it's disquieting. I think they're two very different software assurance techniques that apply in two very different domains and that explains why people don't say too much about them in the same breath. They're united by randomness, but I don't think that much more.

cmrx64 · on April 2, 2017

Speaking of, do you know of any tools for doing grammar-based fuzzing?

munin · on April 2, 2017

peachfuzz, quickfuzz, dharma, grammfuzz. They get done a bunch and usually people wind up re-inventing lisp to make a data specification DSL, then write some generators around it and call it a day. The generation strategy usually winds up being domain specific too (are you looking for OOB bugs? dangling refs / UAF? different patterns in different structured data lead to these errors without feedback) so honestly you should pick some parser combinator in a language that you know (like construct for python, or hammer for C) and then turn it into a fuzzer for your application yourself, using the faults you want to find as inspiration for the generators you write.

rixrax · on April 2, 2017

Peach Fuzzer[p] and Synopsys Defensics[d] should fit the bill. Both are commercial. I'm not aware of any OSS grammar-based fuzzers that would be good for anything (ok; there is Sulley and some others which seemingly haven't been maintained in eons).

[p] www.peachfuzzer.com/ [d] https://www.synopsys.com/software-integrity/security-testing...

eru · on April 2, 2017

You don't need good static types for QuickCheck like checking. See Python's hypothesis (https://hypothesis.readthedocs.io/en/latest/) for the refutation.

Letting the generation be guided by static types was just a good design decision for the original Haskell version of the library---it really helps with convenience.

lomnakkus · on April 9, 2017

What you say is, of course, true[1]. However, I find that the types actually do have a huge benefit, namely that you're "encouraged" to give more accurate types for your functions. For example: "frobnicate :: String -> String" has a huge state space, and if it actually really only works on a subset of String and you want to test it you have to manually provide "overrides" for the value generators... which is semi-painful if you have to do it for every test. So instead you add newtypes and change it to "frobnicate :: Blampf -> Schmeckle" and move all the "test-only" code into the Blampf and Schmeckle constructors[2] instead.

[1] I was actually about to footnote here that it was invented in Erlang-land, but apparently not: https://en.wikipedia.org/wiki/QuickCheck . Not sure why I thought that, maybe just because Joe Armstrong has been an advocate of QC-style testing for a long time? Or maybe it was he who invented QC in Haskell?

[2] Which, granted, in Haskell will mostly be run-time checked, but at least you'll get an error at the earliest possible opportunity.

hyperpape · on April 2, 2017

The author of Hypothesis (Python library for property based testing) has a nice article about defining property based testing and whether and how it differs from fuzzing. http://hypothesis.works/articles/what-is-property-based-test...

galapago · on April 2, 2017

We developed a fuzzer [0] using property-based testing in Haskell using QuickCheck (as expected!)

[0] http://QuickFuzz.org/

dllthomas · on April 4, 2017

I think they're both examples of property based generative testing, where the property of interest in the fuzzing case is "doesn't crash".

lomnakkus · on April 7, 2017

You're right, I think.

k__ · on April 1, 2017

Just saw something like quickcheck for JavaScripts Flow.

I think I will use this in the future.

krick · on April 2, 2017

How does the fuzzer actually work? Is the data really just a completely random string (array of bytes) over and over again, or is there something else? Does it just chose random byte random number of times for each input, or there is some more sophisticated logic in how exactly does it chose the "random input"?

lbrandy · on April 2, 2017

libfuzzer (and many of the fuzzers used widely today) is a coverage (as in, code-coverage) directed fuzzer. So it's not just randomly mutating the input, but "selecting" for random inputs that happen to light up new code paths. It works with a (or generates a random) corpus, and prefers the sample inputs that light up new code paths in subsequent mutations of the population.

Manishearth · on April 2, 2017

It looks at the codepaths taken and tries to trigger new ones. It's an evolutionary algorithm where it prefers to use new interesting strings that trigger new codepaths as a base for subsequent mutation.

komon · on April 2, 2017

Not super sure on the actual workings of cargo's fuzzing, but IIRC, afl-fuzz works by analyzing the memory of the fuzz-ee and using evolutionary algorithms to evolve the input in order to create as many new memory states as it can

chongli · on April 2, 2017

It helps to have a bunch of pre-defined strings that you know are problematic, such as extremely long strings or strings that use ASCII codes or escape characters in weird ways. Trying all of those before moving on to purely random strings can help. Taking things a bit further, you can log the purely random strings that are known to cause problems and then add them to the list of things to try in the future.

kibwen · on April 1, 2017

I'm really excited at encouraging a culture of fuzz testing in Rust. :) Can someone explain what's currently Linux-specific about cargo-fuzz? Is it to do with LibFuzzer itself?

amaranth · on April 1, 2017

The libFuzzer website doesn't cite any supported platforms or platform requirements so I think this might just be a matter of someone putting in the effort. The Chromium documentation (where libFuzzer comes from originally) does make it sound like it either takes effort to integrate with various sanitizers or (more likely) requires the infrastructure that the sanitizers do. None of those support Windows and seem to work best on amd64 Linux or, if you're lucky, amd64 macOS.

Manishearth · on April 1, 2017

libFuzzer (based on LLVM sanitizer instrumentation support) is Linux only.

IIRC the bits for OSX support are almost there, someone was trying this out before.

tux1968 · on April 1, 2017

It would be interesting to know how such a panic-inducing bug makes it passed Rust's much vaunted borrow-checker and other correctness validations. Or does the regex library include "unsafe" code blocks?

corndoge · on April 1, 2017

The borrow checker eliminates data races and hard memory errors, such as access violations. Panics are well within the bounds of safety.

From the docs:

Unlike C, Undefined Behavior is pretty limited in scope in Rust. All the core language cares about is preventing the following things:

* Dereferencing null or dangling pointers

* Reading uninitialized memory

* Breaking the pointer aliasing rules

* Producing invalid primitive values: dangling/null references, a bool that isn't 0 or 1, an undefined enum discriminant, a char outside the ranges [0x0, 0xD7FF] and [0xE000, 0x10FFFF], A non-utf8 str

* Unwinding into another language

* Causing a data race

<snip>

Rust is otherwise quite permissive with respect to other dubious operations. Rust considers it "safe" to:

* Deadlock

* Have a race condition

* Leak memory

* Fail to call destructors

* Overflow integers

* Abort the program

* Delete the production database

https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html

dom0 · on April 1, 2017

Deadlocks and race conditions are two IMHO typical examples of what many developers consider shortcomings of the platform / language, while they are actually application logic bugs that are hard to catch at a language level for all but the most trivial cases. So this is a very nice listing right there, starting with two things that some people may find surprising, then moving towards things that no one should find surprising :)

geofft · on April 1, 2017

Note that it's possible to use Rust's typesystem to build things that are deadlock-free and race-condition-free; see, for instance, https://aturon.github.io/blog/2015/08/27/epoch/ for an API for building lock-free data structures. The assistance you get from the Rust language is much better than the assistance you'd get from C or C++. (Rust has a lot of features other than memory safety!)

It's just that Rust doesn't consider it unsafe to build a design that permits deadlocks and race conditions.

BuuQu9hu · on April 1, 2017

There are languages like E and Monte where deadlocks aren't possible unless there is an actual loop in the data flow dependencies and race conditions are impossible, but they aren't commonly known.

dom0 · on April 1, 2017

> unless there is an actual loop in the data flow

Hence trivial. Making something provably deadlock-free where the message flow is a DAG is a trivial case.

> race conditions are impossible

I very much doubt that since race conditions are not only a technical issue (i.e. "is updating that counter atomic?"), but largely a specification issue. Recall that issues like lost updates (or all those file system races) are race conditions, as well as the many issues with non-atomic REST APIs.

burntsushi · on April 1, 2017

This bug has nothing to do with memory safety or borrowing. It's just an old fashion logic bug, and the panic prevents any memory unsafety from happening.

This bug was in the regex-syntax crate, which doesn't use any unsafe. The regex crate itself does have two uses of unsafe however (one for eliding bounds checks in the inner DFA loop and another for eliding bounds checks in Teddy).

chousuke · on April 1, 2017

"Safety" doesn't mean your program will never crash.

It means that when you mistakenly assume an invariant that does not hold in safe code, the worst that happens is a panic. This is much better than what could happen, since it stops the process in a well defined manner. In a memory-unsafe language, you'll be lucky to get a crash. Instead, you might suffer silent data corruption or a potential exploit, or both.

The code here made an assumption that didn't hold, and predictably hit a panic when that assumption was tested.

Manishearth · on April 1, 2017

Just to clarify here, "crash" in this context means "abort".

Some folks use "crash" to mean "segfault", which is related to safety. This can get confusing :)

benchaney · on April 2, 2017

I believe a segfault is a type of abort. It is just enforced by the kernel rather than being enforced in userspace. Or is there something else you are getting at?

kibwen · on April 2, 2017

A segfault has similarities to an abort, but the importance of the distinction isn't so much in how they're expressed as in the circumstances under which they occur. When someone says "my Rust program crashed", that usually indicates the program deliberately terminating because of a detected or deliberately-unhandled logic bug (like divide-by-zero), which casts no aspersions on the memory-safety of the program. Less frequently "my Rust program crashed" can mean an abort (e.g. because of panicking during unwinding, or because of deliberately turning off unwinding to translate all panics to aborts), but that also has no repercussions for the memory-safety of the program. But if you do get a segfault from a Rust program, that's a big deal, and in Rust terms one would be explicit and say "my Rust program segfaulted" rather than "my Rust program crashed", which leaves no ambiguity as to the severity of the implications.

Manishearth · on April 2, 2017

Yes, a segfault is a type of abort (well, "a type of unintended program death", since you can also define aborts to be SIGABRT only), but in some parlances, "crash" is used to refer specifically to segfaults. Or in general, kernel-enforced program death, or non-anticipated crashes.

Segfaults are usually indicative of bad memory safety issues; ones which were not anticipated.

For example, the following two pieces of code do essentially the same thing on a null ptr, assuming the first one triggers a segfault:

    int contents = *ptr;
   

    if (!ptr)
      abort()

    int contents = *ptr;

The difference is that the abort in the second one is an anticipated one. The program was written defensively, saying that "if the ptr is null here, please abort". In the first case, the program will crash with SIGSEGV, but it could also end up with the compiler exploiting the fact that it is allowed to assume that ptr is non-null when dereferenced and performing some optimizations that make bad things happen.

Also, segfaults don't just occur on null pointers, they occur whenever a pointer whose virtual address is currently not allocated by the process. A use after free, basically. In this case, you have two ways something can go -- it can either segfault (the best result!), or your pointer may by chance still point to memory owned by the process, just not the memory that it's supposed to. This can lead to all kinds of badness.

In Rust, for example, you represent nullables with `Option<T>` (like Haskell's Maybe or Java's Optional). Rust forces you to deal with the `None` case whenever you wish to interact with the Option. You have the ability to assert that the Option contains something and obtain the inner value via `unwrap()` or `expect()`. These will panic (unwind, but this can also be set to abort. for simplicity, let's just assume it's an abort). Usually ends with the program or thread exiting.). Now, runtime-wise speaking, `let x = nullable.unwrap()` isn't really that different from `int contents = *ptr;` when the latter is null and segfaults. Both end up with the program exiting, and with much sadness on the part of the user.

However, the first one was anticipatory. With the Rust panic, the programmer said "okay, I believe that this can never be None here, and if not, something is horribly wrong, please abort". Something went horribly wrong, and it aborted. With the segfault, there was no explicit acknowledgement, and it was the kernel -- not the programmer -- intervening and shutting down the program. Like I already said, the kernel intervening is the best possible result here, other results are much worse.

So while effect wise an explicit abort and segfault are quite similar, implication-wise they're quite different. A lot of people thus distinguish between the two by using "crash" to refer to segfaults and other memory safety crashes exclusively. There's not much consistency here, so I thought it would be worth clarifying -- "'Safety' doesn't mean your program will never crash." is false if you define crash as segfault :)

Manishearth · on April 1, 2017

Panics have nothing to do with Rust's safety guarantees or unsafe.

A panic is a way of signalling "uh oh the program made an assumption that doesn't hold, there's something horribly wrong, please exit". A fatal exception.

Most languages, including Rust, will not let you encode all invariants of a program into the type system. Only languages that come bundled with proof checkers can do that, and that's tedious. So you will have situations where you have to assume an invariant to be true, and if it isn't, there's something horribly wrong and the program should exit. That's a panic.

Rust makes sure that wrong assumptions about invariants can't lead to memory unsafety. It can't magically make bugs go away.

kibwen · on April 1, 2017

I don't think this comment warrants the downvotes it's receiving, as it's entirely reasonable to conflate panics with unsafety if one is only used to "crashing" in the context of C. It's a misconception that we get relatively frequently, so we may just need to make sure this is highlighted in our official tutorials.

carols10cents · on April 2, 2017

We have a whole chapter on error handling in the second edition of The Rust Programming Language[1] that goes into these issues in detail. When we revise chapter 1 (we're doing that chapter last), we're planning on defining what Rust means by "safe".

[1] - http://rust-lang.github.io/book/second-edition/ch09-00-error...

j1f4 · on April 2, 2017

> reasonable to conflate panics with unsafety if one is only used to "crashing" in the context of C.

I suppose it's possible that's what generates most instances of this question, but I had the same question when I was first introduced to Rust and it came from a different root.

I had been reading about Erlang, and from that context I expected "safety" to be about avoiding system downtime (e.g. gracefully handling failed asserts) as much as anything.

usernam · on April 1, 2017

The borrow checker can be [ab]used to guard for several forms of resource usage patterns, but make no mistake here: Rust doesn't mark the end of program bugs (or security bugs). It prevents some forms of bugs.

There will still be plenty of bugs to go around.

lmm · on April 1, 2017

The code chose to panic. What's interesting is why - it suggests the regex library has some invalid states that are representable, perhaps because the type system wasn't powerful enough to express the constraints that the regex library wanted to apply to its values.

burntsushi · on April 1, 2017

Is there a Turing complete language suitable for writing a fast regex engine with a type system that can prevent all possible invalid states? And if so, how much overhead is associated with said process?

lmm · on April 2, 2017

> Is there a Turing complete language suitable for writing a fast regex engine with a type system that can prevent all possible invalid states?

I suspect the only way to answer that for certain is to do it. I'd be vaguely interested in trying with Idris[1] or similar.

Looking at the code from Reddit it seems like the important distinctions in this instance - between whitespace and not, between the raw input string and the input string with whitespace skipped, between a character that has been confirmed to lie between 0 and 7 and an unknown input character - ought to be trackable in Rust as well though.

> And if so, how much overhead is associated with said process?

In theory it should be zero overhead at runtime. In practice of course one should have to see how it turned out in benchmarks.

[1] Which, sure, would require explicitly choosing possible nontermination when running the regex, if that's the desired behaviour. At one job my team deliberately added a step counter and limit to a regex matcher (abort after 10000 or so) as that was better behaviour than running indefinitely.

burntsushi · on April 2, 2017

Sorry, by overhead I meant developer overhead. But I think me and you have had this conversation before, and we fundamentally disagree.

Putting a limit on the regex engine won't fly. How big should that limit be? I use this particular regex engine to search files that are many GB in size. Some people might search bigger files. What should the limit be? Limits only work when you can restrict your use cases significantly, which is not a luxury I can enjoy.

The broader point I was trying to make with my question is that "invalid states aren't represented" isn't particularly interesting to me on its own. What's interesting to me is what it takes to achieve it. i.e., What do I have to give up to make it so? Developer time? Performance? Code complexity? Other practical things? Keep in mind that I'm after things that work, and not a hypothetically advanced-enough tool.

lmm · on April 2, 2017

> Putting a limit on the regex engine won't fly. How big should that limit be? I use this particular regex engine to search files that are many GB in size. Some people might search bigger files. What should the limit be? Limits only work when you can restrict your use cases significantly, which is not a luxury I can enjoy.

The limit just needs to be applied to backtracking (and other extended regex features that lead to turing completeness if there are any), not to input sizes or similar. I'm content to have a tool that errors if a given regex backtracks to the same point more than say 10000 times. I appreciate the principle of not restricting use cases, but in practice the overwhelming majority of the time that leads to the tool going into what's effectively a busy loop, which is not very user-friendly even if it's in some sense correct.

(If I had control over the whole system I would say that perl-style extended regexes are fundamentally the wrong abstraction - the whole original point of regular expressions was that they were expressions for regular languages that were efficiently implementable, and the rare use cases that require turing-complete parsing/matching are better addressed through languages that are designed as turing-complete. But often in practice one doesn't have that freedom).

> The broader point I was trying to make with my question is that "invalid states aren't represented" isn't particularly interesting to me on its own. What's interesting to me is what it takes to achieve it. i.e., What do I have to give up to make it so? Developer time? Performance? Code complexity? Other practical things? Keep in mind that I'm after things that work, and not a hypothetically advanced-enough tool.

It's interesting in so far as it affects defect rates rather than in and of itself, sure. My sense is that it's natural to avoid this kind of problem in a language that encourages you to represent this kind of distinction, and that the shift of style shouldn't carry a cost beyond that. I mean in this example there was a big explanatory comment above the panic-ing line about why the unwraps were safe, which comes with its own cost (both for the initial developer and for anyone reading the code); to my mind expressing that comment in a type system rather than prose would not have been substantially more expensive, and would have caught the issue. But of course that's a lot to claim without a concrete implementation showing these advantages in practice; most of the experience I'm drawing on is unfortunately private, and skepticism is quite warranted.

burntsushi · on April 2, 2017

That's all pretty fair. Maybe we agree more than we disagree after all.

Also, FWIW, Rust's regex engine guarantees linear time search. (Interestingly, it may decide to use backtracking to resolve the search, but it's only an optimization. And the backtracking is always bounded so that it never revisits the same state on the same input byte.)

lmm · on April 2, 2017

> Also, FWIW, Rust's regex engine guarantees linear time search. (Interestingly, it may decide to use backtracking to resolve the search, but it's only an optimization. And the backtracking is always bounded so that it never revisits the same state on the same input byte.)

Ah, fair enough - I saw "extended" and assumed that meant backreferences and the like (which I think require backtracking?). If the search is guaranteed-linear-time in any case then why the requirement for a Turing-complete language?

burntsushi · on April 3, 2017

> If the search is guaranteed-linear-time in any case then why the requirement for a Turing-complete language?

Because suggestions of using Turing-incomplete languages seem somewhat uninteresting to me. To be honest, I have never used one in anger, but if you told me to go write a production grade regex engine in Agda and Coq, I'd say you're nuts. It sounds like a fine experiment to do given infinite time, but I want people to actually use what I build. And I have a very strong suspicion that using one of those languages would take me 10 times as long to write it, if it's even possible at all.

I try hard to blend the practical with the theory. I'm not aware of any Turing-incomplete language that comes close enough to the practical side of things. I hope that changes as PL research advances. :-)

For the most part, my problem with your initial comment on the matter is that it neglects everything except for the "type system could have stopped it" part. But in reality, there are a host of trade offs at play given the actual choices available to me. (Even if those trade offs may be fundamentally orthogonal to the notion of type safety, they still exist and they are still things I must confront.)

I feel like an Agda/Coq implementation would be dead in the water pretty quickly. How easy or hard is it to manipulate raw byte buffers in those languages? Or memory layout? That's why Rust is such a nice tool for a regex engine, because it gives you access to the hardware at the level of C without any fuss at all, while still remaining reasonably safe.

A long time ago, we had a discussion about ASTs and property based testing. I noted that my "AST" couldn't be round-tripped with the concrete syntax accurately, and you noted that it was therefore not an AST. I'm in the middle of making that happen, and I've tried a bit harder this time around to use the type system to avoid invalid states. I do still have one that I haven't been able to rule out though, which is related to an invariant on a sequence of operations applied to a stack: https://github.com/BurntSushi/regex/blob/syntax-rewrite/rege...

codygman · on April 4, 2017

> To be honest, I have never used one in anger, but if you told me to go write a production grade regex engine in Agda and Coq, I'd say you're nuts. It sounds like a fine experiment to do given infinite time, but I want people to actually use what I build. And I have a very strong suspicion that using one of those languages would take me 10 times as long to write it, if it's even possible at all.

What if it took 3 times as long to write? I'm somewhat wildly guessing that would be unacceptable because from your perspective you don't see the benefits for it taking 3 times longer?

burntsushi · on April 7, 2017

Well, I understand the benefits, you get fewer bugs and presumably maintenance would be easier. I guess what I'm interested in is other trade offs. How close to the machine can I get with these languages? A fast regex engine these days requires SIMD, which means wherever you are needs to support using special CPU instructions. You also need to very carefully control allocations, and in my case as well, the memory layout itself.

I'm interested in all the possible trade offs here, not just the type safety trade off. (That's why Rust is so appealing to me, because it nails a lot of my requirements while also giving a fairly high degree of safety.)

lmm · on April 3, 2017

> Because suggestions of using Turing-incomplete languages seem somewhat uninteresting to me. To be honest, I have never used one in anger, but if you told me to go write a production grade regex engine in Agda and Coq, I'd say you're nuts. It sounds like a fine experiment to do given infinite time, but I want people to actually use what I build. And I have a very strong suspicion that using one of those languages would take me 10 times as long to write it, if it's even possible at all.

Idris is where my hopes would lie; while I've only played with it a little it doesn't feel like it takes a lot more work - indeed mostly it just behaves like Haskell[1].

> For the most part, my problem with your initial comment on the matter is that it neglects everything except for the "type system could have stopped it" part. But in reality, there are a host of trade offs at play given the actual choices available to me. (Even if those trade offs may be fundamentally orthogonal to the notion of type safety, they still exist and they are still things I must confront.)

Sure; I've got my own agenda/narrative and I'm talking about this bug because it aligns with that. (My belief that greater use of type systems is the best way forward for the industry is genuine and good-faith, but that doesn't mean I'm right)

> A long time ago, we had a discussion about ASTs and property based testing. I noted that my "AST" couldn't be round-tripped with the concrete syntax accurately, and you noted that it was therefore not an AST. I'm in the middle of making that happen, and I've tried a bit harder this time around to use the type system to avoid invalid states. I do still have one that I haven't been able to rule out though, which is related to an invariant on a sequence of operations applied to a stack: https://github.com/BurntSushi/regex/blob/syntax-rewrite/rege....

Glad to hear that.

The invariant comes from only ever adding alternations to the stack via push_or_add_alternation, right?

My first thought is to encapsulate the stack and the way alternations are added to / removed from it behind an opaque type. i.e. rather than Parser having direct access to the stack, it has a wrapped stack that it can push to (in a way that will always add alternations when necessary) and pop from (in a way that will provide it with the invariant - e.g. by returning an alternation-with-content or non-alternation). That doesn't offer any more proof, but practically it means the two halves of the invariant are in the same place and easier to review, and it's clear where the responsibility is.

Going further and looking to actually eliminate the unreachable, it seems like a stack with an alternation on top is treated differently both on push and pop (not just at the end but also in pop_group)?. Maybe those should be different types. Or maybe the representation isn't quite right - maybe it would be better for each alternate to be its own stack entry (i.e. the AST looks like Alternate(case1, Alternate(case2, Alternate(case3, ...))). That way seems like a better separation of concerns - on encountering a '|' you just push an entry onto the stack rather than having to push-or-mutate, and then when you know the alternation has finished (because you hit ')' or the end) you peek and pop all the alternations that are currently at the top (possibly zero, which I think should actually simplify pop_group because there are no longer two different cases). In turn that suggests a representation where AstGroup always contains AstAlternation (possibly of only one alternate) and the root value is always AstAlternation too.

I don't know whether that would work - maybe there're constraints I haven't thought of - but that's the kind of direction I'd be thinking in. Where there's a conditional, see if it can be moved into data (an enum or similar); where the representation has invalid states, look for a way to represent that avoids them. I think to a certain extent this awkwardness reflects a true awkwardness in the regex grammar - you can't tell whether what you're parsing is part of an alternation or not until you hit the | - but to the extent that it's possible to parse regexes it should be possible to come up with a representation that corresponds to what you do when parsing, and I feel like there's a correspondence/duality between operator precedence and deciding that the children of a certain node must be a particular type of node (e.g. if we wanted to parse arithmetic expressions consisting of +, *, () and variables/literals, we could say that both arguments to + are products (possibly of 1 element)).

[1] Including at runtime; if you're looking to write a library that can be embedded into other languages then Rust is indeed probably still your best option. One of my very long-term projects is to extend Rust (or a Rust-like language) with Noether-like functionality, but that's purely theoretical at this point.

eru · on April 2, 2017

There are Turing-incomplete languages doing that kind of thing. Look up Coq and Idris etc. No runtime overhead necessary.

(Turing incompleteness is not too much of a restriction for them in practice.)

burntsushi · on April 2, 2017

I know about them. That's why I added my criteria of Turing completeness. But perhaps I am wrong. Has anyone written a production grade regex engine in say, Coq or Agda? Can they prevent all possible invalid states? Can they support searching text of arbitrary length?

eru · on April 2, 2017

Arbitrary length shouldn't be a problem. Agda and co are not restricted to programs of finite runtime.

Not sure about production grade, but Google has plenty of results for "Agda regular expressions".

mattnewton · on April 1, 2017

Panics could be a safe way to handle faults at runtime. To me, "Unsafe operations" are more like ones that continue executing in a comprimised state, such as a buffer overflow, or that crash incompletely, segfault, locking up resources in a live lock, etc. Bailing and freeing all your resources seems harder to exploit and sometimes the only sane response.