Hacker News new | past | comments | ask | show | jobs | submit login
Finding bugs in SQLite, the easy way (lcamtuf.blogspot.com)
340 points by robin_reala on April 14, 2015 | hide | past | favorite | 69 comments



I actually (as a big C and C++ fan) quite a damning result for C.

Its easy to say "just programming well" when bugs occur in badly written projects, but SQLite is so well tested, and generally considered well written.


As others have said, parsers for complex formats (be it text-based or binary) are exceptionally hard. There are some classes of C/C++ software where the choice of a language doesn't have such a striking effect. For parsers, the effect is hard to ignore.

But parsers are also the kind of stuff you almost always end up writing in C/C++, and there are semi-compelling reasons for doing so - chiefly, performance and flexibility. You can disagree and make your pitch for Ocaml or JavaScript or whatever, but really, if we had clearly superior choices, we wouldn't be dealing with this problem today (or it would be a much more limited phenomenon). There are some interesting contenders, but the revolution won't happen tomorrow, no matter how much we talk about it on HN.

Perhaps a more fitting conclusion is that if you are parsing untrusted documents, our brains are too puny to get it right, and the parser really needs to live in a low-overhead sandbox. Mechanisms such as seccomp-bpf offer a really convenient and high-performance way to pull it off.


I agree with sandboxing, but a problem is that by using something like seccomp-bpf, you're turning highly portable C code into a highly platform-specific and even architecture-specific system. (Well, it won't be architecture-specific if you use the old plain seccomp, which is probably all you need.)

You also need to deal with serialization and deserialization of the parse tree, which isn't super hard, but people still get it wrong.

Writing parsers is already hard, but writing secure parsers is at least 2x harder than that, or maybe 4x if you have to port it to 3 platforms. There really needs to be some kind of "libsandbox" for parsers.


There are some other interesting choices. Like Rust, or Cyclone.


I think that Haskell's parser combinator libraries, like Parsec, are clearly superior for most parsing tasks.

Parsec is about as concise as a formal grammar, but it is real code rather than a code generator, so you don't have the extra complexity of a parser-generator. Parsec parsers are type-safe, so there's no way that you'd get something back that wasn't a valid AST (or a parse error). Error handling is also quite good.

Parsec in particular is not especially fast - there are other libraries like attoparsec and cereal (for binary serialization) which trade off some of the flexibility of parsec for improved performance.

I think Haskell programmers mostly use these libraries, because they are clearly superior to the alternatives (for instance, you get a lot less interest in RegEx in Haskell-land, because when you have a really good parsing library there's much less reason to use them). C programs like SQLLite aren't using this approach because monads can't be expressed in the C type system, and there is no syntax sugar for monads in C, so Parsec would end up much less pleasant and much less safe than in Haskell.


In C++, we have Boost.Spirit, and for C there is Bison/Flex. Surely those are better and more safe alternatives than hand-rolling your own parser?


Many of the horribly vulnerable parsers are generated with Bison / Flex, so it's not exactly a robust solution. Plus, especially for binary formats (images, videos, etc), it's hand-written or bust.


Marpa is amazing.


I agree, but lexers and parsers are especially stateful pieces of code.

I'm not a static typing person (I like Python, bash, and C), but I have resolved to write all future parsers in OCaml or a similar language.

For other types of code, if you don't have a code path explosion, you can do OK in C. For example, numerical simulation code.

Unfortunately async code in C like Nginx tends to also produce this code path explosion. And kernel code is also littered with state machines.

Another interesting thing is that SQLite's parser is generated with Lemon. It's not even a hand-written parser in C (like Lua), which would be much worse. I guess all the crashes were in the semantic actions of the parser?


I think you shouldn't write parsers by hand at all and instead generate them automatically -- without semantic actions. I wrote a parser generator for tricky binary formats a few months ago - https://www.usenix.org/system/files/conference/osdi14/osdi14... and there's a paper explaining it here https://github.com/jbangert/nail


I've written parsers with ANTLR, lex/bison, and other tools, and I've hand written parsers.

With the generators, it's very hard to get several features of a high quality parser right. In particular error messages and error recovery, but also handling some more esoteric grammars (in particular ones that are not entirely unambiguous, LALR(1), etc). Using a parser generator also complicates your build chain, complicates refactorings, usually means you get bad source tooling, and often slows down the edit/test cycle quite a bit.

At the same time, if you're doing it right, writing a simple LL(1) parser by hand really isn't that hard, and the resulting code doesn't have to be much longer than the spec in your generator's language. Even in languages that are not that DSL-friendly (e.g. Java), you can end up with very compact, readable code. Plus you get all your usual tooling to do so (e.g. IDE support).


Error messages and error recovery are to some extent red herrings. You really do not want a production system to accept wrong inputs (and especially not try to 'recover' them into a usable parse tree). This will definitely make your parser ambiguous and can have very real security implications (say, microsofts anti-xss transformations introducing XSS). As to LL1 - many real world protocols can't be parsed efficiently as LL1.


Error messages really are core feature of parsers. Would you want SQLite to just "return false" whenever there was some syntactical error somewhere in your query?

With error recovery in the context of a parser I mean the ability to continue parsing with predictable results after encountering an error. This is not about automagically correcting errors, it's about being able to report more than one error to the user. Returning just the first error encountered sucks, as does returning a slew of non errors caused by your parser getting messed up.

    As to LL1 - many real world protocols can't be parsed efficiently as LL1.
Not sure what you mean with efficiently - a language either can or cannot be parsed as LL(1) because it's in that language class or not. But in any case, it's still very straightforward to make LL decisions with a longer lookahead in hand-written code, and the decision code is often more efficient than what a generator would create.


Cool, I printed out your paper a few months ago and will give it another look!


The way I write parsers nowadays is to write a pseudo-code description of the grammar, then treat that pseudo-code as data that must be transformed into a valid parser.

The end result is a fast, memory-safe parser in C or C++, and a whole lot of ugly python code :)


I used to entirely use hand-written recursive-descent parsers, until I discovered packrat style parsers. They are crazy fast and so easy to write. Also, the parser generator I use (esrap) lets you fall-back on recursive decent for parsing rules.


    They are crazy fast
Do you have a link/citation for that? I always thought they were asymptotically linear due to the memoization, so comparable to LL, LALR etc, but in practice much slower due to the overhead compared to those more traditional approaches.


For me, I expect Esrap to end up within a factor of 2-4 of a hand-tuned recursive decent parser on sbcl.


That's interesting -- do you have examples somewhere?

Are you saying you write an parser-specific parser generator in Python for every problem, with the goal of producing readable C or C++? (As opposed to the sometimes unreadable output of parser generators.)


I think it'd actually be better to use a hand-written recursive-descent parser. With RD the state is implicitly in the position of the code and the stack, so there's no need to explicitly manage it. AFAIK the SQL grammar isn't unusually difficult to parse.

I've used Flex/Bison before, and it makes it easy to get started, but with more complex parsers, threading the state around quickly gets irritatingly annoying and I could see how bugs could arise from it.


Can you please elaborate why OCaml is a good choice for writing parsers?


I usually post the VandenBerghe link too.

It's basically advocating the ML subset of OCaml, Haskell, and now Rust. Those languages all feature algebraic data types and pattern matching, which were pioneered in ML. Rust is basically a cross between ML and C++, with C-ish syntax.

http://en.wikipedia.org/wiki/ML_(programming_language)

(None of C, C++, Java, JavaScript, Python, etc. have those features, and thus they are all suboptimal for writing parsers.)

A good clue is the name -- ML stands for "meta-language". It's literally a language for describing languages.

ADTs and pattern matching are basically two sides to the same coin -- data vs code. ADTs are how you describe variants in the data. Specifically, "sum types" are essentially type-safe tagged unions (if you are a C person). And pattern matching is how you describe variants in the code. The compiler checks that you have considered all cases.

This helps you tame code with nested conditionals. Nested conditionals give rise to the state space and code path explosion that I mentioned.

Here is more recent explanation:

https://queue.acm.org/detail.cfm?id=2038036

Take a look at the OCaml vs Java code snippets.



Put me in the "most boring interests ever" category, but I'd be interested in a discussion of the actuarial modeling DSL mentioned in link if said write up exists.

I imagine actuaries are about the most technical "nontechnical" users there are so a DSL seems a particularly good fit.


Your sentence to missing a word.

Nonetheless, I agree. Eventually one cannot continue putting blame solely on the proximal cause.

It's too bad that given the number of extensions to x86 there have been, that it still doesn't have anything like native array bounds checking. Even something as simple as "any writes inside the stack outside my stack frame and all reads inside the stack and after my stack frame should assert" (on an opt-in until the next return) would do wonders towards preventing some of these sorts of errors, although not all.


It's too bad that given the number of extensions to x86 there have been, that it still doesn't have anything like native array bounds checking

It's called the BOUND instruction, and has been there since the 80186/188:

http://en.wikipedia.org/wiki/X86_instruction_listings#Added_...

Ironically, AMD removed it in AMD64 and Intel followed them, but it's still there in 32-bit mode.


Because individual instructions doing the same thing are faster. Though it's especially slow because it's microcoded and is left just for backwards compatibility and not trying to be fast. So modern CPUs are good at reordering, predicting branches, etc., for simple instructions and no extra silicon needs to be wasted to duplicate that. And encoding space is limited as well.


If individual instructions are faster, that means the instruction decoder isn't as good as it should be. BOUND wasn't fast because it didn't get any use, and it didn't get any use because people thought it was slow... but now that the usefulness of this instruction is apparent, it seems to me that they should make the decoder issue the necessary uops and make it at least as fast as the (bigger, more cache-wasting) sequence of simpler instructions.

Instructions like MOVS and STOS used to be slower than the equivalent series of simple instructions, but now they're much faster. Even the more obscure and far less useful BCD arithmetic instructions have been made on-par with a longer series of equivalent instructions[1] so not doing the same for the much more useful BOUND is... unusual. The amount of silicon needed to implement MPX (which includes several new instructions) seems far more than it'd take to make a BOUND decode faster.

[1] https://news.ycombinator.com/item?id=8477585


BOUND didn't get much of an use because it is useful only for bounds checking of the "dump core if outside" kind.


Why? (Honest question).


Because the instruction raises a software exception if the test fails. This is a rather heavy-handed reaction, although I guess it kind of makes sense from the instruction's point of view I guess.

Of course a language runtime based on using this instruction needs to capture the exception and handle it, core-dumping need not be the only option at that point.


Part of the issue is that the interrupt number, 5, is hard-wired. So there really was no need to keep it alive.


Divide-by-zero has always been interrupt 0, and the page fault is at #14, and that hasn't been problematic...

BOUND does raise an exception if it fails but I think that's a good thing - if code is doing array bounds-checking, OOB accesses are very likely to be a bug (and possible security vulnerability) so aborting execution by default is the right thing to do.


I feel like this suggestion is a little misplaced. To use a scheme like this you must ultimately tell the hardware what the bounds are, and that layer is going to be error prone in the same ways as what it replaces.

From this vantage point I can say the abstraction that already exists for this is the page table. The MMU knows what regions are mapped and it will generate a fault for an out of bounds access. But within mapped regions the array bounds are all fiction. As they should be in the eyes of the hardware. That is for higher level layers to worry about.

I feel like the c-haters don't get that some layer of the system must work this way. By all means build your higher level stuff on top. But things like virtual memory and allocators are well understood, frankly when it gets to that level there is little reason for it to give a damn that a[] has an element count of 5. At some point something must chop up larger blocks into smaller ones, and create new bounds from nothing.


This would require using zillions of VM areas for all your language level objects, and aligning them on page boundaries. It's really not going to work.

Have a look at SoftBound and HardBound for some of the current thinking in this area that scales to object level granularity.


I think you misunderstand me. I am advocating the status quo, not for more granular pages.

Anyway from what I have seen some high level languages are actually using the MMU for some things. For example, a null check at every access would be kind of lame, from what I have seen high level language runtimes just let it fault on the zero page and handle it by raising exceptions. (I am 100% sure CLR does this, I believe JVM does too.)


On 32-bit x86, you could use segment registers to get bounds checking with overhead only at setup and not at use, effectively making bounds checks free in hot loops (http://www.blackhat.com/presentations/bh-usa-05/BH_US_05-Chi...). I believe that SpiderMonkey uses this trick. Doesn't work on x86-64 though.



Yes I have, but MPX is primarily for a different use case.

Also, MPX has performance issues. See [1].

Also, it has false positives (!). Not a good thing.

[1]https://code.google.com/p/address-sanitizer/wiki/IntelMemory...


Hmm... according to that page, the false positives and performance issues seem to largely derive from using the cache structure to store bounds externally. But if you manage the bounds explicitly as part of your data structures, and use BNDMOV/BNDMK rather than BND{ST,LD}X, it should be reasonably fast. What else could Intel do?


Have a read of the linked page again. A large chunk of the performance issues are because in many cases you can't just "use BNDMOV/BNDMK rather than BND{ST,LD}X".

There are a number of data structures where managing bounds explicitly has... issues. Namely, you end up with a lot of overhead. And the entire purpose of hardware support is to prevent the overhead.

Something like, as I said, an opt-in assert when you read past the end of your stack frame, or write before or after your stack frame, doesn't have the overhead. It doesn't prevent a lot of things - about the only thing it prevents is stack smashing - but it's better than nothing and doesn't have the overhead.


That will only work for direct writes to array variables on the stack, or pointers which can be proved to live on the current stack frame, but those are quite rare; e.g. in the vulnerability mentioned in the post, it's sprintf doing the overflowing write into its parent stack frame, but sprintf could be called with any pointer. If, say, you narrow the trapping region to some unused region of the stack frame to try to avoid this, you can get slightly better protection than a stack canary, but not much better. And most overflow vulnerabilities in modern software seem to be on the heap anyway.*

(Which is not to say primitives based on fine-grained tagged memory couldn't do some interesting things; OpenRISC has some form of this, but I haven't looked into it in detail.)

I don't see content on the linked page corresponding to data structures where managing bounds explicitly is difficult, other than the bnd_variable_size bit, which is described as rare.

*corrected to state that I'm only considering overflows, not, e.g., use-after-frees, which aren't related to bounds checks


Theo de Raadt, one month ago (not the only and first mail in this (derailed) thread though):

>A review of libsqlite source code will demonstrate that it is written

>using many old practices of coping with "older systems". Many of the

>same techniques that caused unneccessary risk in OpenSSL.

https://marc.info/?l=openbsd-misc&m=142566734914439&w=2


Note that the SQLite authors intentionally did that. It runs on many many systems including ones with trivial amounts of RAM. They have ancient compilers. I've waged a many year campaign to get them to at least use size_t and similar "modern" constructs, but no dice.

They also only use techniques that are at least 20 years old, to avoid patent issues.


Of course they did. So did the OpenSSL developers. It wasn't an accident. It's simply a design decision that has a set of costs associated with it.


Is there a list of 'best practices' that OpenBSD follows for C code or anything that points out what these 'old practices' are?


I don't think so. There are some other mails (like https://marc.info/?l=openbsd-misc&m=142563704303304&w=2), but I also think now I should've used a different quote.


Not really. There is this list [0] of security considerations for ports, which is a good starting point. Everybody writing in the same style [1] doesn't hurt either.

[0] http://www.openbsd.org/faq/ports/guide.html#PortsSecurity

[1] http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man9/...


Thought the same, because if you look at what he found..

>ranging from NULL pointer dereferences, to memory fenceposts visible only under ASAN or Valgrind, to pretty straightforward uses of uninitialized pointers (link), bogus calls to free() (link), heap buffer overflows (link), and even stack-based ones (link).

.. it seems that all those bugs are not even possible in Rust.


... in safe Rust, yes. But Rust isn't perfect, we _will_ have these kinds of issues crop up, especially with regards to unsafe code. Rust will significantly reduce these kinds of issues, but not completely eliminate them. There is no silver bullet.

We've just started unleasing afl on Rust code, and it can still find issues: https://github.com/rust-lang/rust/issues/24276


Let it be said that the Rust community believes in defense-in-depth. A low-level language with safe defaults is only one piece of the equation; a complement rather than a supplement to testing, auditing, fuzzing, sandboxing, and so on.


Is the stack trace saying that the Rust parser tickles a bug in jemalloc or somehow uses it unsafely?

A parser (which uses malloc) seems like a pretty basic use case for 100% safe code.

When I think of unsafe code, I think of needing to make raw syscalls, libc calls, or inline assembly. Not string manipulation and malloc.

I am not super familiar with Rust, but I imagine you don't have to use unsafe {} every time you need to malloc, right?


The only reason that the parser was using unsafe code is because it is an extremely old part of the codebase, and likely required the unsafe blocks to hack around some deficiency in the compiler from Rust 0.3 or alike. Recent efforts to overhaul the parser have reduced this dramatically: there are two legitimate remaining uses of `unsafe` (dealing with libc stuff), and four uses of `unsafe` that all have associated FIXMEs.


Kibwen is correct with regards to the parser, but I'd also like to mention that in Rust, you generally don't call malloc yourself. It's actually not even possible in stable Rust, though a high priority API we'd like to stabilize.


I think "damning" might be a bit much -- it's got bugs. Like everything on earth, it's not perfect.

I was going to say it's full of meta-programming too, which should allow for simple fixes to have far-reaching implications (indeed, bugs would have same effect), but looking at the repo and build, I'm not seeing signs of meta-programming in effect. Fossil[0] (also initiated by Richard Hipp, and used to (created explictily to) manage the sqlite codebase) however, is a meta-programming example. Point is: C is not untenable, still offers a lot, and has great tooling built up around it. Don't give up on it, but know it's strengths and weaknesses.

[0] http://fossil-scm.org/index.html/doc/trunk/www/index.wiki


Weakness: the best programmers in the world are empirically unable to write secure code in this language. The errors that cause the insecurity are trivial to prevent via language design, but very difficult for humans to reliably avoid when not protected by the compiler.

It's a pretty big one.


I am with you, as a C/C++ programmer we just need to try harder to make correct code.


It's almost like watching someone with Stockholm syndrome :P

(jk <3)


I know you meant that as a joke but I think there's some truth there. It's so easy to become personally invested in the tools we use and instinctively defend them (even to the point of blaming outselves for our tools' shortcomings) rather than try to switch.


Never give up, never surrender.


Given all the platforms and hardware and constraints sqlite has to run on, what would you suggest replacing it with.


Who has the best automated fuzzing team... how fast can they grab this stuff (seconds from major open source repository commit time to automatic generation of functional exploit)? There's sure to be teams out there in the seconds/minutes range for a broad range of cases already.


Link to afl-fuzz is broken in the post, should be http://lcamtuf.coredump.cx/afl/


> I realized that the developers of SQLite maintained a remarkably well-structured and comprehensive suite of hand-written test cases in their repository

Tcl for the win[0].

[0] https://www.sqlite.org/testing.html


I guess you can run afl-fuzz along side LLVM's LibFuzzer for even better results. Fuzzing is getting very interesting. =)

http://llvm.org/docs/LibFuzzer.html


Does anyone have a good writeup on getting started with afl-fuzz? I'm a Python developer and looked into it the other day for fuzzing Python programs, but gave up after a bit because of the documentation (it seems geared towards lower-level languages).


I'm sure you have checked out the official documentation (http://lcamtuf.coredump.cx/afl/README.txt)?

Fuzzing Python programs with AFL will be hard. AFL leverages a version of the GCC compiler to instrument (add additional code to) the resultant binaries. Because Python is not a compiled language this will be difficult. I'm sure there is something you could do to make it work.


There's a wrapper for CPython to get it to implement the tracing code, afl-python[1]. Alex Gaynor did a basic intro to it a few days ago which is a better write-up that the project itself has! Of course, this doesn't work so well with mixed C/Python code.

[1]: https://bitbucket.org/jwilk/python-afl [2]: https://alexgaynor.net/2015/apr/13/introduction-to-fuzzing-i...


I did read the docs, and that's the conclusion I arrived at as well. I assumed there would be a layer that generated the inputs and checked the outputs, and that I could just use that, rather than use the full code checks, but it looks like that's not the case.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: