Richard Hipp (creator of SQLite) had this to say about Rust and SQLite in the comments:
> Rewriting SQLite in Rust, or some other trendy “safe” language, would not help. In fact it might hurt.
Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way. We know this, because we have tested the SQLite machine code – every single instruction – using many different compilers, on many different CPU architectures and operating systems and with many different compile-time options. So there is nothing wrong with the sqlite3.so or sqlite3.dylib or winsqlite3.dll library that is happily running on your computer. Those files contain no source code, and hence no UB.
The point of Prof. Regehr’s post (as I understand it) is the the C programming language as evolved to contain such byzantine rules that even experts find it difficult to write complex programs that do not contain UB.
The rules of rust are less byzantine (so far – give it time :-)) and so in theory it should be easier to write programs in rust that do not contain UB. That’s all well and good. But it does not relieve the programmer of the responsibility of testing the machine code to make sure it really does work as intended. The rust compiler contains bugs. (I don’t know what they are but I feel sure there must be some.) Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected. In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”. But whatever you call it, the outcome is the same: the program does not work. And the only way to find these problems is to thoroughly test the actual machine code.
And that is where rust falls down. Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing. Nor are there multiple independently-developed rust compilers for diversity testing. Perhaps that situation will change as rust becomes more popular, but that is the situation for now.
One problem with this argument is that SQLite is primarily used as a source-level embeddable library. That is, most users of SQLite don't use the official binaries and instead build the source code themselves. So, in practice, the source code, not the official blessed binary, is what matters. When upgrading compilers, developers of apps that embed SQLite don't typically check to ensure that upstream SQLite has tested the new version of their compiler. They just upgrade their compiler and assume that the new version will continue to compile their SQLite source properly. If the new version of the compiler happens to compile programs with undefined behavior differently, then problems can arise.
He's not wrong, SQLite is a very well tested piece of software and probably the best that can be done safety-wise in C. Still, as it was pointed out by pcwalton below, and also last time this topic was discussed, there are some use cases where tests done on the entirety machine code do not guarantee that UB will not occur.
For one, it's quite likely that an embedded platform's toolchain will not be part of the SQLite test configurations. Secondly, SQLite can be and is compiled into a binary, and this means that all bets are off, especially if LTO is enabled.
Thirdly there are products that build on SQLite, such as its own commercial encryption extension and other extensions from third parties. The former probably enjoy the same level of testing, but it's not clear how the latter are tested.
The conclusion is that it's humanly impossible to write memory-safe C, even with 100% test coverage, static and dynamic analysis. Something like Frama-C is required, which is virtually unheard of for the majority of open source and commercial software.
> In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”.
C has both compiler bugs and undefined behaviour. Undefined behaviour is an inherent property of the C standard, while a compiler bug is a property of the implementation (a place where it doesn't match the standard).
A valid argument along the same lines might be that the Rust compiler has existed for less time and is used less than C compilers, and therefore is more likely to contain bugs.
> Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing.
Coverage tools work on Rust, such as kcov. I'm not sure of the state of gcov itself though.
> Nor are there multiple independently-developed rust compilers for diversity testing.
Isn't diversity testing only necessary/good because there are many C compilers? Using your phrasing, if the code compiles and runs correctly (i.e. every single machine instruction is checked) with the one Rust compiler that exists, then it works.
There's definitely many reasons why a language having multiple compilers is good, but I think "diversity testing" is circular logic.
Undefined behaviour is not a compiler bug - it is deliberate.
And having undefined behaviour in your C code is definitely not a good thing, even if it is basically unavoidable.
The real problem is that the C and C++ standards cop out to UB in too many places, e.g. with things like type aliasing and people reasonably think "weeeell, it may be UB but it works now and I need it so screw it" and then you have a mess of programs relying on de facto non-standard behaviour which is shit.
The C people just need to officially define some of the de facto behaviours.
Rust doesn't have this problem because it doesn't leave so many basic things undefined.
> And having undefined behaviour in your C code is definitely not a good thing, even if it is basically unavoidable.
If it's literally unavoidable, then the language specification is BROKEN.
Now, most C UB is avoidable, but it's very difficult to notice some UB, and most compilers aren't that good at telling you about the UB they exploit. In this sense UB is unavoidable in that human programmers may often write code with UB without noticing.
If it's only "practically unavoidable", not literally, then the language specification and/or the compilers (by failing to warn about it) are BROKEN.
You cannot blame C programmers, not anymore. The committee has been much too aggressive in its zeal to speed up C by adding more UB cases. We've reached the point where compiler outputs run very fast because all the important bits have been elided by the optimizer, breaking the program in the process. We, the users of the language, have been pushed to the breaking point by the committee and the compiler groups. Please stop. And don't just stop, revisit some of the worst UB decisions.
Yes, even C89 had lots of footguns, but UB was much more manageable.
The only reasons I myself have not yet abandoned C are: a) I haven't learned Rust yet, b) many codebases I work with are C codebases and won't get rewritten in Rust anytime soon, c) it takes time to get enough critical mass. (c) is happening though, and (a) is, for me, just a matter of time; (b) I can solve by moving on to new things, but the world is full of legacy code that we can't just abandon/rewrite, so moving on isn't exactly likely.
> The C people just need to officially define some of the de facto behaviours
Sure, as soon as the all the different ISA people officially define some of the de facto behaviors. UB isn't in the standard "just because" it's in the standard because there is no apparent underlying standard.
Aliasing rules, for example, have nothing to do with ISAs. Neither do pointer comparison rules, and many others besides.
The rule about memcmp() with invalid pointers by length zero does have to do with actual systems, but it can still be standardized and the vendors with now-non-compliant memcmp() implementations just have to fix it. This has happened before (e.g., snprintf()), so the ISA thing is a total cop-out.
Weird ISAs are exactly why you cannot compare pointers. Segmented memory for one. Or imagine an OS and compiler that implemented automatic overlay switching. With that and PAE on 32-bit x86 systems you could have special "far overlay" pointers returned from malloc calls which would map in different 1 GB overlay sections when accessed.
Aliasing rules are important in some ISAs too. Like weird DSPs. Imagine a system where 32-bit objects can't even share the same memory space as 8-bit objects. Casting a pointer to a different sized type is completely meaningless there. Of course programming such a weird thing is usually done in assembly, but there are C compilers.
I'm not familiar enough with C on segmented architectures, so I can't quite speak to that, but I was referring to [0], which clearly has nothing to do with segmented architectures.
As to aliasing, ISAs too had nothing to do with the reason for aliasing rules, but rather optimizations for functions like memcpy() (as opposed to memmove()).
There are other aliasing rules that have big performance impacts on certain architectures. The Xbox 360 and PS3 Power cores for example had a severe load-hit-store performance penalty that tended to be triggered by code that moved data between floating point and integer registers via memory. Strict aliasing rules that allow the compiler to assume float and int pointers don't alias could make a huge performance difference but those rules are also the source of much troublesome undefined behavior for code that does intentional type punning.
The ISA in this case requires going via memory to move data between fp and integer registers and certain implementations of that ISA had major performance impacts associated with that. In this case UB rules really did allow for valuable optimizations but really did cause trouble elsewhere.
I don't find that comment very impressive, but further down he has commented some more, and I'm fully in agreement there:
"The disagreement is not over whether or not UB is a problem, but rather how serious of a problem. Is it like “Emergency patch – update immediately!” or more like “We fixed a compiler warning” or is it something in between."
In many cases, what is considered UB in the C standard is a widely accepted and documented extension in the vast majority of compilers. For example, nonstrict aliases are UB in ISO C, but are defined language extensions in MSVC, GCC, clang, and many other compiler vendors.
> Rewriting SQLite in Rust, or some other trendy “safe” language, would not help. In fact it might hurt.
Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way. We know this, because we have tested the SQLite machine code – every single instruction – using many different compilers, on many different CPU architectures and operating systems and with many different compile-time options. So there is nothing wrong with the sqlite3.so or sqlite3.dylib or winsqlite3.dll library that is happily running on your computer. Those files contain no source code, and hence no UB.
The point of Prof. Regehr’s post (as I understand it) is the the C programming language as evolved to contain such byzantine rules that even experts find it difficult to write complex programs that do not contain UB.
The rules of rust are less byzantine (so far – give it time :-)) and so in theory it should be easier to write programs in rust that do not contain UB. That’s all well and good. But it does not relieve the programmer of the responsibility of testing the machine code to make sure it really does work as intended. The rust compiler contains bugs. (I don’t know what they are but I feel sure there must be some.) Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected. In the case of rust we get to call these “compiler bugs” whereas in the C-language world such occurrences are more often labeled “undefined behavior”. But whatever you call it, the outcome is the same: the program does not work. And the only way to find these problems is to thoroughly test the actual machine code.
And that is where rust falls down. Because it is a newer language, it does not have (afaik) tools like gcov that are so helpful for doing machine-code testing. Nor are there multiple independently-developed rust compilers for diversity testing. Perhaps that situation will change as rust becomes more popular, but that is the situation for now.