I would assert that virtually every non-trivial C/C++ program (>100KLOC) contains undefined behavior.
Here's a list of undefined behaviors that's almost impossible to get rid of:
* Data race. This is particularly fun for anyone who started doing multithreaded code pre-C11/C++11, since volatile does not let you use code across multiple threads without locking.
* Signed integer overflow. Are you sure that there is absolutely no input to your program that would not cause one of your thousands of signed arithmetic operations to overflow?
* Buffer overflow. This is something like 90% of all security vulnerabilities.
* Uninitialized variables. Note that -Wuninitialized doesn't catch all cases, although this is relatively easy to mitigate with a paranoid style guide.
* Strict aliasing rules. Better yet, if you have any sort of custom memory allocation scheme, you're pretty much guaranteed to break this, since the only way you can access an object with a dynamic type of bytes (signed/unsigned char) is via signed/unsigned char. Functions like memcpy or malloc cannot legally be written in C without breaking this behavior. Also, there is not (to my knowledge) any dynamic checker for violations of this property, unlike the other things in this list.
Your program probably has undefined behavior. You just don't know it yet, and your compiler hasn't figured out how to squeak out a 0.5% speedup from screwing you over because of it yet.
> I would assert that virtually every non-trivial C/C++ program
First, citation needed.
Second, the programs I work on would fall outside of your "virtually every".
> [big list of things]
This list doesn't prove anything. There's a similar list in the standards themselves. That's how we know what to avoid.
Oh, and we have an in-house static analysis program that catches all of those. And more.
Our code simply has no undefined behavior. It passes our static analysis program, it passes UBSAN, and no compiler has ever miscompiled our code (unless it was due to a compiler bug).
You can try all you like, but there's no way for you to convince me that our code has undefined behavior. And there are plenty of projects out there similar to ours.
> if you have any sort of custom memory allocation scheme, you're pretty much guaranteed to break this
Not true. You should learn about the aliasing rules before you speak authoritatively about them.
> Functions like memcpy or malloc cannot legally be written in C without breaking this behavior
Also not true.
> Your program probably has undefined behavior.
The chances of that are much less than the chances of you not knowing what you are talking about.
In that case (sqlite) code that passed UBSan, ASan, valgrind, and compiled correctly on all current compilers was studied. A new dynamic undefined behavior checker found additional UB defects at a rate over 1 per thousand lines of code.
I would believe your codebase could have a defect rate one, maybe even two orders of magnitude lower. But short of a formal code-correctness proof, better than that seems unlikely.
It is nothing short of sheer hubris to believe that you have avoided all undefined behavior, particularly given that there exist undefined behaviors that have no extant static or dynamic checkers (hi, strict aliasing).
I do know the strict aliasing rules quite well. As I said in a cousin post, the set of permissible accesses to a lvalue are governed by the dynamic type of an object. As a consequence, strict aliasing queries are not symmetric, which is to say, P* could point to an object of type Q* but not vice versa. The case where this will come up is with signed/unsigned char. If you have a char foo[]; as the dynamic object, it is positively illegal to access that with anything other than unsigned or signed char. This is what really screws up a lot of code.
> Functions like memcpy or malloc cannot legally be written in C without breaking this behavior
> Also not true.
You cannot implement malloc in C because malloc always returns a pointer that's not a part of an existing allocation. The malloc in libc has special dispensation from the compiler to do this (GCC "malloc" attribute).
Similarly you can't implement pthread mutexes in C because they imply optimization barriers (all global memory might change) that a C function with a visible implementation wouldn't have.
Strict aliasing is the only item on your list that I could see leading someone to get upset with the optimizer. And, maaaaaaybe signed integer overflow. With strict aliasing, I don't think your examples are correct. They all involve typecasting pointers, but don't involve interacting with the original type later. So, memcpy uses a void* as char* and MyAllocator uses char* as MyType*, but in both cases they never go back and expect predictible new values in the original types.
The rest are clearly taught in CS100 to be doorways to chaos. In particular, complaining that volatile doesn't get rid of the need for locks is just making noise. Might as well complain that auto doesn't get rid of the need for locks. They are both unrelated to threading.
And of course the code is wrong, because the behaviour is undefined - in 64-bit architectures it is very common to use 64-bit registers for `int` even if in memory `sizeof(int) == 4`; thus `int a = INT_MAX;`, `a + 100` could still be greater than `INT_MAX`.
> Data race leads to subtle bugs on all languages and runtimes, including Java and C#. They are just not called undefined behavior in those languages.
That's because they aren't. You might get incorrect results or exceptions or whatnot, but NOT Undefined Behavior aka. nose-demons. What can happen in cases of data races is always constrained by the VM model. (Obviously, this is modulo bugs in the actual VM implementation, but that probably goes without saying.)
In practice, undefined behavior is unpredictable because compiler does optimization with the assumption that you do not have undefined behavior. When you do have them, the optimization won't preserve the "as-if" rule.
The same applies to Java/C# when data race is present. The JIT must be generating optimized codes assuming that no data race occurs, because it is impossible to detect or correct them (at least in current implementation). When you do have data race, the bugs will be as subtle and Schrödinger's as if data race occurs in a C program.
Data races in Java/C# races will result in incorrect values (which may be serious if your program is doing anything important and subtle to find, as you say), but they will not ever result in "undefined behavior" in the way ISO C defines the term. Specifically a data race in Java/C# will not cause your program to do out-of-bounds memory accesses or use-after-free or corrupting unrelated objects because other parts of the respective VM specifications prevent such outcomes.
There is a very important bit of strict aliasing that your statement misses. Yes, char* and unsigned char* can be used to access any lvalue. However, only signed/unsigned char* can be used to access a signed/unsigned char array. (Strict aliasing is defined via the dynamic type of an object, not which alias pairs are necessarily must-not-alias. Symmetry is not inherent, and in the case of char*, it is indeed not present).
When you write custom memory allocator, you request memory with `malloc` or `mmap` first. The return value of those functions are raw storage, not objects with dynamic type of character array. Neither is the return value of you custom `allocate`. Strict aliasing rule does not apply.
Here's a list of undefined behaviors that's almost impossible to get rid of:
* Data race. This is particularly fun for anyone who started doing multithreaded code pre-C11/C++11, since volatile does not let you use code across multiple threads without locking.
* Signed integer overflow. Are you sure that there is absolutely no input to your program that would not cause one of your thousands of signed arithmetic operations to overflow?
* Buffer overflow. This is something like 90% of all security vulnerabilities.
* Uninitialized variables. Note that -Wuninitialized doesn't catch all cases, although this is relatively easy to mitigate with a paranoid style guide.
* Strict aliasing rules. Better yet, if you have any sort of custom memory allocation scheme, you're pretty much guaranteed to break this, since the only way you can access an object with a dynamic type of bytes (signed/unsigned char) is via signed/unsigned char. Functions like memcpy or malloc cannot legally be written in C without breaking this behavior. Also, there is not (to my knowledge) any dynamic checker for violations of this property, unlike the other things in this list.
Your program probably has undefined behavior. You just don't know it yet, and your compiler hasn't figured out how to squeak out a 0.5% speedup from screwing you over because of it yet.