My favorite example of a "this should never happen" error was when I got a call from a customer, who started the conversation by asking, "Who is Brian?".
I was caught a bit off guard, but I assumed the customer must know someone at the company, since Brian was the name of the previous electrical engineer/firmware programmer. So, I told them that Brian didn't work here any more, but was there anything that I could help them with? The customer said, "Well, the device says that I should call Brian". I was confused by this, and asked a lot of questions until I determined that the device was actually displaying "CALL BRIAN" on the LCD display.
This was quite unusual, and at first I didn't believe the customer, until he sent a picture of the device showing the message.
So, I dug into the code, and quickly found the "Call Brian" error condition. It was definitely one of those "this should never happen" cases. I presume that Brian had put that in during firmware development to catch an error case he was afraid might happen due to overwriting valid memory locations.
I got the device back, and found out that the device had a processor problem (I don't remember exactly what) that would write corrupted data to memory. So, really, it should never happen.
That particular device has now been in production for 10 years, and that is the only time that error has ever appeared.
"Hi alttab, i tried to use your program but it closed and displayed a message saying 'segregation fault' or something...i'm not a racist, i love all people, please give me a call back"
I saw one like this once. Back in the early 90s I was working at a computer lab at my university. We had just gotten in a 300MHz DEC alpha, and that thing was a screamer! It was so fast that X-windows didn't feel slow on it! (And this was in the day of 25-50Mhz 386s and 486s.)
I was compiling some tiny test program on it, and it spit out an error message that said something to the extent of "This shouldn't happen. Email Dave and tell him what you did - david<something>@digital.com." I ended up forwarding it to our IT department whom I assume sent it on to DEC. I don't know if Dave ever saw it or not, though.
I remember getting this message myself back in those days, on my brand-spanking new DEC Alpha, which shipped with a 'pre-beta' compiler to those of us who were avid recipients of DEC's first batch of Alpha workstations in anticipation of a strong porting effort to get away from the "MIPS situation" at the time .. heady days indeed!
Yeah, honestly, as a one incidence sort of thing, this sounds awesome haha. You could search the code for it, find the relevant piece immediately, and the user was prompted to call you guys quickly to get it resolved!
How do you do that? I get on bootup you could do a little diddy, but how would you know if random bits are getting flipped? Seems tricky for an embedded device...
Not quite for memory _corruption_ but back when I was writing API code in C, I would place 'sentinels' at each end of my structs.
struct somestruct {
int s1;
int data;
char * moreData;
int s2;
}
When the caller of the API needed to call my code, it had to first call a function to get an instance of the struct. This constructor like code would allocate the memory for the struct, and then set s1 and s2 to 0xDEADBEEF;
The user would then fill out the rest of the struct and pass it back in as an argument to another call.
If either s1 or s2 wasn't 0xDEADBEEF, I would throw an error to the caller.
I helped me catch a lot of cases where the caller to the API had overrun some string while filling out the inputs.
This reminds me of something a friend of mine did once.
He had a structure that was getting overwritten with garbage due to an overrun somewhere else in the code. Rather than debugging and trying to find out what was doing it he just put "char temp[1000];" at the top of the struct to "absorb the damage".
I believe it's still running like that in production to this day.
The code above got written that way because at my first job, I inherited a godawful business charting API written by the lead developer.
The input to the API was a struct with 70-80 members that the caller had to fill in and there were no defaults for anything! Naturally there were not just scalars, but lots of arrays and strings in the struct, which could easily be overrun or often left null.
The users, quite understandably, didn't fill out everything, which led to frequent crashes in _my_ code because that's where the pointers would get dereferenced.
When they would see that the crash was not in their code, the users of the API would punt the error to me even though it was their bad input that caused the problem. This would happen 10-12 times a day.
I rewrote the entire thing in a paranoid style , employing the trick above and others to try and ensure that if there was bad input, that it would always crash on their side of the fence.
After I was done I got one legitimate bug report for the code, even though it was in use worldwide in our medium sized company.
However this might not have caught the error condition described upthread. That condition might have overwritten data or moreData without touching s1 or s2.
I was caught a bit off guard, but I assumed the customer must know someone at the company, since Brian was the name of the previous electrical engineer/firmware programmer. So, I told them that Brian didn't work here any more, but was there anything that I could help them with? The customer said, "Well, the device says that I should call Brian". I was confused by this, and asked a lot of questions until I determined that the device was actually displaying "CALL BRIAN" on the LCD display.
This was quite unusual, and at first I didn't believe the customer, until he sent a picture of the device showing the message.
So, I dug into the code, and quickly found the "Call Brian" error condition. It was definitely one of those "this should never happen" cases. I presume that Brian had put that in during firmware development to catch an error case he was afraid might happen due to overwriting valid memory locations.
I got the device back, and found out that the device had a processor problem (I don't remember exactly what) that would write corrupted data to memory. So, really, it should never happen.
That particular device has now been in production for 10 years, and that is the only time that error has ever appeared.