The Craziest F***ing Bug I've Ever Seen

wrs · on Jan 3, 2010

I guess any bug that occurs below your accustomed abstraction level is "crazy". If most of your work happens in an interpreted high-level language, a quite mundane bug in the runtime (like this) is "crazy". Hardware bugs are "crazy" to most software people at any level, but to hardware folks they're just bugs.

Eliezer · on Jan 3, 2010

I guess any bug that occurs below your accustomed abstraction level is "crazy".

Good way of looking at it. So the ultimate crazy bug would be an error in the laws of physics? Actually, one better than that would be discovering an error in the laws of math.

idlewords · on Jan 3, 2010

That's probably why nature comes with error bars... err, hbars.

tel · on Jan 3, 2010

And the singular craziest bug is Incompleteness?

idlewords · on Jan 3, 2010

not a bug, a feature

tetha · on Jan 5, 2010

Is that bug called 'strings'? ;)

ajuc · on Jan 3, 2010

I would say - the craziness of bug is proportional to the number of the levels of abstraction that it takes from the source of bug to its effects.

And that's why I don't like systems with > 2 layers (J2EE, I'm talking to you).

Eliezer · on Jan 3, 2010

If this is the craziest bug you've ever seen, my guess is that you've spent your whole career programming in languages that lack raw pointers.

wycats · on Jan 3, 2010

Hehe. I guess you didn't get the reference (http://wikiality.wikia.com/The_Craziest_Fucking_Thing_Ive_Ev...)

jacquesm · on Jan 3, 2010

If that's the craziest bug you've ever seen you've lived a sheltered life.

As long as you don't need to break out the logic analyzer or resort to sifting through a stack of 4" of fanfold assembly printouts you're doing just fine :)

Bugs like that are always grounded in the assumption that lower layers are reliable. Hint: they're not, they contain tons of bugs, the conditions to trigger them get more and more rare the longer a platform has been in production. That means that the 'hardest' bugs are going to be the last and if you find something in a platform that has been operational for a while you can bet your life it's not going to be easy to track down.

Interrupt triggered bugs are usually really hard to find as are subtle timing issues in high speed serial links. If it takes you less than a week to track it down it may be the hardest bug you've ever seen but in the greater scheme of things you're not in real trouble yet.

alain94040 · on Jan 3, 2010

Interrupt triggered bugs are usually really hard to find

Very true. Anything that is reproducible is fairly easy to debug. Something that is hard to reproduce is even harder to figure out.

Here's a good scenario, on a good old Apple II: you'd call a subroutine. The processor would put your address on the stack, to know where to return. So when the processor returns, you can peek at the data just below the stack pointer to figure out your own address. Works 99.999999% of the time. Except if an interrupt occured exactly on the return instruction, erasing the old stack data.

With that kind of frequency (interrupts would happen at about 60 Hz on an Apple II), you'd freeze the computer after several minutes. Good luck figuring it out.

sketerpot · on Jan 3, 2010

My craziest bug was a program that would become flaky if you stood too close to the computer. Eventually I discovered that something wasn't grounded properly, and the program was being messed up by the electric fields of our bodies. (Later, the CPU exploded.)

jacquesm · on Jan 3, 2010

You didn't read the bit where it said you shouldn't be reading 0 locations from rom for an extended period ?

I've blown up a chip in a switch once by doing some benchmarking using this rig: http://clustercompute.com/ (it cracked in to two pieces right across the middle) but I've yet to see a cpu explode.

Eproms give a nice flash when you put them the wrong way around in their socket though :(

ErrantX · on Jan 3, 2010

> Later, the CPU exploded.)

That's not a bug. That's a feature.

die_sekte · on Jan 3, 2010

It's only a feature if the computer is a bomb. (Or if someone pressed the self-destruct button.)

z8000 · on Jan 3, 2010

Neat bug but hardly what I would consider "crazy". The source code was available for tracing (and that's how the root cause was made clear).

I don't consider a bug to be "crazy" unless the word "volatile" is involved :)

sketerpot · on Jan 3, 2010

I consider segfaults in MPI programs to be crazy, if only because damn it we should not have to use C for this! I hope Fortress is ready for mainstream use soon, because writing parallel programs in C is inherently nuts.

(This is not to put down weird volatile bugs, of course.)

jey · on Jan 3, 2010

> Before calling the method_missing method itself, Ruby sets a thread-local variable called call_status that reflects whether or not the original call was a vcall or a normal call.

Yuck, MRI (Matz's Ruby Interpreter) uses a global variable for that? (Yes OK, it's actually thread-local, but that's effectively the same thing.) Awful style begets awful bugs. That's what you get for having such a messed up interface.

wycats · on Jan 3, 2010

Ha. I knew someone was going to use this post as an excuse to wail on Ruby. No other Ruby VM uses a global (or thread-local) for this problem; it's not necessary to implement the functionality. While we were working through the problem, Evan (who works on Rubinius, another Ruby VM) remarked how much better the other implementations (like JRuby and Rubinius) handled the same problem.

jey · on Jan 3, 2010

I meant "MRI" and not "Ruby". It's clearly just an implementation detail and not mandated by the language itself. I'll change the original post to clarify this.

If we're going to be pedantic, it's a "Ruby Interpreter" or "Ruby implementation", but not "Ruby VM". Ruby implementations can use a VM-based design, but it's certainly not required.

wycats · on Jan 3, 2010

Thanks! Ruby 1.9 and Rubinius are both "Ruby VMs", while Ruby 1.8 is an interpreter, and IronRuby/JRuby leverage existing VMs. In this case, Evan, who works on a Ruby VM, was looking at Ruby 1.9, another Ruby VM ;)

holygoat · on Jan 3, 2010

I agree with you, but out of interest: how would you do it?

I'd be tempted to ask "why distinguish between vcalls and normal calls?", but in general I don't think it's possible to build a system that is completely non-ugly, and all areas of ugliness have the potential for bugs.

jey · on Jan 3, 2010

Simplest approach: add a parameter to the call-dispatch machinery. For example, let's say you have functions op_call() and op_vcall() corresponding to the bytecodes or AST nodes representing normal calls and vcalls respectively, and these methods work by calling the lower-level internal method dispatch_call(). What they're doing right now is having call() and vcall() set a global that is then inspected later on by dispatch_call() to select the correct error to raise. I'd say that the better thing to do is to leverage the call stack and just turn it into a parameter that's passed to dispatch_call(). That way you can't forget to set it before dispatch_call(), and you get cleanup for free when the stack gets unwound. (Even if you end up longjmp()ing out of the call due to an exception or whatever!)

I'm sure the above is a simplification of the actual scenario in MRI, but I bet the same general schematic can be applied, even if the implementation details are a bit more complicated in a real-world interpreter like MRI.

idlewords · on Jan 3, 2010

It strikes me looking at the discussion here that "tell me about the craziest bug you've ever fixed" may be a very useful interview question. You quickly get an idea of someone's thought process, problem solving strategies, and level of experience.

ErrantX · on Jan 3, 2010

If you think that is crazy try this hardware related bug that STILL eludes me.

Netgear router -> Linksys range expander -> Netgear switch -> a set of machines.

Problem is that in that set there is a Vista machine that randomly kills the wireless network for some reason. The Linksys connection light turns red and both the router and range expander wireless disappears. Everything has to be powercycled numerous times to get it back up. No other machines (Windows or Linux) do this to the network...

Trying to track down a "problem" across multiple hardware/software vendors is not fun :)

So Im gonna claim the craziest fing bug crown :)

teeja · on Jan 3, 2010

My craziest bug was making text-only changes to an html file I'd edited for months, saving the file, then seeing no changes in the browser.

Somehow I'd opened a second editing window for the file, exactly on top of the first window - and the editor was saving with a slightly different name.