Hacker News new | past | comments | ask | show | jobs | submit login
The Craziest F***ing Bug I've Ever Seen (yehudakatz.com)
68 points by wycats on Jan 3, 2010 | hide | past | favorite | 26 comments



I guess any bug that occurs below your accustomed abstraction level is "crazy". If most of your work happens in an interpreted high-level language, a quite mundane bug in the runtime (like this) is "crazy". Hardware bugs are "crazy" to most software people at any level, but to hardware folks they're just bugs.


I guess any bug that occurs below your accustomed abstraction level is "crazy".

Good way of looking at it. So the ultimate crazy bug would be an error in the laws of physics? Actually, one better than that would be discovering an error in the laws of math.


That's probably why nature comes with error bars... err, hbars.


And the singular craziest bug is Incompleteness?


not a bug, a feature


Is that bug called 'strings'? ;)


I would say - the craziness of bug is proportional to the number of the levels of abstraction that it takes from the source of bug to its effects.

And that's why I don't like systems with > 2 layers (J2EE, I'm talking to you).


If this is the craziest bug you've ever seen, my guess is that you've spent your whole career programming in languages that lack raw pointers.


Hehe. I guess you didn't get the reference (http://wikiality.wikia.com/The_Craziest_Fucking_Thing_Ive_Ev...)


If that's the craziest bug you've ever seen you've lived a sheltered life.

As long as you don't need to break out the logic analyzer or resort to sifting through a stack of 4" of fanfold assembly printouts you're doing just fine :)

Bugs like that are always grounded in the assumption that lower layers are reliable. Hint: they're not, they contain tons of bugs, the conditions to trigger them get more and more rare the longer a platform has been in production. That means that the 'hardest' bugs are going to be the last and if you find something in a platform that has been operational for a while you can bet your life it's not going to be easy to track down.

Interrupt triggered bugs are usually really hard to find as are subtle timing issues in high speed serial links. If it takes you less than a week to track it down it may be the hardest bug you've ever seen but in the greater scheme of things you're not in real trouble yet.


Interrupt triggered bugs are usually really hard to find

Very true. Anything that is reproducible is fairly easy to debug. Something that is hard to reproduce is even harder to figure out.

Here's a good scenario, on a good old Apple II: you'd call a subroutine. The processor would put your address on the stack, to know where to return. So when the processor returns, you can peek at the data just below the stack pointer to figure out your own address. Works 99.999999% of the time. Except if an interrupt occured exactly on the return instruction, erasing the old stack data.

With that kind of frequency (interrupts would happen at about 60 Hz on an Apple II), you'd freeze the computer after several minutes. Good luck figuring it out.


My craziest bug was a program that would become flaky if you stood too close to the computer. Eventually I discovered that something wasn't grounded properly, and the program was being messed up by the electric fields of our bodies. (Later, the CPU exploded.)


You didn't read the bit where it said you shouldn't be reading 0 locations from rom for an extended period ?

I've blown up a chip in a switch once by doing some benchmarking using this rig: http://clustercompute.com/ (it cracked in to two pieces right across the middle) but I've yet to see a cpu explode.

Eproms give a nice flash when you put them the wrong way around in their socket though :(


> Later, the CPU exploded.)

That's not a bug. That's a feature.


It's only a feature if the computer is a bomb. (Or if someone pressed the self-destruct button.)


Neat bug but hardly what I would consider "crazy". The source code was available for tracing (and that's how the root cause was made clear).

I don't consider a bug to be "crazy" unless the word "volatile" is involved :)


I consider segfaults in MPI programs to be crazy, if only because damn it we should not have to use C for this! I hope Fortress is ready for mainstream use soon, because writing parallel programs in C is inherently nuts.

(This is not to put down weird volatile bugs, of course.)


> Before calling the method_missing method itself, Ruby sets a thread-local variable called call_status that reflects whether or not the original call was a vcall or a normal call.

Yuck, MRI (Matz's Ruby Interpreter) uses a global variable for that? (Yes OK, it's actually thread-local, but that's effectively the same thing.) Awful style begets awful bugs. That's what you get for having such a messed up interface.


Ha. I knew someone was going to use this post as an excuse to wail on Ruby. No other Ruby VM uses a global (or thread-local) for this problem; it's not necessary to implement the functionality. While we were working through the problem, Evan (who works on Rubinius, another Ruby VM) remarked how much better the other implementations (like JRuby and Rubinius) handled the same problem.


I meant "MRI" and not "Ruby". It's clearly just an implementation detail and not mandated by the language itself. I'll change the original post to clarify this.

If we're going to be pedantic, it's a "Ruby Interpreter" or "Ruby implementation", but not "Ruby VM". Ruby implementations can use a VM-based design, but it's certainly not required.


Thanks! Ruby 1.9 and Rubinius are both "Ruby VMs", while Ruby 1.8 is an interpreter, and IronRuby/JRuby leverage existing VMs. In this case, Evan, who works on a Ruby VM, was looking at Ruby 1.9, another Ruby VM ;)


I agree with you, but out of interest: how would you do it?

I'd be tempted to ask "why distinguish between vcalls and normal calls?", but in general I don't think it's possible to build a system that is completely non-ugly, and all areas of ugliness have the potential for bugs.


Simplest approach: add a parameter to the call-dispatch machinery. For example, let's say you have functions op_call() and op_vcall() corresponding to the bytecodes or AST nodes representing normal calls and vcalls respectively, and these methods work by calling the lower-level internal method dispatch_call(). What they're doing right now is having call() and vcall() set a global that is then inspected later on by dispatch_call() to select the correct error to raise. I'd say that the better thing to do is to leverage the call stack and just turn it into a parameter that's passed to dispatch_call(). That way you can't forget to set it before dispatch_call(), and you get cleanup for free when the stack gets unwound. (Even if you end up longjmp()ing out of the call due to an exception or whatever!)

I'm sure the above is a simplification of the actual scenario in MRI, but I bet the same general schematic can be applied, even if the implementation details are a bit more complicated in a real-world interpreter like MRI.


It strikes me looking at the discussion here that "tell me about the craziest bug you've ever fixed" may be a very useful interview question. You quickly get an idea of someone's thought process, problem solving strategies, and level of experience.


If you think that is crazy try this hardware related bug that STILL eludes me.

Netgear router -> Linksys range expander -> Netgear switch -> a set of machines.

Problem is that in that set there is a Vista machine that randomly kills the wireless network for some reason. The Linksys connection light turns red and both the router and range expander wireless disappears. Everything has to be powercycled numerous times to get it back up. No other machines (Windows or Linux) do this to the network...

Trying to track down a "problem" across multiple hardware/software vendors is not fun :)

So Im gonna claim the craziest fing bug crown :)


My craziest bug was making text-only changes to an html file I'd edited for months, saving the file, then seeing no changes in the browser.

Somehow I'd opened a second editing window for the file, exactly on top of the first window - and the editor was saving with a slightly different name.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: