Hacker News new | past | comments | ask | show | jobs | submit login
Lamest bug we ever encountered (joostdevblog.blogspot.com)
76 points by exch on Dec 10, 2011 | hide | past | favorite | 23 comments



Reminds me of the time I had written a physical simulation engine back in grad school and there was a "minus" sign error. Of course, the error was rare enough that we didn't notice it until after the code was used in a real production environment. Tracking down one minus sign in several hundred thousands of lines is a pain. Not to mention the uneasy feeling you get after you solve it, "How was everything ever working correctly before!? What else did we overlook?"


If I have to venture a guess, I guess you didn't have a comprehensive set of tests at the function/method level of the code? Having that would probably have caught the bug, because you would have written a test for correctly executing the code in that branch.


You're right. But it was after that pain-staking experience that I became fully engrossed in using unittests for all non-trivial functionality. Live and learn.


I'm not completely satisfied by the explanation. I still have that uneasy feeling that you get when you solve a bug, but an unsolved mystery remains. "Also, I still don't know why not all consoles connected to that PC froze."


He didn't mention how the logging was done but if it was over a TCP connection then the send() call probably blocked until it timed out since the sleeping computer didn't close the socket nicely, then it had to re-establish the connection. Although reliability is nice, if I were writing a remote logger for a something like a game, I think I'd use UDP.


> although reliability is nice ... UDP

Would you not use send() with MSG_DONTWAIT? You get the reliability of TCP and you get feedback if there is any potential blocking. (But I certainly am not a socket guru.)


Definitely, asserting non-blocking flags for the socket options is also a good idea.


Are you trying to explain how it's possible for some of the consoles to freeze but others not while talking to the same sleeping computer? If so, I did not understand your explanation.


I believe socket writes don't block until you've filled the internal socket buffer, so it's likely that the unaffected machines simply hadn't done this yet.


ah, there's the missing piece of information. Now I got it, thanks.


It's just a hypothesis. Obviously I don't have enough information to know for sure.


I'm reminded of this story of the folks who worked on LEO hunting down a similarly difficult-to-find bug that was eventually found to be caused by an unrelated external machine: the manager's elevator. https://www.youtube.com/watch?v=Lrn24SdW64I&t=2m50s


I once spent an afternoon tracking down a "bug" as to why sales tax wasn't being calculated on LedgerSMB only to find out I had set the tax rate to 0 in the tax interface.... Ok, it was working as intended. I felt pretty sheepish too.


The worst bugs are when things work as intended, but you still think it's a bug, such as your example.


It's worse when your users find these and are all mad because the computer did exactly what they told them to.


The problem in my case is that sales tax calculation easily qualifies as a big deal and so any sense that it's not working raises all sorts of alarm bells. In addition to the immediate questions of "are production versions affected? If so what do we tell customers?"

Also taxes with a rate of 0 are ignored specifically because sometimes sales tax structures change (as with HST consolidation in Canada) and consequently old taxes need to be retired.....


Nah, then it's a bug in your user interface.


While I am sympathetic to this argument, I would say that is not always the case. Some configuration issues are usually required and when something is set up for a specific case, and it behaves for that case, and the user simply forgot that this is what they did, then it's a bug only in the storage retrival routines of the user's own memory.


They could have solved that bug with one developer in ten minutes by just telling the PS3 to generate a core dump and running addr2line.exe on the core dump report's callstacks.

And the report places the blame on the server instead of their code. Clearly it's their code's fault for doing blocking sockets calls in a main thread.


This looks like an interesting bug. I wonder if there are more bugs like this from the website view such as analytic tools giving you false or misleading information? Or, even monitoring or performance tools?


The lamest bug you will ever encounter deletes your whole /usr.


How is that lame?


I think he's talking about this https://github.com/MrMEEE/bumblebee/commit/a047be85247755cdb... , where the deletion of /usr was not on purpose... the bug was a space in the middle of a file path in the install script.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: