Hacker News new | past | comments | ask | show | jobs | submit login
Why bugs might feel “impossible” (jvns.ca)
215 points by atg_abhishek on June 15, 2021 | hide | past | favorite | 126 comments



> There are bugs where you know exactly how to reproduce it, but it takes a long time (like 20 minutes or way longer) to reproduce the bug. This is hard because it’s hard to maintain your focus: maybe you can only try 1 experiment per day!

Ugh I hate these. It's almost worse than the "only happens sometimes". I'd rather walk down a longer path fast than a shorter path miserably slow.

> there’s no output at all

Hangs!! I hate hangs. I hate them. Crashes give information. Hangs could be so many things, but it is the absence of information.

> 4. one of your assumptions is wrong

#1 assumption I always make sure to check now is "is the code that's running actually from the source code I'm looking at". So may "impossible" bugs I run into are in fact impossible... in the codebase I'm looking at. And it turns out what's deployed is some other code where it is very much possible.


At one point, I encountered a low level bug that would only manifest if the computer was physically located in a particular room. It ended up being an undocumented cabling revision, but I revisited a lot of assumptions about the nature of software and reality before getting to that point.


I have come to love the feeling that the universe is broken when bug hunting. It means the solution is close at hand. If there are still lots of plausible explanations left then it means there is lots of work left to do. Things always seem most insane just before figuring it out.


And then you find it and you remember the sentence: “Everything that can fail, will fail”.


Thank you for this positivity! I'll try to remember this next time I am frustrated.


Sounds like the 500 mile email bug

https://web.mit.edu/jemorris/humor/500-miles


The thing I like about this story is that all the facts fit.

The user may be explaining something that is crazy to you initially, but the first step is to gather the facts, THEN try to figure it out.

So often, the part of troubleshooting is "How can I get more information about X?"

In this case, the seemingly unrelated step of telnet into SMTP gathered a few facts together which helped to explain it all.


I’d love to see a gallery of these kinds of bugs and the actual solutions. This was an awesome story— thanks for linking to it.

Also I didn’t know about the units cli program!



Units is a wonder. I use it pretty much daily. It's a shame units_cur is so fragile though.


"but I revisited a lot of assumptions about the nature of software and reality before getting to that point."

I encountered some bugs that felt like this (software only though), where I did not found the all explaining logical solution and ended up rewriting that part.

But I saved the whole state of the project and data in a zipped file and keep them locked away - so one day, when I feel like madness (and maybe have better tools avaiable) I will jump into it, either to proof my own stupidity - or the existence of dark magic, poltergeister and demons.


>#1 assumption I always make sure to check now is "is the code that's running actually from the source code I'm looking at".

Once I got so far as stepping through some malfunctioning code at ASM level on a production machine, only to discover that the ASM had a jump if equal instruction while the source code prescribed exactly the opposite. I compared to the executable binary that was supposed to be there, and it was different from what was actually there. There had been a bit flip somewhere along the line.

I now spend more time than I probably should comparing hashes of any executable binary that's acting strangely enough.


We had a fun one at my first job where the same Java webapp was behaving differently on a couple of servers (this is back when servers were administered manually by sysadmins). We did a recursive md5sum of the homedir on each server and got the same results.

Turns out that by default tomcat will use filesystem order when loading jars from its library directory, and some of the servers used different filesystems, so even though each one had the same set of jars in the library directory, they would be loaded in a different order which lead to different behaviour.


I worked at Microsoft back when they were looking at the Itanium as a Win64 platform. We were making a .Net console application and one day I was in the lab while our daily test automation was running, and the Itanium machines had a bunch of crazy random chars and colors on the console windows. It wasn't causing tests to fail, but I rolled up my sleeves and debugged the issue, thinking it was probably just some buggy test thing.

It turned out we were doing something nobody else was doing in .Net at the time, which was to return a structure by value while we were populating the console buffer. Apparently the JIT compiler lost track of which registers it was using to store the struct and overwrote them. In the disassembly it looked like "move some stuff into a register, then move more stuff into the same register". Thought I was losing it for a second.

I reckon the moral is don't get into the mindset that the test is always broken. Sometimes it's actually the product.


I've been there a couple of times.

Relevant: https://xkcd.com/1316/


The rise of reversible debugging for C/C++ made debugging much more enjoyable for me. For those tedious to reproduce bugs you can reproduce and record it once and replay it many times with the bonus of reverse-* operations. With rr the execution isn't even slowed that much.


Indeed. I actually quite like hung programs :) Attach a debugger, and now lessee what it is doing...


If anyone is complaining about a few hours here and there, relax. It took 7 years for me to solve an intermittent one once.

The issue turned out to be an immeasurably small thread synchronisation issue that went away when we bought new CPUs which were fast enough for it to not happen.

In the end this task was rewritten in a single threaded process as the CPUs are now fast enough to complete the work on one thread.


This is all well and good but for people like me, I generally need to fix a bug in a week before people start looking at me weird.


This hurts me to read


That's less than 1% of the pain I experienced there.


> #1 assumption I always make sure to check now is "is the code that's running actually from the source code I'm looking at". So may "impossible" bugs I run into are in fact impossible... in the codebase I'm looking at. And it turns out what's deployed is some other code where it is very much possible.

cough I spent two days doing this when testing different application servers for a Java servlet I was consulting on. Two days of banging my head and wondering why it works on that server and not on that server.

I was deploying the new .war to the wrong place and the server autodeploy was picking up the old version every time.

I've been doing this for over two decades.


> Hangs could be so many things, but it is the absence of information.

start adding information.

you need to know what the code is getting to and/or what its not getting to and progressively narrow it down. spamming debug print statements can be useful if you have no better ideas to get yourself moving.

or if you can manage to hook up an interrupt handler to be able to get it to dump a stack trace that'll immediately tell you.

or attach a debugger while its spinning in order to get a stack trace.

or attach something like strace to dump out system calls or ltrace to dump out libc calls. that can't catch processing spinning in "math" though, but that absence of information is a hint that eliminates lots of possibilities.

you may need to add a feature to the software in order to do this kind of debugging, so you write your own USR1 interrupt handler or something. it will be worth it.

i think the kids these days use stuff like systemtap?


One of the best pieces of programming advice I've ever received was from my undergrad advisor. "You'll never win a staring contest with the code." If you're in the middle of debugging and you find yourself staring at the code hoping that it will suddenly reveal answers, that's not a productive position to be in. In those cases, better to figure out what diagnostics can help, and to get more information.


>> there’s no output at all

> Hangs!! I hate hangs. I hate them. Crashes give information. Hangs could be so many things, but it is the absence of information.

variants that made me scream include:

- misleading information, the kind you can't dismiss as irrelevant, and you go down a thousand rabbit holes that seemingly lead nowhere, but it somehow all ties together into understanding the issue, only after a thousand turns and while keeping the holistic view of it all

- useless information. kind of like the one above but every piece of information you get truly leads nowhere

Once I had an actual crash that somehow happens before jumping into the program's code (like, before main), but changing your code does toggle the trigger... but when it crashes you have no core and can't attach a debugger, so you're left with debugging on the happy path and constantly think about what could possibly go wrong, and you don't even know what you're looking for. understanding why crt0/ld trips over is not fun.


The ones I really hate are the ones that disappear when run under a debugger, or with higher logging levels. Usually race conditions.


Ah, the dreaded heisenbug


The solution is simple, though: ship with the debugger.


…with the JVM debug port open, so you can unlock your customers when they are stuck.

Even works for software embedded on rockets - After all, network latency to the moon is only 2s, right?


Yep, GDB also regularly splits up TCP network packets, so beware of changing stream packet boundaries


> 20 minutes or way longer

One of my coworkers was working on a bug that only manifested itself after ~2 weeks of putting the system under continuous stress. Bad times.

Stress tests often seem to unearth strange bugs. Once when we used to sell boxes running a single-threaded real-time OS, the OS vendor gave us a separate machine that would sample the registers and memory of the box 10000 times a second (using jumper cables on the motherboard), in the hope that we would catch the exact state of the system as it crashed after days of stress testing. The sampler machine was a crazy overpowered beast of a machine with really high clock CPUs and oodles of RAM so it could dump as many samples as possible into a ring buffer in memory.


> the code that's running actually from the source code I'm looking at

Super common when working with Clojure(script)! Bites me in the ass all the time.


Speaking of clojurescript and bad assumptions:

Clojure and clojurescript are both dynamically typed, but clojure is strongly typed (type errors throw exceptions), and clojurescript is weakly typed (type errors are technically valid code). Too many clojure enthusiasts act like they're both the same language, but that is an absolutely massive difference.

Try the following code out in each: (+ "1" 1). In clojure you get an exception, in clojurescript you get "11" (maybe a warning at compile time which won't show it's face at runtime when the data is fed to you from an API call).

That bug silently corrupted analytics data for 3 months for a service I worked on. That alone was the reason I stopped using dynamically typed languages.


> That alone was the reason I stopped using dynamically typed languages.

Nitpick, but weakly typed not dynamically typed is the problem here. Python would throw an exception (though I know mostly prefer JS with TypeScript to Python nowadays.)


Yes, in this particular instance that would be correct. But tracking down that bug made me realize something. A well tested program, fed improperly typed inputs, will fail in different places depending on it's type discipline.

With the weakly typed language, it will fail anywhere within your codebase and won't tell you that it is doing so.

With the strong/dynamic types, it will still fail anywhere, and while it will fail loudly, it still leaves you with the work of tracking down why. You won't know that it is due to improperly typed IO until you've traced that data flow all the way back from the point of the failure. For large apps, this can be a nightmare.

With a static/strong typing discipline, type errors can only occur in one place: exactly at the point where objects are constructed during IO.

And let's say that that upstream API provider (that you have no control over) silently switched to bignum strings instead of floats, your strong type system caught the problem immediately, and now you have to accommodate. You change the object, and get a stream of type errors in your IDE, and you go around fixing them all until they disappear. Almost like magic, your program works again. No new tests, no running multiple times until you track down all the type errors. With a dynamically typed language you'll have no such luck.


Frustratingly many statically typed languages are not very strongly typed since they don't force you to check for null values. Kotlin/Swift/Rust really are a huge improvement over Java/Go/C#.

At it's best TypeScript feels very similar to programming in one of these OCaml like languages, though you're still vulnerable to errors in untyped dependencies or people on your project abusing the `any` escape hatch.


(+ 1 “1”) is not a type error for CLJS, because it’s not a type error for JS. Clojure is designed as a hosted language and it shares all the basic types and operations with the host platform.


That's perfectly fine. In fact, it would probably have too much overhead to try to build a dynamic/strong runtime on top of javascript, so adopting the platform's type discipline was probably a wise choice.

However, the type discipline is a fundamental aspect of the definition of a programming language, and you can't exactly call it the same language if it has a completely different type discipline on different platforms. And it would be nice if clojure advocates stopped lying to people about how it is the same language. Be transparent about how different it really is, even if it looks the same on the surface.


That’s quite bad. It basically means that Clojure and clojurescript are different languages. So you can’t blindly share client/server code.


It took me some time but I have a good workflow for debugging hangs/deadlocks/etc. If you know what part of the code produces it, you iterate over it indefinitely in a debugger until it hangs, then once you notice the iteration has stopped you “step in” to the debugger. Then you run another script that dumps the current trace back for each existing thread. That should be enough to detect the lock normally.

The big problem is if you don’t know how to reproduce the issue in a debugger for some reason.


Last time I had a "silent crash" heisenbug that produced no output, I solved it by system-level tracing (ETW, Event Tracing for Windows). It turned out not to be a crash at all! We had a script in our test harness that cleaned up stray processes, and that script sometimes managed to run long enough that it caught the start of the next test, and killed the tested executables before they managed to output anything.


Your last point on the correct code running, ugh. I haven't gotten better at it even thought I'm aware of it more now. Had an incredibly persistent linting config issue just last week, turns out I had the wrong config open for editing in vim.


I also hate hangs. If you are working with the JVM you can send it a signal kill -3 [pid] which can give you a thread dump. That's a QUIT signal if you are curious. I once found a stuck HTTP request when I did this, so I fixed the issue by adding a timeout to the HTTP connection object.

When you are designing a system or a program you should have a way to get diagnostic information to make it easier to fix issues that can arise.


Always design in timeouts. And always make them configurable.


Yes, I agree.


JVM also has jstack and other tools like jprofd to introspect running vms.


First day of electrical engineering classes we were taught, before troubleshooting a circuit - make sure you're plugged in.

Assumptions are evil :)


My worst "impossible" bug I encountered in 1991 (and I should blog about it one day), while writing a DOS game (386/486 PCs era). Game was a mix of C and assembly and for the life of me I couldn't reason about it / recreate the conditions to make the bug appear. It was seemingly random, only very rarely happening.

After days I gave up and invented the thermonuclear weapon: I decided to rewrite the entire game's engine to be deterministic. This took me along while but then I could record events (joystick/keys direction/firing etc.) and at which frame these inputs happened, and could deterministically replay the whole game.

Thing is: back then deterministic games engine / replay based on inputs didn't exist yet (AFAIK). At least I didn't know of any.

The first time I remember reading about a fully deterministic game engine was on Gamasutra, a post-mortem on the first Age of Empire.

So basically my first impossible bug made me discover the idea of deterministic game engines and tiny "replay" save files.

The actual bug? Well eventually it appeared but now I had save files and, sure enough, I could have the whole game replayed automatically and the bug would now show up. And so I knew it was now just a matter of squashing it. Just some good old C dangling pointer IIRC. When the hero had the option to get two shots (usually he only had one) and when one of the shot shot was still active when the level was cleared, that shot would keep being alive in the next level, but invisible and would invariably lead to corrupting the memory of the next level.

Fun stuff.


For the DOS game Terra Nova (1996) I made our game engine fully deterministic. I guess this was a year before Age of Empires but well after you. It was amazing how much easier this made debugging. The effect was so great that I can't even put a multiplier on it, because it moved bugs from the "we'll never reproduce this" category to the "just see what happened and fix it" category. Some of these replays represented more than half an hour of gameplay too.

One thing that surprised me is that it also found a bunch of bugs waiting to happen (uninitalized variables / dangling pointer sort of stuff) that would trigger an error when replaying from a file didn't produce the same results as the original play (we had a checksum of game state that we could check).


That is really cool! There may have been others, earlier ones: just not that I knew of.

You came with the idea yourselves or you knew about other game developers doing that?

> Some of these replays represented more than half an hour of gameplay too.

Same... That was really my main motivation: sometimes needing to play for 20 minutes before the bug would show up.

> (we had a checksum of game state that we could check)

Ooooh I love that: that is plain bad---! So you not only had your deterministic engine, but a way to directly identify any discrepancy between the original state and the replayed one. I didn't think about that!

It's amazing that it let you identify bugs before they even stroke.


I don't remember hearing about other developers doing the same thing. I think it was just that I had a lot of experience debugging deterministic programs (like command-line tools) and it was infinitely more pleasant than trying to debug an interactive graphical program, so it was worth seeing whether we could make the game itself deterministic.

I've never made a system that ambitious again, but one thing I've learned from that experience is to never ever call global rand(); always always create your own RNGs that you can run explicitly, and if you have multiple systems that can be disabled independently (e.g., we were able to turn off graphics during our replays), give them each their own RNG.


Is there any downside to writing deterministic games? It seems like the only sane way to do it, but I guess it adds some complexity overhead for the initial write otherwise everyone would do it by default?


There aren't many significant downsides I know of, it's just really easy to accidentally make your game not deterministic.

You have to entirely isolate the game state from any sources of non-determinism. The latter can include: subtle CPU timing issues, GPU timing, other GPU artifacts, the system clock, timing from IO operations. If you want the state to be deterministic across machines (useful for debugging multiplayer stuff) then you also need to include floating point operations (some chips behave differently on some boundary cases, I think) as well as some graphics operations (thinks like texture operations and rounding are always bitwise identical across GPUs).

If any bit of non-determinism sneaks in from one of these, it will wander through and pollute any other operations and data that depend on it. Flushing out non-determinism bugs can almost feel as difficult as debugging a non-deterministic engine.

I was at EA when the Madden team refactored the engine to be deterministic. It took a full cycle of bug hunting, but it was marvelous once it got there. QA could just send over a replay file and any dev could simply load up the replay and repro the bugs. In fact, the user-facing replay system in the game ("Let's watch that play again in slow mo!") was simply restarting the engine and then replaying the user inputs deterministically to resimulate the whole game again.


There's a bunch of excellent blog posts from the Factorio developers about tracking down tiny bits of non-determinism (since it uses deterministic multiplayer, as there is way too much dynamic world state to update constantly over the network).


Yeah it’s just harder and potentially less performant. There are different levels—deterministic across the same architecture, different architectures etc…

The simplest example I can think of is being deterministic across different frame rates.

Imagine you move a player by adding x to it’s position each frame. You adjust x based on the frame rate so that you don’t move faster on a faster computer.

So on a computer tuning at 30 FPS you move 10 pixels each update. But on a computer running at 60 FPS you move 5.

You have walls that are 6 pixels wide. On the 60fps computer it works fine, but on the 30fps machine you can teleport through the walls.


I guess this is similar to the Fallout 76 "physics is tied to framerate" snafu (where looking at the ground makes you go faster because the fps goes up...)

https://gamerant.com/fallout-76-speed-hack/


I seem to recall '90 Stunts (also known as 4D Sports Driving) was somewhat deterministic as it featured replays, with which I recalled abusing the physics engine with carefully laid out tracks, and looping over the replays to much hilarity


> I should blog about it one day

Yes, please!


Bugs involving printers are among the worst, as everything software with them is a few notches down on the quality scale. Also, troubleshooting involves reams of paper or giant rolls of labels.

A good test for a printer/print spooler is to set the printer offline, but still accepting jobs (eg: open paper tray), then send 20 print jobs and count how many the jobs get printed.

I’ve spent countless hours and a 3’ diameter roll of labels trying to figure out why a printer would occasionally not print&apply a label, causing all packages to be labeled with the wrong label. The printer could print&apply 1 label/second, so it made a lot of mistakes when it failed. We eventually had to dismantle the printer and test each major circuit board in isolation to find out that its internal network adapter had bad firmware that the manufacturer did not want to fix. It turns out that most IP->LPT adapters also have this same flaw too, so we had to basically buy a huge pile of them to find 1 that worked.


Printers smell our fear and desperation.


PC LOAD LETTER


Too few young developers are subjected to printer hell.


I'm "lucky" enough to deal with buggy hardware on a semi-regular basis (I start writing firmware before the hardware is finalized and run on prototypes), so I really do get bugs where the the input data and the logic are all completely correct and the hardware is at fault. You get to an add instruction with immediate data/no pointers, and somehow it gives you back bad data or hangs.

On the one hand, yay, not my fault! On the other hand, HELL to debug. On the worst hand, it dramatically increases my willingness to SAY it must be a hardware problem, which is not always the case!


Two "fun" examples:

1) System trying to boot would hang at seemingly random points. Could never be pinned down to a particular instruction, but could be caught doing it when stepping through with attached hardware debugger. It just wasn't consistent and never made any sense. Hang on an add. Hang on a call and never reach the first line of the thing being called. The hang would always be relatively late in the boot, but that's all that could be found.

Eventually I got it. It would hang the first time a timer interrupt triggered, which would only happen after that interrupt was enabled something like halfway into the boot.

Turns out there were disabled cores and the system was waiting trying to park those cores before servicing the interrupt, but they'd never respond/ack/say "I parked" and so we'd hang.

Disable the interrupt and there was no problem.

2) Operating in Cache-As-RAM mode early in boot, no "real" memory, just the L2 cache mapped as memory. Two valid/available address ranges could not both be written to. Writing to 0xA and then 0xB, or 0xB and then 0xA, would hang the system. Data being written didn't matter. Writes didn't need to be back to back. Just couldn't play nice.

Knowing it's a hardware problem spoils the fun of trying to debug that. Bad cache, couldn't properly convert addresses to cache lines, wrapped back on itself and panicked. Solution - move and resize "usable" cache region to exclude the overlapping ranges.


Bus timing errors! Fun times!

Forgot a wait state? It'll probably work, on most chips!

Even better when suppliers fix, or add, bugs and don't tell you. Or change the firmware they are shipping on a part that's hanging off a UART. Or how about discovering that in the 21st century, one of your suppliers doesn't use source control for their firmware and every time they send you over a firmware blob it consists of some patches applied to whatever code happened to be laying around on some developer's machine!


Had a fun one relatively recently that was a mix of "hard to reproduce" and "hard to get internal state information". Flaky test in a rails app that would fail one in every ten to hundred runs of the full test suite with "this random number is too big to be a primary key" kind of message. Root cause was an edge case passing through multiple swiss-cheese holes in various assumptions:

1. ActiveRecord makes an assumption that primary keys are integers, and does its own check whether or not they are big enough to be persisted (rather than catching a database error). 2. Furthermore, it does this by coercing the key to an integer if it isn't already one. This is done with to_i, which for a string takes any leading 0-9 characters and discards the rest. 3. We had a table with a string primary key (UUID of some sort) 4. One of our test factories was generating a hexadecimal string for that primary key 5. And the test factory was not deterministic and did not respect the --seed flag in the test suite.

So the end result was a very innocuous-looking line of code occasionally generating a hexadecimal string with enough leading numeric digits to be larger than ActiveRecord thinks you can stuff into a table, causing an extremely cryptic error message. It does not reproduce with the same test seed. And the stack trace was about three frames and discarded all the context - all I could see was that ActiveRecord was throwing a fit over somehow mysteriously receiving a large number somehow.

Figuring that out was honestly like, 80% pure luck. Chased down a hunch that it was test object generation somehow and did that in the REPL, then narrowed it down via looking through child object generation until the haystack was small enough that I couldn't not find the needle.


>And the stack trace was about three frames and discarded all the context - all I could see was that ActiveRecord was throwing a fit over somehow mysteriously receiving a large number somehow.

This is Python rather than Ruby, but my average debugging time and frequency of "impossible to figure out" bugs drastically decreased once I started using a traceback library that provides a lot of context to each stack frame.

It makes logs containing any raised exceptions much larger and more tedious to scroll through, but the benefits are more than worth it; especially for production services.

I use better-exceptions (https://github.com/Qix-/better-exceptions), but there are a bunch of other good libraries as well.


Yeah, when we turned this feature on in Sentry a few years back, bugs got so much easier to fix. 90% of the time the stack trace and all arguments along the way were sufficient to figure out the full cause just from reading, no need to try to reproduce.

It is enormously CPU-expensive though, so it can be risky in production. Small-ish error spikes can cause enough latency to cause more errors, which cause more latency...


> the bug is actually 3 bugs

I hate these ones in particular. My debugging strategy is generally like a surgeon's: do no harm. When investigating a bug I try to have as absolutely few moving parts as possible. Otherwise it's too easy to create knock-on bugs or interfere with the repro. Many times I have "fixed" a bug by changing something only to later realize that all I did was cause the repro to no longer manifest it.

My process is usually something like:

1. Come up with hypothesis for cause.

2. Fix the code according to that hypothesis.

3. Did it work? If so, done. If not undo all changes from step 2 and try again with a new hypothesis.

But when a bug is the confluence of several issues, that step 3 can make it impossible to find a fix. I hate having to make multiple speculative changes (especially when the right fix could be any of the exponential number of combinations of them). Often I end up going in circles because I realize there must be multiple different problems interacting to cause the issue.


First off, everything the author writes is worth a read and I want to thank her for that. Re:

>the error message has 0 results when you Google it

While this used to (and still does, I suppose) cause a bit of throat-tightening for me, I've learned this usually means it's a case similar to "your assumptions are wrong". It tends to be something I have misconfigured or a dead-bang obvious typo that my eyes look right past. Often it's something like having the wrong virtual environment in one shell tab which is causing a process to half-work but then fail in a misleading way.


Another hole I used to put myself in is I would often copy and paste stack traces verbatim, and of course most of a stack trace is just filepaths unique to your user account and of course nobody has posted similar stack traces like that because they don't have the same username and folder as you do. I've learned to copy and paste smaller snippets that are more likely to be generic across many different machines and found much more success with that strategy.


Even better: the only result is a stackoverflow post that your coworker made, asking about the exact issue you're investigating, with zero answers


Even worse - when it's a stackoverflow post from yourself 2 years before, and there's still no good solution. This has happened to me.


Even more worse: it's a stackoverflow from yourself 2 years before, which you then closed because you "figured it out" (without expanding on how)


Oof, that really is even worse, because you only have yourself to blame


Relevant XKCD https://xkcd.com/979/


I actually don't get this one. When you can find your bug on the Web it's like the debugging never even started :P


>it’s very slow to reproduce

I had an embedded system that hosted USB endpoints that normally booted in ~3 seconds and could connect to the host in 5-6 seconds. We found an issue in environment testing that the unit took longer to connect the colder it got: ~10 seconds at 0C, 20 seconds at -10C, 120 seconds at -20C, and after -30C it could take 15 minutes or just never connect.

We had to instrument the whole device and stick it inside of a chamber to debug. Every change meant waiting another 20 minutes for the chamber to cool down. Eventually found a sense line left floating that would eventually float high once it got warm enough. Probably 3-4 weeks of troubleshooting and it ended up being a single device tree edit to configure the pin with a pull-up.


This software phenomenon has led me to feel the "reverse" of this in real life, and to start assuming some wild things, e.g., (a) I know someone was recently in the room with me; (b) they have not left through the doors or the window; (c) there are no good hiding places in the room; (d) they've >> temporarily left my field of vision and when I turn back around I don't see them << ... I immediately ponder the possibility they have disappeared (or been raptured) rather than just quietly walking to remain outside my field of vision as I turn around.

So I start thinking that maybe impossible-in-real-life things actually have happened.


  the bug is hard to reproduce locally
One of the things I like about LLVM is that it is written as passes. It's possible to stop after a pass and dump everything to IR/GMIR text. It's also possible to start a pass with this text. This makes unit testing of passes possible.

GlobalISel is a rewrite of the instruction selection mechanism. I'm not sure what GlobalISel offers above monolithic SelectionDAG (well, it's faster) but it is much easier to test because it's broken into irtranslator, legalizer, regbankselect and instruction-select passes each of which can be unit tested independently.


Probably the most impossible bug of my career was that our static website didn't always load in iOS. I don't mean the page rendered wrong or a script broke. I mean that Safari just said "no, I can't load this page".

It ended up being an issue with the load balancer, that only affected some iPhones.

Eventually I tracked it to the load balancer as hitting hosts directly worked fine. Also, the dev/QA environments worked fine since there was no balancer, For a while devops would argue it was the application team's fault :)

Basically this, except at the time (2017 or so) Google wasn't as helpful:

https://www.google.com/amp/s/www.18aproductions.co.uk/news/v...


For me the hard bugs are the multi-threading ones where I made a bad assumption about order of execution. One from 30 years ago that I still remember is when a "packet received" interrupt for a response came in before the "packet sent" interrupt of the request, due to queuing in the lower levels of the network device driver. It totally crashed our system because the pointer pointing to a transaction data structure wasn't initialized yet. It caused a triple fault so it was an instantaneous reboot.

Since then I've become more defensive with classic multithreading.


I’m doing this right now with a bug in the Linux kernel. It takes something like 5 hours to repro with a very specific condition because it’s a SMP race / ordering bug.

I literally haven’t slept properly since I hit the bug because once I find something like this I have to find a fix. lol.


I routed out a couple thousand heisenbugs in an old system I was working on. Turned out the web based platform had no cache invalidation / cache-busting feature enabled, so any time an update would be published, the browsers didn't always get the changes.

The icing on the cake was that it was our app that was bad, and the 'fix' they implemented was a completely broken work-around.

It was SOP to instruct the clients to turn off browser caching. So the app was slow as well. Inevitably the Client's on-site IT would install a new desktop and forget to turn off browser caching. So you'd get these weird states were errors would occur randomly and depend entirely on if someone had changed or hadn't set that setting.

Years of cruft and chaotic deployment workflow meant there were easily 10,000 different places where the cache busting would need to be implemented. I figured out a way to fix this using nginx as a caching layer, and using features of a brand name Web Toolkit which we had already partly implemented without caching. Even had a test harness set up and way to catch bugs during a transitionary stage.

Shame there wasn't any specific tickets on this exact fix because otherwise I would still be working there. Sure, I was assigned a bunch of the random error ones and found out what was causing them, but apparently being assigned a ticket and fixing the problem isn't part of my job?!

The head of the web dev team (and my manager) quit shortly after I was hired. I now understand why.

Fixing bugs is impossible sometimes.


The bugs I hate the most are the ones that you struggle with for days and then figure out 20 minutes after posting a StackOverflow question or sending an email. So embarrassing yet it happens so often...


The trick is to prepare your StackOverflow question, wait twenty minutes before you post it, et voilà! the answer appears!

Realistically, the act of preparing the issue does trigger different thought processes, often encouraging you to be more rigorous in your presentation. Rubber-duck debugging works similarly in my experience. I know it can sound silly, but it’s helped me more than once. Both techniques have, actually.


A big part of it is "letting go" of the issue, too. After I post a question or submit an issue I just wait until I get a reply and go do something else, and that's often when the solution comes to me. It's similar to taking a shower or going for a walk in that regard.


I dealt with one bug that was broken on 8 different points until it looked like another bug.

It was paginated content. It would download a page in one class, concatenate it to the existing content. It then sent that page to the view class, which adds it to the bottom of the list.

This would make duplicate content, e.g. ABCDABCDEFGABC... instead of ABCDEFG. Someone had the brilliant idea of filtering new content from the existing ones. So it would be ABCD+(ABCDEFG-ABCD).

So for the most part it worked exactly as planned. But then there would be a point where the app modified data locally. Say, you add a comment on B. B becomes b. Now you have AbCD+(ABCDEFG-AbCD)= AbCDBEFG. Oops.

In the real world, this was done over so many classes, superclasses and so on that it wasn't clear why it was randomly inserting B at certain points, and at which points it was doing this. The behavior performed exactly as tested, but it was just poorly designed behavior and we ended up spending a few weeks ripping out and rewriting the code for this.


>In the real world, this was done over so many classes, superclasses and so on that it wasn't clear why it was randomly inserting B at certain points, and at which points it was doing this.

If OOP gave us anything it's the joy of trying to piece together huge puzzles.


Monolithic hell: 40 bugs in three files.

Classes hell: three bugs in 40 files.


I particularly love spending a few hours on a bug and then committing a one or two line fix. Makes me look like I do nothing at work.


My two "favorite" hardware bugs: dead RAM stick partially corrupts memory pseudo-randomly (no ECC) so everything sort of works but is really weird. CPU that for some reason misbehaves so much without a newer non-free microcode/firmware version, so that everything sort of works but breaks in random way (just like broken RAM).

The former really plays with my nerves because i tend to question the whole universe before i question the hardware. The latter is even more evil because memtest/smartctl won't complain about anything.. i'm not sure if there's an equivalent utility to test CPU health? Anyway it took me a while to even think of trying non-free intel-microcode, as of the CPU/kernel wouldn't produce a helpful error message like "You bought hardware which is pure shit. Please install some more binary shit in order to use it at all without losing your sanity."


On CPU issues, at my 4-year, we did a bunch of projects with 68HC11s, which stop branching when voltage is low. After a few episodes of this, I eventually figured out when it started running straight through everything, it was time to recharge the batteries (and probably take a break anyway)


I love this. Reminds me of a sleep-deprived driver who keeps missing highway exits.


Fun extra fact. The debugger would work, and you could single step through it not taking branches it should clearly take. My professor told us it was designed to be a low voltage cpu, but they had some issues, so 5v only... I'm guessing this was the issue.


We've solved the halting problem! Just lower the voltage, ignore branch/jump instructions, and run straight through.


I smell a new esolang: Alan. Has a dozen branch instructions but they’re all ignored.


My most impossible bug was a Direct3D issue. It was impossible in the sense that what we were seeing on the screen wasn't a result of the code we were writing.

It turned out to be Direct3D's debug mode. Once we flipped it back to non-debug, everything started working again.

I became a lot more skeptical of debugging tools after that day.


Hah! One of the first things I learned programming was that you could never trust 'debug' and 'release' builds to behave the same. This was back in the late DOS, early Windows days.


Here's another fun one: The bug is in hardware. I once spent almost a week trying to figure out why a board wouldn't boot until I poked around it with an oscilloscope and noticed two bias resistors for the DRAM weren't populated. Whoops.


I fixed a problem where an ARM microcontroller running linux was crashing every so often. Eventually figured out it happened when the radio transmitted a packet.

First fix was realizing that sometimes the radio would trip the reset controllers button input. Adding a 10k resistor reduced the problem by 90%.

Second fix was realizing that the LDO voltage regulator supplying 1.1V to the processor core was briefly tripping it's over current detector when a packet was transmitted. Every so often that would cause the processor to crash.


The Brian Kernighan quote explains this very simply:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.


I recently found a silicon bug in a microcontroller. I had spent 18 hours trying every other possibility before, by chance, the test code made it very obvious that the chip was broken.

I've only been doing this Electronics Engineering thing for a year and I've found 3 serious silicon or documentation bugs. I'll have to recalibrate my hueristic about the prevalence of silicon bugs.

Which is annoying, because so much of experience is learning hueristics to limit the search space for bugs. Now I have one more thing to worry about.


I spent a couple of years writing firmware/tests for custom radio transceivers. I found bugs all the time. Probably about half my job, find a bug, document and figure out a work around.

Last week I found a bug in an older version of gcc. When particular function was called exactly once in the program the compiler generated bad code that messed up the stack. Other programs using it called the function in a couple of places. Worked fine.


I once had a piece of Windows code that made a conditional jump, after executing a 3rd-party library function. The jump frequently went the wrong way, even though the values for the conditional jump were as expected.

It took a long time to crack, using a machine-code debugger.

It turned out that the assumptions of the Microsoft code generator (used to compile the problematic piece of code) were different from the assumptions of the compiler used to build the 3rd-party library, which I think was built using Borland.

Specifically, the Microsoft compiler saved the flag register after entry to a function, and restored it prior to exit; the Borland compiler saved the flag register before calling a function, and restored it after the function call returned.

So library functions compiled with the Borland compiler usually stomped on the flag register, but they worked fine with caller code compiled by Borland. But if you called a Borland function from Microsoft code, anything in the flag register before the function call was lost forever.

It doesn't matter whether it's the caller's job or the callee's job to save and restore the flag register; but it's critical that caller and callee respect the same convention. The Borland convention was shared by most other non-Microsoft compilers. It was the Microsoft C compiler that was deviant.

I believe that this was a deliberate effort by Microsoft to make competing C compilers not work properly with Microsoft code.


In about 15 years of being a developer, I have met a lot of difficult bugs in all categories of this list.

But there is another kind of bug that happened to me only 3 times so far, and that I definitely consider it to be the hardest of all.

It's when the language interpreter/compiler itself is broken.

This is a nightmare to find and understand because when it happens, nothing makes sense. You can take a break, have a whole night of sleep, have a walk or talk to a rubber duck, it still does not make sense because the logic itself is broken.


My favorite kind of killer bug is the Heisenbug - a bug that changes its behavior depending on how you observe it. In the general case, it goes away as soon as you start to use a debugger or put logging statements into your code.

The most common bug of this sort involves race conditions and subtle corner cases wrt timing, in systems with independent asynchronous processes interacting in ways that are difficult or impossible to model using formal methods.


The best/worst is when you've run out of ideas that make sense, so you start trying ideas that don't make sense and one of them works.


Given the enormous costs of some of these long tail bugs, it should be more evident that we have things in place to try to prevent them from happening in the first place.

When I started in C++ I knew there was a lot I didn't know, but I assumed it would get sorted out over the life of the program.

Now I'm almost afraid to write C++, or at least to step off the yellow-brick-road of tried and true things 'I know for sure'. I'm afraid for people on my team doing that and feel weirdly paternalistic when requiring my devs. to do the same. I feel like an adult explaining why we 'walk on the sidewalk not the road' to adults who actually do know the sidewalk is safer, but are probably unaware of the statistical liability of walking along the road even as an adult.

As a developer - I love solving these problems. As a product leader they scare me to death - losing x% of your development time to never ending rabbit holes.


I have an "impossible" debugging situation right now, though it's a little outside the scope I think.

My Windows C drive, an SSD, died unexpectedly after less than a year of operation (I literally checked the SMART stats the day before, for what little info they offer). I bought a new Samsung SSD, installed it, restored from a backup (which was harder than it should have been backblaze) and now I get random bluescreens. No issue, I've had BSODs before, I know how to use WinDbg to explore them.

Problem 1; BSOD claims it is an "unrecoverable hardware error"

Problem 2; No dump file is generated, and I have no idea why. My system is set to generate one, I've turned off autorestart, Windows claims the SSD volume is "healthy" after restart, but it does not write the file

My only hope is that I'll figure out how to set up remote debugging or something


I had an impossible debugging situation. One day, my Windows 10 machine decided that it would get a NTFS_FILE_SYSTEM error on boot. A BSOD on boot means it'll try the recovery partition… which caused a FAT_FILE_SYSTEM error just after the first conhost.exe window appeared. After “recovering”, it'd try to boot normally again – looping between two slightly different OSs, getting basically the same error over and over.

Eventually, I gave up on debugging, and tried to re-install. I put in the recovery disk… and it FAT_FILE_SYSTEM BSOD'd, too. When booting off a completely different device, with which there were definitely no file system errors.

So I put Debian on the machine. No problems, for some reason.


At the risk of going further off-topic, what issue did you experience with Backblaze? I rely on them for my backup too (yes, 3-2-1 rule, etc.)


I got past feeling that bugs were "impossible" early in my career - if it happens in a computer, it can be fixed. What is impossible is providing the demanded estimate as to when such a bug will be fixed. But we have to anyway.


There's always a cause, nothing is impossible. Today might not be your day though.


I've had all 4 issues, in my time.

I am grateful that my initial training was as an RF tech, and my first job was a tech at a microwave receiver factory. It taught me how to find really difficult problems.

After that, most software issues are a cakewalk.

The worst ones are occasional threading issues, buried inside a dependency. Sort of an "all of the above" bug.

That's one big reason that I avoid dependencies like the plague. You only have to have one or two of those, to learn religion.


Solved an “almost impossible” today with “0 hits on Google”. Felt awesome - this is why I like programming.


The more "impossible", the more interesting and rewarding in the end!


I disagree. When you’re debugging a blackbox it can be even more frustrating to find out what the issue was. Often some unintuitive, inconsistency that was not documented. Those are not rewarding, because you learn very little from them except to be suspicious.


They tell a story of the quality of the software. A good way to decide whether to use some library is to look at how well it throws exceptions. If it's buggy and poorly documented then don't use it.


A similar question was asked and answered on SO:

https://stackoverflow.com/a/1268464/59087


As I got better at coding, more and more of my bugs were check-the-plug situations. And it’s just the worst because your brain excludes the easy stuff as a possibility.


Bugs are fine race conditions that leave no proper log that is something really special to troubleshoot especially if only happens under certain conditions


Obligatory link to "Debugging Rules":

https://debuggingrules.com/

-----------------

I.... hate debugging. Just hate it. If I'm developing some code, and it doesn't do the right thing, that's fine, I'll find and fix the problem (usually, unless something wacky is happening with a 3rd party lib).

But ask me to figure out a problem with a large and complex system, and I just find that so discouraging. I know how to do it, (in part thanks to the above), but I just don't like the process. You never know how long it will take. You never know if you'll end up digging down further and further, and it is a problem with the hardware or something else that is hard to fix (I do a lot of embedded development).


I love debugging but just not under time pressure, that one I indeed hate. Something that compounds the problem are clueless people trying to extract a deadline out of you -- "So, how long is this going to take". No matter how careful you are in making sure that the estimate you give is not a commitment it will always be treated as one.


Ambiguous title. I thought it's about learned helplessness in insects.


"Feel impossible" is a pretty awkward way to say "experience learned helplessness".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: