Here they mention that each bisect ran a large number of times to try and catch the rare failure. Reminds me of a previous experience:
We had a large integration test suite. It made calls to an external service, and took ~45 minutes to fully run. Since it needed an exclusive lock on an external account, it could only run a few tests at a time. We started getting random failures, so we were in a tough spot: bisecting didn't work because the failure wasn't consistent, and you couldn't run a single version of a test enough times to verify that a given version definitely did or didn't have the failure in any practical way. I ended up triggering a spread of runs over night, and then used Bayesian statistics to hone in on where the failure was introduced. I felt mighty proud about figuring that out.
Unfortunately, it turns out the tests were more likely to pass at night when the systems were under less strain, so my prior for the failure rate was off and all the math afterwards pointed to the wrong range of commits.
Ultimately, the breakage got worse and I just read through a large number of changes trying to find a likely culprit. After finally finding the change, I went to fix it only to see that the breakage had been fixed by a different team a hour or so before. It turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used. So when the feature was on it broke our tests.
However, it's not surprising when you consider the massive breadth of software that Microsoft builds, as one of the oldest and largest software development orgs.
Amazon (or, at least, my corner of it) still hadn't when I left ~9 months ago - and I'm glad of it. I've moved to a company where one of the core products (though, thankfully, not my team's) is in a monorepo, and from everything I've seen it looks like a horribly inefficient way to develop.
This is astonishing. The build (and deploy) systems are, by a considerable margin, the things I miss most about having left Amazon (CDO, not AWS, but still). What do you dislike about them?
> I had to spend days trying to debug why a build that was working fine broke completely
Sure, but I bet that it was helpful to have the `bats` tool so that you could replicate the build locally, right? As compared with other build systems where (so far as I can see - though I may be wrong) you basically have to push a debugging change for replication.
> I disliked how needlessly convoluted the pipelines are, and how some person pushing on accident to mainline can break everything.
This is true of any CI/CD system, though? In any system, if there's no push-protection set up so that you can only merge into main(/line) once a change has been reviewed (and run your tests at the point of review so that you know the merge won't break anything), you have only yourself to blame for breakage.
> So many things seem to be done the hard way.
Genuine question - what do you find convoluted/hard about them? To me, the apparently-industry-standard of "push a change to your App Code, which triggers a build to generate a docker image, then trigger an automatic commit containing that Docker image to a Deployment Package, which is picked up by your CD system and creates a deployment" is way more convoluted. Having a conceptual "pipeline" built out of lots of little disconnected GitHub Actions (or whatever) is also way harder for me to wrap my head around than the CDK definition of a linear pipeline.
The problem was caused by a forced deprecation of NodeJSFunction in cdk. Which basically made it impossible at the time to add any dependencies… there was nothing me or the more senior engineers could do to solve it.
I figured a work around that involved having a separate manifest for the deps and packing them manually. It worked… I also tried the lambda without any dependencies and they the lambda’s dependencies were available in the instance even though they were not listed anywhere.
The needlessly convoluted part is getting a NodeJS function into prod, the forced change caused something to break even though I was already on 18_x
But I will not lie, bats was super useful when debugging another engineer’s build, and avoiding the whole push to debug is so so so so helpful.
For the pipeline thing, idk why it was setup as such; but it certainly broke everything once a push to mainline was done, kinda like a runtime error when it should have been a compiler error. Though it was certainly on us.
CDK is fine too. Tbh I kinda love CDK and don’t want to go to anything else when it comes to cloud deployment.
In retrospect, perhaps that one sour experience had just too much of an impact.
I want a better way to manage external dependencies mostly. Afaict for scala I will need to pull the package into Brazil to get it to work. An analogue to NPMPM would be great but I can see why there isn’t one yet. A colleague had issues getting python dependencies to work so yeah.
> I want a better way to manage external dependencies mostly
On that point, I'm totally with you. The excessive caution about the software supply chain is probably justified given the impact of a potential incident, but certainly frustrating for the >99% of times that dependencies are safe.
FWIW, I think best practice here is to hardcode all feature flags to off in the integration test suite, unless explicitly overwritten in a test. Otherwise you risk exactly these sorts of heisenbugs.
At a BigCo that’s probably going to require coordinating with an internal tools team, but worth getting it on their backlog. All tests should be as deterministic as possible, and this goes double for integration tests that can flake for reasons outside of the code.
No, the best practice is that on each test run, every feature flag used implicitly or explicitly needs to be captured AND it must be possible to re-run the test with the same set of feature flags.
That way when you get a failure, you can reproduce it. And then one of the easy things to do is test which features may have contributed to it.
I strongly disagree. If you have non-deterministic tests, you are going to have builds breaking for unrelated changes, seriously hampering developer productivity as teams chase down failures unrelated to their change.
Nothing kills confidence in testing more than test flakes. It’s a huge drain on velocity and morale, and encourages devs not to trust test output.
If you want to have some sort of chaos monkey process that runs your test suite flipping feature flags at random and notifying teams of failures (along with some sort of resourcing to investigate) I could get behind that. But that should be something outside of the main suite that gates code deployment.
If a test passes when run by a dev pre-commit, it should pass in CI.
Also you end up with some strange long term test behavior. Because people will often leave feature flags in place long after full release (years sometimes), you end up with a default-off-in-tests only testing behavior with everything newer than N years since the last feature flag cleanup disabled.
Yes it's kinda fractal of bad practices that have to align for this problem to occur, but that's the nature of tech debt.
I agree that this is a real and separate problem, but I believe the solution lies outside of the test suite.
One way I have seen this handled is to enforce restricting rollouts of a feature flag to 95% at most. That way turning a feature all the way on requires removing the flag from your codebase. It’s draconian, but honestly anything less than that leads to the situation you describe.
I like that idea a lot. We've been informally doing it on my current team, made easier since we can sort of cleanly do atomic code+flag updates in a single commit
Agreed, I am advocating for deterministic behavior for all feature flags in the test suite.
If you’re testing a new feature, you should have explicit tests for the enabled state (along with existing tests for the disabled state).
If you have bugs propagating up the stack from flags changing in low-level dependencies, the change to the dependency is probably not properly tested.
Alternatively, if the feature flag gates a change to the interface of the dependency, you should have explicit integration tests covering the systems on both sides of the change.
Man, this story sounds like you could be on my team :-) Pretty much experienced the same stuff working at BigCo!
In the end, I think the real problem is that you can't test all combinations of experiments. I don't trust "all off" or "all on" testing. In my book, you should indeed sample from the true distribution of experiments that real users see. Yes, you get flaky tests, but you also actually test what matters most, i.e. what users will - statistically - see.
Basically, if you have N different features (let's assume they are all on/off switches, but it works for multi-values too), in theory you'd need to run 2^N tests to cover them all, which would become completely impractical. But, you can generate a far, far smaller set of test setups that guarantee that every pair of features gets tested together. Run those tests and you'll probably encounter most feature-interaction bugs in a much quicker time.
All-pairs is for _pairs of features_. For subsets you're in much deeper trouble because of the exponential dependence on N. For a fixed polynomial dependence, you can get clever and let tail bounds eventually work for you, but for exponentially growing hypothesis sets, that won't work.
This is relatively subtle stuff, but here's an attempt at describing the general problem. I'm going to describe the deterministic case, but the probabilistic case is effectively the same.
Let's say you have a bug you suspect is from an interaction of any one pair of 10 features being "on" or "off", but you don't know which specific pair causes the problem. Encode each of the states you could set up your code by a 10-digit binary string: 0000000000, 0000000001, 0000000010, 0000000011, etc.
We could try the 45 possibilities in some order, and we would expect that on average it'd take us 22.5 tries to find the bug. But notice how your "target set" is smaller than the universe of strings: there's only 45 pairs of features, but 1024 strings.
What happens if we try a random string of ones and zeros? Now, instead of catching just one possible pair, we are covering many pairs. The only problem is that we now won't be able to know exactly which pair caused the problem when it does. But we can build a corpus of strings that don't trigger the error vs. strings that trigger the error, and a random sampling soon converges on the correct pair.
If you think about why this works, it's because any of these random strings has about a 1/4 chance to trigger the bug: wlog we can reorder the bits so that the buggy feature are the first two digits, and then we see that we have a 1/4 chance of hitting "11" on those two digits.
The problem is that as you increase the size of the subset that needs to be active, the probability that your random strings will actually catch the bug decreases exponentially. For any _fixed_ target size k (the number of features that need to be active), the overall complexity is still polynomial in n (the number of existing features). But if k is a constant fraction of n, then this technique takes exponential time in n.
You just need to make sure that this doesn't mean people are consistently "lucky" or "unlucky."
I was on a team where app updates were deployed using a canary system. A small percentage of users (say, 1%) received the update first, then the team watched for incoming crash reports from that cohort. If it looked good, the feature was rolled out to a few more people, and this was repeated. This allows you to identify a problem by only negatively impacting a relatively small percentage of customers.
The problem occurs when the calculation to determine which cohort the user belongs to is deterministic. In this case, the calculation was based on the internal ID of the user. This means some users always get the updates first, and deal with bugs more frequently than other users. Conversely, some users are so high in the list that they virtually never get an update until it's been tested by a wide user base, so their experience is consistently stable.
Or have the username be a number that is all the feature flags when converted to a binary representation. Then you can just have one username for each combination you want to test.
The important part is the stability - if your usernames can change then they aren't stable so you don't select it.
I think it is a good reminder that most things you think of as being unchanging that are also directly related to a person.. aren't unchanging. Or at least any conceivable attribute probably has some compelling reason why some one will need to change it.
That's why you have internal user ids instead of using data directly provided by users.
Will it cost an extra lookup? It's cheap, and if you really need to, you could embed the lookup in some encrypted cookie so you can verify you approved some name->id mapping recently without doing a lookup.
Wait, we're talking about maliciously injecting bugs into your employer's software so they have the maximum impact, right?
Clearly, making sure that 1% of all teams gets fired for being unable to run unit tests, then slowly ramping that by a few percent each review cycle is a good strategy.
Ideally, the probability of breaking would drop off exponentially as you moved up the org chart. Something like "p ^ 1/hops_to_director_of_engineering" would work well. The trick would be getting the dependency to query ldap without being detected...
I've used the hash of username+string trick before for a flag. I used it to replace a home-grown heavyweight A/B testing framework which had turned into a performance bottleneck.
That wouldn't work here though, the dependency started by breaking an almost-undetectable fraction of the time.
Imagine a scenario where your upstream dependency started out with one failure per 1,000,000 machine hours, then removed a zero once every 12 months. If you had 100 machines running tests at 100% efficiency, the bug would hit about once a year for the first year, then 10x the next year, and so on.
Put another way, if upstream is malicious, and you're not auditing every line of their source code, you're screwed.
I tested with 6.4.0-0.rc6.48.fc39.x86_64 + f31dcb152a3 revert and all 10000 iterations succeeded (same hardware and environment as my previous post).
To guarantee that there's absolutely no other difference between the two tests, I took the source RPM, added the commit f31dcb152a3 diff + `%patch -P 2 -R`, and built the kernel RPM with mock.
I've been having flashbacks to troubleshooting some particularly thorny unreliable boot stuff several years ago. In the end tracked that one down to the fact that device order was changing somewhat randomly between commits (deterministically, though, so the same kernel from the same commit would always have devices return in the same order), and part of the early boot process was unwittingly dependent on particular network device ordering due to an annoying bug.
The kernel has never made any guarantees about device ordering, so the kernel was behaving just fine.
That one was.. fun. First time I've ever managed to identify dozens of commits widely dispersed within a large range, all seem to be the "cause" of the bug, while clearly having nothing to do with anything related to it, and having commits all around them be good :)
The dhcp client in the klibc-utils had a bug in how it handled multiple interfaces, in that it didn't create separate sockets per interface, as it enumerated through them it would clobber the previous one. It validated the destination of the received DHCP response, and silently dropped it if it wasn't for the interface the socket was for.
The DHCP server was only listening on one of the two interfaces, and so if that interface got enumerated second, all was well and good. The socket was for it, response would be accepted. When it came up first, the clobbered socket meant the dhcp response would be ignored.
I bisected so many times and mostly just got confused. The engineer at Canonical dug in and found the actual bug.
Honestly I don't know! We've seen it appear with host kernel 6.2.15 (https://bugzilla.redhat.com/show_bug.cgi?id=2213346#c5) but I'm not aware of anyone either reproducing or not reproducing it with earlier host kernels. All your other config looks right.
I noticed it hangs in similar way when you insert msleep anywhere before smp_prepare_cpus in kernel_init_freeable. But I have no idea whether sleeping is valid here.
Looks like you have a trigger, but no root cause (yet).
Doesn't matter anyway...revert and work it out later.
The root cause bug is still in there somewhere, waiting to be triggered another way...
This reminded me of another story [0] (discussed on HN [1]) about debugging hanging U-Boot when booting from 1.8 volt SD cards, but not from 3.0 volt SD cards, where the solution involved a kernel patch that actually introduced a delay during boot, by "hardcoding a delay in the regulator setup code (set_machine_constraints)." (In fact it sounded so similar that I actually checked if that patch caused the bug in the OP, but they seem unrelated.)
The story is a wild one, and begins with what looks like a patch with a hacky workaround:
> The patch works around the U-Boot bug by setting the signal voltage back to 3.0V at an opportune moment in the Linux kernel upon reboot, before control is relinquished back to U-Boot.
But wait... it was "the weirdest placebo ever!" Turns out the only reason this worked was because:
> all this setting did was to write a warning to the kernel log... the regulator was being turned off and on again by regulator code, and that writing that line took long enough to be a proper delay to have the regulator reach its target voltage.
Before clicking I thought someone kept note of how many times Linux booted in regard to their computing habits, and not testing software. I know for me I boot roughly 3 times a day into different machines, do my work, shutdown, then rinse & repeat.
Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.
I had a developer that I inherited from a previous manager some years ago. Made tons of excuses about his machine, the complexity of the problem, etc. I offered to check his machine out and he refused because it had "private stuff" on it. He had the same machine as the rest of the team, so since he hadn't made a commit in two weeks on a relatively simple problem, refused help from anyone, etc., we ultimately let him go.
When we looked at his PC to see if there was anything useful from the project, his browser had around a thousand tabs open. Probably 80% of them were duplicates of other tabs, linking to the same couple stack overflow and C# sites for really basic stuff. The other 20% were... definitely "private stuff".
I’m at the other extreme of “private stuff”. Nothing work related should live on my work machine. It should all be pushed to git or dumped in the wiki (personal pages if nothing else).
On one of my largest projects the IT dept made bulk orders for hardware and doled them out to new hires. 18 months into our new project someone’s hard drive died.
Everyone acted like his dog died. I said no problem let’s go through the onboarding docs. The longest step by far was that the company mandated Whole Disk Encryption but IT hadn’t put it in their old inventory yet. So that was 2/3 of setup time. We found some issues with the docs and fixed them.
Every two to four weeks that summer, someone else’s drive would go. You see, we got all of these machines from the same production run. So the hard drives came from the same production run, which was apparently faulty. The process got a little faster as we went. By the end of the summer it was my turn, and people still looked at me like I needed condolences. I got a faster machine for a few hours worth of work. I’m not sad. All my stuff was in the network already. I lost a couple hours’ of work, tops.
My company laptop I don’t do anything I wouldn’t be OK with my manager or IT seeing. Even something as simple as a recipe lookup I do on my phone or personal laptop.
With today’s software for managing corporate machines, and corporate VPN with network security and firewalls abound, anything and everything can be seen.
I have a joke Wi-Fi name that I’ve even considered changing (or at least create a guest network) just to be safe. It’s likely overboard, but I like the idea of just mailing my laptop if I change companies, and no worries at all
Coincidentally I just got a new work machine this week. The IT support staff replacing it scheduled an hour of time to transfer files, set up apps, etc. with me. I was done in 5 minutes. Once I logged in, logged into my cloud services, and verified my faulty port problem with my dock was resolved with new hardware, I was done and as productive in a few minutes as I was before. Install a few tools, copy my scripts from the cloud, make a new key pair for SCM, that's it.
Any machine in a company needs to be able to be unplugged and thrown out of a window, without leading to significant data loss, only the inconvenience of the price of a new machine and setup time.
There are very few machines in the world that are actually mission critical, and you might not be able to do that (although for them, you can probably switch components with it still running). Anybody else, you are just betting your company on the lack of fires, hardware failure, etc.
I'm unusually strict about maintaining a separation between work and personal (for instance, I would never allow my personal smartphone to connect to my employer's WiFi), so I wouldn't use personal keys on a work machine at all.
But if those keys (or passwords, etc.) are generated for work purposes, I consider them to be as much company property as the machine itself, so I'm no more protective of them than I am of any other sensitive company data.
How do you feel about giving your colleague your password?
My personal opinion is that I can hold someone legally culpable if their account does something like leak financial information; you have a professional responsibility to secure your account from absolutely everyone.
Administrators acting on your account must of course be heavily logged and audited, which is the case.
> How do you feel about giving your colleague your password?
I usually don't, mostly just out of good security habits, but also because most employers specifically prohibit doing that.
Almost always, your colleague can be given his own access to whatever the password is for anyway. If that's not possible, then I'll share the password and change it immediately after my colleague doesn't need access anymore.
> you have a professional responsibility to secure your account from absolutely everyone.
I agree -- that's part of treating credentials the same way as all other sensitive company data. But it's still my employer's data, not mine.
If I quit the company or if my supervisor wants to see the contents of my machine, I'm fine with that. The machine and everything on it belongs to the company anyway.
Ok, but your private key, session tokens and CLI access tokens (kube configs, gcloud etc;) are your password in those situations.
They tie to your identity, thus you must not treat them the same as company secrets, they are professional personal secrets which should not be disclosed or allowed to fall into anyone elses hands (less they be revoked and cycled).
It's not just good security posture it could affect your career quite badly or lead to legal issues.
> I agree. I don't think I've said anything counter to that (or perhaps I wasn't being clear?)
I think given the context of the thread (don't touch my secrets), saying that you don't have anything you would consider confidential towards your employer or colleagues is a direct contradiction to what I stated.
That's why I'm "arguing" because my employer/colleagues should not have access to my private key, ever.
There are several very legitimate times when my employer needs to have access to my keys. If I'm leaving the company, for an obvious instance.
But my core point is that such keys/passwords aren't really mine, they're the company's and in the end, the company gets to decide what I'm to do with them.
I think the building access keycard is a perfect analogy. I'd never let anyone borrow mine on my own volition, but if the company wants to retrieve it from me, that's their prerogative. It's theirs, after all.
If an employer needs someone’s particular keys something probably went wrong or there’s bad processes in place. But that aside I think the default course of action should be to aggressively guard your secrets and tokens since they represent you. Not as personal or private property but to keep someone (be it a fellow employee or a 3rd party attacker) from impersonating you without authorization.
There are exceptions but the circumstances where an employer would need to retrieve my keys without my assistance are extremely rare and in those instances it’s unlikely I’d still be an employee anyway.
The handing of the keycard is necessary to ensure it's destroyed and can't be used as a "proof" you work somewhere (most access cards these days have your name, face and the company logo printed on the front).
The keycard will be removed from the access list to the building even when it's destroyed, they're not considered reusable by most companies.
Your private key is not reusable, it should be destroyed and revoked from all system when you leave a company.
We could destroy the keycard with both parties present, that seems safest. I don't mind turning in a private key permanently and getting a receipt at the time, but it needs to be very clear that it's no longer my responsibility.
> but to keep someone (be it a fellow employee or a 3rd party attacker) from impersonating you without authorization.
Aside from a third party attacker (which is well-covered by my normal practices), that's a threat model that I'm personally not worried about at all, really. In part because I've never seen or heard of that happening and in part because if it did, I am confident that there are enough records to be able to prove it.
Internal abuse and attacks aren’t as rare as they should be. You’d be amazed what someone will do to risk their job or even career on impulse or poorly considered risks.
Isn't this largely the point of company directory services? The machines/routers/applications/etc are all doing their authentication against the directory service, and permissions are granted and revoked there. Its a large part of running a company with more than a couple employees because when someone leaves you don't need to run around changing passwords and wondering if they still have access to the AWS account to spin stuff up, or punch through the VPN. The account in the directory service is just deactivated and with it all access.
By default this should be what is happening on all but the most ephemeral of machines/testing platforms/etc. And even then if its a formal testing system it should probably be integrated too.
Directory service integration BTW is the one feature that clearly delineates enterprise products from the rest.
> If I quit the company or if my supervisor wants to see the contents of my machine, I'm fine with that. The machine and everything on it belongs to the company anyway.
I'm fine with that, but I still will not share my passwords. I'd be happy to reset the passwords for them if they can't access the data by other means, but as another commenter pointed out, the fact that anything needs to be recovered from my^H^H not my laptop indicates mistakes were made.
My work laptop has a touchscreen. I've never used it, but other people use it by accident fairly often. Usually only once each though, the look of shock is sometimes even worth the fingerprint :D
I've never understood people who do this. You can point at the screen, tap with a pen, take the mouse or keyboard and move the cursor, etc. but surely it's bad manners to splodge your finger on someone's screen.
He was let go after two weeks? No confrontation nothing?
Sounds very american. In European working culture if you don't show up for two weeks people will be worried that something happened to you and try to work it out with you. This type of all or nothing reaction is a bit sporadic imo.
Yeah, it's not like that part of the story was condensed and might have left out a bunch of details that weren't important to the story. So let's give OP a hard time and make judgements about a situation for which we have not even the slightest bit of context.
Oh absolutely, you're right. I am saying that despite whatever may have happened, two weeks is very short. I feel like it would be at least a month here regardless.
For context, I was brought in with the knowledge he hadn't done anything meaningful since being hired some time before my arrival, and we did reach out and offer help or ask if he needed anything, which he refused, somewhat angrily.
You might have meant to use a word other than "sporadic." That word describes a recurring event that happens at unpredictable intervals, such as snow in a normally hot desert, or a Linux crash caused by a race condition. Other words that fit the sentence better are "unusual," "unexpected," "unjustified," "inappropriate," "surprising," "extreme," "abrupt," or "out of the blue."
(For what it's worth, I'm American, and I disagree with your assessment. We don't know how long the person was given to make progress, and we don't know what was communicated. To conclude that a two-week period without a commit represents the entire period between the start of the poor performance and the termination is, well, a bit out of the blue.)
I was the final two weeks of a long period of zero productivity, despite many offers of help and asking if he understood, needed help, etc. I never enjoy firing someone, even if they are downright awful or mean, and do my best to avoid it. My own involvement was late in the stage, and the thousand or so tabs that were left open I'm sure weren't 2 weeks worth of effort.
When I was hired I was told he was a problematic hire, that hadn't produced anything for long before my arrival. It was basically "We already know we're going to probably have to let him go but if you want to try to work around it, be my guest". I did try to go in with no judgments, as I always do, but he refused help, and refused to even let anyone look over his shoulder and find why this task was taking an order of magnitude too long.
>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.
If the OS and hardware drivers properly support sleep, you almost never need to do otherwise (except to install a new kernel driver or similar).
In macOS for example it hasn't been the case that you need reboot in your regular OS use for over 10+ years.
The "100+ Chrome tabs" or whatever mean nothing. They're paged out when not directly viewed anyway, and if you close just Chrome (not reboot the OS) the memory will be freed in any case...
I've found sleep very reliable on macOS, and both sleep and hibernate reliable on Windows.
I once had my work PC unhibernate and not pop up the login box. The computer appeared to be running normally otherwise; I just couldn't log in, and I had to tap the power button to shut it down. This stuck in my mind due to its rarity.
Can't remember ever having a serious issue on macOS. A couple of my programs sometimes don't survive the sleep/wake cycle, but it's intermittent, and I'm always in the middle of something else when it happens. I've never lost any meaningful work.
> Can't remember ever having a serious issue on macOS.
macos is fine for the most part, but there are some edge cases, such as some sketchy corporate required "security software" that eats up kernel memory or cpu for some unknown reason, a reboot can fix performance issues there
also if you are a dev and apps (like xcode, android studio etc) fill your drive with cache files* or have weird background daemons that eat up cpu, at the least a logout/login (or a reboot) can fix some of those eierd things
you could manually delete them without a reboot but ymmv
Indeed, and I've found the same myself - my comment was about the reliability of the OS sleep functionality, and deliberately says nothing about the wisdom of never rebooting!
I have actually found both Windows and macOS generally pretty good if you leave them running for weeks at a time, but it's one of those things that's best done only if you really need it (and can accept a non-zero chance of something going wrong). They're not so very good that I'd actually recommend doing it routinely. A reboot every 1 or 2 weeks massively reduces the chance of weird stuff happening.
It boggles my mind that you'd reboot needlessly. My uptime is usually in the hundreds of days.
Sleep is good: I just close the lid. Next time I open the lid it immediately picks up where I left off. Why on earth would you want any other behaviour?
Security-wise: encryption at rest? In high security scenarios you may be required to shutdown so you're forcing "attackers" to go through several layers: motherboard password, disk password, encryption password, OS user password + 2FA, etc.
On my personal machines? I don't shut them down or reboot very often.
At work, however, I have to use Windows. In that case, I shut it down at the end of every workday, in part because that prevents weird issues Windows tends to develop when running too long.
Mostly, though, it's because of those damned forced updates. Since I can't trust Windows to not reboot itself at any random point in time, having the habit of shutting down at the end of the day at least ensures that I won't accidentally lose my state overnight or over the weekend.
If you don't/won't/can't use the group policy editor, I got a lot of mileage out of hibernating the PC and powering it off at the mains. You can't leave it running something overnight, but you can at least quickly get back to exactly where you left things the previous day.
(Powering it off at the mains ensures that even if you have a device connected that could wake the PC up - thus putting your computer in a state where WIndows Update can reboot it - it can't. You can turn this feature off on a per-device basis with powercfg, but then one day you'll plug something new in and leave it plugged in and it'll wake the PC up while you're away and Windows Update will do its thing.)
> in part because that prevents weird issues Windows tends to develop when running too long
What are you using Windows Vista? I run about a dozen windows machines, half of them are VMs and none of them need to be rebooted regularly. Average uptime is over 40 days, and I only reboot when there's a big update. Windows becoming unstable entirely depends on the 3rd party software you install on it. Don't install crapware, you won't have a crap experience.
I reboot most weeks, just to make sure the right stuff happens when I do. (I try to do it in the middle of the day, so there's time to sort out any matters arising.)
A couple of times I've discovered I've forgotten to set stuff to auto-run on login, or things turn out to have lost their settings, or stuff doesn't work for whatever reason - I'd much rather discover this at a time of my own choosing!
A long time ago, I had desktops with huge uptimes. The world has changed. I will no longer go that long without a security update. Too much is now passing through my machine.
(As far as you know, none of your machines were ever hacked.)
Your luck is not good security policy. When I was getting started with Linux in 1992 and only intermittently connected to the Internet via dial-up, I celebrated long uptimes. Now that I do daily banking and other activities on a machine continuously connected to the Internet, uptimes longer than the interval between kernel security updates is just irresponsible behavior.
I would prefer to not have to reboot. I know that is not the world we live in. The stability of the kernel is no longer the reason to think about uptime.
I don't care that you have nothing of value connected to the Internet. I am objecting to the advice about not rebooting.
:( I only reboot when my machine freezes or when updates require a reboot.
I did a lot of on-call in my life and I saved tons of time by leaving everything open exactly as I left it during the day.
~> w
11:19 up 18 days, 17:03, 9 users, load averages: 3.87 2.96 2.39
You haven’t properly kept a machine alive until the clock rolls over.
I logged into a firewalled Windows VM on EC2 that’s been running an internal micro service that was acting up and it caught my eye that task manager showed an uptime of 6 days making my mind immediately think it might be a bug caused by the recent reboot or perhaps the update that triggered it.
It turns out no reboot had taken place and in fact, the uptime counter had merely rolled over - and not for the first time! Bug was unrelated to the machine and it’s still (afaik) ticking merrily away.
(Our `uptime` tool for Windows [0] reported the actual time the machine was up correctly.)
Microsoft probably never anticipated needing more than a month or two of uptime, since they roll out restart-required updates more frequently than that.
I used to shutdown regularly, then the power situation here in South Africa got so bad that we'd regularly have about 3 hours of power between interruptions.
Restoring all my work every couple of hours was becoming a pain, so I decided to re-enable hibernation support on Windows for the first time in 10 years... And surprisingly it works absolutely flawlessly.
Even on my 12yr old hardware, even if I'm running a few virtual machines. I honestly haven't seen any reason to reboot other than updates.
> I used to shutdown regularly, then the power situation here in South Africa got so bad that we'd regularly have about 3 hours of power between interruptions.
I'm in SA too, and I used to have 100s of days uptime (one even over a year and a half) ... until the regular blackouts.
Had to stop using a desktop, I've resigned myself to using a laptop, purely so that I don't have to boot the thing all the time and lose my context.
I think that there are two types of people. One set of people (I guess, relatively small) don't trust software and prefer to reboot OS and even periodically reinstall it to keep it "uncluttered". Another set of people prefer to run and repair it forever.
I'm from the first set of people and the only reason I stopped shutting down my macbook is because I'm now keeping its lid closed (connected to display) and there's no way to turn it on without opening a lid which is very inconvenient. I still reboot it every few days, just in case.
I’m in the second group (avoid reboots like the plague) but for the reason you attribute to the first: I never trust that my Windows machine - currently working - will reboot successfully and into the same working condition between OS update regressions, driver issues, etc.
Conversely, it boggles my mind that people think 100+ tabs is a lot. I've got >500 open in Firefox at the moment, they won't go away just because I reboot or upgrade. I'll probably not look at most of them again, but they're not doing any harm just sitting there waiting to be cleaned up.
One of the fascinating curiosities you're missing out on is Pressure Stall Information (https://docs.kernel.org/accounting/psi.html). Here's what the PSI gauges look like in htop when kernel support is available:
PSI some CPU: 0.37% 0.78% 1.50%
PSI some IO: 0.38% 0.33% 0.25%
PSI full IO: 0.38% 0.31% 0.23%
PSI some memory: 0.02% 0.04% 0.00%
PSI full memory: 0.02% 0.04% 0.00%
That article was written ~5 years ago. The parent comment has ~1 year of uptime. What makes you think they don't have a kernel new enough to report PSI stats?
Why? I only restart my (linux) laptop every 3-4 months when I update software.
I can't think of any downside that I've experienced from this practice. I do a lot of work with data loaded in a REPL, so it's certainly saved me time having everything restored to as I left it.
>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual.
I would never suspend to RAM or disk, far too error-prone in my experience. (Plus serializing out 128GiB of RAM is not great.) I just leave my machine running "all the time." My most recently retired disks (WD Black 6TB) have 309 power cycles with ~57,382 power-on hours. Seems like that works out to rebooting a little less than once per week. That tracks: I usually do kernel updates on the weekend, just in case the system doesn't want to reboot unattended.
> Then you have those types who put their machine into hibernate with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.
Hey, I'm that guy (although I put it to sleep instead)! It honestly works really well and is in stark contrast to how Linux and sleep mode interacted just ~10 years ago. It's amazing for keeping your workspace intact.
(FWIW, I also don't reboot or shutdown my desktop where it acts as a mainframe for my "dumb" laptop.)
Why would I ever reboot my laptop without a need to? I only reboot when there's a kernel update, or if I'm doing something where the laptop might get lost or stolen (since powering off will lock the disk encryption).
I just have it running 24/7 and never restart for weeks. I don't even have the 100 tab problem, I just like having the immediate availability without waiting for startup.
Unless you're on solar, does wasting electricity not bother you? I used to seed a lot of stuff for years (with typical uptime measured in months), but the CO2 impact, however tiny it is in the grand scheme of things, does not seem to worth it anymore.
I wonder if bisect is the optimal algorithm for this kind of case. Checking for the error still existing still takes an average of ≈500 iterations before a fail, checking for the error not existing takes 10,000 iterations, 20 times longer, so maybe biasing the bisect to only skip 1/20th of the remaining commits, rather than half of them would be more efficient.
Basically it calculates the commit to test at each step which gains the most information, under some trivial assumptions. The calculation is O(N) in the number of commits if you have a linear history, but it requires prefix-sum which is not O(N) on a DAG so it could be expensive if your history is complex.
Never got round to integrating it into git though.
That's a cool idea. Would also be interesting to consider the size of the commit - a single 100-line change is probably more likely to introduce a bug than 10 10-line changes.
There's an additional stopping problem here that isn't present in a normal binary search. Binary search assumes you can do a test and know for sure whether you've found the target item, a lower item, or a higher item. If the test itself is stochastic and you don't know how long you have to run it to get the hang, I'd think you'd get results faster by running commits randomly and excluding them from consideration when they hang. Effectively, you're running all the commits at the same time instead of working on one commit and not moving on until you've made a decision on it. Then at any time you will have a list of commits that have hanged and a list of commits that have not hanged yet, and you can keep the entire experiment running arbitarily long to catch the long-tail effects rather than having to choose when to stop testing a single non-hanging commit and move onto the next one.
I can see some interesting approaches here. Given n threads/workers you could divide the search space into n sample points (for simplicity let's divide it evenly) and run the repeated test on each point. When a point hangs, that establishes a new upper limit, all higher search points are eliminated, the workers reassigned in the remaining search space.
Given the uncertainty I can see how this might be more efficient, especially if the variance of the heisenbug is high.
If the factor in one direction is large enough then a linear search becomes more efficient. Say you have 20 commits remaining and the factor is 1,000x more costly to make it easier to picture. You're better off doing a linear search which guarantees you'll spend less than 2,000x searching the space.
That suggests that for a larger search space with a large enough difference, the optimal bisection point is probably not always the midpoint even if you know nothing about the distribution.
Perhaps someone can find the exact formula for selecting the next revision to search?
Each boot updates your empirical distribution. As a trivial example, if you have booted a version 9999 times with no hanging, a later version will likely give you more information per boot.
If they boot it 10,000 times for revisions that don't fail, and ~1,000 times for revisions that do fail, you can reach this number with log2(revisions) about 30.
I didn't mention it in the blog, but Paolo Bonzini was helping me and suggested I run the bootbootboot test for 24 hours, to make sure the bug wasn't latent in the older kernel. I got bored after 21 hours, which happened to be 292,612 boots.
Maybe it would have failed on the 292,613rd boot ...
Thanks for mentioning me, but you really did the work!
But in order to contribute something useful, as a rule of thumb you want to have 10 times as many passes than failures in order to reject a commit. If a bug has taken up to 2500 runs to reproduce, don't consider it a pass until 30000 runs have succeeded.
It's something to do with Poisson distributions. If you have 𝑛 runs before a failed run on average, and
you want to be 𝑃 % certain that a fix (including a revert or moving beyond the bug in a bisect) reduced the failure
rate, you can use the formula −
𝑛 ln (1 − 𝑃
/100) for how long to run, and the factor for 𝑃=99.99 is about 10.
In fact that means that once you had landed on a merge commit it was probably much better to switch to a linear backwards search because it might have fewer passing runs and passing runs are 10-15 times more expensive as failures. Is that what you did?
I've been on a similar quest for hard to reproduce, timing/hardware/... bugs, and if you're facing any kind of skepticism (your own or otherwise) it can be very comforting to have a 10x or even 100x no failure occurred confidence.
It's particularly comforting when the reason for the failure/fix/change in behavior isn't completely understood.
If the bug occurs reasonably often, say usually once every 10 minutes, you can model an exponential distribution of the intervals between the bug triggering and then use the distribution to "prove" the bug is fixed in cases where the root cause isn't clear: https://frdmtoplay.com/statistically-squashing-bugs/
No disrespect to Peter Zijlstra, I'm sure he has been a lot more impactful on the open source community than I will ever be but his immmediate reply caught my attention:
> Can I please just get the detail in mail instead of having to go look at random websites?
Maybe it's me but if I did boot boot linux 292.612 times to find a bug, you might as well click a link to a repository of a major open source project on a major git hosting service.
Is it really that weird to ask people online to check a website? Maybe I don't know the etiquette of these mail lists so this is a geniune question. I guess it is better to keep all conversation in a single place, would that be the intention?
I am only guessing here, but I assume it's so the content of the mailing list archive remains. If a linked website goes down or changes at any time in the future, then that archive is no longer fulfilling its purpose of archiving important information.
If that was the reason it would have been best to state that in the request.
> Can I please just get the detail in mail so that it is archived with the list?
Of course you can't expect every email written to be perfect, it is generally treated as an informal medium in these settings. But stating the reason helps people understand your motives and serve them better.
I think that hardcode kernel devs already know the reasons, and there is no point in raising it again. For you it might seem like a random requirement, but it's because of lack of familiarity.
i think in that case explaination is needed even more, if you are hardcore dev, then no one need to remind you about such rule, on the other hand if you are not so familiar with those rules yet, explanation would be very helpful
It was completely obvious to me, and I'm not a Linux committer.
Any bug of the form:
Hi, I'm sending this via official channels, but see [external thing].
Is going to immediately bitrot. For instance, in stack overflow, for something like 10% of answers, you'll see people saying to explain what a link says instead of just linking.
The irony being that he presumably wants more information on the mailing list to keep a good archive, while not giving enough information for people to understand that and follow the advice later.
Not only the link itself - but if the email body /attachments contains the details - it is also easier to write a good reply by selectively quoting from the mail. So it isn't just for the first mail, but for the follow-up discussion thread(s).
I was a bit short in the original description, but luckily we've since reached an understanding on how to try to reproduce this bug.
Unfortunately he's not been able to reproduce it, even though I can reproduce it on several machines here (and it's been independently reproduced by other people at Red Hat). We do know that it happens much less frequently on Intel hardware than AMD hardware (likely just because of subtle timing differences), and he's of course working at Intel.
It's LKML. The volume of that list is insane, and technical discussion is very much the point, so they'd expect you to explain the problem right there, where people can quote parts of it, and comment on each part separately.
I've met people who seriously do use dumb terminals and other people who have seriously discussed using a PDP-11.
So, while your question might sound sarcastic, the answer is definitely yes.
Nerds gonna nerd. Nothing wrong with that.
I personally don't like going to gitlab or github because I don't like the businesses behind them. That's another point irrespective of whether I'm browsing in a terminal or ancient device.
I run OpenBSD on most of my systems. The OpenBSD development team collaborates using cvs instead of git because it fits their workflow well. If I wanted to collaborate with them, I'd use cvs too – and if I wanted to move them to git I'd do it after becoming a core contributor, not before. If I'm going to send bug reports & patches here and there, I'm going to do it in a way that makes it easy for Theo and team to review.
This is very much a Chesterton's fence topic, I think. Linux developers have settled on a workflow that works for them, and if you want to get time from the people who are doing the bulk of the work it's fair to expect you to work within their requests.
This dude literally spent days doing their work. Rebooted Linux nearly 300k to find their fuckup. Then they have the infantile reaction to complain about clicking a link?
It’s a gitlab link, not github. And it isn’t reasonable in this context. GitHub hosts a lot of open source projects but it is not the only place where open source happens. That's kinda the point of open source, and especially of git.
Git itself is a satellite project of the Linux kernel. It can work without the web at all. That someone EEE’d it so hard that even Microsoft couldn’t resist is no reason to expect the kernel devs to change their workflow.
you're wrong. instead you should adopt the standards of the group you're attempting to join. Getting "tourist who complains about customs of country they visit" vibes from this comment
You’re welcome to go tell the Linux kernel devs what they are doing wrong. Fuck around and find out as the kids say. Or start the Zolnux project and see how far that goes chasing shiny objects.
My suspicion is that it's not about reading the bug info once, but having the information in the mailing-list, which is the archive of record for kernel bugs.
Asking to click a link in an email is unreasonable in this context. The email list is the official channel and project participants are expected to use it. They are not expected to have a web browser. The popularity of the linked site is irrelevant. Part of filing good bug reports is understanding a project’s communication style. A link to supplementary information is fine. But like a Stack Overflow answer the email should stand on its own.
Many kernel people really are stuck in their ways like that. They don't want to leave their Mutt (e-mail client) at any cost. I recall some are even to this day running using a text console (ie. no X11 or Wayland).
The link is to gitlab, not github. But any website is inappropriate in this context because it’s not permanent. The email list is, at least as far as the project is concerned.
Of course it's contained in known integer sequences. The positive integers in increasing order, for example: https://oeis.org/A000027. The search doesn't know about every term in every sequence, as most are infinite and many are mostly unknown (some well-defined sequences only have a few known terms).
That README is light on details. How is this different from selecting some N (and hoping it is high enough) and repeating your test case that many times? You just don't have to select a value for N using this tool?
The paper lists the algorithm (which is relatively simple) but basically it is much more efficient than repeating test cases.
You can see that that must be possible fairly easily. Consider two algorithms:
1. Classic binary search - test each element once and 100% trust the result.
2. Overkill - test each element 100 times because you don't trust the result one bit.
The former will clearly give you the wrong result most of the time, and the latter is extremely inefficiency. There's clearly a solution that's more efficient without sacrificing accuracy in-between.
Skimming the algorithm, it looks like they maintain Bayesian probabilities for each element being "the one" and then test an element 50% probability point each iteration, then update the probabilities accordingly. Basically a Bayesian version of the traditional algorithm.
You do still have to select an N, but it's not as critical that the N gives 100% guarantee of the flaky failure (which can be really difficult or even impossible to achieve). Unlike regular binary search, robust binary search doesn't permanently give up on the left or right half based on just a single result.
It makes sense n-sect (rather than bi-sect) as long as these can be run in parallel. For example, if you're searching 1000 commits, a 10-sect will get you there with 30 tests, but only 3 iterations. OTOH, a 2-sect will take more than 3x the time, but require 10 iterations.
There's ofc always some sort of bayesian approach mentioned in other answers.
Yeah, I did a 4-way search like this on gcc back in the Cygnus days - way before git, and the build step involved "me setting up 4 checkouts to build at once and coming back in a few hours" so it was more about giving the human more to dig into at comparison time than actual computer time and usage. (It always amazes me that people have bright-line tests that make the fully automated version useful, but I've also seen "git bisect exists" used as encouragement to break up changes into more sensible components...)
In the speed running community there is a pretty famous clip [0], where a glitch caused a Super Mario speed runner to suddenly teleport to the platform above him, saving him some valuable time.
Of course people tried to find ways to reproduce the bug reliably, as saving even milliseconds can mean everything in a speed run. They went as far as replicating the state of the game from the original occurrence 1:1, but AFAIK no one has been able to reproduce the glitch without messing with the games memory.
For that reason it is speculated that a cosmic ray caused a bit-flip in the byte that stores the players y coordinate, shooting him up into the air and onto the next platform.
After, in order to try to confirm that the pre-commit kernel didn't have a latent bug. In other words that the commit clearly triggers the hang. (This doesn't necessarily prove that the commit is wrong - it might simply be exposing another bug that never occurred before, in fact that is the current thinking.)
I once had to bisect a Rails app between major versions and dependencies. Every bisect would require me to build the app, fix the dependency issues, and so on.
I used to think I was amazing at performance tuning and debugging but after working with a few hundred different people it turns out I’m just really fucking stubborn. I am not going to shrug at this bug again. You are going down. I do have a better way of processing concurrency information in my head, but the rest is just elbow grease.
I had a friend in college who was dumb as a post but could study like nobody’s business. Some of us skated through, some of us earned our degree, but he really earned his. We became friends over computer games and for a long time I wondered if games and fiction were the only things we had in common. Turns out there’s maybe more to that story than I thought at the time.
I think you’re absolutely right. Some of the things I’ve been most proud of have been products of stubbornly refusing to give up. On the other hand, some vast oceans of wasted time have been another result. It’s tricky to know when to be tenacious!
In my defense, I am a strong proponent of refactoring to make all problems shallow. So there are classes of bug that I will see before anyone else because I move the related bits around and it becomes obvious that there are missing modes in the decision tree.
I tend to believe that discipline and tenacity are separate traits. Often appearing in the same people, but different skills with different exercises.
Discipline is a funny word. Some of mine I’m very proud of, and some/many others don’t value it at all. But they sure tell me about how my behavior doesn’t align with their definition of discipline.
I know when I leave projects, my ex coworkers defend my decisions. I’ve gotta be doing something right.
I have an old VIA-based 32-bit x86 machine (VIA Eden Esther 1 GHz from 2006), and it hangs in different times, but I managed to create a reproducer which hangs the system not so long after the boot. About 1 in 20 boots are unsuccessful.
I noticed that verbose booting reduces the chance of hanging compared to quiet boot, but does not eliminate it completely.
The similar issue was present even on Dell servers back in 2008-2009, which are based on more recent x86_64 VIA CPUs, here's an attempt to bisect the issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=507845#84
The CPU seem to enter endless loop, as the machine becomes quite hot as if it's running full-speed.
All these years I believed this is a hardware implementation issue, related either to context switch or to SSE/SSE2 blocks, as running pentium-mmx-compiled OS seem to work fine, given that no other x86 system hangs the way VIA does.
However after this post and all LKML discussions, ticks/jiffies/HZ mentions, and how is it less an issue on Intel, I'm not so sure: the issue mentioned is related to time and printk, I also associate my problem with how chatty the kernel log is (at least partially), and the person in Debian bug tracker above also bisected the code related to printf, although in libc. It could be another software bug in the kernel. If that's the case, it is present since at least 2.6 times.
I would appreciate any suggestions to try, any workarounds to apply, any advice on debugging. If anyone have spare time and interest, I can setup the dedicated machine over SSH for testing. I have a bunch of VIA hardware which is reused for my new non-commercial project and I struggle to run these machines 100% stable.
I have found that my MicroPC fails on some newer kernels: when GDM starts up, the machine locks up and the LCD goes wonky. I'm not particularly looking forward to the bisect, but at least it won't take 292,612 reboots.
I some ways an early boot kernel only failure is easier. Late boot failures like that, could just as well have been something changing in wayland/X/gdm/mesa/dbus/whatever at the same time. And then if it turns out everything but the kernel is constant, its easy to take a wild guess and look for something in say the DRM/GPU driver in use vs the entire kernel. Although last time I did that turns out it wasn't even in the GPU specific code but a refactoring in the generic display mgmt code. Still ended up doing a bisect across like 5 kernel revisions after everything else failed. Which points to the fact that if linux had a less monolithic tree it would be possible to a/b test just the kernel modules and then bisect their individual trees, rather than adjusting each bisect point to the closest related commit if your sure its a driver specific problem. There is a very good chance that if say a particular monitor config + GPU stops working on my x86, the problem is likely in /drivers/gpu rather than all the commits in arch/riscv that are also mixed into the bisect. Ideally the core kernel, arch specific code, and driver subystems would all be independent trees with fixed/versioned ABIs of their own. That way one could upgrade the GPU driver to fix a bug without having to pull forward btrfs/whatever and risk breaking it.
Since I'm in NixOS, I can at least emphatically confirm it is JUST the kernel.
Though, given the way the LCD panel wonks out, I'm actually concerned it's power management related. It looks like what happens to an LCD panel when the voltage goes too low. (Or at least, I think that's what that effect is, based on what I've seen with other weird devices with low battery.) Since MicroPC is x86, though, I doubt the kernel is driving any of the voltages too directly, so who knows.
Did they find the issue yet? So far, the author has reported using qemu 7.2.0 which has been giving many kernel developers spurious boot failures (for x86) that seem fixed in 8.0.0. I myself have measure 3/1000 boot failures on 7.2.0.
I feel the author's pain. The biggest bisect I have done was 17 steps and that was tedious enough. Booting the machine over Dell's iDRAC was the icing on that experience.
Disclaimer: not a kernel dev, opinion based upon very cursory inspection.
The patch references the "scheduler clock," which is a high-speed, high-resolution monotonic clock used to schedule future events. For example, a network card driver might need to reset a chip, wait 2 milliseconds, and then do another initialization step. It can use the scheduler to cause the second step to be executed 2 milliseconds in the future; the "scheduler clock" is the alarm clock for this purpose.
Measuring the "current time" is pretty complicated when you're dealing with multiple-core variable-frequency processors, need a precise measurement, and can't afford to slow things down. The "scheduler clock" code fuses together time sources and elapsed-time indicators to provide an estimated current time which has certain guarentees (such as code running a particular core will never see time go backwards, it will be accurate within particular limits, and it won't need global locks). The sources and elapsed-time indicators it has available varies by computer architecture, vendor, and chip family; therefore the exact behavior on an Intel core 5 will differ from that of an Arm M7.
The patch in question changes the behavior of local_time(); this is the function used by code which wants to know what the current time is on its particular core. The patch tries to make local_time() return a sane value if the schedule clock hasn't been fully initialized but is at least running.
As you can imagine, there a lot of things that can go wrong with that. I think the problem is that sched_clock_init_late() is marking the clock as "running" before it should. I could very well be wrong. Regardless, it's pretty clear that there's some kind of architecture-dependent clock initialization race condition that once in a while gets triggered.
What was the title editorialized for, few hours after posting, with "21 hours" (not important, clickbait-ish)? It was not breaking any of guidelines[1] to my understanding.
We had a large integration test suite. It made calls to an external service, and took ~45 minutes to fully run. Since it needed an exclusive lock on an external account, it could only run a few tests at a time. We started getting random failures, so we were in a tough spot: bisecting didn't work because the failure wasn't consistent, and you couldn't run a single version of a test enough times to verify that a given version definitely did or didn't have the failure in any practical way. I ended up triggering a spread of runs over night, and then used Bayesian statistics to hone in on where the failure was introduced. I felt mighty proud about figuring that out.
Unfortunately, it turns out the tests were more likely to pass at night when the systems were under less strain, so my prior for the failure rate was off and all the math afterwards pointed to the wrong range of commits.
Ultimately, the breakage got worse and I just read through a large number of changes trying to find a likely culprit. After finally finding the change, I went to fix it only to see that the breakage had been fixed by a different team a hour or so before. It turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used. So when the feature was on it broke our tests.