I booted Linux 293k times in 21 hours

Laremere · on June 14, 2023

Here they mention that each bisect ran a large number of times to try and catch the rare failure. Reminds me of a previous experience:

We had a large integration test suite. It made calls to an external service, and took ~45 minutes to fully run. Since it needed an exclusive lock on an external account, it could only run a few tests at a time. We started getting random failures, so we were in a tough spot: bisecting didn't work because the failure wasn't consistent, and you couldn't run a single version of a test enough times to verify that a given version definitely did or didn't have the failure in any practical way. I ended up triggering a spread of runs over night, and then used Bayesian statistics to hone in on where the failure was introduced. I felt mighty proud about figuring that out.

Unfortunately, it turns out the tests were more likely to pass at night when the systems were under less strain, so my prior for the failure rate was off and all the math afterwards pointed to the wrong range of commits.

Ultimately, the breakage got worse and I just read through a large number of changes trying to find a likely culprit. After finally finding the change, I went to fix it only to see that the breakage had been fixed by a different team a hour or so before. It turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used. So when the feature was on it broke our tests.

ambicapter · on June 14, 2023

> Ultimately, it turned out to be one of our dependencies turning on a feature by slowly increasing the probability it was used.

Wow. I feel like this dependency should be named and shamed.

Laremere · on June 14, 2023

Big company internal dependency. So nothing for the public to care about.

vamega · on June 14, 2023

What company. I've seen this being done (and my team does it a lot at Amazon) but curious to know if others are doing it at build time too.

If done in a company with a monorepo I'd be especially interested in hearing more

aeyes · on June 14, 2023

> If done in a company with a monorepo I'd be especially interested in hearing more

Are there any big companies left which haven't adopted a monorepo?

de-moray · on June 14, 2023

Microsoft has not implemented a monorepo.

However, it's not surprising when you consider the massive breadth of software that Microsoft builds, as one of the oldest and largest software development orgs.

scubbo · on June 15, 2023

Amazon (or, at least, my corner of it) still hadn't when I left ~9 months ago - and I'm glad of it. I've moved to a company where one of the core products (though, thankfully, not my team's) is in a monorepo, and from everything I've seen it looks like a horribly inefficient way to develop.

hedora · on June 15, 2023

Monorepos solve a very specific set of problems that ~ 0% of companies have.

I doubt single digit percentages of, say, the fortune 500 use them, and I'd be surprised if double digit percentages of companies use them at IPO.

necovek · on June 15, 2023

Well, they do solve a bunch of problems every company has, but introduce another set of problems those companies didn't have until that point.

As with anything, striking the balance and finding when to use a tool is the hardest question.

raverbashing · on June 15, 2023

Why should companies jump on the latest fad and join things that don't belong in the same repo?

PartiallyTyped · on June 14, 2023

AWS. We probably have the worst build systems :(

scubbo · on June 15, 2023

This is astonishing. The build (and deploy) systems are, by a considerable margin, the things I miss most about having left Amazon (CDO, not AWS, but still). What do you dislike about them?

phinnaeus · on June 15, 2023

I'm in a similar boat. I worked in both CDO and AWS. I miss Brazil and Isengard every day

PartiallyTyped · on June 15, 2023

I do concede that isengard is pretty cool. I am looking forward to Peru as it will make dealing with dependencies a lot easier.

PartiallyTyped · on June 15, 2023

I had to spend days trying to debug why a build that was working fine broke completely and NPMPM was unable to find the dependencies for the lambda.

I disliked how needlessly convoluted the pipelines are, and how some person pushing on accident to mainline can break everything.

So many things seem to be done the hard way.

scubbo · on June 16, 2023

> I had to spend days trying to debug why a build that was working fine broke completely

Sure, but I bet that it was helpful to have the `bats` tool so that you could replicate the build locally, right? As compared with other build systems where (so far as I can see - though I may be wrong) you basically have to push a debugging change for replication.

> I disliked how needlessly convoluted the pipelines are, and how some person pushing on accident to mainline can break everything.

This is true of any CI/CD system, though? In any system, if there's no push-protection set up so that you can only merge into main(/line) once a change has been reviewed (and run your tests at the point of review so that you know the merge won't break anything), you have only yourself to blame for breakage.

> So many things seem to be done the hard way.

Genuine question - what do you find convoluted/hard about them? To me, the apparently-industry-standard of "push a change to your App Code, which triggers a build to generate a docker image, then trigger an automatic commit containing that Docker image to a Deployment Package, which is picked up by your CD system and creates a deployment" is way more convoluted. Having a conceptual "pipeline" built out of lots of little disconnected GitHub Actions (or whatever) is also way harder for me to wrap my head around than the CDK definition of a linear pipeline.

PartiallyTyped · on June 17, 2023

The problem was caused by a forced deprecation of NodeJSFunction in cdk. Which basically made it impossible at the time to add any dependencies… there was nothing me or the more senior engineers could do to solve it.

I figured a work around that involved having a separate manifest for the deps and packing them manually. It worked… I also tried the lambda without any dependencies and they the lambda’s dependencies were available in the instance even though they were not listed anywhere.

The needlessly convoluted part is getting a NodeJS function into prod, the forced change caused something to break even though I was already on 18_x

But I will not lie, bats was super useful when debugging another engineer’s build, and avoiding the whole push to debug is so so so so helpful.

For the pipeline thing, idk why it was setup as such; but it certainly broke everything once a push to mainline was done, kinda like a runtime error when it should have been a compiler error. Though it was certainly on us.

CDK is fine too. Tbh I kinda love CDK and don’t want to go to anything else when it comes to cloud deployment.

In retrospect, perhaps that one sour experience had just too much of an impact.

I want a better way to manage external dependencies mostly. Afaict for scala I will need to pull the package into Brazil to get it to work. An analogue to NPMPM would be great but I can see why there isn’t one yet. A colleague had issues getting python dependencies to work so yeah.

Here’s to hoping Peru would fix those.

scubbo · on June 20, 2023

> I want a better way to manage external dependencies mostly

On that point, I'm totally with you. The excessive caution about the software supply chain is probably justified given the impact of a potential incident, but certainly frustrating for the >99% of times that dependencies are safe.

thehappypm · on June 14, 2023

Isn’t this how multi-armed bandits work?

rootsudo · on June 14, 2023

algo 101, but I can see how it can be nifty for $internalapp.

yojo · on June 14, 2023

Yikes!

FWIW, I think best practice here is to hardcode all feature flags to off in the integration test suite, unless explicitly overwritten in a test. Otherwise you risk exactly these sorts of heisenbugs.

At a BigCo that’s probably going to require coordinating with an internal tools team, but worth getting it on their backlog. All tests should be as deterministic as possible, and this goes double for integration tests that can flake for reasons outside of the code.

btilly · on June 14, 2023

No, the best practice is that on each test run, every feature flag used implicitly or explicitly needs to be captured AND it must be possible to re-run the test with the same set of feature flags.

That way when you get a failure, you can reproduce it. And then one of the easy things to do is test which features may have contributed to it.

yojo · on June 15, 2023

I strongly disagree. If you have non-deterministic tests, you are going to have builds breaking for unrelated changes, seriously hampering developer productivity as teams chase down failures unrelated to their change.

Nothing kills confidence in testing more than test flakes. It’s a huge drain on velocity and morale, and encourages devs not to trust test output.

If you want to have some sort of chaos monkey process that runs your test suite flipping feature flags at random and notifying teams of failures (along with some sort of resourcing to investigate) I could get behind that. But that should be something outside of the main suite that gates code deployment.

If a test passes when run by a dev pre-commit, it should pass in CI.

nosefrog · on June 14, 2023

But then you won't catch the bug before it hits production :)

dmoy · on June 14, 2023

Also you end up with some strange long term test behavior. Because people will often leave feature flags in place long after full release (years sometimes), you end up with a default-off-in-tests only testing behavior with everything newer than N years since the last feature flag cleanup disabled.

Yes it's kinda fractal of bad practices that have to align for this problem to occur, but that's the nature of tech debt.

yojo · on June 15, 2023

I agree that this is a real and separate problem, but I believe the solution lies outside of the test suite.

One way I have seen this handled is to enforce restricting rollouts of a feature flag to 95% at most. That way turning a feature all the way on requires removing the flag from your codebase. It’s draconian, but honestly anything less than that leads to the situation you describe.

dmoy · on June 17, 2023

I like that idea a lot. We've been informally doing it on my current team, made easier since we can sort of cleanly do atomic code+flag updates in a single commit

linuxdude314 · on June 14, 2023

You are both misunderstanding the post.

He’s not saying to alter any of the feature flags used for the test, but simply to record which were used during the test.

Simply logging doesn’t introduce any of the issues you are describing.

jsnell · on June 15, 2023

Huh? This is what yojo@ wrote:

> I think best practice here is to hardcode all feature flags to off in the integration test suite

That's pretty clearly about forcing the flags to be off, i.e. altering them, and not about logging their values.

yojo · on June 15, 2023

Agreed, I am advocating for deterministic behavior for all feature flags in the test suite.

If you’re testing a new feature, you should have explicit tests for the enabled state (along with existing tests for the disabled state).

If you have bugs propagating up the stack from flags changing in low-level dependencies, the change to the dependency is probably not properly tested.

Alternatively, if the feature flag gates a change to the interface of the dependency, you should have explicit integration tests covering the systems on both sides of the change.

painted-now · on June 14, 2023

Man, this story sounds like you could be on my team :-) Pretty much experienced the same stuff working at BigCo!

In the end, I think the real problem is that you can't test all combinations of experiments. I don't trust "all off" or "all on" testing. In my book, you should indeed sample from the true distribution of experiments that real users see. Yes, you get flaky tests, but you also actually test what matters most, i.e. what users will - statistically - see.

joosters · on June 14, 2023

This sounds like a situation that would benefit from using an approach like all-pairs testing - https://en.wikipedia.org/wiki/All-pairs_testing

Basically, if you have N different features (let's assume they are all on/off switches, but it works for multi-values too), in theory you'd need to run 2^N tests to cover them all, which would become completely impractical. But, you can generate a far, far smaller set of test setups that guarantee that every pair of features gets tested together. Run those tests and you'll probably encounter most feature-interaction bugs in a much quicker time.

cscheid · on June 14, 2023

All-pairs is for _pairs of features_. For subsets you're in much deeper trouble because of the exponential dependence on N. For a fixed polynomial dependence, you can get clever and let tail bounds eventually work for you, but for exponentially growing hypothesis sets, that won't work.

Noumenon72 · on June 15, 2023

I don't understand what you're saying. That it doesn't work if you need combinations of 3 or more features (a subset?)

cscheid · on June 16, 2023

This is relatively subtle stuff, but here's an attempt at describing the general problem. I'm going to describe the deterministic case, but the probabilistic case is effectively the same.

Let's say you have a bug you suspect is from an interaction of any one pair of 10 features being "on" or "off", but you don't know which specific pair causes the problem. Encode each of the states you could set up your code by a 10-digit binary string: 0000000000, 0000000001, 0000000010, 0000000011, etc.

We could try the 45 possibilities in some order, and we would expect that on average it'd take us 22.5 tries to find the bug. But notice how your "target set" is smaller than the universe of strings: there's only 45 pairs of features, but 1024 strings.

What happens if we try a random string of ones and zeros? Now, instead of catching just one possible pair, we are covering many pairs. The only problem is that we now won't be able to know exactly which pair caused the problem when it does. But we can build a corpus of strings that don't trigger the error vs. strings that trigger the error, and a random sampling soon converges on the correct pair.

If you think about why this works, it's because any of these random strings has about a 1/4 chance to trigger the bug: wlog we can reorder the bits so that the buggy feature are the first two digits, and then we see that we have a 1/4 chance of hitting "11" on those two digits.

The problem is that as you increase the size of the subset that needs to be active, the probability that your random strings will actually catch the bug decreases exponentially. For any _fixed_ target size k (the number of features that need to be active), the overall complexity is still polynomial in n (the number of existing features). But if k is a constant fraction of n, then this technique takes exponential time in n.

Noumenon72 · on June 19, 2023

Fun, thanks.

hedora · on June 15, 2023

I wonder how well letting a fuzz tester generate the test configurations would work.

n49o7 · on June 14, 2023

Probabilistic feature flags! Love it.

Thorrez · on June 14, 2023

Always base the probability on something stable, such as hash of the username.

_t4za · on June 15, 2023

You just need to make sure that this doesn't mean people are consistently "lucky" or "unlucky."

I was on a team where app updates were deployed using a canary system. A small percentage of users (say, 1%) received the update first, then the team watched for incoming crash reports from that cohort. If it looked good, the feature was rolled out to a few more people, and this was repeated. This allows you to identify a problem by only negatively impacting a relatively small percentage of customers.

The problem occurs when the calculation to determine which cohort the user belongs to is deterministic. In this case, the calculation was based on the internal ID of the user. This means some users always get the updates first, and deal with bugs more frequently than other users. Conversely, some users are so high in the list that they virtually never get an update until it's been tested by a wide user base, so their experience is consistently stable.

Or you might have a problem where some players in a video game consistently take more damage than their friends: https://news.ycombinator.com/item?id=34742505

patmorgan23 · on June 14, 2023

You just need to be able to look up what feature flags where enabled on a given request (maybe by correlation I'd)

robertlagrant · on June 15, 2023

Or have the username be a number that is all the feature flags when converted to a binary representation. Then you can just have one username for each combination you want to test.

IshKebab · on June 14, 2023

Bug report: changing my username breaks $product.

Yeah no thanks. It's probably better than completely random but software should be predictable and unsurprising.

burnished · on June 14, 2023

The important part is the stability - if your usernames can change then they aren't stable so you don't select it.

I think it is a good reminder that most things you think of as being unchanging that are also directly related to a person.. aren't unchanging. Or at least any conceivable attribute probably has some compelling reason why some one will need to change it.

robocat · on June 14, 2023

> changing my username breaks $product.

https://m.youtube.com/watch?v=r-TLSBdHe1A&t=14m10s

Discussing a performance regression due to longer username due to username being in ENVIRONMENT variable which changes memory layout of process.

dietr1ch · on June 14, 2023

That's why you have internal user ids instead of using data directly provided by users.

Will it cost an extra lookup? It's cheap, and if you really need to, you could embed the lookup in some encrypted cookie so you can verify you approved some name->id mapping recently without doing a lookup.

hedora · on June 15, 2023

Wait, we're talking about maliciously injecting bugs into your employer's software so they have the maximum impact, right?

Clearly, making sure that 1% of all teams gets fired for being unable to run unit tests, then slowly ramping that by a few percent each review cycle is a good strategy.

Ideally, the probability of breaking would drop off exponentially as you moved up the org chart. Something like "p ^ 1/hops_to_director_of_engineering" would work well. The trick would be getting the dependency to query ldap without being detected...

btilly · on June 14, 2023

I've used the hash of username+string trick before for a flag. I used it to replace a home-grown heavyweight A/B testing framework which had turned into a performance bottleneck.

It worked quite well.

Thorrez · on June 15, 2023

This is about a feature flag. It should be safe to flip it on and off, otherwise rollouts and rollbacks won't work.

thehappypm · on June 14, 2023

Multi-armed bandits utilize this

ASinclair · on June 14, 2023

This is my daily life at BigCo. These bugs are the worst.

tus666 · on June 15, 2023

Eh this is what version pinning is for. Using edge can always lead to random breakagr, feature flags or not.

hedora · on June 15, 2023

That wouldn't work here though, the dependency started by breaking an almost-undetectable fraction of the time.

Imagine a scenario where your upstream dependency started out with one failure per 1,000,000 machine hours, then removed a zero once every 12 months. If you had 100 machines running tests at 100% efficiency, the bug would hit about once a year for the first year, then 10x the next year, and so on.

Put another way, if upstream is malicious, and you're not auditing every line of their source code, you're screwed.

sowbug · on June 15, 2023

That depends whether the development model is branch-stable or trunk-stable.

rwmj · on June 14, 2023

If anyone would like to try reproducing the bug, I have a fairly solid reproducer here:

https://lore.kernel.org/lkml/20230614173430.GB10301@redhat.c...

You will need a vmlinux or vmlinuz file from Linux 6.4 RC.

If these are the last two lines of output then congratulations you reproduced the bug:

  [    0.074993] Freeing SMP alternatives memory: 48K
  *** ERROR OR HANG ***

You could also try reverting f31dcb152a3 and rerunning the test to see if you get through 10,000 iterations.

chenxiaolong · on June 14, 2023

I gave that reproducer a try and it failed after 1968 iterations.

* CPU: Intel(R) Core(TM) i9-9900KS

* qemu: qemu-kvm-7.2.1-2.fc38.x86_64

* host kernel: 6.3.6-200.fc38.x86_64

* guest kernel: 6.4.0-0.rc6.48.fc39.x86_64 (grabbed latest from mirrors.kernel.org/fedora since fedoraproject.org DNS is down and I can't access koji)

Log:

    <...>
    1966... 1967... 1968...

    [    0.075343] LSM: initializing lsm=lockdown,capability,yama,bpf,landlock,integrity
    [    0.075514] Yama: becoming mindful.
    [    0.075514] LSM support for eBPF active
    [    0.075514] landlock: Up and running.
    [    0.075514] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
    [    0.075514] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
    [    0.075514] x86/cpu: User Mode Instruction Prevention (UMIP) activated
    [    0.075514] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
    [    0.075514] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
    [    0.075514] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
    [    0.075514] Spectre V2 : Mitigation: Enhanced / Automatic IBRS
    [    0.075514] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
    [    0.075514] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
    [    0.075514] RETBleed: Mitigation: Enhanced IBRS
    [    0.075514] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
    [    0.075514] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
    [    0.075514] TAA: Mitigation: TSX disabled
    [    0.075514] MMIO Stale Data: Vulnerable: Clear CPU buffers attempted, no microcode
    [    0.075514] SRBDS: Unknown: Dependent on hypervisor status
    [    0.075514] Freeing SMP alternatives memory: 48K
    *** ERROR OR HANG ***

I'll try reverting f31dcb152a3 and testing again later. Happy to test anything else if needed.

rwmj · on June 14, 2023

Yup, that's the bug. If it does away after reverting the commit, that would be interesting too. I don't have any other suggestions.

chenxiaolong · on June 14, 2023

I tested with 6.4.0-0.rc6.48.fc39.x86_64 + f31dcb152a3 revert and all 10000 iterations succeeded (same hardware and environment as my previous post).

To guarantee that there's absolutely no other difference between the two tests, I took the source RPM, added the commit f31dcb152a3 diff + `%patch -P 2 -R`, and built the kernel RPM with mock.

matja · on June 14, 2023

Host kernel: 6.1.33-1-lts (Arch) Guest kernel: 6.4-rc6 defconfig QEMU: 8.0.2 (Arch) CPU: AMD EPYC 74F3

1242 iterations :

  [    0.015088] printk: console [ttyS0] enabled
  [    0.055882] ACPI: Core revision 20230331
  [    0.056124] APIC: Switch to symmetric I/O mode setup
  [    0.056867] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e204823bcd,   max_idle_ns: 440795224253 ns
  [    0.057467] Calibrating delay loop (skipped) preset value.. 6399.99 BogoMIPS (lpj=3199998)
  [    0.057924] pid_max: default: 32768 minimum: 301
  [    0.058194] LSM: initializing lsm=capability,integrity
  [    0.058464] Mount-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
  [    0.058464] Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
  [    0.058464] x86/cpu: User Mode Instruction Prevention (UMIP) activated
  [    0.058464] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
  [    0.058464] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
  [    0.058464] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
  [    0.058464] Spectre V2 : Mitigation: Retpolines
  [    0.058464] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
  [    0.058464] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT
  [    0.058464] Spectre V2 : Enabling Restricted Speculation for firmware calls
  [    0.058464] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
  [    0.058464] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
  [    0.058464] Freeing SMP alternatives memory: 48K
  *** ERROR OR HANG ***

After reverting commit f31dcb152a3d0816e2f1deab4e64572336da197d :

40000 iterations (4 runs) = "test ok"

Twirrim · on June 14, 2023

I've been having flashbacks to troubleshooting some particularly thorny unreliable boot stuff several years ago. In the end tracked that one down to the fact that device order was changing somewhat randomly between commits (deterministically, though, so the same kernel from the same commit would always have devices return in the same order), and part of the early boot process was unwittingly dependent on particular network device ordering due to an annoying bug. The kernel has never made any guarantees about device ordering, so the kernel was behaving just fine.

That one was.. fun. First time I've ever managed to identify dozens of commits widely dispersed within a large range, all seem to be the "cause" of the bug, while clearly having nothing to do with anything related to it, and having commits all around them be good :)

pcthrowaway · on June 15, 2023

That sounds like an interesting story! Did it result in a kernel patch or at least a blog post?

Twirrim · on June 16, 2023

That one didn't. The bug wasn't actually in the kernel, it was in the klibc / klibc-utils.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=852480 was the ultimate bug/patch that came out of it, Canonical were a vendor for us, and helped do the investigation, https://bugs.launchpad.net/ubuntu/+source/klibc/+bug/1652348.

The dhcp client in the klibc-utils had a bug in how it handled multiple interfaces, in that it didn't create separate sockets per interface, as it enumerated through them it would clobber the previous one. It validated the destination of the received DHCP response, and silently dropped it if it wasn't for the interface the socket was for.

The DHCP server was only listening on one of the two interfaces, and so if that interface got enumerated second, all was well and good. The socket was for it, response would be accepted. When it came up first, the clobbered socket meant the dhcp response would be ignored.

I bisected so many times and mostly just got confused. The engineer at Canonical dug in and found the actual bug.

swordbeta · on June 14, 2023

I wasn't able to reproduce this with 10k iterations on arch, I'm probably doing something wrong. Does the host kernel matter?

Host kernel: 6.1.33

Guest kernel: 6.4-rc6

Guest config: http://oirase.annexia.org/tmp/config-bz2213346

QEMU: 8.0.2

Hardware: AMD Ryzen 7 3700X CPU @ 4.2GHz

rwmj · on June 14, 2023

> Does the host kernel matter?

Honestly I don't know! We've seen it appear with host kernel 6.2.15 (https://bugzilla.redhat.com/show_bug.cgi?id=2213346#c5) but I'm not aware of anyone either reproducing or not reproducing it with earlier host kernels. All your other config looks right.

garaetjjte · on June 14, 2023

vmlinuz-6.4.0-0.rc6.48.fc39.x86_64 failed on my 6.0.0 host after 249 iterations.

rwmj · on June 14, 2023

We had another report that it happens on RHEL 8 host, which is a very much older (franken) kernel.

garaetjjte · on June 15, 2023

I noticed it hangs in similar way when you insert msleep anywhere before smp_prepare_cpus in kernel_init_freeable. But I have no idea whether sleeping is valid here.

rwmj · on June 15, 2023

Can confirm - just adding msleep(1) there causes the same failure. I'm also unclear if sleeping here is valid or not.

rossjudson · on June 15, 2023

Looks like you have a trigger, but no root cause (yet). Doesn't matter anyway...revert and work it out later. The root cause bug is still in there somewhere, waiting to be triggered another way...

rwmj · on June 15, 2023

It's the Linux kernel so I'm in no position to revert the patch.

TechBro8615 · on June 14, 2023

This reminded me of another story [0] (discussed on HN [1]) about debugging hanging U-Boot when booting from 1.8 volt SD cards, but not from 3.0 volt SD cards, where the solution involved a kernel patch that actually introduced a delay during boot, by "hardcoding a delay in the regulator setup code (set_machine_constraints)." (In fact it sounded so similar that I actually checked if that patch caused the bug in the OP, but they seem unrelated.)

The story is a wild one, and begins with what looks like a patch with a hacky workaround:

> The patch works around the U-Boot bug by setting the signal voltage back to 3.0V at an opportune moment in the Linux kernel upon reboot, before control is relinquished back to U-Boot.

But wait... it was "the weirdest placebo ever!" Turns out the only reason this worked was because:

> all this setting did was to write a warning to the kernel log... the regulator was being turned off and on again by regulator code, and that writing that line took long enough to be a proper delay to have the regulator reach its target voltage.

The full story is well worth a read.

[0] https://kohlschuetter.github.io/blog/posts/2022/10/28/linux-...

[1] https://news.ycombinator.com/item?id=33370882

DerekBickerton · on June 14, 2023

Before clicking I thought someone kept note of how many times Linux booted in regard to their computing habits, and not testing software. I know for me I boot roughly 3 times a day into different machines, do my work, shutdown, then rinse & repeat.

Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.

bbarn · on June 14, 2023

I had a developer that I inherited from a previous manager some years ago. Made tons of excuses about his machine, the complexity of the problem, etc. I offered to check his machine out and he refused because it had "private stuff" on it. He had the same machine as the rest of the team, so since he hadn't made a commit in two weeks on a relatively simple problem, refused help from anyone, etc., we ultimately let him go.

When we looked at his PC to see if there was anything useful from the project, his browser had around a thousand tabs open. Probably 80% of them were duplicates of other tabs, linking to the same couple stack overflow and C# sites for really basic stuff. The other 20% were... definitely "private stuff".

hinkley · on June 14, 2023

I’m at the other extreme of “private stuff”. Nothing work related should live on my work machine. It should all be pushed to git or dumped in the wiki (personal pages if nothing else).

On one of my largest projects the IT dept made bulk orders for hardware and doled them out to new hires. 18 months into our new project someone’s hard drive died.

Everyone acted like his dog died. I said no problem let’s go through the onboarding docs. The longest step by far was that the company mandated Whole Disk Encryption but IT hadn’t put it in their old inventory yet. So that was 2/3 of setup time. We found some issues with the docs and fixed them.

Every two to four weeks that summer, someone else’s drive would go. You see, we got all of these machines from the same production run. So the hard drives came from the same production run, which was apparently faulty. The process got a little faster as we went. By the end of the summer it was my turn, and people still looked at me like I needed condolences. I got a faster machine for a few hours worth of work. I’m not sad. All my stuff was in the network already. I lost a couple hours’ of work, tops.

teachrdan · on June 14, 2023

> Nothing work related should live on my work machine.

I thought this was a typo at first. Love this as an engineering koan.

canucker2016 · on June 14, 2023

"Nothing work related should live ONLY on my work machine." is the intent.

noSyncCloud · on June 14, 2023

And the corollary – nothing personal should be on your work machine, either

spacephysics · on June 15, 2023

My company laptop I don’t do anything I wouldn’t be OK with my manager or IT seeing. Even something as simple as a recipe lookup I do on my phone or personal laptop.

With today’s software for managing corporate machines, and corporate VPN with network security and firewalls abound, anything and everything can be seen.

I have a joke Wi-Fi name that I’ve even considered changing (or at least create a guest network) just to be safe. It’s likely overboard, but I like the idea of just mailing my laptop if I change companies, and no worries at all

opello · on June 14, 2023

This is the best way to reduce bus factor and not fall behind documenting key details!

jbm · on June 15, 2023

Apologies for the pedantry but reducing bus factor is a bad thing.

3 -> 2 means you now only need two people to be hit by buses to ruin your project.

bbarn · on June 15, 2023

Coincidentally I just got a new work machine this week. The IT support staff replacing it scheduled an hour of time to transfer files, set up apps, etc. with me. I was done in 5 minutes. Once I logged in, logged into my cloud services, and verified my faulty port problem with my dock was resolved with new hardware, I was done and as productive in a few minutes as I was before. Install a few tools, copy my scripts from the cloud, make a new key pair for SCM, that's it.

greiskul · on June 14, 2023

Any machine in a company needs to be able to be unplugged and thrown out of a window, without leading to significant data loss, only the inconvenience of the price of a new machine and setup time.

There are very few machines in the world that are actually mission critical, and you might not be able to do that (although for them, you can probably switch components with it still running). Anybody else, you are just betting your company on the lack of fires, hardware failure, etc.

JohnFen · on June 14, 2023

> he refused because it had "private stuff" on it.

There's a huge red flag. "Private stuff" (embarrassing or otherwise) shouldn't be on company machines in the first place.

dijit · on June 14, 2023

I agree completely.

However if anyone touches my computer: don't you dare f*%king touch my private key.

(ditto for my browsers sessions database, google cloud credentials directory etc;)

I'm paranoid about it, but not enough to buy a yubikey, apparently.

JohnFen · on June 14, 2023

I'm unusually strict about maintaining a separation between work and personal (for instance, I would never allow my personal smartphone to connect to my employer's WiFi), so I wouldn't use personal keys on a work machine at all.

But if those keys (or passwords, etc.) are generated for work purposes, I consider them to be as much company property as the machine itself, so I'm no more protective of them than I am of any other sensitive company data.

dijit · on June 14, 2023

Interesting thought.

How do you feel about giving your colleague your password?

My personal opinion is that I can hold someone legally culpable if their account does something like leak financial information; you have a professional responsibility to secure your account from absolutely everyone.

Administrators acting on your account must of course be heavily logged and audited, which is the case.

JohnFen · on June 14, 2023

> How do you feel about giving your colleague your password?

I usually don't, mostly just out of good security habits, but also because most employers specifically prohibit doing that.

Almost always, your colleague can be given his own access to whatever the password is for anyway. If that's not possible, then I'll share the password and change it immediately after my colleague doesn't need access anymore.

> you have a professional responsibility to secure your account from absolutely everyone.

I agree -- that's part of treating credentials the same way as all other sensitive company data. But it's still my employer's data, not mine.

If I quit the company or if my supervisor wants to see the contents of my machine, I'm fine with that. The machine and everything on it belongs to the company anyway.

dijit · on June 14, 2023

Ok, but your private key, session tokens and CLI access tokens (kube configs, gcloud etc;) are your password in those situations.

They tie to your identity, thus you must not treat them the same as company secrets, they are professional personal secrets which should not be disclosed or allowed to fall into anyone elses hands (less they be revoked and cycled).

It's not just good security posture it could affect your career quite badly or lead to legal issues.

JohnFen · on June 14, 2023

I agree. I don't think I've said anything counter to that (or perhaps I wasn't being clear?)

> thus you must not treat them the same as company secrets, they are professional personal secrets

They are company secrets that are tied to my identity. The company owns those secrets, not me. Just like my keycard to get into the building.

dijit · on June 14, 2023

> I agree. I don't think I've said anything counter to that (or perhaps I wasn't being clear?)

I think given the context of the thread (don't touch my secrets), saying that you don't have anything you would consider confidential towards your employer or colleagues is a direct contradiction to what I stated.

That's why I'm "arguing" because my employer/colleagues should not have access to my private key, ever.

JohnFen · on June 14, 2023

Ah, OK. Then we do disagree to an extent.

There are several very legitimate times when my employer needs to have access to my keys. If I'm leaving the company, for an obvious instance.

But my core point is that such keys/passwords aren't really mine, they're the company's and in the end, the company gets to decide what I'm to do with them.

I think the building access keycard is a perfect analogy. I'd never let anyone borrow mine on my own volition, but if the company wants to retrieve it from me, that's their prerogative. It's theirs, after all.

brazzledazzle · on June 14, 2023

If an employer needs someone’s particular keys something probably went wrong or there’s bad processes in place. But that aside I think the default course of action should be to aggressively guard your secrets and tokens since they represent you. Not as personal or private property but to keep someone (be it a fellow employee or a 3rd party attacker) from impersonating you without authorization.

There are exceptions but the circumstances where an employer would need to retrieve my keys without my assistance are extremely rare and in those instances it’s unlikely I’d still be an employee anyway.

dijit · on June 14, 2023

We disagree.

The handing of the keycard is necessary to ensure it's destroyed and can't be used as a "proof" you work somewhere (most access cards these days have your name, face and the company logo printed on the front).

The keycard will be removed from the access list to the building even when it's destroyed, they're not considered reusable by most companies.

Your private key is not reusable, it should be destroyed and revoked from all system when you leave a company.

lmm · on June 14, 2023

We could destroy the keycard with both parties present, that seems safest. I don't mind turning in a private key permanently and getting a receipt at the time, but it needs to be very clear that it's no longer my responsibility.

brazzledazzle · on June 15, 2023

I’m not clear how we’re in disagreement here since I agree with this. When I say your keys/creds should be guarded I mean while they’re still valid.

dijit · on June 15, 2023

i replied to you and not the parent by accident. sorry.

brazzledazzle · on June 15, 2023

No worries!

JohnFen · on June 14, 2023

> but to keep someone (be it a fellow employee or a 3rd party attacker) from impersonating you without authorization.

Aside from a third party attacker (which is well-covered by my normal practices), that's a threat model that I'm personally not worried about at all, really. In part because I've never seen or heard of that happening and in part because if it did, I am confident that there are enough records to be able to prove it.

brazzledazzle · on June 15, 2023

Internal abuse and attacks aren’t as rare as they should be. You’d be amazed what someone will do to risk their job or even career on impulse or poorly considered risks.

StillBored · on June 14, 2023

Isn't this largely the point of company directory services? The machines/routers/applications/etc are all doing their authentication against the directory service, and permissions are granted and revoked there. Its a large part of running a company with more than a couple employees because when someone leaves you don't need to run around changing passwords and wondering if they still have access to the AWS account to spin stuff up, or punch through the VPN. The account in the directory service is just deactivated and with it all access.

By default this should be what is happening on all but the most ephemeral of machines/testing platforms/etc. And even then if its a formal testing system it should probably be integrated too.

Directory service integration BTW is the one feature that clearly delineates enterprise products from the rest.

chucksmash · on June 14, 2023

> If I quit the company or if my supervisor wants to see the contents of my machine, I'm fine with that. The machine and everything on it belongs to the company anyway.

I'm fine with that, but I still will not share my passwords. I'd be happy to reset the passwords for them if they can't access the data by other means, but as another commenter pointed out, the fact that anything needs to be recovered from my^H^H not my laptop indicates mistakes were made.

lostlogin · on June 14, 2023

> However if anyone touches my computer: don't you dare f*%king touch my private key.

Touch the computer, sure, but please don’t touch the screen with your filthy grease fingers.

mdpye · on June 14, 2023

My work laptop has a touchscreen. I've never used it, but other people use it by accident fairly often. Usually only once each though, the look of shock is sometimes even worth the fingerprint :D

chimprich · on June 15, 2023

I've never understood people who do this. You can point at the screen, tap with a pen, take the mouse or keyboard and move the cursor, etc. but surely it's bad manners to splodge your finger on someone's screen.

sureglymop · on June 14, 2023

He was let go after two weeks? No confrontation nothing?

Sounds very american. In European working culture if you don't show up for two weeks people will be worried that something happened to you and try to work it out with you. This type of all or nothing reaction is a bit sporadic imo.

mikestew · on June 14, 2023

Sounds very american.

Yeah, it's not like that part of the story was condensed and might have left out a bunch of details that weren't important to the story. So let's give OP a hard time and make judgements about a situation for which we have not even the slightest bit of context.

sureglymop · on June 14, 2023

Oh absolutely, you're right. I am saying that despite whatever may have happened, two weeks is very short. I feel like it would be at least a month here regardless.

Uvix · on June 15, 2023

It's not that the person didn't show up for two weeks, though; they showed up but refused to actually do any work.

bbarn · on June 15, 2023

For context, I was brought in with the knowledge he hadn't done anything meaningful since being hired some time before my arrival, and we did reach out and offer help or ask if he needed anything, which he refused, somewhat angrily.

RandallBrown · on June 14, 2023

He was let go after two weeks of not doing any work, despite the manager offering to help him.

sowbug · on June 15, 2023

You might have meant to use a word other than "sporadic." That word describes a recurring event that happens at unpredictable intervals, such as snow in a normally hot desert, or a Linux crash caused by a race condition. Other words that fit the sentence better are "unusual," "unexpected," "unjustified," "inappropriate," "surprising," "extreme," "abrupt," or "out of the blue."

(For what it's worth, I'm American, and I disagree with your assessment. We don't know how long the person was given to make progress, and we don't know what was communicated. To conclude that a two-week period without a commit represents the entire period between the start of the poor performance and the termination is, well, a bit out of the blue.)

bbarn · on June 15, 2023

I was the final two weeks of a long period of zero productivity, despite many offers of help and asking if he understood, needed help, etc. I never enjoy firing someone, even if they are downright awful or mean, and do my best to avoid it. My own involvement was late in the stage, and the thousand or so tabs that were left open I'm sure weren't 2 weeks worth of effort.

When I was hired I was told he was a problematic hire, that hadn't produced anything for long before my arrival. It was basically "We already know we're going to probably have to let him go but if you want to try to work around it, be my guest". I did try to go in with no judgments, as I always do, but he refused help, and refused to even let anyone look over his shoulder and find why this task was taking an order of magnitude too long.

coldtea · on June 14, 2023

>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.

If the OS and hardware drivers properly support sleep, you almost never need to do otherwise (except to install a new kernel driver or similar).

In macOS for example it hasn't been the case that you need reboot in your regular OS use for over 10+ years.

The "100+ Chrome tabs" or whatever mean nothing. They're paged out when not directly viewed anyway, and if you close just Chrome (not reboot the OS) the memory will be freed in any case...

moron4hire · on June 14, 2023

> If the OS and hardware drivers properly support sleep...

That's like the biggest of big IFs.

tom_ · on June 14, 2023

I've found sleep very reliable on macOS, and both sleep and hibernate reliable on Windows.

I once had my work PC unhibernate and not pop up the login box. The computer appeared to be running normally otherwise; I just couldn't log in, and I had to tap the power button to shut it down. This stuck in my mind due to its rarity.

Can't remember ever having a serious issue on macOS. A couple of my programs sometimes don't survive the sleep/wake cycle, but it's intermittent, and I'm always in the middle of something else when it happens. I've never lost any meaningful work.

andrekandre · on June 14, 2023

  > Can't remember ever having a serious issue on macOS.

macos is fine for the most part, but there are some edge cases, such as some sketchy corporate required "security software" that eats up kernel memory or cpu for some unknown reason, a reboot can fix performance issues there

also if you are a dev and apps (like xcode, android studio etc) fill your drive with cache files* or have weird background daemons that eat up cpu, at the least a logout/login (or a reboot) can fix some of those eierd things

you could manually delete them without a reboot but ymmv

tom_ · on June 17, 2023

Indeed, and I've found the same myself - my comment was about the reliability of the OS sleep functionality, and deliberately says nothing about the wisdom of never rebooting!

I have actually found both Windows and macOS generally pretty good if you leave them running for weeks at a time, but it's one of those things that's best done only if you really need it (and can accept a non-zero chance of something going wrong). They're not so very good that I'd actually recommend doing it routinely. A reboot every 1 or 2 weeks massively reduces the chance of weird stuff happening.

tasuki · on June 14, 2023

> Boggles my mind that people do that.

Why?

It boggles my mind that you'd reboot needlessly. My uptime is usually in the hundreds of days.

Sleep is good: I just close the lid. Next time I open the lid it immediately picks up where I left off. Why on earth would you want any other behaviour?

rolandog · on June 14, 2023

Security-wise: encryption at rest? In high security scenarios you may be required to shutdown so you're forcing "attackers" to go through several layers: motherboard password, disk password, encryption password, OS user password + 2FA, etc.

JohnFen · on June 14, 2023

On my personal machines? I don't shut them down or reboot very often.

At work, however, I have to use Windows. In that case, I shut it down at the end of every workday, in part because that prevents weird issues Windows tends to develop when running too long.

Mostly, though, it's because of those damned forced updates. Since I can't trust Windows to not reboot itself at any random point in time, having the habit of shutting down at the end of the day at least ensures that I won't accidentally lose my state overnight or over the weekend.

tom_ · on June 14, 2023

How to stop Windows installing updates behind your back: https://news.ycombinator.com/item?id=18157968

If you don't/won't/can't use the group policy editor, I got a lot of mileage out of hibernating the PC and powering it off at the mains. You can't leave it running something overnight, but you can at least quickly get back to exactly where you left things the previous day.

(Powering it off at the mains ensures that even if you have a device connected that could wake the PC up - thus putting your computer in a state where WIndows Update can reboot it - it can't. You can turn this feature off on a per-device basis with powercfg, but then one day you'll plug something new in and leave it plugged in and it'll wake the PC up while you're away and Windows Update will do its thing.)

leptons · on June 15, 2023

> in part because that prevents weird issues Windows tends to develop when running too long

What are you using Windows Vista? I run about a dozen windows machines, half of them are VMs and none of them need to be rebooted regularly. Average uptime is over 40 days, and I only reboot when there's a big update. Windows becoming unstable entirely depends on the 3rd party software you install on it. Don't install crapware, you won't have a crap experience.

tom_ · on June 14, 2023

I reboot most weeks, just to make sure the right stuff happens when I do. (I try to do it in the middle of the day, so there's time to sort out any matters arising.)

A couple of times I've discovered I've forgotten to set stuff to auto-run on login, or things turn out to have lost their settings, or stuff doesn't work for whatever reason - I'd much rather discover this at a time of my own choosing!

2b3a51 · on June 14, 2023

Full drive encryption on Linux.

I close down my laptop when I'm moving around or when I leave it somewhere while I'm in another part of the building.

mcculley · on June 14, 2023

A long time ago, I had desktops with huge uptimes. The world has changed. I will no longer go that long without a security update. Too much is now passing through my machine.

tasuki · on June 15, 2023

None of my machines were ever hacked and I regularly run a year old kernel. I guess I'm an uninteresting target.

mcculley · on June 15, 2023

(As far as you know, none of your machines were ever hacked.)

Your luck is not good security policy. When I was getting started with Linux in 1992 and only intermittently connected to the Internet via dial-up, I celebrated long uptimes. Now that I do daily banking and other activities on a machine continuously connected to the Internet, uptimes longer than the interval between kernel security updates is just irresponsible behavior.

I would prefer to not have to reboot. I know that is not the world we live in. The stability of the kernel is no longer the reason to think about uptime.

I don't care that you have nothing of value connected to the Internet. I am objecting to the advice about not rebooting.

jameson71 · on June 14, 2023

Security patching?

pessimizer · on June 14, 2023

What do you need to reboot to patch other than the kernel? I just restart things.

rthnbgrredf · on June 15, 2023

Kernel is enough attack surface to require reboot, can't ignore it.

cannonpalms · on June 14, 2023

Can all be done online, no?

jameson71 · on June 24, 2023

Patching the kernel requires rebooting in Linux. Kernel and many other things require rebooting on Windows.

aeyes · on June 14, 2023

> Boggles my mind that people do that.

:( I only reboot when my machine freezes or when updates require a reboot. I did a lot of on-call in my life and I saved tons of time by leaving everything open exactly as I left it during the day.

  ~> w
  11:19  up 18 days, 17:03, 9 users, load averages: 3.87 2.96 2.39

ComputerGuru · on June 14, 2023

You haven’t properly kept a machine alive until the clock rolls over.

I logged into a firewalled Windows VM on EC2 that’s been running an internal micro service that was acting up and it caught my eye that task manager showed an uptime of 6 days making my mind immediately think it might be a bug caused by the recent reboot or perhaps the update that triggered it.

It turns out no reboot had taken place and in fact, the uptime counter had merely rolled over - and not for the first time! Bug was unrelated to the machine and it’s still (afaik) ticking merrily away.

(Our `uptime` tool for Windows [0] reported the actual time the machine was up correctly.)

[0]: https://neosmart.net/uptime/

sgerenser · on June 15, 2023

Microsoft probably never anticipated needing more than a month or two of uptime, since they roll out restart-required updates more frequently than that.

exikyut · on June 14, 2023

Okay, what was the actual uptime? :) (:E)

ComputerGuru · on June 15, 2023

https://i.imgur.com/B0Y8h71.png

ryanjshaw · on June 14, 2023

I used to shutdown regularly, then the power situation here in South Africa got so bad that we'd regularly have about 3 hours of power between interruptions.

Restoring all my work every couple of hours was becoming a pain, so I decided to re-enable hibernation support on Windows for the first time in 10 years... And surprisingly it works absolutely flawlessly.

Even on my 12yr old hardware, even if I'm running a few virtual machines. I honestly haven't seen any reason to reboot other than updates.

lelanthran · on June 14, 2023

> I used to shutdown regularly, then the power situation here in South Africa got so bad that we'd regularly have about 3 hours of power between interruptions.

I'm in SA too, and I used to have 100s of days uptime (one even over a year and a half) ... until the regular blackouts.

Had to stop using a desktop, I've resigned myself to using a laptop, purely so that I don't have to boot the thing all the time and lose my context.

pessimizer · on June 14, 2023

This thread is like reading that someone is shocked that other people don't burn their beds every morning after they wake up.

vbezhenar · on June 14, 2023

I think that there are two types of people. One set of people (I guess, relatively small) don't trust software and prefer to reboot OS and even periodically reinstall it to keep it "uncluttered". Another set of people prefer to run and repair it forever.

I'm from the first set of people and the only reason I stopped shutting down my macbook is because I'm now keeping its lid closed (connected to display) and there's no way to turn it on without opening a lid which is very inconvenient. I still reboot it every few days, just in case.

ComputerGuru · on June 14, 2023

I’m in the second group (avoid reboots like the plague) but for the reason you attribute to the first: I never trust that my Windows machine - currently working - will reboot successfully and into the same working condition between OS update regressions, driver issues, etc.

andrewaylett · on June 14, 2023

Conversely, it boggles my mind that people think 100+ tabs is a lot. I've got >500 open in Firefox at the moment, they won't go away just because I reboot or upgrade. I'll probably not look at most of them again, but they're not doing any harm just sitting there waiting to be cleaned up.

db48x · on June 14, 2023

That's because in Firefox an open tab that you haven't recently viewed uses no memory.

bregma · on June 14, 2023

> Boggles my mind that people do that.

    $ uptime
     15:39:13 up 359 days,  2:02, 16 users,  load average: 0.09, 0.08, 0.15

16 users is 16 tmux sessions, all me doing different tasks.

exikyut · on June 14, 2023

[Cries in outdated kernel]

One of the fascinating curiosities you're missing out on is Pressure Stall Information (https://docs.kernel.org/accounting/psi.html). Here's what the PSI gauges look like in htop when kernel support is available:

  PSI some CPU:     0.37%  0.78%  1.50% 
  PSI some IO:      0.38%  0.33%  0.25% 
  PSI full IO:      0.38%  0.31%  0.23% 
  PSI some memory:  0.02%  0.04%  0.00% 
  PSI full memory:  0.02%  0.04%  0.00%

E39M5S62 · on June 15, 2023

That article was written ~5 years ago. The parent comment has ~1 year of uptime. What makes you think they don't have a kernel new enough to report PSI stats?

rmbyrro · on June 14, 2023

I get anxious just to think that restoring from sleep/hibernation may fail and I lose all my workspace state...

If there was no boot failure, nor the need to reboot after some upgrade, I'd never, ever reboot my system.

Tubru3dhb22 · on June 14, 2023

> Boggles my mind that people do that.

Why? I only restart my (linux) laptop every 3-4 months when I update software.

I can't think of any downside that I've experienced from this practice. I do a lot of work with data loaded in a REPL, so it's certainly saved me time having everything restored to as I left it.

drbawb · on June 14, 2023

>Then you have those types who put their machine into hibernate/sleep with 100+ Chrome tabs open and never do a full boot ritual.

I would never suspend to RAM or disk, far too error-prone in my experience. (Plus serializing out 128GiB of RAM is not great.) I just leave my machine running "all the time." My most recently retired disks (WD Black 6TB) have 309 power cycles with ~57,382 power-on hours. Seems like that works out to rebooting a little less than once per week. That tracks: I usually do kernel updates on the weekend, just in case the system doesn't want to reboot unattended.

trashburger · on June 14, 2023

> Then you have those types who put their machine into hibernate with 100+ Chrome tabs open and never do a full boot ritual. Boggles my mind that people do that.

Hey, I'm that guy (although I put it to sleep instead)! It honestly works really well and is in stark contrast to how Linux and sleep mode interacted just ~10 years ago. It's amazing for keeping your workspace intact.

(FWIW, I also don't reboot or shutdown my desktop where it acts as a mainframe for my "dumb" laptop.)

eertami · on June 14, 2023

Sleep uses almost 0 power and works flawlessly. I'm never going to waste my time, however short, waiting for a machine to boot.

kelnos · on June 15, 2023

Why would I ever reboot my laptop without a need to? I only reboot when there's a kernel update, or if I'm doing something where the laptop might get lost or stolen (since powering off will lock the disk encryption).

sieabahlpark · on June 14, 2023

I just have it running 24/7 and never restart for weeks. I don't even have the 100 tab problem, I just like having the immediate availability without waiting for startup.

5e92cb50239222b · on June 14, 2023

Unless you're on solar, does wasting electricity not bother you? I used to seed a lot of stuff for years (with typical uptime measured in months), but the CO2 impact, however tiny it is in the grand scheme of things, does not seem to worth it anymore.

Hikikomori · on June 14, 2023

My desktop uses 2w in sleep mode. Likely less if i disable the motherboard RGB.

pessimizer · on June 14, 2023

If you're shutdown or hibernating, is the power draw anything compared to a lightbulb?

rjmunro · on June 14, 2023

I wonder if bisect is the optimal algorithm for this kind of case. Checking for the error still existing still takes an average of ≈500 iterations before a fail, checking for the error not existing takes 10,000 iterations, 20 times longer, so maybe biasing the bisect to only skip 1/20th of the remaining commits, rather than half of them would be more efficient.

ajb · on June 14, 2023

There is actually a bayesian version which I wrote: https://github.com/ealdwulf/bbchop

Basically it calculates the commit to test at each step which gains the most information, under some trivial assumptions. The calculation is O(N) in the number of commits if you have a linear history, but it requires prefix-sum which is not O(N) on a DAG so it could be expensive if your history is complex.

Never got round to integrating it into git though.

defen · on June 14, 2023

That's a cool idea. Would also be interesting to consider the size of the commit - a single 100-line change is probably more likely to introduce a bug than 10 10-line changes.

thaliaarchi · on June 15, 2023

The commit that caused OP's bug changed two lines.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

phist_mcgee · on June 14, 2023

You haven't met the developers at my last company.

muxator · on June 14, 2023

Hidden gem! Thanks!

pacaro · on June 14, 2023

Biasing a binary search would only be beneficial if you know something about the distribution of the search space

electroly · on June 14, 2023

There's an additional stopping problem here that isn't present in a normal binary search. Binary search assumes you can do a test and know for sure whether you've found the target item, a lower item, or a higher item. If the test itself is stochastic and you don't know how long you have to run it to get the hang, I'd think you'd get results faster by running commits randomly and excluding them from consideration when they hang. Effectively, you're running all the commits at the same time instead of working on one commit and not moving on until you've made a decision on it. Then at any time you will have a list of commits that have hanged and a list of commits that have not hanged yet, and you can keep the entire experiment running arbitarily long to catch the long-tail effects rather than having to choose when to stop testing a single non-hanging commit and move onto the next one.

pacaro · on June 14, 2023

I can see some interesting approaches here. Given n threads/workers you could divide the search space into n sample points (for simplicity let's divide it evenly) and run the repeated test on each point. When a point hangs, that establishes a new upper limit, all higher search points are eliminated, the workers reassigned in the remaining search space.

Given the uncertainty I can see how this might be more efficient, especially if the variance of the heisenbug is high.

bgirard · on June 14, 2023

If the factor in one direction is large enough then a linear search becomes more efficient. Say you have 20 commits remaining and the factor is 1,000x more costly to make it easier to picture. You're better off doing a linear search which guarantees you'll spend less than 2,000x searching the space.

That suggests that for a larger search space with a large enough difference, the optimal bisection point is probably not always the midpoint even if you know nothing about the distribution.

Perhaps someone can find the exact formula for selecting the next revision to search?

jwilk · on June 14, 2023

> You're better off doing a linear search which guarantees you'll spend less than 2,000x searching the space.

Almost. If only the last commit is slow, binary search is still faster.

bgirard · on June 14, 2023

> better off

Better off as in expected/average case. Good point, but only marginally better in the worse case.

mortehu · on June 14, 2023

Each boot updates your empirical distribution. As a trivial example, if you have booted a version 9999 times with no hanging, a later version will likely give you more information per boot.

coldtea · on June 14, 2023

Still, why would they need to reboot 292,612 times?

Is that supposed to be the log of the commit messages space?

remram · on June 14, 2023

If they boot it 10,000 times for revisions that don't fail, and ~1,000 times for revisions that do fail, you can reach this number with log2(revisions) about 30.

x86x87 · on June 14, 2023

read the article. they booted so many times to show that it was not reproducing. it's overkill but you don't need to boot 200k times

rwmj · on June 14, 2023

I didn't mention it in the blog, but Paolo Bonzini was helping me and suggested I run the bootbootboot test for 24 hours, to make sure the bug wasn't latent in the older kernel. I got bored after 21 hours, which happened to be 292,612 boots.

Maybe it would have failed on the 292,613rd boot ...

bonzini · on June 15, 2023

Thanks for mentioning me, but you really did the work!

But in order to contribute something useful, as a rule of thumb you want to have 10 times as many passes than failures in order to reject a commit. If a bug has taken up to 2500 runs to reproduce, don't consider it a pass until 30000 runs have succeeded.

It's something to do with Poisson distributions. If you have 𝑛 runs before a failed run on average, and you want to be 𝑃 % certain that a fix (including a revert or moving beyond the bug in a bisect) reduced the failure rate, you can use the formula − 𝑛 ln (1 − 𝑃 /100) for how long to run, and the factor for 𝑃=99.99 is about 10.

In fact that means that once you had landed on a merge commit it was probably much better to switch to a linear backwards search because it might have fewer passing runs and passing runs are 10-15 times more expensive as failures. Is that what you did?

rwmj · on June 15, 2023

> it was probably much better to switch to a linear backwards search

Ha ha, nope! I tested each commit starting at the earliest, and it was the last one in the merge :-(

opello · on June 14, 2023

I've been on a similar quest for hard to reproduce, timing/hardware/... bugs, and if you're facing any kind of skepticism (your own or otherwise) it can be very comforting to have a 10x or even 100x no failure occurred confidence.

It's particularly comforting when the reason for the failure/fix/change in behavior isn't completely understood.

bsilvereagle · on June 14, 2023

If the bug occurs reasonably often, say usually once every 10 minutes, you can model an exponential distribution of the intervals between the bug triggering and then use the distribution to "prove" the bug is fixed in cases where the root cause isn't clear: https://frdmtoplay.com/statistically-squashing-bugs/

quickthrower2 · on June 14, 2023

I think your p value is pretty good here

bonzini · on June 15, 2023

With about 1000 runs to reach a failure I think he has p=0.000001 or something like that.

x86x87 · on June 15, 2023

this is unacceptable :):):) only 21 hours!

hoten · on June 14, 2023

> For unclear reasons the bisect only got me down to a merge commit, I then had to manually test each commit within that which took about another day.

Having hit this before myself... does anyone know how to finagle git bisect to be useful for non-linear history?

eknkc · on June 14, 2023

No disrespect to Peter Zijlstra, I'm sure he has been a lot more impactful on the open source community than I will ever be but his immmediate reply caught my attention:

>> [Being tracked in this bug which contains much more detail: >> https://gitlab.com/qemu-project/qemu/-/issues/1696 ]

> Can I please just get the detail in mail instead of having to go look at random websites?

Maybe it's me but if I did boot boot linux 292.612 times to find a bug, you might as well click a link to a repository of a major open source project on a major git hosting service.

Is it really that weird to ask people online to check a website? Maybe I don't know the etiquette of these mail lists so this is a geniune question. I guess it is better to keep all conversation in a single place, would that be the intention?

CommitSyn · on June 14, 2023

I am only guessing here, but I assume it's so the content of the mailing list archive remains. If a linked website goes down or changes at any time in the future, then that archive is no longer fulfilling its purpose of archiving important information.

kevincox · on June 14, 2023

If that was the reason it would have been best to state that in the request.

> Can I please just get the detail in mail so that it is archived with the list?

Of course you can't expect every email written to be perfect, it is generally treated as an informal medium in these settings. But stating the reason helps people understand your motives and serve them better.

enedil · on June 14, 2023

I think that hardcode kernel devs already know the reasons, and there is no point in raising it again. For you it might seem like a random requirement, but it's because of lack of familiarity.

Szpadel · on June 14, 2023

i think in that case explaination is needed even more, if you are hardcore dev, then no one need to remind you about such rule, on the other hand if you are not so familiar with those rules yet, explanation would be very helpful

hedora · on June 15, 2023

It was completely obvious to me, and I'm not a Linux committer.

Any bug of the form:

Hi, I'm sending this via official channels, but see [external thing].

Is going to immediately bitrot. For instance, in stack overflow, for something like 10% of answers, you'll see people saying to explain what a link says instead of just linking.

bonzini · on June 15, 2023

That's why you preserve everything that's useful in the commit message for the fix.

zxexz · on June 14, 2023

I'm pretty much 100% sure that's the reason, and a good one at that. Mailing lists are the lifeblood of a lot of big open source projects.

sidfthec · on June 14, 2023

The irony being that he presumably wants more information on the mailing list to keep a good archive, while not giving enough information for people to understand that and follow the advice later.

cjsawyer · on June 14, 2023

This is the same logic in avoiding link-only answers on Stack Overflow. They’re both good rules.

e12e · on June 15, 2023

Not only the link itself - but if the email body /attachments contains the details - it is also easier to write a good reply by selectively quoting from the mail. So it isn't just for the first mail, but for the follow-up discussion thread(s).

rwmj · on June 14, 2023

I was a bit short in the original description, but luckily we've since reached an understanding on how to try to reproduce this bug.

Unfortunately he's not been able to reproduce it, even though I can reproduce it on several machines here (and it's been independently reproduced by other people at Red Hat). We do know that it happens much less frequently on Intel hardware than AMD hardware (likely just because of subtle timing differences), and he's of course working at Intel.

dale_glass · on June 14, 2023

It's LKML. The volume of that list is insane, and technical discussion is very much the point, so they'd expect you to explain the problem right there, where people can quote parts of it, and comment on each part separately.

nroets · on June 14, 2023

Many of the participants may also be reading it in a terminal emulator with no web browser nearby.

rblatz · on June 14, 2023

Are they on a PDP-11 or a dumb terminal?

inetknght · on June 14, 2023

> Are they on...?

I've met people who seriously do use dumb terminals and other people who have seriously discussed using a PDP-11.

So, while your question might sound sarcastic, the answer is definitely yes.

Nerds gonna nerd. Nothing wrong with that.

I personally don't like going to gitlab or github because I don't like the businesses behind them. That's another point irrespective of whether I'm browsing in a terminal or ancient device.

nickelpro · on June 15, 2023

You recognize that using hardware from the 70s means you're in a single digit minority right?

You can't possibly expect the world to cater to such affectations

treeman79 · on June 14, 2023

https://en.m.wikipedia.org/wiki/Lynx_(web_browser)

Used this daily for many years. Was great when connecting to the internet was only practical via a shell.