XFS Metadata Corruption on Linux 6.3 Tracked Down to One Missing One-Line Patch

hnarn · on May 29, 2023

FLOSS developers are real heroes, but so are the people willing to spend time testing newer non-LTS versions of the code and report their issues.

I have enough on my plate just dealing with the issues arising from using stable code, I think it’s admirable that people find the time raising their glance to future releases and helping us all enjoying a less panic-inducing experience.

drewg123 · on May 29, 2023

We run bleeding edge FreeBSD at Netflix and are never more than a few weeks behind the FreeBSD main branch. This has worked out quite well for us.

We used to run -stable, and update every few years, like from FreeBSD 9.x to FreeBSD 10.x. We found that when we did that, we would often encounter some small subtle bug that was tickled in our environment, and which was incredibly hard to track down. That sort of bug was hard to track down because the diff between branches was enormous, and because there were thousands of commits to sift through, and because the person responsible for the bug may have committed it months or years ago, and has forgotten about it.

We eventually decided to track the main branch, updating frequently. This means that while we find more bugs, but they are far easier to fix because they were introduced more recently, and there are a lot fewer commits to look through to find where they came from.

bombcar · on May 30, 2023

This is why I prefer rolling distributions compared to stable ones, sure you only have to upgrade an LTS every few years but everything will break; whereas the small breaks you get with a rolling are easier to track down and diagnose.

hpb42 · on May 29, 2023

Is there a position open on your team? This sounds like the stuff I'm into!

hnarn · on May 31, 2023

It’s an interesting take that makes a lot of sense, but I still feel like the angle of “having less code to sift through to spot bugs” is very specific to your use case and high competence relative to other operations.

For a lot of companies, inspecting source code and filing bugs directly is just not a capacity that exists, which is where LTS of a Linux distribution makes a bit more sense — and without throwing any shade on FreeBSD (I love it), maybe the smaller amount of users globally compared to Linux means that “stable” isn’t quite as stable, especially if you’re doing bleeding edge stuff anyway.

I guess you could say the same thing is true for a company like Cloudflare considering their network related patches to the Linux kernel.

Thanks for the perspective!

drewg123 · on May 31, 2023

Yes, I think a lot of it also depends on how much you interact with the code of the open source project vs just consume the finished product.

If you're just a consumer, then it makes a lot of sense to consume the LTS branch. Whenever I've run ubuntu (which I do not contribute to), I run LTS for that reason.

In our case, we have a team of kernel engineers who make frequent contributions and are very familiar with the FreeBSD source code. So we're in a good position to inspect the frequent merges from upstream.

The other benefit to tracking the main branch closely is that it makes it far easier to contribute changes. When tracking the main branch, its easy to test a change in our tree, and then pick it up almost unmodified as a patch to the FreeBSD main branch. That makes it much easier to get the code into FreeBSD. In fact, for most smaller changes, we try to push them upstream first and bring them back with our frequent upstream merges This is much harder when running a several years old branch, as then the patch needs to be forward ported to the FreeBSD main branch. As such, its very hard to integrate and test changes suggested by reviewers, as the patch needs to be ported back and forth. And things get worse for large changes (like new TCP stacks, kTLS, etc), which are harder to port back and forth.

ignoramous · on June 7, 2023

Curious: What does your rollout look like? Update to the latest on a small percentage of the fleet and monitor for a day / week?

Also, do you folks tend to review the freebsd code before upgrades or only after the fact (like, if there's a show-stopping bug or two)? Thanks.

freedomben · on May 30, 2023

Fascinating! I worked with a company that figured out a similar thing with ruby and rails versions. By staying close to master, it's much easier to figure out what broke when things happen.

georgyo · on May 29, 2023

In my experience, bleeding edge and stable are about the same amount of pain. Breakage isn't actually that common, and fixes come a lot faster.

And even if you perfer stable, the latest will become stable eventually. Not trying your workload out on the next releases has pretty much the same risk profile of just running latest.

Many problems can only be found by running your particular workload.

ilyt · on May 29, 2023

That seems to be mostly bathtub curve for most of the software for us when it comes to amount of work.

Running on "latest commit from master" from many projects (not Linux) will just get you code nobody even tested and so a lot of bugs fixed quickly.

Running on "latest stable" (whatever that means for project) means fixes from time to time when it updates, but in vast majority of cases not that much work.

Anything behind that like LTS releases ? Extra work.

Now any doc you find might be about never release or feature that changed. "Bugs" might not get fixed if they are not big enough to backport.

Upgrade to new LTS version will also get you years of changes in app that you then have to apply to the system, vs having to do it "change by change" when keeping up to date.

If you use configuration management that also often means multiple different configs to manage at the very least till previous LTS version gets finally upgraded

talhah · on May 29, 2023

Bleeding edge arch linux user here, I've barely come across any major bugs in the last couple of years. Whenever I find something I do report it and it usually gets fixed really quickly.

In fact, many of these bugs were on stable releases too.

awill · on May 29, 2023

exactly. A RHEL kernel is likely a lot more stable than the kernel.org LTS kernel. Often bugfixes and security patches are backported to the LTS kernel, meaning both can be affected by similar bugs.

sp332 · on May 29, 2023

Is it a little worrying that even with all the attention, no one seems to know what this line of code actually does?

Someone · on May 30, 2023

Also (FTA): “[this build] has been stable for 90 minutes on the same type of hardware that all the other 6.3 kernels crashed within a couple of minutes after boot. So this seems to fix the issue for me.”

If there’s a metadata corruption bug in a file system, I think I would prefer a rapid kernel crash over a magic change that may or may not fix the issue.

_a_a_a_ · on May 29, 2023

Agreed, the tone of the quotes is scarily relaxed. This should not be how good software dev is done. Maybe they are being more rigorous than I give them credit but it doesn't sound good.

pengaru · on May 29, 2023

The transparency of FOSS conferring exceptionally high visibility into how the sausage is made often creates this kind of impression.

But in reality what's happening here is folks are getting access to bleeding-edge kernel development snapshots who choose to run these kernel versions, and are lucky to get such quick access to patches even before the scope of new bugs are entirely understood by the developers. Note there's nothing preventing these affected users from simply running a prior known-stable kernel version until the bug is better understood, they're opting in on the chaos.

It's unfair to assume Dave Chinner et al won't be running the issue seemingly fixed by this one-line change fully to ground.

If you're not interested in playing the role of kernel QA and interacting with the upstream devs when things break in not yet understood ways, don't run bleeding edge kernel versions. LTS and -stable releases are offered for a reason.

jeffbee · on May 29, 2023

You're not the first person to propose this, but like all those other people, you are wrong. 6.3 is the latest "stable" release. It is the version front and center on kernel.org. There is nothing "bleeding-edge" about it.

pengaru · on May 29, 2023

Ah I didn't notice 6.3 had already been promoted to stable, that's unfortunate.

Relative to a kernel version you'd encounter in something like rhel or debian stable however, tracking mainline's "stable" branch is still pretty damn aggressive.

Filligree · on May 29, 2023

> Relative to a kernel version you'd encounter in something like rhel or debian stable however, tracking mainline's "stable" branch is still pretty damn aggressive.

If you want to run hardware released in the last year or so, LTS kernels have a funny tendency not to ever work. I've yet to buy a laptop that didn't at least need latest mainline.

It shouldn't have to count as 'aggressive' to want a kernel that will work with my hardware, but I guess that's what we get when there's no HAL.

jrmg · on May 30, 2023

If a change causes problems that were not there before, it obviously wasn’t understood when it was written. Better to back it out _while_ you try to understand things than to leave it in until the situation is fully understood.

juujian · on May 29, 2023

Glad I am not the only one who was thinking that.

jeffbee · on May 29, 2023

Giant refactor + no unit tests = data loss. The history of Linux in a nutshell.

Paul-Craft · on May 30, 2023

How exactly do you propose to unit test code that depends intimately on interactions with hardware that you don't control? Hardware does not always behave according to spec. No matter how good your "simulation mode" is, it will not match the behavior of real hardware. That makes your so called "unit tests" useless in cases where it actually matters.

You claim it can be done. Have you ever actually done it? I bet not.

latchkey · on May 30, 2023

History of a lot of (most?) software.

patrakov · on May 29, 2023

I wouldn't say "no unit tests". There are xfstests, the problem is that nobody runs them on stable backports to verify their correctness and completeness.

jeffbee · on May 29, 2023

xfstests are not unit tests, they are integration stress tests, and their coverage is quite poor. Nothing in that suite exercises `xfs_bmap_btalloc_at_eof` particularly. That's the kind of unit test you want before undertaking a large refactor. There are several testable postconditions that would be trivial to test, if this code had an easy way to add and run unit tests. It has two mutable (in-out) parameters and a comment that says allocation returns as if the function was never called. And that is where the bug lies, according to the patch (which also adds or modifies no tests).

codeulike · on May 30, 2023

Can we stop being suprised when bugs come down to one line of code? Thats how bugs work. Code is complicated and one line can fuck it up.

ref: various 'this rocket exploded because of one line of code' headlines

malkia · on May 29, 2023

I wonder if unit testing was ever considered, (or possible?) for the Linux source code?

yjftsjthsd-h · on May 30, 2023

Considered? Unit testing is implemented in Linux - https://www.kernel.org/doc/html/latest/dev-tools/testing-ove...

But note that not only are there no silver bullets, as sibling comments note kernels (or anything that touches hardware, and "the real world" (ex. getting back packets with random reorders and duplicates and drops) really) have trouble using unit testing. And even in those cases where it might work it's not universally applied, I think.

jeffbee · on May 30, 2023

The sibling is dead wrong though. It is trivially easy to concoct any messed-up circumstances you wish to imagine, in a unit test, no matter how hard they would be to reproduce in reality.

yjftsjthsd-h · on May 30, 2023

I don't think that's true unless the entire system is purely functional (i.e. all functions take inputs and produce outputs and never touch anything resembling shared state). Ex. how would you make a unit test to check the behavior of 2 threads writing to a single memory buffer from different CPU cores? I could easily be missing a trick, but the only options I can see are integration tests, not unit tests.

jeffbee · on May 30, 2023

It's sort of incredible how you are willing to twist yourself into knots to advocate for the absence of unit tests.

yjftsjthsd-h · on May 30, 2023

Don't go putting words in my mouth. Linux should have unit tests, does have unit tests, and probably should have more unit tests. They are a tool that works well for some cases and not others, and many parts of a kernel are cases where unit tests are not a useful tool.

speed_spread · on May 29, 2023

Code that does I/O has a lot of interplay that's hard to replicate and impossible to cover entirely. The physical world is nothing but shared mutable state.

virtualized · on May 30, 2023

Yes, and that's what automated tests are for. They "replicate" specific conditions and make it possible to cover everything. That's what unit tests are. This has nothing to do with the physical world.

grt_thr · on May 30, 2023

We can't do unit tests against a database with a known state. How do you think one does a unit test on hardware in an unknown state?

jeffbee · on May 30, 2023

By passing it faked hardware. Yes, you have to write your APIs so they are testable. Yes, it is virtually impossible to retrofit unit tests into an old, large code base that was written without regard to testability. But no, it is not difficult at all to fake or mock hardware states in code that was designed with some forethought.

aDfbrtVt · on May 30, 2023

That may hold for a trivial device or a perfectly spec compliant device. However, the former is not interesting and the later does not exist. I agree that more test coverage would be beneficial, but I think your heavily downplaying the difficulty of writing realistic mock hardware.

malkia · on May 30, 2023

Well in this case probably would've caught it, given debug mode, coverage. It's not even hw issue (AFAIT).

eklitzke · on May 30, 2023

Do you have experience doing this in C/C++? There are a bunch of things about the language models for both (e.g. how symbol visibility and linkage work) that make doing DI in C/C++ significantly harder than in most other languages. And even when you can do it, doing this generally requires using techniques that introduce overhead in non-test builds. For example, you need to use virtual methods for everything you want to be able to mock/test, and besides the overhead of a virtual call itself this will affect inlining and so on.

This doesn't even consider the fact that issues related to things like concurrency are usually difficult to properly unit test at all unless you already know what the bug is in advance. If you have a highly concurrent system and it requires a bunch of different things are in some specific state in order to trigger a specific bug, of course you CAN write a test for this in principle, but it's a huge amount of work and requires that you've already done all the debugging already. Which is why developers in C/C++ rely on a bunch of other techniques like sanitizer builds to test issues like this.

jeffbee · on May 30, 2023

Right, doing interfaces that support DI would also force Linux to grow up and learn how to build and ship a peak-optimized artifact with de-virtualization and post-link optimization and all the goodies. It would be a huge win for users.

The fact that it would be hard to test certain edge cases does not in any way excuse the fact that the overwhelming bulk of functions in Linux are pure functions that are thread-hostile anyway, and these all need tests. The hard cases can be left for last.

aftbit · on May 30, 2023

Which point release will have this? I have xfs root on Linux 6.3.1. Am I affected?

Edit: it looks like 6.3.3 is the first broken version. I would _guess_ 6.3.4 is also affected. Presumably 6.3.5 will not be.

lproven · on June 1, 2023

As far as I can tell, this bug appears to be fixed in 6.3.5.

garganzol · on May 29, 2023

This is why I always see the code as a math sheet - if every little expression is perfect then the combined result is guaranteed to be perfect too. This rule never fails.

qprofyeh · on May 30, 2023

I'd find it surprising if there aren't many many more bugs involving variable names that are 4 or 5 letter abbreviations/acronyms.

vfclists · on May 30, 2023

>Program testing can be used to show the presence of bugs, but never to show their absence

said you know who

HyperSane · on May 30, 2023

The Linux Kernel has no unit tests?

virtualized · on May 30, 2023

Apparently not. It is crazy that they delete a random line of code and don't update or add a single test at the same time. Absolute madness. I wonder what they are doing instead that ensures the Kernel mostly works.

yjftsjthsd-h · on May 30, 2023

> I wonder what they are doing instead that ensures the Kernel mostly works.

First, they do have unit tests (KUnit). However, I suspect the "real" tests that result in a mostly-working kernel are massive integration tests run independently by companies contributing to Linux. And, of course, actual users running rc and release kernels who report problems (which I suppose is not unlike a stochastic distributed integration testing system).

HyperSane · on May 30, 2023

How many unit tests does the kernel have to pass before being released as stable?

HyperSane · on May 30, 2023

A bug as serious as data corruption should be detected before being released.

icedchai · on May 30, 2023

This is how most software development worked before roughly the late 2000's. I remember working on a system that processed like a billion in revenue for a major corporation employing thousands of people, written in a mix of C and C++. Zero unit tests. They did have a couple of dedicated QA guys though!

HyperSane · on May 30, 2023

That is insane. I would expect as many lines of unit tests as code. Doesn't ReactOS have over ten thousand tests.