Regression in Linux 4.14 – using bcache can destroy the file system

Jach · on Nov 21, 2017

I really love Gentoo's mask system. I upgraded to 4.12.12 not too long ago, it is the latest stable: https://packages.gentoo.org/packages/sys-kernel/gentoo-sourc... You have to explicitly decide to use masked versions, at your own peril. (Of course, many Gentoo users love the bleeding edge, and I too have a number of things I manually unmasked... But the kernel? Stable only please.)

whacker · on Nov 21, 2017

Debian/Ubuntu has a similar mechanism. http://manpages.ubuntu.com/manpages/xenial/man8/apt-mark.8.h...

dfc · on Nov 22, 2017

apt-mark is nothing like masking from gentoo. A-M is just for making modifications to the database that remembers whether a package was intentionally installed or was automatically pulled in as part of a dependency. This database is how apt generates the list of "packages that were automatically installed and no longer needed." The closest thing that I can think of to masking in Debi an is pulling in a package from experimental or a PPA that you added. There are no packages in the repo (stable, testing, unstable) you are using that require anything other than "apt-get install package" to install them.

anderskaseorg · on Nov 22, 2017

You have described apt-mark {auto|manual}, but there is also apt-mark {hold|unhold}.

leni536 · on Nov 22, 2017

Damn, I always abused apt-preferences to achieve the same effect as apt-mark hold|unhold. Thanks! Does 'apt-get dist-upgrade' respect held packages?

dfc · on Nov 22, 2017

A-M is the recommended way to hold a package, previously you would use `dpkg --get-selections`. All apt/dpkg commands will abide by what you set with A-M.

whacker · on Nov 23, 2017

Yes, it does.

dfc · on Nov 22, 2017

Do you think hold/unhold is anything like masking in gentoo? Me neither...

craftyguy · on Nov 22, 2017

You will probably want to stick on a LTS kernel... unless you like unpatched kernel vulnerabilities.

unmole · on Nov 22, 2017

Why? There is no reason to stick to LTS kernels if you are able to upgrade to the latest stable kernel. LTS is really meant for device manufacturers and OEMs, not desktop or server users.

craftyguy · on Nov 23, 2017

The OP was not on an LTS kernel, so if it is getting security updates now it won't be for much longer. Either stick with the latest stable (they weren't) or stick with an LTS. Anything else is just playing with fire.

Jach · on Nov 22, 2017

LTS kernels don't magically patch themselves, so if you can patch, why not be on the latest patched stable? If you're really concerned about kernel security you're already using grsecurity/PaX patches (which have neutered many otherwise 0-days) or OpenBSD anyway.

craftyguy · on Nov 23, 2017

> LTS kernels don't magically patch themselves,

I never said that and, well, no shit they don't "patch themselves", so right off the bat your premise is wrong.

> you're already using grsecurity/PaX patches

Uh, no. The grsec developer is GPL violator.

bandrami · on Nov 22, 2017

Well, I mean, because of stuff like this, right?

It's always a question of "pick your poison": Debian stable users actually missed out on heartbleed completely because Lenny's OpenSSL was too old to have it.

Jach · on Nov 22, 2017

Gentoo didn't mark 4.14 as stable (it has its own procedures for stability despite the kernel folks treating 14 as the new LTS following 9), so only users who immediately upgraded to it (manually specifying they wanted it) would be affected by this.

I really think gentoo has the best packaging story of Linux distros, except perhaps 'newer' stuff like nixos/guix. Rolling releases (my gentoo box has a continuity since 2009 with nothing ever resembling a dist-upgrade of other distros that you're forced to do to get the latest and greatest), generally gets the latest versions of stuff and their patches first or among the first if you want to live on the bleeding edge, it's easy to find overlay packages for things not in the main tree, and even create your own 'packages' since everything is source-based, USE flags for customization, and anyway it has a good stable-only option too.

vthriller · on Nov 22, 2017

And unlike in other rolling release distros (or at least unlike in ArchLinux; [citation needed]), emerge is smart enough to keep older versions of .so files so you're less likely to get unusable system/programs due to partial upgrades: [0]. I really cannot use Arch on personal computers simply because sometimes I need to keep various packages at older versions for various reasons (bugs, coredumps still in TODO, sensitive processes still running, whatever), so I inevitably end up choosing between bar+libfoo.so.X, baz+libfoo.so.Y, and rebuilding everything from scratch. (Edit: and I also cannot install older version of baz 'cause it's already gone from the mirrors.)

[0] https://wiki.gentoo.org/wiki/Preserved-rebuild

Jach · on Nov 22, 2017

Yeah. Plus things like quickpkg (make a backup of your older installed package whose version may no longer be in the tree), demerge (record states of installed packages and configs over time), slots (conflicting pkg versions like gtk2 and gtk3 being installed at the same time) all make for more niceness.

It's definitely possible to go overboard and get into trouble. But there's a lot to help. And with gentoo I was able to weather the gnome3/systemd madness and keep that stuff off my system despite having to keep around old gnome2 and udev packages until mate, eudev, etc. were ready. I don't know too much about Arch except the pain stories friends have relayed as they eventually gave up and switched to something else. Back when Arch decided to make Python 3 the default python well before it was ready someone on IRC quipped "<dash> well that confirms my impression that arch was invented by a bunch of guys who thought gentoo was too stable and easy to use".

bandrami · on Nov 22, 2017

I haven't used Gentoo in about 10 years, but maybe I should give it a look again. I remember seeing it allows musl as a system C library, which is cool. I think mostly I just prefer the pace of Slackware though.

jnwatson · on Nov 21, 2017

In case anyone is wondering, Bcache allows one to use an SSD as a read/write cache (in writeback mode) or read cache (writethrough or writearound) for another blockdevice (generally a rotating HDD or array).

SteveNuts · on Nov 22, 2017

Is it possible to use bcache in front of an NFS volume?

kogepathic · on Nov 22, 2017

> Is it possible to use bcache in front of an NFS volume?

No, because bcache only works for a block device, whereas NFS presents itself as a pseudo filesystem (NFS) to the target.

You could however use bcache as a local accelerator for a remote iSCSI target.

khc · on Nov 22, 2017

No but you can already use FS-cache?

ptman · on Nov 22, 2017

fs-cache / cachefilesd seems to still be buggy. was it even deprecated?

khc · on Nov 22, 2017

Not that I know of. You can also try https://github.com/kahing/catfs which I wrote recently but I don't promise it's not buggy :-)

pcunite · on Nov 21, 2017

So, and single line of code change? This is why the language choice is not the biggest issue here. Keeping all the necessary state in a programmer's head is the real issue. Using <insert fav lang>Rust</> would not have helped.

bio->bi_partno = bio_src->bi_partno;

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

geofft · on Nov 22, 2017

Not only would this exact error (a new field is added to a struct, and old basic operations on the struct weren't manually updated to take the field into account) have been solved by idiomatic Rust with #[deriving], it is remarkably similar to another CVE in curl that I commented on when the author was all "Rust doesn't solve all our problems because not everything is memory safety". True, not everything is memory safety. Rust isn't just memory safety. There are other things we've learned about programming languages since C was invented and today that also help eliminate bugs.

Quoting myself from https://www.reddit.com/r/programming/comments/61rh9j/curl_is... :

> For instance, the patch for CVE-2016-5420, "Incorrect reuse of client certificates", is about making sure that when you compare two structs, you don't forget to compare a field in the two structs. In C, you have to open-code this comparison. In Rust, you just stick #[deriving(Eq)] before the struct, because unlike C, Rust knows that comparison of structs is a thing that people might want to do ever. The compiler will generate a trait Eq implementation that compares all the fields, using the Eq implementations (either automatically-derived or manual) of each field, so you even get secure string compares if you define some SecureBuffer type and use it for the strings. (As a bonus, doing that also makes sure that you don't accidentally use a normal strcmp on the strings.) When you add a new field to the struct, the automatic Eq implementation on the struct will automatically compare it, too.

(Disclaimer: while I happen to like Rust, there are plenty of other languages you could also use instead—Microsoft Research even made a fork of C# designed for kernelspace. And while I am interested in writing Rust, I'm not particularly interested in telling people who are happy writing C that they should stop, just in making sure that people considering a language for a new project know what each language brings to the table.)

akkartik · on Nov 22, 2017

Minor tangent: I think your links into the curl repo may have rotted in the past 7 months, so they're not pointing at the lines you intended them to. For comments like this that are likely to be cited far in the future, I recommend using a link at a specific tree rather than just HEAD. So this:

  https://github.com/curl/curl/blob/8aee8a6a2d/lib/vtls/vtls.c#L94

and not this:

  https://github.com/curl/curl/blob/master/lib/vtls/vtls.c#L94

(Deliberately disabling truncation on the URL, at the cost of losing clickability. This comment isn't really about these specific links anyway, just the 'syntax' of the URL.)

bostonvaulter2 · on Nov 22, 2017

Also if you are viewing a file on github you can get the canonical link by pressing `y`

GlitchMr · on Nov 21, 2017

Actually, it would. This issue is rather straightforward, caused by not initializing memory. Rust forbids that, consider a following Rust code.

    struct Example {
        x: i32,
        y: i32,
    }
    
    fn example_clone(input: &Example) -> Example {
        Example {
            x: input.x,
            // Oops, forgot y
        }
    }

This fails with a following error:

    error[E0063]: missing field `y` in initializer of `Example`
     --> src/main.rs:7:5
      |
    7 |     Example {
      |     ^^^^^^^ missing `y`

benmmurphy · on Nov 21, 2017

in this kernel's case the caller passed in the output so the bug would not be caught in rust if the same pattern was used. though, possibly this pattern would not be used in a rust kernel.

GlitchMr · on Nov 21, 2017

The pointer return pattern is mostly due to C limitations. Rust is smart enough to automatically output into a pointer when a structure is big enough (this is invisible in source code, this should be rather optimizer's decision really whether to do it) and supports multiple return arguments by use of tuples.

Also, even if somebody did insist on explicit mut-output pattern in Rust anyway, they would quickly realize that they need to initialize the value before getting a mutable reference to it - Rust NEVER allows access to uninitialized memory (other than unsafe blocks where you can trick Rust into accesing uninitialized memory and cause undefined behavior by using `std::mem::uninitialized`). The noise related to it should hint that this is not the intended way to do it. Here is a code example to clarify what I mean.

    fn write2(out: &mut i32) {
        *out = 2;
    }
    
    fn main() {
        let mut two;
        write2(&mut two);
    }

And error message:

    error[E0381]: use of possibly uninitialized variable: `two`
     --> src/main.rs:7:17
      |
    7 |     write2(&mut two);
      |                 ^^^ use of possibly uninitialized `two`

Fixing that requires explicitly writing a value into `two`, for instance with `let mut two = 0`, which for complex structs... yeah, you may as well return a struct directly instead of bothering with pointer output.

[std::mem::uninitialized]: https://doc.rust-lang.org/std/mem/fn.uninitialized.html

caf · on Nov 21, 2017

In this case the memory isn't uninitialised - it's been initialised by bio_init(), which sets bi_partno to 0.

masklinn · on Nov 22, 2017

> it's been initialised by bio_init(), which sets bi_partno to 0.

It just memsets the entire thing to 0, which in Rust would

1. require unsafe{}

2. most likely be UB as the struct looks like it has a bunch of pointers which would not be nullable (e.g. arrays)

And therefore not be done at all, or rejected by reviewers if attempted.

koverstreet · on Nov 22, 2017

You can return structs just fine in C. The reason it's not generally done in practice is because C lacks return value optimization - the compiler (almost?) always has to do a hidden memcpy() at the end of your function due to pointer aliasing.

Rusky · on Nov 22, 2017

A Rust kernel would likely make this function part of an implementation of the Clone trait, which would lead it to getting the error described above.

icebraining · on Nov 21, 2017

Language definitively could have helped. For example, you could have a type that could only be created with both disk and partition info filled in. Then the compiler would complain if you forgot one of them.

iforgotpassword · on Nov 21, 2017

The field was set to 0 by a function it was passed to previously.. It didn't do that in older kernel versions so it kept the proper value, so a classical miscommunication (or none at all) between devs.

mlyle · on Nov 22, 2017

No-- the cloning operation didn't preserve the partition information. When I first bisected/found the issue I thought it was as you describe, but it's really just that cloning 0'd it out in the new copy and everything else was proper.

remar · on Nov 22, 2017

This sounds exactly like the sort of behavior that should be specified/committed in the form of a unit test.

That would've caused this code change to fail, unless I'm missing something.

masklinn · on Nov 22, 2017

> The field was set to 0 by a function it was passed to previously.

The entire structure was just memset to 0, it's not like the struct was carefully initialised.

khuey · on Nov 21, 2017

A language with copy constructors could have prevented this.

bonzini · on Nov 21, 2017

A language with copy constructors doesn't prevent you from putting a bug in them.

C++ or Rust might have put a 0 in the uninitialized member, but the bug would have remained.

steveklabnik · on Nov 21, 2017

Rust doens't have copy constructors, to be clear.

masklinn · on Nov 22, 2017

> A language with copy constructors doesn't prevent you from putting a bug in them.

If you don't override it, it will copy everything properly.

> Rust might have put a 0 in the uninitialized member

Rust does not do that. If a member is not initialised, it's uninitialised and in this case the structure will be rejected. Here even if you could not just have derived Clone (struct bio looks non-trivial) your hand-rolled implementation would have complained that you had not properly initialised the member.

bonzini · on Nov 22, 2017

> If you don't override it, it will copy everything properly.

Sure, so does memcpy or bare struct assignment in C. The problem is that trivial copy constructors are relatively rare.

> Here even if you could not just have derived Clone (struct bio looks non-trivial) your hand-rolled implementation would have complained that you had not properly initialised the member.

Right, but this is more similar to an assignment operator. DerefMut in Rust I think wouldn't have caught the bug.

masklinn · on Nov 22, 2017

> Right, but this is more similar to an assignment operator.

Only if you don't zoom out. If you do, this is alloc/init/fill which in C++ or Rust would use a regular copy and RVO.

barrkel · on Nov 21, 2017

A language with copy constructors hides user defined code inside seemingly innocuous expressions, for great hilarity when they have side effects or other unexpected semantics.

std_throwaway · on Nov 21, 2017

C++ would be a logical language upgrade for the kernel.

cdmckay · on Nov 21, 2017

Read Linus' rant about why the Linux kernel shouldn't use C++: https://lwn.net/Articles/249460/

deathanatos · on Nov 21, 2017

> [C++ is] made more horrible by the fact that a lot of substandard programmers use it

Typical Linus rant. There's no substantive, logical, reasoned argument in there about why C++ would be inappropriate. The closest we get is that "You invariably start using the "nice" library features of the language like STL" — God forbid the language provide a dynamically sized array, and not force you to implement it manually.

Not saying that Linux should switch from C to C++, nor that doing so wouldn't be a metric crap ton of work (it would be) and for benefits that might be hard to sell (e.g., at this point, we already have a decent vector implemented in C macros; what does switching actually gain us?) or that the kernel doesn't have special concerns (in a kernel, you have to implement malloc(), for example (and that's probably a gross simplification) and that could definitely affect a lot of things in the STL), etc. Just that the linked rant is not valid argument as to why C++ isn't a good fit.

koverstreet · on Nov 22, 2017

> [C++ is] made more horrible by the fact that a lot of substandard programmers use it

Thing is, it's a very real issue. C++ as written by the best of programmers might be a very reasonable language for the kernel, but in practice what happens is the additional abstraction and indirection mechanisms mean it becomes very hard to decipher with any confidence code written by mediocre or merely average programmers.

And this is a real issue in practice, I saw it when I was at Google with code written by other programmers at Google, and honestly the majority of the code in the Linux kernel is at about the same level - some of it really good elegant code, more of it ugly and hacky but working, and a lot more (especially driver code) mediocre crap.

And when having to decipher and refactor the mediocre crap - because let's be honest, that's the majority of the code - I'd much rather have to deal with C than C++.

The reason this is such an issue with C++ is that - and it's been said before, but it bears repeating - C++ adds a lot of new ways to shoot yourself in the foot and it doesn't take any of the old ones away. With C, you can generally read a chunk of code and have confidence that it does what it appears to do - you can understand it in isolation. With C++, just figuring out the flow control can be a nightmare.

Rust would be a different story. I am 100% in favor of using Rust whenever possible - it has some of the same issues as C++ or any other polymorphic language in that figuring out flow control can really suck, but it provides much stronger guarantees w.r.t. how code can screw you over that make it well worth it. C++... not so much.

hermitdev · on Nov 22, 2017

Indeed. There is plenty evidence that there are a ton of shit C programmers out there, as well. It's not very hard to take a random OSS C (or C++) project and find bugs or potential bugs in the form of undefined or unspecified behavior. Correctly performing signed integer arithmetic without introducing undefined behavior is surprisingly hard. Additionally, most people never bother to do it correctly in either C or C++ because, well, it's a pain in the ass and it slows the code down.

Linus's argument against C++ is basically akin to Dijkstra's argument against `goto`. Like any tool, it can be abused. But they've seen too many abuses for their own taste, so they ban the tool completely despite the advantages the tool may have.

Shorel · on Nov 22, 2017

Linus' answer is based on a comparison between monotone, implemented in C++, and git, implemented in C.

Monotone uses boost, standard C++ practices, and it's much slower and more bloated than git.

The idea is that, even if git could be written in C++, with similar performance compared to the C version, C++ programmers would invariably end with something closer to monotone than git.

Taniwha · on Nov 22, 2017

IMHO the big problem is new/delete, which C++ many programmers treat as close to 0-cost - in the kernel you have to worry about interrupts, SMP, real-time issues and priority inversions caused having your code hang in ISRs trying to alloc memory - all that nice STL code, string code, etc etc it's full of new/deletes, it can't live in the kernel

C programmers: you're not off the hook - malloc/free are just as bad, which is why the kernel has it's own context appropriate memory allocaters.

And of course don't forget that malloc/free (and I assume new/delete) are not safe to be used inside user space signal handlers

steveklabnik · on Nov 21, 2017

Thu, 6 Sep 2007 18:50:28 +0100 (BST)

C++ has changed quite a bit in the last 10 years.

iforgotpassword · on Nov 21, 2017

Yep, adding more and more bs so it's impossible to fully understand it ever. Understanding C thoroughly is already hard, but C++ is a whole new level. "But then only use the new and modern parts of it" No! You cannot simply ignore all the complicated cruft just because you don't like it because sooner or later it will come and bite you in the ass.

MBCook · on Nov 21, 2017

Why is that?

khc · on Nov 22, 2017

simply because Rust/safe language does not prevent all bugs is not a reason to dis it.

xwvvvvwx · on Nov 21, 2017

Curious if anyone knows how the kernel is tested?

osandov · on Nov 21, 2017

Every subsystem is tested differently. Because of the wide variety of subsystems and workflows in the kernel, it's pretty much impossible to have a single test suite you can run and call good. For example, some subsystems have unit tests that run in the kernel at boot time if you compiled the kernel with a specific config option (e.g., Btrfs [1] and ftrace [2]). Additionally, some subsystems have integration tests where we set stuff up and run from userspace. Filesystems use xfstests [3] and the block layer has something similar [4]. For the most part, developers and maintainers run these before sending patches or pull requests. Additionally, Intel operates an automated system [5] which runs several test suites and reports the results to developers.

1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

2: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

3: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/

4: https://github.com/osandov/blktests

5: https://01.org/lkp/documentation/0-day-test-service

simcop2387 · on Nov 21, 2017

As best as they can, and by a lot of people running release candidates on test systems. Unfortunately there are basically an infinite (not really, just incredibly impossibly large) number of configurations of software and hardware that can lead to a bug like this not triggering so it's not always possible to catch this ahead of time.

xwvvvvwx · on Nov 21, 2017

How often does a bug of this severity slip through the net?

katastic · on Nov 21, 2017

The only one I can find is in 2012 there was a RAID bug that damaged metadata required for RAID access, under specific/rare conditions.

http://www.h-online.com/open/news/item/Bug-in-Linux-kernel-c...

But if anyone is running critical data, they shouldn't be using a bleeding-edge kernel in the first damn place. It certainly "sucks" for the users that found it, but again, that's the risk you play.

And, this bug seems to have been found... a mere week after articles even reported the release of Kernel 4.14:

https://bugs.gentoo.org/638206

http://www.omgubuntu.co.uk/2017/11/linux-kernel-4-14-lts-fea...

Which is a pretty amazing turn-around time for "crowd-sourced" discovery of a bug. And according to the bug tracker, it looks like they patched it the same day.

So as bad as this is (and I feel for those who lost data, I really do), the solution is simply to not run bleeding-edge kernels on release day. The very fact they have a minor version number 4.14.1 (.1) implies things get released that need fixed, and if you have critical data you should be more patient.

raller · on Nov 21, 2017

> bleeding-edge kernel kernel

> risk you play

uhm you realize that the kernel was already released upstream right?

striking · on Nov 21, 2017

Yeah, and which distros are shipping it?

Using the newest upstream kernel is something you opt into, on non-production hardware. This isn't something that would impact a regular user.

rdc12 · on Nov 22, 2017

How many people would be using bcache in production fullstop, I'm guessing not many.

0x0 · on Nov 21, 2017

It's been the official position for quite a while that kernel.org releases - even "stable" ones - are to be considered bleeding-edge, and for it to be the responsibility of distros to do the necessary cooking on behalf of users to avoid unleashing something raw on them. Long gone are the days of regular users being trusted (and being able to trust) to just grab the latest linux-2.0.36 tarball from there and upgrade by compiling themselves.

makomk · on Nov 22, 2017

This fix doesn't seem to have made it into 4.14.1 unfortunately, and the Gentoo bug report came almost a week after this issue was identified and a patch posted to the bcache mailing list.

Manozco · on Nov 22, 2017

If you have really critical data, I expect you to run the latest possible kernel on part of your cluster. That allows you to see potential regressions and also to test your backup/recovery solution in case something that bad happens.

joshbaptiste · on Nov 21, 2017

There was a recent HN post that delved into booting and testing on many machines and fuzzing injections but can't find it at the moment.

klodolph · on Nov 21, 2017

…I mean, yes? This question is well documented and simple Google searches will turn up a large number of high-quality results, such as https://stackoverflow.com/questions/3177338/how-is-the-linux....

xwvvvvwx · on Nov 21, 2017

Don't be boring.

The nice thing about asking on hacker news is you get answers from experts, and you can often get insider knowledge that really isn't available anywhere on google.

But thanks for the link, it's a good one ;)

klodolph · on Nov 21, 2017

Please don't accuse me of being a boring person. I'm not sure why you think that is acceptable behavior.

The nice thing about doing a minute of preliminary research when you have a question is that the questions you ask afterwards will be more interesting.

rodgerd · on Nov 21, 2017

Reading the bug it appears it may affect more filesystems than merely bcache. Nasty.

xwvvvvwx · on Nov 21, 2017

pretty impressive response time though.

pkaye · on Nov 21, 2017

How come none of the static analyzers that seems to be run on the Linux source code not catch this?

caf · on Nov 21, 2017

Because the field wasn't formally uninitialised - it was initialised to zero by bio_init(). It was instead a logic error, which are not as amenable to static analysis.

nategri · on Nov 22, 2017

Slightly OT but I'm happy to see breaking news from a gentoo.org domain. I long ago decided installing Gentoo was a young person's pursuit, but I'll always have fond feelings for it.

peckrob · on Nov 22, 2017

Same. Running Gentoo you learn a lot about how Linux works at a deep level. I'm glad I have that knowledge I gained running Gentoo and it's saved my bacon a few times over the years.

But after about the third time that upgrading my system made it unbootable because I had missed some crucial step buried in the changelog or release notes, I also developed a deep appreciation of all the things modern distros handle for you. I rarely fear a dist-upgrade the way I did an emerge world.

_euac · on Nov 22, 2017

You can always try Arch, it's like the more-adult version of Gentoo. Still a lot of work though.

mohaine · on Nov 22, 2017

Agreed but Arch isn't much work after the initial install. Well worth it for the rolling updates.

Arch mostly just works. The big issue I run into is that the upgrades are _almost_ flawless, which bites you once a year or so when you least expect it.

nerflad · on Nov 22, 2017

It becomes slightly more work if you depend on AUR packages (especially if you end up maintaining the packages you depend on!), but overall I think it's a great daily-driver system.

_euac · on Nov 22, 2017

I might give it another try when I buy my next laptop. Last time I tried there was way too much text-file tweaking for my taste (years of OS X have made me soft.)

hawski · on Nov 22, 2017

When I was a teenager I used Gentoo for a couple of years. I'm glad that I did, because I learned a lot. Of course it became tiresome after some time.

However now I think about Gentoo as a more-adult one. It's really a meta-distribution. It was and is misunderstood by a lot of people. Teenage me included. Gentoo is deceivingly easy to setup - documentation is excellent and tooling mature.

I think that the best way to use Gentoo is to really set all the USE flags and all the unmasking to one's will, but not for a single computer. There should be a binhost - a binary package server available for whole fleet of computers. You can also cross compile for a totally different architecture. Then the customization comes in handy.

Also now it's certainly easier than when I've done it. I used Gentoo on a desktop with a whooping 512MB RAM. Now compile times are significantly lower.

To sum it up: use Gentoo to build a custom Linux distribution for fleet of computers or for embedded devices. That's it's true power.

Google chose Portage for ChromeOS building for a reason.

djsumdog · on Nov 23, 2017

I started running Gentoo again in 2012 after I got fed up with MacOS Lion. I've since used it as my primary OS at three different jobs/work laptops and it's been my primary OS on all my person systems. It's a great OS and it's really fun to use and work with.

My media PC has Void Linux on it, which is also very promising. I'd also suggest looking at Funtoo; which I used briefly on a laptop and it came with a lot of good sane defaults.

baobrien · on Nov 21, 2017

I lost data to this one after building a 4.14 kernel.

harry8 · on Nov 21, 2017

Be positive. You learned an important lesson about using very recently released system software without having backups. You're a better engineer for it. Apply the lesson elsewhere and it's probably significantly more valuable than the lost data.

"Experience is what you get when you didn't get what you want."

Twirrim · on Nov 22, 2017

If you operate on the bleeding edge, sometimes you're going to get cut.

X86BSD · on Nov 21, 2017

I... WOW. This one is epic. I don’t even... wow.

snvzz · on Nov 22, 2017

On a Linux .0 release, it's not that uncommon for fs corruption bugs. I remember quite a few of these.

Best to only upgrade Linux when the next release is already out (e.g. 4.15 out to consider 4.14). Or just use something else.

chris_wot · on Nov 22, 2017

Yeah, epic in how fast it got fixed!

mtgx · on Nov 21, 2017

Linus worried so much about the new security features that he missed this one?

__david__ · on Nov 21, 2017

Here is the commit that introduces the bug: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

Can you spot the error?

Here is the commit that fixes the bug: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

lomnakkus · on Nov 21, 2017

Why is this a single commit? I would expect renaming like that to be done in completely separate commits and then subsequent commits to do any fixups.

Is it just a merge commit? I imagine it's probably unfair to blame Linus for this (assuming it's a huge merge), but it would also weird for you to point to a merge commit as evidence of the opposite.

Color me confused.

vog · on Nov 21, 2017

Not sure why you are downvoted. I thought exactly the same thing after looking at the commit.

As a software developer, the habit committing smaller changes has saved me a lot of time. Even more so, a clean distinction between refactorings (e.g. moving or renaming things) and functional changes (e.g. adding or fixing things) is vital to any large code base.

As the saying goes: "for each desired change, make the change easy (warning: this may be hard), then make the easy change" (Kent Beck)

blueflow · on Nov 21, 2017

I wonder what the creators of git think about your opinion on how to use it properly...

Jokes aside, what you see it usually just the squashed commit submitted for merge. With 6 million git objects, you reconsider twice having each change in a single commit.

bonzini · on Nov 21, 2017

That's absolutely wrong. Linux never squashes merge commits.

With my Linux maintainer hat on, I would have never accepted that commit.

vog · on Nov 22, 2017

> what you see it usually just the squashed commit submitted for merge

Then __david__'s argument "could you spot the error?" is moot, because the squashed commit (hopefully!) was not the subject of review, but the smaller commits.

lomnakkus · on Nov 21, 2017

> I wonder what the creators of git think about your opinion on how to use it properly...

Are you the creator of git? If not, how exactly are you qualified to judge "proper" usage -- by YOUR standard you're not qualified yourself, so...?

(I realized it was a joke, but I offer the above as a counterpoint to the satirical/ironic element of the joke. My opinion is that the originator doesn't get special treatment. But then the point wasn't git, it was what/how things get merged into the Linux kernel. The mechanism is entirely immaterial.)

Squashing commits for merge into Linux is not standard practice AFAIK... and happily I see an Linux committer has async'ly backed me up on this.

Generally the preference I see, where git is concerned (not sure about the kernel), is that everything is rebased right into the master branch to avoid merge graphs getting absurd.

blueflow · on Nov 21, 2017

I meant squashing before sending upstream, not all kernel development happens in public.

Still, i think squashing still happens in the public trees:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

EDIT: Actually, thats indeed a merge, but the diff is still showed with it, which confused me (because conflict-less merges are usually +0/-0).

lomnakkus · on Nov 21, 2017

OK, that's fair enough. Let's not be enemies :)

nickodell · on Nov 21, 2017

This isn't just a rename.

The structure dm_bio_details was changed to remove one field that points to a block_device structure, and add a field that points to a gendisk structure, and an integer called bi_partno.

This change copies the gendisk structure, but not bi_partno. That's the bug.

lomnakkus · on Nov 21, 2017

AFAIU the commits there was a rename and then a remove. Am I wrong in that assessment?

mlyle · on Nov 22, 2017

The important distinction is where the critical information on where the device offset is encoded from. Before it was specified by whether a partition or parent disk device was used; now it is always the actual disk and a partition number. It's not a bad change, but IMO it was made recklessly.

From this patchset, there's now at least four different fixes: tags referring to it in Linus's tree. The patchset itself was broken and intermediate states didn't compile, making bisection annoying. And there's many other callers of this function that are potentially getting bad results, along with other cousins like bio_clone_bioset that are still unfixed.

This was not a fun experience for a new maintainer. It turns out if you had created a fresh bcache on 4.14, like most of my tests do-- you're probably fine. And not all environments will experience problems. It went in a couple of weeks before I became maintainer, and isn't in the bcache code itself (and wasn't reviewed by anyone who worked on bcache, and was never sent to the linux-bcache mailing list.) I've learned some new things to add to my test suite, and I guess i need to read all the block diff to make sure people don't break things out from under me.

vog · on Nov 22, 2017

> The patchset itself was broken and intermediate states didn't compile

As a maintainer, isn't it then your right to refuse the patchset and demanding a clean patchset instead?

peterwaller · on Nov 22, 2017

He's a maintainer of bcache, not of block layer, where the problematic patchset was introduced.

mlyle · on Nov 22, 2017

As pointed out elsewhere, I never accepted this patchset. But:

For stuff I accept, I try and make sure that intermediate states compile, but I can't promise I run a compile on every intermediate step or that my review eye is good enough to spot every wrongly-ordered-dependent-change. It would be nice if someone would throw hardware at building intermediate states of linux-next and sent nastygrams to maintainers/committers about it: might make people more careful.

vog · on Nov 23, 2017

Can't you use GitHub's integrated TravisCI for that? At least that's what I know other Free Software projects are doing.

mlyle · on Nov 23, 2017

TravisCI generally tests the end result of a PR, not all intermediate states.

I have a continuous integration tool for bcache in place, but it's more geared on end states-- compiling/checkpatch/test sequence.

benmmurphy · on Nov 23, 2017

i ran coccinelle on the original code before the patch with this rule to see if it would spot any other missing changes.

      linux git:(c2ee070fb003)  cat patch.coci
    @@
    expression lhs;
    expression rhs;
    @@

    - lhs->bi_bdev = rhs;
    + bio_set_dev(bio, rhs);

so it spotted the original missing bi_partno = in the same file.

        diff -u -p a/block/bio.c b/block/bio.c
    --- a/block/bio.c
    +++ b/block/bio.c
    @@ -596,7 +596,7 @@ void __bio_clone_fast(struct bio *bio, s
       * most users will be overriding ->bi_bdev with a new target,
       * so we don't set nor calculate new physical/hw segment counts here
       */
    - bio->bi_bdev = bio_src->bi_bdev;
    + bio_set_dev(bio, bio_src->bi_bdev);
      bio_set_flag(bio, BIO_CLONED);
      bio->bi_opf = bio_src->bi_opf;
      bio->bi_write_hint = bio_src->bi_write_hint;
    @@ -681,7 +681,7 @@ struct bio *bio_clone_bioset(struct bio
      bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
      if (!bio)
        return NULL;
    - bio->bi_bdev    = bio_src->bi_bdev;
    + bio_set_dev(bio, bio_src->bi_bdev);
      bio->bi_opf   = bio_src->bi_opf;
      bio->bi_write_hint  = bio_src->bi_write_hint;
      bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;

but it also spotted a missing assignment in the same file in the bio_clone_bioset function. i don't know enough about this functionality to know if the assignment is meant to be here as well. also, considering it is in the same file i would have thought if it was necessary it would have been noticed in the last patch. just thought i would let someone else who would have a better idea about how this works know.

i also haven't checked the rest of the coccinelle output. i know it has generated a lot of _wrong_ code. like the actual patch looks like it has done something more intelligent than what coccinelle blindly did.

mlyle · on Nov 23, 2017

Read. :P

To quote what you replied to: "And there's many other callers of this function that are potentially getting bad results, along with other cousins like bio_clone_bioset that are still unfixed."

bcache doesn't call bio_clone_bioset. I put in the minimum change to fix the bcache regression, for quick review and acceptance. You're free to submit this to linux-block if you want.

benmmurphy · on Nov 23, 2017

thanks :) i actually didn't read the post at all. heh

ambrop7 · on Nov 21, 2017

Possibly so that the code corresponding to each commit builds/works, which makes bisect much easier.

kelnos · on Nov 21, 2017

The "GitHub model" is very different from the original kernel-dev model of using git.

GitHub encourages you to make a bunch of smaller commits and then roll them all up into one pull request -- so the unit of change is the PR, and not so much the single commit.

In kernel-dev land, the unit of change is the commit. Sometimes larger changes will be broken up into a series of smaller commits, but in general you tend to see single-commit changes (which often come from a series of commits someone has done and then squashed down to one).

bonzini · on Nov 21, 2017

No, the kernel definitely encourages submitting patch series instead of large changes. That's why git has the "git am" (apply mbox) command.

Patches that are submitted by external contributors are never squashed by maintainers (except to occasionally fix a bug that was discovered after the submission).

lomnakkus · on Nov 21, 2017

You're wrong. Sorry. Read up on it -- or, just, look at the kernel mailing list.

StillBored · on Nov 22, 2017

A part of this is the "Blessed maintainer" issue, where large patchsets like this that a mere moral submits would end up going through dozens of review/testing cycles before even being considered by the maintainers of most subsystems. OTHO, the maintainer of said subsystem can write these huge patches, get one of his buddies to ack it, and its in the tree without much comment/testing.

As this bug points out, being a maintainer doesn't suddenly make all your code bug free..

EDIT: OTOH, most of the Linus released kernels go through months of bug fixing before they show up in any of the stable distributions. Its the "rolling release" distro's which get hit by these bugs because they have this strange idea that once Linus (or whatever package) releases something its done cooking.

StreamBright · on Nov 21, 2017

You are just proving the point that opensource wont protect against security bugs because most of the software engineers cannot spot a security bug in code?

mort96 · on Nov 21, 2017

I'm not sure if anyone has ever claimed that open source is a guaranteed protection against every security bug.

Besides, this wasn't a security bug, was it? It corrupted the filesystem, it didn't give anyone else any undesired access did it?

yjftsjthsd-h · on Nov 22, 2017

Given the response time, I would say that this is an amazing example of the protection that open source provides.

chris_wot · on Nov 22, 2017

What security bug? It was a corruption bug.

PopsiclePete · on Nov 21, 2017

And you are "proving the point" that humans are fallible and make mistakes and "open source" never meant, and can't mean, that the code is magically of better quality?

jlgaddis · on Nov 21, 2017

Surely you don't believe that Linus personally reviews and extensively tests every single patch that gets merged?