Bcachefs Pull Request Submitted for Linux 6.7

sekao · 2023-10-30T18:23:06.000000Z

The author lost corporate funding recently and is asking for donations. I just started giving 20/mo. I don't care about kernel drama; I just want a good COW filesystem for Linux and this guy is doing a thankless service.

https://www.patreon.com/join/bcachefs

You can do a smaller pledge via the custom pledge button if you want.

sho_hn · 2023-10-30T19:13:54.000000Z

> The author lost corporate funding recently and is asking for donations.

Just to be clear: This is apparently unrelated to the (not that unusual) not-yet-accepted PRs and their lkml review discussions.

From Overstreet's Patreon post:

"The company that's been funding bcachefs for the past 6 years has, unfortunately, been hit by a business downturn - they've been affected by the strikes in the media production industry. As such, I'm now having to look for new funding.

This came pretty unexpectedly, and puts us in a bit of a bind; with the addition of Hunter (who's already doing great work), I'm stretched thin, without much of a buffer."

(https://www.patreon.com/posts/your-irregular-89670830)

I was concerned that without this context some might assume this is a "sponsors pulling out" type situation.

KMag · 2023-10-30T18:56:53.000000Z

Also note that this isn't Kent's first rodeo. bcache from the same author is an SSD-optimized block caching layer already in production in the Linux kernel. The idea is that bcache allows building of hybrid drives where the kernel has visibility and independent control over both the SSD(s) and the spinning rust drive(s).

My understanding is that at some point during development of bcache, Kent realized "hmm... if I just got rid of cache eviction and cleaned this up a bit, it would make a nice light-weight CoW SSD-optimized filesystem".

My understanding is that there are very few untried ideas in bcachefs, and it's a relatively low-risk project. Having been burned by over-enthusiastic early adoption of btrfs, I'm hoping bcachefs becomes a conservative alternative for distros that don't currently use a CoW filesystem as the default filesystem. (Okay, I did use rsync to weekly mirror to ext4, so maybe "burned" is too strong a word, but I did wedge btrfs a few times by over-filling it.)

dmarto · 2023-10-31T02:10:06.000000Z

And it has been accepted and merged into torvalds/linux, as of an ~hour ago.

https://lore.kernel.org/all/169870980914.17163.3315949638870...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

candiddevmike · 2023-10-30T18:14:00.000000Z

Seeing some of the mailing list threads about this gives me anxiety. If you thought your PR review process was painful, try reading through some of the nitpicks, useless feedback, and tangents Kent has fought through. You can find it on lore, I'm not going to link directly to anything specific.

As an outsider, there seems to be too many egos to placate when it comes to contributing to Linux.

dboreham · 2023-10-30T20:14:53.000000Z

This is because code review without some sort of organizational structure becomes a form of institutionalized mental abuse.

tux3 · 2023-10-30T20:31:41.000000Z

I don't think that explains it, there is a structure (various levels of mainteners) and mainteners aren't outright malicious.

Theoretically there is also a CoC.

The problem's more that you're going to have strong conflicts of opinions that you can't hash out face to face between people who are already overworked and under stress

It's hard for devs, but you also have people quitting because the maintener position is not just thankless, but fucking thankless.

It is not as simple as reviewers abusing developpers because of a lack of structure. A large part of the kernel is under resourced, lacking mainteners, and that is stressful for everyone

Szpadel · 2023-10-30T18:38:10.000000Z

the same story was from ashai Linux developers. I saw on mastodon that they said they will not be upstreaming all changes for that reason

tuna74 · 2023-10-30T18:52:08.000000Z

What won't be upstreamed?

Szpadel · 2023-10-30T19:18:26.000000Z

AFAIR this was post by Hector Martin but he didn't mention anything specific. He was mostly venting out his flustrations around review process. I tried to find any of those messages but mastodon search looks broken or I have no idea how properly search for specific user posts.

arp242 · 2023-10-30T21:21:30.000000Z

He seems to complain about everything and everyone, so I find it hard to take any of these complaints serious without direct and concrete examples.

nilstycho · 2023-10-30T19:47:04.000000Z

I think you're referring to https://social.treehouse.systems/@marcan/111268799920192168

tuna74 · 2023-10-30T20:36:07.000000Z

Not fixing everything in Linux is not the same as not upstreaming. Martin seems to want to upstream the code for the drivers but not touch anything else. Which is totally fair!

viraptor · 2023-10-30T21:54:51.000000Z

I don't think that's fair. He does touch other things to improve interfaces sometimes. I see this not as "not touch anything else", but rather "not touch everything else".

tuna74 · 2023-10-30T18:46:08.000000Z

Yes, and the biggest ego is the one who started it all.

brucethemoose2 · 2023-10-30T17:03:28.000000Z

My dream linux filesystem is still a lean Windows-compatible filesystem.

I have had the linux kernel ntfs driver corrupt data more than once, so I use it read only mode.

BTRFS is slow and unofficial on Windows, and its not that fast in linux to begin with.

F2FS would be perfect, and Samsung claimed they were building a Windows driver, but it never materialized.

mbreese · 2023-10-30T17:12:49.000000Z

> I have had the linux kernel ntfs driver corrupt data more than once, so I use it read only mode.

Back when I needed something like this (over decade ago?), the Linux NTFS driver had many warnings to only use it read only. For kernel tied FSs, this is probably a good thought.

But isn’t there a Windows driver for ext4 (or ext2 at least, which is compatible with ext4)?

Honestly, I’m not sure I would try to write mount a filesystem from a different OS. Instead, I’d probably try to run the foreign OS in a VM (attached to the drive/image) and then export it over NFS/CIFS. This should work both directions and lets each OS deal with their own FS.

kjs3 · 2023-10-30T18:30:21.000000Z

There's been a free Windows ext2 driver since the Win2k days[1]. It's not remotely perfect, and doesn't look like it's much maintained any more, but I've used it many times. There are numerous other offerings, both free-ish[2] and commercial[3].

Of course, now you can mount ext[234] on Windows using WSL these days, so it's kinuva moot point.[4]

[1] https://sourceforge.net/projects/ext2fsd/ [2] e.g. https://www.diskgenius.com/ [3] e.g. https://www.paragon-drivers.com/en/lfswin/ [4] https://devblogs.microsoft.com/commandline/access-linux-file...

dsr_ · 2023-10-30T18:28:22.000000Z

My dream Linux filesystem is a distributable ZFS with automatic tiering and advance notification of cache invalidation.

I guess we have different dreams.

yjftsjthsd-h · 2023-10-30T18:40:41.000000Z

Well yeah, obviously different people have different dreams depending on what they value. Mine is ZFS but with the licensing situation fixed so it can be mainlined into Linux, or BTRFS but trustworthy and with better cross-platform support. Though I did also once sketch out a distributed filesystem that would combine the features of ZFS (ui, stability), BTRFS (heterogeneous pools), minio (distributed, erasure coding), ceph (distributed, FUSE+NFS+native kernel driver) to give me one ultimate filesystem that could scale from 1 disk in 1 drive to a single filesystem across dozens of machines and hundreds of drives. (Mostly, I noticed that at a high level, a lot of filesystems and sync tools boil down to "slice up your files into blocks, append a checksum, and then save/mirror/stream those blocks wherever you need to" - if you squint, BTRFS and bittorrent are really similar)

ilc · 2023-10-30T19:06:38.000000Z

BTRFS is not as bad as it was. People often forget all the edge conditions ZFS has also. (Of which I got caught by many in my day.)

That said, BTRFS is also not the rampant layering violation that ZFS is, for the good and the bad of that statement in so many ways. The way ARC and L2ARC is handled... is just crazy from a modern prospective. They may have cleaned it up since I last looked, but I tend to doubt it. It was pretty baked in.

BTRFS / ZFS really share 0 with bittorrent other than checksumming for integrity. That's it. Otherwise they are so different they might as well come from different planets.

ceph has a ways to go. It'll get there, but distributed filesystems are hard.

As a wise man said to me: Normal file systems take ~10-15 years to mature. Distributed ones take even longer. So ceph... may be getting usable, but I'm not sure I'd bet the farm on cephfs. (Their S3 stuff maybe. But that's much simpler.)

minio is a great example of a product not trying to be everything to everyone... and succeeding. Kudos to them and their team.

jtriangle · 2023-10-30T19:58:28.000000Z

Simple implementations are always better in general really, because you find yourself well within any reasonable test case. Sure you can do fantastically exotic things with all of the above, but, at that point you're on your own.

I've always much preferred to keep things as "down the middle" as I can for all of those reasons. Never had a CEPH cluster fail, haven't had any issues with ZFS (recently), nor any issues with BTRFS (recently), and that's not magic, it's all just because all of that is by the book or not done at all. It's a constraint at times, to be sure, but, hard to scoff at the results.

Also you're correct in assuming ARC/L2ARC is still.... weird. Sometimes you get wirespeed cache hits, sometimes you just don't, with no rhyme or reason outside of "well it just decided that file didn't belong in cache".

ilc · 2023-10-30T20:26:33.000000Z

It isn't no rhyme or reason. The issue has to do with layering caches over each other.

If you layer caches over each other and they don't pass hit rates to each other bad things happen, especially for designs like ARC. The ZFS caches explicitly take hit rate into account... but with a page cache over them... they don't know the true hit rates. Heck they are a secondary cache ;). Then L2ARC has the same issue one layer down. And there's the issue.

It isn't magical. It is just annoying.

Honestly on linux: btrfs is probably better for your sanity than ZFS.

As far as ceph goes, if you understand it, and can do the systems engineering.. yeah it is great stuff. My only complaint with ceph is just too complex for most mortals. (Despite that once you get to any real large scale... you can't avoid that complexity.)

ceph you can do some crazy shit if you know what you are doing. Want to do a seamless cluster migration? We got ya. ;) Not as good as AFS probably. But it is pretty scary.

anonacct37 · 2023-10-30T22:35:41.000000Z

What you say makes lots of sense. I can see how hit rates not being passed down would cause issues.

But anecdotally I think I'm in the same spot many others are. Btrfs let me down a couple times (once lots of truncated files and another time I ran out of inode space on a mostly empty disk iirc) and I'm in no hurry to give it another chance.

zfs has just happily worked for me. I ran it as a root filesystem for years. All the snapshot/send/receive stuff just worked and stayed out of my way.

Note that this was for personal use and with SSDs and later nvme so I wasn't comparing it head to head with anything. Which might be why I didn't notice issues like page caching behavior.

ilc · 2023-10-31T04:22:09.000000Z

ZFS isn't what lets you down in the ZFS stack. It is memory fragmentation due to ARC, and the ARC not responding to memory pressure correctly.

I'm sure there's parts of btrfs that still suck. Don't get me wrong. And you can do the running out of space, so you can't remove files trick with ZFS too ;).

Personally, I only put btrfs into use on my laptop about 1.5 years ago. So... Take that for what you will. I'm pretty conservative.

jtriangle · 2023-10-31T03:17:37.000000Z

True, ARC isn't magic, it is more that it makes poor decisions on occasion, with no real way to work around it reliably, aka no guaranteed way to ensure a cache hit.

I would posit however, if you need to guarantee cache hits... you don't really need cache, you need to roll your own flash storage and find a way to intelligently manage what's present on it at a given time. I've done this, namely for an environment that had very, very clear, scheduled needs for hot data, warm data, and archive data. Fairly trivial to take files and just scoot them to a different storage array when they're X days old on a nightly basis. That system is completely agnostic to, well, everything, I've done it on windows hosts and, while it's more of a pain in the ass, also worked fine there. That was back in the hardware raid or nothing days, when having a terabyte of flash available was fairly exotic and expensive, but, as a model, programmatically structured storage isn't old by any means and probably fits many environments better than ZFS and "as much memory as accounting would approve".

Also, I don't really 100% agree that ceph is too complex for mere mortals, it's more that, it very, very quickly becomes too complex for your average linux sysadmin to handle if you're not careful.

I really should dip my toes deeper into BTRFS for production usage though. I've played with it in the lab, I run it at home almost exclusively, but, never have done any big boy britches production stuff with it and it very well might be a good tool to have in my pocket.

mustache_kimono · 2023-10-31T06:18:59.000000Z

> The issue has to do with layering caches over each other.

I think this is mostly incorrect. In my understanding, the Linux page cache is mostly skipped. The problem, such as there is a problem, is occasionally duplicated data. My understanding is this data is only mmap files and perhaps the dentry and inode cache?

diath · 2023-10-30T18:08:03.000000Z

On the contrary, I have been using ntfs3g driver in rw mode for years without issues.

saghm · 2023-10-30T19:04:36.000000Z

That's the userspace drive for NTFS rather than the kernel one, right? I'm a bit of a novice when it comes to filesystems, but my understanding is that the tradeoff is lower performance.

pengaru · 2023-10-30T19:31:01.000000Z

The kernel received a new ntfs driver relatively recently, it was kind of a big deal (in terms of drama) when it happened:

https://www.theregister.com/2021/08/02/paragon_ntfs_linux_ke...

I don't recall what, if any, relation the Paragon ntfs support has to ntfs-3g. I do recall when ntfs-3g was the new kid on the block, it was the only alternative for read-write ntfs support vs. paying Paragon software for their proprietary (at the time) kernel driver.

saghm · 2023-10-30T21:19:07.000000Z

Yeah, that fits with what I remember. I read the parent comment talking about unreliability being about the kernel driver rather than the userspace one, so I wanted to verify if the response about ntfs-3g was even talking about the same thing or not.

ChocolateGod · 2023-10-30T17:32:06.000000Z

Linux and Windows has very different needs when it comes to filesystems.

jtriangle · 2023-10-30T19:51:20.000000Z

I haven't had any glaring issues with the current kernel NTFS drivers. They used to be a big problem back in the day, but, as far as I know or can tell, they're production ready at this point.

Or, well, as production ready as NTFS is...

baq · 2023-10-30T18:24:33.000000Z

Nowadays if you have wsl2 capable windows, you can use whatever the bundled Linux kernel supports: mount in Linux shell, access through windows with \\wsl$\.

brucethemoose2 · 2023-10-30T18:29:15.000000Z

Last I checked, it only works for whole disks. You can't mount a linux partition on a disk sharing your Windows install.

renewiltord · 2023-10-30T18:49:29.000000Z

Hmm, my mistake. I thought Bcachefs had copy-up-on-read. It looks like only catfs and aufs do that.

Tobu · 2023-10-30T18:57:57.000000Z

Of course bcachefs can cache data; that was the main draw of bcache as well.

Quoting an example from https://bcachefs.org/GettingStarted/

    bcachefs format /dev/sd[ab]1 \
        --foreground_target /dev/sda1 \
        --promote_target /dev/sda1 \
        --background_target /dev/sdb1
    mount -t bcachefs /dev/sda1:/dev/sdb1 /mnt

This will configure the filesystem so that writes will be buffered to /dev/sda1 before being written back to /dev/sdb1 in the background, and that hot data will be promoted to /dev/sda1 for faster access.

    bcachefs format --help

    --promote_target=(target)
        Device or label to promote data to on read

renewiltord · 2023-10-30T19:34:48.000000Z

That is very cool. I under specified when I last spoke. I was hoping for something that would cache arbitrary files. I have a slow FUSE fs I would like cached.

But perhaps I can use bcache instead of cachefilesd to have some more controllable NFS cache.

Bcache seems to use block devices and I don't think it can stand in front of FUSE.

mercora · 2023-10-30T22:50:36.000000Z

i used to use [0]s3ql on-top of "slow" fuse storage. it comes with its own caching layer and some strategies around handling larger blobs of data efficiently even at high latency with ease but its a non shared filesystem. you mount/lock it only once at a time. otherwise this was a perfect solution to me at the time.

there is also [1]rclone with its own caching layer and own support for various backends directly. I don't remember anymore why i did prefer s3ql though, but i usually have some reasoning with things like this..

[0]: https://github.com/s3ql/s3ql [1]: https://rclone.org/