> It is in FreeBSD main branch now, but disabled by default just to be safe till after 14.0 released, where it will be included. Can be enabled with loader tunable there.
> more code is needed on the ZFS side for Linux integration. A few people are looking at it AFAIK.
- Linux container support (So overlayfs is ZFS aware)
- BLAKE3 checksums (So faster checksums)
- Zstd early abort (I though Zstd already had early abort! This means you can enable Zstd compression by default, and if the data isn't compressible, it aborts the operation and stores it uncompressed, avoiding the extra CPU use.
- Fully adaptive ARC eviction (Separate adaptive ARC for data and for metadata. Until now metadata ARC was almost FIFO and not efficient)
- Prefetch improvements (That's always nice)
> This means you can enable Zstd compression by default, and if the data isn't compressible, it aborts the operation and stores it uncompressed, avoiding the extra CPU use
This was already implemented for all compression algorithms since ZFS was released as open source in 2005, at least (and even though zstd didn't yet exist then, it still benefited from this feature when it was integrated into ZFS).
The feature you describe only saves CPU time on decompression, but still requires ZFS to try to compress all data before storing it on disk.
This new early abort feature actually seems to be even more clever/sophisticated, because it also saves a lot of CPU time when compressing, not just descompressing.
Usually you are forced to select a fast compression level so that your performance doesn't degrade too much for little benefit, because if you had a higher compression level, ZFS would always be paying a high CPU cost for trying to compress data which often is not even compressible (since a lot of data on a typical filesystem belongs to large files, which are often not compressible).
However, with early abort, you can select a higher compression level (zstd-3 or higher) without paying a penalty for spending a lot of time trying to compress uncompressible data.
The way it seems to work is that ZFS first tries to compress your data with LZ4, which is an extremely fast compression algorithm. If LZ4 succeeds in compressing the data, then ZFS discards the LZ4-compressed data and recompresses it with your chosen higher compression level, knowing that it will be worth it.
If LZ4 can't compress the data, then ZFS doesn't even try to compress with the slower algorithm, it just stores the data uncompressed.
This almost completely avoids paying the CPU penalty for trying to compress data that cannot be compressed, and only results in about 9% less compression savings than if you had simply chosen the higher compression level without early abort.
The actual algorithm is even more clever: in-between the LZ4 heuristic and a zstd-3 or higher compression level, there is an intermediate zstd-1 heuristic that is also done to see whether the data is compressible.
This intermediate heuristic seems to almost completely eliminate the 9% compression savings cost while still being much faster than trying to compress all data with zstd-3 or higher, since zstd-1 is also very fast.
In the end, you get almost all the benefits of a higher compression level without paying almost any cost for wasting CPU cycles trying to compress uncompressible data!
edit: I can't tell if you edited this since I read it, or I just skipped the last few paragraphs outright, my apologies. I'll leave this here for posterity anyway.
To add something amusing but constructive - it turns out that one pass of Brotli is even better than using LZ4+zstd-1 for predicting this, but integrating a compression algorithm _just_ as an early pass filter seemed like overkill. Heh.
It's actually even funnier than that.
It tries LZ4, and if that fails, then it tries zstd-1, and if that fails, then it doesn't try your higher level. If either of the first two succeed, it just jumps to the thing you actually requested.
Because it turns out, just using LZ4 is insanely faster than using zstd-1 is insanely faster than any of the higher levels, and you burn a _lot_ of CPU time on incompressible data if you're wrong; using zstd-1 as a second pass helps you avoid the false positive rate from LZ4 having different compression characteristics sometimes.
> I can't tell if you edited this since I read it, or I just skipped the last few paragraphs outright, my apologies.
I made some minor edits within 15-30 mins of posting the comment, but the LZ4+zstd-1 heuristic was there from the beginning (I think) :)
Since you wrote this feature, I've been wondering about something and I was hoping you could give some insights:
Is there a good reason why this early abort feature wasn't implemented in zstd itself rather than ZFS?
It seems like this feature could be useful even for non-filesystem use cases, right? It could make compression at higher levels substantially faster in some cases (e.g. if a significant part of your data is not compressible).
Having this feature also available as an option in the zstd CLI tool could be quite useful, I think. I was thinking it could be implemented in the core zstd code and then both the CLI tool and ZFS could just enable/disable it as needed?
Thank you for your great work!
Edit: I just now realized, based on your Brotli comment, that since these early passes can be done (in theory) with arbitrary algorithms, perhaps it's better implemented as an additional layer on top of them rather than implementing it in each and every compression algorithm.
Briefly, because I initially looked at ripping out LZ4's, but it just skips chunks in the middle of the compression algorithm, so it didn't come out easily.
So I just tried using LZ4 as a first pass, and that worked well enough, with a high false positive rate, so zstd-1 was duct-taped onto it.
Trying to just glue an entropy calculator on is something I want to try, but I haven't, because this worked "well enough".
As far as why not do it in just zstd or LZ4 proper...well, this works very well for extremely quantized chunks, like ZFS records, but for things with the streaming mode, it becomes more exciting to generalize.
Plus it's much nicer for the ZFS codebase to keep it abstracted.
Also, as far as using it for, say, gzip...well, I should really open that PR from that branch I have, shouldn't I? ;)
Ah,you were the one that gave the "Refining ZFS compression" talk, isn't it? Thanks for all your work! Works great and it's pretty interesting.
I bought a super cheap QAT card with the idea of doing hardware accelerated gzip compression on ZFS (just for playing with it), but seems like the drivers are kind of a mess, and not being a developer who could debug it, maybe I should stay with the easy and fast zstd with early abort :P :) Less headaches .
I thought Intel stopped selling QAT cards with their newer generations of it?
Personally I find that gzip is more or less strictly worse than zstd for any use case I can imagine, though if you can hardware offload it, then...it's free but still worse at its job. :)
Though zstd can apparently be hardware offloaded if you have like, Sapphire Rapids-era QAT and the right wiring...sometimes. (I believe the demo code says it doesn't work on the streaming mode and only works on levels 1-12.) [1]
Seems like they stop selling dedicated QAT cards, but their new E2000 DPUs (kind of a smart NIC I think your employer uses for accelerating cloud loads) have QAT instructions. But I'd say that's because those DPUs have a couple Xeons inside :)
I think QAT API used to be a mess in past generations, and IIRC it's one of the factors drivers for older hardware weren't included in TrueNAS Scale. I think it was this ticket [1] where they said that it was complicated to add support for previous hardware, but can't confirm because JIRA's been acting funny since yesterday.
Feels like QAT has stabilized since this Gen4 hardware and interesting things are appearing again. Not only the ZSTD plugin you linked, but I just noticed QATlib added support for lz4/lz4s past July [2]! That's interesting.
They are also publishing guides for accelerating SSL operations in SSL terminators / load balancers like HAProxy [3], so between this and cloud providers using DPUs to accelerate their load, doesn't feel like they're shutting this down soon.
I'd love if they make this a bit more accessible to non-developers.
Not random, right? ZFS compression doesn't (at present; there have been unmerged proposals to try otherwise) care about the other records for calculating whether to try compressing something, and it would be a much more invasive change to actually change what it compresses with rather than a go/no-go decision.
As far as wasteful - not really? It might be possible to be more efficient (as I kept saying in my talk about it, I really swear this shouldn't work in some cases), but LZ4 is so much cheaper than zstd-1 is so much cheaper than the higher levels of zstd, that I tried making a worst-case dataset out of records that experimentally passed both "passes" but failed the 12.5% compression check later, and still, the amount of time spent on compressing was within noise levels compared to without the first two passes, because they're just that much cheaper.
I tried this on Raspberry Pis, I tried this on the single core single thread SPARC I keep under my couch, I tried this on a lot of things, and I couldn't find one where this made the performance worse by more than within the error bars across runs.
(And you can't use zstd-fast for this, it's a much worse second pass or first pass.)
This is so much better than automatically doing dedup and the RAM overhead that entails.
Doing offline/RAM+in memory dedup size optimizations seem like a really good optimization path. In the spirit of also paying only what you use and not the rest.
Edit: What's the RAM overhead of this? Is it ~64B per 128kB deduped block or what's the magnitude of things?
> Edit: What's the RAM overhead of this? Is it ~64B per 128kB deduped block or what's the magnitude of things?
No real memory impact. There's a regions table that uses 128k of memory per terabyte of total storage (and may be a bit more in the future). So for your 10 petabyte pool using deduping, you'd better have an extra gigabyte of RAM.
But erasing files can potentially be twice as expensive in IOPS, even if not deduped. They try to prevent this.
If I manually deduplicate/copy a 1GB file, how many such offset/refcount tuples would be created in RAM? Just one for the whole file, or one per underlying 128kB block?
n.b. I am not pjd, this is from memory and may be wrong.
The answer to that is messy, but basically, there's a table that should be kept in memory for "is there anything reflinked at all in this logical range on disk", and that covers large spans, so how many entries would depend on how contiguous your data logically was on disk; the actual precise mapping list per-vdev doesn't need to be kept continuously in memory, just the more coarse table, so that saves you a fair bit on memory requirements.
They don't go in ram. You have one 64 bit refcount number for every 64MB of storage irrespective of how much is deduplicated; this is a reduction factor of 8M, or 128k per terabyte stored.
This largely exists so you can avoid doing extra work on free.
I’m not a filesystem admin, but we at LLNL use OpenZFS as the storage layer for all of our Lustre file systems in production, including using raid-z for resilience in each pool (order 100 disks each), and have for most of a decade. That combined with improvements in Lustre have taken the rate of data loss or need to clear large scale shared file systems down to nearly zero. There’s a reason we spend as many engineer hours as we do maintaining it, it’s worth it.
If I'm not mistaken the linux port of ZFS that later became OpenZFS started at LLNL and was a port from FreeBSD (it may have been release ~9).
I believe it was called ZFS On Linux or something like that.
Nice how things have evolved: from FreeBSD to linux and back. In my mind this has always been a very inspiring example of a public institution working for the public good.
ZoL, if my ancient memory serves, was at LLNL, not based on the FreeBSD port (if you go _very_ far back in the commit history you can see Brian rebasing against OpenSolaris revisions), but like 2 or 3 different orgs originally announced Linux ports at the same time and then all pooled together, since originally only one of the three was going to have a POSIX layer (the other two didn't need a working POSIX filesystem layer). (I'm not actually sure how much came of this collaboration, I just remember being very amused when within the span of a week or two, three different orgs announced ports, looked at each other, and went "...wait.")
Then for a while people developed on either the FreeBSD port, the illumos fork called OpenZFS, or the Linux port, but because (among other reasons) a bunch of development kept happening on the Linux port, it became the defacto upstream and got renamed "OpenZFS", and then FreeBSD more or less got a fresh port from the OpenZFS codebase that is now what it's based on.
The macOS port got a fresh sync against that codebase recently and is slowly trying to merge in, and then from there, ???
We use ZFS in production for the non-content filesystems on a large, ever increasing, percentage of our Netflix Open Connect CDN nodes, replacing geom's gmirror. We had one gotcha (caught on a very limited set of canaries) where a buggy bootloader crashed part-way through boot, leaving the ZFS "bootonce" stuff in a funky state requiring manual recovery (the nodes with gmirror were fine, and fell back to the old image without fuss). This has since been fixed.
Note that we do not use ZFS for content, since it is incompatible with efficient use of sendfile (both because there is no async handler for ZFS, so no async sendfile, and because the ARC is not integrated with the page cache, so content would require an extra memory copy to be served).
I would expect page cache to be one of those platform-dependent things that prevent OpenZFS from doing that. Especially on Linux, and especially because AFAIK the Linux version has the most eyes on it.
My understanding, not having dug into it, is that it's possible but just work nobody has done yet, though I'm not sure what the relevant interfaces are in the Linux kernel.
One thing that makes the interfaces in Linux much messier than the FreeBSD ones is that a lot of the core functionality you might like to leverage in Linux (workqueues, basically anything more complicated than just calling kmalloc, and of course any SIMD save/restore state, to name three examples I've stumbled over recently) are marked EXPORT_SYMBOL_GPL or just entirely not exported in newer releases, so you get to reimplement the wheel for those, whereas on FreeBSD it's trivial to just use their implementations of such things and shim them to the Solaris-ish interfaces the non-platform-specific code expects.
So that makes the Linux-specific code a lot heavier, because upstream is actively hostile.
I wish someone would come in and convince the kernel devs that "hey, if you want EXPORT_SYMBOL_GPL to have legal weight in a copyleft sense then you can't just slap it onto interfaces for political reasons"
I don't think they care about it having legal weight, that ship sailed long ago when they started advocating for just slapping SYMBOL_GPL on things out of spite; I think they care about excluding people from using their software.
IMO Linus should stop being half-and-half about it and either mark everything SYMBOL_GPL and see how well that goes or stop this nonsense.
My impression is that some of the Linux kernel devs are anti-anything that's not GPL-compatible, of any sort, regardless of the particulars.
Linus himself also made remarks about ZFS at one point that were pretty...hostile. [1] [2]
> The fact is, the whole point of the GPL is that you're being "paid" in terms of tit-for-tat: we give source code to you for free, but we want source code improvements back. If you don't do that but instead say "I think this is _legal_, but I'm not going to help you" you certainly don't get any help from us.
> So things that are outside the kernel tree simply do not matter to us. They get absolutely zero attention. We simply don't care. It's that simple.
> And things that don't do that "give back" have no business talking about us being assholes when we don't care about them.
> See?
Note that there's at least one unfixed Linux kernel bug that was found by OpenZFS users, reproducible without using OpenZFS in any way, reported with a patch, and ignored. [3]
We see how well "including people into using your software" works out for pushover licenses almost every week now, FreeBSD itself being the prime example, having lost three major companies one after another from its list of users, sponsors, and developers a couple of years ago. Not that there were many to begin with. What Linus and the gang are doing have been working out great over the long term (much better than any competition, at least).
Why don't you think it has legal weight? Or did you mean something else?
As far as know the point of EXPORT_SYMBOL_GPL was to push back on companies like Nvidia who wanted to exploit loopholes in the GPL. That seems to me like a reasonable objective.
Sure, and that alone isn't an unreasonable premise - as he says, intent matters.
But if you're marking interfaces as GPL-only, or implementing taint detection that means if you use a non-SYMBOL_GPL kernel symbol which calls a GPL-only function it treats the non-SYMBOL_GPL symbol as GPL-only and blocks your linking, it gets a bit out of hand.
Building the kernel with certain kernel options makes modules like OpenZFS or OpenAFS not link because of that taint propagation - because things like the lockdep checker turn uninfringing calls into infringing ones.
Or a little while ago, there was a change which broke building on PPC because a change made a non-SYMBOL_GPL call on POWER into a SYMBOL_GPL one indirectly, and when the original author was contacted, he sent a patch reverting the changed symbol, and GregKH refused to pull it into stable, suggesting distros could carry it if they wanted to. (Of course, he had happily merged a change into -stable earlier that just implemented more aggressive GPL tainting and thereby broke things like the aforementioned...)
It's explicitly not compatible with GPL, though. It has clauses that are more restrictive than GPL, and IIRC some people who contributed to the OpenZFS project did so explicitly without allowing later CDDL license revisions, which removes Oracle's ability to say CDDL-2 or whatever is GPL-compatible.
So even if someone rolled up dumptrucks of cash and convinced Oracle that everything was great, they don't have all the control needed to do that.
To have legal weight, it has to be a signal that you're implementing something that is derivative of kernel code. That's the directly stated intent of EXPORT_SYMBOL_GPL.
But "call an opaque function that saves SIMD state" is obviously not derivative of the kernel code in any way. The more exports that get badly marked this way, the more EXPORT_SYMBOL_GPL becomes indistinguishable from EXPORT_SYMBOL.
I see it as just a kind of "warranty void if seal broken". Don't do this or you may be in violation of the GPL. Maybe a legal court in $country would find in your favour (I'm not convinced it's as clear cut as you imply). Maybe they would find that you willfully infringed, despite the kernel devs clearly warning you not to do it.
The main "legal effect" I see is that you are not willing to take that risk, just like Oracle isn't.
If the goal was to prove someone saw the GPL, then the entire license enforcement mechanism wouldn't be in the kernel.
If enough symbols get restricted without valid copyright-based reasons, modules that legitimately have non-GPL licenses will have to lie to the kernel to get loaded. And that's a stupid situation all around.
Sun has picked the license specifically to make the project incompatible with the Linux kernel. If you start out with hostility, what treatment do you expect in return?
I suspect the underlying issues for not unifying the two have more to do with the ZFS design than anything to do with Linux. It may be the codebase is far too large at this stage to make such a fundamental change.
I don't think so. The memory management stuff is pretty well abstracted; on FBSD it just glues into UMA pretty transparently, it's just on Linux there's a lot of machinery for implementing our own little cache allocating because Linux's kernel cache allocator is very limited in what sizes it will give you, and sometimes ZFS wants 16M (not necessarily contiguous) regions because someone said they wanted 16M records.
The ZoL project lead said at one point there were a variety of reasons this wasn't initially done for the Linux integration [1], but that it was worth taking another look at since that was a decade ago now. Having looked at the Linux memory subsystems recently for various reasons, I would suspect the limiting factor is that almost all the Linux memory management functions that involve details beyond "give me X pages" are SYMBOL_GPL, so I suspect we couldn't access whatever functionality would be needed to do this.
I could be wrong, though, as I wasn't looking at the code for that specific purpose, so I might have missed functionality that would provide this.
Behlendorf's comment in that thread seems to be talking about linux integration. My point was this is an older issue, going back to the Sun days. See for instance this thread in where McVoy complains about the same issue! https://www.tuhs.org/pipermail/tuhs/2021-February/023013.htm...
That seems more like it's complaining about it not being the actual page cache, not it not being counted as "cache", which is a larger set in at least Linux than just the page cache itself.
But sure, it's certainly an older issue, and given that the ABD rework happened, I wouldn't put anything past being "feasible" if the benefits were great enough.
(Look at the O_DIRECT zvol rework stuff that's pending (I believe not merged) for how a more cut-through memory model could be done, though that has all the tradeoffs you might expect of skipping the abstractions ZFS uses to minimize the ability of applications to poke holes in the abstraction model and violate consistency, I believe...)
Actually, Drew's presentations about Netflix, FreeBSD, ZFS, saturating high-bandwidth network adapters, etc. are legendary and have been posted far and wide. But having him available to answer questions on HN just takes it to a whole 'nother level.
You're making me blush.. But, to set the record straight: I actually know very little about ZFS, beyond basic user/admin knowledge (from having run it for ~15 years). I've never spoken about it, and other members of the team I work for at Netflix are far more knowledgeable about ZFS, and are the ones who have managed the conversion of our fleet to ZFS for non-content partitions.
I’ve devoured your FreeBSD networking presentations but I guess I must have confused a post about tracking down a ZFS bug in production written by someone else with all the other content you’ve produced.
Back to the topic at hand, it’s actually scary how few software expose control over whether or not sendfile is used, assuming support is only a matter of OS and kernel version but not taking into account filesystem limitations. I ran into a terrible Samba on FreeBSD bug (shares remotely disconnected and connections reset with moderate levels of concurrent ro access from even a single client) that I ultimately tracked down to sendfile being enabled in the (default?) config - so it wasn’t just the expected “performance requirements not being met” with sendfile on ZFS but even other reliability issues (almost certainly exposing a different underlying bug, tbh). Imagine if Samba didn’t have a tubeable to set/override sendfile support, though.
Yes. That's the bootonce thing I was talking about. When we update the OS, we set the "bootonce" flag via bectl activate -t to ensure we fall back to the previous BE if the current BE is borked and not bootable. This is the same functionality we had by keeping a primary and secondary root partition in geom a and toggling the bootable partition via the bootonce flag in gpart.
Running in production for about 3 years with Ubuntu 20.04 / zfs 0.8.3. ZFS is being used as the datastore for a cluster of LXD/LXC instances over multiple physical hosts. I have the OS setup on its own dedicated drive and ZFS striped/cloned over 4 NVMe drives.
edit: one thing I forgot to mention is, when creating your pool make sure to import your drives by ID (zpool import -d /dev/disk/by-id/ <poolname>) instead of name in case name assignments change somehow [1]
In our case each node's ZFS store operates independently, it's not shared. We spin up "VMs" (LXC based) on each node and they talk to each via our private network. Hope that helps!
> I just wouldn’t use the Proxmox web ui for zfs configuration. It doesn’t have up to date options.
What options are missing for you?
Would be great if you could open an enhancement request over at https://bugzilla.proxmox.com/ for tracking this (no promises on (immediate) implementation though).
We use it with SmartOS on hypervisors running customer Bhyve machines. ZFS (and smartos) works really really well.
.
The fact how I can replicate live data of arbitrary size (many TB sized filesystems) in small chunks to other hosts every minute greatly increased my deep sleep quality. Of course databases with multi-host write etc are nice, but in our use case, all customers are rather small with just lots and lots of files (medical and otherwise), the database itself is rather small and doesn't need replication.
Best thing, on the receiver side of the backup ZFS ensures due to its architecture that the diff is directly applied on top of the existing filesystem, while in normal differential backups one might find out months or years later that one diff snapshot was damaged in transfer or is not accessible.
zfs scrubbing with S.M.A.R.T monitoring also helps a lot to ensure drive quality over time.
# Gotchas
ZFS:
- There is no undo etc, this is unix, so beware of wrong commands.
- ZFS Scrubbing can be stopped (it does sometimes affect io speed), but ZFS resilvering cannot. This can lead to performance issues.
- There must be enough RAM for the caching to work well and synchronous workloads do well with good write cache drives (ZIL)
- Data usage patterns should fit well with the Append Log schema of ZFS. E.g databases such as LevelDB worked really well. Others are not slow, but need a good ZIL more then when the pattern fits.
SmartOS: Some minor gotchas with how vmadm deletes zfs filesystems or in general with SmartOS, e.g when having too many snapshots, but everything quite predictable.
"production" at home on Debian 11, previously on FreeBSD 10-13. The weirdest gotcha has been related to sending encrypted raw snapshots to remote machines[0],[1]. These have been the first instabilities I had with ZFS in roughly 15 years around the filesystem since switching to native encryption this year. Native encryption seems to be barely stable for production use; no actual data corruption but automatic synchronization (I use znapzend) was breaking frequently. Recent kernel updates fixed my problem although some of the bug reports are still open. I only moved on from FreeBSD because of more familiarity with Linux.
A slightly annoying property of snapshots and clones is the inability to fully re-root a tree of snapshots, e.g. permanently split a clone from its original source and allow first-class send/receive from that clone. The snapshot which originated the clone needs to stick around forever[2]. This prevents a typical virtual machine imagine process of keeping a base image up to date over time that VMs can be cloned from when desired and eventually removing the storage used by the original base image after e.g. several OS upgrades.
I don't have any big performance requirements and most file storage is throughput based on spinning disks which can easily saturate the gigabit network.
I also use ZFS on my laptop's SSD under Ubuntu with about 1GB/s performance and no shortage of IOPS and the ability to send snapshots off to the backup system which is pretty nice. Ubuntu is going backwards on support for ZFS and native encryption uses a hacky intermediate key under LUKS, but it works.
Yes, Ubuntu 20.04 and 22.04. But we've been running ZFS in some form or other for 10+ years. ACL support not as good/easy to use as Solaris/FreeBSD. Not having weird pathological performance issues with kernel memory allocation like we had with FreeBSD though. Sometimes we have issues with automatic pool import on boot, so that's something to be careful with. The tooling is great though, and we've never had catastrophic failure that was due to ZFS, only due to failing hardware.
I use it as my default file system on FreeBSD. It was rough in FreeBSD 7.x (around 2009), but starting with FreeBSD 8.x it has been rock solid to this day. The only gotcha (which the documentation warns about) has been that automatic block level deduplication is only useful in a few special applications and has a large main memory overhead unless you can accept terrible performance for normal operations (e.g. a bandwidth limited offsite backup).
I use TrueNAS (FreeBSD) version. ZFS makes keeping data snapshots and external backups a fairly easy process. It really cut down on server data management time. The only issue I found with it is there was a now long patched SMB bug a few years ago in the TrueNAS software that caused my storage pool to become corrupted after a large transfer. And there was no way for me to recover the data. It was all lost despite the disks still spinning and presumably still having most and the 1 and 0s recoverable.
I use ZFS on Debian for my home file server. The setup is just a tiny NUC with a couple of large USB hard drives, mirrored with ZFS. I’ve had drives fail and painlessly replaced and resilvered them. This is easily the most hassle free file storage setup I’ve owned; been going strong over 10 years now with little to no maintenance.
The gotcha on Proxmox is that you can't do swapfiles on ZFS, so if your swap isn't made big enough when installing and you format everything as ZFS you have to live with it or do odd workarounds.
That's not a gotcha on Proxmox projects but a gotcha everywhere, SWAP on ZFS needs a lot of special care to avoid that it needs to allocate memory during swapping memory out (e.g., if low on memory), causing hard-freezes.
It can be made somewhat[0] stable with using no (or a simple compression like ZLE), avoiding log devices and caches it can work, but it's way simpler and guaranteed stable to just use a separate partition.
1. You shouldn't swap on ZFS, but really
2. You shouldn't swap on a filesystem, but really
3. You shouldn't swap at all.
On a VM host you have a bit of leeway for memory overcommit, but ideally if you have a box that frequently has to rely on swap, something is bad wrong. I don't think it's really fair to criticize a worst-choice configuration for being poorly supported here. There are simply too many opportunities for catch-22/deadlock scenarios in the interaction of the fs and allocator during OOM...
In production on SmartOS (illumos) servers running applications and VM's, on TrueNAS and plain FreeBSD for various storage and backups, and on a few Linux-based workstations. Using mirrors and raidz2 depending on the needs of the machines.
We've successfully survived numerous disk failures (a broken batch of HDD's giving all kinds of small read errors, an SSD that completely failed and disappeared, etc), and were in most cases able to replace them without a second of downtime (would have been all cases if not for disks placed in hard-to-reach places, now only a few minutes downtime to physically swap the disk).
Snapshots work perfectly as well. Systems are set up to automatically make snapshots using [1], on boot, on a timer, and right before potentially dangerous operations such as package manager commands as well. I've rolled back after botched OS updates without problems; after a reboot the machine was back in it's old state. Also rolled back a live system a few times after a broken package update, restoring the filesystem state without any issues. Easily accessing old versions of a file is an added bonus which has been helpful a few times.
Send/receive is ideal for backups. We are able to send snapshots between machines, even across different OSes, without issues. We've also moved entire pools from one OS to another without problems.
Knowing we have automatic snapshots and external backups configured also allows me to be very liberal with giving root access to inexperienced people to various (non-critical) machines, knowing that if anything breaks it will always be easy to roll back, and encouraging them to learn by experimenting a bit, to the point where we can even diff between snapshots to inspect what changed and learn from that.
Biggest gotchas so far have been on my personal Arch Linux setup, where the out-of-tree nature of ZFS has caused some issues like a incompatible kernel being installed, the ZFS module failing to compile, and my workstation subsequently being unable to boot. But even that was solved by my entire system running on ZFS: a single rollback from my bootloader [2] and all was back the way it was before.
Having good tooling set up definitely helped a lot. My monkey brain has the tendency to think "surely I got it right this time, so no need to make a snapshot before trying out X!", especially when experimenting on my own workstation. Automating snapshots using a systemd timer and hooks added to my package manager saved me a number of times.
ZFS on Linux has improved a lot in the last few years. We (prematurely) moved to using it in production for our MySQL data about 5 years ago and initially it was a nightmare due to unexplained stalling which would hang MySQL for 15-30 minutes at random times. I'm sure it shortened my life a few years trying to figure out what was wrong when everything was on fire. Fortunately, they have resolved those issues in the subsequent releases and it's been much more pleasant after that.
We don't use ZFS as a filesystem, we use it to aggregate multiple EBS into a single pool, and present it to... stock Ubuntu (one found in AWS marketplace) as a single big block device.
We also use ZFS for testing of our own block device. My biggest problems with it are related to this activity. Hot-plugging or removing block devices from the pool often leads to unrepeatable pool, and the tests have to be scrambled because the whole system needs to be rebooted.
>We don't use ZFS as a filesystem, we use it to aggregate multiple EBS into a single pool, and present it to... stock Ubuntu (one found in AWS marketplace) as a single big block device.
I don't understand what you mean by this comment. Are you creating a zvol on your zpool and loopback mounting it with a different filesystem? It sounds like you are very much using ZFS as a filesystem, even if you are not booting from it or using it for any kind of high availability purpose.
I do agree with you on the matter that there should be a way to clear out an (UNAVAIL vdev / SUSPENDED pool) without a reboot. This has been a thorn in my side for as long as ZFS has existed. The deadlock-by-design line is really quite nonsense.
No, we don't use it as a filesystem at all. Our product is similar to EBS. So, nothing is mounted anywhere. Users of our product will get a slice or a whole volume thus created and usually would put some filesystem on it, but they will see that volume as a LUN in iSCSI portal we give them an address of. So, in principle, they could also use it to create a ZFS filesystem, but I doubt anyone uses it like that. Most typical usages are either filesystem (something like Ext4 or XFS) or an object store.
In production since the first Release of FreeBSD in our Research company. We layered our storage to disconnect hardware Lifecycles from OS. The spinning disks are pooled together using ScaleIO and provided to FreeBSD on ESX as physical devices which live forever through each EOL cycle of the hardware every five up to eight years. We have been serving ten PetaBytes for 15 years without hiccups. We expose SMB3/NFS4 and S3 and have nightly DR to mirror sites using ZFS binary difference Send and Receive cronjobs which are fully automated and monitored. Basically the ScaleIO SDS servers can be replaced during daytime while the upper layer keeps going on for HPC and thousands of other jobs. Since hardware provides more capacity after each Lifecycle we can increase the ZFS pool on the fly and have the most profound 24/7 storage environment without any major interruptions for 15 YEARS and counting. Our partners are having multiple outages due Lustre and other problems, my 2c
Sure; we get good mileage out of compression and snapshots (well, mostly send-recv for moving data around rather than snapshots in-place). I think the only problems have been very specific to our install process (non-standard kernel in the live environment; if we used the normal distro install process it would be fine).
Yeah, a few years ago I steered our university research group towards using ZFS for a couple of new servers with 1.5PB storage each (2.0PB before raidz2), on RHEL (would have preferred Debian). Absolutely no problems with the system. The IT guys were a bit doubtful whether it would work well, for instance they insisted that XFS would be better at managing parallel requests, but when the system happily saturates a 10Gb ethernet with plenty of bandwidth to spare I don't think there's a problem. We have had ZFS flag up one hard drive so far that was returning false data, which was then hot-swapped out. Other filesystems would probably have just suffered data corruption without us knowing it.
My workplace has been using ZFS for years. I've yet to see an alternative come close to being compelling. We use OpenZFS on all our RHEL servers and even workstations with XFS for root partitions and smaller clients. Our primary file server is Solaris 11 but is coming to the end of its long life. The plan is to use FreeBSD on the replacement. ZFS has become so valuable to us that it alone can be a major factor in directing our other choices.
Is nice to see it advancing in useful ways. Hopefully this leads to offline dedup without the runtime memory costs.
I have been using zfs for years from ubuntu 18. Easy snapshots, monitoring, choices for raid levels, and ability to very easily copy a dataset remotely with resume or incremental support is awesome. I mainly use it for kvm systems each with their own dataset. Coming from mdadm + lvm from my previous set up is night and day for doing snapshots and backups. I do not use zfs on root for ubuntu instead I do a raid1 software set up for the os and then a zfs set up on other disks - zfs on root was the only gotcha. For FreeBSD zfs on root works fine.
The PersistentVolume API is a nice way to divvy up a shared resource across different teams, and using ZFS for that gives us the snapshotting, deduplication, and compression for free. For our workloads, it benchmarked faster than XFS so it was a no-brainer.
Excited, because in addition to ref copies/clones, httm will use this feature, if available (I've already done some work to implement), for its `--roll-forward` operation, and for faster file recoveries from snapshots [0].
As I understand it, there will be no need to copy any data from the same dataset, and this includes all snapshots. Blocks written to the live dataset can just be references to the underlying blocks, and no additional space will need be used.
Imagine being able to continuously switch a file or a dataset back to a previous state extremely quickly without a heavy weight clone, or a rollback, etc.
Right now, httm simply diff copies the blocks for file recovery and roll-forward. For further details, see the man page entry for `--roll-forward`, and the link to the httm GitHub below:
--roll-forward="snap_name"
traditionally 'zfs rollback' is a destructive operation, whereas httm roll-forward is non-destructive. httm will copy only the blocks and file metadata that have changed since a specified snapshot, from that snapshot, to its live dataset. httm will also take two precautionary snapshots, one before and one after the copy. Should the roll forward fail for any reason, httm will roll back to the pre-execution state. Note: This is a ZFS only option which requires super user privileges.
This is huge.
One practical application is fast recovery of a file from past snapshot without using any additional space. I use ZFS dataset for my vCenter datastore (storing my vmdk files). In case of need to launch a clone from a past state one could use a block cloning to bring past vmdk file without the need to actually copy the file - it saves both space and time to make such clone.
They're likely using a NAS (e.g., FreeNAS/TrueNAS) that uses ZFS for the underlying storage then sharing that storage by either NFS or iSCSI with their vSphere cluster. In some rare cases FC is being used instead of NFS or iSCSI.
It seems kind of like hard linking but with copy-on-write for the underlying data, so you'll get near-instant file copies and writing into the middle of them will also be near-instant.
All of this happens under the covers already if you have dedup turned on, but this allows utilities (gnu cp might be taught to opportunistically and transparently use the new clone zfs syscalls, because there is no downside and only upside) and applications to tell zfs that "these blocks are going to be the same as those" without zfs needing to hash all the new blocks and compare them.
Aditionally, for finer control, ranges of blocks can be cloned, not just entire files.
I can't tell from the github issue, can this manual dedup / block cloning be turned on if you're not already using dedup on a dataset? Last time I set up zfs, I was warned that dedup took gobs of memory, so I didn't turn it on.
>When --reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails, or if --reflink=auto is specified, fall back to a standard copy. Use --reflink=never to ensure a standard copy is performed."
It's orthogonal to dedup being on or off, and as someone else said, it's more or less the same underlying semantics you would expect from cp --reflink anywhere.
Also, as mentioned, on Linux, it's not wired up with any interface to be used at all right now.
FTFA: "Block Cloning allows to clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Block Cloning can be described as a fast, manual deduplication."
As others have said: block cloning (the underlying technology that enables copy-on-write) allows you to 'copy' a file without reading all of the data and re-writing it.
For example, if you have a 1 GB file and you want to make a copy of it, you need to read the whole file (all at once or in parts) and then write the whole new file (all at once or in parts). This results in 1 GB of reads and 1 GB of writes. Obviously the slower (or more overloaded) your storage media is, the longer this takes.
With block cloning, you simply tell the OS "I want this file A to be a copy of this file B" and it creates a new "file" that references all the blocks in the old "file". Given that a "file" on a filesystem is just a list of blocks that make up the data in that file, you can create a new "file" which has pointers to the same blocks as the old "file". This is a simple system call (or a few system calls), and as such isn't much more intensive than simply renaming a file instead of copying it.
At my previous job we did builds for our software. This required building the BIOS, kernel, userspace, generating the UI, and so on. These builds required pulling down 10+ GB of git repositories (the git data itself, the checkout, the LFS binary files, external vendor SDKs), and then a large amount of build artifacts on top of that. We also needed to do this build for 80-100 different product models, for both release and debug versions. This meant 200+ copies of the source code alone (not to mention build artifacts and intermediate products), and because of disk space limitations this meant we had to dramatically reduce the number of concurrent builds we could run. The solution we came up with was something like:
1. Check out the source code
2. Create an overlayfs filesystem to mount into each build space
3. Do the build
4. Tear down the overlayfs filesystem
This was problematic if we weren't able to mount the filesystem, if we weren't able to unmount the filesystem (because of hanging file descriptors or processes), and so on. Lots of moving parts, lots of `sudo` commands in the scripts, and so on.
Copy-on-write would have solved this for us by accomplishing the same thing; we could simply do the following:
1. Check out the source code
2. Have each build process simply `cp -R --reflink=always source/ build_root/`; this would be instantaneous and use no new disk space.
3. Do the build
4. `rm -rf build_root`
Fewer moving parts, no root access required, generally simpler all around.
It can be a really convenient way to snapshot something if you can arrange some point at which everything is synced to disk. Get to that point, make your new files that start sharing all their blocks, and then let your main db process (or whatever) continue on as normal.
> Note: currently it is not possible to clone blocks between encrypted datasets, even if those datasets use the same encryption key (this includes snapshots of encrypted datasets). Cloning blocks between datasets that use the same keys should be possible and should be implemented in the future.
Once this is ready, I am going to subdivide my user homedir much more than it already is. The biggest obstacle in the way of this has been that it would waste a bunch of space until the snapshots were done rolling over, which for me is a long time (I keep weekly snapshots of my homedir for a year).
Not sure why you'd want to do that to your home directory usually, but it depends on what you store in it and how you use it, really.
In general, breaking up a filesystem into multiple ones in ZFS is mostly useful for making filesystem management more fine-grained, as a filesystem/dataset in ZFS is the unit of management for most properties and operations (snapshots, clones, compression and checksum algorithms, quotas, encryption, dedup, send/recv, ditto copies, etc) as well as their inheritance and space accounting.
In terms of filesystem management, there aren't many downsides to breaking up a filesystem (within reason), as most properties and the most common operations can be shared between all sub-filesystems if they are part of the same inherited tree (which doesn't necessarily have to correspond to the mountpoint tree!).
As far as I know, the major downsides by far were that 1) you couldn't quickly move a file from one dataset to another, i.e. `mv` would be forced to do a full copy of the file contents rather than just do a cheap rename, and 2) in terms of disk space, moving a file between filesystems would be equivalent to copying the file and deleting the original, which could be terrible if you use snapshots as it would lead to an additional space consumption of a full new file's worth of disk space.
In principle, both of these downsides should be fixed with this new block cloning feature and AFAIU the only tradeoffs would be some amount of increased overhead when freeing data (which should be zero overhead if you don't have many of these cloned blocks being shared anymore), and the low maturity of this code (i.e. higher chance of running into bugs) due to being so new.
Controlling the rate and location of snapshots, mostly. I've broken out some kinds of datasets (video archives) but not others historically (music). It doesn't matter that much, but I want to split some more chunks out.
Fair enough. I've personally slowly moved to a smaller number of filesystems, but if you're actually handling snapshots differently per-area then it makes sense (indeed, one of the reasons I'm consolidating is the realization that personally I'm almost never going to snapshot/restore things separately).
Finally! This is exactly what I want for handling photos and media on my home server. We have phone's dumping their camera rolls onto the machine via Syncthing, but the easiest way to share content safely is just to make copies of the file and go "storage is cheap". But the reality has always been closer to "storage should handle deduplication" - with this feature landing that can finally be a reality.
EDIT: Not to mention, this should make ZFS best-in-class for handling container filesystems. Deduplicating all the files in the individual layers means you can pretty much dispense with awkwardly trying to maintain inheritance trees.
Syncthing - the android app can share your camera roll folder.
So that syncs to my server, and my desktop/laptop. I drag files around on there when I want them deleted off my phone and archived somewhere. Me and my wife share a syncthing folder between us when we want to send files to each other.
All of this produces a fair number of duplicate files, particularly if you have backups turned on in case of deletes.
Offline dedupe basically makes all of that free though - duplicate files on the server or in backup dirs are no longer a problem.
Does ZFS or any other FS offer special operations which DB engine like RocksDB, SQLite or PostgreSQL could benefit from if they decided to target that FS specifically?
Internally, ZFS is kinda like an object store[1], and there was a project trying to expose the ZFS internals through an object store API rather than through a filesystem API.
Sadly I can't seem to find the presentation or recall the name of the project.
On the other hand, looking at for example RocksDB[2]:
File system operations are not atomic, and are susceptible to inconsistencies in the event of system failure. Even with journaling turned on, file systems do not guarantee consistency on unclean restart. POSIX file system does not support atomic batching of operations either. Hence, it is not possible to rely on metadata embedded in RocksDB datastore files to reconstruct the last consistent state of the RocksDB on restart. RocksDB has a built-in mechanism to overcome these limitations of POSIX file system [...]
ZFS does provide atomic operations internally[1], so if exposed it seems something like RocksDB could take advantage of that and forego all the complexity mentioned above.
How much that would help I don't know though, but seems potentially interesting at first glance.
reflink is implemented on Oracle Solaris; how it works internally I don't think Oracle has ever commented, since implementing post-facto dedup on a filesystem that assumes data once written is unmodified is a bit Spicy(tm) - you can look at the BRT talk from the OpenZFS dev summit or other talks about it at the leadership meetings to get some idea why it's so exciting.
Filesystem-level de-duplication is scary as hell as a concept, but also sounds amazing, especially doing it at copy-time so you don't have to opt-in to scanning to deduplicate. Is this common in filesystems? Or is ZFS striking out new ground here? I'm not really an under-the-OS-hood kinda guy.
MacOS and BTRFS have had it for several years. In fact I believe it’s the default behavior when copying a file in MacOS using the Finder (you have to specify `cp -c` in shell).
Disks die all the time anyway. If you want to keep your data, you should have at least two-disk redundancy. In which case bad blocks won't kill anything.
Besides what all the others have mentioned, you can force ZFS to keep up to 3 copies of data blocks on a dataset. ZFS uses this internally for important metadata and will try to spread them around to maximize the chance of recovery,
though don't rely on this feature alone for redundancy.
You can do `zfs set copies=2` for a ZFS dataset if you think there's value in the extra copies. That's better than a copy of the file because with a single bad block, the checksum will fail and ZFS will retrieve the data from the good block and repair both files. In practice, using disk mirrors is better than setting the copies property.
You should also be able to restore your data in a calm, controlled, and correct manner. Test your backups to be sure they work, and to be sure that you're still familiar with the process. You don't want to be stuck reverse-engineering your backup solution in the middle of an already-panicky data loss scenario.
Remain calm, follow the prescribed steps, and wait for your data to come back.
Copying a file isn't great protection against bad blocks. Modern SSDs, when they fail, tend to fail catastrophically (the whole device dies all at once), rather than dying one block at a time. If you care about the data, back it up on a separate piece of hardware.
Just that I'm trusting the OS to re-duplicate it at block level on file write. The idea that block by block you've got "okay, this block is shared by files XYZ, this next block is unique to file Z, then the next block is back to XYZ... oh we're editing that one? Then it's a new block that's now unique to file Z too".
I guess I'm not used to trusting filesystems to do anything but dumb write and read. I know they abstract away a crapload of amazing complexity in reality, I'm just used to thinking of them as dumb bags of bits.
ZFS is COW with checksums so it wouldn't edit the same blocks and there is the possibility for sending snapshots to another pool (which may/may not use deduplication).
Although, deduplication comes with a performance cost. I had all my photos spread out on various disks and external media, sometimes extra copies (as I did not trust certain disks) - if I remember it correctly I went from 3.6T to 2.2T usage by consolidating all my photos to a deduplicated pool. All fine, but the zpool wanted way more RAM and felt slower than my other pool.
After I removed duplicates (with help of https://github.com/sahib/rmlint ), I migrated my photos to an ordinary zpool instead.
I mean, it's happening whether you're using snapshots or not; ZFS doesn't overwrite things in place basically ever. Snapshots just mean it doesn't delete the old copy as having nothing referencing it.
CoW is always happening but without snapshots you can know that every block has exactly zero or one references and it's much simpler. No garbage collection, no complicated data structures, all you need is a single tree and a queue of dead blocks.
likely zfs and btrfs have some similarities (copy on write) and some differences (integration with volume manager). if you're interested, i'm sure you'll find more.
This feature is basically the same as what underpins the reflink feature that btrfs has supported approximately forever and xfs has supported for at least several years.
This is only a tangent given we are talking about snapshots and reflink, but just wanted to mention that LVM has snapshots, so if you need XFS snapshots, create the XFS filesystem on top of an LVM logical volume.
It's worth noting that, while LVM can support snapshots for XFS (or any other filesystem, really), the performance penalty is almost entirely unacceptable (one benchmark[0] shows an up to 60% drop in write throughput), so ideally your process should be something like
1. Snapshot LVM volume
2. Mount the LVM volume and back the snapshot up somewhere else
3. Delete the snapshot as soon as possible
Otherwise you're going to lose a ton of performance.
You can get a similar effect on top of any file system that supports hard links with rdfind ( https://rdfind.pauldreik.se/ ) -- but it's pretty slow.
The Arch wiki says:
"Tools dedicated to deduplicate a Btrfs formatted partition include duperemove, bees, bedup and btrfs-dedup. One may also want to merely deduplicate data on a file based level instead using e.g. rmlint, jdupes or dduper-git. For an overview of available features of those programs and additional information, have a look at the upstream Wiki entry.
Furthermore, Btrfs developers are working on inband (also known as synchronous or inline) deduplication, meaning deduplication done when writing new data to the filesystem. Currently, it is still an experiment which is developed out-of-tree. Users willing to test the new feature should read the appropriate kernel wiki page."
> You can get a similar effect on top of any file system that supports hard links with rdfind ( https://rdfind.pauldreik.se/ ) -- but it's pretty slow.
It's a similar effect only if you don't modify the files, I think.
If you "clone" a file with a hard link and you modify the contents of one copy, the other copy would also be equally modified.
As far as I understand this wouldn't happen with this type of block cloning: each copy of the file would be completely separate, except that they may (transparently) share data blocks on disk.
> It is in FreeBSD main branch now, but disabled by default just to be safe till after 14.0 released, where it will be included. Can be enabled with loader tunable there.
> more code is needed on the ZFS side for Linux integration. A few people are looking at it AFAIK.