I have messed with ZFS on Linux on Ubuntu and I have to say that I would not yet trust it in production. It's not as bullet proof as it needs to be and still under heavy dev. Not even at version 1.0 yet.
We've actually been running it in production at Netflix for a few microservices for over a year (as Scott said, for a few workloads, but a long way from everywhere). I don't think we've made any kind of announcement, but "ZFS" has shown up in a number of Netflix presentations and slide decks: eg, for Titus (container management). ZFS has worked well on Linux for us. I keep meaning to blog about it, but there's been so many things to share (BPF has kept me more busy). Glad Scott found the time to share the root ZFS stuff.
If I had to choose between a filesystem with silent and/or visible data corruption up to pretty much eating itself and having to restore an entire server, versus a filesystem for which you can trust but could have a kernel deadlock/panic..I would choose the latter, and in-fact did.
I have seen a few servers with ext4/mdraid over the last five years have serious corruption but have had to reset a ZoL server maybe twice.
I transitioned an md RAID1 from spinning disks to SSDs last week. After I removed the last spinning disk, one of the SSDs started returning garbage.
1/3 reads are returning garbage and ext4 freaks out, of course. It's too late and the array is shot. I restore from backup.
This would have been a non-event with ZFS. I've got a few production ZoL arrays running and the only problems I've had have been around memory consumption and responsiveness under load. Data integrity has been perfect.
We're strongly considering using something else until this gets addressed. The problem is, we don't know what, because every other CoW implementation also has issues.
* dm-thinp: Slow, wastes disk space
* OverlayFS: No SELinux support
* aufs: Not in mainline or CentOS kernel; rename(2) not implemented correctly; slow writes
Have you had any issues to report? If so, how quickly were they fixed? Knowing what the typical time is to address these issues would help us make a more educated decision.
Yes, we've run into 2 or 3 ZFS bugs that I can think of that were resolved in a timely fashion (released within a few weeks if I recall) by Canonical working with Debian and zfsonlinux maintainers (and subsequently fixed in both Ubuntu and Debian - and upstream zfsonlinux for ones that were not debian-packaging related). Of course your mileage may vary, and it depends on the severity of the issue. Being prepared to provide detailed reproduction and debug information, and testing proposed fixes, will greatly help - but that can be a serious time commitment on your side (for us, it's worth it). Hope that helps!
zfs is not in mainline or centos kernel, so you are presumably willing to try stuff. I believe all the overlay/selinux work is now upstream, it is supposed to ship in the next RHEL release.
1) Seen users complaining about data loss on issues on github.
2) Had the init script fail on upgrade and had to fix it by hand when upgrading Ubuntu. Probably a one time issue.
We have been running ZFS on Linux in production since April 2015 on over 1500 instances in AWS EC2 with Ubuntu 14.04 and 16.04. Only one kernel panic observed so far, on a Jenkins/CI instance, but that was due to Jenkins doing magic on ZFS mounts, believing it was a Solaris ZFS mount.
In our opinion, when we made the switch, it was much more important to trust the integrity of the data, than any possible kernel panic.
Well, we (and by this I mean myself and my fantastic team) have been running it since 2015 as the main filesystem for a double-digit number of KVM hosts running a triple-digit number of virtual machines executing an interesting mix of workloads, ranging from lightweight (file servers for light sharing, web application hosts) to heavy I/O bound ones (databases, build farms) with fantastic results so far. All this on Debian Stable.
The setup process was a bit painful given some interesting delays when using some HW storage controllers that caused udev to not make some HDD devices available under /dev before the ZFS scripts kicked in and we have been bitten a couple times by changes (or bugs) in the boot scripts, however the gains provided by ZFS in terms of data integrity, backup, and virtual machine provisioning workflow were definitely worth it.
It's maturing rapidly and has proven to be very stable so far. We're not using it by default everywhere, at least not yet, and building out an AMI that uses ZFS for the rootfs is still a bit of a research project - but we have been using it to do RAID0 striping of ephemeral drives for a year or two on a number of workloads.
The implementation might be lacking but the underlying FS should be more reliable. I'd still argue that ZFS should be deployed on FreeBSD or Solaris. There are plenty of ways to fire up a Linux environment from there.
i've been using zfs on ubuntu since ~2010 for a small set of machines, reading/writing 24/7 with different loads. it's worked great through quite a few drive replacements, and various other hardware failures.
i'm perfectly willing to believe there may be some rare situations where zfs on linux will cause you a problem. but i bet they're rare enough it'll have saved you a few times before it bites you.