Hacker News new | past | comments | ask | show | jobs | submit login

I have messed with ZFS on Linux on Ubuntu and I have to say that I would not yet trust it in production. It's not as bullet proof as it needs to be and still under heavy dev. Not even at version 1.0 yet.



We've actually been running it in production at Netflix for a few microservices for over a year (as Scott said, for a few workloads, but a long way from everywhere). I don't think we've made any kind of announcement, but "ZFS" has shown up in a number of Netflix presentations and slide decks: eg, for Titus (container management). ZFS has worked well on Linux for us. I keep meaning to blog about it, but there's been so many things to share (BPF has kept me more busy). Glad Scott found the time to share the root ZFS stuff.


If I had to choose between a filesystem with silent and/or visible data corruption up to pretty much eating itself and having to restore an entire server, versus a filesystem for which you can trust but could have a kernel deadlock/panic..I would choose the latter, and in-fact did.

I have seen a few servers with ext4/mdraid over the last five years have serious corruption but have had to reset a ZoL server maybe twice.


Story time.

I transitioned an md RAID1 from spinning disks to SSDs last week. After I removed the last spinning disk, one of the SSDs started returning garbage.

1/3 reads are returning garbage and ext4 freaks out, of course. It's too late and the array is shot. I restore from backup.

This would have been a non-event with ZFS. I've got a few production ZoL arrays running and the only problems I've had have been around memory consumption and responsiveness under load. Data integrity has been perfect.


I've seen the same type of thing with respect to memory and load.


Do you have any specific reasons not to trust ZoL?

ZFS-on-Linux devs say it's ready for production[1].

Lawrence Livermore laboratory stores petabytes of data using ZoL[2].

If we're sharing anecdotes, ZoL has served me fantastically for several years.

[1] https://clusterhq.com/2014/09/11/state-zfs-on-linux/ [2] http://computation.llnl.gov/newsroom/livermores-zfs-linux-po...


We have encountered a reproducible panic and deadlocks when a containerized process gets terminated by the kernel for exceeding its memory limit:

https://github.com/zfsonlinux/zfs/issues/5535

We're strongly considering using something else until this gets addressed. The problem is, we don't know what, because every other CoW implementation also has issues.

* dm-thinp: Slow, wastes disk space

* OverlayFS: No SELinux support

* aufs: Not in mainline or CentOS kernel; rename(2) not implemented correctly; slow writes


The issue you link to was opened a day ago.

If that were me, I'd see how quickly it was fixed before strongly considering something else.


Have you had any issues to report? If so, how quickly were they fixed? Knowing what the typical time is to address these issues would help us make a more educated decision.


Yes, we've run into 2 or 3 ZFS bugs that I can think of that were resolved in a timely fashion (released within a few weeks if I recall) by Canonical working with Debian and zfsonlinux maintainers (and subsequently fixed in both Ubuntu and Debian - and upstream zfsonlinux for ones that were not debian-packaging related). Of course your mileage may vary, and it depends on the severity of the issue. Being prepared to provide detailed reproduction and debug information, and testing proposed fixes, will greatly help - but that can be a serious time commitment on your side (for us, it's worth it). Hope that helps!


zfs is not in mainline or centos kernel, so you are presumably willing to try stuff. I believe all the overlay/selinux work is now upstream, it is supposed to ship in the next RHEL release.


I look forward to that.


My reasons are as follows:

1) Seen users complaining about data loss on issues on github. 2) Had the init script fail on upgrade and had to fix it by hand when upgrading Ubuntu. Probably a one time issue.

Need a bit more reliability from a file system.


I thought "ZoL" was a pun with ZFS and LOL to tell how not ready it is for production ^^


ZoL is an acronym for ZFSonLinux.


We have been running ZFS on Linux in production since April 2015 on over 1500 instances in AWS EC2 with Ubuntu 14.04 and 16.04. Only one kernel panic observed so far, on a Jenkins/CI instance, but that was due to Jenkins doing magic on ZFS mounts, believing it was a Solaris ZFS mount.

In our opinion, when we made the switch, it was much more important to trust the integrity of the data, than any possible kernel panic.


Well, we (and by this I mean myself and my fantastic team) have been running it since 2015 as the main filesystem for a double-digit number of KVM hosts running a triple-digit number of virtual machines executing an interesting mix of workloads, ranging from lightweight (file servers for light sharing, web application hosts) to heavy I/O bound ones (databases, build farms) with fantastic results so far. All this on Debian Stable.

The setup process was a bit painful given some interesting delays when using some HW storage controllers that caused udev to not make some HDD devices available under /dev before the ZFS scripts kicked in and we have been bitten a couple times by changes (or bugs) in the boot scripts, however the gains provided by ZFS in terms of data integrity, backup, and virtual machine provisioning workflow were definitely worth it.


It's maturing rapidly and has proven to be very stable so far. We're not using it by default everywhere, at least not yet, and building out an AMI that uses ZFS for the rootfs is still a bit of a research project - but we have been using it to do RAID0 striping of ephemeral drives for a year or two on a number of workloads.


It's bullet proof on Solaris and FreeBSD.


Which doesn't say anything about its state on Linux.


The implementation might be lacking but the underlying FS should be more reliable. I'd still argue that ZFS should be deployed on FreeBSD or Solaris. There are plenty of ways to fire up a Linux environment from there.


You didn't get the hint. He's saying you should be using Solaris or FreeBSD instead of Linux.


Depends on what you're worried about. Operationally speaking I agree, it's not plug and play.

But it's at a point where it safely stores your data correctly. Perhaps some init scripts fail on boot to import your pool/etc. but the data is there.

We do run it production, but we also have in-house tooling built around it.


i've been using zfs on ubuntu since ~2010 for a small set of machines, reading/writing 24/7 with different loads. it's worked great through quite a few drive replacements, and various other hardware failures.

i'm perfectly willing to believe there may be some rare situations where zfs on linux will cause you a problem. but i bet they're rare enough it'll have saved you a few times before it bites you.


Do you trust btrfs? Suse has been having it as the default since 2014...


> The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes. [0]

[0]: https://btrfs.wiki.kernel.org/index.php/RAID56


Important to note that is only referring to Raid 5 and 6


My newly built (Ubuntu 16.04 LTS) workstation is using ZFS exclusively. I'm keeping my fingers crossed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: