ZFS vs. Hammer

bcantrill · on Jan 3, 2015

As pointed out in the comments on that thread, be aware that essentially all of the negatives about ZFS boil down to FUD. In particular, upstream is emphatically not dead (the recent OpenZFS summit[1] was packed -- as was the hackathon the next day), which can be easily verified just by looking at the commits to illumos.[2] As for "well known data base [sic] degradation", all I can say is that we run ZFS as a backend for Postgres all over our public cloud -- both ourselves as part of SmartDataCenter[3] and Manta[4] and with high profile customers -- and we have not seen any ZFS-related performance issues. And please don't get me started on the Oracle-related FUD...

[1] http://open-zfs.org/wiki/OpenZFS_Developer_Summit

[2] https://github.com/illumos/illumos-gate/commits/master

[3] https://github.com/joyent/sdc

[4] https://github.com/joyent/manta

thrownaway2424 · on Jan 3, 2015

I don't think you can really blame people for perceiving Solaris fragments floating around in space as dead projects. OpenIndiana for example is reeking of death and the last time I tried to upgrade my home storage server from OpenSolaris to OpenIndiana the new boot environment was completely hosed. Yes I realize there are also dozens of other (some useless, some abandoned) distros of illumos.

Comments about Oracle are not FUD. Having a project that is in any way related to Oracle is like having a psychotic murderer living in the flat upstairs. People may be understandably reluctant to come over for tea.

bcantrill · on Jan 3, 2015

We forked this system four-and-a-half years ago, and have done a ton of work in the open; at this point, we identify ourselves much more strongly with illumos than we ever did with Solaris. We also have plenty of folks who have contributed to illumos who never worked for Sun and never contributed to Solaris; to us in the illumos community these are not "Solaris fragments" any more than they are "SVR4 fragments."

As for Oracle, trust me that I hate them as much as anyone[1], but raising Oracle with respect to illumos absolutely is FUD: the project isn't related to Oracle in any way (or any more than it is related to AT&T, which also holds copyright). Even where there are Oracle copyrights, this is open source, and the CDDL is an air-tight license that (sadly) has been battle-tested by Oracle's own lawyers. Thanks to the explicitness of the CDDL around things like patent, there really isn't much that Oracle can do. And indeed, given the massive Oracle-owned Solaris patent portfolio, the same cannot be said for other non-CDDL systems that may infringe those patents -- illumos is (perversely) the only one actually sheltered from Oracle bad behavior in that regard.

[1] https://www.youtube.com/watch?v=-zRN7XLCRhc#t=34m0s

thrownaway2424 · on Jan 3, 2015

Ok it _is_ FUD but I'm saying I can understand it.

_delirium · on Jan 3, 2015

OpenIndiana is pretty moribund lately, yeah, and there are certainly other abandoned distros. The two really "alive" Illumos distributions are SmartOS and OmniOS. Each gets regular updates, and is backed by a company that uses the distribution in-house (SmartOS is from Joyent, and OmniOS is from OmniTI). SmartOS PXE or USB-boots and is designed as a "cloud" OS substrate for starting up Zones where real work is done, and optionally works together with SmartDataCenter (also by Joyent) to orchestrate multiple machines; OmniOS is a more traditional hdd-installable server OS.

They're both perfectly usable and non-dead, although the situation of distros so strongly dependent on a single vendor each does worry me a bit (I feel more comfortable in the long-term stability of e.g. FreeBSD).

rodgerd · on Jan 3, 2015

Non-dead, perhaps, but struggling, e.g.: http://utcc.utoronto.ca/~cks/space/blog/solaris/OmniOSNo10GF...

bcantrill · on Jan 3, 2015

Note that the first comment there is an illumos developer offering to help -- and when Chris tweeted about this issue, he was offered similar help from the community.[1] It would be great to have a better characterization of this issue (which is not as simple as it not working, but rather is apparently a performance pathology that appears under load) -- it's entirely possible (if not likely) that this is (just) a driver bug, and not emblematic of larger issues. Certainly, taken alone it does not indicate that the OS is "struggling" (speaking for ourselves, we're 10G everywhere -- though not 10GBASE-T).

[1] https://twitter.com/thatcks/status/517404451121143809

prakashsurya · on Jan 3, 2015

Could not have said it better myself. It's disappointing to see a post with such unfounded nonsense rated so high on HN.

spydum · on Jan 3, 2015

To be fair, author did say he shouldn't be considered an expert and his experience was extremely limited.

f2f · on Jan 3, 2015

i'm sorry but "i'm no expert" is not an excuse for trying to change people's minds. that's how anti-vaxxers are born.

Luyt · on Jan 3, 2015

Indeed. This is the "Appeal to Lack of Authority" fallacy [1]

Authority has a reputation for being corrupt and inflexible, and this stereotype has been leveraged by some who assert that their own lack of authority somehow makes them a better authority.

Starling might say of the 9/11 attacks: "Every reputable structural engineer understands how fire caused the Twin Towers to collapse."

Bombo can reply: "I'm not an expert in engineering or anything, I'm just a regular guy asking questions."

Starling: "We should listen to what the people who know what they're talking about have to say."

Bombo: "Someone needs to stand up to these experts."

The idea that not knowing what you're talking about somehow makes you heroic or more reliable is incorrect. More likely, your lack of expertise simply makes you wrong.

[1] http://skeptoid.com/episodes/4217

smilliken · on Jan 3, 2015

> "well known data base [sic] degradation"

Came in here to refute this as well. We've been running PostgreSQL on ZFS (ZoL) at MixRank, and it's been holding up very well. We switched from EXT primarily for compression which gives us ~2x savings on IO throughput.

twic · on Jan 3, 2015

Is there any particular PostgreSQL or ZFS configuration that is necessary?

I vaguely remember something about turning off the ZIL for the data files, because the WAL provides the integrity. Or the other way round. Or something else.

rsync · on Jan 3, 2015

Under "ZFS Bad" should be listed:

- desperately needs defrag, which it does not have

If you fill up a ZFS filesystem, even to just 90% full, it will have crippling performance degradation. Further, freeing space to bring it below XX% will sort of solve the problem, but you will get that degradation sooner and sooner (85%, then 80%, etc.) as you refill it.

cbd1984 · on Jan 3, 2015

> desperately needs defrag, which it does not have

OK, this is interesting to me, because it takes me back to the 1990s, when Windows users coming from Windows 95 (who were also Windows 3.11 and MS-DOS users prior to that) would come to Linux and, having been of the Power User variety back in Windows-land, would ask "How do I defrag my disks?"

Of course, by then Linux was on ext2 or similar, so the response was "Real filesystems don't need defragging. The kernel keeps track of free space automatically. Don't worry about it." And, generally, it worked pretty well. (Source: I like to fill disks. I've never seen a big slowdown even unto the point of "no space left on medium" errors.)

So why does ZFS, a Thoroughly Modern Filesystem by any account, need defragging?

wbkang · on Jan 3, 2015

You can still have fragmentation problem with ext or ZFS if your file system is near full. This always has been a problem. The problem with FAT was that the fragmentation problem occured earlier than other systems.

aidenn0 · on Jan 3, 2015

ext2 was set by default to disallow non-root writes to disk at something like 90% or 95% full as even it would fragment under typical usage when too full.

RexRollman · on Jan 3, 2015

Defragging can be useful for some file systems. XFS has one.

makomk · on Jan 3, 2015

Even ext4 has an online defragmenter these days.

bhouston · on Jan 3, 2015

Yup, ran into this in production. Great way to bring down a database server.

davidgerard · on Jan 3, 2015

Been bitten by this one in production. We had to reboot the production box to clear it, which was a nontrivial thing to have to do. We expected better of Solaris. (We did get a ZFS dev in the loop, which was pretty cool. This was Solaris 10 in the Sun days.)

switch007 · on Jan 3, 2015

What causes the performance degradation at sooner and sooner levels of fullness?

rsync · on Jan 3, 2015

Fragmentation.

dmpk2k · on Jan 3, 2015

This is very one-sided, and I agree with most of the commenters in that thread calling the original poster out. About half the bad/ugly points about ZFS simply aren't true.

But there is one crucial feature ZFS has, which wasn't explicitly mentioned. The only feature that really matters: ZFS has been in production use for a nearly decade. On hundreds of thousands, if not millions, of servers. It has bled.

Putting on my sysadmin hat, that original poster just makes me wince. Of all the problems to deal with in production, problems caused by untested filesystems is down at the bottom of what I'd like to wake up to.

PhantomGremlin · on Jan 3, 2015

> ZFS has been in production use for a nearly decade.

The problem is that it's been about 4 years since any OpenSolaris source was released by Oracle. Since then, as a casual observer, I've seen mostly promises and fragmentation. It's not like the old days when Sun was the "one true steward" that drove ZFS development and fixed all the bugs.

If ZFS isn't "dead" or at least on life support, then tell me which distribution should I use? In the old days it was simple, either Solaris (stable, supported by Sun) or OpenSolaris (the leading edge). But now?

You can't answer "it depends" on what I'm doing with it, because that just proves that ZFS is nothing but fragmentation at this point.

It would be nice if OpenZFS could take the mantle and run with it. But the original poster raises some good questions. E.g.:

   If OpenZFS is a future what is their tier one
   development platform? illumos , FreeBSD, or God
   forbid Linux on which ZFS feels very awkward?

dmpk2k · on Jan 3, 2015

Use FreeBSD or illumos. Increasingly Linux is a fine option too.

OpenZFS is the mantle. It's the umbrella for ZFS projects across operating systems: http://open-zfs.org/wiki/Contributors

Dead it is not: http://open-zfs.org/wiki/Features

Indeed, there is some fragmentation, but that's being cleaned up thanks to the OpenZFS initiative.

Why does OpenZFS need a "tier one" development platform? Historically illumos lead the way, but I rather think a cross-platform filesystem is more useful than one limited to a single OS.

In any case, none of this changes that ZFS has been in widespread production use for nearly a decade. HAMMER? Right.

rodgerd · on Jan 3, 2015

If you're going to argue we can trust ZFS because it's been in production use for a decade, it's shittily disingenious to try and paper over the fact that OpenZFS is a fork that diverged from upstream 4 years ago - almost half that timespan. And where is the development effort from that "10 years" going?

bcantrill · on Jan 3, 2015

We run ZFS -- specifically, OpenZFS -- in production across all of our datacenters, as do our private cloud customers in theirs. We use it for everything including (notably) our Manta storage service.[1] There is nothing disingenuous in claiming that ZFS (OpenZFS!) has been in production use for a decade!

[1] http://open-zfs.org/w/images/0/03/Manta-Dave_Pacheco.pdf

PhantomGremlin · on Jan 3, 2015

That link doesn't inspire confidence, for two different reasons:

1) open-zfs.org points to 4 separate operating systems that support ZFS. To wit: illumos, FreeBSD, ZFS on Linux, OpenZFS on OS X. So what about all the other distributions out there, e.g. OpenIndiana?

Like I said: Fragmentation.

What are the odds that any two of the plethora of distributions are using the same ZFS bits? Which bugs are fixed in which distribution? Which post-Sun features are enabled in which distribution?

2) The PDF mentions ZFS panics and disk corruption, "a dark time" about a year ago. Fortuitously "Matt Ahrens and George Wilson diagnosed the problem".

Great! One of the inventors of ZFS and one of the ZFS tech leads from Sun were available. But what about if the average user has a problem? Maybe you can give us Matt's phone number? Is he available for pro bono consulting when Joe Sixpack has a problem?

I'd love to believe in ZFS. But it seems to me that, right now, it's only usable when supported by a big player. It's probably great if I go to Joyent and sign up for the Manta Storage Service. But if I have a problem while using ZFS in FreeBSD or in Linux, my only recourse would be to start whining on mailing lists.

dap · on Jan 4, 2015

> Great! One of the inventors of ZFS and one of the ZFS tech leads from Sun were available. But what about if the average user has a problem? Maybe you can give us Matt's phone number? Is he available for pro bono consulting when Joe Sixpack has a problem?

That was my talk. Your summary is a gross mischaracterization on many levels. I'd suggest watching the video.

To elaborate a bit: we have almost certainly pushed the code path in question enormously harder than anyone else. Our workload is not even close to representative of even heavy users' experiences. We saw the bug a handful of times in about sixteen million executions. The point isn't that ZFS is buggy, since all complex systems have latent bugs. The point is that if this had been a similar bug with almost any other filesystem, the result would have been unmitigated, undetected user data corruption. Instead, it was detected, characterized, and repaired before user data was ever affected. There was admittedly much more human intervention than we would have liked, but we never returned an incorrect byte to a user. That's a big deal, and that was my point.

In terms of support: your choice with any software is to support it yourself or hire others to support it. ZFS is no different in this regard.

I apologize if my slide gave the wrong impression, but I can't help but feel you twisted what was clearly incomplete information into a narrative of FUD. If you want the facts, and the talk doesn't answer your questions, do reach out.

bcantrill · on Jan 3, 2015

First, you're conflating "cross-platform" with "fragmentation". Yes, ZFS is on illumos, FreeBSD, ZoL, etc. -- but thanks to both the common code base and the well-defined feature flags[1], there is no fragmentation. (You can even take zpools from one system, export them on one OS and import them on another -- something essentially unique to ZFS.)

Second, with respect to our Manta experience, the point was not that we hit the issue but that working with the community, we resolved it. Manta hits ZFS way, way, way harder than you ever will (and in particular, hits rollback harder than anyone else ever has), and in that dimension, our experience should inspire confidence. But if you're still concerned, you have two choices: you can either buy ZFS support by buying a product that supports it (thereby making it the problem of one of the "big players" that you cite), or you can appeal to the mailing lists when you have a problem. In that regard, ZFS is no different than any other piece of open source software -- it just happens to be a pretty solid piece of open source software.

[1] http://blog.delphix.com/csiden/files/2012/01/ZFS_Feature_Fla...

the_why_of_y · on Jan 3, 2015

> Great! One of the inventors of ZFS and one of the ZFS tech leads from Sun were available. But what about if the average user has a problem? Maybe you can give us Matt's phone number? Is he available for pro bono consulting when Joe Sixpack has a problem?

This works in exactly the same way as for any other file system: if you've got a serious issue with ext4 or XFS you will be happy if you have a RHEL or SLES subscription giving you competent support, and not so happy if you don't.

ayrx · on Jan 3, 2015

I'm pretty sure that ZFS' hefty RAM requirements are only when you turn on dedupe. It's generally recommended that dedupe be turned off unless you have a very good reason why you want it on. It has pretty hefty memory requirements and doesn't buy you much.

bhouston · on Jan 3, 2015

We ran with ZFS in production with https://Clara.io but we had to switch away since it appears to be leading to a lot of fragmentation as a result of our MongoDB. I think that the write pattern from MongoDB isn't that compatible with ZFS (nor btrsf), or at least that is our belief.

I wrote about that here before: https://news.ycombinator.com/item?id=8777000

patrickg_zill · on Jan 3, 2015

I have been using ZFS for many years and am pleased with it.

The only lacking thing at this point, is not ZFS' fault, which is, it doesn't really have a lot of integration with Linux. Right now it is a sort of bolt-on that is added after the initial installation. I am sure that in time that will be fixed.

I recently started to play with HAMMER under a VM, but I don't have enough experience to intelligently comment on it at this point.

65a · on Jan 3, 2015

All of this is incredibly out of date. ZFS on linux uses SPL, not FUSE and growing filesystems on either platform (FreeBSD, Linux or Solaris) is easy

bluedino · on Jan 3, 2015

What's the AMD stuff all about?

pepijndevos · on Jan 3, 2015

I have no idea. I see a lot of people building NAS systems with AMD procs in them.

I'm also unable to find anything on the mailing lists or the FreeNAS website that indicates problems with AMD.

IgorPartola · on Jan 3, 2015

I have very little experience with ZFS (all of it positive so far), but 16 GB's of RAM to start? Isn't that a bit excessive? Or, is that only for deduplication, which I don't use because higher level stuff takes care of that for me?

toliman · on Jan 3, 2015

With 4gb, it won't do well, from experience.

FreeNAS, which is generally the set and forget ZFS option, wants 1gb per Tb of data, but can cope with 1gb per 3tb. It's more of a cache issue with the Cluster and drive size.

IIIRC, 6gb or 8gb on FreeNAS enables the read ahead caching and a few other tweaks. But it's still not fast.

Once the initial file write/read cache is empty, you get the regular throughout, which is 5-12mb /sec on a slow system, basically however fast the CPU and HDD can search for and supply the data in surges.

setting up a HP N54L microserver with 5x 4tb drives and the standard 4gb, (getting the compatible non registered ECC DDR is ridiculous), Gigabit transfers just slow down to USB v1 speed. Booting up with 16gb ECC, I get 85mb/sec Gigabit transfers with no problems.

For a box that's job is storing movies and TV, music and photos for plex, it doesn't have to do a lot. Transcoding would be nice, but not worth the extra. All up, 20tb was $1500,$400 on top of the wddriveWD red NAS drives.

FireBeyond · on Jan 3, 2015

I configured a SuperMicro board with a Pentium G2030 and 32GB ECC, 4 x 3TB drives in raidz, and 2 x 2TB in mirror. Copying multiple TB of data in either direction to the raidz saturates the GigE link steadily.

sgt · on Jan 3, 2015

He claims ZFS is bad for databases - where is this documented?

desdiv · on Jan 3, 2015

I think he's referring to this:

https://bartsjerps.wordpress.com/2013/02/26/zfs-ora-database...

bcantrill · on Jan 3, 2015

From that post:

I will not make any comments on how this affects performance (I’ll save that for a future post). I also deliberately ignore ZFS caching and other optimizing features – the only thing I want to show right now is how much fragmentation is caused on physical disk by using ZFS for Oracle data files.

Any engineer should be able to appreciate why this is frustrating from the perspective of ZFS: if you're going to turn off all of the things added for performance, reasoning about performance isn't terribly interesting.

He did (kind of) post the promised follow up about eighteen months later[1], but it was running ancient ZFS bits (at least four years old when it was published) on a gallingly mistuned configuration (he only had 100M of ARC).

Again, what we know is our own experience from running Postgres on ZFS -- which has been refreshingly devoid of ZFS-related performance problems. And experience has taught me to trust my own production experience much more than naive microbenchmarking exercises...

[1] https://bartsjerps.wordpress.com/2014/08/11/oracle-asm-vs-zf...

zaroth · on Jan 3, 2015

It sounds like a more inherent result of ZFS being copy-on-write; if you run a pre-allocated fixed size database on top of it which is applying random writes to existing records it's going to heavily fragment over time where a write-in-place filesystem would not.

I think it's fair to examine this behavior without the very-much mediating features such as ARC, since ideally we want ARC boosting performance well above the block storage media, not just earning back lost performance from copy-on-write.

I think the key take-away is if you have spinning rust and expect to sequentially read from disk a large database table by primary key, to the extent old records have been updated they will be fragmented, cause unexpected seeks killing your read throughput, and there's no way to defrag it without taking the database offline.

It sounds like a serious factor to consider depending on the particular workload your database expects.

bhouston · on Jan 3, 2015

I believe we ran into this with https://Clara.io in production. I wrote about it here a few weeks ago: https://news.ycombinator.com/item?id=8777000

atoponce · on Jan 3, 2015

I have been running ZFS on Linux in production on many storage servers for almost 3 years now. I'd like to address several points from the forum post:

If you want ZFS you need lots of expensive ECC RAM. The fact that somebody is running ZFS in her/his laptop on 2GB of RAM will be of no importance to you when that 16TB storage pool fails during the ZFS scrub or de-duplication because you have insufficient RAM or the RAM is of low quality.

First, I doubt many have 16TB of disk with 2GB of RAM in their laptop.

Second, while it's true that ZFS will use all the RAM it can, this is a good thing, not a bad thing. The more you can cache in fast RAM, the less you can rely on slow disk. I've never understood why people think this is a bad thing. The ZFS ARC is also highly configurable, and the system administrator can limit its capacity in RAM, by setting hard limits.

However, the requirement of "you needs lots of RAM" is bogus. You only need "lots of RAM" when you enable deduplication, as the deduplicated blocks are stored in a table (the DDT), which is stored in the ZFS ARC, which is indeed stored in RAM. If you do not have enough RAM to store the DDT safely (a good rule of thumb is "5GB RAM for every 1TB of disk"), then the DDT will spill to platter, and degrade overall pool performance. However, if you have a fast SSD as an L2ARC, say a PCI express SSD, then the performance impact on DDT lookups are minimized.

If you do not enable deduplication, then you do not need "lots of RAM". I am running ZFS on my 12GB workstation for /home/ without any problems, and initially it was installed with only 6GB. The upgrade was at the request of our CTO, who wanted to upgrade all the admin workstations, even if we didn't need it. Of course, I'll take advantage of it, and adjust my ARC limits, but I didn't need it.

Lastly, ZFS doesn't require ECC RAM any more than HAMMER or ext4. ECC is certainly a welcomed advantage, as ZFS blindly trusts what comes out of RAM to be correct. But making the argument for ECC should be applied to any situation where data integrity must be exact, 100% predictable, and redundant. ZFS isn't any more special here than any other filesystem, as far as RAM is concerned. In reality, every system should have ECC RAM; even laptops.

Well known data base degradation (not suitable to keep SQL data bases).

I have not heard of this. ZFS has very rigid and strict synchronous operation. The data that is queued to be written is group into a transaction group (TXG) and flushed synchronously to the ZFS intent log (ZIL). Once the ZIL has the TXG, it sends the acknowledgement back that it is on stable storage, and is flushed to the pool in the background (default every 5 seconds). This is default behavior. When ZFS says it has your data, it has your data.

Now, if the concern is performance, ZFS does fragment over time due to the copy-on-write (COW) design. This is problematic with any COW filesystem, including Btrfs. ZFS minimizing its impact through the slab allocator. More details on that can be found from the author's blog: https://blogs.oracle.com/bonwick/entry/zfs_block_allocation

No file grained journaling like hammer history.

I'm not sure what the OP means here, but if it's related to snapshots, then it's false. Granted, snapshots are not executed by default when you build a ZFS dataset. You must execute "zfs snapshot" manually, typically in cron. But ZFS snapshots are cheap, first class filesystems, that can be navigated to get to a specific file for restoration, or the entire dataset can be rolled back to a single snapshot.

Sun had Time Slider, which was baked into the operating system by default, and allowed you to rollback a ZFS dataset easily. Even though the OpenZFS team doesn't have Time Slider, we do have https://github.com/zfsonlinux/zfs-auto-snapshot which mimics its behavior. It set this up on all of my storage servers with the following policy:

Snapshot every

* 15 minutes, keeping the most recent 4

* 1 hour, keeping the most recent 24

* 1 day, keeping the most recent 30

* 1 week, keeping the most recent 8

* 1 month, keeping the most recent 12

This seems pretty fine grained to me.

No volume growing at least FreeBSD version.

I'm not sure what the OP means here. Does he mean growing the pool, a dataset, or a ZVOL? In the case of growing the pool, as hard drives are replaced with larger capacity drives, once all the drives have been replaced, the pool can automatically expand to fit the larger capacity. "zpool set autoexpand=on pool" must be executed before replacing the drives.

If the OP means growing datasets, then I guess we're talking about quotas, as datasets by default occupy the entire size of the pool. By default, if I have 2TB of ZFS pool space, and I have 100 datasets, each of the 100 datasets occupies the full 2TB of space. As datasets slowly fill, each dataset is aware of the occupied storage of the other datasets, so it knows how much free space remains on the pool.

If quotas are needed, then "zfs set quota=100G pool/data1" can be executed to limit that dataset to 100GB. If at any time you wish to expand, or shrink its quota, just re-execute the command again, with a new value.

Finally, if ZVOLs are the concern, after creation, they can be resized. Suppose a ZVOL was created with "zfs create -V 1G pool/swap" to be used as a 1GB swap device, and more swap is needed. Then "zfs set volsize=2G pool/swap" can be executed to grow it to 2GB, and it can be executed live.

The only thing I think the OP means, is adding and removing VDEVs. This is tricky, and he's right that there are some limitations here. You can grow a pool by adding more VDEVs. Helping in #zfs and #zfsonlinux on Freenode, as well as on Twitter and the mailing lists, I have seen some funky ZFS pools. Usually, the administrator needs more space, so a couple more disks are added, mirrored, then thrown into their existing pool, without too much concern about balancing. While you can add VDEVs all day long, once added, you cannot remove them. At that point, you must tear down the datasets and the pool, and rebuild it with the proper layout you want/need.

But, in terms of growing, I'm not sure what the OP is referring to.

Legally associated with Oracle.

Maybe. There are basically two ZFS implementations out there, one which Oracle has legal rights to, and the other they don't. Oracle's ZFS is proprietary software, and is the ZFS shipped with Oracle Solaris. Currently Oracle ZFS is at ZFS version 6 and pool version 35. The version used with FreeBSD, PCBSD, GNU/Linux, Illumos, and the rest of the Open Source operating systems forked from Sun ZFS at pool version 28 and ZFS version 5. By the CDDL, it is legal to fork the software, and Oracle cannot claim any copyright claims over the fork, as they are not the owner of the forked copyright.

Oracle may have some software patents over ZFS (I haven't looked) that could be troubling, but if so, Sun released ZFS to the community with the intent for the community to use it and make it better. If software patents on ZFS exist, my guess is they are royalty-free. Again though, I haven't looked this up, and it really doesn't bother me.

Native encryption for ZFS has been only available in Oracle version.

While true, this really hasn't been a problem. Most storage administrators that want encrypted ZFS datasets, either put encryption under ZFS, such as LUKS, or put it on top of the dataset, like eCryptfs. Also, the OpenZFS codebase has "feature flags", which setup the framework for things like native encryption, and last I checked, it was actively being worked on. Initially, native encryption was not a priority, mostly due to the fear of losing backwards compatibility with the proprietary Oracle ZFS. I think that fear has become a reality, not because of anything that the OpenZFS developers have done, but because Oracle is pushing forward with their proprietary ZFS, without any thought about staying compatible with OpenZFS. At any event, this point is moot, and any storage administrator who needs/wants encryption with OpenZFS knows how to set it up with a separate utility.

Upstream is dead. I don't know much about you but OpenZFS doesn't inspire lots of confidence in me.

This is news to me. OpenZFS is very actively developed. There is TRIM support for SSDs, and code is actively shared between the FreeBSD and GNU/Linux teams. There are OpenZFS developers from Illumos, SmartOS, OmniOS, as well as the LLNL contract, and plenty from the community. So, I guess "[citation needed]".

feld · on Jan 4, 2015

> "you needs lots of RAM" is bogus.

Depends on what performance you expect to get out of ZFS. I have 4GB in my NAS at home and 12TB of raw disk. It works fine for my needs. I wouldn't use this setup for production as my I/O isn't great unless it's sequential.

HAMMER deduplication only requires ~512MB RAM per 1TB you dedup. The rule of thumb for ZFS is generally 5GB RAM for 1TB of dedup. This is documented on the FreeBSD ZFS wiki.

https://wiki.freebsd.org/ZFSTuningGuide

> Well known data base degradation (not suitable to keep SQL data bases).

He's talking about performance degrading due to multiple layers of logs/journaling. If you configure your ZFS filesystem and database correctly this isn't a problem.

> No file grained journaling like hammer history.

He's talking about the fact that with HAMMER you can basically undelete individual files ... forever (assuming you have enough disk space). You don't need to manually do a snapshot; it retains the ability to undelete anything if you let it.

> Native encryption for ZFS has been only available in Oracle version.

Native encryption as implemented by Oracle is likely vulnerable to watermarking attacks.

You also only mentioned L2ARC once and overlooked that simply adding an L2ARC requires more RAM. A memory constrained system can actually lose performance simply by adding a large L2ARC. Most people don't know this as it's not widely documented.

fensipens · on Jan 3, 2015

If you want ZFS you need lots of expensive ECC RAM. (...) The rule of the thumb is that you typically need about 16GB of RAM to begin with ZFS and then about 1-2 GB for each additional TB of data.

Ugh.. can someone please tell me who started this ZFS-ECC-UBER-RAM fetish? Whenever I read anything about ZFS, some gollum first has to scream at me how superimportant a ridiculous amount of ram (especially ecc-ram) is. Why? Who said so? When?

anderiv · on Jan 3, 2015

Here's my take on it:

One of the main selling points for ZFS is the end-to-end cryptographic verification of file contents as well as all metadata. Everything in ZFS is hashed, and verified during reads. Without ECC RAM, your hashes are vulnerable to corruption due to bits being erroneously flipped in RAM. If this happens with non-ECC RAM, you're never going to know, and it will cause corruption. With ECC RAM, these events are not only detectable, but also fixable.

Regarding ZFS "requiring a huge amount of RAM". Most of this comes into play if you have dedup turned on. Then you really need a lot of RAM. Without dedup enabled, and if you're OK with mediocre performance, you can get by just fine with a "normal" amount of RAM. It's well-known that filesystem performance increases with more RAM - this is not something that is unique to ZFS.

aidenn0 · on Jan 3, 2015

ZFS is fairly profligate with its RAM usage though. There are some good reasons for this, but ZFS will be dog slow compared to other file-systems in RAM-constrained situations, but behave competitively in non-RAM-constrained situations.

Bjoern · on Jan 3, 2015

Here is a well balanced explanation why ECC-RAM. That being said, if the RAM is the weakest link it doesn't matter what underlying FS is used.

https://pthree.org/2013/12/10/zfs-administration-appendix-c-...

In my experience 16GB is only needed if you want to run Deduplication.

wtallis · on Jan 3, 2015

> "if the RAM is the weakest link it doesn't matter what underlying FS is used"

When is this ever the case, though? Even non-ECC RAM is more reliable than hard drives except when a RAM chip has died, which most ECC systems don't handle well either. Full chipkill capability is pretty rare.

rodgerd · on Jan 3, 2015

As I understand the reasoning, it's that ZFS checksumming creates a risk of false positives where non-checksumming filesystems wouldn't have them. So without ECC RAM you're trading the risk of losing data to your IO chain for the risk of losing data to increased RAM activity. With ECC RAM it's an unambiguous win.

wtallis · on Jan 3, 2015

It sounds to me like there must be some situation where ZFS reads data from disk, verifies the checksum, keeps it in memory where it could get corrupted, then recomputes the checksum on the corrupted version and writes either the corrupted data or the incorrect checksum to disk but not both, so that the data on disk is now guaranteed to not match the checksum on disk.

Without something like the above sequence of events, I don't see how ECC eliminates any class of errors, rather than just reducing their probability. So what actually triggers such a chain of events?

zdw · on Jan 3, 2015

This is the paper on ECC RAM and ZFS:

http://research.cs.wisc.edu/adsl/Publications/zfs-corruption...

Basically, ZFS can survive anything except for corruption of it's data in RAM, in which case it can corrupt on disk in ways there is no recovery from.

The size of RAM required is mainly due to in-memory caching - ZFS has always been RAM hungry and it's ARC performs much better with lots of RAM. This has generally scaled well on most non-single socket systems (ie, anything Xeon E5 and above) that can take >32GB of memory.

kurlberg · on Jan 3, 2015

Scary scenarios regarding "faulty ram destroys volume" can be found in various places, e.g.,

https://pthree.org/2013/12/10/zfs-administration-appendix-c-...

However, for a refutation of this scenario see XorNot's comment (and perhaps the surrounding discussion) at

https://news.ycombinator.com/item?id=8294434

alecdbrooks · on Jan 3, 2015

My impression is that ZFS makes the absence of ECC RAM more obvious. Discrepancies in RAM stand out when you're carefully hashing your file system.

nilved · on Jan 3, 2015

I thought at first that this would be testing ZFS' fault tolerance by striking drives with a hammer.

darbelo · on Jan 3, 2015

You can find a video of something like that happening at MeetBSD: https://www.youtube.com/watch?v=d1C6DELK7fc

Luyt · on Jan 3, 2015

That's a silly video. There's no need to destroy a perfectly functioning hard drive after it has been physically removed from the RAID.