ZFS High-Availability NAS

insaneirish · on Aug 13, 2016

It's very nice to see someone putting in the work to document this and show others how they were able to get it to work.

But I unfortunately believe, quite sincerely, that redundancy at this level is misguided at best and dangerous at worst.

Instead of building storage that can fail between nodes, consider building applications that can survive node loss and are not relying on persisting their state in a black box that is assumed to be 100% available. These assumptions about reliablity of systems and their consistent behavior under network partitions has bitten countless people before and will continue to claim victims unless application paradigms are fundamentally rethought.

Many people already know this. Even people who build these systems know this, but too many applications already exist that can't be change. Nonetheless, it's important for people building such systems for the first time to be aware that they can make a choice early in the design process and that the status quo is not a good choice. In other words, "Every day, somebody's born who's never seen The Flintstones." (https://mobile.twitter.com/hotdogsladies/status/760580954532...)

guitarbill · on Aug 13, 2016

> Instead of building storage that can fail between nodes, consider building applications that can survive node loss and are not relying on persisting their state in a black box that is assumed to be 100% available. These assumptions about reliablity of systems and their consistent behavior under network partitions has bitten countless people before and will continue to claim victims unless application paradigms are fundamentally rethought.

I'd recommend this... except that it's really, really hard [0]. And if you think about it, you have to do this for every new application, which is a bit wasteful. So IMO, doing it at filesystem level is a good trade off. It's abstract enough to be useful over a wide range of use-cases (unlike block storage), because if you think about it file systems are just frameworks.

To be fair, block can work really well. I think block storage vendors have themselves to blame. Half the "standards" aren't, interop can be terrible. I know I'd rather have simple, reliable storage - stop stuffing "enterprise" features into a rotting, increasingly technical-debt laden codebase. All back-up solutions for block suck - you'd think this problem would be fixed, especially with cloud storage. I could go on.

[0] https://www.somethingsimilar.com/2013/01/14/notes-on-distrib...

windowsworkstoo · on Aug 13, 2016

Given the target use case of this kind of system, you control what you can control, which is the infra, because the software will overwhelmingly be industry specific COTS stuff that realistically wont be changed.

Further its likely that the types of internal apps running could handle a recovery period, makes total loss less of an issue...this is "old school" operations - having DR plans, backups and recovery time targets

qaq · on Aug 13, 2016

There are legitimate reasons to have a CP system and not an AP system, it's obviously application specific. Also over-engineered systems come with their own risks and can experience as much down time as a well setup CP system due to human error stemming from the system's complexity (AWS is a good example).

zzzcpan · on Aug 13, 2016

NASes are not CP systems, they are noCAP systems. They cannot guarantee neither consistency nor availability in the event of network partition.

rdtsc · on Aug 14, 2016

Very good point. Quite often when talking about distributed system, everyone goes to the CAP theorem and automatically assumes if it is not AP, it must CP then. While in reality (and maybe for good reasons) it might be neither.

qaq · on Aug 14, 2016

You sure can but it seems there are not that many practical applications for distributed system that are not tolerant to partitions.

qaq · on Aug 14, 2016

Setup DRBD on your NAS boxes and you have AP :).

Sanddancer · on Aug 14, 2016

This is about a completely different type of VM application, and a completely different kind of storage architecture. This is about everything being on the storage box, not just any application state. The two nodes export NFS shares, which host the root filesystems of the nodes being run on the VM servers. Most likely, the two storage heads will be existing on the same subnet as the VMWare boxes, thus making a network partition that affects both nodes much, much less likely. Additionally, by having the sort of architecture being presented, you can do things like live migrations of VMs, which is useful when the underlying OS needs to be brought down for maintenance.

There are trade-offs with any architectural decisions, and your idea provides a much more byzantine, complex, and expensive system for little to no trade-off at the scale presented. There are quite a few places where they just need a few servers, and going to your idea just makes things more expensive and more complicated.

insaneirish · on Aug 14, 2016

> This is about a completely different type of VM application, and a completely different kind of storage architecture.

I understand it completely, and I vehemently disagree with everything about it. Providing networked block storage for virtual disks on top of NFS is a recipe for disaster and performance degradation. Just because "everyone" does it doesn't make me like it any more. Using redundant local disk (preferably with ZFS) gets around this problem. Blocks are actually blocks and failures and performance issues are confined to one node.

> Additionally, by having the sort of architecture being presented, you can do things like live migrations of VMs, which is useful when the underlying OS needs to be brought down for maintenance.

I find live migration to be a hack. Once again, solving the wrong problem at the wrong level.

> There are quite a few places where they just need a few servers, and going to your idea just makes things more expensive and more complicated.

Local disk with appropriate backups and application level failover makes a lot more sense and is fundamentally less complicated.

user5994461 · on Aug 15, 2016

I really don't understand that fight.

SAN storage and VmWare (ESXi) have been around forever.

VmWare will gladly provision VMs over any SAN storage. Live migration has been available for more than 10 years (that's called VMotion and it requires any of paid edition).

Prefer local storage? Put any sort of local storage on the server, VmWare will gladly provision VM on it.

What's the most important thing in a storage... THE BACKUP!

For backups, there are snapshots capabilities well integrated into VmWare (cold snapshots, live snapshots, incremental snapsnots, whatever you name it).

All of this has existed for more than ten years. Standard protocols (iSCSI, FiberChannel, other) + robust enclosures (all vendors have SAN) + battle tested software and drivers (VmWare).

A couple servers and some TB of storage is a nothing fancy. The cost of the setup is shared between the servers + the labor + many SAS hard drives + the enclosure (a very little part).

It's really hard to take seriously an overly complex setup that is trying to imitate that over cheap untested hardware and software (NFS? seriously?).

Not only it is a recipe for disaster in production but there isn't even any cost savings to be done.

The cost of complexity and labors will be going wild with all the configuration. The hardware costs won't go down (still need many drives and servers). Good luck debugging that and retrieving data when things will go wrong.

Well. I suppose it's fun for home labs and that's about it :D

ewwhite · on Aug 15, 2016

Huh? I'm not sure I understand.

jakezhao100 · on Aug 17, 2016

jakezhao100 · on Aug 17, 2016

olavgg · on Aug 13, 2016

FreeBSD has a cool HA ZFS solution you can use.

HAST https://wiki.freebsd.org/HAST

There are also other solutions which you can hack together like iSCSI or GEOM Gate Network https://www.freebsd.org/doc/handbook/geom-ggate.html

I would however strongly recommend a proper distributed storage solution like Ceph or GlusterFS.

X86BSD · on Aug 13, 2016

People will do whatever they want to do. And my comment is not a knock on HAST.

But myself, and I know I am not alone, firmly believe especially regarding HA and ZFS that following anything other than the KISS principle is pain upon pain upon pain.

Wether it's iSCSI, HAST, NFS, HAST or any combination thereof.

The simplest and least painful? Is to simply create two servers with dual nic's and dual HBA's.

Assign half the disks in a mirror to HBA-1 the other half to HBA-2.

Use one nic for public networking, use the other for a direct P2P Gig-E link between server A and server B.

Use zfs replication over the private network link using the granularity you feel most comfortable with. 30 minutes? 15 minutes? 5? It's up to you.

This is the simplest, and IMO most reliable way to go about this. When you start adding in iSCSI, NFS, HAST etc you are simply making things to complex, introducing multiple points of failure, and the pain when something does break? Ill pass. I'd rather have to manually intervene and fail over to server B if server A fails and vice versa than deal with all the pain of additional things that could break or introduce split brained issues etc.

But it's a free world, use whatever solution fits your comfort zone.

zzzcpan · on Aug 14, 2016

The biggest issue with your approach is consistency. You just cannot guarantee it unless you are using proper distributed algorithms on every level.

Annatar · on Aug 14, 2016

But myself, and I know I am not alone, firmly believe especially regarding HA and ZFS that following anything other than the KISS principle is pain upon pain upon pain.

Rarely have truer words been written. Highly available, highly reliable storage means keeping it as simple as possible, but no simpler. I wonder how long will it take before computer professionals at large realize this?

cmurf · on Aug 13, 2016

Yeah I think only distributed fs's like Ceph and Gluster really scale. File systems, even ZFS and Btrfs, don't scale because they don't deal with the fact it's not always one or two hard drives dying that'll take out the storage stack. Network card, power supply, motherboard, etc. And unless you have an all SSD array, the recovery times aren't scaling. One dead drive used to take maybe an hour to rebuild and now it can be a day due to capacity growing faster than performance.

zokier · on Aug 13, 2016

Is the performance of Ceph/Gluster comparative to ZFS performance? I would have thought them to be inherently massively slower

olavgg · on Aug 13, 2016

It is slower, but still probably good enough for many use cases.

https://software.intel.com/en-us/blogs/2013/10/25/measure-ce...

http://www.intel.com/content/dam/www/public/us/en/documents/...

zimbatm · on Aug 13, 2016

Isn't the rebuilding argument valid on the machine level as well though? Granted if your blocks are redundant it's possible to copy from multiple sources but rebuilding is still limited to the network speed.

zzzcpan · on Aug 14, 2016

On a properly designed distributed system disk-replacement costs are spread across all machines and in time, making performance degradation a non-issue.

EvanAnderson · on Aug 13, 2016

Ed White, the guy behind the Github repo, is a very experienced sysadmin (and the guy who knocked me off the top scoring slot on Server Fault). His experience supporting both COTS and custom applications in a variety of environments is top-notch.

_rqab · on Aug 13, 2016

To add to the downsides, you can't expand RAIDZ vdevs.

If you start with a 6 disk RAIDZ2 and want to add a couple more drives, you can't. The only way to add capacity to the pool is to add an entire new vdev.

Unfortunately I don't think there's an expandable file system that handles parity/erasure coding as well as bitrot that can be used right now. BTRFS might be usable after the RAID5/6 rewrite and Ceph might be more usable after Bluestore comes along (at the moment erasure coding performance seems to be pretty bad, even with an SSD cache) but in the present there's nothing.

I really hope https://www.patreon.com/bcachefs manages to fix this.

mbakke · on Aug 14, 2016

> Ceph might be more usable after Bluestore comes along (at the moment erasure coding performance seems to be pretty bad, even with an SSD cache)

I just saturated a 10Gb link in a k=4,m=2 EC configuration on Ceph (Haswell). Too lazy to set up concurrent clients. What do you mean by pretty bad? This is a Hammer cluster without SSD journals or cache.

Bluestore was recently released with Jewel, but shouldn't affect EC performance drastically, if at all.

_rqab · on Aug 14, 2016

Maybe I'm doing something horribly wrong. My setup was:

- 6x4TB HDD

- 2x120GB SSD

- OSD per disk

- k=4,m=2 erasure coded HDD pool

- 1 copy replicated (not) SSD pool

- SSD pool writeback cache for HDD pool

I was consistently seeing the SSD writeback cache fill up then the disks thrashed at 100% IO utilization, limiting incoming writes to ~40MB/s from memory.

mbakke · on Aug 19, 2016

Oops, didn't see this reply until now. It's no longer recommended to use a SSD cache pool, much for the reasons you describe.

You will be a lot better off using bcache or similar, on top of the OSD device.

Terribledactyl · on Aug 14, 2016

The amount and type of work that would be required to take a 6disk to a 7disk (in place) is quite hard. If you can take some downtime, you'd really want to snap, send to a new larger system, go down, do the final copy, and come back up. That or work with more vdevs, but then you might have unevenly striped data.

And there are filesystems that are expandable and meet your other requirements, they just aren't cheap. IBM GPFS is one I work with regularly.

teraflop · on Aug 14, 2016

Mdadm (Linux's built-in software RAID) can expand a RAID array online, and has been able to do so for about a decade.

It's not fundamentally that complex of an operation, but there are enough details to be careful of that I don't see why it makes sense to reimplement it in every filesystem.

Terribledactyl · on Aug 14, 2016

It's the checksumming, striping, and parity, especially in context of CoW snapshots that complicate this operation.

mappu · on Aug 14, 2016

LVM2 is a filesystem-agnostic layer that gives you CoW snapshots on top of mdadm's striping/parity. From your list, that just leaves checksumming for bitrot, which you can do at the RAID level via mdadm check, or at the filesystem level with sha1sum --tag / sha1sum -c (although maybe that's not as nice as ZFS/btrfs).

Annatar · on Aug 14, 2016

RAID controllers cannot be relied upon to do check sums, as most RAID controllers do not possess this capability, and of those which do, their firmware is opaque and buggy, and therefore cannot be trusted.

If you care about your data, never use hardware RAID, and always use ZFS or Oracle ASM.

mdadm(8) is a good start on GNU/Linux if one is forced to run on an (older) operating system version with only ext3 or XFS, however while it provides administration consistency and scales to many systems, it does not provide data checksumming, and therefore no data corruption detection and no self healing capabilities like Oracle ASM or like (Open)ZFS.

goda90 · on Aug 14, 2016

You can expand capacity by replacing all the drives with bigger ones. It's obviously expensive and time consuming(has to be done one disk at a time so that each one can be resilvered) but it is possible.

nmjohn · on Aug 14, 2016

> To add to the downsides, you can't expand RAIDZ vdevs.

> If you start with a 6 disk RAIDZ2 and want to add a couple more drives, you can't. The only way to add capacity to the pool is to add an entire new vdev.

This is mostly true, and in many situations for all practical purposes is true. Not fully understanding this recently cost me a few hundred dollars (2x 8TB external drives) when I needed to expand my 4 disk raidz2 to a 6 disk raidz2.

However, what _is_ possible is expanding a pool by incrementally replacing drives with larger capacity equivalents. This wouldn't work in my situation as I already was using the largest consumer drives available - but the next time I need to expand in a few years it may be a possibility.

> ... expandable file system that handles parity/erasure coding as well as bitrot that can be used right now

Unfortunately I believe this to be the case. However in practice, so long as you know up front your data store is not expandable, it's fairly easy to work around without it being too much of a hassle. If having two zpools when you need to expand is a burden for your use case because the data necessarily will be partitioned across pools (ie: in multiple directories), there are tools that will present multiple filesystems as a single mount point to linux.

Alternatively, mirrored vdevs [0] as opposed to a classic raidz2 configuration are much easier to expand, as you only have to replace 2 drives to expand your total storage space (a single vdev) as opposed to having to replace all of your drives with higher capacity drives as with raidz2.

> and Ceph might be more usable

After all that rambling about zfs, the main thing I wanted to touch on: I'd caution most people to not use ceph unless you already know you need it. For the average user, it is introducing so much unnecessary complexity that will just create future headaches when using it in a non-distributed manner (if zfs is an alternative, it must be non-distributed). Which isn't to say ceph isn't useful, it's just built to be optimal at solving a slightly different problem than a filesystem on a single compute can solve.

[0]: http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-...

tmd83 · on Aug 14, 2016

Yeah. I have read about zfs to some degree before but somehow missed this until a couple of days ago. For all its flexibility this looks a very bad limitation. I think this and rebuild time (again related to vdev width) makes zfs very easy to mess up.

The goal of bachefs seem to very lofty though there isn't much detail on the implication of raid design. It seemed to suggest it solves the zfs fragmentation problem, gives better r/w performance with a simpler codebase. Anyone who understand the programmers guide well enough to comment on the viability of the architecture? Does it really solves all the problem of a non-distributed huge filesystem capable of working from a single disk to a big array?

ThatPlayer · on Aug 14, 2016

I've been looking at SnapRAID and UnRAID which I believe use their own parity and handles expansions. Mostly because I'm currently using BTRFS RAID6 and might move off it.

I also have high hopes for bcachefs. Currently using bcache on my laptop and it's been great. BTRFS doesn't really have an easy SSD cache like ZFS, and it's a hassle to convert over to a Bcache backed BTRFS setup.

TiredOfLife · on Aug 14, 2016

REFS on top of Storage Spaces

louwrentius · on Aug 13, 2016

I think this is quite amazing, building high-availability storage is a complex endeavour because the entire storage path needs to be 100% redundant.

It's quite an achievement to get this functional using open-source.

jpgvm · on Aug 14, 2016

One day we will hopefully have first class ZFS mirroring using shared nothing hardware. Until then dual-pathed JBODs will have to do.

zokier · on Aug 13, 2016

Am I reading the output right, that he is creating three-way stripe over raidz1 volumes? Seems like bit odd configuration for HA solution especially. I mean, doesn't that mean that the array can't handle two disk failures except if it gets lucky?

Annatar · on Aug 14, 2016

It means that the vdev on top of the three RAIDZ vdevs can lose up to three devices simultaneously, one per RAIDZ each.

Any further failure would result in a complete data loss. Also, if any of the RAIDZ stripes loses more than one vdev simultaneously, this will lead to a complete data loss as well.

ewwhite · on Aug 14, 2016

This is a simple example setup in a 12-drive enclosure. 9 disks, a global hot-spare disk and two cache drives.

Previously, I would have used straight ZFS mirrors for performance and expandability (raidz VDEVs can't be expanded), but at this scale, the 9-disk 3x3 raidz1 setup consistently outperformed a 10-disk RAID mirror setup in random and sequential I/O.

See: http://i.imgur.com/SeSlmH8.png

eatstoomuchjam · on Aug 14, 2016

A set of 5 stripes of mirrors has the same problem that the comment brought up - losing the _wrong_ two disks loses all of your data.

In this case, going to raidz2 across all 12 disks would give an extra disk worth of capacity/performance with the ability to lose any 2 disks without data loss and going to raidz3 should give the same capacity/performance with the ability to lose any 3 disks without data loss.

ewwhite · on Aug 14, 2016

Hot spares and monitoring. RaidZ2 across that number of drives is a performance nightmare and against ZFS best practices.

Annatar · on Aug 14, 2016

Always on the prowl for cheap just a bunch of disks enclosures for my Solaris / SmartOS systems, from the links in the github readme.md, I learned a whole bunch of stuff, for example, that Sun made the J4400 / J4100 just a bunch of disks external serial attached small computer system interface arrays, and these are dirt cheap on ebay:

https://docs.oracle.com/cd/E19928-01/820-3223-14/J4200_J4400...

that in turn led me on the search for the actual host bus adapter supporting this, and behold:

https://docs.oracle.com/cd/E19928-01/820-3222-19/J4200_J4400...

further "digging" on ebay uncovered that these low profile host bus adapters can be had for as cheap as $18.95 USD plus shipping.

Since this stuff is professional grade equipment, the repossession companies are having a really, really hard time reselling this equipment: enterprises will not buy this because it is so specialized that they do not know about it (and they would much rather buy ultra expensive EMC with a support contract), and private persons will not buy this because most people do not even know what it is, or that it exists, never mind that it requires one to have a rack with professionally installed power and possibly cooling.

As a consequence, this equipment goes for peanuts on ebay, and is a way for someone who knows what they are doing to implement highly reliable, enterprise storage solutions for a mere, tiny fraction of the cost.

Not bad for five minutes worth of research, that was time well invested.

ewwhite · on Aug 14, 2016

This is key. I mean, I focus on the HP side and used ProLiant servers and storage enclosures in my examples... But the Sun/Oracle enclosures and components are very inexpensive off-lease and come at a price-point where it's totally possible to self-support.

gsmethells · on Aug 14, 2016

ZFS forces you to choose a drive size a priori and live with it for the life of the system. Thanks but no thanks. I'll stick with OpenStack Swift where I can leverage new larger disks and have my uncoupled highly available storage to boot.

chad_c · on Aug 14, 2016

This is not true. Read up on mirrored vdevs. http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-...

gsmethells · on Aug 14, 2016

50% storage efficiency and 87.5% survival is not an option for us. We store medical images by the petabyte and drives fail much too fast at scale.

notmyname · on Aug 14, 2016

It's great to hear how you are using Swift. I work on that project; feel free to reach out if you have any questions. Contact info is in my profile.

gsmethells · on Aug 14, 2016

Especially with the long resilver times necessary to achieve a good ROI from the investment.

Annatar · on Aug 14, 2016

That is not correct: one can easily increase the size of the pool by adding larger devices. As soon as the last device is replaced, the pool instantaneously possesses the upgraded capacity. No complicated grow commands, no filesystem expansions. It JustWorks(SM).

  zpool set autoreplace=on pool_name
  zpool set autoexpand=on pool_name

properties must be set on the pool for this to work, either before or after the pool upgrade. Note that older versions of zpool(1M), like for example zpool version 15, do not have the autoexpand property.

What cannot be done is a reduction in pool's capacity, but that does not come into these tales.

There is no "apriori size determination", as zpool(1M) uses the entire disks or logical units, and ZFS filesystems use only as much capacity as they require, and no more, leading to extremely efficient disk space utilization.

gsmethells · on Aug 14, 2016

That is a lot of work compared to just adding a disk and rebalancing. No need to replace every disk in a vdev just to increase disk utilization to the entire disk when using Swift. No fear of losing a pool when losing a vdev.

Annatar · on Aug 14, 2016

"Increase disk utilization when using Swift"? What you wrote makes no sense.

There is never any fear of losing a pool when losing a vdev, as ZFS will let one simulate the entire thing by letting one use files as disks:

  % mkfile -v 128m /var/tmp/d0
  /var/tmp/d0 134217728 bytes
  % mkfile -v 128m /var/tmp/d1
  /var/tmp/d1 134217728 bytes
  % sudo zpool create testpool0 mirror /var/tmp/d0 /var/tmp/d1
  Password:
  % zpool status
  pool: testpool0
  state: ONLINE
  scan: none requested
  config:

	NAME             STATE     READ WRITE CKSUM
	testpool0        ONLINE       0     0     0
	  mirror-0       ONLINE       0     0     0
	    /var/tmp/d0  ONLINE       0     0     0
	    /var/tmp/d1  ONLINE       0     0     0

  errors: No known data errors
  % zfs list
  NAME        USED  AVAIL  REFER  MOUNTPOINT
  testpool0   988K  79.0M   952K  /Volumes/testpool0
  % sudo zpool set autoexpand=on testpool0
  % mkfile -v 192m /var/tmp/d2 
  /var/tmp/d2 201326592 bytes
  % mkfile -v 192m /var/tmp/d3 
  /var/tmp/d3 201326592 bytes
  % sudo zpool replace testpool0 /var/tmp/d0 /var/tmp/d2
  % sudo zpool replace testpool0 /var/tmp/d1 /var/tmp/d3
  % zfs list
  NAME        USED  AVAIL  REFER  MOUNTPOINT
  testpool0  1008K   143M   952K  /Volumes/testpool0
  % sudo zpool destroy testpool0
  Running process: '/usr/sbin/diskutil' 'unmount' '/Volumes/testpool0' 
  Unmount successful for /Volumes/testpool0
  % rm /var/tmp/d[0-9]

gsmethells · on Aug 14, 2016

Did you not read the other guys link above? http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-...

It says "Fault tolerance / degraded performance

Be careful here. Keep in mind that if any single vdev fails, the entire pool fails with it. There is no fault tolerance at the pool level, only at the individual vdev level!"

Annatar · on Aug 15, 2016

I've been architecting and implementing highly fault tolerant, large scale storage solutions with ZFS since one week after it came out at end of June 2006, but thank you, I'll be extra sure to keep that in mind.

Please do not assume that everyone here comes from GNU/Linux and knows next to nothing or half-true anecdotes about (Open)ZFS. For example, I learned what and how in ZFS directly from the ZFS development team at Sun Microsystems: Roch Burbonnais, Adam Leventhal, Eric Shrock, and Jeff Bonwick.

gsmethells · on Aug 14, 2016

I didn't say "increase disk utilization with Swift", I said ZFS does not use the whole disk when you add a new large disk unless you replace the entire vdev worth of disks. Swift treats each disk individually and uses weighting to deal with heterogeneous disk sizes which is much cleaner.

Annatar · on Aug 15, 2016

Replacement of entire disks to grow the pool is simply a different way of adding capacity, because ZFS is based on different paradigms than what one is used to with run of the mill implementations; however, in practice, it's a much simpler and therefore much more reliable way to extend capacity. The units are simply entire larger disks.

Swift is also how many orders of magnitude more complex?