It's very nice to see someone putting in the work to document this and show others how they were able to get it to work.
But I unfortunately believe, quite sincerely, that redundancy at this level is misguided at best and dangerous at worst.
Instead of building storage that can fail between nodes, consider building applications that can survive node loss and are not relying on persisting their state in a black box that is assumed to be 100% available. These assumptions about reliablity of systems and their consistent behavior under network partitions has bitten countless people before and will continue to claim victims unless application paradigms are fundamentally rethought.
Many people already know this. Even people who build these systems know this, but too many applications already exist that can't be change. Nonetheless, it's important for people building such systems for the first time to be aware that they can make a choice early in the design process and that the status quo is not a good choice. In other words, "Every day, somebody's born who's never seen The Flintstones." (https://mobile.twitter.com/hotdogsladies/status/760580954532...)
> Instead of building storage that can fail between nodes, consider building applications that can survive node loss and are not relying on persisting their state in a black box that is assumed to be 100% available. These assumptions about reliablity of systems and their consistent behavior under network partitions has bitten countless people before and will continue to claim victims unless application paradigms are fundamentally rethought.
I'd recommend this... except that it's really, really hard [0]. And if you think about it, you have to do this for every new application, which is a bit wasteful. So IMO, doing it at filesystem level is a good trade off. It's abstract enough to be useful over a wide range of use-cases (unlike block storage), because if you think about it file systems are just frameworks.
To be fair, block can work really well. I think block storage vendors have themselves to blame. Half the "standards" aren't, interop can be terrible. I know I'd rather have simple, reliable storage - stop stuffing "enterprise" features into a rotting, increasingly technical-debt laden codebase. All back-up solutions for block suck - you'd think this problem would be fixed, especially with cloud storage. I could go on.
Given the target use case of this kind of system, you control what you can control, which is the infra, because the software will overwhelmingly be industry specific COTS stuff that realistically wont be changed.
Further its likely that the types of internal apps running could handle a recovery period, makes total loss less of an issue...this is "old school" operations - having DR plans, backups and recovery time targets
There are legitimate reasons to have a CP system and not an AP system, it's obviously application specific. Also over-engineered systems come with their own risks and can experience as much down time as a well setup CP system due to human error stemming from the system's complexity (AWS is a good example).
Very good point. Quite often when talking about distributed system, everyone goes to the CAP theorem and automatically assumes if it is not AP, it must CP then. While in reality (and maybe for good reasons) it might be neither.
This is about a completely different type of VM application, and a completely different kind of storage architecture. This is about everything being on the storage box, not just any application state. The two nodes export NFS shares, which host the root filesystems of the nodes being run on the VM servers. Most likely, the two storage heads will be existing on the same subnet as the VMWare boxes, thus making a network partition that affects both nodes much, much less likely. Additionally, by having the sort of architecture being presented, you can do things like live migrations of VMs, which is useful when the underlying OS needs to be brought down for maintenance.
There are trade-offs with any architectural decisions, and your idea provides a much more byzantine, complex, and expensive system for little to no trade-off at the scale presented. There are quite a few places where they just need a few servers, and going to your idea just makes things more expensive and more complicated.
> This is about a completely different type of VM application, and a completely different kind of storage architecture.
I understand it completely, and I vehemently disagree with everything about it. Providing networked block storage for virtual disks on top of NFS is a recipe for disaster and performance degradation. Just because "everyone" does it doesn't make me like it any more. Using redundant local disk (preferably with ZFS) gets around this problem. Blocks are actually blocks and failures and performance issues are confined to one node.
> Additionally, by having the sort of architecture being presented, you can do things like live migrations of VMs, which is useful when the underlying OS needs to be brought down for maintenance.
I find live migration to be a hack. Once again, solving the wrong problem at the wrong level.
> There are quite a few places where they just need a few servers, and going to your idea just makes things more expensive and more complicated.
Local disk with appropriate backups and application level failover makes a lot more sense and is fundamentally less complicated.
SAN storage and VmWare (ESXi) have been around forever.
VmWare will gladly provision VMs over any SAN storage. Live migration has been available for more than 10 years (that's called VMotion and it requires any of paid edition).
Prefer local storage? Put any sort of local storage on the server, VmWare will gladly provision VM on it.
What's the most important thing in a storage... THE BACKUP!
For backups, there are snapshots capabilities well integrated into VmWare (cold snapshots, live snapshots, incremental snapsnots, whatever you name it).
All of this has existed for more than ten years. Standard protocols (iSCSI, FiberChannel, other) + robust enclosures (all vendors have SAN) + battle tested software and drivers (VmWare).
A couple servers and some TB of storage is a nothing fancy. The cost of the setup is shared between the servers + the labor + many SAS hard drives + the enclosure (a very little part).
It's really hard to take seriously an overly complex setup that is trying to imitate that over cheap untested hardware and software (NFS? seriously?).
Not only it is a recipe for disaster in production but there isn't even any cost savings to be done.
The cost of complexity and labors will be going wild with all the configuration. The hardware costs won't go down (still need many drives and servers). Good luck debugging that and retrieving data when things will go wrong.
Well. I suppose it's fun for home labs and that's about it :D
People will do whatever they want to do. And my comment is not a knock on HAST.
But myself, and I know I am not alone, firmly believe especially regarding HA and ZFS that following anything other than the KISS principle is pain upon pain upon pain.
Wether it's iSCSI, HAST, NFS, HAST or any combination thereof.
The simplest and least painful? Is to simply create two servers with dual nic's and dual HBA's.
Assign half the disks in a mirror to HBA-1 the other half to HBA-2.
Use one nic for public networking, use the other for a direct P2P Gig-E link between server A and server B.
Use zfs replication over the private network link using the granularity you feel most comfortable with. 30 minutes? 15 minutes? 5? It's up to you.
This is the simplest, and IMO most reliable way to go about this. When you start adding in iSCSI, NFS, HAST etc you are simply making things to complex, introducing multiple points of failure, and the pain when something does break? Ill pass.
I'd rather have to manually intervene and fail over to server B if server A fails and vice versa than deal with all the pain of additional things that could break or introduce split brained issues etc.
But it's a free world, use whatever solution fits your comfort zone.
But myself, and I know I am not alone, firmly believe especially regarding HA and ZFS that following anything other than the KISS principle is pain upon pain upon pain.
Rarely have truer words been written. Highly available, highly reliable storage means keeping it as simple as possible, but no simpler. I wonder how long will it take before computer professionals at large realize this?
Yeah I think only distributed fs's like Ceph and Gluster really scale. File systems, even ZFS and Btrfs, don't scale because they don't deal with the fact it's not always one or two hard drives dying that'll take out the storage stack. Network card, power supply, motherboard, etc. And unless you have an all SSD array, the recovery times aren't scaling. One dead drive used to take maybe an hour to rebuild and now it can be a day due to capacity growing faster than performance.
Isn't the rebuilding argument valid on the machine level as well though? Granted if your blocks are redundant it's possible to copy from multiple sources but rebuilding is still limited to the network speed.
On a properly designed distributed system disk-replacement costs are spread across all machines and in time, making performance degradation a non-issue.
Ed White, the guy behind the Github repo, is a very experienced sysadmin (and the guy who knocked me off the top scoring slot on Server Fault). His experience supporting both COTS and custom applications in a variety of environments is top-notch.
To add to the downsides, you can't expand RAIDZ vdevs.
If you start with a 6 disk RAIDZ2 and want to add a couple more drives, you can't. The only way to add capacity to the pool is to add an entire new vdev.
Unfortunately I don't think there's an expandable file system that handles parity/erasure coding as well as bitrot that can be used right now. BTRFS might be usable after the RAID5/6 rewrite and Ceph might be more usable after Bluestore comes along (at the moment erasure coding performance seems to be pretty bad, even with an SSD cache) but in the present there's nothing.
> Ceph might be more usable after Bluestore comes along (at the moment erasure coding performance seems to be pretty bad, even with an SSD cache)
I just saturated a 10Gb link in a k=4,m=2 EC configuration on Ceph (Haswell). Too lazy to set up concurrent clients. What do you mean by pretty bad? This is a Hammer cluster without SSD journals or cache.
Bluestore was recently released with Jewel, but shouldn't affect EC performance drastically, if at all.
Maybe I'm doing something horribly wrong. My setup was:
- 6x4TB HDD
- 2x120GB SSD
- OSD per disk
- k=4,m=2 erasure coded HDD pool
- 1 copy replicated (not) SSD pool
- SSD pool writeback cache for HDD pool
I was consistently seeing the SSD writeback cache fill up then the disks thrashed at 100% IO utilization, limiting incoming writes to ~40MB/s from memory.
The amount and type of work that would be required to take a 6disk to a 7disk (in place) is quite hard. If you can take some downtime, you'd really want to snap, send to a new larger system, go down, do the final copy, and come back up. That or work with more vdevs, but then you might have unevenly striped data.
And there are filesystems that are expandable and meet your other requirements, they just aren't cheap. IBM GPFS is one I work with regularly.
Mdadm (Linux's built-in software RAID) can expand a RAID array online, and has been able to do so for about a decade.
It's not fundamentally that complex of an operation, but there are enough details to be careful of that I don't see why it makes sense to reimplement it in every filesystem.
LVM2 is a filesystem-agnostic layer that gives you CoW snapshots on top of mdadm's striping/parity. From your list, that just leaves checksumming for bitrot, which you can do at the RAID level via mdadm check, or at the filesystem level with sha1sum --tag / sha1sum -c (although maybe that's not as nice as ZFS/btrfs).
RAID controllers cannot be relied upon to do check sums, as most RAID controllers do not possess this capability, and of those which do, their firmware is opaque and buggy, and therefore cannot be trusted.
If you care about your data, never use hardware RAID, and always use ZFS or Oracle ASM.
mdadm(8) is a good start on GNU/Linux if one is forced to run on an (older) operating system version with only ext3 or XFS, however while it provides administration consistency and scales to many systems, it does not provide data checksumming, and therefore no data corruption detection and no self healing capabilities like Oracle ASM or like (Open)ZFS.
You can expand capacity by replacing all the drives with bigger ones. It's obviously expensive and time consuming(has to be done one disk at a time so that each one can be resilvered) but it is possible.
> To add to the downsides, you can't expand RAIDZ vdevs.
> If you start with a 6 disk RAIDZ2 and want to add a couple more drives, you can't. The only way to add capacity to the pool is to add an entire new vdev.
This is mostly true, and in many situations for all practical purposes is true. Not fully understanding this recently cost me a few hundred dollars (2x 8TB external drives) when I needed to expand my 4 disk raidz2 to a 6 disk raidz2.
However, what _is_ possible is expanding a pool by incrementally replacing drives with larger capacity equivalents. This wouldn't work in my situation as I already was using the largest consumer drives available - but the next time I need to expand in a few years it may be a possibility.
> ... expandable file system that handles parity/erasure coding as well as bitrot that can be used right now
Unfortunately I believe this to be the case. However in practice, so long as you know up front your data store is not expandable, it's fairly easy to work around without it being too much of a hassle. If having two zpools when you need to expand is a burden for your use case because the data necessarily will be partitioned across pools (ie: in multiple directories), there are tools that will present multiple filesystems as a single mount point to linux.
Alternatively, mirrored vdevs [0] as opposed to a classic raidz2 configuration are much easier to expand, as you only have to replace 2 drives to expand your total storage space (a single vdev) as opposed to having to replace all of your drives with higher capacity drives as with raidz2.
> and Ceph might be more usable
After all that rambling about zfs, the main thing I wanted to touch on: I'd caution most people to not use ceph unless you already know you need it. For the average user, it is introducing so much unnecessary complexity that will just create future headaches when using it in a non-distributed manner (if zfs is an alternative, it must be non-distributed). Which isn't to say ceph isn't useful, it's just built to be optimal at solving a slightly different problem than a filesystem on a single compute can solve.
Yeah. I have read about zfs to some degree before but somehow missed this until a couple of days ago. For all its flexibility this looks a very bad limitation. I think this and rebuild time (again related to vdev width) makes zfs very easy to mess up.
The goal of bachefs seem to very lofty though there isn't much detail on the implication of raid design. It seemed to suggest it solves the zfs fragmentation problem, gives better r/w performance with a simpler codebase. Anyone who understand the programmers guide well enough to comment on the viability of the architecture? Does it really solves all the problem of a non-distributed huge filesystem capable of working from a single disk to a big array?
I've been looking at SnapRAID and UnRAID which I believe use their own parity and handles expansions. Mostly because I'm currently using BTRFS RAID6 and might move off it.
I also have high hopes for bcachefs. Currently using bcache on my laptop and it's been great. BTRFS doesn't really have an easy SSD cache like ZFS, and it's a hassle to convert over to a Bcache backed BTRFS setup.
Am I reading the output right, that he is creating three-way stripe over raidz1 volumes? Seems like bit odd configuration for HA solution especially. I mean, doesn't that mean that the array can't handle two disk failures except if it gets lucky?
It means that the vdev on top of the three RAIDZ vdevs can lose up to three devices simultaneously, one per RAIDZ each.
Any further failure would result in a complete data loss. Also, if any of the RAIDZ stripes loses more than one vdev simultaneously, this will lead to a complete data loss as well.
This is a simple example setup in a 12-drive enclosure. 9 disks, a global hot-spare disk and two cache drives.
Previously, I would have used straight ZFS mirrors for performance and expandability (raidz VDEVs can't be expanded), but at this scale, the 9-disk 3x3 raidz1 setup consistently outperformed a 10-disk RAID mirror setup in random and sequential I/O.
A set of 5 stripes of mirrors has the same problem that the comment brought up - losing the _wrong_ two disks loses all of your data.
In this case, going to raidz2 across all 12 disks would give an extra disk worth of capacity/performance with the ability to lose any 2 disks without data loss and going to raidz3 should give the same capacity/performance with the ability to lose any 3 disks without data loss.
Always on the prowl for cheap just a bunch of disks enclosures for my Solaris / SmartOS systems, from the links in the github readme.md, I learned a whole bunch of stuff, for example, that Sun made the J4400 / J4100 just a bunch of disks external serial attached small computer system interface arrays, and these are dirt cheap on ebay:
further "digging" on ebay uncovered that these low profile host bus adapters can be had for as cheap as $18.95 USD plus shipping.
Since this stuff is professional grade equipment, the repossession companies are having a really, really hard time reselling this equipment: enterprises will not buy this because it is so specialized that they do not know about it (and they would much rather buy ultra expensive EMC with a support contract), and private persons will not buy this because most people do not even know what it is, or that it exists, never mind that it requires one to have a rack with professionally installed power and possibly cooling.
As a consequence, this equipment goes for peanuts on ebay, and is a way for someone who knows what they are doing to implement highly reliable, enterprise storage solutions for a mere, tiny fraction of the cost.
Not bad for five minutes worth of research, that was time well invested.
This is key. I mean, I focus on the HP side and used ProLiant servers and storage enclosures in my examples... But the Sun/Oracle enclosures and components are very inexpensive off-lease and come at a price-point where it's totally possible to self-support.
ZFS forces you to choose a drive size a priori and live with it for the life of the system. Thanks but no thanks. I'll stick with OpenStack Swift where I can leverage new larger disks and have my uncoupled highly available storage to boot.
That is not correct: one can easily increase the size of the pool by adding larger devices. As soon as the last device is replaced, the pool instantaneously possesses the upgraded capacity. No complicated grow commands, no filesystem expansions. It JustWorks(SM).
zpool set autoreplace=on pool_name
zpool set autoexpand=on pool_name
properties must be set on the pool for this to work, either before or after the pool upgrade. Note that older versions of zpool(1M), like for example zpool version 15, do not have the autoexpand property.
What cannot be done is a reduction in pool's capacity, but that does not come into these tales.
There is no "apriori size determination", as zpool(1M) uses the entire disks or logical units, and ZFS filesystems use only as much capacity as they require, and no more, leading to extremely efficient disk space utilization.
That is a lot of work compared to just adding a disk and rebalancing. No need to replace every disk in a vdev just to increase disk utilization to the entire disk when using Swift. No fear of losing a pool when losing a vdev.
Be careful here. Keep in mind that if any single vdev fails, the entire pool fails with it. There is no fault tolerance at the pool level, only at the individual vdev level!"
I've been architecting and implementing highly fault tolerant, large scale storage solutions with ZFS since one week after it came out at end of June 2006, but thank you, I'll be extra sure to keep that in mind.
Please do not assume that everyone here comes from GNU/Linux and knows next to nothing or half-true anecdotes about (Open)ZFS. For example, I learned what and how in ZFS directly from the ZFS development team at Sun Microsystems: Roch Burbonnais, Adam Leventhal, Eric Shrock, and Jeff Bonwick.
I didn't say "increase disk utilization with Swift", I said ZFS does not use the whole disk when you add a new large disk unless you replace the entire vdev worth of disks. Swift treats each disk individually and uses weighting to deal with heterogeneous disk sizes which is much cleaner.
Replacement of entire disks to grow the pool is simply a different way of adding capacity, because ZFS is based on different paradigms than what one is used to with run of the mill implementations; however, in practice, it's a much simpler and therefore much more reliable way to extend capacity. The units are simply entire larger disks.
Swift is also how many orders of magnitude more complex?
But I unfortunately believe, quite sincerely, that redundancy at this level is misguided at best and dangerous at worst.
Instead of building storage that can fail between nodes, consider building applications that can survive node loss and are not relying on persisting their state in a black box that is assumed to be 100% available. These assumptions about reliablity of systems and their consistent behavior under network partitions has bitten countless people before and will continue to claim victims unless application paradigms are fundamentally rethought.
Many people already know this. Even people who build these systems know this, but too many applications already exist that can't be change. Nonetheless, it's important for people building such systems for the first time to be aware that they can make a choice early in the design process and that the status quo is not a good choice. In other words, "Every day, somebody's born who's never seen The Flintstones." (https://mobile.twitter.com/hotdogsladies/status/760580954532...)