Magical Block Store: Why EBS Can't Work

blantonl · on April 24, 2011

I am an active user of EBS on a highly trafficked Web properly, and came from a long and tedious background in enterprise software.

I really think that one paragraph in his blog post summed everything up quite nicely. It could not ring more true:

My opinion is that the only reason the big enterprise storage vendors have gotten away with network block storage for the last decade is that they can afford to over-engineer the hell out of them and have the luxury of running enterprise workloads, which is a code phrase for “consolidated idle workloads.” When the going gets tough in enterprise storage systems, you do capacity planning and make sure your hot apps are on dedicated spindles, controllers, and network ports.

edw · on April 25, 2011

This awesome entry perfectly captures why I have always hated NFS. I can deal with the possibility that if a machine's hard drive dies, my system is going to have a very hard time continuing to operate in a normal manner, but then NFS comes along, and you realize that all sorts of I/O operations that previously employed a piece of equipment that failed once every two and a half years now depend on a working network with a working NFS server on that network, and the combination of that network and that server are orders of magnitude less reliable.

And now you have situations on a regular basis where you type "ls" and you shell hangs and not even "kill -9" is going to save you. And you go back to using FTP or some other abstraction that does not apply 40,000 hour MTBF thinking to equipment that disappears for coffee breaks daily.

chubot · on April 25, 2011

A thousand times yes. My first thought when hearing about the EBS outage was "wow that seems more fragile than NFS, no wonder it failed spectacularly." NFS presents you with this nice familiar file system interface, and then random sys admins and programmers start creating a tangled mess of dependencies by dropping stuff there, without regard to what happens when it fails. Like the EBS outage, the failures tend to surprise people.

The great quote by Leslie Lamport: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

And this was an excellent and honest article about faulty programming abstractions. It's basically bashing you over the head with the "Fallacies of distributed computing". Don't silently turn local operations into remote operations. They're not the same thing and have to be treated differently at all levels.

Even Werner Vogels wrote a diatribe against "transparency", which is the same issue by another name: http://scholar.google.com/scholar?cluster=700969849916494972...

So I wonder what he thinks of this architectural choice. You have to give up something when communicating over the network. Vogels seems to have chosen consistency rather than availability in his designs. This paper was a turning point in his research. Its candor surprised me.

The file system interface does not let you relax consistency, so by default you have chosen availability. As the Joyent guys honestly remarked, this often has to be learned the hard way.

agazso · on April 25, 2011

I hate NFS for the same reasons that you just wrote, but to be honest it is not a protocol issue, but rather an implementation issue.

If NFS were implemented totally in userspace (like FTP), it would not hang the entire system when something breaks. On the other hand, it would be much slower than it is, therefore it would be unsuitable for a lot of use-cases where it is used now.

moe · on April 25, 2011

NFS hangs the system on failures because it is a shoddy implementation, not because it happens to be implemented in kernel space.

I think the old CVS quote by Tom Lord applies here:

  CVS has some strengths. It's a very stable piece of
  code, largely because nobody wants to work on it anymore.

ssmoot · on April 25, 2011

I don't know how NFS keeps coming up. It's an entirely different use case. It doesn't help the credibility of a critique on networked block storage to harp on a vendor specific implementation of a technology that doesn't even operate in the same sphere.

An NFS server is very simple. With NFS on it's own VLAN, and some very basic QoS, there's no reason an NFS server should be the weak point in your infrastructure. Especially since it's resilient to disconnection on a flaky network.

If you're looking for 100% availability, sure, NFS is probably not the answer. If on the other hand you're running a website, and would rather trade a few bad requests for high-availability and portability, then NFS can be a great fit.

None of that has anything to do with EBS or block-storage though.

Joyent's position is that iSCSI was flaky for them because of unpredictable loads on under-performing equipment. The situation would degrade to the point that they could only attach a couple VM hosts to a pair of servers for example, and they were slicing the LUNs on the host, losing the flexibility networked block-storage provides for portability between systems.

Here's what we do:

We export an 80GB LUN for every running application from two SAN systems.

These systems are home-grown, based on Nexenta Core Platform v3. We don't use de-dupe since the DDT kills performance (and if Joyent was using it, then is local storage without it really a fair comparison?). We provide SSDs for ZIL and ARCL2.

These LUNs are then mirrored on the Dom0. That part is key. Most storage vendors want to create a black-box, bullet-proof "appliance". That's garbage. If it worked maybe it wouldn't be a problem, but in practice these things are never bullet-proof, and a failover in the cluster can easily mean no availability for the initiators for some short period of time. If you're working with Solaris 10, this can easily cause a connection timeout. Once that happens you must reboot the whole machine even if it's just one offline LUN.

It's a nightmare. Don't use Solaris 10.

snv_134 will reconnect eventually. Much smoother experience. So you zpool mirror your LUNs. Now you can take each SAN box offline for routine maintenance without issue. If one of them out-right fails, even with dozens of exported LUNs you're looking at a minute or two while the Dom0 compensates for the event and stops blocking IO.

These systems are very fast. Much faster than local storage is likely to me without throwing serious dollars at it.

These systems are very reliable. Since they can be snapshotted independently, and the underlying file-systems are themselves very reliable, the risk of data-loss is so small as to be a non-issue.

They can be replicated easily to tertiary storage, or offline incremental backup easily.

To take the system out, would require a network melt-down.

To compensate for that you spread link-aggregated connections across stacked switches. If a switch goes down, you're still operational. If a link goes down, you're still operational. The SAN interfaces are on their own VLAN, and the physical interfaces are dedicated to the Dom0. The DomU's are mapped to their own shared NIC.

The Dom0, or either of it's NICs is still a single point of failure. So you make sure to have two of them. Applications mount HA-NFS shares for shared media. You don't depend on stupid gimmicks like live-migration. You just run multiple app instances and load-balance between them.

You quadruple your (thinly provisioned) storage requirements this way, but this is how you build a bullet-proof system using networked storage (both block (iSCSI) and filesystem (NFS)) for serving web-applications.

If you pin yourself to local storage you have massive replication costs, you commit yourself to very weak recovery options. Locality of your data kills you when there's a problem. You're trading effective capacity planning for panic fixes when things don't go so smoothly.

This is why it takes forever to provision anything at Rackspace Cloud, and when things go wrong, you're basically screwed.

Because instead of proper planning, they'd rather not have to concern themselves with availability of your systems/data.

It's not a walk in the park, but if you can afford to invest in your own infrastructure and skills, you can achieve results that are better in every way.

Sure, you might not be able to load a dozen high-traffic Dom0's onto these SAN systems, but that matters mostly if you're trying to squeeze margins as a hosting provider. Their problems are not ours...

chubot · on April 25, 2011

The point of the article is that you are taking an ancient interface and using it for something new. Millions of lines of code was written against that interface with old assumptions, and now you've moved it to a new implementation without changing any of it. Things are bound to go wrong.

When you move sqlite to NFS, for example, file locking probably won't work. There is nothing to tell you this.

It sounds like you have experience making NFS work well, but I don't see how anything you wrote addresses this point. In fact I think you're just echoing some of the article's points about "enterprise planning". AFAICT you come from the enterprise world and are advocating overprovisioning, which is fine, but not the same context.

ssmoot · on April 25, 2011

I work at a small shop who was badly burned by Sun/Oracle. :-)

It's not that I believe in overprovisioning I think. It's that if data is really that critical, and it's availability is critical, then that has to be taken into account during planning.

Everything fails at some point. The Enterprise Storage Vendors would have you believe their stuff doesn't. In practice it's pretty scary when the black box doesn't work as advertised anymore though _after_ you've made it the centerpiece of your operations.

So with those lessons learned, our replacement efforts took into account the level of availability we wanted to achieve.

I did go off on an NFS tanget. Sorry. But this article was about block-storage, which is a different beast from what you describe.

Seeing all networked storage lumped together is like seeing: FastCGI isn't 100% reliable, which is why I hate two-phase-commits.

edw · on April 25, 2011

I brought up NFS because it's an example of a service that implements an abstraction but does so in a way that undermines the assumptions of the implemented abstraction. I do not disagree that local disks are an unrealistic strategy for creating a scalable, fault-tolerant system. The disk abstraction is of limited utility when creating such systems, because "disk thinking" leads to giving in to seductive assumptions about the performance and reliability of the storage resources you have at your disposal.

mkramlich · on April 25, 2011

> I brought up NFS because it's an example of a service that implements an abstraction but does so in a way that undermines the assumptions of the implemented abstraction.

yep. NFS and the like make you more vulnerable to the Fallacies of Distributed Computing.

prodigal_erik · on April 24, 2011

He didn't touch on Joyent's 2+ day partial outage a couple months ago: http://news.ycombinator.com/item?id=2269329

jamie · on April 24, 2011

Don't forget about this: http://www.datacenterknowledge.com/archives/2008/01/15/joyen...

jamie · on April 24, 2011

I think both of these links illustrate that errors happen, mistakes happen, software has bugs, and murphy's law always strikes. The question is, when it strikes, do you have enough control to fix the problem? If you've outsourced the solution, does the provider have enough control/knowledge to fix the problem?

These things will get much worse before they get better, and it's best to think of all these abstractions as being a double edge sword.

SoftwareMaven · on April 24, 2011

Many things in software are impossible magic, until they are not. His argument boils down to "it is a hard problem that nobody has solved yet." That doesn't mean nobody will ever solve it.

Regardless, I do agree that building your application today like it is a solved problem is the wrong way to do it.

blantonl · on April 24, 2011

Regardless, I do agree that building your application today like it is a solved problem is the wrong way to do it.

That presumption assumes that the application is being used as the right tool to resolve the problem. And it also assumes that "the problem" is a finite and solvable item.

sigil · on April 25, 2011

> And it also assumes that "the problem" is a finite and solvable item.

Yes. To make this a bit more concrete, if "the problem" is making distributed storage look and behave exactly like local storage, the CAP Theorem has something to say about its solvability.

jamesaguilar · on April 25, 2011

Depends. Local storage is also not perfectly available. If the network is reliable, you can probably get availability high enough that the system feels "close enough" to how local storage feels. Today's networks aren't that reliable, but someday there may be enough redundancy and bandwidth for this to happen.

sigil · on April 25, 2011

> Local storage is also not perfectly available.

Technically true, although you don't have to contend with the consistency or partitioning factors in the local disk case -- there's only one copy of the state. This means you can focus on making the availability factor as close to 1.0 as possible.

This may not be the case when you're forced to balance all three CAP factors. I sometimes wonder if a follow on result to CAP will be a "practical" (physical or information theoretic) limit like C x A x P <= 1-h for some constant h, and we'll just have to come to terms with that as computer scientists, as physics had to with dx x dp >= h. This is of course wildly unsubstantiated pessimism.

Also, I would gladly entertain any argument demolishing the "local disks are not subject to CAP" claim I made above by talking about read / write caches as separate copies of the local disk state.

jamesaguilar · on April 25, 2011

I doubt it. Suppose that there is a network that is never partitioned, and machines connected to that network that never fail. In that case consistency and availability should be perfect. Although networks will never be perfectly reliable, nor machines, they seem to be getting more reliable. Perhaps someday we may be able to say that the odds of enough partitions or machine failures to make the system unavailable are lower than the odds of you getting struck by lightning, at which point you will have for practical purposes defeated the constraints of the CAP theorem.

sigil · on April 25, 2011

> Suppose that there is a network that is never partitioned, and machines connected to that network that never fail. In that case consistency and availability should be perfect.

You mean, in that case tolerance to partition and availability should be perfect.

> Perhaps someday we may be able to say that the odds of enough partitions or machine failures to make the system unavailable are lower than the odds of you getting struck by lightning, at which point you will have for practical purposes defeated the constraints of the CAP theorem.

So this is the really interesting question. All the CAP theorem says is that (C,A,P) != (1.0,1.0,1.0). How close to (1.0,1.0,1.0) could we make (C,A,P)? If infinitely close, then we have achieved perfection by the limit, and the CAP theorem is rather pointless. If not, then what is the numeric limit?

As you speculate, maybe the numeric limit on C x A x P is so close to 1.0 that the odds of seeing a consistency, availability, or partitioning problem are much smaller than getting hit by lightning.

Then again, maybe not. Who knows? ;)

To avoid sounding like a total crackpot, here is an interesting paper that explores the physical limits of computation:

http://arxiv.org/pdf/quant-ph/9908043v3

jamesaguilar · on April 25, 2011

> You mean, in that case tolerance to partition and availability should be perfect.

No. If a network is never partitioned, you don't need to write algorithms that can tolerate partitions. Therefore consistency and availability are possible.

> So this is the really interesting question. All the CAP theorem says is that (C,A,P) != (1.0,1.0,1.0). How close to (1.0,1.0,1.0) could we make (C,A,P)? If infinitely close, then we have achieved perfection by the limit, and the CAP theorem is rather pointless. If not, then what is the numeric limit?

I think you have misunderstood the theorem (at least, if my bachelor-degree-level understanding is correct). C, A, and P are not variables you can multiply together or perform mathematical operations on. They are more like booleans. "Is the web service consistent (are requests made against it atomically successful or unsuccessful)?" "Is the web service available (will all requests to it terminate)?" "Is the web service partition-tolerant (will the other properties still hold if some nodes in the system cannot communicate with others)?" These questions cannot be "0.5 yes". They are either all-the-way-yes or all-the-way-no.

> . . . and the CAP theorem is rather pointless

Not really. It is pointful for networks that experience partitions. It just doesn't apply to reliable networks. It also sort-of doesn't apply when an unreliable network is acting reliably, with the caveat that since it is not possible to tell in advance when a network will stop behaving reliably, you still have to choose between these three properties when writing your algorithms for when the network behaves badly.

sigil · on April 25, 2011

> C, A, and P are not variables you can multiply together or perform mathematical operations on. They are more like booleans.

Right, but I wasn't restating CAP, just wondering about a follow on to CAP that considers the probability of remaining consistent, the probability of remaining available, and the probability of no failures due to network partitions in physical terms.

Is this not an interesting thing to consider? What if someone proves a hard limit on the product of these probabilities in some physical computation context? The CAP theorem is absolutely fascinating to me, especially if it has something real to say about the systems we can build in the future. The future looks even more distributed.

> It is pointful for networks that experience partitions. It just doesn't apply to reliable networks.

Is there such a thing as a "reliable" network when thousands or millions of computational nodes are involved? Are the routers and switches which connect such a network 100% available? If an amplification attack saturates some network segment with noise, what then?

As programmers, we desperately want things to work, and it's easy to greet something like CAP with flat out denial. I know I'm always fighting it. "It will never fail." No, it can and will fail.

jamesaguilar · on April 25, 2011

I still don't understand what you mean when you say, "probability of remaining consistent," etc. Either you wrote the service so it system would always remain consistent or you didn't. Similarly with availability. Either the system will always return a result, or it may sometimes hang.

Maybe what you mean is the probability of whichever of C, A, or P you gave up actually becoming a problem? But I cannot imagine a physical law of the form you are referring to applying uniformly to these disparate properties. I wouldn't even know how to formulate it for consistency. For availability and partition tolerance it would just be, "Requests to this service will (availability: hang forever/partition-tolerance: return with errors) at a rate exactly equal to the probability of network failures."

With regards to your last point, there are no reliable networks, at least where I work. That doesn't mean there won't be.

CPlatypus · on April 25, 2011

Actually, C/A/P are more like variables you can multiply together. Even more accurately, they define a three-dimensional space with the CAP theorem (especially in Gilbert and Lynch's formal proof) only saying that you cannot be at the point that represents the maximum of all three. There are definitely tradeoffs or "mode switches" that can be made between C and A, and some even believe that you can give up some P to get more C/A. Personally I believe that the probability of partition is always non-zero in even the best designed and fully provisioned networks, even including those that are econonomically infeasible, and that those who choose to treat a small probability as zero to capture a performance advantage or position themselves as "more consistent" than competitors are being a bit dishonest, but those tradeoffs are being made.

jamesaguilar · on April 26, 2011

> they define a three-dimensional space with the CAP theorem (especially in Gilbert and Lynch's formal proof) . . .

That's weird because I've read the proof and they speak only of boolean instances of C, A, and P in the proof. They give no examples of systems where any of the three variables have values other than zero or one.

arethuza · on April 25, 2011

Out of interest, where do you draw the line between local storage and distributed storage? By local storage do you mean directly a attached storage?

What about FC SANs or iSCSI over a WAN? Are they local or distributed?

johnb · on April 25, 2011

It's funny how disk abstractions get you every time.

We used to store and process all of our uploads from our rails app on a GFS partition. GFS behaved like a normal disk most of the time, but we started having trouble processing concurrent uploads and couldn't replicate in dev.

It turned out so GFS could work at all, it had different locking than regular disks. Every time you created a new file it had to lock the containing folder. We solved it by splitting our upload folder in 1000 sequential buckets and wrote each upload to the next folder along... but it took us a long time to stop assuming it was a regular disk.

sciurus · on April 25, 2011

FWIW, this behavior is explained early on in the documentation for GFS2.

johnb · on April 25, 2011

As we were using EngineYard for hosting at the time, everything was set up for us and we never thought to look it up.

We now pay a lot more attention to underlying stack. Just because you've outsourced hosting (either cloud or managed physical servers), you really need to know every component yourself.

spullara · on April 24, 2011

Also worth noting is that Amazon isn't forcing you to use EBS. They also have tons of fast local storage available to RAID as you wish.

lindvall · on April 24, 2011

I strongly believe one of the most positive aspects of EC2 was that it demonstrated a beautiful philosophy that a node and their disks should not be relied upon to always be around and pushed it into the mainstream.

Even for people who didn't use EC2 the existence of the platform caused more people to rethink their architectures to try to rely less on Important Nodes.

EBS is a step back from that philosophy and it's a point worth noting.

One of the great things this post does is enumerates some of the underlying reasons why relying on EBS will inevitably lead to more failures and in ways that are harder and harder to diagnose.

leoc · on April 25, 2011

> EBS is a step back from that philosophy and it's a point worth noting.

Amazon doesn't use EBS itself, right? Isn't EBS something that AWS allowed its customers to nag it into against (what it considers) its better judgement?

lindvall · on April 25, 2011

Yep. And this may be one of those cases where they would have been better off ignoring their customer requests for the good of their reputation and their customers uptime.

blantonl · on April 24, 2011

I agree. We run all of our MySQL and Mongo slave servers with local RAID-0 ephemeral storage. One dies? So what, we remove it from the pool and provision another.

Only our master and our slave backup server runs on EBS. We aren't as write oriented so we can live with some of the limitations of EBS, but we've even considered moving our master MySQL and Mongo servers to ephemeral storage and just relying on our slave back database server to run on EBS (for which we take freeze/snapshots of often). That server rarely ever falls more behind in relay updates.

epi0Bauqu · on April 25, 2011

Do you run xlarge to get the extra ephemeral disks?

smhinsey · on April 24, 2011

Do you mean RAIDing local instance storage? I haven't heard of this approach so far, do you have any links?

blantonl · on April 24, 2011

Depending on the instance that you have provisioned, you have anywhere from 1 to 4 ephemeral (local) disks available to your instance.

They are typically available as /dev/sd[bcde]

In centOS, implementing a RAID-0 block device across 2 ephemeral disks that is present on an m1.large instance can be done via the following:

mdadm --create /dev/md0 --metadata=1.1 --level=0 --quiet --run -c 256 -n 2 /dev/sdb /dev/sdc

You'll then need to format the block device with your fs of choice. Then mount it from there.

count · on April 24, 2011

Just remember the super important keyword:

ephemeral

WHEN your EC2 node disappears (and it will), you will lose everything on that RAID.

That's not a bad thing if you know it'll happen and plan for it, but do be aware of it.

jinushaun · on April 25, 2011

Except Amazon does... Their free micro instance only supports EBS. For some of us who just want to dip our toes into AWS without too much money or commitment, EBS is our only choice.

spullara · on April 25, 2011

Then you really don't need to worry about fault tolerance or uptime.

cagenut · on April 24, 2011

Its really fascinating to watch amazon re-learn/re-implement the lessons IBM baked into mainframes decades ago. Once you get out of shared-nothing/web-scripting land you realize that I/O is much more important and difficult than cpu. What amazon calls EBS IBM has been calling "DASD" forever. I wonder if there are any crossover lessons that they haven't taken advantage of because there just aren't any old ibm'ers working at amazon.

blantonl · on April 24, 2011

IBM's implementation of DASD on the mainframe was always implemented under the assumption that it was a secondary storage medium for data. Meaning, it wasn't accessed often, and it wasn't implemented for top performance.

Think of a bridge between high performance disk and tape.

spudlyo · on April 24, 2011

Trying to use a tool like iostat against a shared, network provided block device to figure out what your level of service your database is getting from the filesystem below it is an exercise in frustration that will get you nowhere.

This may be true under Solaris. Since 2.5 Linux has had /proc/diskstats and an iostat that shows the average i/o request latency (await) for a disk, network or otherwise. For EBS it's 40ms or less on a good day. On a bad day it's 500ms or more if your i/o requests get serviced at all.

alecco · on April 25, 2011

Amazon Six Sigma "Blackbelts", meet Mr. Black Swan.

Edit: my point is you can't hide unexpected/unknown events on statistical models; we should know better, coming from CS.

atombender · on April 25, 2011

> It’s commonly believed that EBS is built on DRBD with a dose of S3-derived replication logic.

Actually, it was discovered some time ago (http://openfoo.org/blog/amazon_ec2_underlying_architecture.h...) that EBS probably used Red Hat's open-source GNDB: http://sourceware.org/cluster/gnbd/

CPlatypus · on April 25, 2011

He only gets it half right. A filesystem interface instead of a block interface is the right choice IMO. Private storage instead of distributed storage is the wrong choice for capacity, performance, and (most importantly) availability reasons. They didn't go with a ZFS-based solution because it was the best fit to requirements. They went with it because they had ZFS experts and advocates on staff.

As Schopenhauer said, every man mistakes the limits of his own vision for the limits of the world, and these are people who've failed to Get It when it comes to distributed storage ever since they tried and failed to make ZFS distributed (leading to the enlistment of the Lustre crew who have also largely failed at the same task). If they can't solve a problem they're arrogant enough to believe nobody can, so they position DAS and SAN as the only possible alternatives.

Disclaimers: I'm the project lead for CloudFS, which is IMO exactly the kind of distributed storage people should be using for this sort of thing. I've also had some fairly public disputes with Bryan "Jackass" Cantrill, formerly of Sun and now of Joyent, about ZFS FUD.

ssmoot · on April 25, 2011

ZFS is just the FS. But you know that already.

The SAN solutions they migrated to are not ZFS based. Unless I'm mis-remembering (I read this a couple days ago) they were only using ZFS to slice LUNs.

Point is, you're taking pot-shots at ZFS when the main thrust appears to be: "It was hard to make iSCSI reliable. Once we did, by buying expensive storage-vendor backed solutions, we found it wasn't financially compelling."

They're a hosting provider. If it takes a replicated SAN pair (which is the wrong way to go about it BTW, though admittedly it's the way the storage vendors and their "appliance" mentality would have it done) to service just a pair of VM hosts (they're still using Zones right?) then it just didn't make sense money-wise for them. If they planned capacity to provide great performance, they weren't making enough money on the services for what they were selling them for.

That's not an "iSCSI is unreliable" problem. It's not a "networked storage is broken" problem. It's not a "networked storage is slow" problem. It's not even a "ZFS didn't work out" problem.

If you go out and spend major bucks on NetApp, not only are you going to have to deal with all the black-box-appliance BS, but it's going to cost a lot of money. A LOT. And DAS is going to end up cheaper to deploy, maintain, and your margins are going to be a lot higher.

DAS is the right choice for a hosting provider who wants to maximize their profits in a competitive space.

It's not the best choice for performance, availability or flexibility for clients though. So you have to ask yourself what kind of budget you have to work with, and what goals are important to you?

BTW, there's _budget_, and then there's NetApp/EMC budget. Just because you need/want more than DAS can give you doesn't mean you need to tie your boat to an insane Enterprise grade budget.

CPlatypus · on April 25, 2011

Perhaps you should RTFA. The author says explicitly that what they do now is "lean on ZFS" and "keep the network out of the storage solution" which made their provisioning more complex because they could no longer treat local disks as ephemeral (i.e. that data can't be assumed to exist anywhere else). I knew this when I wrote the GP. My whole point is that they treated it as a "networked storage is broken" problem even though it wasn't, because of their "ZFS is the only tech we need" bias. Thanks for re-stating that.

As for "DAS is the right choice" that's just wrong on many levels. First, people who know storage use "DAS" to both private (e.g. SATA/SAS) and shared (e.g. FC/iSCSI) storage, so please misusing the term to make a distinction between the two. Second, I don't actually recommend either. I don't recommend paying enterprise margins for anything, and I don't recommend more than a modicum of private storage for cloud applications where most data ultimately needs to be shared. What I do recommend is distributed storage based on commodity hardware and open-source software. There are plenty of options to choose from, some with all of the scalability and redundancy you could get from their enterprise cousins. Just because some people had some bad experience with iSCSI or DRBD doesn't mean all cost-effective distributed storage solutions are bad and one must submit to the false choice of enterprise NAS vs. (either flavor of) DAS.

In short, open your eyes and read what people wrote instead of assuming this is the NAS vs. DAS fight you're used to.

ssmoot · on April 25, 2011

They "lean on ZFS" for DAS.

Seriously. You tell me. What does that have to do with your rant on ZFS? It could have as well been an LSI controller doing RAID6. Or mdadm. Doesn't matter.

That's the evolved solution they came up with.

The "networked storage is broken" pitch actually comes in with the EMC/NetApp interim solution as well. I don't buy it either, but it's a joke to claim the problem was ZFS on the Zones when the Targets weren't running ZFS.

You're awfully prickly, but I didn't suggest it came down to "Enterprise" NAS vs DAS. I actually think networked storage is here to stay (and that's a good thing).

I have my doubts we'll see a stable, inexpensive (or free) Distributed or Clustered file-system ready to replace traditional solutions anytime soon. I'm happy to see people try though.

You clearly have an axe to grind with ZFS though. In my experience it's been by far more stable than any available Linux FS I've used. Pull the power again and again, replace and resilver all you want. Manage terabytes and don't worry about corruption. I wouldn't trust ext3/4fs for anything I couldn't stand to lose...

PS: http://en.wikipedia.org/wiki/Direct-attached_storage

"People who know storage". I don't see iSCSI on that list. Nor FCoE. DAS (at least according to Wikipedia) explicitly rules out switching. Which is how I've always viewed it.

CPlatypus · on April 25, 2011

You're really not getting it, are you? I never said ZFS was the problem, as you seem to think. I'm just saying it's not the solution either. It's a crappy solution, failing to protect against host failures and creating myriad problems in provisioning around the fact that each VM's storage is stranded on one node until it's explicitly copied somewhere else. And if you don't think there are decent distributed filesystems out there, you're just not keeping up with the field and shouldn't be commenting on it.

ssmoot · on April 25, 2011

I don't think I am getting it no. You don't think ZFS is the problem?

So you aren't calling ZFS a "crappy solution"? Just the DAS usage?

What is your gripe exactly then? The overblown critique of networked storage? Well we agree on that at least then. I think.

Honestly, with all the "read the fucking article", it's-not-DAS, oh-it-is, CloudFS is way moar better than ZFS, I never said ZFS sucked, "Bryan ZFS Cantrill is a jackass" you've left me absolutely bewildered at what your intended point (if any) actually is?

For the record, my only comment on (free) distributed filesystems (that aren't vendor-locked and actually unusable to me) is that I wouldn't personally trust them with my data. Not until they have the features I need, and then are running out in the wild, widley deployed for a couple years so I'm not a guinea pig.

I'll even throw you a bone: Even just last year ZFS was having major melt-downs when a new inadequately vetted feature was added. A few years ago it wasn't uncommon to face corruption when trying to do fairly routine things managing disks. Bugs can and do happen.

Maybe CloudFS, or Gluster is ready for prime-time, housing terabytes of data reliably and never making a misstep. I just don't think it's smart to bet your business on it. Not at least without a plan B since moving data around isn't an option when you're down and have terabytes you need to get back online.

dasil003 · on April 25, 2011

I'm wait out of my depth on this stuff, but the arrogant tone of the OP really rubs me the wrong way given my personal experiences with Joyent in the past. They act like their shit doesn't stink and they are god's gift to tech, when in reality they just like playing with cool toys and they're not providing better service than anyone else.

Nrsolis · on April 25, 2011

Do you have any opinions of Ceph? They seem to be dong a similar thing.