Building a high performance SSD SAN, part 1

StillBored · on April 6, 2015

I wrote a longer version of this, but I will summarize as not all the commercial flash systems are x86's driving SSDs. Disclaimer: I don't work for IBM, and have no commercial interest in what i'm about to say.

I had the pleasure (a rare statement for me) of recently having access to a TMS/IBM flash system 840 for a year. Frankly, if the goal is shared network accessible performance, nothing you can build with PC's and SSDs will come close. The on the wire latency and bandwidth that even a single 2U (60TB) unit provides is fairly shocking. And that performance doesn't degrade over time, even under extreme load. The official performance numbers on the IBM site are actually conservative, which is unusual itself. The units are all custom (in a way that is a negative, cause they aren't cheap) and built like tanks. Although from an end user perspective they are very simplistic. AKA they provide a RAIDed volume(s) on the SAN and nothing else. Your on your own for replication and the like, although I think they have a canned solution for that now.

Bottom line, they cost a mint, and are built like old school over engineered IBM hardware. But you won't be kicking yourself a few months down the line as you baby your custom storage solution. Normally i'm the guy building stuff like this, but sometimes I would rather just sleep at night than get called because the garbage collector kicked in on a SSD and it dropped out of the RAID in the middle of a big load. Plus, being IBM, you can probably get a unit with a POC agreement, just to see how it compares.

johngalt · on April 6, 2015

> you won't be kicking yourself a few months down the line as you baby your custom storage solution. Normally i'm the guy building stuff like this, but sometimes I would rather just sleep at night...

"Even Southwest doesn't build their own airplanes." I have this argument with myself all the time. If you don't have an IT pro who works on storage every day, then buy something from someone who does work on storage every day. Their time is worth the price. Your time is worth it too.

smcleod · on April 6, 2015

I've taken this from where I replied to a comment on my blog:

Support

4 hour on site response offered from vendors is not good enough for us and to improve on that is expensive. With using more standard components we are able to be on site at either our primary or secondary datacenter within 30 minutes with spare parts in hand if required.

Choosing one vendor as an example, the support we've received from HP has been atrocious - they've caused more outages than they've be able to fix. The engineers they send to smaller organisations are generally relatively incompetent and they have certainly shown that they don't care about your uptime.

Proprietary storage systems offered by HP, DELL and EMC are not only expensive to purchase and license but they're very time consuming to manage as they're essentially a 'snowflake' in your infrastructure. It's hard to make them integrate with modern automation tools such as Puppet and CI requirements and they all use their own management tools that are specific to the vendor or range of product - usually this involves having a Windows VM running Java or some equally frustrating technology to manage the system. Performing updates on proprietary systems can often be painful as hardware vendors generally are not very good at designing software.

It's very hard to outsource quality and it comes at a large cost.

StillBored · on April 7, 2015

You have hit the nail on the head for why most "enterprise" gear is basically overpriced junk. That doesn't mean there isn't quality hardware out there, just that you have to be more selective. AKA do your own research and don't be dazed by the feature lists. Some of those features shouldn't actually be used.

I personally tend to like the KISS arrays that don't have dedupe/replication/etc built in, and are web manageable. The extra bonus is that there are a metric boatload of tier 2 array vendors (imation's nexan for example) that provide rock solid hardware for a small fraction of the prices of EMC/etc. Its quite possible to get native capacity for less than a company like EMC charges for deduped capacity (aka 100TB from a tier two company can cost less than the 10TB deduped to 100TB from EMC). Raw RAID is pretty simple/understood in comparison to deduped solutions, and I think that has a significant affect on reliability. AKA more features, more latent bugs...

Finally, get a bunch of demo units, and if the configuration UI is a mess of esoteric proprietary command line junk, or it is only configurable with a 32-bit java app that won't run on a 64-bit windows (yah I've seen that) then send the unit back. Be clear about why it won't work for you. The only way to convince these companies to behave is to show them lost sales.

smcleod · on April 6, 2015

I'm the author (Sam Mcleod) of this post and just saw this was posted on HN.

I'm quite a bit further down the track now and should have a much better write up to post in the next week or so time providing.

I can say that I'm hitting over 1,000,000 IOP/s on 4k random reads and around 850,000 IOP/s random 4k writes per 1U unit without cutting any corners with data safety.

What I haven't finished yet is the cross-unit replication tuning and the load balancing / multipathing.

So far everything has gone to plan and is progressing well as a serious replacement for our 'traditional' storage.

Regardless, there's still quite a lot of testing to be done and once I have more of that completely I'll post an update with some serious numbers and observations.

moe · on April 6, 2015

cross-unit replication tuning and the load balancing / multipathing.

That's going to be interesting. At those speeds (1.5GB/s, 850k IOPS) you must bump[1] into all kinds of DRBD bottlenecks?

Also I'm very curious about your approach to redundancy/multipath. Dual-primary or failover?

[1] http://blog.gmane.org/gmane.comp.linux.drbd/page=41

smcleod · on April 6, 2015

I have no illusions that I'll be able to maintain such performance with DRBD replicating and iSCSI overheads.

My current feeling on load balancing is to keep it simple, have half the clients treat node A as their active storage and the other half using node B. If node A becomes unavailable failover will occur and node B will serve all clients.

As far as multipathing goes - my approach is that I honestly have no idea. That's something that's going to be learnt along the way and I'm in uncertain if it'll make it into the final build or not.

Edit: I can't get that link to load on my connection at present but I've bookmarked to look at tomorrow. If you have any experience / advice I'm always very open to assistance!

FlyingAvatar · on April 6, 2015

I was quite amused to see this posted as we just built nearly the exact same thing about 6 months ago (minus the NVMe drives which weren't available). We are using ZFSonLinux on top of DRBD instead of mdmadm, which has been surprisingly successful.

The biggest frustration that we met was using LACP in trying to get a better data rate out of DRBD. Regardless of the LACP mode, we were not able to get DRBD to scale to full bandwidth (20gbps) and had to settle for using it for redundancy only.

I imagine with your setup (read: much faster peak write capability), you might actually be able to bottleneck DRBD pretty well over a single 10GBE pipe if you were writing at peak capacity. I'll be interested to see if you happen upon a work-around.

Load balancing as you've suggested is an intriguing compromise. Have you tried doing a two-way heartbeat setup where each both boxes have their own active IP and can fail over to each other?

tedchs · on April 6, 2015

I wonder if the lack of help from LACP was because DRBD is using only 1 TCP session, which would only get hashed to one or the other member link?

cgb_ · on April 7, 2015

Might be a case where MPTCP (http://www.multipath-tcp.org/) would help with aggregating multiple links.

smcleod · on April 7, 2015

Correct me if I'm wrong but I'm not sure MPTCP would help as LACP is generally SRC/DST MAC based and shares a uses a single IP?

cgb_ · on April 7, 2015

You are right in that LACP bonding methods cannot increase the throughput of a single flow to gt any single link in the group (the balancing method can utilise src/dst MAC, IP and sometime L4 ports).

MPTCP establishes multiple subflows across individual IP paths and can load balance or failover across all subflows. Applications do not need to be rewritten to take advantage of it. I'm not sure if that includes kernel modules like DRBD though. I suppose someone needs to find out :)

smcleod · on April 7, 2015

I could have a play and report back if I get time?

smcleod · on April 6, 2015

Hi FlyingAvatar, I've worked with some rotational storage solutions in the past built on ZFS on BSD with mixed success - the features were great but we had quite a few issues with performance. If you've been down the DRBD performance road I wonder if you'd be open to having a chat with me around your findings / experience?

baruch · on April 7, 2015

I would strongly recommend also testing for when things fail. This is the hardest part in troubleshooting and building a reliable system.

You can make do with READ/WRITE LONG if the disks are SAS to generate medium errors, be aware that it may kill your disk if you do it too much to the same disk.

You can also use relatively inexpensive SAS/SATA link error injectors from Quarch.

Be sure to test hot pulls of the disks to simulate a hard failure.

Simulating other soft failures is much harder, sometimes the disk just stops responding but doesn't go offline so the system finds it harder to detect the case and may reach timeout conditions.

smcleod · on April 7, 2015

I couldn't agree with you more, in the time that I have set aside to establish the new storage I have around 75% of it dedicated to testing, around half of that is dedicated to failure scenarios. A have a list of 'bad things to do' that I'll be executing and reporting on.

jjoe · on April 6, 2015

I understand his decision to skip the HW RAID controller (I like mdadm too). But a BBU is definitely critical to this op. NVMe has so many more queues that on a busy box it's going to hold way too much uncommitted data and metadata. Now the only other reasonable option is a UPS but it adds significant RUs/cost to the setup.

runarb · on April 6, 2015

I am no expert, but are you sure that uncommitted data on the ssd is an issue?

The Intel DC P3600 has a built in capacitor that should give it econoff backup power to commit the data. The SanDisk Extreme Pro disks don't have volatile cache, but instead uses SLC NAND flash for cashing, that will survive a power loss.

There is of course still the issue that data send to the server and stored in main memory will be lost if you loses power, but a HW RAID controller with battery backup would not have prevented that (but an UPS of course might).

otterley · on April 6, 2015

It would definitely prevent an uncommitted data problem if fsync or O_DIRECT were used (which should always be used for critical writes).

UPSes defend against power outages, but they're only one rung on the data-integrity ladder. Controller BBUs, on the other hand, protect against both power outages AND kernel panics.

smcleod · on April 7, 2015

Indeed we have fsync enabled on our database servers (PostgreSQL)

smcleod · on April 7, 2015

(And just wanted to point out that that is only one part of the equation)

smcleod · on April 6, 2015

I tend to agree with you here runarb, I truly don't believe there is much that a hardware RAID contrôlée and battery will protect you from with modern flash storage, at this point I think time is better spent on improving / tuning the database and filesystem for resiliency.

KaiserPro · on April 6, 2015

UPSs all the way. With high random writes you'll get alot of buffering so even if you have a BBU for the raid, the data might still be in RAM.

the order of shutdown is important too, should really be 1) consumers of storage (within the first minute or so) 2) storage (about 15 minutes after)

if your network wobbles or the storage turns off too fast, everything hangs, and your in poostain pants town.

Ideally your DNS/AD shouldn't ever willingly shutdown. or if it does it should be well after the last bit of your production system has turned off.

peterjs · on April 6, 2015

I went down the same road - using mdadm and a UPS instead of a HW RAID controller with battery backup unit. LSI MegaRAID SAS 9261-8i SGL with a battery backup costs roughly 650 EUR; Eaton 9130i 1500VA Rack 2U UPS costs 850 EUR. So it's a bit more expensive, but you would be buying that UPS anyway.

fweespeech · on April 6, 2015

I'd be willing to assume they have a UPS capable of handing the servers in each rack to allow for controlled shutdowns in the event of a power outage.

Hell, I have a UPS in place for my desktop, modem, and router at home.

smcleod · on April 6, 2015

It would be typical that this was posted just before I went to sleep last night!

I've been flooded with messages, comments and feedback overnight and am slowly catching up on most of them.

I wanted to say one thing: My blog post that was linked here was never intended to be much more than a brain dump at the start of a journey. It contains little-to-no useful information other than the idea itself and it doesn't have any technical information or nearly enough background as to why I'm exploring this option or why I think there is a valid use case for it.

Give me a week or two and I'll have an updated post with a proper background story and experience with traditional solutions, my observations so far, some real technical content and of course some serious numbers.

Thank you so much to all the people that have contacted me both from HN and directly to offered insight into their experiences.

hengheng · on April 6, 2015

What is the reason to go with Raid1 for the intel SSDs? My guess would be that the rate of Raid1-catchable failures of these drives is not considerably higher than the rate of, say mainboard failures or other failures that render the whole unit out of operation.

smcleod · on April 6, 2015

The only reason is to protect against per-device failures - the technology is new and while it's meant to have astounding resiliency and a long lifetime - I don't want to take that chance. Down the track when they have proved themselves to us I'll certainly be reconsidering this however.

baruch · on April 8, 2015

If a mainboard fails you replace it and get back up (with downtime), if an SSD fails you lost your data. If you care about your data put some redundancy in it.

Even for the case that a motherboard fails having redundancy will help with the downtime, your software needs to be able to handle failover situations and that gets hard but sometimes worth it.

KaiserPro · on April 6, 2015

I'm always interested in the outcomes of these types of experiments.

One thing that I find interesting is no mention of what network card you are using? or switch setup, or ip config. I seem to remember in the dim and distant past that letting the storage layer handle the paths instead of LACP (which on some switches are active-passive) was better/faster/easier. (that could be bollocks though.)

I used to work in VFX where performance and cost must be balanced. For the storage lumps we had a 1u dell with the fastest 10 core v3 processor in it, 384 gigs of ram (ram is a cheap and fast cache) tied to a dell 3060 (60 4tb disks in a 4u case, with up to 4 SAS connections out(well many other options it does iscsi too))

These topped out at about 2gigabytes/second (or about 1.3gigabytes with 100 concurrent fios from NFS clients). but there are 30+ of them so the aggregate was enough to dowse half the render farm in iops.

A couple of things that stuck out for me: o 24/7 support contracts are amazing. o auto phone home disk replacements are also life savers o multipath iscsi can be painful unless you have decent switching (with ospf) and OS support o in short distances, SAS > FC >> iscsi for block storage. pretty much plug and play (multipath is easy too). o Replication needs a different path separate from the client o melenox or intel nics are the only 10gig cards worth paying for. (the only thing going for broadcom is that they are supported out of the box by vmware) o File systems always fail o for random IO; RAM > SSD by a massive factor. especially if your active dataset is less than ram cache. o drdb will be as slow as your slowest link in the network. o ethernet packet size really matters. (and depends on your workload) o synthetic benchmarks are the chocolate teapots of the storage world. o paying for decent support saved the (me and by extension the company(s)) at least twice.

Yes, the dell/HP/whitebox route is more expensive in capital outlay. However you can if you so wish go the white box route. The dell raid enclosure is a rebadged netapp/lsi/enginio box. dothill do a similar one.

The important thing to note though is the opex & RnD is far, far cheaper with prebuilt systems. With a click of a button you get replication. If you are using VMware (I'm sure KVM has a similar system) it can do it as well. The performance is predicatble, and the configuration and tuning has been done more or less for you (excluding the raid layout, FS tuning and network setup. but then you're a sysadmin who knows how to do all that right?)

There is nothing worse than having to debug a storage system so unusual or highly configured that googling won't help. You feel terribly alone. Even more if you don't have decent backups.

If you don't have those expertise, it'll cost you either time, or budget for a storage guy. (so just do some testing and buy the fastest prebuilt system for your budget.) For a previous place, I tested 7 different systems to get the best fit. (that included home rolled) A lot of the time was modelling the sotrage workload, and devising tests that would proxy that kind of load on the test systems.

END_OF_RAMBLE

smcleod · on April 6, 2015

I actually have to shoot off but wanted to say that I enjoyed your comment and I'll have a chance to reply tomorrow morning - this post of mine was sort of a quick brain dump and massively lacking in almost any useful detail other than the idea itself. Anyway, as I said I'll reply tomorrow and hopefully shed some light on where I'm coming from and get your thoughts on that.

PaulHoule · on April 6, 2015

Enterprise storage systems made more sense back in the spinning disk era when the need to minimize latency caused by physical motion was acute. They're definitely struggling now to produce products that really add value to SSDs.

KaiserPro · on April 6, 2015

for Direct attached storage, possibly.

But its simply not the case to say they are struggling. IBM yes.

Netapp and EMC are still growing, just.

Contrary to popular belief, spinning disks are here for the next few years at least.

Especially when you compare disk performance with EBS "SSD" performance.

128gigs of "SSD" EBS storage gives you ~380 iops. The same performance as two 15k disks. We are very much in the era of spinning disk performance if Amazon can charge a premium for that kind of storage.

jbooth · on April 6, 2015

What's the derivative of that growth look like, though? It could be that they just have enough customers who are either locked-in or on the back-half of the adoption curve, allowing them to limp in with some growth for a few more years before it's over.

Amazon charges a premium for convenience and for the fact that there's 0-startup cost to get going with them. Why would someone every shell out the money for an enterprise 15k disk when they can buy an SSD instead?

baruch · on April 8, 2015

The 15k HDDs will likely be replaced by SSDs, the 7200 ones are likely to stay for a long while yet. Their capacity is yet unmatched, the SSDs are trying to catch up with higher densities but currently the HDDs are ahead of them on the curve of size and definitely ahead on the cost per Gb.

I just recently switched my laptop to an SSD and the change is amazing but I'm not going to replace even the 1TB HDDs on my nas with equivalently sized SSDs. Both from a needs view and from a cost view.

KaiserPro · on April 6, 2015

Good question.

if you want an all in one file server and block storage device, then you only really have a few choices. (netapp, EMC, trintri)

if you don't want to configure and support a custom filer and or block storage, then you're going to go to the likes of Netapp et al.

engendered · on April 6, 2015

if you’ve got lots of money and you don’t care about how you spend it or translating those savings onto your customers

This is a baseless, prejudiced claim, contrasting buying with some idealized scenario where your in-house crew can only possibly build better solutions cheaper. That there are no other outcomes.

But they might also cost far more in manpower costs than any premium. They might give you an unreliable solution that costs you your entire business. Such an effort might distract from the core competencies of the organization. Operating that storage might end up dwarfing the up-front cost (it's easy to hire admins knowledgeable of EMC. Quite a different matter when it's your own home-brew solution).

I understand the draw, but the "all upside" claim undermines the entire piece. There are enormous downsides, not least the reliability of your data. Companies like EMC and Nimble -- despite "great" editorial quotes added by some random person on Wikipedia -- base their entire existence on reliably serving your data, and things like multipathing and replication are the absolute minimum cost of entry in the market. Now add automatic tiering, thin provisioning, disk-deduplication and streaming hardware compression, etc, and the value starts to become evident.

EDIT: The moderation through this is an abomination. HN shouldn't be overly critical, but nor should it pander patronizingly to some tripe because the author happened by.

smcleod · on April 6, 2015

Hi there, I'm very open to constructive criticism on my project but I feel that your comment is more of a strawman attack than particularly useful in any way. I had a look back through your comments on other posts on HN and they seemed to have a similar tone. If you feel strongly about the topic one thing I could suggest that would help readers of your comment would be to include some details, facts or technical analysis that are relevant and insightful. Also I might suggest reading the recent hacker news post concerning community feedback: http://blog.ycombinator.com/new-hacker-news-guideline

Edit: by the way, I do have compression, thin provisioning, replication and it also auto provisions storage volumes when new VMs are created. As I said, I didn't post my two month old blog here and there is plenty of detailed information for those that are genuinely interested on its way in the next few weeks.

engendered · on April 6, 2015

"Hi there, I'm very open to constructive criticism on my project but I feel that your comment is more of a strawman than particularly useful in any way. I had a look back through your comments on other posts on HN and they seemed to have a similar tone."

So you don't attempt to prejudice only in your blog posts, it appears. This is one of the most sadly defensive responses by a blog author I've yet seen on here.

Where, exactly, is the strawman? Please point it out rather than desperately trying to immunize against my completely valid comment in the most frantic of manners.

You posit an idealized notion about building it yourself based upon accomplishing, to this point, apparently very little. It's one thing to cast a theory and pursue it (e.g. "my attempt at building competitive storage on the cheap"), but you're presenting completely unsupported dogma around it, and then bolstering your own decisions by conclusions you actually don't even remotely have. It's the guy who decided to start working at a standing desk and on day one has a laundry list of comments about why anyone who doesn't is wrong and lesser.

This is a fairly typical, of course, and you see it by people who build their own anything ("don't give all your money to big TP -- six quick tips for making your own!"), and we generally only see it in the before stage, as the after stage is more often than not a littered debris field of failures.

I find it rather incredible that you attempt to incite HN guidelines, as if raw gullibility and boosterism was the direction of the critique. You are making broad claims in the linked article that you have absolutely no basis for making, so criticism is well deserved, and a service for anyone who might buy into this notion, making a fool of themselves in their own organization.

moe · on April 6, 2015

Where, exactly, is the strawman?

The strawman is that he builds a solution for his exact needs and you attack it for not having features (multipathing, replication, automatic tiering, thin provisioning, disk-deduplication, streaming hardware compression) that neither he nor most people need.

All that stuff becomes interesting when and if you need more than two bricks. In a time where you can fit >250T of spindles or ~42T of SSDs into a single server the audience for such shrinks rapidly.

engendered · on April 6, 2015

Is it irony that you're presenting an absurd strawman to demonstrate my purported strawman?

I "attacked" (aka disagreed with) any broadly-targeted, generalist claims of the linked article (which is in stark contrast with, for example, the Backblaze posts where they build very purpose-suited storage and never try to over-extend their claims), which was quite clearly that people who buy so-called enterprise storage are, to paraphrase the gist of the article, suckers. I noted replication and multipathing because they are the absolute minimum cost of entry for critical storage, and even the linked article references it as a requirement.

The rest of the features were clearly "value adds", given that most enterprise storage features a lot of its value in software. Yes, everyone benefits from automatic tiering. Everyone benefits from thin provisioning, disk-deduplication, and compression. There are a vanishingly small number of users who won't see significant benefits from all of those features.

"All that stuff becomes interesting when and if you need more than two bricks."

? Multipathing is absolutely critical for a single storage unit. Replication is absolutely critical if you have any care at all about uptime, because a single storage unit, even with multipathing and redundant power supplies and "controllers", as appropriate, isn't enough. The rest of them have nothing at all do do with the number of storage units -- thin provisioning gives you fantastic storage control. Compression is obvious. Disk-deduplication...again, the name of the game is minimizing the amount of data you're actually dealing with, because even if have conceptually unlimited storage in your unit(s), that's data you have to move around and replication and backup and...

Spooky23 · on April 6, 2015

Having a contrarian position isn't a crime.

Most enterprise arrays these days are really software dressed up behind the hardware. As a customer, you get to pay for the hardware multiple times. This is really obvious in the backup space if you compare a software-only solution like CommVault to Avatar/DataDomain.

If you make the conscious decison to invest in engineering talent and have a lot of need for midrange storage, you can yield a positive return on that investment. If you're growing fast and go EMC, you'll invest significant capital in administering those solutions and negotiating anyway.

But if you don't have the resources to staff it, or the systems for provisioning/service management to replace the consoles you get from a vendor product, you don't belong in the business.

on April 6, 2015

[deleted]

engendered · on April 6, 2015

Most people simply don't need this kind of storage anymore.

But this is exactly the sort of storage the linked article is talking about. So your point is that I attacked a "strawman" because I didn't accept the linked article for being something entirely different from what it is?

but the large majority of companies aren't

The majority of mid to large sized companies run entirely on SANs. I'm not sure where you get your information from, but how shoestring upstarts operate has nothing to do with much of the "real world".

This article is about a home-brew SAN, with home-brew SAN qualities and deficiencies.

moe · on April 6, 2015

Ugh, I managed to fatfinger-delete my parent comment, sorry.

(the essence was what you quoted: I claimed most people don't need these advanced SAN-features anymore because they don't need SANs in the traditional sense anymore)

Anyway, I'll just concede that OP aiming for feature-parity with commercial SAN's is indeed nonsensical.

The majority of mid to large sized companies run entirely on SANs

For large companies who bought into EMC/NetApp that may be true.

For small and mid-sized companies that's definitely not my experience.

_ondq · on April 6, 2015

Every company that runs on AWS is dependent upon SAN (eg, EBS). I imagine it's similar for other cloud services.

Just because you've never seen one or had to deal with one directly does not mean you're not dependent on one.

moe · on April 6, 2015

Every company that runs on AWS is dependent upon SAN

I think the discussion was about running your own SAN, i.e. creating your own EBS.

As it happens EBS itself is not backed by a SAN either. According to the rumors it's implemented using either DRBD or GNBD.

dang · on April 6, 2015

Criticism is fine. Gratuitous negativity is not. Your comments routinely have a lot of that, and it's time for it to stop. Please stick to the substance and drop the personal bile from now on.

KaiserPro · on April 6, 2015

EMC are expensive and nimble are ok for light VM performance.

However they provide a service. As the OP says its a bit broad to claim that they are a pointless cost. Having dealt with both, I can tell you that its very rare that own brand storage is cheaper in the long term. Unless: o you run at significant scale o have enough data to warrant a team of skill storage admins

bear in mind that decent storage admins cost the same as a medium sized array, per year, per person. Opex can quickly over take capex.

smcleod · on April 6, 2015

Actually we've found quite the opposite, because proprietary vendor management tools are generally hard to automate and integrate with existing frameworks such as puppet - the time it takes to administer them is far greater than any of our more standard systems. I just replied to a similar observation from someone else here: http://smcleod.net/building-a-high-performance-ssd-san/#comm...