Greplin (YC W10) AWS Benchmarks and Best Practices

epi0Bauqu · on April 26, 2011

Thanks for putting this together. A few questions come to mind:

1) Did you try bigger instance types, i.e. large? I've heard at that level and above you get better networking performance, which theoretically might improve EBS performance?

2) Related to #1, I wonder if you've tried doing RAID0 on the ephemeral drives? You get two with the large instance, and four with xlarge.

3) Did you ever measure non-IO network EBS performance during the tests? I've always wondered whether using EBS heavily would slow down other network traffic to the device given there is only one interface.

4) How often have you yourself experienced EBS volume failure in your RAID volumes?

5) When that happens, what happens to your volumes and instance? That is, what do you use to monitor when the RAID volume degrades? Does it usually take down the instance immediately or only after some time? If it just becomes really slow does that throw alarms or does the application just become really slow?

6) Finally, what is your current procedure for dealing with a volume failure?

Well that ended up being a lot of questions. I'd really appreciate any answers/insights your or anyone else could shed on these questions. I've been reading all these posts but it feels a bit like reading tea leaves.

jread · on April 26, 2011

In response to question #1. I've conducted some similar benchmarking across all instance sizes and using both ephemeral and EBS with and without Raid. EBS is notably faster on larger instances. We observed roughly 2.5-3X better IO with EBS backed m2 instances compared to c1.medium. Using ephemeral Raid 0 on the cc1.4xlarge was about 6x faster.

http://blog.cloudharmony.com/2010/06/disk-io-benchmarking-in... http://blog.cloudharmony.com/2011/04/unofficial-ec2-outage-p...

csarva · on April 27, 2011

re: #3 - My tests show that the virtual NIC can only handle about 100k packets/sec. I don't have the results of my tests up anywhere, but Rightscale confirms this in their tests as well:

http://blog.rightscale.com/2010/04/01/benchmarking-load-bala...

So I'd say YES, other workloads would be affected during heavy EBS I/O, though I've never tested/measured this specific case.

SriniK · on April 26, 2011

Great test. I wonder if concurrent db access(submit db reads to multiple db servers and pick the first one) would enable workaround these weird variations in EBS response times.

Adrian C from netflix had a nice tech article on EBS performance. "Understanding and using Amazon EBS - Elastic Block Store" http://perfcap.blogspot.com/2011/03/understanding-and-using-...

assiotis · on April 26, 2011

Problem with RAID is not just drives failing, but also the UBE (Unrecoverable Bit Error). Lets say you have a RAID5 configuration. In addition to the probability of disk failure, you need to account for the probability of some failure plus the controller's inability to repair the problem because a sector on the good disks experienced an UBE. The probability of UBE on enterprise level disks is rather low (10^-16 I think) but that quickly shoots up if you start considering 1TB and 2TB drives.

Problem with EBS benchmarks is that they largely depend on who else is sharing the spindles with you at the time the benchmark was run. Given the large variance in performance that is being reported, the sample size for some reliable statistics would need to be quite large

krobertson · on April 26, 2011

"RAID helps smooth out flaky performance"

Really? My experience with EBS has been more to the opposite... especially with write performance.

If you got to write data, you got to write data. Flakey performance becomes more likely as more volumes are added.

Previously I think AWS engineers recommended we use RAID0 with EBS... ugh. Don't care cloud or not, RAID0 on production DB servers sounds downright suicidal.

smanek · on April 26, 2011

I'd guess that writing to the RAID is still bottlenecked by the worst drive in the array. But if only one out of N blocks has to be read/written to that drive, average performance may improve (basically, the idea of 'regression towards the mean').

I don't know if that's what actually happening, but testing seems to confirm that RAID performance is more consistent than single drive performance.

RAID0 on EC2 isn't as insane as it appears at first glance. EBS's have an annual failure rate of around 0.2%, while hard drives are around 5%. So, the chances of 1 EBS drive failing is around the same as the chance of 2 physical hard drives failing (5%^2 = 0.25%). It doesn't sound unreasonable to suggest that a RAID0 of EBSs is about as reliable as a RAID10 of physical drives and more reliable than a RAID5 of physical drives.

chrisbolt · on April 26, 2011

What good is a reliable datastore that is slow as molasses?

smanek · on April 26, 2011

If you take a look at our benchmarks, you'll see that with RAID you can get 100 MB/s of seq reads (or ~5MB/sec of random reads). Even on a really bad day, speeds will only fall to around a third of that.

While that's not the fastest thing in the world (e.g., a good SSD will outperform a 16 drive EBS Raid for most workloads), I don't think it's fair to characterize it as 'slow as molasses'.

chrisbolt · on April 26, 2011

Most people are probably using EBS to run a relational database, or something which will be doing random reads more likely than sequential reads. And speaking from experience, a 4 drive EBS raid couldn't even match the performance of a 4 drive RAID-10. Once we started adding SSDs, the gap widened significantly.