SSDs and distributed data systems

Dave_Rosenthal · on June 4, 2012

I'll add a few things (I've done a lot of SSD testing lately) as SSD performance is highly dependent on factors that don't influence hard disks:

0) Sequential performance is easy. If that's what you want, SSDs work pretty well and you can skip this post. The below points are for random IO.

1) Almost all SSDs are significantly slower at mixed read/write workloads (as a database would do) that either just reads or just writes. Sometimes as much as 1/4 the speed!

2) Random I/O throughput, especially for writes, is highly dependent on how full the disk is (assuming TRIM). For example, a 50% full is usually pretty fast, an 80% full disk is getting slower, and a 95% full disk is dog slow.

3) I have seen SSD controller and firmware version drastically impact performance. A recent firmware "upgrade" halved IOps on 100 of our SSDs (glad we tested that one...)

4) Time dependence! Many of my heavy I/O tests took 8+ hours to stabilize (often with very odd transition modes in between). Don't run a 30 second test and assume that's how fast the disk will stay.

5) Lastly, have many outstanding IOs in the queue if you want good IOps throughput. Think 32+.

My recommendation overall: Test your actual application workload for 24 hours. Use a Vertex 4 with firmware 1.4 less than 75% full for your mixed read/write workload needs!

sounds · on June 4, 2012

I'd really like to see x86_64 servers that use direct memory-mapped IO to the SSD. Some add-in PCI cards already do (can't find the link right now). It would really take Intel jumping on board for it to become the standard though.

More particularly, right now, assuming the app wants something from memory-mapped file, the process is:

  1. app attempts to read some bytes
  2. CPU page fault triggers a read by the kernel to get them from disk
  3. kernel block layer locates bytes on disk
  4. block read request submitted to storage subsystem
  5. read request (probably merged with others, e.g. readahead) submitted to SATA controller
  6. ATA command decoded by SSD
  7. bytes sent back up the chain
  (skipping the return trip up the chain for brevity)

Some PCI SSDs already make it possible for the kind of improvement that looks like this:

  1. app opens a file
     blocks mapped into app address space as read-only

wmf · on June 4, 2012

I'd really like to see x86_64 servers that use direct memory-mapped IO to the SSD.

It's not clear whether this is as good an idea as people think. Flash is fast, but the difference between ~20 us and ~50 ns is still huge. You could end up wasting a lot of cycles while the processor is stalled. Also, there's no way for memory to report errors short of a machine check that (if you're lucky) kills the process.

Fusion io is working on an intermediate approach that bypasses the OS but doesn't try to treat flash as memory.

_wiv7 · on June 4, 2012

I find myself in disagreement with the idea that read/write cycles for (rotating) hard disks are unlimited. While it may be much higher, and a 'per block' characterization is largely improper -- the thing that fails is not the block itself but the moving parts -- when I worked in a job where we had tons of cheap consumer-grade hard disks, they did seem to fail after a few hundred complete passes over the disk.

spydum · on June 4, 2012

Pretty sure google published a paper on disk failures that can begin to refute that assumption: http://research.google.com/archive/disk_failures.pdf

crazygringo · on June 4, 2012

I remember RethinkDB was tackling this... if I recall correctly, they had been developing a new MySQL engine specifically designed for SSD's, but looking at their site now it seems they gave that up and changed it to a memcache protocol to focus on NoSQL?

Has anybody ever used it, or know of other companies working on similar ideas?

strlen · on June 4, 2012

Tokutek uses Fractal Trees which have several SSD-friendly properties: http://www.tokutek.com/. Unfortunately, I don't have any first hand experience with it.

If you want an interesting challenge, you could also try implementing a MySQL storage engine using LevelDB. Obviously basic implementation shouldn't be too difficult, but getting good performance and reliability would require some effort.