Inside the SSD are flash memory chips. It takes a relatively long time (typicall...

userbinator · on Dec 28, 2013

>If a power loss causes the flash to scribble on the wrong SSD location (e.g. the tables that keep track of good and bad blocks), the SSD "dies".

The problem isn't that data is corrupted in the flash, the problem is that the devices' own firmware (the SSD's embedded controller, which is a tiny computer in itself, "boots" from that) is stored in the same memory used for data storage. They could've gotten around this by not storing the firmware on the actual NAND used for data, but a separate device (or kept it inside the controller itself), so any power loss may cause data corruption, but not render the SSD completely unresponsible and inoperable.

zAy0LfpBZLC8mAC · on Dec 28, 2013

That's still no reason why you would need "power loss protection", in the sense of energy storage. What is needed is proper brown-out detection and properly set up reset circuitry so that the write gets aborted before lines start glitching (that is to say: energy needs to be dumped so it can not cause any damage once the CPU starts losing its brains).

That the storage location where a write was in progress is indeterminate afterwards shouldn't matter - between the time that some software initiates a write() and the time an fsync() on the same file returns, there is no guarantee what the written location will contain after a power failure, and if your software relies on the value in any way whatsoever, your software is broken.

gvb · on Dec 28, 2013

Yes there is. The amount of hold up time is specified in the flash manufacturer's data sheets.

The problem is that the flash has an internal state machine that performs the charge tunneling as an iterative process: it tunnels some charge, checks the level on the floating gate, and repeats as necessary. If the power to the flash chip goes away or glitches during this internal programming process, the flash write fails in indeterminate and sometimes very unexpected ways.

zAy0LfpBZLC8mAC · on Dec 28, 2013

Well, yeah, of course, you need some limit on the speed at which the power supply voltage drops, I was talking about the "flushing the cache" kind of "power loss protection", not claiming that pulling a circuit into the reset state could happen without latency ;-)

So, yes, you of course have to have some low pass in the power supply rail to make sure that power drops no faster than you can handle shutting down the circuit in an orderly fashion - all I am saying is that there is no need to guarantee that a read of a region where a yet-unacknowledged write was happening when the power supply failed returns non-random data, so it is perfectly fine to interrupt the programming process and leave cells where user data is stored in an indeterminate state. It's not OK to glitch address lines while programming is still going on, of course :-)

nexox · on Dec 28, 2013

The "super capacitors" (almost nobody uses actual super capacitors after early models discovered that super capacitor lifetime at server temperature was inadequate) are just a low-pass filter - they usually only keep the drive online for a couple dozen milliseconds after main power goes down.

Most reasonable SSDs do not write cache at all, but thanks to the wear-leveling issues, they need to have a sector-mapping table to keep track of where each sector actually lives. That table takes many, many updates, and since it's usually stored in some form of a tree, it's expensive to save to media which does not support directly overwriting data (IE NAND, which requires a relatively long erase operation to become writable.) This table is typically what is lost during power events, and it is not usually written out when you sync a write.

So what happens is you write, sync, get an ack, lose power, reboot, and magically that sync'd data is either corrupt, or, even worse, it's regained the value it had before your last write, with no indication that there is a problem. This can cause some extremely interesting bugs.

brianpgordon · on Dec 28, 2013

10 milliseconds is far too long to wait for a write in application code.

zAy0LfpBZLC8mAC · on Dec 28, 2013

Hu? What kind of applications do you write where 10 ms is too much latency?!

JohnBooty · on Dec 28, 2013

I can see you didn't spend much time with early SSDs back in the bad old days when 2.5" SSDs were unsophisticated and were essentially using the same architecture you'd find inside a $5 SD card from Walgreen's. Those SSDs often performed worse than uncached HDD's when you threw random writes at them.

  Hu? What kind of applications do you write where 10 ms is too much latency?!

10ms per write is huge. Keep in mind that even a tiny 1KB database insert/update will involve multiple writes: the filesystem journal, the database journal, and the data itself. 10ms for each of those steps adds up quickly.

Alternately, consider writing a modestly-sized file to disk, like a 5MB mp3 file. You're spanning multiple flash cells (probably 40 128kb cells at a bare minimum, plus filesystem journaling etc) at that point. Now you're close to half a second of total latency. Oh, you're writing an album's worth of .mp3 files? We're up to five seconds of latency now. But probably more like ten seconds, if your computer is doing anything else whatsoever that involves disk writes in the background.

So yeah, 10ms write latency is no fun.

zAy0LfpBZLC8mAC · on Dec 28, 2013

Why would you want to order the write of every 128 kB chunk of your MP3 album with regard to every other? A rotating HDD also has a latency of around 10 ms, and you certainly can write an album's worth of MP3 files faster than in 10 seconds - unless you insist on flushing every 128 kB chunk to the medium separately for no apparent reason, of course.

That you need multiple serialized writes for a commit is a valid point, but I would think that for most applications even a write transaction latency of 100 ms isn't a problem. Also, if you overwrite contents of an existing file without changing its size, you don't need any FS journaling at all, as the FS only needs to maintain metadata consistency, it's not a file-content transaction layer (details depend on the FS, obviously).

Cheap SD cards aren't slow because they have a 10 ms write latency, but because they have a very low IOP rate, which you also could not change by adding a cache and a buffer capacitor, but only by parallelizing writes to flash cells.