Why must SSDs have "power loss protection" in the form of a battery or supercapacitors to finish writes? Can they not simply cache writes but not falsely claim to the OS that they aren't fully written yet, and do some internal housekeeping (journal-like) for recovery? Does SATA/SCSI support the concept of "sync", or would allowing the disk to fall behind without deceiving the OS kill performance in some way?
Inside the SSD are flash memory chips. It takes a relatively long time (typically 10mSec) to tunnel the charge that changes a '1' bit (the erased state) to a '0' bit. If the power goes away during that programming time, the result in indeterminate. Not just bad, but bad in a potentially unknown way.
The worst case, which I've experienced with direct writing to flash chips, is that a totally unexpected flash location is corrupted. My guess was that the CPU started the write cycle just as power was lost, the CPU glitched the address lines as it was losing its brains, and the flash corrupted a random location.
Very, very bad.
If a power loss causes the flash to scribble on the wrong SSD location (e.g. the tables that keep track of good and bad blocks), the SSD "dies".
>If a power loss causes the flash to scribble on the wrong SSD location (e.g. the tables that keep track of good and bad blocks), the SSD "dies".
The problem isn't that data is corrupted in the flash, the problem is that the devices' own firmware (the SSD's embedded controller, which is a tiny computer in itself, "boots" from that) is stored in the same memory used for data storage. They could've gotten around this by not storing the firmware on the actual NAND used for data, but a separate device (or kept it inside the controller itself), so any power loss may cause data corruption, but not render the SSD completely unresponsible and inoperable.
That's still no reason why you would need "power loss protection", in the sense of energy storage. What is needed is proper brown-out detection and properly set up reset circuitry so that the write gets aborted before lines start glitching (that is to say: energy needs to be dumped so it can not cause any damage once the CPU starts losing its brains).
That the storage location where a write was in progress is indeterminate afterwards shouldn't matter - between the time that some software initiates a write() and the time an fsync() on the same file returns, there is no guarantee what the written location will contain after a power failure, and if your software relies on the value in any way whatsoever, your software is broken.
Yes there is. The amount of hold up time is specified in the flash manufacturer's data sheets.
The problem is that the flash has an internal state machine that performs the charge tunneling as an iterative process: it tunnels some charge, checks the level on the floating gate, and repeats as necessary. If the power to the flash chip goes away or glitches during this internal programming process, the flash write fails in indeterminate and sometimes very unexpected ways.
Well, yeah, of course, you need some limit on the speed at which the power supply voltage drops, I was talking about the "flushing the cache" kind of "power loss protection", not claiming that pulling a circuit into the reset state could happen without latency ;-)
So, yes, you of course have to have some low pass in the power supply rail to make sure that power drops no faster than you can handle shutting down the circuit in an orderly fashion - all I am saying is that there is no need to guarantee that a read of a region where a yet-unacknowledged write was happening when the power supply failed returns non-random data, so it is perfectly fine to interrupt the programming process and leave cells where user data is stored in an indeterminate state. It's not OK to glitch address lines while programming is still going on, of course :-)
The "super capacitors" (almost nobody uses actual super capacitors after early models discovered that super capacitor lifetime at server temperature was inadequate) are just a low-pass filter - they usually only keep the drive online for a couple dozen milliseconds after main power goes down.
Most reasonable SSDs do not write cache at all, but thanks to the wear-leveling issues, they need to have a sector-mapping table to keep track of where each sector actually lives. That table takes many, many updates, and since it's usually stored in some form of a tree, it's expensive to save to media which does not support directly overwriting data (IE NAND, which requires a relatively long erase operation to become writable.) This table is typically what is lost during power events, and it is not usually written out when you sync a write.
So what happens is you write, sync, get an ack, lose power, reboot, and magically that sync'd data is either corrupt, or, even worse, it's regained the value it had before your last write, with no indication that there is a problem. This can cause some extremely interesting bugs.
I can see you didn't spend much time with early SSDs back in the bad old days when 2.5" SSDs were unsophisticated and were essentially using the same architecture you'd find inside a $5 SD card from Walgreen's. Those SSDs often performed worse than uncached HDD's when you threw random writes at them.
Hu? What kind of applications do you write where 10 ms is too much latency?!
10ms per write is huge. Keep in mind that even a tiny 1KB database insert/update will involve multiple writes: the filesystem journal, the database journal, and the data itself. 10ms for each of those steps adds up quickly.
Alternately, consider writing a modestly-sized file to disk, like a 5MB mp3 file. You're spanning multiple flash cells (probably 40 128kb cells at a bare minimum, plus filesystem journaling etc) at that point. Now you're close to half a second of total latency. Oh, you're writing an album's worth of .mp3 files? We're up to five seconds of latency now. But probably more like ten seconds, if your computer is doing anything else whatsoever that involves disk writes in the background.
Why would you want to order the write of every 128 kB chunk of your MP3 album with regard to every other? A rotating HDD also has a latency of around 10 ms, and you certainly can write an album's worth of MP3 files faster than in 10 seconds - unless you insist on flushing every 128 kB chunk to the medium separately for no apparent reason, of course.
That you need multiple serialized writes for a commit is a valid point, but I would think that for most applications even a write transaction latency of 100 ms isn't a problem. Also, if you overwrite contents of an existing file without changing its size, you don't need any FS journaling at all, as the FS only needs to maintain metadata consistency, it's not a file-content transaction layer (details depend on the FS, obviously).
Cheap SD cards aren't slow because they have a 10 ms write latency, but because they have a very low IOP rate, which you also could not change by adding a cache and a buffer capacitor, but only by parallelizing writes to flash cells.
One main reason is that the SSDs do not store the data where you ask it for, it maps that location to any other location that it sees fit. That means it has an address-to-physical-location mapping that it maintains. If the SSD loses that mapping you data is gone. You simply cannot find the order of things.
There are some recovery strategies employed by the SSD firmware but they can't handle all possible scenarios and you are likely to lose some data in case of an unprotected power failure.
Also, as mentioned before the write takes time and not finishing it on time will make the location being written to have unreadable data.
The main issue is the firmware being stored in the same flash memory the rest of the data is, so it's subject to the same corruption. "bit rot" in NAND flash is real - the stored charges will gradually leak out, and especially with the shrinking of each bit cell and MLC, even data that's sitting there dormant will gradually erase itself so the firmware has to periodically rewrite these blocks, at the same time moving them around to do wear leveling (another thing that has been sacrificed in the higher density shrinkage is write endurance.)
Corrupting data or BMTs if powered off in the middle of a write is not such a bad thing compared to if that data happens to be the drive's firmware. In the former case at worst you lose a superblock and the OS doesn't boot, but you can still recover from that fairly easily compared to the latter case, where the drive can become no longer responsive.
It's unlikely that the firmware region will be corrupted unless you are in the middle of a firmware upgrade. While with the variety of SSD and HDD failures I've seenis large and such a failure is not impossible it's one I've never seen and sounds rather improbable.
I would like to see a hybrid drive system where writes are first made to a journal which exists on a spinning platter, maintained in such a way that seek times are absolutely minimized. Then, the data is transferred to the flash side of the house.
This would result in a drive with fantastic throughput, except when immediately reading what was written. So long as there was a delay between writes and reads of the same data, it would look like a perfectly performing SSD. This could have fantastic properties for web applications. Given network latencies, many web applications could have considerable delays between the time data is written to disk and subsequently read.
What you're describing is essentially how hybrid drives like the Momentus XT, as well as Apple's "Fusion Drive" software solution already work -- writes go to the HDD and are then pulled into the SSD for fast reads.
Also, I believe you could also achieve this at a software level with ZFS, which I understand allows you to do things like allocate an entire SSD (or multiple SSDs) as caches for other (presumably slower) drives.
writes go to the HDD and are then pulled into the SSD for fast reads.
With journalling?
Also, I believe you could also achieve this at a software level with ZFS, which I understand allows you to do things like allocate an entire SSD (or multiple SSDs) as caches for other (presumably slower) drives.
Careful reading, please. Journalling combined with the high throughput of spinning platters if you can create situations that eliminate seek and rotational latency is the key point here, not caching.
I guess I had trouble understanding your idea because it's contrary to one of the few principles in computer architecture that nobody has ever argued about.
You want to stick a spinning rust platter with moving parts in front of a solid-state memory that is several hundred percent faster for sequential tasks, and is an order of magnitude faster for random tasks?