This is a huge problem and there seems to be a good deal of misinformation about this issue that has confused things. I'm going to debunk two things: first, that DigitalOcean is not violating user expectations (they are), and second, that doing this correctly is difficult (it isn't). The tl;dr is that if DigitalOcean is doing this, they are not using their hardware correctly.
First, it's not uncommon for virtual disk formats to be logically zeroed even when they are physically not. For example, when you create a sparse virtual disk and it appears to be XGB all zeroed and ready to use. Of course, it's not. And this doesn't just apply to virtual disks, such techniques are also used by operating systems when freeing pages of memory - when a page of memory is no longer being used, why zero it right away? Delaying activities until necessary is common and typically built in. Linux does this, Windows does it [http://stackoverflow.com/questions/18385556/does-windows-cle...], and even SSDs do it under the hood. For virtual hard disk technology, Hyper-V VHDs do it, VMWare VMDKs do it, sparse KVM disk image files do it. Zeroed data is the default, the expectation for most platforms. Protected, virtual memory based operating systems will never serve your process data from other processes even if they wait until the last possible moment. AWS will never serve you other customer's data, Azure won't, and none of the major hypervisors will default to it. The exception to this is when a whole disk or logical device is assigned to a VM, in which case it's usually used verbatim.
This brings me to the second issue. Because using a logical device may be what DigitalOcean is doing, it's been asked if it's hard for them to fix it. To answer that in a word: No. In a slightly longer word: BLKDISCARD. Or for Windows and Mac OS X users, TRIM. It takes seconds to execute TRIM commands on hundreds of gigabytes of data because, at a low level, the operating system is telling the SSD "everything between LBA X and LBA X+Y is garbage." Trimming even an SSD with a heavily fragmented filesystem takes only a matter of seconds because the commands to send to the firmware of the SSD are very simple, very low bandwidth. The SSD firmware then marks those pages as "free" and will typically defer zeroing them until use. Not only should DigitalOcean be doing this to protect customer data, but they should be doing it to ensure the longevity of their SSDs. Zeroing an SSD is a costly behavior that, if not detected by the firmware, will harm the longevity of the SSD by dirtying its internal pages and its page cache. Not to mention the performance impact for any other VMs that could be resident on the same hardware as the host has to send 10s of gigabytes of zeroes to the physical device.
Not only is DigitalOcean sacrificing the safety of user's data, but they're harming the longevity of their SSDs by failing to properly run TRIM commands to clean up after their users. It hurts their reputation to have blog posts like this go up, and it hurts their bottom line when they misuse their hardware.
Edit: As RWG points out, not all SSDs will read zeroes after a TRIM command, so other techniques may be necessary to ensure the safety of customer data.
In your second paragraph, you're conflating two different things. File-based disk images don't leak data when they're deleted because the filesystems those images live on ensure that (non-privileged) users can't get at data from deleted files. Sparse images can be smaller than the data they contain because...well, they're sparse images. They're files with holes in them, and the filesystem automagically turns those holes into zeroes on read.
Now, about Trim... Trim is only an advisory command. You tell the disk, "I'm not using these LBAs anymore, so feel free to do whatever with them." The disk has the option to completely ignore your Trim command, and even if it does mark those LBAs as unused in whatever LBA->NAND mapping table it uses internally, the disk can also continue returning the old data on reads of those LBAs if it wants to. There are disks that make the guarantee that Trim'd LBAs will always read back zeroes until written again (an ATA feature called Read Zero After Trim), but I'm guessing DigitalOcean isn't using SSDs that support RZAT since that's generally only found on more expensive SSDs, like Intel's DC S3700.
What I'm getting at is that Trim isn't guaranteed to do what you think it does. Unless the disk supports RZAT, the only way you can guarantee that the disk won't return old data in response to a read command is to write zeroes over that block.
If you're a VM provider and can't count on Trim doing what you want it to (reading back zeroes on Trim'd LBAs) because your drives don't support RZAT and you don't want zero out partitions at creation or destruction time, the right thing to do is encrypt every partition with its own randomly generated key at creation, then destroy the key when the partition is destroyed. Users will see random data soup on their shiny new block devices, which isn't as nice as seeing a zeroed out block device but is still nicer than seeing some other user's raw data. (Also note that doing this doesn't stop you from also issuing a Trim for a partition when destroying it so the SSD gains some breathing room.)
You're absolutely right and thank you for the clarification. I didn't intend to conflate sparse files with file-based disk images, but I was trying to convey that there can be a difference between the logical data of a disk image and the physical data, and that deferred zeroing is the default and the expectation of developers and sysadmins. Images can be sparse and/or file-based, as the features are orthogonal, if cross-cutting.
More importantly, you clarify that RZAT is a necessary feature for what I'm mentioning to work properly. You're right. They should both be ensuring the blocks served to customer VMs are zeroed on use and ensure that they are appropriately running TRIM commands to ensure maximum performance from their hardware. Not all SSDs perform RZAT, and it wouldn't be a bad idea for the host to ensure the device is logically zeroed for the VM anyway.
DigitalOcean could easily switch to doing both, or at least guaranteeing the former by creating new logical disks for customers as every other vendor does. If, as they have blogged about in the past, they are directly mapping virtualized disks to the host's LVM volumes, they are unnecessarily complicating their hosting set up and making their host configuration more brittle. With thin-provisioned/sparsely-allocated or with file-based virtual disk images, they can more flexibly deploy VMs with different disk sizes with minimal changes in host configuration.
Alternatively they could trivially ensure that even forensic tools would have a very difficult time erasing volumes by enabling dm-crypt on top of LVM, and resetting the key every time a virtual machine is deleted. This could reduce performance on some SSDs (particularly SandForce based models) but would allow minimal changes to their configuration to ensure deleted data is unrecoverable.
Using 1:1 mappings of LVM logical volumes to guest VM block devices is the most straightforward and performant method of doing it on Linux, short of doing 1:1 mappings of entire disks or disk partitions to guest VM block devices. While using file-based disk images would prevent data leaks between customers without any further effort required on the VM provider's part (assuming they don't reuse disk images between customers!), there are tons of downsides to file-based disk images, mostly related to performance and write amplification.
I don't agree that file-based disk images are more flexible than LVM's logical volumes — it's ridiculously easy to create, destroy, resize, and snapshot LVs.
Until very recently there were serious problems with putting LVM under any sort of concurrent load. Making more than a few snapshots at the same time, for instance, was asking for trouble. I say "was" - I've got no idea if these problems were fixed. You just don't have those problems with file-based images.
>Unless the disk supports RZAT, the only way you can guarantee that the disk won't return old data in response to a read command is to write zeroes over that block.
While this is true (the disk will never respond with old data once you have zeroed it out) it is important to remember that even zeroizing the disk yourself isn't a guarantee that the old data is actually gone from the disk itself - the disk may present itself as a raw block device, but internally it may use error correction, write amplification prevention, or error prevention schemes which may mean that old data will remain on the disk even though you have written zeroes over it. For example hard disks remapping bad sectors, or SSDs relocating chunks of data when the EEPROM gates in that chunk are starting to wear out. You would have to use forensic means to recover this information but it still remains. The only way to guarantee that the information cannot remain on the disk is to use encryption and make sure that the unencrypted key never touches the disk.
I don't think people care about the idea that a sufficiently advanced attacker with physical access the hardware can restore old data anywhere near as much as they care that the next customer along gets a fully readable copy of your data just handed to them.
For example, when you create a sparse virtual disk and it appears to be XGB all zeroed and ready to use. Of course, it's not.
From within the VM, all the VM will see is zeroes. It sounds like DO is giving VM instances direct access to the underlying SSD or something like that. In fact, I'm having a hard time figuring out precisely how this is occurring. Whenever you create a new VM, how can the VM possibly be reading data from the host's harddrive? Isn't that the definition of a security problem, since VMs are expected to be isolated?
I hope someone will explain the underlying technical details more deeply, because this is very interesting.
Please read to the end of my comment - it appears what DigitalOcean is doing is giving the VM access to a logical device that is preallocated. Perhaps carved out of LVM or MD or some other logical disk. KVM's default behavior when using these sorts of devices is to present to the VM whatever data already existed at the lower level.
Er, I fully read your comment when it was 7 minutes old, but it looks like you've edited it significantly since then to fill in some missing details. Thank you for explaining, I appreciate it!
TRIM at the point of killing the VM doesn't really help though - all you are doing is wiping the blocks where the data is currently stored. Your logical volume could have existed in many different physical places on those disks or on others as the host rearranges for any reason or swaps old hardware for new, so there could be "ghost" copies of your (slightly older) data all over the place, people could have had it mapped into their new volumes ages ago already.
The only way to ensure your data is secure is to use encryption to start with (preferably full-volume encryption, and make sure the keys are not stored at the providers end, so you'll need some mechanism for giving the VM the keys when it reboots and will have to trust no one can somehow read them from RAM) then you don't need to wipe the data at all: just destroy all copies of the keys and the data is rendered unreadable (to anyone given a new volume that spans physical media where your data once sat, it is indescribable from random noise).
note, sparse files have... performance issues. in the very best case you are going to end up with a lot of fragmentation where you aren't expecting it. I was on sparse files in 2005; I'm having a hard time finding my blog posts on the subject, but I didn't switch to pre-allocation 'cause I like paying for more disk, you know?
You are right about the zeroes, though; sparse-files solve that problem. and this is what I personally find interesting about this article. I would be very interested to find out what the Digital Ocean uses for storage. This does indicate to me that they are using something pre-allocated; I can't think of any storage technology that allows over-subscription that would not also give you zeroes in your 'free' (un-allocated) space.
>For virtual hard disk technology, Hyper-V VHDs do it, VMWare VMDKs do it, sparse KVM disk image files do it. Zeroed data is the default, the expectation for most platforms. Protected, virtual memory based operating systems will never serve your process data from other processes even if they wait until the last possible moment. AWS will never serve you other customer's data, Azure won't, and none of the major hypervisors will default to it. The exception to this is when a whole disk or logical device is assigned to a VM, in which case it's usually used verbatim.
Yeah, the thing you are missing there? VMWare, well... it's a very different market. Same with Hyper-V. And sparse files, well, as I explained, suck. (I suspect that to the extent that Hyper-V and VMware use sparse files, they also suck in terms of fragmentation, when you've got a bunch of VMs per box. But most of the time if you are running VMware, you've got money, and you are running few guests on expensive, fast hardware, so it doesn't matter so much.)
Most dedicated server companies have this problem. Most of the time, you will find something other than a test pattern on your disks, unless you are the first customer on the server.
No matter who your provider is, it's always good practice to zero your data behind you when you leave. Your provider should give you some sort of 'rescue image' - something you can boot off of that isn't your disk that can mount your disk. Boot into that and scramble your disk before you leave.
I know I had this problem, too.. many years ago when I switched from sparse files to LVM-backed storage. Fortunately for me, if I remember right, Nick caught it before the rest of the world did. I solved it by zeroing any new disk I give the customer. It takes longer, especially when I ionice the dd to the point where it doesn't kill new customers, but I am deathly afraid (as a provider should be) of someone writing an article like this about me. Ideally, I'd have a background process doing this at a low priority on all free space all the time, but making sure the new customer gets zeroes, I feel, is the most certain way to know that the new customer is getting nothing but zeroes.
>Zeroing an SSD is a costly behavior that, if not detected by the firmware, will harm the longevity of the SSD by dirtying its internal pages and its page cache. Not to mention the performance impact for any other VMs that could be resident on the same hardware as the host has to send 10s of gigabytes of zeroes to the physical device.
Clean failures of disks are not a problem. Unless you are using really shitty components (or buying from Dell) your warranty is gonna last way longer than you actually use something in production. Enterprise hard drives and SSDs both have 5 year warranties.
The dd kills disk performance for other guests on spinning disk if you don't limit it with ionice or the like, and that's the real cost. I would assume that cost would be much lower on a pure ssd setup.
First, it's not uncommon for virtual disk formats to be logically zeroed even when they are physically not. For example, when you create a sparse virtual disk and it appears to be XGB all zeroed and ready to use. Of course, it's not. And this doesn't just apply to virtual disks, such techniques are also used by operating systems when freeing pages of memory - when a page of memory is no longer being used, why zero it right away? Delaying activities until necessary is common and typically built in. Linux does this, Windows does it [http://stackoverflow.com/questions/18385556/does-windows-cle...], and even SSDs do it under the hood. For virtual hard disk technology, Hyper-V VHDs do it, VMWare VMDKs do it, sparse KVM disk image files do it. Zeroed data is the default, the expectation for most platforms. Protected, virtual memory based operating systems will never serve your process data from other processes even if they wait until the last possible moment. AWS will never serve you other customer's data, Azure won't, and none of the major hypervisors will default to it. The exception to this is when a whole disk or logical device is assigned to a VM, in which case it's usually used verbatim.
This brings me to the second issue. Because using a logical device may be what DigitalOcean is doing, it's been asked if it's hard for them to fix it. To answer that in a word: No. In a slightly longer word: BLKDISCARD. Or for Windows and Mac OS X users, TRIM. It takes seconds to execute TRIM commands on hundreds of gigabytes of data because, at a low level, the operating system is telling the SSD "everything between LBA X and LBA X+Y is garbage." Trimming even an SSD with a heavily fragmented filesystem takes only a matter of seconds because the commands to send to the firmware of the SSD are very simple, very low bandwidth. The SSD firmware then marks those pages as "free" and will typically defer zeroing them until use. Not only should DigitalOcean be doing this to protect customer data, but they should be doing it to ensure the longevity of their SSDs. Zeroing an SSD is a costly behavior that, if not detected by the firmware, will harm the longevity of the SSD by dirtying its internal pages and its page cache. Not to mention the performance impact for any other VMs that could be resident on the same hardware as the host has to send 10s of gigabytes of zeroes to the physical device.
Not only is DigitalOcean sacrificing the safety of user's data, but they're harming the longevity of their SSDs by failing to properly run TRIM commands to clean up after their users. It hurts their reputation to have blog posts like this go up, and it hurts their bottom line when they misuse their hardware.
Edit: As RWG points out, not all SSDs will read zeroes after a TRIM command, so other techniques may be necessary to ensure the safety of customer data.