Thanks very much James. Have really been enjoying the posts on MP, as well as the Durability Theater talks you gave.
Would you be able to give more details as to how you use disks as raw block devices, bypassing the filesystem? For example, the kind of abstraction layers you use, their interfaces and methods, disk layout of extents, metadata layout, consistency issues, fsync issues, and the various performance advantages from bypassing the filesystem?
I was also wondering, I know you have two functional zones, but do you also keep an isolated timelapsed append-only backup of all data, in case of ultimate software error (something bad gets past the verifiers, or a delayed bug kicks in)?
Thanks for the feedback! Glad folks are reading this stuff :)
> Would you be able to give more details as to how you use disks as raw block devices, bypassing the filesystem? For example, the kind of abstraction layers you use, their interfaces and methods, disk layout of extents, metadata layout, consistency issues, fsync issues, and the various performance advantages from bypassing the filesystem?
We can definitely go into details on the raw block device. Will probably leave that to @jamwt to write a blog post on since there's quite a lot of content there. Most of the motivation for doing so was actually so we could exploit the write-zone layout of SMR disk drives, although we get some other minor benefits like a bit more "formatted" disk space and no filesystem bugs.
> I was also wondering, I know you have two functional zones, but do you also keep an isolated timelapsed append-only backup of all data, in case of ultimate software error (something bad gets past the verifiers, or a delayed bug kicks in)?
Quick summary is that we write typically only data into at least two zones (regions), obviously replicated within these zones, and then we keep "trash" around in case there's a software bug that causes deletion. We also have a delete grace period where we keep data within the storage system for longer in case an application issued an erroneous delete.
The Dropbox filesystem layer actually checks that a file was added to MP before committing it to the metadata layer of the filesystem, so there's an extra check there that the write didn't just disappear.
We also log metadata about each put and every internal data transformation to a separate system so we can retrace our steps if there was ever an issue. This logging system is actually running on HDFS, so MP isn't the only storage system running at Dropbox, but it's by far the biggest one.
> Would you be able to give more details as to how you use disks as raw block devices, bypassing the filesystem? For example, the kind of abstraction layers you use, their interfaces and methods, disk layout of extents, metadata layout, consistency issues, fsync issues, and the various performance advantages from bypassing the filesystem?
Yeah, we're going to write a whole blog post on this sort of stuff sometime soon. Needless to say, it's a big topic!
Would you be able to give more details as to how you use disks as raw block devices, bypassing the filesystem? For example, the kind of abstraction layers you use, their interfaces and methods, disk layout of extents, metadata layout, consistency issues, fsync issues, and the various performance advantages from bypassing the filesystem?
I was also wondering, I know you have two functional zones, but do you also keep an isolated timelapsed append-only backup of all data, in case of ultimate software error (something bad gets past the verifiers, or a delayed bug kicks in)?