Pocket Watch: Verifying Exabytes of Data

james_cowling · on July 6, 2016

Hi folks, James from Dropbox here. Happy to answer any questions that come up so feel free to send them by way.

toomuchtodo · on July 6, 2016

Was shocked to see that your durability (27 9s) was so much higher than what S3 claims (11 9s), while also charging less for storage and bandwidth than S3 would. Amazing!

I would be curious to see how much of your verification architecture is shared with someone like Backblaze, but I assume the only way to learn that would be to work for both companies to compare :)

Would Dropbox ever share hard drive reliability stats similar to what Backblaze does?

Sorry for all the questions in one comment! Not affiliated with either Dropbox or Backblaze, just an infrastructure wonk.

jerf · on July 6, 2016

27 9s is literally higher than my confidence that human civilization will be here in the next second. 10^27 seconds is about 32 quintillion years. Extinction events occur with a much higher frequency than that.

No criticism of Dropbox here; they know that number is just math games, too. I'm just putting some numbers on how true that is. Because I find tossing about big numbers like this as if they mean something a bizarre, nerdy sort of fun.

gamegoblin · on July 7, 2016

Exactly. They say they are using some variant of Reed-Solomon erasure coding. If you did something like K=100 and N=150 and stored all of the shards on different disks, the probability you lose data is equal to the probability that 50 hard drives fail before you can repair the lost shards.

If I am reading the article correctly, they claim that they should usually be able to repair in less than an hour in the case of disk failure.

Thus the probability of losing 50 (or whatever their N-K value is) disks within an hour is how you get 27 nines of durability.

Of course, the probability that one of your software engineers introduces a durability bug is WAY more likely than those disks experiencing a coordinated failure.

Or say, the probability that a terrorist organization targets your datacenters. Even if those odds are one in a billion, that's still not even close to 27 nines.

james_cowling · on July 7, 2016

For sure. I hope we're all agreeing here :)

We've very strong believers that an effective replication strategy is just table stakes and that from there the real risks to durability are the "black swan" events that are much harder to model.

I gave at talk at Data@Scale recently where the main premise is about "Durability Theater" and how to combat it in a production storage system. In case you're interested: https://code.facebook.com/posts/253562281667886/data-scale-j...

james_cowling · on July 6, 2016

Yeah to be clear my point from the article is that even though you can model an incredibly high durability number in theory, there's a lot more work required to see that in practice. I'm sure the S3 team also have extremely high internal durability numbers.

A lot of other factors go into an end-to-end durability numbers, including the possibility of software bugs, correlated failure, etc. Magic Pocket has an external durability claim of 12 9s which gives us a lot of headroom between the external figure and the internal model.

Fortunately both 11 or 12 nines is close enough to infinite as far as clients are concerned.

toomuchtodo · on July 6, 2016

Thanks for taking the time to reply James!

Johnny555 · on July 6, 2016

S3's durability numbers are for a single region (since they were quoting that number before they supported cross-region replication) -- so you'd get much better durability if you mirror your objects across regions. Which, apparently, is what Dropbox does.

Though if you require such high durability, you're better off with multiple providers in multiple countries, since at 11 9's, it probably becomes more likely that a provider will go out of business or a civil disturbance will make your data become unavailable than the chance of them losing it.

toomuchtodo · on July 6, 2016

To reply your first paragraph, Dropbox is providing multi-region durability for free compared to S3. Point to Dropbox.

I agree with your second paragraph in entirety.

james_cowling · on July 7, 2016

From what I hear from other major storage providers Magic Pocket is definitely very high up on the paranoia/verification scale. There is of course a cost to all this verification traffic. Any sensible storage provider will be doing stuff like disk scrubbing and verification scans on their storage index however.

We haven't shared drive reliability stats before but I don't think we have any objection to doing so. Don't expect these in the next month or two but we'll likely get to this in the not-too-distant future.

toomuchtodo · on July 7, 2016

> We haven't shared drive reliability stats before but I don't think we have any objection to doing so. Don't expect these in the next month or two but we'll likely get to this in the not-too-distant future.

Thanks so much James!

e12e · on July 7, 2016

>> Was shocked to see that your durability (27 9s) was so much higher than what S3 claims (11 9s)

> 27 9s is literally higher than my confidence that human civilization will be here in the next second. 10^27 seconds is about 32 quintillion years. Extinction events occur with a much higher frequency than that.

Without reading the article, I'm assuming this is the uptime probability they promise - so ( 1.0/10^9 ) * 365.25 * 24 * 3600 * 300 ~ 1 second of unavailability every 30 years (AFAIK Amazon is pretty far in the "red" on this one - they've had a few outages?) [ed: initially was off by a factor of 100 (and 10 in error...), the two 9s before the comma - the "compliment" (1 - p) of 11 9s is 1/10^9, not 1/10^11)].

11 9s is already effectively the same as will "never go down in a way customers will notice" (actually a second every 30 years per region isn't entirely insignificant, only almost insignificant). A higher guarantee does indeed seem silly. It's probably much more likely that we'll see annihilation by global thermonuclear war (for example). In which case I'm guessing the data would go off line for a while -- so such a number is meaningless.

placeybordeaux · on July 7, 2016

Thats not what they are promising, that would be insane. Please just read the article.

e12e · on July 7, 2016

Point well taken - I was confused by the comments about how long 10^27 seconds is, and the "9s" nomenclature which is often used to measure uptime.

On the other hand, a Dropbox engineer upthread just claimed that their service have an external [ed2: durability] of 11 to 12 9s. So it does seem that they effectively claim that a block will practically never be unavailable due to not being possible to read from (any) disk [ed2: ie, unavailable due to failed durability]?

I do wonder a bit at the cost of padding redundancy up to such a high number. They don't mention block size, other than to say that 1 GB is filled with actual blocks. Lets say it's 1MB, and they target 1 Billion users, averaging 100 GB of data stored. That's 10⁹ users storing 10² GB each with 10³ blocks, or 10⁹⁺²⁺³ = 10¹⁵ blocks of data. That still leaves a lot of margin - and effectively the storage part of the system should never be the weakest link.

[ed: and that some users are likely to lose some data, if the 11 to 12 9s figure is to be taken "per block". But maybe it's per user? It seems unlikely that they really mean 11 9s of availability full stop]

[ed2: Snuck in an extra availability where I meant durability, rendering also this second comment nonsensical...]

gamegoblin · on July 7, 2016

You are confusing availability ("uptime") with durability ("losing data").

I could make a system that was only available for 1 minute every 10 minutes (10% availability) but never lost a single file (100% durability)

I could also make a system that was never down (100% availability) but would randomly lose 1 in 10 files (90% durability).

Common causes for availability issues are power outages, network outages (fiber cuts, etc), and DNS issues

Common causes for durability issues are software bugs, entire datacenter losses (e.g. earthquake destroys everything), or perhaps coordinated power outages if the system buffers things in volatile memory.

jorangreef · on July 7, 2016

Thanks very much James. Have really been enjoying the posts on MP, as well as the Durability Theater talks you gave.

Would you be able to give more details as to how you use disks as raw block devices, bypassing the filesystem? For example, the kind of abstraction layers you use, their interfaces and methods, disk layout of extents, metadata layout, consistency issues, fsync issues, and the various performance advantages from bypassing the filesystem?

I was also wondering, I know you have two functional zones, but do you also keep an isolated timelapsed append-only backup of all data, in case of ultimate software error (something bad gets past the verifiers, or a delayed bug kicks in)?

james_cowling · on July 7, 2016

Thanks for the feedback! Glad folks are reading this stuff :)

> Would you be able to give more details as to how you use disks as raw block devices, bypassing the filesystem? For example, the kind of abstraction layers you use, their interfaces and methods, disk layout of extents, metadata layout, consistency issues, fsync issues, and the various performance advantages from bypassing the filesystem?

We can definitely go into details on the raw block device. Will probably leave that to @jamwt to write a blog post on since there's quite a lot of content there. Most of the motivation for doing so was actually so we could exploit the write-zone layout of SMR disk drives, although we get some other minor benefits like a bit more "formatted" disk space and no filesystem bugs.

> I was also wondering, I know you have two functional zones, but do you also keep an isolated timelapsed append-only backup of all data, in case of ultimate software error (something bad gets past the verifiers, or a delayed bug kicks in)?

Quick summary is that we write typically only data into at least two zones (regions), obviously replicated within these zones, and then we keep "trash" around in case there's a software bug that causes deletion. We also have a delete grace period where we keep data within the storage system for longer in case an application issued an erroneous delete.

The Dropbox filesystem layer actually checks that a file was added to MP before committing it to the metadata layer of the filesystem, so there's an extra check there that the write didn't just disappear.

We also log metadata about each put and every internal data transformation to a separate system so we can retrace our steps if there was ever an issue. This logging system is actually running on HDFS, so MP isn't the only storage system running at Dropbox, but it's by far the biggest one.

jorangreef · on July 8, 2016

Thanks for taking the time to reply, looking forward to the next few posts in the series.

jamwt · on July 7, 2016

> Would you be able to give more details as to how you use disks as raw block devices, bypassing the filesystem? For example, the kind of abstraction layers you use, their interfaces and methods, disk layout of extents, metadata layout, consistency issues, fsync issues, and the various performance advantages from bypassing the filesystem?

Yeah, we're going to write a whole blog post on this sort of stuff sometime soon. Needless to say, it's a big topic!

jorangreef · on July 8, 2016

Thanks, looking forward to it!

jez · on July 7, 2016

With a graph that's always 0, I'd be scared that my verification code was accidentally wrong. Do you ever intentionally inject missing hashes somewhere to see if your graph blips? How would you feign that a hash was missing?

james_cowling · on July 7, 2016

Yup we do this, and yes, the verifier has been broken before (during early development, not at production scale).

We do a number of things here, like taking down a sufficient number of storage nodes in a single region to make blocks appear "missing" in the region and force an automatic failover to the other region (this is transparent to users, apart from slightly more latency), or more direct/risky checks in our Staging cluster (we don't ever mess with data in our main production cluster).

In reality a large system like this regularly encounters timeouts or failures of sub-components which are masked by our multi-region redundancy but show up as spikes in the verifiers. These remind us that everything is working, in between more explicit DRT (Disaster Recovery Training) tests.

sitkack · on July 7, 2016

ABF, Always Be Failing (just a little bit).

okwap · on July 13, 2016

What's your mysql cluster setup in block index and replication table? NDB, galera or?

okwap · on July 13, 2016

What's difference between "extent" and "bucket"?

pbarnes_1 · on July 6, 2016

This system is pretty amazing. Thanks for sharing all the details as you went along.

You guys have made me interested in rust, which I still find a little too verbose vs golang. :D

jamwt · on July 6, 2016

Thanks!

We're expanding our use of Rust into even more bold endeavours, details soon.

Based on our experience with it in the last 1.5 years, stick with it, it will return your effort many times over in the medium-long run. Then again, if there is no medium-long run for your project (as an entity needing maintenance), just use Python.