Google: Taming The Long Latency Tail - When More Machines Equals Worse Results

dgreensp · on March 12, 2012

The "insightfulness" of these points is overblown, especially if you've worked at Google, where the solution to "our bloated application server takes 20 minutes to recompile" is "then use our distributed compiler that runs on 100 machines," and the solution to "sometimes a worker machine takes a long time to come up or doesn't come up at all" is "then use redundant workers and fire up 200 machines."

I wouldn't be surprised if Google does pull off some snazzy new real-time architecture to use internally, but so far I think their strategy of farming out execution to huge numbers of crappy machines, while innovative and very successful, has pretty much exactly the problems you'd expect it to have.

jrockway · on March 12, 2012

the solution to "our bloated application server takes 20 minutes to recompile" is "then use our distributed compiler that runs on 100 machines,"

This is a bit disingenuous. The app servers are bloated because every library is rebuilt and statically linked, avoiding deployment problems and apps that use massively-outdated libraries. Continuous integration is expensive, but we can afford it.

toddh · on March 12, 2012

You shifted the problem to the network and disk storage. Distributing compiles, objects, executables, debug images, packages, etc can take longer than a compile on a single machine unless you are quite careful.

haberman · on March 12, 2012

> With flash you can read 4KB of data in 100 microseconds or so, if your read is stuck behind an erase you may have wait 10s of milliseconds. That’s a 100x increase in latency variance for that particular variance that used to be very fast.

I had never heard this claimed before -- is this true? Wikipedia says:

"Less expensive SSDs typically have write speeds significantly lower than their read speeds. Higher performing SSDs have similar read and write speeds."

So it sounds like this isn't as much of an issue as the article claims.

ori_b · on March 12, 2012

I'm going to over-simplify a bit (and try to remember; it's been years since I've actually paid attention to flash). Flash works effectively by erasing to all-1s, and NAND'ing data with an existing block of bits. If you have a freshly erased block, it's already all 1s, and you don't have to clear it off before you can use it. This is fast.

If there are no free blocks, or for whatever reason, the flash device elects to use a block you've already written to, you need to erase the entire block, even if you're only writing one bit. This is relatively slow.

Most high end flash devices, as I understand it, try to keep a pool of pre-erased blocks so that they can stay on the fast path, on average.

As far as I can tell, what they're saying in this article is that sometimes you can hit the slow path, and that causes high latency. On average, it's not a big deal, but for some random small number of writes, it can be an issue.

littledanehren · on March 12, 2012

Note: This is my own opinion, and not that of my employer (Google) or based on trade secrets or other IP from my employer.

Keeping pre-erased blocks is useful, but it can only reduce the write latency to what the chip gives you. And the chip gives you a longer write latency than read latency, especially for MLC with smaller process sizes.

ori_b · on March 12, 2012

True, there is a difference between read and write latency, but at least that is consistent and therefore easy to plan for. Large variance makes things far more difficult, in my opinion.

mvgoogler · on March 13, 2012

It's actually not consistent. On MLC flash, individual write operations can vary by a factor of 6.

see: http://cmrr-star.ucsd.edu/starpapers/309-Grupp-1.pdf (PDF)

mvgoogler · on March 13, 2012

Having a pool of pre-erased blocks helps absorb spikes in write activity, but it doesn't let you escape the fact that, on average, an SSD will need to perform an erase operation every 128th page write (assuming 128 pages per block).

If the device is kept at or near full utilization, eventually a fast operation (read) will get delayed behind a slower operation and your tail latencies will suffer.

ori_b · on March 13, 2012

I thought that's what I said. Guess I just suck at explaining things.

Nitramp · on March 12, 2012

Part of the problem can be that operating systems and databases work in block sizes of 4k (8k on some UNIXes), which is appropriate for regular disks, and flash SSDs are organized in blocks of 64k, 128k, or even larger blocks, which need to be written en bloc for certain kinds of write (erasure).

That means if you write a 4k block, your SSD has to read its larger block, combine it with the new data, and write the big block back. So depending on your write patterns, you're suddenly writing at 1/8th or 1/16th the speed. Command queuing can help, but not for truly random writes.

It's actually even a bit more complicated: SSDs only need to erase the entire block when writing ones back over zeros, but it's probably safe to assume that most writes will end up doing something like that. Erasure is also slower than just writing.

Regarding the Wikipedia quote, it really depends a lot. Write speeds can be the same as read speeds if you're just writing 'fresh' blocks, but as soon as your drive needs to erase blocks, the story will be very different.

littledanehren · on March 12, 2012

Note: This is my own opinion, and not that of my employer (Google) or based on trade secrets or other IP from my employer.

I don't know if this is actually true in SSDs that are actually used, but there is a real cost/write-performance tradeoff. You can make writes faster by using SLC rather than MLC and by having a higher proportion of physical capacity to reported capacity in order to reduce write amplification. Looking into the future, there could be a journal in an even faster kind of memory like PCM or memristors. Still, the controller can only do so much to hide the properties of the chip, and if your drive is only flash, then the trend in the recent past for the chips is that reads are getting faster and writes are getting slower (based on the move towards smaller process sizes and more levels).

baruch · on March 12, 2012

It's definitely a lot less painful than dealing with disks as the article suggests. The variability of disk latencies is much higher and is even less expected.

mvgoogler · on March 13, 2012

> "Less expensive SSDs typically have write speeds significantly lower than their read speeds. Higher performing SSDs have similar > read and write speeds." > So it sounds like this isn't as much of an issue as the article claims.

I'm going to guess that the article is taking about read/write throughput. If you are comparing throughput that can be a true statement (thought it's usually only true in devices where the limiting factor is the bandwidth of the connection to the host computer...).

Luiz was referring to the latency of individual operations. Flash is very asymmetric w.r.t. the latencies of reads and writes. The asymmetry is only getting worse with the smaller lithographies used for newer chips.

> > With flash you can read 4KB of data in 100 microseconds or so, if your read is stuck behind an erase you may > > have wait 10s of milliseconds. That’s a 100x increase in latency variance for that particular variance that > > used to be very fast. > I had never heard this claimed before -- is this true? Wikipedia says:

The claim is very true.

At the level of a single flash chip, you can only do one operation at a time. So at any one time, a single chip can be doing exactly one of: erasing a block, writing a page, or reading a page.

Erasing a block - Latency ~5-20ms depending on the generation of flash. - On modern chips, a block will have between 64 & 256 4KiB or 8KiB pages.

Write a page - Latency 400us - 4ms - On SLC chips, the latency will be consistently in the 400us range - On MLC, half of the pages in a block will see 400us latency, the other half will be in the 3ms-4ms range - Writes can only be performed on erased pages.

Read a page - Latency ~80-100us

SSDs achieve performance by putting together a lot of flash chips and trying to keep them all busy at the same time. The more the operations an SSD can do in parallel, the better off it will be. A (very) simplified way to think about it is to imagine that the SSD is a RAID0 controller that is talking to flash chips rather than disks. For simplicity, imagine the data is striped uniformly across the chip, so a read/write for logical address A will always end up on chip (A / page_size) % number_of_chips.

Imagine you have two apps:

The first app only reads from the SSD. As long as nobody else is writing to the device it will see very read latency. If the reads are perfectly distributed across the address range, it will see 100us latency consistently. In the real world, some reads will be targeted to the same chip, and will therefore have to wait for the preceding read to complete.

The first app is seeing: Read - 100us latency Read - 100us latency Read - 200us latency (got stuck behind a previous read) Read - 100us latency Read - 100us latency Everyone is happy.

Now imagine that another process wakes up and starts writing to the SSD. The total latency for a write can vary greatly. If there is an erased page ready to go, the write operation will take ~400us. If the SSD needs to erase a page, it can be much higher. If there is an empty page that just needs to be erased the write operation will take (20ms + 400us). If the SSD needs to do garbage collection (i.e. copy data from one block to another to free up a block) the time could be much higher, possibly 50ms or higher.

App Operation Latency A Read chip 0 100us B Write chip 0 400us A Read chip 0 500us+ ... or .. A Read chip 0 100us B Write chip 0 40ms A Read chip 0 40ms

So app A would see a 40ms/100us = 400x variance in read latency.

strictfp · on March 12, 2012

It is possible to fix the max latency problem. If you have K distinct queries, just make sure that each such query is answered by N servers. If you do this you go from worst latency amongst all servers to worst latency amongst all N-tuples. Overall latency will then approach average of the slowest distinct query as N increases.

mvgoogler · on March 13, 2012

Sure. It's trivial to arbitrarily reduce the probability of this one issue by throwing N times the hardware at it.

Now you have to deal with the problems of:

1) Buying, powering, managing, etc N times the hardware. When you buy hardware by the warehouse, this will be a big bill :-) 2) Since you are now increasing your network traffic, power, etc. by at least N times, it is very likely you will hit another problem that produces latency spikes.

Or, as Chris Colohan likes to say: "If you think solving problems is massively parallel computing systems is easy, please come join my team" :-)

strictfp · on March 13, 2012

I'm not saying that its easy, but the problem formulation in the article is simplified enough to be solved by this method. So I guess that my major objection is how the problem is stated.

Also in practice you don't really need N times the hardware. You pick out the queries with the worst jitter and either solve it or install two or three of these services, and you are good to go.

To summarize, I agree with you in principle, but would like to add that the method still is sound, but perhaps not in its most extreme form.

jamesaguilar · on March 12, 2012

But then you have to write the data N times, and writes are expensive too, especially to flash, as someone else pointed out in this thread.

strictfp · on March 12, 2012

To me the main argument was that the algorithms don't scale, and not even SSDs can help to alleviate that. This is true since you are always bound to the worst time. By spreading the load as described, you aren't anymore, so the problem which the SSDs were supposed to solve simply isn't there anymore.

And even if you would like to avoid those SSD latency spikes, said algorithm would work wonders (given that blocking due to erase occurr randomly). You can then wait for first successful write for example.

PaulHoule · on March 12, 2012

Well, in the ordinary "enterprise", people put throughput first and could case less about latency... With disasterous effects on user experience.