Latency/throughput tradeoffs, illustrated with coffee

Ixiaus · on Nov 26, 2019

I liked this article and how it was framed for businesses but it seems generally useful to many different types of activities:

- Pull requests should usually optimize for latency, not throughput (i.e. smaller PRs/changes are usually better)

- Release frequently instead of infrequently (implying that frequent releases will be smaller but in frequent releases will be very large)

- Non-strictness (latency-optimized) is more composable than strictness (throughput-optimized)

... this orbits another mental model I was exposed to a few years ago I call "weak-signal thinking".

neogodless · on Nov 26, 2019

What I find to be one of most annoying trade-offs of latency and throughput (and etiquette) are single lane bridges.

If there are three cars on either side, the fastest way to get all six cars crossed is to allow all of the cars on one side to go, and then all of the cars on the other side to go. There's some increased latency on the side that goes second, but 4 of the 6 cars end up crossing sooner (1 is the same either way, and 1 is slower) than they would with what actually happens.

One car goes on one side. Then one car goes on the other side. They continue to alternate. It's so slow that what was once six cars quickly becomes twenty. But if you "sneak" in behind the car in front of you, you see unhappy faces and the occasional middle finger!

JacobDotVI · on Nov 26, 2019

I'm not sure where you are from, but when I encountered one of these bridge in Kawaii, Hawaii a few years back it had specific instructions to follow the car in-front of you. I always admired whomever had the foresight to design such instructions and never realized it was not like that elsewhere.

From a quick google, looks like this is the the norm throughout Hawaii:

https://www.hawaii-aloha.com/blog/2014/01/08/one-lane-bridge...

https://vacations.hawaiilife.com/blog/misc/one-lane-bridge-e...

neogodless · on Nov 26, 2019

As if I didn't need one more reason to move to Hawaii?

That's subtle genius. I like this version of the etiquette: 5-7 cars at a time!

bowmessage · on Nov 26, 2019

Where do you live? I've usually encountered the high throughput strategy in the PNW and UK.

neogodless · on Nov 26, 2019

Northeast U.S. outside of Philadelphia. We have lots of one-lane bridges in this area. They say "Yield to oncoming traffic." One could interpret that to mean - if cars in front of you are going, you can go, too, since the others will yield! But instead everyone interprets it as "wait for the next person to come across, then go."

lonelappde · on Nov 27, 2019

That's because the most important thing is that no one blocks a car getting off the bridge. Who get on the bridge first is less important.

roland35 · on Nov 26, 2019

Making coffee is a good analogy for any kind of hardware electronics development! Some things need to be planned out ahead of time like the overall mechanical design, but other things probably should be iterated quickly, maybe by using pre-made development kits and breadboards instead of fully designed circuit boards for initial firmware.

As the article said at the end, it all basically boils down to "it depends..." :)

willis936 · on Nov 26, 2019

Or do what superscalar CPUs do and have 8 coffee machines running all the time.

dragontamer · on Nov 26, 2019

CPUs are the latency-optimized machines of today.

If a CPU sees a line of 10 people, it will brew 10-cups of coffee, speculating that most want coffee. If only 9-cups were needed, it will throw away the extra coffee.

-----------

In practice, this truly happens. CPUs perform branch-prediction over a for-loop.

    for(int i=0; i<32; i++){
        doA();
    }
    doB()

The value "i" hasn't been calculated yet, but the CPU performs branch prediction. Modern CPUs can accurately loops of size ~32 or less. Modern CPUs will literally fill their pipelines with 32x "doA()" statements, and even the doB() statement BEFORE the i<32 check was even tested.

Now the branch-predictor might be wrong! Lets say that doA() is:

    if(random() == 1% chance){
        i++; // Increments the loop counter
    }

The CPU will likely fail to predict this, and then will be forced to throw away the work. Nonetheless, its overall beneficial for the CPU to speculatively try to do all the loops anyway (the alternative is leaving the CPU-pipeline empty, which has roughly the same costs as a failed speculation anyway).

CPU-wins if it is correct, and it ties if it is wrong. So might as well speculate.

hoseja · on Nov 27, 2019

    if(random() == 1% chance)

This really threw me for a loop. I'm not used to seeing the percent sign used for actual percents! I was wondering "Why is he doing 1 modulo some 'chance' variable?"

fiter · on Nov 26, 2019

In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.

dragontamer · on Nov 26, 2019

> In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.

Nope. CPUs are latency-optimized.

The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".

----------

A throughput optimized machine, like GPUs (and strangely enough: hard drives), are willing to slow down the 1st-cup of coffee for better overall throughput.

Hard drives are interesting: if you have the following "reads":

#1: Read location 1

#2: Read location 100

#3: Read location 50

The hard drive will re-arrange the reads into: Read 1, Read 50, Read 100, because the hard-drive head will reach location50 before location100. Remember, hard drives are physically moving their arms to each physical location.

This means that Read100 is "slowed down", its latency got significantly worse. But the three reads all together all completed at the same time.

fiter · on Nov 27, 2019

> The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".

Just to be clear, then: the analogy from the original post doesn't apply.

dragontamer · on Nov 27, 2019

> Just to be clear, then: the analogy from the original post doesn't apply.

The analogy from the original post applies to the cases the original post discusses.

The original "coffee latency" blogpost innately applies to a 1980s style computer: a simple in-order machine. Its truly correct for that model of simple computing.

I've added in complications: pipelining, superscalar, and speculative execution, which were inventions deployed in the early 90s and 00s to CPUs. So things work differently on modern machines, because modern machines have many, many more features than the "original" computer designs.

The original "cups of coffee" are a good way to start thinking about latency vs bandwidth problem. I really like the analogy. But it would take a LOT more writing before I really cover everything going on in modern CPUs.

fiter · on Nov 27, 2019

Your original post was missing explanation because you referenced the original analogy without addressing how the original analogy no longer applied to the scenario you were discussing.

For what it's worth, in all my replies I have not been confused about the behavior of a CPU, but only about how you are trying to use the analogy to fit your exposition.

majormajor · on Nov 27, 2019

Nah, this would be brewing 1 cup at a time 10 times in parallel, an example not considered in the original post.

To extend the original post, this would be trading capital - to buy more equipment - to improve both latency and throughput.

etaioinshrdlu · on Nov 26, 2019

This is a great explaination of why high power CPUs are great for desktops and servers but terrible for mobile.

gameswithgo · on Nov 26, 2019

Does this actually waste power? A lot of people's intuition about what wastes power with cpus is wrong. Usually it is best to light up all the tricks, heat the cpu up, finish the work, then go back to sleep. If branch prediction helps you get back to sleep sooner, it is probably a net win.

dragontamer · on Nov 27, 2019

My understanding of CPU-physics is as follows:

https://www.ti.com/lit/an/scaa035b/scaa035b.pdf

Page 4 (page8 in PDF):

> P = C * V^2 * f * N

Power-used = Capacitance * Voltage^2 * Frequency * Number of Bits flipped

This equation roughly holds for all CMOS circuits, from the 1970s through today's CPUs. Smaller transistors result in smaller capacitance, which is what led to Dennard scaling for the past 40 years. Otherwise, the physics are the same.

Static-power consumption in CMOS is theoretically zero, and measured to be femto-amps. Static power consumption can therefore be ignored, only dynamic power matters in the real world.

Assuming the same transistor size (aka: capacitance), you can see that power use is most strongly determined by voltage. Another note: higher frequencies (say 3GHz or 4GHz) requires more-and-more voltage to sustain.

A mobile chip drawing 1/2 voltage at 1/2-frequency will have 1/8th the power consumption but take 2x longer to complete a task. Overall, you have 1/4th the power consumption used.

----------

This is why servers reduce frequency and voltage and save on power in practice.

dannyw · on Nov 27, 2019

As another example, many cryptocurrency miners undervolt their cards to get better performance per unit of energy.

gpderetta · on Nov 27, 2019

all mobile cpus (at least those worth using) do exactly the same thing.

ncmncm · on Nov 27, 2019

Very often I see false tradeoffs attempting better latency. When the system is on top of things, sure, go for latency. Once there is any backlog, favoring throughput gets you better latency too.

crdrost · on Nov 26, 2019

Joel on Software gave a better illustration of this a while back in talking about the dangers of multitasking; he said imagine that you have no task-switching penalties but have to perform two tasks A and B which are in theory 100 units of time each. If you perform them serially, you get the result for A at time 100 and the result for B at time 200; if you perform them in parallel switching between them, you get the benefit that at time 51 you can show both of the recipients that you are 25% complete, but you deliver A at time 199 and B at time 200. B gets the same result; A gets a strictly better result, by not multitasking. If you imagine that your reputation is proportional to the average of the inverses of your times-to-completion, your reputation is 50% better in the first case due to the 100% improvement on half of your deadlines; if you had done the same nonsense with three parallel tasks your reputation would be 83% better or so.

With that said it seems, I don’t know, like something is missing? Throughput in these project-engineering contexts is little more than the plural of latency; improving latency usually works to improve throughput. So it would be nice to figure out what the actually-perpendicular vector is, given that these two so often go hand-in-hand.

I'd then want to think about situations where you could up-front invest in building a clean piece of software that is dynamic and highly-adaptable later (big wait, then lots of features can be delivered faster) vs. a clunker that was slapped together ad-hoc in order to immediately meet business needs, and it shows (immediate results but every new feature takes longer and longer).

Between the two of those I have a personality which favors the first; in one of my early programming jobs I had a lot of trouble being thrown into the tail end of a system built for years according to the second principle, and so every little change took weeks to debug because everything was spaghetti—I got a bit burned. On the flip-side, the second is in some sense Objectively Correct—lower latencies are really powerful—and I started to adopt some serious principles from that.

So with new internal tools for example, I have some baseline principles which speak to the second vision. A new tool starts without CI/CD, it starts without a database or data persistence, it has a repository but it does not have a release process or code reviews; it starts without extraneous design or styles or templates; usually it starts without tests although in theory I like test-driven development. When I say minimum viable product, I mean that word minimum and I am somewhat loose on that word viable. If there is supposed to be communication with a hypothetical API, that API does not exist and instead there is a file containing some functions which return static JSON blobs that it might have hypothetically tossed back in response. It is a frontend-first design that has no backend.

And I keep negotiating what this product is with my stakeholders, until that frontend has been massaged into something that they can use. Low latency in learning what my tool-consumer wants is key, so I can't be making it expensive to change my data model or the like. I want the complaints that “This tool is extremely useful, I wish it looked pretty and saved my info from session to session and had the latest data from our HR system” and whatever else it needs to do to actually be properly viable.

I think that what I am doing is some variant of Domain-Driven Design? Basically I am trying to suss out major product requirements from nontechnical folks by having them interact with the product requirements as early as possible, to see what those requirements imply and correct them again and again. I want to have a technical model of how they look at the world which is correct, first—and then when I am building the backend I can actually have a properly principled approach to what I am building because I know what the terms mean in this system.

nitely · on Nov 26, 2019

> if you perform them in parallel switching between them, you get the benefit that at time 51 you can show both of the recipients that you are 25% complete, but you deliver A at time 199 and B at time 200.

That's not parallelism, that's concurrency. You are basically doing round-robin. If they were done in parallel, then both tasks would get completed at time 100. Throughput usually improves the otherwise maximum latency when there is at least some parallelism, otherwise I agree improving throughput would not make a lot of sense in many cases.

> I think that what I am doing is some variant of Domain-Driven Design?

Sounds like iterative and incremental software development. I dare say Agile.

eps · on Nov 26, 2019

On the first graph second "heat" should start with the first "drip." It will still be a bit slower in total, but not by much.

It's a cute analogy, but it's not 100% accurate.

ummonk · on Nov 26, 2019

Yup that is what I would do. Slightly higher latency for the first coffee but the higher throughput is worth it.

Pipelining for the win.

forrestthewoods · on Nov 26, 2019

Glib comments like this are why people say “never read the comments”.

diplomatpuppy · on Nov 26, 2019

From a one-piece-flow lean perspective that plays into agile development:

If you brew on cup at a time, your customer can tell you how to improve it before you deliver the second cup.

roland35 · on Nov 26, 2019

This is a great way to start, then once you got it down start cranking that coffee out!

forrestthewoods · on Nov 26, 2019

Nice post. Thanks for sharing.

0xbadcafebee · on Nov 26, 2019

tl;dr choosing between several options seems to depend on the reasons for the choices