What I find to be one of most annoying trade-offs of latency and throughput (and etiquette) are single lane bridges.
If there are three cars on either side, the fastest way to get all six cars crossed is to allow all of the cars on one side to go, and then all of the cars on the other side to go. There's some increased latency on the side that goes second, but 4 of the 6 cars end up crossing sooner (1 is the same either way, and 1 is slower) than they would with what actually happens.
One car goes on one side. Then one car goes on the other side. They continue to alternate. It's so slow that what was once six cars quickly becomes twenty. But if you "sneak" in behind the car in front of you, you see unhappy faces and the occasional middle finger!
I'm not sure where you are from, but when I encountered one of these bridge in Kawaii, Hawaii a few years back it had specific instructions to follow the car in-front of you. I always admired whomever had the foresight to design such instructions and never realized it was not like that elsewhere.
From a quick google, looks like this is the the norm throughout Hawaii:
Northeast U.S. outside of Philadelphia. We have lots of one-lane bridges in this area. They say "Yield to oncoming traffic." One could interpret that to mean - if cars in front of you are going, you can go, too, since the others will yield! But instead everyone interprets it as "wait for the next person to come across, then go."
Making coffee is a good analogy for any kind of hardware electronics development! Some things need to be planned out ahead of time like the overall mechanical design, but other things probably should be iterated quickly, maybe by using pre-made development kits and breadboards instead of fully designed circuit boards for initial firmware.
As the article said at the end, it all basically boils down to "it depends..." :)
If a CPU sees a line of 10 people, it will brew 10-cups of coffee, speculating that most want coffee. If only 9-cups were needed, it will throw away the extra coffee.
-----------
In practice, this truly happens. CPUs perform branch-prediction over a for-loop.
for(int i=0; i<32; i++){
doA();
}
doB()
The value "i" hasn't been calculated yet, but the CPU performs branch prediction. Modern CPUs can accurately loops of size ~32 or less. Modern CPUs will literally fill their pipelines with 32x "doA()" statements, and even the doB() statement BEFORE the i<32 check was even tested.
Now the branch-predictor might be wrong! Lets say that doA() is:
The CPU will likely fail to predict this, and then will be forced to throw away the work. Nonetheless, its overall beneficial for the CPU to speculatively try to do all the loops anyway (the alternative is leaving the CPU-pipeline empty, which has roughly the same costs as a failed speculation anyway).
CPU-wins if it is correct, and it ties if it is wrong. So might as well speculate.
This really threw me for a loop. I'm not used to seeing the percent sign used for actual percents! I was wondering "Why is he doing 1 modulo some 'chance' variable?"
In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.
> In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.
Nope. CPUs are latency-optimized.
The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".
----------
A throughput optimized machine, like GPUs (and strangely enough: hard drives), are willing to slow down the 1st-cup of coffee for better overall throughput.
Hard drives are interesting: if you have the following "reads":
#1: Read location 1
#2: Read location 100
#3: Read location 50
The hard drive will re-arrange the reads into: Read 1, Read 50, Read 100, because the hard-drive head will reach location50 before location100. Remember, hard drives are physically moving their arms to each physical location.
This means that Read100 is "slowed down", its latency got significantly worse. But the three reads all together all completed at the same time.
> The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".
Just to be clear, then: the analogy from the original post doesn't apply.
> Just to be clear, then: the analogy from the original post doesn't apply.
The analogy from the original post applies to the cases the original post discusses.
The original "coffee latency" blogpost innately applies to a 1980s style computer: a simple in-order machine. Its truly correct for that model of simple computing.
I've added in complications: pipelining, superscalar, and speculative execution, which were inventions deployed in the early 90s and 00s to CPUs. So things work differently on modern machines, because modern machines have many, many more features than the "original" computer designs.
The original "cups of coffee" are a good way to start thinking about latency vs bandwidth problem. I really like the analogy. But it would take a LOT more writing before I really cover everything going on in modern CPUs.
Your original post was missing explanation because you referenced the original analogy without addressing how the original analogy no longer applied to the scenario you were discussing.
For what it's worth, in all my replies I have not been confused about the behavior of a CPU, but only about how you are trying to use the analogy to fit your exposition.
Does this actually waste power?
A lot of people's intuition about what wastes power with cpus is wrong. Usually it is best to light up all the tricks, heat the cpu up, finish the work, then go back to sleep. If branch prediction helps you get back to sleep sooner, it is probably a net win.
Power-used = Capacitance * Voltage^2 * Frequency * Number of Bits flipped
This equation roughly holds for all CMOS circuits, from the 1970s through today's CPUs. Smaller transistors result in smaller capacitance, which is what led to Dennard scaling for the past 40 years. Otherwise, the physics are the same.
Static-power consumption in CMOS is theoretically zero, and measured to be femto-amps. Static power consumption can therefore be ignored, only dynamic power matters in the real world.
Assuming the same transistor size (aka: capacitance), you can see that power use is most strongly determined by voltage. Another note: higher frequencies (say 3GHz or 4GHz) requires more-and-more voltage to sustain.
A mobile chip drawing 1/2 voltage at 1/2-frequency will have 1/8th the power consumption but take 2x longer to complete a task. Overall, you have 1/4th the power consumption used.
----------
This is why servers reduce frequency and voltage and save on power in practice.
Very often I see false tradeoffs attempting better latency. When the system is on top of things, sure, go for latency. Once there is any backlog, favoring throughput gets you better latency too.
Joel on Software gave a better illustration of this a while back in talking about the dangers of multitasking; he said imagine that you have no task-switching penalties but have to perform two tasks A and B which are in theory 100 units of time each. If you perform them serially, you get the result for A at time 100 and the result for B at time 200; if you perform them in parallel switching between them, you get the benefit that at time 51 you can show both of the recipients that you are 25% complete, but you deliver A at time 199 and B at time 200. B gets the same result; A gets a strictly better result, by not multitasking. If you imagine that your reputation is proportional to the average of the inverses of your times-to-completion, your reputation is 50% better in the first case due to the 100% improvement on half of your deadlines; if you had done the same nonsense with three parallel tasks your reputation would be 83% better or so.
With that said it seems, I don’t know, like something is missing? Throughput in these project-engineering contexts is little more than the plural of latency; improving latency usually works to improve throughput. So it would be nice to figure out what the actually-perpendicular vector is, given that these two so often go hand-in-hand.
I'd then want to think about situations where you could up-front invest in building a clean piece of software that is dynamic and highly-adaptable later (big wait, then lots of features can be delivered faster) vs. a clunker that was slapped together ad-hoc in order to immediately meet business needs, and it shows (immediate results but every new feature takes longer and longer).
Between the two of those I have a personality which favors the first; in one of my early programming jobs I had a lot of trouble being thrown into the tail end of a system built for years according to the second principle, and so every little change took weeks to debug because everything was spaghetti—I got a bit burned. On the flip-side, the second is in some sense Objectively Correct—lower latencies are really powerful—and I started to adopt some serious principles from that.
So with new internal tools for example, I have some baseline principles which speak to the second vision. A new tool starts without CI/CD, it starts without a database or data persistence, it has a repository but it does not have a release process or code reviews; it starts without extraneous design or styles or templates; usually it starts without tests although in theory I like test-driven development. When I say minimum viable product, I mean that word minimum and I am somewhat loose on that word viable. If there is supposed to be communication with a hypothetical API, that API does not exist and instead there is a file containing some functions which return static JSON blobs that it might have hypothetically tossed back in response. It is a frontend-first design that has no backend.
And I keep negotiating what this product is with my stakeholders, until that frontend has been massaged into something that they can use. Low latency in learning what my tool-consumer wants is key, so I can't be making it expensive to change my data model or the like. I want the complaints that “This tool is extremely useful, I wish it looked pretty and saved my info from session to session and had the latest data from our HR system” and whatever else it needs to do to actually be properly viable.
I think that what I am doing is some variant of Domain-Driven Design? Basically I am trying to suss out major product requirements from nontechnical folks by having them interact with the product requirements as early as possible, to see what those requirements imply and correct them again and again. I want to have a technical model of how they look at the world which is correct, first—and then when I am building the backend I can actually have a properly principled approach to what I am building because I know what the terms mean in this system.
> if you perform them in parallel switching between them, you get the benefit that at time 51 you can show both of the recipients that you are 25% complete, but you deliver A at time 199 and B at time 200.
That's not parallelism, that's concurrency. You are basically doing round-robin. If they were done in parallel, then both tasks would get completed at time 100. Throughput usually improves the otherwise maximum latency when there is at least some parallelism, otherwise I agree improving throughput would not make a lot of sense in many cases.
> I think that what I am doing is some variant of Domain-Driven Design?
Sounds like iterative and incremental software development. I dare say Agile.
- Pull requests should usually optimize for latency, not throughput (i.e. smaller PRs/changes are usually better)
- Release frequently instead of infrequently (implying that frequent releases will be smaller but in frequent releases will be very large)
- Non-strictness (latency-optimized) is more composable than strictness (throughput-optimized)
... this orbits another mental model I was exposed to a few years ago I call "weak-signal thinking".