If a CPU sees a line of 10 people, it will brew 10-cups of coffee, speculating that most want coffee. If only 9-cups were needed, it will throw away the extra coffee.
-----------
In practice, this truly happens. CPUs perform branch-prediction over a for-loop.
for(int i=0; i<32; i++){
doA();
}
doB()
The value "i" hasn't been calculated yet, but the CPU performs branch prediction. Modern CPUs can accurately loops of size ~32 or less. Modern CPUs will literally fill their pipelines with 32x "doA()" statements, and even the doB() statement BEFORE the i<32 check was even tested.
Now the branch-predictor might be wrong! Lets say that doA() is:
The CPU will likely fail to predict this, and then will be forced to throw away the work. Nonetheless, its overall beneficial for the CPU to speculatively try to do all the loops anyway (the alternative is leaving the CPU-pipeline empty, which has roughly the same costs as a failed speculation anyway).
CPU-wins if it is correct, and it ties if it is wrong. So might as well speculate.
This really threw me for a loop. I'm not used to seeing the percent sign used for actual percents! I was wondering "Why is he doing 1 modulo some 'chance' variable?"
In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.
> In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.
Nope. CPUs are latency-optimized.
The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".
----------
A throughput optimized machine, like GPUs (and strangely enough: hard drives), are willing to slow down the 1st-cup of coffee for better overall throughput.
Hard drives are interesting: if you have the following "reads":
#1: Read location 1
#2: Read location 100
#3: Read location 50
The hard drive will re-arrange the reads into: Read 1, Read 50, Read 100, because the hard-drive head will reach location50 before location100. Remember, hard drives are physically moving their arms to each physical location.
This means that Read100 is "slowed down", its latency got significantly worse. But the three reads all together all completed at the same time.
> The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".
Just to be clear, then: the analogy from the original post doesn't apply.
> Just to be clear, then: the analogy from the original post doesn't apply.
The analogy from the original post applies to the cases the original post discusses.
The original "coffee latency" blogpost innately applies to a 1980s style computer: a simple in-order machine. Its truly correct for that model of simple computing.
I've added in complications: pipelining, superscalar, and speculative execution, which were inventions deployed in the early 90s and 00s to CPUs. So things work differently on modern machines, because modern machines have many, many more features than the "original" computer designs.
The original "cups of coffee" are a good way to start thinking about latency vs bandwidth problem. I really like the analogy. But it would take a LOT more writing before I really cover everything going on in modern CPUs.
Your original post was missing explanation because you referenced the original analogy without addressing how the original analogy no longer applied to the scenario you were discussing.
For what it's worth, in all my replies I have not been confused about the behavior of a CPU, but only about how you are trying to use the analogy to fit your exposition.
Does this actually waste power?
A lot of people's intuition about what wastes power with cpus is wrong. Usually it is best to light up all the tricks, heat the cpu up, finish the work, then go back to sleep. If branch prediction helps you get back to sleep sooner, it is probably a net win.
Power-used = Capacitance * Voltage^2 * Frequency * Number of Bits flipped
This equation roughly holds for all CMOS circuits, from the 1970s through today's CPUs. Smaller transistors result in smaller capacitance, which is what led to Dennard scaling for the past 40 years. Otherwise, the physics are the same.
Static-power consumption in CMOS is theoretically zero, and measured to be femto-amps. Static power consumption can therefore be ignored, only dynamic power matters in the real world.
Assuming the same transistor size (aka: capacitance), you can see that power use is most strongly determined by voltage. Another note: higher frequencies (say 3GHz or 4GHz) requires more-and-more voltage to sustain.
A mobile chip drawing 1/2 voltage at 1/2-frequency will have 1/8th the power consumption but take 2x longer to complete a task. Overall, you have 1/4th the power consumption used.
----------
This is why servers reduce frequency and voltage and save on power in practice.