Or do what superscalar CPUs do and have 8 coffee machines running all the time.

dragontamer · on Nov 26, 2019

CPUs are the latency-optimized machines of today.

If a CPU sees a line of 10 people, it will brew 10-cups of coffee, speculating that most want coffee. If only 9-cups were needed, it will throw away the extra coffee.

-----------

In practice, this truly happens. CPUs perform branch-prediction over a for-loop.

    for(int i=0; i<32; i++){
        doA();
    }
    doB()

The value "i" hasn't been calculated yet, but the CPU performs branch prediction. Modern CPUs can accurately loops of size ~32 or less. Modern CPUs will literally fill their pipelines with 32x "doA()" statements, and even the doB() statement BEFORE the i<32 check was even tested.

Now the branch-predictor might be wrong! Lets say that doA() is:

    if(random() == 1% chance){
        i++; // Increments the loop counter
    }

The CPU will likely fail to predict this, and then will be forced to throw away the work. Nonetheless, its overall beneficial for the CPU to speculatively try to do all the loops anyway (the alternative is leaving the CPU-pipeline empty, which has roughly the same costs as a failed speculation anyway).

CPU-wins if it is correct, and it ties if it is wrong. So might as well speculate.

hoseja · on Nov 27, 2019

    if(random() == 1% chance)

This really threw me for a loop. I'm not used to seeing the percent sign used for actual percents! I was wondering "Why is he doing 1 modulo some 'chance' variable?"

fiter · on Nov 26, 2019

In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.

dragontamer · on Nov 26, 2019

> In your example, it sounds like you mean throughput-optimized. According to the original post, brewing hot water for 10 cups would introduce additional latency.

Nope. CPUs are latency-optimized.

The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".

----------

A throughput optimized machine, like GPUs (and strangely enough: hard drives), are willing to slow down the 1st-cup of coffee for better overall throughput.

Hard drives are interesting: if you have the following "reads":

#1: Read location 1

#2: Read location 100

#3: Read location 50

The hard drive will re-arrange the reads into: Read 1, Read 50, Read 100, because the hard-drive head will reach location50 before location100. Remember, hard drives are physically moving their arms to each physical location.

This means that Read100 is "slowed down", its latency got significantly worse. But the three reads all together all completed at the same time.

fiter · on Nov 27, 2019

> The "1st cup of coffee" always takes the same amount of time in CPU-land. The 2nd-cup of coffee was speculatively made, but never "slowed down the first cup of coffee".

Just to be clear, then: the analogy from the original post doesn't apply.

dragontamer · on Nov 27, 2019

> Just to be clear, then: the analogy from the original post doesn't apply.

The analogy from the original post applies to the cases the original post discusses.

The original "coffee latency" blogpost innately applies to a 1980s style computer: a simple in-order machine. Its truly correct for that model of simple computing.

I've added in complications: pipelining, superscalar, and speculative execution, which were inventions deployed in the early 90s and 00s to CPUs. So things work differently on modern machines, because modern machines have many, many more features than the "original" computer designs.

The original "cups of coffee" are a good way to start thinking about latency vs bandwidth problem. I really like the analogy. But it would take a LOT more writing before I really cover everything going on in modern CPUs.

fiter · on Nov 27, 2019

Your original post was missing explanation because you referenced the original analogy without addressing how the original analogy no longer applied to the scenario you were discussing.

For what it's worth, in all my replies I have not been confused about the behavior of a CPU, but only about how you are trying to use the analogy to fit your exposition.

majormajor · on Nov 27, 2019

Nah, this would be brewing 1 cup at a time 10 times in parallel, an example not considered in the original post.

To extend the original post, this would be trading capital - to buy more equipment - to improve both latency and throughput.

etaioinshrdlu · on Nov 26, 2019

This is a great explaination of why high power CPUs are great for desktops and servers but terrible for mobile.

gameswithgo · on Nov 26, 2019

Does this actually waste power? A lot of people's intuition about what wastes power with cpus is wrong. Usually it is best to light up all the tricks, heat the cpu up, finish the work, then go back to sleep. If branch prediction helps you get back to sleep sooner, it is probably a net win.

dragontamer · on Nov 27, 2019

My understanding of CPU-physics is as follows:

https://www.ti.com/lit/an/scaa035b/scaa035b.pdf

Page 4 (page8 in PDF):

> P = C * V^2 * f * N

Power-used = Capacitance * Voltage^2 * Frequency * Number of Bits flipped

This equation roughly holds for all CMOS circuits, from the 1970s through today's CPUs. Smaller transistors result in smaller capacitance, which is what led to Dennard scaling for the past 40 years. Otherwise, the physics are the same.

Static-power consumption in CMOS is theoretically zero, and measured to be femto-amps. Static power consumption can therefore be ignored, only dynamic power matters in the real world.

Assuming the same transistor size (aka: capacitance), you can see that power use is most strongly determined by voltage. Another note: higher frequencies (say 3GHz or 4GHz) requires more-and-more voltage to sustain.

A mobile chip drawing 1/2 voltage at 1/2-frequency will have 1/8th the power consumption but take 2x longer to complete a task. Overall, you have 1/4th the power consumption used.

----------

This is why servers reduce frequency and voltage and save on power in practice.

dannyw · on Nov 27, 2019

As another example, many cryptocurrency miners undervolt their cards to get better performance per unit of energy.

gpderetta · on Nov 27, 2019

all mobile cpus (at least those worth using) do exactly the same thing.