Race conditions: now in hardware at the gate level! A few questions 1. Could you...

thesz · on Oct 8, 2018

2. Quite the contrary when you talk about complexity. You have to design for interfaces with different clock domains specifically. This is hard in Verilog, at the very least. I did it, I knew - Verilog, having no type system, cannot determine difference between various sizes of bit vectors, let alone prevent one to perform direct operation on the values from different clock domains. VHDL is no better (I can explain). The queues required for more-or-less error prone design are not cheap in any measurement - you have to alter your synchronous design for them, you have to add silicon for them and they introduce additional latency of several cycles when transmitting data.

It is the queues between clock domains that prevents splitting sync design into smaller and smaller clock domains.

Async design, on the other hand, adds single-clock "queues" between computation stages and that queues are clocked by the completeness of operation, not by some fixed clock rate whatever variable it is.

And here some propaganda (which translates as "explanation", BTW). Ripple-carry adder in asynchronous design has average case complexity O(log(W)) (W being word size), just like worst case complexity of carry-lookahead adder. The worst case complexities are different, of course, O(W) and O(log(W)). But that worst-case complexity will occur with probability 1/2^W. What's more, if you dynamically add words of size W where one word has two parts H and L and H is either ones or zeroes, then complexity of asynchronous adder performance will be O(log(L size)) in average case. That may be a case when same adder used for addresses computation and general addition in generic CPU pipeline. Given that ARM has, I believe, immediate operand of size 12, you will have some nice speedup in address computation out of thin air. Without introducing any hacks and/or optimizations.

staticfloat · on Oct 8, 2018

1. No, asynchronous means something different than multiple clocks. Think of it like the difference between polling based programming and using coroutines; with multiple clock domains you have separate sections of your chip performing tasks at predefined instants in time (when your clock signal rises/when your polling loop swings around again) but with a truly asynchronous design, you simply start processing the next chunk of work when the previous chunk is finished (when the previous chunk of logic drives a signal high/when the previous coroutine finishes and control flow resumes in your coroutine).

2. It does deliver some benefits, but not all. Truly clockless design is desirable in some cases due to power concerns; for example the Novelda Xethru ultra-wideband radar SoCs are actually clockless, because power distribution networks can account for 20%+ of the power consumed in chips like this. (This is what I've heard, I don't have a citation for this. The paper I quote below similarly handwaves and throws around numbers from 26% all the way up to 40%, but they don't do any analysis of their own on this)

I've never used a clockless CPU design before, but the theoretical advantages are listed out quite nicely in this paper [0], which lists (among other things) the natural ability for the CPU to sit at idle (not executing `NOP` instructions, actually idle) when no work is available. It appears that the AMULET 3 processor (which is compared against an ARM 9 core) is competitive in power consumption, but doesn't quite stand up in performance. While still pretty impressive for a research project, this shows that we do still have quite a bit of work to do before these chips are ruling the world (if, indeed, we can scale up our tools to the point that designing these isn't just an exercise in frustration).

[0]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83....

ofrzeta · on Oct 8, 2018

> the natural ability for the CPU to sit at idle (not executing `NOP` instructions, actually idle) when no work is available.

That just makes so much sense. Just think about how much power could be saved with all the computing devices that are idle pretty much of the time.

thesz · on Oct 8, 2018

Clock power distribution accounted for 60% of Alpha AXP power consumption, AFAIK.

The power consumption is of importance for HPC - current path to exascale is limited by power consumption.

p4bl0 · on Oct 8, 2018

> Race conditions: now in hardware at the gate level!

It's already the case even in synchronous microprocessor. It's exploited by side-channel attacks based on power analysis. It makes it very difficult to implement effective countermeasures using dual-rail protocol in hardware. You can read a bit about it in the state-of-the-art section of one of my papers [1], which will also give you some other references :).

[1] https://eprint.iacr.org/2013/554

rjmunro · on Oct 8, 2018

1. No.

2. Not really. You still have the possibility of varying the speed on an asynchronous CPU. When you want stuff to go faster you can raise the voltage and increase the cooling.

I remember reading at the time of the ARM AMULET that they tested one by cooling it with liquid nitrogen and ran it at a high voltage, and got it to go as faster in benchmarks than the contemporary standard ARM processor.

baybal2 · on Oct 8, 2018

The biggest benefit is in "long range" IO. There you use what's called "self clocking" signal forms.

This is why you don't need all your USB peripherals to be in perfect sync without having an associated penalty for frequency matching. In reality, things are a bit more complex, but in general self-clocking signal form is a must have for such applications.