Asynchronous (Clockless) CPU

childintime · on Oct 8, 2018

Though fundamentally very different, in some cases clocking with a free-running clock is an alternative. That clock has a frequency that represents several gate-delays, and these delays in turn depend on the current core voltage. The end result is a clock close to optimal given the voltage applied to the core, and that even adjusts the clock within a single SMPS charging cycle.

A RISC-V prototype achieved almost 40% power savings: https://people.eecs.berkeley.edu/~bora/Journals/2017/JSSC17-...

ofrzeta · on Oct 8, 2018

I have read a lot about async designs recently and most of the research seems to have drained around 2010.

There doesn't seem to be a consensus on how much power you can actually save with an async CPU. It's said that clock distribution on modern CPUs/boards(?) amounts to around 30 or more percent of the overall power consumption but on the other hand the savings do not necessarily amount to that much.

From a Technical Review article on clockless chips: The Intel "clockless prototype in 1997 ran three times faster than the conventional-chip equivalent, on half the power." Apparently economically that didn't make sense for Intel because you'd have to re-create virtually a whole industry that is based on clocked chips.

Another Intel scientist (unfortunately I can't re-find that source) later said that the power savings of async CPUs aren't as high as claimed by their proponents.

Interestingly Intel Chief Scientist Narayan Srinivasa left the company to be a CTO at Eta Computing who develop an asynchronous ARM Cortex M3 microcontroller.

thesz · on Oct 8, 2018

There is an example of running and sold CPUs: GreenArray's GA144.

http://www.greenarraychips.com/home/documents/greg/PB001-100...

To quote: "Up to 96 billion operations per second" and "instanteous power ranges from 14 microwatts to 650 milliwatts"

Let's say there will be ten times less IPS (addition has O(log(WordSize)) delay in average case, for GA144 word size is less than 32, thus 10 times the reduction; fastest operation for 2-in-1 self-sync encoding is inverse which is just lines swap and does not incur any computation whatsoever) and at max power. It would be 650mW lasting for 1/9.610^9 seconds, or 6.810^-11 Joules or 68pJ (picojoules).

https://www.ics.forth.gr/carv/greenvm/files/tr450.pdf - page 28 lists power consumption of different operations. Simple integer includes addition and 32-bit variant is in range 50-to-80pJ. If we take average, it will be 65pJ which is very close to my worst-case analysis for GA144.

This means, in my opinion, that self-synchronous CPU exemplified by GA144 has efficiency at least that of synchronous ARM CPU, with very efficient sleep mode entry/exit.

jacquesm · on Oct 8, 2018

GreenArray is a pretty special CPU and you can't compare green array ops with those of a regular CPU one on one.

Essentially it is a cluster of special purpose FORTH CPUs.

thesz · on Oct 9, 2018

In fact, I can. I am comparing vaguely estimated cost of addition operation (thus worst case scenario for GA) against better known cost of addition operation on regular CPU.

gradschool · on Oct 8, 2018

To judge by anything on the record, you would think nobody at Intel has much love for anything asynchronous, but it's worth keeping in mind that Intel funded and later acquired a startup called Asynchronous Design Technology, later renamed Fulcrum Microsystems, to the tune of something like $100 million all together, and then let it slip under the radar. Last I checked its domain had been taken over by a tech blog. People from Intel have also usually put in an appearance at the IEEE Asynchronous conferences over the years. I'd venture to guess they know a thing or two about async at Intel but it's fine with them if the rest of us don't.

slongfield · on Oct 8, 2018

Fulcrum didn't slip under the radar, it was purchased by Intel in 2011[1]. Since then, Intel's switch and router division have had their part numbers start with "FM".

[1] https://newsroom.intel.com/news-releases/intel-to-acquire-fu... [2] https://www.intel.com/content/www/us/en/design/products-and-...

wnoise · on Oct 8, 2018

The prior name of Fulcrum Microsystems was Asynchronous Digital Design, not Design Technology. They fairly quickly decided that ADD was an unfortunate acronym...

gradschool · on Oct 8, 2018

I stand corrected. Thank you.

baybal2 · on Oct 8, 2018

>Another Intel scientist (unfortunately I can't re-find that source) later said that the power savings of async CPUs aren't as high as claimed by their proponents.

Because of all modern chips that made with any consideration for power saving use clock and power gating

Dylan16807 · on Oct 8, 2018

Gating is great when you have entire cores idle. Other than AVX2 units, does it save you any notable amount of power on active CPUs?

rasz · on Oct 8, 2018

It can save you as much power as you want, nowadays OEMs simply type a number into registers and thats how much CPU will be allowed to consume.

unwind · on Oct 8, 2018

Surely that answer is kind of how I imagine it might work for clock scaling, but not for gating, which I thought was more about turning on/off clock completely to various blocks that are not being used?

rasz · on Oct 8, 2018

AVX-512 is gated, but causes heavy throttling.

wolfgke · on Oct 8, 2018

> AVX-512 is gated, but causes heavy throttling.

In its current implemention, it does. Consider that this is Intel's first iteration of AVX-512 and don't forget that in its first iteration, also AVX was plagued by performance problems. These first iteration's main purpose is that "ordinary developers" (i.e. not only highly selected specialists that have to sign lots of Intel NDAs) can begin to develop and experiment with these new extensions. I believe that Intel has grand plans for AVX-512 and its next iteration will be the one that aims that "ordinary users" can profit performance-wise from applications that use AVX-512 instructions.

deepnotderp · on Oct 8, 2018

Not a lot of chips gain a lot from power gating though, due to various reasons.

ofrzeta · on Oct 8, 2018

Adding to my own post

> ... Apparently economically that didn't make sense for Intel because you'd have to re-create virtually a whole industry that is based on clocked chips.

Also you'd have to invent a sales/marketing scheme as an alternative to the existing one that is based on increasing clock rates. GHz is to the PC what HP (horsepower) is to the car. That might come to an end obviously but now at least we have cores.

wolfgke · on Oct 8, 2018

> Also you'd have to invent a sales/marketing scheme as an alternative to the existing one that is based on increasing clock rates. GHz is to the PC what HP (horsepower) is to the car.

The GHz race has been over for a long time. Since Intel Core (and AMD Zen, I think; at least AMD Bulldozer had in my opinion a different design philosophy), it is all about smarter cores that do more in less clock steps. Also since AMD Zen, the "number of core race" has regained traction. Finally, in particular Intel tries to promote extra-wide SIMD instructions (AVX-512).

ofrzeta · on Oct 8, 2018

Just look at how they are promoting the new i9 (!) with 8 cores and 5 GHz :) https://thenextweb.com/plugged/2018/10/08/intels-9th-gen-pro...

gruez · on Oct 8, 2018

going back to the car analogy, ghz is more like engine displacement or cylinders. they measure an implementation detail of an engine, not the output such as power (HP) or torque. I'd imagine that cpus will be compared on benchmark scores, which is already being done now.

deepnotderp · on Oct 8, 2018

Just use the average stage cycles per second, that's what we do.

deepnotderp · on Oct 8, 2018

That intel scientist was Shekhar Borkar

ofrzeta · on Oct 9, 2018

Thanks so much. That made me find the article:

"One of the biggest claims to fame for asynchronous logic is that it consumes less power due to the absence of a clock. Clock power is responsible for almost half the power of a chip in a modern design such as a high-performance microprocessor. If you get rid of the clock, then you save almost half the power. This argument might sound reasonable at a glance, but is flawed. If the same logic was asynchronous, then you have to create handshake signals, such as request and acknowledgment signals that propagate forward and backwards from the logic evaluation flow. These signals now become performance-critical, have higher capacitive load and have the same activity as the logic. Therefore, the power saving that you get by eliminating the clock signal gets offset by the power consumption in the handshake signals and the associated logic."

https://www.eetimes.com/document.asp?doc_id=1277174

xenadu02 · on Oct 8, 2018

Race conditions: now in hardware at the gate level!

A few questions

1. Could you call current SOCs asynchronous since they not only clock different blocks at different rates, but internally within a block subsections run at various rates?

2. Does variable clock rate deliver many of the benefits of async without the complexity? In other words how much more blood is there to squeeze from the async stone in the current world?

I doubt we'll see a competitive async chip anytime soon, but as CPUs continue to evolve perhaps we'll see the functional blocks broken up into smaller and smaller clock domains until it becomes difficult to tell the difference?

thesz · on Oct 8, 2018

2. Quite the contrary when you talk about complexity. You have to design for interfaces with different clock domains specifically. This is hard in Verilog, at the very least. I did it, I knew - Verilog, having no type system, cannot determine difference between various sizes of bit vectors, let alone prevent one to perform direct operation on the values from different clock domains. VHDL is no better (I can explain). The queues required for more-or-less error prone design are not cheap in any measurement - you have to alter your synchronous design for them, you have to add silicon for them and they introduce additional latency of several cycles when transmitting data.

It is the queues between clock domains that prevents splitting sync design into smaller and smaller clock domains.

Async design, on the other hand, adds single-clock "queues" between computation stages and that queues are clocked by the completeness of operation, not by some fixed clock rate whatever variable it is.

And here some propaganda (which translates as "explanation", BTW). Ripple-carry adder in asynchronous design has average case complexity O(log(W)) (W being word size), just like worst case complexity of carry-lookahead adder. The worst case complexities are different, of course, O(W) and O(log(W)). But that worst-case complexity will occur with probability 1/2^W. What's more, if you dynamically add words of size W where one word has two parts H and L and H is either ones or zeroes, then complexity of asynchronous adder performance will be O(log(L size)) in average case. That may be a case when same adder used for addresses computation and general addition in generic CPU pipeline. Given that ARM has, I believe, immediate operand of size 12, you will have some nice speedup in address computation out of thin air. Without introducing any hacks and/or optimizations.

staticfloat · on Oct 8, 2018

1. No, asynchronous means something different than multiple clocks. Think of it like the difference between polling based programming and using coroutines; with multiple clock domains you have separate sections of your chip performing tasks at predefined instants in time (when your clock signal rises/when your polling loop swings around again) but with a truly asynchronous design, you simply start processing the next chunk of work when the previous chunk is finished (when the previous chunk of logic drives a signal high/when the previous coroutine finishes and control flow resumes in your coroutine).

2. It does deliver some benefits, but not all. Truly clockless design is desirable in some cases due to power concerns; for example the Novelda Xethru ultra-wideband radar SoCs are actually clockless, because power distribution networks can account for 20%+ of the power consumed in chips like this. (This is what I've heard, I don't have a citation for this. The paper I quote below similarly handwaves and throws around numbers from 26% all the way up to 40%, but they don't do any analysis of their own on this)

I've never used a clockless CPU design before, but the theoretical advantages are listed out quite nicely in this paper [0], which lists (among other things) the natural ability for the CPU to sit at idle (not executing `NOP` instructions, actually idle) when no work is available. It appears that the AMULET 3 processor (which is compared against an ARM 9 core) is competitive in power consumption, but doesn't quite stand up in performance. While still pretty impressive for a research project, this shows that we do still have quite a bit of work to do before these chips are ruling the world (if, indeed, we can scale up our tools to the point that designing these isn't just an exercise in frustration).

[0]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83....

ofrzeta · on Oct 8, 2018

> the natural ability for the CPU to sit at idle (not executing `NOP` instructions, actually idle) when no work is available.

That just makes so much sense. Just think about how much power could be saved with all the computing devices that are idle pretty much of the time.

thesz · on Oct 8, 2018

Clock power distribution accounted for 60% of Alpha AXP power consumption, AFAIK.

The power consumption is of importance for HPC - current path to exascale is limited by power consumption.

p4bl0 · on Oct 8, 2018

> Race conditions: now in hardware at the gate level!

It's already the case even in synchronous microprocessor. It's exploited by side-channel attacks based on power analysis. It makes it very difficult to implement effective countermeasures using dual-rail protocol in hardware. You can read a bit about it in the state-of-the-art section of one of my papers [1], which will also give you some other references :).

[1] https://eprint.iacr.org/2013/554

rjmunro · on Oct 8, 2018

1. No.

2. Not really. You still have the possibility of varying the speed on an asynchronous CPU. When you want stuff to go faster you can raise the voltage and increase the cooling.

I remember reading at the time of the ARM AMULET that they tested one by cooling it with liquid nitrogen and ran it at a high voltage, and got it to go as faster in benchmarks than the contemporary standard ARM processor.

baybal2 · on Oct 8, 2018

The biggest benefit is in "long range" IO. There you use what's called "self clocking" signal forms.

This is why you don't need all your USB peripherals to be in perfect sync without having an associated penalty for frequency matching. In reality, things are a bit more complex, but in general self-clocking signal form is a must have for such applications.

deepnotderp · on Oct 8, 2018

As someone who is working on an asynchronous process myself, I should remind you that asynchronous is more a design level choice rather than a magic bullet that makes everything better.

In particular, in CPUs with a big centralized register file there can be significant overhead to having an asynchronous CPU.

There are certain architectural cases in which it can be a killer advantage or other cases in which it is pretty much the only way forward (e.g. in a 3D chip it can be quite difficult to distribute a high speed, high quality/low skew+jitter clock)

thesz · on Oct 8, 2018

The overhead you are talking about is scoreboard, mainly - clockless CPU cannot assume readiness of the results when issuing commands, and it needs scoreboard of what is ready to issue commands without conflicts.

But scoreboarding allows for most gains of out-of-order execution and is cheap.

deepnotderp · on Oct 8, 2018

Yeah, but most CPUs today are OoO without using scoreboarding

thesz · on Oct 9, 2018

You may have associative memory for speculation (most probably, you refer to it above), but you need to have scoreboard for operation to be issued without conflict and with operands ready.

Dylan16807 · on Oct 8, 2018

> (e.g. in a 3D chip it can be quite difficult to distribute a high speed, high quality/low skew+jitter clock)

How thick is "3D" here, and why is that? What makes a bit of vertical distance harder than several mm of horizontal distance?

DarkWiiPlayer · on Oct 8, 2018

Maybe the electrons need to wait for the elevator?

deepnotderp · on Oct 8, 2018

It's hard more due to EDA tool reasons than due to physical reasons :)

Which incidentally, is also the main reason why asynchronous logic isn't more popular.

insonifi · on Oct 8, 2018

I believe troubles are caused by induction effect from adjacent layer (similar to self-inductance[0]).

[0] - https://en.wikipedia.org/wiki/Inductance#Self-inductance_of_...

simcop2387 · on Oct 8, 2018

Not just inductance, but capacitance between the layers themselves too since you've got to have an insulator between them. All the fun things you get in 2d from nanometer scale devices are now compounded in another degree of freedom. The only thing I don't think you have to work around is quantum tunneling between layers because they'll probably be too far apart still to need that level of work.

oregontechninja · on Oct 8, 2018

I've been studying the Cal-tech and seaforth processors lately and they've actually inspired me to go back to school so I might one day take part in an asynchronous design.

Does anybody recommend any readings besides papers on the above?

As far as I can tell most engineers view asynchronous processors as arcane equipment only meant for the most specialized tasks.

deepnotderp · on Oct 8, 2018

For a view from the other side, search "mark Horowitz asynchronous".

Also check out Ivan sutherland's group at PSU: http://arc.cecs.pdx.edu/publications

person_of_color · on Oct 8, 2018

Where are you doing an MS in CompA?

oregontechninja · on Oct 13, 2018

I actually don't even have a BS yet. I dropped out after doing a cost benefit analysis and learned I could get tech jobs without a degree. It's been a bumpy road to say the least haha. I'm currently unemployed and looking for a gig/job before I go back to get my degree. I plan on attending OSU since it's near by.

tomxor · on Oct 8, 2018

I hate to be the one to ask this because a clock-less CPU sounds like such a neat idea... but wouldn't it also open up a whole other world of timing attacks? (I'm very happy to be enlightened as to how it would not).

Sephr · on Oct 8, 2018

On the contrary, it should close off an entire world of power analysis timing attacks.

There probably are new internal timing attacks that could be exposed through some asynchronous CPU designs.

deepnotderp · on Oct 8, 2018

No it actually closes off a lot of them, especially many physical side channel attacks.

gambler · on Oct 8, 2018

No mention of Ivan Sutherland and his Fleet Architecture?

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167...

sparkie · on Oct 8, 2018

Some more recent publications here: http://arc.cecs.pdx.edu/publications

nivertech · on Oct 8, 2018

The future is running event-triggered lambdas using async system calls in tickless kernels on clockless CPUs.

Q. would tickless kernels benefit from running on async CPUs?

catern · on Oct 8, 2018

Interesting question. Thinking very abstractly, I guess that on an async CPU, software could more easily avoid polling, and be totally event driven, so we could get rid of timers/preemptive multitasking/any notion of time, and having no notion of time would make it easy to be tickless.

Of course you probably still want load-balancing between long-lived tasks, not sure how you'd handle that with a fully async system.

agumonkey · on Oct 8, 2018

Using landauer capable arch.

https://en.wikipedia.org/wiki/Landauer%27s_principle

jhallenworld · on Oct 8, 2018

Modern synchronous techniques can already be low overhead, take a look at "slew tolerant" circuits where you can have not flip-flop setup/hold delays in the circuit path:

http://pages.hmc.edu/harris/class/e158/01/lect21.pdf

Also, big CPUs are power limited anyway. I mean "speed step" allows one core to run fast, as long as the others are unloaded.

deepnotderp · on Oct 8, 2018

Skew tolerant domino needs a multi phase clock and domino logic as you know, can be quite the power hog.

Tempest1981 · on Oct 8, 2018

What’s involved in creating simulation tools for async? Is it straightforward, or in need of research?

ofrzeta · on Oct 8, 2018

I can't answer your question but here's a an open source async synthesis system with a simulator: http://apt.cs.manchester.ac.uk/projects/tools/balsa/

Last update is from 2010 so I guess it could use some research :)

motiw · on Oct 8, 2018

Other the last 20 years, I have seen multiple attempts to commercially take advantage of clockless logic, they all disappeared.

zymhan · on Oct 8, 2018

I don't understand how you buffer the output from stage X if stage Y is running more slowly.

wmf · on Oct 8, 2018

Stage X probably just stops until stage Y is ready; that's why there needs to be an acknowledge signal that propagates backwards through the pipeline.

deebeeoh · on Oct 8, 2018

To confirm something already stated: back in the day in grad school (90s), my understanding was that clocked circuits were just too far ahead in tooling, so clockless could never catch up in the big, complex domains where it would make a difference.

One of my proofs likenee it to (at the time) ai, where it held out such promise, got tons of hype, then reliably would totally disgrace itself every ten years...