Though fundamentally very different, in some cases clocking with a free-running clock is an alternative. That clock has a frequency that represents several gate-delays, and these delays in turn depend on the current core voltage. The end result is a clock close to optimal given the voltage applied to the core, and that even adjusts the clock within a single SMPS charging cycle.
I have read a lot about async designs recently and most of the research seems to have drained around 2010.
There doesn't seem to be a consensus on how much power you can actually save with an async CPU. It's said that clock distribution on modern CPUs/boards(?) amounts to around 30 or more percent of the overall power consumption but on the other hand the savings do not necessarily amount to that much.
From a Technical Review article on clockless chips: The Intel "clockless prototype in 1997 ran three times faster than the conventional-chip equivalent, on half the power." Apparently economically that didn't make sense for Intel because you'd have to re-create virtually a whole industry that is based on clocked chips.
Another Intel scientist (unfortunately I can't re-find that source) later said that the power savings of async CPUs aren't as high as claimed by their proponents.
Interestingly Intel Chief Scientist Narayan Srinivasa left the company to be a CTO at Eta Computing who develop an asynchronous ARM Cortex M3 microcontroller.
To quote: "Up to 96 billion operations per second" and "instanteous power ranges from 14 microwatts to 650 milliwatts"
Let's say there will be ten times less IPS (addition has O(log(WordSize)) delay in average case, for GA144 word size is less than 32, thus 10 times the reduction; fastest operation for 2-in-1 self-sync encoding is inverse which is just lines swap and does not incur any computation whatsoever) and at max power. It would be 650mW lasting for 1/9.610^9 seconds, or 6.810^-11 Joules or 68pJ (picojoules).
https://www.ics.forth.gr/carv/greenvm/files/tr450.pdf - page 28 lists power consumption of different operations. Simple integer includes addition and 32-bit variant is in range 50-to-80pJ. If we take average, it will be 65pJ which is very close to my worst-case analysis for GA144.
This means, in my opinion, that self-synchronous CPU exemplified by GA144 has efficiency at least that of synchronous ARM CPU, with very efficient sleep mode entry/exit.
In fact, I can. I am comparing vaguely estimated cost of addition operation (thus worst case scenario for GA) against better known cost of addition operation on regular CPU.
To judge by anything on the record, you would think nobody at Intel
has much love for anything asynchronous, but it's worth keeping in
mind that Intel funded and later acquired a startup called
Asynchronous Design Technology, later renamed Fulcrum Microsystems, to
the tune of something like $100 million all together, and then let it
slip under the radar. Last I checked its domain had been taken over by
a tech blog. People from Intel have also usually put in an appearance at
the IEEE Asynchronous conferences over the years. I'd venture to guess
they know a thing or two about async at Intel but it's fine with them
if the rest of us don't.
Fulcrum didn't slip under the radar, it was purchased by Intel in 2011[1]. Since then, Intel's switch and router division have had their part numbers start with "FM".
The prior name of Fulcrum Microsystems was Asynchronous Digital Design, not Design Technology. They fairly quickly decided that ADD was an unfortunate acronym...
>Another Intel scientist (unfortunately I can't re-find that source) later said that the power savings of async CPUs aren't as high as claimed by their proponents.
Because of all modern chips that made with any consideration for power saving use clock and power gating
Surely that answer is kind of how I imagine it might work for clock scaling, but not for gating, which I thought was more about turning on/off clock completely to various blocks that are not being used?
In its current implemention, it does. Consider that this is Intel's first iteration of AVX-512 and don't forget that in its first iteration, also AVX was plagued by performance problems. These first iteration's main purpose is that "ordinary developers" (i.e. not only highly selected specialists that have to sign lots of Intel NDAs) can begin to develop and experiment with these new extensions. I believe that Intel has grand plans for AVX-512 and its next iteration will be the one that aims that "ordinary users" can profit performance-wise from applications that use AVX-512 instructions.
> ... Apparently economically that didn't make sense for Intel because you'd have to re-create virtually a whole industry that is based on clocked chips.
Also you'd have to invent a sales/marketing scheme as an alternative to the existing one that is based on increasing clock rates. GHz is to the PC what HP (horsepower) is to the car. That might come to an end obviously but now at least we have cores.
> Also you'd have to invent a sales/marketing scheme as an alternative to the existing one that is based on increasing clock rates. GHz is to the PC what HP (horsepower) is to the car.
The GHz race has been over for a long time. Since Intel Core (and AMD Zen, I think; at least AMD Bulldozer had in my opinion a different design philosophy), it is all about smarter cores that do more in less clock steps. Also since AMD Zen, the "number of core race" has regained traction. Finally, in particular Intel tries to promote extra-wide SIMD instructions (AVX-512).
going back to the car analogy, ghz is more like engine displacement or cylinders. they measure an implementation detail of an engine, not the output such as power (HP) or torque. I'd imagine that cpus will be compared on benchmark scores, which is already being done now.
"One of the biggest claims to fame for asynchronous logic is that it consumes less power due to the absence of a clock. Clock power is responsible for almost half the power of a chip in a modern design such as a high-performance microprocessor. If you get rid of the clock, then you save almost half the power. This argument might sound reasonable at a glance, but is flawed. If the same logic was asynchronous, then you have to create handshake signals, such as request and acknowledgment signals that propagate forward and backwards from the logic evaluation flow. These signals now become performance-critical, have higher capacitive load and have the same activity as the logic. Therefore, the power saving that you get by eliminating the clock signal gets offset by the power consumption in the handshake signals and the associated logic."
Race conditions: now in hardware at the gate level!
A few questions
1. Could you call current SOCs asynchronous since they not only clock different blocks at different rates, but internally within a block subsections run at various rates?
2. Does variable clock rate deliver many of the benefits of async without the complexity? In other words how much more blood is there to squeeze from the async stone in the current world?
I doubt we'll see a competitive async chip anytime soon, but as CPUs continue to evolve perhaps we'll see the functional blocks broken up into smaller and smaller clock domains until it becomes difficult to tell the difference?
2. Quite the contrary when you talk about complexity. You have to design for interfaces with different clock domains specifically. This is hard in Verilog, at the very least. I did it, I knew - Verilog, having no type system, cannot determine difference between various sizes of bit vectors, let alone prevent one to perform direct operation on the values from different clock domains. VHDL is no better (I can explain). The queues required for more-or-less error prone design are not cheap in any measurement - you have to alter your synchronous design for them, you have to add silicon for them and they introduce additional latency of several cycles when transmitting data.
It is the queues between clock domains that prevents splitting sync design into smaller and smaller clock domains.
Async design, on the other hand, adds single-clock "queues" between computation stages and that queues are clocked by the completeness of operation, not by some fixed clock rate whatever variable it is.
And here some propaganda (which translates as "explanation", BTW). Ripple-carry adder in asynchronous design has average case complexity O(log(W)) (W being word size), just like worst case complexity of carry-lookahead adder. The worst case complexities are different, of course, O(W) and O(log(W)). But that worst-case complexity will occur with probability 1/2^W. What's more, if you dynamically add words of size W where one word has two parts H and L and H is either ones or zeroes, then complexity of asynchronous adder performance will be O(log(L size)) in average case. That may be a case when same adder used for addresses computation and general addition in generic CPU pipeline. Given that ARM has, I believe, immediate operand of size 12, you will have some nice speedup in address computation out of thin air. Without introducing any hacks and/or optimizations.
1. No, asynchronous means something different than multiple clocks. Think of it like the difference between polling based programming and using coroutines; with multiple clock domains you have separate sections of your chip performing tasks at predefined instants in time (when your clock signal rises/when your polling loop swings around again) but with a truly asynchronous design, you simply start processing the next chunk of work when the previous chunk is finished (when the previous chunk of logic drives a signal high/when the previous coroutine finishes and control flow resumes in your coroutine).
2. It does deliver some benefits, but not all. Truly clockless design is desirable in some cases due to power concerns; for example the Novelda Xethru ultra-wideband radar SoCs are actually clockless, because power distribution networks can account for 20%+ of the power consumed in chips like this. (This is what I've heard, I don't have a citation for this. The paper I quote below similarly handwaves and throws around numbers from 26% all the way up to 40%, but they don't do any analysis of their own on this)
I've never used a clockless CPU design before, but the theoretical advantages are listed out quite nicely in this paper [0], which lists (among other things) the natural ability for the CPU to sit at idle (not executing `NOP` instructions, actually idle) when no work is available. It appears that the AMULET 3 processor (which is compared against an ARM 9 core) is competitive in power consumption, but doesn't quite stand up in performance. While still pretty impressive for a research project, this shows that we do still have quite a bit of work to do before these chips are ruling the world (if, indeed, we can scale up our tools to the point that designing these isn't just an exercise in frustration).
> Race conditions: now in hardware at the gate level!
It's already the case even in synchronous microprocessor. It's exploited by side-channel attacks based on power analysis. It makes it very difficult to implement effective countermeasures using dual-rail protocol in hardware. You can read a bit about it in the state-of-the-art section of one of my papers [1], which will also give you some other references :).
2. Not really. You still have the possibility of varying the speed on an asynchronous CPU. When you want stuff to go faster you can raise the voltage and increase the cooling.
I remember reading at the time of the ARM AMULET that they tested one by cooling it with liquid nitrogen and ran it at a high voltage, and got it to go as faster in benchmarks than the contemporary standard ARM processor.
The biggest benefit is in "long range" IO. There you use what's called "self clocking" signal forms.
This is why you don't need all your USB peripherals to be in perfect sync without having an associated penalty for frequency matching. In reality, things are a bit more complex, but in general self-clocking signal form is a must have for such applications.
As someone who is working on an asynchronous process myself, I should remind you that asynchronous is more a design level choice rather than a magic bullet that makes everything better.
In particular, in CPUs with a big centralized register file there can be significant overhead to having an asynchronous CPU.
There are certain architectural cases in which it can be a killer advantage or other cases in which it is pretty much the only way forward (e.g. in a 3D chip it can be quite difficult to distribute a high speed, high quality/low skew+jitter clock)
The overhead you are talking about is scoreboard, mainly - clockless CPU cannot assume readiness of the results when issuing commands, and it needs scoreboard of what is ready to issue commands without conflicts.
But scoreboarding allows for most gains of out-of-order execution and is cheap.
You may have associative memory for speculation (most probably, you refer to it above), but you need to have scoreboard for operation to be issued without conflict and with operands ready.
Not just inductance, but capacitance between the layers themselves too since you've got to have an insulator between them. All the fun things you get in 2d from nanometer scale devices are now compounded in another degree of freedom. The only thing I don't think you have to work around is quantum tunneling between layers because they'll probably be too far apart still to need that level of work.
I've been studying the Cal-tech and seaforth processors lately and they've actually inspired me to go back to school so I might one day take part in an asynchronous design.
Does anybody recommend any readings besides papers on the above?
As far as I can tell most engineers view asynchronous processors as arcane equipment only meant for the most specialized tasks.
I actually don't even have a BS yet. I dropped out after doing a cost benefit analysis and learned I could get tech jobs without a degree. It's been a bumpy road to say the least haha. I'm currently unemployed and looking for a gig/job before I go back to get my degree. I plan on attending OSU since it's near by.
I hate to be the one to ask this because a clock-less CPU sounds like such a neat idea... but wouldn't it also open up a whole other world of timing attacks? (I'm very happy to be enlightened as to how it would not).
Interesting question. Thinking very abstractly, I guess that on an async CPU, software could more easily avoid polling, and be totally event driven, so we could get rid of timers/preemptive multitasking/any notion of time, and having no notion of time would make it easy to be tickless.
Of course you probably still want load-balancing between long-lived tasks, not sure how you'd handle that with a fully async system.
Modern synchronous techniques can already be low overhead, take a look at "slew tolerant" circuits where you can have not flip-flop setup/hold delays in the circuit path:
To confirm something already stated: back in the day in grad school (90s), my understanding was that clocked circuits were just too far ahead in tooling, so clockless could never catch up in the big, complex domains where it would make a difference.
One of my proofs likenee it to (at the time) ai, where it held out such promise, got tons of hype, then reliably would totally disgrace itself every ten years...
A RISC-V prototype achieved almost 40% power savings: https://people.eecs.berkeley.edu/~bora/Journals/2017/JSSC17-...