Hacker News new | past | comments | ask | show | jobs | submit login
Computers Without Clocks – Ivan Sutherland (2002) [pdf] (virginia.edu)
118 points by dang on April 4, 2016 | hide | past | favorite | 56 comments



Steve Furber (creator of the BBC Micro, and co-designer of the ARM CPU) headed up a team at Manchester University that designed an asynchronous version of the ARM CPU, called AMULET.

Details: http://apt.cs.manchester.ac.uk/projects/processors/amulet/AM... https://en.wikipedia.org/wiki/AMULET_microprocessor


"First, asynchrony may speed up computers. In a synchronous chip, the clock’s rhythm must be slow enough to accommodate the slowest action in the chip’s circuits. If it takes a billionth of a second for one circuit to complete its operation, the chip cannot run faster than one gigahertz."

Haven't read the whole article yet but pipelining was made specifically to address this exact problem.

Also synchronous circuits have a nice property of dealing with metastability. Merging different clock domains is a nightmare and I would love to know how they plan on solving similar issues.


the thing is that pipelining costs latency, great if you're making something where all the inputs come into the input stage at the beginning and come out of the output stage at the end. Not so good if you want to make something like a CPU where the output of one instruction is the input for another which can result in pipe bubbles - clock speed, latency (pipe stages) etc are tradeoffs - one wants to maximise instructions-per-clock times clock speed for meaningful benchmarks

Merging arbitrary clock domains is an understood problem, simply put we know it can't be done reliably, one simply has to make it "reliably enough" - I built a graphics controller once where we did the math on synchoniser failure and decided that we were more reliably than Win95 by 2 orders of magnitude and that that would be good enough ...

Async stuff tends to be clocked stage to stage at the local level so that data generates it's own clock equivalent when it's done (a 'done' signal)


I think Ivan Sutherland may be aware of pipelining and other issues.


Sure, but it seems disingenuous to throw up a straw man like that rather than address how this stacks up against something that's pipelined properly.

AFAIK you're still going to be constrained by propagation delay either way.

I totally understand the switching power argument but less so the performance argument.


To make things very concrete, imagine that your pipeline has an execute stage where various operations get executed. Say you have operations for:

1. bitwise AND, which is extremely fast because each bit of the answer is just the AND of the corresponding bits of the inputs.

2. ADD, which is still a fast but is definitely slower than the AND: each result bit depends not only on the corresponding input bits, but also on the earlier bits (to propagate carries).

In some barbarically simple timing model:

- Each result bit for the AND might take a single gate delay (because it is a single AND gate), but

- The highest result bit for the ADD might take (say) 10 gate delays.

You'll also need a bit of logic to choose between the above computations depending on the opcode. Let's suppose this selection logic adds another 5 gate delays.

Long story short: when executing an AND, the whole result is ready after 6 gate delays. But when executing the ADD, some bits are not ready until 15 gate delays.

In a typical clocked design, you will need to run the clock slowly enough to accommodate the slowest such delay, i.e., even when you are executing ANDs, you're running the clock slower than 15 gate delays. The clock needs to run slow enough to accommodate any operation, and doesn't dynamically change on some kind of per-opcode basis (because that would be insanely hard to coordinate with the other pipe stages).

In contrast, in an asynchronous design, as far as I understand it, you don't have a clock at all. Instead, the result has an additional "ready" signal associated with it and, whenever the result is ready (a data dependent computation), the next stage can consume it.

Ideally this would mean your execute stage could process the AND operations in just 6 clocks instead of having to wait 15 clocks. Ideally, it might also mean you don't need to design your pipeline quite so carefully: in a clocked design, a single slow path slows down the entire pipeline; in an asynchronous design, that one particular path may be slow, but that doesn't slow down everyone else.


> In contrast, in an asynchronous design, as far as I understand it, you don't have a clock at all. Instead, the result has an additional "ready" signal associated with it and, whenever the result is ready (a data dependent computation), the next stage can consume it.

You can do the exact same thing in clocked designs as well. The AND produces a "ready" signal that allows it's output to skip the stages needed by the ADD side (or conversely, you can have the ADD side produce a stall signal that stops the pipeline). You can actually see this in modern processors - some instructions can take variable amounts of time depending on the instruction arguments (notably loads and stores, but also sometimes multiplies and divides).


The point is that in a synchronous design the delays have to be multiples of a clock cycle. Whenever a latency is not an exact multiple, the circuit is idle. And clock cycles can not be aribtrarily short because of the overhead for latching etc.

Also it takes time to schedule instructions. In a simple processor with a 5 stage pipeline you can simply stall the entire pipeline, but since you stall all other instructions too, this is costly. And in a modern superscalar out-of-order processor, stall is even more expensive and you cannot reschedule all instructions for the next cycle at the end of the previous cycle because rescheduling is too complex.


Hm, I'm still not sure I understand.

> The point is that in a synchronous design the delays have to be multiples of a clock cycle.

Not really, you can pretty much always do rebalancing between stages so that you end up with a multiple of the clock. And if you can't, you can locally skew the clock to borrow time between stages.

> Also it takes time to schedule instructions. In a simple processor with a 5 stage pipeline you can simply stall the entire pipeline, but since you stall all other instructions too, this is costly.

This is an unrelated higher level architectural distinction then asynchronous vs synchronous. The scheduling cost doesn't go away when you use asynchronous circuits.

Also why is stalling bad here? If the circuit takes 16 gate delays on some inputs and 6 gate delays on others, it doesn't matter if we use async or sync design; a fast operation behind a slow operation is still going to wait (stall) for the operation in front of it to complete. That's just a fundamental property of in order execution (which again, isn't related to async or sync circuits)

> And in a modern superscalar out-of-order processor, stall is even more expensive and you cannot reschedule all instructions for the next cycle at the end of the previous cycle because rescheduling is too complex.

What? In canonical OoO design, the default is stall! An instruction will only ever proceed to the next stage if it's dependencies have been satisfied. When a stall happens you don't need to reschedule because the instruction won't have been scheduled in the first place!

The important part from the grandparent was this:

"Ideally, it might also mean you don't need to design your pipeline quite so carefully: in a clocked design, a single slow path slows down the entire pipeline; in an asynchronous design, that one particular path may be slow, but that doesn't slow down everyone else."

Which is true! If you do a bad job of balancing your pipeline stages (or can't balance them statically because of variation/whatever) then the single slow path slows down the entire clock. However, when you can rebalance the pipeline statically, as in the example give, there's no reason that you have to design your pipeline to wait for the slowest path.

But perhaps I misunderstood the example; let me know if I'm missing anything.


edit: Ghettoimp said something very similar, and probably better than I said it.

Say you've got a 2-stage pipeline (for simplicity). Maybe stage 1 has variable execution times, depending on the instruction that is being executed. Maybe stage 2 is faster than the worst-case for stage 1. In all cases, the clock will need to be slower than the slowest step of the pipeline, which means that the circuit may sit idle for a bit of time when stage 1 completed in faster than worst-case time.

In an unclocked equivalent, that idle time can potentially be eliminated in the cases where it's not necessary. When stage 1 does a fast operation and stage 2 is ready to receive the result, the data can advance through the pipeline before the clock pulse would've been received in a clocked circuit. Both are constrained by propagation delay, but a clocked circuit is constrained both by propagation delay and the timing of the clock.


At first you said pipelining was made to address the issues brought up in the paper, now you are saying you just want to see how it stacks up to pipeling, which are two different things, so don't call it a straw man to give Ivan Sutherland the benefit of the doubt.


The straw man is the part I quoted.

He talked about slow operations that gate your chip. Pipelining explicitly address this. Unless their async units have better hold times than d-flops they will both be gated by propagation delay. It's a straw man since it never mentions pipelining at all.

[edit]Not that I don't have a ton of respect for Sutherland(esp in graphics domain) but it would be nice to see something that admits other approaches.


not quite - pipe stages have costs - both in area and flop delay. In particular a particular flip-flop might have a setup time on it's input and a clk->Q delay - for really fast clocks this might be close to 1/4 your clock period.

For example let's suppose we have a combined flop delay of 1nS and we have a combinatorial delay (the logic we want to calculate) of 9nS - we can clock this at 100MHz, or we can pipeline it 3 ways split the combinatorial block into 3 3nS chunks - each pipe stage still has a 1nS flop delay so total pipe stage delay is 4nS (250MHz) - we split the logic in 3 but only got a 2.5 times performance increase because of fixed costs

Pipelining is a great tool but there is a law of diminishing returns that kicks in here


Thanks, this is the kind of reply I was hoping to get instead of being told I can't comment because some has an important last name.


I never said that, don't turn yourself into a victim.


But you didn't quote anything


You pipeline whether it's synchronous or asynchronous. The point of being asynchronous is to eke out better performance when only the faster portions are your pipeline are active.


Other points:

* "clock-speed" adapts automatically to gate speed, rather than be dictated (which has to be set conservatively),

* allows power savings in multiple ways (no globally propagated screw sensitive clock, finer grained clock gating by construction, and for NCL: wider supply range adaptability)

* the absence of a global clock by definition means less simultaneous switching, which reduces strain on power supplies (decoupling) and gives much better EMI.


I think he does - the thing is that the speed of synchronous logic is fixed and limited by it's slowest logic - for example how long a a 32-bit adder takes to produce a result depends on its inputs, the worst case involves carry propagation across all 32-bits (we normally spend gates to create shortcuts here) - so for some input data patterns data appears on the outputs much earlier than others, an add that has a nominal delay of 10nS might finish in 1nS for 50% of inputs in a benchmark but only the full 10nS 2% of the time - an asynchronous design might be 5 times faster on reasonable benchmarks (and 10 times when dipped in liquid N2)


Could you give a bit more information to non EE experts like me:

- What do you mean by pipeline ?

I try to make a analogy with the instruction pipelining, which can increase your throughput but it's not fixing the issue that your CPU has a fixed clock rate.

- What is metastability?

- Why are you mentioning merging clocks as each asynchronous circuit is clock-less ?

EDIT: Thanks for all the replies!


It takes a certain amount of time for a signal to propagate through a series of logic gates (or other electronic components) within a chip, which are also dependent on many other factors. In most synchronous chip design, you look at the worst (slowest) timing case for the design, and constrain your clock speed to that.

You can break up critical (the longest/slowest) paths of a design through pipelining, which can be done manually, or through nice automated techniques like register retiming. Basically, you can add flops (as in D-flip flops, also known as registers) between sections of the design that can be broken into independent pipelined components.

Example:

Say you have a design that takes 10ns from start to end flops. This means the max clock speed for that component is 100MHz. If you are clever, you may be able to dice that up into 10 separate components, which are pipelined, meaning that while there is a 10 cycle startup latency, if you have continuous throughput you can run the design at up to 1GHz. Even better is that nowadays, synthesis tools can do automatic pipelining through something called register retiming. Without doing any work, you can tell the synthesis tool what clock speed you want to run at (or how many cycles you want in your pipeline), and it is able to automagically insert flops to decrease timing for the overall design.


I'm not an EE expert either :)

Pipelining in circuit design is to take one "large" operation like quoted and break it down into a series of pipeline-able steps. Then the longest stage if your pipeline becomes the slowest path. So if you can break your instruction pipeline up 4-times then you can run at a clockspeed 4x faster without hitting propagation limits.

Wikipedia covers metastability pretty well: https://en.wikipedia.org/wiki/Metastability_in_electronics

Basically any logic gate can act as an oscillator if setup or hold timing is violated. It will bounce from zero to one and no guarantee can be made to the final value. Synchronous gates reduce the probability of this to near-zero(but not completely), you can add successive gates to make it more and more less probable. Basically anything that talks with the real world has a chance to screw up and it's only statistics that keep it from happening.

Looks like the Arbiter from the article is what their solution is, although they never explicitly mention metastability: https://en.wikipedia.org/wiki/Arbiter_%28electronics%29

Interesting but had some gnarly implications when it hits a metastable state(10x slower).


A pipeline means doing an operation in little bits, each in 1 clock time - at the cost of extra latency - so a slow combinatorial function might be split into 3 pipe stages each doing 1/3 of the function with data arriving 3 clocks later

Metastability is what can happen if you change data at the instant (or close to the instant) that a synchronous flip-flop is clocked - the resulting value that's stored is neither a 1 or a 0 but instead the storage element ends up oscillating at a high frequency - this little bit of evil can infect subsequent logic stages resulting in a chip that's a horrible hot buzzy mess of crud


http://www.greenarraychips.com/ these are an interesting example of clockless chips. Not sure how they work.

https://users.soe.ucsc.edu/~scott/papers/NCL2.pdf This sort of circuit appeals to me a lot. Multiple rail encoding, where every single gate has a hysteresis threshold before it can change its output. Pipeline stages start out dark, and gates light up as data flows in. There are no inverters inside a stage; gates only go from low to high. Once stage N+1 is done calculating, an inverted ack signal cuts off the input to stage N and it goes dark again.


A better link might be http://www.theseusresearch.com/NullConventionLogic.htm

Sutherlands micropipeline and most (all?) the other clock-less approaches, are fundamentally racy and depends on a difficult timing analysis to determine that the latch is slow enough. What makes NCL so interesting IOM is that it is guaranteed to work timing-wise by construction. This also means that it is tolerant to changes in logic time, which means it circuits can tolerate a wider range of voltage swings (= can save power). (The gate construction has to satisfy a trivial timing requirement, but it's local to the gate, not the complete circuit).

The obvious drawback of NCL is that it uses quite a few more transistors than the equivalent circuit in traditional clocked implementation and tooling is weak or non-existing.

Karl and his student Matthew presented "Aristotle – A Logically Determined (Clockless) RISC-V RV32I" at the 2nd RISC-V workshop. Slides & Video: http://riscv.org/2015/07/2nd-risc-v-workshop/ I'm not sure of the status of that.


How well GA chips work I don't know, but I feel like at some point Chuck Moore said that if you needed a clock you could just keep passing a bit or something from core to core and use that to keep things synchronized. Which I'm sure works great if you're Chuck Moore.


They used to have a website talking about the FLEET architecture, but it seems to have been taken down. Here it is on archive.org: https://web.archive.org/web/20120227072220/http://fleet.cs.b...

Edit: the page cited above has these links, but I should explicitly call the slides they call the best introduction to Fleet [1], and a page full of memos [2]

[1] https://web.archive.org/web/20120227072220/http://fleet.cs.b...

[2] https://web.archive.org/web/20120227072220/http://fleet.cs.b...


FWIW, as I recall this was FLEET's fatal flaw (part of the communication discussion);

* This can cause deadlock

* Programmer must keep input dock fifos from overflowing

Sun did a lot of work with async logic in the SPARC 10, it was written up in IEEE Spectrum I believe, and one of the things that always is a problem are that fabrics without flow control (back pressure or emission control) are subject to failure at the worst possible time.


The one question I have about asynchronous chips since I studied them at my undergrad is: How does one sell them?

Selling a clocked processor is easy. One tests for a finite set of clock speeds, and marks by the fastest one that works. People buy the chip, and run at the tagged clock getting a predictable performance.

Now, make it a batch of asynchronous processors. Each chip you make will have a different performance - one will add some floats faster, another will fetch run faster (but only if the second bit of the address is set), while a third one will shine on integer addition, but completely suck at subtraction (due to a problem in a single transistor).

How does one tag those chips?


I have an asynchronous CPU cluster on my desk right now (a pair of GA144s). The performance spread isn't actually very dramatic; just a few percent. After all, the foundries aim for consistency so that synchronous devices get good yields.


You can have either consistency or high performance, not both.

Your batch is consistent because the foundry you brought from isn't pushing the envelope for performance. The latest Intel or AMD chips don't have this level of consistency.


Bottom up: Create a suite of tests (you'll need them for development and verification anyway), measure the performance, tag it with a number based on how fast it finished. If more chips support the same set of instructions, talk to the company and release standardised tests which now everyone else will "need" to support.

Top down: It can copy bytes in memory at X MB/s, do AES at Y MB/s, generate RSA keys at Z/s, ...

> Each chip you make will have a different performance [...]

Sure. You can test them and if they benchmark below 95% of the expected numbers, throw them out (or split out and sell cheaper).


The entire problem is that there isn't a single linear measure of performance. It is the same problem one has comparing different CPU designs, but it now applies for every single chip you make.

They can certainly be clustered. But what kind of performance are you buying when you get a $model? Yes, you have another $model on your desk to compare, but the new one does not have the same performance at all. What if the one you already have is a fast one? Then you can not expect the new one to be as fast, and may prefer buying from some other manufacturer.

EDIT: To put it shorter: How do you promise you the chip I'm selling has at least some performance X when I don't have any chip with performance X for you to measure and see what it means?


To respond to your edit - tell me something I know about. Like I posted before, how many AES blocks can it decode per second. How many many NxM matrices can it multiply per second. How quickly does it match in a kd-tree. Even tell me that it runs quake at x fps etc. An abstract number (overall performance is 9001) is actually what the customer cares about the least.

I'm going to know what my workload is, or what to compare it to. If you don't know what your workload is, then you're likely a general computer use customer and don't have specific requirements.


I don't think that's so much different than what we have right now. Processors have different core counts, different buses, different feature sets, different speeds on those features, and that's still before we get into patching the microcode. Sure, the differences may be in more basic operations, but then we'll just have benchmarks which expose those numbers instead.


No they don't.

When you get to Amazon, and order a i3 $generation, you know exactly how many cores it will have, and what performance each core has. Every single one of those chips with the same tag have the same performance.


By benchmarking them, like GeekBench does. And those benchmarks are the only thing interested parties rely on, not the marketed speeds.


I've always been fascinated by async circuits but don't know how state of the art has progressed since the early 2000s. Would any EEs be willing to comment?


I'm not an EE, but my dad was the co-author on this paper, if you have a couple of questions for him, I could pass them along if you'd like.


I am curious about how it turned out after 13+ years. Any serious road blocks in theory or practice?


There are a number of technologies that just aren't worth pursuing until the "normal progress" slows down. Transmeta, for instance, arguably died because while they produced a superior chip, by the time they could ship it they were basically tied with what Intel was putting out anyhow.

Asynchronous chips is an example of the sort of thing I expect to start hearing about again when we run out of die shrinks. Which we're getting pretty close to, probably. (Another example is "active RAM" where the RAM sticks can do some sort of computation. Also something like the greenarray chips [1]... while they're trying to compete with normal growth it's hard for a tiny company to get traction.)

[1]: http://www.greenarraychips.com/


>> There are a number of technologies that just aren't worth pursuing until the "normal progress" slows down.

I don't know. The field of low-power micro-controllers doesn't really benefit much from scaling, since sleep current increase when decreasing transistor size. And they are relatively simple circuits(with low-cost development) but still a huge market, so it's an ideal place to try a new development methodology.

And yes, some have tried, but it's not being used today, so it probably failed.


The tools are a big obstacle. The industry is built around synchronous design. How are you going to time your circuit? Verify it? Etc.

It's a really big chunk of work to bite off, even with a "little" microcontroller.

We might see it one day, but as best I can tell things like sleep states are still a big focus, as they can save orders of magnitude power, instead of a few percent.


A great 20min talk from Rajit Manohar from Cornell, talking about self-timed (or asynchronous circuits) and their use in neuromorphic chips.

https://www.youtube.com/watch?v=AVrJRPL-e0g

Async is a perfect design style for these kinds of event driven chips, since you really don't need to run a fast clock if most of the time the circuits aren't computing anything ...


Oh man, this article was one of the first things I ever read about computer architecture when I was about 14. I had no idea Ivan Sutherland was the author until just now. It really stuck with me - I recall the bucket brigade illustration quite vividly whenever I think about asynchronous CPUs.


It intrigues me how similar the internals if a cpu and the workings of a network is.



I thought most of the benefit can also come from skew tolerant circuit design:

http://www.cerc.utexas.edu/~jaa/vlsi/lectures/23-2.pdf

For example, in a pipeline "allowing a slow stage to “borrow” from the time normally allocated to a faster stage"


Intel used an asynchronous technique in their Pentium-4 processors. You may recall that the internal core ALUs ran at 2x the frequency of the rest of the chip. This was done with self timed domino circuits.

These are notoriously difficult to get working.


> The technological trend is inevitable: in the coming decades, asynchronous design will become prevalent.

I wonder if this statement is right on time or still decades away.


Since it was stated just above a decade ago, the complete lack of change we had until today does not invalidate it yet.


I got this from https://news.ycombinator.com/item?id=10328784. Thanks vmorgulis!

If you haven't heard one of Alan Kay's many explanations of Sutherland's seminal Sketchpad work ("a Newton-like leap"), here's a wonderful one: https://www.youtube.com/watch?v=TY-hBgYLJqc#t=46m30s. Note the reference to Wes Clark, the pioneering system designer who died recently (https://news.ycombinator.com/item?id=11183970). Clark liked Sutherland and gave him computer time in the middle of the night, which is how the Newton-like leap came to be.


There's way too much Kay content online nowadays. Thanks for the tip. It's cool to see him rant about the forgotten wonders on stage, it's a different thing to see him look around like a kid when describing sketchpad 'face to face'.


I got to meet him last week and couldn't resist gushing about how much I've learned from him. He seemed embarrassed. I couldn't help it—there's no one who's influenced me more in computing. He's agreed to do an AMA on HN, so hopefully we can set that up soon.

If you get beyond its terrible sound quality, that YouTube video has many stretches of Alan riffing that are pure gold. He embodies the history of our field and the values of the classic ARPA community culture. Much of that precious stuff is encoded in oral culture that we don't have a good way of continuing. I wish we could find a way for HN to facilitate that. It already does, to a small extent. But we need more than just to capture it as history, we need to carry it on, and I don't see that happening.


Wave pipelining is also a technique you could use to run critical datapath circuits synchronously without using clocks. It saves space as well, but eliminating pipeline flip-flops.


it's very VERY difficult to do this over a reasonable range of process corners, and limits you to a single carefully chosen clock speed




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: