Heyo, I'm a PhD student in the field. I figured I can talk about its current sta...

avmich · on April 2, 2019

> A clockless pipeline is always going to be slower than a clocked one

How come? In a clocked design, you have to have clocks slow enough so all possible logic paths would finish. In a clockless one the propagation only takes as much as needed, and in a case of shorter path can take less time, doesn't it?

nbingham · on April 2, 2019

Synchronous design tools are very good at making all of the pipeline stages have about the same logic depth, which is generally 6-8 transitions/cycle but can be much less. The fastest possible QDI circuit is a very simple, very small WCHB buffer which has 6 transitions/cycle. Most QDI logic will have 10-14 transitions/cycle.

Also, the speed of a linear pipeline is limited to the slowest stage in the pipeline whether or not you use clockless. Clockless only helps pipeline speed when you have a complex network.

deepnotderp · on April 2, 2019

I don't think it's really fair to condemn all of asynchronous due to the slowness of QDI. There are faster ways of doing things like GaSP, dual rail domino done detection, bundled data, one sided handshaking, etc.

nbingham · on April 2, 2019

I think there's been a misunderstanding.

You're right, and I don't intend to condemn all of async, or even QDI for that matter :) I am doing my PhD on it, so I do think there is promise. I just think that arithmetic is better handled by Bundled-data specifically. Let QDI do the control leg-work and tack high-performance arithmetic to it.

Also, Gasp is certainly faster, but is limited to simple pipelines. That's why I like QDI, it lets me make weird circuits.

EDIT: Sorry, I got mixed up between the conversation threads... dislexia is a thing.

I'm not saying condemn async or QDI, but we must recognize what it is good at and what it is not. A QDI pipeline stage may be slower, yes. So don't use it if you just want to implement a linear pipeline. But do use it if you have a complex network because of the previously mentioned benefits. Gasp and other async pipeline topologies don't have the flexibility of QDI, and there isn't really a good framework to mix them with QDI techniques at the moment (maybe relative timing?). The power of async comes from this flexibility and the ability to avoid unnecessary computation.

kurthr · on April 2, 2019

Yours matches the commentary of a friend in the field from about 2000. I do hope that the end of Moore sees improvement in this sort of design, and the tools required.

I do see some high speed low power networking hardware moving this way: Router Using Quasi-Delay-Insensitive Asynchronous Design

https://dl.acm.org/citation.cfm?id=2634996&preflayout=flat

nbingham · on April 2, 2019

Yep, I saw that go by, and a lot of my work is heavily influenced by Andrew Lines. But what I've been seeing is that QDI is really bad at arithmetic because the acknowledgement requirements turn XORs into a nightmare hairball of signal dependencies. But QDI is really good at complex control.

deepnotderp · on April 2, 2019

That's not true for all ways of doing things, for example, with bundled data, dual rail domino QDI and various commercial groups like wave computing and ETA computing which have their own asynchronous flavors, often optimized for arithmetic operations.

nbingham · on April 2, 2019

I was specifically talking about dual rail domino QDI. When you compare the typical dual rail domino QDI adder found in Andrew Lines thesis against a typical clocked carry lookahead adder like Kogge & Stone, it is worse by factors of between 2 and 3 in energy, area, and throughput.

Bundled data is a simple control with data clocked from that control. Its very much keeping arithmetic away from the QDI circuitry.

Though to be fair, I haven't seen a good examination of how pass transistor logic might affect QDI arithmetic circuitry, so maybe there is hope.

kurthr · on April 2, 2019

Interesting analysis... I wonder if "locally clocked" ALUs could help with that sort of thing. Clockless shouldn't have to be a purity test :^)

nbingham · on April 2, 2019

Hence my thesis :)

deepnotderp · on April 2, 2019

That's more or less GALS ("globally asynchronous locally synchronous ")

jacquesm · on April 2, 2019

> What they seem to be doing right now is buying asynchronous start-ups and shutting them down.

There's a plan!

tachyonbeam · on April 2, 2019

Investment Program for Embracing Innovation (IPEI) is best-spent capex.

Symmetry · on April 2, 2019

If we're looking at a post-Moore's Law interregnum in process improvements before we find some new substrate then that might be a good opportunity to explore things like clockless designs when efforts can take longer to pay out.

nbingham · on April 2, 2019

And the truth is that we are. Here is a bunch of data on Intel's processes that I've pulled together from various public domain sources. The yield curves are a very rough estimate, so take them with a grain of salt.

https://www.nedbingham.com/intel_max_transistor.png

https://www.nedbingham.com/intel_switching_frequency.png

https://www.nedbingham.com/intel_tdp.png

https://www.nedbingham.com/intel_gate_delay.png

naasking · on April 2, 2019

> Overall, this can mean significantly higher throughput and lower energy, but getting those benefits requires very careful design and completely different computer architectures.

What effect will this have on our programming languages and programming idioms? To some extent, our low-level programming languages have influenced CPU design, and vice versa, but it's not clear what effect an architectural change like this would have.

nbingham · on April 2, 2019

Ideally, none. We're leaning toward FPGAs and CGRAs to accelerate tight inner loops. This means that it will have a huge effect on compilers. They will have to compile from a control flow behavioral description like C to a data-flow description to map onto the array. This compilation process is honestly not solved. This is why you have verilog instead of just compiling C to circuitry. I've taken a crack at it (in the form of QDI circuit synthesis from CHP) and every sub-problem is either NP-hard or NP-complete.

Though all of this is assuming we solve the memory bottleneck... which... might come about with upcoming work on 3D integration and memristors? who knows.

no_identd · on April 7, 2019

>every sub-problem is either NP-hard or NP-complete.

Did you check if they're FPT? (Fixed Parameter Tractable)

naasking · on April 2, 2019

So this is mostly challenging for imperative languages? So a dataflow heavy language, like functional programming, might be easier to compile?

jcranmer · on April 2, 2019

The challenge isn't in extracting a dataflow graph from an imperative control flow format. Compilers are already capable of doing this, and every major compiler has an expression DAG at some point. The challenge is in mapping from a generic dataflow graph to the actual dataflow hardware primitives, which have different requirements whose constraints may require non-trivial mappings.

naasking · on April 3, 2019

Register allocation is also NP complete, but we have plenty of efficient approximations that work well enough in practice, like linear scan for fast compilation, or by graph coloring or puzzle solving for slower but more efficient compilation. Is there a comparable tradeoff available in this case?