Hacker News new | past | comments | ask | show | jobs | submit | nbingham's comments login

I'm working on a compiler for asynchronous circuits. Once I have modules, placement, and routing working, I'll have an MVP. Hopefully, this will allow people without any computer engineering expertise to make chips. For now it has a couple of useful tools.

https://github.com/broccolimicro/loom


I tested two C++ implementations of Calendar Queues again the standard library priority_queue. The first implementation uses a deque as a backing container and then fills the calendar with linked lists. The second implementation just uses vectors in the calendar with no backing container.

The calendar queues have an interesting failure mode. If you are randomly inserting elements with a priority that is less than "now", and then pop some number of elements, and doing this repeatedly, then the front of the calendar empties out. As a result, the random insertions create a single value in the front of the calendar, then there are many many days that are empty after that. So subsequent pops will always have to search a lot of days in the calendar. So, these calendar queues are only fast if you keep track of "now" and only insert events after "now".

https://gist.github.com/nbingham1/611d37fce31334a1520213ce5d...

seed 1725545662 priority_queue 0.644354 calendar_queue 0.215860 calendar_queue_vector 0.405788

seed 1725545667 priority_queue 0.572672 calendar_queue 0.196812 calendar_queue_vector 0.392303

seed 1725545672 priority_queue 0.622041 calendar_queue 0.241419 calendar_queue_vector 0.413713

seed 1725545676 priority_queue 0.590372 calendar_queue 0.204428 calendar_queue_vector 0.386992


The reason we ended up with digital logic is because of noise. Hysteresis from digital gates was the only way to make such large systems from such small devices without all of our signaling turning into a garbled mess. Analog processing has its niches and I suspect the biggest niche will be where that noise is a feature rather than a hindrance, something like continuous time neutral networks.


Neuromorphic hardware is an area where I encountered analogue computing [1]. Biological neurons would be modeled by a leaky integration (resistor/capacitator) unit. The system was 1*10^5 times faster than real-time—too fast to use it for robotics—and consumed little power but was sensitive to temperature (much as our brains). If I recall correctly, the technology has been used at cern, as digital HW would have required too high clock speeds. I have no clue what happened to the technology but there were other attempts at neuromorphic, analogue hardware. It was very exciting to observe and use this research!

[1]https://brainscales.kip.uni-heidelberg.de/

edit: newer link: https://open-neuromorphic.org/neuromorphic-computing/hardwar...


I worked on a similar project - the Stanford braindrop chip. It's a really interesting technology, what happened is that most people don't really know how to train those systems. There's a group called Vivum that seems to have a solution.

I work with QDI systems, and I've long suspected that it would be possible to use those same design principles to make analog circuits robust to timing variation. QDI design is about sequencing discrete events using digital operators - AND and OR. I wonder if it is possible to do the same with continuous "events" using the equivalent analog operators, mix and sum.


We got some nice results with spiking/pulsed networks but the small number of units limits the application so we usually end up in a simulator or using more abstract models. There seems to be a commercial product but also only with 0.5K neurons, might be enough for 1D data processing though and filling a good niche there (1mW!) [2]

[2] https://open-neuromorphic.org/neuromorphic-computing/hardwar...


> the Stanford braindrop chip Interesting.

Just read about it and there are familiar names on the author list. I really wish this type of technology gained more traction but I am afraid it will not receive the deserved focus considering the direction of current research in AI.

> I wonder if it is possible to do the same with continuous "events" using the equivalent analog operators, mix and sum.

Don't know much about QDI but sounds promising.


I actually worked for a startup that makes tiny FPAA's (Field Programmable Analog Arrays) for the low powered battery market. Their major appeal was that you could reprogram the chip to synthesize a real analog network to offload the signal processing portion of your product for a tiny fraction of the power cost.

The key thing is that analog components are influenced much more by environmental fluctuations (think temperature, pressure, and manufacturing), which impacts the "compute" of an analog network. Their novelty was that the chip can be "trimmed" to offset these impacts using floating gate MOSFETs, the same that are used in flash memory, as an analog offset. It works surprisingly well, and I suspect if they can capture the low power market we'll see a revitalization of analog compute in the embedded space. It would be really exciting to see this enter the high bandwidth control system world!


Can you share their name and do they have a consumer product already?


Well it's not just about noise. There is also loss. It's easier to reconstruct and/or amplify a digital signal than an analogue one. Also it's easier to build repeatable digital systems that don't require calibration compared to analogue ones.

Worth noting the exception here which is current loops.


Also density and speed. And precision.

This is a rare misfire from Quanta. No, there is no practical way to model anything non-trivial - especially not ML - with analog hardware of any kind.

Analog hardware just isn't practical for models of equivalent complexity. And even if it was practical, it wouldn't be any more energy efficient.

Whether it's wheels and pulleys or electric currents in capacitors and resistors, analog hardware has to do real work moving energy and mass around.

Modern digital models do an insane amount of work. But each step takes an almost infinitesimal amount of energy, and the amount of energy used for each operation has been decreasing steadily over time.


> No, there is no practical way to model anything non-trivial - especially not ML - with analog hardware of any kind.

Are you aware that multiple companies (IBM, Intel, others) have prototype neuromorphic chips that use analog units to process incoming signals and apply the activation function?

IBM's NorthPole chip has been provided to the DoD for testing, and IBM's published results look pretty promising (faster and less energy for NN workloads compared to Nvidia GPU).

Intel's Loihi 2 chip has been provided to Sandia National laboratories for testing with presumably similar performance benefits as IBM's.

There are many other's with neuromorphic chips in process.

My opinion is that AI workloads will shift to neuromorphic chips as fast as the technology can mature. The only question is which company to buy stock in, not sure who will win.

EDIT: The chips I listed above are NOT analog, they are digital but with alternate architecture to reduce memory access. I've read about IBM's test chips that were analog and assumed these "neuromorphic" chips were analog.


Rajit has since moved to Yale and has made quite a bit of progress on the tooling problems

https://github.com/asyncvlsi/actflow

https://avlsi.csl.yale.edu/act/doku.php

https://avlsi.csl.yale.edu/chips.php

There's also a bit of documentation up on Wikipedia now

https://en.wikipedia.org/wiki/Quasi-delay-insensitive_circui...



> This article is 17 years old and I haven't seen any clockless designs in my professional experience in all of that time.

Yeah, async design takes a while, and async chips don't tend to be well advertised, but they are there.

Async FPGA has 60% less power, 70% increased throughput http://csl.yale.edu/~rajit/ps/fpga2p.pdf

High speed routing (from Fulcrum, one of the startups bought by Intel and shut down) https://www.hotchips.org/wp-content/uploads/hc_archives/hc15...

Ultra low power processor https://ieeexplore.ieee.org/abstract/document/1402056/

Ultra low power neural network accelerator from IBM https://www-03.ibm.com/press/us/en/pressrelease/44529.wss


Heyo, I'm a PhD student in the field. I figured I can talk about its current status.

First, here are various search terms: clockless, self-timed, delay-insensitive, latency-insensitive, quasi delay-insensitive (QDI), speed independent, asynchronous, bundled-data

There are a wide variety of clockless circuits that each make their own timing assumptions. QDI is the most paranoid, making the fewest timing assumptions. Bundled-data is the least paranoid (its effectively clock-gating).

A clockless pipeline is always going to be slower than a clocked one and requires about 2x the area. However, clockless logic is way more flexible, letting you avoid unnecessary computation. Overall, this can mean significantly higher throughput and lower energy, but getting those benefits requires very careful design and completely different computer architectures.

Most of the VLSI industry is woefully uneducated in clockless circuit design and the tools are terribly lacking. I've seen many projects go by that make a synchronous architecture clockless, and they have always resulted in worse performance.

What this means is that it would take billions of dollars for current VLSI companies to retool, and doing so would only give them a one-time benefit. So, you probably won't see clockless processors from any of the big-name companies any time soon. What they seem to be doing right now is buying asynchronous start-ups and shutting them down.

As of the 90nm technology node, its not possible to be switching all of the transistors on chip without lighting a fire. This mean that the 2x area requirement is not much of a problem since a well-designed clockless circuit only needs to switch 25-50% of them at any given time. Also since 90nm, switching frequencies seem to have plateaued with a max of around 10 GHz and typical at around 3 GHz. When minimally sized, simple clockless pipelines (WCHB) can get at most 4 GHz and more complex logic tends to get around 2 GHz (for 28nm technology). Leakage current has become more of a problem, but it's a problem for everyone.

There is a horribly dense wikipedia page on QDI, but it has links to a bunch of other resources if you are curious.


> A clockless pipeline is always going to be slower than a clocked one

How come? In a clocked design, you have to have clocks slow enough so all possible logic paths would finish. In a clockless one the propagation only takes as much as needed, and in a case of shorter path can take less time, doesn't it?


Synchronous design tools are very good at making all of the pipeline stages have about the same logic depth, which is generally 6-8 transitions/cycle but can be much less. The fastest possible QDI circuit is a very simple, very small WCHB buffer which has 6 transitions/cycle. Most QDI logic will have 10-14 transitions/cycle.

Also, the speed of a linear pipeline is limited to the slowest stage in the pipeline whether or not you use clockless. Clockless only helps pipeline speed when you have a complex network.


I don't think it's really fair to condemn all of asynchronous due to the slowness of QDI. There are faster ways of doing things like GaSP, dual rail domino done detection, bundled data, one sided handshaking, etc.


I think there's been a misunderstanding.

You're right, and I don't intend to condemn all of async, or even QDI for that matter :) I am doing my PhD on it, so I do think there is promise. I just think that arithmetic is better handled by Bundled-data specifically. Let QDI do the control leg-work and tack high-performance arithmetic to it.

Also, Gasp is certainly faster, but is limited to simple pipelines. That's why I like QDI, it lets me make weird circuits.

EDIT: Sorry, I got mixed up between the conversation threads... dislexia is a thing.

I'm not saying condemn async or QDI, but we must recognize what it is good at and what it is not. A QDI pipeline stage may be slower, yes. So don't use it if you just want to implement a linear pipeline. But do use it if you have a complex network because of the previously mentioned benefits. Gasp and other async pipeline topologies don't have the flexibility of QDI, and there isn't really a good framework to mix them with QDI techniques at the moment (maybe relative timing?). The power of async comes from this flexibility and the ability to avoid unnecessary computation.


Yours matches the commentary of a friend in the field from about 2000. I do hope that the end of Moore sees improvement in this sort of design, and the tools required.

I do see some high speed low power networking hardware moving this way: Router Using Quasi-Delay-Insensitive Asynchronous Design

https://dl.acm.org/citation.cfm?id=2634996&preflayout=flat


Yep, I saw that go by, and a lot of my work is heavily influenced by Andrew Lines. But what I've been seeing is that QDI is really bad at arithmetic because the acknowledgement requirements turn XORs into a nightmare hairball of signal dependencies. But QDI is really good at complex control.


That's not true for all ways of doing things, for example, with bundled data, dual rail domino QDI and various commercial groups like wave computing and ETA computing which have their own asynchronous flavors, often optimized for arithmetic operations.


I was specifically talking about dual rail domino QDI. When you compare the typical dual rail domino QDI adder found in Andrew Lines thesis against a typical clocked carry lookahead adder like Kogge & Stone, it is worse by factors of between 2 and 3 in energy, area, and throughput.

Bundled data is a simple control with data clocked from that control. Its very much keeping arithmetic away from the QDI circuitry.

Though to be fair, I haven't seen a good examination of how pass transistor logic might affect QDI arithmetic circuitry, so maybe there is hope.


Interesting analysis... I wonder if "locally clocked" ALUs could help with that sort of thing. Clockless shouldn't have to be a purity test :^)


Hence my thesis :)


That's more or less GALS ("globally asynchronous locally synchronous ")


> What they seem to be doing right now is buying asynchronous start-ups and shutting them down.

There's a plan!


Investment Program for Embracing Innovation (IPEI) is best-spent capex.


If we're looking at a post-Moore's Law interregnum in process improvements before we find some new substrate then that might be a good opportunity to explore things like clockless designs when efforts can take longer to pay out.


And the truth is that we are. Here is a bunch of data on Intel's processes that I've pulled together from various public domain sources. The yield curves are a very rough estimate, so take them with a grain of salt.

https://www.nedbingham.com/intel_max_transistor.png

https://www.nedbingham.com/intel_switching_frequency.png

https://www.nedbingham.com/intel_tdp.png

https://www.nedbingham.com/intel_gate_delay.png


> Overall, this can mean significantly higher throughput and lower energy, but getting those benefits requires very careful design and completely different computer architectures.

What effect will this have on our programming languages and programming idioms? To some extent, our low-level programming languages have influenced CPU design, and vice versa, but it's not clear what effect an architectural change like this would have.


Ideally, none. We're leaning toward FPGAs and CGRAs to accelerate tight inner loops. This means that it will have a huge effect on compilers. They will have to compile from a control flow behavioral description like C to a data-flow description to map onto the array. This compilation process is honestly not solved. This is why you have verilog instead of just compiling C to circuitry. I've taken a crack at it (in the form of QDI circuit synthesis from CHP) and every sub-problem is either NP-hard or NP-complete.

Though all of this is assuming we solve the memory bottleneck... which... might come about with upcoming work on 3D integration and memristors? who knows.


>every sub-problem is either NP-hard or NP-complete.

Did you check if they're FPT? (Fixed Parameter Tractable)


So this is mostly challenging for imperative languages? So a dataflow heavy language, like functional programming, might be easier to compile?


The challenge isn't in extracting a dataflow graph from an imperative control flow format. Compilers are already capable of doing this, and every major compiler has an expression DAG at some point. The challenge is in mapping from a generic dataflow graph to the actual dataflow hardware primitives, which have different requirements whose constraints may require non-trivial mappings.


Register allocation is also NP complete, but we have plenty of efficient approximations that work well enough in practice, like linear scan for fast compilation, or by graph coloring or puzzle solving for slower but more efficient compilation. Is there a comparable tradeoff available in this case?


Here is the counter argument to Stochastic Computing.

http://csl.yale.edu/~rajit/ps/stochastic.pdf

Basically, stochastic computing takes exponentially more time and energy to perform the same computation at the same precision, and has a higher error rate. If you want to save energy on precision, it would be better to use bit or digit serial operators.

Note of disclosure, this paper was written by my adviser and I am currently writing two papers on digit-serial arithmetic operators.


Range-v3 seems to be solving a different problem. It seems to be introducing support for Pythonesque features like type-checking and list-comprehension. However, the algorithms that are documented still use first/last iterators (http://en.cppreference.com/w/cpp/experimental/ranges/algorit...). There seems to be a set of basic range class definitions and there seems to be a View class, but I'm not sure if they interact to provide the same type of generic slice definition that I've developed here.

Perhaps I am missing something though, could you expand a little on the specific part of Range-v3 that you are thinking of?


No, the primary purpose of range-v3 is to work with ranges (a container, a pair of iterators, or a view) instead of iterators. Lazy and pipeable (i.e. chainable) views are built on top of this.

Check the first link rather than the cppreference documentation.

By the way, type-checking isn't Pythonesque. Haskellesque perhaps.


Inheritance as implemented by Go provides for a different school of thought from C++. In C++, you define the taxonomy, then define the objects in relation to that taxonomy. In Go its the opposite, defining taxonomy in relation to the objects.

In concrete terms for C++, this means that abstract base classes Must be defined before the objects are defined. So if I wanted to use a library, I would be entirely restricted to whatever abstract base classes they define.

For Go, I could use a library and then define whatever interfaces I needed specifically for the functions I want to implement. Its effectively a much more structured and rigorous architecture for templated code.

At least this is my understanding. I've explored Go just enough to get some of the higher level concepts but I haven't quite dug into it yet. So correct me if I'm wrong.


You should take a look at the Concepts TS - if I understand what you're trying to achieve correctly then Concepts are the approach C++ is taking to address these types of issues.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: