This is a perspective article I wrote with a number of researchers in unconventional computing, about solving problems in AI using physics-based chips. There is also a video of me discussing the paper on a livestream here https://www.youtube.com/watch?v=0h_mf5usgvg
> I don’t know what I may seem to the world, but as to myself, I seem to have been only like a boy playing on the sea shore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered around me.
There have been a great many minds that have tortured themselves over pondering the things we don't even blink an eye at today.
We admire and idolize those who found the pretty pebbles that light up the shore today, and yet I'm sure many of them would have eagerly traded places just to see what continued to take shape.
It's sad, though also a reminder of just how much is easily taken for granted.
To me it sounds more like standing in wonder at how vastly much more he could never have lived long enough to understand. It doesn't strike me as sad, but humble, and appropriately so.
Newton may have been rough/arrogant with other people, but this is entirely consistent with being humble in the face of the mysteries of the universe etc — there is nothing in his biography to suggest otherwise.
In fact, you can see hints of it even in this quote: “I don’t know what I may seem to the world” — suggests that he thinks of himself in an exalted position wrt the world (= other people, in this context), but a little boy wrt Nature. Something like: I am a little boy collecting pebbles on the shore, while everyone else is a senseless baby not even close to the ocean.
This is not sad at all. The sense of childlike wonder and intellectual humility were essential to the great minds of history. If you feel like you’re getting a handle on things remember this quote and who said it. It is as true today as it was then and we could all benefit by remembering it.
You're right, that iterative methods may be linear in time with respect to the number of parameters. In this paper, we provide a method which is linear in time with respect to the dimension (d) of the vector (x) we are solving for in the equation A x = b. The number of parameters is d^2 + d, as there are d^2 parameters for the matrix A and d for the vector b. Gradient descent would require a matrix-vector multiplication just to compute the gradient, which is already O(d^2).
You make an important point, regarding the compilation. In this case, we are talking about the time it takes to upload the matrix A and vector b to the hardware. This requires O(d^2) numbers to be updated, but assuming it is done in parallel it could be done in O(d) time, and the coefficient of this scaling is independent of the physical parameters of the analog hardware. For this reason, in the analysis of the algorithm, we are generally ignoring the time to update the parameters, as is clarified in the Methods section.
OK, got it. I think what you're describing is Gradient Descent on the Normal Equations to solve an overdetermined linear system. Indeed in such a system dim(x) == dim(b) == d. Matrix A is fixed though and not part of the estimation but you're correct about the complexity of gradient computation which is indeed O(d^2).
Thanks for the clarification of the uploading / compilation step.
The Optical Fourier Transform (OFT) in [2] is a way to compute the Discrete Fourier Transform (DFT), and is an alternative to the Fast Fourier Transform (FFT). The OFT basically does a matrix-vector multiplication, where the matrix has a special form, which is accomplished physically using the diffraction of light as it propagates from one plane to another. Although the paper claims constant time for this operation, it’s likely that when the optical array gets bigger, the light has to propagate further for the diffraction pattern to emerge, meaning the time gets longer too. But the coefficient of this scaling is extremely small, because light travels fast. This also accelerates linear algebra using physical dynamics, but can only be used to multiply one particular matrix (the DFT matrix) by an arbitrary vector.
Had a quick read of [3]. This work comments on three contributions to the error in an analog scheme to solve linear systems via an ODE that is encoded in the circuit dynamics, although the specific ODE being solved is not given in the paper. The three sources addressed are: gain error, offset error, and nonlinearity. It is mentioned that the first two can be corrected by calibration, while the nonlinearity error can be mitigated by scaling down the inputs to the problem (the matrix A and vector b in the equation Ax = b). It says that scaling down the problem results in lower accuracy, which I suspect can be captured by the tradeoff we show analytically between time, energy, and accuracy. It is also mentioned that “when the analog accelerator outputs are steady, we can sample the solutions once with higher-precision ADCs. However, the method here does not involve time-averaging the output of the circuit. A core result of our paper is that the accuracy converges with the length of time over which the output is averaged, so I suspect that taking a single sample is a drawback of the method presented here.
An interesting feature of this approach is that the proposed hardware doesn't rely on non-linear elements, memristors, or even active elements (besides an optional noise source). It is simply a passive network of oscillators with a DC bias on each cell. That said, the hardware to implement this at scale does not currently seem to exist. To my knowledge, the state of the art is https://app.normalcomputing.ai/composer
From what I see, it's like mimicking the annealing process and the "derivative" automatically drives you to the solution. If that's the case, implementing such hardware should be not that hard except for the programmable coupling part. A bit off-topic, this reminds me of the duality between any deep forward network and a modern Hopfield network with some special energy functions, in which the duality is based on the fact that the forward running process can be seen as an energy minimization process.
The relationship with Hopfield networks sounds fascinating, would love to discuss further. As you mentioned, there is a connection to annealing in that we are encoding the solution to our problem in the minimization of a physical system's energy. Indeed, the all-to-all coupling is the hard part!
One way to think about these methods is that we are essentially implementing a Monte-Carlo algorithm physically, where on each "iteration" there is a matrix-vector multiplication. The physical system does this matrix-vector multiplication for us in constant time, so it does have an advantage over these digital methods. Not only that, but the "clock speed" of the physical system can be almost arbitrarily short, although this comes with an energy cost.
Yes that’s precisely what made me harken back to my solving of large stochastic PDE system questions :-)
The different though is instead of a matrix multiplication it’s a nonlinear optimization. The crucial part is the nonlinearity. But I assume given this technique is as you say Monte Carlo at its root, that shouldn’t specifically matter?
You mentioned that such problems may be solved by solving a linear approximation of the non-linear problem (and I am in no way an expert in non-linear optimization). To the extent that the bottleneck in that approach is solving the resulting linear system, this method offers a speedup. We are also thinking about using similar thermodynamic methods to solve non-linear systems directly, but some of the nice properties of the harmonic oscillator are not present in that case, so it's currently not clear how much (if any) speedup is there.