Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Configuration of SRAM based FPGAs is rather slow because it requires a scan chain to program each logical element and shift config bits into, and doing it faster requires even more circuitry. You need to multiplex things onto the fabric in practice, you can't "context switch" AKA temporally multiplex very well, you have to spatially multiplex. But FPGAs are already area intensive

On FPGAs designed for this, it is possible to "gradually reconfigure" FPGAs on context switch at high speed, while they continuously process data, in a manner similar to how CPUs gradually change what's in their cache after a context switch, and modern GPUs handle multiple applications by scheduling work units across the compute elements.

I expect those sorts of FPGA designs would become available on the market if vendors decided to develop the ecosystem of FPGAs as general purpose compute accelerators, shared among applications, similar to the role played by GPUs, TPUs and NNPUs now.

(Long shot: If anyone out there seriously wanted to hire someone to build open source or open programming, high performance FPGAs with these switching characteristics, and tooling to match, I would love to do both :)



Modern GPUs hide latency by scheduling tons of work and paying it back in throughput but this is very design sensitive and if you do it in an FPGA requires a ton of pipelining and design work. Which is often better spent just paying some schlubs like us to write software. Again, the cost of programming the fabric is quite real. You pay for area.

And people do actually create marketable FPGA designs you can load into modern accelerators. You can buy Bittware devices yesterday, or Xilinx Alveo and load tons of designs into them. You can go get Amazon F1 instances and put tons of accelerators on them. You don't hear about them and they aren't popular like GPUs because the fact is that most people don't need this, and the ones who do have very particular designs that probably aren't worth over-optimizing the entire system architecture for. That's why they're 95% PCIe cards with attached output peripherials that most of the time end up in Ethernet.


I'm familiar with Bittware, F1 and Alveo accelerators. I've used F1 and might use Alveo this year. The cost of programming them is indeed high, but it's largely because of the design software whose paradigm remains firmly stuck in the 90s. Even "high level synthesis" is far from high level at the moment.

Those devices are completely different to use compared to the sort of general purpose, fast-compilation, fast-switching accelerators like modern GPUs.

FPGAs and FPGA-like architectures and concommittant design software can be designed for fast compilation, adaptive timing and pipelining, and overlapped application multiplexing. But it takes significant design changes. It's a novel and underexplored area. With such architectures, schlubs like us can write software that runs on them with excellent performance for some tasks.

Unfortunately the market and the legal situation hasn't optimised for that. The closed FPGA programming information, for decades, meant others could't produce radically different commercial tools for existing FPGAs, which would generally require skipping the proprietary P&R to use novel fast-compilation and incremental reprogramming techniques. Those who explored it were always worried about legal issues, as well as damaging customer devices.

And for a long time the patents were a chilling effect on new entrants wanting to develop alterate FPGA architectures better suited to this type of programming, as long as they contained elements of traditional FPGAs as well. The patent situation is starting to shift now that early Xilinx and Altera devices are old enough, but it's a multi-decade process, unfortunately.


That's exactly what Tabula did. They made time multiplexed FPGA fabric. They also went bankrupt, you could buy their scrap on eBay for a few months


I liked their idea, and I think it had a lot of potential for clever optimisation. Shame about the bankrupcy. But it's different to what I'm talking about. Tabula's extremely fast multipexing needed a lot of chip area.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: