> One of the major valid concerns that were raised with this configfs interface was security as it opens up the interface to users for modifying the live device tree.
This has always felt like a gaping security hole waiting to be explored.
Modern, high end FPGAs have a feature known as Raw SerDes, which in essence allows you to bypass a PCIe or Ethernet controller and use those lanes (yes, PCIe lanes) to your heart's desire ...provided you can design a working communication protocol. Difficult, but not impossible by any means.
So if you wanted to, you could design your own PCIe controller and give it whatever device ID, vendor ID, memory space, or capability space you want! Normally these things are not writable on a PCIe controller. But if you designed your own, you could write them to whatever you want and spoof device types, memory spaces, or driver bindings, and probably get yourself access to memory you shouldn't be touching. While I don't know how the linux kernel would handle these potentially out of spec conditions, it never sat right with me from a security standpoint.
> But if you designed your own, you could write them to whatever you want and spoof device types, memory spaces, or driver bindings, and probably get yourself access to memory you shouldn't be touching.
Not in a system with a properly configured IOMMU unit. That stuff got some serious attention back in the old Thunderbolt 2 era, when people discovered that yes, it's PCIe under the hood and yes, having no IOMMU protection yields an attacker an instant-0wn.
Sure, but aren't you connecting your general purpose serdes to a peer PCIe controller? I don't understand why having raw serdes control is a security concern in this regard unless you are trying to find exploits at the physical layer...
In any regard, a lot of threat models (including mine) consider installing hardware (especially an FPGA) as a trusted action.
The thing is, the PCIe EP on the FPGAs uses the general purpose SerDes that are routed to the PCIe controller in the bitstream. So if you were to load a different malicious bitstream (which is admittedly a challenge in it's own regard) You could turn the FPGA into a malicious PCIe device.
Is the concern the idea that as FPGA fabric is included in more devices, some hypervisor escape is going to present this as additional attack surface?
Otherwise if it's configfs you're root on the system and unless it's integrated peripherals you plan to attack you probably have finer grained hardware context to imply physical access... which seems to minimize the farther reaching, generalizable concerns?
If physical (evil maid attacks) are not in scope I fail to see the concern. To turn the FPGA into a malicious device you would have to gain root access to the system hosting it. So by the time the attacker is able to gain the ability to program the device, there is little need to even make it malicious. One could argue that it adds persistence vector to malware, except that the device likely will get reprogrammed over and over during normal operation. If malware authors wanted persistence they would likely target firmwares of random flash roms on chipsets and commodity PCIe cards that are less likely to be re-programmed. Lastly, the only other valid concern possibly more dangerous than root access is perhaps a remote attacker programming a bitstream to completely fry the FPGA faster than the power regulators can react and thus killing an expensive chip. That one is concerning.
The built-in Gigabit transceiver cores, which you'd have to use for the PCIe protocol, are connected to very specific IO pins on the FPGA. If the PCIe slots on your mainboard aren't already routed to those pins, then the FPGA will never be able to "bypass" the regular PCIe or Ethernet interfaces. Conversely, if they are connected to those pins, then the regular PCIe and Ethernet interfaces won't be able to use that PCIe slot. So no, your security concerns are unwarranted.
> a feature known as Raw SerDes
I have never heard anyone use the term "raw serdes" for hard transceiver IP cores.
You're attaching your design to the Mac layer inside the FPGA, not to the IO pins, so it's the PIPE interface or something similar that you would need to communicate with. And yes, you can bypass the PCIe or Ethernet controller on various models of FPGAs.
Sorry, but it's still not clear what exact attack scenario you are envisioning. I have PC with a motherboard that has a CPU and am FPGA. I load my custom nefarious PCIe core onto the FPGA that bypasses the built in PCIe core. Now what? What is my PCIe core actually connected to?
To make the FPGA actually useful, it probably is connected to the PCIe lanes. Since PCIe isn't really a bus anymore, it's not clear what is possible, but I believe a PCIe device principally can access all of memory (via DMA or similar)? Maybe an IOMMU can protect that, but I would be very surprised if bugs couldn't be found especially if you can make your PCIe device speak not-quite-right PCIe.
And since it's near impossible to validate FPGA firmware functionality by the kernel, rights to send bitstreams to the FPGA is essentially equivalent of root on DOM0.
Any PCIe device you plug into your computer has the same potential to do something nefarious. We already have problems where no two PCIe implementations interpret the spec the exact same way and they all have bugs. What you are hypothesizing isn't anything new.
This thread started with "gaping security hole" and I'm still not seeing that. Yes, if someone has a PCIe design that can exploit the root complex of the host, and if they have a way to remotely deploy it to an FPGA through this new kernel interface, then yes, that's an interesting new attack. Those are some big ifs though, I think.
FPGAs are no different than any other hardware in this regard; in fact I suspect if you can hack the firmware on most pcie cards you could do that stuff too (unless there is an IOMMU)
It's more expensive, but at our next trade show, we'll move on from giving out hacked USB thumb drives, to hacked PCIe cards. Now I just need to figure out how get people to install them.
> Modern, high end FPGAs have a feature known as Raw SerDes, which in essence allows you to bypass a PCIe or Ethernet controller and use those lanes (yes, PCIe lanes) to your heart's desire ...provided you can design a working communication protocol. Difficult, but not impossible by any means.
Not even close to impossible. I've recently been trying to figure out a relatively "low-tech" way of talking to modern displays that doesn't involve feeding VGA into a black-box ADC, and from what I've gathered so far, most of the serial link standards developed in the past ~25 years are basically overclocked Fibre Channel with the serial numbers filed off and most of the reliability/ordering guarantees quietly dumped in a roadside ditch.
What he means is that almost all serial interfaces at the PHY layer are derivatives of FC's 8b/10b coding. You turn bytes into symbols and optionally apply some whitening with an LFSR. Much like a radio TX. Then some additional tricks like de-emphasis where you try and overdue the high/low transitions to compensate for your PCB substrate or cabling being cheap.
PCI express, SATA, usb3, HDMI (in a slightly different way), display port, etc etc. all use some form of 8b10b coding (or more efficient coding) with multi-gigabit serial transceivers.
In fact the PHY layer for USB3, PCIe 2.0 and SATA is identical - Intel designed a physical standard to encompass all three of those called PIPE. Nowadays that only exists as a virtual bus between silicon IP blocks.
Right. And not only is the general approach the same (which one could credibly blame on a sort of "parallel evolution" driven by the changing economics of transistors vs. pins in chip manufacturing), but in some cases the relevant standards literally incorporate parts of Fibre Channel by reference. It was seeing the DisplayPort spec cite the ANSI Fibre Channel PHY standard to define its specific 8b10b code (there have been several, due to the inherent redundancy of that type of code; cf. the choice of the 2 "extra" symbols in base64) that started me down the rabbit hole.
I'm not sure how many similarities there are between current mainstream interfaces and Fibre Channel specifically, but there has been an obvious trend of convergence, and Fibre Channel was probably one of the earliest instances. Nowadays, almost all high-speed interfaces have abandoned parallel links in favor of serial signalling over differential pairs implementing a packet-switched network. There's enough similarity at the physical layer that you'll often see a chip designer go with a general-purpose PHY shared across SATA/SAS, USB, and PCIe ports because they all operate at similar speeds and often similar encodings (eg. 8b10b). Likewise for sharing between InfiniBand and Ethernet.
DRAM is still almost always a traditional parallel bus, and it's basically alone in that. HDMI was a late holdout where the signalling was serial on differential pairs but the clock speed was variable depending on the data rate; newer versions of the standard rely instead on the link operating at one of a handful of fixed standard data rates, as done by everything else.
> Modern, high end FPGAs have a feature known as Raw SerDes
That's like saying "did you know that advanced microprocessors have the capability to bypass I2C and output voltages directly on the pins?!?!?!?"
First of all, it's backwards. The physical layer comes before the protocols and is always there at the base. Second of all, the world does not exclusively run on I2C. Some people want SPI busses or to toggle transistors with GPIO. That's fine. Sure, gate it behind different permissions, but don't just rant at what you don't understand.
If you want a concrete example where serdes access is important, look up JESD204b, but in general there are loads of real-time systems or bespoke processing applications where it makes sense to dispense with the complex and temperamental packet-switched infrastructure in places where that complexity and nondeterministic behavior is likely to cause more trouble than good. There are also applications to backplane connections (if you are encapsulating PCIe, you want to run slightly faster than the PCIe so the PCIe can run at full bandwidth), even to the development of next-gen PCIe itself. It's not magic, it is not delivered by a stork, it needs to be prototyped, and that's another thing FPGAs are used for.
I hope AMD sees the light and helps F4FPGA develop a more complete open source toolchain for their FGPAs (https://f4pga.org). With this subsystem and an open source compilation flow, FGPA experiments would be way easier.
I don't know much about the ins-and-outs of the FPGA ecosystem -- can you explain why you think this kind of collaboration would be impossible? Is it a technical roadblock, a philosophical difference of opinion, a business decision, etc?
My guess: if the bitstream format was documented competitors would know how the device works and be able to prove their patents are being violated.
FPGA vendors will also justify inertia in that current FPGA users don’t seem to be deterred by the bad tools because of the economics of their business.
Some think a lot of hobby users would try FPGA if the toolset was easier to pick up but there are not enough of those folks to keep Radio Shack or even Fry’s alive and they will be buying $5-$150 parts, not the much more powerful $10,000-$100,000 parts.
> My guess: if the bitstream format was documented competitors would know how the device works and be able to prove their patents are being violated.
This has been the persistent argument for many years from companies who say they can't release Open Source graphics drivers.
> FPGA vendors will also justify inertia in that current FPGA users don’t seem to be deterred by the bad tools because of the economics of their business.
Want hundreds as times as many FPGA users? Make it easy for an FPGA to be used for transparent acceleration, by making it easy for Open Source libraries to build and ship FPGA bitstreams that serve as accelerators for their data handling. Imagine if compression libraries, databases, and many other kinds of libraries could transparently take advantage of an FPGA if available to process data many times faster. Then there'd be a benefit to shipping an FPGA in many servers, and many client systems as well.
I don't think that even with accessible tools there is going to be hundreds of times more users. At the end of the day, they are niche products with niche uses. The average random person isn't going out of their way to make accelerators or some kind of pipeline tool or circuit to do something random that couldn't just easily be achieved with a micro controller or Arduino.
Those that are willing to go out of their way to design a custom circuit or something else on an FPGA are in my opinion the type already dedicated enough or driven enough to not be deterred by crappy tools.
The work you do on FPGAs is already a filter enough so that I don't think anyone is getting passed that and then giving up because the tools suck.
> The work you do on FPGAs is already a filter enough so that I don't think anyone is getting passed that and then giving up because the tools suck.
About 10 years ago I was doing some FPGA development in a startup. We were using Vivado. It seemed like we spent about 30% of our time working around Vivado bugs. I come from a hardware background originally and then got into software development (EDA tools) later on. After the startup gig ended I could've gone more in the direction of FPGA development. I decided not to because the tools suck and life is too short to deal with that day in and day out. And it's not simply that the FPGA vendor tools are some of the buggiest software known to humankind, it's that the FPGA vendors don't care to make them better.
I think you are looking at it from the wrong perspective.
It doesn't take 100x the devs to make FPGA compelling on the desktop or the server. Just like bespoke accelerators in Apple Silicon are used behind a library, so too will the accelerators implemented via FPGAs. The program itself can be copied a billion times.
Your argument can be made for GPUs as well, the users (end users) aren't the ones writing the shaders, but GPUs are used my hundreds of millions of people.
You don't need a massive number of developers. You need a small number of developers contributing changes to popular libraries that are used by a huge number of people.
The use cases you write about are mostly constrained by design, not software. Configuration of SRAM based FPGAs is rather slow because it requires a scan chain to program each logical element and shift config bits into, and doing it faster requires even more circuitry. You need to multiplex things onto the fabric in practice, you can't "context switch" AKA temporally multiplex very well, you have to spatially multiplex. But FPGAs are already area intensive; a k-LUT needs 2^k SRAM bits for the table, each bit being 6 transistors, on top of the scan chain to program it, and the registers and latches that go with the LUT in a typical logic element, and routing crossbars, and so on. Assuming K=6 then a single LUT would be like ~100x transistor overhead compared to a CMOS NAND gate (not a 1-to-1 comparison, just a ballpark). The SRAM requirements alone are problematic because it scales far, far worse than logic. If you're talking about a modern 7/5/3-nm wafer, area = money, and that's a shitload of money. So, what part of the system architecture do you even put the FPGA on? In the core complex? It can't be too big, then; your users are 99% better off with that area going to more cache and branch prediction. Put it on an older process and stuff it in the package? Packaging isn't free. Maybe just on the PCB? "Bump in the wire" to the NIC, or RAM, or storage? That limits the use, but an out-of-line design means there's less bandwidth available as input/outputs are shared. There are benefits and costs to them all and they all change the use cases and interface a bit. Now keep in mind you might have multiple parallel bitstreams you want to run. All of these choices impact how the software can interface with it and what capabilities it has.
Example: Modern DDR5 has something like 64GB of bandwidth per channel; assuming your design is inline on the bus running at something like 500MHz, you'd need a 128-bit bus, per channel. That clock rate might require deep pipelining, further increasing area requirements, so you can't fit as much other stuff. Otherwise, you need a wider bus and to go slower, but wider buses often scale sub-linearly in terms of area and routing congestion; a 256-bit bus will be more than twice as expensive and difficult to route as a 128 bit one due to limited routing tracks, etc. So maybe you can hit that target, but then you're too routing congested, so you can't fit as many channels as you want in. Ergo, you need bigger/more FPGAs, or serious optimization and redesign. There's no immediate win. You need to explore/napkin math the design space to find the best solution on the pareto frontier, typically. Or just buy a FPGA that's massive overkill, AKA "buy a faster PC", the typical software programmer's solution. But it really isn't plug and play or anything close to that.
It's similar to other niche things, like in-memory GPU databases. They are not held back by CUDA being proprietary. That fact does suck, but it's not really relevant in the grand scheme. They are held back by physical design dictating that parallel accelerators need loads of fast memory to feed the execution units, fast memory is super expensive and takes up a lot of space on the PCB resulting in a physical upper bound on density, and that the working set for such databases typically grows much, much faster than rate at which GPU memory performance/price drops. Past the point of no return (working set > VRAM), their advantages rapidly vanish. Their limitations are in the design, not the software.
FPGAs taught me a lot about hardware/software design. I really like them and want more people to use them. I'm really excited there are fully FOSS flows, even if they have giant limitations. But they are pretty niche and have serious physical design factors to account for; I say that as someone who contributes to, uses, and loves the open-source tools for what they are, and even was lucky enough to play with them for work.
I have an idea for a hardware accelerated VP8 video encoder. The bottleneck is pretty obvious. Any FPGA that supports PCIe costs enough that you can just buy a faster CPU and do everything in software. 1080p@60fps requires around 3gbps. Your only hope is to take a low end FPGA and connect it to the USB 3.0 FTDI chip that gives you a 32 bit @ 100MHz parallel interface. The resulting PCB including all hardware would probably cost around 60€. This would be enough to reduce the costs of real time video encoding below 100ms to make it profitable to run it as a business with a slim profit margin.
It's pretty obvious that an FPGA is a bad choice as an accelerator if you can get away with it. Future CXL FPGAs will be highly capable platforms, but they will be both expensive and a nightmare to develop for, negating most of the reasons why you would use them.
By the way, your complaints about LUTs taking up transistors is pretty irrelevant. Most of the transistors are being "wasted" on the routing switches and connection boxes. The space taken up by LUTs is so small that there are mask programmable gate arrays, aka FPGAs without the routing switches and connection boxes. They end up three times as dense as a regular FPGA as a result.
Very correct, routing delays are the bane of my existence as a FPGA dev. Nothing like filling a chip past 80 percent and watching Fmax fall off a cliff.
> Configuration of SRAM based FPGAs is rather slow because it requires a scan chain to program each logical element and shift config bits into, and doing it faster requires even more circuitry. You need to multiplex things onto the fabric in practice, you can't "context switch" AKA temporally multiplex very well, you have to spatially multiplex. But FPGAs are already area intensive
On FPGAs designed for this, it is possible to "gradually reconfigure" FPGAs on context switch at high speed, while they continuously process data, in a manner similar to how CPUs gradually change what's in their cache after a context switch, and modern GPUs handle multiple applications by scheduling work units across the compute elements.
I expect those sorts of FPGA designs would become available on the market if vendors decided to develop the ecosystem of FPGAs as general purpose compute accelerators, shared among applications, similar to the role played by GPUs, TPUs and NNPUs now.
(Long shot: If anyone out there seriously wanted to hire someone to build open source or open programming, high performance FPGAs with these switching characteristics, and tooling to match, I would love to do both :)
Modern GPUs hide latency by scheduling tons of work and paying it back in throughput but this is very design sensitive and if you do it in an FPGA requires a ton of pipelining and design work. Which is often better spent just paying some schlubs like us to write software. Again, the cost of programming the fabric is quite real. You pay for area.
And people do actually create marketable FPGA designs you can load into modern accelerators. You can buy Bittware devices yesterday, or Xilinx Alveo and load tons of designs into them. You can go get Amazon F1 instances and put tons of accelerators on them. You don't hear about them and they aren't popular like GPUs because the fact is that most people don't need this, and the ones who do have very particular designs that probably aren't worth over-optimizing the entire system architecture for. That's why they're 95% PCIe cards with attached output peripherials that most of the time end up in Ethernet.
I'm familiar with Bittware, F1 and Alveo accelerators. I've used F1 and might use Alveo this year. The cost of programming them is indeed high, but it's largely because of the design software whose paradigm remains firmly stuck in the 90s. Even "high level synthesis" is far from high level at the moment.
Those devices are completely different to use compared to the sort of general purpose, fast-compilation, fast-switching accelerators like modern GPUs.
FPGAs and FPGA-like architectures and concommittant design software can be designed for fast compilation, adaptive timing and pipelining, and overlapped application multiplexing. But it takes significant design changes. It's a novel and underexplored area. With such architectures, schlubs like us can write software that runs on them with excellent performance for some tasks.
Unfortunately the market and the legal situation hasn't optimised for that. The closed FPGA programming information, for decades, meant others could't produce radically different commercial tools for existing FPGAs, which would generally require skipping the proprietary P&R to use novel fast-compilation and incremental reprogramming techniques. Those who explored it were always worried about legal issues, as well as damaging customer devices.
And for a long time the patents were a chilling effect on new entrants wanting to develop alterate FPGA architectures better suited to this type of programming, as long as they contained elements of traditional FPGAs as well. The patent situation is starting to shift now that early Xilinx and Altera devices are old enough, but it's a multi-decade process, unfortunately.
I liked their idea, and I think it had a lot of potential for clever optimisation. Shame about the bankrupcy. But it's different to what I'm talking about. Tabula's extremely fast multipexing needed a lot of chip area.
I'd imagine either in the core complex, or on a CXL link or similar.
And yes, you'd need to leave it programmed with the accelerators you actually need. You could have system policy that programs in the accelerators for libraries your software uses, with a mechanism for saying "there's not enough room in the FPGA for all the accelerators, pick the ones you want".
Among other uses, this would mean you might not need specialized hardware for video decoding or encoding for each new codec; you could put it on an FPGA, and upgrade it in the future. In theory you could put it in place of more special-purpose transistors that you can run on the FPGA instead.
CXL doesn't really change anything except for the fact memory is coherent across the bus so you don't have to do your own protocol. "Core complex" is massively oversimplifying it but, generally speaking your accelerator probably needs Serdes and peripherals if it wants to be something more than just a glorified data multiplexor. General purpose PCs barely need this. "Video decoding" is the most common example of where FPGAs would be "awesome" but I'm going to spoil it: it's more efficient to just use an older codec like VP9 or H264, versus introducing a huge piece of silicon that is dead 99% of the time for a new codec, especially when the older codec doesn't suffer from a 100x area/power loss. That area could go to parts of the CPU that are generally useful, again like branch prediction or cache. You could use those many mm^2 of silicon to do other actually useful, general-purpose things, and most importantly, those things can be programmed by software without an absurdly high development and verification cost. Even if it was advantageous, you can just install a PCIe card to do it on the basis that actually skilled programmers will implement the gateware for you. It's not even a good example; new important video codecs come out on the order of once every 7-10 years, which is not worth optimizing entire system designs for.
If you end up doing something like this, it's generally only because your workloads are extremely atypical, e.g. Google's Video Processing Units or whatever might be good as FPGAs but only because they're such outliers. Actually they're just using ASICs because that's more economical. But my point is there isn't actually enough of this to go around in a way that trickles down to the consumer.
It's similar to the question "Why doesn't my x86 CPU have 1,000 cores like a GPU" or "Why isn't all of my memory SRAM." Because it just isn't actually useful or what anyone actually wants and the costs are vastly disproportionate to the actual utility.
> This has been the persistent argument for many years from companies who say they can't release Open Source graphics drivers.
What? How can any company claim that with the patent thing at play? Wouldn't that just be admitting they're violating patents, therefore making the closed-sourceness reason moot in the first place?
Moreover, wouldn't any sufficiently-interested patentholder just reverse-engineer the compiled binary and arrive to the supposed infringment on their own?
> Moreover, wouldn't any sufficiently-interested patentholder just reverse-engineer the compiled binary and arrive to the supposed infringment on their own?
It's a MAD (mutually assured destruction) situation. You can rest assured that everyone knows about everyone else's rotting corpses in the storage locker... the first one to chicken out to the feds will get blasted to pieces just like everyone else.
My personal opinion is that today's patent systems can go and die in a fire for all I care, right after copyright.
You are correct in your points, I'd just like to add some more info.
The bitstream format is an obstacle but can be reversed. It's already been done for the Xilinx 7 series, lattice ecp5, and others.
However, that does NOT solve the actual main problem - timing models.
Timing models are huge databases hundreds of megabytes for a single FPGA that provides exact routing delays and propagation delays for groups of functional gates in the fabric. They are developed over months of painstaking analysis, debugging and tooling by the vendor.
Timing models are what let's you say "please make this IP run at 166mhz" and the fitter knows exactly how hard to work placing and connecting LUTs so that is possible.
Then, the timing analyzer will check the maximum frequency of that clock domain and ensure, using the timing models, that the specified frequency can be reached at all 4 process corners (PVT).
A typical design will have usually anywhere from 5 to 30 clock domains.
So if you have no support for the timing model, you effectively are not able to ever optimize your fitting process, and you have no idea if your FPGA will have its state machines explode when it gets a bit warmer than ambient.
> My guess: if the bitstream format was documented competitors would know how the device works and be able to prove their patents are being violated.
Maybe. There certainly is a lot of "secret sauce" energy around the bitstream formats. Primarily I think they guard the bitstream format to help ensure vendor lockin. Imagine if there were open tools that could easily target FPGAs from multiple vendors so that users could choose the most cost effective solution. The FPGA vendors don't want that.
No,an FPGA is always going to be a lot bigger than an ASIC that does the same thing not least because it the ASIC does not have to support programmability.
> My guess: if the bitstream format was documented competitors would know how the device works and be able to prove their patents are being violated.
Assuming the patentholder had sufficient and warranted suspicions, wouldn't they initiate legal action and get the actual source/hardware design files through discovery anyway?
They won't issue a subpoena for a fishing expedition, there has to be at least some evidence.
However, I don't think that this is a real issue, as competitors and the most skilled customers already mostly know how the devices work. Also, both Xilinx (AMD) and Altera (Intel for now, but looks like the might spin it out) have so many patents that it's probably mutually assured destruction and a huge gamble if either sues the other. I think they just prefer having the tools proprietary, not just to lock the customers in (though they like that) but also to keep away from the hairy corners (avoid defects in the hardware design that could produce bad results or fry the chips).
It'll be interesting to see if AMD can help unleash what Xilinx was. At some point, leaving FPGAs as cloistered inaccessible technology - even well supported cloistered technology - becomes a risk.
Theres been some incremental improvement with low level Linux support across the past year. Good to seem but so far, actually using FPGAs is still all Vivado & closed systems. I think there's so much possibility left on the table by not supporting openfpga alternatives, not embracing yosys/openpnr/openroad/&al.
More integration of FPGAs on computers can be really good to exploit relatively long pipeline algorithms that could run in parallel, specially signal processing ones.
GPUs are really fast but using some kind of instructions, while on a FPGA you can practically design what "instruction" is going to run. From the limited experience I had on CUDA but the complexity of your algorithm and how much branches your code have can make your code a lot slower than running it on a CPU, no matter the amount of CUDA cores you have.
It could be incredibly cool that in some near to mid future to being able to run an application that could run some kind of code in the FPGA, like they already does with the GPU, to solve some kind of problems, like audio, image, video processing, and probably machine learning (I don't know too much about current implementations), and with that user-space interface, it could even be earlier than I though.
I'm all gung-ho for FPGAs since that's a huge part of my job, but I have to admit that GPUs will always be easier, faster, and more convenient for just about everything you'd want to do as some type of accelerator. There would only be exceptions for power usage, and maybe peak throughout.
Where FPGAs shine best are in embedded or very obscure applications where GPUs are way too big, power hungry, or cannot support enough parallelization of your specific algo.
I only know about xilinx specific, but even given the support for runtime updates to the FPGA logic that already exist in the xilinx kernels, there are serious roadblocks to updating only portions of the FPGA array.
The logic must be very carefully implemented to allow over-writing only specific regions of the FPGA.
If you're re-flashing the entire logic array, it's pretty straightforward. If you want to leave some logic in and add an additional funcitonality, this is much more difficult.
This is really a limitaiton of the FPGA logic array implementation, not a linux shortcoming...
Partial reconfiguration is possible since twenty-something years but only a small percentage of problems really benefit from it, see i.e. https://www.fpgakey.com/tutorial/section742
What percentage of problems would benefit from parallel reconfiguration? I'm raising money for an FPGA in my profile where programs aren't bottlenecked in FPGA reconfiguration but instead are reconfigured in parallel.
Without really deep experience (my FPGA time ended 15ish years ago), I would say that granularity and frequency of the swapped out parts need to hit a sweet spot and you need to be able to hide that latency. If I understand you correctly, you tsrget the latter? But as with all pipelining you need do have an early trigger or lookahead.
That would be very good for keeping devices useful far belong the lifetime of their hardware codecs, too - in a decade, when H.266 comes out, you’ll still be able to watch YouTube etc on your “vintage” computer.
This proposal together with CXL initiative I hope will seamlessly transform hardware-centric to software-centric computing world.
We badly need them for the software-defined things (SD-X) namely software-defined radio (SDR), network (SDN), Instrumentation (SDI), etc, and Linux is at the forefront of the transformation.
On-the-fly bitstream programming of FPGAs from a parallel Linux board used to mean holding the FPGA in reset and bit-banging the entire bitstream via the FPGA's passive or active serial interface, which is a blocking process that takes quite a bit of time.
Possible fears about compromising the host system are of course justified. But such fears are appropriate for all DMA-capable devices, for example.
FPGAs are general purpose in a broader sense than CPUs or (GP)GPUs. And, no other general purpose computing device is able to transport and compute incoming data in a pipeline in a few cycles (10, 100, whatever) from input to output. 30Gbyte per second? No problem. Under certain circumstances even guaranteed on a normal PC. A general interface such as a BIOS mapped into the memory would be interesting for many applications.
Apple has taken a good step with the Afterburner (like others before it in the music and movie sector).
After the coin hype, FPGAs once again disappeared into the niche of high-frequency trading, only to become popular in the low-price spectrum of chips for console emulators in recent years. No cloud provider has managed to bring FPGAs to the masses. Intel and AMD have bought the market leaders and have blown all efforts to achieve something in the data centers.
AMD seems to take the first step: introduction of a general interface. The directly necessary next step would be something like CUDA (apart from IP, the name once stood for "Compute Unified Device Architecture") instead of VHDL or Verilog. It doesn't even have to be open source. It has to be something where you can compile a demo program (Mandelbrot and RISC V and 68030) in five minutes and get it running (and don't always tell everyone that it's called "synthesizing"). On a PC with a typical graphics card, place & route would be much faster. But not with a tcl/tk interface in Java as with Vivado.
Yeah, the existing FPGA device subsystem has been pretty much worthless for upstream kernels for years now and the lack of a good userspace API is a big reason, so in practice everyone just bitbangs the bitstream over some QSPI interface and pulls the reset manually. Hopefully this can see the light of day at some point.
With FPGA, unlike with GPUs, you can achieve significant speedup of algorithms where parallelization is very difficult. This is thanks to a technique called pipelining, where you can perform several steps of a sequential computation at the same time.
An example of this is video decoding/encoding, which is commonly implemented by dedicated hardware.
Aside from hardware prototyping, this seems to be the primary benefit of FPGAs. Much like if you assembled a system from glue logic, all of the operations in the pipeline and across the entire subsystem can happen simultaneously and often asynchronously with the only limitations being on clocked devices like flipflops and gated latches... and obviously whatever propagation delays the logic gates have (along with any protection from race conditions)
Admittedly, I haven't had the opportunity to play with FPGAs very often in my professional career, but the limited experience I've had with them showed me you need an entirely different mindset when programming in a hardware description language. With assembly everything is global, and similarly with FPGAs almost everything is asynchronous and parallel.
If you're thinking in terms of instructions or mathematics operations, you're thinking too high level for what an FPGA is and does.
FPGAs are often times for prototyping hardware designs, but also for bespoke parallelized or hardware operations that are too low of yield to justify ASIC development.
I've had gobs of fun writing systolic algorithms for FPGAs. Basically, "what if I could reconfigure my GPU with a per-core custom instruction set and local connectivity."
tailor the hardware to your particular application.Depending on the complexity/fpga, your application can reside in the hardware, or even it is the hardware.
AWS's FPGA instances (essentially a big xylinx on a PCIE board) are probably a good place to start here, any standard probably should allow for the multiple sorts of interfaces that those guys provide
AMD gpu drivers are in the linux kernel. They support even very old hardware. And linux only drops drivers when there are no users. So amd cards stay supported for basically their entire lifetime.
This has always felt like a gaping security hole waiting to be explored.
Modern, high end FPGAs have a feature known as Raw SerDes, which in essence allows you to bypass a PCIe or Ethernet controller and use those lanes (yes, PCIe lanes) to your heart's desire ...provided you can design a working communication protocol. Difficult, but not impossible by any means.
So if you wanted to, you could design your own PCIe controller and give it whatever device ID, vendor ID, memory space, or capability space you want! Normally these things are not writable on a PCIe controller. But if you designed your own, you could write them to whatever you want and spoof device types, memory spaces, or driver bindings, and probably get yourself access to memory you shouldn't be touching. While I don't know how the linux kernel would handle these potentially out of spec conditions, it never sat right with me from a security standpoint.