Nice work with this. I was wondering, are all computations other than convolution performed on the FPGA as well - such as pooling, padding, inter-layer quantization operations (rescaling & offset additions)? If not, does the FPGA offload unsupported operations to the host before continuing?
Does the FPGA need to transfer intermediate layer IO data back and forth between the host during GEMM if the data become too large to fit on the FPGA SRAM?
Thanks
Great questions! With Tensil, all computations are performed on the FPGA. In addition to matrix multiplication Tensil supports SIMD instruction with various operations. The ML activations, average and maximum pooling, normalization, and image resizing use SIMD instruction. Some ML operations, such as padding, are achieved by changing the memory layout. Tensil uses DRAM0 and DRAM1 memory pools (usually in DDR memory) to interact with the host to read model inputs and weights and write outputs. It also uses these pools to offload intermediate results between layers and between tiles within a layer when FPGA does not have sufficient BRAM, which is common on lower-end devices. Tensil compiler takes care of finding the most efficient memory scheduling for given on-FPGA memory size.
Okay thanks, so are DRAM0 & DRAM1 memory pools located on the host DDR memory, or is that a part of separate DDR DRAM hardware located on the FPGA board (kind of like how GPUs have their own separate DDR DRAM)? I definitely want to dive deeper into the source code of this project at some point and see how the compiler and everything works.
Edit: Sorry I think you already clarified that the DRAM0 & DRAM1 memory pools are located on the host
Something like the Alveo PCIe card has onboard HBM/DDR4 memory large enough for Tensil DRAM pools, so this would be similar to how GPU operates but could also reach to host memory via PICe if needed. Embedded applications with Zynq 7 and UltraScale+ have ARM processors on the same chip with FPGA and (usually) DDR as separate chips on one PCB. In this case, Tensil DRAM pools are just contiguous memory blocks in the memory shared with the CPU. We will be publishing documentation on the compiler design soon--stay tuned!