Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How do architectural bottlenecks due to modified Von Neumann architectures' debuggable instruction pipelines limit computational performance when scaling to larger amounts of off-chip RAM?

Tomasulo's algorithm also centralizes on a common data bus (the CPU-RAM data bus) which is a bottleneck that must scale with the amount of RAM.

Can in-RAM computation solve for error correction without redundant computation and consensus algorithms?

Can on-chip SRAM be built at lower cost?

Von Neumann architecture: https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_N... :

> The term "von Neumann architecture" has evolved to refer to any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time (since they share a common bus). This is referred to as the von Neumann bottleneck, which often limits the performance of the corresponding system. [4]

> The von Neumann architecture is simpler than the Harvard architecture (which has one dedicated set of address and data buses for reading and writing to memory and another set of address and data buses to fetch instructions).

Modified Harvard architecture > Comparisons: https://en.wikipedia.org/wiki/Modified_Harvard_architecture

C-RAM: Computational RAM > DRAM-based PIM Taxonomy, See also: https://en.wikipedia.org/wiki/Computational_RAM

SRAM: Static random-access memory https://en.wikipedia.org/wiki/Static_random-access_memory :

> Typically, SRAM is used for the cache and internal registers of a CPU while DRAM is used for a computer's main memory.




For whatever reason Hynix hasn't turned their PIM into a usable product. LPDDR based PIM is insanely effective for inference. I can't stress this enough. An NPU+LPDDR6 PIM would kill GPUs for inference.


How many TOPS/W and TFLOPS/W? (T [Float] Operations Per Second per Watt (hour *?))

/? TOPS/W and FLOPS/W: https://www.google.com/search?q=TOPS%2FW+and+FLOPS%2FW :

- "Why TOPS/W is a bad unit to benchmark next-gen AI chips" (2020) https://medium.com/@aron.kirschen/why-tops-w-is-a-bad-unit-t... :

> The simplest method therefore would be to use TOPS/W for digital approaches in future, but to use TOPS-B/W for analogue in-memory computing approaches!

> TOPS-8/W

> [ IEEE should spec this benchmark metric ]

- "A guide to AI TOPS and NPU performance metrics" (2024) https://www.qualcomm.com/news/onq/2024/04/a-guide-to-ai-tops... :

> TOPS = 2 × MAC unit count × Frequency / 1 trillion

- "Looking Beyond TOPS/W: How To Really Compare NPU Performance" (2023) https://semiengineering.com/looking-beyond-tops-w-how-to-rea... :

> TOPS = MACs * Frequency * 2

> [ { Frequency, NNs employed, Precision, Sparsity and Pruning, Process node, Memory and Power Consumption, utilization} for more representative variants of TOPS/W metric ]


Is this fast enough for DDR or SRAM RAM? "Breakthrough in avalanche-based amorphization reduces data storage energy 1e-9" (2024) https://news.ycombinator.com/item?id=42318944




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: