Both Intel and AMD produce server chips with 12 channel memory these days (that'...

pixelpoet · 2024-12-03T21:23:28 1733261008

My desktop machine has had 128gb since 2018, but for the AI workloads currently commanding almost infinite market value, it really needs the 1TB/s bandwidth and teraflops that only a bona fide GPU can provide. An early AMD GPU with these characteristics is the Radeon VII with 16gb HBM, which I bought for 500 eur back in 2019 (!!!).

I'm a rendering guy, not an AI guy, so I really just want the teraflops, but all GPU users urgently need a 3rd market player.

timschmidt · 2024-12-03T21:25:40 1733261140

That 128gb is hanging off a dual channel memory bus with only 128 total bits of bandwidth. Which is why you need the GPU. The Epyc and Xeon CPUs I'm discussing have 6x the memory bandwidth, and will trade blows with that GPU.

pixelpoet · 2024-12-03T21:29:45 1733261385

At a mere 20x the cost or something, to say nothing about the motherboard etc :( 500 eur for 16GB of 1TB/s with tons of fp32 (and even fp64! The main reason I bought it) back in 2019 is no joke.

Believe me, as a lifelong hobbyist-HPC kind of person, I am absolutely dying for such a HBM/fp64 deal again.

timschmidt · 2024-12-03T21:37:11 1733261831

$1,961.19: H13SSL-N Motherboard And EPYC 9334 QS CPU + DDR5 4*128GB 2666MHZ REG ECC RAM Server motherboard kit

https://www.aliexpress.us/item/3256807766813460.html

Doesn't seem like 20x to me. I'm sure spending more than 30 seconds searching could find even better deals.

pixelpoet · 2024-12-03T21:40:23 1733262023

Isn't 2666 MHz ECC RAM obscenely slow? 32 cores without the fast AVX-512 of Zen5 isn't what anyone is looking for in terms of floating point throughput (ask me about electricity prices in Germany), and for that money I'd rather just take a 4090 with 24GB memory and do my own software fixed point or floating point (which is exactly what I do personally and professionally).

This is exactly what I meant about Intel's recent launch. Imagine if they went full ALU-heavy on latest TSMC process and packaged 128GB with it, for like, 2-3k Eur. Nvidia would be whipping their lawyers to try to do something about that, not just their engineers.

ryao · 2024-12-04T05:15:57 1733289357

Yes and no. I have been developing some local llama 3 inference software on a machine with 3200MT/s ECC RAM and a Ryzen 7 5800X:

https://github.com/ryao/llama3.c

My experience is that input processing (prompt processing) is compute bottlenecked in GEMM. AVX-512 would help there, although my CPU’s Zen 3 cores do not support it and the memory bandwidth does not matter very much. For output generation (token generation), memory bandwidth is a bottleneck and AVX-512 would not help at all.

timschmidt · 2024-12-03T21:43:54 1733262234

I don't think anyone's stopping you, buddy. Great chat. I hope you have a nice evening.

ryao · 2024-12-04T09:13:55 1733303635

12 channel DDR5 is actually 12x32-bit. JEDEC in its wisdom decided to split the 64-bit channels of earlier versions of DDR into 2x 32-bit channels per DIMM. Reaching 768-bit memory buses with DDR5 requires 24 channels.

Whenever I see DDR5 memory channels discussed, I am never sure if the speaker is accounting for the 2x 32-bit channels per DIMM or not.