It doesn't matter if the "cost is driven up". Nvidia has proven that we're all lil pay pigs for them. 5090 will be 3000$ for 32gb of VRAM. Screenshot this now, it will age well.
You are absolutely correct, and even my non-prophetic ass echoed exactly the first sentence of the top comment in this HN thread ("Why don't they just release a basic GPU with 128GB RAM and eat NVidia's local generative AI lunch?").
Yes, yes, it's not trivial to have a GPU with 128gb of memory with cache tags and so on, but is that really in the same universe of complexity of taking on Nvidia and their CUDA / AI moat any other way? Did Intel ever give the impression they don't know how to design a cache? There really has to be a GOOD reason for this, otherwise everyone involved with this launch is just plain stupid or getting paid off to not pursue this.
Saying all this with infinite love and 100% commercial support of OpenCL since version 1.0, a great enjoyer of A770 with 16GB of memory, I live to laugh in the face of people who claimed for over 10 years that OpenCL is deprecated on MacOS (which I cannot stand and will never use, yet the hardware it runs on...) and still routinely crushes powerful desktop GPUs, in reality and practice today.
Both Intel and AMD produce server chips with 12 channel memory these days (that's 12x64bit for 768bit) which combined with DDR5 can push effective socket bandwidth beyond 800GB/s, which is well into the area occupied by single GPUs these days.
You can even find some attractive deals on motherboard/ram/cpu bundles built around grey market engineering sample CPUs on aliexpress with good reports about usability under Linux.
Building a whole new system like this is not exactly as simple as just plugging a GPU into an existing system, but you also benefit from upgradeability of the memory, and not having to use anything like CUDA. llamafile, as an example, really benefits from AVX-512 available in recent CPUs. LLMs are memory bandwidth bound, so it doesn't take many CPU cores to keep the memory bus full.
Another benefit is that you can get a large amount of usable high bandwidth memory with a relatively low total system power usage. Some of AMD's parts with 12 channel memory can fit in a 200W system power budget. Less than a single high end GPU.
My desktop machine has had 128gb since 2018, but for the AI workloads currently commanding almost infinite market value, it really needs the 1TB/s bandwidth and teraflops that only a bona fide GPU can provide. An early AMD GPU with these characteristics is the Radeon VII with 16gb HBM, which I bought for 500 eur back in 2019 (!!!).
I'm a rendering guy, not an AI guy, so I really just want the teraflops, but all GPU users urgently need a 3rd market player.
That 128gb is hanging off a dual channel memory bus with only 128 total bits of bandwidth. Which is why you need the GPU. The Epyc and Xeon CPUs I'm discussing have 6x the memory bandwidth, and will trade blows with that GPU.
At a mere 20x the cost or something, to say nothing about the motherboard etc :( 500 eur for 16GB of 1TB/s with tons of fp32 (and even fp64! The main reason I bought it) back in 2019 is no joke.
Believe me, as a lifelong hobbyist-HPC kind of person, I am absolutely dying for such a HBM/fp64 deal again.
Isn't 2666 MHz ECC RAM obscenely slow? 32 cores without the fast AVX-512 of Zen5 isn't what anyone is looking for in terms of floating point throughput (ask me about electricity prices in Germany), and for that money I'd rather just take a 4090 with 24GB memory and do my own software fixed point or floating point (which is exactly what I do personally and professionally).
This is exactly what I meant about Intel's recent launch. Imagine if they went full ALU-heavy on latest TSMC process and packaged 128GB with it, for like, 2-3k Eur. Nvidia would be whipping their lawyers to try to do something about that, not just their engineers.
My experience is that input processing (prompt processing) is compute bottlenecked in GEMM. AVX-512 would help there, although my CPU’s Zen 3 cores do not support it and the memory bandwidth does not matter very much. For output generation (token generation), memory bandwidth is a bottleneck and AVX-512 would not help at all.
12 channel DDR5 is actually 12x32-bit. JEDEC in its wisdom decided to split the 64-bit channels of earlier versions of DDR into 2x 32-bit channels per DIMM. Reaching 768-bit memory buses with DDR5 requires 24 channels.
Whenever I see DDR5 memory channels discussed, I am never sure if the speaker is accounting for the 2x 32-bit channels per DIMM or not.
The question is whether there's enough overall demand for a GPU architecture with 4x the VRAM of a 5090 but only about 1/3rd of the bandwidth. At that point it would only really be good for AI inferencing, so why not make specialized inferencing silicon instead?
Intel and Qualcomm are doing this, although Intel uses HBM and their hardware is designed to do both inference and training while Qualcomm uses more conventional memory and their hardware is only designed do inference:
They did not put it into the PC parts supply chain for reasons known only to them. That said, it would be awesome if Intel made high memory variants of their Arc graphics cards for sale through the PC parts supply chains.
We'd be happy to pay 5000 for 128gb from Intel.