I also wonder why they have not been acquired yet. Or is it intentional? A few i...

sailingparrot · 2025-09-30T21:27:22 1759267642

> I estimated that they needed over $100m of chips just to do Qwen 3 at max context size

I will point out (again :)), that this math is completely wrong. There is no need (nor performance gains) to store the entire weights of the model in SRAM. You simply store n transformer blocks on-chip and then stream block l+n from external memory to on-chip when you start computing block l, this completely masks the communication time behind the compute time, and specifically does not require you to buy 100M$ worth of SRAM. This is standard stuff that is done routinely in many scenarios, e.g. FSDP.

https://www.cerebras.ai/blog/cerebras-software-release-2.0-5...

bubblethink · 2025-10-01T02:40:08 1759286408

That blog is about training. For inference, the weights and kv cache are in SRAM. Having said that, the $100M number is inaccurate/meaningless. It's a niche product that doesn't have economies of scale yet.

sailingparrot · 2025-10-01T13:39:56 1759325996

The blog is about training but the technique applies equally well to inference, just like FSDP and kv cache sharing are routinely done in inference on GPUs.

There is just no need to have parameters or kv cache for layer 48 in SRAM when you are currently computing layer 3, you have all the time in the world to move that to SRAM when you get to layer 45 or whatever the maths work out to be for your specific model.

vlovich123 · 2025-09-30T22:30:45 1759271445

I did experiments with this on traditional consumer GPU and the larger the discrepancy between model size and VRAM, the faster it dropped off (exponentially) to as if you didn’t even have any VRAM in the first place (over PCIe). This technique is well known and works when you have more than enough bandwidth.

However, the whole point that even HBM is a problem is the available bandwidth is insufficient, so if you’re marrying SRAM and HBM I would expect the performance gains to be overall modest for models that exceed available SRAM in a meaningful way.

sailingparrot · 2025-09-30T23:21:42 1759274502

This is highly dependent on exact model size, architecture and hardware configurations. If the compute time for some unit of work is larger than the time it takes to transfer the next batch of Params you are good to go. If you are doing it sequentially though then yes you will pay a heavy price, but the idea is to fetch a future layer not the one you need right away.

As a similar example I have trained video models on ~1000 H100 where the vast majority of parameters are sharded and so need to be first fetched on the network before being available on HBM, which is similar imbalance to the HBM vs SRAM story. We were able to fully mask comms time such that not sharding (if it was even possible) would offer no performance advantage.

aurareturn · 2025-10-01T05:11:11 1759295471

What about for inference?

In that same thread, Cerebras executive disputed my $100m number but did not dispute that they store the entire model on SRAM.

They can make chips at cost and claim it isn’t $100m. But Anandtech did estimate/report $2-3m per chip.

sailingparrot · 2025-10-01T15:42:10 1759333330

> What about for inference?

Same techniques apply.

> but did not dispute that they store the entire model on SRAM.

No idea what they did or did not do for that specific test (which was about delivering 1800 tokens/sec though, not simply running qwen-3) since they didn't provide any detail. I don't think there is any point storing everything in SRAM, even if you do happen to have 100M$ worth of chips lying around in a test cluster at the office, since WSE-3 is designed from the ground up for data parallelism (see [1] section 3.2) and inference is sequential both within a single token generation (you need to go through layer 1 before you can go through layer 2 etc.) and between tokens (autoregressive, so token 1 before token 2). This means most of your weights loaded in SRAM would be just sitting unused most of the time, and when they need to be used they need to be broadcasted to all chips from the SRAM of the chip that has the particular layer you care about, this is extremely fast, but external memory is certainly fast enough to do this if you fetch the layer in advance. So the way to get the best ROI on such a system would be to pack the biggest batch size you can (so many users' queries) and process them all in parallel, streaming the weights as needed. The more your SRAM is occupied by batch activations and not parameters, the better the compute density and thus $/flops.

You can check the Cerebras doc to see how weight streaming works [2]. From the start, one of the selling point of Cerebras is the possibility to scale memory independently of compute, and they have developped an entire system specifically for weight streaming from that decoupled memory. Their docs seems to keep things fairly simple assuming you can only fit one layer in SRAM and thus they fetch things sequentially, but if you can store at least 2 layers in those 44GB of SRAM then you can simply fetch l+1 when l is starting to compute, completely masking latency cost. Its possible they already mask the latency even within a single layer by streaming by tiles for matmul though, unclear from their docs. They mention that in passing in [3] section 6.3.

All of their doc is for training since it seems for inference play they have pivoted to selling API access rather than chips, but inference is really the same thing, just without the backprop (especially in their case were they aren't doing pipeline parallelism where you could claim doing fwd+back prop gives you better compute density). At the end of the day whether you are doing training or inference, all you care about is that your cores have the data they need in their registers at the moment they are free to compute, so streaming to SRAM works the same way in both cases.

Ultimately I can't tell you how much it cost to run Qwen-3, you can certainly do it on a single chip + weight streaming, but their specs are just too light on the exact FLOPs and bandwidth to know what the memory movement cost would be in this case (if any), and we don't even know the price of single chip (everyone is saying 3M$ though, regardless of that comment on the other thread). But I can tell you that your math of doing `model_size/sram_per_chip * chip_cost` just isn't the right way to think about this, and so the 100M$ figure doesn't make sense.

[1]: https://arxiv.org/html/2503.11698v1#S3.

[2]: https://training-api.cerebras.ai/en/2.1.0/wsc/cerebras-basic....

[3]: https://8968533.fs1.hubspotusercontent-na2.net/hubfs/8968533...

MichaelZuo · 2025-09-30T21:49:57 1759268997

So then what explains such a low implied valuation at series G?

There’s no way that could be the case if the technology was competitive.

sailingparrot · 2025-09-30T21:56:43 1759269403

I’m not saying it’s particularly competitive, I’m saying claiming it cost 100M$ to run Qwen is complete lunacy. There is a gulf between those 2 things.

And beyond pure performance competitiveness there are many things that make it hard for Cerebras and to be actually competitive: can they ship enough chips to meet the need of large clusters ? What about the software stack and lack of great support compared to nvidia? Lack of ml engineers that know how to use them, when everyone knows how to use CUDA and there are many things developed on top of it by the community (e.g triton).

Just look at the valuation difference between AMD and Nvidia, when AMD is already very competitive. But being 99% of the way there is still not enough for customers that are going to pay 5B$ for their clusters.

addaon · 2025-09-30T20:37:09 1759264629

> SRAM has virtually stopped scaling in new nodes.

But there are several 1T memories that are still scaling, more or less — eDRAM, MRAM, etc. Is there anything preventing their general architecture from moving to a 1T technology once the density advantages outweigh the need for pipelining to hide access time?

aurareturn · 2025-09-30T20:45:31 1759265131

I’m pretty sure that HBM4 can be 20-30x faster in terms of bandwidth than eDRAM. That makes eDRAM not an option for AI workloads since bandwidth is the main bottleneck.

addaon · 2025-09-30T22:13:24 1759270404

HBM4 is limited to a few thousand bits of width per stack. eDRAM bandwidth scales with chip area. A full-wafer chip could have astonishing bandwidth.

sinuhe69 · 2025-10-01T03:20:37 1759288837

No, only Groq uses the all SRAM approach, Cerebras only use SRAM for local context while the weights are still loaded from RAM (or HBM). With 48 Kbytes per node, the whole wafer has only 44 GB SDRAM, much lower than the amount needed for loading the whole networks.

aurareturn · 2025-10-01T11:50:33 1759319433

In order to achieve the speeds they're claiming, they are putting the entire model on multiple chips and completely on SRAM.

sinuhe69 · 2025-10-01T13:08:43 1759324123

Why are you claiming on false facts and not checking their documents?

imtringued · 2025-10-02T14:53:26 1759416806

This is actually completely unnecessary in the batched inference case.

Here is an oversimplified explanation that gets the gist accross:

The standard architecture for transformer based LLMs is as follows: Token Embedding -> N Layers each consisting of an attention sublayer and an MLP sublayer -> Output Embedding.

Most attention implementations use a simple KV caching strategy. In prefill you first calculate the KV cache entries by performing GEMM against the W_K, W_V, W_Q tensors. In the case of token generation, you only need to calculate against the current token. Next comes the quadratic part of attention. You need to calculate softmax(Q K^T)V. This is two matrix multiplications and has a linear cost with respect to the number of entries in the KV cache for generating the next token, as you need to re-read the entire KV cache plus the new entry. For prefill you are processing n tokens, so the cost is quadratic. The KV cache is unique for every user session. It also grows with the size of the context. This means the KV cache is really expensive memory wise. It consumes both memory capacity and bandwidth and it also doesn't permit batching.

Meanwhile the MLP sublayer is so boring I won't bother going into the details, but the gist is that you have a simple gating network with two feed forward layers that project the token vector into a higher dimension (e.g. more outputs than inputs) known as up gate and then you element-wise multiply these vectors and then feed them into a down gate which reduces it back to the original dimension of the token vector. Since the matrices are always the same, you can process the tokens of multiple users at once.

Now here are the implications of what I wrote above: Prefill is generally compute bound is therefore mostly uninteresting,or rather, interesting for ASIC designers because FLOPS are cheap and SRAM is expensive. Token generation meanwhile is a mix of being memory bandwidth bound and compute bound in the batched case. The MLP layer is trivially parallelized through GEMM based batching. Having lots of SRAM is beneficial for GEMM, but it is not super critical in a double buffered implementation that performs loading and computation simultaneously with the memory bandwidth being chosen so that both finish roughly at the same time.

What SRAM buys you for GEMM is the following: Given two square matrices A, B and their output A*B = C of the same dimension, where A and B are both 1 GiB in size and x MiB of SRAM, you tile the GEMM operation so that each sub-matrix is x/3 MiB in size. Let's say x=120MiB which means 40 MiB per matrix. You will split the matrices A and B into approximately 25 tiles. For every tile in A, you have to load all tiles in B. Meaning (A) 25 + 25*25 (A*B) = 650 load operations of 40 MiB matrices for a total amount of reads of 26000 MiB. If you double the SRAM you now have 13 tiles of size 80 MiB. 13 + 13*13 = 182. 182 * 80 MiB = 14560 MiB. Loosely speaking, doubling SRAM reduces the needed memory bandwidth by half. This is boring old linear scaling, because fewer tiles also means bigger tiles, so the quadratic gain of 4x reduction in loads is outweighed by 2x bigger load operations. Having more SRAM is good though.

Now onto Flash Attention. If I had to dumb down flash attention, it's a very quirky way of arranging two GEMM operations to reduce the amount of memory allocated to the intermediate C matrix of the first Q*K^T multiplication. Otherwise it is the same as two GEMM with smaller tiles. Doubling SRAM halves the necessary memory bandwidth.

Final conclusion: In the batched multi user inference case your goal is to allocate the KV cache to SRAM for attention nodes and achieve as large of a batch size as possible for MLP nodes and use the SRAM to operate on as large tiles as possible. If you achieve both, then the required memory bandwidth scales reciprocal to the amount of SRAM. Storing full tensors in SRAM is not necessary at large batch sizes.

Of course since I only looked at the memory aspects, it shouldn't be left out that you need to evenly match compute and memory resources. Having SRAM on its own doesn't buy you anything really.

lostmsu · 2025-10-07T20:34:12 1759869252

In batched inference Cerebras have no advantage but cost more AFAIU.

arisAlexis · 2025-10-01T08:03:55 1759305835

But apparently they serve all the models super fast and in production so you must be wrong somewhere