Memory bandwidth is just a marketing term for Apple at this point. Sure, the bus...

tucnak · 2025-09-15T11:12:05 1757934725

It's solely dependent on the workload's memory access patterns. The higher you go in thread count, the more you're constrained by contention, caches, etc. The paper in OP is demonstrating how relatively subtle differences in the memory model are leading to substantial differences in performance on actual hardware. The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. M-series processors have packaging advantage that is very hard to beat, and indeed, is yet to be beat—in consumer and prosumer segments.

See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness.

Rohansi · 2025-09-15T20:09:34 1757966974

> The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time.

The opposite case is also possible. You can be compute limited. Or there could be bottlenecks somewhere else. This is definitely the case for Apple Silicon because you will certainly not be able to make use of all of the memory bandwidth from the CPU or GPU. As always, benchmark instead of looking at raw hardware specifications.

inkyoto · 2025-09-15T12:26:23 1757939183

> […] but how much can your code actually use?

All of it, and it is transparent to the code. The correct question is «how much data does the code transfer?»

Whether you are scanning large string ropes for a lone character or multiplying huge matrices, no manual code optimisation is required.

tucnak · 2025-09-15T13:40:04 1757943604

Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths? I'm only familiar with Apple Silicon-specific code in llama.cpp, and not really familiar with either Accelerate[0] or MLX[1] specifically. Have they really cracked it at homogenous computing so that you could use a single description of computation, and have it emit efficient code for whatever target in the SoC? Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?

[0]: https://developer.apple.com/documentation/accelerate

[1]: https://ml-explore.github.io/mlx/build/html/usage/quick_star...

inkyoto · 2025-09-17T14:35:39 1758119739

> Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths?

Yes.

> Have they really cracked it at homogenous computing […]

Yes.

> have it emit efficient code […]

Yes. I had also written compilers and code generators for a number of platforms (all RISC) decades before Apple Silicon became a thing.

> […] for whatever target in the SoC?

You are mistaking the memory bus width that I was referring to for CPU specific optimisations. You are also disregarding the fact that the M1-4 Apple SoC's have the same internal CPU architecture, differing mostly in the ARM instruction sets they support (ARM64 v8.2 in M1 through to ARM64 v8.6 in M4).

> Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?

Yes.

Is there truly a need to be confrontantial in what otherwise could have become an insightgul and engaging conversation?

Rohansi · 2025-09-15T20:00:44 1757966444

Have you tested it or is that just what you expect?

inkyoto · 2025-09-17T14:50:18 1758120618

Yes, I have actually tested it.

Others also have. The https://lemire.me/blog/ blog has a wealth of insights across multiple architectures, which include all of the current incumbents (Intel, Apple, Qualcomm, etc.)

Do you have any detailed insights? I would be eager to assess them.