I wish we could get something other than Geekbench for these things, since Geekbench seems to be trash. For example, it has the Ryzen 7 7700X with a higher multi-core score than the Epyc 9534 even though they're both Zen4 and the latter has 8 times as many cores and is significantly faster on threaded workloads in real life.
That's what the single thread score is supposed to be for. The multi-thread score is supposed to tell you how the thing performs on the many real workloads that are embarrassingly parallel.
Suppose I'm trying to decide whether to buy a 32-core system with a lower base clock or a 24-core system with a higher base clock. What good is it to tell me that both of them are the same speed as the 8-core system because they have the same boost clock and the "multi-core" benchmark doesn't actually use most of the cores?
The only valid benchmark for that is to use the application you intend to use as a benchmark. Even embarassingly parallel problems can have different characteristics depending on their use of memory and caches and the thermal characteristics of the CPU. Something that uses only L1 cache and registers will probably scale almost linearly in the number of cores, except for thermal influences. Something that uses L2, L3 caches or even main memory will be sublinear.
You're essentially just arguing that all general-purpose benchmarks are worthless because your application could be different.
Suppose I run many different kinds of applications and am just looking for an overall score to provide a general idea of how two machines compare with one another. That's supposed to be the purpose of these benchmarks, isn't it? But this one seems to be unusually useless at distinguishing between various machines with more than a small number of cores.
Your analysis is also incorrect for many of these systems. Each core may have its own L2 cache and each core complex may have its own L3, so systems with more core complexes don't inherently have more contention for caches because they also have more caches. Likewise, systems with more cores often also have more memory bandwidth, so the amount of bandwidth per core isn't inherently less than it is in systems with fewer cores, and in some cases it's actually more, e.g. a HEDT processor may have twice as many cores but four times as many memory channels.
General-purpose benchmarks aren't worthless. They can be used to predict, in very broad strokes, what application performance might be. Especially if you don't really know what the applications would be, or if it is too tedious to use real application benchmarks.
But in your example, deciding between 24 cores with somewhat higher frequency or 32 cores with somewhat lower frequency based on some general-purpose benchmark is essentially pointless. The difference will be small enough that only the real application benchmark can tell you what you need to know. A general purpose benchmark will be no better than a coin toss, because the exact workings of the benchmark, the weightings of it's components into a score and the exact hardware you are running on will have interactions that will determine the decision to a far greater amount. You are right that there could be shared or separate caches, shared or separate memory channels. The benchmark might exercise those, or it might not. It might heat certain parts of the die more than others. It might just be the epitome of embarassingly parallel benchmarks, BogoMIPS, which is a loop executing NOPs. The predictive value of the general purpose benchmark is nil in those cases. The variability from the benchmark maker's choices will always necessarily introduce a bias and therefore a measurement uncertainty. And what you are trying to measure is usually smaller than that uncertainty. Therefore: No better than a coin toss.
You're just back to arguing that general purpose benchmarks are worthless again. Yes, they're not as applicable to the performance of a specific application as testing that application in particular, but you don't always have a specific application in mind. Many systems run a wide variety of different applications.
And a benchmark can then provide a reasonable cross-section of different applications. Or it can yield scores that don't reflect real-world performance differences, implying that it's poorly designed.
I attempted to do this and discovered an irregularity.
Many of the systems claiming to have that CPU were actually VMs assigned random numbers of cores less than all of them. Moreover, VMs can list any CPU they want as long as the underlying hardware supports the same set of instructions, so unknown numbers of them could have been running on different physical hardware, including on systems that e.g. use Zen4c instead of Zen4 since they provide the same set of instructions.
If they're just taking all of those submissions and averaging them to get a combined score it's no wonder the results are nonsense. And VMs can claim to be non-server CPUs too:
The multi-core score listed in the main results page for EPYC 9534 is 15433, but if you look at the individual results, the ones that aren't VMs with fewer than all the cores typically get a multi-core score in the 20k-25k range, e.g.:
What does that have to do with the scores being wrong? As mentioned, virtual machines can claim to be consumer CPUs too, while running on hardware with slower cores than the ones in the claimed CPU.
That doesn't make any sense. Many of the applications are identical, e.g. developer workstations and CI servers are both compiling code, video editing workstations and render farms are both processing video. A lot of the hardware is all but indistinguishable; Epyc and Threadripper have similar core counts and even use the same core complexes.
The only real distinction is between high end systems and low end systems, but that's exactly what a benchmark should be able to usefully compare because people want to know what a higher price tag would buy them.
For >99% of people looking to compile code or render video on an M5 Laptop they are interested in the wall-clock time, running bare metal, assuming all IO is to a fast NVMe SSD, and even a large job will only thermally throttle for a bit then recover.
Most people looking to optimize Epyc compile or render performance care about running inside VMs, all IO to SANs, assuming the is enough work you can yield to other jobs to increase throughput, and ideally near thermal equilibrium.