The AVX512 instructions can cause strange global performance downgrades.
“One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.“ - 3JPLW and https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock.
* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).
* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).
> Can other SIMD instructions (AVX2, say) do the same?
On Intel CPUs, yes. There's even a BIOS/UEFI setting to specify how much you want the clock frequency to drop when running AVX code called "AVX offset". AMD CPUs doesn't do that though as far as I know.
The thermal hit of using wider vectors decreases with every node shrink though, so expect the issue to become muted over time (which also explains why that doesn't apply to AMD - their only µarch with 256-bit execution units, Zen 2, is on a better node than Intel).
AVX offset is only available on motherboards that support overclocking due to how much higher intel CPUs can be pushed relatively to their advertised base and boost clocks.
Both Zen and Intel lower their clocks under load especially AVX, keep in mind that Zen 2 doesn’t even reach its advertised boost clocks under any load some CPUs come close to within 100mhz or so but overall they all clock down rather fast once TMax or PMax is reached.
What are the forces in chip design that are at play here? Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much. As a result, if you fill your modern chip with compute gates, you cannot use them all at once because the chip will melt. Or at least you can't have them all running at max clock rates. One solution is to increase the proportion of the chip used for SRAM (it uses less power per unit area than compute gates), this is what Graphcore have done. Another is to put down multiple different compute blocks, each designed for a different purpose, and only use them a-few-at-a-time. The big-little Arm designs in smartphones are an example of that. But I feel like AVX512 might be an example too. When they add ML accelerator blocks next, they also will not be able to be used flat out at the same time as the rest of the cores' resources.
I'm sure Intel should fix the problems Linus is complaining about, but I feel like chip vendors are being forced into this "add special purpose blocks" approach, as the only way to make their new chips better than their old ones.
Jim Keller had an interesting talk recently [1] about ways of doing parallel processing to better us the billions of transistors we have - assuming the task is parallelizable. There's the scalar core (i.e the basic CPU) which is easy to program realtively. Then a scalar core with vector instructions - difficult to program efficiently. Then there are arrays of scalar cores, i.e. GPUs, so relatively easy to program again, and now a lot of startups with arrays of scalar cores each with vector engines, so expected to be most difficult to program. He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.
Vectorization:
I'm not an expert in this area so I can only tell you what I've personally found difficult in dealing with vectorization. Usually it all comes down to alignment and vector lanes. To utilize the vector instructions you basically have to paint your memory into separate (but interleaved) regions that can be mapped to distinct vector lanes efficiently. Everything is fine as long as no two elements from separate lanes have to be mixed in some way, as soon as your computation requires that you incur a heavy cost.
Dealing with these issues might require you to know the corners of the instruction set really well or some times the solution is outside of the instruction set and is related to how your data structure is laid out in memory leading you to AoS vs SoA analysis etc.
Compilers and vectorization:
Based on reading a lot of assembly output I think what compilers usually struggle with are assumptions that the human programmer know hold for a given piece of code, but the compiler has no right to make. Some of this is basic alignment, gcc and clang have intrinsics for these. Some times it's related to the memory model of the programming language disallowing a load or a store at specific points.
GPGPU programmability:
GPUs being easy to program is something I take with a grain of salt, yes it's easy to get up and running with CUDA. Making an _efficient_ CUDA program however is easily as challenging if not more than writing an efficient AVX program.
> as long as vectorization can fail (and it will), […] you must come to deeply understand the auto-vectorizer. […] This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation
GPUs aren't really arrays of scalar cores. All threads in a warp run in lock step. If one takes a branch they all do, with operations being masked off as needed.
It's not all that different conceptually to AVX-512 with mask registers, except the vector size is even larger and of course the programming model differs.
> He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.
I have a simplistic explanation - maybe not what you're looking for but it is the best I can do...
At 12m23s in the video he says, "If you're working in a layer and the layers are well constructed (abstracted) you really can make a lot of progress. But if the top layer says, 'to make this really fast, go change the bottom layer', then its going to get all tangled up."
That's what implementing an algorithm on a SIMD architecture feels like to me. I have to figure out a way of filling my SIMD width with data each clock cycle, while in contrast, the specification of the algorithm deals with data one piece at a time.
Take insertion sort as a (bad) example.
i ? 1
while i < length(A)
j ? i
while j > 0 and A[j-1] > A[j]
swap A[j] and A[j-1]
j ? j - 1
end while
i ? i + 1
end while
That algorithm cannot easily take advantage of SIMD. You have to change the algorithm to make it work with the architecture.
We'd probably say the algorithm is the top level of the abstraction stack, and the SIMD architecture is a level near the bottom. So this problem is the opposite way around to how Jim phrased it, but the point is that we have NOT got clean abstraction - an implementation in one layer depends on the implementation in another.
Are GPU's really easier to program than scalar w/ SIMD (or vector insns)? The programming models you have to work with for GPGPU seem quite obscure, whereas with CPU and SIMD flipping a compiler switch gets you most of the way there, and self-contained intrinsics do the rest.
GPU programming is easy enough, the complexity comes from the seperate memory system and the tedious(and not portable) API you need to use to access the GPU.
I prefer intrinsics as they give more control than shader languages and they can be written in C++ instead of fiddling with some garbage GPU API that runs async.
Disclaimer: I work on AMD ROCm, but my opinions are my own.
There's also HIP[1], which can be used as a thin wrapper around CUDA, or with the ROCm backend on AMD platforms. It doesn't yet match CUDA in either breadth of features or maturity, but it's getting closer every day.
As I understand it, that has to work for the CORAL 2 US "exascale", so people who've been proved fairly right so far obviously have some confidence in it. (de Supinksi of Livermore said he'd be out of a job if conventional wisdom was right, though it was pretty obvious at the time that it wasn't.)
Free software too, praise be.
At least part of the problem is that computing mostly depends on moving data. Memory bandwidth is relatively low, so it's difficult to get enough actual floating point intensity, at least for "large" arrays even when it's theoretically available. A classic example is GEMM (generalized matrix multiplication) where you should expect a good implementation to get around 90% of peak performance, but also expect it to jump through various tricky hoops to get there. With, say, vector multiplication the hoops aren't available, and you're ultimately memory-bound.
Yes, there's more to it than that, and SIMD has non-FP applications etc.
The power problem is solved by having cores more suited to a task. A CPU is completely general, but power inefficient. Dedicated HW is as efficient as it gets, but in the extreme is not flexible and only does one task well. With loads of extra silicon available, we can now use that for more specific engines/accelerators and of course not all of these would be active at once. So in a way the scaling / density does allow us to get more efficiency in some cases. The trick is finding the balance for a given process node.
> Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much.
Kids these days get 8 cores for a 100W TDP.
When I was a boy, 100W got you a single core. And you didn't get dynamic frequency scaling, so it'd be putting out that heat all the time.
(We also had to walk to school barefoot in the snow, uphill both ways)
The Pentium II was not as efficient as the III. I remember setting up a dual socket machine where the PS started to matter. The best thing was that the web browser would only suck 100% from one processor.
What's the problem? My old school pentiums kept my dorm room nice and toasty. Could keep my window cracked in the winter for fresh air while gentoo compiled...
The main problem is software, with GPGPUs you need to explicitly program for them, while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.
Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.
> the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.
With how AVX512 is implemented, there isn't much point in a compiler auto optimizing general purpose code to use it, because even if there is a theoretical speedup, it may well be slower in practice.
> while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms
No. I recently could really, really have used the packed saturated integer arithmetic and horizontal addition in AVX2 (but my old machine doesn't support it) and even better, the same but 512 bits wide on AVX512. It would only have been 6 or 7 instructions, if that, but it was inner loop, and mattered. Using compiler intrinsics would have been fine. I think you're looking at things too narrowly.
I am looking at it of the point of view of joe/jane developer that cannot tell head from tail regarding vector programming and doesn't even know what compiler intrinsics are for, and use languages that don't expose them anyway.
Which is the whole point of "this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.", because not only do those people not get it, there is a general decline in using languages that expose vector intrisics like C and C++ for regular LOB applications.
In my ideal world you'd be able to mark a function "this should compile to / run on gpgpu" and the compiler would potentially tell you why it can't do that. I'm not even sure if anything is stopping us apart from implementing that apart from the effort required. Sure, many ways to write that code will result in terrible performance, but it would still be closer to the auto-vectorisation experience.
The current OpenMP spec has GPU offload features specifically for what was expected of the Sierra supercomputer. I'm not sure how relevant a paper that old (relatively, I hasten to add) is.
> Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.
I think you got this backwards - the lack of developers' interest is what leads to the mistaken impression that GPU compute is only good for multimedia and FP-crunching workloads. Even looking at the success of GPU compute in mining cryptocoins (only ASIC's do better) ought to be enough to tell you that we could do a lot more with them if we cared to.
> What are the forces in chip design that are at play here?
The "weak form" of Moore's Law--"Performance doubles every 12-18 months"--is dead and buried.
The "strong form" of Moore's Law is still active--"Transistor cost halves every 12-18 months".
This means that you can't make the primary paths any faster. So, all you can do is add functionality and pray that someone magically can make that functionality relevant to the primary use cases.
AVX is not a "special purpose block", it's Intel's answer to not adding special purpose blocks on customer demand, like you can do with ARM.
Crypto or video decoding comes to mind, those would be much faster with dedicated silicon, but more general AVX instructions can get you halfway there. Well, maybe a quarter. People point out that AVX uses a lot of power, but they ignore that the same algorithm running instead on more but simpler cores would use even more power.
They exist today, but they were added after AVX. Every year we figure out how to cram more transistors on a cubic cm, and once the low hanging fruit was added and we knew how to add more transistors, we decided to start putting more and more specific functions.
That is the point of Linus. He would have preferred to use that increase in transistor count for other things, like more cache.
More cache has diminishing returns, because cache wants to be as close as possible to the core logic. And modern CPU's are mostly cache anyway. Special-purpose blocks for common compute tasks are quite cheap.
Back in the day, CPUs didn't come with FPUs and the latter were optional co-processors.
The idea in the x86-world always was to "outsource" special requirements to dedicated hardware (FP co-processors, GPUs, sound cards, network cards, hardware codec cards, etc.), instead of putting them on the CPU package (like ARM-based SoCs).
So it's different philosophies entirely - tightly integrated SoCs vs versatile and flexible component-based hardware.
It's The One Ring ([ARM-based] SoCs) vs freedom of choice and modularity (PC). If I don't do simulations or 3d-modelling/rendering, I am free to choose a cheap display adapter without powerful 3D-acceleration and choose a better audio interface instead (e.g. for music production).
The SoC approach forces me to buy that fancy AI/ML-accelerator, various video codecs, and powerful graphics hardware with my CPU regardless of my needs, because the benevolent system provider (e.g. Apple) deems it fit for all...
Torvalds is just old-school in that he prefers freedom of choice and the "traditional" PC over highly integrated SoCs.
FP coprocessors "only" existed because the processes weren't advanced enough to have them inside the chip, but they were a natural extension (they were married to the instruction set of the chip - it wasn't a product, it was a feature)
> So it's not "just benchmarks", people actually want to do stuff with it
IIRC when bulldozer was released and Intel's propaganda machine started spewing stories about how AMD core count was fake because two cores shared a FP unit, there was a flurry of scientific papers on the subject.
IIRC, it was determined that even the hot path of FP-intensive code only executed a single FP ops for each 7 non-FP operations. To put it differently, between each FP op all code has to execute ops to move data around.
Consequently, bulldozer's FP benchmarks scaled linearly wrt cores because even when multiple cores had to share a FP unit to run FP operations, they were so relatively scarce even in number-crunching applications that cores didn't blocked, thus overall performance was not affected.
That's the relevance of FP in real-world benchmarks.
I for one would be delighted by having more caches or wider backends instead of AVX512, but I don't want SIMD to be pushed into GPUs. It'd be better to do the reverse - to push forward the asymmetric core idea and move more GPU functionality into lots of simpler cores tuned for SIMD at the cost of single thread performance.
If seems like they just keep that area mostly empty in processors without that feature, at least for the processors related to the one pictured. Not really sure how much cache that would be effective could fit without a major overhaul, but likely a chip designer or enthusiast would. This could be why Linus focused on computational enhancement when he discussed transistor budget.
From a quick glance at the proportions and considering not only the register files are halved, but also the vector EUs, I'd expect a 25% increase in L3 or a 50% in L2. That and some lessened thermal constraints.
If all cores see a single unified and consistent memory image (some scratchpad memory excepted), it's best if they all share the same basic ISA (and not implemented instructions trap to process migration or software emulation)
AVX-512's richness to x86 is like what C++'s is to C. Linus makes a summary assessment for how he can leverage these technologies to his advantage and if the cost of learning the technology and all its intricacies outweighs the perceived advantage: that technology is garbage. This reaction from Linus appears to fit his conservative pattern. I think where Linus gets things wrong stems from his facts rather than his philosophy.
AVX-512's fantastic breadth is born out of an actual need to free compilers from constraints imposed by programs in virtually every mainstream language. All of these describe programs for an academic-machine rooted in a scalar instruction model. Without any further performance from increasing cycles over time the target has to become instructions-per-cycle and even operations-per-instruction. The limitations on ILP and the expense of powering circuitry to achieve it has been well studied for the past two decades. The failure to realize it is evident in the failure of Netburst. Linus believes that the frontend of CPU's have a lot more to give; perhaps best exhibited with his refutation of CMOV (https://yarchive.net/comp/linux/cmov.html).
Today's programming languages haven't evolved to make things easier on programmers to describe non-scalar code. On the other hand, power constraints, and now security constraints haven't made things easier for hardware to efficiently execute scalar code. Perhaps AVX-512 is as naive a bet as Itanium, if not it might be just the missing piece compilers need that they didn't have twenty years ago.
Are Intel just delaying the inevitable? Is it safe to say (even today) that a slow GPU will crunch big matrices faster than a fast CPU? And that's before we get to price/performance. So all that's left is the bottleneck around PCIe which, in theory, leaves the CPU with an advantage only for small datasets - which we don't really care about anyway (because they happen quickly).
Maybe the tradeoff is somewhere interesting from a latency perspective - SDR or similar. I dunno, am I barking up the wrong tree?
AVX is the surviving heritage from Larrabee, the CPU that would outtake GPGPU and so far has failed to do so.
The only thing that Intel has going for their GPUs is that as typically happens with the underdog companies, they decided to play nice with FOSS drivers and with integrated GPUs they own the low budget laptop market.
Everyone that has done any serious 3D programming is painfully aware how bad their OpenGL drivers used to be, they even used to fake OpenGL queries confirming features as supported, when they were actually implemented in software, thus making some games unusable.
That is why they started the campaign about optimizing games for Intel GPUs, and how to make best use of Graphics Profile Analyser, which ironically in the old days was DirectX only.
The bottleneck you mention is only an issue when there isn't any shared memory available, if the hardware allows for unified memory models then there is no data transfer and the GPU can work right way, naturally there are some synchronization points that need to happen still.
In the article they quote Linus speculating that the increased core count of CPUs will achieve the same thing as AVX512 without the problems. I have read comments on HN that if cores keep increasing on CPUs they might be able to replace GPUs for some of the tasks as GPUs (or CUDA in particular) have quirks that CPUs don't have.
AVX512 in particular has issues. Using it slows down the CPU so actual wall clock benefits depends heavily on how it is used.
For general purpose computing maybe. For gaming the GPUs contain special operations for texture lookup and what not that would be very expensive in a CPU.
This may just be too obvious for you to mention, but GPUs only work well for (massively) parallel tasks such as matrix multiplication.
Of any day's computational workload, only graphics, (parts of) ML, and maybe space heating masquerading as financial innovation are amendable to be run in such a fashion. And those workloads are, as far as I can tell, already being run on GPUs (and similar) almost universally.
So I don't think there actually are major workloads that will shift away from Intel to GPUs in the near future?
SVE isn't in a co-processor, I guess is the point. There's a lot more to Fugaku than SVE (whether or not you think that's a version of avx512), though. No DDR is suggestive.
I read, but putting it in a co-processor makes little sense if you care about memory bandwidth and latency -- which you should to keep the vector unit fed. Fujitsu have carefully designed this stuff with such considerations in mind as far as I can tell.
AVX512 is both integer and floating point, not just FP, so this rant about FP comes across as ill informed.
Despite that I'd agree most people probably see no benefit from these units today. But that could change. For workloads with parallelism, wide SIMD is very efficient - more so than multiple threads anyway. The only way to get people to write vector code is to have vector processing available. Once it's ubiquitously available people might code for it and the benefits may become more apparent.
The very wide AVX stuff with integer ops, like these from wiki:
- AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[3]
- AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision.
could be very useful. I could have done with those recently. They also don't (AFAIK) cause cpu scaling (polite term for downclocking). He may well be right with FP though.
If he was right with FP, he'd know better than the business analysts at Intel. Instead, his opinion is based on what the market looked like thirty years ago.
Nine years ago, AMD tested the hypothesis that really more "cores" and higher integer throughput were all that was needed and that FP performance didn't matter. The resulting architecture (Bulldozer) was a near-fatal disaster. It didn't even work out in the datacenter, where you might expect that hypothesis to hold.
If Intel had their shit together they would have released AVX512 years ago with Skylake desktop, but they prefer to artificially segment the market, and have still not managed to release a desktop chip with AVX512--allowing AMD to catch up and now in many ways surpass them.
More than that, Bulldozer didn't even have good single thread integer performance. What it gave you was 8 cores that might be able to keep up with 4 of Intel's cores on something that has 8 threads. The market was not particularly interested in this, especially since at the time even fewer things could actually use 8 threads than they do now.
Bulldozer significantly outperformed Sandy Bridge on the workloads which it was designed to be good at, which is multi-threaded integer workloads, like compiling the Linux kernel.
If Linus' attitude of "I'd rather have more cores" and "FP doesn't really matter" were representative of market demand, you'd have expected Bulldozer to do well at least somewhere, as opposed to nowhere.
Are we looking at the same benchmarks? In the first they're comparing an 8-core Bulldozer to Sandy Bridge with 4 cores and no hyperthreading and it's basically even, sometimes it wins by a small margin on the threaded ones. In the second the 3770K has 4 cores with hyperthreading and that makes it look even worse.
If they were actually getting twice the integer performance per module as Intel was getting per core then it might've been interesting, but being the same or only slightly better when comparing modules to cores wasn't enough to overcome the single thread performance deficit which people still care about a lot.
You have to look for them, but there are benchmarks where AMD outperforms significantly. I cant find the Linux compilation benchmark now, but the difference was not small.
The Bulldozer really did have a big advantage in integer throughput per dollar, but that does not translate to a 2x speedup in pretty much any benchmark. FP throughput on the other hand shows up a lot.
I think we'd have been rather better off buying a load of Magny Cours rather than Sandybridge for a university HPC system whose procurement I wasn't sufficiently involved in.
"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.
Because absolutely nobody cares outside of benchmarks."
That was back in the stone age when a lot of applications for FP math weren't mainstream. Most of AVX-512 doesn't even concern FP, there's lots of integer and bit twiddling stuff there.
Furthermore, people really do care about these benchmarks. It influences their purchasing, which is really the thing that matters most to Intel. A lot of people don't actually care about hypothetical security issues or the fact that the CPU is 14nm when it still outperforms 7nm in single-threaded code.
Also, it's not like you can just trade off IPC or extra cores for wider SIMD. It's not like "just add more cores" is just as good for throughput, otherwise GPUs wouldn't exist. Wider SIMD is cheap in terms of die area, for the throughput it gives you.
Lastly, these are just instructions, nothing says that an AVX-512 instruction needs to go through a physical 512-bit wide unit, it just says that you can take advantage of those semantics, if possible.
He has the history correct. Most of the CPUs that x86 beat in the market had superior FP performance; SPARCs, Alphas, PA-RISC, Itanium, etc.
> When's the last time he actually did anything with a computer?
According to Linus he completes about 30 pull requests a day. Some multiple of that in kernel builds. His $1900 32 core Threadripper speeds that process a great deal and FP contributes little to nothing.
Today people stream video+audio, encrypt+decrypt and render graphics. All of these have specialized silicon. If their AVX-512 vanished in the night almost no one would notice the next day.
Maybe we should all be astronomers and thermodynamicists writing bespoke finite element simulations and have a deep appreciation for the wonders of floating point ISAs, but that's just not the real world.
Speaking as someone who does scientific computing all day long, in part with FEM simulations, even for me AVX512 isn't usually worth it in terms of wall-clock time.
They wouldn't notice AVX512 vanishing because they never had it in the first place, as Intel hasn't shipped it in CPU people actually use for those tasks--just servers and random laptops.
As for the rest, you are wrong, AVX/512 is not just floating point by any means, and floating point is used by more than just scientific workloads.
Games/simulations/modeling software etc all can make heavy use of floating point.
This discussion is about avx512, which has shown to have some issues compared to other solutions. Nobody is claiming FP is garbage or that we don't ever need SIMD.
Regarding Linus: he is almost always right. He has had more hands-on experience than everyone in this thread combined. I trust his judgement, he has earned it by being consistently correct while his opposition has just complained and eventually failed to deliver themselves.
The evidence for "more hands-on experience than everyone in this thread combined"? I've seen him sound off about things he obviously doesn't have hands on experience with, and not obviously be proved right about things he does (like compilation). That said, avx512 is over-rated in the area I know best.
First of all, there is indeed SIMD code in the kernel. Check out the very beginning of your boot message.
More importantly, the kernel needs to support context switch between userspace applications that use SIMD registers. So it touches a bunch of critical data structures and event handlers.
Fixed point audio decoding is common, actually. In general, floating point math makes sense when dealing with computations in a range spanning different orders of magnitude, where one cares about relative precision. This describes a lot of what we use computers for, but fixed point math is a lot more efficient and makes sense for simpler cases.
No, it isn't common. Source: someone who actually does this stuff.
Audio processing and decoding is all about maintaining intermediate results at appropriate precision. The magnitudes involved far exceed the bit width at the output of the pipeline. The only reason you would ever use fixed point is for speed... which is no longer necessary, and needs to stay that way.
I think you read it wrong, slightly helped by either Linus misspeaking or Phoronix misquoting him. The "matter" in "and it matter not one iota" clearly should be past tense.
As I see it, what he's saying is that back in the day, the majority of those buying CPUs did not care about FP code. And he thinks that today the same is true of AVX-512, the majority of those that buy CPUs don't care about AVX-512.
“One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.“ - 3JPLW and https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock.
* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).
* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).
As per https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...