Linus Torvalds on AVX512

robocat · on July 12, 2020

The AVX512 instructions can cause strange global performance downgrades.

“One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.“ - 3JPLW and https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock.

* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).

* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

As per https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

MaxBarraclough · on July 12, 2020

> The AVX512 instructions can cause strange global performance downgrades.

Can other SIMD instructions (AVX2, say) do the same?

th3typh00n · on July 12, 2020

> Can other SIMD instructions (AVX2, say) do the same?

On Intel CPUs, yes. There's even a BIOS/UEFI setting to specify how much you want the clock frequency to drop when running AVX code called "AVX offset". AMD CPUs doesn't do that though as far as I know.

The thermal hit of using wider vectors decreases with every node shrink though, so expect the issue to become muted over time (which also explains why that doesn't apply to AMD - their only µarch with 256-bit execution units, Zen 2, is on a better node than Intel).

dogma1138 · on July 13, 2020

AVX offset is only available on motherboards that support overclocking due to how much higher intel CPUs can be pushed relatively to their advertised base and boost clocks.

Both Zen and Intel lower their clocks under load especially AVX, keep in mind that Zen 2 doesn’t even reach its advertised boost clocks under any load some CPUs come close to within 100mhz or so but overall they all clock down rather fast once TMax or PMax is reached.

andoriyu · on July 12, 2020

AVX was slowing down some code if input was less than 128 bits wide.

termau · on July 13, 2020

wonder if this could be used as a denial of service against a vps host node.

abainbridge · on July 12, 2020

What are the forces in chip design that are at play here? Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much. As a result, if you fill your modern chip with compute gates, you cannot use them all at once because the chip will melt. Or at least you can't have them all running at max clock rates. One solution is to increase the proportion of the chip used for SRAM (it uses less power per unit area than compute gates), this is what Graphcore have done. Another is to put down multiple different compute blocks, each designed for a different purpose, and only use them a-few-at-a-time. The big-little Arm designs in smartphones are an example of that. But I feel like AVX512 might be an example too. When they add ML accelerator blocks next, they also will not be able to be used flat out at the same time as the rest of the cores' resources.

I'm sure Intel should fix the problems Linus is complaining about, but I feel like chip vendors are being forced into this "add special purpose blocks" approach, as the only way to make their new chips better than their old ones.

tails4e · on July 12, 2020

Jim Keller had an interesting talk recently [1] about ways of doing parallel processing to better us the billions of transistors we have - assuming the task is parallelizable. There's the scalar core (i.e the basic CPU) which is easy to program realtively. Then a scalar core with vector instructions - difficult to program efficiently. Then there are arrays of scalar cores, i.e. GPUs, so relatively easy to program again, and now a lot of startups with arrays of scalar cores each with vector engines, so expected to be most difficult to program. He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.

1. https://youtu.be/8eT1jaHmlx8

confuseshrink · on July 12, 2020

Vectorization: I'm not an expert in this area so I can only tell you what I've personally found difficult in dealing with vectorization. Usually it all comes down to alignment and vector lanes. To utilize the vector instructions you basically have to paint your memory into separate (but interleaved) regions that can be mapped to distinct vector lanes efficiently. Everything is fine as long as no two elements from separate lanes have to be mixed in some way, as soon as your computation requires that you incur a heavy cost.

Dealing with these issues might require you to know the corners of the instruction set really well or some times the solution is outside of the instruction set and is related to how your data structure is laid out in memory leading you to AoS vs SoA analysis etc.

Compilers and vectorization: Based on reading a lot of assembly output I think what compilers usually struggle with are assumptions that the human programmer know hold for a given piece of code, but the compiler has no right to make. Some of this is basic alignment, gcc and clang have intrinsics for these. Some times it's related to the memory model of the programming language disallowing a load or a store at specific points.

GPGPU programmability: GPUs being easy to program is something I take with a grain of salt, yes it's easy to get up and running with CUDA. Making an _efficient_ CUDA program however is easily as challenging if not more than writing an efficient AVX program.

pornel · on July 12, 2020

Here's more on the problem of SIMD and C compilers:

https://pharr.org/matt/blog/2018/04/18/ispc-origins.html#aut...

> as long as vectorization can fail (and it will), […] you must come to deeply understand the auto-vectorizer. […] This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation

reitzensteinm · on July 12, 2020

GPUs aren't really arrays of scalar cores. All threads in a warp run in lock step. If one takes a branch they all do, with operations being masked off as needed.

It's not all that different conceptually to AVX-512 with mask registers, except the vector size is even larger and of course the programming model differs.

abainbridge · on July 12, 2020

> He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.

I have a simplistic explanation - maybe not what you're looking for but it is the best I can do...

At 12m23s in the video he says, "If you're working in a layer and the layers are well constructed (abstracted) you really can make a lot of progress. But if the top layer says, 'to make this really fast, go change the bottom layer', then its going to get all tangled up."

That's what implementing an algorithm on a SIMD architecture feels like to me. I have to figure out a way of filling my SIMD width with data each clock cycle, while in contrast, the specification of the algorithm deals with data one piece at a time.

Take insertion sort as a (bad) example.

    i ? 1
    while i < length(A)
        j ? i
        while j > 0 and A[j-1] > A[j]
            swap A[j] and A[j-1]
            j ? j - 1
        end while
        i ? i + 1
    end while

That algorithm cannot easily take advantage of SIMD. You have to change the algorithm to make it work with the architecture.

We'd probably say the algorithm is the top level of the abstraction stack, and the SIMD architecture is a level near the bottom. So this problem is the opposite way around to how Jim phrased it, but the point is that we have NOT got clean abstraction - an implementation in one layer depends on the implementation in another.

zozbot234 · on July 12, 2020

Are GPU's really easier to program than scalar w/ SIMD (or vector insns)? The programming models you have to work with for GPGPU seem quite obscure, whereas with CPU and SIMD flipping a compiler switch gets you most of the way there, and self-contained intrinsics do the rest.

TinkersW · on July 12, 2020

GPU programming is easy enough, the complexity comes from the seperate memory system and the tedious(and not portable) API you need to use to access the GPU.

I prefer intrinsics as they give more control than shader languages and they can be written in C++ instead of fiddling with some garbage GPU API that runs async.

pjmlp · on July 12, 2020

MSL, CUDA and SYSCL are C++ with extra topping.

Also one of the reasons CUDA won developer love is that it fully embraced polyglot programming on the GPU.

TinkersW · on July 12, 2020

None of those are both portable and widely available on end user machines, which is needed for games

CUDA seems nice, but being Nvidia only makes it a total dead end.

slavik81 · on July 13, 2020

Disclaimer: I work on AMD ROCm, but my opinions are my own.

There's also HIP[1], which can be used as a thin wrapper around CUDA, or with the ROCm backend on AMD platforms. It doesn't yet match CUDA in either breadth of features or maturity, but it's getting closer every day.

[1]: https://github.com/ROCm-Developer-Tools/HIP

gnufx · on July 13, 2020

As I understand it, that has to work for the CORAL 2 US "exascale", so people who've been proved fairly right so far obviously have some confidence in it. (de Supinksi of Livermore said he'd be out of a job if conventional wisdom was right, though it was pretty obvious at the time that it wasn't.) Free software too, praise be.

TinkersW · on July 14, 2020

It looks good but without Intel iGPU support I don't think any gamedevs would use it :/

I wish all the GPU companies would get together and make a standard based on C++ and stick with it.

kmbriedis · on July 12, 2020

I believe the ML community will strongly disagree. CUDA is everything

ethelward · on July 12, 2020

Because the academic ML community does not care about shipping product to end users not equipped in nVidia.

pjmlp · on July 13, 2020

Except SYSCL also works on AMD and Intel, and also has a CUDA backend, but apparently you missed that part.

In what concerns commercial uses of CUDA, Hollywood doesn't seem to have any problem with it, nor the car manufacturers with Jetson.

TinkersW · on July 14, 2020

SYSCL might be an option, but it doesn't seem to have much in the way of adoption(Forum is dead)which is concerning.

It does look like Intel is supporting at least, so maybe in the future it will be a good option.

pjmlp · on July 12, 2020

Windows and iOS gaming community with disagree will that statement.

Or are you speaking about the 1% Linux users on Steam?

gnufx · on July 13, 2020

At least part of the problem is that computing mostly depends on moving data. Memory bandwidth is relatively low, so it's difficult to get enough actual floating point intensity, at least for "large" arrays even when it's theoretically available. A classic example is GEMM (generalized matrix multiplication) where you should expect a good implementation to get around 90% of peak performance, but also expect it to jump through various tricky hoops to get there. With, say, vector multiplication the hoops aren't available, and you're ultimately memory-bound. Yes, there's more to it than that, and SIMD has non-FP applications etc.

amelius · on July 12, 2020

How does this solve the power problem that GP is talking about?

tails4e · on July 13, 2020

The power problem is solved by having cores more suited to a task. A CPU is completely general, but power inefficient. Dedicated HW is as efficient as it gets, but in the extreme is not flexible and only does one task well. With loads of extra silicon available, we can now use that for more specific engines/accelerators and of course not all of these would be active at once. So in a way the scaling / density does allow us to get more efficiency in some cases. The trick is finding the balance for a given process node.

michaelt · on July 12, 2020

> Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much.

Kids these days get 8 cores for a 100W TDP.

When I was a boy, 100W got you a single core. And you didn't get dynamic frequency scaling, so it'd be putting out that heat all the time.

(We also had to walk to school barefoot in the snow, uphill both ways)

em500 · on July 12, 2020

You must be young. Home PC CPUs from my youth drew only single digit watts. They didn't require any fan until the Pentium.

acqq · on July 12, 2020

Indeed:

386, introduced 1985:

http://www.cpu-world.com/CPUs/80386/Intel-A80386-16.html

Typical/Maximum power dissipation: 1.85 Watt / 2.3 Watt

And even no Pentium III 1999-2003 needed more than around 30 W:

https://en.wikipedia.org/wiki/List_of_Intel_Pentium_III_micr...

karmakaze · on July 13, 2020

The Pentium II was not as efficient as the III. I remember setting up a dual socket machine where the PS started to matter. The best thing was that the web browser would only suck 100% from one processor.

noisem4ker · on July 12, 2020

>putting out that heat all the time

Even if the frequency was fixed, dissipated heat did definitely vary together with the computing load.

blaser-waffle · on July 14, 2020

> so it'd be putting out that heat all the time

What's the problem? My old school pentiums kept my dorm room nice and toasty. Could keep my window cracked in the winter for fresh air while gentoo compiled...

msh · on July 12, 2020

Could you not use TDP to melt the snow ;)

pjmlp · on July 12, 2020

The main problem is software, with GPGPUs you need to explicitly program for them, while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.

Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.

teruakohatu · on July 12, 2020

> the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.

With how AVX512 is implemented, there isn't much point in a compiler auto optimizing general purpose code to use it, because even if there is a theoretical speedup, it may well be slower in practice.

pjmlp · on July 12, 2020

There might not be one, but all major C, C++, Hotspot, Graal and RyuJIT compilers do it to some extent.

throwaway_pdp09 · on July 12, 2020

> while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms

No. I recently could really, really have used the packed saturated integer arithmetic and horizontal addition in AVX2 (but my old machine doesn't support it) and even better, the same but 512 bits wide on AVX512. It would only have been 6 or 7 instructions, if that, but it was inner loop, and mattered. Using compiler intrinsics would have been fine. I think you're looking at things too narrowly.

pjmlp · on July 12, 2020

I am looking at it of the point of view of joe/jane developer that cannot tell head from tail regarding vector programming and doesn't even know what compiler intrinsics are for, and use languages that don't expose them anyway.

voldacar · on July 13, 2020

Well those people will never be getting the most out of their CPUs to begin with.

pjmlp · on July 13, 2020

Which is the whole point of "this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.", because not only do those people not get it, there is a general decline in using languages that expose vector intrisics like C and C++ for regular LOB applications.

viraptor · on July 12, 2020

In my ideal world you'd be able to mark a function "this should compile to / run on gpgpu" and the compiler would potentially tell you why it can't do that. I'm not even sure if anything is stopping us apart from implementing that apart from the effort required. Sure, many ways to write that code will result in terrible performance, but it would still be closer to the auto-vectorisation experience.

Actually we already have openmp to cuda (http://www2.engr.arizona.edu/~ece569a/Readings/GPU_Papers/3....) so just making it more production-ready would be perfect.

gnufx · on July 13, 2020

The current OpenMP spec has GPU offload features specifically for what was expected of the Sierra supercomputer. I'm not sure how relevant a paper that old (relatively, I hasten to add) is.

zozbot234 · on July 12, 2020

> Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.

I think you got this backwards - the lack of developers' interest is what leads to the mistaken impression that GPU compute is only good for multimedia and FP-crunching workloads. Even looking at the success of GPU compute in mining cryptocoins (only ASIC's do better) ought to be enough to tell you that we could do a lot more with them if we cared to.

pjmlp · on July 12, 2020

From my point of view cryptomining is a useless fad, and typical line of business applications don't need anything more than what I listed.

ksec · on July 12, 2020

>but haven't reduced the power consumption per gate as much.

That is simply not true. You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.

At every node they have reduced power consumption that is also one reason you see continuous performance improvement.

abainbridge · on July 12, 2020

> That is simply not true.

I'm not claiming anything controversial. Power not having scaled as well as area recently is often referred to as the end of Dennard scaling:

https://en.wikipedia.org/wiki/Dennard_scaling#Breakdown_of_D...

> You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.

That can be true despite the fact that power hasn't scaled as well as area.

> At every node they have reduced power consumption

Yep, just not as much as they improved area.

bsder · on July 12, 2020

> What are the forces in chip design that are at play here?

The "weak form" of Moore's Law--"Performance doubles every 12-18 months"--is dead and buried.

The "strong form" of Moore's Law is still active--"Transistor cost halves every 12-18 months".

This means that you can't make the primary paths any faster. So, all you can do is add functionality and pray that someone magically can make that functionality relevant to the primary use cases.

gridlockd · on July 12, 2020

AVX is not a "special purpose block", it's Intel's answer to not adding special purpose blocks on customer demand, like you can do with ARM.

Crypto or video decoding comes to mind, those would be much faster with dedicated silicon, but more general AVX instructions can get you halfway there. Well, maybe a quarter. People point out that AVX uses a lot of power, but they ignore that the same algorithm running instead on more but simpler cores would use even more power.

throwaway_pdp09 · on July 12, 2020

> but more general AVX instructions can get you halfway there

Maybe misunderstand you but there are some fairly non-general ops for encoding/decoding crypto

https://en.wikipedia.org/wiki/AVX-512#VAES

ArnoVW · on July 12, 2020

They exist today, but they were added after AVX. Every year we figure out how to cram more transistors on a cubic cm, and once the low hanging fruit was added and we knew how to add more transistors, we decided to start putting more and more specific functions.

That is the point of Linus. He would have preferred to use that increase in transistor count for other things, like more cache.

zozbot234 · on July 12, 2020

More cache has diminishing returns, because cache wants to be as close as possible to the core logic. And modern CPU's are mostly cache anyway. Special-purpose blocks for common compute tasks are quite cheap.

xxs · on July 13, 2020

>And modern CPU's are mostly cache anyway.

Skylake is less than 30% cache. However internally it's 512bus, thanks to avx-512 - which could be considered suboptimal.

rurban · on July 12, 2020

Unsupported by valgrind still. Not sure about qemu. Don't use.

floatboth · on July 12, 2020

I agree that there's too much focus on FP, but SIMD is not all about FP. Every new SIMD ISA extension has something interesting for integer.

Here's an article about JITing x86 to AVX-512 to fuzz 16 VMs per thread:

https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_e...

raverbashing · on July 12, 2020

FP matters (especially with SIMD)

It matters to image/video/audio processing

It matters to simulations

It matters to 3D models/rendering

It matters to games

So it's not "just benchmarks", people actually want to do stuff with it

Sure, AVX512 might not be the greatest way of doing it, and it might be better to just make the existing instructions go faster, that might work

qayxc · on July 12, 2020

It's a matter of perspective.

Back in the day, CPUs didn't come with FPUs and the latter were optional co-processors.

The idea in the x86-world always was to "outsource" special requirements to dedicated hardware (FP co-processors, GPUs, sound cards, network cards, hardware codec cards, etc.), instead of putting them on the CPU package (like ARM-based SoCs).

So it's different philosophies entirely - tightly integrated SoCs vs versatile and flexible component-based hardware.

It's The One Ring ([ARM-based] SoCs) vs freedom of choice and modularity (PC). If I don't do simulations or 3d-modelling/rendering, I am free to choose a cheap display adapter without powerful 3D-acceleration and choose a better audio interface instead (e.g. for music production).

The SoC approach forces me to buy that fancy AI/ML-accelerator, various video codecs, and powerful graphics hardware with my CPU regardless of my needs, because the benevolent system provider (e.g. Apple) deems it fit for all...

Torvalds is just old-school in that he prefers freedom of choice and the "traditional" PC over highly integrated SoCs.

raverbashing · on July 12, 2020

FP coprocessors "only" existed because the processes weren't advanced enough to have them inside the chip, but they were a natural extension (they were married to the instruction set of the chip - it wasn't a product, it was a feature)

At the old days there were minor competitors to the x87 family that died quickly. (For reference: https://en.wikipedia.org/wiki/X87#Manufacturers )

For the rest yeah, it kinda makes sense to have them customizable.

pjmlp · on July 13, 2020

> The idea in the x86-world always was to "outsource" special requirements to dedicated hardware

Actually the Atari and Amigas were there first, that was PC catching up with their multimedia capabilities.

rumanator · on July 12, 2020

> So it's not "just benchmarks", people actually want to do stuff with it

IIRC when bulldozer was released and Intel's propaganda machine started spewing stories about how AMD core count was fake because two cores shared a FP unit, there was a flurry of scientific papers on the subject.

IIRC, it was determined that even the hot path of FP-intensive code only executed a single FP ops for each 7 non-FP operations. To put it differently, between each FP op all code has to execute ops to move data around.

Consequently, bulldozer's FP benchmarks scaled linearly wrt cores because even when multiple cores had to share a FP unit to run FP operations, they were so relatively scarce even in number-crunching applications that cores didn't blocked, thus overall performance was not affected.

That's the relevance of FP in real-world benchmarks.

rbanffy · on July 12, 2020

I for one would be delighted by having more caches or wider backends instead of AVX512, but I don't want SIMD to be pushed into GPUs. It'd be better to do the reverse - to push forward the asymmetric core idea and move more GPU functionality into lots of simpler cores tuned for SIMD at the cost of single thread performance.

molticrystal · on July 12, 2020

Here are some shots of the Mask Registers https://travisdowns.github.io/blog/2020/05/26/kreg2.html#the...

If seems like they just keep that area mostly empty in processors without that feature, at least for the processors related to the one pictured. Not really sure how much cache that would be effective could fit without a major overhaul, but likely a chip designer or enthusiast would. This could be why Linus focused on computational enhancement when he discussed transistor budget.

rbanffy · on July 12, 2020

From a quick glance at the proportions and considering not only the register files are halved, but also the vector EUs, I'd expect a 25% increase in L3 or a 50% in L2. That and some lessened thermal constraints.

throwaway_pdp09 · on July 12, 2020

I really don't know if that would help much. Better cache management might give more bonus than just bigger caches or higher bandwidth.

rbanffy · on July 12, 2020

It depends on your workload, but if you are wasting too much time with L3 misses, more cache (and more memory channels) is a good idea.

TazeTSchnitzel · on July 12, 2020

So, Larrabee?

protomyth · on July 12, 2020

Cell with a saner bus/memory access?

rbanffy · on July 12, 2020

If all cores see a single unified and consistent memory image (some scratchpad memory excepted), it's best if they all share the same basic ISA (and not implemented instructions trap to process migration or software emulation)

floatboth · on July 12, 2020

Or Fujitsu A64FX?

jasonzemos · on July 13, 2020

AVX-512's richness to x86 is like what C++'s is to C. Linus makes a summary assessment for how he can leverage these technologies to his advantage and if the cost of learning the technology and all its intricacies outweighs the perceived advantage: that technology is garbage. This reaction from Linus appears to fit his conservative pattern. I think where Linus gets things wrong stems from his facts rather than his philosophy.

AVX-512's fantastic breadth is born out of an actual need to free compilers from constraints imposed by programs in virtually every mainstream language. All of these describe programs for an academic-machine rooted in a scalar instruction model. Without any further performance from increasing cycles over time the target has to become instructions-per-cycle and even operations-per-instruction. The limitations on ILP and the expense of powering circuitry to achieve it has been well studied for the past two decades. The failure to realize it is evident in the failure of Netburst. Linus believes that the frontend of CPU's have a lot more to give; perhaps best exhibited with his refutation of CMOV (https://yarchive.net/comp/linux/cmov.html).

Today's programming languages haven't evolved to make things easier on programmers to describe non-scalar code. On the other hand, power constraints, and now security constraints haven't made things easier for hardware to efficiently execute scalar code. Perhaps AVX-512 is as naive a bet as Itanium, if not it might be just the missing piece compilers need that they didn't have twenty years ago.

RantyDave · on July 12, 2020

Are Intel just delaying the inevitable? Is it safe to say (even today) that a slow GPU will crunch big matrices faster than a fast CPU? And that's before we get to price/performance. So all that's left is the bottleneck around PCIe which, in theory, leaves the CPU with an advantage only for small datasets - which we don't really care about anyway (because they happen quickly).

Maybe the tradeoff is somewhere interesting from a latency perspective - SDR or similar. I dunno, am I barking up the wrong tree?

pjmlp · on July 12, 2020

AVX is the surviving heritage from Larrabee, the CPU that would outtake GPGPU and so far has failed to do so.

The only thing that Intel has going for their GPUs is that as typically happens with the underdog companies, they decided to play nice with FOSS drivers and with integrated GPUs they own the low budget laptop market.

Everyone that has done any serious 3D programming is painfully aware how bad their OpenGL drivers used to be, they even used to fake OpenGL queries confirming features as supported, when they were actually implemented in software, thus making some games unusable.

That is why they started the campaign about optimizing games for Intel GPUs, and how to make best use of Graphics Profile Analyser, which ironically in the old days was DirectX only.

The bottleneck you mention is only an issue when there isn't any shared memory available, if the hardware allows for unified memory models then there is no data transfer and the GPU can work right way, naturally there are some synchronization points that need to happen still.

teruakohatu · on July 12, 2020

In the article they quote Linus speculating that the increased core count of CPUs will achieve the same thing as AVX512 without the problems. I have read comments on HN that if cores keep increasing on CPUs they might be able to replace GPUs for some of the tasks as GPUs (or CUDA in particular) have quirks that CPUs don't have.

AVX512 in particular has issues. Using it slows down the CPU so actual wall clock benefits depends heavily on how it is used.

panpanna · on July 12, 2020

For general purpose computing maybe. For gaming the GPUs contain special operations for texture lookup and what not that would be very expensive in a CPU.

IAmEveryone · on July 12, 2020

This may just be too obvious for you to mention, but GPUs only work well for (massively) parallel tasks such as matrix multiplication.

Of any day's computational workload, only graphics, (parts of) ML, and maybe space heating masquerading as financial innovation are amendable to be run in such a fashion. And those workloads are, as far as I can tell, already being run on GPUs (and similar) almost universally.

So I don't think there actually are major workloads that will shift away from Intel to GPUs in the near future?

nix23 · on July 12, 2020

Yes the future is probably ARM/RiscV with many cores + GPU + some AI/ML/FPGA/Whatever co-Processor.

m0xte · on July 12, 2020

So what Apple are doing then.

nix23 · on July 12, 2020

What the actual fastest Supercomputer (Fugaku) already did, and all the Smartphones before ;)

TinkersW · on July 12, 2020

What are you talking about?

Fugaku is the opposite of that, each CPU chip is 48 cores with 512 bit wide SVE(arm version of AVX512).

They deliberately went for something easier to program, that didn't require doing the CPU/GPU dance.

nix23 · on July 12, 2020

ARM/RiscV with many cores...that's what i wrote, if you don't need a gpu you don't need one, if you need SVE you integrate it or use a co-Processor :)

gnufx · on July 13, 2020

SVE isn't in a co-processor, I guess is the point. There's a lot more to Fugaku than SVE (whether or not you think that's a version of avx512), though. No DDR is suggestive.

nix23 · on July 13, 2020

if you need SVE you INTEGRATE it OR use a co-Processor

Thanks for reading.

gnufx · on July 14, 2020

I read, but putting it in a co-processor makes little sense if you care about memory bandwidth and latency -- which you should to keep the vector unit fed. Fujitsu have carefully designed this stuff with such considerations in mind as far as I can tell.

m0xte · on July 12, 2020

That is true :)

jfkebwjsbx · on July 12, 2020

CPUs does not win in smaller datasets but in small computations.

fancyfredbot · on July 12, 2020

AVX512 is both integer and floating point, not just FP, so this rant about FP comes across as ill informed.

Despite that I'd agree most people probably see no benefit from these units today. But that could change. For workloads with parallelism, wide SIMD is very efficient - more so than multiple threads anyway. The only way to get people to write vector code is to have vector processing available. Once it's ubiquitously available people might code for it and the benefits may become more apparent.

throwaway_pdp09 · on July 12, 2020

The very wide AVX stuff with integer ops, like these from wiki:

- AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[3]

- AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision.

could be very useful. I could have done with those recently. They also don't (AFAIK) cause cpu scaling (polite term for downclocking). He may well be right with FP though.

superjan · on July 12, 2020

52 bit precision? typo?

sdflhasjd · on July 12, 2020

52 bits is the size of the mantissa in an IEEE 754 double precision floating point

throwaway_pdp09 · on July 12, 2020

Well caught. But https://www.felixcloutier.com/x86/vpmadd52luq

kzrdude · on July 12, 2020

Sounds suspiciously like integers in the float mantissa

gridlockd · on July 12, 2020

If he was right with FP, he'd know better than the business analysts at Intel. Instead, his opinion is based on what the market looked like thirty years ago.

Nine years ago, AMD tested the hypothesis that really more "cores" and higher integer throughput were all that was needed and that FP performance didn't matter. The resulting architecture (Bulldozer) was a near-fatal disaster. It didn't even work out in the datacenter, where you might expect that hypothesis to hold.

throwaway_pdp09 · on July 12, 2020

AMD is currently giving intel great pain. So much for business analysts at Intel.

TinkersW · on July 12, 2020

If Intel had their shit together they would have released AVX512 years ago with Skylake desktop, but they prefer to artificially segment the market, and have still not managed to release a desktop chip with AVX512--allowing AMD to catch up and now in many ways surpass them.

pjmlp · on July 13, 2020

The only pain I see is they having won the CPU for game consoles.

All our laptops have Intel stickers on them and I doubt AMD is winning crazy dollars on cloud deployments.

theevilsharpie · on July 12, 2020

AMD is doing well in the CPU market today _because_ they reversed course from the Bulldozer-based architectures.

AnthonyMouse · on July 12, 2020

More than that, Bulldozer didn't even have good single thread integer performance. What it gave you was 8 cores that might be able to keep up with 4 of Intel's cores on something that has 8 threads. The market was not particularly interested in this, especially since at the time even fewer things could actually use 8 threads than they do now.

gridlockd · on July 12, 2020

Bulldozer significantly outperformed Sandy Bridge on the workloads which it was designed to be good at, which is multi-threaded integer workloads, like compiling the Linux kernel.

https://www.phoronix.com/scan.php?page=article&item=amd_fx81...

https://www.phoronix.com/scan.php?page=article&item=amd_fx83...

If Linus' attitude of "I'd rather have more cores" and "FP doesn't really matter" were representative of market demand, you'd have expected Bulldozer to do well at least somewhere, as opposed to nowhere.

AnthonyMouse · on July 12, 2020

Are we looking at the same benchmarks? In the first they're comparing an 8-core Bulldozer to Sandy Bridge with 4 cores and no hyperthreading and it's basically even, sometimes it wins by a small margin on the threaded ones. In the second the 3770K has 4 cores with hyperthreading and that makes it look even worse.

If they were actually getting twice the integer performance per module as Intel was getting per core then it might've been interesting, but being the same or only slightly better when comparing modules to cores wasn't enough to overcome the single thread performance deficit which people still care about a lot.

gridlockd · on July 12, 2020

You have to look for them, but there are benchmarks where AMD outperforms significantly. I cant find the Linux compilation benchmark now, but the difference was not small.

The Bulldozer really did have a big advantage in integer throughput per dollar, but that does not translate to a 2x speedup in pretty much any benchmark. FP throughput on the other hand shows up a lot.

gnufx · on July 13, 2020

I think we'd have been rather better off buying a load of Magny Cours rather than Sandybridge for a university HPC system whose procurement I wasn't sufficiently involved in.

dang · on July 13, 2020

A related discussion is here: https://news.ycombinator.com/item?id=23822203, also with interesting comments.

Since this thread is of the second freshness, we won't merge.

bartwe · on July 12, 2020

Down with simd, up with spmd/compute

nullc · on July 12, 2020

There are 1001 AVX512 variations, but few equivalent operations to the RISV bit manipulation instructions.

gridlockd · on July 12, 2020

"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.

Because absolutely nobody cares outside of benchmarks."

That was back in the stone age when a lot of applications for FP math weren't mainstream. Most of AVX-512 doesn't even concern FP, there's lots of integer and bit twiddling stuff there.

Furthermore, people really do care about these benchmarks. It influences their purchasing, which is really the thing that matters most to Intel. A lot of people don't actually care about hypothetical security issues or the fact that the CPU is 14nm when it still outperforms 7nm in single-threaded code.

Also, it's not like you can just trade off IPC or extra cores for wider SIMD. It's not like "just add more cores" is just as good for throughput, otherwise GPUs wouldn't exist. Wider SIMD is cheap in terms of die area, for the throughput it gives you.

Lastly, these are just instructions, nothing says that an AVX-512 instruction needs to go through a physical 512-bit wide unit, it just says that you can take advantage of those semantics, if possible.

CamperBob2 · on July 12, 2020

Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks.

Today I learned that even Linus Torvalds has a bozo bit. [1] When's the last time he actually did anything with a computer?

1: https://en.wikipedia.org/wiki/Bozo_bit

topspin · on July 12, 2020

He has the history correct. Most of the CPUs that x86 beat in the market had superior FP performance; SPARCs, Alphas, PA-RISC, Itanium, etc.

> When's the last time he actually did anything with a computer?

According to Linus he completes about 30 pull requests a day. Some multiple of that in kernel builds. His $1900 32 core Threadripper speeds that process a great deal and FP contributes little to nothing.

Today people stream video+audio, encrypt+decrypt and render graphics. All of these have specialized silicon. If their AVX-512 vanished in the night almost no one would notice the next day.

Maybe we should all be astronomers and thermodynamicists writing bespoke finite element simulations and have a deep appreciation for the wonders of floating point ISAs, but that's just not the real world.

azalemeth · on July 12, 2020

Speaking as someone who does scientific computing all day long, in part with FEM simulations, even for me AVX512 isn't usually worth it in terms of wall-clock time.

sgillen · on July 12, 2020

Speaking as someone else who does scientific computing all day, taking away vectorized operations would kill my performance completely.

CamperBob2 · on July 12, 2020

Yes, it's pretty much useless, and Linus's reasoning will guarantee that things stay that way.

TinkersW · on July 12, 2020

They wouldn't notice AVX512 vanishing because they never had it in the first place, as Intel hasn't shipped it in CPU people actually use for those tasks--just servers and random laptops.

As for the rest, you are wrong, AVX/512 is not just floating point by any means, and floating point is used by more than just scientific workloads.

Games/simulations/modeling software etc all can make heavy use of floating point.

CamperBob2 · on July 12, 2020

[flagged]

panpanna · on July 12, 2020

This discussion is about avx512, which has shown to have some issues compared to other solutions. Nobody is claiming FP is garbage or that we don't ever need SIMD.

Regarding Linus: he is almost always right. He has had more hands-on experience than everyone in this thread combined. I trust his judgement, he has earned it by being consistently correct while his opposition has just complained and eventually failed to deliver themselves.

gnufx · on July 13, 2020

The evidence for "more hands-on experience than everyone in this thread combined"? I've seen him sound off about things he obviously doesn't have hands on experience with, and not obviously be proved right about things he does (like compilation). That said, avx512 is over-rated in the area I know best.

TinkersW · on July 13, 2020

He likely has ~zero hands on experience with SIMD or AVX512 as those aren't used in his precious kernel.

panpanna · on July 13, 2020

First of all, there is indeed SIMD code in the kernel. Check out the very beginning of your boot message.

More importantly, the kernel needs to support context switch between userspace applications that use SIMD registers. So it touches a bunch of critical data structures and event handlers.

CamperBob2 · on July 14, 2020

So he knows how to push and pop the registers. Okay...

zozbot234 · on July 12, 2020

Fixed point audio decoding is common, actually. In general, floating point math makes sense when dealing with computations in a range spanning different orders of magnitude, where one cares about relative precision. This describes a lot of what we use computers for, but fixed point math is a lot more efficient and makes sense for simpler cases.

CamperBob2 · on July 12, 2020

No, it isn't common. Source: someone who actually does this stuff.

Audio processing and decoding is all about maintaining intermediate results at appropriate precision. The magnitudes involved far exceed the bit width at the output of the pipeline. The only reason you would ever use fixed point is for speed... which is no longer necessary, and needs to stay that way.

ksec · on July 12, 2020

What he said and I quote;

>AVX2 is much more than enough.

CamperBob2 · on July 12, 2020

Funny, that's not the same as the quote I responded to. Something about nobody caring about improving FP performance outside of benchmarks.

Which of us is quoting Linus accurately? It can't be both of us, unless Linus is slipping into his dotage.

Also:

"640K is enough for anyone."

"Who needs MMX? Just give me more superscalar execution units."

"Who needs 3DNow? Nobody uses floats. Integer SIMD is fine."

"Who needs SSE? Nobody's even using 3DNow yet."

"Who needs SSE2? Only astronomers need double precision."

"Who needs SSE3? I don't even know how to use SSE2 yet."

"Who needs SSE4? I don't write codecs."

"Who needs AVX...?"

When has this line of reasoning ever proven correct in the long run?

magicalhippo · on July 13, 2020

I think you read it wrong, slightly helped by either Linus misspeaking or Phoronix misquoting him. The "matter" in "and it matter not one iota" clearly should be past tense.

As I see it, what he's saying is that back in the day, the majority of those buying CPUs did not care about FP code. And he thinks that today the same is true of AVX-512, the majority of those that buy CPUs don't care about AVX-512.