If you're running anything CPU-intensive these days, you are definitely going to use those AVX cores.
Video Editing, Video Games, Graphics, 3d Modeling, Photoshop. Even Stockfish Chess uses new instructions (not SIMD: but the Bit-board popcnt and pext / pdep instructions) to accelerate chess computations.
At a bare minimum, AVX grossly accelerates memcpy and memset operations. (Setting 256-bits per assembly instruction instead of 64-bits per operation is a big improvement). And virtually every program can benefit from faster memcpy and faster memsets.
"Standard Software" isn't written to be very fast. But anything that's even close to CPU-bound is being upgraded to use more and more SIMD instructions.
In theory, yes. I was looking to build a new rig, I do a lot of video. I'm running a 2697v2 and 3930k in another (both built around the same time, around ~5-6 years ago). To be honest, I can't really tell much of a difference in using those for those heavy video editing and fx work vs threadripper and newer Xeons I'm using at other places. I'm not saying they're not faster, but there's not much of a noticable, if at all, difference in using those machines.
> At a bare minimum, AVX grossly accelerates memcpy and memset operations.
Not necessarily, for large operations and depending on processor generation a simple rep stos/movsb will be simpler (no alignment requirements) and saturate your memory bandwidth just as well as any AVX sequence will with less icache pressure.
At a bare minimum, AVX grossly accelerates memcpy and memset operations. (Setting 256-bits per assembly instruction instead of 64-bits per operation is a big improvement). And virtually every program can benefit from faster memcpy and faster memsets.
How often are memcpy and memset CPU-bound, though?
Whenever its in L1 or L2 caches, which on Intel, have 64-byte-per-clock bandwidth. (96-byte-per-clock bandwidth to L1 cache, 64-byte-per-clock theoretical L2 bandwidth... dropping to ~29 sustained)
L3-cache on Skylake architectures has 18-bytes-per-clock sustained, which is still greater than 128-bits per clock bandwidth to your L3 cache.
Soooo... I'd say roughly for any memset or memcpy smaller than 256kB or so... you're going to benefit to an AVX-based memset or memcpy.
At ~8MB or so, where you're hitting L3 cache, it probably is a benefit but not much of one. The important thing is Skylake only has one store unit, so you can only write once per clock cycle.
So... do you write one 64-bit value, or do you write one 256-bit (32-byte) value?
----------
To be fair: I'm pretty sure every compiler's default settings today outputs SSE-based memcpy and memsets (128-bit). So AVX doubles that to 256-bit.
The issue - and one that a compiler often can't really reason about very well - is that in using AVX (and especially AVX-512) you incur other performance costs due to throttling the CPU down to lower frequencies and having additional latency issues. Deciding when it's actually worth doing so for a memcpy is, in general, a very hard problem.
This is the problem when I have a Threadripper can't can't see these issues, lol. One day I'll get a proper AVX512 machine and try this stuff for myself.
Still, it seems like the documentation elsewhere says that AVX 128-bit doesn't cause any clocking issues. So AVX512 (applied to 128-bit) should still lead to good speeds without any clocking problems.
Right, if you stick to 128 bit registers, you avoid most of the issues. Does that get you much benefit for a memcpy over an SSE implementation on a SKX CPU? I don't think so, but I haven't looked into that particular problem, so maybe I'm missing something obvious here.
to confirm they use `xmm` and `zmm` registers, respectively. (The 512 version also uses xmm to finish off a remainder. Ideally, it would use masked load/stores, but I haven't seen any auto-vectorizers actually do that.
I'm using gcc 8.2.1. I believe the option `-mprefer-vector-width` was added recently.
Now I compiled both into shared libraries (drop the `-S` and choose an appropriate file name), and benchmarked from within Julia.
The advantage wasn't very impressive here, but it persisted for vectors of length 20,000. At 200,000 we saw the memory bottleneck hit in force, when it busts the per core L2 cache. The 7900x's per core L2 cache is 1048576, which translates to 131,072 doubles.
This is a nice benchmark, thanks for doing it, but the point I was making is somewhat orthogonal to this.
Of course there's a performance advantage from using 512 bit registers for a memcpy - but a memcpy is rarely a major performance bottleneck by itself and is usually surrounded by other code. Unless that code is also AVX-512, you've just made it slower by optimizing the memcpy. My point was that a compiler can't usually decide whether it's worth making the optimization in light of the broader context.
The other point was whether using AVX-512 while sticking to xmm registers is faster than just using xmm SSE/AVX code. I don't have an AVX-512 capable machine at the moment, perhaps you'd like to check if your 128 bit version is any faster than just doing "gcc -march=skylake -O2 -mprefer-vector-width=128" (thereby retaining the microarchitecture optimizations, but sticking to AVX2)?
I don’t know about the other software types, but games are not typical users of AVX (at client runtime anyway, the tooling used to make the games is different).
Video Editing, Video Games, Graphics, 3d Modeling, Photoshop. Even Stockfish Chess uses new instructions (not SIMD: but the Bit-board popcnt and pext / pdep instructions) to accelerate chess computations.
At a bare minimum, AVX grossly accelerates memcpy and memset operations. (Setting 256-bits per assembly instruction instead of 64-bits per operation is a big improvement). And virtually every program can benefit from faster memcpy and faster memsets.
"Standard Software" isn't written to be very fast. But anything that's even close to CPU-bound is being upgraded to use more and more SIMD instructions.