If you're running anything CPU-intensive these days, you are definitely going to...

Keyframe · on Oct 19, 2018

In theory, yes. I was looking to build a new rig, I do a lot of video. I'm running a 2697v2 and 3930k in another (both built around the same time, around ~5-6 years ago). To be honest, I can't really tell much of a difference in using those for those heavy video editing and fx work vs threadripper and newer Xeons I'm using at other places. I'm not saying they're not faster, but there's not much of a noticable, if at all, difference in using those machines.

the8472 · on Oct 19, 2018

> At a bare minimum, AVX grossly accelerates memcpy and memset operations.

Not necessarily, for large operations and depending on processor generation a simple rep stos/movsb will be simpler (no alignment requirements) and saturate your memory bandwidth just as well as any AVX sequence will with less icache pressure.

dragontamer · on Oct 19, 2018

I wasn't aware of ERMSB "Enhanced Rep MOVSB". Thanks for the tip.

Seems to be a feature in Ivy Bridge and later, which happens to be around the time AVX2 started.

slashdev · on Oct 19, 2018

To be fair, it's issuing a long sequence of microcode ops under the hood using 256bit ops.

CamperBob2 · on Oct 19, 2018

At a bare minimum, AVX grossly accelerates memcpy and memset operations. (Setting 256-bits per assembly instruction instead of 64-bits per operation is a big improvement). And virtually every program can benefit from faster memcpy and faster memsets.

How often are memcpy and memset CPU-bound, though?

dragontamer · on Oct 19, 2018

Whenever its in L1 or L2 caches, which on Intel, have 64-byte-per-clock bandwidth. (96-byte-per-clock bandwidth to L1 cache, 64-byte-per-clock theoretical L2 bandwidth... dropping to ~29 sustained)

L3-cache on Skylake architectures has 18-bytes-per-clock sustained, which is still greater than 128-bits per clock bandwidth to your L3 cache.

Soooo... I'd say roughly for any memset or memcpy smaller than 256kB or so... you're going to benefit to an AVX-based memset or memcpy.

At ~8MB or so, where you're hitting L3 cache, it probably is a benefit but not much of one. The important thing is Skylake only has one store unit, so you can only write once per clock cycle.

So... do you write one 64-bit value, or do you write one 256-bit (32-byte) value?

----------

To be fair: I'm pretty sure every compiler's default settings today outputs SSE-based memcpy and memsets (128-bit). So AVX doubles that to 256-bit.

keldaris · on Oct 19, 2018

The issue - and one that a compiler often can't really reason about very well - is that in using AVX (and especially AVX-512) you incur other performance costs due to throttling the CPU down to lower frequencies and having additional latency issues. Deciding when it's actually worth doing so for a memcpy is, in general, a very hard problem.

dragontamer · on Oct 19, 2018

This is the problem when I have a Threadripper can't can't see these issues, lol. One day I'll get a proper AVX512 machine and try this stuff for myself.

Still, it seems like the documentation elsewhere says that AVX 128-bit doesn't cause any clocking issues. So AVX512 (applied to 128-bit) should still lead to good speeds without any clocking problems.

keldaris · on Oct 20, 2018

Right, if you stick to 128 bit registers, you avoid most of the issues. Does that get you much benefit for a memcpy over an SSE implementation on a SKX CPU? I don't think so, but I haven't looked into that particular problem, so maybe I'm missing something obvious here.

celrod · on Oct 20, 2018

Simple test, `memcopy.c`:

  void memcopy(double* restrict a, double* restrict b, long N){
    for (long n; n < N; n++){
      a[n] = b[n];
    }
  }

You can check assembly with:

  gcc -march=skylake-avx512 -O2 -ftree-vectorize -mprefer-vector-width=128
   -shared -fPIC -S memcopy.c -o memcopy128.s

  gcc -march=skylake-avx512 -O2 -ftree-vectorize -mprefer-vector-width=512 
  -shared -fPIC -S memcopy.c -o memcopy512.s

to confirm they use `xmm` and `zmm` registers, respectively. (The 512 version also uses xmm to finish off a remainder. Ideally, it would use masked load/stores, but I haven't seen any auto-vectorizers actually do that.

I'm using gcc 8.2.1. I believe the option `-mprefer-vector-width` was added recently.

Now I compiled both into shared libraries (drop the `-S` and choose an appropriate file name), and benchmarked from within Julia.

  julia> using BenchmarkTools, Random

  julia> memcopy128!(a, b) = ccall((:memcopy, 
  "/home/chriselrod/Documents/progwork/C/libmemcopy128.so"),
  Cvoid, (Ptr{Cdouble},Ptr{Cdouble},Clong), pointer(a), pointer(b), length(a));

  julia> memcopy512!(a, b) = ccall((:memcopy, 
  "/home/chriselrod/Documents/progwork/C/libmemcopy512.so"), 
 Cvoid, (Ptr{Cdouble},Ptr{Cdouble},Clong), pointer(a), pointer(b), length(a));

  julia> b = randn(32); a = similar(b);

  julia> b = randn(32); a = similar(b);

  julia> all(a == b)
  false

  julia> memcopy128!(a, b); all(a == b)
  true

  julia> randn!(a); all(a == b)
  false

  julia> memcopy512!(a, b); all(a == b)
  true

  julia> @btime memcopy128!($a, $b)
    7.517 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    5.277 ns (0 allocations: 0 bytes)

  julia> b = randn(64); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    10.980 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    7.300 ns (0 allocations: 0 bytes)

  julia> b = randn(100); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    15.060 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    14.031 ns (0 allocations: 0 bytes)

  julia> b = randn(200); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    33.867 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    19.546 ns (0 allocations: 0 bytes)

  julia> b = randn(400); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    53.923 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    28.376 ns (0 allocations: 0 bytes)

  julia> b = randn(800); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    98.277 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    57.960 ns (0 allocations: 0 bytes)

  julia> b = randn(2_000); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    232.138 ns (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    133.229 ns (0 allocations: 0 bytes)

  julia> b = randn(20_000); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    3.688 μs (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    2.759 μs (0 allocations: 0 bytes)

  julia> b = randn(200_000); a = similar(b);

  julia> @btime memcopy128!($a, $b)
    112.598 μs (0 allocations: 0 bytes)

  julia> @btime memcopy512!($a, $b)
    117.437 μs (0 allocations: 0 bytes)

The advantage wasn't very impressive here, but it persisted for vectors of length 20,000. At 200,000 we saw the memory bottleneck hit in force, when it busts the per core L2 cache. The 7900x's per core L2 cache is 1048576, which translates to 131,072 doubles.

keldaris · on Oct 20, 2018

This is a nice benchmark, thanks for doing it, but the point I was making is somewhat orthogonal to this.

Of course there's a performance advantage from using 512 bit registers for a memcpy - but a memcpy is rarely a major performance bottleneck by itself and is usually surrounded by other code. Unless that code is also AVX-512, you've just made it slower by optimizing the memcpy. My point was that a compiler can't usually decide whether it's worth making the optimization in light of the broader context.

The other point was whether using AVX-512 while sticking to xmm registers is faster than just using xmm SSE/AVX code. I don't have an AVX-512 capable machine at the moment, perhaps you'd like to check if your 128 bit version is any faster than just doing "gcc -march=skylake -O2 -mprefer-vector-width=128" (thereby retaining the microarchitecture optimizations, but sticking to AVX2)?

spatulon · on Oct 20, 2018

Here are your C examples for easy viewing: https://godbolt.org/z/5j-QUj

rys · on Oct 19, 2018

I don’t know about the other software types, but games are not typical users of AVX (at client runtime anyway, the tooling used to make the games is different).