I was experimenting a while ago with something that I think is related to this. ...

phkahler · on July 22, 2022

While reading your post I kept thinking "de-interleave the bits of a single counter" since I have used that before to great benefit. It does suffer from issues if your sizes are not powers of two, so your grey code modification seems interesting to me. I'll be looking in to that.

FYI this also extends to higher dimensions. I once used it to split a single loop variable into 3 for a "normal" matrix multiplication to get a huge speedup. Unrolling the inner 3 bits is possible too but a bit painful.

Also used it in ray tracing to scan pixels in Z-order, which gave about 10 percent performance improvement at the time.

But yes, I must look at this grey code step to see if it makes sense to me and understand why its not helping you even with reduced cache misses.

robalni · on July 23, 2022

Interestingly I get more cache misses with the gray code solution but still fewer switches between cache lines. This has to be because the cache line prefetcher can predict the memory usage in the simple version but it's harder to predict in the gray code or de-interleave version.

Also, I have never thought about de-interleaving bits without using gray code but that seems to work similarly. The main reason I used gray code is that then we don't have to switch memory blocks as often since only one of the indexes changes every time so one of the current memory blocks can always stay the same as in the previous iteration.

samsquire · on July 22, 2022

This is similar to what I'm working on.

I am working on machine sympathetic systems. One of my ideas is concurrent loops.

Nested loops are equivalent to the Nth product of the loop × loop × loop or the Cartesian product.

  for letter in letters:
   for number in numbers:
    for symbol in symbols:
     print(letter + number + symbol)

If len(letters) == 3, len(numbers) == 3, len(symbols) == 3. If you think of this as loop indexes "i", "j" and "k" and they begin at 000 and go up in kind of base 3 001, 001, 002, 010, 011, 012, 020, 100 and so on.

The formula for "i", "j" and "k" is

  N = self.N
  for index, item in enumerate(self.sets):
    N, r = divmod(N, len(item))
    combo.append(r)

So the self.N is the product number of the nested loops, you can increment this by 1 each time to get each product of the nested loops.

We can load balance the loops and schedule the N to not starve the outer loops. We can independently prioritize each inner loop. If you represent your GUI or backend as extremely nested loops, you can write easy to understand code with one abstraction, by acting as if loops were independent processes.

So rather than that sequence, we can generate 000, 111, 222, 010, 230, 013 and so on.

Load balanced loops means you can progress on multiple items simultaneously, concurrently. I combined concurrent loops with parallel loops with multithreading. I plan it to be a pattern to create incredibly reactive frontends and backends with low latency to processing. Many frontends lag when encrypting, compressing or resizing images, it doesn't need to be that slow. IntelliJ doesn't need to be slow with concurrent loops and multithreading.

See https://github.com/samsquire/multiversion-concurrency-contro...

Loops are rarely first class citizens in most languages I have used and I plan to change that.

I think I can combine your inspiration of gray codes with this idea.

I think memory layout and data structure and algorithm can be 3 separate decisions. I am yet to see any developers talk of this. Most of the time the CPU is idling waiting for memory, disk or network. We can arrange and iterate data to be efficient.

samsquire · on July 22, 2022

Link to my concurrent loop repository.

https://github.com/samsquire/multiversion-concurrency-contro...

samsquire · on July 22, 2022

I decided to try implement get_gray_xy. I am wondering how you can generalise it to any number of loops and apply it to nested loops of varying sizes, that is if you have three sets and the sizes are 8, 12, 81.

Wikipedia says graycodes can be found by num ^ (num >> 1)

  a = [1, 2, 3, 4]
  b = [2, 4, 8, 16]
  indexes = set()
  correct = set()

  print("graycode loop indexes")
  for index in range(0, len(a) * len(b)):
    code = index ^ (index >> 1)
    left = code & 0x0003
    right = code >> 0x0002 & 0x0003
  
    print(left, right)
    indexes.add((left, right))
  
  assert len(indexes) == 16
  
  print("regular nested loop indexes")
  
  for x in range(0, len(a)):
    for y in range(0, len(b)):
      correct.add((x,y))
      print(x, y)
  
  
  assert correct == indexes

This produces the sequence:

    graycode loop indexes
    0 0
    1 0
    3 0
    2 0
    2 1
    3 1
    1 1
    0 1
    0 3
    1 3
    3 3
    2 3
    2 2
    3 2
    1 2
    0 2
    regular nested loop indexes
    0 0
    0 1
    0 2
    0 3
    1 0
    1 1
    1 2
    1 3
    2 0
    2 1
    2 2
    2 3
    3 0
    3 1
    3 2
    3 3

robalni · on July 22, 2022

Here is the implementation of get_gray_xy that I used:

    static uint64_t deinterleave(uint64_t x) {
        x = x & 0x5555555555555555;
        x = (x | (x >>  1)) & 0x3333333333333333;
        x = (x | (x >>  2)) & 0x0f0f0f0f0f0f0f0f;
        x = (x | (x >>  4)) & 0x00ff00ff00ff00ff;
        x = (x | (x >>  8)) & 0x0000ffff0000ffff;
        x = (x | (x >> 16)) & 0x00000000ffffffff;
        return x;
    }
    static void get_gray_xy(uint64_t n, uint64_t *x, uint64_t *y) {
        uint64_t gray = n ^ (n >> 1);
        *x = deinterleave(gray);
        *y = deinterleave(gray >> 1);
    }

I don't think the gray code solution can work well for sets of different sizes because the area of indexes that you go through grows as a square.

But it does work for different number of dimensions or different number of nested loops. For 3 dimensions each coordinate is constructed from every 3rd bit instead of 2nd.

So if you have index 14, then gray code is 9 (1001) which means in two dimensions x=1,y=2 and in three dimensions, x=3,y=0,z=0.

samsquire · on July 22, 2022

Thank you for sharing your idea and your code and thoughts and thanks for your reply.

I would like to generalise it for sets of different sizes. Maybe it's something different to a gray codes which you taught me today, it feels that it should be simple.

Maybe someone kind can step in and help me or I can ask it on Stackoverflow.

I need to think it through.

At one point in time I was reading a section on Intel's website of cache optimisation using a tool that shows you the cache behaviour of the CPU and column or row major iteration it had a GUI. I found it very interesting but I cannot seem to find it at this time. I did find some medium article of Cachediff.

phkahler · on July 22, 2022

deinterleave doesn't actually produce gray code does it? The GP said to convert to gray code before doing the deinterleave.

jve · on July 22, 2022

> but the algorithm didn't get faster

Could it be branch predictor at play? https://stackoverflow.com/questions/11227809/why-is-processi...