Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was experimenting a while ago with something that I think is related to this.

If you have a quadratic algorithm where you need to compare each pair of objects in a large array of objects, you might use a loop in a loop like this:

    for (int i = 0; i < size; i++) {  for (int j = 0; j < size; j++) { do_something(arr[i], arr[j]); }  }
When this algorithm runs, it will access every cache line or memory page in the array for each item, because j goes through the whole array for each i.

I thought that a better algorithm might do all possible comparisons inside the first block of memory before accessing any other memory. I wanted this to work independently of the size of cache lines or pages, so the solution would be that for each power of 2, we access all items in a position lower than that power of 2 before moving to the second half of the next power of 2.

This means that we need to first compare indexes (0,0) and then (1,0),(1,1),(0,1) and then all pairs of {0,1,2,3} that haven't yet been compared and so on.

It turns out that the sequence (0,0),(1,0),(1,1),(0,1) that we went through (bottom-left, bottom-right, top-right, top-left in a coordinate system) can be recursively repeated by assuming that the whole sequence that we have done so far is the first item in the next iteration of the same sequence. So the next step is to do (0,0),(1,0),(1,1),(0,1) again (but backwards this time) starting on the (2,0) block so it becomes (2,1),(3,1),(3,0),(2,0) and then we do it again starting on the (2,2) block and then on the (0,2) block. Now we have done a new complete block that we can repeat again starting on the (4,0) block and then (4,4) and then (0,4). Then we continue like this through the whole array.

As you can see, every time we move, only one bit in one index number is changed. What we have here is actually two halves of a gray code that is counting up, half of it in the x coordinate and the other half in y. This means that we can iterate through the whole array in the described way by making only one loop that goes up to the square of the size and then we get the two indexes (x and y) by calculating the gray code of i and then de-interleaving the result so that bits 0,2,4... make the x coordinate and bits 1,3,5... make y. Then we compare indexes x and y:

    for (int i = 0; i < size * size; i++) { get_gray_xy(i, &x, &y); do_something(arr[x], arr[y]); }
The result of all this was that the number of cache line/page switches went down by orders of magnitude but the algorithm didn't get faster. This is probably for two reasons: incremental linear memory access is very fast on modern hardware and calculating the gray code and all that added an overhead.


While reading your post I kept thinking "de-interleave the bits of a single counter" since I have used that before to great benefit. It does suffer from issues if your sizes are not powers of two, so your grey code modification seems interesting to me. I'll be looking in to that.

FYI this also extends to higher dimensions. I once used it to split a single loop variable into 3 for a "normal" matrix multiplication to get a huge speedup. Unrolling the inner 3 bits is possible too but a bit painful.

Also used it in ray tracing to scan pixels in Z-order, which gave about 10 percent performance improvement at the time.

But yes, I must look at this grey code step to see if it makes sense to me and understand why its not helping you even with reduced cache misses.


Interestingly I get more cache misses with the gray code solution but still fewer switches between cache lines. This has to be because the cache line prefetcher can predict the memory usage in the simple version but it's harder to predict in the gray code or de-interleave version.

Also, I have never thought about de-interleaving bits without using gray code but that seems to work similarly. The main reason I used gray code is that then we don't have to switch memory blocks as often since only one of the indexes changes every time so one of the current memory blocks can always stay the same as in the previous iteration.


This is similar to what I'm working on.

I am working on machine sympathetic systems. One of my ideas is concurrent loops.

Nested loops are equivalent to the Nth product of the loop × loop × loop or the Cartesian product.

  for letter in letters:
   for number in numbers:
    for symbol in symbols:
     print(letter + number + symbol)
If len(letters) == 3, len(numbers) == 3, len(symbols) == 3. If you think of this as loop indexes "i", "j" and "k" and they begin at 000 and go up in kind of base 3 001, 001, 002, 010, 011, 012, 020, 100 and so on.

The formula for "i", "j" and "k" is

  N = self.N
  for index, item in enumerate(self.sets):
    N, r = divmod(N, len(item))
    combo.append(r)
So the self.N is the product number of the nested loops, you can increment this by 1 each time to get each product of the nested loops.

We can load balance the loops and schedule the N to not starve the outer loops. We can independently prioritize each inner loop. If you represent your GUI or backend as extremely nested loops, you can write easy to understand code with one abstraction, by acting as if loops were independent processes.

So rather than that sequence, we can generate 000, 111, 222, 010, 230, 013 and so on.

Load balanced loops means you can progress on multiple items simultaneously, concurrently. I combined concurrent loops with parallel loops with multithreading. I plan it to be a pattern to create incredibly reactive frontends and backends with low latency to processing. Many frontends lag when encrypting, compressing or resizing images, it doesn't need to be that slow. IntelliJ doesn't need to be slow with concurrent loops and multithreading.

See https://github.com/samsquire/multiversion-concurrency-contro...

Loops are rarely first class citizens in most languages I have used and I plan to change that.

I think I can combine your inspiration of gray codes with this idea.

I think memory layout and data structure and algorithm can be 3 separate decisions. I am yet to see any developers talk of this. Most of the time the CPU is idling waiting for memory, disk or network. We can arrange and iterate data to be efficient.



I decided to try implement get_gray_xy. I am wondering how you can generalise it to any number of loops and apply it to nested loops of varying sizes, that is if you have three sets and the sizes are 8, 12, 81.

Wikipedia says graycodes can be found by num ^ (num >> 1)

  a = [1, 2, 3, 4]
  b = [2, 4, 8, 16]
  indexes = set()
  correct = set()

  print("graycode loop indexes")
  for index in range(0, len(a) * len(b)):
    code = index ^ (index >> 1)
    left = code & 0x0003
    right = code >> 0x0002 & 0x0003
  
    print(left, right)
    indexes.add((left, right))
  
  assert len(indexes) == 16
  
  print("regular nested loop indexes")
  
  for x in range(0, len(a)):
    for y in range(0, len(b)):
      correct.add((x,y))
      print(x, y)
  
  
  assert correct == indexes
This produces the sequence:

    graycode loop indexes
    0 0
    1 0
    3 0
    2 0
    2 1
    3 1
    1 1
    0 1
    0 3
    1 3
    3 3
    2 3
    2 2
    3 2
    1 2
    0 2
    regular nested loop indexes
    0 0
    0 1
    0 2
    0 3
    1 0
    1 1
    1 2
    1 3
    2 0
    2 1
    2 2
    2 3
    3 0
    3 1
    3 2
    3 3


Here is the implementation of get_gray_xy that I used:

    static uint64_t deinterleave(uint64_t x) {
        x = x & 0x5555555555555555;
        x = (x | (x >>  1)) & 0x3333333333333333;
        x = (x | (x >>  2)) & 0x0f0f0f0f0f0f0f0f;
        x = (x | (x >>  4)) & 0x00ff00ff00ff00ff;
        x = (x | (x >>  8)) & 0x0000ffff0000ffff;
        x = (x | (x >> 16)) & 0x00000000ffffffff;
        return x;
    }
    static void get_gray_xy(uint64_t n, uint64_t *x, uint64_t *y) {
        uint64_t gray = n ^ (n >> 1);
        *x = deinterleave(gray);
        *y = deinterleave(gray >> 1);
    }
I don't think the gray code solution can work well for sets of different sizes because the area of indexes that you go through grows as a square.

But it does work for different number of dimensions or different number of nested loops. For 3 dimensions each coordinate is constructed from every 3rd bit instead of 2nd.

So if you have index 14, then gray code is 9 (1001) which means in two dimensions x=1,y=2 and in three dimensions, x=3,y=0,z=0.


Thank you for sharing your idea and your code and thoughts and thanks for your reply.

I would like to generalise it for sets of different sizes. Maybe it's something different to a gray codes which you taught me today, it feels that it should be simple.

Maybe someone kind can step in and help me or I can ask it on Stackoverflow.

I need to think it through.

At one point in time I was reading a section on Intel's website of cache optimisation using a tool that shows you the cache behaviour of the CPU and column or row major iteration it had a GUI. I found it very interesting but I cannot seem to find it at this time. I did find some medium article of Cachediff.


deinterleave doesn't actually produce gray code does it? The GP said to convert to gray code before doing the deinterleave.


> but the algorithm didn't get faster

Could it be branch predictor at play? https://stackoverflow.com/questions/11227809/why-is-processi...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: