Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Rectfillcurve – generate rectangle-filling curves (sr.ht)
24 points by dalke on July 18, 2023 | hide | past | favorite | 10 comments
How do you visit every coordinate in an NxM grid once? The easiest is to process line-by-line, from the first to last column, but if you want better caching you might try alternating the column direction for each row. The Morton/Z-order and Hilbert order give even better cache coherency for some tasks, although the classic versions only work on squares with power-of-two length sides.

Luckily for me, people have developed generalized versions of those algorithms which can handle arbitrary-sized rectangles.

I've taken those and packaged all of the those curves into "rectfillcurve", with an iterator API for generating those curves, and a bonus "mlcg curve" with a pseudo-random visit order that should have poor cache behavior. Implemented in stand-alone C, and also available as a Python module.



This feels like a huge misunderstanding of the memory layout of 2D arrays. Memory locality is about how close values are in 1D memory, not their Euclidian distance in 2D.

Left to right, top to bottom matches memory order exactly and is always going to be fastest if your algorithm allows it. The benchmarks show that.

Zigzag is slightly slower because big CPUs have fetch predictors that recognise backwards access patterns so it's basically the same as going left to right except you trip the predictor up at the start of each row.


It's definitely not showing the effect of memory fetch prediction, as the entire benchmark should fit in L1 cache.

The point of the benchmark was to show how much overhead there is for using the method. It doesn't do any memory traversal at all - it's the iterator equivalent of:

  long long benchmark_for_loop(int w, int h, int bad_i, int bad_j) {
    long long count = 0;
    for (int i=0; i<w; i++) {
      for (int j=0; j<h; j++) {
        if (i == bad_i || j == bad_j) {
          return INVALID;
        }
        count++;
      }
    }
    return count;
  }
What the zigzag benchmark shows is the overhead for the few additional operations needed to figure out if it's traversing left or right; more specifically the "if (row % 2 == 0)" test and the else clause of the following:

    *i = row = local.pos / local.h;
    col = local.pos % local.h;
    if (row % 2 == 0) {
      *j = col;
    } else {
      *j = local.h - col - 1;
    }
The factor of 4 performance difference between the for-loop and the row iterator is all due to function call overhead for the iterator API.


The description in this HN post is indeed inaccurate, but the readme on sourcehut has more info on why you might a different order for, e.g., matrix multiplication.


No it doesn't?


I thought the README was long enough that I wasn't going to go into those details, especially as 1) I am not an expert on the topic, 2) the choice of curve depends on your application, and 3) there are a lot of resources on the topic. I listed a few.

If you want to learn more, see "Space-Filling Curves An Introduction with Applications in Scientific Computing", listed as one of my sources, eg, "Algorithm 13.3: Matrix-vector multiplication based on Hilbert order matrix traversal" and "Matrix Multiplication Using Peano Curves".

One of the sources I cited is "Improved Data Locality Using Morton-order Curve on the Example of LU Decomposition" at https://ieeexplore.ieee.org/document/9378385 :

> Here, we look at the block factorization algorithm, where the LU decomposition performance depends on the performance of the matrix multiplication. In both cases, the LU decomposition and the matrix multiplication, such a matrix is traversed by three nested loops. In this paper, we propose to traverse such loops in an order defined by a space-filling curve. This traversal dramatically improves data locality and offers effective exploitation of the memory hierarchy. Besides the canonical (or line-by-line) access pattern, we demonstrate the traversal in Hilbert-, Peano and Morton order. Our extensive experiments show that the Morton order (or Z -order) and the inverse Morton order (or И-order) have a better runtime performance compared to the others.

I first learned about this issue from "The Anatomy of High-Performance 2D Similarity Calculations" at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4839782/ :

> Certain self-similar space-filling curves provide a natural hierarchy and iteration order to optimize locality in a cache-oblivious algorithm. In particular, Morton ordering, based on the hierarchical Z-order curve, has been previously used to optimize cache locality in matrix multiplication,22 suggesting that it may also be useful for similarity matrices. .... Other space-filling curves, such as the Hilbert curve, also have excellent cache-locality properties; however, the calculations required to implement the Z-order curve are particularly simple.

I want to reproduce their result, then see how other traversal orders affect the timings. Careful reading of that paper shows they only used sizes which are a power of two, while real-world data sets aren't so convenient.

I couldn't find an off-the-shelf library with non-powers-of-two/rectangular generalizations, so I wrote one.


The space-filling curves are also useful in arranging GPU compute workloads, particularly the Morton order. This is because, among other things, textures are stored in blocks, so the Z-curve makes your caches happier. Another reason is that the spatial nature of the curve also aligns well with screen-space and such techniques where the data has a natural spatial order to it. BVH construction is another. You could mention some of that in your readme.

Some refs:

https://www.nocentino.com/Nocentino10.pdf

http://graphics.snu.ac.kr/class/graphics2011/materials/paper...


If you look at this the opposite way round, then you discover why the Apple II and Sinclair ZX Spectrum had such a bonkers screen layout.

The 4116 DRAM chips were arranged as a 128x128 grid with seven-bit row and column addresses. By scattering the RAM like that, different "pages" could be hit by the screen updates, solving the problem of refreshing the DRAM. If you read them sequentially, the RAM would have decayed by the time you got to the bottom of the screen.


For people interested in this, a very interesting algorithm is the H-curve. It is described in this paper http://hint.fm/papers/158-wattenberg-final3.pdf It has the neat property that it ends where it starts if the grid has even width and height.


does source hut actually work still? clicking on any of the git links on that page just gives 502 error


What git links? I use Mercurial.

Oh, I see. Some of the commit links result in "502 Bad Gateway" from nginx. How odd!

Even stranger, I can't reproduce it on my other repos.

I so rarely use the web interface I never noticed. I mostly use it via the command-line, for solo projects, effectively as off-site backup.

EDIT: Yeah, looks like that issue was reported 10 days ago, at https://todo.sr.ht/~sircmpwn/hg.sr.ht/46 , with no response from the operators.

As a Mercurial user and fellow traveller to free software, it's hard to find a good and responsive hosting provider.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: