The README covers this. Its argument is persuasive. If your point is that the co...

The README covers this. Its argument is persuasive. If your point is that the constant is badly tuned for theoretical 1000 core machines that don't exist, I'm not sure I care. A 100ns stall at most every 100us becoming more likely when you approach multiple hundreds of cores is hardly a disaster. In the context of the comment I replied to, the difference between 8 and 16 workers is literally zero, as the wakeups are spaced so the locks will never conflict.

Actually, if you did have a 32k core machine somehow with magical sufficiently-uniform memory for microthreading to be sensible for it, I think it's not even hard to extend the algorithm to work with that. Just put the workers on a 3D torus and only share orthogonally. It means you don't have perfect work sharing, but I'm also pretty sure it doesn't matter.