The README covers this. Its argument is persuasive. If your point is that the constant is badly tuned for theoretical 1000 core machines that don't exist, I'm not sure I care. A 100ns stall at most every 100us becoming more likely when you approach multiple hundreds of cores is hardly a disaster. In the context of the comment I replied to, the difference between 8 and 16 workers is literally zero, as the wakeups are spaced so the locks will never conflict.
Actually, if you did have a 32k core machine somehow with magical sufficiently-uniform memory for microthreading to be sensible for it, I think it's not even hard to extend the algorithm to work with that. Just put the workers on a 3D torus and only share orthogonally. It means you don't have perfect work sharing, but I'm also pretty sure it doesn't matter.
Actually, if you did have a 32k core machine somehow with magical sufficiently-uniform memory for microthreading to be sensible for it, I think it's not even hard to extend the algorithm to work with that. Just put the workers on a 3D torus and only share orthogonally. It means you don't have perfect work sharing, but I'm also pretty sure it doesn't matter.