Cool stuff! This method is very similar to how AVX-512-optimized RSA implementations work too, as they also have to do Very Large Exponentiations. This paper[1] covers how RSA does its windowing, which includes the formula showing how the window size can be arbitrary. One additional thing AVX-512 RSA implementations do, though, is store the results of the multiplications for the range [0..2^{window-size}) in a table; then, for each window, that result is retrieved from the table[2] and only the shifting/repositioning has to be done.
Ooh interesting, I should have looked at this while developing.... Looks like that code could definitely use another version for e.g. Zen 5 where using zmm registers would lead to a 2x multiplication throughput. Also the mask registers are bounced to GPRs for arithmetic but that's suboptimal on Zen 4/5.
Separately, I'm wondering whether the carries really need to be propagated in one step. (At least I think that's what's going on?) The chance that a carry in leads to an additional carry out beyond what's already there in the high 12 bits is very small, so in my code, I assume that carries only happen once and then loop back if necessary. That reduces the latency in the common case. I guess with a branch there could be timing attack issues though
Ye, hence a separate version for CPUs which don't have that problem. Although, maintaining so many of these RSA kernels does seem like a pain. Didn't realize u wrote that code; super cool that it's used in practice!
1. https://dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal)
2. https://github.com/aws/aws-lc/blob/9c8bd6d7b8adccdd8af4242e0...