Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cool stuff! This method is very similar to how AVX-512-optimized RSA implementations work too, as they also have to do Very Large Exponentiations. This paper[1] covers how RSA does its windowing, which includes the formula showing how the window size can be arbitrary. One additional thing AVX-512 RSA implementations do, though, is store the results of the multiplications for the range [0..2^{window-size}) in a table; then, for each window, that result is retrieved from the table[2] and only the shifting/repositioning has to be done.

1. https://dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal)

2. https://github.com/aws/aws-lc/blob/9c8bd6d7b8adccdd8af4242e0...



Ooh interesting, I should have looked at this while developing.... Looks like that code could definitely use another version for e.g. Zen 5 where using zmm registers would lead to a 2x multiplication throughput. Also the mask registers are bounced to GPRs for arithmetic but that's suboptimal on Zen 4/5.

Separately, I'm wondering whether the carries really need to be propagated in one step. (At least I think that's what's going on?) The chance that a carry in leads to an additional carry out beyond what's already there in the high 12 bits is very small, so in my code, I assume that carries only happen once and then loop back if necessary. That reduces the latency in the common case. I guess with a branch there could be timing attack issues though


ymms were used here on purpose! With full-width registers, the IFMA insns have a deleterious effect on frequency, at least in the Icelake timeframe.


Ye, hence a separate version for CPUs which don't have that problem. Although, maintaining so many of these RSA kernels does seem like a pain. Didn't realize u wrote that code; super cool that it's used in practice!


I am not the original author—this is adapted from an implementation by Shay Gueron, the author of that paper I linked, but I do agree that it's cool!


zen5 can run avx512 at near full boost clocks: https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavi...


> dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal

One can also upload to archive.org: https://archive.org/download/sime_20250531/sime.pdf




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: