Cool stuff! This method is very similar to how AVX-512-optimized RSA implementat...

anematode · 2025-05-30T18:59:38 1748631578

Ooh interesting, I should have looked at this while developing.... Looks like that code could definitely use another version for e.g. Zen 5 where using zmm registers would lead to a 2x multiplication throughput. Also the mask registers are bounced to GPRs for arithmetic but that's suboptimal on Zen 4/5.

Separately, I'm wondering whether the carries really need to be propagated in one step. (At least I think that's what's going on?) The chance that a carry in leads to an additional carry out beyond what's already there in the high 12 bits is very small, so in my code, I assume that carries only happen once and then loop back if necessary. That reduces the latency in the common case. I guess with a branch there could be timing attack issues though

pittma · 2025-05-30T19:21:52 1748632912

ymms were used here on purpose! With full-width registers, the IFMA insns have a deleterious effect on frequency, at least in the Icelake timeframe.

anematode · 2025-05-30T19:37:13 1748633833

Ye, hence a separate version for CPUs which don't have that problem. Although, maintaining so many of these RSA kernels does seem like a pain. Didn't realize u wrote that code; super cool that it's used in practice!

pittma · 2025-05-30T19:43:50 1748634230

I am not the original author—this is adapted from an implementation by Shay Gueron, the author of that paper I linked, but I do agree that it's cool!

hnuser123456 · 2025-05-31T01:15:20 1748654120

zen5 can run avx512 at near full boost clocks: https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavi...

ignoramous · 2025-05-31T20:12:00 1748722320

> dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal

One can also upload to archive.org: https://archive.org/download/sime_20250531/sime.pdf