I've been using this on my M1 Max and it works pretty well, 1.65 iterations per second (full precision, whereas my PC's 3080 can only do half-precision due to limited memory)... a 50-iteration image in about 40 seconds or so.
Thank you and smoldesu for letting me know it should work, I'll have a better look into what's going on - it didn't immediately work on Windows in full precision (probably a batch size issue as you suggested) and I gave up...
I shouldn't have given up so easily, but my tolerance for annoyances on Windows is pretty low (that Windows machine is kept for gaming, the last time I used a Windows machine for anything but launching Steam was when Windows 2000 was the hot new thing...)
this worked fine for me, and running side by side with Intel CPU + nVidia 2070 it actually does not take much longer (and as a sibling said, seems to be working at full precision). It is one of the first things I've done that has properly made my M1 Max's fan spin up hard though!
and https://github.com/magnusviri/stable-diffusion/tree/apple-si...