> But when we switch to longer context, we see something interesting happen. WMMA + FA basically loses no performance at this longer context length!
> Vulkan + FA still has better pp but tg is significantly lower. More data points would be better, but seems like Vulkan performance may continue to decrease as context extends while the HIP+rocWMMA backend should perform better.
> (What is bad is that basically every single model has a different optimal backend, and most of them have different optimal backends for pp (handling context) vs tg (new text)).
Anyway, for me, the greatest thing about the Strix Halo + llama.cpp combo is that we can throw one or more egpu into the mix, as echoed by level1tech video (https://youtu.be/ziZDzrDI7AM?t=485), which should help a lot with PP performance.