llama.cpp can be run with a speedup for AMD GPUs when compiled with `LLAMA_CLBLA...

Kelteseth · on June 13, 2023

I just tried this [1] and it still uses my CPU even though the prompt says otherwise.

[1] https://github.com/ggerganov/llama.cpp/issues/1433#issuecomm...

lhl · on June 13, 2023

I saw there was an answer already in your issue, although you plan on doing a lot of inferencing on your GPU, I'd highly recommend you consider dual-booting into Linux. It turns out exllama merged ROCm support last week and more than 2X faster than the CLBlast code. A 13b gptq model at full context clocks in at 15t/s on my old Radeon VII. (Rumor has it that ROCm 5.6 may add Windows support, although it remains to be seen what that exactly entails.)

Kelteseth · on June 13, 2023

So it now uses the GPU after some help, but it is not that much faster on my Vega VII than on my 5950x 16 core cpu :/