llama.cpp can be run with a speedup for AMD GPUs when compiled with `LLAMA_CLBLAST=1` and there is also a HIPified fork [1] being worked on by a community contributor. The other week I was poking on how hard it would be to get an AMD card running w/ acceleration on Linux and was pleasantly surprised, it wasn't too bad: https://mostlyobvious.org/?link=/Reference%2FSoftware%2FGene...
The ELI5 is that a few years back, AMD split their graphics (RDNA) and compute (CDNA) architectures, which Nvidia does too, but notably (what Nvidia definitely doesn't do, and a key to their success IMO) AMD also decided they would simply not support any CUDA-parity compute features on Windows or their non "compute" cards. In practice, this means that community/open-source developers will never have, tinker, port, or develop on AMD hardware, while on Nvidia you can start with a GTX/RTX card on your laptop, and use the same code up to an H100 or DGX.
llama.cpp is a super-high profile project, has almost 200 contributiors now, but AFAIK, no contributors from AMD. If AMD doesn't have the manpower, IMO they should simply be sending nsa free hardware to top open source project/library developers (and on the software side, their #1 priority should be making sure every single current GPU they sell is at least "enabled" if not "supported" in ROCm, on Linux and Windows).
I saw there was an answer already in your issue, although you plan on doing a lot of inferencing on your GPU, I'd highly recommend you consider dual-booting into Linux. It turns out exllama merged ROCm support last week and more than 2X faster than the CLBlast code. A 13b gptq model at full context clocks in at 15t/s on my old Radeon VII. (Rumor has it that ROCm 5.6 may add Windows support, although it remains to be seen what that exactly entails.)
That being said, it's important to note that ROCm is Linux only. Not only that, but ROCm's GPU support has actually been decreasing over the past few years. The current list: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... Previously (2022): https://docs.amd.com/bundle/Hardware_and_Software_Reference_...
The ELI5 is that a few years back, AMD split their graphics (RDNA) and compute (CDNA) architectures, which Nvidia does too, but notably (what Nvidia definitely doesn't do, and a key to their success IMO) AMD also decided they would simply not support any CUDA-parity compute features on Windows or their non "compute" cards. In practice, this means that community/open-source developers will never have, tinker, port, or develop on AMD hardware, while on Nvidia you can start with a GTX/RTX card on your laptop, and use the same code up to an H100 or DGX.
llama.cpp is a super-high profile project, has almost 200 contributiors now, but AFAIK, no contributors from AMD. If AMD doesn't have the manpower, IMO they should simply be sending nsa free hardware to top open source project/library developers (and on the software side, their #1 priority should be making sure every single current GPU they sell is at least "enabled" if not "supported" in ROCm, on Linux and Windows).
[1] https://github.com/SlyEcho/llama.cpp/tree/hipblas