I'd assume that the existing llama.cpp ability to split layers out to the GPU still applies, so you could have some fraction in VRAM and speed up those layers.
The memory bandwidth might be an issue, and it would be a pretty small percentage of the model, but I'd guess the speedup would be apparent.
Maybe not worth the few thousand for the card + more power/cooling/space, of course.
The memory bandwidth might be an issue, and it would be a pretty small percentage of the model, but I'd guess the speedup would be apparent.
Maybe not worth the few thousand for the card + more power/cooling/space, of course.