> [...] llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1)
> [...] I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1)
Ok so you agree that llama.cpp etc are great for batch==1, right?
And I agree their targeted use case is not batch==32 (because who is doing that really?)
But if we extended llama.cpp or some other faster batch==1 implementation to support batch==32, why do you suppose it wouldn't still be faster than MKML? It seems to me that if you can do batch==1 faster, you could easily do batch>>1 faster too -- it is just that no one really needed that (yet?)
Ok so you agree that llama.cpp etc are great for batch==1, right?
And I agree their targeted use case is not batch==32 (because who is doing that really?)
But if we extended llama.cpp or some other faster batch==1 implementation to support batch==32, why do you suppose it wouldn't still be faster than MKML? It seems to me that if you can do batch==1 faster, you could easily do batch>>1 faster too -- it is just that no one really needed that (yet?)