You can use ONXX versions of embedding models. Those run faster on CPU. Also, do...

rdli · on May 16, 2024

Thanks! I should have been clearer -- embeddings are pretty fast (relatively) -- it's inference that's slow (I'm at 5 tokens/second on AKS).

jnnnthnn · on May 16, 2024

Could you sidestep inference altogether? Just return the top N results by cosine similarity (or full text search) and let the user find what they need?

https://ollama.com models also works really well on most modern hardware

rdli · on May 16, 2024

I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.

jnnnthnn · on May 16, 2024

Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)