The problem of self-hosting is that you increase the friction to swap models and use whatever is SOTA or whatever fits your purpose best.
Also, I've heard from others that the Qwen models are a bit too overfit to the benchmarks and that their real-life usage is not as impressive as they would appear on the benchmarks.
Switching models when running locally is fairly easy - as long as you have them downloaded you can switch them in and out with a just a config setting - cant quite remember, but you may need to rebuild the vectorstore when switching though.
LangChain has the embeddings for major providers:
def build_vectorstore(docs):
"""
Create vectorstore from documents using configured embedding model.
"""
# Choose embedding model
if cfg.EMBED_MODEL.lower() == "openai":
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
elif cfg.EMBED_MODEL.lower() == "huggingface":
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
elif cfg.EMBED_MODEL.lower() == "nomic-embed-text":
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model=cfg.EMBED_MODEL)
Also, I've heard from others that the Qwen models are a bit too overfit to the benchmarks and that their real-life usage is not as impressive as they would appear on the benchmarks.