I'm curious about the point on the embedding lookup cost... in my experience for...

PeterStuer · on Aug 13, 2023

Yes. I think the point is that the price per token for creating the embeddings using e.g. OpenAI's text-embedding-ada-002 api might be low, this will add up to some significant cost for a large document corpus. The suggestion to roll your own based on freely available embedding models is sound IMHO.

Now how to chunk those documents into semantically coherent pieces for context retrieval, that is the real challange though.

phreeza · on Aug 13, 2023

There are very efficient algorithms for doing this, but of course it may still be expensive if your dataset is very large. See https://ann-benchmarks.com/ for some of the algorithms