Fixed size chunks is holding back a bunch of RAG projects on my backlog. Will be...

refulgentis · 2025-02-05T21:46:34 1738791994

FWIW, you might be doing it / ruled it out already:

- BM25 to eliminate the 0 results in source data problem

- Longer term, a peek at Gwern's recent hierarchical embedding article. Got decent early returns even with fixed size chunks

thelittleone · 2025-02-05T21:53:56 1738792436

Much appreciated.

For others interested in BM25 for the use case above, I found this thread informative.

https://news.ycombinator.com/item?id=41034297

mediaman · 2025-02-05T22:10:22 1738793422

Agree, BM25 honestly does an amazing job on its own sometimes, especially if content is technical.

We use it in combination with semantic but sometimes turn off the semantic part to see what happens and are surprised with the robustness of the results.

This would work less well for cross-language or less technical content, however. It's great for acronyms, company or industry specific terms, project names, people, technical phrases, and so on.

jacobr1 · 2025-02-05T23:43:54 1738799034

Also consider methods that are using reasoning to potentially dispatch additional searches based on analysis of the returned data

nnurmanov · 2025-02-06T04:09:13 1738814953

This is my problem as well; do you have lots of documents?