When I started working in search 10+ years ago, people would build a beautiful UI, and then, only on shipping, realize the search results were trash + irrelevant. They imagined a search system like Elasticsearch was basically Google. When in reality, Elasticsearch is just a bit of infrastructure. A framework, not a solution.
There's a similar thing happening on RAG. Where people think building the chat interaction is the hard thing. The hard thing is extracting + searching to get relevant context. A lot of founders I talk to suddenly realize this at the last minute, right before shipping, similar to search back in the day. It's harder than just throwing chunks in a vector DB. It involves a lot of different backend data sources potentially, and is in many ways harder than a standard search relevance problem (which is itself hard enough).
Yep, we're doing RAG-ish search and ranking across many context types and modalities, you definitely can't just use a vectordb and do some chunking/search, there are a wide variety of search-like ranking, clustering, etc. and domain specific work for relevance and it's very hard to measure and prove improvements.
It's going to just evolve into recreating the various search and ranking processes of old just on top of a bit more semantic understanding with some smarter NLG layered in :). It won't be just LLMs, we'll have intent classification, named entity recognition, a personalization layer, reranking, all that fun stuff again.
Especially considering the additional logic that some queries require. Stacked questions, comparative questions, recommendations, questions that assume information found in previous statements / questions.
It becomes a very frustrating experience matching the inherent chaos of a conversation.
Great observation. I've seen it often in tech, across the board. It's no better, maybe a step up, than 'idea guy' who 'just' needs someone to build his idea. Hand-waving or complete lack of awareness on the actual value (hard) part.
I spent 8 months telling people this before I got laid off while the CEO continues to chase LLM money with no new ideas or even the talent to solve the problem.
They spent so much time on the UI and basically left the actual search to the last minute, and it was a hilarious failure on launch.
Very good points. Have you seen any examples of systems (or projects) that successfully combine multiple backend data sources, including databases, that perform better than the single backend alone? This seems like an important enough question that it ought to have been documented somewhere.
Hmm, RAG is not "the chat interaction", that's GPT or any other "brain" you choose.
Last week I finished building my 3rd RAG stack for legal document retrieval. Almost-vanilla RAG got me 90-95% of the way. Only drawback is cost, still 10x-100x above the ideal price point; but that will only improve in the future.
This is a great comment. Good search is really hard. RAG is much harder. At least with search user can pick the best result manually or refine their search. With RAG you pass topK to the LLM and assume its good results. The assumption is that its "semantic search" with vectors so it will just work... wrong.
This is a post that summarizes some reading that I had done in the space of LLMs + Knowledge Graphs with the goal of identifying technically deep and interesting directions. The post cover retrieval augmented generation (RAG) systems that use unstructured data (RAG-U) and the role folks envision knowledge graphs to play in it. Briefly the design spectrum of RAG-U systems have two dimensions:
1) What additional data to put into LLM prompts: such as, documents, or triples extracted from documents.
2) How to store and fetch that data: such as a vector index, gdbms, or both.
The standard RAG-U uses vector embeddings of chunks, which are fetched from a vector index. An envisioned role of knowledge graphs is to improve standard RAG-U by explicitly linking the chunks through the entities they mention. This is a promising idea but one that need to be subjected to rigorous evaluation as done in prominent IR publications, e.g., SIGIR.
The post then discusses the scenario when an enterprise does not have a knowledge graph and discuss the ideal of automatically extracting knowledge graphs from unstructured pdfs and text documents. It covers the recent work that uses LLMs for this task (they're not yet competitive with specialized models) and highlights many interesting open questions.
Hope this is interesting to people who are interested in the area but intimidated because of the flood of activity (but don't be; I think the area is easier to digest than it may look.)
Knowledge graphs improve vector search by providing a "back of the book" index for the content. This can be done using knowledge extraction from an LLM during indexing, such as pulling out keyterms of a given chunk before embedding, or asking a question of the content and then answering it using the keyterms in addition to the embeddings. One challenge I found with this is determining keyterms to use with prompts that have light context, but using a time window helps with this, as does hitting the vector store for related content, then finding the keyterms for THAT content to use with the current query.
OpenNRE (https://github.com/thunlp/OpenNRE) is another good approach to neural relation extraction, though it's slightly dated. What would be particularly interesting is to combine models like OpenNRE or SpanMarker with entity-linking models to construct KG triples. And a solid, scalable graph database underneath would make for a great knowledge base that can be constructed from unstructured text.
By this I presume you mean build a search index that can retrieve results based on keywords? I know certain databases use Lucene to build a keyword-based index on top of unstructured blobs of data. Another alternative is to use Tantivy (https://github.com/quickwit-oss/tantivy), a Rust version of Lucene, if building search indices via Java isn't your cup of tea :)
Both libraries offer multilingual support for keywords, I believe, so that's a benefit to vector search where multilingual embedding models are rather expensive.
Having just started from zero, I agree on the easy to digest point. You can get a pretty good understanding of how most things work in a couple days, and the field is moving so fast that a lot of papers are just exploring different iterative improvements on basic concepts.
I really liked the idea of creating linked data to connect chunks. That is an idea that deserves some play time (I just added it to my TODO list). Thanks for the good ideas!
Note for those who aren't aware, a "Semantic Graph" means a knowledge graph built using a "sentence(pooled) transformer" language model to draw edges between the vertices (text data at whatever granularity the user decides) according to semantic similarity.
What's awesome about them is that they essentially form in my mind the "extractive" analogue to LLMs "generative" nature.
Semantic Graphs give every single graph theory algorithm a unique epistemological twist given any particular dataset. In my case, I've built and released pre-trained semantic graphs for my debate evidence. I observe that path traversals form "debate cases", and that graph centrality in this case finds the most "generic/universally applicable" evidence. Given a different dataset, the same algorithms will have different interpretations.
What makes txtai so awesome is that it creates a synchronized interface between an underlying vector DB, SQL DB, and a semantic knowledge graph. The flexibility and power this offers compared to other vector DB solutions is simply unparalleled. I have seen zero meaningful competition from a vectorDB industry which is flooded with money despite little product differentiation among themselves.
This is really cool, I'm surprised I never heard of this project before. The examples look really clean.
Most RAG tools seem to start with the LLM and add Vector building and retrieval around it, while this tool seems like it started with Vector / Graph building and retrieval, then added LLM support later.
The article is a good summary of RAG in the enterprise. It shed some light for me on the quality of building KG using LLMs, as recently, it is an approach that Neo4j was proposing [0].
According to the article, it is either costly (if using OpenAI), or slow using open source AI models. In both cases, predicting the quality of generated KG using LLMs is hard.
This is an excellent article that asks some much-needed questions on the literature that exists connecting LLMs and RAGs on unstructured data, with knowledge graphs in between. We've seen plenty of articles that speculate on how one can build a simple retrieval system on top of a KG, but there are two challenges: a) constructing a high quality KG isn't easy, and b) keyword or phrase embedding on metadata for pre-filtering on relevant sections of the graph is required.
As some others here have pointed out, information extraction and searching with relevant context are the hardest parts of any search system, and it's clear that simply chunking vectors up and throwing them into a vector DB has limitations, no matter what the vector DB vendors tell you. Just like this article says, I hope that 2024 is the year where we actually get some papers that perform more rigorous evaluations of systems that use vector DBs, graph DBs, or a combination of them for building RAGs.
Totally agree! The wave of blog posts and examples one sees where it's just text-to-SQL or text-to-Cypher or any other query lang aren't really exploring the topic at any level of technical depth, and we need to see more evaluations and technical papers that characterize them, so that we can understand how to build better systems.
I think even on the LLMS + KGs space the depth is not very deep. In fact there is more technical depth in the text-to-SQL than anything else I have seen on LLMs. Maybe the COLBERT-like matrix-models is another topic where there is good technical depth.
One quick check for any RAG system is to ask what all can the bot answer about. Generating scalable metadata at ingestion along with knowledge graphs make for a good closed domain experience
There's a similar thing happening on RAG. Where people think building the chat interaction is the hard thing. The hard thing is extracting + searching to get relevant context. A lot of founders I talk to suddenly realize this at the last minute, right before shipping, similar to search back in the day. It's harder than just throwing chunks in a vector DB. It involves a lot of different backend data sources potentially, and is in many ways harder than a standard search relevance problem (which is itself hard enough).