RAG Using Unstructured Data and Role of Knowledge Graphs

softwaredoug · on Jan 17, 2024

When I started working in search 10+ years ago, people would build a beautiful UI, and then, only on shipping, realize the search results were trash + irrelevant. They imagined a search system like Elasticsearch was basically Google. When in reality, Elasticsearch is just a bit of infrastructure. A framework, not a solution.

There's a similar thing happening on RAG. Where people think building the chat interaction is the hard thing. The hard thing is extracting + searching to get relevant context. A lot of founders I talk to suddenly realize this at the last minute, right before shipping, similar to search back in the day. It's harder than just throwing chunks in a vector DB. It involves a lot of different backend data sources potentially, and is in many ways harder than a standard search relevance problem (which is itself hard enough).

dbish · on Jan 17, 2024

Yep, we're doing RAG-ish search and ranking across many context types and modalities, you definitely can't just use a vectordb and do some chunking/search, there are a wide variety of search-like ranking, clustering, etc. and domain specific work for relevance and it's very hard to measure and prove improvements.

It's going to just evolve into recreating the various search and ranking processes of old just on top of a bit more semantic understanding with some smarter NLG layered in :). It won't be just LLMs, we'll have intent classification, named entity recognition, a personalization layer, reranking, all that fun stuff again.

MattDaEskimo · on Jan 17, 2024

Especially considering the additional logic that some queries require. Stacked questions, comparative questions, recommendations, questions that assume information found in previous statements / questions.

It becomes a very frustrating experience matching the inherent chaos of a conversation.

softwaredoug · on Jan 17, 2024

Yeah, and to do it well you have to focus on a subset of tasks. Then find a way to gracefully reject anything you can't retrieve well.

In many ways it makes the chat more Siri-like than ChatGPT like. Which may not be what users actually expect.

Keyframe · on Jan 17, 2024

Great observation. I've seen it often in tech, across the board. It's no better, maybe a step up, than 'idea guy' who 'just' needs someone to build his idea. Hand-waving or complete lack of awareness on the actual value (hard) part.

hobs · on Jan 17, 2024

I spent 8 months telling people this before I got laid off while the CEO continues to chase LLM money with no new ideas or even the talent to solve the problem.

They spent so much time on the UI and basically left the actual search to the last minute, and it was a hilarious failure on launch.

laminarflow027 · on Jan 17, 2024

Very good points. Have you seen any examples of systems (or projects) that successfully combine multiple backend data sources, including databases, that perform better than the single backend alone? This seems like an important enough question that it ought to have been documented somewhere.

moralestapia · on Jan 17, 2024

Hmm, RAG is not "the chat interaction", that's GPT or any other "brain" you choose.

Last week I finished building my 3rd RAG stack for legal document retrieval. Almost-vanilla RAG got me 90-95% of the way. Only drawback is cost, still 10x-100x above the ideal price point; but that will only improve in the future.

opisthenar84 · on Jan 17, 2024

True. Pure vectorstores seem limited and kind of overrated. Combining many sources of data is challenging but the right thing to do.

hackernoteng · on Jan 17, 2024

This is a great comment. Good search is really hard. RAG is much harder. At least with search user can pick the best result manually or refine their search. With RAG you pass topK to the LLM and assume its good results. The assumption is that its "semantic search" with vectors so it will just work... wrong.

semihsalihoglu · on Jan 17, 2024

This is a post that summarizes some reading that I had done in the space of LLMs + Knowledge Graphs with the goal of identifying technically deep and interesting directions. The post cover retrieval augmented generation (RAG) systems that use unstructured data (RAG-U) and the role folks envision knowledge graphs to play in it. Briefly the design spectrum of RAG-U systems have two dimensions: 1) What additional data to put into LLM prompts: such as, documents, or triples extracted from documents. 2) How to store and fetch that data: such as a vector index, gdbms, or both.

The standard RAG-U uses vector embeddings of chunks, which are fetched from a vector index. An envisioned role of knowledge graphs is to improve standard RAG-U by explicitly linking the chunks through the entities they mention. This is a promising idea but one that need to be subjected to rigorous evaluation as done in prominent IR publications, e.g., SIGIR.

The post then discusses the scenario when an enterprise does not have a knowledge graph and discuss the ideal of automatically extracting knowledge graphs from unstructured pdfs and text documents. It covers the recent work that uses LLMs for this task (they're not yet competitive with specialized models) and highlights many interesting open questions.

Hope this is interesting to people who are interested in the area but intimidated because of the flood of activity (but don't be; I think the area is easier to digest than it may look.)

kordlessagain · on Jan 17, 2024

Knowledge graphs improve vector search by providing a "back of the book" index for the content. This can be done using knowledge extraction from an LLM during indexing, such as pulling out keyterms of a given chunk before embedding, or asking a question of the content and then answering it using the keyterms in addition to the embeddings. One challenge I found with this is determining keyterms to use with prompts that have light context, but using a time window helps with this, as does hitting the vector store for related content, then finding the keyterms for THAT content to use with the current query.

sroussey · on Jan 17, 2024

What open source model is good at pulling keyterms?

laminarflow027 · on Jan 17, 2024

OpenNRE (https://github.com/thunlp/OpenNRE) is another good approach to neural relation extraction, though it's slightly dated. What would be particularly interesting is to combine models like OpenNRE or SpanMarker with entity-linking models to construct KG triples. And a solid, scalable graph database underneath would make for a great knowledge base that can be constructed from unstructured text.

sroussey · on Jan 18, 2024

Nice, I’ll look that up.

I was thinking in terms of RAG and turning text into keywords. Any thoughts there?

laminarflow027 · on Jan 18, 2024

By this I presume you mean build a search index that can retrieve results based on keywords? I know certain databases use Lucene to build a keyword-based index on top of unstructured blobs of data. Another alternative is to use Tantivy (https://github.com/quickwit-oss/tantivy), a Rust version of Lucene, if building search indices via Java isn't your cup of tea :)

Both libraries offer multilingual support for keywords, I believe, so that's a benefit to vector search where multilingual embedding models are rather expensive.

semihsalihoglu · on Jan 17, 2024

For entity extraction you can look at SpanMarker: https://tomaarsen.github.io/SpanMarkerNER/. I'm sure other tools exists and others can hopefully point at more.

daxfohl · on Jan 17, 2024

Having just started from zero, I agree on the easy to digest point. You can get a pretty good understanding of how most things work in a couple days, and the field is moving so fast that a lot of papers are just exploring different iterative improvements on basic concepts.

mark_l_watson · on Jan 17, 2024

I really liked the idea of creating linked data to connect chunks. That is an idea that deserves some play time (I just added it to my TODO list). Thanks for the good ideas!

dmezzetti · on Jan 17, 2024

If you're interested in graphs + RAG and want an alternate approach, txtai has a semantic graph component.

https://neuml.hashnode.dev/introducing-the-semantic-graph

https://github.com/neuml/txtai

Disclaimer: I'm the primary author of txtai

Der_Einzige · on Jan 17, 2024

Note for those who aren't aware, a "Semantic Graph" means a knowledge graph built using a "sentence(pooled) transformer" language model to draw edges between the vertices (text data at whatever granularity the user decides) according to semantic similarity.

What's awesome about them is that they essentially form in my mind the "extractive" analogue to LLMs "generative" nature.

Semantic Graphs give every single graph theory algorithm a unique epistemological twist given any particular dataset. In my case, I've built and released pre-trained semantic graphs for my debate evidence. I observe that path traversals form "debate cases", and that graph centrality in this case finds the most "generic/universally applicable" evidence. Given a different dataset, the same algorithms will have different interpretations.

What makes txtai so awesome is that it creates a synchronized interface between an underlying vector DB, SQL DB, and a semantic knowledge graph. The flexibility and power this offers compared to other vector DB solutions is simply unparalleled. I have seen zero meaningful competition from a vectorDB industry which is flooded with money despite little product differentiation among themselves.

Disclaimer: I wrote an NLP paper with dmezzetti as my co-author about semantic graphs: https://aclanthology.org/2023.newsum-1.10.pdf

dmezzetti · on Jan 17, 2024

Thank you for taking the time to share these excellent additional details!

bryan0 · on Jan 17, 2024

This is really cool, I'm surprised I never heard of this project before. The examples look really clean.

Most RAG tools seem to start with the LLM and add Vector building and retrieval around it, while this tool seems like it started with Vector / Graph building and retrieval, then added LLM support later.

dmezzetti · on Jan 17, 2024

Thanks, that's an accurate assessment. The main reason for this approach is that txtai has been around since 2020 before the LLM era.

Oras · on Jan 17, 2024

The article is a good summary of RAG in the enterprise. It shed some light for me on the quality of building KG using LLMs, as recently, it is an approach that Neo4j was proposing [0].

According to the article, it is either costly (if using OpenAI), or slow using open source AI models. In both cases, predicting the quality of generated KG using LLMs is hard.

[0] https://github.com/neo4j/NaLLM

laminarflow027 · on Jan 17, 2024

This is an excellent article that asks some much-needed questions on the literature that exists connecting LLMs and RAGs on unstructured data, with knowledge graphs in between. We've seen plenty of articles that speculate on how one can build a simple retrieval system on top of a KG, but there are two challenges: a) constructing a high quality KG isn't easy, and b) keyword or phrase embedding on metadata for pre-filtering on relevant sections of the graph is required.

As some others here have pointed out, information extraction and searching with relevant context are the hardest parts of any search system, and it's clear that simply chunking vectors up and throwing them into a vector DB has limitations, no matter what the vector DB vendors tell you. Just like this article says, I hope that 2024 is the year where we actually get some papers that perform more rigorous evaluations of systems that use vector DBs, graph DBs, or a combination of them for building RAGs.

formercoder · on Jan 17, 2024

It’s interesting to see more developed KG + LLM use cases that aren’t just NL to Graph DB Query Lang.

laminarflow027 · on Jan 17, 2024

Totally agree! The wave of blog posts and examples one sees where it's just text-to-SQL or text-to-Cypher or any other query lang aren't really exploring the topic at any level of technical depth, and we need to see more evaluations and technical papers that characterize them, so that we can understand how to build better systems.

semihsalihoglu · on Jan 17, 2024

I think even on the LLMS + KGs space the depth is not very deep. In fact there is more technical depth in the text-to-SQL than anything else I have seen on LLMs. Maybe the COLBERT-like matrix-models is another topic where there is good technical depth.

iAkashPaul · on Jan 17, 2024

One quick check for any RAG system is to ask what all can the bot answer about. Generating scalable metadata at ingestion along with knowledge graphs make for a good closed domain experience

hall0ween · on Jan 22, 2024

Dear author, please define your acronyms or variables.