I wrote a blog post on the HNSW vector index design, perhaps the most popular vector index design adopted by databases at this point. The post is based on several lectures I gave in a graduate course at UWaterloo last fall. This is intended for people who are interested in understanding how these indices work internally.
My goal was to explain the intuitions behind HNSW indices as a natural relaxation of two prior indices: kd trees and the (not much appreciated) sa trees.
I also place these three vector indices in a framework that I call the "FES Theorem", which states that any vector index design can provide at most two of the following three properties:
- Fast: returns vectors that are similar to a query vector q quickly.
- Exact: correctly returns the most similar vectors to q (instead of "approximate" indices that can make mistakes)
- Scalable: can index vectors with large number of dimensions, e.g., 1000s of dimensions.
Kd trees, sa trees, and HNSW satisfy each 2 possible combinations of these 3 properties.
Needless to say, I intentionally picked the term "FES Theorem" to sound like the famous "CAP Theorem". Fes (Turkish) or a fez (English), just like cap, is a headdress. You can see a picture in the post.
I hope you find the explanation of HNSW as a sequence of relaxation of kd trees useful.
I wrote a blog post on something that I find is often misunderstood and under-appreciated: Resource Description Framework (RDF). I explain what RDF is, what it is not, when you may need it, and its virtues and vices. RDF is good to know about in our increasingly AI-dominated world, since it has its roots in knowledge representation and reasoning (KRR), which is a field of AI, known as good-old-fashioned symbolic AI.
I explain RDF first as a data model and compare its pros and cons with relational and property graph model. I then explain RDF and the standards around RDF, such as RDFS and OWL, as a "knowledge representation system". I cover RDF's roots in knowledge representation and reasoning (KRR), traditional symbolic AI systems. I also discuss several directions I have seen people pursue to improve LLMs with RDF-based or more broadly KRR-based technology (see especially the link to Doug Lenat's last article (I think) on the subject before he passed on).
It's a bit of a long read but I hope people find it useful to think about RDF.
That's correct though OWL also provides ways to constraint what can be encoded (e.g., the cardinality constraint example I gave). But yes, SHACL is primarily for constraints. I general, there are several other standards than RDFS and OWL I didn't mention in the post. I wanted to give a few example standards to explain to show how RDF + standards forms something more than a regular data model that developers think of.
I mentioned SHACL specifically because it resolves your issue of lack of integrity as it serves the same role as schema definitions in RDF databases that support it. If any attempt at an insert fails the SHACL constraints, the attempted insert will be rejected.
OWL however is only used when doing selects. It expands queries to allow them to access things they didn't know to ask for. When used on top of an relational database, OWL expresses the kind of statements that exists in a T-box for ontology based access.
I think even on the LLMS + KGs space the depth is not very deep. In fact there is more technical depth in the text-to-SQL than anything else I have seen on LLMs. Maybe the COLBERT-like matrix-models is another topic where there is good technical depth.
This is a post that summarizes some reading that I had done in the space of LLMs + Knowledge Graphs with the goal of identifying technically deep and interesting directions. The post cover retrieval augmented generation (RAG) systems that use unstructured data (RAG-U) and the role folks envision knowledge graphs to play in it. Briefly the design spectrum of RAG-U systems have two dimensions:
1) What additional data to put into LLM prompts: such as, documents, or triples extracted from documents.
2) How to store and fetch that data: such as a vector index, gdbms, or both.
The standard RAG-U uses vector embeddings of chunks, which are fetched from a vector index. An envisioned role of knowledge graphs is to improve standard RAG-U by explicitly linking the chunks through the entities they mention. This is a promising idea but one that need to be subjected to rigorous evaluation as done in prominent IR publications, e.g., SIGIR.
The post then discusses the scenario when an enterprise does not have a knowledge graph and discuss the ideal of automatically extracting knowledge graphs from unstructured pdfs and text documents. It covers the recent work that uses LLMs for this task (they're not yet competitive with specialized models) and highlights many interesting open questions.
Hope this is interesting to people who are interested in the area but intimidated because of the flood of activity (but don't be; I think the area is easier to digest than it may look.)
Knowledge graphs improve vector search by providing a "back of the book" index for the content. This can be done using knowledge extraction from an LLM during indexing, such as pulling out keyterms of a given chunk before embedding, or asking a question of the content and then answering it using the keyterms in addition to the embeddings. One challenge I found with this is determining keyterms to use with prompts that have light context, but using a time window helps with this, as does hitting the vector store for related content, then finding the keyterms for THAT content to use with the current query.
OpenNRE (https://github.com/thunlp/OpenNRE) is another good approach to neural relation extraction, though it's slightly dated. What would be particularly interesting is to combine models like OpenNRE or SpanMarker with entity-linking models to construct KG triples. And a solid, scalable graph database underneath would make for a great knowledge base that can be constructed from unstructured text.
By this I presume you mean build a search index that can retrieve results based on keywords? I know certain databases use Lucene to build a keyword-based index on top of unstructured blobs of data. Another alternative is to use Tantivy (https://github.com/quickwit-oss/tantivy), a Rust version of Lucene, if building search indices via Java isn't your cup of tea :)
Both libraries offer multilingual support for keywords, I believe, so that's a benefit to vector search where multilingual embedding models are rather expensive.
Having just started from zero, I agree on the easy to digest point. You can get a pretty good understanding of how most things work in a couple days, and the field is moving so fast that a lot of papers are just exploring different iterative improvements on basic concepts.
I really liked the idea of creating linked data to connect chunks. That is an idea that deserves some play time (I just added it to my TODO list). Thanks for the good ideas!
This is the first blog post on a series of posts I plan to write on the role graph DBMSs and knowledge graphs play on LLM applications and recent text-to-high-level-query-language work I read up on over the holiday season.
These blogs have two goals:
(i) give an overview of what I learned as an outsider looking for technical depth;
(ii) discuss some venues of work that I ran into that looked important.
This first post is on "Retrieval Augmented Generation using structured data", so private records stored in relational or graph DBMSs. The post is long and full of links to some of the important material I read (given my academic background, many of these are papers) but it should be an easy read especially if you were an outsider intimidated by this fast moving space.
tl;dr for this post:
- I provide an overview of RAG.
- Compared to pre-LLM work, the simplicity and effectiveness of developing a natural language interface over your database using LLMs is impressive.
- There is little work that studies LLMs' ability to generate Cypher or SPARQL. I also hope to see more work on nested, recursive and union-of-join queries.
- Everyone is studying how to prompt LLMs so they generate correct DBMS queries. Here, I hope to see work studying the effects of data modeling (normalization, views, graph modeling) on the accuracy of LLM-generated queries.
So this post is about using RAG/LLM to generate queries (Cypher in this case, to be consumed by Kuzu). That way you could ask natural-language questions to be answered by the result of the query.
I wonder if you could comment about other areas of AI+Graphs (I think this is mostly Graph Neural Networks, not sure if anything else?).
For instance, I found PyG and Deep Graph Library but the use cases are so jargon-heavy [1], [2], I'm not sure about the real world applications, in layman terms.
Ok, using ChatGPT and Bard (the irony lol) I learned a bit more about GNNs:
GNNs are probabilistic and can be trained to learn representations in graph-structured data and handling complex relationships, while classical graph algorithms are specialized for specific graph analysis tasks and operate based on predefined rules/steps.
* Why is PyG it called "Geometric" and not "Topologic" ?
Properties like connectivity, neighborhoods, and even geodesic distances can all be considered topological features of a graph. These features remain unchanged under continuous deformations like stretching or bending, which is the defining characteristic of topological equivalence. In this sense, "PyTorch Topologic" might be a more accurate reflection of the library's focus on analyzing the intrinsic structure and connections within graphs.
However, the term "geometric" still has some merit in the context of PyG. While most GNN operations rely on topological principles, some do incorporate notions of Euclidean geometry, such as:
- Node embeddings: Many GNNs learn low-dimensional vectors for each node, which can be interpreted as points in a vector space, allowing geometric operations like distances and angles to be applied.
- Spectral GNNs: These models leverage the eigenvalues and eigenvectors of the graph Laplacian, which encodes information about the geometric structure and distances between nodes.
- Manifold learning: Certain types of graphs can be seen as low-dimensional representations of high-dimensional manifolds. Applying GNNs in this context involves learning geometric properties on the manifold itself.
Therefore, although topology plays a primary role in understanding and analyzing graphs, geometry can still be relevant in certain contexts and GNN operations.
* Real world applications:
- HuggingFace has a few models [0] around things like computational chemistry [1] or weather forecasting.
- PyGod [2] can be used for Outlier Detection (Anomaly Detection).
- Apparently ULTRA [3] can "infer" (in the knowledge graph sense), that Michael Jackson released some disco music :-p (see the paper).
- RGCN [4] can be used for knowledge graph link prediction (recovery of missing facts, i.e. subject-predicate-object triples) and entity classification (recovery of missing entity attributes).
- GreatX [5] tackles removing inherent noise, "Distribution Shift" and "Adversarial Attacks" (ex: noise purposely introduced to hide a node presence) from networks. Apparently this is a thing and the field is called "Graph Reliability" or "Reliable Deep Graph Learning". The author even has a bunch of "awesome" style lists of links! [6]
- Finally this repo has a nice explanation of how/why to run machine learning algorithms "outside of the DB":
"Pytorch Geometric (PyG) has a whole arsenal of neural network layers and techniques to approach machine learning on graphs (aka graph representation learning, graph machine learning, deep graph learning) and has been used in this repo [7] to learn link patterns, also known as link or edge predictions."
You seem to have done some research already but let me answer briefly: GNNs and what I covered in the blog post, "RAG over structured data", are not connected. They are approaches to solve 2 different problems.
GNNs: Let's forget about LLMs completely. GNN is a term given to a specific ML models where the model layers follow a graph structure. Suppose you have some data that represents real-world entities and you have features, i.e., some vector of floating numbers representing properties of these entities. Then if you want to run some predictive task on these entities, e.g., your entities are customers and products, and you want to predict who could buy a product so you can recommend these products to customers. Then there are a suite of ML tools you can use if you could represent your entities as vectors themselves, e.g., then you can use distances between these vector-representations as indication of closeness/similarity and you could recommend products to a customer A that were bought by customers that are close to A's vector representation. This is what's embedding of these entities in a vector space.
One way to embed your entities is to run the nodes through an ML model that takes their features as input and produces another set of vectors (you could use the features alone as embeddings but they are not really trained and often have higher dimensions compared to the embedded vectors' dimensions). GNNs are a specific versions of such ML models where the entities and relationships between entities are modeled as a graph. And the model's architecture, i.e., the operations that it does on the feature vectors, depends on the structure of this graph. (edited)
In short, GNNs are not deeply connected to LLMs.
GNNs became very popular several years ago because they were the only ML architectures where you could incorporate into the model and training objective not just the features but also connections between features. And they dominated academia until LLMs. In practice, I don't think they're as popular as they are in academia but afaik several major companies, such as Pinterest based their recommendation engines on models that had GNN architecture.
But one can imagine building applications that use a mix of these technologies. You can use GNNs to create embeddings of KGs and the use these embeddings to extract information during retrieval in a RAG system. All these combinations are possible.
I am also amused that Ontario, which is the most populous province has the following statistics:
Area: 1,076,395 km square
Population: 15.3M
Istanbul, which is the biggest city in my home country Turkiye has these:
Area: 5,343 (so 200x smaller)
Population: 18M (official census is ~16M but even the major cites 18M)
I'm not even comparing the much denser places in the world (e.g., Gaza).
The provincial area figure is meaningless for computing density. It’s like calculating the density of beer in a glass that’s been poured with 99% head. The real density is in the quarter inch of beer at the bottom of the glass, not the giant blob of foam above it.
I believe those liquid cash barons are now all international conglomerates. Also, I can tell you that the LCBO is actually quite a bit cheaper than BC Liquor and NSLC. Though compared to Alberta it is way more expensive.
Ontario is culturally 2 provinces, southern and northern Ontario.
The north is what you normally think of Canada: rocks, trees and lakes. The south is similar to the neighbouring US states of Michigan, Ohio, and New York but with a distinct Canadian identity and a French region in the east.
Most of the population (13.4M) lives in the southern part (114,000 km^2)
Southwestern Ontario (Windsor/London) is very culturally and demographically different from Toronto/Golden Horseshoe and Eastern Ontario (Ottawa/Kingston)
Michigan is also culturally different from New York.
The north-south split is the most striking, but yes south can be further divided into the GTA, SW (farm country and a small part of the rust belt, similar to the Midwest) and eastern (Ottawa and the townships: government and farms)
Michigan and Ohio are most similar to SW but upstate NY also bleeds culturally into the region, both around Niagara and the thousand islands.
It's worth noting that roughly half of Canada's population lives in the Quebec-Windsor Corridor [0] (the land between Quebec City, QC and Windsor, ON). This region is really the Canadian equivalent of the Boston-Washington corridor in the States, so imagine if nearly 150 million lived in BosWash.
~6.954 (as of 2016 so actually higher) million of those 15m people live in 8,244 km^2 in the Greater Toronto Hamilton area, which works out to 843 people per square kilometer (vs 3368 for Istanbul).
Yes, 1/4 the density, yes, but nowhere close to the bizarre comparison you gave, where you used literally uninhabitable/barely-settled areas of the province (Ontario is huge and mostly rock and lakes) and compared to one of the densest cities in the world. At least compare urban area to urban area.
Toronto is the 3rd or 4th most populated city in North America, depending on how you count. Not some quaint arctic getaway. Also sits at the same latitude as e.g. Marseille in the south of France, or Florence in Italy. So not exactly northern at all.
Also if we restrict to just the actual technical city of Toronto proper, the density is higher than what you gave for Istanbul: 2,794,356 people in 630.20km^2 == 4435 people per square kilometer.
I might be wrong in the numbers as I just got them from Wikipedia. But to be clear: Istanbul is the name of both the city and the "province" in Turkiye. And I used the larger area number from Wikipedia, which is for the province: https://en.wikipedia.org/wiki/Istanbul
So even when using official numbers, the urban density of Istnabul is > 6000 according to Wikipedia.
ps: I live in Toronto, so I know Toronto is incomparably dense compared to other parts of Ontario. I could have made the same point by comparing Toronto's density to Ontario. Though, I'm curious how much of Ontario is uninhabitable. I thought most of it (say > 50%) seemed habitable.
The right comparison would be with the area and population of southern & maybe eastern Ontario and use that. Northern Ontario is really just something else. Anything north of the line where the Canadian shield begins becomes basically unarable and extremely low population density. And once you get past e.g. Timmins, there's really ... "nothing." (Pretty though.)
I don't have the patience to go look this up, but obv it will clearly be below Istanbul, but likely about equivalent with most of the US
atlantic region and maybe even parts of the UK etc.
Istanbul/Constantinople, ... a heavily populated city for like 2000 years, not to mention the region going back to the, uh, paleolithic and it's literally basically the origin of ... farming and settled agriculture. Kinda weird comparison.
What do you mean by "maybe"? Eastern Ontario contains the 4th largest metropolitan area in Canada (Ottawa) which is a very reasonable comparator, the density is 195/km.
IT's just there are large parts of what is classified as "eastern Ontario" that is kinda similar to the near-north ; not arable, shield & cottage country, and sparsely populated. So it doesn't feel fair. But yes, Ottawa clearly a major centre.
Name me a major western metro area with jobs and a good standard of living that doesn't. I remember people banging the drums 10 years agho "come to Berlin, there's so much empty houses". Yeah, that changed quickly.
This is a good point. There must be one though that other cities can look up to. I would guess Singapore does a good job on these problems. I would be curious to hear of the best examples of large cities that have done a better job on housing.
Housing problems here are related to pricing, and only tangentially to quantity of construction. The issue is how cheap housing loans have been, how deregulated the real estate sales market, and how stupid the last 20 years of governments have been about a) how they regulate the housing market and b) what kind of housing gets built (mostly either sprawly suburban single family mc-homes way out in e.g. Vaughan or Mississauga etc. or tiny condos in Toronto proper which just get turned into speculative investments or airbnbs).
Mortgages here were too cheap for too long. Really, what people "buy" when they buy a home is a mortgage, not a house.
Because the government is importing a massive number of people relative to the country’s population and not building anywhere near enough housing. They ain’t slowing down any time soon so it’s gonna get even worse.
My goal was to explain the intuitions behind HNSW indices as a natural relaxation of two prior indices: kd trees and the (not much appreciated) sa trees.
I also place these three vector indices in a framework that I call the "FES Theorem", which states that any vector index design can provide at most two of the following three properties: - Fast: returns vectors that are similar to a query vector q quickly. - Exact: correctly returns the most similar vectors to q (instead of "approximate" indices that can make mistakes) - Scalable: can index vectors with large number of dimensions, e.g., 1000s of dimensions.
Kd trees, sa trees, and HNSW satisfy each 2 possible combinations of these 3 properties.
Needless to say, I intentionally picked the term "FES Theorem" to sound like the famous "CAP Theorem". Fes (Turkish) or a fez (English), just like cap, is a headdress. You can see a picture in the post.
I hope you find the explanation of HNSW as a sequence of relaxation of kd trees useful.
Enjoy!