Don't miss the new llm-cluster plugin, which can both calculate clusters from embeddings and use another LLM call to generate a name for each cluster: https://github.com/simonw/llm-cluster
Example usage:
Fetch all issues, embed them and store the embeddings and content in SQLite:
Very curious to follow your journey through embedding search.
If I want 100 close matches that match a filter, is it better to filter first then find vector similarity within that, or find 1000 similar vectors and then filter that subset?
I experimented with that a few months ago. Building a fresh FAISS index for a few thousand matches is really quick, so I think it's often better to filter first, build a scratch index and then use that for similarity: https://github.com/simonw/datasette-faiss/issues/3
... although on thinking about this more I realize that a better approach may well be to just filter down to ~1,000 and then run a brute-force score across all of them, rather than messing around with an index.
Just have to say, I’m enjoying playing with llm so much. Being able to easily play with this technology at the command line feels like magic. Thanks for the awesome work!
Curious to know what value you've seen out of these clusters. In my experience k means clustering was very lackluster. Having to define the number of clusters was a big pain point too.
You almost certainly want a graph like structure (overlapping communities rather than clusters).
But unsupervised clustering was almost entirely ineffective for every use case I had :/
I only got the clustering working this morning, so aside from playing around with it a bit I've not had any results that have convinced me it's a tool I should throw at lots of different problems.
I mainly like it as another example of the kind of things you can use embeddings for.
There are iterative methods for optimizing the number of clusters in k-means (silhouette and knee/elbow are common), but in practice I prefer density-based methods like HDBSCAN and OPTICS. There's a very basic visual comparison at https://scikit-learn.org/stable/auto_examples/cluster/plot_c....
You could also use a Bayesian version of kmeans. It applies a Dirichlet process as a prior to an infinite (truncated) set of clusters such that the most probable number k is automatically found.
I found one implementation here: https://github.com/vsmolyakov/DP_means
Alternatively, there is a Bayesian GMM in sklearn. When you restrict it to diagonal Covariance matrices, you should be fine in high dimensions
Well this is timely, I was just yesterday needing a tool that would easily do embeddings for me so I could do RAG. LLM seems ideal, thanks Simon!
One question, I saw a comment here that doing RAG efficiently entails some more trickery, like chunking the embeddings. In your experience, is stuff like that necessary, or do I pass the returned documents to GPT-4 and that's it for my RAG?
For an example of what I'm doing now, I bought an ESP32-Box (think basically an OSS Amazon Echo) and want to ask it questions about my (Markdown) notes. What would be the easiest way to do that?
I'm still trying to figure out the answer to that question myself.
The absolute easiest approach right now is to use Claude, since it has a 100,000 token limit - so you can stuff a ton of documentation into it at once and start asking questions.
Doing RAG with smaller models requires much more cleverness, which I'm only just starting to explore.
That's fair, thanks! Do you plan to integrate the cleverness into LLM, so we can benefit from it too? I'm not sure if LLM can be used as a library, currently, I've only been using it as a cli, but it would be great if I could use it in my programs without shelling out.
Still pondering on using embeddings for classification. Yes, we can group similars with embeddings through clustering, but how do you extract the label for the groups?
What I've come up with is either a) ask an LLM for the common label from samples from the grouped set after indexing (what keyterm best describes the relationship between these documents), or b) determine the label (or keyword) while indexing (by having the LLM find the keyterms ahead of time), then use set overlap on the grouped set's keyterms after to determine a label for the group.
This is a fantastic library. I plan to use some of the search functionality with a system that tries to figure out how to manipulate/work with/add features to existing code.
This looks absolutely awesome. Does this handle prompt/instruct format using the plugins? It's been the biggest pain point for me using llama.cpp directly.
I'm still iterating on that. Plugins get complete control over the prompts, so they can handle the various weirdnesses of them. Here's some relevant code:
I'm not completely happy with this yet. Part of the problem is that different models on the same architecture may have completely different prompting styles.
I expect I'll eventually evolve the plugins to allow them to be configured in an easier and more flexible way. Ideally I'd like you to be able to run new models on existing architectures using an existing plugin.
Is there a natural embedding for text? Why do we need to make up random extremely precise numbers and optimize them expensively and keep that thing hanging around?
LLM is to transformers like the Django ORM or SQLAlchemy is to the Python psycopg2 and MySQLdb client libraries.
It's a higher layer abstraction that runs on top of multiple libraries like HF transformers, in some case calling plugins that use transformers directly.
Python Library "llm" now provides tools for working with embeddings
I initially was trying to parse that, thinking "is this an open AI thing?". Of course the answer is just a click away, but people might miss this if they are interested in Python coding and AI.
It's meant to be general - the (ambitious) scope of the tool is "run any large language model on your own device, as a CLI or Python library or Web UI (still in development)".
The fact that it's written in Python is, I think, one of the least interesting aspects of the project.
Sure, I understand the reluctance to associate too closely with Python, especially if it's a CLI+lib and Python is otherwise an implementation detail. But isn't it similar to a Web framework calling itself `web`? That would be confusing, which is probably why Python has `web.py` :)
It's a great name from a "glad you got there first" perspective but it's also so general as to be ungoogleable. Like if I want to find documentation for this library/CLI in Google, what would I search? I'd probably end up putting "Python" in the query just to disambiguate it from all the results about LLMs in general. So IMO you may as well include "py" (or some other disambiguator) in the name, since people are going to need to include it in their search queries anyway.
I'm taking a bit of a risk here, but I actually think there's a tiny chance that the acronym is still obscure enough that I'm in with a chance. I'm on the second page of Google already.
Don't miss the new llm-cluster plugin, which can both calculate clusters from embeddings and use another LLM call to generate a name for each cluster: https://github.com/simonw/llm-cluster
Example usage:
Fetch all issues, embed them and store the embeddings and content in SQLite:
Group those in 10 clusters and generate a summary for each one using a call to GPT-4: