Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LLM Python/CLI tool adds support for embeddings (simonwillison.net)
161 points by simonw on Sept 4, 2023 | hide | past | favorite | 46 comments


There's a lot of stuff in this release.

Don't miss the new llm-cluster plugin, which can both calculate clusters from embeddings and use another LLM call to generate a name for each cluster: https://github.com/simonw/llm-cluster

Example usage:

Fetch all issues, embed them and store the embeddings and content in SQLite:

    paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
      | jq '[.[] | {id: .id, title: .title}]' \
      | llm embed-multi llm-issues - \
        --database issues.db \
        --model sentence-transformers/all-MiniLM-L6-v2 \
        --store
Group those in 10 clusters and generate a summary for each one using a call to GPT-4:

    llm cluster llm-issues --database issues.db 10 --summary --model gpt-4


Very curious to follow your journey through embedding search.

If I want 100 close matches that match a filter, is it better to filter first then find vector similarity within that, or find 1000 similar vectors and then filter that subset?


I experimented with that a few months ago. Building a fresh FAISS index for a few thousand matches is really quick, so I think it's often better to filter first, build a scratch index and then use that for similarity: https://github.com/simonw/datasette-faiss/issues/3

... although on thinking about this more I realize that a better approach may well be to just filter down to ~1,000 and then run a brute-force score across all of them, rather than messing around with an index.


For 1000 points, brute force is super quick. Actually, up to 100k (on my machine), brute force takes less than 1 second.


> jq '[.[] | {id: .id, title: .title}]'

Can be simplified to:

    jq 'map({id,  title})'


Just have to say, I’m enjoying playing with llm so much. Being able to easily play with this technology at the command line feels like magic. Thanks for the awesome work!


Curious to know what value you've seen out of these clusters. In my experience k means clustering was very lackluster. Having to define the number of clusters was a big pain point too.

You almost certainly want a graph like structure (overlapping communities rather than clusters).

But unsupervised clustering was almost entirely ineffective for every use case I had :/


I only got the clustering working this morning, so aside from playing around with it a bit I've not had any results that have convinced me it's a tool I should throw at lots of different problems.

I mainly like it as another example of the kind of things you can use embeddings for.

My implementation is very naive - it's just this:

    sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto")
I imagine there are all kinds of improvements that could be made to this kind of thing.

I'd love to understand if there's a good way to automatically pick an interesting number of clusters, as opposed to picking a number at the start.

https://github.com/simonw/llm-cluster/blob/main/llm_cluster....


There are iterative methods for optimizing the number of clusters in k-means (silhouette and knee/elbow are common), but in practice I prefer density-based methods like HDBSCAN and OPTICS. There's a very basic visual comparison at https://scikit-learn.org/stable/auto_examples/cluster/plot_c....


You could also use a Bayesian version of kmeans. It applies a Dirichlet process as a prior to an infinite (truncated) set of clusters such that the most probable number k is automatically found. I found one implementation here: https://github.com/vsmolyakov/DP_means

Alternatively, there is a Bayesian GMM in sklearn. When you restrict it to diagonal Covariance matrices, you should be fine in high dimensions


Having close centers might help with the labeling. Let me know if I can help


Switch to using HDBSCAN. It's good.


Elbow method is a good place to start for finding the number of clusters.


That's a useful hint, thanks. I fed it through GPT-4 and got some interesting leads: https://chat.openai.com/share/400f76ae-b53b-4d07-ac31-adcef2... and https://chat.openai.com/share/48650db8-5a29-49c5-84b2-574f53...


Use bottom up clustering, you get the whole tree. fclusterdata in scipy


Well this is timely, I was just yesterday needing a tool that would easily do embeddings for me so I could do RAG. LLM seems ideal, thanks Simon!

One question, I saw a comment here that doing RAG efficiently entails some more trickery, like chunking the embeddings. In your experience, is stuff like that necessary, or do I pass the returned documents to GPT-4 and that's it for my RAG?

For an example of what I'm doing now, I bought an ESP32-Box (think basically an OSS Amazon Echo) and want to ask it questions about my (Markdown) notes. What would be the easiest way to do that?


I'm still trying to figure out the answer to that question myself.

The absolute easiest approach right now is to use Claude, since it has a 100,000 token limit - so you can stuff a ton of documentation into it at once and start asking questions.

Doing RAG with smaller models requires much more cleverness, which I'm only just starting to explore.


That's fair, thanks! Do you plan to integrate the cleverness into LLM, so we can benefit from it too? I'm not sure if LLM can be used as a library, currently, I've only been using it as a cli, but it would be great if I could use it in my programs without shelling out.


Yes, all of this stuff will end up in LLM - either in core or as a plugin for it.

LLM works as a library already, but there's definitely room for improvement there:

https://llm.datasette.io/en/stable/python-api.html

https://llm.datasette.io/en/stable/embeddings/python-api.htm...


This is great news, thanks!


Still pondering on using embeddings for classification. Yes, we can group similars with embeddings through clustering, but how do you extract the label for the groups?

What I've come up with is either a) ask an LLM for the common label from samples from the grouped set after indexing (what keyterm best describes the relationship between these documents), or b) determine the label (or keyword) while indexing (by having the LLM find the keyterms ahead of time), then use set overlap on the grouped set's keyterms after to determine a label for the group.


You seem to be looking for a topic model. BERTopic might help: https://maartengr.github.io/BERTopic/index.html#quick-start


This is great...thank you!


This is a fantastic library. I plan to use some of the search functionality with a system that tries to figure out how to manipulate/work with/add features to existing code.


Really thrilled to see such a simple tool handle for complex LLM tools.

I wonder how well it would work for building an "apropos" tool for searching man pages (and infotext)?


This looks absolutely awesome. Does this handle prompt/instruct format using the plugins? It's been the biggest pain point for me using llama.cpp directly.


I'm still iterating on that. Plugins get complete control over the prompts, so they can handle the various weirdnesses of them. Here's some relevant code:

https://github.com/simonw/llm-gpt4all/blob/0046e2bf5d0a9c369...

https://github.com/simonw/llm-mlc/blob/b05eec9ba008e700ecc42...

https://github.com/simonw/llm-llama-cpp/blob/29ee8d239f5cfbf...

I'm not completely happy with this yet. Part of the problem is that different models on the same architecture may have completely different prompting styles.

I expect I'll eventually evolve the plugins to allow them to be configured in an easier and more flexible way. Ideally I'd like you to be able to run new models on existing architectures using an existing plugin.


Simon, you are a force of nature!


> # Generate and store embeddings for every README.md in your home directory, recursively

Do I have to run it against my own corpus? Are there "standard" embeddings that many people use that I could use with llm?


I found this project recently which is really interesting: https://alex.macrocosm.so/download

They've embedded all papers on Arxiv and made the results available for anyone to use.

I believe they used the Instructor XL model - I need to build an LLM plugin for that.


Is there a natural embedding for text? Why do we need to make up random extremely precise numbers and optimize them expensively and keep that thing hanging around?


What is the benefit of this lib over hf transformers?


LLM is to transformers like the Django ORM or SQLAlchemy is to the Python psycopg2 and MySQLdb client libraries.

It's a higher layer abstraction that runs on top of multiple libraries like HF transformers, in some case calling plugins that use transformers directly.


I would change the title to:

    Python Library "llm" now provides tools for working with embeddings
I initially was trying to parse that, thinking "is this an open AI thing?". Of course the answer is just a click away, but people might miss this if they are interested in Python coding and AI.


It's not just a Python library though: it's also a CLI tool.

I put a bunch of work into getting it into Homebrew so that people who aren't Python developers can "brew install llm" and start using it.

Details on the CLI here: https://llm.datasette.io/en/stable/usage.html and https://llm.datasette.io/en/stable/embeddings/cli.html


Personally I think you should change the name, it's way too general. Or at least refer to it as llm.py


It's meant to be general - the (ambitious) scope of the tool is "run any large language model on your own device, as a CLI or Python library or Web UI (still in development)".

The fact that it's written in Python is, I think, one of the least interesting aspects of the project.


Sure, I understand the reluctance to associate too closely with Python, especially if it's a CLI+lib and Python is otherwise an implementation detail. But isn't it similar to a Web framework calling itself `web`? That would be confusing, which is probably why Python has `web.py` :)

It's a great name from a "glad you got there first" perspective but it's also so general as to be ungoogleable. Like if I want to find documentation for this library/CLI in Google, what would I search? I'd probably end up putting "Python" in the query just to disambiguate it from all the results about LLMs in general. So IMO you may as well include "py" (or some other disambiguator) in the name, since people are going to need to include it in their search queries anyway.


I got it in the two namespaces that matter most to me - https://pypi.org/project/llm/ and https://formulae.brew.sh/formula/llm

I'm taking a bit of a risk here, but I actually think there's a tiny chance that the acronym is still obscure enough that I'm in with a chance. I'm on the second page of Google already.


OK, we've put Python library up there.


Looks like you missed my reply by seconds pointing out that it's not just a Python library, it's also a CLI tool: https://news.ycombinator.com/item?id=37385788


Aah! Sorry about that both of you. I didn't think dang would see this and simon would update the title and sanity check it.


Ok - but we need some qualifier. If the title just says "LLM" that's far too general. Any suggestions?


How about "LLM CLI tool and Python library adds support for embeddings"?


Ok, I've shortened that a bit and put it up there.


Thanks!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: