Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Easiest example is taking three words: Universe, University, College.

- University and Universe are similar alphabetically.

- University and College are similar in meaning.

Take embeddings for those three words and `University` will be near `College`, while `Universe` will be further away, because embeddings capture meaning:

University<-->College<-------------->Universe

_

With old school search you'd need to handle the special case of treating University and College as similar, but embeddings already handle it.

With embeddings you can do math to find how similar two results are, based on how close their vectors are. The closer the embeddings, the closer the meaning.



Another interesting point is that math can be performed on embedding vectors: emb("king") - emb("man") + emb("woman") = emb("queen").


That's a property of Word2Vec specifically due to how it's trained (a shallow network where most of the "logic" would be contained within the embeddings themselves). Using it for embeddings generated from LLMs or Embedding layers will not give as fun results; in practice the only thing you can do is average or cluster them.


> That's a property of Word2Vec specifically due to how it's trained (a shallow network where most of the "logic" would be contained within the embeddings themselves).

Is it though? I thought the LLM-based embeddings are even more fun for this, as you have many more interesting directions to move in. I.e. not just:

emb("king") - emb("man") + emb("woman") = emb("queen")

But also e.g.:

emb(<insert a couple paragraph long positive book review>) + av(sad) + bv(short) - c*v(positive) = emb(<a single paragraph, negative and depressing review>)

Where a, b, c are some constants to tweak, and v(X) is a vector for quality X, which you can get by embedding a bunch of texts expressing the quality X and averaging them out (or doing some other dimensional reduction trickery).

I've suggested this on HN some time ago, but only been told that I'm confused and the idea is not even wrong. But then, there was this talk on some AI conference recently[0], where the speaker demonstrated exactly this kind of latent space translations of text in a language model.

--

[0] - https://www.youtube.com/watch?v=veShHxQYPzo&t=13980s - "The Hidden Life of Embeddings", by Linus Lee from Notion.


That talk used a novel embeddings model trained by the speaker which does exhibit this kind of property - but that was a new (extremely cool) thing, not something that other embeddings models can do.


Interesting video. When he says "we decode the embedding", does he essentially mean that he is searching a vector database or something else?


The model is an encoder-decoder, which encodes some text into a latent embedding, and can then decode it back into text. It’s a feature of the model itself.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: