I'm still amazed at word embeddings. They allow to find related words, and even ...

otabdeveloper4 · 2025-02-13T20:59:40 1739480380

That's an urban legend and not how embbeddings work.

nativeit · 2025-02-14T06:36:46 1739515006

I'm actually genuinely interested to know what you're referring to, because as an I.T. professional who's doing their best to keep up with how things work, that's more or less how I understood it as well. When words are mapped to a continuous vector space, placing semantically related words near one another gives their vector coordinates similar values (see the excerpt from IBM below).

However, I really don't understand how that necessarily enables one to perform arithmetic functions with two sets of vector coordinates and expect the result to be something tangential to the original two words. I understand how using a model to create embeddings with semantically correlated values can be achieved, and why that would be so fundamental for LLMs. My math skills aren't advanced enough to confidently land on either side of this question, but my instincts are that such an elegant relationship would be unlikely. Then again, mathematics are replete with counterintuitive but elegantly clever connections, so I could absolutely understand why this is eminently believable--especially in the context of AI and language models.

According to the descriptions from IBM [0]:

> Word embeddings capture the semantic relationships and contextual meanings of words based on their usage patterns in a given language corpus. Each word is represented as a fixed-sized dense vector of real numbers. It is the opposite of a sparse vector, such as one-hot encoding, which has many zero entries.

> The use of word embedding has significantly improved the performance of natural language processing (NLP) models by providing a more meaningful and efficient representation of words. These embeddings enable machines to understand and process language in a way that captures semantic nuances and contextual relationships, making them valuable for a wide range of applications, including sentiment analysis, machine translation and information retrieval.

> Popular word embedding models include Word2Vec, GloVe (Global Vectors for Word Representation), FastText and embeddings derived from transformer-based models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer).

---

0. What is embedding? (https://www.ibm.com/think/topics/embedding)

CamperBob2 · 2025-02-13T22:03:22 1739484202

It's my understanding of how embeddings work, as well. The dot products (cosine similarity) of man . woman end up being very similar to king . queen. This similarity could be considered subtraction if you stretch the metaphor enough.

If this is not a valid way to think about embedding vectors, would you care to elaborate?

otabdeveloper4 · 2025-02-14T08:27:14 1739521634

Embedding vectors are when you take a high-dimensionality vector of word ID's and reduce deminsionality to something more manageable. (Think something like principal component analysis or singular value decomposition.)

The "similarity" here means word ID's that commonly occur together in vectors.

There is no logical reasoning or attempts at semantic analysis here.

> The dot products (cosine similarity) of man . woman end up being very similar to king . queen

This is because 'king' and 'man' occur together in a distribution similar to that of 'queen' and 'woman'.

The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend. Embeddings don't carry semantic meanings, they aren't dictionaries or encyclopedias. They are only statistical features about word co-occurrences.

HPsquared · 2025-02-14T18:11:02 1739556662

This blog post [0] thinks that it is indeed possible to do arithmetic operations on them. My intuition is that they're vectors after all, and can be added and subtracted like any other vector. A word is just a location in that high-dimensional vector space.

[0] https://blog.dataiku.com/arithmetic-properties-of-word-embed...

EDIT: I guess there are different forms of word embeddings and apparently modern LLMs don't use static word embeddings like word2vec and it's more contextual. Tokens aren't 1:1 with words either of course. I guess it's more complex than "LLMs represent words as vectors". Still though it's a neat trick and is indeed that simple with something like word2vec.

Edit 2: some interesting slides including the bit about semantic analogy and a paraphrase from the original word2vec paper about the king queen thing: https://staff.fnwi.uva.nl/e.kanoulas/wp-content/uploads/Lect...

And the original word2vec paper by Mikolov mentions it https://arxiv.org/abs/1301.3781

CamperBob2 · 2025-02-17T07:04:31 1739775871

The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend.

(Shrug) Take it up with Bishop, page 376: https://i.imgur.com/PgjQK3t.png

otabdeveloper4 · 2025-02-17T07:38:44 1739777924

And I will, that passage is intentionally written to mislead in a self-serving way. If you know the math behind it you understand that it's technically correct, but a layman or a cursory reading will leave you with a wildly incorrect idea.

CamperBob2 · 2025-02-17T16:42:57 1739810577

It's above my pay grade for sure, but you can get in touch with him at either Microsoft Research, the Royal Academy of Engineering, or the Royal Society, where he holds fellowships. Might want to cc: Yoshua Benjio, who seems to be laboring under similar misapprehensions.