I'm actually genuinely interested to know what you're referring to, because as an I.T. professional who's doing their best to keep up with how things work, that's more or less how I understood it as well. When words are mapped to a continuous vector space, placing semantically related words near one another gives their vector coordinates similar values (see the excerpt from IBM below).
However, I really don't understand how that necessarily enables one to perform arithmetic functions with two sets of vector coordinates and expect the result to be something tangential to the original two words. I understand how using a model to create embeddings with semantically correlated values can be achieved, and why that would be so fundamental for LLMs. My math skills aren't advanced enough to confidently land on either side of this question, but my instincts are that such an elegant relationship would be unlikely. Then again, mathematics are replete with counterintuitive but elegantly clever connections, so I could absolutely understand why this is eminently believable--especially in the context of AI and language models.
According to the descriptions from IBM [0]:
> Word embeddings capture the semantic relationships and contextual meanings of words based on their usage patterns in a given language corpus. Each word is represented as a fixed-sized dense vector of real numbers. It is the opposite of a sparse vector, such as one-hot encoding, which has many zero entries.
> The use of word embedding has significantly improved the performance of natural language processing (NLP) models by providing a more meaningful and efficient representation of words. These embeddings enable machines to understand and process language in a way that captures semantic nuances and contextual relationships, making them valuable for a wide range of applications, including sentiment analysis, machine translation and information retrieval.
> Popular word embedding models include Word2Vec, GloVe (Global Vectors for Word Representation), FastText and embeddings derived from transformer-based models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer).
It's my understanding of how embeddings work, as well. The dot products (cosine similarity) of man . woman end up being very similar to king . queen. This similarity could be considered subtraction if you stretch the metaphor enough.
If this is not a valid way to think about embedding vectors, would you care to elaborate?
Embedding vectors are when you take a high-dimensionality vector of word ID's and reduce deminsionality to something more manageable. (Think something like principal component analysis or singular value decomposition.)
The "similarity" here means word ID's that commonly occur together in vectors.
There is no logical reasoning or attempts at semantic analysis here.
> The dot products (cosine similarity) of man . woman end up being very similar to king . queen
This is because 'king' and 'man' occur together in a distribution similar to that of 'queen' and 'woman'.
The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend. Embeddings don't carry semantic meanings, they aren't dictionaries or encyclopedias. They are only statistical features about word co-occurrences.
This blog post [0] thinks that it is indeed possible to do arithmetic operations on them. My intuition is that they're vectors after all, and can be added and subtracted like any other vector. A word is just a location in that high-dimensional vector space.
EDIT: I guess there are different forms of word embeddings and apparently modern LLMs don't use static word embeddings like word2vec and it's more contextual. Tokens aren't 1:1 with words either of course. I guess it's more complex than "LLMs represent words as vectors". Still though it's a neat trick and is indeed that simple with something like word2vec.
The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend.
And I will, that passage is intentionally written to mislead in a self-serving way. If you know the math behind it you understand that it's technically correct, but a layman or a cursory reading will leave you with a wildly incorrect idea.
It's above my pay grade for sure, but you can get in touch with him at either Microsoft Research, the Royal Academy of Engineering, or the Royal Society, where he holds fellowships. Might want to cc: Yoshua Benjio, who seems to be laboring under similar misapprehensions.
The way you can supposedly take "king", subtract "man" and add "woman" and you get "queen", and this kind of thing is the mathematical basis of LLMs.