The most significant bits of a floating point representation are basically a logarithm of the number. Logarithms have this relation between multiplication and addition.
> You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?
You can multiply two numbers by multiplying their mantissas and adding their exponents.
Since the numbers are stored in base 2, the mantissa values range from 1 to 2 (subnormals notwithstanding) - but they're effectively stored as a 0..1 value. The effective range produced by adding vs. multiplying them have considerable overlap, and on average the difference won't be much. Because of how the mantissa and exponent fields are arranged (a deliberate choice by the standard, AFAIK), the mantissa effectively adds additional bits to the approximation of the logarithm given by the exponent.
To give a "yes and" side-track to your comment: saying "logarithms have this relation between multiplication and addition" is even underselling what logarithms are, because reducing multiplication to an additive operation was the whole motivation for John Napier[0] to discover/invent logarithms:
> “…nothing is more tedious, fellow mathematicians, in the practice of the mathematical arts, than the great delays suffered in the tedium of lengthy multiplications and divisions, the finding of ratios, and in the extraction of square and cube roots… [with] the many slippery errors that can arise…I have found an amazing way of shortening the proceedings [in which]… all the numbers associated with the multiplications, and divisions of numbers, and with the long arduous tasks of extracting square and cube roots are themselves rejected from the work, and in their place other numbers are substituted, which perform the tasks of these rejected by means of addition, subtraction, and division by two or three only.”[1]
Logarithms were honestly an enormous breakthrough in optimization, computers wouldn't be remotely as useful without them, even if most of us don't "see" the logarithms being used.
In fact I'd argue that they are the second-biggest computational optimization in use today, with only positional notation being a bigger deal. Which, funny enough, works kind of similarly: imagine you only could count by tallying (so, unary). Adding two number M and N would take M+N operations, e.g. 1234 + 5678 would require counting all 6912 individual digits. Unary math scales O(n) in both data and computation. Systems like Roman numerals almost work, but as soon as we reach values larger than the largest symbol (M for 1000) it's O(n) again, just with a better constant factor.
With positional notation numbers require only log(n) symbols to write down, and log(n) operations for addition, e.g. 1234 + 5678 requires one or two additions for each digit pair in a given position - one addition if there's no carry from the previous addition, two if there is. So addition at most 2 × ceil( max( log(M), log(N) ) ) operations, so log(n).
Logarithms take that idea and "recursively" apply it to the notation, making the same optimization work for multiplication. Without it, the naive algorithm for the multiplication of two numbers requires iterating over each digit, e.g. 1234 × 5678 requires multiplying each of the four digits of the first number with each of the digit of the second number, and then adding all the resulting numbers. It scales O(di×dj), where di and dj are the digits of each number. If they're the same we can simplify that to O(d²). When the numbers are represented as two logarithms the operation is reduced to adding two numbers again, so O(log(d) + [whatever the log/inverse log conversion cost is]). Of course d is a different value here and the number of digits used affects the precision.
I think the craziest thing of all this is that we're so used to positional notation that nobody ever seems to consider it a data compression technique. Even though almost no other data compression method would work without it as a building block (run-length encoding, Lempel-Ziv, Arithmetic coding? Useless without positional notation's O(log(n)) scaling factor). The only exceptions are data compression methods that are based on inventing their own numerical notation[2].
We do this every day ever since we first learned addition and subtraction as kids. Or as David Bess[3] puts it in his book "Mathematica": ask almost any adult what one billion minus one is and they know the answer instantaneously, so most adults would appear to have mental superpowers in the eyes of pretty much all mathematicians before positional notation was invented (well, everyone except Archimedes maybe[4]). Positional notation is magical, we're all math wizards, and it's so normalized that we don't even realize it.
But to get back to your original point: yes, you are entirely correct. IEEE floats are a form of lossy compression of fractions, and the basis of that lossy compression is logarithmic notation (but with a fixed number of binary digits and some curious rules for encoding other values like NaN and infinity).
As a historical tidbit I'll add that Romans did develop two ways to write larger numbers.
1. Writing a line (vinculum) over a numeral to multiply its value by 1,000. This was in fact extended to writing a line to the left and above a numeral to multiply its value by 1,000,000, and could in principle be extended to lines below and to the right to multiply by 10^9 and 10^12, and even nested boxes for larger powers.
2. The use |), |)), |))), ... for 500, 5,000, 50,000, ... and (|), ((|)), (((|))), ... for 1,000, 10,000, 100,000, ... These can be continued indefinitely.
Both require an ever increasing number of marks just to write the increasing powers, as well as an ever increasing number of powers being summed, but both increase only logarithmically, so we end up using O((log n)²) marks to write n. This is quadratically worse than positional notation, but exponentially better than just writing M over and over.
Symbols like that can represent all sorts of things, and numbers are no exception! In fact the Romans had been using other letters to represent their numbers since Etruscan times. Wait until you hear the torrid history of the character V (and its "misrepresentation" U).
And if I may take this yet a step further, this is the point of mathematical transformations: to find a domain in which desirable operations are easier such that the round trip to and from the domain are offset.
In the past, transformations, like logarithms, Fourier transforms, wavelets, had to be proposed. Enabled by advances in computers, machine learning automated all that away by composing differentiable building blocks into universal estimators. The parameters of these building blocks are estimated through gradient descent in conjunction with a user-chosen loss function, which guides the optimization of the transform. Good representations can be manipulated through basic algebra (like added, averaged, and compared for similarity, depending on the task) in a way that corresponds to semantic operations when their raw, untransformed representations can not.
I'm actually genuinely interested to know what you're referring to, because as an I.T. professional who's doing their best to keep up with how things work, that's more or less how I understood it as well. When words are mapped to a continuous vector space, placing semantically related words near one another gives their vector coordinates similar values (see the excerpt from IBM below).
However, I really don't understand how that necessarily enables one to perform arithmetic functions with two sets of vector coordinates and expect the result to be something tangential to the original two words. I understand how using a model to create embeddings with semantically correlated values can be achieved, and why that would be so fundamental for LLMs. My math skills aren't advanced enough to confidently land on either side of this question, but my instincts are that such an elegant relationship would be unlikely. Then again, mathematics are replete with counterintuitive but elegantly clever connections, so I could absolutely understand why this is eminently believable--especially in the context of AI and language models.
According to the descriptions from IBM [0]:
> Word embeddings capture the semantic relationships and contextual meanings of words based on their usage patterns in a given language corpus. Each word is represented as a fixed-sized dense vector of real numbers. It is the opposite of a sparse vector, such as one-hot encoding, which has many zero entries.
> The use of word embedding has significantly improved the performance of natural language processing (NLP) models by providing a more meaningful and efficient representation of words. These embeddings enable machines to understand and process language in a way that captures semantic nuances and contextual relationships, making them valuable for a wide range of applications, including sentiment analysis, machine translation and information retrieval.
> Popular word embedding models include Word2Vec, GloVe (Global Vectors for Word Representation), FastText and embeddings derived from transformer-based models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer).
It's my understanding of how embeddings work, as well. The dot products (cosine similarity) of man . woman end up being very similar to king . queen. This similarity could be considered subtraction if you stretch the metaphor enough.
If this is not a valid way to think about embedding vectors, would you care to elaborate?
Embedding vectors are when you take a high-dimensionality vector of word ID's and reduce deminsionality to something more manageable. (Think something like principal component analysis or singular value decomposition.)
The "similarity" here means word ID's that commonly occur together in vectors.
There is no logical reasoning or attempts at semantic analysis here.
> The dot products (cosine similarity) of man . woman end up being very similar to king . queen
This is because 'king' and 'man' occur together in a distribution similar to that of 'queen' and 'woman'.
The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend. Embeddings don't carry semantic meanings, they aren't dictionaries or encyclopedias. They are only statistical features about word co-occurrences.
This blog post [0] thinks that it is indeed possible to do arithmetic operations on them. My intuition is that they're vectors after all, and can be added and subtracted like any other vector. A word is just a location in that high-dimensional vector space.
EDIT: I guess there are different forms of word embeddings and apparently modern LLMs don't use static word embeddings like word2vec and it's more contextual. Tokens aren't 1:1 with words either of course. I guess it's more complex than "LLMs represent words as vectors". Still though it's a neat trick and is indeed that simple with something like word2vec.
The idea that the embedding of 'king' is somehow a sum of 'autarch' and 'man' and that subtracting 'man' from 'king' and adding 'woman' somehow gives you 'queen' is an urban legend.
And I will, that passage is intentionally written to mislead in a self-serving way. If you know the math behind it you understand that it's technically correct, but a layman or a cursory reading will leave you with a wildly incorrect idea.
It's above my pay grade for sure, but you can get in touch with him at either Microsoft Research, the Royal Academy of Engineering, or the Royal Society, where he holds fellowships. Might want to cc: Yoshua Benjio, who seems to be laboring under similar misapprehensions.
I remember the glee I felt when I first used logarithms to represent large numbers in Excel. Suddenly, I could easily approximate large factorials, and they were fast! I wondered how many other people have discovered this trick. I couldn't find many references on the Internet. Over time, I realized this method of handling large numbers is used so often that most people who know about it don't bother to talk about it.
Logarithms pop up (IIRC) in 10th grade, and ln(x*y) = ln(x) + ln(y) is usually explained as part of that. What many teachers completely fail to teach is that it's not just a formula to memorize, but that it has profound implications on how you can do math. As you discovered by yourself. (Kudos!)
It is, with the right teacher, a super-exciting story with lots of colorful history & characters. You can even guide your students to come to that some crucial insight. Alas, few teachers have the necessary amount of passion to make that come alive.
But that's probably why few people write about it - for most, either their math interest got murdered in HS so they never get to look at it, or they did learn it in 10th grade and so consider it not worth mentioning.
(Corollary - we all should write more about the cool realizations we had, even if they're in retrospective "well known")
I must admit I wanted to write that all that is true but calculating logarithms is difficult but then looked it up and found that slide rule was invented shortly thereafter:
When the numbers are represented as two logarithms the operation is reduced to adding two numbers again, so O(log(d) + [whatever the log/inverse log conversion cost is]).
This is pretty confused. First, if the d is the number of digits of a number, then you will need O(d) of memory to store its logarithm, otherwise you will loose a lot of precision. Thus, the addition of logarithms is still O(d), not O(log(d)).
Second, calculating log(x) and its inverse, exp(x) is much more expensive than multiplication. For example, if x is between 0 and 1, you can approximate exp(x) pretty well as 1 + x + x^2/2 + x^3/6 + x^4/24. This is 3 multiplications and 3 divisions just to convert.
> Thus, the addition of logarithms is still O(d), not O(log(d)).
Oh dear, yes that's wrong, thank you for catching that.
Still, O(d) a lot better than O(d²). And perhaps more importantly: division (which is even more painful to compute) is a simple matter of subtracting the logarithmic tranformations and then taking the exponent.
> Second, calculating log(x) and its inverse, exp(x) is much more expensive than multiplication.
You're not wrong, but you're talking about the costs of a computer doing the calculation. I was talking about a human doing calculations by hand. What I forgot to mention is that Mirifici Logarithmorum Canonis Descriptio contained 90 pages of precomputed tables. Multiplying two numbers is then a matter of two look-ups, one addition, and another look-up. For large enough numbers that's worth it (and "large enough" is reached pretty quickly in the case of calculations done by hand).
And shortly after the introduction of logarithms William Oughtred invented the slide rule, using two log scales to create an analog computer, removing the need for a table as well (depending on the precision required)
Yeah, if we are talking about humans, then this is definitely a big win: big table lookup is for a human faster than multiplication (not really true for computers), and slide rules make all the difference.
Wish I could up vote this more than once :-). It also is why people had books of 'logs' back in the day that were just the log10 value of a given number.
John Napier is my current math nerd hero! I've been trying to promote "magnitude notation"[0] as the next evolution in numeric notation. If we used mag numbers in science and journalism and even finance, we'd get the same increase in mental superpowers as we got with positional notation.
Notation making it easier to intuit logarithmic scales does sound like a good idea in a world where human societies partially struggles to come to action because of things like the majority of people not intuitively understanding exponential growth, or the difference in scale between "millionaire" or "billionaire". It would be a good tool for the mind.
yeah. because positional notation is a general mathematical concept, and it's only in practical application that its power is realized. binary is the most efficient base for computation with the technology we have at hand. positional notation exists for any/every base, but binary is the foundation optimization in practice.
if I were to argue it, that is. In reality they're all really important and it's not like there's a twisted math genie that's forcing us to choose one or the other. I felt like pointing out that chosing base two is also a very important, fundamental optimization decision because it's lead us to the world we have today.
Yes, and overflowing the mantissa moves the exponent in the right direction by the right amount; just now the least significant bits are not quite in the right place anymore. But that's by definition a small error.
Anyone who did high school before calculators (late 70s) would have carried around a book of logarithmic tables so they could do the multiplication necessary for many problems.