Radix economy is all about which base is the most efficient to represent a given...

Radix economy is all about which base is the most efficient to represent a given number. It is simple to show that, for large numbers, this is equivalent to how efficient a base can represent itself, b/ln(b). Simple calculus shows this is minimized at e (Euler's number) or 3 if integer (closely followed by 2).

It sounds like you have something to add but you are already dictating the base by saying "bit". Literally from "binary digit". Anyway, quantization is not about which number system is best - virtually all computer systems we use today represents numbers in base 2. Quantization, at its core, is lossy compression. How do you go from a large model trained to high precision to a smaller model without hindering performance? This can be studied without needing to know the base.

Suppose you are using a decimal computer. You can ask yourself, I have a 128-decimal precision numbers, do I need that much precision? What happens if I just use 1-decimal precision by chopping off the 127 digits after the first decimal? You realize that there are two parts of an operation. The numbers involved (the operands) and the operation itself. You then ask yourself, if I keep one of the operands fixed (the original input), can I represent my 128-decimal precision neural network simply as a series of operations without the other operand? Perhaps only the most basic ones? Like: noops (add 0 or multiply by 1), increments (add 1), decrements (subtract 1), negations (multiply by -1), and clears (multiply by 0)? You count those numbers (-1, 0, and 1). There are 3 so you proudly proclaim you've made a neural network that only uses 0.477 dits. People get excited and confused because that is less than 1 dit which seems like a fundamental point. You further surprise the scientific field by finding a clever trick for getting rid of negations. You beat your previous record and now you only need 0.301 dits to represent your network. You are about to accept your Turing reward when the ghost of Claude Shannon appears and says "Why are you using a unit that measures entropy to mean how many symbols you have? If you insist, at least realize 0.301 dits is 1 bit." You are shocked when you realize 10^0.301 = 2^1. Reviewing Shannon's seminal paper, you are awestruck by Shannon's prescient comment "Change from the base a to base b merely requires multiplication by log_b(a).". You humbly give your award to Shannon. You keep the $1M since ghosts aren't as fast a new NVidia DGX. No matter how quantized the ghost is.

[1] - https://people.math.harvard.edu/~ctm/home/text/others/shanno...