qq, can you explain a few noob thoughts. Why is surprisal defined as `log (1/p(X=x)`? I'm lacking the intuition for 1. why did someone think to use 1/p? 2. why do we have a log term here.
Log shows up a lot but I don't think I ever really understood when/how do people decided to use it in their formulas. One KL divergence youtube video said, `let's normalize the p(P)/p(Q) using (log(p(P)/q(P)))^N` and then shows how to derived the KL formula. I'd also appreciate if you know how the N (sample size ig) is being used here.
For a lot of math stuff I endup being confused about how people use operations. Like using 1/p(X=x) is like magic to me because I don't understand what the context is when someone thinks about the problem and then decides to do something. What's their thinking process here, what tool or process do I not know about that makes me confused?
It's not answering quite the question you asked, but Shannon - who invented much of this stuff - has some really nice practical arguments in the start of his amazing paper that introduced "information theory" [1]. It's a really readable paper, much less intimidating than you might think, and worth a look.
Hartley was (as far as I know) the first person to recognise the usefulness of log probabilities in the context of measuring information [2]. It's a really amazing paper, for several reasons ... but one thing that really strikes me about it is how it's aged: the first half has an essentially timeless presentation of the essence of information, and the second has a now-quite-irrelevant presentation of how to make a better TV. I guess that was the really interesting problem at the time!
Negative log is the only decreasing function which satisfies f(x * y) = f(x) + f(y). If you have two independent events it makes sense that the total surprisal (information content) should be the sum of the surprisals of the two events.
Another way to see it is as a continuous generalization of the idea that you need n bits to represent 2^n equally likely alternatives.
> If you have two independent events it makes sense that the total surprisal (information content) should be the sum of the surprisals of the two events.
I think the parent is also asking why we would expect surprisal to be additive rather than multiplicative like probabilities.
I think in part because it's really nice to measure information in bits rather than tiny probabilities. And it aligns well with how information is stored.
I think "information" is a better name than surprisal.
If your distribution has N equally-likely values, `p(x) = 1/N`, and information/surprisal `I(x) = log(N)`. In base 2, this is how many bits are required to specify exactly WHICH of the N values you're talking about.
If `x` is not a single one of the N states but an event consisting of `n(x)` states, then `I(x) = log(N) - log(n(x))`, suggesting it takes _somewhat less information_ to specify this particular state, and it correctly gives 0 if `n(x) = N`, i.e. there's only one state.
Exactly what this "less information" means is vague, but you might think of it in terms of compression: if you compress some stream of data which is sampled from `X` with probability `p(x)`, you could use use the shortest codes (0, 10, 11, etc) for the most common values with some "stop word" to say when the end of a datum is reached. `I(x)` captures this sense in general, but it might only become literally true in the limit of a very large stream of data with a very large dictionary.
Good question. It's because -log P(X) is the number of bits of information I would need to eliminate any surprise about the value of X.
For example, let's say X is the value of an 8-bit register, and each of the 2^8=256 possible values has a probability of 1/256. I would need to learn all 8 bits to know the value of X: so Surprisal(X=x) = -log P(X=x) = -log 1/256 = 8.
Suppose I learned only that X is odd, which is equivalent to learning that the last bit of the register is equal to 1. The probability that X is 1 is 1/2, and so I would need to be provided with -log P(X is odd) = -log 1/2 = 1 bit of information to learn that X is odd.
Information can be thought of as simply the elimination of uncertainty. So Surprisal (uncertainty) is basically a measure of information, but instead of how much information you have, it's how much information you need to eliminate all uncertainty.
Forgive my peasant mind, I've heard about K-L divergence before, I just don't know what it is used for nor what's special about it compared to other metrics?
I think people who love math expressed in nomenclature well never understand those who don't ...
K-L divergence is the pretty much really dumb, obvious thing you would probably think up if someone asked you to compare two probability distributions.
You would start by going "oh I suppose I will look at how far apart they are at all the different values" and add it up. Then you would say, "oh, but 0.1 and 0.01 are really an order of magnitude apart, just like 0.01 and 0.001. Perhaps it will work better if I use the log of the probability". Then you would pause for thought and say, "Hmmmm hang on some of the values are really extreme but almost never happen, shouldn't it be weighted by how frequently they occur?".
But of course paragraphs of mathematical symbols are the way many people prefer to express this.
> K-L divergence is the pretty much really dumb, obvious thing you would probably think up if someone asked you to compare two probability distributions.
I don't think if anyone was asked simply to "compare two probability distributions", they'd come up with something asymmetric like the KL divergence. You'd need to at least add that one of the distributions is the "real" distribution.
I agree, however the next step would be to symmetrize the KL divergence because otherwise we get a result that depends on the order that we parse the distributions.
Practically speaking, it's a simple measure of how similar two probability distributions are, minimised (with value zero) when they are the same. So it's often used as a loss term in optimisations when you want two distributions to be pushed towards being similar. Sometimes this motivated by clever reasoning about information/probability ... but often it's more just "slap a KL on it", because it tends to work.
Here is the simplest way of explaining the KL divergence:
The KL divergence yields a concrete value that tells you how many actual bits of space on disk you will waste if you try to use an encoding table from one ZIP file of data to encode another ZIP file of data. It's not just theoretical, this is exactly the type of task that it's used for.
The closer the folders are to each other in content, the fewer wasted bits. So, we can use this to measure how similar two sets of information are, in a manner of speaking.
These 'wasted bits' are also known as relative entropy, since entropy basically is a measure of how disordered something can be. The more disordered, the more possibilities we have to choose from, thus the more information possible.
Entropy does not guarantee that the information is usable. It only guarantees how much of this quantity we can get, much like pipes serving water. Yes, they will likely serve water, but you can accidentally have sludge come through instead. Still, their capacity is the same.
One thing to note is that with our ZIP files, if you use the encoding tables from one to encode the other, then you will end up with different relative entropy (i.e. our 'wasted bits') numbers than if you did the vice versa. This is because the KL is not what's called symmetric. That is, it can have different meaning based upon which direction it goes.
Can you pull out a piece of paper, make yourself an example problem, and tease out an intuition as to why?
First -- thank you for the thanks! It means a lot. :) Secondly -- same. It's the way my brain works. In a particular test I was administered a while back, 'coding' (i.e. mapping information transiently to symbols) was by far one of my weakest skills, oddly enough.
To me, I think in shapes. Ideas have shapes. Some ideas have very similar shapes.
This makes me very good at neural network engineering and research.
We use KL-divergence to calculate how surprising a time-series anomaly is and rank them for aviation safety, e.g., give me a ranked list of the most surprising increases in a safety metric. It's quite handy!
Would you be interested in chatting with me about this topic for a book I'm working on? If so please reach out at mathintersectprogramming@gmail.com and we can set up a time to chat!
My theory was: calculate entropy ("surprisal") of used words in a language (in my case, from an NYT corpus), then calculate KL-divergence between a given prose and a collection of surprisals for different authors. The author to whom the prose had the highest KL-divergence was assumed to be the author. I think it has been used in stylometry a bit.
For me, the intuitive way of understanding it is, "how badly would a gambler lose in the long term, if they keep betting on a game believing the probability distribution is X but it is in actual fact Y". It also explains why KL divergence is assymetric, and why it goes to infinity / undefined when the expected probability distribution has zeros where the true distribution has non-zeros. Suppose an urn can have red, blue and green balls. If the true distribution (X) is that there are no red balls at all, but the gambler believes (Y) that there is a small fraction of red balls, the gambler would lose a bit of money with every bet on red, but overall the loss is finite. But suppose the gambler beleives (Y) there are absolutely no red balls in the urn, but in actual fact (X) there is some small fraction of them. According to the gambler's beliefs it would be rational to gamble potentially infinite money on the ball not being red, so the loss is potentially infinite. There is a parallel here to data compression, transmission, etc (KL divergence between expected and actual distributions in information theory) - if you believe a certain bit sequence will never occur in the input sequence, you won't assign it a code, and so if it ever does actually occur you won't be able to transmit it at all ("infinite loss"). If you beleive it will occur very infrequently, you will assign it a very long code, and so if it actually occurs very frequently your output data will be very long (large loss, large KL divergence).
One intuition is that KL-divergence represents a sort of “distance” between probability distributions. However, this isn’t quite right as it doesn’t satisfy some basic properties a real distance (a norm) would satisfy, including the fact that it isn’t symmetric: KL(Q, P) != KL(P,Q), and it does not satisfy the triangle inequality. Nonetheless, KL(P,Q) gives you a good idea of how “far” is P is from Q: in the context of encoding, if you wanted to come up with an ideal encoding of symbols coming from P, but you guessed Q as the distribution of these symbols, then KL(P, Q) is the extra number of bits you’d have to use. One nice property is that in the case that KL(P,Q) = 0, P and Q are equal (almost everywhere, which for most applications is irrelevant). This makes it useful in the ML context as you can minimize KL divergence and know that the resulting “guessed” distribution is getting closer to the data distribution you’re trying to guess using some parametrized function (an NN).
> it doesn’t satisfy some basic properties a real distance (a norm) would satisfy, including the fact that it isn’t symmetric [...] and it does not satisfy the triangle inequality.
Not sure about "real" but one can have useful distances which are not symmetric like the distance between cities measured in time or in gallons.
It just needs to be clarified that KL divergence isn’t a proper mathematical norm, so it doesn’t behave the way we intuitively think a distance should. As mentioned, it doesn’t satisfy the triangle inequality, which is a basic property for any distance-like function.
In comparison, both of your examples are much closer to norms as they both satisfy the triangle inequality.
For reference, this is what I’m referring to when I say a “norm”:
I was just clarifying that norm and distance are not the same as you seemed to imply with "a real distance (a norm)". (And I think one can also intuitively understand that the distance from A to B may not be the same as the distance from B to A as soon as we step outside of geometry.)
In addition to what others have said, it fixes a “bug” in Shannon’s continuous version of the entropy. Shannon assumed you could just replace the sum with an integral and call it a day. However, two bad things happen:
1. This definition of entropy is not invariant to coordinate transforms. If you change the parameters of your distribution, you get a different value for the entropy, despite the change of parameters not adding or removing information.
2. You can get negative values for the entropy.
Jaynes argued (in a way that’s quite readable if you find his original paper, I’m on mobile) that really you should pick a base reference measure q and define the entropy as -KL(p; q). This fixes the first bug which is the critical one. The second bug is halfway fixed because this quantity is always non-positive. However that’s alright because often we really care about the change in entropy not the absolute value (like how we care about the change in potential energy not the absolute value).
This gets at why the KL divergence is often called the relative entropy. It is the entropy relative to the reference measure q.
My favorite physical interpretation (discrete version):
How many extra bits per character your data compression algorithm would need to store text from distribution P if it (mistakenly) assumed it were drawn from Q.
That is, reserve the shortest “words” for the most common characters based on the assumption that the data will be drawn from Q. Then KL(P||Q) is how much bigger the compressed data will be (per input character) if the data is actually drawn from P.
Lots of different ways of motivating KL, but I think one that's frequently neglected in the ML era is its relationship to a likelihood ratio test. Kullback and Leibler originally characterized the KL divergence as the measure of the ability of discriminate between two distributions (ie. perform a hypothesis test between them), given some set of observations (the thing you're averaging over). You can read their original paper for free if you want to hear it in their own words, Kullback and Leibler 1951, although they go kind of deep on the stat theory.
Say you're flipping a coin N times, and you get outcomes x_1, x_2, ...., x_N. You want to determine whether the coin comes up heads with probability P or probability Q (kind of a weird fiction that there are only two options, but roll with it). The classic way to do this would be a likelihood (log) ratio test, you compute this test statistic:
Y = log( Pr[x_1,...,x_n|P] / Pr[x_1,...,x_N|Q] )
Depending on the value of Y you make a decision about whether to go with P or Q, and IMO the definition makes intuitive sense for this purpose: if Y is positive then you favor P, if Y is negative you favor Q. If it's 0 then you can't pick between them. Simple and easy, plus the Neyman-Pearson lemma basically says that it's the best you could do (among the set of decision making criteria which satisfy a certain set of desirable properties).
Having defined your test statistic Y, you might then ask what kind of values you can expect to get, even before you make any experimental measurements. Well, if you assume that P is the true probability (the null hypothesis) then E_X[Y|P,Q] = KL[P|Q]. Basically the expected value of the log-likelihood ratio, the thing you would use to decide between P and Q, characterizes how similar or different they are. When the expected value is close to 0, and hence P and Q are similar, then before conducting the experiment you can expect that it will be hard to distinguish between them. When the divergence is very large then you can expect it will be easy.
This likelihood ratio approach highlights the fact that KL divergence is a member of the family of Csiszár F-divergences. These are measures of "distance" between distributions of the form E_Q[ F(p(x)/q(x)) ] where F is any convex function with F(1) = 0. This is kind of a generalization of log-likelihood where F kind of "weights" the badness of ratios different to 1. When F is -log you get KL divergence.
Another curious fact about KL divergences is they are also a Bregman divergence: take a convex function H and define B_H(P, Q) = \sum_x H(p(x)) - H(q(x)) - <∇H(q(x)), p(x) - q(x)>. These generalize pointwise square Euclidean distance. KL is obtained when H(P) is negative entropy \sum_x p(x) log p(x).
I spent a bunch of time studying divergences over distributions (e.g., see my blog post[1]) and in particular these two classes and the really neat fact about KL divergence is that it is essentially the only divergence that is both an F-divergence and a Bregman divergence. This is basically due to the property of log that turns logs of products into sums.
As for practical use cases, one is to find an approximate optimization to a function
- You want to find the min/max of some probability distribution P(x)
- P(x) is too complicated to find a closed-form min, but you can draw samples from it.
- So instead, you carefully construct some OTHER probability distribution Q(x|θ) that you claim is structurally similar "enough" to P(x), parameterized by θ.
- Now you find the theta which minimizes the KL divergence KL(P(x) || Q(x|θ)), which is equivalent to delivering you the parameters of θ to Q(x|θ) that make it [approximately] "most" similar to P(x) without ever having minimized P(x)
It was a trick that came up a lot when AI consisted of giant Bayesian plate models for each specific task that you had to hand-optimize.
I think KL divergence is best understood through the lens of variational inference. Inference in the Bayesian regime is a balancing act between best explaining the data and “keeping it simple”, by staying close to the prior. KL divergence is the (only) divergence measure that places just the right amount of weight on the prior to make variational inference proper Bayesian inference. I’ve written about it here: http://www.openias.org/variational-coin-toss
Intuitively, it measures the difference between two probability distributions. It's not symmetric, so it's not quite that, but in my opinion, it's good intuition.
As motivation, say you're an internet provider, providing internet service to a business. You naturally want to save money, so you perhaps want to compress packets before they go over the wire. Let's say the business you're providing service to also compresses their data, but they've made a mistake and do it inefficiently.
Let's say the business has, incorrectly, determined the probability distribution for their data to be $q(x)$. That is, they assign probability of seeing symbol $x$ to be $q(x)$. Let's say you've determined the "true" distribution to be $p(x)$. The entropy, or number of bits, they expect to transmit per packet/symbol will be $-\sum p(x) lg(q(x))$. Meaning, they'll compress their stream under the assumption that the distribution is $q(x)$ but the actually probability of seeing a packet, $x$, is $p(x)$, which is why the term $p(x) lg(q(x))$ shows up.
The number of bits you're transmitting is just $-\sum p(x) lg(p(x))$. Now we ask, how many bits, per packet, is the savings of your method over the businesses? This is $-\sum p(x) lg(q(x)/p(x))$, which is exactly the Kullback-Leibler divergence (maybe up to a sign difference).
In other words, given a "guess" at a distribution and the "true" distribution, how bad is it between them? This is the Kullback-Leibler distribution and why it shows up (I believe) in machine learning and fitness functions.
As a more concrete example, I just ran across a paper talking [0] about using WFC [1] to asses how well it, and other algorithms, do when trying to create generative "super mario brothers" like levels. Take a 2x2 or 3x3 grid, make a library of tiles, use that to generate a random level, then use the K-L divergence to determine how well your generative algorithm has done compared to the observed distribution from an example image.
Your characterization (originally due to Cover, I think) is a strong one because it concretely ties KL divergence to nature. I also like Sanov's theorem for this.
I have sat through many frustrating anti-explanations of the following sort:
>What is KL divergence you ask? Why, it's simply a quantitative difference between distributions. The further away distributions are, the higher KL divergence is... It's like a distance-squared between distributions... but it isn't symmetric and it doesn't obey any usual triangle inequality, so this analogy isn't helpful for analysis... Pinsker's inequality gives a useful lower bound. A useful general upper bound is, uhh,... uh...
This class of answer is totally uninformative (and discrediting if given, IMO) because it does not provide a useful, unique characterization of KL divergence, only fundamentally inaccurate descriptions of it.
It just describes how two probability distributions are different. If they are the same then it’s 0.
Example: an LLM gives a probability distribution of the next word. If it is perfectly accurate at predicting the next word then divergence is 0 (100% probability on the actual next word). If it is slightly off or unsure then the divergence goes up.
It's a useful way of measuring how different two probability distributions are. If F and G are distributions, Kl(F||G) is non-negative real number which is larger if F and G are less similar.
In a lot of statistical estimation procedures, you have some kind of "current estimate" distribution which has nice properties and some kind of "true distribution" which you'd like to use your nice distribution to approximate. It's then common to create a system which manipulates the parameters of your current estimate distribution to minimize the KL-divergence with the true distribution.
A relatively simple example of this is fitting a Gaussian mixture model. If you look up how that process is derived you'll see it depends centrally on minimizing the KL-divergence between two distributions.
There are other ways to measure the difference between two probability distributions, but the KL-divergence has some nice properties. It shows up as the answer to lots of well-motivated questions around statistical inference (Neyman-Pearson testing, information geometry, Bayesian inference, entropy) and it has a form which is somewhat amenable to algebraic manipulation.
In some sense it's popular because it keeps showing up and working. People recognize its form, consider it relatively simple, and find it meaningful to talk about. It's common enough that you might even begin considering a problem by asking if you can minimize the KL-divergence between your estimate and some goal just knowing that outright it will likely lead to a successful solution to your problem.
Here is another view, which also explains where the logarithm comes from.
The KL between two distributions of a random variable, say Kl[p|q], says that if you made a perfect compression algorithm for samples from distribution q, how many extra bits/nats you expect to need to code samples that actually come from p instead if you use that compression algorithm.
And compression is all about keeping only the true information that is encoded in a sample.
One use case is in the training of Diffusion Models. In the original formulation, the likelihood-maximizing objective is recast in terms of KL divergences. This is done because the KL divergence between Gaussians has a closed form, and the transition distributions in Diffusion Models are taken to be Gaussian, which makes the problem tenable.
My intro to K-L divergence was in the context of maximum likelihood estimation with a misspecified model. The asymptotic expected value of a parametric MLE is the parameter value that minimizes the K-L divergence from the true distribution to the model. That means that when the parametric model contains the true distribution, the MLE is a consistent estimator (as is well known). But it also gives you a way of analytically finding the (asymptotic) bias of the MLE when the model does not contain the true distribution.
The connection between all of these manifestations of KL divergence is that a system far from equilibrium contains more information (in the Shannon sense) than a system in equilibirum. That "excess information" is what drives fitness within some environment.
I'm not sure if I follow the fitness-driving argument. Without an informational sieve of some kind, isn't it just random entropy?
In this case, sure, far from equilibrium has more information, but that just means the system has more flexibility to move, not that the system actually inherently contains more information (stored information vs inbound throughput).
Unless it's some kind of Nash game, or problem with multiple similarly-deep valleys, I don't see how that would do much more from a fitness perspective other than allow us to effect something approaching an unbiased estimator of the minima density of whatever problem we're working on is. Or...whatever, I'm not quite sure.
Anywho, however it is, I'm a bit confused on the last part there.
I will confess that at least the biology work is still somewhat mysterious to me. John Baez used to be active on Twitter. I believe he's still active on Mathstodon now. If you're curious, you might ask him.
I learnt about KL Divergence recently and it was pretty cool to know that cross-entropy loss originated from KL Divergence.
But could someone give me the cases where it is preferred to use Mean-squared Error loss vs Cross-entropy loss? Is there any merits or demerits of using either?
This is the NN 101 explanation: mean-square loss is for regression, cross-entropy is for classification.
NN 201 explanation: mean-square is about finite but continuous errors (residuals) of predicted value vs true value of the output, cross-entropy is for distributions over discrete sets of categories.
NN 501 explanation: the task of the NN and the form of its outputs should be defined in terms of the "shape" or nature of the residuals of its predictions. Mean-square corresponds to predicting means with Gaussian residuals, cross-entropy corresponds to predicting over discrete outputs with multinomial (mutually exclusive) structure. Indeed you can derive any loss function you want by first defining the expected form of the residuals and then deriving the negative log-likelihood of the associated distribution.
If you look deep enough they are not exclusive. From the deeplearning book:
"Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross-
entropy between the empirical distribution defined by the training set and the
probability distribution defined by model. For example, mean squared error is the
cross-entropy between the empirical distribution and a Gaussian model"
Also, the gradients of "softmax loss" and mean-square-error loss are the same. The network learns to optimize the same function, up until the output activation.
In NN training, minimizing cross entropy is equivalent to minimizing KL divergence. This is because cross entropy is equal to (entropy of the true distribution) + (KL divergence from the true distribution to the model). Obviously by changing the model you can't change the first term, only the second. So when you minimize cross entropy, you are minimizing KL divergence.
Minimizing mean-squared-error loss is equivalent to minimizing KL divergence (and thus cross entropy) under the assumption that your model produces a vector that parameterizes the mean of a multivariate Gaussian distribution which is then used to predict your data. This is the most natural way to set up a model that predicts continuous data.
TL;DR: One is about distance in space, the other is about spread in space.
KL-Divergence is not a metric, it's not symmetric. It's biased toward your reference distribution. Although, it gives an probabilistic / information view of a difference in distributions. One of the outcome of KL is that it will highlight the tail of your distribution.
Euclidean distance, L2, is a metric. So it is suited when you need a metric. Also, it does not give an insight of any distribution phenomenons expect the means of the distribution.
For example, you are a teacher, you have two classes. You want to compare the grades.
L2 can be a summary of how far the grades are apart. The length of tails of both grades distribution won't have impact if they have the same mean. That's good if you want to know the average level of both classes.
KL will give the point of view how the class grades spread are alike. Two classes can have small KL Divergence if they have the same shapes of distributions. If your classes are very different - one is very homogeneous and the other one is very heterogeneous - then your KL will be big, even if the average are very close.
Btw, KL divergence isn’t symmetrical. So D(P,Q) != D(Q,P). If you need a symmetrical version of it, you can use Jensen–Shannon divergence which is the mean of D(P,Q) and D(Q,P). Or if you only care about relative distances you can just use the sum.
surprisal: how surprised I am when I learn the value of X
entropy: how surprised I expect to be cross-entropy: how surprised I expect Bob to be (if Bob's beliefs are q instead of p) KL divergence: how much *more* surprised I expect Bob to be than me information gain: how much less surprised I expect Bob to be if he knew that Y=y mutual information: how much information I expect to gain about X from learning the value of Y