Bayes is guaranteed to overfit

radford-neal · on May 28, 2023

As the author admits at the end, this is rather misleading. In normal usage, "overfit" is by definition a bad thing (it wouldn't be "over" if it was good). And the argument given does nothing to show that Bayesian inference is doing anything bad.

To take a trivial example, suppose you have a uniform(0,1) prior for the probability of a coin landing heads. Integrating over this gives a probability for heads of 1/2. You flip the coin once, and it lands heads. If you integrate over the posterior given this observation, you'll find that the probability of the value in the observation, which is heads, is now 2/3, greater than it was under the prior.

And that's OVERFITTING, according to the definition in the blog post.

Not according to any sensible definition, however.

kgwgk · on May 28, 2023

I was writing another comment based on that same example and his leaving-one-out calculations (at least based on what I understood).

The posterior vs prior would be the extreme case of a leaving-one-out procedure - leaving the only data point out there is nothing left.

The divergence between the data and the model goes down when we include information about the data in the model. That doesn't seem a controversial opinion. (That's how the blog post is introduced here: https://twitter.com/YulingYao/status/1662284440603619328)

---

If the data consists of two flips they are either equal or different (the former becomes more likely as the true probability diverges from 0.5).

a) If the data is the same, the posterior probability of that result is 3/4. The log score is 2 log(3/4) = -0.6

When we check the out-of-sample log score for each one based on the 2/3 posterior obtained from the other we get in each case a log score log(2/3) = -0.4

b) If the data is different, the posterior probability is still 1/2. The log score is 2 log(1/2) = 2 -0.7 = -1.4

When we check the out-of-sample log score for each one based on the 1/3 posterior for getting that result obtained from the other we get in each case a log score log(1/3) = -1.1

sillymath3 · on May 28, 2023

When there is a small amount of information the variance of any estimation is very big and this explains what happens in that example. Overfitting implies a different behavior in training and in test and this is related to a big variance in the estimation of the error. So small amount of information implies that any model suffer overfitting and big variance, so is a general result not related especifically with Bayes.

CrazyStat · on May 28, 2023

I'm on my phone so I haven't tried to work through the math to see where the error is, but the author's conclusion is wrong and the counterexample is simple.

> Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality p ( y i | y − i ) < p ( y i | y ) , for any point i and any model.

Let y_1, ... y_n be iid from a Uniform(0,theta) distribution, with some nice prior on theta (e.g. Exponential(1)). Then the posterior for theta, and hence the predictive density for a new y_i, depends only on max(y_1, ..., y_n). So for all but one of the n observations the author's strict inequality does not hold.

syntaxing · on May 28, 2023

The author mentions he defines over fit as “Test error is always larger than training error”. Is there an algorithm or model where that’s not the case?

jgalt212 · on May 28, 2023

Yeah, that's a crummy definition. You can easily force "Test error is always larger than training error" for any model type.

nerdponx · on May 28, 2023

It's not even about "forcing". This is such common and expected behavior that it's surprising (and suspicious) when it isn't the case.

jgalt212 · on May 29, 2023

yes, that's better said.

onos · on May 28, 2023

A pedantic example: a model that ignored the training data would do just as well on the training set as on the test set.

tesdinger · on May 28, 2023

If you do not train on the training set, then there is no training set, and your example is degenerate.

maxbond · on May 29, 2023

We've trained on the training set, but our fit() implementation was, "accept the input, do nothing with it, and then return." When we then evaluate the model on the test set and training set, assuming there isn't a large covariant shift between the two or something, we'll get approximately equal values.

So it's a degenerate case, but not of the training set. (And presumably that's partly what they meant by "pedantic".)

jphoward · on May 28, 2023

You can see regularly in practice where aggressive data augmentation is used, which obviously is only used on training data. But, of course, you'd still be 'overfit' if you fed in unaugmented training data.

dataflow · on May 28, 2023

I think they mean "always (statistically) significantly larger". They're probably imagining that something like cross-validation would make test errors approximately equal to training errors, but if you consistently see significantly larger errors, then you've overfit.

bbstats · on May 28, 2023

When you find the minima of your validation curve, rarely if ever is your test loss lower than your training loss. I don't think this necessarily means you're overfitting.

PartiallyTyped · on May 29, 2023

Under the Probably Approximately Correct framework, there's an x% chance your error on the empirical distribution deviates more than \epsilon away in either direction from that of the actual distribution, so one can assume similar things between two empirical distributions.

jksk61 · on May 28, 2023

Yes and no. Suppose for example you give MNIST to a SVM and fit the model, then test it only on 0 and 1 digits, which are generally well discriminized, you'll get almost 100% accuracy in test, whereas 97% or less in training. (ye probably need some preprocessing, like using PacMAP or UMAP or whatever but the point is the same)

However, that's just because I decided the right data to test it onto. So, you can't really say much on a model using that definition.

to-mi · on May 28, 2023

It seems that the post is comparing a predictive distribution conditioned on N data points to one conditioned on N-1 data points. The latter is a biased estimate of the former (e.g., https://users.aalto.fi/~ave/publications/VehtariLampinen_NC2...)

sieste · on May 28, 2023

This. Leave-one-out doesn't turn an in-sample prediction problem into an equivalent out-of-sample problem, but a slightly less skilful one.

MontyCarloHall · on May 28, 2023

I don't follow the math. WLOG, for N total datapoints, let y_i = y_N. Then the leave-one-out posterior predictive is

  \int p(y_N|θ)p(θ|{y_1...y_{N-1}}) dθ = p(y_N|{y_1...y_{N-1})

by the law of total probability.

Expanding the leave-one-out posterior (via Bayes' rule), we have

  p(θ|{y_1...y_{N-1}}) = p({y_1...y_{N-1}}|θ)p(θ)/\int p({y_1...y_{N-1}}|θ')p(θ') dθ'

which when plugged back into the first equation is

  \int p(y_N|θ) p({y_1...y_{N-1}}|θ)p(θ) dθ/(\int p({y_1...y_{N-1}}|θ')p(θ') dθ')

I don't see how this simplifies to the harmonic mean expression in the post.

Regardless, the author is asserting that

  p(y_N|{y_1...y_{N-1}}) ≤ p(y_N|{y_1...y_N})

which seems intuitively plausible for any trained model — given a model trained on data {y_1...y_N}, performing inference on any datapoint y_1...y_N in the training set will generally be more accurate than performing inference on a datapoint y_{N+1} not in the training set.

kgwgk · on May 28, 2023

It's reassuring to see that I'm not the only one who finds those equations far from obvious. I didn't spend much time trying to understand the derivation though - as you wrote the result doesn't seem interesting anyway.

alexmolas · on May 28, 2023

I got lost in the second equation, when the author says

p(y_i|y_{-i})= \int p(y_i|\theta) p(\theta|y) \frac{p(y_i|\theta)^{-1}} {\int p(y_i|\theta^\prime p(\theta^{\prime}|y))^{-1} d\theta^\prime} d\theta

why is that? Can someone explain the rationale behind this?

fragmede · on May 28, 2023

https://arachnoid.com/latex/?equ=%0Ap(y_i%7Cy_%7B-i%7D)%3D%2...

for if you can't parse latex in your head

vervez · on May 28, 2023

Here's a good recent paper that looks at this problem and provides remedies in a Bayesian manner. https://arxiv.org/abs/2202.11678

tesdinger · on May 28, 2023

Quote

|How do we compare between hypotheses that are entirely consistent with observations?

|... Occam's razor

The answer is, build an experiment that makes it obvious. According to Occam's Razor, the Sun was burning coal, before we found out, it doesnt.

vervez · on May 29, 2023

Their paper is more for the case of "we can't gather more data, so what to do?" but your solution is in line with optimal experimental design and choosing a utility function to distinguish between models using as little data as possible.

joshjob42 · on May 29, 2023

This is a pretty silly blog post. The complaint comes down to "a Bayesian model will always place a lower probability on outcomes that have not been observed than outcomes which have been observed already" which... of course it would! In what situation where you're trying to understand an exchangeable set of outcomes would you think you should put more probability on things that you haven't seen than those you have? The only things I can dream up violate exchangeability (ie for instance, a finite bag of different color marbles, where you draw without replacement).

bbminner · on May 28, 2023

So the argument is essentially that "not only if you pick the best thing fitting your finite data, but even if you take a weighted average over things that fit your finite data proportionally to how well they fit your finite data - you are still almost surely end up with something that fits your finite sample better than the general population (that this sample was drawn from)"?

psyklic · on May 28, 2023

Typically, we think of overfitting and underfitting as exclusive properties. IMO a large problem here is that the author's definition of overfitting is not inconsistent with underfitting. (Underfit indicates a poor fit on both the training and test sets, in general.)

For example, a simple model might underfit in general, but it may still fit the training set better than the test. If this happens yet both are poor fits, it is clearly underfitting and not overfitting. Yet by the article's definition, it would both be underfitting and overfitting simultaneously. So, I suspect this is not an ideal definition.

dmurray · on May 28, 2023

Am I missing something or is this argument only as strong as the (perfectly reasonable) claim that all models overfit unless they have a regularisation term?

throwawaymaths · on May 28, 2023

The author is using a slightly different (but not wrong) definition of overfitting that possibly you are not used to.

chunsj · on May 29, 2023

We can safely say that Bayes is guaranteed to underfit compared to MLE.

tesdinger · on May 28, 2023

Bayesian statistics is dumb. What's the point of using prior assumptions that are based on speculation? It's better to admitting your lack of knowledge, not jumping to conclusions without sufficient data, and processing data in an unbiased manner.

UniverseHacker · on May 29, 2023

The prior represents your current state of knowledge, whatever that may be. If you have a total lack of knowledge, your prior is uniform over all possible states.

It is never proper to fabricate a biased (e.g. non-uniform) prior without any knowledge/information, that is not in any way part of Bayesian Inference. If you do have some extremely weak evidence, you use it accordingly with an extremely weak prior.

To paraphrase E.T. Jaynes, the rules of probability theory (e.g. Bayes Theorem) are the unique logically consistent way to reason about uncertainty.

yellowcake0 · on May 29, 2023

I'm not sure total lack of knowledge can ever be encoded into a prior distribution, and if it can, I think Jeffreys prior makes a more compelling case than a uniform prior. After all, to believe that all possible values of the prior parameter are equally likely, is to believe that all possible values of its square are not.

lqr · on May 29, 2023

How can your prior be uniform if the hypothesis class is unbounded?

kgwgk · on May 29, 2023

How can the hypothesis class be unbounded?

Anyway, you can define a sequence of solutions with bounded uniform priors and calculate the limiting solution. For any given data set when the endpoints of the intervals go to +/-infinity the solution will converge to the uniform prior one - if it exist.

UniverseHacker · on May 29, 2023

I discussed this in another reply in this thread, but I can't personally think of any examples where the hypothesis is unbounded. The laws of physics, computation, the human mind, measuring instruments, etc. all impose bounds on real world problems.

As someone that does Bayesian Inference a lot for my work (computational biology), I very often use uniform priors, but the structure of all real world problems I have ever encountered allows me specify hard bounds to the edges of non-zero probability.

yellowcake0 · on May 29, 2023

This isn't always possible, but sometimes you can define what's called an improper prior p(µ), such that even if ∫p(µ) is not finite, the posterior distribution p(µ|x) is.

A common example is when p(x|µ,v) is a Gaussian dist. with a prior on the mean set to p(µ)=1.

kgwgk · on May 28, 2023

Too bad that the prior is fixed and you cannot change it to represent your lack of knowledge, eh?

tesdinger · on May 28, 2023

The prior is a way to condense knowledge into math. A prior doesn't make sense if there is a lack of knowledge.

kgwgk · on May 28, 2023

What does “lack of knowledge” mean? If it means “any value is as plausible as any other as far as I know” there is a prior for that. Etc.

tesdinger · on May 28, 2023

> any value is as plausible as any other as far as I know”

That isn't possible to use as a prior as the uniform distribution from negative to positive infinity is zero everywhere.

UniverseHacker · on May 29, 2023

Can you give an example of a problem where there is no possible prior information, and states all the way from negative to positive infinity are equally plausible? I don't think the laws of physics (not to mention, the practical limits of computation) allow for real world inference problems where there are no bounds on a prior of any kind.

Failing that, I don't think there are any human usable data sources that could report observations over an infinite interval.

kgwgk · on May 28, 2023

It’s easy to use a uniform prior in a suitably huge interval - say from minus one gazillion to plus one gazillion - and you can look at the limit when the endpoints go to minus/plus infinity if you really have doubts about whether the interval was huge enough for your problem to fit comfortably within it.

If you don’t think that this is possible that says more about you than about the shortcomings of Bayesian statistics.

tesdinger · on May 29, 2023

What would you do if you see an observation outside of the interval?

kgwgk · on May 29, 2023

Use a larger interval?

As I said, you can define the improper uniform prior solution as the limit of a sequence of solutions corresponding to a sequence of increasingly wider intervals with endpoints that go to +/-infinity.

(And as I said, you can start with a suitably huge region. Say that you want to determine the position of something and use a uniform prior that extends to a distance of 10^27m - a perfectly bounded prior from a mathematical point of view that covers the whole observable universe. If you observe something outside it, it’s not with the prior that you have a problem.)